~/blog/Out-with-the-Outliers
Published on

Out with the Outliers: How to Remove Outliers in Python

906 words5 min read–––
Views
Authors

Outliers are data points that are significantly different from the rest of the data in a dataset. These points can have a significant impact on the analysis and interpretation of the data, so it is often necessary to identify and remove them. In this tutorial, we will discuss how to remove outliers using Python.

Identifying Outliers

Before we can remove outliers, we need to identify them. There are several ways to do this, including:

  • Visualization: One of the simplest ways to identify outliers is to create a visual plot of the data. Outliers will typically stand out as points that are significantly different from the rest of the data.

  • Z-score: The Z-score is a statistical measure that helps identify outliers by indicating how many standard deviations a data point is from the mean. A Z-score of more than 3 or less than -3 is typically considered an outlier.

  • Interquartile range (IQR): The IQR is the range between the first and third quartiles of a dataset. Any data point that falls outside of the range defined by the IQR is considered an outlier.

Once we have identified the outliers in our dataset, we can proceed with removing them.

Removing Outliers

There are several ways to remove outliers, including:

  • Filtering: One option is to simply filter out the outliers from the dataset. This can be done by creating a new dataset that only includes data points that fall within a certain range.

  • Imputation: Another option is to impute the outliers, which means replacing the outlier values with more typical values. This can be done using techniques such as mean imputation, median imputation, or linear interpolation.

  • Transformation: In some cases, it may be appropriate to transform the data in order to make the outliers less extreme. This can be done using techniques such as log transformation or box-cox transformation.

In this tutorial, we'll look at how to identify and remove outliers using Python. We'll start by importing the necessary libraries and loading some sample data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv("data.csv")

Next, we'll use the boxplot function from the seaborn library to visualize the distribution of the data and identify any outliers.

import seaborn as sns

sns.boxplot(data=df)
plt.show()

This will create a boxplot of the data, with a box representing the interquartile range (IQR) and a line representing the median. Any data points outside of the whiskers (the lines extending from the box) are considered outliers.

Once we've identified the outliers, we can remove them from the data using the zscore function from scipy. This function calculates the z-score of each data point, which is the number of standard deviations it is from the mean.

from scipy import stats

# Calculate z-scores of each data point
z = np.abs(stats.zscore(df))

# Remove outliers
df_clean = df[(z < 3).all(axis=1)]

The zscore function returns an array of z-scores for each data point, and the all function returns a boolean array indicating which data points have a z-score less than 3 (which is generally considered the threshold for identifying outliers). We can use this boolean array to select only the non-outlier data points from the original data.

Alternatively, we can use the iqr function from scipy to identify and remove outliers based on the IQR. This is useful if we want to use a different threshold for identifying outliers (e.g. 1.5 times the IQR instead of 3 standard deviations).

from scipy.stats import iqr

# Calculate IQR
q75, q25 = np.percentile(df, [75 ,25])
iqr = q75 - q25

# Calculate the outlier threshold
cutoff = iqr * 1.5
lower, upper = q25 - cutoff, q75 + cutoff

# Remove outliers
df_clean = df[(df > lower) & (df < upper)]

This will remove any data points that are less than lower or greater than upper.

We can also use the matplotlib library to visualize the cleaned data to make sure the outliers have been successfully removed.

sns.boxplot(data=df_clean)
plt.show()

This will create a new boxplot of the cleaned data, which should no longer have any outliers.

That's it! With these simple steps, we can easily identify and remove outliers from our data using Python. By removing outliers, we can improve the accuracy and reliability of our statistical analyses and machine learning models.

I hope this tutorial has been helpful. If you have any questions or comments, please let me know.