Skip to content

Instantly share code, notes, and snippets.

@joseph-allen
Created December 29, 2017 14:22
Show Gist options
  • Save joseph-allen/14d72af86689c99e1e225e5771ce1600 to your computer and use it in GitHub Desktop.
Save joseph-allen/14d72af86689c99e1e225e5771ce1600 to your computer and use it in GitHub Desktop.
Tukey method outlier detection
import numpy as np
from collections import Counter
def detect_outliers(df, n, features):
"""
Takes a dataframe df of features and returns a list of the indices
corresponding to the observations containing more than n outliers according
to the Tukey method.
"""
outlier_indices = []
# iterate over features(columns)
for col in features:
# 1st quartile (25%)
Q1 = np.percentile(df[col], 25)
# 3rd quartile (75%)
Q3 = np.percentile(df[col], 75)
# Interquartile range (IQR)
IQR = Q3 - Q1
# outlier step
outlier_step = 1.5 * IQR
# Determine a list of indices of outliers for feature col
outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index
# append the found outlier indices for col to the list of outlier indices
outlier_indices.extend(outlier_list_col)
# select observations containing more than 2 outliers
outlier_indices = Counter(outlier_indices)
multiple_outliers = list(k for k, v in outlier_indices.items() if v > n)
return multiple_outliers
# detect outliers from list of features
lof = ['example_feature_1', 'example_feature_2']
# params dataset, number of outliers for rejection, list of features
Outliers_to_drop = detect_outliers(dataset, 2, lof)
@zahs123
Copy link

zahs123 commented Dec 8, 2020

why do you select observations containing more than 2 outliers?

@joseph-allen
Copy link
Author

That's how the Tukey method works. The function takes an N which is the number of outliers to use to classify a row as an outlier.

@zahs123
Copy link

zahs123 commented Dec 8, 2020

Where is this documented? So e.g if two rows are classified as an outlier by two different columns you would then classify this as a outlieer? Not if it's just been classified as an outlier by one column?

@joseph-allen
Copy link
Author

https://www.itl.nist.gov/div898/handbook/prc/section4/prc471.htm

The idea is that things that are outliers generally trend together. For example, if you have a set of weights and heights what would height and weight trend together?

@joseph-allen
Copy link
Author

This is basically a naive attempt at outlier classification without using ML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment