Created
December 29, 2017 14:22
-
-
Save joseph-allen/14d72af86689c99e1e225e5771ce1600 to your computer and use it in GitHub Desktop.
Tukey method outlier detection
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
from collections import Counter | |
def detect_outliers(df, n, features): | |
""" | |
Takes a dataframe df of features and returns a list of the indices | |
corresponding to the observations containing more than n outliers according | |
to the Tukey method. | |
""" | |
outlier_indices = [] | |
# iterate over features(columns) | |
for col in features: | |
# 1st quartile (25%) | |
Q1 = np.percentile(df[col], 25) | |
# 3rd quartile (75%) | |
Q3 = np.percentile(df[col], 75) | |
# Interquartile range (IQR) | |
IQR = Q3 - Q1 | |
# outlier step | |
outlier_step = 1.5 * IQR | |
# Determine a list of indices of outliers for feature col | |
outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index | |
# append the found outlier indices for col to the list of outlier indices | |
outlier_indices.extend(outlier_list_col) | |
# select observations containing more than 2 outliers | |
outlier_indices = Counter(outlier_indices) | |
multiple_outliers = list(k for k, v in outlier_indices.items() if v > n) | |
return multiple_outliers | |
# detect outliers from list of features | |
lof = ['example_feature_1', 'example_feature_2'] | |
# params dataset, number of outliers for rejection, list of features | |
Outliers_to_drop = detect_outliers(dataset, 2, lof) |
That's how the Tukey method works. The function takes an N which is the number of outliers to use to classify a row as an outlier.
Where is this documented? So e.g if two rows are classified as an outlier by two different columns you would then classify this as a outlieer? Not if it's just been classified as an outlier by one column?
https://www.itl.nist.gov/div898/handbook/prc/section4/prc471.htm
The idea is that things that are outliers generally trend together. For example, if you have a set of weights and heights what would height and weight trend together?
This is basically a naive attempt at outlier classification without using ML.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
why do you select observations containing more than 2 outliers?