Last active
March 31, 2021 21:53
-
-
Save vishalkuo/f4aec300cf6252ed28d3 to your computer and use it in GitHub Desktop.
Remove outliers using numpy. Normally, an outlier is outside 1.5 * the IQR experimental analysis has shown that a higher/lower IQR might produce more accurate results. Interestingly, after 1000 runs, removing outliers creates a larger standard deviation between test run results.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
def removeOutliers(x, outlierConstant): | |
a = np.array(x) | |
upper_quartile = np.percentile(a, 75) | |
lower_quartile = np.percentile(a, 25) | |
IQR = (upper_quartile - lower_quartile) * outlierConstant | |
quartileSet = (lower_quartile - IQR, upper_quartile + IQR) | |
resultList = [] | |
for y in a.tolist(): | |
if y >= quartileSet[0] and y <= quartileSet[1]: | |
resultList.append(y) | |
return resultList | |
How do I decide what the constant is ?
Hi, here is my suggestion to take advantage of numpy's speed instead of a python loop with a growing list. With big arrays the difference in time is noticeable.
def removeOutliers(x, outlierConstant):
a = np.array(x)
upper_quartile = np.percentile(a, 75)
lower_quartile = np.percentile(a, 25)
IQR = (upper_quartile - lower_quartile) * outlierConstant
quartileSet = (lower_quartile - IQR, upper_quartile + IQR)
result = a[np.where((a >= quartileSet[0]) & (a <= quartileSet[1]))]
return result.tolist()
Thanks, @adrian-alberto! Updated
Did you mean 0.25 and 0.75 rather than 25 and 75? Percentiles go from 0 to 100. Thanks for the code.
@marcoruizrueda
What you are talking about are quantiles.
0 quartile = 0 quantile = 0 percentile
1 quartile = 0.25 quantile = 25 percentile
2 quartile = .5 quantile = 50 percentile (median)
3 quartile = .75 quantile = 75 percentile
4 quartile = 1 quantile = 100 percentile
what is outlier constant?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks for posting, I need this code for dataframe too. I will try to modify it for my case .