Created
April 5, 2022 15:31
-
-
Save nickkraakman/266732bd59263aed6ed990430f988378 to your computer and use it in GitHub Desktop.
A Python function that uses Chauvenet's Criterion to filter outliers from a dataset and returns only the reasonable values.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy | |
from scipy.special import erfc | |
def filter_outliers(datapoints): | |
"""Run Chauvenet's Criterion to remove outliers | |
@See: https://www.statisticshowto.com/chauvenets-criterion/ | |
@See: https://github.com/msproteomicstools/msproteomicstools/blob/master/msproteomicstoolslib/math/chauvenet.py | |
Args: | |
datapoints (list): Array of datapoints from which to filter the outliers | |
Returns: | |
list: Valid datapoints with outliers removed | |
""" | |
criterion = 1.0/(2*len(datapoints)) | |
valid_datapoints = [] | |
# Step 1: Determine sample mean | |
mean = numpy.mean(datapoints) | |
# Step 2: Calculate standard deviation of sample | |
standard_deviation = numpy.std(datapoints) | |
# Step 3: For each value, calculate distance to mean in standard deviations | |
# Compare to criterion and store those that pass in valid_periods array | |
for datapoint in datapoints: | |
distance = abs(datapoint-mean)/standard_deviation # Distance of a value to mean in stdv's | |
distance /= 2.0**0.5 # The left and right tail threshold values | |
probability = erfc(distance) # Area normal distribution | |
if probability >= criterion: | |
valid_datapoints.append(datapoint) # Store only non-outliers | |
return valid_datapoints | |
# Let's use the function to filter some outliers from a list | |
mylist = [745, 801, 129876, 793, 698] | |
valid_list = filter_outliers(mylist) | |
print(valid_list) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment