Last active
September 11, 2018 23:29
-
-
Save andersgs/b55d90d45392cad6dba35f9d5fab392f to your computer and use it in GitHub Desktop.
Find cgMLST alleles unique to a group/cluster
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
GROUP | LOCUS1 | LOCUS2 | LOCUS3 | |
---|---|---|---|---|
A | 1 | 1 | 2 | |
A | 2 | 1 | 2 | |
B | 1 | 3 | 2 | |
B | 2 | 4 | 2 | |
B | 2 | 4 | 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
Find unique alleles to a group | |
''' | |
import click | |
import pandas as pd | |
def load_data(filename): | |
''' | |
Load CSV in to a pandas.DataFrame | |
''' | |
return pd.read_csv(filename, engine="python", sep=None) | |
def find_unique(tab): | |
''' | |
1. One-hot encode the data. | |
2. Groub By the "GROUP" variable --- this assumes that you have only two groups an IN and an OUT, or an A and B groups | |
3. Find the proportion of individuals in each group that have a particular allele in each of the two groups | |
4. If group of interest is A, identify those columns that have proportion 1.0 for the allele in A, and 0.0 in group B. | |
5. Return a Boolean vector | |
''' | |
tab_one = pd.get_dummies(tab, columns=list(tab.columns[tab.columns.str.startswith("LOCUS")])) | |
return tab_one.groupby("GROUP").apply(lambda x: x.sum()/x.count()).apply(lambda x: True if x.A == 1.0 and x.B == 0.0 else False) | |
@click.command() | |
@click.argument("input_file") | |
def run_finder(input_file): | |
tab = load_data(input_file) | |
unq = find_unique(tab) | |
print(unq) | |
if __name__ == "__main__": | |
run_finder() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When running this script with the dummy data, you get: