Skip to content

Instantly share code, notes, and snippets.

View fedarko's full-sized avatar

Marcus Fedarko fedarko

View GitHub Profile
@fedarko
fedarko / handle_duplicate_sample_ids.py
Last active December 16, 2019 22:12
Script to report on duplicate IDs in a plate map spreadsheet (and modify certain duplicate IDs, in a very specific case); also attempts to update Qiita prep files accordingly. As a warning, code is untested / pretty gross.
#! /usr/bin/env python3
import os
from collections import Counter
from math import ceil
import re
from numpy import argmax
import pandas as pd
from qiime2 import Metadata
# "Parameters" of this script
@fedarko
fedarko / find_missing_dates.py
Last active December 10, 2019 00:54
In a timeseries metadata file, finds all days that are not "represented" by at least one sample in the metadata
#! /usr/bin/env python3
from dateutil.parser import parse
import pandas as pd
df = pd.read_csv("20191209_metadata.txt", sep="\t", index_col=0)
# Subset to a certain host subject ID, if desired
df = df[df["host_subject_id"] == "M03"]
@fedarko
fedarko / negative_control_stats.py
Last active October 27, 2019 23:55
Search taxonomies of negative controls
#! /usr/bin/env python3
"""
This is a small script that looks through the annotated taxonomies of all
features present in a dataset's negative control samples. It's handy for
checking that certain features are (for the most part) absent from these
samples.
This obviously isn't a very formal way of accounting for contamination,
but it is useful for quickly verifying that certain taxa are probably not
the product of contamination. (Better approaches include e.g. the decontam
@fedarko
fedarko / convert_timestamp_to_ordinal_date.py
Created October 8, 2019 22:26
adds an ordinal date field based on a timestamp to a q2 metadata file
#! /usr/bin/env python3
from qiime2 import Metadata
from dateutil.parser import parse
from dateutil.relativedelta import relativedelta
m = Metadata.load("metadata-with-age.tsv")
m_df = m.to_dataframe()
m_df["ordinal-timestamp"] = 0
@fedarko
fedarko / convert_timestamp_to_days_elapsed.py
Created October 5, 2019 01:02
For a QIIME 2-formatted sample metadata file, uses the collection_timestamp field to assign samples a "days since first day" field. This field is useful for visualizations like q2-longitudinal's volatility plots.
#! /usr/bin/env python3
from qiime2 import Metadata
from dateutil.parser import parse
m = Metadata.load("metadata-with-age-and-ordinal-timestamp.tsv")
m_df = m.to_dataframe()
# Compute earliest date
min_date = None
for sample_id in m_df.index:
@fedarko
fedarko / gh_url_to_raw_gh_url.py
Created October 2, 2019 22:10
Convert a github file URL to a raw.githubusercontent.com URL (that can be directly accessed for things like view.qiime2.org or wget)
# your link goes here
link = "https://github.com/knightlab-analyses/qurro-mackerel-analysis/blob/master/AnalysisOutput/qurro-plot.qzv"
# note: this will break if a repo/organization or subfolder is named "blob" -- would be ideal to use a fancy regex
# to be more precise here
print(link.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/"))
# example output link:
# https://raw.githubusercontent.com/knightlab-analyses/qurro-mackerel-analysis/master/AnalysisOutput/qurro-plot.qzv
@fedarko
fedarko / split_metadata_by_run.py
Last active September 26, 2019 22:27
Splits up a QIIME 2 metadata file into separate metadata files, such that there is one file per specified "run" column. This is useful if multiple samples from different runs share barcode sequences, which can make QIIME 2 angry.
# NOTE: Assumes that there's a SAMPLE_METADATA environment variable declared pointing to a metadata file
# NOTE: Assumes that this metadata file contains BarcodeSequence and seq_run_ord columns
import pandas as pd
import os
md = pd.read_csv(os.environ["SAMPLE_METADATA"], sep="\t", index_col=0)
print("There are {} unique barcode sequences in this metadata file.".format(len(md["BarcodeSequence"].unique())))
runs = tuple(md["seq_run_ord"].unique())
print("Also, the {} runs listed in this metadata file are {}.".format(len(runs), runs))
@fedarko
fedarko / add_age_column_to_metadata.py
Last active August 30, 2019 06:25
Adds an "age in years" column to a QIIME 2 sample metadata file
#! /usr/bin/env python3
from qiime2 import Metadata
from dateutil.parser import parse
from dateutil.relativedelta import relativedelta
subject_id = "HOST SUBJECT ID"
subject_birthday = "HOST BIRTHDAY"
subject_birthday_datetime = parse(subject_birthday)
age_col_name = "subject_age_years"
@fedarko
fedarko / validate_sample_ids_and_timestamps.py
Last active August 21, 2019 22:49
Compares "Qiita-style" sample IDs containing dates with the collection_timestamp dates in a metadata file.
#! /usr/bin/env python3
import re
import pandas as pd
from dateutil.parser import parse
m = pd.read_csv("metadata.tsv", sep='\t', index_col=0)
# Find all sample IDs in the metadata that include a "date"
# We assume that sample IDs that start with a 5-digit Qiita ID, then a period,
# then a two-character host ID string, then another period, will follow this
# convention.
@fedarko
fedarko / validate_collection_date_and_timestamp.py
Last active July 25, 2019 21:05
In a tab-separated metadata file, validate that all collection_timestamp values in a metadata file start with that row's collection_date value
#! /usr/bin/env python3
# note that this is very preliminary + untested code
# We use pd.read_csv() because, unlike QIIME 2's Metadata object, it allows
# duplicate sample IDs.
import pandas as pd
import sys
if len(sys.argv) < 2:
raise ValueError("You need to specify a metadata file to check.")
df = pd.read_csv(sys.argv[1], sep='\t', index_col=0)