fedarko’s gists

fedarko / handle_duplicate_sample_ids.py

Last active December 16, 2019 22:12

Script to report on duplicate IDs in a plate map spreadsheet (and modify certain duplicate IDs, in a very specific case); also attempts to update Qiita prep files accordingly. As a warning, code is untested / pretty gross.

	#! /usr/bin/env python3
	import os
	from collections import Counter
	from math import ceil
	import re
	from numpy import argmax
	import pandas as pd
	from qiime2 import Metadata

	# "Parameters" of this script

fedarko / find_missing_dates.py

Last active December 10, 2019 00:54

In a timeseries metadata file, finds all days that are not "represented" by at least one sample in the metadata

	#! /usr/bin/env python3
	from dateutil.parser import parse
	import pandas as pd


	df = pd.read_csv("20191209_metadata.txt", sep="\t", index_col=0)

	# Subset to a certain host subject ID, if desired
	df = df[df["host_subject_id"] == "M03"]

fedarko / negative_control_stats.py

Last active October 27, 2019 23:55

Search taxonomies of negative controls

	#! /usr/bin/env python3
	"""
	This is a small script that looks through the annotated taxonomies of all
	features present in a dataset's negative control samples. It's handy for
	checking that certain features are (for the most part) absent from these
	samples.

	This obviously isn't a very formal way of accounting for contamination,
	but it is useful for quickly verifying that certain taxa are probably not
	the product of contamination. (Better approaches include e.g. the decontam

fedarko / convert_timestamp_to_ordinal_date.py

Created October 8, 2019 22:26

adds an ordinal date field based on a timestamp to a q2 metadata file

	#! /usr/bin/env python3
	from qiime2 import Metadata
	from dateutil.parser import parse
	from dateutil.relativedelta import relativedelta

	m = Metadata.load("metadata-with-age.tsv")
	m_df = m.to_dataframe()

	m_df["ordinal-timestamp"] = 0

fedarko / convert_timestamp_to_days_elapsed.py

Created October 5, 2019 01:02

For a QIIME 2-formatted sample metadata file, uses the collection_timestamp field to assign samples a "days since first day" field. This field is useful for visualizations like q2-longitudinal's volatility plots.

	#! /usr/bin/env python3
	from qiime2 import Metadata
	from dateutil.parser import parse

	m = Metadata.load("metadata-with-age-and-ordinal-timestamp.tsv")
	m_df = m.to_dataframe()

	# Compute earliest date
	min_date = None
	for sample_id in m_df.index:

fedarko / gh_url_to_raw_gh_url.py

Created October 2, 2019 22:10

Convert a github file URL to a raw.githubusercontent.com URL (that can be directly accessed for things like view.qiime2.org or wget)

	# your link goes here
	link = "https://github.com/knightlab-analyses/qurro-mackerel-analysis/blob/master/AnalysisOutput/qurro-plot.qzv"

	# note: this will break if a repo/organization or subfolder is named "blob" -- would be ideal to use a fancy regex
	# to be more precise here
	print(link.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/"))

	# example output link:
	# https://raw.githubusercontent.com/knightlab-analyses/qurro-mackerel-analysis/master/AnalysisOutput/qurro-plot.qzv

fedarko / split_metadata_by_run.py

Last active September 26, 2019 22:27

Splits up a QIIME 2 metadata file into separate metadata files, such that there is one file per specified "run" column. This is useful if multiple samples from different runs share barcode sequences, which can make QIIME 2 angry.

	# NOTE: Assumes that there's a SAMPLE_METADATA environment variable declared pointing to a metadata file
	# NOTE: Assumes that this metadata file contains BarcodeSequence and seq_run_ord columns
	import pandas as pd
	import os


	md = pd.read_csv(os.environ["SAMPLE_METADATA"], sep="\t", index_col=0)
	print("There are {} unique barcode sequences in this metadata file.".format(len(md["BarcodeSequence"].unique())))
	runs = tuple(md["seq_run_ord"].unique())
	print("Also, the {} runs listed in this metadata file are {}.".format(len(runs), runs))

fedarko / add_age_column_to_metadata.py

Last active August 30, 2019 06:25

Adds an "age in years" column to a QIIME 2 sample metadata file

	#! /usr/bin/env python3
	from qiime2 import Metadata
	from dateutil.parser import parse
	from dateutil.relativedelta import relativedelta

	subject_id = "HOST SUBJECT ID"
	subject_birthday = "HOST BIRTHDAY"
	subject_birthday_datetime = parse(subject_birthday)
	age_col_name = "subject_age_years"

fedarko / validate_sample_ids_and_timestamps.py

Last active August 21, 2019 22:49

Compares "Qiita-style" sample IDs containing dates with the collection_timestamp dates in a metadata file.

	#! /usr/bin/env python3
	import re
	import pandas as pd
	from dateutil.parser import parse

	m = pd.read_csv("metadata.tsv", sep='\t', index_col=0)
	# Find all sample IDs in the metadata that include a "date"
	# We assume that sample IDs that start with a 5-digit Qiita ID, then a period,
	# then a two-character host ID string, then another period, will follow this
	# convention.

fedarko / validate_collection_date_and_timestamp.py

Last active July 25, 2019 21:05

In a tab-separated metadata file, validate that all collection_timestamp values in a metadata file start with that row's collection_date value

	#! /usr/bin/env python3
	# note that this is very preliminary + untested code

	# We use pd.read_csv() because, unlike QIIME 2's Metadata object, it allows
	# duplicate sample IDs.
	import pandas as pd
	import sys
	if len(sys.argv) < 2:
	raise ValueError("You need to specify a metadata file to check.")
	df = pd.read_csv(sys.argv[1], sep='\t', index_col=0)

Marcus Fedarko fedarko