The Baylor-09, Baylor-10, and Baylor-11 samples have been uploaded to SRA, but need to be organized. The following workflow was created for this task.
First, download the metadata file with SRA accessions from https://submit.ncbi.nlm.nih.gov
Then, run the following bash command to parse out the identifiers from the files. As long as you use a unique identifier, such as sample_name, and only search through the downloaded files, the following command will work:
# copy and paste Baylor IDs from Roger to the file baylor-09_ids
BAYLOR09=($(<baylor-09_ids))
for i in "${BAYLOR09[@]}" ; do
grep "$i" *.txt >> baylor09_grep.txt ;
done
cat baylor10_grep.txt | cut -d$'\t' -f4,5,6 > baylor10_parsed.tsv
Repeat for Baylor-10 and Baylor-11.
Finally, perform a Full Outer Join to match the IDs to the SRA data, and determine if any were missed. Excel can be used for this task, as well as R or python (programmatically). Then modify the Google Document.
An example using pandas is shown here:
import pandas as pd
import numpy as np
df_identifiers = pd.read_csv('baylor_10_idList.csv')
df_SRA = pd.read_csv('baylor10_parsed.tsv', sep='\t')
merged = pd.merge(df_identifiers, df_SRA, on='Animal_ID', how='outer')
valueIsNaN = merged[merged['BioSample'].isnull()]
# valueIsNaN is a data frame with any missing values