Notes by Greg Caporaso ([email protected])
From email to picrust-developers on 4 Oct 2012:
-
Filter the HMP (not HMP-mock) data set to ~50-100k sequences at random to form a filtered dataset (for decreased run time).
-
Select ~12 of the most abundant IMG-defined OTUs from the HMP, and slice the reference sequence to the amplified region in that dataset. "IMG-defined" here means that we have an IMG genome attached to the OTU, opposed to the Greengenes-defined OTUs where we don't have a genome for that specific OTU. "most abundant" will be somewhat arbitrary - I'm thinking something like a random 12 IMG-defined from the 25% most abundant OTUs in the dataset.