genomewalker/get-constituent-genomes.jpg

We will analyse the run SRR061192:

Human metagenome sample from G_DNA_Supragingival plaque of a female participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)

We start from reads that have been already pre-processed by fastp, where adapters have been clipped and quality trimmed.

Let's set the stage:

mamba create -n khmer python=3.8.0 khmer=3.0.0a
mamba create -n sourmash python=3.9.0 "sourmash>=4.2.4,<5"

Get the signatures for GTDB R207:

mkdir get-constituent-genomes
cd get-constituent-genomes
curl -L https://osf.io/ernct/download -o gtdb-rs207.genomic-reps.k31.sbt.zip
wget http://files.metagenomics.eu/SRR061192.trim.fq.gz

Then we will trim the low abundant kmers from the reads.

conda activate khmer
mkdir abundtrim
trim-low-abund.py -C 3 -Z 18 -M 2e9 -V SRR061192.trim.fq.gz -o abundtrim/SRR061192.abundtrim.fq.gz --gzip
conda deactivate

This takes a while, you can download the data from http://files.metagenomics.eu/SRR061192.abundtrim.fq.gz

Then we will create the signatures with sourmash:

conda activate sourmash
mkdir sigs
sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund \
    abundtrim/SRR061192.abundtrim.fq.gz \
    -o sigs/SRR061192.abundtrim.sig.gz \
    --name SRR061192

We will to whatever we can found in the GTDB R207.

sourmash prefetch sigs/SRR061192.abundtrim.sig.gz \
    gtdb-rs207.genomic-reps.k31.sbt.zip \
    -k 31 \
    --threshold-bp=100000.0 \
    --dna \
    --save-matching-hashes=gather/SRR061192.known.sig.gz \
    --save-unmatched-hashes=gather/SRR061192.unknown.sig.gz \
    -o gather/SRR061192.prefetch.csv

And let's see what we have found:

sourmash gather sigs/SRR061192.abundtrim.sig.gz \
    gtdb-rs207.genomic-reps.k31.sbt.zip \
    -o gather/SRR061192.gather.csv \
    --threshold-bp 100000.0 \
    --picklist gather/SRR061192.prefetch.csv::prefetch \
    --save-matches gather/SRR061192.matches.sig.gz

Let's check who they are:

head -n 5 gather/SRR061192.matches.sig.gz

The interesting columns to look at are:

intersect_bp: total k-mers matched
unique_intersect_bp: remaining k-mers matched
f_match_orig: total k-mer cover
f_match: remaining k-mer cover
name: name of the genome

intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp
2651000,0.007045023306244585,0.7810842663523866,0.007045023306244585,0.01203710958006823,8.519426631459826,7.0,9.394207335936839,"GCF_000269805.1 Actinomyces massiliensis 4401292 strain=4401292, ASM26980v1",gtdb-rs207.genomic-reps.k31.sbt.zip,b8d9b3912ab1d72c40680ab960f15ed9,0.7810842663523866,2651000,0,137886000,outputs.tutorial/abundtrim/SRR061192.abundtrim.fq.gz,SRR061192,bd965373,376294000
2334000,0.006202596905611038,0.7638388470357026,0.0061972819125471045,0.0041704840586244815,3.3554888507718696,3.0,2.641882916447153,"GCA_013333945.2 Propionibacterium sp., ASM1333394v2",gtdb-rs207.genomic-reps.k31.sbt.zip,556e71b644fd81e9690f6afdb62665b8,0.7644939403865051,2332000,1,135554000,outputs.tutorial/abundtrim/SRR061192.abundtrim.fq.gz,SRR061192,bd965373,376294000
2270000,0.006032517127565149,0.736475273813475,0.0058969848044348305,0.010891758750421712,9.209553853086977,3.0,15.892404639254796,"GCF_001956585.1 Actinomyces naeslundii strain=NCTC 10301, ASM195658v1",gtdb-rs207.genomic-reps.k31.sbt.zip,9196730273108bdb75cdf35e1ddeada2,0.7534019249917027,2219000,2,133335000,outputs.tutorial/abundtrim/SRR061192.abundtrim.fq.gz,SRR061192,bd965373,376294000
2069000,0.005498360324639777,0.7344692935747249,0.005498360324639777,0.060547433993095916,54.9076848719188,57.0,28.302954088838916,"GCF_011612265.2 Corynebacterium matruchotii strain=ATCC 14266, ASM1161226v2",gtdb-rs207.genomic-reps.k31.sbt.zip,db37b4d014ce2defcf45ffa5d152f30d,0.7344692935747249,2069000,3,131266000,outputs.tutorial/abundtrim/SRR061192.abundtrim.fq.gz,SRR061192,bd965373,376294000

A brief explanation of the gather results from here:

The matching genomes are provided in the order found by the greedy algorithm, i.e. by overlap with remaining k-mers in the metagenome. The high rank (early) matches reflect large and/or mostly-covered genomes with high containment, while later matches reflect genomes that share fewer k-mers with the remaining set of k-mers in the metagenome - smaller genomes, less-covered genomes, and/or genomes with substantial overlap with earlier matches. Where there are overlaps between genomes, shared common k-mers are “claimed” by higher rank matches and only k-mer content specific to the later genome is used to find lower rank matches.

A quick plotting of the gather results:

library(tidyverse)
library(ggpubr)

gather_df <- read_csv(file = "gather/SRR061192.gather.csv")

data <- gather_df %>%
  select(intersect_bp, unique_intersect_bp, f_match_orig, f_match, name) %>%
  mutate(intersect_bp = intersect_bp/1e6,
         unique_intersect_bp = unique_intersect_bp/1e6) %>%
  filter(f_match_orig > 0.5) %>%
  pivot_longer(names_to = "var", values_to = "value", -name) %>%
  separate(name, into = c("accession","genome"), extra = "merge", sep = " ") %>%
  separate(genome, into = "genome", sep = ",", extra = "drop")

p1 <- data %>%
  filter(grepl("intersect", var)) %>%
  mutate(var = ifelse(var == "intersect_bp", "total k-mers matched", "remaining k-mers matched")) %>%
  mutate(genome = fct_reorder(genome, value)) %>%
  ggplot(aes(genome, value, color = var)) +
  geom_point() +
  ggpubr::rotate() +
  #scale_color_manual(labels = c("total k-mers matched", "remaining k-mers matched"), values = c("blue", "red")) +
  xlab("") +
  ylab("millions of kmers") +
  theme_light() +
  theme(legend.position = "top")

p2 <- data %>%
  filter(grepl("match", var)) %>%
  mutate(var = ifelse(var == "f_match_orig", "total k-mer cover", "remaining k-mer cover")) %>%
  mutate(genome = fct_reorder(genome, value)) %>%
  ggplot(aes(genome, value, color = var)) +
  geom_point() +
  ggpubr::rotate() +
  scale_y_continuous(labels = scales::percent) +
  xlab("") +
  ylab("genome covered") +
  theme_light() +
  theme(legend.position = "top",
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

ggpubr::ggarrange(plotlist = list(p1, p2), align = "h", widths = c(0.65, 0.35))