We will analyse the run SRR061192:
Human metagenome sample from G_DNA_Supragingival plaque of a female participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)
We start from reads that have been already pre-processed by fastp, where adapters have been clipped and quality trimmed.
Let's set the stage:
mamba create -n khmer python=3.8.0 khmer=3.0.0a
mamba create -n sourmash python=3.9.0 "sourmash>=4.2.4,<5"
Get the signatures for GTDB R207:
mkdir get-constituent-genomes
cd get-constituent-genomes
curl -L https://osf.io/ernct/download -o gtdb-rs207.genomic-reps.k31.sbt.zip
wget http://files.metagenomics.eu/SRR061192.trim.fq.gz
Then we will trim the low abundant kmers from the reads.
conda activate khmer
mkdir abundtrim
trim-low-abund.py -C 3 -Z 18 -M 2e9 -V SRR061192.trim.fq.gz -o abundtrim/SRR061192.abundtrim.fq.gz --gzip
conda deactivate
This takes a while, you can download the data from http://files.metagenomics.eu/SRR061192.abundtrim.fq.gz
Then we will create the signatures with sourmash:
conda activate sourmash
mkdir sigs
sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund \
abundtrim/SRR061192.abundtrim.fq.gz \
-o sigs/SRR061192.abundtrim.sig.gz \
--name SRR061192
We will to whatever we can found in the GTDB R207.
sourmash prefetch sigs/SRR061192.abundtrim.sig.gz \
gtdb-rs207.genomic-reps.k31.sbt.zip \
-k 31 \
--threshold-bp=100000.0 \
--dna \
--save-matching-hashes=gather/SRR061192.known.sig.gz \
--save-unmatched-hashes=gather/SRR061192.unknown.sig.gz \
-o gather/SRR061192.prefetch.csv
And let's see what we have found:
sourmash gather sigs/SRR061192.abundtrim.sig.gz \
gtdb-rs207.genomic-reps.k31.sbt.zip \
-o gather/SRR061192.gather.csv \
--threshold-bp 100000.0 \
--picklist gather/SRR061192.prefetch.csv::prefetch \
--save-matches gather/SRR061192.matches.sig.gz
Let's check who they are:
head -n 5 gather/SRR061192.matches.sig.gz
The interesting columns to look at are:
- intersect_bp: total k-mers matched
- unique_intersect_bp: remaining k-mers matched
- f_match_orig: total k-mer cover
- f_match: remaining k-mer cover
- name: name of the genome
intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp
2651000,0.007045023306244585,0.7810842663523866,0.007045023306244585,0.01203710958006823,8.519426631459826,7.0,9.394207335936839,"GCF_000269805.1 Actinomyces massiliensis 4401292 strain=4401292, ASM26980v1",gtdb-rs207.genomic-reps.k31.sbt.zip,b8d9b3912ab1d72c40680ab960f15ed9,0.7810842663523866,2651000,0,137886000,outputs.tutorial/abundtrim/SRR061192.abundtrim.fq.gz,SRR061192,bd965373,376294000
2334000,0.006202596905611038,0.7638388470357026,0.0061972819125471045,0.0041704840586244815,3.3554888507718696,3.0,2.641882916447153,"GCA_013333945.2 Propionibacterium sp., ASM1333394v2",gtdb-rs207.genomic-reps.k31.sbt.zip,556e71b644fd81e9690f6afdb62665b8,0.7644939403865051,2332000,1,135554000,outputs.tutorial/abundtrim/SRR061192.abundtrim.fq.gz,SRR061192,bd965373,376294000
2270000,0.006032517127565149,0.736475273813475,0.0058969848044348305,0.010891758750421712,9.209553853086977,3.0,15.892404639254796,"GCF_001956585.1 Actinomyces naeslundii strain=NCTC 10301, ASM195658v1",gtdb-rs207.genomic-reps.k31.sbt.zip,9196730273108bdb75cdf35e1ddeada2,0.7534019249917027,2219000,2,133335000,outputs.tutorial/abundtrim/SRR061192.abundtrim.fq.gz,SRR061192,bd965373,376294000
2069000,0.005498360324639777,0.7344692935747249,0.005498360324639777,0.060547433993095916,54.9076848719188,57.0,28.302954088838916,"GCF_011612265.2 Corynebacterium matruchotii strain=ATCC 14266, ASM1161226v2",gtdb-rs207.genomic-reps.k31.sbt.zip,db37b4d014ce2defcf45ffa5d152f30d,0.7344692935747249,2069000,3,131266000,outputs.tutorial/abundtrim/SRR061192.abundtrim.fq.gz,SRR061192,bd965373,376294000
A brief explanation of the gather results from here:
The matching genomes are provided in the order found by the greedy algorithm, i.e. by overlap with remaining k-mers in the metagenome. The high rank (early) matches reflect large and/or mostly-covered genomes with high containment, while later matches reflect genomes that share fewer k-mers with the remaining set of k-mers in the metagenome - smaller genomes, less-covered genomes, and/or genomes with substantial overlap with earlier matches. Where there are overlaps between genomes, shared common k-mers are “claimed” by higher rank matches and only k-mer content specific to the later genome is used to find lower rank matches.
A quick plotting of the gather results:
library(tidyverse)
library(ggpubr)
gather_df <- read_csv(file = "gather/SRR061192.gather.csv")
data <- gather_df %>%
select(intersect_bp, unique_intersect_bp, f_match_orig, f_match, name) %>%
mutate(intersect_bp = intersect_bp/1e6,
unique_intersect_bp = unique_intersect_bp/1e6) %>%
filter(f_match_orig > 0.5) %>%
pivot_longer(names_to = "var", values_to = "value", -name) %>%
separate(name, into = c("accession","genome"), extra = "merge", sep = " ") %>%
separate(genome, into = "genome", sep = ",", extra = "drop")
p1 <- data %>%
filter(grepl("intersect", var)) %>%
mutate(var = ifelse(var == "intersect_bp", "total k-mers matched", "remaining k-mers matched")) %>%
mutate(genome = fct_reorder(genome, value)) %>%
ggplot(aes(genome, value, color = var)) +
geom_point() +
ggpubr::rotate() +
#scale_color_manual(labels = c("total k-mers matched", "remaining k-mers matched"), values = c("blue", "red")) +
xlab("") +
ylab("millions of kmers") +
theme_light() +
theme(legend.position = "top")
p2 <- data %>%
filter(grepl("match", var)) %>%
mutate(var = ifelse(var == "f_match_orig", "total k-mer cover", "remaining k-mer cover")) %>%
mutate(genome = fct_reorder(genome, value)) %>%
ggplot(aes(genome, value, color = var)) +
geom_point() +
ggpubr::rotate() +
scale_y_continuous(labels = scales::percent) +
xlab("") +
ylab("genome covered") +
theme_light() +
theme(legend.position = "top",
axis.ticks.y = element_blank(),
axis.text.y = element_blank())
ggpubr::ggarrange(plotlist = list(p1, p2), align = "h", widths = c(0.65, 0.35))