- Download the following two BED files from the RSEQC website, hosted on SourceForge:
hg38_RefSeq.bed.gz: All human RefSeq transcriptshg38.HouseKeepingGenes.bed.gz: Subset of human RefSeq transcripts considered housekeeping genes
- Extract the RefSeq identifiers from the
HouseKeepingGenesfile. - Map the RefSeq identifiers to orthologous genes in your species of interest, e.g. using
ensembl's BioMart web interface. Make sure
the return the
gene stable IDfor the target species. - Download the GTF file with gene annotations for your species of interest, e.g. from ensembls FTP server
- Next, we need to subset the GTF file to the
housekeepinggene identifiers we obtained by mapping the human RefSeq identifiers, e.g. usingrtracklayerBioC package.
library(purrr)
library(readr)
library(rtracklayer)
# read biomart output files
housekeeping <- purrr::map_df(
dir("~/Downloads/", pattern = "mart_export*", full.names = TRUE),
readr::read_tsv)
housekeeping <- unique(housekeeping$`Crab-eating macaque gene stable ID`)
# read ensembl GTF file
gtf <- import("~/tmp/Macaca_fascicularis.Macaca_fascicularis_6.0.103.gtf.gz")
# subset the `GRanges` object; also exclude gene rows
hk <- gtf[gtf$gene_id %in% housekeeping & gtf$type != "gene"]
# export the filtered GTF file
export(hk, con = "~/tmp/housekeeping.gtf", format = "gtf")- Convert the GTF files to BED format
Convert the original GTF file:
conda create -n gtf
conda activate gtf
mamba install -c bioconda ucsc-gtfToGenePred ucsc-genePredToBedgzip -cd ~/tmp/Macaca_fascicularis.Macaca_fascicularis_6.0.103.gtf.gz | \
gtfToGenePred /dev/stdin ~/tmp/all_transcripts.genepred
genePredToBed ~/tmp/all_transcripts.genepred ~/tmp/all_transcripts.bedConvert the housekeeping GTF file:
gtfToGenePred ~/tmp/housekeeping.gtf ~/tmp/housekeeping.genepred
genePredToBed ~/tmp/housekeeping.genepred ~/tmp/housekeeping.bed