Working through running HiFive on a Hi-C datasets.
First, a note on memory an performance: bin size influences everything. Starting with a bin size of 40kb, loading data in hg38 seems to stay under ~16GB. At fend level resolution memory requirements approach ~32GB and running time increases several fold.
HiFive stores a fend
file with information on the locations of restriction
fragments in the genome. We need to get the locations of the RE sites into a BED
file to create this. The UCSC findMotifs tool can do this.
We are also going to bin at 40k right from the beginning. This will help performance and we know the data is sparse. If QuASAR suggests we can go lower we can always reprocess.
# Install findMotifs and hifive from bioconda
conda install ucsc-findmotifs
conda install hifive
# Create a BED file with locations of HindIII sites
findMotifs -strand=+ -motif=AAGCTT $(SOMEWHERE)/hg38.fa > hg38.HindIII.bed
# Import into HiFive optimized format
hifive fends -B hg38.HindIII.bed --binned 40000 -R HindIII -G hg38 hg38.HindIII.40k.bed
Here we just bin without regard to where RE site locations are. We need to get the chromosome sizes from UCSC so HiFive knows how long the chromosomes are. Then we just cut into 40kb chunks.
wget https://genome.ucsc.edu/goldenpath/help/hg38.chrom.sizes
hifive fends --binned 40000 -L hg38.chrom.sizes -g hg38 hg38.NA.40k.fends
Then we would use hg38.NA.40k.fends
in place of hg38.HindIII.40k.fends
below.
Just for demonstration let's get some reads for the Dixon et al. 2012 paper from SRA:
fastq-dump --accession SRR442158 --split-files
Now we will align with bwa mem
. You will need an index first. I always name
mine with the corresponding .fa file as the base name. Importantly, we
align the ends independently.
bwa mem -t 16 $(SOMEWHERE)/hg38.fa SRR442158_1.fastq | samtools view -b > SRR442158_1.bam
bwa mem -t 16 $(SOMEWHERE)/hg38.fa SRR442158_2.fastq | samtools view -b > SRR442158_2.bam
Now we can load the aligned ends into a HiFive "Hi-C Data" file. We simply provide the two aligned ends and the fend file, plus a filename to write to (SRR442158.hcd). Then we will create a "Hi-C Project". This step will
hifive hic-data -S SRR442158_1.bam SRR442158_2.bam hg38.HindIII.40k.fends SRR442158.hcd
hifive hic-project -f 5 SRR442158.hcd SRR442158.hcp
with 40kb bins things look good:
Read 66055585 validly-mapped read pairs.
66055585 total validly-mapped read pairs loaded. 38346128 valid fend pairs
Parsing fend pairs... Done 17459539 cis reads, 21188745 trans reads
and then:
Filtering fends... Removed 6933 of 78722 bins
Finding distance curve... Done
Even if you get an error here (can't fit the distance function), we can still go ahead with QuASAR and get a resolution estimate.
hifive quasar -p SRR442158.hcp -r 1000000,200000,40000 -d 0 -o report.txt SRR442158.quasar
After QuASAR runs, we can check its report in report.txt
:
Quality Score Results
Resolution Coverage All chr1 chr10 chr10_GL383545v1_alt chr10_GL383546v1_alt chr10_KI270824v1_alt chr10_KI270825v1_alt chr11 chr11_GL383547v1_alt chr11_JH159136v1_alt chr11_JH159137v1_alt chr11_
40 Kb 17,459,539 0.019164 0.019010 0.023605 nan 0.046658 nan -0.108543 0.019584 nan 0.001550 nan nan -0.012705 nan 0.009512 -0.016825 -0.007
200 Kb 17,459,539 0.083343 0.085528 0.102154 nan nan nan nan 0.097350 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.087015
1 Mb 17,459,539 0.120813 0.127911 0.133518 nan nan nan nan 0.142377 nan nan nan nan nan nan nan nan nan nan nan nan nan 0.115370
Strict quality maximum resolution: 105,888 bp
Loose quality maximum resolution: 73,549 bp
So, this data supports ~80-100k bins.