Skip to content

Instantly share code, notes, and snippets.

@usametov
Last active March 30, 2025 03:59
Show Gist options
  • Save usametov/7924fdced1c68e53ee1c6ab66b6cbe03 to your computer and use it in GitHub Desktop.
Save usametov/7924fdced1c68e53ee1c6ab66b6cbe03 to your computer and use it in GitHub Desktop.
CRAM chunk processing with R

Let’s dive into processing your 70 GB CRAM file on a machine with 32 GB RAM, focusing first on using R and then exploring alternatives. Your setup is ambitious, but with the right approach, it’s absolutely doable.

Processing a 70 GB CRAM File in R with 32 GB RAM

Based on the document, a 70 GB CRAM file likely contains around 700–1,400 million reads (assuming the 10 GB = 100–200 million reads scaling holds). Loading this entirely into memory would require 35–140 GB of RAM, far exceeding your 32 GB. So, a full load is off the table. However, the document highlights strategies to make this manageable, and I’ll tailor them to your case.

Is It Possible in R?

Yes, it’s possible, but you’ll need to avoid loading the whole file at once. The key is chunk-based processing, which keeps memory usage constant regardless of file size. For a 70 GB CRAM, chunk processing can work within 500 MB to 1 GB of RAM per chunk, well within your 32 GB limit. Even with overhead from R and other operations, you can stay comfortable.

How to Do It in R

Using Bioconductor packages like Rsamtools and GenomicAlignments, here’s a step-by-step approach:

  1. Chunk-Based Processing:

    • Use BamFile (CRAM files are supported) with a yieldSize parameter to process reads in batches. The document suggests 100,000 reads as a starting point, requiring 100–500 MB per chunk.
    • Example code:
      library(Rsamtools)
      cram_file <- BamFile("your_file.cram", yieldSize = 100000)
      open(cram_file)
      while (length(chunk <- readGAlignments(cram_file))) {
          # Process each chunk here (e.g., count, filter, analyze)
          print(summary(chunk))
      }
      close(cram_file)
    • Memory stays low (500 MB–1 GB per chunk), and you can scale up yieldSize (e.g., to 500,000 reads, ~2–3 GB) if your analysis permits, still fitting within 32 GB.
  2. Region-Based Analysis:

    • If you only need specific genomic regions (e.g., a chromosome or exome targets), use ScanBamParam to restrict reads:
      param <- ScanBamParam(which = GRanges("chr1", IRanges(1, 1e6)))
      reads <- readGAlignments(cram_file, param = param)
    • Memory for a single chromosome at 30x coverage is 800 MB–2 GB, or up to 6 GB at 100x, leaving room in your 32 GB.
  3. Filtering During Read:

    • Apply filters (e.g., mapping quality > 20) to reduce loaded data:
      param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE), mapqFilter = 20)
    • This cuts memory use by 20–60%, potentially dropping per-chunk needs below 500 MB.
  4. Optimize R Settings:

    • Trigger garbage collection manually: gc() after each chunk to free memory.
    • On Linux/Mac, ensure your ulimit allows R to use most of your 32 GB. On Windows, memory.limit(size=NA) maximizes available RAM.

What Can You Do?

  • Counting Reads: 10–20 MB plus chunk overhead (~1 GB total)—easy.
  • Coverage Calculation: Whole-genome needs 2–8 GB, but per-chromosome (200 MB–1 GB) fits. Loop over chromosomes.
  • Variant Calling: Whole-genome (8–16 GB+) is tight; targeted regions (1–5 GB) are safer. You might push 32 GB’s limit with multiple processes running.

Limits and Tips

  • With 32 GB, you’re in the “Whole-genome analysis” recommended range (32+ GB), but not the minimum (16 GB). Chunking makes it work, but avoid parallel processing unless you’re sure of memory overhead.
  • Test with a smaller yieldSize first to monitor RAM usage (e.g., via top or htop).

Alternatives to R

If R feels limiting or you want flexibility, here are other tools to process your 70 GB CRAM file on 32 GB RAM:

  1. Samtools:

    • Why: Lightweight, command-line-based, designed for BAM/CRAM files.
    • How: Use samtools view with region filters or pipes to process incrementally:
      samtools view -q 20 your_file.cram chr1 | wc -l  # Count filtered reads
    • Memory: Minimal (tens of MB), streams data without loading fully.
    • Use Case: Counting, filtering, or converting to other formats.
  2. Bcftools (for Variant Calling):

    • Why: Efficient for variant calling on CRAMs, integrates with samtools.
    • How:
      bcftools mpileup -r chr1 -f reference.fa your_file.cram | bcftools call -mv > variants.vcf
    • Memory: 1–5 GB for targeted regions, scalable with pipes.
    • Use Case: Variant calling without R’s overhead.
  3. Python with Pysam:

    • Why: Python’s pysam mirrors Rsamtools but with more control.
    • How:
      import pysam
      cram = pysam.AlignmentFile("your_file.cram", "rc")
      for read in cram.fetch("chr1", 1, 1000000):
          # Process read (e.g., count, analyze)
          pass
      cram.close()
    • Memory: Similar to R’s chunking (500 MB–1 GB per batch), adjustable via iteration.
    • Use Case: Custom workflows, scripting flexibility.
  4. GATK (Genome Analysis Toolkit):

    • Why: Robust for large-scale genomics, optimized for CRAM.
    • How: Use tools like HaplotypeCaller with --intervals for regions:
      gatk HaplotypeCaller -R reference.fa -I your_file.cram -L chr1 -O output.vcf
    • Memory: 4–16 GB depending on region/depth, configurable with JVM flags (-Xmx20g).
    • Use Case: Advanced variant calling, but watch memory limits.

Recommendation

  • Stick with R if you’re comfortable with Bioconductor and your analysis (e.g., coverage, gene expression) aligns with its strengths. Use chunking (yieldSize = 100,000–500,000) and region-based queries to stay under 32 GB.
  • Switch to Samtools/Pysam if you need simpler tasks (counting, filtering) or more lightweight processing. Add Bcftools or GATK for variant calling.
  • Hybrid Approach: Pre-filter with samtools (e.g., extract a chromosome, reducing size to ~2–4 GB), then analyze in R.

You’ve got this—32 GB is enough with smart processing. What specific analysis are you aiming for? That’ll help me refine the advice further!

Memory Requirements for CRAM File Analysis in R

This document provides detailed estimates of memory requirements when working with CRAM files in R using Bioconductor packages like Rsamtools and GenomicAlignments.

Factors Affecting Memory Usage

Memory requirements for CRAM file analysis depend on several key factors:

  1. File Size: The size of your CRAM file, which correlates with:

    • Number of reads
    • Read length
    • Amount of metadata stored
  2. Analysis Type: Different operations have different memory footprints:

    • Simple counting operations are relatively lightweight
    • Coverage calculations require more memory
    • Variant calling is typically the most memory-intensive
  3. Processing Strategy: How you process the data:

    • Whole-file loading vs. chunk-based processing
    • Region-specific analysis vs. genome-wide analysis
    • Filtering criteria applied
  4. System Configuration: Your R environment settings:

    • Available physical RAM
    • R memory limits
    • Garbage collection settings

Estimated Memory Requirements by Operation

Basic File Operations

Operation Estimated Memory Notes
Loading file metadata 10-50 MB Includes header information, chromosome data
Counting reads 10-20 MB For the counting operation itself
Indexing a CRAM file 50-200 MB Depends on file complexity

Reading and Processing Alignments

Operation Estimated Memory Notes
Loading 1 million reads 500 MB - 1 GB Varies with read length and metadata
Loading 10 million reads 5-10 GB Scales approximately linearly with read count
Processing in 100k read chunks 100-500 MB Constant memory regardless of file size
Region-specific queries 50-500 MB Depends on region size and read density

Advanced Analysis

Operation Estimated Memory Notes
Coverage calculation (whole genome) 2-8 GB Depends on genome size and sequencing depth
Coverage calculation (single chromosome) 200 MB - 1 GB Depends on chromosome size and coverage
Variant calling (whole genome) 8-16 GB or more Very dependent on depth and algorithms used
Variant calling (targeted region) 1-5 GB For regions of several megabases

Memory Scaling Examples

By CRAM File Size

CRAM File Size Approximate Read Count Estimated Memory for Full Load Chunk-based Processing
1 GB 10-20 million 5-20 GB 500 MB - 1 GB
10 GB 100-200 million 50-200 GB (not recommended) 500 MB - 1 GB
100 GB 1-2 billion Not feasible 500 MB - 1 GB

By Genome Coverage

Sequencing Depth Whole Genome Analysis Single Chromosome Analysis
10x coverage 4-8 GB 300-800 MB
30x coverage 8-16 GB 800 MB - 2 GB
100x coverage 16-32 GB 2-6 GB

Memory Optimization Strategies

  1. Chunk-based Processing:

    • Set appropriate yieldSize in BamFile (100,000 is a good starting point)
    • Process data in a loop rather than loading all at once
    • Memory savings: 80-95% compared to whole-file loading
  2. Region-based Analysis:

    • Use ScanBamParam(which=...) to limit to specific genomic regions
    • Memory savings: Proportional to the fraction of genome analyzed
  3. Filtering During Read:

    • Apply filters during reading (mapping quality, flags, etc.)
    • Memory savings: 20-60% depending on filter stringency
  4. R Memory Settings:

    • Increase R memory limit: memory.limit(size=NA) (Windows) or ulimit settings (Linux/Mac)
    • Control garbage collection frequency: gc.control()

System Recommendations

Analysis Type Minimum RAM Recommended RAM
Basic exploration of small regions 4 GB 8 GB
Chromosome-level analysis 8 GB 16 GB
Whole-genome analysis 16 GB 32+ GB
Population-scale studies 32 GB 64+ GB

Real-world Examples

  1. Exome sequencing data (50x coverage, ~6 GB CRAM):

    • Full load: ~6-10 GB RAM
    • Chunk processing: ~1 GB RAM
    • Variant calling: ~4-8 GB RAM
  2. Whole genome sequencing (30x coverage, ~30 GB CRAM):

    • Full load: Not recommended
    • Chunk processing: ~1 GB RAM
    • Per-chromosome analysis: ~2-4 GB RAM
    • Variant calling: ~12-16 GB RAM
  3. RNA-seq data (30 million reads, ~3 GB CRAM):

    • Full load: ~3-6 GB RAM
    • Chunk processing: ~800 MB RAM
    • Gene expression quantification: ~2-4 GB RAM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment