Skip to content

Instantly share code, notes, and snippets.

@brantfaircloth
Created July 16, 2012 23:44
Show Gist options
  • Save brantfaircloth/3125885 to your computer and use it in GitHub Desktop.
Save brantfaircloth/3125885 to your computer and use it in GitHub Desktop.
Convert BCL files to fastq

Install dependencies and Casava

The following assumes you are converting BCL files containing PE100 reads with a 10 nt index read. You can allow Casava to demultiplex for you or do it on your own, later. You can adjust values below if you are doing something different (e.g. shorter reads, longer indexes) but be careful.

  • You need a pretty beefy machine. Illumina recommends something with multiple cores and 48 GB RAM, running Centos 5. Centos 6 also works just fine. See their recommendations here:

    http://support.illumina.com/sequencing/sequencing_software/casava/computing_requirements.ilmn

  • you need compilers, etc. installed:

    yum groupinstall "Development Tools"
    
  • Casava 1.8.2 (or whichever version) has many dependencies. You can meet them all pretty easily on Centos 5/6 using yum install packagename. You want the x86_64 versions:

    GNU make (3.81 recommended)
    Perl (>= 5.8)
    Python (>=2.3 and <=2.6)
    PyXML
    gnuplot (>= 3.7, 4.0 recommended)
    ImageMagick (>= 5.4.7)
    ghostscript
    libxslt
    libxslt-devel
    libxml2
    libxml2-devel
    libxml2-python
    ncurses
    ncurses-devel
    libtiff
    libtiff-devel
    bzip2
    bzip2-devel
    zlib
    zlib-devel
    
    Perl modules:
    perl-XML-Dumper
    perl-XML-Grove
    perl-XML-LibXML
    perl-XML-LibXML-Common
    perl-XML-NamespaceSupport
    perl-XML-Parser
    perl-XML-SAX
    perl-XML-Simple
    perl-XML-Twig
    perldoc
    
  • Get Casava 1.8.2 (or whichever version) from Illumina.

  • Build the software according to the installation documents. I installed in my $HOME, after installing most dependencies as root (except for PyXML, which lives in $HOME so as not to pollute the system site-packages):

    ./configure --prefix=$HOME
    make
    make install
    
  • Make sure the install location is in your $PATH

Prepare the data

  • If you have an entire flowcell of data from the HiSeq, you may want to pare this down to a single lane. You need the following directory structure - I'll use L008 as an example below, but it could be any lane. An asterisk below is a wilcard, representing all files of that type. You want to copy the files below from the entire flowcell into a new directory that represents your lane. For several of the XML files, they are not split by lane, so just get the main file. I'm working on a python script that will do this automatically. Note: you should replace L008 and the 8 in s_8_*_pos.txt below with the lane containing your data:

    - Date_InstrumentNumber_run/
        - RunInfo.xml
        - runParameters.xml
        Data/
            Intensities/
                - config.xml
                - RTAConfiguration.xml
                - s_8_*_pos.txt
                L008/*
                BaseCalls/
                    - config.xml
                    - SampleSheet.csv
                    L008/*
    
  • Setup your SampleSheet.csv properly. If it is not configured correctly, the program may run, but you may get no output, which is confusing. In the following:

    FCID - the instrument ID. Must place value here.
    Lane - the lane id.  Must place value here.
    SampleID - the sample ID.  Must place value here.
    SampleRef - the reference genome.  Can be empty.
    Index - the index sequence.  Can be empty.
    Description - description of what you're doing. Can be empty.
    Control - Y or N indicating control lane
    Recipe - the recipe name.  Must place value here.
    Operator - your name/initials.  Must place value here.
    SampleProject - the name of the "project".  Becomes directory holding your files. Must place value here.
    
  • Sample sheet to return data and indexes, only - no demultiplexing, just reads with indexes (assuming you sequenced indexes):

    FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject
    D109LACXX,8,not_demultiplexed,,,Test bcl conversion,N,D109LACXX,BCF,testbclconv
    
  • Sample sheet to return demultiplexed data, according to values within. Note that my barcodes are 10 nt here:

    FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject
    D109LACXX,8,my_sample_1,,AACCGAGTTA,Test demultiplexing,N,D109LACXX,BCF,testdmux
    D109LACXX,8,my_sample_2,,AATACTTCCG,Test demultiplexing,N,D109LACXX,BCF,testdmux
    D109LACXX,8,my_sample_3,,AACAACAACC,Test demultiplexing,N,D109LACXX,BCF,testdmux
    
  • Depending on what you want to do and where you're storing the resulting data, get your paths in order (know what they are).

Generate the Makefile

Standard Demultiplexing

  • For standard demultiplexing with no error correction, after you enter your TruSeq indexes in the sample sheet, then:

    configureBclToFastq.pl \
        --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
        --output-dir /where/you/want/the/output \
        --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv
    

Standard Demultiplexing and correcting 1 index error

The standard TruSeq indexes allow you to correct one substitution error within their sequence.

  • For demultiplexing longer indexes (nt > 9) with no error correction, you need to explicitly pass --use-bases-mask and have the entire index sequence in your SampleSheet.csv, or you'll get an error:

    configureBclToFastq.pl \
        --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
        --output-dir /where/you/want/the/output \
        --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
        --mismatches 1
    

Demultiplexing Long (>9 nt) Indexes

  • For standard demultiplexing with 1 error correction (substitutions), after you enter your TruSeq indexes in the sample sheet, you need to explicitly pass `--mismatches`then:

    configureBclToFastq.pl \
        --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
        --output-dir /where/you/want/the/output \
        --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
        --use-bases-mask Y*,I*,Y*
    

Demultiplexing Long (>9 nt) Indexes and correcting 1 index error

  • For demultiplexing longer indexes (nt > 9) with 1 error correction (substitutions), you need to explicitly pass --use-bases-mask and --mismatches and have the entire index sequence in your SampleSheet.csv, or you'll get an error:

    configureBclToFastq.pl \
        --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
        --output-dir /where/you/want/the/output \
        --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
        --use-bases-mask Y*,I*,Y* --mismatches 1
    

No Demultiplexing, just output reads and indices

  • If you want to process the data to fastq and demultiplex the data using an external method, you need to input a sample sheet with no demultiplexing requested (see above). The assuming you have 10 nucleotide indexes, run:

    configureBclToFastq.pl \
        --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
        --output-dir /where/you/want/the/output \
        --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
        --use-bases-mask Y*,Y10,Y*
    
  • This will output R1 (first 100 bp read), R2 (index read), and R3 (second 100 bp read) files rather than the "normal" R1 and R2 files that contain an index sequence. You will need to subsequently manipulate the files to prepare them for your downstream demultiplexing code. Thanks go to the excellent Illumina Tech Support staff for the solution above (and below).

  • If this causes problems you may need to specifically state the R1 and R3 read lengths. So, if you did a PE100 run, then:

    configureBclToFastq.pl \
        --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
        --output-dir /where/you/want/the/output \
        --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
        --use-bases-mask Y100,Y10,Y100
    

Run the conversion/demultiplexing

  • change to the output directory you just created:

    cd /where/you/want/the/output
    
  • run make:

    make
    
  • if you have multiple processing cores (where N = number of cores):

    make -j N
    
@guillermomarco
Copy link

Hello,

I'm having some problems with HiSeq data conversion. I would like to try the idea you're commenting on this page. Merging all the flowcell data in a single lane structure. Did you finally created the python script to prepare this data?

Thank you so much !

Best,
Guillermo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment