The following assumes you are converting BCL files containing PE100 reads with a 10 nt index read. You can allow Casava to demultiplex for you or do it on your own, later. You can adjust values below if you are doing something different (e.g. shorter reads, longer indexes) but be careful.
You need a pretty beefy machine. Illumina recommends something with multiple cores and 48 GB RAM, running Centos 5. Centos 6 also works just fine. See their recommendations here:
http://support.illumina.com/sequencing/sequencing_software/casava/computing_requirements.ilmn
you need compilers, etc. installed:
yum groupinstall "Development Tools"
Casava 1.8.2 (or whichever version) has many dependencies. You can meet them all pretty easily on Centos 5/6 using
yum install packagename
. You want the x86_64 versions:GNU make (3.81 recommended) Perl (>= 5.8) Python (>=2.3 and <=2.6) PyXML gnuplot (>= 3.7, 4.0 recommended) ImageMagick (>= 5.4.7) ghostscript libxslt libxslt-devel libxml2 libxml2-devel libxml2-python ncurses ncurses-devel libtiff libtiff-devel bzip2 bzip2-devel zlib zlib-devel Perl modules: perl-XML-Dumper perl-XML-Grove perl-XML-LibXML perl-XML-LibXML-Common perl-XML-NamespaceSupport perl-XML-Parser perl-XML-SAX perl-XML-Simple perl-XML-Twig perldoc
Get Casava 1.8.2 (or whichever version) from Illumina.
Build the software according to the installation documents. I installed in my $HOME, after installing most dependencies as root (except for PyXML, which lives in $HOME so as not to pollute the system site-packages):
./configure --prefix=$HOME make make install
Make sure the install location is in your $PATH
If you have an entire flowcell of data from the HiSeq, you may want to pare this down to a single lane. You need the following directory structure - I'll use L008 as an example below, but it could be any lane. An asterisk below is a wilcard, representing all files of that type. You want to copy the files below from the entire flowcell into a new directory that represents your lane. For several of the XML files, they are not split by lane, so just get the main file. I'm working on a python script that will do this automatically. Note: you should replace L008 and the 8 in s_8_*_pos.txt below with the lane containing your data:
- Date_InstrumentNumber_run/ - RunInfo.xml - runParameters.xml Data/ Intensities/ - config.xml - RTAConfiguration.xml - s_8_*_pos.txt L008/* BaseCalls/ - config.xml - SampleSheet.csv L008/*
Setup your SampleSheet.csv properly. If it is not configured correctly, the program may run, but you may get no output, which is confusing. In the following:
FCID - the instrument ID. Must place value here. Lane - the lane id. Must place value here. SampleID - the sample ID. Must place value here. SampleRef - the reference genome. Can be empty. Index - the index sequence. Can be empty. Description - description of what you're doing. Can be empty. Control - Y or N indicating control lane Recipe - the recipe name. Must place value here. Operator - your name/initials. Must place value here. SampleProject - the name of the "project". Becomes directory holding your files. Must place value here.
Sample sheet to return data and indexes, only - no demultiplexing, just reads with indexes (assuming you sequenced indexes):
FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject D109LACXX,8,not_demultiplexed,,,Test bcl conversion,N,D109LACXX,BCF,testbclconv
Sample sheet to return demultiplexed data, according to values within. Note that my barcodes are 10 nt here:
FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject D109LACXX,8,my_sample_1,,AACCGAGTTA,Test demultiplexing,N,D109LACXX,BCF,testdmux D109LACXX,8,my_sample_2,,AATACTTCCG,Test demultiplexing,N,D109LACXX,BCF,testdmux D109LACXX,8,my_sample_3,,AACAACAACC,Test demultiplexing,N,D109LACXX,BCF,testdmux
Depending on what you want to do and where you're storing the resulting data, get your paths in order (know what they are).
For standard demultiplexing with no error correction, after you enter your TruSeq indexes in the sample sheet, then:
configureBclToFastq.pl \ --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \ --output-dir /where/you/want/the/output \ --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv
The standard TruSeq indexes allow you to correct one substitution error within their sequence.
For demultiplexing longer indexes (nt > 9) with no error correction, you need to explicitly pass --use-bases-mask and have the entire index sequence in your SampleSheet.csv, or you'll get an error:
configureBclToFastq.pl \ --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \ --output-dir /where/you/want/the/output \ --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \ --mismatches 1
For standard demultiplexing with 1 error correction (substitutions), after you enter your TruSeq indexes in the sample sheet, you need to explicitly pass `--mismatches`then:
configureBclToFastq.pl \ --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \ --output-dir /where/you/want/the/output \ --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \ --use-bases-mask Y*,I*,Y*
For demultiplexing longer indexes (nt > 9) with 1 error correction (substitutions), you need to explicitly pass --use-bases-mask and --mismatches and have the entire index sequence in your SampleSheet.csv, or you'll get an error:
configureBclToFastq.pl \ --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \ --output-dir /where/you/want/the/output \ --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \ --use-bases-mask Y*,I*,Y* --mismatches 1
If you want to process the data to fastq and demultiplex the data using an external method, you need to input a sample sheet with no demultiplexing requested (see above). The assuming you have 10 nucleotide indexes, run:
configureBclToFastq.pl \ --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \ --output-dir /where/you/want/the/output \ --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \ --use-bases-mask Y*,Y10,Y*
This will output R1 (first 100 bp read), R2 (index read), and R3 (second 100 bp read) files rather than the "normal" R1 and R2 files that contain an index sequence. You will need to subsequently manipulate the files to prepare them for your downstream demultiplexing code. Thanks go to the excellent Illumina Tech Support staff for the solution above (and below).
If this causes problems you may need to specifically state the R1 and R3 read lengths. So, if you did a PE100 run, then:
configureBclToFastq.pl \ --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \ --output-dir /where/you/want/the/output \ --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \ --use-bases-mask Y100,Y10,Y100
change to the output directory you just created:
cd /where/you/want/the/output
run make:
make
if you have multiple processing cores (where N = number of cores):
make -j N
Hello,
I'm having some problems with HiSeq data conversion. I would like to try the idea you're commenting on this page. Merging all the flowcell data in a single lane structure. Did you finally created the python script to prepare this data?
Thank you so much !
Best,
Guillermo.