Skip to content

Instantly share code, notes, and snippets.

@drio
Created April 30, 2010 14:54
Show Gist options
  • Save drio/385309 to your computer and use it in GitHub Desktop.
Save drio/385309 to your computer and use it in GitHub Desktop.
These are the set of key changes we suggest to implement in order to successfully assimilate
the incoming HiSEQ instruments:
1. Transfer process from Instruments to Cluster.
Currently the two GAIIs are dumping to a single machine (slxdump). Then to the
cluster. We suggest removing the intermidiate machine and make the
instruments dump data directly to the cluster volumes. A similar process of what we
are doing for SOLiD.
The only available protocol from illumina to transfer data is SMB. This means SAMBA
will have to be setup in the dumping servers.
2. What data to transfer.
The HiSEQ by default does not save images. This can be activated but it will require
30Tb of space available per Run. In addition, Illumina is not saving intensities by
default with the new software. If we want to save them, we will need around 4Tb per
each Run.
We agree with Illumina's software default behaviour and we recommend keeping only
the base calls. This is still going to require 400G per RUN.
3. Archiving Solution.
We have to come up with a better way to archive data. The current archiving
solution protocol requires human interaction. That is a problem and it will be
even more problematic once the HiSEQs arrive. Moving towards a programmatic
approach will save time and avoid human errors. The old method of dropping
the locations in a file worked fine for us but we are open to any idea
the sysadmins may have.
4. Computer Processing Requirements
We are still trying to figure out how to improve the parallelization levels
to efficiently use the computation resources. Illumina's docs state it will
take 44 hours to compute 200G of data with their 3 (8cores x 32G) machines.
Washu can compute 200G in 30 hours using 64 cores 8 (8cores x 32G) machines.
One thing is clear, we will need 8 core machine with 32 G. The ELAND step
in the GAP spawns N processes at OS level instead of cluster level.
We should consider the possibility of moving the pipeline to Ardmore, but
we will need the sysadmins feedback for this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment