Created
April 30, 2010 14:54
-
-
Save drio/385309 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
These are the set of key changes we suggest to implement in order to successfully assimilate | |
the incoming HiSEQ instruments: | |
1. Transfer process from Instruments to Cluster. | |
Currently the two GAIIs are dumping to a single machine (slxdump). Then to the | |
cluster. We suggest removing the intermidiate machine and make the | |
instruments dump data directly to the cluster volumes. A similar process of what we | |
are doing for SOLiD. | |
The only available protocol from illumina to transfer data is SMB. This means SAMBA | |
will have to be setup in the dumping servers. | |
2. What data to transfer. | |
The HiSEQ by default does not save images. This can be activated but it will require | |
30Tb of space available per Run. In addition, Illumina is not saving intensities by | |
default with the new software. If we want to save them, we will need around 4Tb per | |
each Run. | |
We agree with Illumina's software default behaviour and we recommend keeping only | |
the base calls. This is still going to require 400G per RUN. | |
3. Archiving Solution. | |
We have to come up with a better way to archive data. The current archiving | |
solution protocol requires human interaction. That is a problem and it will be | |
even more problematic once the HiSEQs arrive. Moving towards a programmatic | |
approach will save time and avoid human errors. The old method of dropping | |
the locations in a file worked fine for us but we are open to any idea | |
the sysadmins may have. | |
4. Computer Processing Requirements | |
We are still trying to figure out how to improve the parallelization levels | |
to efficiently use the computation resources. Illumina's docs state it will | |
take 44 hours to compute 200G of data with their 3 (8cores x 32G) machines. | |
Washu can compute 200G in 30 hours using 64 cores 8 (8cores x 32G) machines. | |
One thing is clear, we will need 8 core machine with 32 G. The ELAND step | |
in the GAP spawns N processes at OS level instead of cluster level. | |
We should consider the possibility of moving the pipeline to Ardmore, but | |
we will need the sysadmins feedback for this. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment