ALSPAC imputation: sharing the burden

Background

The ALSPAC imputation to UK10K+1000genomes reference panels is underway. Each chromosome is split into approximately 5Mb chunks, and each 5Mb chunk is split into subsets of 700 individuals. For example, chromosome 22 is split into 8 chunks, and each chunk is split into 26 subsets. Therefore, there are 8 x 26 = 208 jobs to run to complete chromosome 22.

On Bluecrystal3 you are only allowed to submit batch jobs of 100. So, for chromosome 22 you need to submit 3 batch jobs of 100, 100 and 8 to complete it.

All the data and scripts are ready and waiting to be submitted. They are in a shared location to which everyone should have write access, so all that is needed is to choose a chromosome and submit the submission scripts.

What to do

Step 1

First choose a chromosome. What needs to be done, is underway, and has been completed, is outlined on this spreadsheet:

https://docs.google.com/spreadsheets/d/1bGpxgBuqloGV6wcgx9ITmEG2kkJO6v0SzCoC1_faKGI/edit?usp=sharing

You should be able to drag chromosomes to your name so that nobody else starts working on something that you are working on already.

Step 2

Navigate to the imputation directory:

cd /panfs/panasas01/shared/alspac/deprecated/gib/imputation/alspac_uk10k_1kg/data/imputed/

There is a separate directory for each chromosome. If you are doing chromosome 10 for example, go to the chr10 directory

cd chr10

The directory contents look like this:

-rwxrwxr-x 1 gh13047 sscm 2117 Aug 12 15:12 01_submit_impute10.sh
-rwxrwxr-x 1 gh13047 sscm 2119 Aug 12 15:12 02_submit_impute10.sh
-rwxrwxr-x 1 gh13047 sscm 2119 Aug 12 15:12 03_submit_impute10.sh
-rwxrwxr-x 1 gh13047 sscm 2119 Aug 12 15:12 04_submit_impute10.sh
-rwxrwxr-x 1 gh13047 sscm 2119 Aug 12 15:12 05_submit_impute10.sh
-rwxrwxr-x 1 gh13047 sscm 2119 Aug 12 15:12 06_submit_impute10.sh
-rwxrwxr-x 1 gh13047 sscm 2119 Aug 12 15:12 07_submit_impute10.sh
-rwxrwxr-x 1 gh13047 sscm 2119 Aug 12 15:12 08_submit_impute10.sh
drwxrwxr-x 2 gh13047 sscm 4096 Aug 12 14:41 job_reports
-rw-rw-r-- 1 gh13047 sscm  566 Aug 12 14:42 split10.txt
-rwxrwxr-x 1 gh13047 sscm 2117 Aug 12 14:42 submit_impute10.sh

Step 3

You can test to see if the scripts are working for you by running them interactively. For example:

./01_submit_impute10.sh 1

will run the first job (e.g. chunk 1, individual 1). You should see impute2 starting up, reading in the haplotypes, reading in the reference data, reporting the numbers of type 0, type 1 and type 2 SNPs etc. As long as it isn't crashing because it can't find any files etc, this means everything is working.

After a few moments, e.g. by the time it starts trying to read in the reference haplotypes, stop the test from running by pressing ctrl+c. If you check the directory now you should see three new files:

ALSPAC10_1_1_samples
ALSPAC10_1_1_summary
ALSPAC10_1_1_warnings

These are log files from the run. They will be overwritten by the time you run these properly using qsub.

Step 4

All you need to do now is submit each of the scripts that begin with a number. e.g.

qsub firstrun_1_submit_impute10.sh
qsub firstrun_2_submit_impute10.sh
...

for i in {1..8}
do
	qsub firstrun_${i}_submit_impute10.sh
done

You can monitor what is running and what is not running by using the showq command. My username is gh13047, so I would run:

showq -u gh13047

You should see that some jobs are running, some are in a queue, and some are idle. Gradually they will start finishing.

Step 5

Check back in a couple of days and see how it's going. You should have some output files that look like this:

ALSPAC10_1_1_allele_probs.gz
ALSPAC10_1_1_diplotype_ordering
ALSPAC10_1_1.gz
ALSPAC10_1_1_haps.gz
ALSPAC10_1_1_info
ALSPAC10_1_1_info_by_sample
ALSPAC10_1_1_samples
ALSPAC10_1_1_summary
ALSPAC10_1_1_warnings

It's likely that some of the jobs will not have completed, either because they needed extra memory or they ran out of time. To complete all unfinished jobs, run the following:

for i in {1..8}
do
    qsub secondrun_${i}_submit_impute10.sh
done

These will submit all the jobs again to the 48 hour queue with 10 cores per node, and they will only run if the particular batch needs to be completed.

explodecomputer/imputation_sharing.md

Select an option

No results found