This is a walkthrough of the UCGD Pipeline simplified project setup.
$> sudo /bin/su - ucgd-pepipeline
$> ml ucgd_modules
$> UCGDProject BuildProject -p [% Project %]
By default UCGDProject will uses the build specified in the UCGD database.
If however you would like to override the default you can add the following option.
optional arguments:
--build BUILD, -b BUILD
Which reference version to build project. Will override database assembly. i.e. GRCh37/GRCh38.
Modify the examples below to fit the location and type of data.
Consider using a PE data transfer node to increase transfer rate from outside PE:
$> ssh pe-dtn.chpc.utah.edu
$> cd /scratch/ucgd/lustre/UCGD_Processing/[% Project %]/UCGD/GRCh38/Project_Setup
Download FDT jar file from http://monalisa.cern.ch/FDT/download.html
Log into HCI-Gnomex-- must be given permission to access that project from the investigator and/or Brian Dalley.
Search Proj # > Files tab > Download > FDT CL download
Paste that command line into redwood as below
$> nohup [paste CL from HCI-Gnomex] 2> fdt.error &
# For example:
$> nohup java -jar fdt.jar -noupdates -pull -r -c hci-bio-app.hci.utah.edu -d ./ /scratch/fdtswap/fdt_sandbox_gnomex/f0e318c3-3fd8-4ea5-b24c-dc949a3d5580/15101R 2> fdt.error &
Move fastq.gz files from HCI download subdirectory to Project_Setup and perform subsequent processing in that directory. The Snakemake workflow and other steps in the README file are expecting the fastq files and the processing steps to occur in Project_Setup and if you perform processing in another directory things will break and you'll have to edit the Snakefile and some of the command lines below.
$> wget -m ftp://User_Name:Password@Data_Link
Note: If you are copying from one server to another be sure to use scp
instead of cp
.
$> ls /path/to/data/*.fastq.gz | nohup parallel cp {} . 2> cp.fastq.error &
$> ls /path/to/data/*.md5 | parallel cp {} . 2> cp.md5.error
# Or
$> ls /path/to/data/*.sha256 | parallel cp {} . 2> cp.sha256.error
# If no md5sums (or sha256) provided:
$> cd /path/to/data/
$> ls *.fastq.gz | nohup parallel 'md5sum {} > {}.md5' 2> md5sum.err &
$> cd -
$> cp /path/to/data/md5sum.* ./
# Use rclone config to set up a new endpoint for the transfer.
$> nohup rclone sync janssen_coon_s3:sys_bio/PROJ-00161 PROJ-00161 &> rclone-sync-sys_bio-PROJ-00161.txt &
# Set up a profile in ~/.aws/credentials that has the format
# [profile_name]
# aws_access_key_id = #############
# aws_secret_access_key = ###################
$> aws s3 sync --profile profile_name s3://bucket_name/ .
NOTE: THIS MUST BE DONE VERY CAREFULLY AND MAY BE DIFFERENT FOR EACH PROJECT!!! THIS IS A STEP WHERE MISTAKES COULD RESULT IN DATA SWAPS AND DATA MIXING COULD OCCUR. SANITY CHECK THE RESULTING FILE MANIFEST FOR ACCURACY.
If necessary, change fastq file names or bam file SM tags so that Sample_ID column in sample manifest matches either A) fastq file name prior to first underscore or B) SM tag in bam file
## Fastq files
$> ls *gz|perl -F'_' -lane 'use Cwd 'abs_path'; print join "\t", $F[0], abs_path($_)' > source_files_ids.txt
$> data_prep.pl -list source_files_ids.txt > source_file_manifest.txt
## BAM files
$> ml samtools
$> bam_sample_file_names.pl *.bam > source_files_ids.txt
$> data_prep.pl -list source_files_ids.txt > source_file_manifest.txt
NOTE:
All source_files_ids.txt
must be of the form (tab delimited):
sample_id full/path/to/file
The Project_Setup
directory must contain both: source_files_ids.txt
and source_file_manifest.txt
in order to process workflows correctly.
mkdir ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]
ucgd_db --report manifest --Project [% Project %] > ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]/[% Project %]-Samples.txt
ucgd_db --report ped --Project [% Project %] > ~/Box\ Sync/UCGD/Projects/NGS/[% Project %]/[% Project %].ped
Check UCGD_DB.db > People for PI's First_Name, Last_Name, and Email. Should match manifest entries in Projects exactly.
ucgd_db --report project_wiki --Project [% Project %]
Paste the resulting markdown into Jira task description.
cd /scratch/ucgd/lustre/UCGD_Processing/[% Project %]/UCGD/GRCh38/Reports
ucgd_db --report manifest --Project [% Project %] > ./[% Project %]-Samples.txt