Last active
August 24, 2018 19:25
-
-
Save IsmailM/95dda9440189338f9844eee3f6dc0f6b to your computer and use it in GitHub Desktop.
Tutorial on how to download data from PGP-UK - https://www.personalgenomes.org.uk/data/
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Download PGP Data CSV | |
| # Contains three columns: | |
| # - PGP-UK Hex ID | |
| # - Direct Link | |
| # - Type of Data | |
| wget https://www.personalgenomes.org.uk/data/data_file_links.csv | |
| # Here we display the Types of Data that are available | |
| cat data_file_links.csv | cut -d ',' -f 3 | sort | uniq -c | |
| # You will see something like: | |
| # (Run on 24, Fri Aug 2018) | |
| # 11 Methylation 450k Array Green Blood IDAT | |
| # 13 Methylation 450k Array Green Saliva IDAT | |
| # 11 Methylation 450k Array Red Blood IDAT | |
| # 13 Methylation 450k Array Red Saliva IDAT | |
| # 10 Transcriptomic - Amplicon Fastq | |
| # 20 Transcriptomic - Proton RNA Sequence Fastq | |
| # 30 Transcriptomic - RNAseq Fastq | |
| # 11 VCF | |
| # 11 VCF MD5 | |
| # 11 VCF Tabix Index | |
| # 10 WGBS Bam | |
| # 10 WGBS Bam Index | |
| # 20 WGBS Fastq | |
| # 11 WGS Bam | |
| # 1 WGS Bam Index | |
| # 90 WGS Cram | |
| # 203 WGS Fastq | |
| ### To extract the links of a certain type of data. | |
| # Here, we use `awk` to extract rows where the third row is equal to "VCF Tabix Index", | |
| # We then extract the second field (the url), which is written to a text file | |
| cat data_file_links.csv | \ | |
| awk -F ',' '$3 == "VCF Tabix Index" {print $2}' > download_urls.txt | |
| # To download these URLs sequentially | |
| wget -i download_urls.txt | |
| # Downloading in parallel | |
| ## Using Xargs (tested with Mac xargs and Ubuntu GNU xargs) | |
| cat download_urls.txt | xargs -n 1 -P 8 wget -q | |
| ## Alternative option with GNUParallels | |
| # Requires prior installation of GNU Parallel - https://www.gnu.org/software/parallel/ | |
| # Set the number of threads using the -j argument (set to 8 below) | |
| parallel -a download_urls.txt -j 8 wget {} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment