Tabix
command of htslib can query a locus to a remote s3 file using s3://
protocol.
$ aws s3 ls s3://your_bucket/
vcf.gz
vcf.gz.tbi
$ tabix -l s3://your_bucket/vcf.gz
chr1
chr2
chr3
But it is not enabled by default. To enable it, we need to compile it with --enable-libcurl
option which enables variety of network protcols.
$ less htslib/INSTALL
...
--enable-libcurl
Use libcurl (<http://curl.haxx.se/>) to implement network access to
remote files via FTP, HTTP, HTTPS, etc. By default, HTSlib uses its
own simple networking code to provide access via FTP and HTTP only.
...
--enable-s3
Implement network access to Amazon AWS S3. By default or with
--enable-s3=check, this is enabled when libcurl is enabled.
...
As of writing, the latest version 1.9 doesn't fully support s3 but the develop branch includes a fix for it. So,
- Fetch recent develop branch
$ git clone --shallow-since 2019-07-01 https://github.com/samtools/htslib.git
- Install dependencies listed in
INSTALL
.
$ cd htslib
$ less INSTALL
...
RedHat / CentOS
---------------
sudo yum install autoconf automake make gcc perl-Data-Dumper zlib-devel bzip2 bzip2-devel xz-devel curl-devel openssl-devel
...
$ sudo yum install autoconf automake make gcc perl-Data-Dumper zlib-devel bzip2 bzip2-devel xz-devel curl-devel openssl-devel
- Then, compile it with
--enable-libcurl
option which enabless3://
protocol.
$ autoheader
$ autoconf
$ ./configure --enable-libcurl
$ make
$ sudo make install
Now tabix
can query a remote s3 file without downloading it. Make sure you have both bgzip
-ed vcf (s3://your_bucket/vcf.gz
) and its tabix
index file (s3://your_bucket/vcf.gz.tbi
) in the same s3 location. As of writing, AWS's instance profile was not supported, so set AWS credentials by environmental variables or ~/.aws/credentials
.
$ export AWS_ACCESS_KEY_ID=XXX AWS_SECRET_ACCESS_KEY=XXX AWS_DEFAULT_REGION=us-west-2
$ tabix -l s3://your_bucket/vcf.gz
chr1
chr2
chr3