Last active
April 25, 2024 13:26
-
-
Save alexpreynolds/01f54bceee01e41bfc0770f6ee416d78 to your computer and use it in GitHub Desktop.
Create an indexed tabix file from a Pandas dataframe via "pure" Python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
''' | |
Create an indexed tabix file from a Pandas dataframe | |
via "pure" Python (i.e., no subprocess) | |
''' | |
import os | |
import io | |
import pandas as pd | |
import pysam | |
import bgzip | |
ds = io.StringIO('''chr1 842320 842327 | |
chr1 842328 842330 | |
chr1 842328 842330 | |
chr1 855426 855427 | |
chr1 855739 855740''') | |
df = pd.read_csv(ds, delimiter='\t', header=None) | |
df.columns = ['chrom', 'start', 'stop'] | |
out_bgz_fn = "test_pd.bed.gz" | |
with open(out_bgz_fn, "wb") as out_bgz: | |
with bgzip.BGZipWriter(out_bgz) as out_bgz_fh: | |
for index, row in df.iterrows(): | |
out_line = '{}\t{}\t{}\n'.format(row['chrom'], row['start'], row['stop']) | |
out_bgz_fh.write(out_line.encode()) | |
if not os.path.exists(out_bgz_fn): | |
raise Exception("Error: Could not create bgzip archive") | |
out_index_fn = "{}.tbi".format(out_bgz_fn) | |
if not os.path.exists(out_index_fn): | |
pysam.tabix_index(out_bgz_fn, preset="bed") | |
if not os.path.exists(out_index_fn): | |
raise Exception("Error: Could not create index of bgzip archive") |
wow... interaction of bgzip and pandas is weird...I find that just importing bgzip means to_csv automatically writes f"{filename}.gz
in bgzip format... like it's hooking into it somehow...
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
How does this compare (speed wise) to
df.to_csv(..., compression="gzip")
?Obviously I know that a gzip'ed file can't be tabix indexed...