Created
July 12, 2013 14:28
-
-
Save walterst/5984883 to your computer and use it in GitHub Desktop.
Parser to pull barcodes from fastq labels and write to a separate barcodes fastq file. See description at beginning of code for usage example. Requires PyCogent 1.5.3 to be installed (http://sourceforge.net/projects/pycogent/files/PyCogent/1.5.3/PyCogent-1.5.3.tgz/download)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# Usage: | |
# python parse_bcs_from_fastq_labels.py X Y Z A | |
# where X is input fastq file, Y is output barcode reads file, | |
# Z is character to split on in label (use quote characters), and A is number of characters to trim from the end of the label (0 for none) | |
# This assumes barcode is at the end of the label, and the number of characters following it are consistent | |
""" Example sequence, would use: python parse_bcs_from_fastq_labels.py fastq_fp bc_reads.fastq '#' 2 to generate barcodes | |
@MCIC-SOLEXA_0051_FC:1:1:14637:1026#CGATGT/1 | |
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN | |
+MCIC-SOLEXA_0051_FC:1:1:14637:1026#CGATGT/1 | |
cQRQOXXXXX_T___WTWWTQTVTV_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB | |
@MCIC-SOLEXA_0051_FC:1:1:4065:1039#CGATGT/1 | |
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN | |
+MCIC-SOLEXA_0051_FC:1:1:4065:1039#CGATGT/1 | |
KPPPQWWWWWQQ________BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB | |
""" | |
from sys import argv | |
from cogent.parse.fastq import MinimalFastqParser | |
f = open(argv[1], "U") | |
bc_out = open(argv[2], "w") | |
char_to_split = argv[3] | |
chars_to_trim = int(argv[4]) | |
for data in MinimalFastqParser(f, strict=False): | |
# Read in current label | |
curr_label = data[0].strip() | |
# Cut off last part of line past ":" character, replace if different character used | |
curr_bc_read = data[0].strip().split(char_to_split)[-1][0:-chars_to_trim] | |
# Create fake quality score since not going to get real data, match length of barcode | |
curr_bc_qual = "F"*len(curr_bc_read) | |
bc_out.write("@%s\n" % curr_label) | |
bc_out.write("%s\n" % curr_bc_read) | |
bc_out.write("+%s\n" % curr_label) | |
bc_out.write("%s\n" % curr_bc_qual) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment