Skip to content

Instantly share code, notes, and snippets.

@wdecoster
Last active June 12, 2019 13:30
Show Gist options
  • Select an option

  • Save wdecoster/bbd446ff59991e34684226d757f847f8 to your computer and use it in GitHub Desktop.

Select an option

Save wdecoster/bbd446ff59991e34684226d757f847f8 to your computer and use it in GitHub Desktop.
from nanomath import aveQual
from nanoplotter import scatter
from Bio import SeqIO
import numpy as np
import pandas as pd
import sys
import gzip
from scipy import stats
import csv
def main():
datadict = getFQInput(gzip.open(sys.argv[1], 'rt'))
with open(sys.argv[2], 'r') as summaryf:
summary = csv.reader(summaryf, delimiter='\t')
next(summary, None)
for row in summary:
if not row[11] == '0':
datadict[row[1]].append(float(row[13]))
scatter(
x=pd.Series([entry[0] for entry in datadict.values()]),
y=pd.Series([entry[1] for entry in datadict.values()]),
names=["Calculated average read quality score", "Albacore summary derived quality score"],
path="AveQualvsSummary",
color="#4CB391",
figformat="png",
plots={"hex": 0, "kde": 1, "dot": 0, "pauvre": 0},
stat=stats.pearsonr,
plot_settings={})
def getFQInput(fq):
'''
Get all quality scores in the dataset returned as a list of lists
'''
return {rec.id : [aveQual(rec.letter_annotations["phred_quality"])] for rec in SeqIO.parse(fq, "fastq")}
if __name__ == '__main__':
main()
@sarah872

sarah872 commented Jun 5, 2019

Copy link
Copy Markdown

I there,
I'd like to reproduce the figure you produced, but I am getting the following error:

Traceback (most recent call last):
  File "average-fq-qual-vs-albacore-summary.py3", line 39, in <module>
    main()
  File "average-fq-qual-vs-albacore-summary.py3", line 16, in main
    next(summary, None)
  File "/home/user/viehboeck/00_scripts/py3-venv/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

@wdecoster

Copy link
Copy Markdown
Author

Is your summary file gzip compressed?

@sarah872

sarah872 commented Jun 7, 2019

Copy link
Copy Markdown

yes it is!

@wdecoster

Copy link
Copy Markdown
Author

That explains the error I believe. The easiest is to decompress it, but I think you can also change the code to use gzip.open (line 13).
This gist is 2 years old, so I'm not 100% sure the scatter() function will still work. But let me know if you encounter an issue and I'll help you fix it.

@sarah872

sarah872 commented Jun 7, 2019

Copy link
Copy Markdown

File "average-fq-qual-vs-albacore-summary.py3", line 39, in
main()
File "average-fq-qual-vs-albacore-summary.py3", line 28, in main
stat=stats.pearsonr)
File "/home/user/py3-venv/lib/python3.7/site-packages/nanoplotter/nanoplotter_main.py", line 90, in scatter
sns.set(style="ticks", **plot_settings)
TypeError: set() argument after ** must be a mapping, not NoneType

@wdecoster

Copy link
Copy Markdown
Author

There we have it :) as expected.

Could you try adding plot_settings={} in the scatter() function as below:

    scatter(
        x=np.array([entry[0] for entry in datadict.values()]),
        y=np.array([entry[1] for entry in datadict.values()]),
        names=["Calculated average read quality score", "Albacore summary derived quality score"],
        path="AveQualvsSummary",
        color="#4CB391",
        format="png",
        plots={"hex": 0, "kde": 1, "dot": 0},
        stat=stats.pearsonr,
        plot_settings={})

@sarah872

sarah872 commented Jun 7, 2019

Copy link
Copy Markdown

Getting the following error now:

File "average-fq-qual-vs-albacore-summary.py3", line 49, in <module>
   main()
 File "average-fq-qual-vs-albacore-summary.py3", line 39, in main
   plot_settings={})
TypeError: scatter() got an unexpected keyword argument 'format'

@wdecoster

Copy link
Copy Markdown
Author

Thanks for sticking around - the function has changed a bit the past two years. NanoPlot itself stays compatible of course, but this snippet is outdated.

You can change the 'format' to 'figformat' and then this error should be solved. As soon as we get this working I'll update the gist...

@sarah872

sarah872 commented Jun 7, 2019

Copy link
Copy Markdown

Well, thank YOU for the help!

New error:

  File "average-fq-qual-vs-albacore-summary.py3", line 49, in <module>
    main()
  File "average-fq-qual-vs-albacore-summary.py3", line 39, in main
    plot_settings={})
  File "/home/user/py3-venv/lib/python3.7/site-packages/nanoplotter/nanoplotter_main.py", line 153, in scatter
    idx = np.random.choice(x.index, min(2000, len(x)), replace=False)
AttributeError: 'numpy.ndarray' object has no attribute 'index'

@wdecoster

Copy link
Copy Markdown
Author

Hmmm I changed more in two years than I thought :-)

Try this:

import pandas as pd

scatter(
        x=pd.Series([entry[0] for entry in datadict.values()]),
        y=pd.Series([entry[1] for entry in datadict.values()]),
        names=["Calculated average read quality score", "Albacore summary derived quality score"],
        path="AveQualvsSummary",
        color="#4CB391",
        format="png",
        plots={"hex": 0, "kde": 1, "dot": 0},
        stat=stats.pearsonr,
        plot_settings={})

@sarah872

sarah872 commented Jun 7, 2019

Copy link
Copy Markdown

You probably meant ‘figformat’ ;)
It is working now, however, the pearsonr is very low (0.26). Any idea why?

@wdecoster

Copy link
Copy Markdown
Author

Oh yeah, heh, definitely figformat :-p

Could you show me the plot? Where is the data from?

@sarah872

sarah872 commented Jun 8, 2019

Copy link
Copy Markdown

The data was obtained by sequencing (as a beginner) a bacterial isolate.
AveQualvsSummary_kde

@wdecoster

Copy link
Copy Markdown
Author

Uh ow, I think I know what went wrong here. The ONT sequencing_summary.txt file format changed between my code and your file, and you are not plotting the right column here. The quality scores from the summary (y-axis) go to unrealistic numbers.
All of these changes are taken into account for NanoPlot (which might give you more relevant plots, too), but not for an old snippet like this.

In line 18 I take the thirteenth value of the summary file (index 12 in Python):

datadict[row[1]].append(float(row[12]))

You'll have to correct that number.

@sarah872

Copy link
Copy Markdown

This is my summary file:

filename	read_id	run_id	channel	start_time	duration	num_events	passes_filtering	template_start	num_events_template	template_duration	num_called_template	sequence_length_template	mean_qscore_template	strand_score_template	calibration_strand_genome_template	calibration_strand_identity_template	calibration_strand_accuracy_template	calibration_strand_speed_bps_template
PCW_VVVV_9_20180831_FAH89570_MN17058_sequencing_run_string.fast5	1b261311-a333-4af9-8f03-fefee722938b	1d2f069b4156b0bfdc46e8cd5825a3419de1a39d	248	112.06525	1.37275	957	False	0.17625	957	1.1965	957	229	5.177	-0.0008	filtered_out	-1.0	-1.0	0.0

So it should be index 13 in python.
However, that gives me the following error:

ng is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
/home/user/py3-venv/lib/python3.7/site-packages/seaborn/axisgrid.py:1847: UserWarning: JointGrid annotation is deprecated and will be removed in a future release.
  warnings.warn(UserWarning(msg))
Traceback (most recent call last):
  File "average-fq-qual-vs-albacore-summary.py3", line 40, in <module>
    main()
  File "average-fq-qual-vs-albacore-summary.py3", line 30, in main
    plot_settings={})
  File "/home/user/py3-venv/lib/python3.7/site-packages/nanoplotter/nanoplotter_main.py", line 182, in scatter
    if plots["pauvre"] and names == ['Read lengths', 'Average read quality'] and log is False:
KeyError: 'pauvre'

@wdecoster

Copy link
Copy Markdown
Author

That's progress!

Change plots={"hex": 0, "kde": 1, "dot": 0} to plots={"hex": 0, "kde": 1, "dot": 0, "pauvre": 0}.

I hope, after all these changes, you'll be happy with the results, but I expect you'll just get a boring pretty good correlation.

@sarah872

Copy link
Copy Markdown

That's it! Thanks a lot for your time!

@wdecoster

Copy link
Copy Markdown
Author

Good to hear! Based on your findings I have updated the gist. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment