Skip to content

Instantly share code, notes, and snippets.

View cschin's full-sized avatar
💭
I may be slow to respond.

Jason Chin cschin

💭
I may be slow to respond.
View GitHub Profile
@cschin
cschin / De_Bruijn_VIS.ipynb
Created August 14, 2012 02:36
Demo for showing De Bruijn graph with ipython notebook + d3.js
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@cschin
cschin / gist:6357971
Last active December 21, 2015 19:59
Some details to make good vitalization for the overlapping data within Celera Assembler
(1) If we use Celera Assembler's bogart unitiger, it will generate a file called "bests.edges" in the "4-unitigger" as one of its output directories.
(2) I wrote a simple script converting the edge list to a GML. The script can be downloaded https://github.com/PacificBiosciences/HBAR-DTK/blob/master/src/CA_best_edge_to_GML.py
(3) Load the graph into gephi (https://gephi.org/)
(4) I typically use the following steps of different layout algorithms in Gephi to get a good layout,
1) "YifanHu's Multilevel" to get a rough layout, the output usually catches good large scale structure and detangle the graph reasonably one so I can start to see features of the assembly overlap graph (or the string graph.)
Some thought about Heng Li's proposal for assembly graph format http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format/
some quick comments.
Is this format trying represent the raw overlaps or finally assembly graph or both?
It seems to me that it is more suitable for the first. In the work to represent diploid genome assembly, I had to do multiple level of reduction of the graph from the initial string/overlap graph to simply the problem. if we are looking at a more reduced assembly, we might have to deal with edges corresponding to unitigs with the same in and out nodes. In this format, such bubble paths (difference between them bigger than small indel) will be in different row, the behavior of such edges with the same in and out node should be defined. What I did for diploid work is to assign uid for each edges.
Also, I do think the final assembly should avoid the bidirectional edges. It should be resolved by the assembler. From pragmatic point, it will confuse a lot of bi
@cschin
cschin / gist:be6c52a08a9535839417
Last active August 29, 2015 14:23
Sunday Day Dream: My Vision for A Quantum Pore DNA Sequencer
I currently work for PacBio as a bioinformatist developing some methods to handle single molecule data and genome assembly properly.
Recently, I feel I am so lacking of vision. I have spent most of my time helping to develop methods in hope that they will be useful for the scientific community to use PacBio data. While we were developing those methods, as far as I could tell, many of those ONT fans had zero vision about them. We openly revealed those methods for the benefit to the scientific community to understand the value of PacBio's and PacBio-like data. We naively assumed ONT would generate some great data with raw single molecule read accuracy > 96% as what Clive presented in 2012 AGBT. If so, those ONT fans would not need to use any of those methods we had developed. After a while, we find out that some of the visionary ONT fans are finally "inspired" to use some of our methods for processing some ONT data and publishing papers to show some values which some of those fan questioned about before. Wit
```
$ cat Dockerfile
FROM ubuntu
RUN apt-get update -qq && \
apt-get install -qqy tar gzip curl jq && \
apt-get install -qqy python python-pip vim-tiny less git
RUN pip install httpie requests
RUN apt-get install -qqy curl
RUN mkdir /build && cd /build && curl -s https://nim-lang.org/download/nim-0.17.2.tar.xz > nim-0.17.2.tar.xz && \
tar xvf nim-0.17.2.tar.xz && \
@cschin
cschin / fastareader.py
Created April 7, 2019 14:19
fastareader.py
from os.path import abspath, expanduser
from io import StringIO
import contextlib
import gzip
import re
import subprocess
##
# Utility functions for FastaReader
##
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@cschin
cschin / gist:4c09f2f6888703a6e1e57d782d3244e5
Created July 26, 2020 21:41
A note on evaluating genome assemblies with CHM13 BACs
+ bash getStats.sh chm13_pergrine_p_ctg_cns.fasta
Checking attempted with maximum distance from contig end of 1000 bp..Done
******************* BAC SUMMARY ******************
TOTAL : 341
BP : 51532183
************** Statistics for: chm13_p_ctg_cns.fasta ****************
BACs closed: 321 (94.1349); BACs attempted: 333 %good = 96.3964; BASES 48527269 (94.1689)
Median: 99.9878
MedianQV: 39.1364
Mean: 99.94864
@cschin
cschin / hosts-yt-ads
Created December 16, 2020 07:33 — forked from ewpratten/hosts-yt-ads
youtube ads hosts file
0.0.0.0 ads.doubleclick.net
0.0.0.0 s.ytimg.com
0.0.0.0 ad.youtube.com
0.0.0.0 ads.youtube.com
0.0.0.0 clients1.google.com
0.0.0.0 dts.innovid.com
0.0.0.0 googleads4.g.doubleclick.net
0.0.0.0 pagead2.googlesyndication.com
0.0.0.0 pixel.moatads.com
0.0.0.0 rtd.tubemogul.com
@cschin
cschin / example
Created July 8, 2021 16:09
maybe a bug in WFA
❯ ./wfa_adapt
s0 len: 16179, s1 len: 16326 Alignment contains 16168 matches 46 mismatches, 149 insertions, and 2 deletions
The alignment length is not consitent with sequence length:
16168 + 46 + 2 = 16216 != 16179
16168 + 46 + 149 = 16363 != 16326