A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)
Pretty print tables summarizing properties of tensor arrays in numpy, pytorch, jax, etc. | |
Now on pip! `pip install arrgh` https://github.com/nmwsharp/arrgh |
def eterna_recalculated(row,scale_max=2.3): | |
"""Helper method that recalculates the Eterna score for an entry from a dataframe. It will then put the score back into the row. Please note that there is not a 1:1 correspondence between the actual and recalculated scores""" | |
assert len(row["target_structure"]) == len(row["sequence"]) | |
# sometimes there is a fingerprint sequence at the end of the sturcutre, If that is the case it needs to be removed | |
sequence = re.sub("AAAGAAACAACAACAACAAC$","",row["sequence"]) | |
# data_len is the number of data points that will be reviewed | |
data_len = min( | |
len(row["target_structure"]), | |
len(row["SHAPE_data"]), # can probably get rid of this one | |
len(sequence), |
A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)
SAM and BAM filtering one-liners
@author: David Fredman, [email protected] (sans poly-A tail)
@dependencies: http://sourceforge.net/projects/bamtools/ and http://samtools.sourceforge.net/
Please extend with additional/faster/better solutions via a pull request!
BWA mapping (using piping for minimal disk I/O)
// -*- compile-command: "clang++ -ggdb -o random_selection -std=c++0x -stdlib=libc++ random_selection.cpp" -*- | |
//Reference implementation for doing random number selection from a container. | |
//Kept for posterity and because I made a surprising number of subtle mistakes on my first attempt. | |
#include <random> | |
#include <iterator> | |
template <typename RandomGenerator = std::default_random_engine> | |
struct random_selector | |
{ | |
//On most platforms, you probably want to use std::random_device("/dev/urandom")() |
abc 1 2 3 | |
def 4 5 6 | |
ga 7 9 10 | |
hij 1 5 99 |
import threading | |
# Based on tornado.ioloop.IOLoop.instance() approach. | |
# See https://github.com/facebook/tornado | |
class SingletonMixin(object): | |
__singleton_lock = threading.Lock() | |
__singleton_instance = None | |
@classmethod |
# Code written by brentp in response to BioStars question: | |
# http://www.biostars.org/post/show/6544/ | |
import random | |
import sys | |
def write_random_records(fqa, fqb, N=100000): | |
""" get N random headers from a fastq file without reading the | |
whole thing into memory""" | |
records = sum(1 for _ in open(fqa)) / 4 |