A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)
| Pretty print tables summarizing properties of tensor arrays in numpy, pytorch, jax, etc. | |
| Now on pip! `pip install arrgh` https://github.com/nmwsharp/arrgh |
| def eterna_recalculated(row,scale_max=2.3): | |
| """Helper method that recalculates the Eterna score for an entry from a dataframe. It will then put the score back into the row. Please note that there is not a 1:1 correspondence between the actual and recalculated scores""" | |
| assert len(row["target_structure"]) == len(row["sequence"]) | |
| # sometimes there is a fingerprint sequence at the end of the sturcutre, If that is the case it needs to be removed | |
| sequence = re.sub("AAAGAAACAACAACAACAAC$","",row["sequence"]) | |
| # data_len is the number of data points that will be reviewed | |
| data_len = min( | |
| len(row["target_structure"]), | |
| len(row["SHAPE_data"]), # can probably get rid of this one | |
| len(sequence), |
A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)
SAM and BAM filtering one-liners
@author: David Fredman, [email protected] (sans poly-A tail)
@dependencies: http://sourceforge.net/projects/bamtools/ and http://samtools.sourceforge.net/
Please extend with additional/faster/better solutions via a pull request!
BWA mapping (using piping for minimal disk I/O)
| // -*- compile-command: "clang++ -ggdb -o random_selection -std=c++0x -stdlib=libc++ random_selection.cpp" -*- | |
| //Reference implementation for doing random number selection from a container. | |
| //Kept for posterity and because I made a surprising number of subtle mistakes on my first attempt. | |
| #include <random> | |
| #include <iterator> | |
| template <typename RandomGenerator = std::default_random_engine> | |
| struct random_selector | |
| { | |
| //On most platforms, you probably want to use std::random_device("/dev/urandom")() |
| abc 1 2 3 | |
| def 4 5 6 | |
| ga 7 9 10 | |
| hij 1 5 99 |
| import threading | |
| # Based on tornado.ioloop.IOLoop.instance() approach. | |
| # See https://github.com/facebook/tornado | |
| class SingletonMixin(object): | |
| __singleton_lock = threading.Lock() | |
| __singleton_instance = None | |
| @classmethod |
| # Code written by brentp in response to BioStars question: | |
| # http://www.biostars.org/post/show/6544/ | |
| import random | |
| import sys | |
| def write_random_records(fqa, fqb, N=100000): | |
| """ get N random headers from a fastq file without reading the | |
| whole thing into memory""" | |
| records = sum(1 for _ in open(fqa)) / 4 |