Selin Jessa sjessa

Anti-hype LLM reading list

Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.

Foundational Concepts

Pre-Transformer Models

A Checklist for Reproducibility in Reinforcement Learning

From a slide in a NeurIPS 2018 keynote by Joelle Pineau

For all algorithms presented, check if you include:

A clear description of the algorithm.
An analysis of the complexity (time, space, sample size) of the algorithm.
A link to downloadable source code, including all dependencies.

Use Singularity and Docker to run a kernel in a jupyter notebook

This is an extension to this post about creating a kernel in a Jupyter notebook that runs a Singularity container.

Download Singularity (see here).

Create a Singularity file, e.g., (making sure to install the ipykernel module in it):

Bootstrap: docker

Keles -- Statistical Methods for profiling long range chromatin interactions from repetitive regions of the genome

Multi-mapping reads (multi-reads) are typically thrown out in many HTS analyses incuding Hi-C
- Assays predominently rely on short-read (50-150bp) so multi-reads are common
- Using ChIP-seq as an example, incorporating multi-reads finds peaks in regions where "uni-reads" do not
- e.g. Perm-seq using DHS + ChIP-seq data and multi-reads. 27.3% more peaks compared to ENCODE uniform processing pipeline
How to combine this with Hi-C data?
- Hi-C read processing
  - Typical pipelines: singletons, multi-mapping ends, low map quality, and unaligned all discarded
Evaluation of the impact of this using IMR90 and Plasmodium datasets

A Few Useful Things to Know about Machine Learning

The paper presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.

1. Learning = Representation + Evaluation + Optimization

All machine learning algorithms have three components:

Representation for a learner is the set if classifiers/functions that can be possibly learnt. This set is called hypothesis space. If a function is not in hypothesis space, it can not be learnt.
Evaluation function tells how good the machine learning model is.
Optimisation is the method to search for the most optimal learning model.

This gist lets you keep IPython notebooks in git repositories. It tells git to ignore prompt numbers and program outputs when checking that a file has changed.

To use the script, follow the instructions given in the script's docstring.

For further details, read this blogpost.

The procedure outlined here is inspired by this answer on Stack Overflow.

	# License CC0

	import httpx

	async def analyze_self_citations(doi):
	async with httpx.AsyncClient() as client:
	response = await client.get(
	f"https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}",
	params={"fields": "title,authors,references.authors"}
	)

	# Add this to your .Rprofile
	options(
	error = quote(rlang::entrace()),
	rlang__backtrace_on_error = "collapse" # or "branch" or "full"
	)

	#install UMAP from https://github.com/lmcinnes/umap
	#install.packages("rPython")

	umap <- function(x,n_neighbors=10,n_components=2,min_dist=0.1,metric="euclidean"){
	x <- as.matrix(x)
	colnames(x) <- NULL
	rPython::python.exec( c( "def umap(data,n,d,mdist,metric):",
	"\timport umap" ,
	"\timport numpy",
	"\tembedding = umap.UMAP(n_neighbors=n,n_components=d,min_dist=mdist,metric=metric).fit_transform(data)",

	/*

	This script is meant to be used with a Google Sheets spreadsheet. When you edit a cell containing a
	valid CSS hexadecimal color code (like #000 or #000000), the background color will change to that
	color and the font color will be changed to the inverse color for readability.

	To use this script in a Google Sheets spreadsheet:
	1. go to Tools » Script Editor;
	2. replace everyting in the text editor with this code;
	3. click File » Save;