Stephen Anthony Rose arose13

🎯

Focusing

Machine Learning - Statistics - Genomics

arose13 / extract-json.py

Created January 23, 2025 16:53

	import re

	def extract_json(text):
	"""
	Extracts the first JSON object from a given string.

	Args:
	text (str): The input string containing JSON.

	Returns:

arose13 / python-project-line-counter.sh

Created December 12, 2024 18:53

Count the number of lines in your python project

	#!/bin/bash

	# Find all .py files and count their lines, excluding .venv directory
	find . -name ".py" -type f -not -path "./.venv/" -exec wc -l {} \; \| {
	total=0
	while read -r lines file; do
	echo "$lines lines in $file"
	((total += lines))
	done
	echo "-------------------------"

arose13 / both-torch-models-equal.py

Created April 2, 2024 20:01

This can check if the state_dict from 2 pytorch models are the same

-def is_state_dict_equal(dict1, dict2):
-    import torch
-    for key in dict1:
-        if key not in dict2:
-            print(f"Key {key} not in second dict")
-            return False
-        if not torch.all(torch.eq(dict1[key], dict2[key])):
-            print(f"Difference in values for key {key}")
-            return False
-    return True

arose13 / quick-dirty-custom-dataloader.py

Created November 30, 2023 03:45

LLM continously concat sequence of tokens

	def custom_dataloader(dataset: Dataset, batch_size=16):
	random_indices = torch.randperm(len(dataset['tokens']) - context_size)

	for idx in range(0, len(random_indices), batch_size):
	x = torch.stack([
	dataset['tokens'][i: i+context_size]
	for i in random_indices[idx: idx+batch_size]
	])
	y = torch.stack([
	dataset['tokens'][i+1: i+context_size+1]

arose13 / computer_sample_weight.py

Created December 14, 2022 15:17

Convert propensity score to sample weight for causal inference

	def compute_sample_weight(treatment_label, propensity_score, enforce_positivity=True, max_sample_weight=1e3):
	"""
	Demystifying Double Robustness
	https://projecteuclid.org/download/pdfview_1/euclid.ss/1207580167
	https://arxiv.org/pdf/1706.10029.pdf
	weights = [ti/g(Xi) + (1−ti)/(1−g(Xi))]
	:param treatment_label: known treatment labels
	:param propensity_score: estimated propensity scores
	:param enforce_positivity: self explanatory
	:param max_sample_weight: this is to prevent inf in subsquent calculation. N

arose13 / .bash-aliases

Created December 8, 2021 21:59

arose13 / gradient-boosting-for-the-plebs.py

Created September 20, 2021 18:08

	import numpy as np
	import matplotlib.pyplot as graph
	from sklearn.datasets import load_boston
	from sklearn.metrics import r2_score
	from sklearn.model_selection import train_test_split
	from sklearn.tree import DecisionTreeRegressor
	from tqdm import trange


	x, y = load_boston(return_X_y=True) # type: np.ndarray, np.ndarray

arose13 / dask-on-databricks-script.sh

Created January 9, 2020 14:46

How to get Dask to run on Databricks

	#!/bin/bash
	# CREATE DIRECTORY ON DBFS FOR LOGS
	LOG_DIR=/dbfs/databricks/scripts/logs/$DB_CLUSTER_ID/dask/
	HOSTNAME=`hostname`
	mkdir -p $LOG_DIR
	# INSTALL DASK AND OTHER DEPENDENCIES
	set -ex
	/databricks/python/bin/python -V
	. /databricks/conda/etc/profile.d/conda.sh
	conda activate /databricks/python

arose13 / keras_onehot_encoder.py

Last active October 28, 2019 18:38

Automatic One Hot encoding layer for Keras

	import tensorflow.keras as k
	import tensorflow.keras.backend as K


	def _one_hot_layer(num_classes: int):
	"""
	One hot encoding layer to save massive amounts of memory in Keras

	:param num_classes:
	:return:

arose13 / fast_lasso.py

Last active September 27, 2019 15:53

Compute Lasso and optimize lambda without needing cross validation

	import numpy as np
	import sklearn.linear_model as lm

	class FastLasso:
	def __init__(self, verbose=False):
	self.alphas, self.coefs = 2*[None]
	self.score_path_ = None
	self.best_iteration_ = -1
	self.best_score_ = -np.inf
	self.verbose = verbose