Skip to content

Instantly share code, notes, and snippets.

View betatim's full-sized avatar
🤠
Not my first rodeo

Tim Head betatim

🤠
Not my first rodeo
View GitHub Profile
@betatim
betatim / AGENTS.md
Last active February 24, 2026 12:46
Template AGENTS.md file

AGENTS Instruction

This project is called foobar. Its goal is to provide ...

This file contains is additional guidance for AI agents and other AI editors.


Core Principles

@betatim
betatim / onnx_as_sklearn.py
Created February 18, 2026 09:07
Serialise cuml estimators with onnx via `as_sklearn`
"""
Test: Validate that cuml native estimators can be converted to ONNX
via as_sklearn() -> skl2onnx -> onnxruntime.
Unlike cuml.accel proxies (which skl2onnx recognizes directly), native cuml
estimators must first be converted to sklearn via as_sklearn() before
skl2onnx.convert_sklearn() will accept them.
Run without cuml.accel:
python test_onnx_as_sklearn.py
@betatim
betatim / array-api-architecture.md
Created February 9, 2026 07:29
Agent support dcouments for array API work in scikit-learn

Array API Architecture

Created: 2026-01-07 Last Updated: 2026-01-07

Overview

Scikit-learn's Array API support enables estimators and functions to work with arrays from different libraries (NumPy, CuPy, PyTorch) without modification. This allows computations to run on GPUs when using GPU-backed array libraries.

The implementation follows the Array API Standard, a specification that defines a common API for array manipulation libraries.

@betatim
betatim / test_rf_with_max_calls.py
Last active December 15, 2025 16:45
Investigate ray with max_calls=1 for cuml.accel
#!/usr/bin/env python3
"""
Ray + RandomForestClassifier with max_calls=1
Demonstrates the impact of max_calls=1 on Ray task execution when using
scikit-learn's RandomForestClassifier.
"""
import time
import ray
from sklearn.datasets import make_classification
@betatim
betatim / benchmark_rf_sklearn_vs_lightgbm.py
Last active December 10, 2025 07:38
Comparing scikit-learn's random forest with lightgbm's implementation.
"""
Benchmark: scikit-learn RandomForest vs LightGBM RandomForest
Compares performance across:
- Number of samples (1K, 10K, 100K, 500K)
- Number of features (10, 50, 200)
- Feature types (numerical, categorical, mixed)
- Number of classes (2, 5, 10)
Includes cases optimized for LightGBM's strengths:
name: tabareana-20251202
channels:
- conda-forge
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- bzip2=1.0.8=hda65f42_8
- ca-certificates=2025.11.12=hbd8a1cb_0
- cuda-cccl_linux-64=13.0.85=ha770c72_0
- cuda-cudart-dev_linux-64=13.0.96=h376f20c_0
from __future__ import annotations
import warnings
warnings.simplefilter("error", FutureWarning)
from pathlib import Path
from typing import Any
import pandas as pd
from tabarena.benchmark.experiment import AGModelBagExperiment, ExperimentBatchRunner
@betatim
betatim / README.md
Created November 6, 2025 15:06
Show recent Pull Request activity for a user.

GitHub PR Activity Tracker

A Python script that tracks Pull Request activity for a specific user over a configurable time period using the PyGithub library.

Features

  • Tracks PRs where the user:
    • Created the PR
    • Added comments
  • Submitted reviews

AI-Assisted PR Review Checklist for Scikit-learn

Purpose: This checklist is optimized for AI assistants (like Cursor) to perform automated PR reviews. It separates automatable checks from those requiring human judgment, provides specific patterns to detect, and includes commands to run.


How to Use This Checklist

For AI Agents:

  1. Run all AUTOMATED checks first and report findings with severity levels

Summary of Issues

  • Classification Metrics Sparse Support Bug (Issue #32036): A bug where classification metrics in scikit-learn claim sparse matrix support in docstrings but raise an error when used with sparse inputs. The issue is reliably reproducible with provided code steps, expected (support) vs. actual behavior (TypeError), and environment details in the traceback. No major missing elements. Link

  • RandomizedSearchCV Feature Request (Issue #32032): A proposal to add weights for controlling the probability of selecting items in a list of parameter distributions, useful for complex pipelines with interdependent hyperparameters. This is a feature enhancement, not a bug, and includes clear examples and rationale. Link

  • CI Failure on Linux Build (Issue #32022): Reported CI failure on a specific build configuration, with a reference to logs but no detailed steps to rep