ngs-ml-learn.md

测序/生信分析

我们专注短读长测序平台的碱基识别（Basecall）任务。任务的关键特征是：

输入数据：$\mathbf{X} \in \mathcal{R}^{(T \times N \times C)}$，其中
- $T$: 表示测序循环数目，范围在 $50 \sim 1000$
- $N$: 表示 DNA 分子数目，范围在 $10^7 \sim 10^{11}$
- $D$: 表示输入信号的维度，范围在 $2 \sim 10^2$
输出数据：$\mathbf{Y} \in \{a, t, c, g\}^{(N \times T)}$，即 N 条长度为 T 的基因序列

注意：我们感兴趣的短读长 Basecall 技术趋于成熟，相关的文章少。近期搜索 Basecall 关键字，大部分结果是长度长（三代、纳米孔）测序平台的任务。它们和我们关系不大，不过可以学习和参考。

我们关心模型的训练、加速以及有意思的理论文章。

Scaling Laws for Neural Language Models
Language Models are Few-Shot Learners
Attention is all you need
On the Efficiency of Convolutional Neural Networks
White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
Efficient LLM Inference on CPUs
How to avoid machine learning pitfalls: a guide for academic researchers
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Efficiently Modeling Long Sequences with Structured State Spaces
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
LoRA: Low-Rank Adaptation of Large Language Models
Zoom In: An Introduction to Circuits
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
A high-bias, low-variance introduction to Machine Learning for physicists
BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation
Masked Autoencoders As Spatiotemporal Learners
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
You only look once: Unified, real-time object detection
Deep Residual Learning for Image Recognition

Hands-on machine learning with scikit-learn, keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems：神经网络入门

Andrej Karpathy - a blend of general and technical content, Zero to Hero playlist is a must-watch youtube.com/andrejkarpathy @karpathy
Umar Jamil - highly technical, implements ML and LLM techniques from scratch youtube.com/@umarjamilai @hkproj
Simon Oz - technical low-level machine learning videos youtube.com/@szymonozog7862
Tunadorable - paper review, implementation, triton youtube.com/@Tunadorable
GPU Mode - technical interviews and walkthroughs about anything related to GPUs youtube.com/@GPUMODE
AI Jason - AI experiments, software design, and new techniques beautifully explained youtube.com/@AIJasonZ @jasonzhou1993
Ferdinand Mom - everything related to distributed training & inference youtube.com/@FerdinandMom @FerdinandMom
Welch Labs - unique in-depth look at machine learning complexities like nobody else youtube.com/@WelchLabsVideo @welchlabs
Artem Kirsanov - neuroscience and machine learning from a different look, great visuals youtube.com/@ArtemKirsanov @ArtemKRSV
David Ondrej - new models, building apps with AI, practical for developers youtube.com/@DavidOndrej @DavidOndrej1

我们使用 Python/C++ 编码，通过 Git 进行版本管理与多人合作。

Programming Massively Parallel Processors: A Hands-on Approach, Fourth Edition：著名 CUDA 入门教材 PMPP
C++ Concurrency in Action
GPU编程实战（基于Python和CUDA）
Python 和 HDF5 大数据应用

不用担心，大部分我也没看过