我们专注短读长测序平台的碱基识别(Basecall)任务。任务的关键特征是:
-
输入数据:
$\mathbf{X} \in \mathcal{R}^{(T \times N \times C)}$ ,其中-
$T$ : 表示测序循环数目,范围在$50 \sim 1000$ -
$N$ : 表示 DNA 分子数目,范围在$10^7 \sim 10^{11}$ -
$D$ : 表示输入信号的维度,范围在$2 \sim 10^2$
-
-
输出数据:
$\mathbf{Y} \in \{a, t, c, g\}^{(N \times T)}$ ,即 N 条长度为 T 的基因序列
注意:我们感兴趣的短读长 Basecall 技术趋于成熟,相关的文章少。近期搜索 Basecall 关键字,大部分结果是长度长(三代、纳米孔)测序平台的任务。它们和我们关系不大,不过可以学习和参考。
- CiMBA: Accelerating Genome Sequencing through On-Device Basecalling via Compute-in-Memory: 长读长测序,加速卡设计
- Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling:长度长测序 basecall 模型架构评估
- A Comparison of Base-calling Algorithms for Illumina Sequencing Technology: 短读长 basecall 方法综述
- naiveBayesCall: An Efficient Model-Based Base-Calling Algorithm for High-Throughput Sequencing: 短读长测序 BayesCall 升级版本
- BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing: 短读长测序 basecall 概率模型 + 参数推断
- Analysis of Software Read Cross-Contamination in DNBSEQ Data: 短读长测序结果分析
- Bioinformatics for High Throughput Sequencing: NGS 教材,仔细描述了 basecall 上下游技术原理
- 人类全基因组遗传变异解读的高通量测序数据规范: 包含很多 QC 指标的官方定义
我们关心模型的训练、加速以及有意思的理论文章。
- Scaling Laws for Neural Language Models
- Language Models are Few-Shot Learners
- Attention is all you need
- On the Efficiency of Convolutional Neural Networks
- White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
- Efficient LLM Inference on CPUs
- How to avoid machine learning pitfalls: a guide for academic researchers
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- Efficiently Modeling Long Sequences with Structured State Spaces
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- LoRA: Low-Rank Adaptation of Large Language Models
- Zoom In: An Introduction to Circuits
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- A high-bias, low-variance introduction to Machine Learning for physicists
- BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks
- A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation
- Masked Autoencoders As Spatiotemporal Learners
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- You only look once: Unified, real-time object detection
- Deep Residual Learning for Image Recognition
- Hands-on machine learning with scikit-learn, keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems:神经网络入门
- https://zeux.io/2024/03/15/llm-inference-sol/
- https://karpathy.github.io/2019/04/25/recipe/
- https://aman.ai/primers/ai/top-30-papers/
- Andrej Karpathy - a blend of general and technical content, Zero to Hero playlist is a must-watch youtube.com/andrejkarpathy @karpathy
- Umar Jamil - highly technical, implements ML and LLM techniques from scratch youtube.com/@umarjamilai @hkproj
- Simon Oz - technical low-level machine learning videos youtube.com/@szymonozog7862
- Tunadorable - paper review, implementation, triton youtube.com/@Tunadorable
- GPU Mode - technical interviews and walkthroughs about anything related to GPUs youtube.com/@GPUMODE
- AI Jason - AI experiments, software design, and new techniques beautifully explained youtube.com/@AIJasonZ @jasonzhou1993
- Ferdinand Mom - everything related to distributed training & inference youtube.com/@FerdinandMom @FerdinandMom
- Welch Labs - unique in-depth look at machine learning complexities like nobody else youtube.com/@WelchLabsVideo @welchlabs
- Artem Kirsanov - neuroscience and machine learning from a different look, great visuals youtube.com/@ArtemKirsanov @ArtemKRSV
- David Ondrej - new models, building apps with AI, practical for developers youtube.com/@DavidOndrej @DavidOndrej1
我们使用 Python/C++ 编码,通过 Git 进行版本管理与多人合作。
- Programming Massively Parallel Processors: A Hands-on Approach, Fourth Edition:著名 CUDA 入门教材 PMPP
- C++ Concurrency in Action
- GPU编程实战(基于Python和CUDA)
- Python 和 HDF5 大数据应用
不用担心,大部分我也没看过