Jesse createthis

DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention

DeepSeek- AI

Abstract

We introduce DeepSeek- V3.2- Exp, an experimental sparse- attention model, which equips DeepSeek- V3.1- Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine- grained sparse attention mechanism powered by a lightning indexer, DeepSeek- V3.2- Exp achieves significant efficiency improvements in both training and inference, especially in long- context scenarios. The model checkpoints are available at https://huggingface.co/deepseek- ai/DeepSeek- V3.2- Exp.

	#!/usr/bin/env python3
	import argparse
	import torch

	# Prefer local examples path resolution if running from repo root
	try:
	from examples.deepseek_v32.utils import per_custom_dims_cast_to_fp8 as _to_fp8
	def to_fp8(x):
	# Cast along last dim to FP8 E4M3 to match kernel expectations
	# Handle both (x, dims, use_ue8m0) and (x, dims) signatures and return the scaled tensor only.

	#!/usr/bin/env python3
	import argparse
	import time
	import torch

	# TileLang example kernels
	from examples.deepseek_v32.topk_selector import tl_topk, tl_topk_impl

	def bench_tl_topk(seq_len: int, topk: int = 256, batch: int = 1, iters: int = 50, warmup: int = 5):
	torch.cuda.synchronize()

	#!/usr/bin/env python3
	import argparse
	import torch
	import os
	import sys
	from typing import Optional

	# Optional TVM runtime import to dump CUDA/PTX sources
	import tilelang
	from tilelang import tvm

	#include <tl_templates/cuda/cuda_fp8.h>
	#include <tl_templates/cuda/gemm.h>
	#include <tl_templates/cuda/copy.h>
	#include <tl_templates/cuda/reduce.h>
	#include <tl_templates/cuda/ldsm.h>
	#include <tl_templates/cuda/threadblock_swizzle.h>
	#include <tl_templates/cuda/debug.h>
	#ifdef ENABLE_BF16
	#include <tl_templates/cuda/cuda_bf16_fallbacks.cuh>
	#endif