Config: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/config.json
This configuration file defines the architecture and hyperparameters for a model named DeepseekV3ForCausalLM
, which is a causal language model (LM) based on the DeepseekV3
architecture. Below is an explanation of the key configurations:
architectures
: Specifies the model class, which isDeepseekV3ForCausalLM
. This indicates the model is designed for causal language modeling (e.g., text generation).model_type
: The type of model, which isdeepseek_v3
. This is used to identify the model architecture in the Hugging Face Transformers library.
attention_bias
: Disables attention bias (e.g., no additional bias terms in attention layers).attention_dropout
: Dropout rate for attention layers (set to0.0
, meaning no dropout).num_attention_heads
: Number of attention heads in the multi-head attention mechanism (128).num_key_value_heads
: Number of key/value heads in the attention mechanism (128). This is often used in grouped-query attention.qk_nope_head_dim
: Dimension of the head for queries and keys without positional encoding (128).qk_rope_head_dim
: Dimension of the head for queries and keys with rotary positional encoding (64).
hidden_size
: The size of the hidden layers in the model (7168).intermediate_size
: The size of the intermediate layer in the feed-forward network (18432).num_hidden_layers
: The number of hidden layers in the model (61).vocab_size
: The size of the vocabulary (129280 tokens).
max_position_embeddings
: The maximum sequence length the model can handle (163840 tokens).rope_theta
: The base value for rotary positional embeddings (10000).rope_scaling
: Configuration for scaling rotary positional embeddings:type
: The scaling type isyarn
.factor
: Scaling factor (40).beta_fast
andbeta_slow
: Parameters for controlling the speed of scaling.mscale
: Multiplicative scaling factor (1.0).original_max_position_embeddings
: The original maximum sequence length before scaling (4096).
moe_intermediate_size
: Intermediate size for MoE layers (2048).moe_layer_freq
: Frequency of MoE layers in the model (1, meaning every layer is an MoE layer).n_routed_experts
: Number of routed experts in the MoE layer (256).n_shared_experts
: Number of shared experts in the MoE layer (1).num_experts_per_tok
: Number of experts activated per token (8).routed_scaling_factor
: Scaling factor for routed experts (2.5).scoring_func
: The scoring function used for expert routing (sigmoid
).topk_method
: The method used for selecting top-k experts (noaux_tc
).topk_group
: The group size for top-k expert selection (4).
kv_lora_rank
: The rank for LoRA adaptation in key/value projections (512).q_lora_rank
: The rank for LoRA adaptation in query projections (1536).
rms_norm_eps
: The epsilon value for RMS normalization (1e-06).
initializer_range
: The range for initializing model weights (0.02).
hidden_act
: The activation function used in the model (silu
, also known as Swish).
quantization_config
: Configuration for quantization:quant_method
: The quantization method (fp8
, 8-bit floating point).fmt
: The format for quantization (e4m3
, a floating-point format).activation_scheme
: The scheme for quantizing activations (dynamic
).weight_block_size
: The block size for quantizing weights ([128, 128]).
bos_token_id
: The ID of the beginning-of-sequence token (0).eos_token_id
: The ID of the end-of-sequence token (1).
pretraining_tp
: Tensor parallelism during pretraining (1, meaning no parallelism).use_cache
: Whether to use caching during inference (true).torch_dtype
: The data type used for tensors (bfloat16
).transformers_version
: The version of the Hugging Face Transformers library used (4.33.1).
aux_loss_alpha
: The weight for auxiliary loss (0.001).seq_aux
: Whether to use sequence-level auxiliary loss (true).
ep_size
: Expert parallelism size (1, meaning no parallelism).first_k_dense_replace
: The number of dense layers to replace in the beginning (3).norm_topk_prob
: Whether to normalize top-k probabilities (true).num_nextn_predict_layers
: The number of layers used for next-n prediction (1).tie_word_embeddings
: Whether to tie the word embeddings to the output layer (false).v_head_dim
: The dimension of the value head in attention (128).
This configuration defines a large-scale causal language model with a mixture of experts (MoE) architecture, rotary positional embeddings, and low-rank adaptations (LoRA). It supports long sequences (up to 163840 tokens) and uses 8-bit floating-point quantization for efficiency. The model is designed for high-performance text generation tasks.