Skip to content

Instantly share code, notes, and snippets.

@vinhnx
Last active December 26, 2024 05:17
Show Gist options
  • Save vinhnx/40eaf98b4bb51f0770276020a690bad5 to your computer and use it in GitHub Desktop.
Save vinhnx/40eaf98b4bb51f0770276020a690bad5 to your computer and use it in GitHub Desktop.
deepseek-v3 config params

Configuration Explanation for DeepseekV3ForCausalLM

Config: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/config.json

This configuration file defines the architecture and hyperparameters for a model named DeepseekV3ForCausalLM, which is a causal language model (LM) based on the DeepseekV3 architecture. Below is an explanation of the key configurations:


Model Architecture

  • architectures: Specifies the model class, which is DeepseekV3ForCausalLM. This indicates the model is designed for causal language modeling (e.g., text generation).
  • model_type: The type of model, which is deepseek_v3. This is used to identify the model architecture in the Hugging Face Transformers library.

Attention Mechanism

  • attention_bias: Disables attention bias (e.g., no additional bias terms in attention layers).
  • attention_dropout: Dropout rate for attention layers (set to 0.0, meaning no dropout).
  • num_attention_heads: Number of attention heads in the multi-head attention mechanism (128).
  • num_key_value_heads: Number of key/value heads in the attention mechanism (128). This is often used in grouped-query attention.
  • qk_nope_head_dim: Dimension of the head for queries and keys without positional encoding (128).
  • qk_rope_head_dim: Dimension of the head for queries and keys with rotary positional encoding (64).

Model Dimensions

  • hidden_size: The size of the hidden layers in the model (7168).
  • intermediate_size: The size of the intermediate layer in the feed-forward network (18432).
  • num_hidden_layers: The number of hidden layers in the model (61).
  • vocab_size: The size of the vocabulary (129280 tokens).

Positional Encoding

  • max_position_embeddings: The maximum sequence length the model can handle (163840 tokens).
  • rope_theta: The base value for rotary positional embeddings (10000).
  • rope_scaling: Configuration for scaling rotary positional embeddings:
    • type: The scaling type is yarn.
    • factor: Scaling factor (40).
    • beta_fast and beta_slow: Parameters for controlling the speed of scaling.
    • mscale: Multiplicative scaling factor (1.0).
    • original_max_position_embeddings: The original maximum sequence length before scaling (4096).

Mixture of Experts (MoE)

  • moe_intermediate_size: Intermediate size for MoE layers (2048).
  • moe_layer_freq: Frequency of MoE layers in the model (1, meaning every layer is an MoE layer).
  • n_routed_experts: Number of routed experts in the MoE layer (256).
  • n_shared_experts: Number of shared experts in the MoE layer (1).
  • num_experts_per_tok: Number of experts activated per token (8).
  • routed_scaling_factor: Scaling factor for routed experts (2.5).
  • scoring_func: The scoring function used for expert routing (sigmoid).
  • topk_method: The method used for selecting top-k experts (noaux_tc).
  • topk_group: The group size for top-k expert selection (4).

LoRA (Low-Rank Adaptation)

  • kv_lora_rank: The rank for LoRA adaptation in key/value projections (512).
  • q_lora_rank: The rank for LoRA adaptation in query projections (1536).

Normalization

  • rms_norm_eps: The epsilon value for RMS normalization (1e-06).

Initialization

  • initializer_range: The range for initializing model weights (0.02).

Activation Function

  • hidden_act: The activation function used in the model (silu, also known as Swish).

Quantization

  • quantization_config: Configuration for quantization:
    • quant_method: The quantization method (fp8, 8-bit floating point).
    • fmt: The format for quantization (e4m3, a floating-point format).
    • activation_scheme: The scheme for quantizing activations (dynamic).
    • weight_block_size: The block size for quantizing weights ([128, 128]).

Tokenization

  • bos_token_id: The ID of the beginning-of-sequence token (0).
  • eos_token_id: The ID of the end-of-sequence token (1).

Training and Inference

  • pretraining_tp: Tensor parallelism during pretraining (1, meaning no parallelism).
  • use_cache: Whether to use caching during inference (true).
  • torch_dtype: The data type used for tensors (bfloat16).
  • transformers_version: The version of the Hugging Face Transformers library used (4.33.1).

Auxiliary Loss

  • aux_loss_alpha: The weight for auxiliary loss (0.001).
  • seq_aux: Whether to use sequence-level auxiliary loss (true).

Miscellaneous

  • ep_size: Expert parallelism size (1, meaning no parallelism).
  • first_k_dense_replace: The number of dense layers to replace in the beginning (3).
  • norm_topk_prob: Whether to normalize top-k probabilities (true).
  • num_nextn_predict_layers: The number of layers used for next-n prediction (1).
  • tie_word_embeddings: Whether to tie the word embeddings to the output layer (false).
  • v_head_dim: The dimension of the value head in attention (128).

Summary

This configuration defines a large-scale causal language model with a mixture of experts (MoE) architecture, rotary positional embeddings, and low-rank adaptations (LoRA). It supports long sequences (up to 163840 tokens) and uses 8-bit floating-point quantization for efficiency. The model is designed for high-performance text generation tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment