Skip to content

Instantly share code, notes, and snippets.

@d0rc
Created December 16, 2025 11:22
Show Gist options
  • Select an option

  • Save d0rc/965934ce7c13a486dc6a98488ce237f3 to your computer and use it in GitHub Desktop.

Select an option

Save d0rc/965934ce7c13a486dc6a98488ce237f3 to your computer and use it in GitHub Desktop.
EGGROLL hacks

USE_ADAPTIVE_THRESHOLD

#ifndef USE_ADAPTIVE_THRESHOLD
#  define USE_ADAPTIVE_THRESHOLD 1
#endif
#ifndef ADAPTIVE_THRESHOLD_ALPHA
#  define ADAPTIVE_THRESHOLD_ALPHA 0.1f
#endif

It works like this:

#if USE_ADAPTIVE_THRESHOLD
            // Adaptive Thresholding: Filter out weak signals based on Mean Absolute Difference
            double sum_abs_diff = 0.0;
            for(int i=0; i<num_fitness; i++) {
                sum_abs_diff += std::abs((double)diffs[i]);
            }
            
            double mad = sum_abs_diff / num_fitness;
            double threshold = mad * ADAPTIVE_THRESHOLD_ALPHA;
            
            for(int i=0; i<num_fitness; i++) {
                if (std::abs((double)diffs[i]) < threshold) {
                    h_fit[i] = 0;
                } else {
                    h_fit[i] = (diffs[i] > 0) ? 1 : -1;
                }
            }
#else
            // Original Pairwise Comparison
            for(int i=0; i<num_fitness; i++) {
                h_fit[i] = (diffs[i] > 0) ? 1 : ((diffs[i] < 0) ? -1 : 0);
            }
#endif

CHUNK_MEAN_FILTER

#if CHUNK_MEAN_FILTER
            double mean_diff = sum_diff_val / num_fitness;
            
            if (mean_diff != 0.0) {
                double sign = (mean_diff > 0) ? 1.0 : -1.0;
                mean_diff = sign * std::pow(std::abs(mean_diff), (double)CHUNK_MEAN_EXPONENT);
            }

            for(int i=0; i<num_fitness; i++) {
                diffs[i] += (int32_t)mean_diff;
            }
#endif

ADAPTIVE_NOISE_ENABLED

__device__ __forceinline__ float get_adaptive_scale(WeightType ov) {
#if ADAPTIVE_NOISE_ENABLED
    if (ov < 0) return 0.0f;
    if (ov < 64) return (float)ov / 64.0f;
    return 1.0f;
#else
    return 1.0f;
#endif
}
  1. Activity Tracking:

    • The system monitors the Adam optimizer updates.
    • If a weight row or column receives a significant update, it is marked as "active".
  2. Hysteresis Mechanism (Rank-1 Overlay):

    • We maintain int8_t counters for each row and column (the AdaptiveScales).
    • Reinforcement: Active features increment their counter (+5), increasing the noise scale for future steps.
    • Decay: Inactive features decrement their counter (-1), gradually reducing noise.
    • Dead Zone: Values below 0 result in zero noise, effectively "freezing" stable weights until a strong signal reactivates them.

get_adaptive_scale() acts as a transfer function that maps the integer "activity counter" (stored in AdaptiveScales) to a floating-point noise multiplier ($0.0$ to $1.0$). This scale factor is injected directly into the noise generation logic for every layer (Attention, MLP, Norms).

  1. Retrieval: In kernels like compute_mlp or compute_attention, the code fetches the integer overlay value for the specific weight row/column (e.g., scales->w_q_row[l][tid]).
  2. Conversion: It calls get_adaptive_scale() to convert this integer into a float scale.
  3. Modulation: This scale multiplies the random noise term before it is added to the weights or activations.

Example (Linear Projection):

// acc = dot_product(input, weights)
// noise = random_hash() * scale_out
// acc += noise * global_noise_strength
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment