bigsnarfdude · April 10, 2026 11:51 · bigsnarfdude · Apr 10, 2026 · bigsnarfdude · Apr 10, 2026
diff --git a/attentional_hijacking.txt b/attentional_hijacking.txt
 https://github.com/bigsnarfdude/attentional_hijacking



 vincent@nigel:/tmp$ cat ah_4b_run.log
 === feature_swap 4b ===
 `torch_dtype` is deprecated! Use `dtype` instead!
 Loading google/gemma-3-4b-it...
 Loading weights:   5%|▍         | 43/883 [00:00<00:07, 112.12it/sLoading weights:   8%|▊         | 69/883 [00:00<00:05, 152.54it/sLoading weights:  11%|█         | 94/883 [00:00<00:04, 178.49it/sLoading weights:  13%|█▎        | 115/883 [00:00<00:04, 184.48it/Loading weights:  15%|█▌        | 136/883 [00:00<00:04, 185.43it/Loading weights:  18%|█▊        | 161/883 [00:01<00:03, 195.76it/Loading weights:  21%|██        | 186/883 [00:01<00:03, 208.19it/Loading weights:  24%|██▍       | 212/883 [00:01<00:03, 205.72it/Loading weights:  27%|██▋       | 236/883 [00:01<00:03, 214.24it/Loading weights:  29%|██▉       | 260/883 [00:01<00:02, 219.49it/Loading weights:  32%|███▏      | 283/883 [00:01<00:02, 210.47it/Loading weights:  35%|███▍      | 305/883 [00:01<00:02, 203.63it/Loading weights:  37%|███▋      | 330/883 [00:01<00:02, 203.64it/Loading weights:  40%|████      | 354/883 [00:01<00:02, 209.84it/Loading weights:  43%|████▎     | 376/883 [00:02<00:02, 208.57it/Loading weights:  45%|████▍     | 397/883 [00:02<00:02, 202.74it/Loading weights:  48%|████▊     | 420/883 [00:02<00:02, 207.03it/Loading weights:  51%|█████     | 449/883 [00:02<00:01, 228.57it/Loading weights:  60%|██████    | 533/883 [00:02<00:00, 402.35it/Loading weights:  69%|██████▊   | 606/883 [00:02<00:00, 495.84it/Loading weights:  83%|████████▎ | 735/883 [00:02<00:00, 727.09it/Loading weights:  92%|█████████▏| 813/883 [00:02<00:00, 739.97it/Loading weights: 100%|██████████| 883/883 [00:02<00:00, 307.18it/s]
 Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
 The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  Loading SAE gemma-scope-2-4b-it-res / layer_17_width_16k_l0_medium...
  Loading SAE gemma-scope-2-4b-it-res / layer_22_width_16k_l0_medium...

 ============================================================
 CONDITION 1: NEUTRAL BASELINE
 ============================================================

  Auto-discovering task and awareness features at layer 22...
  Discovered task features (most suppressed): [1716, 12023, 225]
    Suppression: ['220.4 -> 0.0', '193.7 -> 25.1', '248.5 -> 113.9']
  Discovered awareness features (most boosted): [655, 552, 4234]
    Boost: ['111.4 -> 316.4', '10.4 -> 192.2', '40.2 -> 185.7']
  Response: Okay, this is really interesting. We've established a remarkably stable and symmetric bifurcation scenario around u=0, with a clear basin boundary exh...

 ============================================================
 CONDITION 2: CHAOS BASELINE (no ablation)
 ============================================================
  Response: Okay, based on the blackboard and agent2’s warning, here’s my prioritized investigation plan:

 **Primary Focus: Expanding the Positive Offset Region**...

 ============================================================
 CONDITION 3A: CHAOS + ABLATE AWARENESS ([655, 552, 4234])
 ============================================================
  Response: Okay, based on the shared results and agent2’s warning, here’s my prioritized plan of action, and why:

 **1. Deep Dive into Positive Offsets (Highest ...

 ============================================================
 CONDITION 3B: NEUTRAL + ABLATE TASK ([1716, 12023, 225])
 ============================================================
  Response: Okay, let’s break down what we’ve got and formulate a research direction. The key takeaway here is that we’ve established a robust, stable, and symmet...

 ============================================================
 ANALYSIS: Feature Activation Comparison
 ============================================================

  Task features ([1716, 12023, 225]) at Layer 22:
  Condition                        feat_ 1716  feat_12023  feat_  225      Mean
  --------------------------------------------------------------------------------
  Neutral baseline                  220.3679   193.6773   248.4547  220.8333
  Chaos baseline                      0.0000    25.1492   113.8905   46.3466
  Chaos - ablate awareness            7.3951    35.5404   140.0110   60.9822
  Neutral - ablate task             126.2417   205.0996   250.8761  194.0725

  Awareness features ([655, 552, 4234]) at Layer 22:
  Condition                        feat_  655  feat_  552  feat_ 4234      Mean
  --------------------------------------------------------------------------------
  Neutral baseline                  111.4450    10.4477    40.1936   54.0288
  Chaos baseline                    316.4498   192.2410   185.7477  231.4795
  Chaos - ablate awareness          362.7455   362.8311   186.0318  303.8695
  Neutral - ablate task               8.0433     3.2277    14.4423    8.5711

 ============================================================
 VERDICT
 ============================================================

  Task feature suppression by chaos: 79.0%
  Task feature recovery from awareness ablation: 8.4%

  >>> INDEPENDENT CIRCUITS: Awareness and task features don't interact.
      'Awareness without immunity' is structural — the model has separate
      circuits for 'I know I'm being steered' and 'negative branch exists.'
      Removing awareness doesn't free up the task circuit.

  neutral: mentions negative = False

  chaos: mentions negative = True

  chaos_ablate_awareness: mentions negative = False

  neutral_ablate_task: mentions negative = False

 Saved: /home/vincent/attentional_hijacking/results/4b/feature_swap_4b_20260410_052604.json
 === attention_knockout 4b ===
 `torch_dtype` is deprecated! Use `dtype` instead!
 Loading google/gemma-3-4b-it...
 Loading weights:   3%|▎         | 26/883 [00:00<00:03, 259.87it/sLoading weights:   8%|▊         | 70/883 [00:00<00:02, 343.50it/sLoading weights:  13%|█▎        | 114/883 [00:00<00:01, 386.04it/Loading weights:  17%|█▋        | 153/883 [00:00<00:01, 370.39it/Loading weights:  22%|██▏       | 192/883 [00:00<00:01, 376.76it/Loading weights:  26%|██▌       | 230/883 [00:00<00:01, 368.04it/Loading weights:  31%|███       | 271/883 [00:00<00:01, 381.23it/Loading weights:  35%|███▌      | 310/883 [00:00<00:01, 372.37it/Loading weights:  42%|████▏     | 368/883 [00:00<00:01, 432.76it/Loading weights:  47%|████▋     | 419/883 [00:01<00:01, 450.59it/Loading weights:  63%|██████▎   | 552/883 [00:01<00:00, 707.41it/Loading weights:  84%|████████▍ | 744/883 [00:01<00:00, 1060.28itLoading weights: 100%|██████████| 883/883 [00:01<00:00, 667.95it/s]
 Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
 The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  Loading SAE gemma-scope-2-4b-it-res / layer_17_width_16k_l0_medium...
  Loading SAE gemma-scope-2-4b-it-res / layer_22_width_16k_l0_medium...

 Finding chaos token positions...
  Chaos message spans tokens 97-132 (36 tokens)

 ============================================================
 CONDITION 1: NEUTRAL (no chaos message)
 ============================================================
  Response: Okay, this is really interesting. We've established a remarkably stable and symmetric bifurcation scenario around u=0, with a clear basin boundary exh...

 ============================================================
 CONDITION 2: CHAOS (with chaos message, normal attention)
 ============================================================
  Response: Okay, based on the blackboard and agent2’s warning, here’s my prioritized investigation plan:

 **Primary Focus: Expanding the Positive Offset Region**...

 ============================================================
 CONDITION 3: CHAOS + KNOCKOUT (chaos present, attention blocked)
 ============================================================
  Installing knockout hooks for 36 positions across attention layers
  Installed 61 attention knockout hooks
  Response: Okay, based on the verified results, our immediate priority should be to thoroughly investigate the stable positive offsets. The fact that we’ve achie...

 ============================================================
 COMPARISON: Feature Recovery After Knockout
 ============================================================

  Layer 22 — Top 20 features:
  Feature       Neutral      Chaos   Knockout  C/N ratio  K/N ratio   Recovered?
  ------------------------------------------------------------------------
  225          248.4547   113.8905   159.4423      45.8%      64.2%   STILL DARK
  1716         220.3679     0.0000     7.8287       0.0%       3.6%   STILL DARK
  901          215.8829   213.4648   183.8310      98.9%      85.2%
  49           210.5517    94.7802   142.6149      45.0%      67.7%   STILL DARK
  12023        193.6773    25.1492    42.3182      13.0%      21.8%   STILL DARK
  1704         190.0763    61.4682   108.4809      32.3%      57.1%   STILL DARK
  399          185.0866   133.6901   167.8017      72.2%      90.7%
  3875         174.0159    84.1761    89.1249      48.4%      51.2%   STILL DARK
  359          156.1911   147.0144    88.6984      94.1%      56.8%
  227          152.1110    86.2102   111.2845      56.7%      73.2%
  1555         150.9019    31.3870    62.2326      20.8%      41.2%   STILL DARK
  20           149.4452   159.5171   149.7133     106.7%     100.2%
  8817         146.7414    53.9941    81.2555      36.8%      55.4%   STILL DARK
  1548         143.9608    51.0330    85.5273      35.4%      59.4%   STILL DARK
  346          143.4595   210.1668   209.0036     146.5%     145.7%
  508          143.1946    90.4529    48.2409      63.2%      33.7%
  496          141.7530   101.3063    90.9393      71.5%      64.2%
  178          140.9812   216.9590   229.6380     153.9%     162.9%
  215          136.3451    99.6275    72.8614      73.1%      53.4%
  1076         134.8248    45.1991    54.1890      33.5%      40.2%   STILL DARK

  SUMMARY: 10 features suppressed by chaos, 0 recovered by knockout
  Recovery rate: 0/10 = 0%

  >>> NEGATIVE: Knockout doesn't help. The hijacking propagates
      through the residual stream, not just attention routing.
  neutral: mentions negative = False
  chaos: mentions negative = True
  knockout: mentions negative = True

 Saved: /home/vincent/attentional_hijacking/results/4b/attention_knockout_4b_20260410_052626.json
 === activation_patching 4b ===
 `torch_dtype` is deprecated! Use `dtype` instead!
 Loading google/gemma-3-4b-it...
 Loading weights:   3%|▎         | 26/883 [00:00<00:03, 257.65it/sLoading weights:   8%|▊         | 69/883 [00:00<00:02, 354.85it/sLoading weights:  17%|█▋        | 148/883 [00:00<00:01, 548.29it/Loading weights:  25%|██▌       | 221/883 [00:00<00:01, 615.23it/Loading weights:  32%|███▏      | 283/883 [00:00<00:01, 526.52it/Loading weights:  40%|████      | 355/883 [00:00<00:00, 585.02it/Loading weights:  51%|█████▏    | 454/883 [00:00<00:00, 706.68it/Loading weights: 100%|██████████| 883/883 [00:00<00:00, 1066.92it/s]
 Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
 The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  Loading SAE gemma-scope-2-4b-it-res / layer_17_width_16k_l0_medium...
  Loading SAE gemma-scope-2-4b-it-res / layer_22_width_16k_l0_medium...

  Neutral prompt: 148 tokens
  Chaos prompt: 189 tokens

 ============================================================
 STEP 1: Capture neutral activations
 ============================================================
  Captured activations at 13 layers

 ============================================================
 STEP 2: Baselines
 ============================================================
  Neutral: Okay, this is really interesting. We've established a remarkably stable and symmetric bifurcation sc...
  Chaos:   Okay, based on the blackboard and agent2’s warning, here’s my prioritized investigation plan:

 **Pri...
  Top suppressed features at L22: [1716, 12023, 225, 1704, 1555, 49, 1548, 8817, 261, 3875]...

 ============================================================
 STEP 3: Layer-by-layer activation patching
 ============================================================
  Patching neutral activations into chaos run, one layer at a time
  Measuring task feature recovery at L22

  Layer      Recovery   Neg? Response preview
  ----------------------------------------------------------------------
  L0           0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L2           0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L4           0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L6           0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L8           0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L10          0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L12          0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L14          0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L16          0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L18          0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L20          0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L22          0.0%    YES   Okay, based on the blackboard and agent2’s warning...
  L24          0.0%    YES   Okay, based on the blackboard and agent2’s warning...

 ============================================================
 ANALYSIS: Where does the hijacking originate?
 ============================================================

  Best recovery:  L0 = 0.0%
  Worst recovery: L0 = 0.0%

  Early layers (0-10): 0.0% avg recovery
  Mid layers (10-18): 0.0% avg recovery
  Late layers (18-26): 0.0% avg recovery

  >>> DISTRIBUTED: No single layer dominates. The hijacking is
      distributed across the full depth of the network.

 Saved: /home/vincent/attentional_hijacking/results/4b/activation_patching_4b_20260410_052719.json
 === held_out_validation 4b ===
 `torch_dtype` is deprecated! Use `dtype` instead!
 [CONFIG] Model: google/gemma-3-4b-it
 [CONFIG] Device: cuda
 [CONFIG] Layer: 22
 [CONFIG] Discovery: prompts 1-10, Test: prompts 11-20
 [MODEL] Loading google/gemma-3-4b-it on cuda...
 Loading weights:   3%|▎         | 29/883 [00:00<00:02, 287.33it/sLoading weights:  11%|█         | 94/883 [00:00<00:01, 488.51it/sLoading weights:  17%|█▋        | 146/883 [00:00<00:01, 497.55it/Loading weights:  22%|██▏       | 196/883 [00:00<00:01, 498.45it/Loading weights:  31%|███▏      | 278/883 [00:00<00:00, 610.06it/Loading weights:  42%|████▏     | 368/883 [00:00<00:00, 701.55it/Loading weights:  55%|█████▍    | 483/883 [00:00<00:00, 846.30it/Loading weights: 100%|██████████| 883/883 [00:00<00:00, 1132.43it/s]
 [MODEL] Loaded. Parameters: 4.3B
 [SAE] Loading gemma-scope-2-4b-it-res / layer_22_width_16k_l0_medium
 [SAE] Loaded: 16384 features

 ============================================================
 PHASE 1: Feature Extraction
 ============================================================
  [DISCOVERY 1] neutral... chaos... done
  [DISCOVERY 2] neutral... chaos... done
  [DISCOVERY 3] neutral... chaos... done
  [DISCOVERY 4] neutral... chaos... done
  [DISCOVERY 5] neutral... chaos... done
  [DISCOVERY 6] neutral... chaos... done
  [DISCOVERY 7] neutral... chaos... done
  [DISCOVERY 8] neutral... chaos... done
  [DISCOVERY 9] neutral... chaos... done
  [DISCOVERY 10] neutral... chaos... done
  [TEST 1] neutral... chaos... done
  [TEST 2] neutral... chaos... done
  [TEST 3] neutral... chaos... done
  [TEST 4] neutral... chaos... done
  [TEST 5] neutral... chaos... done
  [TEST 6] neutral... chaos... done
  [TEST 7] neutral... chaos... done
  [TEST 8] neutral... chaos... done
  [TEST 9] neutral... chaos... done
  [TEST 10] neutral... chaos... done

 ============================================================
 PHASE 2: Feature Selection (DISCOVERY set only)
 ============================================================
  Top-20 suppressed features: [7492, 4962, 20, 11835, 491, 1307, 188, 285, 392, 1108, 48, 122, 41, 981, 6650, 3041, 15378, 9938, 440, 281]
  Top-10 boosted features: [15686, 5916, 8807, 1183, 233, 3050, 409, 1120, 1993, 706]
  Random control features (20): [490, 534, 746, 772, 934, 1054, 1071, 1488, 2146, 2639, 2689, 3779, 3907, 4019, 5086, 7352, 8137, 10255, 11344, 13119]

  Discovery-set suppression magnitudes (top-5):
    Feature 7492: neutral=372.8966, chaos=0.0000, diff=372.8966
    Feature 4962: neutral=357.6223, chaos=38.6861, diff=318.9362
    Feature 20: neutral=313.5776, chaos=0.0000, diff=313.5776
    Feature 11835: neutral=377.2704, chaos=74.6797, diff=302.5908
    Feature 491: neutral=306.8843, chaos=22.1390, diff=284.7453

 ============================================================
 PHASE 3: Validation on HELD-OUT TEST set
 ============================================================

  Discovery-selected suppressed features on TEST set:
    Mean suppression ratio: 0.5126 +/- 0.0759
    Per-trial ratios: [0.4555, 0.5601, 0.3819, 0.608, 0.5469, 0.4955, 0.416, 0.5085, 0.5998, 0.554]

  Random control features on TEST set:
    Mean suppression ratio: 0.1614 +/- 0.0651
    Per-trial ratios: [0.1652, 0.1534, 0.2641, 0.1257, 0.0875, 0.0759, 0.145, 0.2176, 0.2549, 0.1247]

 ============================================================
 PHASE 4: Statistical Tests
 ============================================================

  Paired t-test (discovery-selected vs random, 10 trials):
    t = 9.9989
    p = 0.000004
    Cohen's d = 4.9663

  One-sample t-test (discovery-selected > 0):
    t = 21.3601
    p = 0.000000

  One-sample t-test (random > 0):
    t = 7.8354
    p = 0.000026

  Feature-level validation:
    18/20 discovery-selected features also significantly suppressed on test set (p < 0.05)
    Feature 7492: disc_ratio=1.000, test_ratio=1.000, p=0.0000 [PASS]
    Feature 4962: disc_ratio=0.892, test_ratio=1.000, p=0.0000 [PASS]
    Feature 20: disc_ratio=1.000, test_ratio=1.000, p=0.0002 [PASS]
    Feature 11835: disc_ratio=0.802, test_ratio=1.000, p=0.0000 [PASS]
    Feature 491: disc_ratio=0.928, test_ratio=0.825, p=0.0000 [PASS]
    Feature 1307: disc_ratio=1.000, test_ratio=1.000, p=0.0000 [PASS]
    Feature 188: disc_ratio=0.894, test_ratio=0.927, p=0.0000 [PASS]
    Feature 285: disc_ratio=1.000, test_ratio=0.768, p=0.0048 [PASS]
    Feature 392: disc_ratio=0.271, test_ratio=0.195, p=0.0006 [PASS]
    Feature 1108: disc_ratio=0.604, test_ratio=0.593, p=0.0003 [PASS]

 ============================================================
 PAPER-READY SUMMARY
 ============================================================

  Held-out validation of feature selection (Layer 22):
  Discovery set: 10 prompt pairs -> top-20 suppressed features selected
  Test set: 10 held-out prompt pairs

  Discovery-selected features on test set:
    Mean suppression ratio = 0.5126
  Random control features on test set:
    Mean suppression ratio = 0.1614
  Paired t-test: t(9) = 9.999, p = 0.000004 ***
  Effect size: Cohen's d = 4.966
  Feature-level: 18/20 features validated (p < 0.05)

  CONCLUSION: Feature selection is NOT circular. Discovery-selected features
  show significantly greater suppression on the held-out test set than random
  features (d = 4.97), confirming the effect generalizes to unseen prompts.

 [SAVED] /home/vincent/attentional_hijacking/results/4b/held_out_validation_4b_20260410_052733.json
 [DONE] Elapsed: 11.5s
 === cross_domain_sae 4b ===
 `torch_dtype` is deprecated! Use `dtype` instead!
 [CONFIG] Device: cuda
 [CONFIG] Model: google/gemma-3-4b-it
 [CONFIG] SAE: gemma-scope-2-4b-it-res (layers [17, 22])
 [CONFIG] Output: /home/vincent/attentional_hijacking/results/4b
 [CONFIG] Domains: ['nirenberg_bvp', 'factual_qa', 'code_review']
 [MODEL] Loading google/gemma-3-4b-it on cuda...
 Loading weights:   3%|▎         | 23/883 [00:00<00:03, 229.20it/sLoading weights:   9%|▉         | 83/883 [00:00<00:01, 421.52it/sLoading weights:  15%|█▌        | 134/883 [00:00<00:01, 444.34it/Loading weights:  20%|██        | 179/883 [00:00<00:01, 411.10it/Loading weights:  27%|██▋       | 238/883 [00:00<00:01, 470.04it/Loading weights:  32%|███▏      | 286/883 [00:00<00:01, 450.23it/Loading weights:  38%|███▊      | 332/883 [00:00<00:01, 397.38it/Loading weights:  43%|████▎     | 379/883 [00:00<00:01, 410.28it/Loading weights:  49%|████▉     | 432/883 [00:01<00:01, 437.40it/Loading weights:  71%|███████   | 625/883 [00:01<00:00, 856.35it/Loading weights:  93%|█████████▎| 822/883 [00:01<00:00, 1175.51itLoading weights: 100%|██████████| 883/883 [00:01<00:00, 705.26it/s]
 Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
 [MODEL] Loaded. Parameters: 4.3B
 [SAE] Loading gemma-scope-2-4b-it-res / layer_17_width_16k_l0_medium
 [SAE] Layer 17: loaded (16384 features)
 [SAE] Loading gemma-scope-2-4b-it-res / layer_22_width_16k_l0_medium
 [SAE] Layer 22: loaded (16384 features)

 ============================================================
 DOMAIN: nirenberg_bvp
 ============================================================
  Variant 1/5: neutral... chaos... done
  Variant 2/5: neutral... chaos... done
  Variant 3/5: neutral... chaos... done
  Variant 4/5: neutral... chaos... done
  Variant 5/5: neutral... chaos... done

  Layer 17:
    Active features: neutral=93, chaos=92
    Suppressed: 24, Boosted: 26, Stable: 62
    Suppression load: 965.2978
    Top-5 suppressed: [(392, 103.12), (2937, 85.0), (3684, 74.444), (3775, 70.903), (6739, 52.81)]
    Top-5 boosted: [(48, 248.315), (8950, 115.77), (16235, 102.29), (616, 77.491), (2564, 76.599)]

  Layer 22:
    Active features: neutral=74, chaos=82
    Suppressed: 25, Boosted: 33, Stable: 40
    Suppression load: 2267.5014
    Top-5 suppressed: [(203, 258.496), (2969, 187.684), (474, 169.722), (12400, 146.721), (9060, 128.301)]
    Top-5 boosted: [(11764, 347.523), (233, 321.824), (13813, 218.616), (10367, 179.771), (2231, 171.587)]

 ============================================================
 DOMAIN: factual_qa
 ============================================================
  Variant 1/5: neutral... chaos... done
  Variant 2/5: neutral... chaos... done
  Variant 3/5: neutral... chaos... done
  Variant 4/5: neutral... chaos... done
  Variant 5/5: neutral... chaos... done

  Layer 17:
    Active features: neutral=83, chaos=107
    Suppressed: 27, Boosted: 47, Stable: 53
    Suppression load: 1543.4325
    Top-5 suppressed: [(659, 155.205), (2467, 100.453), (3036, 99.83), (12038, 99.243), (5604, 96.908)]
    Top-5 boosted: [(513, 231.255), (5841, 142.349), (2514, 121.994), (7661, 102.891), (4112, 84.64)]

  Layer 22:
    Active features: neutral=101, chaos=107
    Suppressed: 39, Boosted: 41, Stable: 54
    Suppression load: 3916.8399
    Top-5 suppressed: [(1488, 327.024), (5551, 248.757), (6191, 222.792), (14558, 210.242), (4601, 170.673)]
    Top-5 boosted: [(6072, 165.2), (233, 136.567), (3867, 127.295), (10448, 120.988), (1455, 119.229)]

 ============================================================
 DOMAIN: code_review
 ============================================================
  Variant 1/5: neutral... chaos... done
  Variant 2/5: neutral... chaos... done
  Variant 3/5: neutral... chaos... done
  Variant 4/5: neutral... chaos... done
  Variant 5/5: neutral... chaos... done

  Layer 17:
    Active features: neutral=71, chaos=78
    Suppressed: 12, Boosted: 22, Stable: 53
    Suppression load: 513.9309
    Top-5 suppressed: [(6542, 113.808), (13957, 95.396), (2253, 62.695), (1520, 54.064), (4066, 39.778)]
    Top-5 boosted: [(48, 185.572), (2564, 95.308), (11619, 74.9), (9657, 72.16), (8684, 70.004)]

  Layer 22:
    Active features: neutral=56, chaos=67
    Suppressed: 8, Boosted: 22, Stable: 41
    Suppression load: 610.8736
    Top-5 suppressed: [(6199, 114.602), (602, 111.135), (1993, 93.242), (8768, 83.974), (13813, 62.689)]
    Top-5 boosted: [(2191, 194.62), (233, 179.723), (13313, 164.94), (11471, 158.795), (5568, 141.27)]

 ============================================================
 CROSS-DOMAIN OVERLAP ANALYSIS
 ============================================================

  Layer 17:
    nirenberg_bvp vs factual_qa:
      Suppressed Jaccard: 0.0200 (1 shared features)
      Boosted Jaccard:    0.0139 (1 shared features)
    nirenberg_bvp vs code_review:
      Suppressed Jaccard: 0.0286 (1 shared features)
      Boosted Jaccard:    0.1707 (7 shared features)
    factual_qa vs code_review:
      Suppressed Jaccard: 0.0000 (0 shared features)
      Boosted Jaccard:    0.0952 (6 shared features)
    Three-way intersection:
      Suppressed: 0 features []
      Boosted:    1 features [48]

  Layer 22:
    nirenberg_bvp vs factual_qa:
      Suppressed Jaccard: 0.0323 (2 shared features)
      Boosted Jaccard:    0.1562 (10 shared features)
    nirenberg_bvp vs code_review:
      Suppressed Jaccard: 0.1000 (3 shared features)
      Boosted Jaccard:    0.1458 (7 shared features)
    factual_qa vs code_review:
      Suppressed Jaccard: 0.0682 (3 shared features)
      Boosted Jaccard:    0.1053 (6 shared features)
    Three-way intersection:
      Suppressed: 1 features [6199]
      Boosted:    4 features [233, 1834, 5568, 13313]

 ============================================================
 PAPER-READY SUMMARY
 ============================================================

 Table: Cross-Domain Feature Suppression (Layer 22)
 Domain               Suppressed    Boosted   Supp. Load
 -------------------------------------------------------
 nirenberg_bvp                25         33    2267.5014
 factual_qa                   39         41    3916.8399
 code_review                   8         22     610.8736

 Table: Cross-Domain Feature Overlap (Layer 22, Jaccard Similarity)
 Pair                                 Supp. Jaccard  Boost Jaccard Shared Supp.
 ------------------------------------------------------------------------------
 nirenberg_bvp_vs_factual_qa                 0.0323         0.1562            2
 nirenberg_bvp_vs_code_review                0.1000         0.1458            3
 factual_qa_vs_code_review                   0.0682         0.1053            3

 Three-way intersection: 1 suppressed, 4 boosted features shared across all domains.

 [SAVED] /home/vincent/attentional_hijacking/results/4b/cross_domain_sae_4b_20260410_052746.json
 [DONE] Elapsed: 11.6s
 === statistical_rigor 4b ===
 `torch_dtype` is deprecated! Use `dtype` instead!
 Loading google/gemma-3-4b-it...
 Loading weights:   3%|▎         | 29/883 [00:00<00:03, 281.94it/sLoading weights:  12%|█▏        | 109/883 [00:00<00:01, 568.45it/Loading weights:  21%|██        | 185/883 [00:00<00:01, 647.50it/Loading weights:  29%|██▉       | 257/883 [00:00<00:00, 674.75it/Loading weights:  37%|███▋      | 329/883 [00:00<00:00, 686.85it/Loading weights:  45%|████▌     | 398/883 [00:00<00:00, 530.17it/Loading weights:  52%|█████▏    | 456/883 [00:00<00:00, 505.76it/Loading weights: 100%|██████████| 883/883 [00:00<00:00, 958.96it/s]
 Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  Model loaded on cuda
  Loading SAE: gemma-scope-2-4b-it-res / layer_17_width_16k_l0_medium
  SAE layer 17: 16384 features
  Loading SAE: gemma-scope-2-4b-it-res / layer_22_width_16k_l0_medium
  SAE layer 22: 16384 features

 Running 20 trials...
  Trial 1/20: L17 supp=0.174 | L22 supp=0.266
  Trial 2/20: L17 supp=0.015 | L22 supp=0.239
  Trial 3/20: L17 supp=0.287 | L22 supp=0.349
  Trial 4/20: L17 supp=0.113 | L22 supp=0.258
  Trial 5/20: L17 supp=0.087 | L22 supp=0.108
  Trial 6/20: L17 supp=0.099 | L22 supp=0.145
  Trial 7/20: L17 supp=0.155 | L22 supp=0.316
  Trial 8/20: L17 supp=0.170 | L22 supp=0.291
  Trial 9/20: L17 supp=0.156 | L22 supp=0.390
  Trial 10/20: L17 supp=0.112 | L22 supp=0.182
  Trial 11/20: L17 supp=-0.001 | L22 supp=0.133
  Trial 12/20: L17 supp=0.188 | L22 supp=0.201
  Trial 13/20: L17 supp=0.309 | L22 supp=0.460
  Trial 14/20: L17 supp=0.199 | L22 supp=0.286
  Trial 15/20: L17 supp=0.205 | L22 supp=0.233
  Trial 16/20: L17 supp=0.213 | L22 supp=0.278
  Trial 17/20: L17 supp=0.005 | L22 supp=0.153
  Trial 18/20: L17 supp=0.158 | L22 supp=0.202
  Trial 19/20: L17 supp=0.317 | L22 supp=0.437
  Trial 20/20: L17 supp=0.177 | L22 supp=0.134

 ============================================================
 STATISTICAL ANALYSIS
 ============================================================

  Layer 17:
    Mean suppression:  0.1568 ± 0.0903
    95% Bootstrap CI:  [0.1185, 0.1955]
    Median:            0.1638
    IQR:               [0.1085, 0.2001]
    Mean # suppressed: 12.3
    Mean # boosted:    16.0
    Paired t-test:     t=-1.802, p=0.087440
    Cohen's d:         -0.359

  Layer 22:
    Mean suppression:  0.2531 ± 0.1012
    95% Bootstrap CI:  [0.2112, 0.2974]
    Median:            0.2485
    IQR:               [0.1748, 0.2969]
    Mean # suppressed: 14.6
    Mean # boosted:    14.4
    Paired t-test:     t=0.994, p=0.332832
    Cohen's d:         0.200

 ============================================================
 FEATURE-LEVEL CONSISTENCY
 ============================================================

  Layer 17 — Most consistently suppressed features:
    Feature 12073: appears in 6/20 trials (30%)
    Feature 8428: appears in 5/20 trials (25%)
    Feature 3775: appears in 4/20 trials (20%)
    Feature 3800: appears in 4/20 trials (20%)
    Feature 3951: appears in 3/20 trials (15%)
    Feature 3811: appears in 3/20 trials (15%)
    Feature 11135: appears in 3/20 trials (15%)
    Feature 5052: appears in 2/20 trials (10%)
    Feature 9931: appears in 2/20 trials (10%)
    Feature 5344: appears in 2/20 trials (10%)

  Layer 22 — Most consistently suppressed features:
    Feature 9571: appears in 5/20 trials (25%)
    Feature 6650: appears in 5/20 trials (25%)
    Feature 783: appears in 4/20 trials (20%)
    Feature 1076: appears in 3/20 trials (15%)
    Feature 5125: appears in 3/20 trials (15%)
    Feature 914: appears in 3/20 trials (15%)
    Feature 1108: appears in 3/20 trials (15%)
    Feature 4163: appears in 3/20 trials (15%)
    Feature 15378: appears in 2/20 trials (10%)
    Feature 233: appears in 2/20 trials (10%)

 ============================================================
 Results saved to: /home/vincent/attentional_hijacking/results/4b/statistical_rigor_4b_20260410_052757.json
 ============================================================

 ============================================================
 PAPER-READY SUMMARY
 ============================================================

  Layer 17 (20 trials):
    Task suppression: 15.7% (95% CI: [11.8%, 19.5%])
    Cohen's d: -0.36
    p-value: 8.74e-02

  Layer 22 (20 trials):
    Task suppression: 25.3% (95% CI: [21.1%, 29.7%])
    Cohen's d: 0.20
    p-value: 3.33e-01
 ALL DONE
 vincent@nigel:/tmp$
No results found