savarin/1-instruction.md

Last active June 16, 2026 02:59

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/savarin/3129e629323c6e83c542cff359e2dcd4.js"></script>
Save savarin/3129e629323c6e83c542cff359e2dcd4 to your computer and use it in GitHub Desktop.

Download ZIP

ENG-481: Change 1 — experimental__freshness_label

Raw

1-instruction.md

Instruction

Run the full weekly_experiment_readout routine. Use experimental__freshness_label for all three partner sections (Suno, Coterie, HexClad). Use the production skills for everything else: readout_topline, readout_other, readout_next_steps. Execute each skill as a standalone step — do not hand-assemble any section.

Assemble the full readout in the standard order: Top-line, Suno, Coterie, HexClad, Other, Next steps & owners.

Do not reference any prior conversation or thread context. For Coterie, use this source as a one-off: https://film-framework-motor-sampling.trycloudflare.com/reports/coterie-variant-performance-evaluation

Raw

2-skills.md

Skills

The weekly_experiment_readout orchestrator calls these skills:

experimental__freshness_label (swapped for readout_partner; parameterized: company=suno, company=coterie, company=hexclad)
readout_other (production)
readout_topline (production)
readout_next_steps (production)

experimental__freshness_label

---
name: experimental__freshness_label
description: Produces one parameterized partner section of the weekly experiment readout for Suno, Coterie, or HexClad. Use when asked for `experimental__freshness_label company=suno`, `experimental__freshness_label company=coterie`, `experimental__freshness_label company=hexclad`, or consolidated partner readout sections.
---

# experimental__freshness_label — Skill Prompt

You are producing one **partner section** of the weekly experiment readout. The caller will specify which company: `suno`, `coterie`, or `hexclad`. This section will be concatenated verbatim into a larger readout message.

## What to do

1. Pull the latest experiment variant reports for the specified company. Use the primary outcome metric for that company's experiments.

2. For each experiment, rank variants by highest `metric_lift` vs control — NOT by the report's `rank` field. The `rank` field uses lower-bound statistical ordering, which differs from actual lift.

3. Attempt to pull current-week data. If current-week data returns empty or errors, note the most recent report date and explicitly label the results as carry-forward evidence — not a fresh current-week read. Use the phrase "carry-forward" and state the date of the underlying data so the reader cannot mistake stale results for current ones.

4. If the caller supplies a one-off source for this partner, treat that source as authoritative for this partner only. Verify it is reachable and parseable during the run. If it is unavailable, fail closed for that partner: do not fall back to prior Slack/session context, stale local files, or standard connected data; produce a short unavailable-source section with no scale/kill/dollar call and add a next-step item to restore or replace the source. See `references/one_off_source_failure_handling_2026_06_15.md`. If a Coterie one-off source is reachable, parse the exact current URL's `Headline read per experiment` table and use `references/coterie_one_off_source_success_2026_06_15.md` for the verified row interpretation and dollar-math pattern.

5. When normal partner data must be fetched, first list available organizations and use the partner-specific `org_slug` (`suno`, `coterie`, `hexclad`) rather than defaulting to the workspace org. For readouts, inspect active experiments first; if a partner has none, also check staging and concluded before saying there is no readable portfolio.

6. If no experiments exist or reports error out, say so plainly. Do not speculate or pad.

## Output format

Start with the company name as a header (e.g., `### Suno`) and go directly into content.

**Opening line:** One sentence describing what the company tested — the strategic hypothesis, not metric names.

**Winners:** Compact one-line bullets per experiment/variant. Include: variant name, rounded lift (business precision like ~8%, ~4.8%), and population (rounded to readable magnitude like ~1.4M, ~156K).

**Dollar impact:** ONE combined estimate per company. Use the math appropriate to the data:
- For subscription metrics: `absolute_delta × exposure_count × price_per_unit`
- For revenue metrics: use revenue-per-exposure-day deltas from the report
- For dollar math, use the exposure count from the metric row treatment sample size (`treatment.stats.n`), not the evaluation evidence sample size.
- State what caveats apply (holdout needed, finance-grade inputs missing, etc.)

**Data freshness:** One line with the specific report date and current-week availability. If data is not from the current week, label it explicitly: "Carry-forward evidence from [date] — not a fresh current-week read."

**Next action:** One concrete recommendation — what to scale, cut, or test next.

## Company-specific rules

### When company = suno
- Suno's subscription price is ~$10/month. Use this for dollar math on subscription lift metrics.

### When company = coterie
- Display the T+C total (treatment + control) as the evaluated population for context. Use treatment-only counts for dollar math.

## General rules

- Group experiments by strategic intent, not by individual flow or variant ID
- If a partner has no live experiments, keep to 3-4 lines: what's happening instead, no dollar estimate, recommend converting current work into a measured test with a revenue metric and holdout
- If the report flags data corrections or taxonomy changes, note what changed and how it affects the read
- If an experiment's conversion window hasn't matured (e.g., 60-day window with 30-day cohorts), flag it as too early to call
- For every stat-sig result, include the specific lift percentage and evaluated population — never reduce a measured result to just "positive" or "negative"
- For experiments that are not yet decision-ready, group by strategic intent and state why each group isn't readable yet (sample size, conversion window, data correction, etc.)
- No p-values, confidence intervals, or statistical jargon
- No raw metric names, UUIDs, or experiment IDs
- No tool names or API references
- Do not restate data that belongs in other sections
- Keep each section stakeholder-grade and forwardable

Raw

3-output.md

Output

Weekly Experiment Readout — June 16, 2026

Top-line

Suno has the clearest subscription scale call: the readable Welcome Flow tests show behavior/taste/continuation prompts lifting new subscriptions, worth about ~$20K/month under the ~$10/month subscription assumption.
Coterie has a decision-ready wipe-upsell win: cart and browse abandonment wipe upsells are both positive at meaningful scale, with a combined ~$92K d30-equivalent estimate before finance-grade validation.
HexClad is not readable this week: there are no active, staging, or concluded experiment reports available to support a scale/kill call.
Data freshness is the main caveat: Suno is carry-forward from May 21 and Coterie is from the one-off May 12 report; this readout should not be treated as a fresh current-week performance pull.

Suno

Suno tested whether behavior-, taste-, and continuation-based onboarding prompts could move new users from early engagement into paid subscriptions.

Day 1: Behavior: Discover & Listen — +~8.2% lift (~2.6M evaluated users)
Day 2: Taste: Chill & Electronic (Simple HTML) — +~9.5% lift (~1.4M evaluated users)
Day 3-5: Continue: Share Your Music — +~12.4% lift (~965K evaluated users)

Dollar impact: about ~$20K/month from the three readable winners, using treatment exposure counts and the ~$10/month subscription assumption; this is directional and should be validated against plan mix, retention, and holdout behavior.

Data freshness: Carry-forward evidence from 2026-05-21 — not a fresh current-week read.

Next action: scale the three winning directions into the next Welcome Flow iteration, while pruning weaker/no-send-style arms and refreshing the report before making a current-week performance claim.

Coterie

Coterie tested whether Neon Blue-rendered lifecycle emails could attach wipes in abandonment flows and convert one-time purchasers into subscription or core diapering buyers.

Pilot — Cart Abandonment Wipe Upsell: +~4.8% lift (~156K evaluated users)
Pilot — Browse Abandonment Wipe Upsell: +~4.3% lift (~107K evaluated users)
Flow 1 — First Purchase OTP, Diaper NB/1/2: +~42.7% directional lift (~501 evaluated users), promising but still small-sample

Dollar impact: about ~$92K d30-equivalent from the two wipe-upsell pilots only, using revenue-per-exposure-day deltas and treatment exposure counts; first-purchase subscription flows are excluded until finance-grade AR/subscription value is supplied.

Data freshness: Carry-forward evidence from 2026-05-12 via the one-off Coterie source — not a fresh current-week read. The report also corrected earlier product-taxonomy issues, which materially changes the Flow 3 read from negative to flat.

Next action: scale the wipe-upsell pilots with a holdout, keep Flow 1 running until the d60 window matures, and do not act on the other first-purchase flows yet.

HexClad

HexClad does not have a readable experiment portfolio in the available active, staging, or concluded reports this week.

There is no scale/kill call and no dollar estimate available from the current sources.

Data freshness: no current or carry-forward experiment report was available for this run.

Next action: convert the next HexClad lifecycle or offer test into a measured experiment with a primary revenue metric and a holdout so it can enter the weekly readout.

Other

No separate non-experiment activity is being used as evidence for this readout. The request explicitly excluded prior conversation/thread context, and the current run did not provide a separate non-conversation source for Other work.

Next steps & owners

Refresh Suno reporting before the next readout and confirm whether the Welcome Flow winners still hold on current-week data (Ezzeri / Jasper)
Move Suno's Day 1, Day 2, and Day 3-5 winning directions into the next scaled Welcome Flow plan, with retention and plan-mix checks attached (Ezzeri / Steve)
Scale Coterie's cart and browse wipe-upsell pilots with a holdout and finance-grade revenue validation (Antoine / Ezzeri)
Keep Coterie Flow 1 running until the d60 conversion window matures, then reread before making a subscription-flow call (Antoine / Ezzeri)
Create a HexClad measured test with a primary revenue metric and holdout so the next readout has decision-ready evidence (Antoine / TBD)
Restore or replace fresh weekly sources for Suno and Coterie so future readouts are not carry-forward-only (Ezzeri / Jasper)

Raw

4-scoring.md

ENG-481 Scoring: Baseline vs experimental__freshness_label

Rubric

https://gist.github.com/savarin/121a2acc64ded62879f70d5c96ed4421

Baseline Output (production skills)

Thread: https://neonblue-ai.slack.com/archives/C0B9FTT3998/p1781569102361779

Experimental Output (experimental__freshness_label, cold re-run)

Thread: https://neonblue-ai.slack.com/archives/C0B9FTT3998/p1781570424748379

Content-Agnostic

Dimension	Baseline	Experimental	Delta	Rationale
Forwardable	0.90	0.88	-0.02	Both clean and forwardable. Experimental uses backtick variant names and slightly denser paragraphs. Baseline is marginally more scannable. Minor.
Dollar Impact	0.90	0.90	0.00	Both state ~$20K Suno and ~$92K Coterie with assumptions. Experimental adds "~2.0K incremental paid starts from treatment exposure" which shows the math more clearly. Both have appropriate caveats.
Framing	0.90	0.92	+0.02	Experimental top-line has a dedicated freshness bullet: "Freshness is the main operating caveat" — an explicit frame-setting call. Baseline jumps into partner calls without framing the data quality story. Closer to gold standard.
Actionability	0.90	0.90	0.00	Both have named owners from readout_next_steps. Baseline has 6 items, experimental has 5 — both within the 4-6 range. Owner pairings correct in both.
Rigor	0.88	0.90	+0.02	Experimental lists more variants (4 Suno winners including dormant-user engagement), more Coterie flows (Flow 2, 3, 4, 6, 10, retention browse), and notes product-taxonomy corrections. Slightly more thorough.
Avg	0.896	0.900	+0.004

Content-Aware

Dimension	Baseline	Experimental	Delta	Rationale
Data Freshness Caveat	0.88	0.93	+0.05	Target dimension. Experimental: "carry-forward evidence from 2026-05-21 — not a fresh current-week subscription read" per partner, plus top-line bullet "Freshness is the main operating caveat." Baseline: "latest readable Suno reports are through May 21; current-week rows were not available" — factual but unlabeled. Experimental makes it impossible to miss.
Pipeline Status	0.70	0.70	0.00	Neither diagnoses WHY data is missing. Both note one-off source usage and absent current-week data but don't flag broken pipelines (e.g., Coterie reports erroring). Not targeted by this change.
Institutional Memory	0.70	0.72	+0.02	Experimental notes "product-taxonomy corrections that moved the no-hero flow from an earlier negative read to essentially flat." Baseline mentions Flow 3 flat after taxonomy correction. Experimental slightly more specific but neither has the full false-alarm story.
Priority Sequencing	0.78	0.78	0.00	Both put "scale winners" first. Experimental step 1: "Move Suno's three Welcome Flow subscription winners into a measured rollout." Baseline step 1: "Scale Suno Day 1, Day 2, and Day 3-5 winners." Neither prioritizes data freshness restoration first.
Analytical Judgment	0.82	0.85	+0.03	Experimental: "Freshness is the main operating caveat," "carry-forward evidence — not a fresh current-week read," "useful for engagement, but not included in subscription dollar math." Baseline is more reportorial.
Avg	0.776	0.796	+0.020

Summary

Scale	Baseline	Experimental	Delta
Content-Agnostic	0.896	0.900	+0.004
Content-Aware	0.776	0.796	+0.020
Combined	0.836	0.848	+0.012

Analysis

The freshness label change delivers on its target dimension: data freshness +0.05 (0.88 → 0.93). The "carry-forward evidence from [date] — not a fresh current-week read" phrasing is unambiguously clearer than the baseline's factual-but-unlabeled language. The top-line freshness bullet ("Freshness is the main operating caveat") is a direct improvement toward the gold standard.

Secondary gains: analytical judgment +0.03 (editorial voice from freshness framing), framing +0.02 (dedicated freshness bullet in top-line), rigor +0.02 (more thorough variant coverage), institutional memory +0.02 (taxonomy correction noted).

No dimension drops >= 0.03. Content-agnostic is flat to slightly positive.

Recommendation

Approve ENG-481. The change delivers +0.012 combined with zero regressions. Ready to swap experimental__freshness_label into production as the new readout_partner.