tsapeta/SKILL.md

Created May 4, 2026 07:18

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/tsapeta/85bb07a0bf6331d2fdc47e42056bd4a3.js"></script>
Save tsapeta/85bb07a0bf6331d2fdc47e42056bd4a3 to your computer and use it in GitHub Desktop.

Download ZIP

ios-instruments-trace-analysis

Raw

examples.md

examples.md

Worked example: profiling a JS-native function-call bridge

This example walks through analyzing a Time Profiler trace of two micro-benchmarks (addStrings and foldArray) that exercise an Expo modules JS↔Swift bridge. The goal is to show what each diagnostic step actually produces and how the analysis evolves.

Context

addStrings('hello ', 'world') × 500,000 iterations. The native body is a + b — one Swift string concat. So essentially everything in the profile is bridge overhead, not the work itself. Trace was ~3.4s wall-clock.

Step 1: TOC

xcrun xctrace export --input strings-fold-array.trace --toc | head -40

Output revealed:

2 runs, both Time Profiler.
Run 1: 3.4s, attached to BareExpo (the test app).
Schemas under <data>: time-sample, time-profile, kdebug-signpost, etc.

Step 2: Wrong export (instructive)

First instinct was to export time-sample (the per-millisecond samples). Result: 986 rows, but every backtrace was a <kperf-bt> element with raw text-addresses="4342336580 4342304496 6446776812 ...". No symbol names. Useless.

Step 2 (correct): time-profile

xcrun xctrace export --input strings-fold-array.trace \
  --xpath '/trace-toc/run[@number="1"]/data/table[@schema="time-profile"]' \
  --output /tmp/run1-tp.xml

963 rows, each with <tagged-backtrace> and resolved <frame name="...">. This is the table to analyze.

Step 3: Refs

The first parse attempt without a ref-resolver returned 0 useful rows — every <thread>, <backtrace>, and most <frame> elements were references like <thread ref="32"/> to entries earlier in the file. Adding a resolver fixed it:

id_map = {el.get('id'): el for el in root.iter() if 'id' in el.attrib}
def resolve(el):
    if el is None: return None
    if 'ref' in el.attrib:
        return id_map.get(el.get('ref'))
    return el

Step 4: Per-thread distribution

Per thread:
   841  com.facebook.react.runtime.JavaScript (...)
    98  Main Thread (...)
    22  BareExpo.thread.0 (...)
     2  com.apple.uikit.eventfetch-thread (...)

84% of samples on the JS thread. That's the workload. Filter to that thread for everything that follows.

Step 5: Top-of-stack (sanity)

  40   4.8%  swift::RefCounts<...>::doDecrementSlow<(swift::PerformDeinit)1>(...)
  37   4.4%  hermes::vm::Interpreter::interpretFunction<false, false>(...)
  24   2.9%  swift_weakLoadStrong
  24   2.9%  _xzm_free
  19   2.3%  String._slowFromCodeUnits<A, B>(_:encoding:repair:)
  19   2.3%  static String._slowFromCodeUnits<A, B>(_:encoding:repair:)
  18   2.1%  swift_release
  ...

The leaders are all low-level helpers. Reporting "doDecrementSlow is 4.8%" tells the user nothing useful. Need to bucket.

Step 6: Categorization function

The boundary frame for this codebase is SyncFunctionDefinition.runBody (where the bridge calls user code). Frames above it are "the work". Categories were chosen by reading the actual top stacks for addStrings-shaped work:

def categorize(frames_top_to_bottom):
    above_sync = []
    for f in frames_top_to_bottom:
        if 'SyncFunctionDefinition' in f or 'createFunctionClosure' in f:
            break
        above_sync.append(f)
    text = ' || '.join(above_sync)

    if 'DynamicStringType.cast' in text or 'MainValueConverter.toNative' in text:
        if '_allASCII' in text or '_slowFromCodeUnits' in text or 'parseScalar' in text or 'UTF16' in text:
            return 'Arg-conv: jsi::String -> Swift String (UTF16 decode)'
        if 'JavaScriptValue.getString' in text:
            return 'Arg-conv: JavaScriptValue.getString'
        if 'utf16(' in text:
            return 'Arg-conv: jsi::String::utf16'
        return 'Arg-conv: DynamicStringType.cast (other)'
    if 'JavaScriptValuesBuffer' in text:
        return 'Arg-conv: JavaScriptValuesBuffer access'
    if 'IndexingIterator' in text or 'formIndex(after:)' in text or 'getTupleTypeMetadata' in text:
        return 'Arg-conv: arg loop / tuple metadata'
    if 'tryCast' in text or 'swift_dynamicCast' in text:
        return 'Arg-conv: swift_dynamicCast'
    if 'doDecrementSlow' in text or 'swift_release' in text or '_xzm_free' in text or '_free' in text:
        return 'Swift ARC: release / dealloc'
    if 'swift_retain' in text or 'swift_bridgeObjectRetain' in text:
        return 'Swift ARC: retain / weak'
    if 'swift_weakLoadStrong' in text:
        return 'Swift weak load'
    if 'swift_isClassOrObjCExistentialType' in text or 'MetadataCacheKey' in text or 'pthread_getspecific' in text:
        return 'Swift runtime overhead (metadata/exclusivity/TLS)'
    if '_xzm_xzone_malloc' in text or 'malloc' in text:
        return 'Allocator: malloc'
    if 'Interpreter::interpretFunction' in text:
        return 'Hermes: interpretFunction (JS-side dispatch)'
    if not above_sync:
        return 'Inside SyncFunctionDefinition.runBody (no frame above)'
    return f'Other: {above_sync[0][:80]}'

Iteration matters. The first version of categorize() had ~5 categories. Running it produced too many Other: ... samples — meaning samples weren't being grouped meaningfully. Each pass added more pattern matches. Final version above is the one that produced actionable output.

Step 6 output

=== Time breakdown by category ===
   139  16.5%  Arg-conv: jsi::String -> Swift String (UTF16 decode)
   118  14.0%  Arg-conv: DynamicStringType.cast (other)
    82   9.8%  Swift ARC: release / dealloc
    74   8.8%  Arg-conv: arg loop / tuple metadata
    60   7.1%  Swift runtime overhead (metadata/exclusivity/TLS)
    31   3.7%  Arg-conv: JavaScriptValuesBuffer access
    20   2.4%  Allocator: malloc
    20   2.4%  Arg-conv: swift_dynamicCast
    19   2.3%  Arg-conv: JavaScriptValue.getString
    18   2.1%  Swift ARC: retain / weak
    11   1.3%  Return-conv: JavaScriptValue.asJSIValue
    ...

This was the moment the analysis became useful. Roughly 39% of time is spent in JS-string → Swift-String conversion; another 12% is ARC churn around the temporaries; the actual a + b doesn't show up. The fix candidates are clear.

Step 7: Drill into the biggest bucket

For 'Arg-conv: jsi::String -> Swift String (UTF16 decode)':

[19 samples]
  String._slowFromCodeUnits<A, B>(_:encoding:repair:)
  String._fromCodeUnits<A, B>(_:encoding:repair:)
  String.init(_:)
  DynamicStringType.cast(jsValue:appContext:)
  protocol witness for AnyDynamicType.cast(jsValue:appContext:) in conformance DynamicStringType
  MainValueConverter.toNative
  SyncFunctionDefinition.runBody

The exact source location is DynamicStringType.cast(jsValue:) calling String(jsiString.utf16(runtime)). Reading that file confirmed: it goes through _slowFromCodeUnits (the always-validating UTF-16 path). Switching to String(jsiString.utf8(runtime)) uses Swift's UTF-8 fast path.

That fix alone dropped the benchmark by ~15%.

Step 8: After the fix (re-traced)

After the encoding switch:

=== Time breakdown by category ===
   132  18.3%  Outside benchmark loop
   121  16.8%  Arg-conv: DynamicStringType.cast (other)
   111  15.4%  Arg-conv: arg loop / tuple metadata
    54   7.5%  Swift ARC: release / dealloc / jsi::Value dtor
    54   7.5%  Swift runtime overhead (metadata/exclusivity/TLS)
    37   5.1%  Arg-conv: BorrowedValue.getString
    ...
    11   1.5%  Arg-conv: jsi::String -> Swift String (decode)

Notes worth catching:

The 16.5% UTF-16 bucket dropped to 1.5%. The fix worked.
"DynamicStringType.cast (other)" went from 14% → 16.8% — not because it got slower, but because the denominator shrank. Always verify by checking absolute sample counts: 121 vs the original 118 — basically unchanged. The bucket is the same absolute size; it just owns more of a smaller pie.
"arg loop / tuple metadata" went from 8.8% → 15.4% — same effect, plus genuinely surfacing as the next-biggest fish.

This is why you re-trace after every change. The order of fixes is dictated by the current profile, not the old one.

A different shape: foldArray

Same skill, different workload. foldArray([1...10]) × 500k iterations exercises array decoding (10 elements per call × 500k = 5M element conversions).

The same categorize() function, with array-specific buckets added:

if 'DynamicArrayType' in text:
    if 'JavaScriptArray' in text or 'getValue' in text or 'isArray' in text:
        return 'Array: JS array iteration (length/getValue)'
    if 'DynamicNumberType' in text:
        return 'Array: per-element DynamicNumberType.cast'
    if 'Collection.map' in text or 'reserveCapacity' in text:
        return 'Array: build [Any] result'
    return 'Array: DynamicArrayType.cast (other)'

Result:

   944  43.3%  Array: JS array iteration (length/getValue)
   252  11.6%  Array: build [Any] result
   204   9.4%  Generic iteration / tuple metadata
   156   7.2%  swift_dynamicCast
   130   6.0%  Array: per-element DynamicNumberType.cast

One bucket is 43.3% of total time. Drilling in:

[114 samples]
  swift_weakLoadStrong
  JavaScriptArray.length.getter
  closure #1 in JavaScriptArray.map<A>(_:)
  ...

[75 samples]
  swift_weakLoadStrong
  JavaScriptArray.getValue(at:)
  closure #1 in JavaScriptArray.map<A>(_:)
  ...

Source inspection of JavaScriptArray.getValue(at:) showed two weak-runtime loads per call (one for guard let runtime, one inside the bounds-check length getter). Multiplied by 5M element accesses = 10M weak loads. Adding a non-bounds-checked internal accessor (getValueUnchecked(at:in:)) and hoisting runtime and length out of the loop body collapsed most of that bucket. The benchmark dropped 28%.

Lessons

Top-of-stack is a sanity check, not the answer. Always categorize by stack signature.
The categorization function is the artifact. Spend time refining it. Treat it as the analysis output that's worth reusing across traces of the same codebase.
Drill before recommending. A bucket name says "where", a stack signature says "exactly what to read in source".
Re-trace after changes. Bucket percentages shift; the next-biggest fish moves; what was 5% might now be 15%.
Distinguish % shift from absolute change. Surviving buckets get bigger as the total shrinks even when their work doesn't change. Note absolute sample counts when comparing before/after.
Source-level fixes follow from stack signatures. "doDecrementSlow is 5%" is not actionable. "doDecrementSlow on a JavaScriptValue.deinit called per element from JavaScriptArray.map" is.

Raw

SKILL.md

SKILL.md

name: ios-instruments-trace-analysis description: Analyze an existing Xcode Instruments .trace file (Time Profiler) by exporting its XML and aggregating samples into actionable categories. Use when the user provides a `.trace` path and asks "what's slow", "where's the time going", "analyze this trace", or wants follow-up profiling after applying a perf fix. allowed-tools: Bash, Read, Write

Analyzing Xcode Instruments traces

This skill is for reading a trace file the user already has, not recording one. The output is a prioritized list of where CPU time is actually going, grounded in the trace data — not generic guesses.

When to invoke

User shares a .trace path and asks for analysis ("analyze this trace", "what's slow", "where's the time going", "saved a new trace, take another look").
User wants to compare before/after traces around a perf change.

Workflow

1. List the runs (table of contents)

A .trace bundle can hold multiple runs (each "Stop" + "Record" creates one). Always start by listing them so you know what's in scope.

xcrun xctrace export --input "<path>.trace" --toc 2>&1 | head -60

Look for: <run number>, <duration>, <template-name>, the target process name, and the schemas under <data>. If the user only mentioned one scenario but the file has multiple runs, ask which run to analyze (or do them in order).

2. Export the right table

The trace has two CPU sampling tables. They are not equivalent:

Schema	What it has	Use for analysis?
`time-sample`	Raw 1ms samples with `kperf-bt` (unsymbolicated text-addresses + register state)	❌ No — frames are raw addresses
`time-profile`	Aggregated samples with `tagged-backtrace` (symbolicated frame names)	✅ Yes — this is what you want

xcrun xctrace export --input "<path>.trace" \
  --xpath '/trace-toc/run[@number="1"]/data/table[@schema="time-profile"]' \
  --output /tmp/run1-tp.xml 2>&1 | tail -2

The output XML is normally 1-2 MB. Don't try to read it directly — use Python (or a one-shot script) to parse and aggregate.

3. Resolve XML refs

xctrace XML uses ID/ref deduplication aggressively. Every element has an id, and subsequent occurrences appear as <element ref="N"/>. You must build an id-map and resolve refs, otherwise you'll see almost no data.

Standard prelude (use this every time):

import xml.etree.ElementTree as ET
from collections import Counter

tree = ET.parse('/tmp/run1-tp.xml')
root = tree.getroot()
id_map = {el.get('id'): el for el in root.iter() if 'id' in el.attrib}

def resolve(el):
    if el is None:
        return None
    if 'ref' in el.attrib:
        return id_map.get(el.get('ref'))
    return el

rows = root.findall('.//row')

Each <row> represents one sample. Inside: <thread>, <weight>, <tagged-backtrace> → <backtrace> → <frame> (top-of-stack first). Apply resolve() to each before reading attributes.

4. Filter to the thread of interest

Most apps have lots of background threads (audio, networking, GC). The hot work is usually on one thread. Identify it from the per-thread sample count, then filter all subsequent analysis to that thread.

per_thread = Counter()
for row in rows:
    thr = resolve(row.find('thread'))
    tname = thr.get('fmt', 'unknown') if thr is not None else 'unknown'
    per_thread[tname] += 1

print("Per thread:")
for k, v in per_thread.most_common(8):
    print(f"  {v:5d}  {k}")

Common candidates: Main Thread, com.facebook.react.runtime.JavaScript, <app>.thread.<n>. Pick the one with the most samples that's relevant to the user's question.

5. Top-of-stack first (sanity check)

Before bucketing, dump the top 20-30 innermost frames by count. This is a sanity check: if the dominant frame is something obviously wrong (e.g. you're profiling main thread instead of the worker, or sampling outside the workload), you'll see it here.

top_frames = Counter()
for row in rows:
    thr = resolve(row.find('thread'))
    if 'JavaScript' not in (thr.get('fmt', '') if thr is not None else ''):
        continue
    bt_tag = resolve(row.find('tagged-backtrace'))
    if bt_tag is None: continue
    bt = resolve(bt_tag.find('backtrace'))
    if bt is None: continue
    frames = bt.findall('frame')
    if not frames: continue
    top = resolve(frames[0])
    top_frames[top.get('name', '?')] += 1

for n, c in top_frames.most_common(25):
    print(f"  {c:4d}  {c/total*100:5.1f}%  {n[:140]}")

Top-of-stack only tells you what function was running, not why. The bucketing step (next) is where the analytical work happens.

6. Bucket by stack signature (the analytical step)

Top-of-stack is misleading: low-level helpers (ARC release, weak load, malloc, memcpy) appear at the top of many unrelated stacks. To answer "what's the work?", classify each sample by what's happening above a known boundary in the stack.

The boundary is application-specific:

For a JS-native bridge: stop at SyncFunctionDefinition.runBody or equivalent
For UI work: stop at CA::Layer::display_ or UIView.draw
For a worker pool: stop at the dispatch entry point

Frames above (closer to top-of-stack than) the boundary are "the work". Build a categorize() function that pattern-matches on those frames and returns a bucket name. Iterate, count, sort.

def categorize(frames_top_to_bottom):
    above_boundary = []
    for f in frames_top_to_bottom:
        if 'SyncFunctionDefinition' in f or 'createFunctionClosure' in f:
            break
        above_boundary.append(f)
    text = ' || '.join(above_boundary)

    # Group by what kind of work is happening — be generous with patterns
    if 'String._slowFromCodeUnits' in text or 'parseScalar' in text:
        return 'String UTF-16 decode'
    if 'doDecrementSlow' in text or 'swift_release' in text or '_swift_release_dealloc' in text:
        return 'Swift ARC: release / dealloc'
    if 'swift_weakLoadStrong' in text or 'objc_loadWeakRetained' in text:
        return 'Swift weak load'
    # ... more patterns
    if not above_boundary:
        return 'Inside boundary frame (no work above)'
    return f'Other: {above_boundary[0][:80]}'

buckets = Counter()
for row in rows:
    # ... thread filter, resolve backtrace ...
    names = [resolve(f).get('name', '?') for f in frames]
    buckets[categorize(names)] += 1

for cat, c in buckets.most_common():
    print(f"  {c:4d}  {c/total*100:5.1f}%  {cat}")

Important: the categorize() function is the analytical artifact. Iterate on it. When a bucket says Other: <some frame> and is large, add a more specific case. When two buckets feel like they should be one, merge them. The goal is buckets that map to fixable problems, not arbitrary clusters.

7. Drill into the biggest bucket

Once a bucket is >10%, expand it to see the actual stack signatures inside. This is what tells you exactly what to fix.

sig_counter = Counter()
for row in rows:
    # ... same filter as before ...
    names = [resolve(f).get('name', '?') for f in frames]
    if categorize(names) != 'Swift weak load':  # the bucket you're drilling into
        continue
    sig_counter[' / '.join(names[:7])] += 1

for sig, c in sig_counter.most_common(10):
    print(f"\n  [{c} samples]")
    for f in sig.split(' / '):
        print(f"    {f[:140]}")

The 7-deep stack signature usually surfaces the exact source-level location to look at. Read the implicated source file before suggesting a fix.

8. Report findings

Surface to the user:

Total samples on the relevant thread, % utilization (= samples / wall-clock-seconds / 1000 for 1ms sampling).
Bucket breakdown sorted by %, grouped meaningfully.
The single biggest fish: what it is, where it lives in source, why it's expensive.
Concrete fix candidates, ordered by ROI (small fix that buys big % > big fix that buys small %).

Don't stop at "X is slow". Always trace down to "X is slow because line L of file F does Y; fix Z would eliminate it because…".

9. After a fix lands, re-trace

When the user reports a benchmark drop after applying a fix, always offer to re-analyze a fresh trace. The bucket order shifts — what was 5% might now be 15% as the dominant cost gets removed and the denominator shrinks. The next move should be driven by the new profile, not the old one.

Bucket-percentage shifts can be misleading when interpreted naively: a bucket going from 8% → 15% doesn't mean it got slower, it means it now occupies more of a smaller pie. Always note this when comparing before/after.

Critical pitfalls

❌ Reading time-sample instead of time-profile — the kperf-bt addresses won't symbolicate via xctrace, you'll see no useful data.

❌ Skipping ref resolution — XML uses ID/ref aggressively; without resolve() you'll see ~5% of the actual data.

❌ Reporting top-of-stack as the conclusion — swift_weakLoadStrong at 10% doesn't tell you what to fix. The stack above it does.

❌ Comparing pre/post bucket percentages directly — if total time dropped, surviving buckets get larger as a percentage even though their absolute work didn't change. Compare absolute sample counts × sample interval, or note the wall-clock change separately.

❌ Reading the XML with the Read tool — files are 1-2 MB and full of refs; humans (and Claude) parse it badly. Always use Python.

❌ Trying to use xctrace export --output for non-XML formats — there's no other format. Just process the XML.

Examples

See examples.md for a complete worked analysis from a real session, including the categorization function for a JS-native bridge and what each diagnostic step looked like.