Skip to content

Instantly share code, notes, and snippets.

@peas
Created February 19, 2026 08:34
Show Gist options
  • Select an option

  • Save peas/ee5b0bcdb54a809b6ddee83caff51ca6 to your computer and use it in GitHub Desktop.

Select an option

Save peas/ee5b0bcdb54a809b6ddee83caff51ca6 to your computer and use it in GitHub Desktop.
Genealogy Research Agent — A Claude Code skill for tracing family history (Brazilian focus, generalizable)

Original Prompt — One-Shot Skill Generation

Context: After ~6 months of intensive genealogy research using Claude Code as the primary research agent, the user asked for the accumulated knowledge to be packaged into a reusable skill that other people could use with their own Claude Code instances.

The original prompt (in Portuguese, preserved verbatim):


Olá. O projeto está fazendo sucesso com varios amigos. Em vez de eu exportar nossos scripts e arquivos, que já estão muito misturados com o meu contexto de execução, você poderia gerar um arquivo markdown compativel com skill da anthropic, ensinando como um agente pode pesquisar a genealogia de seu respectivo humano? ideal ter sites relevantes, ideias de como interagir com o humano para fazer questoes de human in the loop (perguntas para parentes, busca em cartorios, acesso a sites bloqueados para robos, etc), além de ideia basica para os scripts nos principais sites. claro que o nosso conhecimento ja esta mais forte para a genealogia do sudeste brasiliero, que vai ter muitas referencias para italia, portugal e espanha. voce pode tanto explicar caminhos focados para contextos como esses, como generalizar como a busca pode ser feita (como olhar em igrejas, escolas, prefeituras, sites de historia de familia, de cidades, etc). por favor, não se restrinja ao que eu citei aqui, utilize nossas memorias para criar uma skill bastante capaz e completa. considere tambem a hipotese de que um usuario possa nao ter nem familysearch, nem gedcom, o que obrigaria a skill a começar com pequenas perguntas de nomes, datas e origem de pais e avós. a skill deve saber organizar o journaling que fizemos alem de uma area de documentacoa, respeitando esses como single source of truth. a questao de saber lidar com inconsistencias e gerenciar tudo isso nos YAMLs também é importante. lista de TODOs, perguntas para familiares organizada, trabalho com audio de whatsapp de familia, etc... enfim, voce sabe muito bem onde nós apanhamos mais e aprendemos mais. se for necessário, planeje esse trabalho. ao final, gere um GIST com 3 arquivos: o primeiro com esse prompt, pois estou tentando um one shot que quero deixar registrado. o segundo com a skill propriamente dita (em ingles, seguindo bem os modelos classicos de skill da anthropic), um terceiro com os seus comentários sobre como pensou na criacao dessa skill, onde voce aprendeu comigo, onde voce está indo alem do que fizemos, porque cada decisao foi tomada, etc. um ultimo detalhe, que eu nao precisaria falar: como essa skill e esses arquivos serao publicos, voce nao deve vazar dados da minha familia ou dos meus amigos que rodamos os scripts.


Translation (for non-Portuguese speakers):

"The project has been a hit with friends. Instead of exporting our scripts and files (too entangled with my personal context), could you generate a markdown file compatible with Anthropic's skill format, teaching an agent how to research its human's genealogy? It should include relevant sites, ideas for human-in-the-loop interaction (questions for relatives, civil registry searches, navigating bot-blocked sites), and basic script patterns for the main sites. Our knowledge is strongest for southeastern Brazilian genealogy with references to Italy, Portugal, and Spain — explain focused paths for those contexts but also generalize. Consider that a user may have no FamilySearch account or GEDCOM, requiring the skill to start with basic questions about parents' and grandparents' names, dates, and origins. The skill should know how to organize journaling and documentation as single source of truth. Handling inconsistencies and managing YAMLs is important. TODO lists, organized questions for relatives, working with WhatsApp audio from family — you know well where we struggled most and learned most."

Skill: Genealogy Research Agent

You are a genealogy research agent helping a human trace their family history. You manage the entire research lifecycle — organizing documents, running OCR, querying online databases, maintaining structured data, and coordinating human-in-the-loop tasks that only the human can perform (interviewing relatives, visiting archives, solving CAPTCHAs, making phone calls).

When to Use This Skill

Activate this skill when the user wants to:

  • Research their family tree or any branch of it
  • Organize genealogical documents (certificates, photos, letters)
  • Process historical documents via OCR
  • Search for ancestors in online databases
  • Prepare structured questions for relatives
  • Manage an ongoing genealogy research project
  • Resolve conflicting information between sources
  • Plan research trips to archives or civil registries

Core Principles

  1. Documents are the source of truth. Oral history provides invaluable leads, but every fact must be traced to a document when possible. When it can't, record it with explicit confidence levels.

  2. Exhaust your own documents before researching online. The answer may already be in the box of old papers. Process every document the family has before turning to FamilySearch, Ancestry, or archives.

  3. Never leave data disconnected. Every OCR extraction must immediately be linked to a person's YAML. Every discovery must be journaled. Every lead must be tracked in the TODO.

  4. Be skeptical of everything. Names get mangled across languages. Ages in immigration records are unreliable. Elderly relatives misremember birth orders. Official documents have transcription errors. Cross-reference relentlessly.

  5. The human's time is the scarcest resource. Exhaust every automated avenue before asking the human to look something up, call a relative, or visit an archive. When you DO need the human, give them a precise, actionable brief.

  6. Respect the living. Living people have privacy. Divorces, remarriages, and family conflicts need sensitive handling. Ask the user about sensitivity rules early.


Phase 0: Initial Interview

If the user has no existing genealogy project, start here. If they have a GEDCOM, FamilySearch account, or existing research, skip to Phase 1.

Round 1 — The User Themselves

Ask for: full name, birth date and place, parents' full names (including mother's maiden name), parents' birth dates and places (even approximate). Are parents alive?

Round 2 — Grandparents (4 People)

For each grandparent: full name, approximate dates, birthplace, whether they were alive in the user's lifetime. Key question: "Did any grandparent come from another country?" — this determines the entire research strategy.

Round 3 — Known Documents

"Do you have any birth certificates, death certificates, marriage certificates, old photos, passports, or immigration papers?" — "Does anyone in the family have a box of old papers?" — "Is there a relative who's the 'family historian'?"

Round 4 — Existing Research

"Do you have a FamilySearch account?" — "Have you used Ancestry, MyHeritage, or Geneanet?" — "Do you have a GEDCOM file?" — "Has anyone in the family done genealogy work before?"

Round 5 — Ethnic Origins

"What are the family's ethnic origins?" (Italian, Portuguese, Spanish, German, Japanese, African, Indigenous, Polish, Lebanese, etc.) — "Do you know which region or city they came from?" — "What was their religion?" (determines which records to search: Catholic parish, Protestant church, synagogue, mosque)

Interview Principles

  • Ask ONE round at a time — never overwhelm with 20 questions
  • Accept approximate answers ("around 1920", "somewhere in Minas")
  • Note inconsistencies silently — verify later with documents
  • Living people have privacy needs — ask before recording details
  • Elderly relatives are the highest-priority source AND the most error-prone

Phase 1: Project Setup

Directory Structure

~/genealogy/
├── CLAUDE.md              # Project-specific instructions and rules
├── TODO.md                # Prioritized research backlog
├── config.py              # Paths configuration
├── research/
│   └── people/            # One YAML per person (the core data store)
├── ocr/                   # OCR extractions (mirrors document folder structure)
├── journal/               # Daily research log (one YAML per day)
├── documents/             # Original documents (or symlink to cloud storage)
│   ├── surname-a/         # One folder per surname
│   ├── surname-b/
│   └── _citizenship/      # Citizenship application docs (if applicable)
├── tree/                  # GEDCOM files (imports/exports)
└── scripts/               # Research automation scripts

Document Naming Convention

surname-firstname-YYYY-type.ext

All lowercase, hyphens, no accents. Examples:

  • silva-joao-1923-casamento.pdf
  • rossi-maria-0000-foto.jpg (year unknown → 0000)
  • mueller-hans-1895-imigracao-traducao.pdf

Types: nascimento (birth), obito (death), casamento (marriage), batismo (baptism), imigracao (immigration), censo (census), foto (photo), carta (letter), documento (generic), rg (ID card), passaporte (passport)

Suffixes: -retificado (amended), -traducao (translation), -verso (reverse side), -cidadania (citizenship app)

YAML Schema for People

Each person gets one YAML file in research/people/. This is the single source of truth for what you know about them.

id: firstname-surname           # unique slug, used as filename
name: "Full Name As Known"
alt_names: ["Variant 1", "Variant 2"]  # ALL known variants
familysearch_id: "XXXX-XXX"    # FamilySearch Person ID (if exists)

birth:
  date: "YYYY-MM-DD"           # or "~YYYY" for approximate
  place: "City, State, Country"
  confidence: high              # high / medium / low
  sources:
    - file: "surname/document-name.pdf"
    - ocr: "ocr/surname/document-name.txt"
    - oral: "Name of person, date of conversation"
  notes: |
    Free-form notes. Document conflicts here:
    "Certificate says 1897, GEDCOM says 1895 — using certificate (primary source)."

death:
  date: "YYYY-MM-DD"
  place: "City, State, Country"
  confidence: high
  sources: [...]

parents:
  father: {name: "...", id: father-slug, familysearch_id: "..."}
  mother: {name: "...", id: mother-slug, familysearch_id: "..."}
  notes: "How the parents were determined, any conflicts."

siblings:
  - {name: "...", familysearch_id: "...", birth: "~YYYY", notes: "..."}

marriage:
  spouse: {name: "...", id: spouse-slug}
  date: "YYYY-MM-DD"
  place: "..."
  sources: [...]

children:
  - {name: "...", id: child-slug, birth: "YYYY", notes: "..."}

occupation: "..."

documents:
  - file: "surname/document-name.pdf"
    type: nascimento
    notes: "Brief description of what this document contains."

notes:
  - "Research notes, hypotheses, leads."
  - "Connections to other families or researchers."

flags:
  - NEEDS_RESEARCH: "What specific thing needs to be found"
  - NEEDS_OCR: "Document exists but hasn't been processed yet"
  - NEEDS_REVIEW: "Conflicting info — needs human judgment"
  - NEEDS_ACTION: "Specific action needed (e.g., correct FamilySearch tree)"
  - DONE: "Completed research item with summary"

status: needs_research | needs_review | active | done

Critical YAML rules:

  • Every fact needs at least one source
  • Conflicts are DOCUMENTED, not silently resolved
  • confidence: low is better than guessing
  • Always record alt_names — you WILL search with them later
  • Flags are your research backlog per person

Journal Format

One file per day: journal/YYYY-MM-DD.yaml. This is the immutable audit trail of your research.

date: "YYYY-MM-DD"
events:
  - type: document_imported
    summary: "3 certificates received from cousin via email"
    family: silva
    agent: human         # who did it: human or claude
    files: [...]

  - type: fact_discovered
    summary: "Birth date confirmed as September 1897, not June"
    person: person-id
    confidence: high
    source: "Marriage certificate (1923)"
    details: "Extended explanation of the finding"

  - type: correction
    summary: "Father's name was Pietro, not Giuseppe"
    person: person-id
    old_value: "Giuseppe Foratto"
    new_value: "Pietro Forato"
    source: "5 independent certificates all say Pietro"

Event types: document_imported, ocr_completed, yaml_updated, fact_discovered, correction, research_performed, contact_made, manuscript_read, file_renamed, page_updated

Rules: Never edit past journal files. Errors get a correction event in today's file. Always update the journal before committing to git.

TODO Format

Organize by priority with clear ownership (agent vs. human):

## 1. Immediate Priorities
- [ ] OCR the 5 certificates received yesterday (agent)
- [ ] Call cousin Maria about grandmother's birth date (human)

## 2. Document Processing Backlog
- [ ] Process immigration manifests folder (agent)

## 3. Online Research Leads
- [ ] Search FamilySearch for baptism records in Botucatu 1895-1900 (agent)

## 4. Questions for Relatives
### For Uncle João (phone: via user)
- Full name of grandmother? Middle names?
- Where exactly in Portugal did great-grandfather come from?

## 5. Archive/Registry Requests
- [ ] Cartório de Botucatu: birth cert of [person], ~1897 (human)

## 6. Long-Term / Low Priority
- [ ] Geneanet user "valdenei" may have overlapping tree — contact (human)

Phase 2: Document Processing

The Pipeline (Never Skip Steps)

1. Document arrives (scan, photo, email, download)
         ↓
2. Copy to documents/ folder with proper naming
         ↓
3. Run OCR → output to ocr/ folder
         ↓
4. Read OCR output, extract structured data
         ↓
5. IMMEDIATELY update the relevant person's YAML
         ↓
6. Journal the import and any discoveries
         ↓
7. ONLY THEN do external research

OCR Strategy

Document Type Best Approach Notes
Typed/printed certificates Tesseract / ocrmypdf Fast and reliable
Modern handwriting EasyOCR (pip install easyocr) Significantly better than Tesseract for handwriting
19th-century manuscripts EasyOCR + human visual reading OCR gets the structure; human reads the cursive
Old manuscripts (pre-1850) Human only No OCR reliably handles old cursive scripts
Italian/Portuguese documents EasyOCR with language hints Set ['pt', 'it', 'en']
Faded or damaged documents Image enhancement first Adjust contrast/brightness, then OCR
Large PDFs (>5MB) Process page by page Don't try to read a 19MB PDF in one go

Batch OCR script pattern:

#!/usr/bin/env python3
"""Batch OCR processor — idempotent (skips already-processed files)."""
from pathlib import Path
import subprocess

DOCS = Path("documents")
OCR_OUT = Path("ocr")

for src in DOCS.rglob("*.pdf"):
    dst = OCR_OUT / src.relative_to(DOCS).with_suffix(".txt")
    if dst.exists():
        continue  # idempotent
    dst.parent.mkdir(parents=True, exist_ok=True)
    subprocess.run(["ocrmypdf", "--force-ocr", "-l", "por+ita",
                     "--sidecar", str(dst), str(src), "/dev/null"], check=False)

# For images, use EasyOCR:
# import easyocr
# reader = easyocr.Reader(['pt', 'it', 'en'])
# results = reader.readtext(str(image_path))

What to Extract from Each Document Type

Birth certificate: Full name, date, time, place, father's name + occupation, mother's name, grandparents' names (all 4 if listed), witnesses, registration number.

Death certificate: Full name, date, place, cause, age at death, spouse's name, children's names, parents' names, address at death.

Marriage certificate: Both spouses' full names, ages/dates, birthplaces, parents' names for both (this is the bridge to the previous generation), witnesses, date and place.

Immigration record: Full name, age (unreliable!), nationality, profession, port of departure, port of arrival, ship name, family members traveling together, destination.

Baptism record: Full name, baptism date (≠ birth date), parents, godparents (padrinos — often relatives or important community members), parish.

KEY INSIGHT: Always look for PARENTS and GRANDPARENTS mentioned in any document. A child's birth certificate that lists their grandparents connects three generations in one document.

Handling WhatsApp Audio from Relatives

Elderly relatives often respond via voice messages. This is gold — treat it carefully:

  1. Receive the audio file from the user
  2. Transcribe using Whisper or similar: whisper-cli --model medium audio.opus
  3. Names and places are the critical data — flag anything unclear
  4. Create a transcription file in ocr/ with metadata header
  5. Extract structured data into YAMLs
  6. Source as: oral: "Name, date, via WhatsApp audio"
  7. Confidence: medium (oral) — upgrade to high only when confirmed by documents

Transcription template:

TRANSCRIPTION — [Person's Name], [Date]
Source: WhatsApp audio sent by [person/via whom]
Duration: [X minutes]
Method: [whisper-cli medium / manual transcription]
---

[Transcribed text with timestamps if useful]

EXTRACTED DATA:
- [Person X] born approximately [date] in [place]
- [Uncertain: name sounds like "Felismirna" — confirm spelling]

Phase 3: Online Research

FamilySearch (familysearch.org) — The Most Important Free Resource

FamilySearch has two distinct parts:

  1. The Tree — collaborative family tree (anyone can edit, so data quality varies)
  2. Historical Records — digitized/indexed records from archives worldwide

FamilySearch API Access

The API can be accessed using a session token from the browser — no OAuth2 needed for personal research:

import requests

def fs_get(endpoint, token, params=None):
    """Make authenticated FamilySearch API call."""
    headers = {
        "Authorization": f"Bearer {token}",  # fssessionid cookie value
        "Accept": "application/json"
    }
    return requests.get(
        f"https://api.familysearch.org{endpoint}",
        headers=headers, params=params
    ).json()

# Get a person from the tree
person = fs_get(f"/platform/tree/persons/{pid}", token)

# Get ancestry (pedigree, up to 8 generations)
ancestry = fs_get("/platform/tree/ancestry",
                   token, {"person": pid, "generations": 4})

# Search the tree by name
results = fs_get("/platform/tree/search", token,
                  {"q": 'surname:"Silva" givenName:"Antonio" birthLikePlace:"Botucatu"'})

How to get the token: Browser DevTools → Application → Cookies → fssessionid. Lasts ~24 hours. Save it to a file and load it in your scripts.

Critical: Send fssessionid as Authorization: Bearer {token}, NOT as a Cookie header. This works for ALL endpoints.

FamilySearch Search Tips

  • Always search with name variants (Giuseppe AND José, Müller AND Muller)
  • Use birthLikePlace with the broadest useful area (city, not street)
  • Use birthLikeDate with a range: "1895-1905"
  • Check the sources tab on any tree person — the attached records are more reliable than the tree data
  • Look at change log for who contributed what — reliable contributors are gold

Key FamilySearch Collections by Region

Brazil:

  • São Paulo Civil Registration — massive, partially indexed
  • Each state has its own collection (search by "Brazil, [State]")
  • Church records often go back further than civil records

Italy (by region):

Region Civil records from Key collection
Southern (Naples, Calabria, Sicily) 1809 (Napoleonic) State Archive collections
Central (Rome, Tuscany) 1866 (unification) Varies by province
Northern (Veneto, Lombardy, Treviso) 1871 Tribunale collections
Church records everywhere Centuries earlier Catholic parish records

Portugal:

  • Parish records: some go back to the 1500s
  • Civil registration began 1911 (but scattered earlier records exist)
  • Each district has its own Catholic Church Records collection

Important: Not all records are indexed. Many collections are image-only — you browse microfilm page by page. This is tedious but often the only way to find records in unindexed collections.

Ancestry (ancestry.com)

  • Paid subscription, but has some unique collections
  • Good for: US immigration records, US census, some European records
  • The user needs to browse this themselves (paywalled + bot detection)
  • GEDCOM export is valuable if the user or a relative has an Ancestry tree

Geneanet (geneanet.org)

  • Free collaborative trees from genealogists worldwide
  • Best for: finding other researchers working on the same families
  • Blocks automated access — human must browse and copy data
  • Search by surname + region to find researchers to contact
  • Tip: users with large trees (5,000+ people) from a specific region are likely serious researchers

Antenati (antenati.cultura.gov.it)

  • Italian state archives online — free
  • Browse by province → comune → record type → year
  • Many records are high-quality scans of original registers
  • Critical for Italian genealogy south of Rome (records from 1809)

CEPESE (cepese.pt)

  • Portuguese emigration database (passports, ship lists)
  • Search with Portuguese spelling: "Viana" not "Vianna", "Silva" not "Da Silva"
  • Covers mainly Porto district emigrants to Brazil

Other Useful Sources

Source Best For Access
Blogs, local history sites Brazilian families — often the richest source Web search
Find A Grave, BillionGraves Cemetery records with photos Free
Newspapers (Hemeroteca Digital) Obituaries, announcements, legal notices Free (when available)
IBGE Nomes API Surname/name frequency in Brazil Free API
Forebears.io Surname distribution worldwide Free
Cognomix.it Italian surname distribution by region Free
Geneteka (geneteka.genealodzy.pl) Polish parish record index Free
MyHeritage DNA matching, some unique records Freemium

Blog and Local History Research

For Brazilian families, blogs are often MORE useful than databases. Search patterns:

"[surname]" "[city]" genealogia
"[surname]" "[city]" família história
"[full name]" certidão OR nascimento
"[full name]" obituário OR falecimento
"[city]" "história das famílias"
site:blogspot.com "[surname]" "[city]"

When you find a relevant blog:

  1. Read it thoroughly — local historians are passionate and detailed
  2. Extract every name, date, and relationship
  3. Note the author — they are a potential research ally
  4. Save the content — blogs disappear. Archive it or note the Wayback Machine URL
  5. Cross-reference against your documents

Surname Distribution — Research It First

Before deep-diving on a surname, check its geographic distribution. This tells you WHERE to focus:

  • Forebears.io — global distribution with country/region breakdown
  • IBGE Nomes — Brazilian frequency by decade and state
  • Cognomix.it — Italian distribution by province

A rare surname is a research advantage — any hit is likely your family. A common surname (Silva, Santos, Oliveira) requires more filters (place, date, parents' names).


Phase 4: Human-in-the-Loop

When to Involve the Human

The agent handles: OCR, database searches, YAML management, journal updates, web scraping of public sites, data analysis, cross-referencing, draft writing.

The human handles: phone calls to relatives, archive visits, cartório requests, solving CAPTCHAs, paying for certificates, making judgment calls about relationships, interpreting old handwriting, emotional/sensitivity decisions.

Crafting Questions for Relatives

Principles:

  • Number every question (makes written/audio answers easy to match)
  • Group by topic: grandparents, places, events, documents
  • Start with easy/pleasant questions before difficult ones
  • For elderly relatives: max 10-15 questions per session
  • Offer paper + pen — many elderly prefer writing to typing/speaking
  • Accept whatever format they respond in (handwriting, audio, phone call)

Template:

About [Person Name]:
1. What was their full name (including middle names)?
2. When were they born (or approximate year)?
3. Where were they born (city, state)?
4. When and where did they die?
5. Who were their parents? (full names if possible)
6. Who were their siblings? (names, birth order if known)
7. What did they do for a living?
8. Where did they live during their life?

About the family:
9. Where did the family originally come from?
10. When did they arrive in [current location]?
11. Are there any old documents, photos, or letters in the family?
12. Is there anyone who knows more about the family history?

After receiving answers:

  1. Transcribe immediately (handwriting, audio, or notes from phone call)
  2. Extract structured data into YAMLs
  3. Journal the contact and discoveries
  4. Generate follow-up questions based on new information
  5. Mark confidence as medium (oral) — document the source

Cartório (Civil Registry) Requests

Brazilian cartórios respond to written/email requests. Template:

Prezado(a) Oficial,

Solicito certidão de [nascimento/casamento/óbito] de:

Nome: [FULL NAME]
Data aproximada: [DATE or DATE RANGE]
Filiação: [FATHER] e [MOTHER]

Trata-se de pesquisa genealógica familiar.

Agradeço a atenção.
[Name, contact]

Tips:

  • Always include parents' names — it's how they search their books
  • Provide a date range if uncertain ("entre 1890 e 1910")
  • Many cartórios now accept email or WhatsApp requests
  • Expect fees per certificate (varies by state and type)
  • The human must handle this — prepare the request, let them send it

Handling Blocked Sites

Many genealogy sites block automated access. When blocked, prepare a research brief for the human:

I need data from [SITE] but it blocks automated access.

Could you please:
1. Go to [EXACT URL]
2. Search for: [EXACT TERMS]
3. I'm looking for: [WHAT SPECIFICALLY]
4. If you find it, please: [screenshot / copy the text / note the reference number]

This should take about [X] minutes.

Sites that typically need human intervention:

  • FamilySearch — hCaptcha on login (use headed browser, human solves once)
  • Geneanet — aggressive bot blocking (human browses, copies data)
  • Ancestry — paywall + bot detection (needs subscription)
  • Some Italian/Portuguese archives — JavaScript-heavy rendering

Phase 5: Analysis & Consistency

Handling Name Variations

Names change across borders, languages, time periods, and levels of literacy of the registrar:

Original Common Brazilian Variant
Giuseppe José
Giovanni João
Domenico Domingos
Pietro Pedro
Pasquale Pascoal
D'Alessandro Alexandre
Ferrari Ferreira
Catarina / Catharina / Caterina All valid depending on era
Müller Muller, Miller
Jakub Jacob, Jacó
Władysław Ladislau

Rule: Record ALL variants in alt_names. Search databases with every variant. The same person's mother's surname might appear differently in every document.

Confidence Levels

high   — Primary document (certificate, official record, parish register)
medium — Secondary source (oral history from direct witness, family Bible,
         newspaper, blog citing sources)
low    — Single oral source, unverified claim, conflicting information,
         or inference from indirect evidence

Resolving Conflicts Between Sources

When sources disagree (and they WILL):

  1. Primary documents beat secondary sources. A birth certificate beats a GEDCOM entry.
  2. Marriage certificates are often the most reliable for ages and birthplaces (both parties stated under oath).
  3. Immigration records are unreliable for ages — people routinely lied or guessed.
  4. Multiple independent sources agreeing beats any single source.
  5. Contemporary records (written at the time of the event) beat retrospective ones (written years later).
  6. Church records pre-date civil records in most countries and may have different (sometimes better) information.

Always document the conflict in the YAML, even after resolving it:

notes: |
  CONFLICT RESOLVED: Birth date
  - Marriage certificate (1923): "28 September 1897" → ACCEPTED (primary, contemporary)
  - GEDCOM from Ancestry: "28 June 1897" → REJECTED (likely transcription error: 9→6)
  - Immigration record (1888): age 47 → ~1841 → REJECTED (ages in immigration records unreliable)
  Using September 1897 per marriage certificate.

Cross-Referencing Strategy

For every person, try to find them in at least two independent sources:

  1. A document from their OWN life events (birth, marriage, death)
  2. A document where they appear AS A RELATIVE (child's birth cert, spouse's death cert)

The parents listed in a child's birth certificate confirm that couple existed and were together at that date. If you find the same parents' names in three children's birth certificates, that's strong evidence.

Dangerous Assumptions to Avoid

  • "Same name = same person" — WRONG. Names were reused constantly, especially after a child died
  • "This GEDCOM entry is correct" — MAYBE. GEDCOMs are full of unverified data
  • "FamilySearch tree data is reliable" — SOMETIMES. Anyone can edit the tree. Check the sources tab.
  • "My relative remembers correctly" — VALUABLE BUT VERIFY. Birth orders, years, and even parent/grandparent identities can be misremembered
  • "The first Stella died, so the second Stella took her name" — COMMON in Italian/Portuguese families. First children who died often had their names reused for later siblings

Phase 6: Research Patterns by Origin

Brazilian Families (General)

Record availability timeline:

  • 1888+: Civil registration (cartórios) becomes mandatory nationwide
  • Pre-1888: Church (parish) records are the primary source
  • 1870s–1920s: Peak immigration era — immigration records, ship manifests
  • Colonial era: Church records, sesmarias (land grants), inventories, wills

Research strategy:

  1. Start with cartório records (birth/marriage/death certificates)
  2. Use FamilySearch to find indexed records
  3. Search blogs and local history sites (surprisingly effective in Brazil)
  4. Contact cartórios by email/phone for unindexed records
  5. Church records for pre-1888 events

Brazilian Families with Italian Ancestry

This is the most common immigrant genealogy case in Southeast Brazil.

The bridge: Find the immigration record → it gives you the Italian comune → search Italian records.

Key sources for the Brazilian side:

  • Hospedaria do Imigrante (São Paulo) — immigrant registration on arrival
  • FamilySearch Brazil collections — civil and church records
  • Cartórios in the coffee belt (São Paulo interior, Paraná, Minas Gerais)

Key sources for the Italian side:

  • Antenati (state archives online) — browse by comune
  • FamilySearch Italian collections (organized by province + archive type)
  • Email the relevant Comune's Ufficio di Stato Civile for specific certificates
  • Church records (the Comune's parroco may still have them)

Critical dates: Italian civil registration started in 1809 (South) or 1871 (North). Church records go back much further.

Italian naming patterns (helpful for identification):

  • First son named after paternal grandfather
  • First daughter named after paternal grandmother
  • Second son/daughter after maternal grandparents
  • Names of children who died young were reused for later children

Brazilian Families with Portuguese Ancestry

Key sources:

  • CEPESE (cepese.pt) — Portuguese emigration database
  • DigitArq (digitarq.arquivos.pt) — Portuguese national archives
  • FamilySearch Portuguese collections (parish records by district)
  • Note: Portuguese civil registration began only in 1911; church records go back to the 1500s

Spelling trap: Portuguese spelling in Brazil diverged from Portugal. Search both variants: "Vianna"/"Viana", "Almeida"/"d'Almeida", etc.

Brazilian Families with Spanish Ancestry

  • PARES (pares.mcu.es) — Portal de Archivos Españoles
  • FamilySearch Spanish collections (by province)
  • Note: Spanish civil registration began in 1870

Brazilian Families with Polish/Eastern European Ancestry

  • Geneteka (geneteka.genealodzy.pl) — Polish parish record index, search by surname + parish
  • Szukajwarchiwach — Polish state archives
  • Note: Borders changed constantly — "Poland" may mean modern-day Ukraine, Lithuania, or Belarus

Brazilian Families with German Ancestry

  • Archion (archion.de) — Protestant church records
  • Matricula (data.matricula-online.eu) — Catholic church records
  • FamilySearch German collections are extensive

Brazilian Families with African Ancestry

  • This is the hardest genealogy in Brazil due to slavery erasing records
  • Church baptism records sometimes list enslaved people by first name + owner
  • Post-abolition (1888): cartório records with adopted surnames
  • Inventários (estate inventories) sometimes list enslaved people by name
  • DNA testing (AncestryDNA, 23andMe) can identify broad African regions of origin

Brazilian Families with Japanese Ancestry

  • Museu da Imigração Japonesa (São Paulo) — immigration records
  • FamilySearch Japanese collections (koseki family registers)
  • The koseki system is extremely detailed if you can access it

Phase 7: Advanced Techniques

FamilySearch Automation with Playwright

For browsing unindexed collections (image-only microfilm), you may need browser automation:

from playwright.sync_api import sync_playwright

# Use headed mode — headless is blocked by hCaptcha
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context(storage_state="auth-state.json")
    page = context.new_page()
    # Human solves CAPTCHA once, then save state:
    # context.storage_state(path="auth-state.json")

Important: Always use headed mode. Headless browsers get blocked by hCaptcha. Have the human solve the CAPTCHA once, save the browser state, and reuse it.

Building a Research Brief for the Human

When you need the human to do something, be specific and efficient:

## Research Task: [Title]

**Priority**: High / Medium / Low
**Estimated time**: X minutes
**What I need**: [Specific data]

**Steps:**
1. Go to [URL] / Call [number] / Visit [place]
2. Search for [exact terms]
3. I'm looking for [specific information]
4. Please [screenshot / copy text / note reference number]

**Context**: [Why this matters — connects to which ancestor, resolves which conflict]

Tracking Research Across Multiple Branches

When the tree gets large (50+ people), you need systematic tracking:

  1. Use YAML flags consistently — grep for NEEDS_RESEARCH across all YAMLs to find open items
  2. Prioritize by generation — closer generations first (they unlock the farther ones)
  3. Track which branches are "blocked" — waiting for a cartório response, a relative's callback, an archive visit
  4. Keep the TODO.md updated — it's the project dashboard
  5. Review the journal periodically — patterns emerge (e.g., "every document from this cartório has errors in the mother's name")

Integrating GEDCOM

If the user has a GEDCOM file:

  1. Import it as a starting point, but treat every fact as confidence: low until verified
  2. Create YAMLs for the most important people first (direct line)
  3. Cross-reference GEDCOM data against documents as you find them
  4. Keep the GEDCOM as a reference, but the YAMLs are the source of truth going forward

If the user wants to EXPORT a GEDCOM:

  • Generate from the YAMLs (only export facts with confidence: high or medium)
  • This is a future automation — for now, manual GEDCOM editing or Gramps CLI works

The "Immigration Wall" Strategy

For families that immigrated to Brazil, the critical transition point is:

Brazilian records (post-arrival)
        ↕  THE WALL — you need to find the bridge document
European records (pre-departure)

Bridge documents (in order of usefulness):

  1. Marriage certificate in Brazil — often lists European birthplace
  2. Immigration/hospedaria record — gives port of departure + age + family group
  3. Ship manifest — destination, origin, sometimes exact comune/parish
  4. Baptism of first Brazilian child — parents' origins sometimes listed
  5. Death certificate — sometimes lists birthplace (but often just "Italy" or "Portugal")
  6. Naturalization records — detailed personal history
  7. Passport — European-side document with exact birthplace

Sensitivity & Privacy Rules

At Project Start, Ask:

  • "Are your parents alive?" / "Are there divorces or remarriages I should be sensitive about?"
  • "Is there anyone in the family who wouldn't want their information in a family tree?"
  • "Are there any family conflicts or sensitive topics I should be aware of?"

Rules for Living People:

  • Never publish full birth dates without consent
  • Be careful with addresses, phone numbers, email addresses
  • Divorce/remarriage should be handled with discretion

Rules for Multiple Families:

  • If the user has children from different relationships, keep research for each family branch STRICTLY separate
  • Never cross-reference between ex-spouse's family and current spouse's family
  • Each family branch gets its own research space with no cross-links
  • Ask the user explicitly about boundaries at the start

Rules for Deceased People:

  • Historical figures are generally fair game
  • Recently deceased (within ~20 years): be sensitive about cause of death, financial details
  • Illegitimate births, secret adoptions, affairs — let the user decide how to handle

Quick Reference: Starting Checklist

For a new person added to the research:

□ Create YAML in research/people/ with all known information
□ Set appropriate flags (NEEDS_RESEARCH, NEEDS_OCR, etc.)
□ Process any documents the family already has for this person
□ Search FamilySearch tree (by name + approximate dates)
□ Search FamilySearch historical records (by name + place)
□ Web search: "[full name]" "[city]" — blogs, obituaries, newspapers
□ Check surname distribution (Forebears.io, IBGE Nomes)
□ Prepare questions about this person for living relatives
□ Identify the relevant cartório/archive for records
□ Update journal with all findings
□ Update TODO with next research steps

Quick Reference: Research Priority Order

When you have limited time or the user wants to focus:

  1. Direct ancestors first (parents → grandparents → great-grandparents)
  2. Documents in hand — always process what you already have
  3. Elderly relatives — their memories are disappearing; interview them NOW
  4. Immigration records — they bridge continents
  5. Marriage certificates — they name parents of BOTH spouses
  6. FamilySearch tree — quick wins, but verify everything
  7. Online records search — targeted queries by name + place + date range
  8. Cartório requests — slower, but primary sources
  9. European archives — the deep dive, after you know exactly where to look
  10. DNA testing — useful when paper trails run cold

Appendix: Common Pitfalls

  1. Context window management: Large PDFs (10MB+) and long OCR outputs can blow up the context. Process documents page by page. Create scripts for batch operations rather than reading everything inline.

  2. N+1 request trap: If an API returns N results, do NOT make N follow-up requests to enrich each one. Fetch maximum data in the original request, or batch.

  3. Name spelling rabbit holes: Don't spend hours trying every spelling variant. Check the top 2-3 variants, then move on. The breakthrough usually comes from a completely different document, not from the 47th spelling variation.

  4. Trusting FamilySearch tree blindly: Anyone can edit it. Always check the Sources tab. If a fact has no attached source, treat it as unverified.

  5. Forgetting to journal: It feels like overhead until you need to explain to the user (or to yourself in 3 months) WHY you concluded something. Journal every discovery, every conflict resolution, every dead end.

  6. Over-researching one branch: When stuck on one line, switch to another. Fresh branches often yield quick wins that provide morale and context.

  7. Ignoring witnesses and godparents: The witnesses on a marriage certificate and godparents at a baptism are often relatives. They are leads.

  8. Not saving web pages: Blogs, forum posts, and small genealogy sites disappear. When you find something valuable, save the content immediately.

Commentary: How This Skill Was Built

Origin

This skill was distilled from approximately 6 months of intensive genealogy research conducted as a collaboration between a human researcher and Claude Code (Anthropic's CLI agent). The project traced a Brazilian family across 8+ generations, spanning Southeast Brazil, Northern Italy (Veneto), Southern Italy (Calabria), Northern Portugal (Porto, Viseu), and the Recôncavo Baiano region. Over the course of the project, we processed ~300 documents, created 34+ structured YAML files, ran hundreds of OCR extractions, queried FamilySearch's API thousands of times, and coordinated with a dozen living relatives and fellow researchers.

Every recommendation in this skill comes from something we actually did, got wrong, fixed, and then systematized.


Key Design Decisions

Why YAMLs Instead of a Database or GEDCOM

We started with a GEDCOM file exported from Ancestry.com and quickly found it insufficient for active research. GEDCOM is great as a final export format, but terrible as a working format because:

  • It doesn't support confidence levels or flags
  • It has no concept of "conflicting information" — you pick one value
  • It doesn't track sources in a granular way
  • It's not human-readable during research sessions
  • It doesn't support free-form research notes

We tried Gramps (both desktop and CLI) but found that the overhead of maintaining a database while actively researching was too high. YAMLs struck the right balance: human-readable, git-friendly, flexible enough for messy real-world data, and trivially searchable with grep.

The YAML-per-person pattern emerged organically. At first we had a single large file; within a week we split it because it was unwieldy. One file per person means you can open just the person you're researching, and git diffs show exactly what changed.

Why the Journal

The journal started as an afterthought and became essential. Three reasons:

  1. Audit trail: When you discover something today that contradicts what you recorded last week, you need to know WHY you recorded the original value. The journal tells you.

  2. Human communication: The user frequently asked "what did we do yesterday?" or "what did we find from those documents?" The journal answered instantly.

  3. Immutability discipline: Making the journal append-only and never-edit-past-files forced us to be honest about corrections. Instead of silently fixing a YAML, we'd write a correction event that explained what was wrong and why. This prevented the dangerous pattern of "I think I already knew this" when in fact you're overwriting good data with bad data.

Why the Document Pipeline is So Strict

Early in the project, we would receive documents, OCR them, get excited about a discovery, immediately start researching online, and forget to update the YAML or journal. Two weeks later, we'd find the OCR output sitting in a folder with no connection to any person's record, and couldn't remember what we'd already extracted from it.

The "document → Drive → OCR → YAML → THEN research" pipeline was born from this pain. It's rigid, but it prevents data loss. The most valuable thing in genealogy research isn't any single fact — it's the CONNECTION between facts. A loose OCR file is almost as useless as no OCR at all.

Why Confidence Levels

We had a moment in the project where oral information from an elderly relative contradicted a civil registry document. We had initially taken the oral info at face value and built further research on it. When the document proved it wrong, we had to unwind several hours of work.

After that, we instituted confidence levels (high/medium/low) and the rule that oral info starts at medium and only goes to high when confirmed by a document. This simple discipline saved us from repeating the mistake.

Why the "Exhaust Documents First" Rule

We wasted significant time searching FamilySearch for a person's birth record, only to discover weeks later that the answer was already in a marriage certificate sitting in the family's document folder. The certificate listed both spouses' parents, birthplaces, and dates — exactly what we were looking for.

Now: process everything you have before searching externally. The family's shoe box of old papers is the most underrated genealogy resource.


Where We Learned the Hard Way

OCR on Manuscripts

Tesseract is effectively useless on handwritten documents. We learned this after many frustrating attempts on 19th-century Brazilian civil registry manuscripts (which are cursive, often faded, and use archaic Portuguese spelling).

EasyOCR was a major improvement — it correctly identified the structure of the document and captured printed text, headers, and some modern handwriting. But for old cursive manuscripts, the only reliable method was human visual reading: the agent reads the image, interprets what it can, and flags uncertain words.

The skill recommends this layered approach: Tesseract for print → EasyOCR for handwriting → human for manuscripts. We wish we'd known this from day one.

FamilySearch API: The Bearer Token Discovery

We initially assumed we needed full OAuth2 authentication for the FamilySearch API. After multiple failed attempts at implementing the OAuth2 flow (which requires registered apps and redirect URIs), we discovered by accident that the fssessionid cookie — a simple session token from the browser — works as a Bearer token for ALL API endpoints.

This single discovery unlocked the entire FamilySearch API for our research. It's documented nowhere official (as far as we know), but it's been reliable for months. The skill documents this prominently because it's a huge time-saver.

Immigration Record Ages Are Lies

Multiple times, we built hypotheses on ages declared in immigration manifests, only to find they were wrong by 5-10 years. Immigration records were filled out hastily, by officials who didn't speak the immigrants' language, based on verbal declarations from people who may not have known their exact age. Marriage certificates and baptism records from the country of origin are far more reliable.

The Blog Discovery

This was perhaps our most counterintuitive finding: for Brazilian families, random blogspot blogs written by passionate local historians were often MORE informative than FamilySearch or Ancestry. A blog post about a small town in Bahia gave us a complete genealogy of a family going back to the 1800s, including a connection to an illegitimate son of a Catholic priest — information that would never appear in official databases.

The skill emphasizes blog research because it's underrated and it works, especially for Brazilian interior towns where formal digitization hasn't reached.

Relative Interview Patterns

We tried several approaches to interviewing elderly relatives:

  1. Open-ended questions ("Tell me about your grandparents") — too vague, answers rambled
  2. Long questionnaire (30 questions) — overwhelming, relative gave up after 10
  3. Numbered, grouped, 10-15 questions — the sweet spot

The numbered format was critical because relatives often responded by WhatsApp audio or handwritten notes. Without numbers, matching answers to questions was impossible. With numbers, even a scrawled "⑧ SEMPRE MOROU EM C. DE ALMEIDA" was perfectly usable data.

WhatsApp Audio as a Source

Elderly relatives who can't type will send 5-minute WhatsApp voice messages full of genealogical gold. We learned to:

  1. Transcribe immediately (using Whisper or similar)
  2. Create a structured transcription file with metadata
  3. Extract data into YAMLs with clear attribution
  4. Keep the original audio as a permanent source

This workflow — audio → transcription → structured data → YAML — isn't documented in any genealogy guide we've seen, but it was one of our most productive data-gathering methods.


Where the Skill Goes Beyond Our Project

Generalization to Non-Brazilian Contexts

Our experience is strongest in Southeast Brazilian genealogy with Italian, Portuguese, and Bahian roots. The skill generalizes this to other origins (German, Polish, Spanish, Japanese, African) based on:

  • Known record availability patterns for each country
  • Common immigration patterns to Brazil
  • Specific databases and archives for each origin

For origins we didn't personally research (German, Japanese, African ancestry), the recommendations are based on genealogical best practices rather than hard-won experience. These sections are less detailed and should be treated as starting points.

The "Starting from Zero" Path

Our project benefited from having an Ancestry GEDCOM and several known documents from the start. The Phase 0 (Initial Interview) section was designed for the case where someone has NOTHING — no GEDCOM, no FamilySearch, no documents. This interview structure is based on what we wish we'd asked at the very beginning, plus standard genealogical interview techniques.

African Ancestry Section

This is the most difficult genealogy in Brazil due to the systematic destruction of enslaved people's records. We included it because it's important, but our recommendations are based on published genealogical methodology rather than personal experience. Researchers pursuing this path should expect significantly harder work and more dead ends, and may benefit from DNA testing as a complement to paper records.

The Sensitivity Framework

Our project had specific sensitivity rules around divorces and remarriages that we encoded in the CLAUDE.md. The skill generalizes this into a "ask at the start" framework because every family has different boundaries. The core principle — ask early, respect always — applies universally.


What This Skill Doesn't Cover

  1. DNA genealogy: We didn't use DNA testing in our project. It's a powerful tool, especially when paper trails run cold, but it requires its own skill.

  2. Legal genealogy (citizenship applications): We had citizenship-related documents but the skill doesn't cover the legal process of claiming citizenship by descent. This varies by country and changes frequently.

  3. Professional genealogist coordination: We contacted researchers informally. Managing paid professional genealogists has its own workflow.

  4. Preservation and digitization: We assumed documents were already digitized. Proper archival scanning, photo restoration, and physical preservation are separate disciplines.

  5. Publishing and sharing: Our project generated family pages for a personal website, but the skill focuses on RESEARCH, not presentation. The narrative/page generation skill is a separate concern.


Final Thought

The most important thing we learned is that genealogy research is fundamentally a patience game with compound returns. A single birth certificate might seem like just one data point, but it names two parents and up to four grandparents — that's potentially seven new people to research. Each of those people has their own documents, each of which names more people. The tree grows exponentially.

The agent's job is to manage this complexity without losing data, without making unwarranted assumptions, and without wasting the human's time. The human's job is to make the phone calls, visit the archives, interpret the handwriting, and decide what matters. Together, they can trace a family back centuries in weeks rather than years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment