Skip to content

Instantly share code, notes, and snippets.

@ccerv1
Last active September 26, 2025 08:00
Show Gist options
  • Select an option

  • Save ccerv1/24617cdb978e075d7476b0b6f960e106 to your computer and use it in GitHub Desktop.

Select an option

Save ccerv1/24617cdb978e075d7476b0b6f960e106 to your computer and use it in GitHub Desktop.
Workflow for mapping dependencies (LLM generated based on overview of current approach)

Decision Tree: Parsing Dependencies for a GitHub Repository

Objective

Produce a complete, source-attributable dependency set for a GitHub repo with clear provenance, direct vs transitive labeling (where feasible), and predictable fallbacks.

High-Level Flow

  1. Detect ecosystem(s) from repo files

    • npm: package.json, package-lock.json/yarn.lock/pnpm-lock.yaml
    • Rust: Cargo.toml, Cargo.lock
    • Python: pyproject.toml, poetry.lock, requirements*.txt, Pipfile.lock
    • .NET: *.csproj, packages.config, Directory.Packages.props
    • Java/Maven: pom.xml (optionally dependency:tree output)
    • Go: go.mod, go.sum
    • Else: mark as “other”
  2. Follow ecosystem branch

    • npm → gather all deps, then post-process to label direct vs transitive
    • crates.io / PyPI / NuGet / Maven → gather all deps (direct+transitive)
    • Go → gather deps hosted on GitHub; ignore or queue non-GitHub modules
    • Others → scripted fetch + AI parse → human review
  3. Normalize to a common schema and emit artifacts

    • dependency records with fields listed in “Output Schema” below
    • logs + evidence: the exact files, lock snapshots, and commands used

Ecosystem Branches

A) npm (Node.js)

  • Source of truth:
    • Prefer lockfiles (package-lock.json, yarn.lock, pnpm-lock.yaml) for a deterministic graph
    • Fallback to package.json for direct deps if no lockfile
  • Retrieval:
    • Parse lockfile to enumerate full graph and versions
    • Derive direct deps from package.json (dependencies/devDependencies/optional/peer)
  • Post-processing for direct vs transitive:
    • Mark a package “direct” if it appears in package.json’s top-level sections
    • Everything else present only via lockfile resolution is “transitive”
  • Edge cases:
    • Workspaces/monorepos (multiple package.json files)
    • Peer dependencies (treat as direct if declared at top level; otherwise transitive)
    • Git or file specifiers in package.json (record as non-registry source with URL)
    • Engines and optional deps: include but flag type
  • Output: full graph with source_file and lockfile_path per node

B) Rust (crates.io)

  • Source of truth:
    • Cargo.lock for resolved versions; Cargo.toml for declared direct deps
  • Retrieval:
    • Use Cargo.lock to list all resolved crates (direct+transitive)
    • Optionally mark direct by diffing Cargo.toml’s [dependencies]/[dev-dependencies]/[build-dependencies]
  • Notes:
    • Features can alter the graph; record active features if detectable
    • Workspace members may have separate manifests; iterate members
  • Output: all deps; direct flag optional (if Cargo.toml is reliably mapped)

C) Python (PyPI)

  • Source of truth:
    • Prefer lockfiles (poetry.lock, Pipfile.lock); otherwise requirements*.txt or pyproject.toml with a resolver
  • Retrieval:
    • If lockfile present, parse for pinned transitive graph
    • If only requirements.txt, resolve to lock (deterministic) before graphing when possible; otherwise capture declared list and mark transitivity as unknown
  • Edge cases:
    • Extras (pkg[extra]) → record extra selector
    • VCS/URL installs → record source URL and commit/ref if present
    • Multiple requirement files (prod/dev/test) → tag environment
  • Output: all deps when locked; otherwise declared-only with a needs_resolution flag

D) .NET (NuGet)

  • Source of truth:
    • *.csproj and Directory.Packages.props; packages.lock.json when present
  • Retrieval:
    • Prefer packages.lock.json to get resolved transitive graph
    • Otherwise parse csproj references and, if allowed, run restore to snapshot lock
  • Edge cases:
    • Central package management (Directory.Packages.*) — merge contexts
    • Private feeds — record source feed when available
  • Output: full graph if lock exists; else direct declared; mark unresolved transitive

E) Java (Maven)

  • Source of truth:
    • pom.xml; optionally capture output of mvn dependency:tree -DoutputType=dot/json
  • Retrieval:
    • If build permissible, run dependency:tree to materialize the resolved graph
    • Without build, parse pom.xml declared deps; transitive remains unresolved
  • Edge cases:
    • Multi-module builds — iterate all modules and merge graphs
    • Profiles — record active profile assumptions
  • Output: full graph when tree available; else declared set marked needs_resolution

F) Go (Modules)

  • Source of truth:
    • go.mod/go.sum
  • Retrieval policy (opinionated):
    • Include only dependencies whose module path maps to GitHub (e.g., github.com/owner/repo)
    • Non-GitHub modules: ignore by default or queue for manual resolution
  • Mapping detail:
    • For each require in go.mod (and replacements), map module path → VCS host
    • If host is github.com, extract owner/repo; record version from go.mod/go.sum
  • Edge cases:
    • replace directives — honor replacement path/version; may change host
    • Pseudo-versions — record exact revision
    • Private or vanity domains — queue for manual resolution
  • Output: GitHub-hosted subset; list “skipped_modules” for transparency

G) Others (fallback)

  • Scripted approach:
    • Download repo at the target ref (commit/branch/tag)
    • Scan file tree for common manifest/lock patterns
    • Generate an SBOM (e.g., using a proven local tool) for a first pass
    • Feed manifests + SBOM into an AI parser with instructions to:
      • extract probable ecosystem(s)
      • list dependencies, versions, and sources
      • highlight low-confidence items and missing lockfiles
    • Human review mandatory:
      • verify ecosystem mapping
      • confirm version pins and transitive completeness
  • Output: reviewed dependency list with confidence scores and reviewer stamp

Normalization and Output Schema

  • dependency_id: stable hash of (ecosystem, name, version, source_url)
  • repo: GitHub org/repo
  • commit_sha: the exact commit analyzed
  • ecosystem: npm|crates|pypi|nuget|maven|go|other
  • package_name: registry identifier or module path
  • version: pinned version or VCS ref; null if not resolved
  • scope: runtime|dev|build|test|optional (best-effort per ecosystem)
  • relationship: direct|transitive|unknown (see notes by ecosystem)
  • source_type: lockfile|manifest|build_output|sbom|ai_inference
  • source_file: path to the file used (e.g., package-lock.json)
  • registry_or_host: npmjs|crates.io|pypi.org|nuget.org|maven central|github|other
  • homepage_url: if available from registry metadata
  • license_spdx: if available (optional enrichment)
  • notes: free text for anomalies (peer deps, replace, private feed, etc.)
  • confidence: high|medium|low

Determinism and Evidence

  • Prefer lockfiles or build-tool “dependency tree” outputs for determinism
  • Always store raw artifacts:
    • the exact lock/manifest files
    • the command(s) executed and their stdout/stderr
    • the SBOM (where generated)
  • Hash all inputs and include in output for reproducibility

Failure Modes and Mitigations

  • No lockfile:
    • Emit declared direct deps; set needs_resolution; attempt a dry-run lock in CI if permitted
  • Monorepos:
    • Discover multiple manifests; produce one graph per package/module, then aggregate with a “component” field
  • Vendored code:
    • Exclude vendored directories by default (node_modules, vendor/, third_party/) unless explicitly whitelisted
  • Private registries/feeds:
    • Record feed URL; do not attempt credentialed resolution unless configured
  • Rate limits:
    • Cache registry metadata; backoff and resume
  • Non-GitHub Go modules:
    • Queue for manual mapping; do not guess ownership

Quality Gates

  • Every dependency record must cite a source_type and source_file
  • For npm, direct vs transitive must be set
  • If ecosystem is go, record skipped_modules explicitly
  • If source_type is ai_inference, human review must be present before publish

Minimal Pseudocode

detect = detect_ecosystems(repo_files)
for eco in detect:
  if eco == "npm":
    graph = parse_npm(repo_files)
    graph = label_direct_from_package_json(graph)
  elif eco in {"crates","pypi","nuget","maven"}:
    graph = parse_with_lock_or_tree(repo_files, eco)
  elif eco == "go":
    all_mods = parse_go_modules(repo_files)
    graph = [m for m in all_mods if host(m.path) == "github.com"]
    skipped = [m for m in all_mods if host(m.path) != "github.com"]
  else:
    sbom, ai_guess = scripted_sbom_and_ai_parse(repo_snapshot)
    graph = human_review(ai_guess)

  emit(normalize(graph), evidence=evidence_blobs)

Rationale (opinionated)

  • Lockfiles and tool-generated trees are the most reliable snapshot of reality
  • npm gets explicit direct vs transitive labeling because package.json provides a clean signal and downstream analyses often depend on it
  • For Go, module paths are the canonical identity; restricting to GitHub-hosted modules ensures consistent owner/repo mapping for graph analytics. Non-GitHub modules frequently require custom VCS resolution or vanity domain logic; forcing manual review avoids false attributions
  • An “others” lane with SBOM+AI+human review prevents silent failure on less common ecosystems and unusual build systems

Deliverables

  • dependencies.jsonl adhering to the schema above
  • evidence/ directory containing:
    • raw manifests/lockfiles
    • build logs or dependency:tree outputs
    • sbom.* (if generated)
    • review_notes.md with human sign-off
  • summary.md with counts by ecosystem, direct vs transitive, unresolved items, and skipped Go modules
@ccerv1
Copy link
Copy Markdown
Author

ccerv1 commented Sep 26, 2025

flowchart TD
  A[Start: repo files & eco] --> B{Which ecosystem?}

  %% npm
  B -->|npm| N1[Parse npm files]
  N1 --> N2[Label direct vs transitive]
  N2 --> O1[[Output: Graph with labels]]

  %% crates / pypi / nuget / maven
  B -->|crates / pypi / nuget / maven| R1[Parse with lock or tree]
  R1 --> O2[[Output: Graph resolved]]

  %% go
  B -->|go| G1[Parse Go modules]
  G1 --> G2{Hosted on GitHub?}
  G2 -->|yes| G3[Include in graph]
  G2 -->|no| G4[Add to skipped list]
  G3 --> O3[[Output: GitHub-only graph]]
  G4 --> O3

  %% others
  B -->|other| X1[Run SBOM + AI parse]
  X1 --> X2[Human review]
  X2 --> O4[[Output: Reviewed graph]]

  %% normalization
  O1 --> Z[Normalize & emit artifacts]
  O2 --> Z
  O3 --> Z
  O4 --> Z
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment