Produce a complete, source-attributable dependency set for a GitHub repo with clear provenance, direct vs transitive labeling (where feasible), and predictable fallbacks.
-
Detect ecosystem(s) from repo files
- npm: package.json, package-lock.json/yarn.lock/pnpm-lock.yaml
- Rust: Cargo.toml, Cargo.lock
- Python: pyproject.toml, poetry.lock, requirements*.txt, Pipfile.lock
- .NET: *.csproj, packages.config, Directory.Packages.props
- Java/Maven: pom.xml (optionally dependency:tree output)
- Go: go.mod, go.sum
- Else: mark as “other”
-
Follow ecosystem branch
- npm → gather all deps, then post-process to label direct vs transitive
- crates.io / PyPI / NuGet / Maven → gather all deps (direct+transitive)
- Go → gather deps hosted on GitHub; ignore or queue non-GitHub modules
- Others → scripted fetch + AI parse → human review
-
Normalize to a common schema and emit artifacts
- dependency records with fields listed in “Output Schema” below
- logs + evidence: the exact files, lock snapshots, and commands used
A) npm (Node.js)
- Source of truth:
- Prefer lockfiles (package-lock.json, yarn.lock, pnpm-lock.yaml) for a deterministic graph
- Fallback to package.json for direct deps if no lockfile
- Retrieval:
- Parse lockfile to enumerate full graph and versions
- Derive direct deps from package.json (dependencies/devDependencies/optional/peer)
- Post-processing for direct vs transitive:
- Mark a package “direct” if it appears in package.json’s top-level sections
- Everything else present only via lockfile resolution is “transitive”
- Edge cases:
- Workspaces/monorepos (multiple package.json files)
- Peer dependencies (treat as direct if declared at top level; otherwise transitive)
- Git or file specifiers in package.json (record as non-registry source with URL)
- Engines and optional deps: include but flag type
- Output: full graph with source_file and lockfile_path per node
B) Rust (crates.io)
- Source of truth:
- Cargo.lock for resolved versions; Cargo.toml for declared direct deps
- Retrieval:
- Use Cargo.lock to list all resolved crates (direct+transitive)
- Optionally mark direct by diffing Cargo.toml’s [dependencies]/[dev-dependencies]/[build-dependencies]
- Notes:
- Features can alter the graph; record active features if detectable
- Workspace members may have separate manifests; iterate members
- Output: all deps; direct flag optional (if Cargo.toml is reliably mapped)
C) Python (PyPI)
- Source of truth:
- Prefer lockfiles (poetry.lock, Pipfile.lock); otherwise requirements*.txt or pyproject.toml with a resolver
- Retrieval:
- If lockfile present, parse for pinned transitive graph
- If only requirements.txt, resolve to lock (deterministic) before graphing when possible; otherwise capture declared list and mark transitivity as unknown
- Edge cases:
- Extras (pkg[extra]) → record extra selector
- VCS/URL installs → record source URL and commit/ref if present
- Multiple requirement files (prod/dev/test) → tag environment
- Output: all deps when locked; otherwise declared-only with a needs_resolution flag
D) .NET (NuGet)
- Source of truth:
- *.csproj and Directory.Packages.props; packages.lock.json when present
- Retrieval:
- Prefer packages.lock.json to get resolved transitive graph
- Otherwise parse csproj references and, if allowed, run restore to snapshot lock
- Edge cases:
- Central package management (Directory.Packages.*) — merge contexts
- Private feeds — record source feed when available
- Output: full graph if lock exists; else direct declared; mark unresolved transitive
E) Java (Maven)
- Source of truth:
- pom.xml; optionally capture output of mvn dependency:tree -DoutputType=dot/json
- Retrieval:
- If build permissible, run dependency:tree to materialize the resolved graph
- Without build, parse pom.xml declared deps; transitive remains unresolved
- Edge cases:
- Multi-module builds — iterate all modules and merge graphs
- Profiles — record active profile assumptions
- Output: full graph when tree available; else declared set marked needs_resolution
F) Go (Modules)
- Source of truth:
- go.mod/go.sum
- Retrieval policy (opinionated):
- Include only dependencies whose module path maps to GitHub (e.g., github.com/owner/repo)
- Non-GitHub modules: ignore by default or queue for manual resolution
- Mapping detail:
- For each require in go.mod (and replacements), map module path → VCS host
- If host is github.com, extract owner/repo; record version from go.mod/go.sum
- Edge cases:
- replace directives — honor replacement path/version; may change host
- Pseudo-versions — record exact revision
- Private or vanity domains — queue for manual resolution
- Output: GitHub-hosted subset; list “skipped_modules” for transparency
G) Others (fallback)
- Scripted approach:
- Download repo at the target ref (commit/branch/tag)
- Scan file tree for common manifest/lock patterns
- Generate an SBOM (e.g., using a proven local tool) for a first pass
- Feed manifests + SBOM into an AI parser with instructions to:
- extract probable ecosystem(s)
- list dependencies, versions, and sources
- highlight low-confidence items and missing lockfiles
- Human review mandatory:
- verify ecosystem mapping
- confirm version pins and transitive completeness
- Output: reviewed dependency list with confidence scores and reviewer stamp
- dependency_id: stable hash of (ecosystem, name, version, source_url)
- repo: GitHub org/repo
- commit_sha: the exact commit analyzed
- ecosystem: npm|crates|pypi|nuget|maven|go|other
- package_name: registry identifier or module path
- version: pinned version or VCS ref; null if not resolved
- scope: runtime|dev|build|test|optional (best-effort per ecosystem)
- relationship: direct|transitive|unknown (see notes by ecosystem)
- source_type: lockfile|manifest|build_output|sbom|ai_inference
- source_file: path to the file used (e.g., package-lock.json)
- registry_or_host: npmjs|crates.io|pypi.org|nuget.org|maven central|github|other
- homepage_url: if available from registry metadata
- license_spdx: if available (optional enrichment)
- notes: free text for anomalies (peer deps, replace, private feed, etc.)
- confidence: high|medium|low
- Prefer lockfiles or build-tool “dependency tree” outputs for determinism
- Always store raw artifacts:
- the exact lock/manifest files
- the command(s) executed and their stdout/stderr
- the SBOM (where generated)
- Hash all inputs and include in output for reproducibility
- No lockfile:
- Emit declared direct deps; set needs_resolution; attempt a dry-run lock in CI if permitted
- Monorepos:
- Discover multiple manifests; produce one graph per package/module, then aggregate with a “component” field
- Vendored code:
- Exclude vendored directories by default (node_modules, vendor/, third_party/) unless explicitly whitelisted
- Private registries/feeds:
- Record feed URL; do not attempt credentialed resolution unless configured
- Rate limits:
- Cache registry metadata; backoff and resume
- Non-GitHub Go modules:
- Queue for manual mapping; do not guess ownership
- Every dependency record must cite a source_type and source_file
- For npm, direct vs transitive must be set
- If ecosystem is go, record skipped_modules explicitly
- If source_type is ai_inference, human review must be present before publish
detect = detect_ecosystems(repo_files)
for eco in detect:
if eco == "npm":
graph = parse_npm(repo_files)
graph = label_direct_from_package_json(graph)
elif eco in {"crates","pypi","nuget","maven"}:
graph = parse_with_lock_or_tree(repo_files, eco)
elif eco == "go":
all_mods = parse_go_modules(repo_files)
graph = [m for m in all_mods if host(m.path) == "github.com"]
skipped = [m for m in all_mods if host(m.path) != "github.com"]
else:
sbom, ai_guess = scripted_sbom_and_ai_parse(repo_snapshot)
graph = human_review(ai_guess)
emit(normalize(graph), evidence=evidence_blobs)
- Lockfiles and tool-generated trees are the most reliable snapshot of reality
- npm gets explicit direct vs transitive labeling because package.json provides a clean signal and downstream analyses often depend on it
- For Go, module paths are the canonical identity; restricting to GitHub-hosted modules ensures consistent owner/repo mapping for graph analytics. Non-GitHub modules frequently require custom VCS resolution or vanity domain logic; forcing manual review avoids false attributions
- An “others” lane with SBOM+AI+human review prevents silent failure on less common ecosystems and unusual build systems
- dependencies.jsonl adhering to the schema above
- evidence/ directory containing:
- raw manifests/lockfiles
- build logs or dependency:tree outputs
- sbom.* (if generated)
- review_notes.md with human sign-off
- summary.md with counts by ecosystem, direct vs transitive, unresolved items, and skipped Go modules
flowchart TD A[Start: repo files & eco] --> B{Which ecosystem?} %% npm B -->|npm| N1[Parse npm files] N1 --> N2[Label direct vs transitive] N2 --> O1[[Output: Graph with labels]] %% crates / pypi / nuget / maven B -->|crates / pypi / nuget / maven| R1[Parse with lock or tree] R1 --> O2[[Output: Graph resolved]] %% go B -->|go| G1[Parse Go modules] G1 --> G2{Hosted on GitHub?} G2 -->|yes| G3[Include in graph] G2 -->|no| G4[Add to skipped list] G3 --> O3[[Output: GitHub-only graph]] G4 --> O3 %% others B -->|other| X1[Run SBOM + AI parse] X1 --> X2[Human review] X2 --> O4[[Output: Reviewed graph]] %% normalization O1 --> Z[Normalize & emit artifacts] O2 --> Z O3 --> Z O4 --> Z