Proposal Title: Development of an Open Source Software Dependency & Code Analytics Tool (MVP Phase)
Authors: Brian 'redbeard' Harrington
Date: April 29, 2025
Related Working Group: [Insert Relevant RISE Working Group, e.g., "Toolchain WG" or "Software Ecosystem WG"]
Technical Lead for Proposal Review: [Name of the assigned Technical Lead from the WG/TSC]
TSC Reviewers: [Names of at least 2 other TSC members for review]
Background
RISE's mission is to enhance the RISC-V software ecosystem. A significant challenge in porting software to new architectures like RISC-V is understanding the intricate web of software dependencies and identifying code that relies on architecture-specific features (e.g., ISA extensions like Vector or Atomic instructions). Modern software projects pull in dependencies from various ecosystems (PyPI, npm, Maven) that ultimately interact with system-level packages (RPMs, Debs). Furthermore, the exact dependency graph resolved for a project is non-deterministic and varies significantly based on the build environment, package manager version, target OS, and architecture.
Currently, there is no single open-source tool that provides a comprehensive, contextual view of software dependencies correlated with the analysis of source code content (specifically for architecture-specific patterns), traversable across different packaging ecosystems, and capable of answering questions like "Why was package X installed in this specific build environment?" or "Which packages in this dependency tree contain potential RISC-V assembly code?". Addressing this gap requires a dedicated, well-architected open-source solution.
Challenge
The primary challenge is to build a scalable and flexible tool capable of:
- Ingesting and accurately modeling complex, context-dependent dependency graphs from multiple package ecosystems.
- Acquiring and analyzing source code artifacts associated with these packages to identify architecture-specific code patterns (initially, just identifying files likely containing assembly).
- Storing this combined graph and analysis data efficiently (leveraging graph database and object storage).
- Providing a powerful, user-friendly interface for querying and visualizing dependency relationships and code characteristics under specific resolution contexts.
- Ensuring the system is modular and extensible to support new package managers, analysis types, and architectural features over time.
- Implementing robust security measures to handle the analysis of potentially untrusted code and manage access control.
Building such a system requires expertise across multiple domains, including dependency management tools, static analysis, graph databases, distributed systems, and security. While RISE member companies contribute valuable expertise, developing this comprehensive tool as a dedicated project exceeds typical contribution scopes and requires focused resources.
Current Solution
There is no single open-source tool that fulfills the requirements of this project. Existing solutions are often siloed:
- Package managers handle resolution but don't expose the contextual graph in a queryable format.
- Software Bill of Materials (SBOMs) often represent a resolved graph but struggle to capture the multiple possible graphs or link back to why a specific version was chosen in a context, and typically don't include deep code content analysis.
- Static analysis tools can find code patterns but are generally not integrated with dependency graph analysis or capable of correlating findings across a full, context-dependent dependency tree sourced from diverse ecosystems.
This project proposes to build a novel, integrated open-source solution addressing this gap.
Why should RISE Fund
Funding the MVP phase of this Dependency & Code Analytics Tool directly aligns with and significantly advances RISE's core mission by:
- Directly Supporting RISC-V Porting: The tool provides critical insights into dependency trees and architecture-specific code usage, which are essential for identifying porting challenges and prioritizing effort within the RISC-V software ecosystem. The ability to query for potential assembly code (MVP) is a direct aid in finding areas needing architectural review.
- Enhancing Open Source Ecosystem Understanding: It provides a valuable open-source tool for the entire software community to better understand software supply chains, dependency complexity, and code characteristics across ecosystems, fostering more informed development and maintenance practices.
- Enabling Data-Driven Collaboration: By making dependency and code data queryable, it facilitates data-driven identification of common porting issues or architectural hotspots across projects, aiding collaborative upstream contributions.
- Showcasing Openness and Transparency: The project itself will be developed as open source, with a public design specification (v0.1 attached) and transparent development process, embodying RISE's principles.
- Addressing a Community Need: The lack of a comprehensive tool in this space is a recognized gap, particularly challenging for cross-architecture efforts. RISE funding will make this tool available for the benefit of the entire RISC-V and broader open-source community.
Milestones to Deliver
This RFP seeks proposals to deliver the Minimum Viable Product (MVP) phase as defined in the attached Design Specification (v0.1), focusing on establishing the core architecture, contextual dependency graph ingestion, basic file analysis (potential assembly), and fundamental query/visualization capabilities with robust AAA.
Proposals should break down the work into logical milestones, which could include but are not limited to:
- Milestone 1: Core Architecture & Framework Setup:
- Establish core service components (Backend API, Worker, Job Queue).
- Set up initial Neo4j and SpiceDB instances.
- Implement basic API framework with Authentication and SpiceDB Authorization integration for initial endpoints (e.g., health checks, listing repositories user can view).
- Implement foundational job queuing and worker processing logic.
- Implement basic configuration management.
- Milestone 2: Basic Data Ingestion & Modeling (Repo Metadata, Files):
- Implement
ScanRepository
job handler and its associated plugin interface (RepositoryMetadataSource
). - Develop initial plugins for one or two key package ecosystems (e.g., RPM, PyPI) sufficient for parsing repository metadata and getting package names, versions, and declared source pointers.
- Implement
AcquireSourceFiles
andExtractAndListSourceFiles
job handlers. - Develop initial source acquisition/extraction plugins (
SourceAcquirer
,ArchiveExtractor
). - Implement logic to create
:Repository
,:Package
, and:File
nodes and:FROM_REPOSITORY
,:CONTAINS_FILE
relationships in Neo4j. - Implement logic to detect and set the
potential_assembly
flag on:File
nodes based on extensions. - Implement initial storage of file lists (path, checksum, type) in Object Storage.
- Implement
- Milestone 3: Contextual Dependency Graph & Core Queries:
- Implement
ParseManifest
job handler and its plugin interface (ManifestParser
). - Enhance ecosystem plugins to parse detailed dependency declarations (
:DECLARES_DEPENDENCY
). - Implement
ResolveDependencies
job handler and its plugin interface (DependencyResolver
). - Develop initial
DependencyResolver
plugin(s) (e.g., for RPM or PyPI) capable of resolving declared dependencies within a mocked or controlled set of available packages for a given context (OS/Arch/Resolver). - Implement logic to create
:RESOLVES_TO
relationships with full context properties (context_id
,resolver_tool
, etc.) in Neo4j. - Implement Backend API endpoints for core graph queries (
GET /packages/{packageId}/dependencies
,GET /query/dependency-path
), leveraging resolution context parameters.
- Implement
- Milestone 4: MVP UI & Admin Features, Correlation (Basic), Documentation, and Deployment:
- Develop the MVP Public UI (Package search/list, Package details showing basic metadata, dependencies, files with
potential_assembly
flag). - Implement interactive graph visualization in the UI for dependency subgraphs filtered by context.
- Develop the MVP Admin UI (Trigger scan jobs, basic job monitoring).
- Implement basic
:CORRELATES_TO
relationship creation (e.g., manual correlation via API). - Finalize deployment artifacts (Dockerfiles, Docker Compose) for the MVP system.
- Develop initial user and developer documentation covering installation, basic usage, architecture overview, and how to contribute new basic plugins (manifest parsing, file listing).
- Establish initial CI/CD pipeline for automated testing and build artifact generation.
- Develop the MVP Public UI (Package search/list, Package details showing basic metadata, dependencies, files with
Proposers should propose specific deliverables for each milestone, including tested code, documentation updates, and deployment instructions.
Estimated Cost
Proposers should provide a detailed cost estimate based on their proposed approach and the milestones outlined above, or a modified set of milestones that still deliver the MVP as defined in the attached specification. The cost should be broken down per milestone.
Proposal Submission
Proposers should submit their bids via the Google Form on the Project wiki by [Date - min 1 week from publication].
Review Process
Proposals will be reviewed by the named Technical Lead and TSC Reviewers based on the proposer's understanding of the problem, proposed approach, technical expertise, plan for delivering the milestones, estimated cost, and alignment with open-source principles and the attached specification (v0.1). Recommendations will be made to the RISE Governing Board for final award approval.
Contracting
The awarded contractor will enter into a standard Linux Foundation Europe contract with a Statement of Work (SOW) and payment schedule based on the agreed-upon milestones.
Attachment: Design Specification: Open Source Software Dependency & Code Analytics Tool, Version 0.1 (Draft)