brianredbeard/draft_rfp.md

Last active April 29, 2025 21:54

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/brianredbeard/a3f1933103da1c73b4eb880d82f7391c.js"></script>
Save brianredbeard/a3f1933103da1c73b4eb880d82f7391c to your computer and use it in GitHub Desktop.

Download ZIP

package graph

Raw

draft_rfp.md

RISE TSC Project Proposal: Open Source Software Dependency & Code Analytics Tool (MVP)

Proposal Title: Development of an Open Source Software Dependency & Code Analytics Tool (MVP Phase)

Authors: Brian 'redbeard' Harrington

Date: April 29, 2025

Related Working Group: [Insert Relevant RISE Working Group, e.g., "Toolchain WG" or "Software Ecosystem WG"]

Technical Lead for Proposal Review: [Name of the assigned Technical Lead from the WG/TSC]

TSC Reviewers: [Names of at least 2 other TSC members for review]

Background

RISE's mission is to enhance the RISC-V software ecosystem. A significant challenge in porting software to new architectures like RISC-V is understanding the intricate web of software dependencies and identifying code that relies on architecture-specific features (e.g., ISA extensions like Vector or Atomic instructions). Modern software projects pull in dependencies from various ecosystems (PyPI, npm, Maven) that ultimately interact with system-level packages (RPMs, Debs). Furthermore, the exact dependency graph resolved for a project is non-deterministic and varies significantly based on the build environment, package manager version, target OS, and architecture.

Currently, there is no single open-source tool that provides a comprehensive, contextual view of software dependencies correlated with the analysis of source code content (specifically for architecture-specific patterns), traversable across different packaging ecosystems, and capable of answering questions like "Why was package X installed in this specific build environment?" or "Which packages in this dependency tree contain potential RISC-V assembly code?". Addressing this gap requires a dedicated, well-architected open-source solution.

Challenge

The primary challenge is to build a scalable and flexible tool capable of:

Ingesting and accurately modeling complex, context-dependent dependency graphs from multiple package ecosystems.
Acquiring and analyzing source code artifacts associated with these packages to identify architecture-specific code patterns (initially, just identifying files likely containing assembly).
Storing this combined graph and analysis data efficiently (leveraging graph database and object storage).
Providing a powerful, user-friendly interface for querying and visualizing dependency relationships and code characteristics under specific resolution contexts.
Ensuring the system is modular and extensible to support new package managers, analysis types, and architectural features over time.
Implementing robust security measures to handle the analysis of potentially untrusted code and manage access control.

Building such a system requires expertise across multiple domains, including dependency management tools, static analysis, graph databases, distributed systems, and security. While RISE member companies contribute valuable expertise, developing this comprehensive tool as a dedicated project exceeds typical contribution scopes and requires focused resources.

Current Solution

There is no single open-source tool that fulfills the requirements of this project. Existing solutions are often siloed:

Package managers handle resolution but don't expose the contextual graph in a queryable format.
Software Bill of Materials (SBOMs) often represent a resolved graph but struggle to capture the multiple possible graphs or link back to why a specific version was chosen in a context, and typically don't include deep code content analysis.
Static analysis tools can find code patterns but are generally not integrated with dependency graph analysis or capable of correlating findings across a full, context-dependent dependency tree sourced from diverse ecosystems.

This project proposes to build a novel, integrated open-source solution addressing this gap.

Why should RISE Fund

Funding the MVP phase of this Dependency & Code Analytics Tool directly aligns with and significantly advances RISE's core mission by:

Directly Supporting RISC-V Porting: The tool provides critical insights into dependency trees and architecture-specific code usage, which are essential for identifying porting challenges and prioritizing effort within the RISC-V software ecosystem. The ability to query for potential assembly code (MVP) is a direct aid in finding areas needing architectural review.
Enhancing Open Source Ecosystem Understanding: It provides a valuable open-source tool for the entire software community to better understand software supply chains, dependency complexity, and code characteristics across ecosystems, fostering more informed development and maintenance practices.
Enabling Data-Driven Collaboration: By making dependency and code data queryable, it facilitates data-driven identification of common porting issues or architectural hotspots across projects, aiding collaborative upstream contributions.
Showcasing Openness and Transparency: The project itself will be developed as open source, with a public design specification (v0.1 attached) and transparent development process, embodying RISE's principles.
Addressing a Community Need: The lack of a comprehensive tool in this space is a recognized gap, particularly challenging for cross-architecture efforts. RISE funding will make this tool available for the benefit of the entire RISC-V and broader open-source community.

Milestones to Deliver

This RFP seeks proposals to deliver the Minimum Viable Product (MVP) phase as defined in the attached Design Specification (v0.1), focusing on establishing the core architecture, contextual dependency graph ingestion, basic file analysis (potential assembly), and fundamental query/visualization capabilities with robust AAA.

Proposals should break down the work into logical milestones, which could include but are not limited to:

Milestone 1: Core Architecture & Framework Setup:
- Establish core service components (Backend API, Worker, Job Queue).
- Set up initial Neo4j and SpiceDB instances.
- Implement basic API framework with Authentication and SpiceDB Authorization integration for initial endpoints (e.g., health checks, listing repositories user can view).
- Implement foundational job queuing and worker processing logic.
- Implement basic configuration management.
Milestone 2: Basic Data Ingestion & Modeling (Repo Metadata, Files):
- Implement ScanRepository job handler and its associated plugin interface (RepositoryMetadataSource).
- Develop initial plugins for one or two key package ecosystems (e.g., RPM, PyPI) sufficient for parsing repository metadata and getting package names, versions, and declared source pointers.
- Implement AcquireSourceFiles and ExtractAndListSourceFiles job handlers.
- Develop initial source acquisition/extraction plugins (SourceAcquirer, ArchiveExtractor).
- Implement logic to create :Repository, :Package, and :File nodes and :FROM_REPOSITORY, :CONTAINS_FILE relationships in Neo4j.
- Implement logic to detect and set the potential_assembly flag on :File nodes based on extensions.
- Implement initial storage of file lists (path, checksum, type) in Object Storage.
Milestone 3: Contextual Dependency Graph & Core Queries:
- Implement ParseManifest job handler and its plugin interface (ManifestParser).
- Enhance ecosystem plugins to parse detailed dependency declarations (:DECLARES_DEPENDENCY).
- Implement ResolveDependencies job handler and its plugin interface (DependencyResolver).
- Develop initial DependencyResolver plugin(s) (e.g., for RPM or PyPI) capable of resolving declared dependencies within a mocked or controlled set of available packages for a given context (OS/Arch/Resolver).
- Implement logic to create :RESOLVES_TO relationships with full context properties (context_id, resolver_tool, etc.) in Neo4j.
- Implement Backend API endpoints for core graph queries (GET /packages/{packageId}/dependencies, GET /query/dependency-path), leveraging resolution context parameters.
Milestone 4: MVP UI & Admin Features, Correlation (Basic), Documentation, and Deployment:
- Develop the MVP Public UI (Package search/list, Package details showing basic metadata, dependencies, files with potential_assembly flag).
- Implement interactive graph visualization in the UI for dependency subgraphs filtered by context.
- Develop the MVP Admin UI (Trigger scan jobs, basic job monitoring).
- Implement basic :CORRELATES_TO relationship creation (e.g., manual correlation via API).
- Finalize deployment artifacts (Dockerfiles, Docker Compose) for the MVP system.
- Develop initial user and developer documentation covering installation, basic usage, architecture overview, and how to contribute new basic plugins (manifest parsing, file listing).
- Establish initial CI/CD pipeline for automated testing and build artifact generation.

Proposers should propose specific deliverables for each milestone, including tested code, documentation updates, and deployment instructions.

Estimated Cost

Proposers should provide a detailed cost estimate based on their proposed approach and the milestones outlined above, or a modified set of milestones that still deliver the MVP as defined in the attached specification. The cost should be broken down per milestone.

Proposal Submission

Proposers should submit their bids via the Google Form on the Project wiki by [Date - min 1 week from publication].

Review Process

Proposals will be reviewed by the named Technical Lead and TSC Reviewers based on the proposer's understanding of the problem, proposed approach, technical expertise, plan for delivering the milestones, estimated cost, and alignment with open-source principles and the attached specification (v0.1). Recommendations will be made to the RISE Governing Board for final award approval.

Contracting

The awarded contractor will enter into a standard Linux Foundation Europe contract with a Statement of Work (SOW) and payment schedule based on the agreed-upon milestones.

Attachment: Design Specification: Open Source Software Dependency & Code Analytics Tool, Version 0.1 (Draft)

Raw

proposal.md

Design Specification: Open Source Software Dependency & Code Analytics Tool

Version 0.1 (Draft)

Date: April 29, 2025

Authors: Brian 'redbeard' Harrington

1. Introduction

1.1 Problem Statement

Understanding the complex landscape of software dependencies, especially within large open source ecosystems and distributions like Fedora, is a significant challenge. Dependency graphs are not static; they vary based on the resolution context (package manager, version, OS, architecture, available package set). Furthermore, gaining insight into the content of source code within these dependencies – such as the usage of architecture-specific instructions or compiler intrinsics – is critical for tasks like porting to new Instruction Set Architectures (ISAs) such as RISC-V, or for security analysis. Existing tools often focus solely on declared dependencies or static code analysis in isolation, lacking the ability to correlate complex dependency resolutions with deep code characteristics across diverse software supply chains.

1.2 Goals

MVP Goals:
    * Construct and store a comprehensive dependency graph across various package managers and ecosystems (RPM, npm, PyPI, etc.) in a graph database (Neo4j).
    * Model the dependency graph resolution context explicitly to represent how dependencies resolve differently based on factors like package manager, version, target OS, and architecture.
    * Store core package and file metadata (name, version, checksums, paths) in Neo4j.
    * Identify files within packages that potentially contain assembly code based on file extensions.
    * Enable complex querying against the contextual dependency graph to answer questions like "Why is package X a transitive dependency in Context A?", "What packages depend on package Y?", and "Show dependency paths that include packages with potential assembly files in Context B?".
    * Provide a read-only UI (Public View) for exploring packages, visualizing dependency subgraphs for specific contexts, and executing predefined or guided queries.
    * Provide an administrative UI (Admin View) for initiating and monitoring data ingestion jobs.
    * Implement a Relationship-Based Access Control (ReBAC) system (SpiceDB) to secure Admin and Public views and potentially enable more granular permissions in the future.
    * Design the ingestion pipeline and data model with abstractions to facilitate adding support for new package managers and ecosystems.
Future Goals (Post-MVP):
    * Perform detailed static source code analysis for specific patterns (e.g., inline assembly, compiler intrinsics, specific ISA instruction mnemonics) across different languages and architectures.
    * Determine, to the extent feasible via static analysis and build metadata, if detected ISA usage appears required or optional.
    * Store these detailed analysis results (what instruction/intrinsic, where found, architecture, assessment) externally in commodity object storage (S3-compatible, GCS, etc.).
    * Enable querying that combines dependency graph traversal (Neo4j) with filtering/annotation based on detailed analysis data (Object Storage).
    * Provide functionality to correlate packages across different ecosystems that are derived from the same upstream source.
    * Extend the UI to visualize detailed analysis findings and enable combined queries.
    * Design and implement a robust extensibility model for source code analysis plugins.
    * Implement advanced operational features (detailed monitoring, alerting, automated scaling).

1.3 Non-Goals (Initial MVP)

Runtime or dynamic analysis of software behavior.
Comprehensive vulnerability scanning (though analysis findings could inform this).
Automatic dependency resolution or installation capabilities.
Suggesting fixes or remediation actions for detected issues.
Storing full source code content within the database.
Analysis of binary artifacts (EXEs, SOs, DLLs, etc.).
Providing legal advice based on license analysis.

1.4 Target Audience

Software engineers involved in porting to new architectures (e.g., RISC-V).
Package maintainers for operating system distributions and language ecosystems.
Software supply chain security analysts.
Developers interested in understanding the composition of their dependencies.
Open source contributors to the tool itself.

2. User Stories

Investigating Dependency Origins and ISA Usage for RISC-V Porting:
    * As a Porting Engineer targeting a new architecture (e.g., RISC-V) or a Package Maintainer debugging build environments,
    * I want to understand the full dependency closure for a specific package version within a defined build context (e.g., pypi:pytorch:2.1.2 resolved using pip on Ubuntu 22.04 x86_64 or rpm:pytorch:2.1.2-1.fc39 built with dnf on Fedora 39 aarch64), traverse the dependency graph to see why specific versions of transitive dependencies were included, identify files within those packages that contain architecture-specific code patterns (like ISA-specific assembly or intrinsics), view detailed analysis findings for those files, and correlate the package back to its counterparts in other ecosystems or from upstream sources,
    * So that I can:
        * Identify the exact set of dependencies that need to be considered for porting or build environment replication.
        * Pinpoint which packages and specific files require architectural review or modification due to detected ISA usage.
        * Prioritize porting effort based on detected required ISA extensions versus optional usage.
        * Understand differences in dependency trees and package contents between various distribution methods (e.g., PyPI wheels vs. RPMs) or target architectures.
        * Trace a system package back to its upstream source or equivalent packages in other language ecosystems to find relevant community discussions or alternative implementations.
        * Debug unexpected dependencies or build failures related to differing resolution outcomes across environments.

3. Architecture Overview

The system follows a microservices-oriented architecture centered around a graph database and external object storage, orchestrated via a job queue and exposed through a backend API and frontend UI.

[Conceptual Diagram Here - Components: Data Sources -> Ingestion/Scanning Workers -> Job Queue -> Backend API <- UI, Backend API <-> Neo4j, Backend API <-> Object Storage, Backend API <-> SpiceDB]

Data Sources: Package manager repositories (repodata, registries), Source code repositories (Git), Package archives (SRPMs, wheels, tarballs).
Ingestion/Scanning Workers: Processes responsible for acquiring data from sources, parsing manifests, acquiring source code, listing files, resolving dependencies, and performing static code analysis. They communicate with the Job Queue and load specific plugins for different ecosystems and analysis types.
Job Queue: Decouples the process of requesting work (via API or internal triggers) from the execution of that work by the workers. Ensures asynchronous processing and provides reliability.
Backend API Service: A Node.js application serving as the central interface. Receives requests from the UI and external clients, validates them, interacts with SpiceDB for authorization, orchestrates jobs by placing them on the queue, queries Neo4j for graph data, and retrieves/processes detailed analysis results from Object Storage.
Frontend UI: A Node.js application (using a framework like React/Vue/Angular) providing the user interface for browsing, querying, visualization, and administering the system. Interacts with the Backend API.
Graph Database (Neo4j): Stores the core dependency graph topology, package/file metadata, and summary analysis findings. Optimized for relationships and traversal queries.
Object Storage: Commodity storage (S3-compatible) for storing detailed, potentially large, results from static code analysis jobs (e.g., list of detected ISA instructions per file).
ReBAC Service (SpiceDB): Stores the authorization model and relationships, providing permission check (CheckPermission) and resource listing (LookupResources) capabilities to the Backend API.

4. Data Model

Data is stored across two systems: Neo4j for the graph structure and core metadata, and Object Storage for detailed analysis findings.

4.1 Neo4j Data Model

Node Labels & Properties: (As detailed in the previous expanded section)
    * :Repository (url, vcs_type, last_scanned_identifier, name)
    * :Package (name, version, ecosystem, epoch, release, arch, checksum, license, declared_source_url, package_manager_metadata, detected_isa_usage_summary)
    * :File (path, checksum, inferred_type, potential_assembly, analysis_results_uri)
Relationship Types & Properties: (As detailed in the previous expanded section)
    * :FROM_REPOSITORY (scanned_identifier)
    * :DECLARES_DEPENDENCY (type, specifier, via_package_manager, raw_declaration)
    * :RESOLVES_TO (specifier, via_package_manager, context_id, resolver_tool, resolver_version, target_os, target_arch, package_set_id, resolution_timestamp)
    * :CONTAINS_FILE (no properties)
    * :CORRELATES_TO (method, confidence, details)
Constraints/Indices: (As detailed in the previous expanded section - Unique constraints on :Repository and :Package, Indices on key properties and relationship types).

4.2 Object Storage Data Model

Purpose: Store detailed, granular results of analysis jobs (e.g., lists of findings per file).
Object Keying Scheme: analysis_results/{analysis_type}/{package_checksum}/{file_checksum}.{format}
File Format: Parquet is the recommended format for structured analysis results due to its efficiency for storage and analytical queries. JSON Lines is an acceptable simpler alternative.
Data Schemas (Example for isa_usage_scan): (As detailed in the previous expanded section)
* Object Schema includes file_checksum, package_checksum, analysis_timestamp, analyzer_version, architecture, and a list of findings.
* finding Struct includes pattern_id, isa_extension, location, context, code_snippet, is_conditional, conditional_symbol, heuristic_assessment, details.
Linking Neo4j to Object Storage: File.checksum and Package.checksum in Neo4j nodes are used to construct the object key in Object Storage.

5. ETL Pipeline & Job Management

Architecture: An API service receives requests and queues jobs. A pool of workers pull jobs from the queue. Job Handlers within workers execute specific ETL tasks leveraging plugin interfaces.
Job Queue: Decouples API from workers (implementation detail TBD - e.g., Redis, RabbitMQ).
Job Types: Formal definitions for job types corresponding to ETL stages: ScanRepository, ParseManifest, AcquireSourceFiles, ExtractAndListSourceFiles, ResolveDependencies, ScheduleFileAnalysis, AnalyzeFileContent, SummarizePackageAnalysis, CorrelatePackages.
Process Flow: Each job handler performs a specific task (Extract, Transform, Load to Neo4j or Object Storage) and may enqueue subsequent jobs. Diagrams will be used to illustrate complex flows (e.g., how a ScanRepository job triggers ParseManifest and ResolveDependencies).
Error Handling: Jobs will have retry mechanisms for transient errors. Permanent errors will mark the job as failed and log details. Worker crashes should not lose job state from the queue.
Idempotency: Jobs will be designed to be safely retried (e.g., by using checksums, relying on Neo4j constraints, or using optimistic locking/conditional writes where necessary).
Concurrency: The worker pool size will be configurable. Potential resource limits per job or per worker instance will be implemented, especially for source acquisition and analysis jobs.
Abstractions: Plugin interfaces (ManifestParser, SourceAcquirer, DependencyResolver, CodeAnalyzer) are key to the ETL process, enabling modularity and extensibility.

6. Backend API Specification

Base URL: /api/v1
Authentication: Standard mechanisms (JWT, API Key, etc.) identify the user for ReBAC checks.
General Structure: RESTful, JSON request/response, standard HTTP status codes, JSON error bodies. Pagination via limit/offset.
Endpoints: (As detailed in the previous expanded section)
    * System & Job Management: GET /jobs, GET /jobs/{jobId}, GET /jobs/{jobId}/logs, POST /jobs/{jobId}/cancel, POST /scan/repository, POST /analyze/package/{packageChecksum}/{analysisType}, GET /system/config, PUT /system/config.
    * Repository Endpoints: GET /repositories, GET /repositories/{repositoryId}.
    * Package Endpoints: GET /packages, GET /packages/{packageId}, GET /packages/{packageId}/files.
    * Dependency Graph Query: GET /packages/{packageId}/dependencies, GET /query/dependency-path, POST /query/cypher (Optional Admin).
    * Analysis Endpoints: GET /packages/{packageId}/analysis-summary, GET /packages/{packageId}/analysis-results/{analysisType}, GET /analysis-types.
    * Correlation Endpoints: GET /packages/{packageId}/correlations, POST /packages/{packageId}/correlations (Admin).
Authorization: Every endpoint is protected by ReBAC checks via SpiceDB, mapping API calls to permission checks (user:...#permission@object:...). Listing endpoints use SpiceDB LookupResources.

7. User Interface (UI) Specification

Technology: Node.js frontend framework (React, Vue, or Angular TBD).
Views: Dashboard, Repositories/Ecosystems, Package Catalog, Graph Explorer, Query Interface, Job Monitor.
Functionality:
* Public View (Read-only): Search packages, view package details (metadata, dependencies, files, analysis summary), navigate relationships, visualize dependency graphs for specific contexts, execute predefined and guided queries (including combined graph + analysis queries in future), view correlations.
* Admin View: All Public View capabilities plus: Initiate scan and analysis jobs, monitor job status and logs, cancel jobs, manage system configuration.
Graph Visualization: Interactive visualization of Neo4j subgraphs. Must handle different layouts, filtering by node/edge properties (including resolution context and analysis summaries), expanding/collapsing nodes, linking to detail views. Performance on larger graphs is a key challenge.
Query Interface: Guided builder for common queries, ability to specify resolution context for graph queries, presentation of results (graph and tabular).
Analysis Result Display: Structured presentation of detailed analysis findings fetched from Object Storage, potentially linked to a source code viewer component.

8. Authentication and Authorization (AAA) with ReBAC

Strategy: Relationship-Based Access Control (ReBAC).
Implementation: SpiceDB (or compatible OpenFGA) will be used as the authorization service.
SpiceDB Schema: Defines object types (user, group, repository, package, job, system_config) and relations (member, admin, viewer, trigger_scan, view_logs, cancel, edit, view_analysis, repository, resource, editor).
Permission Modeling: Permissions (view, trigger_scan, etc.) are defined using logic over relations (e.g., permission view = viewer | repository->admin).
Admin/Public Split: Implemented by assigning users to groups (group:admins, group:public_viewers) and granting relations (admin, viewer, etc.) to these groups on relevant object types in SpiceDB.
Backend Integration: The Backend API acts as a Policy Enforcement Point (PEP), querying SpiceDB (Policy Decision Point - PDP) using CheckPermission and LookupResources calls for every authorized request.

9. Extensibility Model

Architecture: Plugin-based for ecosystem scanning/parsing and source code analysis.
Interfaces: Formal interfaces (ManifestParser, SourceAcquirer, DependencyResolver, CodeAnalyzer) define the contract for plugins.
Discovery & Loading: Plugins installed alongside workers, loaded via configuration or entry points.
Isolation: Plugins run in sandboxed environments (containers) for security and resource control.
Contribution: New plugins can be contributed by implementing the defined interfaces and updating configuration/deployment.

10. Testing Strategy & Testability

Levels: Unit, Integration, System (End-to-End), Performance, Security.
Methodology: Automated tests integrated into CI/CD pipelines. Test data management using curated samples and subsets of real-world data.
Testability: Design components and plugin interfaces to be easily testable in isolation or with mocked dependencies.
Security Testing: Include fuzzing and vulnerability scanning of project dependencies.

11. Migration and Backward Compatibility

Neo4j: Database migration tool (Flyway/Liquibase/Custom) for schema evolution. Versioned migration scripts.
Object Storage: Versioned analysis result schemas within the data files. Support for reading multiple past schema versions. Tools for optional data re-processing on schema changes.
API: URL-based versioning (/v1, /v2). Clear deprecation policy for old versions.
Application: Versioned deployment artifacts. Documented upgrade procedures covering data migration steps.

12. Security Considerations

Untrusted Input: Treat source code and manifests as untrusted. Process in isolated, resource-limited sandbox environments (containers). Input validation and sanitization.
Secrets Management: Use environment variables, Docker Secrets, Kubernetes Secrets, or dedicated secrets managers. No secrets in code or version control.
Dependency Security: Regular scanning of project dependencies for vulnerabilities.
Least Privilege: Components and workers run with minimal necessary permissions.
Auditing: Log security-relevant events.

13. Deployment and Operational Detail

Deployment Model: Primarily containerized (Docker Compose for dev, Kubernetes for prod). Manual install option available.
Components: Clear list of runtime components to deploy (API, UI, Worker, Queue, Neo4j, SpiceDB, Object Storage).
Configuration: Environment variables are the primary configuration mechanism. Full list of variables per service provided.
Health Checks: Standard HTTP health endpoints for each service.
Logging: Structured logging (JSON to stdout/stderr), configurable levels. Recommend log aggregation.
Monitoring: Expose application and system metrics (Prometheus format). Recommend monitoring tools (Grafana).
Alerting: Define critical alert conditions based on metrics.
Backups: Document backup strategy for Neo4j and Object Storage.
Scaling: Components designed for horizontal scaling. Identify stateful components and bottlenecks.

14. Open Questions

Specific choice of Job Queue technology.
Specific choice of Node.js UI framework.
Detailed implementation strategy for the Object Storage query integration layer in the Backend API (direct access vs. external query engine).
Precise heuristics and implementation details for the heuristic_assessment property in ISA analysis results.
Specific technologies or libraries for implementing the sandboxed worker environments.
Initial set of package managers and analysis types to support in MVP vs. initial post-MVP.
Data retention policy for old analysis results or resolution contexts.