Skip to content

Instantly share code, notes, and snippets.

@Animesh-Ghosh
Last active March 15, 2026 10:18
Show Gist options
  • Select an option

  • Save Animesh-Ghosh/545858e87bf1f25994256aad571d3845 to your computer and use it in GitHub Desktop.

Select an option

Save Animesh-Ghosh/545858e87bf1f25994256aad571d3845 to your computer and use it in GitHub Desktop.
Lucia Challenge scratchpad
[
{
"idx": 1,
"question": "What are the revenue figures for Meta for Q1, Q2 and Q3?",
"expected": "Q2 2024: $39.1B \n\nQ3 2024: $40.6B \n\nQ1 2025: $42.3B",
"generated": "Q1 2025: $42.3B, Q2 2024: $39.1B, Q3 2024: $40.6B",
"doc_name": "META-Q1-2025-Earnings-Call-Transcript-1.pdf",
"confidence": "high",
"fuzzy_sim": 0.3,
"semantic_sim": 0.995417594909668,
"doc_match": 1.0,
"chunks_retrieved": 5
},
{
"idx": 2,
"question": "What was KFIN's revenue in 2021?",
"expected": "Rs. 4586550000",
"generated": "Rs. 4586550000",
"doc_name": "1649061896992.PDF",
"confidence": "high",
"fuzzy_sim": 1.0,
"semantic_sim": 1.0,
"doc_match": 1.0,
"chunks_retrieved": 5
},
{
"idx": 3,
"question": "What metrics helped CCI determine if the combination would be anticompetitive?",
"expected": "The CCI used the Herfindahl Hirschman Index to assess market concentration",
"generated": "The CCI used the Herfindahl Hirschman Index to assess market concentration.",
"doc_name": "4. C-2015-07-288.pdf",
"confidence": "high",
"fuzzy_sim": 0.9932885906040269,
"semantic_sim": 0.9943408966064453,
"doc_match": 1.0,
"chunks_retrieved": 5
},
{
"idx": 4,
"question": "What was the bench in the Eastman Kodak Case?",
"expected": "BLACKMUN, J., delivered the opinion of the Court, in which REHNQUIST, C. J., and WHITE, STEVENS, KENNEDY, and SOUTER, JJ., joined. SCALIA, J., filed a dissenting opinion, in which O\u2019CONNOR and THOMAS, JJ., joined.",
"generated": "The bench in the Eastman Kodak Case is not explicitly mentioned in the provided context. However, it can be inferred that the case was heard by the Supreme Court of the United States as indicated by '504 U.S. 451 (1992)' which is a citation format for Supreme Court cases.",
"doc_name": "Eastman Kodak Co. v. Image Technical Services, Inc., 504 U.S. 451 (1992) .pdf",
"confidence": "medium",
"fuzzy_sim": 0.024742268041237112,
"semantic_sim": 0.7343320250511169,
"doc_match": 1.0,
"chunks_retrieved": 5
},
{
"idx": 5,
"question": "How many SCOTUS cases are in the set? Name them.",
"expected": "There are 5 cases\nBell Atlantic Corp. v. Twombly, 550 U.S. 544 (2007)\nEastman Kodak Co. v. Image Technical Services, Inc., 504 U.S. 451 (1992)\nStandard Oil Co. of New Jersey v. United States, 221 U.S. 1 (1911)\nUnited States v. Socony-Vacuum Oil Co., Inc., 310 U.S. 150 (1940)\nBrown Shoe Co., Inc. v. United States, 370 U.S. 294 (1962) ",
"generated": "There are 4 cases in the set. Bell Atlantic Corp. v. Twombly, 550 U.S. 544 (2007), Eastman Kodak Co. v. Image Technical Services, Inc., 504 U.S. 451 (1992), Standard Oil Co. of New Jersey v. United States, 221 U.S. 1 (1911), and United States v. Socony-Vacuum Oil Co., Inc., 310 U.S. 150 (1940).",
"doc_name": "Testing Set Questions.xlsx",
"confidence": "high",
"fuzzy_sim": 0.8394276629570747,
"semantic_sim": 0.9673870205879211,
"doc_match": 0.0,
"chunks_retrieved": 5
},
{
"idx": 6,
"question": "What is the governing law in the NVCA IRA?",
"expected": "Delaware",
"generated": "The governing law for the NVCA Investor Rights Agreement is the internal law of the State of Delaware.",
"doc_name": "NVCA-Model-Document-Investor-Rights-Agreement.docx",
"confidence": "high",
"fuzzy_sim": 0.14545454545454545,
"semantic_sim": 0.7153719663619995,
"doc_match": 1.0,
"chunks_retrieved": 5
},
{
"idx": 7,
"question": "If Pristine were to acquire an indian company that had turnover of 1Cr and no assets, would it have to notify the deal to the CCI?",
"expected": "The thresholds for notification are:\nAssets of the acquirer and target > INR 25 Billion; or\nRevenue of the acquirer and target > INR 75 Billion\n\n\nPristine's revenue in 2021 was INR 5 billion; and total assets were INR 10 billion. Therefore, even after adding the turnover and the asset figures of the target, a notification will not be required.",
"generated": "No, Pristine would not have to notify the deal to the CCI as the thresholds for notification are Assets of the acquirer and target > INR 25 Billion; or Revenue of the acquirer and target > INR 75 Billion.",
"doc_name": "Testing Set Questions.xlsx",
"confidence": "high",
"fuzzy_sim": 0.38979963570127507,
"semantic_sim": 0.8773993253707886,
"doc_match": 0.0,
"chunks_retrieved": 5
},
{
"idx": 8,
"question": "Which questions in the Testing Set Questions has \u201cDelaware\u201d as an answer",
"expected": "The question is \u201cWhat is the governing law in the NVCA IRA\u201d? ",
"generated": "The Delaware General Corporation Law (DGCL)",
"doc_name": "NVCA-Model-Document-Investor-Rights-Agreement.docx",
"confidence": "high",
"fuzzy_sim": 0.2912621359223301,
"semantic_sim": 0.786303699016571,
"doc_match": 0.0,
"chunks_retrieved": 5
},
{
"idx": 9,
"question": "Which question has an answer as \u20b94586550000",
"expected": "The question is \u201cWhat was KFIN's revenue in 2021?\u201d",
"generated": "NOT_FOUND",
"doc_name": "1649061896992.PDF",
"confidence": "low",
"fuzzy_sim": 0.1694915254237288,
"semantic_sim": 0.7544689774513245,
"doc_match": 0.0,
"chunks_retrieved": 5
},
{
"idx": 10,
"question": "Does any question has reference to document 4. C-2015-07-288.pdf, p12",
"expected": "It's \u201cWhat metrics helped CCI determine if the combination would be anticompetitive?\u201d",
"generated": "NOT_FOUND",
"doc_name": "None",
"confidence": "low",
"fuzzy_sim": 0.1276595744680851,
"semantic_sim": 0.8461841940879822,
"doc_match": 0.0,
"chunks_retrieved": 5
}
]

FAISS GPU Migration Plan (No faiss-cpu)

Goal

Adopt FAISS as a GPU-only dense retrieval backend for benchmark runs, while preserving current Qdrant behavior as fallback and avoiding faiss-cpu installation in shared environments.

Scope

  • In scope:
    • GPU-only FAISS dependency path.
    • Backend selection via config/env.
    • Dense retrieval migration (coarse + rerank path).
    • Benchmark parity checks (latency + retrieval quality).
  • Out of scope:
    • Removing Qdrant immediately.
    • ANN index tuning (IVF/HNSW/PQ) in phase 1.

Current Baseline

Design Decisions

  1. Do not add FAISS to requirements.txt.
  2. Install FAISS only in GPU-specific setup paths:
  3. Add VECTOR_BACKEND switch:
    • qdrant (default)
    • faiss (GPU benchmark path)
  4. Keep BM25 unchanged for hybrid retrieval.
  5. Start with exact FAISS indexes:
    • IndexFlatIP(128) for coarse stage.
    • IndexFlatIP(384) for full-stage rerank/scoring.
  6. Reassess cluster routing after FAISS benchmark data; keep initially for parity.

Implementation Plan

Phase 1: Dependency and Backend Plumbing

  1. Add backend config in config.py:
    • VECTOR_BACKEND = os.getenv("VECTOR_BACKEND", "qdrant")
  2. Add FAISS install in Dockerfile.gpu only.
  3. Add FAISS install in Makefile setup target only.
  4. Add GPU benchmark targets in Makefile:
    • bench-faiss
    • e2e-faiss
    • export VECTOR_BACKEND=faiss for these targets.

Phase 2: FAISS Index Build + Persistence

  1. Introduce FAISS index build functions in storage/indexer.py:
    • build coarse and full indexes from normalized vectors.
  2. Persist artifacts under data/:
    • coarse.faiss
    • full.faiss (optional if rerank is numpy-only over candidate vectors)
    • metadata map (row_id -> chunk metadata).
  3. Keep existing BM25 and chunk cache unchanged.

Phase 3: Retrieval Path Integration

  1. In retrieval/retriever.py, route _coarse_search by backend:
    • Qdrant path (existing)
    • FAISS path (new)
  2. In retrieval/retriever.py, route _rerank by backend:
    • Qdrant path (existing)
    • FAISS/numpy candidate scoring path (new)
  3. Keep query embedding and BM25 flow unchanged in retrieval/retriever.py.

Phase 4: Benchmarks and Acceptance

  1. Compare qdrant vs faiss on same corpus/questions:
    • ingestion index build time
    • query latency for all 15 questions
    • top-k overlap and final answer quality metrics
  2. Acceptance criteria:
    • FAISS backend installs only in GPU paths.
    • No faiss-cpu anywhere.
    • FAISS benchmark latency is equal or better than Qdrant.
    • No quality regression beyond agreed threshold.
  3. If FAISS wins, make faiss default only in GPU benchmark workflows.

Risks and Mitigations

  1. FAISS wheel/CUDA compatibility:
    • Pin versions and validate in both Docker and bare-metal setup.
  2. Divergence between Docker and Makefile setups:
    • Use same install source/version in both files.
  3. Metadata mapping bugs:
    • Add integrity checks for row_id/chunk_id consistency.
  4. Cluster routing complexity:
    • Keep behind flag, disable if no measurable gain.

Suggested Milestones

  1. M1 (0.5 day): dependency plumbing + backend flag.
  2. M2 (1 day): FAISS indexing + persistence.
  3. M3 (1 day): retrieval integration + parity tests.
  4. M4 (0.5 day): benchmark report + decision.

Definition of Done

  1. GPU-only FAISS path operational in Dockerfile.gpu and Makefile.
  2. VECTOR_BACKEND=faiss runs ingestion and query successfully.
  3. Benchmarks documented with Qdrant comparison.
  4. Default non-GPU/local paths remain unaffected.
https://luciohackathon.purplewater-eec0a096.centralindia.azurecontainerapps.io/docs#/Submissions/submit_submissions_post

Streaming Embedding Migration Plan

Goal

Move ingestion from a batch handoff model to a bounded streaming pipeline so parsing, chunking, embedding, and vector indexing overlap instead of running as isolated stages.

Scope

  • In scope:
    • Streaming Go to Python handoff for chunk batches via Unix Domain Sockets (UDS).
    • Deterministic streamed chunk ordering and ID assignment.
    • Bounded backpressure across parse, chunk, embed, and index stages.
    • Append-oriented vector indexing for both supported dense backends (Qdrant/FAISS).
    • Benchmarking and retrieval parity checks against the current fast path.
    • Infrastructure: Targeted for Ubuntu/Debian on vast.ai GPU boxes.
  • Out of scope for phase 1:
    • Moving embeddings fully into Go.
    • Replacing BM25 or cluster-map retrieval components.
    • Redesigning the query pipeline.

Current Baseline

The current fast path is: Go parse + chunk -> Arrow file on disk -> Python read-all -> embedding -> vector index -> BM25 -> cluster map

Key current bottleneck:

  • Python only starts embedding after Go has finished writing the full Arrow file.
  • go-ingestor collects all raw chunks before assigning IDs and writing.

Target Architecture

The target streaming fast path is: file discovery -> parse workers -> chunk sequencer -> Arrow stream (UDS) -> Python embedding re-batcher -> vector index writer -> finalization (BM25/Cluster map)

Design Constraints & Refinements

  1. Transport: Use Arrow IPC streaming over a Unix Domain Socket (UDS).
    • Python (consumer) creates a temporary socket file (e.g., /tmp/lucio_ingest.sock).
    • Go (producer) connects to the socket and streams Arrow batches.
    • Benefit: Keeps stdout and stderr completely free for human-readable logging and progress bars without risk of stream corruption.
  2. Backpressure: Leverage UDS buffer semantics. Python's asyncio consumer will naturally apply backpressure to the Go producer if the embedding queue is full.
  3. Deterministic IDs: Ensure chunk_id generation (hashing/UUID) is identical between batch and stream paths to maintain parity.
  4. Re-batching Layer: Python will implement a bounded queue to re-batch incoming Arrow chunks into optimal GPU inference sizes (e.g., 128/256) independently of Go's transport batch size.
  5. Finalization Manifest: Python will maintain an append-only in-memory or temporary manifest of all chunk metadata to build the BM25 index and Cluster map after the socket closes.

Implementation Plan

Phase 1: Lock the rollout shape & Infrastructure

  1. Add STREAMING_ENABLED=1 flag support to pipeline.py and fast_ingest.py.
  2. Update Makefile with a new e2e-stream-faiss target that drives run_e2e_test.py --ingest with streaming enabled.
  3. VastAI Verification: Verify libarrow-dev (apt) and apache-arrow-go (go.mod) version compatibility on the production GPU box.

Phase 2: Make chunk IDs stream-safe

  1. Introduce a document-aware sequencer in Go that buffers out-of-order pages and emits chunks in deterministic page_num / page_order.
  2. Assign chunk_idx and chunk_id immediately upon sequencing.
  3. Test: go-ingestor/chunker/sequencer_test.go proving parity with batch ordering.

Phase 3: Replace Arrow file handoff with Arrow streaming (Go side)

  1. Add a streaming Arrow writer in Go that can write to a net.Conn (UDS).
  2. Add a --socket-path flag to go-ingestor to trigger streaming mode.
  3. Ensure all logs use stderr or the standard stdout (now safe since data is in UDS).

Phase 4: Python streaming consumer and re-batcher

  1. Update fast_ingest.py to create the UDS server, spawn the Go process with the --socket-path flag, and accept the connection.
  2. Implement the Python-side asyncio re-batching queue to feed the Embedder.
  3. Manage socket lifecycle: create, listen, and ensure cleanup of the /tmp/*.sock file on exit or failure.

Phase 5: Append-only Vector Indexing

  1. For FAISS: Use an in-process appendable builder.
  2. For Qdrant: Perform batch upserts as re-batched vectors become available.

Phase 6: Finalization & Metadata

  1. Accumulate chunk metadata (text, doc_name, page_num) in an append-only manifest during the stream.
  2. After the stream closes, trigger the build of BM25 and Cluster map from this manifest.

Phase 7: Backpressure and Failure Handling

  1. Use bounded channels in Go and asyncio.Queue in Python.
  2. Implement structured failure propagation: if Go fails, Python should cleanup partial artifacts and exit.

Verification

  1. Parity: Run make e2e-stream-faiss and ensure similarity scores and document matches are identical to make e2e-faiss.
  2. Performance: Record "Time to first embedding" and "Total ingestion time".
  3. Stability: Verify clean socket cleanup and log readability on vast.ai.

Related Docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment