Skip to content

Instantly share code, notes, and snippets.

@theatom06
Last active October 23, 2025 12:26
Show Gist options
  • Select an option

  • Save theatom06/b68bbbf5ddd047f4480c2a79d75786e1 to your computer and use it in GitHub Desktop.

Select an option

Save theatom06/b68bbbf5ddd047f4480c2a79d75786e1 to your computer and use it in GitHub Desktop.

Project: Notebook

Overview

Notebook is a minimal, trust-driven platform for students in grades 8–12 to share study materials—organized by board, grade, and subject. It enables uploads of notes (PDFs, PPTs), question papers, and worksheets. Students can vote, comment, report, and build a trust score similar to Reddit karma. A clean black-and-white interface, realtime updates, and offline access through PWA make it lightweight but powerful.

Core Concept

A peer-driven learning space where credibility rises naturally through trust, not algorithms. Every upload, vote, and report shapes the reputation ecosystem.

Architecture

Stack:

  • Frontend: Vite + Tailwind + PWA
  • Backend: Bun.js on server
  • Storage: JSON files (state + users) and filesystem for uploaded content

Folder Structure

notebook/
├── server/
│   ├── index.ts                 # Bun server entry
│   ├── routes/
│   │   ├── upload.ts            # file uploads
│   │   ├── feed.ts              # fetch feeds
│   │   ├── vote.ts              # upvote/downvote
│   │   ├── report.ts            # moderation logic
│   │   ├── user.ts              # profiles and leaderboard
│   ├── db/
│   │   ├── recent.json          # maintains list of recent posts ranked 
|   |   ├── posts/
|   |       └── uuid.json        # stores information of each post
│   │   └── users/
|   |       └── uuid.json        # stores information of each user
│   ├── uploads/ 
|   |       └── uuid.json        # stored PDFs/PPTs by board/grade/subject
│   └── utils/
│       ├── trust.ts             # trust logic
│       └── id.ts                # UUID generator
│
├── client/
│   ├── index.html
│   ├── main.tsx
│   ├── App.tsx
│   ├── components/ (Feed, UploadForm, NoteCard, Profile, Leaderboard, CommentSection)
│   ├── pages/ (Home, FeedPage, UploadPage)
│   ├── service-worker.js        # caching logic
│   └── manifest.json            # PWA manifest

Data Models

posts/uuid.json:

{
  "title": "Electromagnetism Notes",
  "board": "CBSE",
  "grade": "11",
  "bloomData": bloomData          #blooms helps store liked users data
  "subject": "Physics",
  "filename": "uuid.pdf",
  "uploader": "user123",
  "trust": 35,
  "votes": { "up": 12, "down": 2 },
  "reports": 1,
  "timestamp": 1697000000,
  "comments": []
}

users/uuid.json:

{
  "username": "user123",
  "trust": 85,
  "uploads": 10,
  "reports": 0
}

API Endpoints

Endpoint Method Purpose
/upload POST Upload file + metadata
/feed/:board/:grade/:subject GET Fetch feed sorted by trust/time
/vote/:id POST Upvote/downvote note
/report/:id POST Report content (auto-deletes after N reports)
/user/:username GET Get profile
/leaderboard GET Fetch top trusted users

Moderation Logic

if (note.reports >= min(5, 0.1 * note.iews)) {
  removeFile(note.filename);
  reportingUser.reported += 1;
  user.trust -= 10;
}

Frontend Structure

Pages: Feed, Profile, Leaderboard, Dcuments

Feed: Lazy load N new posts based on trust score and likes/views Documents: downloaded and stared ones with modal to upload Profile: check own profile and others Leaderboard: Check others rank top 10 and yours and +/-3 from u

Core Components:

  • NoteCard: shows note title, votes, trust
  • UploadForm: handles metadata + file upload
  • CommentSection: threaded discussion
  • Leaderboard: global trust ranking
  • Profile: user uploads and stats

UI / UX

  • Theme: Black and white monochrome, grayscale shadows
  • Typography: Sans (Inter) + Mono (JetBrains Mono)
  • Icons: Thin-line Lucide or Tabler icons
  • Layout: Minimal, 3–4 pages; feed-focused UI

Features

Core: Upload, download, trust score, vote, report, profile, feed Extras: Comments, leaderboard, realtime feed, search bar Advanced: Offline mode, export, auto-trust decay

Realtime Updates

import { WebSocketServer } from "ws";
const wss = new WebSocketServer({ port: 8081 });
function broadcastUpdate(note) {
  wss.clients.forEach(c => c.send(JSON.stringify({ type: "new", note })));
}

Frontend listens for new uploads and injects into the live feed.

PWA & Offline Mode

  • Cache static assets and last-viewed feeds.
  • IndexedDB/localStorage for offline reading.
  • Manifest for installable app experience.
  • Banner indicator for offline state.

Hackathon Build Plan

Day 1:

  • Setup Bun server and routes (upload/feed/vote/report)
  • Create JSON storage system
  • Build core UI (Feed + Upload)
  • Implement trust/voting logic

Day 2:

  • Add comments, profiles, leaderboard
  • Integrate WebSocket realtime feed
  • Add offline caching + final polish
  • Prepare final demo pitch

Pitch Summary

Notebook is a peer-to-peer academic sharing space where trust, not algorithms, decides visibility. Students upload, vote, and collaborate on notes across boards, grades, and subjects. It’s fast, minimal, works offline, and rewards credible contribution.

@triadwoozie
Copy link

triadwoozie commented Oct 23, 2025

RAG

The best approach is to build the AI as a separate Python microservice. Your main Bun.js server will remain the lightweight "frontend" server, which will then make internal HTTP requests (acting as a proxy) to this new Python AI service.

This keeps your stacks separate: Bun.js handles fast I/O and web logic, while Python handles the heavy compute of AI/ML.

Here are the two main AI features we'll design:

  1. AI Q&A Assistant (The RAG Pipeline):

    • Purpose: Allows students to ask questions ("Explain electromagnetism," "What are the key themes in Chapter 3?") and get answers based specifically on the uploaded notes and question papers.
    • Tech:
      • Retrieval: all-mpnet-base-v2 will be used to embed (vectorize) all uploaded documents (PDFs, PPTs) into a vector database (like ChromaDB, which is file-system-based and fits your "minimal" ethos).
      • Generation: A student's question is embedded. We find the most relevant document chunks from the vector DB. These chunks (the "context") and the original question are sent to Gemini 12B (via Ollama) to generate a helpful, context-aware answer.
  2. AI Question Paper Predictor (Hybrid ML + LLM Model):

    • Purpose: Generates a "predicted," new practice paper for a specific board, grade, and subject.
    • Tech: This is a hybrid approach.
      • Analyzer (Classic ML): We'll use TF-IDF and K-Means Clustering to analyze the existing corpus of question papers.
      • Generator (LLM): We'll use Gemini 12B (via Ollama) to author a new paper based on the ML model's analysis.
    • How it works:
      1. All questions from all papers are extracted and stored.
      2. We apply TF-IDF + K-Means to this corpus to cluster questions into topics (e.g., "Topic 1: Kinematics," "Topic 2: Optics").
      3. When a user requests a "predicted paper," the service:
        • Filters all known questions by board/grade/subject.
        • Uses the ML model to find the most important topics and a set of example questions from those topics.
        • Feeds this analysis (e.g., "Key Topics: Kinematics, Optics. Example Kinematics Question: ...") as rich context to Gemini 12B.
        • Prompts the LLM to generate a new, original practice paper (with a specific number of questions) based on these topics and examples. This is an intensive process and can take up to a minute.

📁 AI Service: Filesystem Structure

You will create a new folder next to your client and server folders, called ai_service.

notebook/
├── server/
│   ├── index.ts
│   ├── routes/
│   │   ├── ... (existing routes)
│   │   └── ai.ts          # <--- NEW: Proxy for AI service
│   └── ... (rest of server)
│
├── client/
│   ├── ... (existing client files)
│   ├── components/
│   │   ├── ... (existing components)
│   │   ├── AiChatBot.js   # <--- NEW: UI for Q&A
│   │   └── PaperGen.js    # <--- NEW: UI for paper generator
│
└── ai_service/            # <--- NEW: Python AI Microservice
    ├── main.py            # FastAPI server entry point
    ├── requirements.txt   # Python dependencies
    ├── config.py          # Configuration (model names, paths)
    ├── data/              # Persistent data for AI
    │   ├── vector_store/  # ChromaDB vector database files
    │   └── ml_models/     # Saved TF-IDF/K-Means models (as .pkl)
    │
    ├── rag_pipeline/
    │   ├── __init__.py
    │   ├── embedder.py    # Loads all-mpnet-base-v2
    │   ├── text_extractor.py # Logic to read PDFs and PPTs
    │   ├── vector_db.py   # Manages ChromaDB (add, query)
    │   └── generator.py   # Interfaces with Ollama (Gemini 12B)
    │
    ├── ml_models/
    │   ├── __init__.py
    │   ├── topic_modeler.py # TF-IDF & K-Means logic
    │   └── question_db.py   # Manages a simple DB of questions
    │
    └── processing/
        ├── __init__.py
        └── indexer.py     # Main script to read, process, & index files

📚 Incredibly Detailed Documentation: ai_service

1. Overview & Purpose

The ai_service is a self-contained Python microservice built with FastAPI. Its sole purpose is to handle all computationally expensive AI and ML tasks. It communicates only with the main notebook/server (Bun.js) via a local HTTP API. It never talks directly to the end-user's client.

  • Python Stack: FastAPI, Ollama, Sentence-Transformers, ChromaDB, Scikit-learn, PyPDF2, python-pptx.

2. Setup & Installation

ai_service/requirements.txt:

fastapi
uvicorn[standard]
ollama
sentence-transformers
chromadb
scikit-learn
pypdf
python-pptx
numpy

Installation Steps:

  1. Install Python & Pip: Ensure you have Python 3.9+ installed.
  2. Install Dependencies:
    cd notebook/ai_service
    pip install -r requirements.txt
  3. Install & Run Ollama:
    • Follow the official Ollama setup instructions.
    • Pull the required models:
      ollama pull all-mpnet-base-v2  # For embeddings
      ollama pull gemini:12b-qat       # Or whatever tag you use for Gemini
    • Ensure the Ollama server is running.

3. Core Component Deep-Dive

processing/indexer.py (The "Data Pipeline")

This is the most critical offline script. It must be run to build the AI's knowledge.

  • Purpose: To find all uploaded documents, extract their text, process them for RAG and ML, and store the results.
  • Workflow:
    1. Scan ../server/db/posts/: Reads all uuid.json files to get a list of all posts.
    2. Filter for Documents: It looks for posts that are "question papers" or "notes" (PDF/PPT).
    3. For each document:
      • Get Metadata: Reads the board, grade, subject, and filename from the JSON.
      • Get File Path: Constructs the path (e.g., ../server/uploads/[filename]).
      • Extract Text (using rag_pipeline/text_extractor.py):
        • If .pdf, use PyPDF2 to read text page by page.
        • If .pptx, use python-pptx to read text from slides.
      • Chunk Text: Splits the full text into smaller, overlapping chunks (e.g., 500 characters per chunk with 100 overlap).
      • Store for RAG (using rag_pipeline/vector_db.py):
        • Each chunk is embedded using all-mpnet-base-v2.
        • The resulting vector is stored in ChromaDB (data/vector_store/) along with its metadata: { "text": chunk_text, "source_file": filename, "board": board, "grade": grade }.
      • Store for ML (using ml_models/question_db.py):
        • A simpler regex/heuristic identifies "questions" (e.g., lines ending in "?", or starting with "Q.").
        • These questions are stored raw in a separate simple JSON file or SQLite DB (ml_models/question_db.json) for the topic modeler.
    4. Train ML Model (using ml_models/topic_modeler.py):
      • After processing all questions, it loads the entire question bank.
      • It fits a TfidfVectorizer on the text and saves the vectorizer to data/ml_models/tfidf.pkl.
      • It then fits a KMeans model (e.g., n_clusters=50) on the TF-IDF vectors and saves the model to data/ml_models/kmeans.pkl.

How to run it:
You must run this script the first time you set up the service.

cd notebook/ai_service
python processing/indexer.py

rag_pipeline/ (The Q&A System)

  • embedder.py:

    • Loads the all-mpnet-base-v2 model using sentence_transformers.
    • Provides a simple function: def get_embedding(text: str) -> List[float]: ...
  • vector_db.py:

    • Initializes the ChromaDB client: client = chromadb.PersistentClient(path="./data/vector_store").
    • Gets or creates a collection: collection = client.get_or_create_collection("notebook_docs").
    • def add_documents(chunks: List[str], metadatas: List[dict], ids: List[str]): ... (used by indexer.py).
    • def query_db(query_text: str, n_results=5, filter_dict={}) -> List[str]: ...
      1. Gets the embedding for query_text from embedder.py.
      2. Queries the collection: results = collection.query(query_embeddings=[embedding], n_results=n_results, where=filter_dict).
      3. Returns the text from the documents in the results.
  • generator.py:

    • Interfaces with Ollama.
    • def get_ai_response(question: str, context: List[str]) -> str:
      1. Initializes the Ollama client: client = ollama.Client().
      2. Builds the RAG prompt:
        You are a helpful study assistant...
        CONTEXT:
        ---
        [context[0]]
        ---
        QUESTION:
        [question]
        ANSWER:
        
      3. Calls the model: response = client.chat(model='gemini:12b-qat', messages=[...]).
      4. Returns response['message']['content'].
    • def generate_paper_from_context(analysis_data: dict, num_questions: int) -> str:
      1. Initializes Ollama client.
      2. Builds the detailed paper generation prompt:
        You are an expert curriculum developer and high school teacher.
        Your task is to generate a new, high-quality practice question paper.
        
        Use the following analysis of past papers as your guide.
        
        Board: [analysis_data.board]
        Grade: [analysis_data.grade]
        Subject: [analysis_data.subject]
        
        Key Topics & Examples:
        ---
        Topic 1:
        Keywords: [analysis_data.top_topics[0].keywords]
        Example Questions:
        - [analysis_data.top_topics[0].examples[0]]
        - [analysis_data.top_topics[0].examples[1]]
        ---
        ... (repeat for all top topics) ...
        
        INSTRUCTIONS:
        1.  Generate a new practice paper with exactly [num_questions] questions.
        2.  The questions must be *new* and *original*, but inspired by the topics, keywords, and example questions provided.
        3.  Do NOT just copy the example questions.
        4.  Ensure the paper covers the identified topics in a balanced way.
        5.  Format the output clearly (e.g., "Q1: ...", "Q2: ...").
        
        NEW PRACTICE PAPER:
        
      3. Calls the model: response = client.chat(model='gemini:12b-qat', messages=[...]).
        • Note: This call will be slow (up to 1 minute). The Ollama client and FastAPI server must be configured to handle this long timeout.
      4. Returns response['message']['content'] (the full paper as a single string).

ml_models/ (The Paper Predictor's Analyzer)

  • topic_modeler.py:
    • def load_models(): ...
      • Loads tfidf.pkl and kmeans.pkl from data/ml_models/ using joblib or pickle.
    • def get_topic_analysis(board: str, grade: str, subject: str) -> dict:
      1. Loads the pre-trained TF-IDF and K-Means models.
      2. Loads all questions for the matching board, grade, and subject from question_db.py.
      3. Transforms these filtered questions using the loaded TF-IDF vectorizer.
      4. Uses the loaded K-Means model to predict the topic (cluster label) for each question.
      5. Counts the most frequent topics (e.g., Top 5 topics).
      6. For each top topic, finds 2-3 representative questions (e.g., those closest to the cluster center).
      7. Returns a dictionary structure (the analysis_data for the LLM):
        {
          "board": "CBSE",
          "grade": "12",
          "subject": "Physics",
          "top_topics": [
            { "topic_id": 3, "keywords": "kinematics, velocity", "examples": ["Q: ...", "Q: ..."] },
            { "topic_id": 7, "keywords": "optics, lens, mirror", "examples": ["Q: ..."] }
          ]
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment