Sprint 3: Machine Learning Model Guide

Overview

This guide covers the third ticket of the BandersnatchStarter project, focused on building and integrating a machine learning model using Scikit-learn, as outlined in BandersnatchStarter/tickets/thirdTicket.md. The objective of Sprint 3 is to create a machine learning model that analyzes monster data from MongoDB (set up in Sprint 1) and integrates it into the Flask web application for user interaction, such as predicting monster attributes. This guide is tailored for a junior JavaScript/TypeScript developer familiar with web development and possibly TensorFlow.js, but new to Python and Scikit-learn. By relating Python and Scikit-learn to JavaScript concepts, this guide will help you build a machine learning model while learning new tools.

Ticket Objectives

The third ticket focuses on:

Preparing Monster Data: Query and preprocess monster data from MongoDB for machine learning.
Building a Machine Learning Model: Use Scikit-learn to train a model (e.g., classify monster types or predict attributes like strength).
Integrating the Model into Flask: Create routes to accept user input and display model predictions.
Enabling User Interaction: Allow users to input monster attributes and receive predictions via the web app.

This is similar to building a predictive function in a JavaScript app (e.g., using TensorFlow.js) and exposing it via an Express.js API endpoint.

Prerequisites

Before starting, ensure you’ve completed Sprints 1 and 2 and have:

A MongoDB database with monster data (e.g., documents with fields like name, strength, type, speed) from Sprint 1.
Visualizations set up with Altair (Sprint 2) to understand the data.
The BandersnatchStarter repository set up:
- Cloned: git clone https://github.com/your-username/BandersnatchStarter.git.
- Virtual environment activated:
  - Windows: venv\Scripts\activate.
  - macOS/Linux: source venv/bin/activate.
- Dependencies installed: python -m pip install -r requirements.txt.
A running Flask app: python -m app.main (Windows) or python3 -m app.main (macOS/Linux).
Familiarity with Python dictionaries, lists, and pandas DataFrames (from Sprint 2), which are like JavaScript objects, arrays, and data manipulation libraries.

You’ll need Scikit-learn and pandas installed (included in requirements.txt) and a basic understanding of machine learning concepts (e.g., features, labels, training).

Step-by-Step Guide

Below are the steps to complete Sprint 3, with explanations and JavaScript analogies to help you understand the process.

Step 1: Understand Scikit-learn and Machine Learning Basics

Goal: Learn how Scikit-learn works and its role in building machine learning models.

What to Do:

Explore Scikit-learn:
- Scikit-learn is a Python library for machine learning, similar to TensorFlow.js but easier for beginners. It provides pre-built algorithms like decision trees, logistic regression, or linear regression.
- For this project, you’ll likely build a classification model (e.g., predict a monster’s type based on strength and speed) or a regression model (e.g., predict strength based on other attributes).
- Machine learning involves:
  - Features: Input data (e.g., strength, speed), like variables in a JavaScript function.
  - Labels: Output data (e.g., type), like the return value of a function.
  - Training: Teaching the model to map features to labels, like training a neural network in TensorFlow.js.
Verify Scikit-learn Installation:
- Check that Scikit-learn is installed: pip show scikit-learn.
- If not, ensure requirements.txt includes scikit-learn and reinstall: pip install -r requirements.txt.

JavaScript Analogy:

Scikit-learn is like TensorFlow.js but with simpler APIs for common tasks. Instead of defining neural networks, you use algorithms like DecisionTreeClassifier (like a pre-built function).
Think of a machine learning model as a JavaScript function that takes inputs (features) and returns predictions (labels).

Tips:

Start with a simple algorithm like DecisionTreeClassifier for classification or LinearRegression for regression.
Visit the Scikit-learn Documentation for tutorials, similar to TensorFlow.js docs.

Step 2: Prepare Monster Data for Machine Learning

Goal: Query monster data from MongoDB and preprocess it for Scikit-learn.

What to Do:

Reuse MongoDB Connection:

Use the MongoDB connection from Sprint 1 (e.g., in app/database.py):

from pymongo import MongoClient
from dotenv import load_dotenv
import os
load_dotenv()
db_url = os.getenv("DB_URL")
client = MongoClient(db_url)
db = client["bandersnatch"]
collection = db["monsters"]

Fetch and Convert Data:

Query all monsters and convert to a pandas DataFrame (from Sprint 2):

import pandas as pd
def get_monsters():
    return list(collection.find())
monsters = get_monsters()
df = pd.DataFrame(monsters)

Remove the MongoDB _id field, as it’s not useful for machine learning:
```
df = df.drop("_id", axis=1, errors="ignore")
```

Preprocess Data:
- Select Features and Labels: Decide what to predict. For example:
  - Classification: Predict type (e.g., Fire, Earth) using features like strength and speed.
  - Regression: Predict strength using features like speed and type.
- Handle Categorical Data: If type is a string (e.g., “Fire”), convert it to numbers:
```
df["type"] = df["type"].astype("category").cat.codes
```
  This is like mapping strings to numbers in JavaScript (e.g., { "Fire": 0, "Earth": 1 }).
- Handle Missing Data: Ensure no missing values in features or labels:
```
df = df.dropna()
```
- Split Features and Labels:
```
X = df[["strength", "speed"]]  # Features
y = df["type"]  # Labels
```
  X is like an array of input objects, and y is an array of outputs.
Split Data for Training and Testing:
- Split the data into training (e.g., 80%) and testing (20%) sets to evaluate the model:
```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
- This is like splitting a dataset in TensorFlow.js to train and validate a model.

JavaScript Analogy:

A pandas DataFrame is like a JavaScript array of objects, but with built-in methods for filtering or transforming data.
Preprocessing is like cleaning and formatting JSON data before passing it to a TensorFlow.js model.
train_test_split is like manually splitting an array in JavaScript for training and testing.

Tips:

Ensure your dataset has enough data (e.g., 20+ monsters) for meaningful training. If not, add more in MongoDB.
Print X and y to verify: print(X.head(), y.head()).
Use a fixed random_state (e.g., 42) for reproducible splits.

Step 3: Build and Train a Machine Learning Model

Goal: Train a Scikit-learn model on the monster data.

What to Do:

Choose a Model:

For classification (e.g., predicting type), use DecisionTreeClassifier:

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)

For regression (e.g., predicting strength), use LinearRegression:

from sklearn.linear_model import LinearRegression
model = LinearRegression()

Train the Model:
- Fit the model to the training data:
```
model.fit(X_train, y_train)
```
- This is like calling model.fit() in TensorFlow.js to train a neural network.

Evaluate the Model:

Test the model on the test data:
```
predictions = model.predict(X_test)
```

For classification, check accuracy:

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

For regression, check mean squared error:

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse}")

JavaScript Analogy:

A Scikit-learn model is like a TensorFlow.js model, but simpler to set up (no need to define layers).
model.fit() is like model.fit() in TensorFlow.js, training the model on input-output pairs.
model.predict() is like model.predict() in TensorFlow.js, generating predictions for new data.

Tips:

Start with DecisionTreeClassifier for simplicity, as it’s interpretable and works well with small datasets.
If accuracy is low, try adding more data or features, or experiment with other models (e.g., RandomForestClassifier).

Save the model for reuse:

import joblib
joblib.dump(model, "model.pkl")

Step 4: Integrate the Model into Flask

Goal: Create Flask routes to accept user input and display model predictions.

What to Do:

Load the Model:

Load the saved model in app/main.py:

import joblib
model = joblib.load("model.pkl")

Create a Prediction Route:

Add a route to accept user input (e.g., strength, speed) and return predictions:

from flask import request, render_template
@app.route("/predict", methods=["GET", "POST"])
def predict():
    if request.method == "POST":
        data = request.form  # or request.get_json() for JSON input
        strength = float(data["strength"])
        speed = float(data["speed"])
        prediction = model.predict([[strength, speed]])
        return render_template("predict.html", prediction=prediction[0])
    return render_template("predict.html")

This accepts form data (like a POST request in Express.js) and predicts the monster’s type or strength.

Create a Jinja2 Template:

In templates/predict.html, create a form for user input and display predictions:

<!DOCTYPE html>
<html>
<head>
    <title>Predict Monster Type</title>
</head>
<body>
    <h1>Predict Monster Type</h1>
    <form method="POST">
        <label>Strength:</label>
        <input type="number" name="strength" required><br>
        <label>Speed:</label>
        <input type="number" name="speed" required><br>
        <button type="submit">Predict</button>
    </form>
    {% if prediction %}
        <p>Predicted Type: {{ prediction }}</p>
    {% endif %}
</body>
</html>

Test the Route:
- Run the Flask app: python -m app.main (Windows) or python3 -m app.main (macOS/Linux).
- Visit http://127.0.0.1:5000/predict, enter values, and check the prediction.

JavaScript Analogy:

The Flask route is like an Express.js POST endpoint (e.g., app.post('/predict', ...)) that processes input and returns a response.
The Jinja2 template is like an EJS or Handlebars template, rendering a form and displaying results.
model.predict() is like a TensorFlow.js model’s predict() method, taking input and returning a result.

Tips:

Ensure input data matches the model’s expected format (e.g., [[strength, speed]] for two features).

Validate user input to avoid errors:

try:
    strength = float(data["strength"])
except ValueError:
    return render_template("predict.html", error="Invalid input")

Step 5: Enhance User Interaction

Goal: Make the prediction interface user-friendly and robust.

What to Do:

Add Input Validation:

Check that inputs are valid numbers and within reasonable ranges:

if strength < 0 or speed < 0:
    return render_template("predict.html", error="Values must be positive")

Display Meaningful Predictions:

If predicting type (encoded as numbers), map numbers back to strings:

type_map = {0: "Fire", 1: "Earth", 2: "Water"}  # Based on your encoding
prediction = type_map[prediction[0]]

Add Visual Feedback:

Combine predictions with visualizations (from Sprint 2). For example, show the new data point on a scatter plot:

import altair as alt
new_data = pd.DataFrame([[strength, speed, prediction]], columns=["strength", "speed", "type"])
scatter = alt.Chart(new_data).mark_point().encode(
    x="strength",
    y="speed",
    color="type"
)
return render_template("predict.html", prediction=prediction, chart=scatter.to_json())

JavaScript Analogy:

Input validation is like checking req.body in Express.js before processing.
Mapping predictions is like converting a numeric ID to a string in a JavaScript app.
Combining predictions with visualizations is like updating a Chart.js chart with new data in React.

Tips:

Use CSS in predict.html to style the form, similar to styling a React form.
Test the interface with various inputs to ensure robustness.

Learning Tips for JavaScript Developers

Python vs. JavaScript: Python dictionaries ({}) are like JavaScript objects, and lists ([]) are like arrays. Use print() like console.log() for debugging.
Scikit-learn vs. TensorFlow.js: Scikit-learn is simpler, with pre-built algorithms that don’t require defining layers or tensors.
pandas DataFrames: Like JavaScript arrays of objects but with powerful methods for filtering or grouping (e.g., df.dropna()).
Flask Routes: Like Express.js routes. Use request.form for form data (like req.body) or request.get_json() for JSON.
Debugging:
- Print DataFrames or predictions: print(X.head(), y.head()).
- Use try-except for error handling:
```
try:
    # Your code
except Exception as e:
    print(f"Error: {e}")
```
Resources:
- Scikit-learn Documentation: Like TensorFlow.js docs.
- pandas Documentation: For DataFrame operations.
- Flask Documentation: Like Express.js docs.
- Python Tutorial: Like MDN for JavaScript.

Potential Challenges and Solutions

Insufficient Data: If the model performs poorly, ensure you have enough varied data (e.g., 20+ monsters). Add more via MongoDB.
Type Mismatch Errors: Ensure features (X) and labels (y) are numeric. Use pd.to_numeric() or .cat.codes as needed.
Model Accuracy Issues: If accuracy is low, try more features, a different model (e.g., RandomForestClassifier), or more data.
Flask Route Errors: Check that form inputs match expected keys (e.g., strength, speed) and are converted to floats.
Template Rendering Issues: Ensure predict.html uses correct Jinja2 syntax and that chart.to_json() uses the safe filter if visualizations are included.

Next Steps

Test Your Work:
- Verify that the model trains successfully and produces reasonable predictions.
- Test the /predict route with various inputs and check the output in the template.
- If using visualizations, ensure new predictions appear correctly.
Enhance the Model:
- Experiment with other Scikit-learn models (e.g., RandomForestClassifier, KNeighborsClassifier).
- Add more features (e.g., health, defense) to improve predictions.
Explore Stretch Goals:
- Allow users to retrain the model via a Flask route.
- Save and download the trained model or dataset.
- Add authentication to protect the prediction route.
Deploy: Test the model in the deployed app on Render.com (see repository’s deployment instructions).

Example Workflow

Here’s a high-level workflow:

Query monster data from MongoDB and convert to a pandas DataFrame.
Preprocess data by encoding categorical variables (e.g., type) and splitting into features (X) and labels (y).
Train a DecisionTreeClassifier on the training data and evaluate its accuracy.
Save the model using joblib.
Create a /predict route in Flask to accept strength and speed inputs and return predictions.
Build a predict.html template with a form and display the prediction.
Test the route at http://127.0.0.1:5000/predict and verify predictions.

Conclusion

Sprint 3 introduces you to machine learning with Scikit-learn, building on the MongoDB setup (Sprint 1) and visualizations (Sprint 2). By preparing data, training a model, and integrating it into Flask, you’ll create an interactive prediction interface. For a JavaScript developer, Scikit-learn is like a simplified TensorFlow.js, and Flask/Jinja2 is like Express.js/Handlebars. Focus on small steps—data preprocessing, model training, route creation—to master Python and machine learning.

decagondev/Bandersnatch-Ticket3-Guide.md

Sprint 3: Machine Learning Model Guide

Overview

Ticket Objectives

Prerequisites

Step-by-Step Guide

Step 1: Understand Scikit-learn and Machine Learning Basics

Step 2: Prepare Monster Data for Machine Learning

Step 3: Build and Train a Machine Learning Model

Step 4: Integrate the Model into Flask

Step 5: Enhance User Interaction

Learning Tips for JavaScript Developers

Potential Challenges and Solutions

Next Steps

Example Workflow

Conclusion