This guide covers the third ticket of the BandersnatchStarter project, focused on building and integrating a machine learning model using Scikit-learn, as outlined in BandersnatchStarter/tickets/thirdTicket.md. The objective of Sprint 3 is to create a machine learning model that analyzes monster data from MongoDB (set up in Sprint 1) and integrates it into the Flask web application for user interaction, such as predicting monster attributes. This guide is tailored for a junior JavaScript/TypeScript developer familiar with web development and possibly TensorFlow.js, but new to Python and Scikit-learn. By relating Python and Scikit-learn to JavaScript concepts, this guide will help you build a machine learning model while learning new tools.
The third ticket focuses on:
- Preparing Monster Data: Query and preprocess monster data from MongoDB for machine learning.
- Building a Machine Learning Model: Use Scikit-learn to train a model (e.g., classify monster types or predict attributes like strength).
- Integrating the Model into Flask: Create routes to accept user input and display model predictions.
- Enabling User Interaction: Allow users to input monster attributes and receive predictions via the web app.
This is similar to building a predictive function in a JavaScript app (e.g., using TensorFlow.js) and exposing it via an Express.js API endpoint.
Before starting, ensure you’ve completed Sprints 1 and 2 and have:
- A MongoDB database with monster data (e.g., documents with fields like
name
,strength
,type
,speed
) from Sprint 1. - Visualizations set up with Altair (Sprint 2) to understand the data.
- The BandersnatchStarter repository set up:
- Cloned:
git clone https://github.com/your-username/BandersnatchStarter.git
. - Virtual environment activated:
- Windows:
venv\Scripts\activate
. - macOS/Linux:
source venv/bin/activate
.
- Windows:
- Dependencies installed:
python -m pip install -r requirements.txt
.
- Cloned:
- A running Flask app:
python -m app.main
(Windows) orpython3 -m app.main
(macOS/Linux). - Familiarity with Python dictionaries, lists, and pandas DataFrames (from Sprint 2), which are like JavaScript objects, arrays, and data manipulation libraries.
You’ll need Scikit-learn and pandas installed (included in requirements.txt
) and a basic understanding of machine learning concepts (e.g., features, labels, training).
Below are the steps to complete Sprint 3, with explanations and JavaScript analogies to help you understand the process.
Goal: Learn how Scikit-learn works and its role in building machine learning models.
What to Do:
-
Explore Scikit-learn:
- Scikit-learn is a Python library for machine learning, similar to TensorFlow.js but easier for beginners. It provides pre-built algorithms like decision trees, logistic regression, or linear regression.
- For this project, you’ll likely build a classification model (e.g., predict a monster’s
type
based onstrength
andspeed
) or a regression model (e.g., predictstrength
based on other attributes). - Machine learning involves:
- Features: Input data (e.g.,
strength
,speed
), like variables in a JavaScript function. - Labels: Output data (e.g.,
type
), like the return value of a function. - Training: Teaching the model to map features to labels, like training a neural network in TensorFlow.js.
- Features: Input data (e.g.,
-
Verify Scikit-learn Installation:
- Check that Scikit-learn is installed:
pip show scikit-learn
. - If not, ensure
requirements.txt
includesscikit-learn
and reinstall:pip install -r requirements.txt
.
- Check that Scikit-learn is installed:
JavaScript Analogy:
- Scikit-learn is like TensorFlow.js but with simpler APIs for common tasks. Instead of defining neural networks, you use algorithms like
DecisionTreeClassifier
(like a pre-built function). - Think of a machine learning model as a JavaScript function that takes inputs (features) and returns predictions (labels).
Tips:
- Start with a simple algorithm like
DecisionTreeClassifier
for classification orLinearRegression
for regression. - Visit the Scikit-learn Documentation for tutorials, similar to TensorFlow.js docs.
Goal: Query monster data from MongoDB and preprocess it for Scikit-learn.
What to Do:
-
Reuse MongoDB Connection:
- Use the MongoDB connection from Sprint 1 (e.g., in
app/database.py
):from pymongo import MongoClient from dotenv import load_dotenv import os load_dotenv() db_url = os.getenv("DB_URL") client = MongoClient(db_url) db = client["bandersnatch"] collection = db["monsters"]
- Use the MongoDB connection from Sprint 1 (e.g., in
-
Fetch and Convert Data:
- Query all monsters and convert to a pandas DataFrame (from Sprint 2):
import pandas as pd def get_monsters(): return list(collection.find()) monsters = get_monsters() df = pd.DataFrame(monsters)
- Remove the MongoDB
_id
field, as it’s not useful for machine learning:df = df.drop("_id", axis=1, errors="ignore")
- Query all monsters and convert to a pandas DataFrame (from Sprint 2):
-
Preprocess Data:
- Select Features and Labels: Decide what to predict. For example:
- Classification: Predict
type
(e.g., Fire, Earth) using features likestrength
andspeed
. - Regression: Predict
strength
using features likespeed
andtype
.
- Classification: Predict
- Handle Categorical Data: If
type
is a string (e.g., “Fire”), convert it to numbers:This is like mapping strings to numbers in JavaScript (e.g.,df["type"] = df["type"].astype("category").cat.codes
{ "Fire": 0, "Earth": 1 }
). - Handle Missing Data: Ensure no missing values in features or labels:
df = df.dropna()
- Split Features and Labels:
X = df[["strength", "speed"]] # Features y = df["type"] # Labels
X
is like an array of input objects, andy
is an array of outputs.
- Select Features and Labels: Decide what to predict. For example:
-
Split Data for Training and Testing:
- Split the data into training (e.g., 80%) and testing (20%) sets to evaluate the model:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- This is like splitting a dataset in TensorFlow.js to train and validate a model.
- Split the data into training (e.g., 80%) and testing (20%) sets to evaluate the model:
JavaScript Analogy:
- A pandas DataFrame is like a JavaScript array of objects, but with built-in methods for filtering or transforming data.
- Preprocessing is like cleaning and formatting JSON data before passing it to a TensorFlow.js model.
train_test_split
is like manually splitting an array in JavaScript for training and testing.
Tips:
- Ensure your dataset has enough data (e.g., 20+ monsters) for meaningful training. If not, add more in MongoDB.
- Print
X
andy
to verify:print(X.head(), y.head())
. - Use a fixed
random_state
(e.g., 42) for reproducible splits.
Goal: Train a Scikit-learn model on the monster data.
What to Do:
-
Choose a Model:
- For classification (e.g., predicting
type
), useDecisionTreeClassifier
:from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(random_state=42)
- For regression (e.g., predicting
strength
), useLinearRegression
:from sklearn.linear_model import LinearRegression model = LinearRegression()
- For classification (e.g., predicting
-
Train the Model:
- Fit the model to the training data:
model.fit(X_train, y_train)
- This is like calling
model.fit()
in TensorFlow.js to train a neural network.
- Fit the model to the training data:
-
Evaluate the Model:
- Test the model on the test data:
predictions = model.predict(X_test)
- For classification, check accuracy:
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, predictions) print(f"Accuracy: {accuracy}")
- For regression, check mean squared error:
from sklearn.metrics import mean_squared_error mse = mean_squared_error(y_test, predictions) print(f"Mean Squared Error: {mse}")
- Test the model on the test data:
JavaScript Analogy:
- A Scikit-learn model is like a TensorFlow.js model, but simpler to set up (no need to define layers).
model.fit()
is likemodel.fit()
in TensorFlow.js, training the model on input-output pairs.model.predict()
is likemodel.predict()
in TensorFlow.js, generating predictions for new data.
Tips:
- Start with
DecisionTreeClassifier
for simplicity, as it’s interpretable and works well with small datasets. - If accuracy is low, try adding more data or features, or experiment with other models (e.g.,
RandomForestClassifier
). - Save the model for reuse:
import joblib joblib.dump(model, "model.pkl")
Goal: Create Flask routes to accept user input and display model predictions.
What to Do:
-
Load the Model:
- Load the saved model in
app/main.py
:import joblib model = joblib.load("model.pkl")
- Load the saved model in
-
Create a Prediction Route:
- Add a route to accept user input (e.g.,
strength
,speed
) and return predictions:from flask import request, render_template @app.route("/predict", methods=["GET", "POST"]) def predict(): if request.method == "POST": data = request.form # or request.get_json() for JSON input strength = float(data["strength"]) speed = float(data["speed"]) prediction = model.predict([[strength, speed]]) return render_template("predict.html", prediction=prediction[0]) return render_template("predict.html")
- This accepts form data (like a POST request in Express.js) and predicts the monster’s
type
orstrength
.
- Add a route to accept user input (e.g.,
-
Create a Jinja2 Template:
- In
templates/predict.html
, create a form for user input and display predictions:<!DOCTYPE html> <html> <head> <title>Predict Monster Type</title> </head> <body> <h1>Predict Monster Type</h1> <form method="POST"> <label>Strength:</label> <input type="number" name="strength" required><br> <label>Speed:</label> <input type="number" name="speed" required><br> <button type="submit">Predict</button> </form> {% if prediction %} <p>Predicted Type: {{ prediction }}</p> {% endif %} </body> </html>
- In
-
Test the Route:
- Run the Flask app:
python -m app.main
(Windows) orpython3 -m app.main
(macOS/Linux). - Visit
http://127.0.0.1:5000/predict
, enter values, and check the prediction.
- Run the Flask app:
JavaScript Analogy:
- The Flask route is like an Express.js POST endpoint (e.g.,
app.post('/predict', ...)
) that processes input and returns a response. - The Jinja2 template is like an EJS or Handlebars template, rendering a form and displaying results.
model.predict()
is like a TensorFlow.js model’spredict()
method, taking input and returning a result.
Tips:
- Ensure input data matches the model’s expected format (e.g.,
[[strength, speed]]
for two features). - Validate user input to avoid errors:
try: strength = float(data["strength"]) except ValueError: return render_template("predict.html", error="Invalid input")
Goal: Make the prediction interface user-friendly and robust.
What to Do:
-
Add Input Validation:
- Check that inputs are valid numbers and within reasonable ranges:
if strength < 0 or speed < 0: return render_template("predict.html", error="Values must be positive")
- Check that inputs are valid numbers and within reasonable ranges:
-
Display Meaningful Predictions:
- If predicting
type
(encoded as numbers), map numbers back to strings:type_map = {0: "Fire", 1: "Earth", 2: "Water"} # Based on your encoding prediction = type_map[prediction[0]]
- If predicting
-
Add Visual Feedback:
- Combine predictions with visualizations (from Sprint 2). For example, show the new data point on a scatter plot:
import altair as alt new_data = pd.DataFrame([[strength, speed, prediction]], columns=["strength", "speed", "type"]) scatter = alt.Chart(new_data).mark_point().encode( x="strength", y="speed", color="type" ) return render_template("predict.html", prediction=prediction, chart=scatter.to_json())
- Combine predictions with visualizations (from Sprint 2). For example, show the new data point on a scatter plot:
JavaScript Analogy:
- Input validation is like checking
req.body
in Express.js before processing. - Mapping predictions is like converting a numeric ID to a string in a JavaScript app.
- Combining predictions with visualizations is like updating a Chart.js chart with new data in React.
Tips:
- Use CSS in
predict.html
to style the form, similar to styling a React form. - Test the interface with various inputs to ensure robustness.
- Python vs. JavaScript: Python dictionaries (
{}
) are like JavaScript objects, and lists ([]
) are like arrays. Useprint()
likeconsole.log()
for debugging. - Scikit-learn vs. TensorFlow.js: Scikit-learn is simpler, with pre-built algorithms that don’t require defining layers or tensors.
- pandas DataFrames: Like JavaScript arrays of objects but with powerful methods for filtering or grouping (e.g.,
df.dropna()
). - Flask Routes: Like Express.js routes. Use
request.form
for form data (likereq.body
) orrequest.get_json()
for JSON. - Debugging:
- Print DataFrames or predictions:
print(X.head(), y.head())
. - Use try-except for error handling:
try: # Your code except Exception as e: print(f"Error: {e}")
- Print DataFrames or predictions:
- Resources:
- Scikit-learn Documentation: Like TensorFlow.js docs.
- pandas Documentation: For DataFrame operations.
- Flask Documentation: Like Express.js docs.
- Python Tutorial: Like MDN for JavaScript.
- Insufficient Data: If the model performs poorly, ensure you have enough varied data (e.g., 20+ monsters). Add more via MongoDB.
- Type Mismatch Errors: Ensure features (
X
) and labels (y
) are numeric. Usepd.to_numeric()
or.cat.codes
as needed. - Model Accuracy Issues: If accuracy is low, try more features, a different model (e.g.,
RandomForestClassifier
), or more data. - Flask Route Errors: Check that form inputs match expected keys (e.g.,
strength
,speed
) and are converted to floats. - Template Rendering Issues: Ensure
predict.html
uses correct Jinja2 syntax and thatchart.to_json()
uses thesafe
filter if visualizations are included.
- Test Your Work:
- Verify that the model trains successfully and produces reasonable predictions.
- Test the
/predict
route with various inputs and check the output in the template. - If using visualizations, ensure new predictions appear correctly.
- Enhance the Model:
- Experiment with other Scikit-learn models (e.g.,
RandomForestClassifier
,KNeighborsClassifier
). - Add more features (e.g.,
health
,defense
) to improve predictions.
- Experiment with other Scikit-learn models (e.g.,
- Explore Stretch Goals:
- Allow users to retrain the model via a Flask route.
- Save and download the trained model or dataset.
- Add authentication to protect the prediction route.
- Deploy: Test the model in the deployed app on Render.com (see repository’s deployment instructions).
Here’s a high-level workflow:
- Query monster data from MongoDB and convert to a pandas DataFrame.
- Preprocess data by encoding categorical variables (e.g.,
type
) and splitting into features (X
) and labels (y
). - Train a
DecisionTreeClassifier
on the training data and evaluate its accuracy. - Save the model using
joblib
. - Create a
/predict
route in Flask to acceptstrength
andspeed
inputs and return predictions. - Build a
predict.html
template with a form and display the prediction. - Test the route at
http://127.0.0.1:5000/predict
and verify predictions.
Sprint 3 introduces you to machine learning with Scikit-learn, building on the MongoDB setup (Sprint 1) and visualizations (Sprint 2). By preparing data, training a model, and integrating it into Flask, you’ll create an interactive prediction interface. For a JavaScript developer, Scikit-learn is like a simplified TensorFlow.js, and Flask/Jinja2 is like Express.js/Handlebars. Focus on small steps—data preprocessing, model training, route creation—to master Python and machine learning.