This guide covers the third ticket of the BandersnatchStarter project, titled "Machine Learning Model," as outlined in BandersnatchStarter/tickets/thirdTicket.md. The objective of Sprint 3 is to explore machine learning models in a Jupyter Notebook, develop a Machine class in app/machine.py using Scikit-learn to predict monster attributes (e.g., Rarity) from data generated in Sprint 1, serialize the model with joblib, and integrate it into the Flask web application. This guide is designed for someone with a beginner data science background, familiar with Python, pandas, and Jupyter Notebooks, but new to Flask and Git. You’ll learn to use Scikit-learn (like in your course), serialize models (like saving notebook outputs), and integrate predictions with Flask (like sharing results online), while managing code with Git (like notebook checkpoints). The guide relates tasks to your notebook experience, guiding you toward a solution without providing complete code to encourage learning.
The third ticket focuses on:
- Notebook Model Training & Tuning: Create a Jupyter Notebook to train and tune at least three machine learning models, report accuracy, and describe the best model.
- Machine Learning Interface Class: Implement a
Machineclass inapp/machine.pyto initialize, train, and predict using a Scikit-learn model. - Model Serialization: Add methods to save and load the model using joblib.
- API Model Integration: Include an
info()method and integrate the model into Flask for predictions. - Using Git: Track changes with Git for collaboration and version control.
This is like moving from training models in a notebook to creating a reusable model class and making predictions available via a web app, using the monster data (from MonsterLab) stored in MongoDB.
Before starting, ensure you’ve completed the setup and previous sprints:
- Install Python 3, Git (git-scm.com), and Jupyter Notebook (
pip install jupyter). - Fork and clone the repository:
git clone https://github.com/your-username/BandersnatchStarter.git cd BandersnatchStarter - Set up a virtual environment:
- Windows:
python -m venv venvandvenv\Scripts\activate. - macOS/Linux:
python3 -m venv venvandsource venv/bin/activate.
- Windows:
- Install dependencies:
python -m pip install -r requirements.txt. Ensurescikit-learn,joblib,pandas,MonsterLab, andpymongoare installed. - Complete Sprint 1 (MongoDB with 1,000+ monsters) and Sprint 2 (Altair visualizations).
- Set the
SPRINTvariable inapp/main.pyto3:SPRINT = 3
- Test the Flask app:
python -m app.main(Windows) orpython3 -m app.main(macOS/Linux), and visithttp://127.0.0.1:5000. - Complete the onboarding module in your course.
You’ll need familiarity with pandas DataFrames, Scikit-learn models (e.g., RandomForestClassifier), and Jupyter Notebooks from your data science course.
Below are the steps to complete Sprint 3, with explanations relating tasks to your Jupyter Notebook and pandas experience, incorporating the monster data from Sprint 1 and the specific requirements for the Machine class.
Goal: Use a Jupyter Notebook to train and tune at least three machine learning models, like experimenting with models in a notebook.
What to Do:
-
Start a Jupyter Notebook:
- In the project root, run:
jupyter notebook
- Create a new notebook (e.g.,
model_exploration.ipynb), like starting a notebook for model training.
- In the project root, run:
-
Load Monster Data:
- Use the
MongoDBclass from Sprint 1 (app/data.py) to fetch data:from app.data import MongoDB import pandas as pd db = MongoDB("monsters") df = db.dataframe()
- Inspect the DataFrame:
df.head()to see attributes (e.g.,name,strength,type,speed,Rarity).
- Use the
-
Preprocess Data:
- Choose
Rarityas the target (based on the example) and other columns as features. - Handle categorical columns (e.g.,
type,name):df["type"] = df["type"].astype("category").cat.codes df = df.drop(columns=["name"], errors="ignore") # Drop non-numeric columns
- Handle missing values:
df = df.dropna(). - Split features and target:
X = df.drop(columns=["Rarity"]) y = df["Rarity"]
- This is like preprocessing a dataset in a notebook before model training.
- Choose
-
Train and Tune Models:
- Train at least three Scikit-learn models (e.g.,
RandomForestClassifier,DecisionTreeClassifier,LogisticRegression):from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = RandomForestClassifier() model.fit(X_train, y_train) predictions = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, predictions)}")
- Tune hyperparameters (e.g.,
n_estimatorsfor RandomForest):model = RandomForestClassifier(n_estimators=100, max_depth=10)
- Repeat for other models, comparing accuracy.
- Train at least three Scikit-learn models (e.g.,
-
Report the Best Model:
- Record accuracy for each model in the notebook.
- Write 1–2 paragraphs describing your best model (e.g., RandomForestClassifier), including:
- Why it performed best (e.g., highest accuracy).
- Hyperparameters used (e.g.,
n_estimators=100). - Potential improvements (e.g., more data, feature engineering).
- Save the notebook, like documenting a model experiment.
Notebook Analogy:
- The notebook is like your usual environment for training Scikit-learn models.
- Preprocessing is like cleaning a dataset in a notebook (e.g.,
df.dropna()). - Model training and tuning is like fitting and tweaking models in a notebook cell.
Tips:
- Use
df.info()to check data types and missing values. - Save the best model’s configuration for use in
app/machine.py. - Explore the Scikit-learn Documentation for model options and tuning.
Goal: Create a Machine class in app/machine.py to train and predict, like a reusable model class in a notebook.
What to Do:
-
Open
app/machine.py:- This is where you’ll define the
Machineclass, like writing a class in a notebook to encapsulate model operations.
- This is where you’ll define the
-
Define the
__init__Method:- Initialize the class with a DataFrame and train a model:
from pandas import DataFrame from sklearn.ensemble import RandomForestClassifier class Machine: def __init__(self, df: DataFrame): pass # Replace with your code
- Store the model name (e.g.,
"Random Forest Classifier"). - Extract the target (
Rarity) and features (all other columns):self.target = df["Rarity"] self.features = df.drop(columns=["Rarity"])
- Initialize and train a Scikit-learn model (e.g.,
RandomForestClassifier) using the best model from your notebook:self.model = RandomForestClassifier(n_estimators=100, random_state=42) self.model.fit(self.features, self.target)
- Initialize the class with a DataFrame and train a model:
-
Define the
__call__Method:- Add a method to make predictions:
def __call__(self, pred_basis: DataFrame): pass # Replace with your code
- Accept a DataFrame of features, predict the target, and return the prediction and probability:
prediction = self.model.predict(pred_basis)[0] probability = self.model.predict_proba(pred_basis)[0] return prediction, probability
- This is like calling
model.predict()in a notebook but with a class interface.
- Add a method to make predictions:
-
Test the Class:
- Test in a notebook:
from app.machine import Machine from app.data import MongoDB df = MongoDB("monsters").dataframe() model = Machine(df) test_data = df.drop(columns=["Rarity"]).iloc[[0]] # Single row pred, prob = model(test_data) print(f"Prediction: {pred}, Probability: {prob}")
- Test in a notebook:
Notebook Analogy:
- The
Machineclass is like a notebook class that wrapsmodel.fit()andmodel.predict(). __init__is like a notebook cell that trains a model.__call__is like a cell that runs predictions on new data.
Tips:
- Ensure
pred_basishas the same feature columns asself.features. - Add a docstring for each method, per PEP 8.
- Use the best model from your notebook (e.g., RandomForestClassifier with tuned parameters).
Goal: Add save() and open() methods to serialize the model, like saving a notebook model to a file.
What to Do:
-
Define the
save()Method:- Add a method to save the model:
def save(self, filepath: str) -> None: pass # Replace with your code
- Use
joblibto save the model:import joblib joblib.dump(self.model, filepath)
- Add a method to save the model:
-
Define the
open()Method:- Add a method to load a model:
@classmethod def open(cls, filepath: str) -> 'Machine': pass # Replace with your code
- Load the model and create a new
Machineinstance:model = joblib.load(filepath) instance = cls(df=None) # Placeholder DataFrame instance.model = model return instance
- Add a method to load a model:
-
Test Serialization:
- Test in a notebook:
model = Machine(df) model.save("model.pkl") loaded_model = Machine.open("model.pkl") pred, prob = loaded_model(test_data) print(f"Loaded Prediction: {pred}")
- Test in a notebook:
Notebook Analogy:
joblib.dump()is likedf.to_csv()to save data.joblib.load()is likepd.read_csv()to load data.- Serialization is like saving a trained model in a notebook for later use.
Tips:
- Add docstrings, e.g., “Saves the model to the specified filepath.”
- Ensure
filepathis valid (e.g.,model.pkl). - Handle errors with try-except:
try: joblib.dump(self.model, filepath) except Exception as e: print(f"Error: {e}")
Goal: Add an info() method and integrate the model into Flask, like sharing predictions online.
What to Do:
-
Define the
info()Method:- Add a method to return model details:
def info(self) -> str: pass # Replace with your code
- Return a string with the model name and initialization timestamp:
from datetime import datetime self.init_time = datetime.now() # Store in __init__ return f"Model: {self.name}, Initialized: {self.init_time}"
- Add a method to return model details:
-
Integrate with Flask:
- In
app/main.py, create a/modelroute to use theMachineclass:from app.machine import Machine from app.data import MongoDB from flask import render_template, request @app.route("/model", methods=["GET", "POST"]) def model(): db = MongoDB("monsters") df = db.dataframe() machine = Machine(df) if request.method == "POST": data = request.form features = pd.DataFrame([{ "strength": float(data["strength"]), "speed": float(data["speed"]), # Add other features }]) pred, prob = machine(features) return render_template("model.html", prediction=pred, probability=prob) return render_template("model.html", info=machine.info())
- Create
templates/model.html:<!DOCTYPE html> <html> <head> <title>Monster Prediction</title> <style> body { background-color: #1a1a1a; color: #ffffff; } </style> </head> <body> <h1>Monster Rarity Prediction</h1> <p>{{ info }}</p> <form method="POST"> <label>Strength:</label><input type="number" name="strength" required><br> <label>Speed:</label><input type="number" name="speed" required><br> <!-- Add other features --> <button type="submit">Predict</button> </form> {% if prediction %} <p>Prediction: {{ prediction }}</p> <p>Probability: {{ probability }}</p> {% endif %} </body> </html>
- In
-
Test Integration:
- Run the Flask app:
python -m app.main(Windows) orpython3 -m app.main(macOS/Linux). - Visit
http://127.0.0.1:5000/modeland compare to the deployed site (https://bandersnatch.herokuapp.com/model). - Enter feature values (e.g.,
strength,speed) and check predictions.
- Run the Flask app:
Notebook Analogy:
- The
/modelroute is like a notebook cell that runs predictions and displays results. - The Jinja2 template is like a markdown cell formatting model outputs for sharing.
info()is like a cell that prints model metadata.
Tips:
- Match the template’s dark theme (e.g.,
#1a1a1abackground,#fffffftext). - Validate form inputs:
try: float(data["strength"]) except ValueError: return render_template("model.html", error="Invalid input")
- Check browser developer tools (F12) for errors.
Goal: Use Git to manage code, like saving notebook checkpoints.
What to Do:
-
Create a Branch:
- Create a branch for Sprint 3:
git checkout -b sprint-3
- Create a branch for Sprint 3:
-
Commit Changes:
- After editing
app/machine.py,app/main.py,templates/model.html, and your notebook:git add . git commit -m "Implement Machine class and Flask integration for Sprint 3"
- After editing
-
Push to GitHub:
- Push your branch:
git push origin sprint-3
- Create a pull request on your GitHub fork.
- Push your branch:
Notebook Analogy:
- A Git branch is like a separate notebook for a task.
- Committing is like saving a notebook, and pushing is like uploading it.
Tips:
- Run
git statusto check changes. - Use clear commit messages (e.g., “Add model serialization”).
- Check GitHub’s documentation.
Goal: Submit deliverables for the ticket.
-
Forked Repository Link:
- Ensure changes are pushed to your GitHub fork.
- Copy the repository URL.
-
Loom Video:
- Record a Loom video answering the prompt in the
Submit Your Deliverablesassignment. - Demonstrate:
- The notebook with model training, tuning, and results.
- The
/modelroute showing predictions and model info. - Explaining the
Machineclass and serialization. - Discussing your best model’s performance.
- Record a Loom video answering the prompt in the
-
Submit in Course:
- Submit the repository link and Loom video link in your course platform.
Notebook Analogy:
- The Loom video is like presenting your notebook results.
- Submitting links is like sharing a notebook and demo.
Tips:
- Test the
/modelroute locally. - Keep the video concise, focusing on notebook results, the
Machineclass, and web integration.
- Scikit-learn vs. Notebooks: The
Machineclass wraps familiarfit()andpredict()methods. - MongoDB vs. pandas: Use
MongoDB.dataframe()likepd.read_csv()to load data. - Flask vs. Notebooks: The
/modelroute is like a notebook cell outputting predictions to HTML. - Git vs. Checkpoints: Git tracks changes across files. Branch per sprint, commit often.
- Code Style:
- Add docstrings for all methods, per PEP 8.
- Remove inline
print()statements in final code. - Follow PEP 8 for clean code.
- Debugging:
- Use
print(df.head())in the notebook, but remove fromapp/machine.py. - Use try-except in
Machine:try: # Prediction code except Exception as e: print(f"Error: {e}")
- Check browser developer tools (F12) for Flask errors.
- Use
- Resources:
- Scikit-learn Documentation: For models and tuning.
- joblib Documentation: For serialization.
- Flask Documentation: For web routes.
- Git for Beginners: For Git basics.
- Model Accuracy Issues: If accuracy is low, check preprocessing (e.g., categorical encoding, missing values) or try more tuning.
- Serialization Errors: Ensure
filepathis valid andjoblibis installed (pip show joblib). - Flask Integration Issues: Verify form inputs match feature columns. Check browser console (F12) for errors.
- Git Errors: If
git pushfails, ensure you’re onsprint-3(git branch) and pushing to your fork. - DataFrame Mismatch: Ensure
pred_basisin__call__has the same columns asself.features.
- Test Your Work:
- Train and tune models in the notebook, documenting the best model.
- Test the
Machineclass, serialization, and/modelroute. - Verify predictions match the deployed site’s behavior.
- Git Workflow: Commit and push changes, create a pull request for
sprint-3. - Deliverables: Submit your repository link and Loom video in your course.
- Create a Jupyter Notebook to train and tune three models, reporting the best model.
- Implement the
Machineclass inapp/machine.pywith__init__,__call__,save(),open(), andinfo(). - Update
app/main.pyandtemplates/model.htmlto integrate predictions. - Test at
http://127.0.0.1:5000/model. - Commit changes to a
sprint-3branch and push to GitHub. - Record a Loom video and submit deliverables.
Sprint 3 builds on your Scikit-learn and pandas skills to create a machine learning model for monster data, using a notebook for exploration and a Machine class for deployment. By training models, serializing them, and integrating with Flask, you’ll create an interactive prediction interface, like sharing notebook results online. Treat Scikit-learn like your course models, Flask like notebook outputs, and Git like checkpoints. Work through each step, test frequently, and leverage your data science experience to succeed.