Sprint 3: Machine Learning Model Guide for Data Scientists

Overview

This guide covers the third ticket of the BandersnatchStarter project, titled "Machine Learning Model," as outlined in BandersnatchStarter/tickets/thirdTicket.md. The objective of Sprint 3 is to explore machine learning models in a Jupyter Notebook, develop a Machine class in app/machine.py using Scikit-learn to predict monster attributes (e.g., Rarity) from data generated in Sprint 1, serialize the model with joblib, and integrate it into the Flask web application. This guide is designed for someone with a beginner data science background, familiar with Python, pandas, and Jupyter Notebooks, but new to Flask and Git. You’ll learn to use Scikit-learn (like in your course), serialize models (like saving notebook outputs), and integrate predictions with Flask (like sharing results online), while managing code with Git (like notebook checkpoints). The guide relates tasks to your notebook experience, guiding you toward a solution without providing complete code to encourage learning.

Ticket Objectives

The third ticket focuses on:

Notebook Model Training & Tuning: Create a Jupyter Notebook to train and tune at least three machine learning models, report accuracy, and describe the best model.
Machine Learning Interface Class: Implement a Machine class in app/machine.py to initialize, train, and predict using a Scikit-learn model.
Model Serialization: Add methods to save and load the model using joblib.
API Model Integration: Include an info() method and integrate the model into Flask for predictions.
Using Git: Track changes with Git for collaboration and version control.

This is like moving from training models in a notebook to creating a reusable model class and making predictions available via a web app, using the monster data (from MonsterLab) stored in MongoDB.

Prerequisites

Before starting, ensure you’ve completed the setup and previous sprints:

Install Python 3, Git (git-scm.com), and Jupyter Notebook (pip install jupyter).

Fork and clone the repository:

git clone https://github.com/your-username/BandersnatchStarter.git
cd BandersnatchStarter

Set up a virtual environment:
- Windows: python -m venv venv and venv\Scripts\activate.
- macOS/Linux: python3 -m venv venv and source venv/bin/activate.
Install dependencies: python -m pip install -r requirements.txt. Ensure scikit-learn, joblib, pandas, MonsterLab, and pymongo are installed.
Complete Sprint 1 (MongoDB with 1,000+ monsters) and Sprint 2 (Altair visualizations).
Set the SPRINT variable in app/main.py to 3:
```
SPRINT = 3
```
Test the Flask app: python -m app.main (Windows) or python3 -m app.main (macOS/Linux), and visit http://127.0.0.1:5000.
Complete the onboarding module in your course.

You’ll need familiarity with pandas DataFrames, Scikit-learn models (e.g., RandomForestClassifier), and Jupyter Notebooks from your data science course.

Step-by-Step Guide

Below are the steps to complete Sprint 3, with explanations relating tasks to your Jupyter Notebook and pandas experience, incorporating the monster data from Sprint 1 and the specific requirements for the Machine class.

Step 1: Create a Jupyter Notebook for Model Training & Tuning

Goal: Use a Jupyter Notebook to train and tune at least three machine learning models, like experimenting with models in a notebook.

What to Do:

Start a Jupyter Notebook:
- In the project root, run:
```
jupyter notebook
```
- Create a new notebook (e.g., model_exploration.ipynb), like starting a notebook for model training.
Load Monster Data:
- Use the MongoDB class from Sprint 1 (app/data.py) to fetch data:
```
from app.data import MongoDB
import pandas as pd
db = MongoDB("monsters")
df = db.dataframe()
```
- Inspect the DataFrame: df.head() to see attributes (e.g., name, strength, type, speed, Rarity).
Preprocess Data:
- Choose Rarity as the target (based on the example) and other columns as features.
- Handle categorical columns (e.g., type, name):
```
df["type"] = df["type"].astype("category").cat.codes
df = df.drop(columns=["name"], errors="ignore")  # Drop non-numeric columns
```
- Handle missing values: df = df.dropna().
- Split features and target:
```
X = df.drop(columns=["Rarity"])
y = df["Rarity"]
```
- This is like preprocessing a dataset in a notebook before model training.

Train and Tune Models:

Train at least three Scikit-learn models (e.g., RandomForestClassifier, DecisionTreeClassifier, LogisticRegression):

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")

Tune hyperparameters (e.g., n_estimators for RandomForest):

model = RandomForestClassifier(n_estimators=100, max_depth=10)

Repeat for other models, comparing accuracy.

Report the Best Model:
- Record accuracy for each model in the notebook.
- Write 1–2 paragraphs describing your best model (e.g., RandomForestClassifier), including:
  - Why it performed best (e.g., highest accuracy).
  - Hyperparameters used (e.g., n_estimators=100).
  - Potential improvements (e.g., more data, feature engineering).
- Save the notebook, like documenting a model experiment.

Notebook Analogy:

The notebook is like your usual environment for training Scikit-learn models.
Preprocessing is like cleaning a dataset in a notebook (e.g., df.dropna()).
Model training and tuning is like fitting and tweaking models in a notebook cell.

Tips:

Use df.info() to check data types and missing values.
Save the best model’s configuration for use in app/machine.py.
Explore the Scikit-learn Documentation for model options and tuning.

Step 2: Implement the Machine Learning Interface Class

Goal: Create a Machine class in app/machine.py to train and predict, like a reusable model class in a notebook.

What to Do:

Open app/machine.py:
- This is where you’ll define the Machine class, like writing a class in a notebook to encapsulate model operations.

Define the __init__ Method:

Initialize the class with a DataFrame and train a model:

from pandas import DataFrame
from sklearn.ensemble import RandomForestClassifier
class Machine:
    def __init__(self, df: DataFrame):
        pass  # Replace with your code

Store the model name (e.g., "Random Forest Classifier").

Extract the target (Rarity) and features (all other columns):

self.target = df["Rarity"]
self.features = df.drop(columns=["Rarity"])

Initialize and train a Scikit-learn model (e.g., RandomForestClassifier) using the best model from your notebook:

self.model = RandomForestClassifier(n_estimators=100, random_state=42)
self.model.fit(self.features, self.target)

Define the __call__ Method:
- Add a method to make predictions:
```
def __call__(self, pred_basis: DataFrame):
    pass  # Replace with your code
```
- Accept a DataFrame of features, predict the target, and return the prediction and probability:
```
prediction = self.model.predict(pred_basis)[0]
probability = self.model.predict_proba(pred_basis)[0]
return prediction, probability
```
- This is like calling model.predict() in a notebook but with a class interface.

Test the Class:

Test in a notebook:

from app.machine import Machine
from app.data import MongoDB
df = MongoDB("monsters").dataframe()
model = Machine(df)
test_data = df.drop(columns=["Rarity"]).iloc[[0]]  # Single row
pred, prob = model(test_data)
print(f"Prediction: {pred}, Probability: {prob}")

Notebook Analogy:

The Machine class is like a notebook class that wraps model.fit() and model.predict().
__init__ is like a notebook cell that trains a model.
__call__ is like a cell that runs predictions on new data.

Tips:

Ensure pred_basis has the same feature columns as self.features.
Add a docstring for each method, per PEP 8.
Use the best model from your notebook (e.g., RandomForestClassifier with tuned parameters).

Step 3: Implement Model Serialization

Goal: Add save() and open() methods to serialize the model, like saving a notebook model to a file.

What to Do:

Define the save() Method:

Add a method to save the model:

def save(self, filepath: str) -> None:
    pass  # Replace with your code

Use joblib to save the model:

import joblib
joblib.dump(self.model, filepath)

Define the open() Method:

Add a method to load a model:

@classmethod
def open(cls, filepath: str) -> 'Machine':
    pass  # Replace with your code

Load the model and create a new Machine instance:

model = joblib.load(filepath)
instance = cls(df=None)  # Placeholder DataFrame
instance.model = model
return instance

Test Serialization:

Test in a notebook:

model = Machine(df)
model.save("model.pkl")
loaded_model = Machine.open("model.pkl")
pred, prob = loaded_model(test_data)
print(f"Loaded Prediction: {pred}")

Notebook Analogy:

joblib.dump() is like df.to_csv() to save data.
joblib.load() is like pd.read_csv() to load data.
Serialization is like saving a trained model in a notebook for later use.

Tips:

Add docstrings, e.g., “Saves the model to the specified filepath.”
Ensure filepath is valid (e.g., model.pkl).

Handle errors with try-except:

try:
    joblib.dump(self.model, filepath)
except Exception as e:
    print(f"Error: {e}")

Step 4: Implement API Model Integration

Goal: Add an info() method and integrate the model into Flask, like sharing predictions online.

What to Do:

Define the info() Method:

Add a method to return model details:

def info(self) -> str:
    pass  # Replace with your code

Return a string with the model name and initialization timestamp:

from datetime import datetime
self.init_time = datetime.now()  # Store in __init__
return f"Model: {self.name}, Initialized: {self.init_time}"

Integrate with Flask:

In app/main.py, create a /model route to use the Machine class:

from app.machine import Machine
from app.data import MongoDB
from flask import render_template, request
@app.route("/model", methods=["GET", "POST"])
def model():
    db = MongoDB("monsters")
    df = db.dataframe()
    machine = Machine(df)
    if request.method == "POST":
        data = request.form
        features = pd.DataFrame([{
            "strength": float(data["strength"]),
            "speed": float(data["speed"]),
            # Add other features
        }])
        pred, prob = machine(features)
        return render_template("model.html", prediction=pred, probability=prob)
    return render_template("model.html", info=machine.info())

Create templates/model.html:

<!DOCTYPE html>
<html>
<head>
    <title>Monster Prediction</title>
    <style>
        body { background-color: #1a1a1a; color: #ffffff; }
    </style>
</head>
<body>
    <h1>Monster Rarity Prediction</h1>
    <p>{{ info }}</p>
    <form method="POST">
        <label>Strength:</label><input type="number" name="strength" required><br>
        <label>Speed:</label><input type="number" name="speed" required><br>
        <!-- Add other features -->
        <button type="submit">Predict</button>
    </form>
    {% if prediction %}
        <p>Prediction: {{ prediction }}</p>
        <p>Probability: {{ probability }}</p>
    {% endif %}
</body>
</html>

Test Integration:
- Run the Flask app: python -m app.main (Windows) or python3 -m app.main (macOS/Linux).
- Visit http://127.0.0.1:5000/model and compare to the deployed site (https://bandersnatch.herokuapp.com/model).
- Enter feature values (e.g., strength, speed) and check predictions.

Notebook Analogy:

The /model route is like a notebook cell that runs predictions and displays results.
The Jinja2 template is like a markdown cell formatting model outputs for sharing.
info() is like a cell that prints model metadata.

Tips:

Match the template’s dark theme (e.g., #1a1a1a background, #ffffff text).

Validate form inputs:

try:
    float(data["strength"])
except ValueError:
    return render_template("model.html", error="Invalid input")

Check browser developer tools (F12) for errors.

Step 5: Use Git to Track Changes

Goal: Use Git to manage code, like saving notebook checkpoints.

What to Do:

Create a Branch:
- Create a branch for Sprint 3:
```
git checkout -b sprint-3
```
Commit Changes:
- After editing app/machine.py, app/main.py, templates/model.html, and your notebook:
```
git add .
git commit -m "Implement Machine class and Flask integration for Sprint 3"
```
Push to GitHub:
- Push your branch:
```
git push origin sprint-3
```
- Create a pull request on your GitHub fork.

Notebook Analogy:

A Git branch is like a separate notebook for a task.
Committing is like saving a notebook, and pushing is like uploading it.

Tips:

Run git status to check changes.
Use clear commit messages (e.g., “Add model serialization”).
Check GitHub’s documentation.

Step 6: Prepare Deliverables

Goal: Submit deliverables for the ticket.

Forked Repository Link:
- Ensure changes are pushed to your GitHub fork.
- Copy the repository URL.
Loom Video:
- Record a Loom video answering the prompt in the Submit Your Deliverables assignment.
- Demonstrate:
  - The notebook with model training, tuning, and results.
  - The /model route showing predictions and model info.
  - Explaining the Machine class and serialization.
  - Discussing your best model’s performance.
Submit in Course:
- Submit the repository link and Loom video link in your course platform.

Notebook Analogy:

The Loom video is like presenting your notebook results.
Submitting links is like sharing a notebook and demo.

Tips:

Test the /model route locally.
Keep the video concise, focusing on notebook results, the Machine class, and web integration.

Learning Tips for Data Scientists

Scikit-learn vs. Notebooks: The Machine class wraps familiar fit() and predict() methods.
MongoDB vs. pandas: Use MongoDB.dataframe() like pd.read_csv() to load data.
Flask vs. Notebooks: The /model route is like a notebook cell outputting predictions to HTML.
Git vs. Checkpoints: Git tracks changes across files. Branch per sprint, commit often.
Code Style:
- Add docstrings for all methods, per PEP 8.
- Remove inline print() statements in final code.
- Follow PEP 8 for clean code.
Debugging:
- Use print(df.head()) in the notebook, but remove from app/machine.py.
- Use try-except in Machine:
```
try:
    # Prediction code
except Exception as e:
    print(f"Error: {e}")
```
- Check browser developer tools (F12) for Flask errors.
Resources:
- Scikit-learn Documentation: For models and tuning.
- joblib Documentation: For serialization.
- Flask Documentation: For web routes.
- Git for Beginners: For Git basics.

Potential Challenges and Solutions

Model Accuracy Issues: If accuracy is low, check preprocessing (e.g., categorical encoding, missing values) or try more tuning.
Serialization Errors: Ensure filepath is valid and joblib is installed (pip show joblib).
Flask Integration Issues: Verify form inputs match feature columns. Check browser console (F12) for errors.
Git Errors: If git push fails, ensure you’re on sprint-3 (git branch) and pushing to your fork.
DataFrame Mismatch: Ensure pred_basis in __call__ has the same columns as self.features.

Next Steps

Test Your Work:
- Train and tune models in the notebook, documenting the best model.
- Test the Machine class, serialization, and /model route.
- Verify predictions match the deployed site’s behavior.
Git Workflow: Commit and push changes, create a pull request for sprint-3.
Deliverables: Submit your repository link and Loom video in your course.

Example Workflow

Create a Jupyter Notebook to train and tune three models, reporting the best model.
Implement the Machine class in app/machine.py with __init__, __call__, save(), open(), and info().
Update app/main.py and templates/model.html to integrate predictions.
Test at http://127.0.0.1:5000/model.
Commit changes to a sprint-3 branch and push to GitHub.
Record a Loom video and submit deliverables.

Conclusion

Sprint 3 builds on your Scikit-learn and pandas skills to create a machine learning model for monster data, using a notebook for exploration and a Machine class for deployment. By training models, serializing them, and integrating with Flask, you’ll create an interactive prediction interface, like sharing notebook results online. Treat Scikit-learn like your course models, Flask like notebook outputs, and Git like checkpoints. Work through each step, test frequently, and leverage your data science experience to succeed.

decagondev/DS-Bandersnatch-Ticket3.md

Sprint 3: Machine Learning Model Guide for Data Scientists

Overview

Ticket Objectives

Prerequisites

Step-by-Step Guide

Step 1: Create a Jupyter Notebook for Model Training & Tuning

Step 2: Implement the Machine Learning Interface Class

Step 3: Implement Model Serialization

Step 4: Implement API Model Integration

Step 5: Use Git to Track Changes

Step 6: Prepare Deliverables

Learning Tips for Data Scientists

Potential Challenges and Solutions

Next Steps

Example Workflow

Conclusion