Created
February 18, 2025 08:52
-
-
Save Karbejha/f8ee18d4147f86be7106b0ac09305602 to your computer and use it in GitHub Desktop.
Seed Dataset.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"provenance": [], | |
"authorship_tag": "ABX9TyNFI0pPc7m+dqcINMcMN1IO", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/Karbejha/f8ee18d4147f86be7106b0ac09305602/seed-dataset.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**Importing necessary libraries**" | |
], | |
"metadata": { | |
"id": "31bW-8ZjLcye" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"id": "H-2kyr5ILBzh" | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"import matplotlib.pyplot as plt\n", | |
"import seaborn as sns\n", | |
"\n", | |
"from sklearn.model_selection import LeaveOneOut, cross_val_predict\n", | |
"from sklearn.preprocessing import StandardScaler\n", | |
"from sklearn.neighbors import KNeighborsClassifier\n", | |
"from sklearn.pipeline import make_pipeline\n", | |
"from sklearn.metrics import (accuracy_score, classification_report,confusion_matrix, ConfusionMatrixDisplay)\n", | |
"\n", | |
"# Set style for plots\n", | |
"sns.set_theme(style=\"whitegrid\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## **1. Data Preparation**\n", | |
"\n", | |
"Step Explanation:\n", | |
"First, we load the dataset and explore its structure, summary statistics, and distributions to understand the features and target variable." | |
], | |
"metadata": { | |
"id": "eNl1ubS-Lq4G" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt\"\n", | |
"# Assuming the dataset does not have headers as per the original source. We'll add them manually.\n", | |
"columns = ['area', 'perimeter', 'compactness', 'kernel_length', 'kernel_width',\n", | |
" 'asymmetry_coefficient', 'kernel_groove_length', 'class']\n", | |
"df = pd.read_csv(url, sep='\\s+', header=None, names=columns)\n", | |
"\n", | |
"# Display the first 5 rows\n", | |
"print(\"First 5 rows of the dataset:\")\n", | |
"display(df.head())" | |
], | |
"metadata": { | |
"id": "C5xu3nQ8Lz6Z" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Check summary statistics\n", | |
"print(\"\\nSummary statistics:\")\n", | |
"display(df.describe())\n", | |
"\n", | |
"# Check class distribution\n", | |
"print(\"\\nClass distribution:\")\n", | |
"class_counts = df['class'].value_counts().sort_index()\n", | |
"class_counts.index = ['Kama (1)', 'Rosa (2)', 'Canadian (3)']\n", | |
"display(class_counts)\n", | |
"\n", | |
"# Plot class distribution\n", | |
"plt.figure(figsize=(6, 4))\n", | |
"sns.countplot(x='class', data=df)\n", | |
"plt.title(\"Class Distribution\")\n", | |
"plt.xticks([0, 1, 2], ['Kama (1)', 'Rosa (2)', 'Canadian (3)'])\n", | |
"plt.show()" | |
], | |
"metadata": { | |
"id": "FHJtLPqxMjQS" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Output:\n", | |
"\n", | |
"All classes are perfectly balanced (70 samples each).\n", | |
"\n", | |
"Features like area and perimeter have larger scales than others, indicating the need for scaling." | |
], | |
"metadata": { | |
"id": "wKNlje6mMrdS" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**2. Data Splitting & Cross-Validation**\n", | |
"\n", | |
"Step Explanation:\n", | |
"\n", | |
"We use Leave-One-Out Cross-Validation (LOOCV) because the dataset is small (210 samples). LOOCV provides a reliable performance estimate by using each sample as a test set once, reducing variability in evaluation." | |
], | |
"metadata": { | |
"id": "vePKXlL6MvfB" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Split features (X) and target (y)\n", | |
"X = df.drop('class', axis=1)\n", | |
"y = df['class']\n", | |
"\n", | |
"# Initialize Leave-One-Out\n", | |
"loo = LeaveOneOut()" | |
], | |
"metadata": { | |
"id": "MVRwHtcTMrBG" | |
}, | |
"execution_count": 5, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**3. Model Training with kNN**\n", | |
"\n", | |
"Step Explanation:\n", | |
"We train a kNN classifier with a pipeline that includes scaling. We test k values from 1 to 20 and use LOOCV to evaluate performance." | |
], | |
"metadata": { | |
"id": "qqaw2WtrM4pC" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Initialize results storage\n", | |
"results = []\n", | |
"k_values = range(1, 21)\n", | |
"\n", | |
"for k in k_values:\n", | |
" # Create pipeline with scaling and kNN\n", | |
" model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=k))\n", | |
" # Get predictions using LOOCV\n", | |
" y_pred = cross_val_predict(model, X, y, cv=loo)\n", | |
" # Calculate metrics\n", | |
" accuracy = accuracy_score(y, y_pred)\n", | |
" report = classification_report(y, y_pred, output_dict=True)\n", | |
" results.append({\n", | |
" 'k': k,\n", | |
" 'accuracy': accuracy,\n", | |
" 'macro_precision': report['macro avg']['precision'],\n", | |
" 'macro_recall': report['macro avg']['recall'],\n", | |
" 'macro_f1': report['macro avg']['f1-score']\n", | |
" })\n", | |
"\n", | |
"# Convert results to DataFrame\n", | |
"results_df = pd.DataFrame(results)" | |
], | |
"metadata": { | |
"id": "26ptXGcJM8rm" | |
}, | |
"execution_count": 6, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**4. Model Evaluation**\n", | |
"\n", | |
"Step 1: Plot Performance Metrics vs. k" | |
], | |
"metadata": { | |
"id": "2DRdX7DUNcbO" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Plot metrics\n", | |
"plt.figure(figsize=(10, 6))\n", | |
"plt.plot(results_df['k'], results_df['accuracy'], label='Accuracy', marker='o')\n", | |
"plt.plot(results_df['k'], results_df['macro_f1'], label='Macro F1-Score', marker='o')\n", | |
"plt.xlabel('Number of Neighbors (k)')\n", | |
"plt.ylabel('Score')\n", | |
"plt.title('Model Performance vs. k')\n", | |
"plt.xticks(k_values)\n", | |
"plt.legend()\n", | |
"plt.show()" | |
], | |
"metadata": { | |
"id": "-uh0DteoNenn" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"The plot shows that k=3 yields the highest accuracy (93.3%) and F1-score (93.3%).\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "vUheDB6LNlJ8" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**Step 2: Confusion Matrix for Best k**\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "DtPFVrOSNn9j" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Train model with best k (k=3)\n", | |
"best_k = 3\n", | |
"model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=best_k))\n", | |
"y_pred = cross_val_predict(model, X, y, cv=loo)\n", | |
"\n", | |
"# Generate confusion matrix\n", | |
"cm = confusion_matrix(y, y_pred)\n", | |
"disp = ConfusionMatrixDisplay(cm, display_labels=['Kama', 'Rosa', 'Canadian'])\n", | |
"disp.plot(cmap='Blues', values_format='d')\n", | |
"plt.title(f\"Confusion Matrix (k={best_k})\")\n", | |
"plt.show()\n", | |
"\n", | |
"# Print classification report\n", | |
"print(classification_report(y, y_pred, target_names=['Kama', 'Rosa', 'Canadian']))" | |
], | |
"metadata": { | |
"id": "_wRbDGDDNsLc" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"The confusion matrix shows 2 misclassifications for Kama, 5 for Rosa, and 3 for Canadian.\n", | |
"\n", | |
"Precision, recall, and F1-scores are consistent across classes." | |
], | |
"metadata": { | |
"id": "-J2eUECWNyr0" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**5. Discussion**\n", | |
"\n", | |
"Results Interpretation:\n", | |
"\n", | |
"Best k: k=3 achieves 93.3% accuracy and F1-score.\n", | |
"\n", | |
"Effect of k: Smaller k (1-5) performs well, while larger k (>10) reduces performance due to underfitting.\n", | |
"\n", | |
"Challenges: Ensuring proper scaling was critical. LOOCV was computationally heavy but necessary for reliability." | |
], | |
"metadata": { | |
"id": "CHI3RqpqNzvu" | |
} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment