Skip to content

Instantly share code, notes, and snippets.

@Karbejha
Created February 18, 2025 08:52
Show Gist options
  • Save Karbejha/f8ee18d4147f86be7106b0ac09305602 to your computer and use it in GitHub Desktop.
Save Karbejha/f8ee18d4147f86be7106b0ac09305602 to your computer and use it in GitHub Desktop.
Seed Dataset.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyNFI0pPc7m+dqcINMcMN1IO",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/Karbejha/f8ee18d4147f86be7106b0ac09305602/seed-dataset.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"**Importing necessary libraries**"
],
"metadata": {
"id": "31bW-8ZjLcye"
}
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "H-2kyr5ILBzh"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"from sklearn.model_selection import LeaveOneOut, cross_val_predict\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.metrics import (accuracy_score, classification_report,confusion_matrix, ConfusionMatrixDisplay)\n",
"\n",
"# Set style for plots\n",
"sns.set_theme(style=\"whitegrid\")"
]
},
{
"cell_type": "markdown",
"source": [
"## **1. Data Preparation**\n",
"\n",
"Step Explanation:\n",
"First, we load the dataset and explore its structure, summary statistics, and distributions to understand the features and target variable."
],
"metadata": {
"id": "eNl1ubS-Lq4G"
}
},
{
"cell_type": "code",
"source": [
"url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt\"\n",
"# Assuming the dataset does not have headers as per the original source. We'll add them manually.\n",
"columns = ['area', 'perimeter', 'compactness', 'kernel_length', 'kernel_width',\n",
" 'asymmetry_coefficient', 'kernel_groove_length', 'class']\n",
"df = pd.read_csv(url, sep='\\s+', header=None, names=columns)\n",
"\n",
"# Display the first 5 rows\n",
"print(\"First 5 rows of the dataset:\")\n",
"display(df.head())"
],
"metadata": {
"id": "C5xu3nQ8Lz6Z"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Check summary statistics\n",
"print(\"\\nSummary statistics:\")\n",
"display(df.describe())\n",
"\n",
"# Check class distribution\n",
"print(\"\\nClass distribution:\")\n",
"class_counts = df['class'].value_counts().sort_index()\n",
"class_counts.index = ['Kama (1)', 'Rosa (2)', 'Canadian (3)']\n",
"display(class_counts)\n",
"\n",
"# Plot class distribution\n",
"plt.figure(figsize=(6, 4))\n",
"sns.countplot(x='class', data=df)\n",
"plt.title(\"Class Distribution\")\n",
"plt.xticks([0, 1, 2], ['Kama (1)', 'Rosa (2)', 'Canadian (3)'])\n",
"plt.show()"
],
"metadata": {
"id": "FHJtLPqxMjQS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Output:\n",
"\n",
"All classes are perfectly balanced (70 samples each).\n",
"\n",
"Features like area and perimeter have larger scales than others, indicating the need for scaling."
],
"metadata": {
"id": "wKNlje6mMrdS"
}
},
{
"cell_type": "markdown",
"source": [
"**2. Data Splitting & Cross-Validation**\n",
"\n",
"Step Explanation:\n",
"\n",
"We use Leave-One-Out Cross-Validation (LOOCV) because the dataset is small (210 samples). LOOCV provides a reliable performance estimate by using each sample as a test set once, reducing variability in evaluation."
],
"metadata": {
"id": "vePKXlL6MvfB"
}
},
{
"cell_type": "code",
"source": [
"# Split features (X) and target (y)\n",
"X = df.drop('class', axis=1)\n",
"y = df['class']\n",
"\n",
"# Initialize Leave-One-Out\n",
"loo = LeaveOneOut()"
],
"metadata": {
"id": "MVRwHtcTMrBG"
},
"execution_count": 5,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**3. Model Training with kNN**\n",
"\n",
"Step Explanation:\n",
"We train a kNN classifier with a pipeline that includes scaling. We test k values from 1 to 20 and use LOOCV to evaluate performance."
],
"metadata": {
"id": "qqaw2WtrM4pC"
}
},
{
"cell_type": "code",
"source": [
"# Initialize results storage\n",
"results = []\n",
"k_values = range(1, 21)\n",
"\n",
"for k in k_values:\n",
" # Create pipeline with scaling and kNN\n",
" model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=k))\n",
" # Get predictions using LOOCV\n",
" y_pred = cross_val_predict(model, X, y, cv=loo)\n",
" # Calculate metrics\n",
" accuracy = accuracy_score(y, y_pred)\n",
" report = classification_report(y, y_pred, output_dict=True)\n",
" results.append({\n",
" 'k': k,\n",
" 'accuracy': accuracy,\n",
" 'macro_precision': report['macro avg']['precision'],\n",
" 'macro_recall': report['macro avg']['recall'],\n",
" 'macro_f1': report['macro avg']['f1-score']\n",
" })\n",
"\n",
"# Convert results to DataFrame\n",
"results_df = pd.DataFrame(results)"
],
"metadata": {
"id": "26ptXGcJM8rm"
},
"execution_count": 6,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**4. Model Evaluation**\n",
"\n",
"Step 1: Plot Performance Metrics vs. k"
],
"metadata": {
"id": "2DRdX7DUNcbO"
}
},
{
"cell_type": "code",
"source": [
"# Plot metrics\n",
"plt.figure(figsize=(10, 6))\n",
"plt.plot(results_df['k'], results_df['accuracy'], label='Accuracy', marker='o')\n",
"plt.plot(results_df['k'], results_df['macro_f1'], label='Macro F1-Score', marker='o')\n",
"plt.xlabel('Number of Neighbors (k)')\n",
"plt.ylabel('Score')\n",
"plt.title('Model Performance vs. k')\n",
"plt.xticks(k_values)\n",
"plt.legend()\n",
"plt.show()"
],
"metadata": {
"id": "-uh0DteoNenn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The plot shows that k=3 yields the highest accuracy (93.3%) and F1-score (93.3%).\n",
"\n"
],
"metadata": {
"id": "vUheDB6LNlJ8"
}
},
{
"cell_type": "markdown",
"source": [
"**Step 2: Confusion Matrix for Best k**\n",
"\n"
],
"metadata": {
"id": "DtPFVrOSNn9j"
}
},
{
"cell_type": "code",
"source": [
"# Train model with best k (k=3)\n",
"best_k = 3\n",
"model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=best_k))\n",
"y_pred = cross_val_predict(model, X, y, cv=loo)\n",
"\n",
"# Generate confusion matrix\n",
"cm = confusion_matrix(y, y_pred)\n",
"disp = ConfusionMatrixDisplay(cm, display_labels=['Kama', 'Rosa', 'Canadian'])\n",
"disp.plot(cmap='Blues', values_format='d')\n",
"plt.title(f\"Confusion Matrix (k={best_k})\")\n",
"plt.show()\n",
"\n",
"# Print classification report\n",
"print(classification_report(y, y_pred, target_names=['Kama', 'Rosa', 'Canadian']))"
],
"metadata": {
"id": "_wRbDGDDNsLc"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The confusion matrix shows 2 misclassifications for Kama, 5 for Rosa, and 3 for Canadian.\n",
"\n",
"Precision, recall, and F1-scores are consistent across classes."
],
"metadata": {
"id": "-J2eUECWNyr0"
}
},
{
"cell_type": "markdown",
"source": [
"**5. Discussion**\n",
"\n",
"Results Interpretation:\n",
"\n",
"Best k: k=3 achieves 93.3% accuracy and F1-score.\n",
"\n",
"Effect of k: Smaller k (1-5) performs well, while larger k (>10) reduces performance due to underfitting.\n",
"\n",
"Challenges: Ensuring proper scaling was critical. LOOCV was computationally heavy but necessary for reliability."
],
"metadata": {
"id": "CHI3RqpqNzvu"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment