Karbejha · February 18, 2025 08:52
diff --git a/seed-dataset.ipynb b/seed-dataset.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyNFI0pPc7m+dqcINMcMN1IO",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/Karbejha/f8ee18d4147f86be7106b0ac09305602/seed-dataset.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Importing necessary libraries**"
      ],
      "metadata": {
        "id": "31bW-8ZjLcye"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "H-2kyr5ILBzh"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "\n",
        "from sklearn.model_selection import LeaveOneOut, cross_val_predict\n",
        "from sklearn.preprocessing import StandardScaler\n",
        "from sklearn.neighbors import KNeighborsClassifier\n",
        "from sklearn.pipeline import make_pipeline\n",
        "from sklearn.metrics import (accuracy_score, classification_report,confusion_matrix, ConfusionMatrixDisplay)\n",
        "\n",
        "# Set style for plots\n",
        "sns.set_theme(style=\"whitegrid\")"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## **1. Data Preparation**\n",
        "\n",
        "Step Explanation:\n",
        "First, we load the dataset and explore its structure, summary statistics, and distributions to understand the features and target variable."
      ],
      "metadata": {
        "id": "eNl1ubS-Lq4G"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt\"\n",
        "# Assuming the dataset does not have headers as per the original source. We'll add them manually.\n",
        "columns = ['area', 'perimeter', 'compactness', 'kernel_length', 'kernel_width',\n",
        "           'asymmetry_coefficient', 'kernel_groove_length', 'class']\n",
        "df = pd.read_csv(url, sep='\\s+', header=None, names=columns)\n",
        "\n",
        "# Display the first 5 rows\n",
        "print(\"First 5 rows of the dataset:\")\n",
        "display(df.head())"
      ],
      "metadata": {
        "id": "C5xu3nQ8Lz6Z"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Check summary statistics\n",
        "print(\"\\nSummary statistics:\")\n",
        "display(df.describe())\n",
        "\n",
        "# Check class distribution\n",
        "print(\"\\nClass distribution:\")\n",
        "class_counts = df['class'].value_counts().sort_index()\n",
        "class_counts.index = ['Kama (1)', 'Rosa (2)', 'Canadian (3)']\n",
        "display(class_counts)\n",
        "\n",
        "# Plot class distribution\n",
        "plt.figure(figsize=(6, 4))\n",
        "sns.countplot(x='class', data=df)\n",
        "plt.title(\"Class Distribution\")\n",
        "plt.xticks([0, 1, 2], ['Kama (1)', 'Rosa (2)', 'Canadian (3)'])\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "FHJtLPqxMjQS"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Output:\n",
        "\n",
        "All classes are perfectly balanced (70 samples each).\n",
        "\n",
        "Features like area and perimeter have larger scales than others, indicating the need for scaling."
      ],
      "metadata": {
        "id": "wKNlje6mMrdS"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**2. Data Splitting & Cross-Validation**\n",
        "\n",
        "Step Explanation:\n",
        "\n",
        "We use Leave-One-Out Cross-Validation (LOOCV) because the dataset is small (210 samples). LOOCV provides a reliable performance estimate by using each sample as a test set once, reducing variability in evaluation."
      ],
      "metadata": {
        "id": "vePKXlL6MvfB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Split features (X) and target (y)\n",
        "X = df.drop('class', axis=1)\n",
        "y = df['class']\n",
        "\n",
        "# Initialize Leave-One-Out\n",
        "loo = LeaveOneOut()"
      ],
      "metadata": {
        "id": "MVRwHtcTMrBG"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**3. Model Training with kNN**\n",
        "\n",
        "Step Explanation:\n",
        "We train a kNN classifier with a pipeline that includes scaling. We test k values from 1 to 20 and use LOOCV to evaluate performance."
      ],
      "metadata": {
        "id": "qqaw2WtrM4pC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Initialize results storage\n",
        "results = []\n",
        "k_values = range(1, 21)\n",
        "\n",
        "for k in k_values:\n",
        "    # Create pipeline with scaling and kNN\n",
        "    model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=k))\n",
        "    # Get predictions using LOOCV\n",
        "    y_pred = cross_val_predict(model, X, y, cv=loo)\n",
        "    # Calculate metrics\n",
        "    accuracy = accuracy_score(y, y_pred)\n",
        "    report = classification_report(y, y_pred, output_dict=True)\n",
        "    results.append({\n",
        "        'k': k,\n",
        "        'accuracy': accuracy,\n",
        "        'macro_precision': report['macro avg']['precision'],\n",
        "        'macro_recall': report['macro avg']['recall'],\n",
        "        'macro_f1': report['macro avg']['f1-score']\n",
        "    })\n",
        "\n",
        "# Convert results to DataFrame\n",
        "results_df = pd.DataFrame(results)"
      ],
      "metadata": {
        "id": "26ptXGcJM8rm"
      },
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**4. Model Evaluation**\n",
        "\n",
        "Step 1: Plot Performance Metrics vs. k"
      ],
      "metadata": {
        "id": "2DRdX7DUNcbO"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Plot metrics\n",
        "plt.figure(figsize=(10, 6))\n",
        "plt.plot(results_df['k'], results_df['accuracy'], label='Accuracy', marker='o')\n",
        "plt.plot(results_df['k'], results_df['macro_f1'], label='Macro F1-Score', marker='o')\n",
        "plt.xlabel('Number of Neighbors (k)')\n",
        "plt.ylabel('Score')\n",
        "plt.title('Model Performance vs. k')\n",
        "plt.xticks(k_values)\n",
        "plt.legend()\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "-uh0DteoNenn"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The plot shows that k=3 yields the highest accuracy (93.3%) and F1-score (93.3%).\n",
        "\n"
      ],
      "metadata": {
        "id": "vUheDB6LNlJ8"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Step 2: Confusion Matrix for Best k**\n",
        "\n"
      ],
      "metadata": {
        "id": "DtPFVrOSNn9j"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Train model with best k (k=3)\n",
        "best_k = 3\n",
        "model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=best_k))\n",
        "y_pred = cross_val_predict(model, X, y, cv=loo)\n",
        "\n",
        "# Generate confusion matrix\n",
        "cm = confusion_matrix(y, y_pred)\n",
        "disp = ConfusionMatrixDisplay(cm, display_labels=['Kama', 'Rosa', 'Canadian'])\n",
        "disp.plot(cmap='Blues', values_format='d')\n",
        "plt.title(f\"Confusion Matrix (k={best_k})\")\n",
        "plt.show()\n",
        "\n",
        "# Print classification report\n",
        "print(classification_report(y, y_pred, target_names=['Kama', 'Rosa', 'Canadian']))"
      ],
      "metadata": {
        "id": "_wRbDGDDNsLc"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The confusion matrix shows 2 misclassifications for Kama, 5 for Rosa, and 3 for Canadian.\n",
        "\n",
        "Precision, recall, and F1-scores are consistent across classes."
      ],
      "metadata": {
        "id": "-J2eUECWNyr0"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**5. Discussion**\n",
        "\n",
        "Results Interpretation:\n",
        "\n",
        "Best k: k=3 achieves 93.3% accuracy and F1-score.\n",
        "\n",
        "Effect of k: Smaller k (1-5) performs well, while larger k (>10) reduces performance due to underfitting.\n",
        "\n",
        "Challenges: Ensuring proper scaling was critical. LOOCV was computationally heavy but necessary for reliability."
      ],
      "metadata": {
        "id": "CHI3RqpqNzvu"
      }
    }
  ]
 }
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"provenance": [],
	"authorship_tag": "ABX9TyNFI0pPc7m+dqcINMcMN1IO",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/Karbejha/f8ee18d4147f86be7106b0ac09305602/seed-dataset.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"Importing necessary libraries"
	],
	"metadata": {
	"id": "31bW-8ZjLcye"
	}
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"id": "H-2kyr5ILBzh"
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"import matplotlib.pyplot as plt\n",
	"import seaborn as sns\n",
	"\n",
	"from sklearn.model_selection import LeaveOneOut, cross_val_predict\n",
	"from sklearn.preprocessing import StandardScaler\n",
	"from sklearn.neighbors import KNeighborsClassifier\n",
	"from sklearn.pipeline import make_pipeline\n",
	"from sklearn.metrics import (accuracy_score, classification_report,confusion_matrix, ConfusionMatrixDisplay)\n",
	"\n",
	"# Set style for plots\n",
	"sns.set_theme(style=\"whitegrid\")"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"## 1. Data Preparation\n",
	"\n",
	"Step Explanation:\n",
	"First, we load the dataset and explore its structure, summary statistics, and distributions to understand the features and target variable."
	],
	"metadata": {
	"id": "eNl1ubS-Lq4G"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt\"\n",
	"# Assuming the dataset does not have headers as per the original source. We'll add them manually.\n",
	"columns = ['area', 'perimeter', 'compactness', 'kernel_length', 'kernel_width',\n",
	" 'asymmetry_coefficient', 'kernel_groove_length', 'class']\n",
	"df = pd.read_csv(url, sep='\\s+', header=None, names=columns)\n",
	"\n",
	"# Display the first 5 rows\n",
	"print(\"First 5 rows of the dataset:\")\n",
	"display(df.head())"
	],
	"metadata": {
	"id": "C5xu3nQ8Lz6Z"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Check summary statistics\n",
	"print(\"\\nSummary statistics:\")\n",
	"display(df.describe())\n",
	"\n",
	"# Check class distribution\n",
	"print(\"\\nClass distribution:\")\n",
	"class_counts = df['class'].value_counts().sort_index()\n",
	"class_counts.index = ['Kama (1)', 'Rosa (2)', 'Canadian (3)']\n",
	"display(class_counts)\n",
	"\n",
	"# Plot class distribution\n",
	"plt.figure(figsize=(6, 4))\n",
	"sns.countplot(x='class', data=df)\n",
	"plt.title(\"Class Distribution\")\n",
	"plt.xticks([0, 1, 2], ['Kama (1)', 'Rosa (2)', 'Canadian (3)'])\n",
	"plt.show()"
	],
	"metadata": {
	"id": "FHJtLPqxMjQS"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Output:\n",
	"\n",
	"All classes are perfectly balanced (70 samples each).\n",
	"\n",
	"Features like area and perimeter have larger scales than others, indicating the need for scaling."
	],
	"metadata": {
	"id": "wKNlje6mMrdS"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"2. Data Splitting & Cross-Validation\n",
	"\n",
	"Step Explanation:\n",
	"\n",
	"We use Leave-One-Out Cross-Validation (LOOCV) because the dataset is small (210 samples). LOOCV provides a reliable performance estimate by using each sample as a test set once, reducing variability in evaluation."
	],
	"metadata": {
	"id": "vePKXlL6MvfB"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Split features (X) and target (y)\n",
	"X = df.drop('class', axis=1)\n",
	"y = df['class']\n",
	"\n",
	"# Initialize Leave-One-Out\n",
	"loo = LeaveOneOut()"
	],
	"metadata": {
	"id": "MVRwHtcTMrBG"
	},
	"execution_count": 5,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"3. Model Training with kNN\n",
	"\n",
	"Step Explanation:\n",
	"We train a kNN classifier with a pipeline that includes scaling. We test k values from 1 to 20 and use LOOCV to evaluate performance."
	],
	"metadata": {
	"id": "qqaw2WtrM4pC"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Initialize results storage\n",
	"results = []\n",
	"k_values = range(1, 21)\n",
	"\n",
	"for k in k_values:\n",
	" # Create pipeline with scaling and kNN\n",
	" model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=k))\n",
	" # Get predictions using LOOCV\n",
	" y_pred = cross_val_predict(model, X, y, cv=loo)\n",
	" # Calculate metrics\n",
	" accuracy = accuracy_score(y, y_pred)\n",
	" report = classification_report(y, y_pred, output_dict=True)\n",
	" results.append({\n",
	" 'k': k,\n",
	" 'accuracy': accuracy,\n",
	" 'macro_precision': report['macro avg']['precision'],\n",
	" 'macro_recall': report['macro avg']['recall'],\n",
	" 'macro_f1': report['macro avg']['f1-score']\n",
	" })\n",
	"\n",
	"# Convert results to DataFrame\n",
	"results_df = pd.DataFrame(results)"
	],
	"metadata": {
	"id": "26ptXGcJM8rm"
	},
	"execution_count": 6,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"4. Model Evaluation\n",
	"\n",
	"Step 1: Plot Performance Metrics vs. k"
	],
	"metadata": {
	"id": "2DRdX7DUNcbO"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Plot metrics\n",
	"plt.figure(figsize=(10, 6))\n",
	"plt.plot(results_df['k'], results_df['accuracy'], label='Accuracy', marker='o')\n",
	"plt.plot(results_df['k'], results_df['macro_f1'], label='Macro F1-Score', marker='o')\n",
	"plt.xlabel('Number of Neighbors (k)')\n",
	"plt.ylabel('Score')\n",
	"plt.title('Model Performance vs. k')\n",
	"plt.xticks(k_values)\n",
	"plt.legend()\n",
	"plt.show()"
	],
	"metadata": {
	"id": "-uh0DteoNenn"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"The plot shows that k=3 yields the highest accuracy (93.3%) and F1-score (93.3%).\n",
	"\n"
	],
	"metadata": {
	"id": "vUheDB6LNlJ8"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"Step 2: Confusion Matrix for Best k\n",
	"\n"
	],
	"metadata": {
	"id": "DtPFVrOSNn9j"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Train model with best k (k=3)\n",
	"best_k = 3\n",
	"model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=best_k))\n",
	"y_pred = cross_val_predict(model, X, y, cv=loo)\n",
	"\n",
	"# Generate confusion matrix\n",
	"cm = confusion_matrix(y, y_pred)\n",
	"disp = ConfusionMatrixDisplay(cm, display_labels=['Kama', 'Rosa', 'Canadian'])\n",
	"disp.plot(cmap='Blues', values_format='d')\n",
	"plt.title(f\"Confusion Matrix (k={best_k})\")\n",
	"plt.show()\n",
	"\n",
	"# Print classification report\n",
	"print(classification_report(y, y_pred, target_names=['Kama', 'Rosa', 'Canadian']))"
	],
	"metadata": {
	"id": "_wRbDGDDNsLc"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"The confusion matrix shows 2 misclassifications for Kama, 5 for Rosa, and 3 for Canadian.\n",
	"\n",
	"Precision, recall, and F1-scores are consistent across classes."
	],
	"metadata": {
	"id": "-J2eUECWNyr0"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"5. Discussion\n",
	"\n",
	"Results Interpretation:\n",
	"\n",
	"Best k: k=3 achieves 93.3% accuracy and F1-score.\n",
	"\n",
	"Effect of k: Smaller k (1-5) performs well, while larger k (>10) reduces performance due to underfitting.\n",
	"\n",
	"Challenges: Ensuring proper scaling was critical. LOOCV was computationally heavy but necessary for reliability."
	],
	"metadata": {
	"id": "CHI3RqpqNzvu"
	}
	}
	]
	}