KuRRe8 · June 22, 2025 09:09 · KuRRe8 · May 8, 2025
diff --git a/_intro.md b/_intro.md
diff --git a/app_deploy_tutorial.ipynb b/app_deploy_tutorial.ipynb
diff --git a/asyncio_tutorial.ipynb b/asyncio_tutorial.ipynb
diff --git a/boosting_libs_tutorial.ipynb b/boosting_libs_tutorial.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 高性能梯度提升库教程 (XGBoost, LightGBM, CatBoost)\n",
    "\n",
    "欢迎来到 XGBoost, LightGBM, 和 CatBoost 教程！这三个库都是**梯度提升决策树 (Gradient Boosting Decision Tree, GBDT)** 算法的高效、可扩展且流行的实现。它们在处理**表格/结构化数据**方面表现出色，经常在数据科学竞赛和工业界应用中取得顶尖性能。\n",
    "\n",
    "**为什么使用这些库？**\n",
    "\n",
    "它们通常比 Scikit-learn 内置的梯度提升实现更快、更精确，并提供更多高级功能，如内置正则化、缺失值处理和对类别特征的特殊支持（尤其是 CatBoost）。\n",
    "\n",
    "本教程将独立地介绍这三个库的基础用法：\n",
    "\n",
    "1.  **XGBoost**: 最早广泛流行的高效 GBDT 实现之一。\n",
    "2.  **LightGBM**: 以速度快和内存占用低著称。\n",
    "3.  **CatBoost**: 特别擅长自动处理类别特征。\n",
    "\n",
    "我们将使用 Scikit-learn 内置的数据集进行分类和回归任务，分别展示每个库如何训练、预测和评估模型。\n",
    "\n",
    "**本教程结构：**\n",
    "1.  准备工作（安装库、公共数据准备）。\n",
    "2.  使用 XGBoost。\n",
    "3.  使用 LightGBM。\n",
    "4.  使用 CatBoost。\n",
    "5.  性能比较与总结。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. 准备工作\n",
    "\n",
    "安装必要的库，并准备用于演示的数据集。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 安装库\n",
    "\n",
    "```bash\n",
    "pip install xgboost lightgbm catboost scikit-learn pandas numpy matplotlib\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- 公共导入 (用于数据准备和评估) ---\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score, mean_squared_error, r2_score, roc_auc_score\n",
    "from sklearn.datasets import load_breast_cancer, fetch_california_housing\n",
    "import time\n",
    "import os\n",
    "import warnings\n",
    "\n",
    "# 忽略特定库可能产生的未来警告，使输出更整洁\n",
    "warnings.filterwarnings('ignore', category=FutureWarning)\n",
    "\n",
    "# 用于计时的辅助函数\n",
    "def time_it(func, *args, **kwargs):\n",
    "    start_time = time.time()\n",
    "    result = func(*args, **kwargs)\n",
    "    end_time = time.time()\n",
    "    print(f\"Execution time: {end_time - start_time:.4f} seconds\")\n",
    "    return result\n",
    "\n",
    "# --- 检查库版本 --- \n",
    "print(\"Checking library versions...\")\n",
    "try: import xgboost as xgb; print(f\"  XGBoost version: {xgb.__version__}\")\n",
    "except ImportError: print(\"  XGBoost not installed.\"); xgb = None\n",
    "try: import lightgbm as lgb; print(f\"  LightGBM version: {lgb.__version__}\")\n",
    "except ImportError: print(\"  LightGBM not installed.\"); lgb = None\n",
    "try: import catboost as cb; print(f\"  CatBoost version: {cb.__version__}\")\n",
    "except ImportError: print(\"  CatBoost not installed.\"); cb = None\n",
    "\n",
    "# --- 公共数据准备 (执行一次) ---\n",
    "print(\"\\n--- Preparing Datasets ---\")\n",
    "# Classification Data\n",
    "cancer = load_breast_cancer()\n",
    "X_cancer_base, y_cancer_base = cancer.data, cancer.target\n",
    "cancer_feature_names = cancer.feature_names\n",
    "print(f\"Base classification data shape: X={X_cancer_base.shape}\")\n",
    "\n",
    "# Regression Data\n",
    "regression_available = False\n",
    "X_housing_base, y_housing_base, housing_feature_names = None, None, None\n",
    "try:\n",
    "    housing = fetch_california_housing()\n",
    "    X_housing_base, y_housing_base = housing.data, housing.target\n",
    "    housing_feature_names = housing.feature_names\n",
    "    print(f\"Base regression data shape: X={X_housing_base.shape}\")\n",
    "    regression_available = True\n",
    "except ImportError:\n",
    "    print(\"California housing dataset not available (requires scikit-learn >= 0.20). Skipping regression examples.\")\n",
    "except Exception as e:\n",
    "     print(f\"Error loading regression dataset: {e}. Skipping regression examples.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. 使用 XGBoost\n",
    "\n",
    "XGBoost (eXtreme Gradient Boosting) 是一个优化过的分布式梯度提升库，旨在高效、灵活和可移植。它实现了正则化学习目标，有助于控制过拟合。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- XGBoost: 导入与数据划分 ---\n",
    "print(\"\\n--- XGBoost Section --- \")\n",
    "accuracy_xgb, auc_xgb, mse_xgb, r2_xgb = None, None, None, None # Initialize results\n",
    "if xgb:\n",
    "    # 划分分类数据\n",
    "    X_cancer_train_xgb, X_cancer_test_xgb, y_cancer_train_xgb, y_cancer_test_xgb = train_test_split(\n",
    "        X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base\n",
    "    )\n",
    "    print(\"XGBoost: Classification data split.\")\n",
    "    \n",
    "    # 划分回归数据\n",
    "    if regression_available:\n",
    "        X_housing_train_xgb, X_housing_test_xgb, y_housing_train_xgb, y_housing_test_xgb = train_test_split(\n",
    "            X_housing_base, y_housing_base, test_size=0.2, random_state=42\n",
    "        )\n",
    "        print(\"XGBoost: Regression data split.\")\n",
    "    else:\n",
    "        print(\"XGBoost: Regression data unavailable.\")\n",
    "else:\n",
    "     print(\"XGBoost not available, skipping this section.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- XGBoost: 分类 --- \n",
    "if xgb:\n",
    "    print(\"\\nXGBoost: Training XGBClassifier...\")\n",
    "    xgb_clf = xgb.XGBClassifier(\n",
    "        objective='binary:logistic',\n",
    "        n_estimators=100,         \n",
    "        learning_rate=0.1,\n",
    "        max_depth=3,\n",
    "        subsample=0.8,            \n",
    "        colsample_bytree=0.8,     \n",
    "        use_label_encoder=False,  # Recommended setting \n",
    "        eval_metric='logloss',    \n",
    "        random_state=42,\n",
    "        n_jobs=-1                 \n",
    "    )\n",
    "    \n",
    "    # Train the model with early stopping\n",
    "    time_it(xgb_clf.fit, X_cancer_train_xgb, y_cancer_train_xgb, \n",
    "            early_stopping_rounds=10, \n",
    "            eval_set=[(X_cancer_test_xgb, y_cancer_test_xgb)], \n",
    "            verbose=False)\n",
    "    \n",
    "    # Predict and Evaluate\n",
    "    y_pred_xgb_clf = xgb_clf.predict(X_cancer_test_xgb)\n",
    "    y_proba_xgb_clf = xgb_clf.predict_proba(X_cancer_test_xgb)[:, 1] \n",
    "    accuracy_xgb = accuracy_score(y_cancer_test_xgb, y_pred_xgb_clf)\n",
    "    auc_xgb = roc_auc_score(y_cancer_test_xgb, y_proba_xgb_clf)\n",
    "    print(f\"XGBoost Classifier Accuracy: {accuracy_xgb:.4f}\")\n",
    "    print(f\"XGBoost Classifier AUC: {auc_xgb:.4f}\")\n",
    "else:\n",
    "    print(\"Skipping XGBoost classification (library not loaded).\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- XGBoost: 回归 --- \n",
    "if xgb and regression_available:\n",
    "    print(\"\\nXGBoost: Training XGBRegressor...\")\n",
    "    xgb_reg = xgb.XGBRegressor(\n",
    "        objective='reg:squarederror', \n",
    "        n_estimators=100,\n",
    "        learning_rate=0.1,\n",
    "        max_depth=5,\n",
    "        subsample=0.8,\n",
    "        colsample_bytree=0.8,\n",
    "        random_state=42,\n",
    "        n_jobs=-1\n",
    "    )\n",
    "    \n",
    "    time_it(xgb_reg.fit, X_housing_train_xgb, y_housing_train_xgb,\n",
    "            early_stopping_rounds=10,\n",
    "            eval_set=[(X_housing_test_xgb, y_housing_test_xgb)],\n",
    "            verbose=False)\n",
    "            \n",
    "    y_pred_xgb_reg = xgb_reg.predict(X_housing_test_xgb)\n",
    "    mse_xgb = mean_squared_error(y_housing_test_xgb, y_pred_xgb_reg)\n",
    "    r2_xgb = r2_score(y_housing_test_xgb, y_pred_xgb_reg)\n",
    "    print(f\"XGBoost Regressor MSE: {mse_xgb:.4f}\")\n",
    "    print(f\"XGBoost Regressor R2: {r2_xgb:.4f}\")\n",
    "elif xgb:\n",
    "    print(\"\\nXGBoost: Skipping Regressor example (data unavailable).\")\n",
    "else:\n",
    "     print(\"Skipping XGBoost regression (library not loaded).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. 使用 LightGBM\n",
    "\n",
    "LightGBM (Light Gradient Boosting Machine) 以其训练速度快和内存占用低而闻名。它使用基于直方图的算法和叶子优先 (leaf-wise) 的树生长策略。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- LightGBM: 导入与数据划分 ---\n",
    "print(\"\\n--- LightGBM Section --- \")\n",
    "accuracy_lgb, auc_lgb, mse_lgb, r2_lgb = None, None, None, None # Initialize results\n",
    "if lgb:\n",
    "    from lightgbm import early_stopping, log_evaluation # Callbacks\n",
    "\n",
    "    # 划分分类数据\n",
    "    X_cancer_train_lgb, X_cancer_test_lgb, y_cancer_train_lgb, y_cancer_test_lgb = train_test_split(\n",
    "        X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base\n",
    "    )\n",
    "    print(\"LightGBM: Classification data split.\")\n",
    "    \n",
    "    # 划分回归数据\n",
    "    if regression_available:\n",
    "        X_housing_train_lgb, X_housing_test_lgb, y_housing_train_lgb, y_housing_test_lgb = train_test_split(\n",
    "            X_housing_base, y_housing_base, test_size=0.2, random_state=42\n",
    "        )\n",
    "        print(\"LightGBM: Regression data split.\")\n",
    "    else:\n",
    "         print(\"LightGBM: Regression data unavailable.\")\n",
    "else:\n",
    "     print(\"LightGBM not available, skipping this section.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- LightGBM: 分类 --- \n",
    "if lgb:\n",
    "    print(\"\\nLightGBM: Training LGBMClassifier...\")\n",
    "    lgb_clf = lgb.LGBMClassifier(\n",
    "        objective='binary',\n",
    "        metric='auc',\n",
    "        n_estimators=100,\n",
    "        learning_rate=0.1,\n",
    "        num_leaves=31,           \n",
    "        max_depth=-1,            \n",
    "        subsample=0.8,           # bagging_fraction\n",
    "        colsample_bytree=0.8,    # feature_fraction\n",
    "        random_state=42,\n",
    "        n_jobs=-1\n",
    "    )\n",
    "    \n",
    "    lgbm_clf_callbacks = [\n",
    "        early_stopping(stopping_rounds=10, verbose=False),\n",
    "        log_evaluation(period=0)\n",
    "    ]\n",
    "    time_it(lgb_clf.fit, X_cancer_train_lgb, y_cancer_train_lgb, \n",
    "            eval_set=[(X_cancer_test_lgb, y_cancer_test_lgb)], \n",
    "            eval_metric='auc',\n",
    "            callbacks=lgbm_clf_callbacks)\n",
    "    \n",
    "    y_pred_lgb_clf = lgb_clf.predict(X_cancer_test_lgb)\n",
    "    y_proba_lgb_clf = lgb_clf.predict_proba(X_cancer_test_lgb)[:, 1]\n",
    "    accuracy_lgb = accuracy_score(y_cancer_test_lgb, y_pred_lgb_clf)\n",
    "    auc_lgb = roc_auc_score(y_cancer_test_lgb, y_proba_lgb_clf)\n",
    "    print(f\"LightGBM Classifier Accuracy: {accuracy_lgb:.4f}\")\n",
    "    print(f\"LightGBM Classifier AUC: {auc_lgb:.4f}\")\n",
    "else:\n",
    "    print(\"Skipping LightGBM classification (library not loaded).\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- LightGBM: 回归 --- \n",
    "if lgb and regression_available:\n",
    "    print(\"\\nLightGBM: Training LGBMRegressor...\")\n",
    "    lgb_reg = lgb.LGBMRegressor(\n",
    "        objective='regression_l2',\n",
    "        metric='rmse',\n",
    "        n_estimators=100,\n",
    "        learning_rate=0.1,\n",
    "        num_leaves=31,\n",
    "        max_depth=-1,\n",
    "        subsample=0.8,\n",
    "        colsample_bytree=0.8,\n",
    "        random_state=42,\n",
    "        n_jobs=-1\n",
    "    )\n",
    "    \n",
    "    lgbm_reg_callbacks = [\n",
    "         early_stopping(stopping_rounds=10, verbose=False),\n",
    "         log_evaluation(period=0)\n",
    "    ]\n",
    "    time_it(lgb_reg.fit, X_housing_train_lgb, y_housing_train_lgb,\n",
    "            eval_set=[(X_housing_test_lgb, y_housing_test_lgb)],\n",
    "            eval_metric='rmse',\n",
    "            callbacks=lgbm_reg_callbacks)\n",
    "            \n",
    "    y_pred_lgb_reg = lgb_reg.predict(X_housing_test_lgb)\n",
    "    mse_lgb = mean_squared_error(y_housing_test_lgb, y_pred_lgb_reg)\n",
    "    r2_lgb = r2_score(y_housing_test_lgb, y_pred_lgb_reg)\n",
    "    print(f\"LightGBM Regressor MSE: {mse_lgb:.4f}\")\n",
    "    print(f\"LightGBM Regressor R2: {r2_lgb:.4f}\")\n",
    "elif lgb:\n",
    "    print(\"\\nLightGBM: Skipping Regressor example (data unavailable).\")\n",
    "else:\n",
    "    print(\"Skipping LightGBM regression (library not loaded).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. 使用 CatBoost\n",
    "\n",
    "CatBoost (Categorical Boosting) 的主要特点是其内置的对类别特征的高效处理，通常无需预处理即可获得良好效果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- CatBoost: 导入与数据划分 ---\n",
    "print(\"\\n--- CatBoost Section --- \")\n",
    "accuracy_cb, auc_cb, mse_cb, r2_cb = None, None, None, None # Initialize results\n",
    "if cb:\n",
    "    # 划分分类数据\n",
    "    X_cancer_train_cb, X_cancer_test_cb, y_cancer_train_cb, y_cancer_test_cb = train_test_split(\n",
    "        X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base\n",
    "    )\n",
    "    print(\"CatBoost: Classification data split.\")\n",
    "\n",
    "    # 划分回归数据\n",
    "    if regression_available:\n",
    "        X_housing_train_cb, X_housing_test_cb, y_housing_train_cb, y_housing_test_cb = train_test_split(\n",
    "            X_housing_base, y_housing_base, test_size=0.2, random_state=42\n",
    "        )\n",
    "        print(\"CatBoost: Regression data split.\")\n",
    "    else:\n",
    "         print(\"CatBoost: Regression data unavailable.\")\n",
    "\n",
    "else:\n",
    "    print(\"CatBoost not available, skipping this section.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- CatBoost: 分类 --- \n",
    "if cb:\n",
    "    print(\"\\nCatBoost: Training CatBoostClassifier...\")\n",
    "    cb_clf = cb.CatBoostClassifier(\n",
    "        iterations=100,         \n",
    "        learning_rate=0.1,\n",
    "        depth=6,                \n",
    "        l2_leaf_reg=3,          \n",
    "        loss_function='Logloss',\n",
    "        eval_metric='AUC',      \n",
    "        random_seed=42,\n",
    "        verbose=0,              # Suppress iteration output\n",
    "        early_stopping_rounds=10\n",
    "    )\n",
    "    \n",
    "    time_it(cb_clf.fit, X_cancer_train_cb, y_cancer_train_cb,\n",
    "            eval_set=(X_cancer_test_cb, y_cancer_test_cb),\n",
    "            verbose=0) # Pass verbose=0 to fit as well\n",
    "    \n",
    "    y_pred_cb_clf = cb_clf.predict(X_cancer_test_cb)\n",
    "    y_proba_cb_clf = cb_clf.predict_proba(X_cancer_test_cb)[:, 1]\n",
    "    accuracy_cb = accuracy_score(y_cancer_test_cb, y_pred_cb_clf)\n",
    "    auc_cb = roc_auc_score(y_cancer_test_cb, y_proba_cb_clf)\n",
    "    print(f\"CatBoost Classifier Accuracy: {accuracy_cb:.4f}\")\n",
    "    print(f\"CatBoost Classifier AUC: {auc_cb:.4f}\")\n",
    "else:\n",
    "    print(\"Skipping CatBoost classification (library not loaded).\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- CatBoost: 回归 --- \n",
    "if cb and regression_available:\n",
    "    print(\"\\nCatBoost: Training CatBoostRegressor...\")\n",
    "    cb_reg = cb.CatBoostRegressor(\n",
    "        iterations=100,\n",
    "        learning_rate=0.1,\n",
    "        depth=6,\n",
    "        l2_leaf_reg=3,\n",
    "        loss_function='RMSE', \n",
    "        eval_metric='RMSE',\n",
    "        random_seed=42,\n",
    "        verbose=0,\n",
    "        early_stopping_rounds=10\n",
    "    )\n",
    "    \n",
    "    time_it(cb_reg.fit, X_housing_train_cb, y_housing_train_cb,\n",
    "            eval_set=(X_housing_test_cb, y_housing_test_cb),\n",
    "            verbose=0)\n",
    "            \n",
    "    y_pred_cb_reg = cb_reg.predict(X_housing_test_cb)\n",
    "    mse_cb = mean_squared_error(y_housing_test_cb, y_pred_cb_reg)\n",
    "    r2_cb = r2_score(y_housing_test_cb, y_pred_cb_reg)\n",
    "    print(f\"CatBoost Regressor MSE: {mse_cb:.4f}\")\n",
    "    print(f\"CatBoost Regressor R2: {r2_cb:.4f}\")\n",
    "elif cb:\n",
    "    print(\"\\nCatBoost: Skipping Regressor example (data unavailable).\")\n",
    "else:\n",
    "    print(\"Skipping CatBoost regression (library not loaded).\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- CatBoost: Categorical Feature Handling --- \n",
    "if cb:\n",
    "    print(\"\\n--- CatBoost Categorical Feature Handling Example ---\")\n",
    "    # Create sample data with categorical features\n",
    "    cat_df = pd.DataFrame({\n",
    "        'Num1': np.random.rand(100),\n",
    "        'City': np.random.choice(['London', 'Paris', 'Tokyo', 'NYC'], 100),\n",
    "        'Weather': np.random.choice(['Sunny', 'Cloudy', 'Rainy'], 100),\n",
    "        'Target': np.random.randint(0, 2, 100)\n",
    "    })\n",
    "    X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(\n",
    "        cat_df[['Num1', 'City', 'Weather']], cat_df['Target'], test_size=0.25, random_state=42\n",
    "    )\n",
    "    \n",
    "    # Identify categorical features\n",
    "    categorical_features_indices = np.where(X_cat_train.dtypes != float)[0]\n",
    "    print(f\"Categorical feature indices: {categorical_features_indices}\") # Should be [1, 2]\n",
    "    \n",
    "    cb_clf_cat = cb.CatBoostClassifier(\n",
    "        iterations=50, verbose=0, random_seed=42,\n",
    "        cat_features=categorical_features_indices # Pass indices\n",
    "    )\n",
    "    print(\"Fitting CatBoostClassifier with cat_features...\")\n",
    "    time_it(cb_clf_cat.fit, X_cat_train, y_cat_train, \n",
    "            eval_set=(X_cat_test, y_cat_test), verbose=0)\n",
    "    \n",
    "    y_pred_cb_cat = cb_clf_cat.predict(X_cat_test)\n",
    "    print(f\"CatBoost (with cat features) Accuracy: {accuracy_score(y_cat_test, y_pred_cb_cat):.4f}\")\n",
    "else:\n",
    "    print(\"Skipping CatBoost categorical example (library not loaded).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 比较与总结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. 性能比较与总结\n",
    "\n",
    "让我们回顾一下这三个库在我们的简单示例上的性能。\n",
    "**免责声明**: 本次运行使用了非常基础的参数和有限的训练轮数，结果仅供演示 API 用法，**不代表**各库在优化后的真实相对性能。实际项目中需要仔细进行超参数调优。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"--- Performance Summary (Basic Run) ---\")\n",
    "\n",
    "print(\"Classification (Breast Cancer - Higher is better):\")\n",
    "if xgb:\n",
    "    print(f\"  XGBoost : Accuracy={accuracy_xgb if accuracy_xgb is not None else 'N/A':.4f}, AUC={auc_xgb if auc_xgb is not None else 'N/A':.4f}\")\n",
    "else: print(\"  XGBoost: Not run.\")\n",
    "if lgb:\n",
    "    print(f\"  LightGBM: Accuracy={accuracy_lgb if accuracy_lgb is not None else 'N/A':.4f}, AUC={auc_lgb if auc_lgb is not None else 'N/A':.4f}\")\n",
    "else: print(\"  LightGBM: Not run.\")\n",
    "if cb:\n",
    "    print(f\"  CatBoost: Accuracy={accuracy_cb if accuracy_cb is not None else 'N/A':.4f}, AUC={auc_cb if auc_cb is not None else 'N/A':.4f}\")\n",
    "else: print(\"  CatBoost: Not run.\")\n",
    "\n",
    "if regression_available:\n",
    "    print(\"\\nRegression (California Housing):\")\n",
    "    print(\"  Metric: MSE (Lower is better), R2 (Higher is better)\")\n",
    "    if xgb:\n",
    "        print(f\"  XGBoost : MSE={mse_xgb if mse_xgb is not None else 'N/A':.4f}, R2={r2_xgb if r2_xgb is not None else 'N/A':.4f}\")\n",
    "    else: print(\"  XGBoost: Not run.\")\n",
    "    if lgb:\n",
    "        print(f\"  LightGBM: MSE={mse_lgb if mse_lgb is not None else 'N/A':.4f}, R2={r2_lgb if r2_lgb is not None else 'N/A':.4f}\")\n",
    "    else: print(\"  LightGBM: Not run.\")\n",
    "    if cb:\n",
    "        print(f\"  CatBoost: MSE={mse_cb if mse_cb is not None else 'N/A':.4f}, R2={r2_cb if r2_cb is not None else 'N/A':.4f}\")\n",
    "    else: print(\"  CatBoost: Not run.\")\n",
    "else:\n",
    "    print(\"\\nRegression results not available.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 总结要点\n",
    "\n",
    "*   **XGBoost, LightGBM, CatBoost** 都是处理表格数据的强大武器。\n",
    "*   它们都提供了方便的 **Scikit-learn 兼容接口**。\n",
    "*   **LightGBM** 通常以**速度**见长。\n",
    "*   **CatBoost** 在**类别特征处理**上具有独特优势。\n",
    "*   **XGBoost** 是一个**成熟、稳定、功能全面**的选择。\n",
    "*   **超参数调优**和**提前停止**对于获得最佳性能至关重要。\n",
    "\n",
    "在实际项目中，建议根据数据特点和需求尝试不同的库，并通过交叉验证和调优来选择最佳模型。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
diff --git a/c_extensions_tutorial.ipynb b/c_extensions_tutorial.ipynb
diff --git a/concurrency_tutorial.ipynb b/concurrency_tutorial.ipynb
diff --git a/context_managers_tutorial.ipynb b/context_managers_tutorial.ipynb
diff --git a/descriptor.md b/descriptor.md
diff --git a/design_patterns_tutorial.ipynb b/design_patterns_tutorial.ipynb
diff --git a/dunder_methods_tutorial.ipynb b/dunder_methods_tutorial.ipynb
diff --git a/experiment_tracking_tutorial.ipynb b/experiment_tracking_tutorial.ipynb
diff --git a/faiss_tutorial.ipynb b/faiss_tutorial.ipynb
diff --git a/generics_tutorial.ipynb b/generics_tutorial.ipynb
diff --git a/gymnasium_tutorial.ipynb b/gymnasium_tutorial.ipynb
diff --git a/hpo_tutorial.ipynb b/hpo_tutorial.ipynb
diff --git a/iterators_generators_tutorial.ipynb b/iterators_generators_tutorial.ipynb
diff --git a/langchain_tutorial.ipynb b/langchain_tutorial.ipynb
diff --git a/llamaindex_tutorial.ipynb b/llamaindex_tutorial.ipynb
diff --git a/matplotlib_tutorial.ipynb b/matplotlib_tutorial.ipynb
diff --git a/metaprogramming_tutorial.ipynb b/metaprogramming_tutorial.ipynb
diff --git a/numpy_tutorial.ipynb b/numpy_tutorial.ipynb
diff --git a/opencv_tutorial.ipynb b/opencv_tutorial.ipynb
diff --git a/pandas_tutorial.ipynb b/pandas_tutorial.ipynb
diff --git a/pattern_matching_tutorial.ipynb b/pattern_matching_tutorial.ipynb
diff --git a/python_internals_performance_tutorial.ipynb b/python_internals_performance_tutorial.ipynb
diff --git a/pytorch_tutorial.ipynb b/pytorch_tutorial.ipynb
diff --git a/scikit_learn_tutorial.ipynb b/scikit_learn_tutorial.ipynb
diff --git a/seaborn_tutorial.ipynb b/seaborn_tutorial.ipynb
diff --git a/shap_tutorial.ipynb b/shap_tutorial.ipynb
diff --git a/sigh.md b/sigh.md
diff --git a/stable_baselines3_tutorial.ipynb b/stable_baselines3_tutorial.ipynb
diff --git a/standard_library_tutorial.ipynb b/standard_library_tutorial.ipynb
diff --git a/transformers_tutorial.ipynb b/transformers_tutorial.ipynb
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 高性能梯度提升库教程 (XGBoost, LightGBM, CatBoost)\n",
	"\n",
	"欢迎来到 XGBoost, LightGBM, 和 CatBoost 教程！这三个库都是梯度提升决策树 (Gradient Boosting Decision Tree, GBDT) 算法的高效、可扩展且流行的实现。它们在处理表格/结构化数据方面表现出色，经常在数据科学竞赛和工业界应用中取得顶尖性能。\n",
	"\n",
	"为什么使用这些库？\n",
	"\n",
	"它们通常比 Scikit-learn 内置的梯度提升实现更快、更精确，并提供更多高级功能，如内置正则化、缺失值处理和对类别特征的特殊支持（尤其是 CatBoost）。\n",
	"\n",
	"本教程将独立地介绍这三个库的基础用法：\n",
	"\n",
	"1. XGBoost: 最早广泛流行的高效 GBDT 实现之一。\n",
	"2. LightGBM: 以速度快和内存占用低著称。\n",
	"3. CatBoost: 特别擅长自动处理类别特征。\n",
	"\n",
	"我们将使用 Scikit-learn 内置的数据集进行分类和回归任务，分别展示每个库如何训练、预测和评估模型。\n",
	"\n",
	"本教程结构：\n",
	"1. 准备工作（安装库、公共数据准备）。\n",
	"2. 使用 XGBoost。\n",
	"3. 使用 LightGBM。\n",
	"4. 使用 CatBoost。\n",
	"5. 性能比较与总结。"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 1. 准备工作\n",
	"\n",
	"安装必要的库，并准备用于演示的数据集。"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 1.1 安装库\n",
	"\n",
	"```bash\n",
	"pip install xgboost lightgbm catboost scikit-learn pandas numpy matplotlib\n",
	"```"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- 公共导入 (用于数据准备和评估) ---\n",
	"import numpy as np\n",
	"import pandas as pd\n",
	"from sklearn.model_selection import train_test_split\n",
	"from sklearn.metrics import accuracy_score, mean_squared_error, r2_score, roc_auc_score\n",
	"from sklearn.datasets import load_breast_cancer, fetch_california_housing\n",
	"import time\n",
	"import os\n",
	"import warnings\n",
	"\n",
	"# 忽略特定库可能产生的未来警告，使输出更整洁\n",
	"warnings.filterwarnings('ignore', category=FutureWarning)\n",
	"\n",
	"# 用于计时的辅助函数\n",
	"def time_it(func, args, *kwargs):\n",
	" start_time = time.time()\n",
	" result = func(args, *kwargs)\n",
	" end_time = time.time()\n",
	" print(f\"Execution time: {end_time - start_time:.4f} seconds\")\n",
	" return result\n",
	"\n",
	"# --- 检查库版本 --- \n",
	"print(\"Checking library versions...\")\n",
	"try: import xgboost as xgb; print(f\" XGBoost version: {xgb.__version__}\")\n",
	"except ImportError: print(\" XGBoost not installed.\"); xgb = None\n",
	"try: import lightgbm as lgb; print(f\" LightGBM version: {lgb.__version__}\")\n",
	"except ImportError: print(\" LightGBM not installed.\"); lgb = None\n",
	"try: import catboost as cb; print(f\" CatBoost version: {cb.__version__}\")\n",
	"except ImportError: print(\" CatBoost not installed.\"); cb = None\n",
	"\n",
	"# --- 公共数据准备 (执行一次) ---\n",
	"print(\"\\n--- Preparing Datasets ---\")\n",
	"# Classification Data\n",
	"cancer = load_breast_cancer()\n",
	"X_cancer_base, y_cancer_base = cancer.data, cancer.target\n",
	"cancer_feature_names = cancer.feature_names\n",
	"print(f\"Base classification data shape: X={X_cancer_base.shape}\")\n",
	"\n",
	"# Regression Data\n",
	"regression_available = False\n",
	"X_housing_base, y_housing_base, housing_feature_names = None, None, None\n",
	"try:\n",
	" housing = fetch_california_housing()\n",
	" X_housing_base, y_housing_base = housing.data, housing.target\n",
	" housing_feature_names = housing.feature_names\n",
	" print(f\"Base regression data shape: X={X_housing_base.shape}\")\n",
	" regression_available = True\n",
	"except ImportError:\n",
	" print(\"California housing dataset not available (requires scikit-learn >= 0.20). Skipping regression examples.\")\n",
	"except Exception as e:\n",
	" print(f\"Error loading regression dataset: {e}. Skipping regression examples.\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 2. 使用 XGBoost\n",
	"\n",
	"XGBoost (eXtreme Gradient Boosting) 是一个优化过的分布式梯度提升库，旨在高效、灵活和可移植。它实现了正则化学习目标，有助于控制过拟合。"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- XGBoost: 导入与数据划分 ---\n",
	"print(\"\\n--- XGBoost Section --- \")\n",
	"accuracy_xgb, auc_xgb, mse_xgb, r2_xgb = None, None, None, None # Initialize results\n",
	"if xgb:\n",
	" # 划分分类数据\n",
	" X_cancer_train_xgb, X_cancer_test_xgb, y_cancer_train_xgb, y_cancer_test_xgb = train_test_split(\n",
	" X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base\n",
	" )\n",
	" print(\"XGBoost: Classification data split.\")\n",
	" \n",
	" # 划分回归数据\n",
	" if regression_available:\n",
	" X_housing_train_xgb, X_housing_test_xgb, y_housing_train_xgb, y_housing_test_xgb = train_test_split(\n",
	" X_housing_base, y_housing_base, test_size=0.2, random_state=42\n",
	" )\n",
	" print(\"XGBoost: Regression data split.\")\n",
	" else:\n",
	" print(\"XGBoost: Regression data unavailable.\")\n",
	"else:\n",
	" print(\"XGBoost not available, skipping this section.\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- XGBoost: 分类 --- \n",
	"if xgb:\n",
	" print(\"\\nXGBoost: Training XGBClassifier...\")\n",
	" xgb_clf = xgb.XGBClassifier(\n",
	" objective='binary:logistic',\n",
	" n_estimators=100, \n",
	" learning_rate=0.1,\n",
	" max_depth=3,\n",
	" subsample=0.8, \n",
	" colsample_bytree=0.8, \n",
	" use_label_encoder=False, # Recommended setting \n",
	" eval_metric='logloss', \n",
	" random_state=42,\n",
	" n_jobs=-1 \n",
	" )\n",
	" \n",
	" # Train the model with early stopping\n",
	" time_it(xgb_clf.fit, X_cancer_train_xgb, y_cancer_train_xgb, \n",
	" early_stopping_rounds=10, \n",
	" eval_set=[(X_cancer_test_xgb, y_cancer_test_xgb)], \n",
	" verbose=False)\n",
	" \n",
	" # Predict and Evaluate\n",
	" y_pred_xgb_clf = xgb_clf.predict(X_cancer_test_xgb)\n",
	" y_proba_xgb_clf = xgb_clf.predict_proba(X_cancer_test_xgb)[:, 1] \n",
	" accuracy_xgb = accuracy_score(y_cancer_test_xgb, y_pred_xgb_clf)\n",
	" auc_xgb = roc_auc_score(y_cancer_test_xgb, y_proba_xgb_clf)\n",
	" print(f\"XGBoost Classifier Accuracy: {accuracy_xgb:.4f}\")\n",
	" print(f\"XGBoost Classifier AUC: {auc_xgb:.4f}\")\n",
	"else:\n",
	" print(\"Skipping XGBoost classification (library not loaded).\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- XGBoost: 回归 --- \n",
	"if xgb and regression_available:\n",
	" print(\"\\nXGBoost: Training XGBRegressor...\")\n",
	" xgb_reg = xgb.XGBRegressor(\n",
	" objective='reg:squarederror', \n",
	" n_estimators=100,\n",
	" learning_rate=0.1,\n",
	" max_depth=5,\n",
	" subsample=0.8,\n",
	" colsample_bytree=0.8,\n",
	" random_state=42,\n",
	" n_jobs=-1\n",
	" )\n",
	" \n",
	" time_it(xgb_reg.fit, X_housing_train_xgb, y_housing_train_xgb,\n",
	" early_stopping_rounds=10,\n",
	" eval_set=[(X_housing_test_xgb, y_housing_test_xgb)],\n",
	" verbose=False)\n",
	" \n",
	" y_pred_xgb_reg = xgb_reg.predict(X_housing_test_xgb)\n",
	" mse_xgb = mean_squared_error(y_housing_test_xgb, y_pred_xgb_reg)\n",
	" r2_xgb = r2_score(y_housing_test_xgb, y_pred_xgb_reg)\n",
	" print(f\"XGBoost Regressor MSE: {mse_xgb:.4f}\")\n",
	" print(f\"XGBoost Regressor R2: {r2_xgb:.4f}\")\n",
	"elif xgb:\n",
	" print(\"\\nXGBoost: Skipping Regressor example (data unavailable).\")\n",
	"else:\n",
	" print(\"Skipping XGBoost regression (library not loaded).\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 3. 使用 LightGBM\n",
	"\n",
	"LightGBM (Light Gradient Boosting Machine) 以其训练速度快和内存占用低而闻名。它使用基于直方图的算法和叶子优先 (leaf-wise) 的树生长策略。"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- LightGBM: 导入与数据划分 ---\n",
	"print(\"\\n--- LightGBM Section --- \")\n",
	"accuracy_lgb, auc_lgb, mse_lgb, r2_lgb = None, None, None, None # Initialize results\n",
	"if lgb:\n",
	" from lightgbm import early_stopping, log_evaluation # Callbacks\n",
	"\n",
	" # 划分分类数据\n",
	" X_cancer_train_lgb, X_cancer_test_lgb, y_cancer_train_lgb, y_cancer_test_lgb = train_test_split(\n",
	" X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base\n",
	" )\n",
	" print(\"LightGBM: Classification data split.\")\n",
	" \n",
	" # 划分回归数据\n",
	" if regression_available:\n",
	" X_housing_train_lgb, X_housing_test_lgb, y_housing_train_lgb, y_housing_test_lgb = train_test_split(\n",
	" X_housing_base, y_housing_base, test_size=0.2, random_state=42\n",
	" )\n",
	" print(\"LightGBM: Regression data split.\")\n",
	" else:\n",
	" print(\"LightGBM: Regression data unavailable.\")\n",
	"else:\n",
	" print(\"LightGBM not available, skipping this section.\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- LightGBM: 分类 --- \n",
	"if lgb:\n",
	" print(\"\\nLightGBM: Training LGBMClassifier...\")\n",
	" lgb_clf = lgb.LGBMClassifier(\n",
	" objective='binary',\n",
	" metric='auc',\n",
	" n_estimators=100,\n",
	" learning_rate=0.1,\n",
	" num_leaves=31, \n",
	" max_depth=-1, \n",
	" subsample=0.8, # bagging_fraction\n",
	" colsample_bytree=0.8, # feature_fraction\n",
	" random_state=42,\n",
	" n_jobs=-1\n",
	" )\n",
	" \n",
	" lgbm_clf_callbacks = [\n",
	" early_stopping(stopping_rounds=10, verbose=False),\n",
	" log_evaluation(period=0)\n",
	" ]\n",
	" time_it(lgb_clf.fit, X_cancer_train_lgb, y_cancer_train_lgb, \n",
	" eval_set=[(X_cancer_test_lgb, y_cancer_test_lgb)], \n",
	" eval_metric='auc',\n",
	" callbacks=lgbm_clf_callbacks)\n",
	" \n",
	" y_pred_lgb_clf = lgb_clf.predict(X_cancer_test_lgb)\n",
	" y_proba_lgb_clf = lgb_clf.predict_proba(X_cancer_test_lgb)[:, 1]\n",
	" accuracy_lgb = accuracy_score(y_cancer_test_lgb, y_pred_lgb_clf)\n",
	" auc_lgb = roc_auc_score(y_cancer_test_lgb, y_proba_lgb_clf)\n",
	" print(f\"LightGBM Classifier Accuracy: {accuracy_lgb:.4f}\")\n",
	" print(f\"LightGBM Classifier AUC: {auc_lgb:.4f}\")\n",
	"else:\n",
	" print(\"Skipping LightGBM classification (library not loaded).\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- LightGBM: 回归 --- \n",
	"if lgb and regression_available:\n",
	" print(\"\\nLightGBM: Training LGBMRegressor...\")\n",
	" lgb_reg = lgb.LGBMRegressor(\n",
	" objective='regression_l2',\n",
	" metric='rmse',\n",
	" n_estimators=100,\n",
	" learning_rate=0.1,\n",
	" num_leaves=31,\n",
	" max_depth=-1,\n",
	" subsample=0.8,\n",
	" colsample_bytree=0.8,\n",
	" random_state=42,\n",
	" n_jobs=-1\n",
	" )\n",
	" \n",
	" lgbm_reg_callbacks = [\n",
	" early_stopping(stopping_rounds=10, verbose=False),\n",
	" log_evaluation(period=0)\n",
	" ]\n",
	" time_it(lgb_reg.fit, X_housing_train_lgb, y_housing_train_lgb,\n",
	" eval_set=[(X_housing_test_lgb, y_housing_test_lgb)],\n",
	" eval_metric='rmse',\n",
	" callbacks=lgbm_reg_callbacks)\n",
	" \n",
	" y_pred_lgb_reg = lgb_reg.predict(X_housing_test_lgb)\n",
	" mse_lgb = mean_squared_error(y_housing_test_lgb, y_pred_lgb_reg)\n",
	" r2_lgb = r2_score(y_housing_test_lgb, y_pred_lgb_reg)\n",
	" print(f\"LightGBM Regressor MSE: {mse_lgb:.4f}\")\n",
	" print(f\"LightGBM Regressor R2: {r2_lgb:.4f}\")\n",
	"elif lgb:\n",
	" print(\"\\nLightGBM: Skipping Regressor example (data unavailable).\")\n",
	"else:\n",
	" print(\"Skipping LightGBM regression (library not loaded).\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 4. 使用 CatBoost\n",
	"\n",
	"CatBoost (Categorical Boosting) 的主要特点是其内置的对类别特征的高效处理，通常无需预处理即可获得良好效果。"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- CatBoost: 导入与数据划分 ---\n",
	"print(\"\\n--- CatBoost Section --- \")\n",
	"accuracy_cb, auc_cb, mse_cb, r2_cb = None, None, None, None # Initialize results\n",
	"if cb:\n",
	" # 划分分类数据\n",
	" X_cancer_train_cb, X_cancer_test_cb, y_cancer_train_cb, y_cancer_test_cb = train_test_split(\n",
	" X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base\n",
	" )\n",
	" print(\"CatBoost: Classification data split.\")\n",
	"\n",
	" # 划分回归数据\n",
	" if regression_available:\n",
	" X_housing_train_cb, X_housing_test_cb, y_housing_train_cb, y_housing_test_cb = train_test_split(\n",
	" X_housing_base, y_housing_base, test_size=0.2, random_state=42\n",
	" )\n",
	" print(\"CatBoost: Regression data split.\")\n",
	" else:\n",
	" print(\"CatBoost: Regression data unavailable.\")\n",
	"\n",
	"else:\n",
	" print(\"CatBoost not available, skipping this section.\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- CatBoost: 分类 --- \n",
	"if cb:\n",
	" print(\"\\nCatBoost: Training CatBoostClassifier...\")\n",
	" cb_clf = cb.CatBoostClassifier(\n",
	" iterations=100, \n",
	" learning_rate=0.1,\n",
	" depth=6, \n",
	" l2_leaf_reg=3, \n",
	" loss_function='Logloss',\n",
	" eval_metric='AUC', \n",
	" random_seed=42,\n",
	" verbose=0, # Suppress iteration output\n",
	" early_stopping_rounds=10\n",
	" )\n",
	" \n",
	" time_it(cb_clf.fit, X_cancer_train_cb, y_cancer_train_cb,\n",
	" eval_set=(X_cancer_test_cb, y_cancer_test_cb),\n",
	" verbose=0) # Pass verbose=0 to fit as well\n",
	" \n",
	" y_pred_cb_clf = cb_clf.predict(X_cancer_test_cb)\n",
	" y_proba_cb_clf = cb_clf.predict_proba(X_cancer_test_cb)[:, 1]\n",
	" accuracy_cb = accuracy_score(y_cancer_test_cb, y_pred_cb_clf)\n",
	" auc_cb = roc_auc_score(y_cancer_test_cb, y_proba_cb_clf)\n",
	" print(f\"CatBoost Classifier Accuracy: {accuracy_cb:.4f}\")\n",
	" print(f\"CatBoost Classifier AUC: {auc_cb:.4f}\")\n",
	"else:\n",
	" print(\"Skipping CatBoost classification (library not loaded).\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- CatBoost: 回归 --- \n",
	"if cb and regression_available:\n",
	" print(\"\\nCatBoost: Training CatBoostRegressor...\")\n",
	" cb_reg = cb.CatBoostRegressor(\n",
	" iterations=100,\n",
	" learning_rate=0.1,\n",
	" depth=6,\n",
	" l2_leaf_reg=3,\n",
	" loss_function='RMSE', \n",
	" eval_metric='RMSE',\n",
	" random_seed=42,\n",
	" verbose=0,\n",
	" early_stopping_rounds=10\n",
	" )\n",
	" \n",
	" time_it(cb_reg.fit, X_housing_train_cb, y_housing_train_cb,\n",
	" eval_set=(X_housing_test_cb, y_housing_test_cb),\n",
	" verbose=0)\n",
	" \n",
	" y_pred_cb_reg = cb_reg.predict(X_housing_test_cb)\n",
	" mse_cb = mean_squared_error(y_housing_test_cb, y_pred_cb_reg)\n",
	" r2_cb = r2_score(y_housing_test_cb, y_pred_cb_reg)\n",
	" print(f\"CatBoost Regressor MSE: {mse_cb:.4f}\")\n",
	" print(f\"CatBoost Regressor R2: {r2_cb:.4f}\")\n",
	"elif cb:\n",
	" print(\"\\nCatBoost: Skipping Regressor example (data unavailable).\")\n",
	"else:\n",
	" print(\"Skipping CatBoost regression (library not loaded).\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# --- CatBoost: Categorical Feature Handling --- \n",
	"if cb:\n",
	" print(\"\\n--- CatBoost Categorical Feature Handling Example ---\")\n",
	" # Create sample data with categorical features\n",
	" cat_df = pd.DataFrame({\n",
	" 'Num1': np.random.rand(100),\n",
	" 'City': np.random.choice(['London', 'Paris', 'Tokyo', 'NYC'], 100),\n",
	" 'Weather': np.random.choice(['Sunny', 'Cloudy', 'Rainy'], 100),\n",
	" 'Target': np.random.randint(0, 2, 100)\n",
	" })\n",
	" X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(\n",
	" cat_df[['Num1', 'City', 'Weather']], cat_df['Target'], test_size=0.25, random_state=42\n",
	" )\n",
	" \n",
	" # Identify categorical features\n",
	" categorical_features_indices = np.where(X_cat_train.dtypes != float)[0]\n",
	" print(f\"Categorical feature indices: {categorical_features_indices}\") # Should be [1, 2]\n",
	" \n",
	" cb_clf_cat = cb.CatBoostClassifier(\n",
	" iterations=50, verbose=0, random_seed=42,\n",
	" cat_features=categorical_features_indices # Pass indices\n",
	" )\n",
	" print(\"Fitting CatBoostClassifier with cat_features...\")\n",
	" time_it(cb_clf_cat.fit, X_cat_train, y_cat_train, \n",
	" eval_set=(X_cat_test, y_cat_test), verbose=0)\n",
	" \n",
	" y_pred_cb_cat = cb_clf_cat.predict(X_cat_test)\n",
	" print(f\"CatBoost (with cat features) Accuracy: {accuracy_score(y_cat_test, y_pred_cb_cat):.4f}\")\n",
	"else:\n",
	" print(\"Skipping CatBoost categorical example (library not loaded).\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 比较与总结"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 5. 性能比较与总结\n",
	"\n",
	"让我们回顾一下这三个库在我们的简单示例上的性能。\n",
	"免责声明: 本次运行使用了非常基础的参数和有限的训练轮数，结果仅供演示 API 用法，不代表各库在优化后的真实相对性能。实际项目中需要仔细进行超参数调优。"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"print(\"--- Performance Summary (Basic Run) ---\")\n",
	"\n",
	"print(\"Classification (Breast Cancer - Higher is better):\")\n",
	"if xgb:\n",
	" print(f\" XGBoost : Accuracy={accuracy_xgb if accuracy_xgb is not None else 'N/A':.4f}, AUC={auc_xgb if auc_xgb is not None else 'N/A':.4f}\")\n",
	"else: print(\" XGBoost: Not run.\")\n",
	"if lgb:\n",
	" print(f\" LightGBM: Accuracy={accuracy_lgb if accuracy_lgb is not None else 'N/A':.4f}, AUC={auc_lgb if auc_lgb is not None else 'N/A':.4f}\")\n",
	"else: print(\" LightGBM: Not run.\")\n",
	"if cb:\n",
	" print(f\" CatBoost: Accuracy={accuracy_cb if accuracy_cb is not None else 'N/A':.4f}, AUC={auc_cb if auc_cb is not None else 'N/A':.4f}\")\n",
	"else: print(\" CatBoost: Not run.\")\n",
	"\n",
	"if regression_available:\n",
	" print(\"\\nRegression (California Housing):\")\n",
	" print(\" Metric: MSE (Lower is better), R2 (Higher is better)\")\n",
	" if xgb:\n",
	" print(f\" XGBoost : MSE={mse_xgb if mse_xgb is not None else 'N/A':.4f}, R2={r2_xgb if r2_xgb is not None else 'N/A':.4f}\")\n",
	" else: print(\" XGBoost: Not run.\")\n",
	" if lgb:\n",
	" print(f\" LightGBM: MSE={mse_lgb if mse_lgb is not None else 'N/A':.4f}, R2={r2_lgb if r2_lgb is not None else 'N/A':.4f}\")\n",
	" else: print(\" LightGBM: Not run.\")\n",
	" if cb:\n",
	" print(f\" CatBoost: MSE={mse_cb if mse_cb is not None else 'N/A':.4f}, R2={r2_cb if r2_cb is not None else 'N/A':.4f}\")\n",
	" else: print(\" CatBoost: Not run.\")\n",
	"else:\n",
	" print(\"\\nRegression results not available.\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 总结要点\n",
	"\n",
	"* XGBoost, LightGBM, CatBoost 都是处理表格数据的强大武器。\n",
	"* 它们都提供了方便的 Scikit-learn 兼容接口。\n",
	"* LightGBM 通常以速度见长。\n",
	"* CatBoost 在类别特征处理上具有独特优势。\n",
	"* XGBoost 是一个成熟、稳定、功能全面的选择。\n",
	"* 超参数调优和提前停止对于获得最佳性能至关重要。\n",
	"\n",
	"在实际项目中，建议根据数据特点和需求尝试不同的库，并通过交叉验证和调优来选择最佳模型。"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.12"
	},
	"orig_nbformat": 4
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}