timathom · July 2, 2025 21:17
diff --git a/entity-resolution-with-weaviate.ipynb b/entity-resolution-with-weaviate.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "machine_shape": "hm",
      "gpuType": "L4",
      "name": "entity-resolution-with-weaviate.ipynb",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/timathom/f53337cb4febd201aa50b5a9f698f753/entity_resolution_with_weaviate.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Yale Entity Resolution: Vector Search and Subject Imputation with Weaviate\n",
        "\n",
        "## 🎯 Introduction\n",
        "\n",
        "This notebook demonstrates how to use the **Weaviate vector database** and **OpenAI embeddings** to help distinguish between entities with identical names but different domains of activity.\n",
        "\n",
        "## 📚 Learning Objectives\n",
        "\n",
        "1. **Vector Database Architecture**: How Weaviate stores and indexes text embeddings for semantic search at production scale\n",
        "2. **Semantic Similarity Search**: Finding related entities through cosine similarity in high-dimensional embedding space\n",
        "3. **Subject Imputation Strategy**: Using composite text similarity to fill missing subject fields via weighted centroid algorithms\n",
        "\n",
        "## 🔬 Real-World Challenge: The Franz Schubert Problem\n",
        "\n",
        "Yale's catalog contains multiple \"Franz Schubert\" entities:\n",
        "- **Franz Schubert, 1806-1893** (artist) → Documentary and Technical Arts  \n",
        "- **Franz Schubert, 1797-1828** (composer) → Music, Sound, and Sonic Arts\n",
        "\n",
        "Similarly, \"Jean Roberts\" appears as:\n",
        "- Medical researcher (health statistics)\n",
        "- Literary scholar (drama criticism)  \n",
        "- Political writer (economic policy)\n",
        "\n",
        "**Our mission**: Use semantic embeddings to automatically classify and enhance these records.\n",
        "\n",
        "## 🛠️ Technical Infrastructure\n",
        "\n",
        "- **Vector Database**: Weaviate Cloud with HNSW indexing for sub-linear search performance\n",
        "- **Embeddings**: OpenAI text-embedding-3-small (1,536 dimensions) for semantic understanding\n",
        "- **Data Source**: Yale Library catalog records from Hugging Face\n",
        "- **Imputation Method**: Hot-deck centroid algorithm for filling missing subject fields"
      ],
      "metadata": {
        "id": "r1C8n6m-0c6R"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 📦 Step 1: Install Dependencies for Vector Search\n",
        "\n",
        "We need several specialized libraries for this entity resolution pipeline:\n",
        "\n",
        "- **`weaviate-client`**: Vector database client for storing and searching high-dimensional embeddings with production-grade HNSW indexing\n",
        "- **`datasets`**: Hugging Face library for accessing Yale's public training data (2,539 real catalog records)  \n",
        "- **`openai`**: Access to text-embedding-3-small model that powers Yale's semantic understanding\n",
        "- **`pandas` & `numpy`**: Data manipulation and numerical operations for embedding calculations\n",
        "- **`tqdm`**: Progress tracking for batch operations on large datasets\n",
        "\n",
        "These components form Yale's production vector search infrastructure, handling millions of catalog records with sub-second query response times."
      ],
      "metadata": {
        "id": "pox6cLQn1ypQ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Install required packages\n",
        "!pip install mistralai pandas matplotlib seaborn wandb datasets==3.2.0 weaviate-client"
      ],
      "metadata": {
        "id": "KZuzgP5aXN-Q",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "507172c0-2d07-42bf-93e9-10e7eb95e722"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Collecting mistralai\n",
            "  Downloading mistralai-1.9.1-py3-none-any.whl.metadata (33 kB)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (2.2.2)\n",
            "Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (3.10.0)\n",
            "Requirement already satisfied: seaborn in /usr/local/lib/python3.11/dist-packages (0.13.2)\n",
            "Requirement already satisfied: wandb in /usr/local/lib/python3.11/dist-packages (0.20.1)\n",
            "Collecting datasets==3.2.0\n",
            "  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)\n",
            "Collecting weaviate-client\n",
            "  Downloading weaviate_client-4.15.4-py3-none-any.whl.metadata (3.7 kB)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (3.18.0)\n",
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (2.0.2)\n",
            "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (18.1.0)\n",
            "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (0.3.7)\n",
            "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (2.32.3)\n",
            "Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (4.67.1)\n",
            "Requirement already satisfied: xxhash in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (3.5.0)\n",
            "Requirement already satisfied: multiprocess<0.70.17 in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (0.70.15)\n",
            "Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets==3.2.0)\n",
            "  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)\n",
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (3.11.15)\n",
            "Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (0.33.0)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (24.2)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from datasets==3.2.0) (6.0.2)\n",
            "Collecting eval-type-backport>=0.2.0 (from mistralai)\n",
            "  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)\n",
            "Requirement already satisfied: httpx>=0.28.1 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.28.1)\n",
            "Requirement already satisfied: pydantic>=2.10.3 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.11.7)\n",
            "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.9.0.post0)\n",
            "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.4.1)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.2)\n",
            "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.2)\n",
            "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.3.2)\n",
            "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (0.12.1)\n",
            "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (4.58.4)\n",
            "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.4.8)\n",
            "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (11.2.1)\n",
            "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (3.2.3)\n",
            "Requirement already satisfied: click!=8.0.0,>=7.1 in /usr/local/lib/python3.11/dist-packages (from wandb) (8.2.1)\n",
            "Requirement already satisfied: gitpython!=3.1.29,>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from wandb) (3.1.44)\n",
            "Requirement already satisfied: platformdirs in /usr/local/lib/python3.11/dist-packages (from wandb) (4.3.8)\n",
            "Requirement already satisfied: protobuf!=4.21.0,!=5.28.0,<7,>=3.19.0 in /usr/local/lib/python3.11/dist-packages (from wandb) (5.29.5)\n",
            "Requirement already satisfied: psutil>=5.0.0 in /usr/local/lib/python3.11/dist-packages (from wandb) (5.9.5)\n",
            "Requirement already satisfied: sentry-sdk>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from wandb) (2.31.0)\n",
            "Requirement already satisfied: setproctitle in /usr/local/lib/python3.11/dist-packages (from wandb) (1.3.6)\n",
            "Requirement already satisfied: typing-extensions<5,>=4.8 in /usr/local/lib/python3.11/dist-packages (from wandb) (4.14.0)\n",
            "Collecting validators==0.34.0 (from weaviate-client)\n",
            "  Downloading validators-0.34.0-py3-none-any.whl.metadata (3.8 kB)\n",
            "Collecting authlib<2.0.0,>=1.2.1 (from weaviate-client)\n",
            "  Downloading authlib-1.6.0-py2.py3-none-any.whl.metadata (4.1 kB)\n",
            "Requirement already satisfied: grpcio<2.0.0,>=1.66.2 in /usr/local/lib/python3.11/dist-packages (from weaviate-client) (1.73.0)\n",
            "Collecting grpcio-tools<2.0.0,>=1.66.2 (from weaviate-client)\n",
            "  Downloading grpcio_tools-1.73.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)\n",
            "Collecting grpcio-health-checking<2.0.0,>=1.66.2 (from weaviate-client)\n",
            "  Downloading grpcio_health_checking-1.73.1-py3-none-any.whl.metadata (1.0 kB)\n",
            "Collecting deprecation<3.0.0,>=2.1.0 (from weaviate-client)\n",
            "  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)\n",
            "Requirement already satisfied: cryptography in /usr/local/lib/python3.11/dist-packages (from authlib<2.0.0,>=1.2.1->weaviate-client) (43.0.3)\n",
            "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets==3.2.0) (2.6.1)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets==3.2.0) (1.3.2)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets==3.2.0) (25.3.0)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets==3.2.0) (1.7.0)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets==3.2.0) (6.4.4)\n",
            "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets==3.2.0) (0.3.2)\n",
            "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->datasets==3.2.0) (1.20.1)\n",
            "Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.11/dist-packages (from gitpython!=3.1.29,>=1.0.0->wandb) (4.0.12)\n",
            "Collecting protobuf!=4.21.0,!=5.28.0,<7,>=3.19.0 (from wandb)\n",
            "  Downloading protobuf-6.31.1-cp39-abi3-manylinux2014_x86_64.whl.metadata (593 bytes)\n",
            "Collecting grpcio<2.0.0,>=1.66.2 (from weaviate-client)\n",
            "  Downloading grpcio-1.73.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)\n",
            "Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from grpcio-tools<2.0.0,>=1.66.2->weaviate-client) (75.2.0)\n",
            "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.28.1->mistralai) (4.9.0)\n",
            "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx>=0.28.1->mistralai) (2025.6.15)\n",
            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx>=0.28.1->mistralai) (1.0.9)\n",
            "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.28.1->mistralai) (3.10)\n",
            "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx>=0.28.1->mistralai) (0.16.0)\n",
            "Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub>=0.23.0->datasets==3.2.0) (1.1.5)\n",
            "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.10.3->mistralai) (0.7.0)\n",
            "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.10.3->mistralai) (2.33.2)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->mistralai) (1.17.0)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests>=2.32.2->datasets==3.2.0) (3.4.2)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests>=2.32.2->datasets==3.2.0) (2.4.0)\n",
            "Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.11/dist-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb) (5.0.2)\n",
            "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.28.1->mistralai) (1.3.1)\n",
            "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.11/dist-packages (from cryptography->authlib<2.0.0,>=1.2.1->weaviate-client) (1.17.1)\n",
            "Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.12->cryptography->authlib<2.0.0,>=1.2.1->weaviate-client) (2.22)\n",
            "Downloading datasets-3.2.0-py3-none-any.whl (480 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m17.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading mistralai-1.9.1-py3-none-any.whl (381 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m381.8/381.8 kB\u001b[0m \u001b[31m35.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading weaviate_client-4.15.4-py3-none-any.whl (432 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m433.0/433.0 kB\u001b[0m \u001b[31m39.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading validators-0.34.0-py3-none-any.whl (43 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m43.5/43.5 kB\u001b[0m \u001b[31m4.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading authlib-1.6.0-py2.py3-none-any.whl (239 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m240.0/240.0 kB\u001b[0m \u001b[31m23.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)\n",
            "Downloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)\n",
            "Downloading fsspec-2024.9.0-py3-none-any.whl (179 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.3/179.3 kB\u001b[0m \u001b[31m19.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading grpcio_health_checking-1.73.1-py3-none-any.whl (18 kB)\n",
            "Downloading grpcio-1.73.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m120.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading grpcio_tools-1.73.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.7/2.7 MB\u001b[0m \u001b[31m109.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading protobuf-6.31.1-cp39-abi3-manylinux2014_x86_64.whl (321 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m321.1/321.1 kB\u001b[0m \u001b[31m35.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hInstalling collected packages: validators, protobuf, grpcio, fsspec, eval-type-backport, deprecation, grpcio-tools, grpcio-health-checking, mistralai, authlib, weaviate-client, datasets\n",
            "  Attempting uninstall: protobuf\n",
            "    Found existing installation: protobuf 5.29.5\n",
            "    Uninstalling protobuf-5.29.5:\n",
            "      Successfully uninstalled protobuf-5.29.5\n",
            "  Attempting uninstall: grpcio\n",
            "    Found existing installation: grpcio 1.73.0\n",
            "    Uninstalling grpcio-1.73.0:\n",
            "      Successfully uninstalled grpcio-1.73.0\n",
            "  Attempting uninstall: fsspec\n",
            "    Found existing installation: fsspec 2025.3.2\n",
            "    Uninstalling fsspec-2025.3.2:\n",
            "      Successfully uninstalled fsspec-2025.3.2\n",
            "  Attempting uninstall: datasets\n",
            "    Found existing installation: datasets 2.14.4\n",
            "    Uninstalling datasets-2.14.4:\n",
            "      Successfully uninstalled datasets-2.14.4\n",
            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
            "ydf 0.12.0 requires protobuf<6.0.0,>=5.29.1, but you have protobuf 6.31.1 which is incompatible.\n",
            "grpcio-status 1.71.0 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 6.31.1 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-cuda-cupti-cu12 12.5.82 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-cuda-nvrtc-cu12 12.5.82 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-cuda-runtime-cu12==12.4.127; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-cuda-runtime-cu12 12.5.82 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-cudnn-cu12==9.1.0.70; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-cudnn-cu12 9.3.0.75 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-cufft-cu12==11.2.1.3; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-cufft-cu12 11.2.3.61 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-curand-cu12==10.3.5.147; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-curand-cu12 10.3.6.82 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-cusolver-cu12==11.6.1.9; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-cusolver-cu12 11.6.3.83 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-cusparse-cu12==12.3.1.170; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-cusparse-cu12 12.5.1.3 which is incompatible.\n",
            "torch 2.6.0+cu124 requires nvidia-nvjitlink-cu12==12.4.127; platform_system == \"Linux\" and platform_machine == \"x86_64\", but you have nvidia-nvjitlink-cu12 12.5.82 which is incompatible.\n",
            "gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2024.9.0 which is incompatible.\n",
            "tensorflow 2.18.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3, but you have protobuf 6.31.1 which is incompatible.\n",
            "google-ai-generativelanguage 0.6.15 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.2, but you have protobuf 6.31.1 which is incompatible.\u001b[0m\u001b[31m\n",
            "\u001b[0mSuccessfully installed authlib-1.6.0 datasets-3.2.0 deprecation-2.1.0 eval-type-backport-0.2.2 fsspec-2024.9.0 grpcio-1.73.1 grpcio-health-checking-1.73.1 grpcio-tools-1.73.1 mistralai-1.9.1 protobuf-6.31.1 validators-0.34.0 weaviate-client-4.15.4\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "application/vnd.colab-display-data+json": {
              "pip_warning": {
                "packages": [
                  "google"
                ]
              },
              "id": "90a5c7f040f741d6af7034b6706bca16"
            }
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🔧 Step 2: Import Production Libraries  \n",
        "\n",
        "### Core Libraries\n",
        "- **OpenAI**: Text embedding generation using `text-embedding-3-small` model\n",
        "- **Weaviate**: Vector database for semantic search with cosine similarity\n",
        "- **Datasets**: Direct access to Yale's training data from Hugging Face Hub\n"
      ],
      "metadata": {
        "id": "HPEF1QH7MFuZ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "T743xYe7MEEt"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🔑 Step 3: Configure API Authentication\n",
        "\n",
        "This step establishes secure connections to all services in Yale's vector search pipeline:\n",
        "\n",
        "### Required API Keys\n",
        "- **OpenAI API Key**: Access to `text-embedding-3-small` model for generating 1,536-dimensional embeddings\n",
        "- **Weaviate Cloud Credentials**: URL and API key for vector database with HNSW indexing  \n",
        "- **Hugging Face Token**: Download Yale's public training dataset (2,539 labeled records)\n",
        "\n",
        "Store your API keys securely in Colab's secrets panel (🔑 icon in sidebar) before running this cell."
      ],
      "metadata": {
        "id": "d0PfcjRCMQVR"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "from google.colab import userdata\n",
        "import requests\n",
        "import json\n",
        "import random\n",
        "import time\n",
        "from typing import Dict, List, Tuple, Any\n",
        "import hashlib\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "\n",
        "from openai import OpenAI\n",
        "from datasets import load_dataset\n",
        "import weaviate\n",
        "from weaviate.classes.config import Configure, Property, DataType, VectorDistances\n",
        "from weaviate.classes.query import MetadataQuery, Filter\n",
        "from weaviate.util import generate_uuid5\n",
        "from tqdm import tqdm\n",
        "RANDOM_SEED = 42"
      ],
      "metadata": {
        "id": "I_xTwuybWA2M"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 2: Configure API Keys and Authentication\n",
        "\n",
        "This step sets up secure access to the services we'll use throughout the classification pipeline:\n",
        "\n",
        "- **OpenAI**: Provides embeddings (`text-embedding-3-small`) used by our Weaviate vector database for semantic search\n",
        "- **Hugging Face**: Enables us to download Yale's pre-labeled training datasets directly from their public repository\n",
        "- **Weaviate Cloud**: Vector database service for storing and querying entity embeddings at scale"
      ],
      "metadata": {
        "id": "ssfc9uc23Euc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')\n",
        "os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')\n",
        "os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')\n",
        "os.environ[\"WCD_URL\"] = userdata.get('WCD_URL')\n",
        "os.environ[\"WCD_API_KEY\"] = userdata.get('WCD_API_KEY')"
      ],
      "metadata": {
        "id": "aI0wI4ai3crw"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🌐 Step 4: Connect to Weaviate Vector Database\n",
        "\n",
        "This cell establishes connection to Yale's production vector database infrastructure.\n",
        "\n",
        "https://console.weaviate.cloud/\n",
        "\n",
        "### Weaviate Cloud Setup\n",
        "- **Cluster Connection**: Connect to hosted Weaviate instance with authentication\n",
        "- **OpenAI Integration**: Pass API key for automated embedding generation\n",
        "- **Production Headers**: Configure client for enterprise-grade operations\n",
        "\n",
        "### Vector Database Benefits\n",
        "- **HNSW Indexing**: Hierarchical Navigable Small World graphs for fast similarity search\n",
        "- **Cosine Distance**: Semantic similarity metric optimized for text embeddings  \n",
        "- **Horizontal Scaling**: Handle millions of vectors with consistent sub-second queries\n",
        "\n",
        "### Connection Verification\n",
        "The successful connection enables us to:\n",
        "- Store 1,536-dimensional embeddings from OpenAI\n",
        "- Perform subject imputation using vector similarity\n"
      ],
      "metadata": {
        "id": "apHynNLI3mBY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Connect to Weaviate\n",
        "weaviate_api_key = os.environ.get(\"WCD_API_KEY\")\n",
        "openai_api_key = os.environ.get(\"OPENAI_API_KEY\")\n",
        "weaviate_url = os.environ.get(\"WCD_URL\")\n",
        "\n",
        "openai_client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\"))\n",
        "\n",
        "weaviate_client = weaviate.connect_to_weaviate_cloud(\n",
        "    cluster_url=weaviate_url,\n",
        "    auth_credentials=weaviate.auth.AuthApiKey(weaviate_api_key),\n",
        "    headers={\"X-OpenAI-Api-Key\": openai_api_key}  # For OpenAI vectorizer\n",
        ")\n",
        "\n",
        "print(\"✅ Connected to OpenAI and Weaviate!\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "k97Vf_ECtWMj",
        "outputId": "b38b3b6c-9ace-4eb9-f9a1-a9bf0694e5a5"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "✅ Connected to OpenAI and Weaviate!\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 📚 Step 5: Load Yale Catalog Data"
      ],
      "metadata": {
        "id": "MG2doTkvM6nX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Load from Hugging Face\n",
        "print(\"📚 Loading Yale dataset...\")\n",
        "training_data = pd.DataFrame(load_dataset(\"timathom/yale-library-entity-resolver-training-data\")[\"train\"])\n",
        "\n",
        "print(f\"✅ Loaded {len(training_data):,} records\")\n",
        "print(f\"   Sample: {training_data.iloc[0]['person']} - {training_data.iloc[0]['title'][:50]}...\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 133,
          "referenced_widgets": [
            "109dc27f67564f8ea57b6d13e9ce9cb9",
            "9646ad2f7d76467d9662e2dfa625fa84",
            "10fbc83a5735495882f65322a0070adc",
            "02e5c65bcde446ec8f8d02369706f497",
            "009d0b49df4a47e08b4a97541f1df992",
            "58b1087cc2fb4d3d969a4957b2915f51",
            "d5612758832c4dabb0a8cfdd6130b214",
            "26ad5d87c1b544f4a0f488ae15ab77f0",
            "e3ce3b8b3f084dc78f99dec2aed123ef",
            "8685a053b4994a41ab4f1a93db4a6ed5",
            "40b15c6687bf4dd7a871ef695871dc3b",
            "4d20ac7386844e0aa9bf47d2584cca0a",
            "f21ac0d61169409a87eb928caa47c3f7",
            "c8c6e347a66a47259ec2cb4e6a116ecc",
            "d03ca43056344d5d8054bce0088a1b90",
            "852bee5cb4da4e1fbf267d0f7a6f6683",
            "e20aea8130094cc4bb5fcc8b49ad51e1",
            "2d95be6702c34021bdc6828fa9a2ecd8",
            "5099026b15f949b58a20100e41643d76",
            "6da043e0fa8b4f4d9a64f23cea258e3a",
            "717e5ff8509849d29578413eb658d933",
            "b0dffbe245fc4920a5296ac8a81e100a"
          ]
        },
        "id": "vcrm_JOjx4Kn",
        "outputId": "52909a29-fa65-4e54-9d4c-bb50b21bf99b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "📚 Loading Yale dataset...\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "(…)ibrary-entity-resolver-training-data.csv: 0.00B [00:00, ?B/s]"
            ],
            "application/vnd.jupyter.widget-view+json": {
              "version_major": 2,
              "version_minor": 0,
              "model_id": "109dc27f67564f8ea57b6d13e9ce9cb9"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "Generating train split:   0%|          | 0/2539 [00:00<?, ? examples/s]"
            ],
            "application/vnd.jupyter.widget-view+json": {
              "version_major": 2,
              "version_minor": 0,
              "model_id": "4d20ac7386844e0aa9bf47d2584cca0a"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "✅ Loaded 2,539 records\n",
            "   Sample: Schubert, Franz - Archäologie und Photographie: fünfzig Beispiele ...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "OjeHXqpHMt2y"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🧠 Step 6: Embed Records\n",
        "\n",
        "This function replicates Yale's exact production embedding generation from `embedding_and_indexing.py`:\n",
        "\n",
        "### OpenAI Text-Embedding-3-Small Model\n",
        "- **Dimensions**: 1,536-dimensional vectors optimized for semantic understanding\n",
        "- **Model Performance**: Superior to earlier models for academic and literary content\n",
        "- **Cost Efficiency**: ~$0.13 per 1M tokens, enabling large-scale processing\n",
        "- **Multilingual Support**: Handles German, English, and other European languages in Yale's catalog\n"
      ],
      "metadata": {
        "id": "O3jbyhQANH8o"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def generate_embedding(text: str, model: str = \"text-embedding-3-small\") -> np.ndarray:\n",
        "    \"\"\"\n",
        "    Yale's production embedding function from embedding_and_indexing.py\n",
        "\n",
        "    Args:\n",
        "        text: Input text to embed\n",
        "        model: OpenAI embedding model (text-embedding-3-small)\n",
        "\n",
        "    Returns:\n",
        "        1536-dimensional embedding vector\n",
        "    \"\"\"\n",
        "    if not text or text.strip() == \"\":\n",
        "        # Return zero vector for empty text\n",
        "        return np.zeros(1536, dtype=np.float32)\n",
        "\n",
        "    try:\n",
        "        response = openai_client.embeddings.create(\n",
        "            model=model,\n",
        "            input=text\n",
        "        )\n",
        "\n",
        "        # Extract embedding from response\n",
        "        embedding = np.array(response.data[0].embedding, dtype=np.float32)\n",
        "        return embedding\n",
        "\n",
        "    except Exception as e:\n",
        "        print(f\"❌ Error generating embedding: {e}\")\n",
        "        return np.zeros(1536, dtype=np.float32)\n",
        "\n",
        "# Test the embedding function with real Yale data\n",
        "test_composite = training_data.iloc[0]['composite']\n",
        "test_embedding = generate_embedding(test_composite)\n",
        "print(f\"✅ Embedding generated successfully! Shape: {test_embedding.shape}\")\n",
        "print(f\"   Sample values: {test_embedding[:5]}\")\n",
        "print(f\"   Composite text: {test_composite[:80]}...\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "IQKWvEWaOqqS",
        "outputId": "74320bb0-20d9-4f83-9e8e-a7af2e1f3ded"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "✅ Embedding generated successfully! Shape: (1536,)\n",
            "   Sample values: [ 0.01115062  0.02462124 -0.0213398   0.00958305 -0.04418446]\n",
            "   Composite text: Title: Archäologie und Photographie: fünfzig Beispiele zur Geschichte und Meth...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "S_3WyjNuNMEB"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🏗️ Step 7: Create Production Weaviate Schema\n",
        "\n",
        "This function creates an `EntityString` collection schema for storing and querying entity embeddings:\n",
        "\n",
        "### Schema Architecture  \n",
        "- **Collection Name**: `EntityString` - standard collection for entity embeddings\n",
        "- **Vectorizer**: `text2vec_openai` with automatic embedding generation via OpenAI API\n",
        "- **Vector Dimensions**: 1,536 to match `text-embedding-3-small` model output\n",
        "\n",
        "### HNSW Vector Index Configuration\n",
        "- **ef=128**: Controls query accuracy vs. speed tradeoff (higher = more accurate)\n",
        "- **max_connections=64**: Graph connectivity for optimal search performance  \n",
        "- **ef_construction=128**: Build-time parameter for index quality\n",
        "- **distance_metric=COSINE**: Optimal for normalized text embeddings\n",
        "\n",
        "### Data Properties\n",
        "- **original_string**: The actual text content (person name, composite text, title, subjects)\n",
        "- **hash_value**: SHA-256 hash for deduplication and UUID generation\n",
        "- **field_type**: Entity field classification (person, composite, title, subjects)\n",
        "- **frequency**: Occurrence count for popularity-based ranking\n",
        "- **personId/recordId**: Metadata for subject imputation workflows\n",
        "\n",
        "### Production Benefits\n",
        "This schema enables:\n",
        "- **Sub-second similarity search** across millions of vectors\n",
        "- **Automatic embedding generation** when inserting new text\n",
        "- **Multi-field entity representation** (person names, titles, subjects separately indexed)\n",
        "- **Subject imputation workflows** using personId linking\n"
      ],
      "metadata": {
        "id": "EIGkgPZYO8RB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def create_entity_schema(client):\n",
        "    \"\"\"\n",
        "    Create EntityString schema\n",
        "    \"\"\"\n",
        "    try:\n",
        "        # Check if collection already exists\n",
        "        # Delete existing collection if it exists\n",
        "        if client.collections.exists(\"EntityString\"):\n",
        "            client.collections.delete(\"EntityString\")\n",
        "            print(\"🗑️ Deleted existing EntityString collection\")\n",
        "\n",
        "        # Create with exact production schema from embedding_and_indexing.py + metadata for imputation\n",
        "        collection = client.collections.create(\n",
        "            name=\"EntityString\",\n",
        "            description=\"Collection for entity string values with their embeddings\",\n",
        "            vectorizer_config=Configure.Vectorizer.text2vec_openai(\n",
        "                model=\"text-embedding-3-small\",\n",
        "                dimensions=1536\n",
        "            ),\n",
        "            vector_index_config=Configure.VectorIndex.hnsw(\n",
        "                ef=128,                    # Production config\n",
        "                max_connections=64,        # Production config\n",
        "                ef_construction=128,       # Production config\n",
        "                distance_metric=VectorDistances.COSINE\n",
        "            ),\n",
        "            properties=[\n",
        "                # Exact production schema\n",
        "                Property(name=\"original_string\", data_type=DataType.TEXT),\n",
        "                Property(name=\"hash_value\", data_type=DataType.TEXT),\n",
        "                Property(name=\"field_type\", data_type=DataType.TEXT),\n",
        "                Property(name=\"frequency\", data_type=DataType.INT),\n",
        "                # Added for subject imputation demo\n",
        "                Property(name=\"personId\", data_type=DataType.TEXT),\n",
        "                Property(name=\"recordId\", data_type=DataType.TEXT)\n",
        "            ]\n",
        "        )\n",
        "\n",
        "        print(\"✅ Created EntityString collection with schema\")\n",
        "        return collection\n",
        "\n",
        "    except Exception as e:\n",
        "        print(f\"❌ Error creating schema: {e}\")\n",
        "        return None\n",
        "\n",
        "# Create the schema\n",
        "entity_collection = create_entity_schema(weaviate_client)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zLJ3nPHlO1oj",
        "outputId": "76e2000f-a3c6-4a52-95d3-d6fbf7c0ca17"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "🗑️ Deleted existing EntityString collection\n",
            "✅ Created EntityString collection with schema\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🔐 Step 8: Generate SHA-256 Hashes for Deduplication\n",
        "\n",
        "This step implements Yale's production deduplication strategy using cryptographic hashing:\n",
        "\n",
        "### SHA-256 Hash Generation\n",
        "- **Deterministic Deduplication**: Identical strings always produce identical hashes\n",
        "- **Collision Resistance**: Cryptographically secure against hash conflicts\n",
        "- **UTF-8 Encoding**: Handles multilingual catalog content (German, French, Latin)\n",
        "- **Null Handling**: Empty/null values map to \"NULL\" string for consistent processing\n",
        "\n",
        "### Field-Specific Hashing\n",
        "Yale processes each entity field type separately:\n",
        "- **person_hash**: Names and name variants (e.g., \"Schubert, Franz\" vs \"Schubert, Franz, 1797-1828\")\n",
        "- **composite_hash**: Structured text combining title, subjects, provision information  \n",
        "- **title_hash**: Work titles with normalization for cataloging variations\n",
        "- **subjects_hash**: Subject headings and classifications (NULL for missing subjects)\n",
        "\n",
        "### Production Benefits\n",
        "- **UUID Generation**: Hashes enable deterministic UUIDs using `generate_uuid5()`\n",
        "- **Duplicate Prevention**: Multiple records with identical content share single vector\n",
        "- **Consistency**: Same hash always maps to same vector across different processing runs\n",
        "- **Storage Optimization**: Eliminates redundant embeddings for repeated strings\n",
        "\n",
        "### Deduplication Statistics\n",
        "The hash analysis reveals:\n",
        "- **189 unique person names** across 2,539 catalog records  \n",
        "- **2,357 unique composite texts** showing rich content diversity\n",
        "- **351 records missing subjects** (candidates for imputation)\n",
        "\n",
        "This hashing strategy enables Yale to efficiently manage 17.6M+ catalog records while maintaining data integrity and preventing duplicate vector storage."
      ],
      "metadata": {
        "id": "R5lhc6-zNWAx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def generate_hash(text: str) -> str:\n",
        "    \"\"\"\n",
        "    Generate SHA-256 hash for text (Yale's production method)\n",
        "    \"\"\"\n",
        "    if not text or pd.isna(text):\n",
        "        return \"NULL\"\n",
        "    return hashlib.sha256(text.encode('utf-8')).hexdigest()\n",
        "\n",
        "# Generate hashes for all fields using Yale's production method\n",
        "print(\"🔐 Generating SHA-256 hashes for all records...\")\n",
        "\n",
        "for i, row in training_data.iterrows():\n",
        "    # Generate hashes for each field type (Yale's approach)\n",
        "    person_hash = generate_hash(row['person'])\n",
        "    composite_hash = generate_hash(row['composite'])\n",
        "    title_hash = generate_hash(row['title'])\n",
        "    subjects_hash = generate_hash(row['subjects']) if pd.notna(row['subjects']) else \"NULL\"\n",
        "\n",
        "    # Store in dataframe\n",
        "    training_data.at[i, 'person_hash'] = person_hash\n",
        "    training_data.at[i, 'composite_hash'] = composite_hash\n",
        "    training_data.at[i, 'title_hash'] = title_hash\n",
        "    training_data.at[i, 'subjects_hash'] = subjects_hash\n",
        "\n",
        "print(\"✅ Generated SHA-256 hashes for all records\")\n",
        "print(f\"   Sample person hash: {training_data.iloc[0]['person_hash'][:16]}...\")\n",
        "print(f\"   Sample composite hash: {training_data.iloc[0]['composite_hash'][:16]}...\")\n",
        "\n",
        "# Show hash distribution\n",
        "print(f\"\\n📊 Hash Statistics:\")\n",
        "print(f\"   Unique person hashes: {training_data['person_hash'].nunique()}\")\n",
        "print(f\"   Unique composite hashes: {training_data['composite_hash'].nunique()}\")\n",
        "print(f\"   NULL subjects hashes: {(training_data['subjects_hash'] == 'NULL').sum()}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hnq1IuTHPTCB",
        "outputId": "667e3f02-7d1b-4dd3-c436-4107de020a52"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "🔐 Generating SHA-256 hashes for all records...\n",
            "✅ Generated SHA-256 hashes for all records\n",
            "   Sample person hash: 6cb0f164412941e2...\n",
            "   Sample composite hash: 324648e06f268fed...\n",
            "\n",
            "📊 Hash Statistics:\n",
            "   Unique person hashes: 189\n",
            "   Unique composite hashes: 2357\n",
            "   NULL subjects hashes: 351\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 📊 Step 9: Deduplicate Objects for Vector Indexing\n",
        "\n",
        "This step prepares deduplicated entity objects for efficient vector database indexing:\n",
        "\n",
        "### Deduplication Strategy\n",
        "Yale processes each field type separately to prevent UUID conflicts:\n",
        "- **person**: Individual names with personId/recordId linking for entity resolution\n",
        "- **composite**: Rich text descriptions combining titles, subjects, provision information\n",
        "- **title**: Work titles for semantic similarity matching\n",
        "- **subjects**: Subject headings (excluding NULL values for imputation candidates)\n",
        "\n",
        "### Object Structure  \n",
        "Each unique object contains:\n",
        "- **hash_value**: SHA-256 identifier for deterministic UUID generation\n",
        "- **original_string**: The actual text content for embedding generation\n",
        "- **field_type**: Entity field classification for filtered search queries\n",
        "- **frequency**: Occurrence count (could be calculated for popularity ranking)\n",
        "- **personId/recordId**: Metadata enabling subject imputation workflows\n",
        "\n",
        "### Deduplication Results\n",
        "Our processing reveals the data's natural structure:\n",
        "- **189 unique person names** (high reuse - many authors appear multiple times)\n",
        "- **2,357 unique composite texts** (diverse content across catalog)  \n",
        "- **1,966 unique titles** (some title reuse across editions/translations)\n",
        "- **1,599 unique subject headings** (rich vocabulary for subject imputation)\n",
        "\n",
        "### Production Efficiency\n",
        "This deduplication approach provides:\n",
        "- **6,111 unique objects** instead of 9,805+ raw records (38% storage reduction)\n",
        "- **No duplicate vectors** stored in Weaviate (prevents redundant computation)\n",
        "- **Consistent UUIDs** across processing runs using deterministic hashing\n",
        "- **Efficient queries** with field_type filtering for targeted search\n",
        "\n",
        "The deduplicated objects maintain all necessary metadata for Yale's subject imputation workflow while optimizing vector database storage and performance."
      ],
      "metadata": {
        "id": "-vj-e0GENb10"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(\"\\n🔄 Deduplicating data for indexing...\")\n",
        "unique_objects = []\n",
        "\n",
        "# Process each field type separately to avoid duplicate UUIDs\n",
        "field_types = ['person', 'composite', 'title', 'subjects']\n",
        "\n",
        "for field_type in field_types:\n",
        "    print(f\"   Processing {field_type} field...\")\n",
        "\n",
        "    # Get hash and text columns\n",
        "    hash_col = f\"{field_type}_hash\"\n",
        "    text_col = field_type\n",
        "\n",
        "    # Skip if field doesn't exist\n",
        "    if text_col not in training_data.columns:\n",
        "        continue\n",
        "\n",
        "    # Filter out NULL hashes and get unique hash-text pairs with metadata\n",
        "    field_data = training_data[training_data[hash_col] != \"NULL\"][[hash_col, text_col, 'personId', 'recordId']].drop_duplicates(subset=[hash_col])\n",
        "\n",
        "    # Add to unique objects with personId and recordId for imputation\n",
        "    for _, row in field_data.iterrows():\n",
        "        unique_objects.append({\n",
        "            'hash_value': row[hash_col],\n",
        "            'original_string': str(row[text_col]),\n",
        "            'field_type': field_type,\n",
        "            'frequency': 1,  # Could be calculated if needed\n",
        "            'personId': str(row['personId']) if pd.notna(row['personId']) else \"\",\n",
        "            'recordId': str(row['recordId']) if pd.notna(row['recordId']) else \"\"\n",
        "        })\n",
        "\n",
        "print(f\"✅ Created {len(unique_objects):,} unique objects for indexing\")\n",
        "\n",
        "# Show deduplication statistics\n",
        "field_counts = {}\n",
        "for obj in unique_objects:\n",
        "    field_type = obj['field_type']\n",
        "    field_counts[field_type] = field_counts.get(field_type, 0) + 1\n",
        "\n",
        "print(f\"\\n📊 Unique objects by field type:\")\n",
        "for field_type, count in field_counts.items():\n",
        "    print(f\"   {field_type}: {count:,}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "WbXM09ZqPbIm",
        "outputId": "354d31f5-0bce-4d4a-efe5-d4ebb38c83ec"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\n",
            "🔄 Deduplicating data for indexing...\n",
            "   Processing person field...\n",
            "   Processing composite field...\n",
            "   Processing title field...\n",
            "   Processing subjects field...\n",
            "✅ Created 6,111 unique objects for indexing\n",
            "\n",
            "📊 Unique objects by field type:\n",
            "   person: 189\n",
            "   composite: 2,357\n",
            "   title: 1,966\n",
            "   subjects: 1,599\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🚀 Step 10: Index Entities in Weaviate with Batch Processing\n",
        "\n",
        "This step performs production-scale indexing of deduplicated entity objects into Weaviate:\n",
        "\n",
        "### Batch Indexing Strategy\n",
        "- **Dynamic Batching**: Weaviate optimizes batch sizes automatically for throughput\n",
        "- **UUID Generation**: Deterministic UUIDs using `generate_uuid5(hash_value + field_type)`\n",
        "- **Progress Tracking**: Real-time monitoring with tqdm for large datasets\n",
        "- **Error Handling**: Robust processing continues despite individual record failures\n",
        "\n",
        "### Vector Generation Process\n",
        "For each unique object, Weaviate automatically:\n",
        "1. **Extracts text** from `original_string` property\n",
        "2. **Generates embedding** using OpenAI text-embedding-3-small API\n",
        "3. **Stores vector** with 1,536 dimensions in HNSW index\n",
        "4. **Associates metadata** (personId, recordId, field_type, hash_value)\n",
        "\n",
        "### Production Performance\n",
        "- **400+ objects/second** indexing rate on standard hardware\n",
        "- **Automatic retries** for transient API failures\n",
        "- **Memory optimization** with dynamic batch sizing\n",
        "- **Consistent UUIDs** prevent duplicate indexing across runs\n",
        "\n",
        "### Index Verification  \n",
        "The final verification confirms:\n",
        "- **6,111 unique objects** successfully indexed\n",
        "- **All field types represented** (person, composite, title, subjects)\n",
        "- **Metadata preserved** for subject imputation workflows\n",
        "- **Vector index ready** for semantic similarity queries\n",
        "\n",
        "The indexed vectors are now ready for semantic search and subject imputation demonstrations."
      ],
      "metadata": {
        "id": "IM1OHTZyNgzP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def index_entities(collection, dataframe):\n",
        "    \"\"\"\n",
        "    Index Yale entity strings in Weaviate\n",
        "    \"\"\"\n",
        "    print(\"🔄 Indexing Yale entity strings in Weaviate...\")\n",
        "\n",
        "    indexed_count = 0\n",
        "    batch_size = 100\n",
        "\n",
        "    print(\"🚀 Indexing deduplicated data...\")\n",
        "\n",
        "    with collection.batch.dynamic() as batch:\n",
        "        for obj in tqdm(unique_objects, desc=\"Indexing unique objects\"):\n",
        "            try:\n",
        "                # Generate UUID using production method (hash + field_type)\n",
        "                uuid_input = f\"{obj['hash_value']}_{obj['field_type']}\"\n",
        "                uuid = generate_uuid5(uuid_input)\n",
        "\n",
        "                # Add to batch\n",
        "                batch.add_object(\n",
        "                    uuid=uuid,\n",
        "                    properties={\n",
        "                        \"original_string\": obj['original_string'],\n",
        "                        \"hash_value\": obj['hash_value'],\n",
        "                        \"field_type\": obj['field_type'],\n",
        "                        \"frequency\": obj['frequency'],\n",
        "                        \"personId\": obj['personId'],\n",
        "                        \"recordId\": obj['recordId']\n",
        "                    }\n",
        "                )\n",
        "                indexed_count += 1\n",
        "\n",
        "            except Exception as e:\n",
        "                print(f\"❌ Error indexing {obj['field_type']}: {e}\")\n",
        "\n",
        "    print(f\"✅ Successfully indexed {indexed_count:,} unique objects\")\n",
        "\n",
        "    return indexed_count\n",
        "\n",
        "# Index our real Yale data\n",
        "indexed_count = index_entities(entity_collection, training_data)\n",
        "\n",
        "# Verify indexing\n",
        "print(f\"\\n🔍 Verification:\")\n",
        "print(f\"   Expected records: {len(training_data) * 3 + training_data['subjects'].notna().sum()}\")  # person + composite + title + subjects (if not null)\n",
        "print(f\"   Actually indexed: {indexed_count}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ygUsbEG_Pg4U",
        "outputId": "5ee20bad-13cb-4887-c234-a923c472fc6a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "🔄 Indexing Yale entity strings in Weaviate...\n",
            "🚀 Indexing deduplicated data...\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Indexing unique objects: 100%|██████████| 6111/6111 [00:15<00:00, 399.19it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "✅ Successfully indexed 6,111 unique objects\n",
            "\n",
            "🔍 Verification:\n",
            "   Expected records: 9805\n",
            "   Actually indexed: 6111\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🔍 Step 11: Test Semantic Search Capabilities\n",
        "\n",
        "This step demonstrates Weaviate's semantic search power using our indexed entity vectors:\n",
        "\n",
        "### Semantic Query Processing\n",
        "- **Query**: \"classical compositions\" (broad musical concept)\n",
        "- **Vector Generation**: Convert query to 1,536-dimensional embedding\n",
        "- **HNSW Search**: Find nearest neighbors using cosine similarity in vector space\n",
        "- **Result Ranking**: Order by semantic similarity (higher = more related)"
      ],
      "metadata": {
        "id": "OcMKzvtmNm_g"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Test semantic search\n",
        "print(\"🔍 Testing semantic search...\")\n",
        "query = \"classical compositions\"\n",
        "\n",
        "# Search\n",
        "search_results = entity_collection.query.near_text(\n",
        "    query=query,\n",
        "    limit=5,\n",
        "    return_properties=[\"original_string\", \"field_type\", \"hash_value\"],\n",
        "    return_metadata=[\"distance\"]\n",
        ")\n",
        "\n",
        "print(f'\\n🎼 Search results for \"{query}\":')\n",
        "for i, obj in enumerate(search_results.objects, 1):\n",
        "    props = obj.properties\n",
        "    distance = obj.metadata.distance\n",
        "    cosine_similarity = 1 - distance  # Convert distance to cosine similarity\n",
        "\n",
        "    print(f\"   {i}. {props['field_type']}: {props['original_string'][:60]}...\")\n",
        "    print(f\"      Cosine Similarity: {cosine_similarity:.4f}\")\n",
        "\n",
        "# Check counts by field type\n",
        "print(f\"\\n📊 Objects by field type:\")\n",
        "for field_type in [\"person\", \"composite\", \"title\", \"subjects\"]:\n",
        "    from weaviate.classes.query import Filter\n",
        "    result = entity_collection.aggregate.over_all(\n",
        "        filters=Filter.by_property(\"field_type\").equal(field_type),\n",
        "        total_count=True\n",
        "    )\n",
        "    print(f\"   {field_type}: {result.total_count:,}\")\n",
        "\n",
        "# Total count\n",
        "result = entity_collection.aggregate.over_all(total_count=True)\n",
        "print(f\"\\n📊 Total indexed: {result.total_count:,} objects\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "E3JvI-8oQCK0",
        "outputId": "598cedf4-4822-462c-96da-e4d13a90a733"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "🔍 Testing semantic search...\n",
            "\n",
            "🎼 Search results for \"classical compositions\":\n",
            "   1. subjects: Piano quartets; Piano quintets; Piano trios; Sonatas (Violin...\n",
            "      Cosine Similarity: 0.4571\n",
            "   2. composite: Title: Piano sonatas: D 557, D 575, D 894\n",
            "Version of: Sonata...\n",
            "      Cosine Similarity: 0.4496\n",
            "   3. composite: Title: Piano sonatas: D 557, D 575, D 894\n",
            "Subjects: Sonatas ...\n",
            "      Cosine Similarity: 0.4458\n",
            "   4. composite: Title: Piano sonatas: D 557, D 575, D 894\n",
            "Related work: Sona...\n",
            "      Cosine Similarity: 0.4454\n",
            "   5. composite: Title: Piano sonatas: D 557, D 575, D 894\n",
            "Related work: Sona...\n",
            "      Cosine Similarity: 0.4428\n",
            "\n",
            "📊 Objects by field type:\n",
            "   person: 189\n",
            "   composite: 2,357\n",
            "   title: 1,966\n",
            "   subjects: 1,599\n",
            "\n",
            "📊 Total indexed: 6,111 objects\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🎯 Step 12: Hot-Deck Subject Imputation\n",
        "\n",
        "This demonstration shows a **hot-deck imputation methodology** for filling missing subject fields using semantic similarity:\n",
        "\n",
        "### The Challenge: Missing Subject Information\n",
        "Many catalog records lack subject classifications due to:\n",
        "- **Incomplete cataloging** during original processing\n",
        "- **Legacy records** from before systematic subject assignment  \n",
        "- **Specialized materials** requiring domain expertise\n",
        "- **Time constraints** in high-volume cataloging workflows\n",
        "\n",
        "### Proposed Solution: Vector-Based Hot-Deck Imputation\n",
        "**Hot-deck imputation** borrows values from similar records in the same dataset:\n",
        "\n",
        "1. **Identify target record** with missing subjects\n",
        "2. **Find semantically similar composite texts** using vector search\n",
        "3. **Extract subjects from similar records** (donor records)\n",
        "4. **Calculate weighted centroid** of subject embeddings\n",
        "5. **Select best subject match** closest to centroid\n",
        "\n",
        "### Demonstration Record\n",
        "- **PersonId**: demo#Agent100-99\n",
        "- **Person**: Roberts, Jean  \n",
        "- **Title**: \"Literary analysis techniques in modern drama criticism\"\n",
        "- **Missing**: Subject classifications (what we'll impute!)\n",
        "\n"
      ],
      "metadata": {
        "id": "XbLCpRWYNwxj"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Step 1: Introduce our target record (missing subjects)\n",
        "print(\"📖 STEP 1: Our Target Record (Missing Subjects)\")\n",
        "print(\"-\" * 45)\n",
        "target_record = {\n",
        "    \"personId\": \"demo#Agent100-99\",\n",
        "    \"person\": \"Roberts, Jean\",\n",
        "    \"composite\": \"Title: Literary analysis techniques in modern drama criticism\\\\nProvision information: London: Academic Press, 1975\",\n",
        "    \"title\": \"Literary analysis techniques in modern drama criticism\",\n",
        "    \"subjects\": None  # ← This is what we want to impute!\n",
        "}\n",
        "\n",
        "print(f\"   📋 PersonId: {target_record['personId']}\")\n",
        "print(f\"   👤 Person: {target_record['person']}\")\n",
        "print(f\"   📚 Title: {target_record['title']}\")\n",
        "print(f\"   📄 Composite: {target_record['composite']}\")\n",
        "print(f\"   ❌ Subjects: None (this is what we need to find!)\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "r9-maCMyQK99",
        "outputId": "cf3feb47-f666-46c4-a1f1-9d006266ca96"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "📖 STEP 1: Our Target Record (Missing Subjects)\n",
            "---------------------------------------------\n",
            "   📋 PersonId: demo#Agent100-99\n",
            "   👤 Person: Roberts, Jean\n",
            "   📚 Title: Literary analysis techniques in modern drama criticism\n",
            "   📄 Composite: Title: Literary analysis techniques in modern drama criticism\\nProvision information: London: Academic Press, 1975\n",
            "   ❌ Subjects: None (this is what we need to find!)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🔍 Step 13: Finding Semantically Similar Records\n",
        "\n",
        "This step performs the core vector search to find candidate donor records for subject imputation:\n",
        "\n",
        "### Vector Search Process\n",
        "1. **Query Construction**: Use complete composite text as search query\n",
        "2. **Field Filtering**: Search only `composite` field types (not person names or titles)\n",
        "3. **Similarity Ranking**: HNSW algorithm returns nearest neighbors by cosine similarity\n",
        "4. **Candidate Selection**: Retrieve top most similar composite texts\n",
        "\n",
        "### Search Query Analysis\n",
        "**Target composite**: \"Literary analysis techniques in modern drama criticism\"\n",
        "\n",
        "This query seeks records about:\n",
        "- **Literary analysis** (scholarly methodology)\n",
        "- **Drama criticism** (theatrical/literary domain)  \n",
        "- **Modern context** (contemporary approaches)\n",
        "\n",
        "### Similarity Results Interpretation\n",
        "The top candidates show semantic understanding:\n",
        "\n",
        "1. **Dramatic Annals: Critiques on Plays and Performances** (Sim: 0.500)\n",
        "   - Direct match: drama criticism and performance analysis\n",
        "   \n",
        "2. **The Modern Theatre; A Collection of Successful Modern Plays** (Sim: 0.479)\n",
        "   - Strong match: modern theatre and dramatic works\n",
        "   \n",
        "3. **Playhouses, Theatres and Other Places of Public Amusement** (Sim: 0.450)\n",
        "   - Related: theatrical contexts and performance venues\n",
        "\n",
        "### Vector Search Effectiveness\n",
        "- **Semantic understanding**: Finds conceptually related records, not just keyword matches\n",
        "- **Domain relevance**: All top results relate to drama, theatre, and literary criticism\n",
        "- **Academic context**: Identifies scholarly works about dramatic literature\n",
        "- **Quality ranking**: Higher similarities correspond to more relevant content\n",
        "\n",
        "This vector search provides the foundation for identifying records with subjects suitable for imputation to our target record."
      ],
      "metadata": {
        "id": "dxsbRYlNN5i6"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(\"🔍 STEP 2: Finding Similar Records\")\n",
        "print(\"-\" * 35)\n",
        "print(\"We search for composite texts that are semantically similar to our target...\")\n",
        "print(f\"   🎯 Query: '{target_record['composite']}'\")\n",
        "print()\n",
        "\n",
        "similar_composites = entity_collection.query.near_text(\n",
        "    query=target_record['composite'],\n",
        "    filters=Filter.by_property(\"field_type\").equal(\"composite\"),\n",
        "    limit=8,\n",
        "    return_properties=[\"original_string\", \"personId\", \"recordId\"],\n",
        "    return_metadata=MetadataQuery(distance=True)\n",
        ")\n",
        "\n",
        "print(f\"   📊 Found {len(similar_composites.objects)} similar composite records:\")\n",
        "# Show the records we found\n",
        "for i, obj in enumerate(similar_composites.objects, 1):\n",
        "    similarity = 1.0 - obj.metadata.distance\n",
        "    print(f\"      {i}. Similarity: {similarity:.3f} - {obj.properties['original_string'][:70]}...\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "AUFL9dVrQRQl",
        "outputId": "cea12cf5-8ae5-4b39-947a-71873ebd78d0"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "🔍 STEP 2: Finding Similar Records\n",
            "-----------------------------------\n",
            "We search for composite texts that are semantically similar to our target...\n",
            "   🎯 Query: 'Title: Literary analysis techniques in modern drama criticism\\nProvision information: London: Academic Press, 1975'\n",
            "\n",
            "   📊 Found 8 similar composite records:\n",
            "      1. Similarity: 0.500 - Title: Dramatic Annals: Critiques on Plays and Performances. Vol 1. 17...\n",
            "      2. Similarity: 0.479 - Title: The Modern Theatre; A Collection of Successful Modern Plays, As...\n",
            "      3. Similarity: 0.450 - Title: Playhouses, Theatres and Other Places of Public Amusement in Lo...\n",
            "      4. Similarity: 0.445 - Title: The Critic; or, A Tragedy Rehears'd\n",
            "Subjects: Celebrity Culture...\n",
            "      5. Similarity: 0.438 - Title: The saving lie: Harold Bloom and deconstruction\n",
            "Subjects: Criti...\n",
            "      6. Similarity: 0.423 - Title: Metalinguagem: ensaios de teoria e crítica literária\n",
            "Subjects...\n",
            "      7. Similarity: 0.421 - Title: Opinions and perspectives from the New York times book review\n",
            "S...\n",
            "      8. Similarity: 0.419 - Title: Literary style and music: including two short essays on gracefu...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 📋 Step 14: Analyze Candidate Records for Subject Availability\n",
        "\n",
        "This step examines each similar record to determine which ones have subjects available for imputation:\n",
        "\n",
        "### Donor Record Qualification Process\n",
        "For each semantically similar composite record:\n",
        "1. **Extract PersonId**: Unique identifier linking to other fields for same entity\n",
        "2. **Subject Lookup**: Query for subject fields associated with this PersonId  \n",
        "3. **Availability Check**: Confirm subjects exist (not NULL or missing)\n",
        "4. **Candidate Registration**: Add to donor pool if subjects are available\n",
        "\n",
        "### Hot-Deck Method\n",
        "- **Centroid calculation** with multiple subject vectors\n",
        "- **Domain consistency** (all records relate to drama/theatre/criticism)\n",
        "- **Quality assurance** through similarity thresholds\n",
        "- **Confidence scoring** based on donor pool size and similarity\n"
      ],
      "metadata": {
        "id": "bAvPy3WzOC8M"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Step 3: Show candidate records and their similarity scores\n",
        "print(\"📋 STEP 3: Candidate Records with Similarity Scores\")\n",
        "print(\"-\" * 50)\n",
        "candidates_with_subjects = []\n",
        "\n",
        "for i, obj in enumerate(similar_composites.objects, 1):\n",
        "    similarity = 1.0 - obj.metadata.distance\n",
        "    person_id = obj.properties[\"personId\"]\n",
        "    record_id = obj.properties[\"recordId\"]\n",
        "    composite_text = obj.properties[\"original_string\"]\n",
        "\n",
        "    print(f\"   {i}. Similarity: {similarity:.3f}\")\n",
        "    print(f\"      PersonId: {person_id}\")\n",
        "    print(f\"      Composite: {composite_text[:80]}...\")\n",
        "\n",
        "    # Check if this person has subjects (potential donor)\n",
        "    subject_query = entity_collection.query.fetch_objects(\n",
        "        filters=(\n",
        "            Filter.by_property(\"personId\").equal(person_id) &\n",
        "            Filter.by_property(\"field_type\").equal(\"subjects\")\n",
        "        ),\n",
        "        return_properties=[\"original_string\"],\n",
        "        limit=1\n",
        "    )\n",
        "\n",
        "    if subject_query.objects:\n",
        "        subject_text = subject_query.objects[0].properties[\"original_string\"]\n",
        "        print(f\"      ✅ Has Subjects: {subject_text[:60]}...\")\n",
        "        candidates_with_subjects.append({\n",
        "            'personId': person_id,\n",
        "            'recordId': record_id,\n",
        "            'similarity': similarity,\n",
        "            'subjects': subject_text,\n",
        "            'composite': composite_text\n",
        "        })\n",
        "    else:\n",
        "        print(f\"      ❌ No Subjects: Cannot use as donor\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2RshQ227QWI-",
        "outputId": "6161b280-7b67-4624-939d-486b29c21d6d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "📋 STEP 3: Candidate Records with Similarity Scores\n",
            "--------------------------------------------------\n",
            "   1. Similarity: 0.500\n",
            "      PersonId: 13930523#Agent100-10\n",
            "      Composite: Title: Dramatic Annals: Critiques on Plays and Performances. Vol 1. 1741-1785. C...\n",
            "      ✅ Has Subjects: Celebrity Culture & Fashion; Business & Finance; Modes of Pe...\n",
            "   2. Similarity: 0.479\n",
            "      PersonId: 13933294#Agent700-39\n",
            "      Composite: Title: The Modern Theatre; A Collection of Successful Modern Plays, As Acted at ...\n",
            "      ✅ Has Subjects: Modes of Performance: Costume, Scenography & Spectacle; Cove...\n",
            "   3. Similarity: 0.450\n",
            "      PersonId: 13930526#Agent700-57\n",
            "      Composite: Title: Playhouses, Theatres and Other Places of Public Amusement in London and i...\n",
            "      ✅ Has Subjects: Celebrity Culture & Fashion; Business & Finance; Modes of Pe...\n",
            "   4. Similarity: 0.445\n",
            "      PersonId: 13932650#Agent100-10\n",
            "      Composite: Title: The Critic; or, A Tragedy Rehears'd\n",
            "Subjects: Celebrity Culture & Fashion...\n",
            "      ✅ Has Subjects: Celebrity Culture & Fashion; Theatre Royal Drury Lane; Sheri...\n",
            "   5. Similarity: 0.438\n",
            "      PersonId: 9820535#Agent600-23\n",
            "      Composite: Title: The saving lie: Harold Bloom and deconstruction\n",
            "Subjects: Criticism--Unit...\n",
            "      ✅ Has Subjects: Criticism--United States--History--20th century; Literature-...\n",
            "   6. Similarity: 0.423\n",
            "      PersonId: 125562#Agent100-12\n",
            "      Composite: Title: Metalinguagem: ensaios de teoria e crítica literária\n",
            "Subjects: Literatu...\n",
            "      ✅ Has Subjects: Literature, Modern--History and criticism; Brazilian literat...\n",
            "   7. Similarity: 0.421\n",
            "      PersonId: 5655226#Agent600-22\n",
            "      Composite: Title: Opinions and perspectives from the New York times book review\n",
            "Subjects: L...\n",
            "      ✅ Has Subjects: Literature, Modern; Books--Reviews; James, Henry, 1843-1916-...\n",
            "   8. Similarity: 0.419\n",
            "      PersonId: 3643200#Agent100-13\n",
            "      Composite: Title: Literary style and music: including two short essays on gracefulness and ...\n",
            "      ✅ Has Subjects: Literary style; Music; Aesthetics...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "print(\"📊 STEP 4: Understanding Similarity Scores\")\n",
        "print(\"-\" * 42)\n",
        "print(f\"   🎯 Found {len(candidates_with_subjects)} potential donor records\")\n",
        "print(\"   📏 Similarity scores range from 0.0 (different) to 1.0 (identical)\")\n",
        "print(\"   🚪 Yale's threshold: 0.45 (only use candidates above this)\")\n",
        "print()\n",
        "\n",
        "# Filter candidates by threshold\n",
        "threshold = 0.45\n",
        "good_candidates = [c for c in candidates_with_subjects if c['similarity'] >= threshold]\n",
        "print(f\"   ✅ Candidates above threshold ({threshold}): {len(good_candidates)}\")\n",
        "\n",
        "if good_candidates:\n",
        "    print(\"   🏆 Best candidates for subject imputation:\")\n",
        "    for i, candidate in enumerate(good_candidates[:3], 1):\n",
        "        print(f\"      {i}. Similarity {candidate['similarity']:.3f}: {candidate['subjects'][:500]}...\")\n",
        "else:\n",
        "    print(\"   ⚠️  No candidates above threshold - imputation not recommended\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kksh1tepQbpk",
        "outputId": "a98377fb-c5ee-4227-c9e6-71ad57fb162d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "📊 STEP 4: Understanding Similarity Scores\n",
            "------------------------------------------\n",
            "   🎯 Found 8 potential donor records\n",
            "   📏 Similarity scores range from 0.0 (different) to 1.0 (identical)\n",
            "   🚪 Yale's threshold: 0.45 (only use candidates above this)\n",
            "\n",
            "   ✅ Candidates above threshold (0.45): 3\n",
            "   🏆 Best candidates for subject imputation:\n",
            "      1. Similarity 0.500: Celebrity Culture & Fashion; Business & Finance; Modes of Performance: Costume, Scenography & Spectacle; Women in Eighteenth Century Drama; Theatre Royal Drury Lane; Covent Garden Theatre; Goodman's Fields; Richmond Theatre; The Little Theatre (or Theatre Royal), Haymarket; Royalty Theatre; Garrick, David; Barry, Elizabeth; Fenton, Lavinia; Walker, Thomas; Pinkethman, William; Cibber, Colley; Cibber, Susannah; Pritchard, Mrs; Clive, Catherine; Woodward, Henry; Foote, Samuel; King, Thomas; Reddis...\n",
            "      2. Similarity 0.479: Modes of Performance: Costume, Scenography & Spectacle; Covent Garden Theatre; The Little Theatre (or Theatre Royal), Haymarket; Theatre Royal Drury Lane; Palmer, Mr; Bannister Jr, Mr; Farren, Miss; Kemble, Mrs; Lewis, Mr; Wroughton, Mr; Smith, Mr; play, author, entertainment, publication...\n",
            "      3. Similarity 0.450: Celebrity Culture & Fashion; Business & Finance; Modes of Performance: Costume, Scenography & Spectacle; Women in Eighteenth Century Drama; Theatre Royal Drury Lane; Covent Garden Theatre; Lacy, James; Garrick, David; Killigrew, Thomas; Betterton, Thomas; Cibber, Colley; Wilks, Robert; Siddons, Sarah; Kean, Edmund; Mohun, Michaell; Cibber, Mrs; Miller, Joe; Abington, Mrs Frances; Yeates, Mr; Burton, W; Palmer, John; Clive, Catherine; Havell, Daniel; Jones, Inigo; King George I; Gainsborough, Tho...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Step 5: Demonstrate the hot-deck imputation process\n",
        "print(\"🧮 STEP 5: Hot-Deck Imputation Process\")\n",
        "print(\"-\" * 40)\n",
        "if good_candidates:\n",
        "    print(\"   🔄 Weighted centroid algorithm:\")\n",
        "    print(\"      1. Weight each candidate by similarity score\")\n",
        "    print(\"      2. Calculate centroid of subject embeddings\")\n",
        "    print(\"      3. Find subject closest to the centroid\")\n",
        "    print()\n",
        "\n",
        "    # Simple demonstration (using similarity-weighted selection)\n",
        "    best_candidate = max(good_candidates, key=lambda x: x['similarity'])\n",
        "    confidence = best_candidate['similarity'] * 0.85  # Approximate confidence calculation\n",
        "\n",
        "    print(f\"   🎯 Selected Subject (highest similarity):\")\n",
        "    print(f\"      📝 Subject: {best_candidate['subjects']}\")\n",
        "    print(f\"      📊 Source Similarity: {best_candidate['similarity']:.3f}\")\n",
        "    print(f\"      🎪 Confidence Score: {confidence:.3f}\")\n",
        "    print(f\"      📋 Source PersonId: {best_candidate['personId']}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4v-IN3tzQfk-",
        "outputId": "676299bd-626c-4df3-875b-1e786de9f1cc"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "🧮 STEP 5: Hot-Deck Imputation Process\n",
            "----------------------------------------\n",
            "   🔄 Weighted centroid algorithm:\n",
            "      1. Weight each candidate by similarity score\n",
            "      2. Calculate centroid of subject embeddings\n",
            "      3. Find subject closest to the centroid\n",
            "\n",
            "   🎯 Selected Subject (highest similarity):\n",
            "      📝 Subject: Celebrity Culture & Fashion; Business & Finance; Modes of Performance: Costume, Scenography & Spectacle; Women in Eighteenth Century Drama; Theatre Royal Drury Lane; Covent Garden Theatre; Goodman's Fields; Richmond Theatre; The Little Theatre (or Theatre Royal), Haymarket; Royalty Theatre; Garrick, David; Barry, Elizabeth; Fenton, Lavinia; Walker, Thomas; Pinkethman, William; Cibber, Colley; Cibber, Susannah; Pritchard, Mrs; Clive, Catherine; Woodward, Henry; Foote, Samuel; King, Thomas; Reddish, Samuel; Quick, John; Barry, Spranger; Mattocks, Mrs; Miss Younge; Dibdin, Charles; Abington, Frances; Lewis, Charles Lee; Sheridan, Thomas; Cowley, Hannah; Mr Aickin; Siddons, Sarah; Miss Pope; Farreri, Eliza; Wilkinson, Tate; Jordan, Mrs; Crouch, Mrs; Nixon, John; Garrick, Eva Marie (née Veigel); Stevens, George; Walpole, Lady Elizabeth \"Nancy\"; actor, career, character, costume, scene, history, music, friendship, royalty, Shakespeare, newspaper, epilogue, prologue, The British Chronicle, The London Chronicle, review, theatre politics, Lloyd's Evening Post, The English Theatre, death, Harlequin, marriage, performance, Christmas, funeral, Morning Post, Theatrical Intelligence, advertisement, song, opera, fable, comedian, humour, theatre opening, first performance\n",
            "      📊 Source Similarity: 0.500\n",
            "      🎪 Confidence Score: 0.425\n",
            "      📋 Source PersonId: 13930523#Agent100-10\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Close connection when done\n",
        "weaviate_client.close()"
      ],
      "metadata": {
        "id": "CsakPBkxQl0u"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
 }