Skip to content

Instantly share code, notes, and snippets.

@monotykamary
Last active November 25, 2025 12:22
Show Gist options
  • Select an option

  • Save monotykamary/76d0e444a7e57aab9af8077430b07aca to your computer and use it in GitHub Desktop.

Select an option

Save monotykamary/76d0e444a7e57aab9af8077430b07aca to your computer and use it in GitHub Desktop.
Semantic centroids

Formal Definition: Hybrid Semantic-Dimensional Similarity

1. Preliminaries: The Embedding Space

Let $\mathcal{T}$ be the set of all possible text inputs. We define a pre-trained embedding function $E$ that maps text to a $d$-dimensional vector space $\mathbb{R}^d$.

$$ E: \mathcal{T} \to \mathbb{R}^d $$

For any text input $x \in \mathcal{T}$, its vector representation is $\mathbf{v}_x = E(x)$. We assume the standard measure of semantic distance in this space is Cosine Similarity.

Definition 1 (Semantic Similarity). Given two text vectors $\mathbf{v}A, \mathbf{v}B \in \mathbb{R}^d$, the semantic similarity $S_{sem}$ is defined as the cosine of the angle between them:

$$ S_{sem}(\mathbf{v}_A, \mathbf{v}_B) = \frac{\mathbf{v}_A \cdot \mathbf{v}_B}{| \mathbf{v}_A | | \mathbf{v}_B |} $$

2. Construction of Dimensional Centroids

We wish to bias the generic space $\mathbb{R}^d$ towards a specific taxonomy $K$ consisting of $n$ distinct classes (e.g., Accessibility, Automation, UX). Let $D = {D_1, D_2, \dots, D_n}$ represent these dimensions.

For each dimension $D_j$, we curate a set of representative text examples $X_j = {x_1, x_2, \dots, x_m}$.

Definition 2 (Dimensional Centroid). The centroid $\mathbf{c}_j$ for dimension $D_j$ is the arithmetic mean of the normalized embeddings of its examples.

$$ \mathbf{c}_j = \frac{1}{|X_j|} \sum_{x \in X_j} \frac{E(x)}{|E(x)|} $$

To ensure consistent projection magnitude, we normalize the resulting centroid vector to the unit hypersphere:

$$ \hat{\mathbf{c}}_j = \frac{\mathbf{c}_j}{|\mathbf{c}_j|} $$

We now possess a set of $n$ reference vectors $\mathcal{C} = {\hat{\mathbf{c}}_1, \dots, \hat{\mathbf{c}}_n}$ defining the basis of your specific taxonomy.

3. Dimensional Profiling (Basis Transformation)

We define a transformation function $\Phi$ that maps a vector from the generic embedding space $\mathbb{R}^d$ to a lower-dimensional profile space $\mathbb{R}^n$, where $n$ is the number of Katalon dimensions.

Definition 3 (Dimensional Profile Vector). For an arbitrary query vector $\mathbf{v}_q$, the dimensional profile $\mathbf{p}_q \in \mathbb{R}^n$ is constructed by projecting $\mathbf{v}_q$ onto each centroid in $\mathcal{C}$. The $j$-th component of $\mathbf{p}_q$ is given by:

$$ \mathbf{p}_q[j] = S_{sem}(\mathbf{v}_q, \hat{\mathbf{c}}_j) $$

Thus, the profile vector is:

$$ \mathbf{p}_q = \left[ S_{sem}(\mathbf{v}_q, \hat{\mathbf{c}}_1), \ S_{sem}(\mathbf{v}_q, \hat{\mathbf{c}}_2), \ \dots, \ S_{sem}(\mathbf{v}_q, \hat{\mathbf{c}}_n) \right]^T $$

This vector $\mathbf{p}_q$ describes the "shape" of the query relative only to your specific taxonomy, stripping away generic noise.

4. The Hybrid Similarity Metric

To compare a query $Q$ and a target document $T$, we calculate similarity in both the generic space ($\mathbb{R}^d$) and the profile space ($\mathbb{R}^n$).

Definition 4 (Profile Similarity). The similarity between the dimensional profiles of $Q$ and $T$ is defined as:

$$ S_{dim}(Q, T) = S_{sem}(\mathbf{p}_Q, \mathbf{p}_T) = \frac{\mathbf{p}_Q \cdot \mathbf{p}_T}{| \mathbf{p}_Q | | \mathbf{p}_T |} $$

Note: We use Cosine Similarity here rather than Euclidean distance because we care about the relative distribution of dimensions (e.g., "Is this 80% UX and 20% Accessibility?") rather than the absolute magnitude of the match.

Definition 5 (Hybrid Similarity Score). The final similarity function $S_{hybrid}$ is a convex combination of the semantic and dimensional similarities, controlled by a hyperparameter $\lambda \in [0, 1]$.

$$ S_{hybrid}(Q, T) = (1 - \lambda) \cdot S_{sem}(\mathbf{v}_Q, \mathbf{v}_T) + \lambda \cdot S_{dim}(\mathbf{p}_Q, \mathbf{p}_T) $$

Where:

  • $S_{sem}$ captures general linguistic closeness (synonyms, context).
  • $S_{dim}$ captures taxonomic structural alignment.
  • $\lambda$ dictates the influence of the Katalon-specific taxonomy.

Summary of Algorithm

For a system with $n$ defined dimensions and hyperparameter $\lambda$:

  1. Offline Phase: Compute normalized centroids $\hat{\mathbf{c}}_1 \dots \hat{\mathbf{c}}_n$.
  2. Online Phase (Input $Q, T$):
    • Compute embeddings $\mathbf{v}_Q, \mathbf{v}_T$.
    • Compute profiles: $$\mathbf{p}_Q = \Phi(\mathbf{v}_Q) = [\mathbf{v}_Q \cdot \hat{\mathbf{c}}_1, \dots, \mathbf{v}_Q \cdot \hat{\mathbf{c}}_n]$$ $$\mathbf{p}_T = \Phi(\mathbf{v}_T) = [\mathbf{v}_T \cdot \hat{\mathbf{c}}_1, \dots, \mathbf{v}_T \cdot \hat{\mathbf{c}}_n]$$ (Note: Dot product applies if inputs are unit vectors)
    • Calculate Final Score: $$Score = (1-\lambda)\frac{\mathbf{v}_Q \cdot \mathbf{v}_T}{|\mathbf{v}_Q||\mathbf{v}_T|} + \lambda \frac{\mathbf{p}_Q \cdot \mathbf{p}_T}{|\mathbf{p}_Q||\mathbf{p}_T|}$$
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment