You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let $\mathcal{T}$ be the set of all possible text inputs. We define a pre-trained embedding function $E$ that maps text to a $d$-dimensional vector space $\mathbb{R}^d$.
$$
E: \mathcal{T} \to \mathbb{R}^d
$$
For any text input $x \in \mathcal{T}$, its vector representation is $\mathbf{v}_x = E(x)$. We assume the standard measure of semantic distance in this space is Cosine Similarity.
Definition 1 (Semantic Similarity). Given two text vectors $\mathbf{v}A, \mathbf{v}B \in \mathbb{R}^d$, the semantic similarity $S_{sem}$ is defined as the cosine of the angle between them:
We wish to bias the generic space $\mathbb{R}^d$ towards a specific taxonomy $K$ consisting of $n$ distinct classes (e.g., Accessibility, Automation, UX). Let $D = {D_1, D_2, \dots, D_n}$ represent these dimensions.
For each dimension $D_j$, we curate a set of representative text examples $X_j = {x_1, x_2, \dots, x_m}$.
Definition 2 (Dimensional Centroid). The centroid $\mathbf{c}_j$ for dimension $D_j$ is the arithmetic mean of the normalized embeddings of its examples.
We now possess a set of $n$ reference vectors $\mathcal{C} = {\hat{\mathbf{c}}_1, \dots, \hat{\mathbf{c}}_n}$ defining the basis of your specific taxonomy.
3. Dimensional Profiling (Basis Transformation)
We define a transformation function $\Phi$ that maps a vector from the generic embedding space $\mathbb{R}^d$ to a lower-dimensional profile space $\mathbb{R}^n$, where $n$ is the number of Katalon dimensions.
Definition 3 (Dimensional Profile Vector). For an arbitrary query vector $\mathbf{v}_q$, the dimensional profile $\mathbf{p}_q \in \mathbb{R}^n$ is constructed by projecting $\mathbf{v}_q$ onto each centroid in $\mathcal{C}$. The $j$-th component of $\mathbf{p}_q$ is given by:
This vector $\mathbf{p}_q$ describes the "shape" of the query relative only to your specific taxonomy, stripping away generic noise.
4. The Hybrid Similarity Metric
To compare a query $Q$ and a target document $T$, we calculate similarity in both the generic space ($\mathbb{R}^d$) and the profile space ($\mathbb{R}^n$).
Definition 4 (Profile Similarity). The similarity between the dimensional profiles of $Q$ and $T$ is defined as:
Note: We use Cosine Similarity here rather than Euclidean distance because we care about the relative distribution of dimensions (e.g., "Is this 80% UX and 20% Accessibility?") rather than the absolute magnitude of the match.
Definition 5 (Hybrid Similarity Score). The final similarity function $S_{hybrid}$ is a convex combination of the semantic and dimensional similarities, controlled by a hyperparameter $\lambda \in [0, 1]$.