Created
August 28, 2017 12:15
-
-
Save svecon/2eead0ff05ec4b4b7412100e4b69f0a6 to your computer and use it in GitHub Desktop.
Notes for Bellet 2014 survey
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Mahalanobis distance dM(x,x′)^2 = (x − x′)^T M (x − x′) | |
where M is cone of symmetric PSD d*d matrix | |
Must-link / cannot-link constraints (sometimes called positive / negative pairs): | |
S = {(xi, xj) : xi and xj should be similar}, | |
D = {(xi, xj) : xi and xj should be dissimilar}. | |
Relative constraints (sometimes called training triplets): | |
R = {(xi, xj , xk) : xi should be more similar to xj than to xk}. | |
Optimization problem that has the following general form: min M { ℓ(M, S,D,R) + λR(M) } | |
where ℓ(M, S,D,R) is a loss function that incurs a penalty when training constraints | |
are violated, R(M) is some regularizer on the parameters M of the learned metric and λ ≥ 0 is the regularization parameter | |
Related topics: | |
Kernel learning | |
- limited to transductive (observed/learnt) | |
Multiple Kernel learning | |
- effective, inductive, but more restricted than Kernel learning | |
Dimensionality reduction | |
- assume that the (unlabeled) data lie on an embedded low-dimensional manifold within the higher-dimensional space and aim at “unfolding” it | |
Metric learning: | |
Learning paradigm | |
- fully, weakly, semi-supervised | |
Form of metric | |
- linear, nonlinear, local | |
Scalability | |
- w.r.t. examples / dimension | |
Optimality of the solution | |
- local, global | |
Dimensionality reduction | |
- yes, no | |
Early Approaches | |
Mahalanobis metric for clustering (MMC) (Xing) [&] | |
- convex formulation with no regulation | |
- projected gradient approach requiring the full eigenvalue decomposition of M at each iteration | |
S&J (Schultz & Joachims) | |
M = A^T*W*A, where A is fixed and known and W diagonal | |
- less general, manual choosing of A | |
Approaches Driven by Nearest Neighbours | |
Neighbourhood Component Analysis (NCA) (Goldberger) [&] | |
- "to optimize the expected leave-one-out error of a stochastic nearest neighbor classifier in the projection space induced by dM" | |
- non-convex => local maxima | |
- later generalized to k-NN (Tarlow) | |
Maximally Collapsing Metric Learning (MCML) (Globerson & Roweis) | |
- modification of NCA to convex problem, but costly projections onto PSD cone | |
- based on minimizing a KL divergence between pij and an ideal distribution | |
Large Margin Nearest Neighbors (LMNN) (Weinberger) [&] [!!] | |
- contraints are defined locally | |
- sub-gradient descent (+ 4 more papers with different approaches) | |
- performs well, some overfitting | |
Information-Theoretic Approaches | |
(optimiaztion problem involving an information measure) | |
Relevant Component Analysis (RCA) (Shental) [&] | |
- positive pairs only (S), using transitive closure | |
- efficient algorithm, but poor results (doesnt use any negative information) | |
Information-Theoretic Metric Learning (ITML) (Davis) [&] [!!] | |
- introduces LogDet divergence regularization | |
- efficient, converges to the global minimum and the resulting distance performs well in practice | |
- using positive (S) and negative (D) pairs | |
- M0 (regularization matrix) must be picked at hand, usually I => euclidean distance | |
Sparse Distance Metric Learning (SDML) (Qi) | |
- deals with high dimensional data (n << d) | |
Online Approaches | |
(online learning methods often come with regret bounds, | |
stating that the accumulated loss suffered along the way | |
is not much worse than that of the best hypothesis chosen in hidsight) | |
Pseudo-metric Online Learning Algorithm (POLA) (Shalev-Shwartz) | |
- at each step it receives a pair (S or D) and does two orthogonal projections | |
Logdet Exact Gradient Online (LEGO) (Jain) | |
- overall improved POLA with LogDet regularization | |
Regularized Distance Metric Learning (RDML) (Jin) | |
- similar to POLA, but more flexible | |
- performs similarly to LMNN and ITML | |
Mirror Descent Metric Learning (MDML) (Kunapuli & Shavik) | |
- composite mirror descent, which allows online optimization of many regularized problems | |
- comparable to LMNN and ITML, is fast, induces low-rank solutions, but has not been evaluated on large-scale datasets | |
Multi-Task Metric Learning | |
(given a set of related tasks, one learns a metric for each in a coupled fashion in order to improve the performance on all tasks) | |
Multi-Task LMNN (mt-LMNN) (Parameswaran & Weinberger) [&] | |
- outperforms single-task metric learning methods | |
- formulation remains convex and can be solved using the same efficient solver as LMNN | |
Multi-Task Low-Rank Metric Learning Based on Common Subspace (MLCS) (Yang) | |
- learning Mahalanobis metric for each task, parametrized by transformation matrix, and then decomposed | |
- not convex (local minima), opposed to mt-LMNN can be made low-rank (less params) | |
Geometry Preserving Multi-task Metric Learning (GPML) (Yang) | |
- generalization of any previous MTL | |
- their formulation can extend any metric learning algorithm to multi-task setting | |
- outperforms single-task LMNN and mt-LMNN especially when training data is scarce | |
Transfer Metric Learning (TML) (Zhang & Yeung) [&] | |
- they assume we have solved similar tasks with enough labeled data and learned Mahalanobis distance for each of the task | |
- then they use this for another task which has scarce data | |
Other Approaches | |
(sparse matrix learning, theory of boosting, ...) | |
Learning Sparse Metrics via Linear Programming (LPML) (Rosales & Fung) | |
- aims at learning matrices with entire columns/rows set to zero (making M low-rank) | |
- but uses only regularization which targets cells to be zero (not rows/columns) | |
Sparse Matrix Learning (SML) (Ying) [&] | |
- uses L_2,1 norm which tends to zero-out entire rows of M (=> performs feature selection) | |
- difficult to optimize, fast convergence, but slow iteration O(d^3) | |
BoostMetric (Shen) [&] | |
- The method is based on the property that any PSD matrix can be decomposed into a positive linear combination of trace-one rank-one matrices. | |
- requires many iterations for high-dimensional datasets | |
- improved upon by many others (scalability, redundancy, regularization) | |
Distance Metric Learning with Eigenvalue Optimization (DML-p) (Ying) | |
- revisiting original MMC | |
- solved efficiently using "first-order algorithm" that only needs largest eigenvalue at each iteration | |
- good results and low computational complexity, but not clear how to accommodate a regularizer | |
Robust Metric Learning (RML) (Huang) | |
- deals with noisy, incorrect training constraints | |
- good robustness up to 30% of incorrect triplets | |
Matric Learning to Rank (MLR) (McFee & Lankriet) [&] | |
- learning a matric for ranking task (for given query) | |
- needs full eigen decomposition at each iteration (problem for high-dimensional data) | |
Other Advances in Metric Learning | |
Linear Similarity Learning | |
Similarity Learning for Nearest Neighbor Classification (SiLA) (Qamar) | |
- M is not PSD not symetric | |
- generalization of the cosine similarity (text and image retrieval) | |
- subsuquent work studied an online feature reweighting algorithm | |
Online and Batch Learning of Generalized Cosine Similarities (gCosLA) (Qamar & Gaussier) | |
- similar to POLA (2 projections per iteration) | |
- fewer iterations and better accuracy than SiLa | |
- competitive with LMNN and ITML | |
An Online Algorithm for Large Scale Image Similarity Learning (OASIS) (Chechik) [&] | |
- bilinear, online, uses triplets (R), is efficient, more efficient for sparse inputs | |
- can define similarity between instances of different dimensions | |
- competitive resuts on medium-scale problems | |
- is scalable to problems wih millions of training instances | |
- cannot incorporate complex regularizers | |
Similarity Learning for Linear Classification (SLLC) (Bellet) | |
- different approach: instead of pairs or triples they take some averages | |
- billinear, creates extremely sparse classifiers | |
- con: cannot deal with multi-class setting? (one-vs-all, one-vs-one) | |
Riemannian Similarity Learning (RSL) (Cheng) | |
- billinear, focuses on the setting of pair matching (predicting whether two pairs are similar) | |
Nonlinear methods | |
Kernelization of Linear Methods | |
Learning Nonlinear Forms of Metrics | |
Learning a Similarity Metric Discriminatively, with Application to Face Verification (LSMD) (Chopra) | |
- Convolutional NN, SGD (=> local optima) | |
- requires fine-tuning of hyperparameters | |
Nonlinear NCA (NNCA) (Salakhutdinov & Hinton) | |
- deep belief network + optimizing NCA for the last layer | |
- performs well when enough data | |
Support Vector Metric Learning (SVML) (Xu) | |
- (i) learning the SVM model with respect to the current Mahalanobis distance | |
- (ii) learning a Mahalanobis distance that minimizes a surrogate of the validation error of the current SVM model | |
Gradient-Boosted LMNN (GB-LMNN) (Kedem) | |
- gradient boosted regression trees of limited depth | |
- robust to overfitting, comparable or better than LMNN and ITML | |
Hamming Distance Metric Learning (HDML) (Norouzi) [&] | |
- the goal here is to optimize a mapping b(x) that projects a d-dimensional real-valued input x to a q-dimensional binary code | |
Local Metric Learning | |
Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification (Frome) | |
- not reviewed in the paper | |
- features vectors from image patches (important step) | |
- too many training data, chosen 15 candidates from category -> 15*14*1500*101 (candidates * positive * negative * categories) | |
- trained over triplets of data, tightly coupled (slower, but globally comparable), convex problem (solved using dual) | |
Multiple Metrics LMNN (MM-LMNN) (Weinberger & Saul) [&] | |
- expects C clusters (from labeled data or K-means) | |
- learns C metrics in coupled fashion | |
- large improvements, but prone to overfitting | |
Generative Local Metric Learning (GLML) (Noh) | |
- 1-NN classifier, calculating Mahalanobis distance for each training instance, regularized towards diagonal matrix | |
- each metric calculated independently (=> scalable) | |
- performs poorly on complex problems | |
Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering (Bk-means) (Wu) | |
- does not satisfy triangle inequality; is convex; generalizes Manalanobis distance and KL divergence | |
- possitive (S), negative (D) pairs; learning ϕ can be seen as learning an infinite number of local Mahalanobis distances | |
- n+1 variables instead of d^2 (scalable), but requires n kernel evaluations (expesive for large datasets) | |
Parametric Local Metric Learning (PLML) (Wang) [&] | |
- Mahalanobis metric d^2 M_i is learned for each training instance x_i | |
- needs eigen-decomposition at each step in O(d^3) (intractable for high-dimensional) | |
- like LMNN and PLML: is sensitive to the relelvance of the Euclidean distance to assess the similarity between anchor points (obtained through the means of clusters by K-Means alg.) | |
Random Forest Distance (RFD) (Xiong) [&] | |
- pair classification problem; encodes both relative and absolute position (compared to Mahalanobis, which only encodes relative information) | |
- uses forest of decision trees | |
- very efficient, each tree takes O(n log n) to generate and can be done in parallel | |
- drawback: evaluation needs output of all trees | |
- outperforms some global and local metric learning methods | |
Metric Learning for Histogram Data | |
Generalization Gaurantees for Metric Learning | |
Standard Semi-Supervised Setting | |
(unlabeled pairs) | |
Laplacian Regularized Metric Learning (LRML) (Hoi) [&] | |
- manifold regularization, weight matrix encodes the similarity between pairs of points | |
- good for data with scarce information (??) | |
- drawback: computing W is intractable for large-scale datasets | |
Semisupervised Multiview Distance Metric Learning (M-DML) (Zha) | |
- augumenting LRML, learning multiple metrics from auxiliary datasets | |
Information-theoretic Semisupervised Metric Learning via Entropy Regularization (SERAPH) (Niu) [&] | |
- optimizing a probability of labeling a given pair parameterized by a Mahalanobis distance | |
- Intuitively, the regularization enforces low uncertainty of unobserved weak labels. | |
Metric Learning for Domain Adaptation | |
(the labeled training data and the test data come from different (but somehow related) distributions (referred to as the source and target distributions respectively)) | |
Consistent Distance Metric Learning (CDML) (Cao) | |
- deals with the setting of covariate shift, which assumes that source and target data distributions | |
pS(x) and pT (x) are different but the conditional distribution of the labels given | |
the features, p(y|x), remains the same | |
Domain Adaptation Metric Learning (DAML) (Geng) | |
- maximum mean discrepancy | |
- trade-off between satisfying the constraints on the labeled source data | |
and finding a projection that minimizes the discrepancy between the source and target | |
distribution | |
String Edit Distance Learning |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment