Skip to content

Instantly share code, notes, and snippets.

@AaradhyaSaxena
Last active April 29, 2025 16:01
Show Gist options
  • Save AaradhyaSaxena/269511995f0a3be65946c1ad30ed2b7c to your computer and use it in GitHub Desktop.
Save AaradhyaSaxena/269511995f0a3be65946c1ad30ed2b7c to your computer and use it in GitHub Desktop.
Machine Learning

Precision and Recall

The terrorist detection task is an imbalanced classification problem: we have two classes we need to identify—terrorists and not terrorists—with one category (non-terrorists) representing the overwhelming majority of the data points. Another imbalanced classification problem occurs in disease detection when the rate of the disease in the public is very low. In both these cases, the positive class—disease or terrorist—greatly outnumbers the negative class. These types of problems are examples of the fairly common case in data science when accuracy is not a good measure for assessing model performance.

Recall

Intuitively, we know that proclaiming all data points as negative (not a terrorist) in the terrorist detection problem isn’t helpful, and, instead, we should focus on identifying the positive cases. The metric our intuition tells us we should maximize is known in statistics as recall, or the ability of a model to find all the relevant cases within a data set.

image

Precision

While recall expresses the ability to find all relevant instances of a class in a data set, precision expresses the proportion of the data points our model says existed in the relevant class that were indeed relevant.

image

ROC Curve

One of the main visualization technique for showing the performance of a classification model is the Receiver Operating Characteristic (ROC) curve. The idea is simple: the ROC curve shows how the recall vs. precision relationship changes as we vary the threshold for identifying a positive data point in our model. The threshold represents the value above which we consider a data point in the positive class.

image

AUC

We can quantify a model’s ROC curve by calculating the total Area Under the Curve (AUC), a metric that falls between zero and one with a higher number indicating better classification performance. In the graph above, the AUC for the blue curve will be greater than that for the red curve, meaning the blue model is better at achieving a blend of precision and recall. A random classifier (the black line) achieves an AUC of 0.5.

Metric for Evaluating Recommendation Engines

Cumulative Gain (CG)

If every recommendation has a graded relevance score associated with it, CG is the sum of graded relevance values of all results in a search result list.

The Cumulative Gain at a particular rank position p, where the rel_i is the graded relevance of the result at position i. The problem with CG is that it does not take into consideration the rank of the result set when determining the usefulness of a result set.

Discounted Cumulative Gain

To overcome this we introduce DCG. DCG penalizes highly relevant documents that appear lower in the search by reducing the graded relevance value logarithimically proportional to the position of the result.

Eg:

  • setA: [3, 1, 2, 3, 2, 0] CG setA: 11
  • setB: [3, 3, 2, 2, 1, 0] CG setB: 11
  • DCG setA: 13.306224081788834
  • DCG setB: 14.595390756454924

Normalized Discounted Cumulative Gain

An issue arises with DCG when we want to compare the search engines performance from one query to the next because search results list can vary in length depending on the query that has been provided.

Hence, by normalizing the cumulative gain at each position for a chosen value of p across queries we arrive at NDCG. We perform this by sorting all the relevant documents in the corpus by their relative relevance producing the max possible DCG through position p (a.k.a Ideal Discounted Cumulative Gain)

The ratios will always be in the range of [0, 1] with 1 being a perfect score — meaning that the DCG is the same as the IDCG. Therefore, the NDCG values can be averaged for all queries to obtain a measure of the average performance of a recommender systems ranking algorithm.

Limitations of NDCG

  • The NDCG does not penalize for bad documents in the results.
  • Does not penalize missing documents in the results.
  • May not be suitable to measure performance of queiries that may often have several equally good results.

LambdaRank and LambdaMART

What problem are they solving?

  • In ranking, your goal is not just to predict relevance scores — it's to get the order right.
  • Metrics like NDCG are non-differentiable (they involve sorting and discrete jumps), so you can't directly optimize them with gradient descent.
  • LambdaRank and LambdaMART were invented to indirectly optimize ranking metrics like NDCG, using clever tricks.

What are LambdaRank and LambdaMART?

  • Both LambdaRank and LambdaMART are algorithms designed to directly optimize ranking metrics like NDCG (instead of using simple regression or classification loss).
  • They are based on the idea of learning to rank by pairwise comparisons, but smartly adjusting how much you care about different ranking mistakes.

LambdaRank

Problem it solves:

  • Standard ranking models (like RankNet) used pairwise loss but did not care about the ranking metric directly.
  • LambdaRank adjusts the gradient dynamically based on the change in NDCG if you swap two items.

High-Level Idea of LambdaRank:

  • For each pair of documents/items for a query:
  • Compute the swap effect: if we swap these two items, how much would the NDCG change?

Use that swap impact to scale the gradient:

  • If swapping the items would hurt NDCG a lot, you apply a larger gradient.
  • If swapping would not affect NDCG much, you apply a smaller gradient.

Thus: the model focuses more on mistakes that matter more to the actual ranking quality (NDCG).

LambdaMART

  • LambdaRank is usually trained with a neural network.
  • LambdaMART replaces the neural net with gradient boosted decision trees (GBDT).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment