Skip to content

Instantly share code, notes, and snippets.

@malteos
Created February 6, 2025 21:47
Show Gist options
  • Save malteos/696bd91153598503974f0bd75b598b7b to your computer and use it in GitHub Desktop.
Save malteos/696bd91153598503974f0bd75b598b7b to your computer and use it in GitHub Desktop.
DeepSeek deep thinking about retrieval system evaluation with multiple annotators and low inter-rater agreement.

Prompt

In the context of an information retrieval system, I have human annotated data in the form of query - retrieved documents. The documents are labeled to according their relevance for the given query. Each document is annotated by two to three annotators. How to evaluate two retrieval systems with this test data, while accounting for a low inter-rater agreement. Provide Python code as an example.

Thinking

Okay, so I need to figure out how to evaluate two retrieval systems when there's low inter-rater agreement in the annotations. Let's start by understanding the problem. The user has queries and retrieved documents, each document labeled by 2-3 annotators for relevance. But since the annotators don't agree much, traditional evaluation metrics like average precision or NDCG might not be reliable because they depend on consistent relevance labels.

First, I should recall what inter-rater agreement means. Low agreement suggests that the relevance judgments are subjective or ambiguous. So, using a single "ground truth" by taking majority vote might not capture the true picture, especially if the disagreements are significant. Instead, maybe I should incorporate the uncertainty from the annotators into the evaluation.

Possible approaches:

  1. Use all annotations without aggregation: Instead of taking majority votes, evaluate each system against each annotator's labels separately and then average the results. This way, the variability among annotators is considered in the evaluation.

  2. Statistical tests that account for variability: Since there's low agreement, standard significance tests might not be appropriate. Maybe use tests that consider the variance across annotators.

  3. Use metrics designed for multiple judgments: Some metrics like Kripendorff's alpha measure agreement, but that's more about the annotations themselves. Alternatively, maybe compute the expected performance over the distribution of possible relevance labels inferred from the annotators.

Let me think more about the first approach. For each query and document, instead of collapsing the labels into a single relevance score, I can compute the evaluation metric for each annotator and then aggregate the results. For example, calculate Precision@k, Recall@k, or NDCG for each annotator's labels and then take the average across annotators. This way, each system's performance is measured against each annotator's perspective, and the average gives a more robust measure considering the disagreement.

But how to handle cases where the same document is judged by different numbers of annotators? Wait, the user mentioned each document is annotated by 2-3 annotators. So for each document, there are 2 or 3 relevance labels. When evaluating a system's retrieved documents for a query, each document in the retrieved list has multiple labels.

Another idea: For each query, the system returns a ranked list of documents. For each document in the list, take the average relevance score across annotators (if relevance is graded) or compute the proportion of annotators that deemed it relevant (if binary). Then compute metrics like average precision or NDCG using these aggregated scores. This softens the binary judgment into a continuous score, which might be more robust to disagreement.

But if the relevance is binary (relevant/not relevant), maybe taking the majority vote is standard, but with low agreement, majority might not be reliable. Alternatively, using the proportion of relevant votes as a graded relevance (e.g., 0, 0.5, 1 if two annotators; 0, 0.33, 0.66, 1 if three) and then compute metrics that handle graded relevance, like NDCG.

But how does that account for low agreement? Well, it's incorporating the uncertainty into the metric. If a document is marked relevant by 1 out of 2 annotators, it's 0.5 relevance. The evaluation metric would then use these soft scores instead of hard labels.

Alternatively, use the expected value of the metric over all possible annotators. For example, for a document with two annotators (one says relevant, one not), when calculating precision, we average the possible precisions if we took either annotator's label as correct.

Wait, but that might be computationally intensive. However, for each document, the possible relevance labels are the ones provided by the annotators. So, if a document has two annotations (say, relevant and not relevant), then when evaluating a system that retrieved this document, the contribution to the metric would be the average over the two possible labels.

But how to implement this? Let's think in terms of the evaluation process. For each query, the system returns a ranked list of documents. For each document in the list, we have multiple relevance labels. Let's say, for example, if the metric is Precision@k, which depends on whether each document is relevant. Normally, we count a document as 1 if relevant, 0 otherwise. But with multiple annotations, the count could be the average of the 0/1 from each annotator. So, if two annotators, one says relevant (1), the other not (0), the contribution to precision would be 0.5 for that document. Then, summing over the top k documents and dividing by k gives an average precision that accounts for the disagreement.

Similarly, for metrics like NDCG, which can handle graded relevance, instead of using binary 0/1, we can use the average relevance score across annotators. For example, if relevance is on a scale (like 0-3), average the scores. If it's binary, average the 0/1 to get a score between 0 and 1. Then compute NDCG using these averaged scores. This way, the evaluation uses the soft scores, which might better reflect the uncertainty.

So, the steps would be:

  1. For each query and document, compute the average relevance score across annotators. If relevance is binary, this is the proportion of annotators that labeled it as relevant. If graded, average the grades.

  2. Compute evaluation metrics (e.g., AP, NDCG, etc.) using these averaged scores.

  3. Compare the two systems based on these metrics across all queries.

But wait, how does this account for low inter-rater agreement? If there's high disagreement, the averaged scores would be more towards the middle (e.g., 0.5 for binary with two annotators). But when comparing two systems, the evaluation would take into account that some documents are ambiguously relevant. So, a system that retrieves documents with higher agreement (either all annotators say relevant or not) would be favored over one that retrieves documents with mixed annotations.

Alternatively, if we use the approach of evaluating each annotator's labels separately and then averaging the results, that could also account for variability. For example, for each query and each annotator, compute the metric (like AP) as if that annotator's labels are the ground truth. Then, average the metric across all annotators for all queries. This way, each annotator's perspective is considered, and the system's performance is the average over all possible perspectives.

Which approach is better? Using the average relevance score might be more straightforward, especially with existing metrics that can handle graded relevance. But evaluating per-annotator and then averaging might be more statistically sound, as each annotator's judgment is treated as a separate ground truth.

Another consideration: if the number of annotators per document varies (2-3), when using per-annotator evaluation, how to handle documents with different numbers of annotators? For example, a document with 2 annotators would contribute two separate evaluations, while one with three would contribute three. But this might complicate the per-query averaging. Alternatively, for each document, randomly select one annotator's label per trial and compute the metric over many trials, then average. But that's more computationally heavy.

Alternatively, for each query, collect all annotator-label pairs. So, if a document has two annotators, it's treated as two separate judgments. Then, for each system's retrieved list, compute the metric for each annotator's labels. For example, if a query has 5 retrieved documents, each with 2 annotators, then there are 10 (5x2) relevance judgments. Then compute the metric over all these judgments. Wait, but each annotator only labeled the documents they were assigned. Hmm, maybe not, because each document is labeled by a subset of annotators. So perhaps it's better to pair each annotator's labels for the documents they annotated. But annotators might not have labeled all documents. For example, in the dataset, each document is labeled by 2-3 annotators, but not all annotators labeled all documents. So each annotator has a subset of documents they labeled. Therefore, when evaluating a system's retrieval for a query, we can only use the documents that each annotator labeled. But that complicates things because different annotators might have labeled different documents. Unless the setup is that for each query-document pair, there are 2-3 annotators who labeled that pair. Then, for each query, when evaluating against an annotator, we can take all the documents that the annotator labeled for that query. But that requires that each annotator labeled all retrieved documents for the query, which is unlikely. So perhaps this approach isn't feasible.

So maybe the better approach is to use the averaged relevance scores per document. Let's think in terms of code. Suppose for each query, the system returns a list of documents. For each document, we have a list of relevance labels from annotators. We average those to get a soft relevance score. Then, compute metrics like NDCG using these scores.

Another approach is to use statistical tests that consider the variance across annotators. For example, compute the performance for each annotator and then perform a paired t-test or Wilcoxon signed-rank test across queries, using each annotator's scores as separate data points. However, if some queries have more annotators than others, this might complicate the analysis.

Alternatively, for each query, compute the metric for each annotator and then average them per query. Then, compare the two systems using the per-query averages with a statistical test. For example:

For each query q:

For each annotator a who labeled documents in q:

Compute metric (e.g., AP) for system 1 and system 2 using a's labels

Average the metrics across annotators for system 1 and system 2, resulting in avg1 and avg2 for q

Collect all avg1 and avg2 across queries, then perform a paired test.

But this requires that for each query, all annotators have labeled all the documents that the systems retrieved. Otherwise, some annotators may not have labels for some documents, leading to missing data. In practice, this might not be the case. The annotations might be such that each document is labeled by 2-3 annotators, but not all documents are labeled by all annotators. So, for a given query and annotator, the annotator may have labeled only a subset of the documents. Therefore, when computing the metric for an annotator, you can only consider the documents they labeled. But in a ranked list, the unlabeled documents would be considered non-relevant? Or excluded? That's a problem because if an annotator didn't label a document, we don't know their relevance judgment. So this approach might not be feasible unless we assume that unlabeled documents are non-relevant, which is a common assumption in IR, but might introduce noise if that's not the case.

Given these complexities, perhaps the most feasible approach is to compute the average relevance score per document and use that with standard metrics. Let's outline this approach step by step:

  1. Aggregate Annotations: For each query-document pair, compute the average relevance score. If relevance is binary (0/1), this is the proportion of annotators who labeled it as relevant. If it's graded (e.g., 0-3), average the grades.

  2. Compute Metrics with Soft Scores: Use metrics that can handle graded relevance, such as NDCG or Average Precision (with soft relevance). For example, NDCG can use the averaged scores as the gain values. Average Precision can be computed by considering the soft scores as the degree of relevance, though the standard AP is for binary relevance. Alternatively, use a graded version of AP.

  3. Compare Systems: Compute the metrics for each query using the soft scores, then average across queries. Use statistical tests (e.g., paired t-test, Wilcoxon signed-rank test) to determine if the differences between systems are significant.

This approach accounts for the disagreement by softening the relevance scores. If two annotators disagree, the document gets a medium score, which affects the metrics accordingly. Systems that retrieve documents with higher agreement (higher average scores) will be favored.

Now, for Python code. Let's outline a possible implementation.

First, we need a dataset structure. Let's assume we have a list of queries. For each query, we have a list of retrieved documents by each system. Each document has annotations from 2-3 annotators.

Suppose the data is structured as follows:

  • A dictionary where each key is a query ID.

  • For each query ID, another dictionary with keys 'system1' and 'system2', each containing a list of document IDs retrieved by that system.

  • For each document ID, a list of relevance labels from annotators.

Alternatively, maybe for each query and document, we have the list of relevance labels.

But to keep it simple, let's represent the data as:

queries = {

'q1': {

'system1': ['doc1', 'doc2', 'doc3'],

'system2': ['doc4', 'doc5', 'doc6'],

'annotations': {

'doc1': [1, 0], # two annotators: one relevant, one not

'doc2': [1, 1],

'doc3': [0, 0],

'doc4': [1, 0, 1], # three annotators

'doc5': [0, 0, 0],

'doc6': [1, 1, 0]

}

},

more queries...

}

But this is a simplified structure. Each query has the systems' retrieved documents and the annotations for each document.

Then, for each system's retrieved list, compute the average relevance for each document. Then compute NDCG or other metrics using these averages.

But how to compute NDCG? NDCG requires a relevance score for each document in the ranked list. So for each document in the list, we get its average relevance from the annotations. Then compute the discounted cumulative gain.

Let's code a function to compute average relevance.

But first, let's think about the code steps:

  1. For each query and each system, get the ranked list of documents.

  2. For each document in the list, retrieve its annotations and compute the average relevance.

  3. Compute the evaluation metric (e.g., NDCG) using these average relevances.

  4. Aggregate the metric across all queries.

Additionally, to compare two systems, compute the metric for each query and perform a statistical test.

But how to handle cases where a retrieved document has no annotations? Probably, such documents should be considered non-relevant, but in the given setup, maybe all retrieved documents are annotated.

Assuming that all retrieved documents have annotations, proceed.

Now, code example:

First, let's import necessary libraries.

import numpy as np

from sklearn.metrics import ndcg_score

from scipy.stats import ttest_rel

Assuming we use sklearn's ndcg_score, which expects true relevance scores and predicted scores. Wait, but in our case, the true relevance is the average of the annotations, and the system's ranking is the order of the documents. So, the true relevance is the average scores, and the predicted is the system's ranking (since in retrieval, the system's output is the ranked list). Therefore, the NDCG is computed based on the true average relevance scores and the system's ranking.

But wait, the ndcg_score function in sklearn compares the predicted scores (e.g., relevance scores predicted by the system) to the true relevance. However, in our case, the system's output is the ranked list, and the true relevance is the average of the annotations. So the ideal DCG is based on the ordering of the true average relevances. But the system's DCG is based on its own ranking. So for each query, we can compute the NDCG by comparing the system's ranking to the true average relevances.

Yes, that's correct. So for each query, the true relevance is the average of the annotations for each document. The system's ranking is the list of documents it retrieved in order. The NDCG compares the system's ranking to the ideal ranking (sorted by true relevance).

So, for a given query and system, extract the true average relevance scores for the documents in the system's ranked list. Then compute the NDCG for that ranking.

Now, let's code a function to compute the average relevance.

Then, for each query and system, compute NDCG.

But in the example data structure, each query's annotations are under that query. So for system1's retrieved documents in q1, we take the order ['doc1', 'doc2', 'doc3'], get their average relevance from q1's annotations, then compute NDCG.

But the true relevance for the query is all the documents in the system's retrieved list, each with their average score. The ideal ranking would be those documents sorted by their average relevance in descending order.

Wait, but NDCG typically requires the true relevance of all documents in the collection to compute the ideal DCG. However, in practice, most systems only have the retrieved subset. So in practice, the NDCG is computed over the retrieved documents, assuming that unretrieved documents have lower relevance. But this can be problematic. However, given that the systems retrieve a subset, and annotations are available only for retrieved documents, perhaps we can proceed under the assumption that the ideal DCG is based on the retrieved documents sorted by their true relevance.

This is a simplification, but for the sake of example, we'll proceed.

So, for each query and system:

  • Get the list of retrieved documents (ranked)

  • For each document in the list, get the average relevance score

  • The true relevance vector is the average scores

  • The system's ranking is the order of the documents

  • Compute NDCG by comparing the system's ranking's DCG to the ideal DCG (sorted by true scores)

In sklearn's ndcg_score, the true relevance is a 2D array (one sample), and the predicted scores are the system's scores. Wait, the function is a bit confusing. Let me check:

sklearn.metrics.ndcg_score(y_true, y_score, k=None)

y_true : ndarray of shape (n_samples, n_labels)

True targets (multilabel).

y_score : ndarray of shape (n_samples, n_labels)

Predicted probabilities.

But in our case, the y_true would be the average relevance scores of the retrieved documents, and y_score would be the system's ranking. Wait, no: the ndcg_score expects y_score to be the predicted relevance scores used to generate the ranking. The function then computes the DCG of the ranking induced by y_score and compares it to the ideal DCG from y_true.

But in our case, the systems have already produced a ranking (the retrieved documents in order). So the y_score would be the ranking's order. However, the ndcg_score function isn't directly applicable because it uses y_score to generate the ranking. So perhaps we need to compute NDCG manually.

Alternatively, for the system's ranking, the gain for each document is the true average relevance, and the position is based on the system's order. Then compute DCG and IDCG as follows:

For a ranked list of documents D = [d_1, d_2, ..., d_n] where d_1 is the first document, the DCG is sum_{i=1 to n} (relevance_i / log2(i + 1)).

The IDCG is the DCG of the ranking sorted by true relevance in descending order.

Then, NDCG = DCG / IDCG.

So, let's code this manually.

First, for a given ranked list and average relevance scores:

def compute_ndcg(ranked_docs, true_relevance):

Compute DCG

dcg = 0.0

for i, doc in enumerate(ranked_docs):

rel = true_relevance.get(doc, 0.0) # default 0 if not found?

rank = i + 1

dcg += rel / np.log2(rank + 1)

Compute IDCG

sorted_relevances = sorted(true_relevance.values(), reverse=True)

idcg = 0.0

for i, rel in enumerate(sorted_relevances):

rank = i + 1

idcg += rel / np.log2(rank + 1)

return dcg / idcg if idcg > 0 else 0.0

But in our case, the true_relevance is the average scores for each document in the ranked list. Wait, but in the function above, true_relevance is a dict mapping doc IDs to their average relevance. Then, for the ranked_docs (the system's retrieved list), we take their relevance in order.

But wait, the IDCG should be computed based on the true relevance of all documents in the collection, but since we don't have that, we can only compute it based on the retrieved documents. This is a limitation but common practice when full judgments aren't available.

So, proceed with the code.

Now, let's structure the data.

Sample data:

queries = {

'q1': {

'system1': ['doc1', 'doc2', 'doc3'],

'system2': ['doc4', 'doc5', 'doc6'],

'annotations': {

'doc1': [1, 0], # avg 0.5

'doc2': [1, 1], # avg 1.0

'doc3': [0, 0], # avg 0.0

'doc4': [1, 0, 1], # avg (2/3) ≈0.666

'doc5': [0, 0, 0], # avg 0.0

'doc6': [1, 1, 0] # avg (2/3) ≈0.666

}

},

'q2': {

'system1': ['doc7', 'doc8', 'doc9'],

'system2': ['doc10', 'doc11', 'doc12'],

'annotations': {

'doc7': [1, 0],

'doc8': [0, 1],

'doc9': [1, 1],

'doc10': [0, 0, 1],

'doc11': [1, 0, 1],

'doc12': [0, 1, 0]

}

}

}

For each query, compute for each system the NDCG.

Let's process query 'q1' for system1:

ranked_docs = ['doc1', 'doc2', 'doc3']

true_relevance is built from the annotations for these docs:

doc1: 0.5, doc2: 1.0, doc3: 0.0

So the system's ranking is doc1, doc2, doc3.

The DCG is (0.5 / log2(2)) + (1.0 / log2(3)) + (0.0 / log2(4))

= 0.5/1 + 1/1.58496 + 0 ≈ 0.5 + 0.6309 ≈ 1.1309

The ideal order is doc2 (1.0), doc1 (0.5), doc3 (0.0)

IDCG is (1.0 / log2(2)) + (0.5 / log2(3)) + (0.0 / log2(4))

= 1/1 + 0.5/1.58496 ≈ 1 + 0.315 ≈ 1.315

NDCG = 1.1309 / 1.315 ≈ 0.859

For system1 in q1, NDCG is ~0.86.

Similarly for system2 in q1:

ranked_docs = ['doc4', 'doc5', 'doc6']

annotations: doc4: 0.666..., doc5: 0.0, doc6: 0.666...

System's order: doc4, doc5, doc6.

DCG: 0.666/log2(2) + 0.0/log2(3) + 0.666/log2(4)

= 0.666/1 + 0 + 0.666/2 ≈ 0.666 + 0 + 0.333 ≈ 0.999

Ideal order: doc4 and doc6 both 0.666, so sorted as doc4, doc6, doc5.

IDCG: 0.666/log2(2) + 0.666/log2(3) + 0.0/log2(4)

≈ 0.666/1 + 0.666/1.58496 ≈ 0.666 + 0.42 ≈ 1.086

NDCG = 0.999 / 1.086 ≈ 0.92

So system2's NDCG for q1 is ~0.92.

Now, code this.

First, process each query and system:

def average_relevance(annotations):

avg = {}

for doc, labels in annotations.items():

avg[doc] = sum(labels) / len(labels)

return avg

Then, for each query:

for q_id, q_data in queries.items():

annotations = q_data['annotations']

avg_rel = average_relevance(annotations)

For each system

for system in ['system1', 'system2']:

ranked_docs = q_data[system]

Get the average relevance for each doc in the ranked list

But some docs might not be in annotations? Assuming all are annotated

rel_scores = [avg_rel.get(doc, 0.0) for doc in ranked_docs]

Compute DCG

dcg = 0.0

for i, score in enumerate(rel_scores):

rank = i + 1

dcg += score / np.log2(rank + 1)

Compute IDCG

ideal_sorted = sorted(rel_scores, reverse=True)

idcg = 0.0

for i, score in enumerate(ideal_sorted):

rank = i + 1

idcg += score / np.log2(rank + 1)

ndcg = dcg / idcg if idcg > 0 else 0.0

print(f"Query {q_id}, {system}: NDCG = {ndcg:.4f}")

But wait, this code uses only the retrieved documents to compute IDCG, which is not correct. Because the ideal ranking should include all relevant documents in the collection, not just the ones retrieved by the system. But since we don't have annotations for all documents in the collection, this is a common approximation (assuming that unretrieved documents are non-relevant). However, this can bias the NDCG scores because if a system fails to retrieve a highly relevant document, it's not penalized. But given the problem setup, where we only have annotations for the retrieved documents (from multiple systems?), perhaps we need to collect all documents that were retrieved by either system for a query and consider those as the full set for that query. Otherwise, the IDCG is computed only over the retrieved documents by the system, which makes the NDCG scores not comparable across systems.

This is a problem. For example, if system1 retrieves doc1, doc2, doc3, and system2 retrieves doc4, doc5, doc6, then each system's IDCG is based on their own retrieved documents. But in reality, the ideal ranking would include all relevant documents from the entire collection. However, without knowing the relevance of all documents in the collection, we can't compute true IDCG. This is a common issue in IR evaluation when using incomplete judgments (i.e., only judging the pooled results from multiple systems).

In the case of the example data, perhaps all documents retrieved by any system for a query have been annotated. So, to compute IDCG properly, for each query, we need to collect all documents retrieved by either system and compute the ideal DCG based on their average relevance. Then, for each system's ranking, compute DCG based on their order and divide by this common IDCG.

So, modifying the approach:

For each query:

  1. Collect all documents retrieved by either system1 or system2.

  2. Compute the average relevance for each of these documents.

  3. Compute IDCG as the DCG of the ideal ranking of all these documents.

  4. For each system's retrieved list, compute DCG based on their order (including only their retrieved documents, but using the common IDCG from all documents).

But this requires that for each query, we have annotations for all documents retrieved by any system. This is a typical pooling method in IR evaluation (like TREC). Assuming that's the case, the code can be adjusted.

Let's adjust the code accordingly.

For each query q:

  • Get all documents retrieved by system1 or system2: all_docs = system1_docs + system2_docs (unique)

  • Compute average relevance for all_docs

  • Compute IDCG based on all_docs sorted by average relevance

  • For each system, compute DCG for their ranked list (truncated to the length of the list), using the global IDCG

But how to handle different lengths of retrieved lists? If the systems retrieve different numbers of documents. But in the example, both retrieve 3 docs. Assume that both systems retrieve the same number of docs per query.

But in real scenarios, systems might retrieve different numbers. However, for metrics like NDCG@k, we can fix k.

Alternatively, proceed under the assumption that the evaluation is done at a certain k (e.g., 10), and all systems retrieve at least k documents. But in the example, it's 3.

So, modifying the code:

For each query:

all_docs = list(set(q_data['system1'] + q_data['system2']))

But we need to maintain the order in the ideal ranking. Wait, no. For IDCG, we sort all_docs by their average relevance in descending order and take the top k, where k is the length of the system's retrieved list.

Wait, NDCG is often computed at a specific k (like NDCG@5, NDCG@10). So for each system's retrieved list, compute NDCG@k, where k is the length of the list. But if systems have different lengths, it's harder. Alternatively, fix k for all queries.

But in this example, let's compute NDCG at the length of the retrieved list for each query-system pair. So if a system retrieves 3 documents for a query, compute NDCG@3.

But to compute IDCG correctly, we need to consider all documents in the pool (retrieved by any system for that query) and take the top k (3 in this case) sorted by their average relevance.

So, for each query:

  1. Collect all documents from both systems.

  2. Compute average relevance for each.

  3. Sort them in descending order of average relevance.

  4. Take the top k documents (where k is the number of documents retrieved by the system for this query) to compute IDCG.

Then, for each system's retrieved list of length k:

  • Compute DCG@k using their ranking.

  • Compute IDCG@k using the top k documents from the sorted pool.

  • NDCG@k = DCG@k / IDCG@k.

This way, the IDCG is based on the best possible ranking of all known documents for the query, which is more accurate.

But this requires that for each query and system, we know the number of documents retrieved (k). In the example data, each system retrieves 3 docs per query.

So, modifying the code:

def compute_ndcg(system_docs, all_docs_avg, k):

Compute DCG for system's ranking

dcg = 0.0

for i, doc in enumerate(system_docs[:k]): # truncate to top k

rel = all_docs_avg.get(doc, 0.0)

rank = i + 1

dcg += rel / np.log2(rank + 1)

Compute IDCG using top k docs from all_docs_avg

sorted_docs = sorted(all_docs_avg.items(), key=lambda x: -x[1])

ideal_relevances = [rel for doc, rel in sorted_docs[:k]]

idcg = 0.0

for i, rel in enumerate(ideal_relevances):

rank = i + 1

idcg += rel / np.log2(rank + 1)

return dcg / idcg if idcg > 0 else 0.0

Then, for each query:

all_docs = list(set(q_data['system1'] + q_data['system2']))

annotations = q_data['annotations']

avg_rel = average_relevance(annotations)

all_docs_avg = {doc: avg_rel.get(doc, 0.0) for doc in all_docs}

k_system1 = len(q_data['system1'])

k_system2 = len(q_data['system2'])

ndcg_system1 = compute_ndcg(q_data['system1'], all_docs_avg, k_system1)

ndcg_system2 = compute_ndcg(q_data['system2'], all_docs_avg, k_system2)

This way, the IDCG is computed based on all the documents in the pool for the query, ensuring a fair comparison.

In the example for q1, all_docs would be doc1-6. But wait, in the q1 data, system1 retrieves doc1-3, system2 doc4-6. So all_docs is doc1-6. The annotations for q1 include all of these. So the average relevance for each is computed, then sorted. The top 3 (since k=3) would be the three highest average scores.

For system1, which retrieves doc1-3, their average relevances are 0.5, 1.0, 0.0. But the pool includes doc4 (0.666), doc5 (0), doc6 (0.666). So the top 3 in the pool are doc2 (1.0), doc4 (0.666), doc6 (0.666), doc1 (0.5) – wait, sorted order:

doc2 (1.0), doc4 (0.666), doc6 (0.666), doc1 (0.5), doc3 (0.0), doc5 (0.0).

So top 3 are doc2, doc4, doc6. So the ideal DCG@3 is:

1.0 / log2(2) + 0.666 / log2(3) + 0.666 / log2(4)

= 1.0/1 + 0.666/1.58496 ≈ 0.420 + 0.666/2 ≈ 0.333

Wait, no:

Ideal relevances for top 3: [1.0, 0.666..., 0.666...]

DCG:

1.0 / log2(2) = 1.0/1 = 1.0

0.666 / log2(3) ≈ 0.666 / 1.58496 ≈ 0.420

0.666 / log2(4) = 0.666 / 2 = 0.333

Total IDCG = 1.0 + 0.420 + 0.333 ≈ 1.753

System1's retrieved docs are doc1 (0.5), doc2 (1.0), doc3 (0.0)

Their DCG@3:

doc1 at pos1: 0.5 / log2(2) = 0.5/1 = 0.5

doc2 at pos2: 1.0 / log2(3) ≈ 0.6309

doc3 at pos3: 0.0 → 0

Total DCG = 0.5 + 0.6309 = 1.1309

NDCG = 1.1309 / 1.753 ≈ 0.645

Previously, when IDCG was based only on system1's docs, it was higher. But now, considering all pool docs, the IDCG is higher, so system1's NDCG is lower.

This approach is more accurate because it considers all known relevant documents for the query.

Thus, the code needs to be adjusted to consider the pooled documents.

This makes the evaluation fairer between the two systems, as the ideal is based on all documents that any system retrieved.

Now, integrating this into the code:

For each query, collect all documents from both systems, compute their average relevance, then compute IDCG based on the top k documents from this pool, where k is the number of documents retrieved by the system.

So, the code would look something like:

queries = {

sample data as before

}

def average_relevance(annotations):

avg = {}

for doc, labels in annotations.items():

avg[doc] = sum(labels) / len(labels)

return avg

def compute_ndcg(system_docs, all_docs_avg, k):

Compute DCG for the system's top k docs

dcg = 0.0

for i, doc in enumerate(system_docs[:k]):

rel = all_docs_avg.get(doc, 0.0)

dcg += rel / np.log2(i + 2) # i starts at 0, rank is i+1+1?

Wait, first position is i=0 → rank=1 → log2(2)

Compute IDCG from top k docs in all_docs_avg

sorted_docs = sorted(all_docs_avg.items(), key=lambda x: -x[1])

ideal_relevances = [rel for doc, rel in sorted_docs[:k]]

idcg = 0.0

for i, rel in enumerate(ideal_relevances):

idcg += rel / np.log2(i + 2)

return dcg / idcg if idcg > 0 else 0.0

Collect NDCG scores for each system

system1_ndcg = []

system2_ndcg = []

for q_id, q_data in queries.items():

annotations = q_data['annotations']

avg_rel = average_relevance(annotations)

system1_docs = q_data['system1']

system2_docs = q_data['system2']

Get all docs from both systems

all_docs = list(set(system1_docs + system2_docs))

Create all_docs_avg, including 0.0 for docs not in annotations (unlikely here)

But according to the data structure, all retrieved docs are annotated

all_docs_avg = {doc: avg_rel[doc] for doc in all_docs}

k is the number of docs retrieved by each system (assuming same per query)

k = len(system1_docs) # assuming both systems retrieve same number

Compute NDCG for each system

ndcg1 = compute_ndcg(system1_docs, all_docs_avg, k)

ndcg2 = compute_ndcg(system2_docs, all_docs_avg, k)

system1_ndcg.append(ndcg1)

system2_ndcg.append(ndcg2)

Now, compare the two systems

print("System 1 NDCG scores:", system1_ndcg)

print("System 2 NDCG scores:", system2_ndcg)

Perform a paired t-test

t_stat, p_value = ttest_rel(system1_ndcg, system2_ndcg)

print(f"Paired t-test: t={t_stat:.4f}, p={p_value:.4f}")

Alternatively, Wilcoxon signed-rank test

from scipy.stats import wilcoxon

w_stat, w_p_value = wilcoxon(system1_ndcg, system2_ndcg)

print(f"Wilcoxon test: statistic={w_stat}, p={w_p_value:.4f}")

This code computes the NDCG for each system per query, using the pooled documents for IDCG calculation. Then, it performs statistical tests to compare the two systems.

This approach accounts for the inter-rater disagreement by using averaged relevance scores and ensures a fair comparison by considering all relevant documents retrieved by either system. The statistical tests then determine if the differences in NDCG scores are significant.

Another consideration is the handling of tie-breakers in the ideal ranking. For example, if two documents have the same average relevance, their order in the ideal ranking doesn't affect the IDCG.

In summary, the steps are:

  1. For each query, average the relevance annotations for each document.

  2. Pool all documents retrieved by either system for the query.

  3. Compute NDCG for each system's ranking using the average relevance and the pooled ideal ranking.

  4. Compare the systems using statistical tests on the per-query NDCG scores.

This approach effectively accounts for low inter-rater agreement by incorporating the uncertainty into the relevance scores and provides a statistically sound comparison between the two retrieval systems.

Response

To evaluate two retrieval systems with low inter-rater agreement, we use averaged relevance scores from multiple annotators and compute NDCG@k for each system. We then compare systems using statistical tests. Here's the Python code:

import numpy as np
from scipy.stats import ttest_rel, wilcoxon

# Example data structure
queries = {
    'q1': {
        'system1': ['doc1', 'doc2', 'doc3'],
        'system2': ['doc4', 'doc5', 'doc6'],
        'annotations': {
            'doc1': [1, 0],   # avg: 0.5
            'doc2': [1, 1],   # avg: 1.0
            'doc3': [0, 0],   # avg: 0.0
            'doc4': [1, 0, 1],# avg: 0.666
            'doc5': [0, 0, 0],# avg: 0.0
            'doc6': [1, 1, 0] # avg: 0.666
        }
    },
    'q2': {
        'system1': ['doc7', 'doc8', 'doc9'],
        'system2': ['doc10', 'doc11', 'doc12'],
        'annotations': {
            'doc7': [1, 0],    # avg: 0.5
            'doc8': [0, 1],    # avg: 0.5
            'doc9': [1, 1],    # avg: 1.0
            'doc10': [0, 0, 1],# avg: 0.333
            'doc11': [1, 0, 1],# avg: 0.666
            'doc12': [0, 1, 0] # avg: 0.333
        }
    }
}

def average_relevance(annotations):
    return {doc: sum(labels)/len(labels) for doc, labels in annotations.items()}

def compute_ndcg(system_docs, all_docs_avg, k):
    # DCG of the system's ranking
    dcg = sum( (all_docs_avg[doc] / np.log2(i + 2)) for i, doc in enumerate(system_docs[:k]) )
    # Ideal ranking from all pooled docs
    ideal_sorted = sorted(all_docs_avg.values(), reverse=True)[:k]
    idcg = sum( (rel / np.log2(i + 2)) for i, rel in enumerate(ideal_sorted) )
    return dcg / idcg if idcg > 0 else 0.0

system1_scores = []
system2_scores = []

for q_id, q_data in queries.items():
    avg_rel = average_relevance(q_data['annotations'])
    system1_docs = q_data['system1']
    system2_docs = q_data['system2']
    all_docs = list(set(system1_docs + system2_docs))
    all_docs_avg = {doc: avg_rel[doc] for doc in all_docs}
    k = len(system1_docs)  # Assuming both systems retrieve the same number of docs
    
    ndcg1 = compute_ndcg(system1_docs, all_docs_avg, k)
    ndcg2 = compute_ndcg(system2_docs, all_docs_avg, k)
    
    system1_scores.append(ndcg1)
    system2_scores.append(ndcg2)

print(f"System 1 Average NDCG: {np.mean(system1_scores):.4f}")
print(f"System 2 Average NDCG: {np.mean(system2_scores):.4f}")

# Paired t-test
t_stat, p_value = ttest_rel(system1_scores, system2_scores)
print(f"Paired t-test: t = {t_stat:.4f}, p = {p_value:.4f}")

# Wilcoxon signed-rank test
w_stat, w_p = wilcoxon(system1_scores, system2_scores)
print(f"Wilcoxon Test: statistic = {w_stat}, p = {w_p:.4f}")

Explanation:

  1. Averaging Relevance Scores: Each document's relevance score is averaged across annotators, converting disagreements into a continuous measure.
  2. Pooled Evaluation: For each query, all documents retrieved by either system are pooled to compute the ideal NDCG, ensuring fair comparison.
  3. Statistical Testing: Paired t-test and Wilcoxon signed-rank test determine if differences in NDCG scores are statistically significant, accounting for variability across queries.

This approach mitigates the impact of low inter-rater agreement by incorporating annotator uncertainty into the evaluation metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment