In the context of an information retrieval system, I have human annotated data in the form of query - retrieved documents. The documents are labeled to according their relevance for the given query. Each document is annotated by two to three annotators. How to evaluate two retrieval systems with this test data, while accounting for a low inter-rater agreement. Provide Python code as an example.
Okay, so I need to figure out how to evaluate two retrieval systems when there's low inter-rater agreement in the annotations. Let's start by understanding the problem. The user has queries and retrieved documents, each document labeled by 2-3 annotators for relevance. But since the annotators don't agree much, traditional evaluation metrics like average precision or NDCG might not be reliable because they depend on consistent relevance labels.
First, I should recall what inter-rater agreement means. Low agreement suggests that the relevance judgments are subjective or ambiguous. So, using a sin