malteos malteos

Prompt

In the context of an information retrieval system, I have human annotated data in the form of query - retrieved documents. The documents are labeled to according their relevance for the given query. Each document is annotated by two to three annotators. How to evaluate two retrieval systems with this test data, while accounting for a low inter-rater agreement. Provide Python code as an example.

Thinking

Okay, so I need to figure out how to evaluate two retrieval systems when there's low inter-rater agreement in the annotations. Let's start by understanding the problem. The user has queries and retrieved documents, each document labeled by 2-3 annotators for relevance. But since the annotators don't agree much, traditional evaluation metrics like average precision or NDCG might not be reliable because they depend on consistent relevance labels.

First, I should recall what inter-rater agreement means. Low agreement suggests that the relevance judgments are subjective or ambiguous. So, using a sin

Run Docker (without Docker Desktop) on MacOS with Apple Silicon (M1/M2/...)

Docker Desktop requires an expensive license for commercial use: https://www.docker.com/pricing/faq/

# Install minikube
brew install minikube

# Install Docker CLI
brew install docker

Connect via SSH to a Slurm compute job that runs as Enroot container

Being able to SSH directly into a compute job has the advantage of using all remote development tools such as using your IDE's debugger also for GPU jobs (VSCode, PyCharm, ...).

Slurm: Scheduling system that many HPC clusters use
Enroot: Container system like Docker for NVIDIA GPUs

General problem:

Returns only 'Main Page'

curl -XPOST localhost:9200/wiki_content/_search?pretty -d '
{
  "_source": [
    "title"
  ],
  "query": {
    "bool": {
      "should": [

	"""Evaluate BM25 on MTEB tasks

	Usage:

	python bm25.py -t <task name> --output_folder=./data/results

	Notes:
	- https://github.com/xhluca/bm25s (promissing implememntation)
	- https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/lexical/evaluate_bm25.py
	- https://colab.research.google.com/drive/1HfutiEhHMJLXiWGT8pcipxT5L2TpYEdt?usp=sharing#scrollTo=nqotyXuIBPt6

	#!/bin/bash
	#SBATCH --job-name=oxw-bloom-1b7-twc-german
	#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
	#SBATCH --nodes=4
	#SBATCH --gres=gpu:4 # ---> does not matter on JUWELS
	#SBATCH --cpus-per-task=48 # number of cores per tasks
	#SBATCH --hint=nomultithread # we get physical cores not logical
	#SBATCH --time=0-12:00:00
	#SBATCH --output=%j.%x.out
	#SBATCH --partition=booster

	# Copyright 2022 EleutherAI and The HuggingFace Inc. team. All rights reserved.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,

	import argparse

	import os
	import torch

	from transformers.models.auto import AutoModelForCausalLM

	LAYER_FILE_PREFIX = 'layer_'
	MODEL_FILE_PREFIX = 'model_'
	EMBEDDING_LAYER_INDEX = 1

	# Obtain Lets-Encrypt SSL Certificate via Docker DNS challenge
	# adjust:
	# - domains (-d foo.me)
	mkdir letsencrypt_etc letsencrypt_var
	docker run -it --rm --name certbot \
	-v "./letsencrypt_etc:/etc/letsencrypt" \
	-v "./letsencrypt_var:/var/lib/letsencrypt" \
	certbot/certbot certonly -d foo.me -d *.foo.me --manual --preferred-challenges dns

	[AutoDJ]

	[Master]

	[VinylControl]

	[PreviewDeck1]

	[Channel1]
	play y