PrithivirajDamodaran · April 28, 2024 03:39
diff --git a/bgem3_hindi_miracl.txt b/bgem3_hindi_miracl.txt
 /usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
 and will be removed in future. Use torchrun.
 Note that --use-env is set by default in torchrun.
 If your script expects `--local-rank` argument to be set, please
 change it to read from `os.environ['LOCAL_RANK']` instead. See 
 https://pytorch.org/docs/stable/distributed.html#launch-utility for 
 further instructions

  warnings.warn(
 2024-04-27 16:05:43.073043: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
 2024-04-27 16:05:43.125561: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
 2024-04-27 16:05:43.125608: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
 2024-04-27 16:05:43.127388: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
 2024-04-27 16:05:43.135621: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
 To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
 2024-04-27 16:05:44.214271: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
 Downloading config.json: 100% 687/687 [00:00<00:00, 3.71MB/s]
 [INFO|configuration_utils.py:666] 2024-04-27 16:05:46,547 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/config.json
 [INFO|configuration_utils.py:718] 2024-04-27 16:05:46,548 >> Model config XLMRobertaConfig {
  "_name_or_path": "BAAI/bge-m3",
  "architectures": [
    "XLMRobertaModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 8194,
  "model_type": "xlm-roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
 }

 Downloading tokenizer_config.json: 100% 444/444 [00:00<00:00, 2.77MB/s]
 Downloading (…)tencepiece.bpe.model: 100% 5.07M/5.07M [00:00<00:00, 72.4MB/s]
 Downloading tokenizer.json: 100% 17.1M/17.1M [00:00<00:00, 127MB/s]
 Downloading (…)cial_tokens_map.json: 100% 964/964 [00:00<00:00, 6.36MB/s]
 [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file sentencepiece.bpe.model from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/sentencepiece.bpe.model
 [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/tokenizer.json
 [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file added_tokens.json from cache at None
 [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/special_tokens_map.json
 [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/tokenizer_config.json
 Downloading pytorch_model.bin: 100% 2.27G/2.27G [00:10<00:00, 217MB/s]
 [INFO|modeling_utils.py:2275] 2024-04-27 16:06:00,102 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/pytorch_model.bin
 [INFO|configuration_utils.py:543] 2024-04-27 16:06:01,464 >> Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 2,
  "pad_token_id": 1,
  "transformers_version": "4.26.1"
 }

 [INFO|modeling_utils.py:2860] 2024-04-27 16:06:06,569 >> All model checkpoint weights were used when initializing XLMRobertaDense.

 [WARNING|modeling_utils.py:2862] 2024-04-27 16:06:06,569 >> Some weights of XLMRobertaDense were not initialized from the model checkpoint at BAAI/bge-m3 and are newly initialized: ['embeddings.position_ids']
 You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
 [INFO|modeling_utils.py:2522] 2024-04-27 16:06:06,742 >> Generation config file not found, using a generation config created from the model config.
 Encoding Queries...
 [INFO|trainer.py:2964] 2024-04-27 16:06:07,475 >> ***** Running Prediction *****
 [INFO|trainer.py:2966] 2024-04-27 16:06:07,475 >>   Num examples = 350
 [INFO|trainer.py:2969] 2024-04-27 16:06:07,476 >>   Batch size = 32
 /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
 506264it [00:02, 200781.80it/s]
 Sorting Corpus by document length (Longest first)...
 Use len(str.split()) to sort the corpus for efficiently encoding
 Split corpus encoding:   0% 0/6 [00:00<?, ?it/s][INFO|trainer.py:2964] 2024-04-27 16:06:16,364 >> ***** Running Prediction *****
 [INFO|trainer.py:2966] 2024-04-27 16:06:16,364 >>   Num examples = 84378
 [INFO|trainer.py:2969] 2024-04-27 16:06:16,364 >>   Batch size = 32
 Split corpus encoding:  17% 1/6 [51:21<4:16:49, 3081.85s/it][INFO|trainer.py:2964] 2024-04-27 16:57:38,275 >> ***** Running Prediction *****
 [INFO|trainer.py:2966] 2024-04-27 16:57:38,275 >>   Num examples = 84378
 [INFO|trainer.py:2969] 2024-04-27 16:57:38,275 >>   Batch size = 32
 Split corpus encoding:  33% 2/6 [1:19:16<2:30:16, 2254.18s/it][INFO|trainer.py:2964] 2024-04-27 17:25:33,077 >> ***** Running Prediction *****
 [INFO|trainer.py:2966] 2024-04-27 17:25:33,077 >>   Num examples = 84377
 [INFO|trainer.py:2969] 2024-04-27 17:25:33,077 >>   Batch size = 32
 Split corpus encoding:  50% 3/6 [1:37:40<1:26:26, 1728.81s/it][INFO|trainer.py:2964] 2024-04-27 17:43:56,688 >> ***** Running Prediction *****
 [INFO|trainer.py:2966] 2024-04-27 17:43:56,688 >>   Num examples = 84377
 [INFO|trainer.py:2969] 2024-04-27 17:43:56,688 >>   Batch size = 32
 Split corpus encoding:  67% 4/6 [1:49:28<44:11, 1325.80s/it]  [INFO|trainer.py:2964] 2024-04-27 17:55:44,668 >> ***** Running Prediction *****
 [INFO|trainer.py:2966] 2024-04-27 17:55:44,668 >>   Num examples = 84377
 [INFO|trainer.py:2969] 2024-04-27 17:55:44,668 >>   Batch size = 32
 Split corpus encoding:  83% 5/6 [1:56:58<16:50, 1010.03s/it][INFO|trainer.py:2964] 2024-04-27 18:03:14,790 >> ***** Running Prediction *****
 [INFO|trainer.py:2966] 2024-04-27 18:03:14,790 >>   Num examples = 84377
 [INFO|trainer.py:2969] 2024-04-27 18:03:14,790 >>   Batch size = 32
 Split corpus encoding: 100% 6/6 [2:01:55<00:00, 1219.18s/it]
 Batch search: 100% 1/1 [00:00<00:00, 11.57it/s]
 {'NDCG@1': 0.28857, 'NDCG@3': 0.28552, 'NDCG@5': 0.29595, 'NDCG@10': 0.32177, 'NDCG@100': 0.37571, 'NDCG@1000': 0.40721}
 {'MAP@1': 0.14605, 'MAP@3': 0.21954, 'MAP@5': 0.23868, 'MAP@10': 0.25409, 'MAP@100': 0.26767, 'MAP@1000': 0.2692}
 {'Recall@10': 0.40017, 'Recall@50': 0.55551, 'Recall@100': 0.61155, 'Recall@200': 0.69742, 'Recall@500': 0.78185, 'Recall@1000': 0.82583}
 {'P@1': 0.28857, 'P@3': 0.19238, 'P@5': 0.14114, 'P@10': 0.08714, 'P@100': 0.01383, 'P@1000': 0.00179}
 {'MRR@10': 0.37349, 'MRR@100': 0.38027, 'MRR@1000': 0.38097}
	/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
	and will be removed in future. Use torchrun.
	Note that --use-env is set by default in torchrun.
	If your script expects `--local-rank` argument to be set, please
	change it to read from `os.environ['LOCAL_RANK']` instead. See
	https://pytorch.org/docs/stable/distributed.html#launch-utility for
	further instructions

	warnings.warn(
	2024-04-27 16:05:43.073043: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
	2024-04-27 16:05:43.125561: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
	2024-04-27 16:05:43.125608: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
	2024-04-27 16:05:43.127388: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
	2024-04-27 16:05:43.135621: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
	To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
	2024-04-27 16:05:44.214271: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
	Downloading config.json: 100% 687/687 [00:00<00:00, 3.71MB/s]
	[INFO\|configuration_utils.py:666] 2024-04-27 16:05:46,547 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/config.json
	[INFO\|configuration_utils.py:718] 2024-04-27 16:05:46,548 >> Model config XLMRobertaConfig {
	"_name_or_path": "BAAI/bge-m3",
	"architectures": [
	"XLMRobertaModel"
	],
	"attention_probs_dropout_prob": 0.1,
	"bos_token_id": 0,
	"classifier_dropout": null,
	"eos_token_id": 2,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 1024,
	"initializer_range": 0.02,
	"intermediate_size": 4096,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 8194,
	"model_type": "xlm-roberta",
	"num_attention_heads": 16,
	"num_hidden_layers": 24,
	"output_past": true,
	"pad_token_id": 1,
	"position_embedding_type": "absolute",
	"torch_dtype": "float32",
	"transformers_version": "4.26.1",
	"type_vocab_size": 1,
	"use_cache": true,
	"vocab_size": 250002
	}

	Downloading tokenizer_config.json: 100% 444/444 [00:00<00:00, 2.77MB/s]
	Downloading (…)tencepiece.bpe.model: 100% 5.07M/5.07M [00:00<00:00, 72.4MB/s]
	Downloading tokenizer.json: 100% 17.1M/17.1M [00:00<00:00, 127MB/s]
	Downloading (…)cial_tokens_map.json: 100% 964/964 [00:00<00:00, 6.36MB/s]
	[INFO\|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file sentencepiece.bpe.model from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/sentencepiece.bpe.model
	[INFO\|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/tokenizer.json
	[INFO\|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file added_tokens.json from cache at None
	[INFO\|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/special_tokens_map.json
	[INFO\|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/tokenizer_config.json
	Downloading pytorch_model.bin: 100% 2.27G/2.27G [00:10<00:00, 217MB/s]
	[INFO\|modeling_utils.py:2275] 2024-04-27 16:06:00,102 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/pytorch_model.bin
	[INFO\|configuration_utils.py:543] 2024-04-27 16:06:01,464 >> Generate config GenerationConfig {
	"bos_token_id": 0,
	"eos_token_id": 2,
	"pad_token_id": 1,
	"transformers_version": "4.26.1"
	}

	[INFO\|modeling_utils.py:2860] 2024-04-27 16:06:06,569 >> All model checkpoint weights were used when initializing XLMRobertaDense.

	[WARNING\|modeling_utils.py:2862] 2024-04-27 16:06:06,569 >> Some weights of XLMRobertaDense were not initialized from the model checkpoint at BAAI/bge-m3 and are newly initialized: ['embeddings.position_ids']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
	[INFO\|modeling_utils.py:2522] 2024-04-27 16:06:06,742 >> Generation config file not found, using a generation config created from the model config.
	Encoding Queries...
	[INFO\|trainer.py:2964] 2024-04-27 16:06:07,475 >> *** Running Prediction ***
	[INFO\|trainer.py:2966] 2024-04-27 16:06:07,475 >> Num examples = 350
	[INFO\|trainer.py:2969] 2024-04-27 16:06:07,476 >> Batch size = 32
	/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
	self.pid = os.fork()
	506264it [00:02, 200781.80it/s]
	Sorting Corpus by document length (Longest first)...
	Use len(str.split()) to sort the corpus for efficiently encoding
	Split corpus encoding: 0% 0/6 [00:00<?, ?it/s][INFO\|trainer.py:2964] 2024-04-27 16:06:16,364 >> *** Running Prediction ***
	[INFO\|trainer.py:2966] 2024-04-27 16:06:16,364 >> Num examples = 84378
	[INFO\|trainer.py:2969] 2024-04-27 16:06:16,364 >> Batch size = 32
	Split corpus encoding: 17% 1/6 [51:21<4:16:49, 3081.85s/it][INFO\|trainer.py:2964] 2024-04-27 16:57:38,275 >> *** Running Prediction ***
	[INFO\|trainer.py:2966] 2024-04-27 16:57:38,275 >> Num examples = 84378
	[INFO\|trainer.py:2969] 2024-04-27 16:57:38,275 >> Batch size = 32
	Split corpus encoding: 33% 2/6 [1:19:16<2:30:16, 2254.18s/it][INFO\|trainer.py:2964] 2024-04-27 17:25:33,077 >> *** Running Prediction ***
	[INFO\|trainer.py:2966] 2024-04-27 17:25:33,077 >> Num examples = 84377
	[INFO\|trainer.py:2969] 2024-04-27 17:25:33,077 >> Batch size = 32
	Split corpus encoding: 50% 3/6 [1:37:40<1:26:26, 1728.81s/it][INFO\|trainer.py:2964] 2024-04-27 17:43:56,688 >> *** Running Prediction ***
	[INFO\|trainer.py:2966] 2024-04-27 17:43:56,688 >> Num examples = 84377
	[INFO\|trainer.py:2969] 2024-04-27 17:43:56,688 >> Batch size = 32
	Split corpus encoding: 67% 4/6 [1:49:28<44:11, 1325.80s/it] [INFO\|trainer.py:2964] 2024-04-27 17:55:44,668 >> *** Running Prediction ***
	[INFO\|trainer.py:2966] 2024-04-27 17:55:44,668 >> Num examples = 84377
	[INFO\|trainer.py:2969] 2024-04-27 17:55:44,668 >> Batch size = 32
	Split corpus encoding: 83% 5/6 [1:56:58<16:50, 1010.03s/it][INFO\|trainer.py:2964] 2024-04-27 18:03:14,790 >> *** Running Prediction ***
	[INFO\|trainer.py:2966] 2024-04-27 18:03:14,790 >> Num examples = 84377
	[INFO\|trainer.py:2969] 2024-04-27 18:03:14,790 >> Batch size = 32
	Split corpus encoding: 100% 6/6 [2:01:55<00:00, 1219.18s/it]
	Batch search: 100% 1/1 [00:00<00:00, 11.57it/s]
	{'NDCG@1': 0.28857, 'NDCG@3': 0.28552, 'NDCG@5': 0.29595, 'NDCG@10': 0.32177, 'NDCG@100': 0.37571, 'NDCG@1000': 0.40721}
	{'MAP@1': 0.14605, 'MAP@3': 0.21954, 'MAP@5': 0.23868, 'MAP@10': 0.25409, 'MAP@100': 0.26767, 'MAP@1000': 0.2692}
	{'Recall@10': 0.40017, 'Recall@50': 0.55551, 'Recall@100': 0.61155, 'Recall@200': 0.69742, 'Recall@500': 0.78185, 'Recall@1000': 0.82583}
	{'P@1': 0.28857, 'P@3': 0.19238, 'P@5': 0.14114, 'P@10': 0.08714, 'P@100': 0.01383, 'P@1000': 0.00179}
	{'MRR@10': 0.37349, 'MRR@100': 0.38027, 'MRR@1000': 0.38097}