Skip to content

Instantly share code, notes, and snippets.

@PrithivirajDamodaran
Created April 28, 2024 03:39
Show Gist options
  • Save PrithivirajDamodaran/655ed65287d3f5ec0ecece34c0d5822e to your computer and use it in GitHub Desktop.
Save PrithivirajDamodaran/655ed65287d3f5ec0ecece34c0d5822e to your computer and use it in GitHub Desktop.
BGE-M3 Spot check on MIRACL Hindi Dev set
/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
2024-04-27 16:05:43.073043: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-27 16:05:43.125561: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-27 16:05:43.125608: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-27 16:05:43.127388: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-27 16:05:43.135621: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-27 16:05:44.214271: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading config.json: 100% 687/687 [00:00<00:00, 3.71MB/s]
[INFO|configuration_utils.py:666] 2024-04-27 16:05:46,547 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/config.json
[INFO|configuration_utils.py:718] 2024-04-27 16:05:46,548 >> Model config XLMRobertaConfig {
"_name_or_path": "BAAI/bge-m3",
"architectures": [
"XLMRobertaModel"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 8194,
"model_type": "xlm-roberta",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"output_past": true,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.26.1",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 250002
}
Downloading tokenizer_config.json: 100% 444/444 [00:00<00:00, 2.77MB/s]
Downloading (…)tencepiece.bpe.model: 100% 5.07M/5.07M [00:00<00:00, 72.4MB/s]
Downloading tokenizer.json: 100% 17.1M/17.1M [00:00<00:00, 127MB/s]
Downloading (…)cial_tokens_map.json: 100% 964/964 [00:00<00:00, 6.36MB/s]
[INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file sentencepiece.bpe.model from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/sentencepiece.bpe.model
[INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/tokenizer.json
[INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/special_tokens_map.json
[INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/tokenizer_config.json
Downloading pytorch_model.bin: 100% 2.27G/2.27G [00:10<00:00, 217MB/s]
[INFO|modeling_utils.py:2275] 2024-04-27 16:06:00,102 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/pytorch_model.bin
[INFO|configuration_utils.py:543] 2024-04-27 16:06:01,464 >> Generate config GenerationConfig {
"bos_token_id": 0,
"eos_token_id": 2,
"pad_token_id": 1,
"transformers_version": "4.26.1"
}
[INFO|modeling_utils.py:2860] 2024-04-27 16:06:06,569 >> All model checkpoint weights were used when initializing XLMRobertaDense.
[WARNING|modeling_utils.py:2862] 2024-04-27 16:06:06,569 >> Some weights of XLMRobertaDense were not initialized from the model checkpoint at BAAI/bge-m3 and are newly initialized: ['embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2522] 2024-04-27 16:06:06,742 >> Generation config file not found, using a generation config created from the model config.
Encoding Queries...
[INFO|trainer.py:2964] 2024-04-27 16:06:07,475 >> ***** Running Prediction *****
[INFO|trainer.py:2966] 2024-04-27 16:06:07,475 >> Num examples = 350
[INFO|trainer.py:2969] 2024-04-27 16:06:07,476 >> Batch size = 32
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
506264it [00:02, 200781.80it/s]
Sorting Corpus by document length (Longest first)...
Use len(str.split()) to sort the corpus for efficiently encoding
Split corpus encoding: 0% 0/6 [00:00<?, ?it/s][INFO|trainer.py:2964] 2024-04-27 16:06:16,364 >> ***** Running Prediction *****
[INFO|trainer.py:2966] 2024-04-27 16:06:16,364 >> Num examples = 84378
[INFO|trainer.py:2969] 2024-04-27 16:06:16,364 >> Batch size = 32
Split corpus encoding: 17% 1/6 [51:21<4:16:49, 3081.85s/it][INFO|trainer.py:2964] 2024-04-27 16:57:38,275 >> ***** Running Prediction *****
[INFO|trainer.py:2966] 2024-04-27 16:57:38,275 >> Num examples = 84378
[INFO|trainer.py:2969] 2024-04-27 16:57:38,275 >> Batch size = 32
Split corpus encoding: 33% 2/6 [1:19:16<2:30:16, 2254.18s/it][INFO|trainer.py:2964] 2024-04-27 17:25:33,077 >> ***** Running Prediction *****
[INFO|trainer.py:2966] 2024-04-27 17:25:33,077 >> Num examples = 84377
[INFO|trainer.py:2969] 2024-04-27 17:25:33,077 >> Batch size = 32
Split corpus encoding: 50% 3/6 [1:37:40<1:26:26, 1728.81s/it][INFO|trainer.py:2964] 2024-04-27 17:43:56,688 >> ***** Running Prediction *****
[INFO|trainer.py:2966] 2024-04-27 17:43:56,688 >> Num examples = 84377
[INFO|trainer.py:2969] 2024-04-27 17:43:56,688 >> Batch size = 32
Split corpus encoding: 67% 4/6 [1:49:28<44:11, 1325.80s/it] [INFO|trainer.py:2964] 2024-04-27 17:55:44,668 >> ***** Running Prediction *****
[INFO|trainer.py:2966] 2024-04-27 17:55:44,668 >> Num examples = 84377
[INFO|trainer.py:2969] 2024-04-27 17:55:44,668 >> Batch size = 32
Split corpus encoding: 83% 5/6 [1:56:58<16:50, 1010.03s/it][INFO|trainer.py:2964] 2024-04-27 18:03:14,790 >> ***** Running Prediction *****
[INFO|trainer.py:2966] 2024-04-27 18:03:14,790 >> Num examples = 84377
[INFO|trainer.py:2969] 2024-04-27 18:03:14,790 >> Batch size = 32
Split corpus encoding: 100% 6/6 [2:01:55<00:00, 1219.18s/it]
Batch search: 100% 1/1 [00:00<00:00, 11.57it/s]
{'NDCG@1': 0.28857, 'NDCG@3': 0.28552, 'NDCG@5': 0.29595, 'NDCG@10': 0.32177, 'NDCG@100': 0.37571, 'NDCG@1000': 0.40721}
{'MAP@1': 0.14605, 'MAP@3': 0.21954, 'MAP@5': 0.23868, 'MAP@10': 0.25409, 'MAP@100': 0.26767, 'MAP@1000': 0.2692}
{'Recall@10': 0.40017, 'Recall@50': 0.55551, 'Recall@100': 0.61155, 'Recall@200': 0.69742, 'Recall@500': 0.78185, 'Recall@1000': 0.82583}
{'P@1': 0.28857, 'P@3': 0.19238, 'P@5': 0.14114, 'P@10': 0.08714, 'P@100': 0.01383, 'P@1000': 0.00179}
{'MRR@10': 0.37349, 'MRR@100': 0.38027, 'MRR@1000': 0.38097}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment