Created
April 28, 2024 03:39
-
-
Save PrithivirajDamodaran/655ed65287d3f5ec0ecece34c0d5822e to your computer and use it in GitHub Desktop.
BGE-M3 Spot check on MIRACL Hindi Dev set
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| /usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated | |
| and will be removed in future. Use torchrun. | |
| Note that --use-env is set by default in torchrun. | |
| If your script expects `--local-rank` argument to be set, please | |
| change it to read from `os.environ['LOCAL_RANK']` instead. See | |
| https://pytorch.org/docs/stable/distributed.html#launch-utility for | |
| further instructions | |
| warnings.warn( | |
| 2024-04-27 16:05:43.073043: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. | |
| 2024-04-27 16:05:43.125561: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered | |
| 2024-04-27 16:05:43.125608: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered | |
| 2024-04-27 16:05:43.127388: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered | |
| 2024-04-27 16:05:43.135621: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. | |
| To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
| 2024-04-27 16:05:44.214271: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT | |
| Downloading config.json: 100% 687/687 [00:00<00:00, 3.71MB/s] | |
| [INFO|configuration_utils.py:666] 2024-04-27 16:05:46,547 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/config.json | |
| [INFO|configuration_utils.py:718] 2024-04-27 16:05:46,548 >> Model config XLMRobertaConfig { | |
| "_name_or_path": "BAAI/bge-m3", | |
| "architectures": [ | |
| "XLMRobertaModel" | |
| ], | |
| "attention_probs_dropout_prob": 0.1, | |
| "bos_token_id": 0, | |
| "classifier_dropout": null, | |
| "eos_token_id": 2, | |
| "hidden_act": "gelu", | |
| "hidden_dropout_prob": 0.1, | |
| "hidden_size": 1024, | |
| "initializer_range": 0.02, | |
| "intermediate_size": 4096, | |
| "layer_norm_eps": 1e-05, | |
| "max_position_embeddings": 8194, | |
| "model_type": "xlm-roberta", | |
| "num_attention_heads": 16, | |
| "num_hidden_layers": 24, | |
| "output_past": true, | |
| "pad_token_id": 1, | |
| "position_embedding_type": "absolute", | |
| "torch_dtype": "float32", | |
| "transformers_version": "4.26.1", | |
| "type_vocab_size": 1, | |
| "use_cache": true, | |
| "vocab_size": 250002 | |
| } | |
| Downloading tokenizer_config.json: 100% 444/444 [00:00<00:00, 2.77MB/s] | |
| Downloading (…)tencepiece.bpe.model: 100% 5.07M/5.07M [00:00<00:00, 72.4MB/s] | |
| Downloading tokenizer.json: 100% 17.1M/17.1M [00:00<00:00, 127MB/s] | |
| Downloading (…)cial_tokens_map.json: 100% 964/964 [00:00<00:00, 6.36MB/s] | |
| [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file sentencepiece.bpe.model from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/sentencepiece.bpe.model | |
| [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/tokenizer.json | |
| [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file added_tokens.json from cache at None | |
| [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/special_tokens_map.json | |
| [INFO|tokenization_utils_base.py:1802] 2024-04-27 16:05:48,301 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/tokenizer_config.json | |
| Downloading pytorch_model.bin: 100% 2.27G/2.27G [00:10<00:00, 217MB/s] | |
| [INFO|modeling_utils.py:2275] 2024-04-27 16:06:00,102 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--BAAI--bge-m3/snapshots/babcf60cae0a1f438d7ade582983d4ba462303c2/pytorch_model.bin | |
| [INFO|configuration_utils.py:543] 2024-04-27 16:06:01,464 >> Generate config GenerationConfig { | |
| "bos_token_id": 0, | |
| "eos_token_id": 2, | |
| "pad_token_id": 1, | |
| "transformers_version": "4.26.1" | |
| } | |
| [INFO|modeling_utils.py:2860] 2024-04-27 16:06:06,569 >> All model checkpoint weights were used when initializing XLMRobertaDense. | |
| [WARNING|modeling_utils.py:2862] 2024-04-27 16:06:06,569 >> Some weights of XLMRobertaDense were not initialized from the model checkpoint at BAAI/bge-m3 and are newly initialized: ['embeddings.position_ids'] | |
| You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. | |
| [INFO|modeling_utils.py:2522] 2024-04-27 16:06:06,742 >> Generation config file not found, using a generation config created from the model config. | |
| Encoding Queries... | |
| [INFO|trainer.py:2964] 2024-04-27 16:06:07,475 >> ***** Running Prediction ***** | |
| [INFO|trainer.py:2966] 2024-04-27 16:06:07,475 >> Num examples = 350 | |
| [INFO|trainer.py:2969] 2024-04-27 16:06:07,476 >> Batch size = 32 | |
| /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. | |
| self.pid = os.fork() | |
| 506264it [00:02, 200781.80it/s] | |
| Sorting Corpus by document length (Longest first)... | |
| Use len(str.split()) to sort the corpus for efficiently encoding | |
| Split corpus encoding: 0% 0/6 [00:00<?, ?it/s][INFO|trainer.py:2964] 2024-04-27 16:06:16,364 >> ***** Running Prediction ***** | |
| [INFO|trainer.py:2966] 2024-04-27 16:06:16,364 >> Num examples = 84378 | |
| [INFO|trainer.py:2969] 2024-04-27 16:06:16,364 >> Batch size = 32 | |
| Split corpus encoding: 17% 1/6 [51:21<4:16:49, 3081.85s/it][INFO|trainer.py:2964] 2024-04-27 16:57:38,275 >> ***** Running Prediction ***** | |
| [INFO|trainer.py:2966] 2024-04-27 16:57:38,275 >> Num examples = 84378 | |
| [INFO|trainer.py:2969] 2024-04-27 16:57:38,275 >> Batch size = 32 | |
| Split corpus encoding: 33% 2/6 [1:19:16<2:30:16, 2254.18s/it][INFO|trainer.py:2964] 2024-04-27 17:25:33,077 >> ***** Running Prediction ***** | |
| [INFO|trainer.py:2966] 2024-04-27 17:25:33,077 >> Num examples = 84377 | |
| [INFO|trainer.py:2969] 2024-04-27 17:25:33,077 >> Batch size = 32 | |
| Split corpus encoding: 50% 3/6 [1:37:40<1:26:26, 1728.81s/it][INFO|trainer.py:2964] 2024-04-27 17:43:56,688 >> ***** Running Prediction ***** | |
| [INFO|trainer.py:2966] 2024-04-27 17:43:56,688 >> Num examples = 84377 | |
| [INFO|trainer.py:2969] 2024-04-27 17:43:56,688 >> Batch size = 32 | |
| Split corpus encoding: 67% 4/6 [1:49:28<44:11, 1325.80s/it] [INFO|trainer.py:2964] 2024-04-27 17:55:44,668 >> ***** Running Prediction ***** | |
| [INFO|trainer.py:2966] 2024-04-27 17:55:44,668 >> Num examples = 84377 | |
| [INFO|trainer.py:2969] 2024-04-27 17:55:44,668 >> Batch size = 32 | |
| Split corpus encoding: 83% 5/6 [1:56:58<16:50, 1010.03s/it][INFO|trainer.py:2964] 2024-04-27 18:03:14,790 >> ***** Running Prediction ***** | |
| [INFO|trainer.py:2966] 2024-04-27 18:03:14,790 >> Num examples = 84377 | |
| [INFO|trainer.py:2969] 2024-04-27 18:03:14,790 >> Batch size = 32 | |
| Split corpus encoding: 100% 6/6 [2:01:55<00:00, 1219.18s/it] | |
| Batch search: 100% 1/1 [00:00<00:00, 11.57it/s] | |
| {'NDCG@1': 0.28857, 'NDCG@3': 0.28552, 'NDCG@5': 0.29595, 'NDCG@10': 0.32177, 'NDCG@100': 0.37571, 'NDCG@1000': 0.40721} | |
| {'MAP@1': 0.14605, 'MAP@3': 0.21954, 'MAP@5': 0.23868, 'MAP@10': 0.25409, 'MAP@100': 0.26767, 'MAP@1000': 0.2692} | |
| {'Recall@10': 0.40017, 'Recall@50': 0.55551, 'Recall@100': 0.61155, 'Recall@200': 0.69742, 'Recall@500': 0.78185, 'Recall@1000': 0.82583} | |
| {'P@1': 0.28857, 'P@3': 0.19238, 'P@5': 0.14114, 'P@10': 0.08714, 'P@100': 0.01383, 'P@1000': 0.00179} | |
| {'MRR@10': 0.37349, 'MRR@100': 0.38027, 'MRR@1000': 0.38097} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment