Last active
January 12, 2025 14:05
-
-
Save virattt/b140fb4bf549b6125d53aa153dc53be6 to your computer and use it in GitHub Desktop.
rag-reranking-gpt-colbert.ipynb
@truebit If I have done it right you need to add:
# Add this lines
query = "Your query in string format..."
query_encoding = tokenizer(query, return_tensors='pt', truncation=True, max_length=512)
query_embedding = model(**query_encoding).last_hidden_state.squeeze(0)
# Get score for each document
for document in splits:
document_encoding = tokenizer(document, return_tensors='pt', truncation=True, max_length=512)
document_embedding = model(**document_encoding).last_hidden_state
# Calculate MaxSim score
score = maxsim(query_embedding.unsqueeze(0), document_embedding)
...
@Psancs05 thx
Great catch - updated 🙏
@virattt Do you know the difference between using:
query_embedding = model(**query_encoding).last_hidden_state.squeeze(0)
query_embedding = model(**query_encoding).last_hidden_state.mean(dim=1)
I have tested both and seems that the squeeze(0)
returns better quality similar documents (maybe it's just the use-case I tried)
query_embedding = model(**query_encoding).last_hidden_state.squeeze(0)
is correct since it returns a vector per token, whilst
query_embedding = model(**query_encoding).last_hidden_state.mean(dim=1)
returns a single vector averaged over all tokens.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
thanks for sharing but the
query_embedding
variable missing assignment statement