Skip to content

Instantly share code, notes, and snippets.

@quicksilver0
Last active March 12, 2024 12:47
Show Gist options
  • Save quicksilver0/9958a06426236ea67a48c9d27bbe49bb to your computer and use it in GitHub Desktop.
Save quicksilver0/9958a06426236ea67a48c9d27bbe49bb to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"source": [
"Get hsbc bank dataset from this google drive: https://drive.google.com/file/d/1gFVEzyTBgDmMP48SjySA3skq8nwRxZZw/view?usp=sharing"
],
"metadata": {
"id": "-6LGAl2AGj4j"
}
},
{
"cell_type": "code",
"source": [
"filename = 'dataset_hsbc'\n",
"with open(filename, 'r') as f:\n",
" document = f.read()"
],
"metadata": {
"id": "cWtP9m9unZNP"
},
"execution_count": 2,
"outputs": []
},
{
"cell_type": "code",
"source": [
"document_chunks = document.split('########')\n",
"print('Number of texts:', len(document_chunks))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cbWUXpnVobB9",
"outputId": "7b25a9f2-af4a-49dd-b294-3064317883cf"
},
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Number of texts: 565\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## RAG (Retrieval Augmented Generation)\n",
"Retriving relevant HSBC information.\n",
"1. Define query.\n",
"2. Encode into embedding space. Retrieve most similar chunks of text.\n",
"3. Improve retrieved chunks of text via reranker - rearrange it in an improved order.\n",
"\n",
"Excellent open source embeddings: https://github.com/FlagOpen/FlagEmbedding"
],
"metadata": {
"id": "3klGmAwEDxDY"
}
},
{
"cell_type": "code",
"source": [
"pip install FlagEmbedding"
],
"metadata": {
"id": "OqVExC56qYXF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from FlagEmbedding import FlagModel, FlagReranker\n",
"import numpy as np"
],
"metadata": {
"id": "nrBLEiWqq16F"
},
"execution_count": 13,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Initialize embedding model."
],
"metadata": {
"id": "1BlAqzrbTLi4"
}
},
{
"cell_type": "code",
"source": [
"embeddings_model_bge = FlagModel('BAAI/bge-base-en-v1.5',\n",
" use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation"
],
"metadata": {
"id": "maa9zPqp1jKR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"query = \"how do i get student loan\""
],
"metadata": {
"id": "3VwI3SmcZhca"
},
"execution_count": 9,
"outputs": []
},
{
"cell_type": "code",
"source": [
"## encode dataset\n",
"embeddings_bge = embeddings_model_bge.encode(document_chunks)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "038d455b-b26f-4a10-e9ae-ccabe1310429",
"id": "5mgQaML-3dYm"
},
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Inference Embeddings: 100%|██████████| 3/3 [00:07<00:00, 2.38s/it]\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"emb_bge_query = embeddings_model_bge.encode([query])"
],
"metadata": {
"id": "oROlBTnm0hYC"
},
"execution_count": 11,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Getting similarity scores via dot product\n",
"scores = emb_bge_query @ embeddings_bge.T\n",
"scores = np.squeeze(scores)"
],
"metadata": {
"id": "NOGFWBg-0rp7"
},
"execution_count": 15,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Sort scores from large to small and obtaining indexes of them\n",
"max_idx = np.argsort(-scores)\n",
"max_idx[:8]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ggpWrw8N896o",
"outputId": "6029604e-dc4e-4f79-ea14-6f81c80c6163"
},
"execution_count": 18,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([200, 452, 196, 253, 252, 176, 201, 72])"
]
},
"metadata": {},
"execution_count": 18
}
]
},
{
"cell_type": "code",
"source": [
"print(f\"Query: {query}\\n\")\n",
"context_chunks_init = []\n",
"context_scores = []\n",
"for idx in max_idx[:8]:\n",
" print(f\"Score: {scores[idx]:.3f}\")\n",
" print(document_chunks[idx].split('\\n')[1])\n",
" print(\"--------\")\n",
" context_chunks_init.append(document_chunks[idx])\n",
" context_scores.append(scores[idx])"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "fuRPAA_v2Dfd",
"outputId": "1ede5358-8e4d-4db9-d731-98fa29c1216a"
},
"execution_count": 25,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Query: how do i get student loan\n",
"\n",
"Score: 0.777\n",
"Borrowing money as a student \n",
"--------\n",
"Score: 0.735\n",
"Paying off your student loan early \n",
"--------\n",
"Score: 0.720\n",
"How to budget as a student \n",
"--------\n",
"Score: 0.715\n",
"What is a loan? \n",
"--------\n",
"Score: 0.709\n",
"Tips to successfully apply for a loan \n",
"--------\n",
"Score: 0.708\n",
"How much can I borrow? \n",
"--------\n",
"Score: 0.702\n",
"Should you get a student credit card? \n",
"--------\n",
"Score: 0.701\n",
"Ways to borrow \n",
"--------\n"
]
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment