Created
April 8, 2023 06:58
-
-
Save donvito/bf3575f7d8d87d39e15301da9ee3e9eb to your computer and use it in GitHub Desktop.
llama_index_public.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"provenance": [], | |
"authorship_tag": "ABX9TyM/S+q86umB0Nsri/Z87hjh", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/donvito/bf3575f7d8d87d39e15301da9ee3e9eb/llama_index_public.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# **Install Dependencies**" | |
], | |
"metadata": { | |
"id": "R5SQMrNJeqil" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "mCnB4zUGecU6" | |
}, | |
"outputs": [], | |
"source": [ | |
"pip install llama-index PyPDF2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# **Prepare your Data**" | |
], | |
"metadata": { | |
"id": "hCCrHiLxUzvj" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Before you run the sample code, you need to create a /data folder in Google Colab. Then, upload all PDFs you want to index or search.\n", | |
"\n", | |
"For this example, I downloaded my resume as PDF from LinkedIn then I uploaded it in the data folder as **Melvin Resume.pdf**. " | |
], | |
"metadata": { | |
"id": "mi7xc9vwWG00" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"" | |
], | |
"metadata": { | |
"id": "9kPmKc_rU5dF" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# **Llama Index Code**" | |
], | |
"metadata": { | |
"id": "VZZV0o7yev6t" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Index the document and save to disk. You just need to run this once so it'll not cost you money to do the embeddings Open AI API calls which is run under the hood by this function GPTSimpleVectorIndex.from_documents()" | |
], | |
"metadata": { | |
"id": "HgAb5WjsA-do" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"import os\n", | |
"os.environ[\"OPENAI_API_KEY\"] = ''\n", | |
"\n", | |
"from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper, ServiceContext, SimpleDirectoryReader\n", | |
"from langchain import OpenAI\n", | |
"from langchain.chat_models import ChatOpenAI\n", | |
"from pathlib import Path\n", | |
"\n", | |
"# set maximum input size\n", | |
"max_input_size = 4096\n", | |
"# set number of output tokens\n", | |
"num_output = 256\n", | |
"# set maximum chunk overlap\n", | |
"max_chunk_overlap = 20\n", | |
"# set temperature\n", | |
"temperature = 0\n", | |
"\n", | |
"# define LLM, you can try text-davinci-003 also but it would be more expensive. But it can give you better restults\n", | |
"# Reference https://platform.openai.com/docs/models\n", | |
"model_name = 'gpt-3.5-turbo'\n", | |
"llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=temperature, model_name=\"gpt-3.5-turbo\", max_tokens=num_output)) \n", | |
"\n", | |
"prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)\n", | |
"\n", | |
"service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)\n", | |
"\n", | |
"indexFile = 'index.json'\n", | |
"\n", | |
"# START indexing part - comment this block if you alrady have the index saved to disk\n", | |
"documents = SimpleDirectoryReader('data').load_data()\n", | |
"index = GPTSimpleVectorIndex.from_documents(\n", | |
" documents, service_context=service_context\n", | |
")\n", | |
"index.save_to_disk(indexFile)\n", | |
"# END indexing part\n", | |
"\n", | |
"# load from disk\n", | |
"index = GPTSimpleVectorIndex.load_from_disk(indexFile)\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "6aeTTQ4CemMo" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Run a query with the index" | |
], | |
"metadata": { | |
"id": "-kgQrSzSBGFt" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"response = index.query(\"Which companies did Melvin work in the Philippines?\", service_context=service_context)\n", | |
"print(llm_predictor.last_token_usage)\n", | |
"print(response)" | |
], | |
"metadata": { | |
"id": "gpPvJH36fhYe" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment