Created
April 12, 2023 10:00
-
-
Save kun432/6f465f6163fce76899b3180cc2861fac to your computer and use it in GitHub Desktop.
LangChain/LlamaIndexによる独自データとLLMの連携.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "provenance": [], | |
| "authorship_tag": "ABX9TyPWmSqJxASf8W/xWcetYeR/", | |
| "include_colab_link": true | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| }, | |
| "language_info": { | |
| "name": "python" | |
| } | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "view-in-github", | |
| "colab_type": "text" | |
| }, | |
| "source": [ | |
| "<a href=\"https://colab.research.google.com/gist/kun432/6f465f6163fce76899b3180cc2861fac/langchain-llamaindex-llm.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# LangChain/LlamaIndexによる独自データとLLMの連携\n", | |
| "\n", | |
| "LangChain/LlamaIndexを使った独自データからLLMに回答させるサンプルです。2つの例を取り上げています。\n", | |
| "\n", | |
| "- テキストデータから回答させる\n", | |
| "- ブログの内容(RSS)から回答させる\n", | |
| "\n", | |
| "上から順に実行していってください。" | |
| ], | |
| "metadata": { | |
| "id": "t3OXp-W29mxT" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## 0: 事前準備: パッケージのインストールとAPIキーの設定\n" | |
| ], | |
| "metadata": { | |
| "id": "feZPhvFs-BoG" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "必要なパッケージをインストール。OpenAIのAPIキーもセットする。" | |
| ], | |
| "metadata": { | |
| "id": "cyCX7EDy4kQz" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "id": "3Iq3P2YiY77D" | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "!pip install openai langchain llama-index feedparser chromadb" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "OPENAI_API_KEY=\"\" #@param{type: \"string\"}\n", | |
| "\n", | |
| "import os\n", | |
| "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY" | |
| ], | |
| "metadata": { | |
| "id": "ASqXaKKOdSaV", | |
| "cellView": "form" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## 1. テキストデータから回答させる\n", | |
| "\n", | |
| "テキストデータをローカルでVector化してインデックスを作成してLLMに参照させます。\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "QbFpvmFt-Lsq" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "今回はサンプルとして2022/10/03の内閣総理大臣所信表明演説をテキスト化したものを使用する。ディレクトリを作成してそこにダウンロードする。\n", | |
| "\n", | |
| "データは以下を参照\n", | |
| "https://gist.github.com/kun432/dff61851a88d543dd146052edfbcb27f" | |
| ], | |
| "metadata": { | |
| "id": "7ed2bL25-zKd" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import os\n", | |
| "os.makedirs(\"data\", exist_ok=True)\n", | |
| "!wget https://gist.githubusercontent.com/kun432/dff61851a88d543dd146052edfbcb27f/raw/dacac3ba7fcc17b92b471bd87785ecfdb98d45c1/gistfile1.txt -O data/sample.txt" | |
| ], | |
| "metadata": { | |
| "id": "gVpoh7_T-rBa" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "テキストデータを埋め込みに変換して、インデックス化" | |
| ], | |
| "metadata": { | |
| "id": "_M8pX-Av-__T" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!rm -rf .chroma\n", | |
| "\n", | |
| "from langchain.embeddings.openai import OpenAIEmbeddings\n", | |
| "from langchain.text_splitter import CharacterTextSplitter\n", | |
| "from langchain.vectorstores import Chroma\n", | |
| "from langchain.docstore.document import Document\n", | |
| "from langchain.prompts import PromptTemplate\n", | |
| "\n", | |
| "with open(\"data/sample.txt\") as f:\n", | |
| " state_of_the_union = f.read()\n", | |
| "text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n", | |
| "texts = text_splitter.split_text(state_of_the_union)\n", | |
| "\n", | |
| "embeddings = OpenAIEmbeddings()\n", | |
| "\n", | |
| "docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{\"source\": str(i)} for i in range(len(texts))])" | |
| ], | |
| "metadata": { | |
| "id": "hB10yZisfv-I" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "インデックス化されたドキュメントを検索してみる(類似検索)" | |
| ], | |
| "metadata": { | |
| "id": "w0hfaOFugmZP" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import pprint\n", | |
| "\n", | |
| "user_input = \"\\u30A6\\u30AF\\u30E9\\u30A4\\u30CA\\u5BFE\\u7B56\\u306B\\u3064\\u3044\\u3066\\u306F\\u3069\\u3046\\u3067\\u3057\\u3087\\u3046\\u304B\" #@param{type: \"string\"}\n", | |
| "docs = docsearch.similarity_search(user_input)\n", | |
| "pprint.pprint(docs)" | |
| ], | |
| "metadata": { | |
| "id": "sW1ffzIyBfcP" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "これをプロンプトに含めてLLMに投げて回答を取得するのだけど、それなりのデータ量になっている。そのままLLMに投げるとどうなるか?\n" | |
| ], | |
| "metadata": { | |
| "id": "J06ZTvJBJzcN" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "from langchain.chains.qa_with_sources import load_qa_with_sources_chain\n", | |
| "from langchain.llms import OpenAI\n", | |
| "\n", | |
| "chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type=\"stuff\")\n", | |
| "chain({\"input_documents\": docs, \"question\": user_input}, return_only_outputs=True)" | |
| ], | |
| "metadata": { | |
| "id": "gVt5-OmShQiS" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "エラーになったと思う。\n", | |
| "\n", | |
| "独自データをローカルでVectorDB化した場合、VectorDBから関連しそうなテキストを取得して、元のユーザからのプロンプトと合わせてLLMに投げて最終回答を生成する。よって、それなりのデータ量になるため、場合によっては上限を超えてしまう。\n", | |
| "\n", | |
| "独自データをどういう風に拾ってきてLLMに投げるか?はいくつかの手法があり、それぞれメリットデメリットがある\n", | |
| "\n", | |
| "以下が参考になる\n", | |
| "\n", | |
| "https://note.com/npaka/n/nb9b70619939a\n", | |
| "\n", | |
| "stuffではOpenAIのAPIリクエストサイズを超えてしまいがちなので、今回はMapReduceを使ってみる。" | |
| ], | |
| "metadata": { | |
| "id": "CGoz3F1OhgnT" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "from langchain.chains.qa_with_sources import load_qa_with_sources_chain\n", | |
| "from langchain.llms import OpenAI\n", | |
| "\n", | |
| "chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type=\"map_reduce\")\n", | |
| "chain({\"input_documents\": docs, \"question\": user_input}, return_only_outputs=True)" | |
| ], | |
| "metadata": { | |
| "id": "KfGCkoZnhf5r" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## 2. ブログから回答させる\n", | |
| "\n", | |
| "LangChain/LlamaIndexには予めいろんなデータローダーが予め用意されています。今回はLlamaIndexの\"web/rss\"を使ってブログのコンテンツ(RSS)をVector化・インデックス作成して、その検索結果をLLMに投げて回答させます。" | |
| ], | |
| "metadata": { | |
| "id": "Jp5iMC6bJrfJ" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "RSSを取得して、埋め込みに変換、インデックス化。自分のブログを使った。\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "BNeyQdQcKJ0a" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "from llama_index import GPTSimpleVectorIndex, download_loader\n", | |
| "\n", | |
| "rss_url = \"https://kun432.hatenablog.com/rss\" #@param{type: \"string\"}\n", | |
| "\n", | |
| "# LlamaIndexのweb/rssを使って、ブログをindex化\n", | |
| "RssReader = download_loader(\"RssReader\")\n", | |
| "loader = RssReader()\n", | |
| "documents = loader.load_data([rss_url])\n", | |
| "index_from_rss = GPTSimpleVectorIndex.from_documents(documents)\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "JAQEkAUgH4uk", | |
| "cellView": "form" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "では適当に聞いてみる。自分のブログではVoiceflowについて書いている記事が多いのでそれについて聞いてみる。" | |
| ], | |
| "metadata": { | |
| "id": "TUwd41isH4cS" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "from langchain.agents import initialize_agent, Tool\n", | |
| "from langchain.tools import BaseTool\n", | |
| "from langchain.llms import OpenAI\n", | |
| "\n", | |
| "user_input = \"Voiceflowについて教えて\" #@param{type: \"string\"}\n", | |
| "debug=False #@param{type: \"boolean\"}\n", | |
| "\n", | |
| "# LlamaIndexのweb/rssをLangChainのToolとして定義\n", | |
| "tools = [\n", | |
| " Tool(\n", | |
| " name = \"rss search\",\n", | |
| " func=lambda q: str(index_from_rss.query(q)),\n", | |
| " description=\"search from rss\"\n", | |
| " ), \n", | |
| "]\n", | |
| "\n", | |
| "# LangChainのagent\n", | |
| "llm = OpenAI(temperature=0)\n", | |
| "\n", | |
| "agent = initialize_agent(tools, llm, agent=\"zero-shot-react-description\", verbose=debug)\n", | |
| "\n", | |
| "agent.run(user_input)" | |
| ], | |
| "metadata": { | |
| "cellView": "form", | |
| "id": "h1M1Sog1KUOE" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "自分のブログではVoiceflowの新機能とかを紹介しているので、Voiceflowとは?というような記事はないので、もしかすると間違った回答が返ってくるかもしれない。\n", | |
| "\n", | |
| "### まとめ\n", | |
| "\n", | |
| "LLMと独自データを組み合わせる場合はFine Tuningが話題になることが多いが、フレームワークを使うことでFine Tuningしなくても、独自のデータソースを元にLLMに回答させるということができる。\n", | |
| "\n", | |
| "ただし、このパターンはLLMと複数回やり取りすることもありうるため、レスポンス時間は長めになりがち。特定のユースケースではこれが問題になる可能性がある(Alexaとか)ため、Fine Tuningも検討する必要がある。\n", | |
| "\n", | |
| "実際にFine Tuningもやってみたが、学習データの量やデータの内容等によって、間違えたり、想定と違ったこと言ったりと、正直評価が難しい上、やってみないとわからないというところがある(少ないデータでも5ドル程度は発生する)\n", | |
| "\n", | |
| "よって、Fine Tuningでやるのか、この手のやつを使うのか、という二者択一よりは、両方組み合わせるというが現時点での落とし所のように思える。\n" | |
| ], | |
| "metadata": { | |
| "id": "C6TCkzRs20Bq" | |
| } | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment