Skip to content

Instantly share code, notes, and snippets.

@showjim
Forked from ninehills/chatpdf-zh.ipynb
Created March 26, 2023 01:36
Show Gist options
  • Save showjim/bf633cb04f6930e83f8e484ee88b1494 to your computer and use it in GitHub Desktop.
Save showjim/bf633cb04f6930e83f8e484ee88b1494 to your computer and use it in GitHub Desktop.
ChatPDF-zh.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Chat with pdf file "
],
"metadata": {
"id": "4Sw_ysmQlk-8"
}
},
{
"cell_type": "code",
"source": [
"# 建议将 PDF 文件保存在 Google Drive 上,左侧 Connect to Google Drive\n",
"\n",
"from google.colab import drive\n",
"drive.mount('/content/drive')"
],
"metadata": {
"id": "WKhC2AZRjyok",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "ab33bce1-88fd-432d-d65e-1c9892a860a8"
},
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"WORK_DIR = \"/content/drive/MyDrive/ChatGPT/Notebooks/ChatPDF/\"\n",
"SRC_FILE = \"jianshang.pdf\"\n",
"INDEX_FILE = SRC_FILE + \".index\""
],
"metadata": {
"id": "UXw1TWw_nj_F"
},
"execution_count": 3,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%capture\n",
"# update or install the necessary libraries\n",
"!pip install --upgrade llama_index\n",
"!pip install --upgrade langchain\n",
"!pip install --upgrade python-dotenv\n"
],
"metadata": {
"id": "Aqef8N2RlUpo"
},
"execution_count": 4,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from llama_index import GPTSimpleVectorIndex, LLMPredictor, PromptHelper\n",
"from llama_index.response.notebook_utils import display_response\n",
"from llama_index.prompts.prompts import QuestionAnswerPrompt\n",
"from langchain.chat_models import ChatOpenAI\n",
"from IPython.display import Markdown, display\n",
"from langchain.callbacks.base import CallbackManager\n",
"from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler"
],
"metadata": {
"id": "Vp6JcErhmt_w"
},
"execution_count": 20,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Load environment variables, Just create a .env file with your OPENAI_API_KEY then load it.\n",
"\n",
"import os \n",
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()\n",
"\n",
"# API configuration\n",
"OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\")\n",
"\n",
"if OPENAI_API_KEY == \"\":\n",
" raise Exception(\"Need set OPENAI_API_KEY\")"
],
"metadata": {
"id": "WKoA2bzul7Gz"
},
"execution_count": 7,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"准备 Index 文件,为了避免重复索引,增加缓存\n",
"\n",
"\n"
],
"metadata": {
"id": "SApFHwHCpEGJ"
}
},
{
"cell_type": "code",
"source": [
"# Load pdf to documents\n",
"\n",
"from pathlib import Path\n",
"from llama_index import download_loader\n",
"\n",
"# 中文 PDF 建议使用 CJKPDFReader,英文建议用 PDFReader\n",
"# 其他类型文件,请去 https://llamahub.ai/ 寻找合适的 Loader\n",
"CJKPDFReader = download_loader(\"CJKPDFReader\")\n",
"\n",
"loader = CJKPDFReader()\n",
"index_file = os.path.join(Path(WORK_DIR), Path(INDEX_FILE))\n",
"\n",
"if os.path.exists(index_file) == False:\n",
" documents = loader.load_data(file=os.path.join(Path(WORK_DIR), Path(SRC_FILE)))\n",
" index = GPTSimpleVectorIndex(documents)\n",
" index.save_to_disk(index_file)\n",
"else:\n",
" index = GPTSimpleVectorIndex.load_from_disk(index_file)\n"
],
"metadata": {
"id": "Cb98YMtrnTxU"
},
"execution_count": 9,
"outputs": []
},
{
"cell_type": "code",
"source": [
"llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name=\"gpt-3.5-turbo\"))\n",
"llm_predictor_stream = LLMPredictor(llm=ChatOpenAI(\n",
" temperature=0.2, model_name=\"gpt-3.5-turbo\", stream=True,\n",
" callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), verbose=True\n",
" )\n",
")\n",
"\n",
"QUESTION_ANSWER_PROMPT_TMPL = (\n",
" \"Context information is below. \\n\"\n",
" \"---------------------\\n\"\n",
" \"{context_str}\"\n",
" \"\\n---------------------\\n\"\n",
" \"{query_str}\\n\"\n",
")\n",
"\n",
"QUESTION_ANSWER_PROMPT_TMPL_2 = \"\"\"\n",
"You are an AI assistant providing helpful advice. You are given the following extracted parts of a long document and a question. Provide a conversational answer based on the context provided.\n",
"If you can't find the answer in the context below, just say \"Hmm, I'm not sure.\" Don't try to make up an answer.\n",
"If the question is not related to the context, politely respond that you are tuned to only answer questions that are related to the context.\n",
"Context information is below.\n",
"=========\n",
"{context_str}\n",
"=========\n",
"{query_str}\n",
"\"\"\"\n",
"\n",
"QUESTION_ANSWER_PROMPT = QuestionAnswerPrompt(QUESTION_ANSWER_PROMPT_TMPL_2)\n",
"\n",
"def chat(query):\n",
" result = index.query(\n",
" query,\n",
" llm_predictor=llm_predictor,\n",
" text_qa_template=QUESTION_ANSWER_PROMPT,\n",
" # default: For the given index, “create and refine” an answer by sequentially \n",
" # going through each Node; make a separate LLM call per Node. Good for more \n",
" # detailed answers.\n",
" # compact: For the given index, “compact” the prompt during each LLM call \n",
" # by stuffing as many Node text chunks that can fit within the maximum prompt size. \n",
" # If there are too many chunks to stuff in one prompt, “create and refine” an answer \n",
" # by going through multiple prompts.\n",
" # tree_summarize: Given a set of Nodes and the query, recursively construct a \n",
" # tree and return the root node as the response. Good for summarization purposes.\n",
" response_mode=\"tree_summarize\",\n",
" similarity_top_k=3,\n",
" # mode=\"default\" will a create and refine an answer sequentially through \n",
" # the nodes of the list. \n",
" # mode=\"embedding\" will synthesize an answer by \n",
" # fetching the top-k nodes by embedding similarity.\n",
" mode=\"embedding\",\n",
" )\n",
" print(f\"Token used: {llm_predictor.last_token_usage}, total used: {llm_predictor.total_tokens_used}\")\n",
" return result\n",
"\n",
"# It's not work now, please don't use it.\n",
"# Bug: https://github.com/jerryjliu/llama_index/issues/831\n",
"def chat_stream(query):\n",
" return index.query(\n",
" query,\n",
" llm_predictor=llm_predictor_stream,\n",
" text_qa_template=QUESTION_ANSWER_PROMPT,\n",
" response_mode=\"tree_summarize\",\n",
" similarity_top_k=3,\n",
" streaming=True,\n",
" mode=\"embedding\",\n",
" )"
],
"metadata": {
"id": "6ddjxclno8tg"
},
"execution_count": 28,
"outputs": []
},
{
"cell_type": "code",
"source": [
"resp = chat(\"这本书讲了什么?\")\n",
"display_response(resp)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 410
},
"id": "psvZXfWirq31",
"outputId": "16c58a16-c570-4b6a-c6cf-8552ddbf8828"
},
"execution_count": 29,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Token used: 10843, total used: 10843\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Final Response:`** 这本书主要讲述了商周之变的历史背景和周朝的兴起,以及商文化和周文化的差异。它还提到了一些关于人祭习俗的历史知识,并分享了作者的研究经历和认知。"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 1/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.8000785245735972<br>**Text:** 如《象传》和《彖传》可能是周公作品。其他篇章里常出现“子日”,孔子 \n自己肯定不会这样写,它们应当是孔门弟子编写的。《周易》经传的详细知识, \n可参考廖明春《周易经传十五讲》,北京大学出版社,2...<br>"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 2/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.7965259185103453<br>**Text:** 的支持,其实是心理上的,让我意识到除了祭祀坑里的尸骨,这世界 上还有别的东西。 也许,人不应当凝视深渊;虽然深渊就在那里。 \f \f 始于一页,抵达世界 Humanities ■ Histor...<br>"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 3/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.7947722920317227<br>**Text:** 书》则是“太姒梦见商之庭产棘”。此事应载于《逸周书•程寤》篇,但传 \n世本只存篇名,正文缺。参见黄怀信等《逸周书汇校集注》(修订本),上海 \n古籍出版社,2007年,第262、1141页;李学勤...<br>"
},
"metadata": {}
}
]
},
{
"cell_type": "code",
"source": [
"display_response(chat(\"牧野之战的具体过程是什么?\"))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 544
},
"id": "bS5LAJkuqR4U",
"outputId": "18515036-427e-4e63-fd10-2eda66aca4ce"
},
"execution_count": 31,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Token used: 14013, total used: 37002\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Final Response:`** 在牧野之战开始时,武王率领西土联军面对着数量远超自己的商军。武王的前提是有殷都内部联络人的密约,但局势不断变化,没有商人助战,西土联军将被一边倒地屠杀。武王没有别的选择,他只能相信父亲描述的那位上帝站在自己一边,只要全心信任他,父亲开启的翦商事业就能成功。在战斗开始时,武王一方没有任何章法和战术可言,但商军阵列却突然自行解体,变成了互相砍杀的人群。或许是看到周军义无反顾的冲锋,商军中的密谋者终于鼓起勇气,倒戈杀向纣王中军。接着,西土联军全部投入了混战。后世的周人史诗说,“商庶若化\",即是说,商军队伍就像滚水冲刷的油脂,瞬间溃散,融化。最终,武王率领的西土联军获胜,商王朝终结。"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 1/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.7968888490270644<br>**Text:** 王受命第十一年,5他再度起兵东征。有好几种文献记载武王此次伐\n商的行军日程,但年份和月份皆有所不同。总的来说,武王此次起兵\n是在隆冬季节,决战则是在冬末春初。\n\n总攻的前期工作在前一年底就开始了...<br>"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 2/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.7918882416293495<br>**Text:** 并不是武王的私人属下。他们很在意这种身份区别。7\n\n天色渐明,雨势渐小,对面的商军阵列逐渐成形。周人史诗的描\n述是,敌军的戈矛像森林一样密集,所谓“殷商之旅,其会如林”。(《诗\n经•大雅•大明》...<br>"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 3/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.7853722213787273<br>**Text:** 作的“金花”。这都是王室才会有的财物,看来王室和奴隶们居住\n的地方相隔并不远。\n\n\f\n第十一章商人的思维与国家\n\n225\n\n需要注意的是,只有殷墟王宫区发现有大量集中存放的石头农具,\n其他任何商...<br>"
},
"metadata": {}
}
]
},
{
"cell_type": "code",
"source": [
"display_response(chat(\"被商朝献祭的人群中,分别都有哪些角色?\"))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 433
},
"id": "7c1Kwsj2waEB",
"outputId": "a5383960-998d-4876-c3d0-141f5aeb71ff"
},
"execution_count": 33,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Token used: 12048, total used: 60520\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Final Response:`** 在商朝献祭的人群中,有征伐周边斩获的首级的侯来、陈本等人,以及现场屠宰的牲畜,牛“五百有四”头,猪、羊等牲畜共“三千七百有一”头。其中还向天(上帝)和后稷献祭,以及向其他百神、水土之神献祭。其中有王族和方伯,以及纣王自己。"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 1/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.850064718034232<br>**Text:** 仪式上,首先奉献的是侯来、陈本等征伐周边斩获的首级,并 搭配现场屠宰的牲畜,“断牛六,断羊二”;然后向天(上帝)和后 稷献祭,用的是牛“五百有四”头;再向其他百神、水土之神献祭, 用猪、羊等牲畜...<br>"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 2/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.8454964094383671<br>**Text:** 之内,中土世界天翻地覆。\n\n纣王焚身而死,后世人大都将其理解为一种走投无路的自绝。其 \n实,按照商人的宗教理念,这是一场最高级的献祭--王把自己奉献 \n给了上帝和祖宗诸神。商朝开国之王成汤(天乙...<br>"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "---"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**`Source Node 3/3`**"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<IPython.core.display.Markdown object>"
],
"text/markdown": "**Document ID:** 1f43ae29-e41d-474b-b9c7-2afcb36d0d0b<br>**Similarity:** 0.8425104275046603<br>**Text:** 郭宝钧:《1950年春殷墟发掘报告》,第45页。\n20王平、顾彬:《甲骨文与殷商人祭》,大象出版社,2007年,第88、97页。\n21唐际根、汤毓赞:《再论殷墟人祭坑与甲骨文中羌祭卜辞的相关性》...<br>"
},
"metadata": {}
}
]
}
]
}
@wd021
Copy link

wd021 commented Jul 10, 2025

👀 share the prompts with God Tier Prompts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment