document load

mumubusu · mumubusu · commit 97ca34f542f2 · 2025-01-23T22:31:30.000+08:00
diff --git a/langchain/2.role_player.ipynb b/langchain/2.role_player.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 角色扮演"
+    "## Role Player"
    ]
   },
   {
diff --git a/langchain/3.length_selector.ipynb b/langchain/3.length_selector.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 长度选择器"
+    "## Length Selector"
    ]
   },
   {
diff --git a/langchain/4.output_parser_json.ipynb b/langchain/4.output_parser_json.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Output Parser解析器"
+    "## Output Parser"
    ]
   },
   {
diff --git a/langchain/5.output_parser_xml.ipynb b/langchain/5.output_parser_xml.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Output Parser解析器"
+    "## Output Parser"
    ]
   },
   {
@@ -16,7 +16,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -37,7 +37,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -65,7 +65,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -87,7 +87,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
@@ -136,14 +136,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/var/folders/5x/c0q41fpx6l540lsl42_bzk5h0000gq/T/ipykernel_61871/4044007954.py:1: UserWarning: Parameters {'presence_penalty', 'frequency_penalty', 'top_p'} should be specified explicitly. Instead they were passed in as part of `model_kwargs` parameter.\n",
+      "/var/folders/5x/c0q41fpx6l540lsl42_bzk5h0000gq/T/ipykernel_56994/4044007954.py:1: UserWarning: Parameters {'top_p', 'presence_penalty', 'frequency_penalty'} should be specified explicitly. Instead they were passed in as part of `model_kwargs` parameter.\n",
       "  llm = get_model(\"openai\")\n"
      ]
     },
@@ -155,7 +155,7 @@
        "  {'city': 'New York City'}]}"
       ]
      },
-     "execution_count": 17,
+     "execution_count": 5,
      "metadata": {},
      "output_type": "execute_result"
     }
diff --git a/langchain/6.runnable_bind_tools.ipynb b/langchain/6.runnable_bind_tools.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Rannable 对象绑定函数"
+    "## Rannable Object Binding Tools"
    ]
   },
   {
diff --git a/langchain/7.document_loader.ipynb b/langchain/7.document_loader.ipynb
diff --git a/langchain/8.extract_information_from_arxiv.ipynb b/langchain/8.extract_information_from_arxiv.ipynb
@@ -4,13 +4,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 从Arxiv加载论文并进行摘要\n",
+    "## Extract Key Information from Arxiv Pages\n",
     "Arxiv网站上一篇《Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference》英文论文，其论文编号为：2501.12959。示例尝试加载这篇论文，并对其内容进行中文摘要。"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 30,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -31,7 +31,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 31,
    "metadata": {},
    "outputs": [
     {
@@ -58,11 +58,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 45,
    "metadata": {},
    "outputs": [],
    "source": [
-    "spliter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=2)\n",
+    "spliter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=32)\n",
     "texts = spliter.split_documents(docs)\n",
     "# pprint(texts)"
    ]
@@ -76,41 +76,35 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 46,
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/var/folders/5x/c0q41fpx6l540lsl42_bzk5h0000gq/T/ipykernel_73163/2510094347.py:5: UserWarning: Parameters {'presence_penalty', 'top_p', 'frequency_penalty'} should be specified explicitly. Instead they were passed in as part of `model_kwargs` parameter.\n",
-      "  llm = get_model('openai')\n"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "'本文提出了一种高效的、无需训练的提示压缩方法EHPC，通过评估头部在长文本输入中选择最重要的令牌，从而加速长文本推理。EHPC在两个主流基准测试中取得了最先进的结果，有效降低了商业API调用的复杂性和成本。与基于键值缓存的加速方法相比，EHPC具有竞争力，有望提高LLM在长文本任务中的效率。EHPC通过评估头部选择重要令牌，加速长文本推理，降低内存使用，并与KV缓存压缩方法竞争。EHPC在提示压缩基准测试上取得了新的最先进性能，降低了商业LLM的API成本和内存使用。'\n"
+      "('本文提出了一种基于评估头（Evaluator '\n",
+      " 'Heads）的高效提示压缩方法EHPC，用于加速长上下文Transformer推理。通过识别Transformer模型中特定的注意力头，EHPC能够在预填充阶段快速筛选出重要信息，仅保留关键token进行推理。该方法无需额外训练，显著降低了长上下文处理的计算成本和内存开销。实验表明，EHPC在主流基准测试中达到了最先进的性能，有效减少了商业API调用成本，并在长文本推理加速任务中表现出色。')\n"
      ]
     }
    ],
    "source": [
     "doc_prompt = PromptTemplate.from_template(\"{page_content}\")\n",
     "#文本拼接\n",
-    "content = lambda docs: \"\\n\\n\".join(doc.page_content for doc in docs)   \n",
-    "prompt = PromptTemplate.from_template(\"请使用中文总结以下内容，控制在140个字以内：\\n\\n{content}\")\n",
-    "llm = get_model('openai')\n",
+    "prompt = PromptTemplate.from_template(\"请使用中文总结以下内容，控制在140个字以内：{content}\")\n",
+    "# 由于openai gpt-3.5-tubro 最大token数为16385，超出了文档的限制，此处使用deepseek模型\n",
+    "llm = get_model('deepseek')\n",
+    "# pprint(prompt.invoke('{input}'))\n",
     "\n",
     "# 链\n",
     "chain = (\n",
-    "    {\"content\": lambda docs: content(docs)}\n",
+    "    {\"content\": lambda docs: \"\\n\\n\".join(doc.page_content for doc in docs)}\n",
     "    | prompt\n",
     "    | llm\n",
     "    | StrOutputParser()\n",
     ")\n",
     "\n",
-    "pprint(chain.invoke(texts[:50]))\n"
+    "pprint(chain.invoke(texts))\n"
    ]
   },
   {
@@ -122,7 +116,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 34,
    "metadata": {},
    "outputs": [
     {
diff --git a/requirement.txt b/requirement.txt
@@ -13,6 +13,11 @@ langchain-openai=0.3.1
 ## tools
 arxiv==2.1.3
 pymupdf==1.25.2
+rapidocr_onnxruntime==1.4.4
+pdfminer.six==20240706
+pi_heif==0.21.0
+unstructured_inference==0.8.6
+pdf2image==1.17.0
 
 ## parser
 defusedxml==0.7.1

Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@`
`4`	`4`	`"cell_type": "markdown",`
`5`	`5`	`"metadata": {},`
`6`	`6`	`"source": [`
`7`		`- "## 角色扮演"`
	`7`	`+ "## Role Player"`
`8`	`8`	`]`
`9`	`9`	`},`
`10`	`10`	`{`