Skip to content

Commit 97ca34f

Browse files
committed
document load
1 parent 96ed156 commit 97ca34f

8 files changed

+375
-33
lines changed

langchain/2.role_player.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## 角色扮演"
7+
"## Role Player"
88
]
99
},
1010
{

langchain/3.length_selector.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## 长度选择器"
7+
"## Length Selector"
88
]
99
},
1010
{

langchain/4.output_parser_json.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## Output Parser解析器"
7+
"## Output Parser"
88
]
99
},
1010
{

langchain/5.output_parser_xml.ipynb

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## Output Parser解析器"
7+
"## Output Parser"
88
]
99
},
1010
{
@@ -16,7 +16,7 @@
1616
},
1717
{
1818
"cell_type": "code",
19-
"execution_count": 14,
19+
"execution_count": 1,
2020
"metadata": {},
2121
"outputs": [],
2222
"source": [
@@ -37,7 +37,7 @@
3737
},
3838
{
3939
"cell_type": "code",
40-
"execution_count": 15,
40+
"execution_count": 2,
4141
"metadata": {},
4242
"outputs": [],
4343
"source": [
@@ -65,7 +65,7 @@
6565
},
6666
{
6767
"cell_type": "code",
68-
"execution_count": 16,
68+
"execution_count": 3,
6969
"metadata": {},
7070
"outputs": [],
7171
"source": [
@@ -87,7 +87,7 @@
8787
},
8888
{
8989
"cell_type": "code",
90-
"execution_count": 18,
90+
"execution_count": 4,
9191
"metadata": {},
9292
"outputs": [
9393
{
@@ -136,14 +136,14 @@
136136
},
137137
{
138138
"cell_type": "code",
139-
"execution_count": 17,
139+
"execution_count": 5,
140140
"metadata": {},
141141
"outputs": [
142142
{
143143
"name": "stderr",
144144
"output_type": "stream",
145145
"text": [
146-
"/var/folders/5x/c0q41fpx6l540lsl42_bzk5h0000gq/T/ipykernel_61871/4044007954.py:1: UserWarning: Parameters {'presence_penalty', 'frequency_penalty', 'top_p'} should be specified explicitly. Instead they were passed in as part of `model_kwargs` parameter.\n",
146+
"/var/folders/5x/c0q41fpx6l540lsl42_bzk5h0000gq/T/ipykernel_56994/4044007954.py:1: UserWarning: Parameters {'top_p', 'presence_penalty', 'frequency_penalty'} should be specified explicitly. Instead they were passed in as part of `model_kwargs` parameter.\n",
147147
" llm = get_model(\"openai\")\n"
148148
]
149149
},
@@ -155,7 +155,7 @@
155155
" {'city': 'New York City'}]}"
156156
]
157157
},
158-
"execution_count": 17,
158+
"execution_count": 5,
159159
"metadata": {},
160160
"output_type": "execute_result"
161161
}

langchain/6.runnable_bind_tools.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## Rannable 对象绑定函数"
7+
"## Rannable Object Binding Tools"
88
]
99
},
1010
{

langchain/7.document_loader.ipynb

Lines changed: 343 additions & 0 deletions
Large diffs are not rendered by default.

langchain/7.document_loader_from_arxiv.ipynb renamed to langchain/8.extract_information_from_arxiv.ipynb

Lines changed: 15 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## 从Arxiv加载论文并进行摘要\n",
7+
"## Extract Key Information from Arxiv Pages\n",
88
"Arxiv网站上一篇《Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference》英文论文,其论文编号为:2501.12959。示例尝试加载这篇论文,并对其内容进行中文摘要。"
99
]
1010
},
1111
{
1212
"cell_type": "code",
13-
"execution_count": 20,
13+
"execution_count": 30,
1414
"metadata": {},
1515
"outputs": [],
1616
"source": [
@@ -31,7 +31,7 @@
3131
},
3232
{
3333
"cell_type": "code",
34-
"execution_count": 21,
34+
"execution_count": 31,
3535
"metadata": {},
3636
"outputs": [
3737
{
@@ -58,11 +58,11 @@
5858
},
5959
{
6060
"cell_type": "code",
61-
"execution_count": 22,
61+
"execution_count": 45,
6262
"metadata": {},
6363
"outputs": [],
6464
"source": [
65-
"spliter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=2)\n",
65+
"spliter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=32)\n",
6666
"texts = spliter.split_documents(docs)\n",
6767
"# pprint(texts)"
6868
]
@@ -76,41 +76,35 @@
7676
},
7777
{
7878
"cell_type": "code",
79-
"execution_count": 23,
79+
"execution_count": 46,
8080
"metadata": {},
8181
"outputs": [
82-
{
83-
"name": "stderr",
84-
"output_type": "stream",
85-
"text": [
86-
"/var/folders/5x/c0q41fpx6l540lsl42_bzk5h0000gq/T/ipykernel_73163/2510094347.py:5: UserWarning: Parameters {'presence_penalty', 'top_p', 'frequency_penalty'} should be specified explicitly. Instead they were passed in as part of `model_kwargs` parameter.\n",
87-
" llm = get_model('openai')\n"
88-
]
89-
},
9082
{
9183
"name": "stdout",
9284
"output_type": "stream",
9385
"text": [
94-
"'本文提出了一种高效的、无需训练的提示压缩方法EHPC,通过评估头部在长文本输入中选择最重要的令牌,从而加速长文本推理。EHPC在两个主流基准测试中取得了最先进的结果,有效降低了商业API调用的复杂性和成本。与基于键值缓存的加速方法相比,EHPC具有竞争力,有望提高LLM在长文本任务中的效率。EHPC通过评估头部选择重要令牌,加速长文本推理,降低内存使用,并与KV缓存压缩方法竞争。EHPC在提示压缩基准测试上取得了新的最先进性能,降低了商业LLM的API成本和内存使用。'\n"
86+
"('本文提出了一种基于评估头(Evaluator '\n",
87+
" 'Heads)的高效提示压缩方法EHPC,用于加速长上下文Transformer推理。通过识别Transformer模型中特定的注意力头,EHPC能够在预填充阶段快速筛选出重要信息,仅保留关键token进行推理。该方法无需额外训练,显著降低了长上下文处理的计算成本和内存开销。实验表明,EHPC在主流基准测试中达到了最先进的性能,有效减少了商业API调用成本,并在长文本推理加速任务中表现出色。')\n"
9588
]
9689
}
9790
],
9891
"source": [
9992
"doc_prompt = PromptTemplate.from_template(\"{page_content}\")\n",
10093
"#文本拼接\n",
101-
"content = lambda docs: \"\\n\\n\".join(doc.page_content for doc in docs) \n",
102-
"prompt = PromptTemplate.from_template(\"请使用中文总结以下内容,控制在140个字以内:\\n\\n{content}\")\n",
103-
"llm = get_model('openai')\n",
94+
"prompt = PromptTemplate.from_template(\"请使用中文总结以下内容,控制在140个字以内:{content}\")\n",
95+
"# 由于openai gpt-3.5-tubro 最大token数为16385,超出了文档的限制,此处使用deepseek模型\n",
96+
"llm = get_model('deepseek')\n",
97+
"# pprint(prompt.invoke('{input}'))\n",
10498
"\n",
10599
"# 链\n",
106100
"chain = (\n",
107-
" {\"content\": lambda docs: content(docs)}\n",
101+
" {\"content\": lambda docs: \"\\n\\n\".join(doc.page_content for doc in docs)}\n",
108102
" | prompt\n",
109103
" | llm\n",
110104
" | StrOutputParser()\n",
111105
")\n",
112106
"\n",
113-
"pprint(chain.invoke(texts[:50]))\n"
107+
"pprint(chain.invoke(texts))\n"
114108
]
115109
},
116110
{
@@ -122,7 +116,7 @@
122116
},
123117
{
124118
"cell_type": "code",
125-
"execution_count": 24,
119+
"execution_count": 34,
126120
"metadata": {},
127121
"outputs": [
128122
{

requirement.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ langchain-openai=0.3.1
1313
## tools
1414
arxiv==2.1.3
1515
pymupdf==1.25.2
16+
rapidocr_onnxruntime==1.4.4
17+
pdfminer.six==20240706
18+
pi_heif==0.21.0
19+
unstructured_inference==0.8.6
20+
pdf2image==1.17.0
1621

1722
## parser
1823
defusedxml==0.7.1

0 commit comments

Comments
 (0)