-
运行前先确保python版本大于等于3.9,然后按照
requiments.txt配置本地python环境,并修改setup.sh内的python路径 -
为了实现pdf输出,需要手动安装
wkhtmltopdf软件,并在main.py中的path_wk配置其路径 -
为了进行企业微信的推送,需要在
main.py中为msger配置正确的key -
运行
setup.sh,该脚本检查python版本、配置必需的文件夹、为agent/custom_tools/下的自定义工具进行注册(如果有的话) -
在
~/.metagpt/config2.yaml中为MetaGPT配置LLM服务的API,参考:配置大模型API | MetaGPT (deepwisdom.ai) -
使用正确的python环境运行
main.py,即可运行MetGPT框架下的team, 包括一个SimpleCrawler和一个Summarizer。前者爬取当日arxiv.com某一搜索结果页面的一定量文章,后者对所有的文章进行分类-归类-总结操作python ./main.py
-
当然也可以使用
crontab -e设定定时运行计划(使用绝对路径),比如设定每天9点30分,通过task.sh运行一次,shell输出到指定的task.log文件。为此需要对task.sh里的python路径和一些必要的PATH进行设置。38 9 * * * /bin/bash /home/int.orion.que/dev/my_programs/arxiv-agent/task.sh > /home/int.orion.que/dev/task.log 2>&1
-
爬取和总结全部完成后,主进程得到
summary@{today}.md和paper@{today}.md,并将其转换为pdf格式,推送给指定的机器人,要发送的消息可以在main.py中配置。 -
每天爬取的记录保存在
output/paperdone.pkl中,每天爬虫根据这个记录避免重复爬取,详情见后文 -
每天的输出都在
output下,可以通过在crontab -e设定clear.sh的定时运行计划清除这些历史数据
- 需要配置的主要参数:
main.py中的路径main.py中的机器人keymain.py中的自定义消息agent/custom_actions/DataActoins.py中的爬取目标URL(可以修改URL进行高级搜索的设定)和爬取规模TaskSizeagent/custom_actions/TextActoins.py中的各个prompt
- 每次运行可以得到的输出:
SimpleCrawler的日志logs/crawler_{today}.logMetaGPT的日志logs/{today}.txt- 爬虫爬取的所有文章条目
output/raw/crawler_{today}.json - 通过筛选,保证是新的文章
output/paper@{today}.md和output/summary@{today}.pdf - 分类&归类结果
output/summary@{today}.md和output/summary@{today}.pdf - 7日内爬取过的所有文章的网址的缓存
output/paperdone.pkl,是一个字典,key为日期,value是连接的List(由主进程main.py自动维护,不要轻易删除,会导致重复爬取)。 - 如果是
crontab定时任务,还会在指定的输出位置得到shell的输出log文件。
项目结构如下面描述。
agent/中有三个module,custom_actions,custom_roles,和custom_tools。这些分别是对MetaGPT框架下的Action,Role,和Tool元素的定义。具体请看:智能体入门 | MetaGPT (deepwisdom.ai)。其中custom_tools暂时没有用处,如果在其中定义了新的tool,需要通过setup.sh为其创建到MetaGPT要求目录的链接(根据MetaGPT文档要求)。output/保存所有的非日志输出,包括json,md,和pkl文件。新的总结和文章保存在根目录,其他将保存在对应的文件夹下。tools/是一个module,包含项目的其他模块代码,如爬虫、数据处理、文件存取、日志、机器人通信等,是最核心的目录。logs/是日志保存的文件夹。main.py是主程序test.py是测试用的程序task.sh是定时执行主程序所用的脚本clear.sh是定时进行缓存清理所用的脚本setup.sh是配置环境的脚本(功能见1)LICENSE是项目的开源许可证README.md即本说明文档requirements.txt是python环境需求
arxiv-agent
|agent/
|——custom_actions/
|——__init__.py
|——DataActions.py
|——TextActions.py
|——custom_roles/
|——__init__.py
|——SimpleCrawler.py
|——Summarizer.py
|——custom_tools/
|——__init__.py
|——CustomStructedCrawler.py
|output/
|——paperdone.pkl
|——outdated/
|——pdf/
|——raw/
|tools/
|——__init__.py
|——AccessFile.py
|——Crawler.py
|——DataProcessor.py
|——Logger.py
|——Messenger.py
|——OutputMD.py
|logs/
|main.py
|test.py
|task.sh
|clear.sh
|setup.sh
|README.md
|LICENSE
|requirements.txt- 7日内爬取过的所有文章的网址的缓存
output/paperdone.pkl,是由主进程自动维护的一个字典的二进制存档,包含过去7天内爬到的每个日期(key)的文章链接(value)。 - 这个字典在
tools/DataProcessor.py中被使用和维护,用于确定这篇文章是否在过去一周内见过,如果没见过将会加入新条目并保存。 - 这个字典作为过去文章的记录,由主程序
main.py每次运行时定期清除历史缓存,会检查并删除超过7天范围的条目。 - 因此不要轻易手动删除该文件,否则下一次运行可能会导致重复总结已经总结过的文章。
-
其他数据包括上述的每天产生的各种输出。这些缓存将由
./clear.sh通过同样的crontab自动任务进行定时清除。比如下述设置每周一零点进行一次缓存清空。0 0 * * 1 /bin/bash /home/int.orion.que/dev/my_programs/arxiv-agent/clear.sh > /home/int.orion.que/dev/clear.log 2>&1
Here is the translation of the provided document into English:
-
Before running, ensure that your Python version is at least 3.9, then configure your local Python environment according to
requirements.txt, and modify the Python path insetup.sh. -
To enable PDF output, manually install the
wkhtmltopdfsoftware and configure its path inpath_wkwithinmain.py. -
For enterprise WeChat notifications, configure the correct key for
msgerinmain.py. -
Run
setup.sh. This script checks the Python version, sets up necessary folders, and registers custom tools underagent/custom_tools/(if any). -
Configure the LLM service API for
MetaGPTin~/.metagpt/config2.yaml. Reference: Configure Large Model API | MetaGPT (deepwisdom.ai) -
Run
main.pyusing the correct Python environment to launch theteamunder theMetaGPTframework, which includes aSimpleCrawlerand aSummarizer. The former scrapes a certain number of articles from a search results page onarxiv.comfor the day, while the latter categorizes, groups, and summarizes all the articles.python ./main.py
-
You can also use
crontab -eto set up a scheduled task (using absolute paths), for example, to runtask.shonce daily at 9:30 AM, with shell output directed to a specifictask.logfile. You need to configure the Python path and some necessary PATHs intask.sh.38 9 * * * /bin/bash /home/int.orion.que/dev/my_programs/arxiv-agent/task.sh > /home/int.orion.que/dev/task.log 2>&1
-
After scraping and summarizing, the main process generates
summary@{today}.mdandpaper@{today}.md, converts them to PDF format, and sends them to a designated bot. The message to be sent can be configured inmain.py. -
Daily scraping records are saved in
output/paperdone.pkl, and each day's crawler uses this record to avoid duplicate scraping. Details are provided later. -
All daily outputs are stored in
output/, and you can clear these historical data through a scheduledclear.shtask set up incrontab -e.
- Main parameters to configure:
- Paths in
main.py - Bot key in
main.py - Custom messages in
main.py - Scraping target URL and scale (
TaskSize) inagent/custom_actions/DataActions.py - Prompts in
agent/custom_actions/TextActions.py
- Paths in
- Outputs obtained each run:
SimpleCrawlerlog:logs/crawler_{today}.logMetaGPTlog:logs/{today}.txt- All article entries scraped by the crawler:
output/raw/crawler_{today}.json - New articles after filtering:
output/paper@{today}.mdandoutput/summary@{today}.pdf - Categorization & grouping results:
output/summary@{today}.mdandoutput/summary@{today}.pdf - Cache of all URLs of articles scraped within the last 7 days:
output/paperdone.pkl(a dictionary where keys are dates and values are lists of links; maintained automatically by the main processmain.py; do not delete it easily as it may cause duplicate scraping). - If using a
crontabscheduled task, you will also get the shell outputlogfile at the specified location.
The project structure is described as follows:
agent/contains three modules:custom_actions,custom_roles, andcustom_tools. These define the Action, Role, and Tool elements under theMetaGPTframework. For more details, see: Agent 101 | MetaGPT (deepwisdom.ai). Thecustom_toolsmodule is currently unused. If new tools are defined here, you need to create links to the required directory viasetup.sh(as perMetaGPTdocumentation requirements).output/stores all non-log outputs, includingjson,md, andpklfiles. New summaries and articles are saved in the root directory, while others are saved in corresponding subfolders.tools/is a module containing other code modules such as crawlers, data processing, file access, logging, and bot communication – the core directory of the project.logs/is the folder for log files.main.pyis the main program.test.pyis a test program.task.shis the script used for scheduling the main program.clear.shis the script used for scheduled cache cleaning.setup.shis the script for configuring the environment (see section 1 for details).LICENSEis the open-source license for the project.README.mdis this documentation.requirements.txtlists the Python environment dependencies.
arxiv-agent
|-agent/
|——custom_actions/
|——__init__.py
|——DataActions.py
|——TextActions.py
|——custom_roles/
|——__init__.py
|——SimpleCrawler.py
|——Summarizer.py
|——custom_tools/
|——__init__.py
|——CustomStructedCrawler.py
|-output/
|——paperdone.pkl
|——outdated/
|——pdf/
|——raw/
|-tools/
|——__init__.py
|——AccessFile.py
|——Crawler.py
|——DataProcessor.py
|——Logger.py
|——Messenger.py
|——OutputMD.py
|-logs/
|-main.py
|-test.py
|-task.sh
|-clear.sh
|-setup.sh
|-README.md
|-LICENSE
|-requirements.txtHere is the translation for the additional section:
- The cache of all URLs of articles scraped within the last 7 days is stored in
output/paperdone.pkl. This is a binary archive of a dictionary automatically maintained by the main process, containing the links (values) of articles for each date (keys) within the past 7 days. - This dictionary is used and maintained in
tools/DataProcessor.pyto determine whether an article has been seen within the past week. If not, it adds a new entry and saves it. - As a record of past articles, this dictionary is periodically cleared by the main program
main.pywhen it runs, checking and removing entries that exceed the 7-day range. - Therefore, do not delete this file manually, as doing so might result in re-summarizing articles that have already been summarized during the next run.
-
Other data includes various outputs generated daily, as mentioned above. These caches are automatically cleared by
./clear.shthrough the samecrontabscheduled task. For example, the following setting clears the cache every Monday at midnight.0 0 * * 1 /bin/bash /home/int.orion.que/dev/my_programs/arxiv-agent/clear.sh > /home/int.orion.que/dev/clear.log 2>&1