Releases: NaiboWang/EasySpider
Version 0.6.2
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
Windows x64版本支持64位的Windows 10/Windows Server 2016及以上系统,Windows x32版本支持所有位数(32位和64位)的Windows 7及以上系统,即64位的Windows 7也要下载x32版本的EasySpider使用。 注意x32版本的EasySpider的Chrome浏览器永远都是109版本,不会随着Chrome版本更新而更新(为了兼容Windows 7系统),因此如果想用最新版Chrome浏览器采集数据,请在Windows 10 x64及以上系统上运行x64版本的软件。无任何版本支持Windows Server 2012及以下版本系统,这些系统下需要自行编译运行。
The Windows x64 version supports Windows 10/Windows Server 2016 and above with 64-bit, while the x32 version of Windows supports all versions (32-bit and 64-bit) of Windows 7 and above, meaning that the 64-bit version of Windows 7 should also download x32 version of EasySpider. Note that the Chrome browser in this x32 version of EasySpider is always version 109 and will not update with Chrome updates (to maintain compatibility with the Windows 7 system). Therefore, if you want to collect data with the latest version of the Chrome browser, please run the x64 version of the software on Windows 10 x64 and above systems. There is no version support for Windows Server 2012 and below. These systems require manual compilation for execution.
MacOS版本压缩包请用系统自带的归档使用工具
解压,MacOS版本支持所有芯片组,包括Apple自研芯片(如M1,M2)和Intel芯片(如酷睿i7),注意下载对应版本的程序,且操作系统最低版本要求为11.1,更低操作系统版本请下载v0.2.0版本的Mac版使用,或自行下载代码并编译,示例编译方式看这个issue。
For the MacOS version, please use the system's inbuilt software Archive Utility to unzip the .7z file. The MacOS version supports all chipsets, including Apple's self-developed chips (such as M1, M2) and Intel chips (such as Core i7). Ensure you download the correct version of the program, and note that the minimum required version for the operating system is 11.1. For lower operating system versions, please download the code and compile it yourself. An example compilation method can be found in this issue.
同理,Linux版只适用于Ubuntu 20.04及以上版本、Deepin、Debian及其衍生版本,如想使用其他Linux发行版采集数据,请自行下载代码并编译,如CentOS系统编译示例看这个issue。
对于Ubuntu 24.04系统,请把文件夹中的easy-spider.sh
启动脚本中的内容更换成此脚本的内容以启动和使用软件:easy-spider.sh
Similarly, the Linux version is only suitable for Ubuntu 20.04 and above, Deepin, Debian, and their derivatives. If you want to collect data using other Linux distributions, please download the code and compile it yourself. For an example of compiling on CentOS, see this issue.
On the Ubuntu 24.04 system, please replace the contents of the easy-spider.sh
startup script in the folder with the content of this script to start and use the software: easy-spider.sh
请划到本节最下方以下载EasySpider。
Please scroll down to the bottom of this section to download EasySpider.
从源代码编译程序并设计运行和调试任务指南(基于Ubuntu24.04)
https://www.bilibili.com/video/BV1VE421P7yj/
Docker运行示例
Linux环境下如何用Docker运行EasySpider任务执行阶段的说明:
https://github.com/NaiboWang/EasySpider/wiki/Docker%E8%BF%90%E8%A1%8C%E7%A4%BA%E4%BE%8B
更新说明
- 循环内操作拼接后给出XPath最终是什么的提示。
- 提取数据操作中每个字段试运行时可以实时显示多数类型的元素提取值。
- 数据写入模式中增加文件已存在时自动重命名功能,下载文件同名情况下自动重命名。
- 自定义操作新增“生成新数据行”,“清空字段值”,“退出程序”的操作。
- 试运行JS增加提示返回值功能。
- 命令行命令默认任务读取类型改为local模式。
- 字段内容示例值很长时自动换行。
- 任务列表支持排序和搜索功能。
- 保存任务提示时间减少。
- MacOS下循环点击每个链接的Bug修复,打开网页默认应该打开的是链接池的第一个链接的Bug修复。
- Chrome浏览器版本升级为124。
Update Notes
- Added a prompt that shows the final XPath after concatenation operations within loops.
- In data extraction operations, real-time display of extracted values for most types of elements is now available when testing each field.
- Added automatic file renaming feature when a file already exists during data write operations, and automatic renaming for files with the same name during download.
- Custom actions have been added: "Create new data row", "Clear field value", and "Exit program".
- Added a feature to prompt return values when testing JavaScript (JS).
- Default task reading type in command line commands is changed again to local mode.
- Automatic line wrap for field content examples when values are too long.
- Task list now supports sort and search functionality.
- Reduced the time prompt when saving tasks.
- Fixed a bug on MacOS where clicking each link in a loop opened the wrong links; it should now correctly open the first link from the link pool by default.
- Updated Chrome browser to version 124.
Version 0.6.0
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
Windows x64版本支持64位的Windows 10/Windows Server 2016及以上系统,Windows x32版本支持所有位数(32位和64位)的Windows 7及以上系统,即64位的Windows 7也要下载x32版本的EasySpider使用。 注意x32版本的EasySpider的Chrome浏览器永远都是109版本,不会随着Chrome版本更新而更新(为了兼容Windows 7系统),因此如果想用最新版Chrome浏览器采集数据,请在Windows 10 x64及以上系统上运行x64版本的软件。无任何版本支持Windows Server 2012及以下版本系统,这些系统下需要自行编译运行。
The Windows x64 version supports Windows 10/Windows Server 2016 and above with 64-bit, while the x32 version of Windows supports all versions (32-bit and 64-bit) of Windows 7 and above, meaning that the 64-bit version of Windows 7 should also download x32 version of EasySpider. Note that the Chrome browser in this x32 version of EasySpider is always version 109 and will not update with Chrome updates (to maintain compatibility with the Windows 7 system). Therefore, if you want to collect data with the latest version of the Chrome browser, please run the x64 version of the software on Windows 10 x64 and above systems. There is no version support for Windows Server 2012 and below. These systems require manual compilation for execution.
MacOS版本压缩包请用系统自带的归档使用工具
解压,MacOS版本支持所有芯片组,包括Apple自研芯片(如M1,M2)和Intel芯片(如酷睿i7),注意下载对应版本的程序,且操作系统最低版本要求为11.1,更低操作系统版本请下载v0.2.0版本的Mac版使用,或自行下载代码并编译,示例编译方式看这个issue。
For the MacOS version, please use the system's inbuilt software Archive Utility to unzip the .7z file. The MacOS version supports all chipsets, including Apple's self-developed chips (such as M1, M2) and Intel chips (such as Core i7). Ensure you download the correct version of the program, and note that the minimum required version for the operating system is 11.1. For lower operating system versions, please download the code and compile it yourself. An example compilation method can be found in this issue.
同理,Linux版只适用于Ubuntu 20.04及以上版本、Deepin、Debian及其衍生版本,如想使用其他Linux发行版采集数据,请自行下载代码并编译,如CentOS系统编译示例看这个issue。
Similarly, the Linux version is only suitable for Ubuntu 20.04 and above, Deepin, Debian, and their derivatives. If you want to collect data using other Linux distributions, please download the code and compile it yourself. For an example of compiling on CentOS, see this issue.
请划到本节最下方以下载EasySpider。
Please scroll down to the bottom of this section to download EasySpider.
本版本相关教程 | Tutorial of this version
无下一页按钮只有具体页面按钮时的翻页方法(EXEC和EVAL教程)| Paging method when there is no 'next page' button, only specific page buttons (EXEC and EVAL tutorial)
操作台挡住登录页面的处理办法 | Solution for the toolbox blocking the login page
循环次数设定(包括无限循环)及检测到页面内容才提取数据 | Set the number of iterations (including infinite loops) and extract data only when page content is detected
更新说明 | Update Notes
For English version of update notes, please see: https://github.com/NaiboWang/EasySpider/wiki/Update-notes-of-version-0.6.0
- 在任务设计阶段,打开浏览器的情况下,单击相关操作后浏览器会自动标记映射对应元素方便用户调试,任何和浏览器相关的操作均可调试,包括JavaScript命令调试以及条件分支自动检测是否满足条件,自动标记元素等。
- 在任务设计阶段,打开浏览器的情况下,双击相关操作后将试运行该操作以进行动态调试,并将执行结果实时显示在浏览器中:
- 加速:对于循环提取数据的操作,如果没有额外的如执行JS,下载图片等操作,数据提取的速度将会得到极大的提升。
- 用eval功能动态修改XPath和代码片段:在任意的XPath,JavaScript代码片段中,均可以使用
eval("表达式值")
来直接表示python环境中的表达式,无需用自定义操作储存变量做中转,示例:
-
用自定义操作的exec选项定义一个变量a:
self.a = 1
-
在提取数据的操作中的XPath中,使用下面的值来表示/html/body/div[1]:
/html/body/div[eval("self.a")]
-
再次使用自定义操作的exec选项改变a的值:
self.a = self.a +1
-
则此时提取数据的XPath将会变为/html/body/div[2]
适用于以下没有下一页按钮只能依次点击不同页码翻页的场景,查看示例教程。
- 所有的Exec和Eval选项可选择外挂代码文件,可在本地用IDE如VSCode写好Python代码之后,直接在任务输入框写入
outside:myCode.py
,程序将会读取并执行EasySpider目录下的myCode.py中的代码,此功能适合执行大量代码需要IDE辅助的场景。
注意EasySpider支持自定义Python函数,引入外部Python包以及使用try...except...进行异常处理等操作。
- 输入文字(包括批量输入文字)操作同样可以使用
eval("Python代码")
关键词来输入任务执行时由Python程序动态生成的输出值;同时,还支持使用JS("return JS代码")
关键词来输入由JavaScript动态生成的文本内容(JS代码不能换行),例如,使用JS("return new Date().getMonth()+1")/2023
来输入“当前月份/2023”,即输入:12/2023
(2023年12月时的输入值):
-
可以处理多层嵌套的iframe,体验和无iframe时相同,但需要注意的是XPath需设定为只有指定iframe页面中才能定位到的XPath,因此类似
//body
这种XPath将只会定位到第一层iframe中的body标签。 -
在设计完提取数据操作后,浏览器操作台将提示是否要进行进一步的翻页操作,此时可以指定翻页按钮位置,流程图中将自动生成好带翻页功能的提取数据操作:
- 自定义操作增加暂停程序执行功能,用于在弹出验证码等页面时自动暂停等待用户手动验证码等操作。
- 自定义操作增加刷新页面操作。
- 自定义操作新增发送邮件功能。
- 点击元素操作可进行Alert弹窗处理,可选择接受或拒绝弹窗。
- 并行多开优化:对于带用户信息的浏览器的执行模式,执行时改为先复制用户目录后再执行的模式,以解决并行多开问题,现在可直接多次点击任务执行(带用户信息模式)按钮或同时运行多个命令行程序来并行执行带用户信息模式的任务,无需手动复制多个用户信息文件夹。任务执行完成后将会自动删除复制的用户信息文件夹(如果中途手动退出,则需要手动删除
TempUserDataFolder
文件夹下的用户信息临时目录)。 - 各种操作的名称将会根据场景自动匹配和修改,省去了修改操作名称的繁琐,如点击元素和移动到元素的默认名称更改为点击/移动到的元素的文本值,循环操作按照循环类型更名,以及切换自定义操作/循环/条件分支类型时自动更名等。
- 对于单个元素循环,如一直点击翻页按钮的循环,检测内容不变的条件值可以仅仅限制在检测某个元素的内容而不是整个页面的内容。
- 文件默认下载地址更改为任务文件夹内。
- 新增条件分支改为增加到最右侧。
- 可在流程图任意操作操作点击右键弹出右键菜单,即可试运行(调试运行),复制,剪切,删除元素以及调整条件分支的前后顺序。
- 操作提示框增加右下角关闭提示,适用于登录时二维码被遮挡的情况,可点击右下角×关闭操作台:
- 保存任务时,可自定义暂停/控制按键,实现不同的多开程序使用不同的按键来控制暂停/继续。
- 保存任务时,可设置任务运行时是否最大化浏览器窗口再运行任务。
- 保存任务时,可设置写入模式为数据覆盖模式,此时每次执行相同任务ID的任务,都会先删除源文件再重新采集(需要文件名设置为静态文件名)。
- 写入MySQL数据库时,当遇到重复数据时,忽略此条数据并继续运行,适用于不想要插入重复数据的场景(需要自行设定数据库表格主键为指定字段,否则按照EasySpider自己设计的表格,主键为自增ID,不会出现重复数据的情况)。
- 增加data:base64类型的图片下载功能,并可以处理需要登录才能下载的图片(不一定全部有效)。
- 更好的异常处理,防止采集过程中意外中断,中断会重试,如历史记录回退的bug修复,查看循环次数设定(包括无限循环)及检测到页面内容才提取数据的教程。
- 支持对提取的数据字段进行自动换行操作,如长文章采集自动换行。
- 带用户信息模式浏览器窗口可记录上次设计任务时的浏览器位置,而不是每次都和流程图对半分屏幕。
- 点击元素增加可根据坐标点击元素,适用于点击空白区域关闭某些窗口/对话框的场景,如空白处的坐标为(10, 10)即可在点击元素XPath处写:point(10, 10)表示点击网页坐标(10, 10)。
- 可选择是否在数据采集完成后去除重复数据,注意此功能需要等到任务结束时执行,因此执行任务中途退出将无法进行去重!!!
- 可在不固定循环列表/固定循环列表/文本列表/网址列表循环中,设置跳过前n次循环,适用于任务中途断开不想从头开始任务的场景。
- 执行任务时,可以手动指定任务ID,则此时点击“直接执行”或“获取ID”按钮时,将不会生成一个新的任务ID,而是使用自己指定的ID号,如果指定的任务ID曾经存在过则任务调用文件将会被覆盖,适用于重新修改任务流程后,不想以新的任务ID开始任务,而是想要继续追加写入原任务ID文件夹中文件的场景。
- ddddocr库升级。
- 软件UI更新。
- Chrome浏览器版本升级为120。
Version 0.5.0
EXEC和EVAL用法示例教程:https://github.com/NaiboWang/EasySpider/wiki/EXEC%E5%92%8CEVAL%E7%94%A8%E6%B3%95%E7%A4%BA%E4%BE%8B
此版本只发布了Windows x64 与x32版本及MacOS的Apple芯片版本,欢迎试用并及时提Issue反馈Bug,因此对于其余操作系统版本,请先使用0.3.5版本。
This version has only released the Windows x64 and x32 versions and MacOS apple Silicon version, welcome to try it out and report any bugs as Issues in a timely manner. For the other operating system versions, please use version 0.3.5 for now.
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
MacOS版本需要使用下面的命令修改包属性以解决“包已损坏”的问题:
xattr -cr 你的EasySpider.app文件路径
如:
xattr -cr /Users/你的用户名/Downloads/EasySpider_MacOS_all_arch/EasySpider.app
然后再次尝试打开。
For MacOS version, the following command needs to be used to modify the package attributes to solve the "package is damaged" problem:
xattr -cr YourPathToEasySpider.app
For example:
xattr -cr /Users/YourUserName/Downloads/EasySpider_MacOS_all_arch/EasySpider.app
Then try opening it again.
Windows x64版本支持64位的Windows 10及以上系统,Windows x32版本支持所有位数(32位和64位)的Windows 7及以上系统,即64位的Windows 7也要下载此版本。注意x32版本的EasySpider的Chrome浏览器永远都是109版本,不会随着Chrome版本更新而更新(为了兼容Win 7系统),因此如果想用最新版Chrome浏览器采集数据,请在Windows 10 x64及以上系统上运行x64版本的软件。
The Windows x64 version supports Windows 10 and above with 64-bit, while the x32 version of Windows supports all versions (32-bit and 64-bit) of Windows 7 and above, meaning that the 64-bit version of Windows 7 should also download this version. Note that the Chrome browser in this x32 version of EasySpider is always version 109 and will not update with Chrome updates (to maintain compatibility with the Win 7 system). Therefore, if you want to collect data with the latest version of the Chrome browser, please run the x64 version of the software on Windows 10 x64 and above systems.
MacOS版请用系统自带的归档使用工具
解压,MacOS版本支持所有芯片组,包括Intel芯片(如酷睿i7) 和 Apple自研芯片(如M1,M2),注意下载对应版本的程序,且操作系统最低版本要求为11.1,更低操作系统版本请下载v0.2.0版本的Mac版使用,或自行下载代码并编译,示例编译方式看这个Issue。
For the MacOS version, please use the system's inbuilt Archive Utility to unzip. The MacOS version supports all chipsets, including Intel chips (such as Core i7) and Apple's self-developed chips (such as M1, M2). Ensure you download the correct version of the program, and note that the minimum required version for the operating system is 11.1. For lower operating system versions, please download the code and compile it yourself. An example compilation method can be found in this issue.
同理,Linux版只适用于Ubuntu 20.04及以上版本、Deepin、Debian及其衍生版本,如想使用其他Linux发行版采集数据,请自行下载代码并编译。
Similarly, the Linux version is only compatible with Ubuntu 20.04 and above, Deepin, Debian, and their derivatives. If you want to use other Linux distributions for data collection, please download the code and compile it yourself.
更新说明
- 重大更新: 自定义操作增加在当前环境直接运行Python代码自定义变量和获得变量值功能,循环和判断条件同样支持自定义变量和表达式的识别:
此选项为高级功能,可以直接用Python代码操纵正在运行中的浏览器,及可以自定义整个执行环境中的变量,并对变量进行修改赋值等操作,示例:
- 用
self.browser
表示当前操作的浏览器,可直接用selenium
的API进行操作,如self.browser.find_element(By.CSS_SELECTOR, "body").send_keys(Keys.END)
即可滚动到页面最下方。 - 自定义一个全局变量:
self.myVar = 1
- 操纵上面定义的全局变量:
self.myVar = self.myVar + 1
- 打印上面定义的全局变量:
print(self.myVar)
如果想要将自己定义的变量作为字段记录,请选择下一个在执行环境下获得Python表达式值(eval操作)
选项。
此选项为高级功能,可以直接返回Python代码的表达式值,并在其他位置用Field["本操作名称"]
表示此操作返回值,示例:
- 返回当前浏览器对象的相关值,用
self.browser
表示当前操作的浏览器,可直接用selenium的API进行操作,如self.browser.find_element(By.CSS_SELECTOR, "body").text
即可返回当前页面的文字。 - 返回自定义全局变量的值:
self.myVar
- 返回条件判断的值:
self.myVar == 1
,此表达式的判断值可用于条件判断
和循环
!!!
注意此功能不能对变量进行赋值操作,即不可以写self.myVar = 1
这种,如果想要进行赋值操作,请选择上一个在执行环境下运行Python代码(exec操作)
选项。
- 一个循环文本列表内的文本可以输入多个输入框,只要对应好索引值:
- 执行阶段可设置Excel指定读取文件,并可指定Excel路径,对于一个循环文本列表中的多个字段,可读入Excel同名称多列并自动合并:
- 相对循环内的元素点击和移动到元素事件可设置相对循环内的XPath,但此功能和之前版本的任务文件存在兼容性问题,之前版本的文件需要手工修正,需要把所有使用循环内的点击元素操作的XPath设置为空才可用,因此建议直接使用新版本设计任务。
- UI重大更新:可通过拖动操作来新增操作,修改流程以及调整锚点,即新增操作,剪切元素和调整锚点操作可通过拖动实现;右键可以删除元素;可双击箭头直接调整锚点。
- 浏览器操作台右下角增加关闭操作台的按钮,以应对某些操作台会挡住验证码框或登录框的特殊场景。
- 记录字段前可选择是否清空其他非本操作定义字段的值。
- 增加跳过当前循环功能,即
Continue
功能。 - 所有的XPath均可以使用
Field["字段值"]
替换为变量值。 - 对于提取数据操作,增加重新执行任务时从上次保存的位置继续执行的功能(保存任务时设置),以解决程序意外退出必须从头跑的问题。
- OCR功能更换为
ddddocr
,无需手动安装环境并提高了OCR识别准确率。 - 修复提取数据时不保存数据多一行的bug。
- 操作执行前可设定等待某元素出现才执行。
- 可提取元素的属性值。
- 增加版权和使用协议说明。
- 全版本支持
一直向下滚动直到页面内容无变化
的功能,同时循环点击下一页的操作的退出循环条件改为找不到下一页按钮
及检测不到页面内容变化
。 - 优化日志格式。
- 增加可保存为
JSON
格式的文件的功能。 - Chrome版本更新为115。
Release Notes
-
Major Update: Added the ability to run Python code, manipulate custom variables, and retrieve variable values directly in the current environment for custom actions. Loops and conditional statements also support recognition of custom variables and expressions:
This option provides advanced functionality to manipulate the browser running in real-time using Python code. You can customize variables within the entire execution environment and perform operations such as modification and assignment. Examples:
- Use
self.browser
to refer to the current browser being operated on, and perform actions using Selenium APIs. For instance,self.browser.find_element(By.CSS_SELECTOR, "body").send_keys(Keys.END)
can scroll to the bottom of the page. - Define a global variable:
self.myVar = 1
- Manipulate the above-defined global variable:
self.myVar = self.myVar + 1
- Print the above-defined global variable:
print(self.myVar)
If you want to record your custom variables as field values, choose the next option:
Retrieve Python Expression Value in Execution Environment (eval operation)
.This option allows you to directly return the expression value of Python code and represent the return value of this operation using
Field["operation name"]
in other places. Examples:- Return relevant values of the current browser object using
self.browser
, which refers to the current browser being operated on. You can directly use Selenium APIs, e.g.,self.browser.find_element(By.CSS_SELECTOR, "body").text
to retrieve the text on the current page. - Return the value of a custom global variable:
self.myVar
- Return the value of a conditional statement:
self.myVar == 1
, the evaluation of this expression can be used forconditional statements
andloops
!!!
Note that this functionality does not support variable assignment, meaning you cannot write something likeself.myVar = 1
. If you want to perform an assignment, choose the previous option:Run Python code on current environment (the "exec" operation)
.
- Use
-
Within a loop, multiple input fields can now be associated with text from a looped list by matching corresponding index values:
-
During execution, you can set Excel files for specific reads, specifying Excel paths. For multiple fields within a looped text list, you can read multiple columns with the same name from Excel and automatically merge them:
-
Relative element clicks and move-to-element events within a loop can be set using relative XPath. However, this feature is not compatible with task files from previous versions. Previous version files need manual modification, where XPaths used for element clicks within the loop must be set to empty in order to work. It's recommended to directly use the new version's task design.
-
UI Major Update: Operations can be added, flow can be modified, and anchor points can be adjusted through drag-and-drop actions. Adding operations, cutting elements, and adjusting anchor points can all be achieved through dragging and dropping. Right-click to delete elements. Double-click arrows to directly adjust anchor points.
-
Added a close button in the bottom right corner of the browser console to handle scenarios where the console obstructs captcha or login prompts.
-
Option to clear other non-operation-defined field values before recording a field.
-
Added the feature to skip the current loop, i.e.,
Continue
functionality. -
All XPaths can be replaced with variable values using
Field["field value"]
. -
For data extraction operations, added the ability to resume execution from the last saved position when re-executing a task (set during task save), to address the issue of starting from the beginning after unexpected program termination.
-
Replaced OCR functionality with
ddddocr
, eliminating the need for manual environment installation and improving OCR recognition accuracy. -
Fixed a bug where an extra row of data wasn't saved during data extraction.
-
Set a waiting condition for an element to appear before executing an operation.
-
Can extract attribute values of elements.
-
Added copyright and usage agreement statements.
-
Full version supports the function to "Scroll down continuously until the page content remains unchanged." The exit conditions for looped operations of clicking the next page have been updated to "Next page button not found" and "Page cont...
Version 0.3.5
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
有关Windows x64位版本部分情况下无法采集链接地址的说明:#128
Explanation about the issue where the link address cannot be collected in some cases on Windows x64 version: #128
Windows x64版本支持64位的Windows 10及以上系统,Windows x86版本支持所有位数(32位和64位)的Windows 7及以上系统,即64位的Windows 7也要下载此版本。注意x86版本的EasySpider的Chrome浏览器永远都是109,不会随着Chrome版本更新而更新(为了兼容Win 7系统),因此如果想用最新版Chrome浏览器采集数据,请在Windows 10 x64及以上系统上运行x64版本的软件。
The Windows x64 version supports Windows 10 and above with 64-bit, while the x86 version of Windows supports all versions (32-bit and 64-bit) of Windows 7 and above, meaning that the 64-bit version of Windows 7 should also download this version. Note that the Chrome browser in this x86 version of EasySpider is always version 109 and will not update with Chrome updates (to maintain compatibility with the Win 7 system). Therefore, if you want to collect data with the latest version of the Chrome browser, please run the x64 version of the software on Windows 10 x64 and above systems.
MacOS版请用系统自带的归档使用工具
解压,MacOS版本支持所有芯片组,包括Intel和M1,M2等处理器,但操作系统最低版本要求为11.1,更低操作系统版本请下载v0.2.0版本的Mac版使用,或自行下载代码并编译,示例编译方式看这个Issue。
The MacOS version supports all chipsets, including Intel, M1, M2, and other processors. However, the minimum operating system requirement is 11.1. For lower operating system versions, please download the code and compile it yourself. An example compilation method can be found in this issue. Please unzip the .tar.gz
file by the Arxiv Utility
software.
同理,Linux版只适用于Ubuntu 20.04及以上版本、Deepin、Debian及其衍生版本,如想使用其他Linux发行版采集数据,请自行下载代码并编译。
Similarly, the Linux version is only compatible with Ubuntu 20.04 and above, Deepin, Debian, and their derivatives. If you want to use other Linux distributions for data collection, please download the code and compile it yourself.
更新说明
- 提速:极大的提升了大部分场景的采集速度。
- 所有写JavaScript/系统命令代码语句的地方以及打开网页的链接池,都可以用
Field["参数名"]
表示最近提取到的页面参数值/自定义操作返回,即实现了全面的变量
功能。 - 循环中可以在任意位置使用
自定义操作
的退出循环
选项直接退出循环,即添加了Break
功能。
- 可以提取在
<iframe>
标签内的数据。 - 增加暂停执行任务功能,可长按键盘
p
键暂停和继续执行任务。 - (Windows x64可用,其余系统请等待下个版本)增加“一直向下滚动直到页面内容无变化”的功能,同时循环点击下一页的操作的退出循环条件改为找不到下一页按钮及检测不到页面内容变化。
- 执行阶段也可以使用
XPath Helper
扩展来调试XPath,配合上面的暂停功能使用。 - 可导出为
Excel/TXT
文件,可写入MySQL
数据库,可指定数据类型为整数/小数/日期
等,点此查看MySQL写入教程。 - 调用任务时的输入参数值可以通过读取Excel文件替换。
- 浏览器操作台可通过左上角拖动改变大小。
- 提取数据的字段可设置为不保存(适用于只想将此字段作为变量输入的情况)。
- 输入文字操作后可用
<enter>
或<ENTER>
表示硬回车,即输入完成后在当前文本框按回车。 - 可以模拟手机端浏览器运行。
- (只支持Windows x64版本)可处理和采集针对被Cloudflare的验证码保护的变态网站,点此查看视频教程。
- 新增默认索引位置使用last()从后往前数的XPath提示。
- 操作后等待时长可设置为设定时间的50%-150%的随机等待。
- 软件包内自带python源代码以供专业人士修改任务流程和调试。
打开网页
的高级操作支持获取当前页面Cookies,并可修改Cookies。
- 更改点击元素方式,真正模拟现实世界鼠标点击操作。
- 通用参数设置:每采集多少条本地写入一次,默认为10;控制栏预览数据长度,默认为15等。
- 压缩任务文件大小。
- 保存名称和位置更改,默认文件保存路径是
Data/Task_ID
,想要保存到其他路径,可以用../../
这种形式进行相对路径引用,比如../../JS
表示保存的的文件名是JS
,保存位置为和Data
文件夹同一级目录的文件夹,即EasySpider
主文件夹。 - 流程图和选项配置自动刷新,无需点击
确定
按钮,但仍需手动保存任务。 - 源代码优化,使二次开发更容易。
- Bug修复:如执行系统命令如果失败会打印错误信息,修复了MacOS和Linux下系统命令执行失败的Bug;URL格式判断,累计增长的字段名索引值不正确等Bug。
- 屏蔽无关日志信息,执行界面更清爽。
Update Instruction
- Speed up: Greatly improved the collection speed in most scenarios.
- Variable Functionality: In all places where you write JavaScript/system command code statements and open web page links, you can use Field["parameter_name"] to represent the recently extracted page parameter value/custom operation return. This provides comprehensive variable functionality.
- Loop Control: During a loop, you can use the
exit loop
option ofcustom operation
at any position to directly exit the loop, that is, theBreak
function has been added. - Data Extraction: Data within
<iframe>
tags can be extracted. - Task Control: Added pause execution task feature, you can long press the
p
key on the keyboard to pause and continue execution. - (Windows x64 only now, other OS please wait for the next version) Add a "Keep scrolling until the page content does not change" feature, and modify the loop exit condition of repeatedly clicking the next page operation to "unable to find the next page button" and "page content doesn't change".
- XPath Debugging: You can also use
XPath Helper
extension to debug XPath during the execution stage, which can be used in conjunction with the pause feature above. - Data Export and Writing: Can be exported to
Excel/TXT
files, can be written toMySQL
databases, can specify data types asinteger/decimal/date
, etc., click here to view MySQL writing tutorial. - Parameter Handling: The input parameter values when calling tasks can be replaced by reading Excel files.
- Interface Adjustment: The browser operation console can be resized by dragging the top left corner.
- Data Handling: Fields for extracting data can be set to not be saved (suitable for cases where you only want to use this field as a variable input).
- Text Input: After entering text operation,
<enter>
or<ENTER>
can be used to represent a hard return, that is, press enter in the current text box after entering. - Device Simulation: Can simulate mobile browser running.
- (Not Stable) Cloudflare Handling: Capable of handling and collecting data from websites protected by Cloudflare's captcha, click here to view the video tutorial.
- XPath Indexing: Added a hint for using last() from the back as the default index position in XPath.
- Wait Time Control: The waiting time after the operation can be set to 50%-150% of the set time for random waiting.
- Source Code Included: The software package comes with Python source code for professionals to modify the task process and debugging.
- Cookie Handling: The advanced operations of
open webpage
support getting the current page Cookie and can modify Cookie. - Click Simulation: Change the way to click elements, truly simulating real-world mouse click operations.
- General Parameter Settings: General parameter settings: how many times to write locally for each collection, the default is 10; control bar preview data length, the default is 15, etc.
- File Compression: Compressed task file size.
- Name and Location Changes: The default file save path is
Data/Task_ID
. If you want to save to a different path, use relative path referencing like../../
. For example, if the file name isJS
and you want to save it in a folder at the same level as theData
folder, which is theEasySpider
main folder, you can use../../JS
as the relative path. - Flowchart Updates: Automatic update and refresh of the flowchart, no need to click the
Confirm
button. - Source Code Optimization: Source code optimization, making secondary development easier.
- Bug Fixes: Bug fixes: such as printing error information if the execution of system commands fails, fixing the bug of system command execution failure under MacOS and Linux; URL format judgment and other bugs.
- Filter irrelevant log information for a cleaner interface execution.
Version 0.3.2
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
新特性视频讲解:定时执行任务,选中子元素多种模式及将提取值作为变量输入。
Windows的x64和x32版本支持Windows 10及以上系统,Windows 7需下载Windows7 专版(因为Chrome 109是最后一个支持Windows 7的Chrome版本),不要下载错了。
The x64 and x32 versions of Windows support Windows 10 and above. For Windows 7, please download the Windows 7 special edition (as Chrome 109 is the last Chrome version to support Windows 7). Please make sure not to download the wrong version.
MacOS版支持所有芯片组,包括Intel和M1,M2等处理器,但操作系统最低版本要求为11.1,更低操作系统版本请下载v0.2.0版本的Mac版使用,或自行下载代码并编译,示例编译方式看这个帖子。
The MacOS version supports all chipsets, including Intel, M1, M2, and other processors. However, the minimum operating system requirement is 11.1. For lower operating system versions, please download the code and compile it yourself. An example compilation method can be found in this post.
同理,Linux版只适用于Ubuntu 20.04及以上版本、Deepin、Debian及其衍生版本,如想使用其他Linux发行版采集数据,请自行下载代码并编译。
Similarly, the Linux version is only compatible with Ubuntu 20.04 and above, Deepin, Debian, and their derivatives. If you want to use other Linux distributions for data collection, please download the code and compile it yourself.
V0.3.2版本兼容V0.3.1版本任务。
Version 0.3.2 is compatible with tasks from version 0.3.1.
更新说明
- 选中子元素操作可删除字段并在浏览器中实时取消标记被删除的字段。
- 选中子元素增加选择模式,可以只选择所有块都有的子元素,或者所有块中和第一个选中的块相同的子元素。
-
可下载文件,如PDF。
-
修复打开后有可能会白屏10秒左右的Bug,使得在内网,暗网以及任意局域网都可以使用软件。
-
修复提取当前页面URL和标题时可能提取不到的bug。
-
修复OCR识别时可能提取不到文字信息的bug。
-
提取逻辑更新为每采集10条本地保存一次。
-
修改任务时默认锚点位置为任务流程的最后操作后。
-
更新Chrome版本为114。
Update Instruction
- Selected child element operations can delete fields and unmark deleted fields in real-time in the browser.
- Selecting child elements adds a selection mode that allows you to choose only the child elements that are present in all blocks or the child elements that are the same as the first selected block.
- In the text input and webpage open options, you can use the extracted field value as a variable for text input, represented by
Field["field_name"]
. - Files can be downloaded, such as PDF files.
- Fixed a bug where the software could display a blank screen for about 10 seconds after opening, making it usable in intranets, darknets, and any local network.
- Fixed a bug where the current page URL and title could not be extracted.
- Fixed a bug where OCR recognition could fail to extract information.
- Updated extraction logic to save locally every 10 records collected.
- When modifying a task, the default anchor position is set to after the last operation in the task flow.
- Updated Chrome version to 114.
Version 0.3.1
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
Windows版支持Windows 10及以上版本,Windows 7此版本无直接可用版本(因为Chrome 109是最后一个支持Windows 7的Chrome版本),但v0.2.0的32位版本可用,且可以通过自行编译软件来运行,因此如想使用Windows 7采集数据,请下载v0.2.0的32位版本或自行下载代码并编译。
The Windows version supports Windows 10 and above. There is no direct usable version available for Windows 7, but the 32-bit version of v0.2.0 is available and can be run by compiling the software yourself. Therefore, if you want to use Windows 7 for data collection, please download the 32-bit version of v0.2.0 or download the code and compile it yourself.
MacOS版支持所有芯片组,包括Intel和M1,M2等处理器,但操作系统最低版本要求为11.1,更低操作系统版本请下载v0.2.0版本的Mac版使用,或自行下载代码并编译,示例编译方式看这个帖子。
The MacOS version supports all chipsets, including Intel, M1, M2, and other processors. However, the minimum operating system requirement is 11.1. For lower operating system versions, please download the code and compile it yourself. An example compilation method can be found in this post.
同理,Linux版只适用于Ubuntu 20.04及以上版本、Deepin、Debian及其衍生版本,如想使用其他Linux发行版采集数据,请自行下载代码并编译。
Similarly, the Linux version is only compatible with Ubuntu 20.04 and above, Deepin, Debian, and their derivatives. If you want to use other Linux distributions for data collection, please download the code and compile it yourself.
强烈建议大家观看新特性讲解视频
B站最新版特性视频已上传,新视频非常有用,推荐大家观看。
【重要】自定义条件判断之使用循环项内的JS命令返回值 - 第二弹
注意,v0.3.1版本任务tasks
文件夹内.json
文件和之前所有版本均不兼容,请重新设计v0.3.1版本任务。
Note that the '.json' file in the tasks
folder of the v0.3.1 version is not compatible with all previous versions. Please redesign the v0.3.1 version task.
更新说明
- 高级操作:
- 可以在任务流程中执行自定义脚本,包括在浏览器中执行Javascript指令以及操作系统级别的脚本调用并可得到命令返回值并记录,大大扩展了可操作空间。
- 在每一个操作执行前和执行后,都可以指定执行一段针对当前定位元素的JavaScript指令。
-
判断条件和循环条件中同样增加了执行自定义脚本,并根据自定义脚本的返回值是否为真来作为条件判断和循环的判断条件,同样极大的增加了任务的可操作性。循环中增加了用代码break的操作设定,自定义操作可以操作循环内元素。
-
可同时生成多种XPath供用户选择,并预装了XPath Helper扩展供大家调试XPath。
-
增加采集元素背景图片地址,当前页面标题,当前页面URL地址功能。
-
增加保存元素截图功能,如要截图某元素或整个网页页面,可以用此功能(配合无头模式效果更好)。
-
增加下载图片功能。
-
增加OCR识别元素功能(使用此功能需首先自行安装Tesseract库:https://blog.csdn.net/u010454030/article/details/80515501)
-
可直接提取对元素执行JavaScript代码后的返回值,实现如正则表达式,获得元素背景颜色等功能。
- 大幅增加使用提示和说明,使软件更易用(如增加了iframe标签的处理方式说明,各个选项的参数意义,以及循环项XPath的修改说明等等)。
- 执行命令时增加了如何用命令行执行任务的提示:https://github.com/NaiboWang/EasySpider/wiki/Argument-Instruction。
- 增加并行多开模式。
- 增加无头模式,即无浏览器界面模式配置。
- 修复了使用用户配置浏览器模式下的中文路径不能正确识别的问题。
- 修复了条件分支没有无条件分支时会卡死的问题。
- 修复了保存任务后会输入框卡死的问题。
- 打开网页操作和点击元素操作新增设置页面最长加载等待时间。
- 增加了鼠标移动到元素功能。
- 找不到元素时会提示。
- 修复网页滚动Bug。
- 增加新增提取数据字段操作。
- 任务名称初始化为第一次进入页面的标题值。
- 增加版本更新提示。
- 应要求增加出品方信息。
- 更新chrome版本为113。
Update Instruction
- Advanced Operations:
- Custom scripts can be executed in the workflow, including executing JavaScript commands in the browser and invoking scripts at the operating system level. The command's return value can be obtained and recorded, greatly expanding the scope of operations.
- Before and after each operation, you can specify a JavaScript command to be executed targeting the current located element.
-
Custom scripts are also supported in the conditions and loop conditions. The return value of the custom script determines the condition for the judgment of conditions and loops, greatly enhancing the flexibility of tasks. The ability to use the break statement within a loop is added, allowing custom operations to manipulate elements within the loop.
-
Multiple XPath expressions are generated simultaneously for user selection, and the XPath Helper extension is pre-installed for XPath debugging.
-
Added the functionality to extract the background image URL of elements, current page title, and current page URL.
-
Added the capability to save screenshots of elements or entire web pages. This feature works best in headless mode.
-
Added the functionality to download images.
-
Added OCR recognition of elements. To use this feature, Tesseract library needs to be installed first: https://tesseract-ocr.github.io/tessdoc/Installation.html
-
Directly extract the return value of executing JavaScript code on elements, allowing for functionalities such as regular expression matching and obtaining the background color of elements.
-
Added the capability to switch dropdown options and extract the selected value and text of dropdown options.
-
Significantly improved user guidance and explanations to make the software more user-friendly. This includes instructions on handling iframe tags, explanations of parameter meanings for various options, and explanations on modifying the XPath for loop items, and more.
-
Added instructions on how to execute tasks from the command line.
-
Added parallel mode which can run different tasks concurrently.
-
Added headless mode configuration, allowing the software to run without a browser interface.
-
Fixed the issue where Chinese paths couldn't be recognized correctly when using user-configured browser modes.
-
Fixed the issue where the program would freeze when there was no unconditional branch in the conditional branching.
-
Fixed the issue where the input box would freeze after saving a task.
-
Added the option to set the maximum waiting time for page load in the "Open Page" and "Click element" operations.
-
Added the functionality to move the mouse to an element.
-
Displays a prompt when an element cannot be found.
-
Fixed the webpage scrolling bug.
-
New Field Function at Extract Data operation.
-
The task name is initialized with the value of the page title upon the first visit.
-
Added version update prompts.
-
Added the information of the publisher as requested.
-
Updated Chrome version to 113.
Beta Version 0.3.0
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
Windows 64位Beta版已上传,欢迎大家测试,如果发现问题和bug请及时提issue,正式版以及其他操作系统版本将于5月底前上线为v0.3.1版本。
强烈建议大家观看新特性讲解视频
B站最新版特性视频已上传,新视频非常有用,推荐大家观看。
【重要】自定义条件判断之使用循环项内的JS命令返回值 - 第二弹
注意,v0.3.0版本任务task文件夹内.json
文件和v0.2.0版本不兼容,请重新设计v0.3.0版本任务。
Version 0.2.0
完全重构版本,支持功能:
- 全系统平台支持(Windows 32位,64位;Ubuntu 20.04 x64及以上版本;所有MacOS版本,包括Intel和Arm系列芯片,如M1)。
- 所有任务信息支持默认保存在本地tasks文件夹内,无需自行搭建服务器,可以将生成的任务号.json文件直接复制到别的机器的task文件夹内,从而实现任务的导入导出;同时,如果想多机器共享任务信息,也可以搭建云服务器,并通过修改config.json来指定云服务器地址。
- 中文英文双语支持。
- 支持纯净模式和数据模式,即可以每次设计和执行爬虫任务时都可以保留之前的登录凭证,甚至添加的插件等信息(所有信息保存在本地,因此数据绝对安全)。
- Chrome 扩展更新至manifest V3版本并模块化。
- 每爬取100条数据会自动保存到本地,防止数据丢失。
- 更新UI设计。
- 用户可以在自动爬取的同时手动操作,如输入突然弹出的验证码(可在设计任务时设置等待时间以防止验证码出现或留出时间手动输入)。
A completely restructured version that supports the following features:
- Full system platform support (Windows 32-bit, 64-bit; Ubuntu 20.04 x64 and above; all MacOS versions, including Intel and Arm series chips, such as M1).
- All task information supports default storage in the local tasks folder. There is no need to set up a server on your own. You can directly copy the generated task number .json file to another machine's task folder to import or export tasks. At the same time, if you want to share task information across multiple machines, you can also set up a cloud server and specify the cloud server address by modifying the config.json.
- Chinese-English bilingual support.
- Supports clean mode and data mode, which can retain previous login credentials, added extensions and other information every time you design and execute a web crawler task (all information is stored locally, so data is absolutely safe).
- Chrome extension updated to manifest V3 version and modularized.
- Automatically save data to local every 100 records to prevent data loss.
- Updated UI design.
- Users can manually operate while automatically crawling, such as entering suddenly appearing CAPTCHA (you can set a wait time to prevent CAPTCHA from appearing or leave time to manually enter it when designing the task).
Beta Version 0.1.0
此版本已弃用,请下载最新版本使用。
This version has been deprecated, please download the latest version for use.
Support on Windows 10/11 x64 (amd64), Windows 10/11 x86 (386), windows 7 (.Net Framework 4.7 required), Linux x64 (tested on Ubuntu 20.04 and above), and MacOS x64 (support on both Intel and Arm Chips like M1).
支持Windows 10/11所有版本,Ubuntu 20.04及以上版本,MacOS所有版本(包括Intel和M1等芯片)。