chore(bench): add large benches

spider-rs · Dec 27, 2023 · 92d7d06 · 92d7d06
1 parent 50ad1c6
commit 92d7d06
Show file tree

Hide file tree

Showing 4 changed files with 58 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -2,59 +2,34 @@
 
 The [spider](https://github.com/spider-rs/spider) project ported to Python.
 
-## Getting Started
-
-1. `pip install spider_rs`
+Test url: `https://espn.com`
 
-```python
-import asyncio
+| `libraries`                    | `pages`   | `speed` |
+| :----------------------------- | :-------- | :------ |
+| **`spider-rs(python): crawl`** | `150,387` | `186s`  |
+| **`scrapy(python): crawl`**    | `49,598`  | `1h`    |
 
-from spider_rs import crawl
+The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.
 
-async def main():
-    website = await crawl("https://choosealicense.com")
-    print(website.links)
-    # print(website.pages)
+## Getting Started
 
-asyncio.run(main())
-```
+1. `pip install spider_rs`
 
-Use the Website class to build the crawler you need.
 
 ```python
 import asyncio
 
 from spider_rs import Website
 
 async def main():
-    website = Website("https://choosealicense.com", False).with_headers({ "authorization": "myjwttoken" })
+    website = Website("https://choosealicense.com")
     website.crawl()
     print(website.get_links())
 
 asyncio.run(main())
 ```
 
-Setting up real time subscriptions can be done too.
-
-```python
-import asyncio
-
-from spider_rs import Website
-
-class Subscription:
-    def __init__(self):
-        print("Subscription Created...")
-    def __call__(self, page):
-        print(page.url + " - status: " + str(page.status_code))
-
-async def main():
-    website = Website("https://choosealicense.com", False)
-    website.crawl(Subscription())
-
-asyncio.run(main())
-```
-
-View the [examples](./examples/) for more.
+View the [examples](./examples/) to learn more.
 
 ## Development
 

diff --git a/bench/README.md b/bench/README.md
@@ -3,7 +3,7 @@
 You can run the benches with python in terminal.
 
 ```sh
-python scrapy.py && python spider.py
+python scrappy.py && python spider.py
 ```
 
 ## Cases
@@ -32,4 +32,22 @@ pages found 200
 elasped duration 5.860108852386475
 ```
 
-Linux performance for Spider-RS increases around 10x especially on Arm.
+Test url: `https://a11ywatch.com` (medium)
+648 pages
+
+| `libraries`                       | `speed` |
+| :-------------------------------- | :------ |
+| **`spider-rs: crawl 10 samples`** | `2s`    |
+| **`scrapy: crawl 10 samples`**    | `7.7s`  |
+
+Test url: `https://espn.com` (large)
+150,387 pages
+
+| `libraries`                               | `pages`   | `speed` |
+| :---------------------------------------- | :-------- | :------ |
+| **`spider-rs(python): crawl 10 samples`** | `150,387` | `186s`  |
+| **`scrapy(python): crawl 10 samples`**    | `49,598`  | `1h`    |
+
+Scrapy used too much memory, crawl cancelled after an hour.
+
+Note: The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.
diff --git a/book/src/README.md b/book/src/README.md
@@ -13,3 +13,12 @@
 - Written in [Rust](https://www.rust-lang.org/) for speed, safety, and simplicity
 
 Spider powers some big tools and helps bring the crawling aspect to almost no downtime with the correct setup, view the [spider](https://github.com/spider-rs/spider) project to learn more.
+
+Test url: `https://espn.com`
+
+| `libraries`                    | `pages`   | `speed` |
+| :----------------------------- | :-------- | :------ |
+| **`spider-rs(python): crawl`** | `150,387` | `186s`  |
+| **`scrapy(python): crawl`**    | `49,598`  | `1h`    |
+
+The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.
diff --git a/book/src/benchmarks.md b/book/src/benchmarks.md
@@ -50,4 +50,22 @@ Test url: `https://rsseau.fr` (medium)
 | **`spider-rs: crawl 10 samples`** | `2.5s`  |
 | **`scrapy: crawl 10 samples`**    | `10s`   |
 
-The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.
+Test url: `https://a11ywatch.com` (medium)
+648 pages
+
+| `libraries`                       | `speed` |
+| :-------------------------------- | :------ |
+| **`spider-rs: crawl 10 samples`** | `2s`    |
+| **`scrapy: crawl 10 samples`**    | `7.7s`  |
+
+Test url: `https://espn.com` (large)
+150,387 pages
+
+| `libraries`                       | `pages`   | `speed` |
+| :-------------------------------- | :-------- | :------ |
+| **`spider-rs: crawl 10 samples`** | `150,387` | `186s`  |
+| **`scrapy: crawl 10 samples`**    | `49,598`  | `1h`    |
+
+Scrapy used too much memory, crawl cancelled after an hour.
+
+Note: The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.