Skip to content

Commit

Permalink
chore(bench): add large benches
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Dec 27, 2023
1 parent 50ad1c6 commit 92d7d06
Show file tree
Hide file tree
Showing 4 changed files with 58 additions and 38 deletions.
45 changes: 10 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,59 +2,34 @@

The [spider](https://github.com/spider-rs/spider) project ported to Python.

## Getting Started

1. `pip install spider_rs`
Test url: `https://espn.com`

```python
import asyncio
| `libraries` | `pages` | `speed` |
| :----------------------------- | :-------- | :------ |
| **`spider-rs(python): crawl`** | `150,387` | `186s` |
| **`scrapy(python): crawl`** | `49,598` | `1h` |

from spider_rs import crawl
The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

async def main():
website = await crawl("https://choosealicense.com")
print(website.links)
# print(website.pages)
## Getting Started

asyncio.run(main())
```
1. `pip install spider_rs`

Use the Website class to build the crawler you need.

```python
import asyncio

from spider_rs import Website

async def main():
website = Website("https://choosealicense.com", False).with_headers({ "authorization": "myjwttoken" })
website = Website("https://choosealicense.com")
website.crawl()
print(website.get_links())

asyncio.run(main())
```

Setting up real time subscriptions can be done too.

```python
import asyncio

from spider_rs import Website

class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))

async def main():
website = Website("https://choosealicense.com", False)
website.crawl(Subscription())

asyncio.run(main())
```

View the [examples](./examples/) for more.
View the [examples](./examples/) to learn more.

## Development

Expand Down
22 changes: 20 additions & 2 deletions bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
You can run the benches with python in terminal.

```sh
python scrapy.py && python spider.py
python scrappy.py && python spider.py
```

## Cases
Expand Down Expand Up @@ -32,4 +32,22 @@ pages found 200
elasped duration 5.860108852386475
```

Linux performance for Spider-RS increases around 10x especially on Arm.
Test url: `https://a11ywatch.com` (medium)
648 pages

| `libraries` | `speed` |
| :-------------------------------- | :------ |
| **`spider-rs: crawl 10 samples`** | `2s` |
| **`scrapy: crawl 10 samples`** | `7.7s` |

Test url: `https://espn.com` (large)
150,387 pages

| `libraries` | `pages` | `speed` |
| :---------------------------------------- | :-------- | :------ |
| **`spider-rs(python): crawl 10 samples`** | `150,387` | `186s` |
| **`scrapy(python): crawl 10 samples`** | `49,598` | `1h` |

Scrapy used too much memory, crawl cancelled after an hour.

Note: The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.
9 changes: 9 additions & 0 deletions book/src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,12 @@
- Written in [Rust](https://www.rust-lang.org/) for speed, safety, and simplicity

Spider powers some big tools and helps bring the crawling aspect to almost no downtime with the correct setup, view the [spider](https://github.com/spider-rs/spider) project to learn more.

Test url: `https://espn.com`

| `libraries` | `pages` | `speed` |
| :----------------------------- | :-------- | :------ |
| **`spider-rs(python): crawl`** | `150,387` | `186s` |
| **`scrapy(python): crawl`** | `49,598` | `1h` |

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.
20 changes: 19 additions & 1 deletion book/src/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,22 @@ Test url: `https://rsseau.fr` (medium)
| **`spider-rs: crawl 10 samples`** | `2.5s` |
| **`scrapy: crawl 10 samples`** | `10s` |

The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.
Test url: `https://a11ywatch.com` (medium)
648 pages

| `libraries` | `speed` |
| :-------------------------------- | :------ |
| **`spider-rs: crawl 10 samples`** | `2s` |
| **`scrapy: crawl 10 samples`** | `7.7s` |

Test url: `https://espn.com` (large)
150,387 pages

| `libraries` | `pages` | `speed` |
| :-------------------------------- | :-------- | :------ |
| **`spider-rs: crawl 10 samples`** | `150,387` | `186s` |
| **`scrapy: crawl 10 samples`** | `49,598` | `1h` |

Scrapy used too much memory, crawl cancelled after an hour.

Note: The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.

0 comments on commit 92d7d06

Please sign in to comment.