CLI: download files as they arrive? #192

gjtorikian · 2024-07-19T18:01:09Z

Hey, thanks for this amazing tool. I've noticed that files are written to disk after the website is completely processed. For performance, this makes sense: making HTTP requests and then stopping to write an HTML file out is an interruption that takes time.

However, in some cases, the trade-off is worth it. If one were to crawl all of docs.github.com, they would have to hold on to 1.5 GB of memory before flushing it all onto disk.

Is there any interest in either a) an option to write to disk as pages are found or b) set some kind of memory limiter (eg., flush to disk after 10 MB of content)? I have experience with Rust, but not this project, so I'm happy to make a PR if there's interest. Thank you!

The text was updated successfully, but these errors were encountered:

j-mendez · 2024-07-19T18:26:37Z

Hi @gjtorikian,

Thank you for your input. We have this code for storing files temporally to help with large pages:

#[cfg(feature = "fs")]
pub async fn fetch_page_html(target_url: &str, client: &Client) -> PageResponse {

Currently, we delete the file after the parsing. The fs feature is only necessary if you need to store very large pages that exceed what can be buffered in memory.

We can update the code so that the file deletion process is conditionally compiled, making it optional.

If you would like to implement this, feel free. I can get to it tonight or tomorrow.

The fs flag currently stores the files in the OS's temporary directory. We could introduce an environment variable to control this path, configurable via the CLI.

Additionally, the spider core crate could use some refactoring. The current split of scrape methods makes the code verbose and hard to extend with new logic. Since the CLI uses the scrape functionality (though it may be slightly different from crawl), a refactor should be considered by the end of the month to facilitate easier development.

j-mendez · 2024-07-19T18:31:58Z

Hi @gjtorikian,

Thank you for your input. We have this code for storing files temporally to help with large pages:
#[cfg(feature = "fs")]
pub async fn fetch_page_html(target_url: &str, client: &Client) -> PageResponse {
Currently, we delete the file after the parsing. The fs feature is only necessary if you need to store very large pages that exceed what can be buffered in memory.

We can update the code so that the file deletion process is conditionally compiled, making it optional.

If you would like to implement this, feel free. I can get to it tonight or tomorrow.

The fs flag currently stores the files in the OS's temporary directory. We could introduce an environment variable to control this path, configurable via the CLI.

Additionally, the spider core crate could use some refactoring. The current split of scrape methods makes the code verbose and hard to extend with new logic. Since the CLI uses the scrape functionality (though it may be slightly different from crawl), a refactor should be considered by the end of the month to facilitate easier development.

We can also update the CLI code to use website.subscription to process each concurrently streamed. That would most likely be the best approach.

j-mendez · 2024-07-19T19:33:35Z

released in 1.99.6 - also a major performance increase coming!

gjtorikian · 2024-07-19T21:22:04Z

Wow! From one open source maintainer to another, thank you. That was incredibly fast.

As an aside, I should mention, I'm working on a Ruby wrapper for this wonderful lib. Although "wrapper" might be a very optimistic phrasing. Although I've worked with Rust/Ruby FFI boundaries before, the underlying glue does not yet support futures, unlike PyO3, so I can't do anything with Ruby fibers + Rust async.

Regardless, I hope to provide a proper API around the crate soon and would be happy to let y'all know when that's done.

j-mendez closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: download files as they arrive? #192

CLI: download files as they arrive? #192

gjtorikian commented Jul 19, 2024

j-mendez commented Jul 19, 2024

j-mendez commented Jul 19, 2024

j-mendez commented Jul 19, 2024

gjtorikian commented Jul 19, 2024

CLI: download files as they arrive? #192

CLI: download files as they arrive? #192

Comments

gjtorikian commented Jul 19, 2024

j-mendez commented Jul 19, 2024

j-mendez commented Jul 19, 2024

j-mendez commented Jul 19, 2024

gjtorikian commented Jul 19, 2024