Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: download files as they arrive? #192

Closed
gjtorikian opened this issue Jul 19, 2024 · 4 comments
Closed

CLI: download files as they arrive? #192

gjtorikian opened this issue Jul 19, 2024 · 4 comments

Comments

@gjtorikian
Copy link

Hey, thanks for this amazing tool. I've noticed that files are written to disk after the website is completely processed. For performance, this makes sense: making HTTP requests and then stopping to write an HTML file out is an interruption that takes time.

However, in some cases, the trade-off is worth it. If one were to crawl all of docs.github.com, they would have to hold on to 1.5 GB of memory before flushing it all onto disk.

Is there any interest in either a) an option to write to disk as pages are found or b) set some kind of memory limiter (eg., flush to disk after 10 MB of content)? I have experience with Rust, but not this project, so I'm happy to make a PR if there's interest. Thank you!

@j-mendez
Copy link
Member

Hi @gjtorikian,

Thank you for your input. We have this code for storing files temporally to help with large pages:

#[cfg(feature = "fs")]
pub async fn fetch_page_html(target_url: &str, client: &Client) -> PageResponse {

Currently, we delete the file after the parsing. The fs feature is only necessary if you need to store very large pages that exceed what can be buffered in memory.

We can update the code so that the file deletion process is conditionally compiled, making it optional.

If you would like to implement this, feel free. I can get to it tonight or tomorrow.

The fs flag currently stores the files in the OS's temporary directory. We could introduce an environment variable to control this path, configurable via the CLI.

Additionally, the spider core crate could use some refactoring. The current split of scrape methods makes the code verbose and hard to extend with new logic. Since the CLI uses the scrape functionality (though it may be slightly different from crawl), a refactor should be considered by the end of the month to facilitate easier development.

@j-mendez
Copy link
Member

Hi @gjtorikian,

Thank you for your input. We have this code for storing files temporally to help with large pages:

#[cfg(feature = "fs")]
pub async fn fetch_page_html(target_url: &str, client: &Client) -> PageResponse {

Currently, we delete the file after the parsing. The fs feature is only necessary if you need to store very large pages that exceed what can be buffered in memory.

We can update the code so that the file deletion process is conditionally compiled, making it optional.

If you would like to implement this, feel free. I can get to it tonight or tomorrow.

The fs flag currently stores the files in the OS's temporary directory. We could introduce an environment variable to control this path, configurable via the CLI.

Additionally, the spider core crate could use some refactoring. The current split of scrape methods makes the code verbose and hard to extend with new logic. Since the CLI uses the scrape functionality (though it may be slightly different from crawl), a refactor should be considered by the end of the month to facilitate easier development.

We can also update the CLI code to use website.subscription to process each concurrently streamed. That would most likely be the best approach.

@j-mendez
Copy link
Member

released in 1.99.6 - also a major performance increase coming!

@gjtorikian
Copy link
Author

Wow! From one open source maintainer to another, thank you. That was incredibly fast.

As an aside, I should mention, I'm working on a Ruby wrapper for this wonderful lib. Although "wrapper" might be a very optimistic phrasing. Although I've worked with Rust/Ruby FFI boundaries before, the underlying glue does not yet support futures, unlike PyO3, so I can't do anything with Ruby fibers + Rust async.

Regardless, I hope to provide a proper API around the crate soon and would be happy to let y'all know when that's done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants