-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLI: download files as they arrive? #192
Comments
Hi @gjtorikian, Thank you for your input. We have this code for storing files temporally to help with large pages: #[cfg(feature = "fs")]
pub async fn fetch_page_html(target_url: &str, client: &Client) -> PageResponse { Currently, we delete the file after the parsing. The We can update the code so that the file deletion process is conditionally compiled, making it optional. If you would like to implement this, feel free. I can get to it tonight or tomorrow. The Additionally, the spider core crate could use some refactoring. The current split of scrape methods makes the code verbose and hard to extend with new logic. Since the CLI uses the scrape functionality (though it may be slightly different from |
We can also update the CLI code to use |
released in |
Wow! From one open source maintainer to another, thank you. That was incredibly fast. As an aside, I should mention, I'm working on a Ruby wrapper for this wonderful lib. Although "wrapper" might be a very optimistic phrasing. Although I've worked with Rust/Ruby FFI boundaries before, the underlying glue does not yet support futures, unlike PyO3, so I can't do anything with Ruby fibers + Rust async. Regardless, I hope to provide a proper API around the crate soon and would be happy to let y'all know when that's done. |
Hey, thanks for this amazing tool. I've noticed that files are written to disk after the website is completely processed. For performance, this makes sense: making HTTP requests and then stopping to write an HTML file out is an interruption that takes time.
However, in some cases, the trade-off is worth it. If one were to crawl all of
docs.github.com
, they would have to hold on to 1.5 GB of memory before flushing it all onto disk.Is there any interest in either a) an option to write to disk as pages are found or b) set some kind of memory limiter (eg., flush to disk after 10 MB of content)? I have experience with Rust, but not this project, so I'm happy to make a PR if there's interest. Thank you!
The text was updated successfully, but these errors were encountered: