-
Notifications
You must be signed in to change notification settings - Fork 11
Added option to include page body in crawl results #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Oh right I guess this new "body" field doesn't match the type hint for |
|
@indrajithi I'm not quite sure how to do the type hints for the |
|
|
If the type hint for crawl_result is not working, we can just set it to a basic dict or override/suppress checking that case and move on. |
|
Alright I'm on my phone right now but I'll get to it when I get home 👍 |
Nope that's what raised the error on the ci. |
|
@indrajithi Ok I just introduced a temporary fix, I set the type hint to |
Closes #8
Changes
include_bodyargument in theSpiderclass. This is a boolean that defaults to false. When set to true, the body of the crawled pages will be included incrawl_result.Spider.crawladded the code to add the body tocrawl_resultfromsoup.html.Right now the body of the page is added regardless of wether it finds links inside it or not! This just felt like the most expectable behavior, but let me know if I should change it.
Also there are no verbose prints, I didn't find it necessary but let me know if I should add some 👍