Skip to content

Conversation

@Mews
Copy link
Collaborator

@Mews Mews commented Jun 16, 2024

Closes #8

Changes

  • Added include_body argument in the Spider class. This is a boolean that defaults to false. When set to true, the body of the crawled pages will be included in crawl_result.
  • In Spider.crawl added the code to add the body to crawl_result from soup.html.
  • Added a test for this feature.

Right now the body of the page is added regardless of wether it finds links inside it or not! This just felt like the most expectable behavior, but let me know if I should change it.
Also there are no verbose prints, I didn't find it necessary but let me know if I should add some 👍

@Mews
Copy link
Collaborator Author

Mews commented Jun 16, 2024

Oh right I guess this new "body" field doesn't match the type hint for crawl_result :P
Should it be Dict[str, Dict[str, Union[List[str], str]]]?

@Mews Mews mentioned this pull request Jun 16, 2024
@Mews
Copy link
Collaborator Author

Mews commented Jun 16, 2024

@indrajithi I'm not quite sure how to do the type hints for the crawl_result variable now that it has this new body field :/

@indrajithi
Copy link
Collaborator

@indrajithi I'm not quite sure how to do the type hints for the crawl_result variable now that it has this new body field :/

self.crawl_result: Dict[str, Dict[str, Union[List[str], str]]] = {} Does this not work?

@indrajithi
Copy link
Collaborator

If the type hint for crawl_result is not working, we can just set it to a basic dict or override/suppress checking that case and move on.

@Mews
Copy link
Collaborator Author

Mews commented Jun 16, 2024

Alright I'm on my phone right now but I'll get to it when I get home 👍

@Mews
Copy link
Collaborator Author

Mews commented Jun 16, 2024

@indrajithi I'm not quite sure how to do the type hints for the crawl_result variable now that it has this new body field :/

self.crawl_result: Dict[str, Dict[str, Union[List[str], str]]] = {} Does this not work?

Nope that's what raised the error on the ci.
I'll open an issue about it so that it can be dealt with later.

@Mews
Copy link
Collaborator Author

Mews commented Jun 16, 2024

@indrajithi Ok I just introduced a temporary fix, I set the type hint to Dict[str, Any], so you can rerun the ci and merge if everything passes. I'll open the issue now.

@indrajithi indrajithi merged commit c813f93 into DataCrawl-AI:master Jun 16, 2024
@indrajithi indrajithi mentioned this pull request Jun 16, 2024
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add option to return the crawled website body in the response

2 participants