Added option to include page body in crawl results #19

Mews · 2024-06-16T12:40:35Z

Closes #8

Changes

Added include_body argument in the Spider class. This is a boolean that defaults to false. When set to true, the body of the crawled pages will be included in crawl_result.
In Spider.crawl added the code to add the body to crawl_result from soup.html.
Added a test for this feature.

Right now the body of the page is added regardless of wether it finds links inside it or not! This just felt like the most expectable behavior, but let me know if I should change it.
Also there are no verbose prints, I didn't find it necessary but let me know if I should add some 👍

Mews · 2024-06-16T12:44:47Z

Oh right I guess this new "body" field doesn't match the type hint for crawl_result :P
Should it be Dict[str, Dict[str, Union[List[str], str]]]?

tiny_web_crawler/crawler.py

Mews · 2024-06-16T13:33:40Z

@indrajithi I'm not quite sure how to do the type hints for the crawl_result variable now that it has this new body field :/

indrajithi · 2024-06-16T13:50:05Z

@indrajithi I'm not quite sure how to do the type hints for the crawl_result variable now that it has this new body field :/

self.crawl_result: Dict[str, Dict[str, Union[List[str], str]]] = {} Does this not work?

indrajithi · 2024-06-16T14:44:51Z

If the type hint for crawl_result is not working, we can just set it to a basic dict or override/suppress checking that case and move on.

Mews · 2024-06-16T15:56:17Z

Alright I'm on my phone right now but I'll get to it when I get home 👍

Mews · 2024-06-16T15:57:11Z

@indrajithi I'm not quite sure how to do the type hints for the crawl_result variable now that it has this new body field :/

self.crawl_result: Dict[str, Dict[str, Union[List[str], str]]] = {} Does this not work?

Nope that's what raised the error on the ci.
I'll open an issue about it so that it can be dealt with later.

Mews · 2024-06-16T16:36:43Z

@indrajithi Ok I just introduced a temporary fix, I set the type hint to Dict[str, Any], so you can rerun the ci and merge if everything passes. I'll open the issue now.

Mews added 2 commits June 16, 2024 13:27

Add option to include page body in results

c2d56ba

Added test for include_body option

a01711d

Mews added 2 commits June 16, 2024 13:47

Fix small formatting issue in test_include_body

c1916e9

Fix type hint for crawl_result

51a6610

Mews mentioned this pull request Jun 16, 2024

Add mypy to pre-commit hooks #20

Closed

indrajithi reviewed Jun 16, 2024

View reviewed changes

tiny_web_crawler/crawler.py Outdated Show resolved Hide resolved

Get full html content from soup

df59615

Temporary fix for crawl_result type hint

3f5a437

Updated type hint in docstring

7164422

Mews mentioned this pull request Jun 16, 2024

Fix crawl_result type hint #21

Open

indrajithi merged commit c813f93 into DataCrawl-AI:master Jun 16, 2024

indrajithi mentioned this pull request Jun 16, 2024

First Major Release v1.0.0 #24

Open

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added option to include page body in crawl results #19

Added option to include page body in crawl results #19

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

indrajithi commented Jun 16, 2024

Uh oh!

indrajithi commented Jun 16, 2024

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added option to include page body in crawl results #19

Added option to include page body in crawl results #19

Uh oh!

Conversation

Mews commented Jun 16, 2024

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

indrajithi commented Jun 16, 2024

Uh oh!

indrajithi commented Jun 16, 2024

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Mews commented Jun 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants