null byte issue

**Description**
There is a problem with null byte characters being inserted in HTML pages created with Docusaurus when the language is cjk. Of course, the issue mentioned is also registered as an issue in Docusaurus.

When I scrape that page with docs-scraper, I run into the problem that it doesn't scrape anything. Logic to replace null byte characters is required.

example site:
> Docs-Scraper: https://docs.whatap.io/java/agent-load-amount 0 records)
> Docs-Scraper: https://docs.whatap.io/java/agent-dbsql 0 records)
> Docs-Scraper: https://docs.whatap.io/java/agent-apdex 0 records)

I proceeded with the work by modifying the files as shown below. Please refer to the information below and correct it for the better.

documentation_spider.py:162

```python
def parse_from_sitemap(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)
    
    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if (not self.force_sitemap_urls_crawling) and (
            not self.is_rules_compliant(response)):
        print("\033[94m> Ignored from sitemap:\033[0m " + response.url)
    else:
        # self.add_records(response, from_sitemap=True)
        self.add_records(response.replace(body=response_text), from_sitemap=True)
        # We don't return self.parse(response) in order to avoid crawling those web page

def parse_from_start_url(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)

    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if self.is_rules_compliant(response):
        self.add_records(response, from_sitemap=False)
    else:
        print("\033[94m> Ignored: from start url\033[0m " + response.url)

    # return self.parse(response)
    return self.parse(response.replace(body=response_text))
```

custom_downloader_middleware.py:37

```python
body = self.driver.page_source.encode('utf-8')
# remove null byte
body = self.driver.page_source.replace('\u0000', '')
body = body.encode('utf-8')  # UTF-8 encoding
url = self.driver.current_url
```

default_strategy.py:37

```python
if self._body_contains_stop_content(response):
    return []

# remove null byte
cleaned_body = response.text.replace('\u0000', '')

self.dom = self.get_dom(response.replace(body=cleaned_body.encode('utf-8')))
self.dom = self.remove_from_dom(self.dom, self.config.selectors_exclude)

records = self.get_records_from_dom(response.url)
return records
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

null byte issue #474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

null byte issue #474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions