Skip to content

null byte issue #474

@kijung-iM

Description

@kijung-iM

Description
There is a problem with null byte characters being inserted in HTML pages created with Docusaurus when the language is cjk. Of course, the issue mentioned is also registered as an issue in Docusaurus.

When I scrape that page with docs-scraper, I run into the problem that it doesn't scrape anything. Logic to replace null byte characters is required.

example site:

Docs-Scraper: https://docs.whatap.io/java/agent-load-amount 0 records)
Docs-Scraper: https://docs.whatap.io/java/agent-dbsql 0 records)
Docs-Scraper: https://docs.whatap.io/java/agent-apdex 0 records)

I proceeded with the work by modifying the files as shown below. Please refer to the information below and correct it for the better.

documentation_spider.py:162

def parse_from_sitemap(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)
    
    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if (not self.force_sitemap_urls_crawling) and (
            not self.is_rules_compliant(response)):
        print("\033[94m> Ignored from sitemap:\033[0m " + response.url)
    else:
        # self.add_records(response, from_sitemap=True)
        self.add_records(response.replace(body=response_text), from_sitemap=True)
        # We don't return self.parse(response) in order to avoid crawling those web page

def parse_from_start_url(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)

    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if self.is_rules_compliant(response):
        self.add_records(response, from_sitemap=False)
    else:
        print("\033[94m> Ignored: from start url\033[0m " + response.url)

    # return self.parse(response)
    return self.parse(response.replace(body=response_text))

custom_downloader_middleware.py:37

body = self.driver.page_source.encode('utf-8')
# remove null byte
body = self.driver.page_source.replace('\u0000', '')
body = body.encode('utf-8')  # UTF-8 encoding
url = self.driver.current_url

default_strategy.py:37

if self._body_contains_stop_content(response):
    return []

# remove null byte
cleaned_body = response.text.replace('\u0000', '')

self.dom = self.get_dom(response.replace(body=cleaned_body.encode('utf-8')))
self.dom = self.remove_from_dom(self.dom, self.config.selectors_exclude)

records = self.get_records_from_dom(response.url)
return records

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions