-
Notifications
You must be signed in to change notification settings - Fork 14
Documentation
m92 edited this page Jul 27, 2024
·
1 revision
page_source (str): html source text
base_url (str): url of the html source.
html_parser (str): which beautifulsoup html parser to use, defaults to 'lxml'
keep_images (bool): keep image links. If False will remove image links from the text if image link are not required while scraping saving tokens to be processed by LLM. Default True
remove_svg_image (bool): remove .svg image. usually .svg files are not required while scraping. default True
remove_gif_image (bool): remove .gif image. usually .gif files are not required while scraping. default True
remove_image_types (list): add any image extensions which you want to remove inside a list. eg: ['.png', '.jpg']. Default []
keep_webpage_links (bool): keep webpage links. if scraping job does not require links then can remove them to reduce input token count to LLM. Default True
remove_script_tag (bool): True
remove_style_tag (bool): True
remove_tags (list): list of tags to be remove. Default []