Skip to content

Documentation

m92 edited this page Jul 27, 2024 · 1 revision

Function : get_processed_text(page_source, base_url)

page_source (str): html source text
base_url (str): url of the html source.
html_parser (str): which beautifulsoup html parser to use, defaults to 'lxml'
keep_images (bool): keep image links. If False will remove image links from the text if image link are not required while scraping saving tokens to be processed by LLM. Default True
remove_svg_image (bool): remove .svg image. usually .svg files are not required while scraping. default True
remove_gif_image (bool): remove .gif image. usually .gif files are not required while scraping. default True
remove_image_types (list): add any image extensions which you want to remove inside a list. eg: ['.png', '.jpg']. Default []
keep_webpage_links (bool): keep webpage links. if scraping job does not require links then can remove them to reduce input token count to LLM. Default True
remove_script_tag (bool): True
remove_style_tag (bool): True
remove_tags (list): list of tags to be remove. Default []
Clone this wiki locally