Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSurfer Updated (Selenium, Playwright, and support for many filetypes) #1929

Merged
merged 61 commits into from
Sep 25, 2024
Merged
Changes from 1 commit
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
96683ee
Feat/headless browser (retargeted) (#1832)
INF800 Mar 2, 2024
348d676
Handle missing Selenium package.
afourney Mar 4, 2024
bb7a249
Added browser_chat.py example to simplify testing.
afourney Mar 4, 2024
7535226
Based browser on mdconvert. (#1847)
afourney Mar 4, 2024
8dc2220
Added an initial POC with Playwright.
afourney Mar 4, 2024
4e7e6a5
Merge branch 'main' into headless_web_surfer
afourney Mar 4, 2024
1d96568
Separated Bing search into it's own utility module.
afourney Mar 8, 2024
21b1789
Simple browser now uses Bing tools.
afourney Mar 8, 2024
19bb19c
Updated Playwright browser to inherit from SimpleTextBrowser
afourney Mar 9, 2024
c6a7ee3
Got Selenium working too.
afourney Mar 9, 2024
d5d6644
Renamed classes and files for consistency.
afourney Mar 9, 2024
acb08c3
Added more instructions.
afourney Mar 9, 2024
d19c9c7
Merge branch 'main' into headless_web_surfer
afourney Mar 9, 2024
f595516
Initial work to support other search providers.
afourney Mar 12, 2024
e8e8de0
Merge branch 'headless_web_surfer' of github.com:microsoft/autogen in…
afourney Mar 12, 2024
df4e3e1
Added some basic behavior when the BING_API_KEY is missing.
afourney Mar 12, 2024
e33a2fa
Cleaned up some search results.
afourney Mar 12, 2024
e221a5f
Moved to using the request.Sessions object. Moved Bing SERP paring to…
afourney Mar 12, 2024
35c48fe
Added backward compatibility to WebSurferAgent
afourney Mar 12, 2024
df3ef28
Selenium and Playwright now grab the whole DOM, not jus the body, all…
afourney Mar 12, 2024
0a52483
Fixed printing of page titles in Playwright.
afourney Mar 13, 2024
802f099
Merge branch 'main' into headless_web_surfer
afourney Mar 14, 2024
156e6f7
Moved installation of WebSurfer dependencies to contrib-tests.yml
afourney Mar 14, 2024
8744405
Fixing pre-commit issues.
afourney Mar 14, 2024
3c2a118
Reverting conversable_agent, which should not have been changed in pr…
afourney Mar 14, 2024
92bc064
Added RequestMarkdownBrowser tests.
afourney Mar 14, 2024
87119a4
Fixed a bug with Bing search, and added search test cases.
afourney Mar 15, 2024
ccf37a4
Added tests for Bing search.
afourney Mar 15, 2024
c33ac26
Added tests for md_convert
afourney Mar 15, 2024
6af5ff9
Added test files.
afourney Mar 15, 2024
9581c07
Added missing pptx.
afourney Mar 15, 2024
ecd5329
Added more tests for WebSurfer coverage.
afourney Mar 15, 2024
b5dca7e
Merge branch 'main' into headless_web_surfer
afourney Mar 15, 2024
25c78c0
Fixed guard on requests_markdown_browser test.
afourney Mar 15, 2024
de011b8
Updated test coverage for mdconvert.
afourney Mar 15, 2024
f897bf3
Fix brwser_utils tests.
afourney Mar 16, 2024
3f8c65f
Removed image test from browser, since exiftool isn't installed on te…
afourney Mar 16, 2024
8e6b5e8
Removed image test from browser, since exiftool isn't installed on te…
afourney Mar 16, 2024
b280028
Merge branch 'main' into headless_web_surfer
afourney Mar 18, 2024
d3b6f68
Disable Selenium GPU and sandbox to ensure it runs headless in Docker.
afourney Mar 18, 2024
852ee33
Merge branch 'main' into headless_web_surfer
afourney Mar 27, 2024
f094e69
Added option for Bing API results to be interleaved (as Bing specifie…
afourney Mar 29, 2024
745dc21
Print more details when requests exceptions are thrown.
afourney Mar 29, 2024
fe8fa07
Merge branch 'main' into headless_web_surfer
afourney Apr 1, 2024
7353681
Added additional documentation to markdown_search
afourney Apr 1, 2024
a174d42
Added documentation to the selenium_markdown_browser.
afourney Apr 1, 2024
2f9de28
Added documentation to playwright_markdown_browser.py
afourney Apr 1, 2024
371b991
Added documentation to requests_markdown_browser
afourney Apr 1, 2024
1ac0a4d
Added documentation to mdconvert.py
afourney Apr 1, 2024
2c1398b
Updated agentchat_surfer notebook.
afourney Apr 1, 2024
6ba05c9
Merge branch 'main' into headless_web_surfer
ekzhu Apr 2, 2024
266cefc
Update .github/workflows/contrib-tests.yml
afourney Apr 2, 2024
8a6ebe1
Merge main. Resolve conflicts.
afourney May 20, 2024
ccbdd1b
Merge main. Resolve conflicts.
afourney May 20, 2024
37b3292
Resolve pre-commit checks.
afourney May 20, 2024
b1ca235
Merge branch 'main' into headless_web_surfer
ekzhu May 22, 2024
5304bab
Removed offending LFS file.
afourney Sep 25, 2024
a486843
Re-added offending LFS file.
afourney Sep 25, 2024
0c5a4a9
Merged main.
afourney Sep 25, 2024
764fb3f
Fixed browser_utils tests.
afourney Sep 25, 2024
42fe8f5
Fixed style errors.
afourney Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Added documentation to mdconvert.py
  • Loading branch information
afourney committed Apr 1, 2024
commit 1ac0a4d700f20ee7ad9c0c212ec13095c9e4dc8d
43 changes: 43 additions & 0 deletions autogen/browser_utils/mdconvert.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,15 @@


class _CustomMarkdownify(markdownify.MarkdownConverter):
"""
A custom version of markdownify's MarkdownConverter. Changes include:

- Altering the default heading style to use '#', '##', etc.
- Removing javascript hyperlinks.
- Truncating images with large data:uri sources.
- Ensuring URIs are properly escaped, and do not conflict with Markdown syntax
"""

def __init__(self, **options):
options["heading_style"] = options.get("heading_style", markdownify.ATX)
super().__init__(**options)
Expand Down Expand Up @@ -134,6 +143,8 @@ def __init__(self, title: Union[str, None] = None, text_content: str = ""):


class DocumentConverter:
"""Abstract superclass of all DocumentConverters."""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
raise NotImplementedError()

Expand Down Expand Up @@ -427,6 +438,10 @@ def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:


class PdfConverter(DocumentConverter):
"""
Converts PDFs to Markdown. Most style information is ignored, so the results are essentially plain-text.
"""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a PDF
extension = kwargs.get("file_extension", "")
Expand All @@ -440,6 +455,10 @@ def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:


class DocxConverter(HtmlConverter):
"""
Converts DOCX files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
"""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a DOCX
extension = kwargs.get("file_extension", "")
Expand All @@ -456,6 +475,10 @@ def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:


class XlsxConverter(HtmlConverter):
"""
Converts XLSX files to Markdown, with each sheet presented as a separate Markdown table.
"""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a XLSX
extension = kwargs.get("file_extension", "")
Expand All @@ -476,6 +499,10 @@ def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:


class PptxConverter(HtmlConverter):
"""
Converts PPTX files to Markdown. Supports heading, tables and images with alt text.
"""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a PPTX
extension = kwargs.get("file_extension", "")
Expand Down Expand Up @@ -558,6 +585,10 @@ def _is_table(self, shape):


class MediaConverter(DocumentConverter):
"""
Abstract class for multi-modal media (e.g., images and audio)
"""

def _get_metadata(self, local_path):
exiftool = shutil.which("exiftool")
if not exiftool:
Expand All @@ -571,6 +602,10 @@ def _get_metadata(self, local_path):


class WavConverter(MediaConverter):
"""
Converts WAV files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` is installed).
"""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a XLSX
extension = kwargs.get("file_extension", "")
Expand Down Expand Up @@ -620,6 +655,10 @@ def _transcribe_audio(self, local_path) -> str:


class Mp3Converter(WavConverter):
"""
Converts MP3 files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` AND `pydub` are installed).
"""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a MP3
extension = kwargs.get("file_extension", "")
Expand Down Expand Up @@ -677,6 +716,10 @@ def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:


class ImageConverter(MediaConverter):
"""
Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an mlm_client is configured).
"""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a XLSX
extension = kwargs.get("file_extension", "")
Expand Down
Loading