Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSurfer Updated (Selenium, Playwright, and support for many filetypes) #1929

Merged
merged 61 commits into from
Sep 25, 2024
Merged
Changes from 1 commit
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
96683ee
Feat/headless browser (retargeted) (#1832)
INF800 Mar 2, 2024
348d676
Handle missing Selenium package.
afourney Mar 4, 2024
bb7a249
Added browser_chat.py example to simplify testing.
afourney Mar 4, 2024
7535226
Based browser on mdconvert. (#1847)
afourney Mar 4, 2024
8dc2220
Added an initial POC with Playwright.
afourney Mar 4, 2024
4e7e6a5
Merge branch 'main' into headless_web_surfer
afourney Mar 4, 2024
1d96568
Separated Bing search into it's own utility module.
afourney Mar 8, 2024
21b1789
Simple browser now uses Bing tools.
afourney Mar 8, 2024
19bb19c
Updated Playwright browser to inherit from SimpleTextBrowser
afourney Mar 9, 2024
c6a7ee3
Got Selenium working too.
afourney Mar 9, 2024
d5d6644
Renamed classes and files for consistency.
afourney Mar 9, 2024
acb08c3
Added more instructions.
afourney Mar 9, 2024
d19c9c7
Merge branch 'main' into headless_web_surfer
afourney Mar 9, 2024
f595516
Initial work to support other search providers.
afourney Mar 12, 2024
e8e8de0
Merge branch 'headless_web_surfer' of github.com:microsoft/autogen in…
afourney Mar 12, 2024
df4e3e1
Added some basic behavior when the BING_API_KEY is missing.
afourney Mar 12, 2024
e33a2fa
Cleaned up some search results.
afourney Mar 12, 2024
e221a5f
Moved to using the request.Sessions object. Moved Bing SERP paring to…
afourney Mar 12, 2024
35c48fe
Added backward compatibility to WebSurferAgent
afourney Mar 12, 2024
df3ef28
Selenium and Playwright now grab the whole DOM, not jus the body, all…
afourney Mar 12, 2024
0a52483
Fixed printing of page titles in Playwright.
afourney Mar 13, 2024
802f099
Merge branch 'main' into headless_web_surfer
afourney Mar 14, 2024
156e6f7
Moved installation of WebSurfer dependencies to contrib-tests.yml
afourney Mar 14, 2024
8744405
Fixing pre-commit issues.
afourney Mar 14, 2024
3c2a118
Reverting conversable_agent, which should not have been changed in pr…
afourney Mar 14, 2024
92bc064
Added RequestMarkdownBrowser tests.
afourney Mar 14, 2024
87119a4
Fixed a bug with Bing search, and added search test cases.
afourney Mar 15, 2024
ccf37a4
Added tests for Bing search.
afourney Mar 15, 2024
c33ac26
Added tests for md_convert
afourney Mar 15, 2024
6af5ff9
Added test files.
afourney Mar 15, 2024
9581c07
Added missing pptx.
afourney Mar 15, 2024
ecd5329
Added more tests for WebSurfer coverage.
afourney Mar 15, 2024
b5dca7e
Merge branch 'main' into headless_web_surfer
afourney Mar 15, 2024
25c78c0
Fixed guard on requests_markdown_browser test.
afourney Mar 15, 2024
de011b8
Updated test coverage for mdconvert.
afourney Mar 15, 2024
f897bf3
Fix brwser_utils tests.
afourney Mar 16, 2024
3f8c65f
Removed image test from browser, since exiftool isn't installed on te…
afourney Mar 16, 2024
8e6b5e8
Removed image test from browser, since exiftool isn't installed on te…
afourney Mar 16, 2024
b280028
Merge branch 'main' into headless_web_surfer
afourney Mar 18, 2024
d3b6f68
Disable Selenium GPU and sandbox to ensure it runs headless in Docker.
afourney Mar 18, 2024
852ee33
Merge branch 'main' into headless_web_surfer
afourney Mar 27, 2024
f094e69
Added option for Bing API results to be interleaved (as Bing specifie…
afourney Mar 29, 2024
745dc21
Print more details when requests exceptions are thrown.
afourney Mar 29, 2024
fe8fa07
Merge branch 'main' into headless_web_surfer
afourney Apr 1, 2024
7353681
Added additional documentation to markdown_search
afourney Apr 1, 2024
a174d42
Added documentation to the selenium_markdown_browser.
afourney Apr 1, 2024
2f9de28
Added documentation to playwright_markdown_browser.py
afourney Apr 1, 2024
371b991
Added documentation to requests_markdown_browser
afourney Apr 1, 2024
1ac0a4d
Added documentation to mdconvert.py
afourney Apr 1, 2024
2c1398b
Updated agentchat_surfer notebook.
afourney Apr 1, 2024
6ba05c9
Merge branch 'main' into headless_web_surfer
ekzhu Apr 2, 2024
266cefc
Update .github/workflows/contrib-tests.yml
afourney Apr 2, 2024
8a6ebe1
Merge main. Resolve conflicts.
afourney May 20, 2024
ccbdd1b
Merge main. Resolve conflicts.
afourney May 20, 2024
37b3292
Resolve pre-commit checks.
afourney May 20, 2024
b1ca235
Merge branch 'main' into headless_web_surfer
ekzhu May 22, 2024
5304bab
Removed offending LFS file.
afourney Sep 25, 2024
a486843
Re-added offending LFS file.
afourney Sep 25, 2024
0c5a4a9
Merged main.
afourney Sep 25, 2024
764fb3f
Fixed browser_utils tests.
afourney Sep 25, 2024
42fe8f5
Fixed style errors.
afourney Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Added documentation to the selenium_markdown_browser.
  • Loading branch information
afourney committed Apr 1, 2024
commit a174d42c1d3488ffda9b3093854ce397006003f0
23 changes: 20 additions & 3 deletions autogen/browser_utils/selenium_markdown_browser.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from urllib.parse import urljoin, urlparse, quote_plus, unquote, parse_qs
from .requests_markdown_browser import RequestsMarkdownBrowser

# Check if Playwright dependencies are installed
# Check if Selenium dependencies are installed
IS_SELENIUM_ENABLED = False
try:
from selenium import webdriver
Expand All @@ -19,10 +19,17 @@
class SeleniumMarkdownBrowser(RequestsMarkdownBrowser):
"""
(In preview) A Selenium and Chromium powered Markdown web browser.
See AbstractMarkdownBrowser for more details.
SeleniumMarkdownBrowser extends RequestsMarkdownBrowser, and replaces only the functionality of `visit_page(url)`.
"""

def __init__(self, **kwargs):
"""
Instantiate a new SeleniumMarkdownBrowser.

Arguments:
**kwargs: SeleniumMarkdownBrowser passes all arguments to the RequestsMarkdownBrowser superclass. See RequestsMarkdownBrowser documentation for more details.
"""

super().__init__(**kwargs)
self._webdriver = None

Expand All @@ -41,13 +48,23 @@ def __init__(self, **kwargs):
self._webdriver.get(self.start_page)

def __del__(self):
"""
Close the Selenium session when garbage-collected. Garbage collection may not always occur, or may happen at a later time. Call `close()` explicitly if you wish to free up resources used by Selenium or Chromium.
"""
self.close()

def close(self):
"""
Close the Selenium session used by this instance. The session cannot be reopened without instantiating a new SeleniumMarkdownBrowser instance.
"""
if self._webdriver is not None:
pass
self._webdriver.quit()
self._webdriver = None

def _fetch_page(self, url) -> None:
"""
Fetch a page. If the page is a regular HTTP page, use Selenium to gather the HTML. If the page is a download, or a local file, rely on superclass behavior.
"""
if url.startswith("file://"):
super()._fetch_page(url)
else:
Expand Down
Loading