Skip to content

Comments

Feat: Add Sogou Text Backend#392

Open
scarletkc wants to merge 7 commits intodeedy5:mainfrom
scarletkc:main
Open

Feat: Add Sogou Text Backend#392
scarletkc wants to merge 7 commits intodeedy5:mainfrom
scarletkc:main

Conversation

@scarletkc
Copy link

Feature

  • add ddgs/engines/sogou.py implementing text search with timelimit + pagination support
  • wire the backend into CLI (ddgs text -b sogou) and list it in README’s engines table
  • results/downloads expose the real destination URLs

Notes

  • endpoint: https://www.sogou.com/web (GET)
  • selectors: items_xpath = //div[contains(@class,'vrwrap') and not(contains(@class,'hint'))]
  • _normalize_href follows redirects via inline JS/meta refresh and caches the decoded hrefs

Tests

  • pytest tests/ → 17/18 (pre-existing test_books_command failure)
  • pre-commit run --all-files
  • manual DDGS().text('python', backend='sogou', max_results=5)

- implement ddgs/engines/sogou.py
@scarletkc
Copy link
Author

@deedy5 Sogou is a well-known search engine in China. Because of the GFW, many users hope to use ddgs without a proxy, but none of the existing search backends support this. This backend addresses that gap and makes ddgs usable in China.

Occasionally, the first result may be missing (since it is Sogou’s own sponsored recommendation), but this has little impact on overall functionality.

Moved the import of Mapping from collections.abc into a TYPE_CHECKING block to optimize runtime imports and improve type checking performance.
Expanded .gitignore to match all .venv* directories. Enhanced the README.md with a more detailed and nested Table of Contents for better navigation.
Copy link
Owner

@deedy5 deedy5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution — your other changes look good. Could you please revert the last commit that modified .gitignore and README? Those edits aren’t necessary and should be left unchanged.

@scarletkc
Copy link
Author

Thanks for the contribution — your other changes look good. Could you please revert the last commit that modified .gitignore and README? Those edits aren’t necessary and should be left unchanged.

Reverted now. Thanks!

@scarletkc scarletkc requested a review from deedy5 November 21, 2025 02:23
Removed redirect resolution and caching from Sogou search engine. The post_extract_results method now only normalizes hrefs to absolute URLs and filters results with valid titles and hrefs, reducing complexity.
@scarletkc scarletkc requested a review from deedy5 December 8, 2025 10:43
@deedy5
Copy link
Owner

deedy5 commented Dec 17, 2025

@scarletkc
We require direct, unencoded URLs (not redirect/wrapper links like https://www.sogou.com/link?url=...)

scarletkc and others added 2 commits December 18, 2025 15:48
Enhanced the Sogou search engine to extract real target URLs from the data-url attribute, avoiding unnecessary requests to resolve wrapper links. Added helper methods for XPath extraction and updated post-processing to filter out unresolved wrapper links. Included tests to verify correct behavior when data-url is present or missing.
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the tests: they don't make sense in their current form.

@@ -0,0 +1,94 @@
"""Sogou search engine implementation."""

from __future__ import annotations
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete from __future__ import annotations

items_xpath = "//div[contains(@class, 'vrwrap') and not(contains(@class, 'hint'))]"
elements_xpath: ClassVar[Mapping[str, str]] = {
"title": ".//h3//a//text()",
"href": ".//h3//a/@href",
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"href": use xpath from _data_url_xpath

return payload

@staticmethod
def _xpath_join(item: Any, xpath: str) -> str: # noqa: ANN401
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function is unnecessary

return " ".join(x.strip() for x in item.xpath(xpath) if x and x.strip())

@staticmethod
def _xpath_first(item: Any, xpath: str) -> str: # noqa: ANN401
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function is unnecessary

return ""

@staticmethod
def _is_wrapper_link(href: str) -> bool:
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function is unnecessary

def _is_wrapper_link(href: str) -> bool:
return "/link?url=" in href or "sogou.com/link?url=" in href

def extract_results(self, html_text: str) -> list[TextResult]:
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def extract_results is unnecessary

"body": ".//div[contains(@class, 'space-txt')]//text()",
}

_data_url_xpath = ".//*[@data-url]/@data-url"
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

del _data_url_xpath

results.append(TextResult(title=title, href=href, body=body))
return results

def post_extract_results(self, results: list[TextResult]) -> list[TextResult]:
Copy link
Owner

@deedy5 deedy5 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate href in post_extract_results

Copy link
Owner

@deedy5 deedy5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything works—thanks!
Could you optimize the code and rename the branch containing your changes to a name other than 'main'?
You only need to implement the correct xpath's and functions build_payload and post_extract_results.
Check Contributing.md and how the BaseSearchEngine class — the one Sogou inherits from — works:

ddgs/ddgs/base.py

Lines 19 to 122 in 7962c03

class BaseSearchEngine(ABC, Generic[T]):
"""Abstract base class for all search-engine backends."""
name: ClassVar[str] # unique key, e.g. "google"
category: ClassVar[Literal["text", "images", "videos", "news", "books"]]
provider: ClassVar[str] # source of the search results (e.g. "bing" for DuckDuckgo)
disabled: ClassVar[bool] = False # if True, the engine is disabled
priority: ClassVar[float] = 1
search_url: str
search_method: ClassVar[str] # GET or POST
search_headers: ClassVar[Mapping[str, str]] = {}
items_xpath: ClassVar[str]
elements_xpath: ClassVar[Mapping[str, str]]
elements_replace: ClassVar[Mapping[str, str]]
def __init__(self, proxy: str | None = None, timeout: int | None = None, *, verify: bool | str = True) -> None:
self.http_client = HttpClient(proxy=proxy, timeout=timeout, verify=verify)
self.results: list[T] = []
@property
def result_type(self) -> type[T]:
"""Get result type based on category."""
categories = {
"text": TextResult,
"images": ImagesResult,
"videos": VideosResult,
"news": NewsResult,
"books": BooksResult,
}
return categories[self.category]
@abstractmethod
def build_payload(
self,
query: str,
region: str,
safesearch: str,
timelimit: str | None,
page: int,
**kwargs: str,
) -> dict[str, Any]:
"""Build a payload for the search request."""
raise NotImplementedError
def request(self, *args: Any, **kwargs: Any) -> str | None: # noqa: ANN401
"""Make a request to the search engine."""
resp = self.http_client.request(*args, **kwargs)
if resp.status_code == 200:
return resp.text
return None
@cached_property
def parser(self) -> LHTMLParser:
"""Get HTML parser."""
return LHTMLParser(remove_blank_text=True, remove_comments=True, remove_pis=True, collect_ids=False)
def extract_tree(self, html_text: str) -> html.Element:
"""Extract html tree from html text."""
return html.fromstring(html_text, parser=self.parser)
def pre_process_html(self, html_text: str) -> str:
"""Pre-process html_text before extracting results."""
return html_text
def extract_results(self, html_text: str) -> list[T]:
"""Extract search results from html text."""
html_text = self.pre_process_html(html_text)
tree = self.extract_tree(html_text)
items = tree.xpath(self.items_xpath)
results = []
for item in items:
result = self.result_type()
for key, value in self.elements_xpath.items():
data = " ".join(x.strip() for x in item.xpath(value))
result.__setattr__(key, data)
results.append(result)
return results
def post_extract_results(self, results: list[T]) -> list[T]:
"""Post-process search results."""
return results
def search(
self,
query: str,
region: str = "us-en",
safesearch: str = "moderate",
timelimit: str | None = None,
page: int = 1,
**kwargs: str,
) -> list[T] | None:
"""Search the engine."""
payload = self.build_payload(
query=query, region=region, safesearch=safesearch, timelimit=timelimit, page=page, **kwargs
)
if self.search_method == "GET":
html_text = self.request(self.search_method, self.search_url, params=payload, headers=self.search_headers)
else:
html_text = self.request(self.search_method, self.search_url, data=payload, headers=self.search_headers)
if not html_text:
return None
results = self.extract_results(html_text)
return self.post_extract_results(results)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants