Feat: Add Sogou Text Backend by scarletkc · Pull Request #392 · deedy5/ddgs

scarletkc · 2025-11-17T22:45:41Z

Feature

add ddgs/engines/sogou.py implementing text search with timelimit + pagination support
wire the backend into CLI (ddgs text -b sogou) and list it in README’s engines table
results/downloads expose the real destination URLs

Notes

endpoint: https://www.sogou.com/web (GET)
selectors: items_xpath = //div[contains(@class,'vrwrap') and not(contains(@class,'hint'))]
_normalize_href follows redirects via inline JS/meta refresh and caches the decoded hrefs

Tests

pytest tests/ → 17/18 (pre-existing test_books_command failure)
pre-commit run --all-files
manual DDGS().text('python', backend='sogou', max_results=5)

- implement ddgs/engines/sogou.py

scarletkc · 2025-11-17T23:08:12Z

@deedy5 Sogou is a well-known search engine in China. Because of the GFW, many users hope to use ddgs without a proxy, but none of the existing search backends support this. This backend addresses that gap and makes ddgs usable in China.

Occasionally, the first result may be missing (since it is Sogou’s own sponsored recommendation), but this has little impact on overall functionality.

Moved the import of Mapping from collections.abc into a TYPE_CHECKING block to optimize runtime imports and improve type checking performance.

Expanded .gitignore to match all .venv* directories. Enhanced the README.md with a more detailed and nested Table of Contents for better navigation.

deedy5

Thanks for the contribution — your other changes look good. Could you please revert the last commit that modified .gitignore and README? Those edits aren’t necessary and should be left unchanged.

ddgs/engines/sogou.py

This reverts commit d3daacd.

scarletkc · 2025-11-20T21:22:47Z

Thanks for the contribution — your other changes look good. Could you please revert the last commit that modified .gitignore and README? Those edits aren’t necessary and should be left unchanged.

Reverted now. Thanks!

ddgs/engines/sogou.py

Removed redirect resolution and caching from Sogou search engine. The post_extract_results method now only normalizes hrefs to absolute URLs and filters results with valid titles and hrefs, reducing complexity.

deedy5 · 2025-12-17T16:34:09Z

@scarletkc
We require direct, unencoded URLs (not redirect/wrapper links like https://www.sogou.com/link?url=...)

Enhanced the Sogou search engine to extract real target URLs from the data-url attribute, avoiding unnecessary requests to resolve wrapper links. Added helper methods for XPath extraction and updated post-processing to filter out unresolved wrapper links. Included tests to verify correct behavior when data-url is present or missing.

deedy5 · 2025-12-18T12:34:11Z

tests/sogou_test.py

Please remove the tests: they don't make sense in their current form.

deedy5 · 2025-12-18T12:38:29Z

ddgs/engines/sogou.py

@@ -0,0 +1,94 @@
+"""Sogou search engine implementation."""
+
+from __future__ import annotations


please delete from __future__ import annotations

deedy5 · 2025-12-18T12:39:46Z

ddgs/engines/sogou.py

+    items_xpath = "//div[contains(@class, 'vrwrap') and not(contains(@class, 'hint'))]"
+    elements_xpath: ClassVar[Mapping[str, str]] = {
+        "title": ".//h3//a//text()",
+        "href": ".//h3//a/@href",


"href": use xpath from _data_url_xpath

deedy5 · 2025-12-18T12:45:18Z

ddgs/engines/sogou.py

+        return payload
+
+    @staticmethod
+    def _xpath_join(item: Any, xpath: str) -> str:  # noqa: ANN401


function is unnecessary

deedy5 · 2025-12-18T12:45:25Z

ddgs/engines/sogou.py

+        return " ".join(x.strip() for x in item.xpath(xpath) if x and x.strip())
+
+    @staticmethod
+    def _xpath_first(item: Any, xpath: str) -> str:  # noqa: ANN401


function is unnecessary

deedy5 · 2025-12-18T12:45:31Z

ddgs/engines/sogou.py

+        return ""
+
+    @staticmethod
+    def _is_wrapper_link(href: str) -> bool:


function is unnecessary

deedy5 · 2025-12-18T12:45:39Z

ddgs/engines/sogou.py

+    def _is_wrapper_link(href: str) -> bool:
+        return "/link?url=" in href or "sogou.com/link?url=" in href
+
+    def extract_results(self, html_text: str) -> list[TextResult]:


def extract_results is unnecessary

deedy5 · 2025-12-18T12:46:20Z

ddgs/engines/sogou.py

+        "body": ".//div[contains(@class, 'space-txt')]//text()",
+    }
+
+    _data_url_xpath = ".//*[@data-url]/@data-url"


del _data_url_xpath

deedy5 · 2025-12-18T12:47:48Z

ddgs/engines/sogou.py

+            results.append(TextResult(title=title, href=href, body=body))
+        return results
+
+    def post_extract_results(self, results: list[TextResult]) -> list[TextResult]:


validate href in post_extract_results

deedy5

Everything works—thanks!
Could you optimize the code and rename the branch containing your changes to a name other than 'main'?
You only need to implement the correct xpath's and functions build_payload and post_extract_results.
Check Contributing.md and how the BaseSearchEngine class — the one Sogou inherits from — works:

ddgs/ddgs/base.py

Lines 19 to 122 in 7962c03

    
           class BaseSearchEngine(ABC, Generic[T]): 
        
               """Abstract base class for all search-engine backends.""" 
        
               name: ClassVar[str]  # unique key, e.g. "google" 
        
               category: ClassVar[Literal["text", "images", "videos", "news", "books"]] 
        
               provider: ClassVar[str]  # source of the search results (e.g. "bing" for DuckDuckgo) 
        
               disabled: ClassVar[bool] = False  # if True, the engine is disabled 
        
               priority: ClassVar[float] = 1 
        
               search_url: str 
        
               search_method: ClassVar[str]  # GET or POST 
        
               search_headers: ClassVar[Mapping[str, str]] = {} 
        
               items_xpath: ClassVar[str] 
        
               elements_xpath: ClassVar[Mapping[str, str]] 
        
               elements_replace: ClassVar[Mapping[str, str]] 
        
               def __init__(self, proxy: str | None = None, timeout: int | None = None, *, verify: bool | str = True) -> None: 
        
                   self.http_client = HttpClient(proxy=proxy, timeout=timeout, verify=verify) 
        
                   self.results: list[T] = [] 
        
               @property 
        
               def result_type(self) -> type[T]: 
        
                   """Get result type based on category.""" 
        
                   categories = { 
        
                       "text": TextResult, 
        
                       "images": ImagesResult, 
        
                       "videos": VideosResult, 
        
                       "news": NewsResult, 
        
                       "books": BooksResult, 
        
                   } 
        
                   return categories[self.category] 
        
               @abstractmethod 
        
               def build_payload( 
        
                   self, 
        
                   query: str, 
        
                   region: str, 
        
                   safesearch: str, 
        
                   timelimit: str | None, 
        
                   page: int, 
        
                   **kwargs: str, 
        
               ) -> dict[str, Any]: 
        
                   """Build a payload for the search request.""" 
        
                   raise NotImplementedError 
        
               def request(self, *args: Any, **kwargs: Any) -> str | None:  # noqa: ANN401 
        
                   """Make a request to the search engine.""" 
        
                   resp = self.http_client.request(*args, **kwargs) 
        
                   if resp.status_code == 200: 
        
                       return resp.text 
        
                   return None 
        
               @cached_property 
        
               def parser(self) -> LHTMLParser: 
        
                   """Get HTML parser.""" 
        
                   return LHTMLParser(remove_blank_text=True, remove_comments=True, remove_pis=True, collect_ids=False) 
        
               def extract_tree(self, html_text: str) -> html.Element: 
        
                   """Extract html tree from html text.""" 
        
                   return html.fromstring(html_text, parser=self.parser) 
        
               def pre_process_html(self, html_text: str) -> str: 
        
                   """Pre-process html_text before extracting results.""" 
        
                   return html_text 
        
               def extract_results(self, html_text: str) -> list[T]: 
        
                   """Extract search results from html text.""" 
        
                   html_text = self.pre_process_html(html_text) 
        
                   tree = self.extract_tree(html_text) 
        
                   items = tree.xpath(self.items_xpath) 
        
                   results = [] 
        
                   for item in items: 
        
                       result = self.result_type() 
        
                       for key, value in self.elements_xpath.items(): 
        
                           data = " ".join(x.strip() for x in item.xpath(value)) 
        
                           result.__setattr__(key, data) 
        
                       results.append(result) 
        
                   return results 
        
               def post_extract_results(self, results: list[T]) -> list[T]: 
        
                   """Post-process search results.""" 
        
                   return results 
        
               def search( 
        
                   self, 
        
                   query: str, 
        
                   region: str = "us-en", 
        
                   safesearch: str = "moderate", 
        
                   timelimit: str | None = None, 
        
                   page: int = 1, 
        
                   **kwargs: str, 
        
               ) -> list[T] | None: 
        
                   """Search the engine.""" 
        
                   payload = self.build_payload( 
        
                       query=query, region=region, safesearch=safesearch, timelimit=timelimit, page=page, **kwargs 
        
                   ) 
        
                   if self.search_method == "GET": 
        
                       html_text = self.request(self.search_method, self.search_url, params=payload, headers=self.search_headers) 
        
                   else: 
        
                       html_text = self.request(self.search_method, self.search_url, data=payload, headers=self.search_headers) 
        
                   if not html_text: 
        
                       return None 
        
                   results = self.extract_results(html_text) 
        
                   return self.post_extract_results(results)

feat(sogou): add text backend

6d39cae

- implement ddgs/engines/sogou.py

scarletkc added 2 commits November 20, 2025 12:07

Refactor imports in sogou.py for type checking

ad02ff6

Moved the import of Mapping from collections.abc into a TYPE_CHECKING block to optimize runtime imports and improve type checking performance.

Update .gitignore and improve README structure

d3daacd

Expanded .gitignore to match all .venv* directories. Enhanced the README.md with a more detailed and nested Table of Contents for better navigation.

deedy5 requested changes Nov 20, 2025

View reviewed changes

deedy5 reviewed Nov 20, 2025

View reviewed changes

ddgs/engines/sogou.py Outdated Show resolved Hide resolved

Revert "Update .gitignore and improve README structure"

3b49972

This reverts commit d3daacd.

scarletkc requested a review from deedy5 November 21, 2025 02:23

deedy5 requested changes Nov 28, 2025

View reviewed changes

ddgs/engines/sogou.py Outdated Show resolved Hide resolved

Simplify Sogou result post-processing logic

d082e65

Removed redirect resolution and caching from Sogou search engine. The post_extract_results method now only normalizes hrefs to absolute URLs and filters results with valid titles and hrefs, reducing complexity.

scarletkc requested a review from deedy5 December 8, 2025 10:43

scarletkc and others added 2 commits December 18, 2025 15:48

Merge branch 'main' into main

f0aa30f

deedy5 reviewed Dec 18, 2025

View reviewed changes

tests/sogou_test.py

Copy link

Owner

deedy5 Dec 18, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the tests: they don't make sense in their current form.

deedy5 reviewed Dec 18, 2025

View reviewed changes

ddgs/engines/sogou.py

return ""

@staticmethod

def _is_wrapper_link(href: str) -> bool:

Copy link

Owner

deedy5 Dec 18, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function is unnecessary

deedy5 reviewed Dec 18, 2025

View reviewed changes

deedy5 requested changes Dec 18, 2025

View reviewed changes

		@@ -0,0 +1,94 @@
		"""Sogou search engine implementation."""

		from __future__ import annotations

	class BaseSearchEngine(ABC, Generic[T]):
	"""Abstract base class for all search-engine backends."""

	name: ClassVar[str] # unique key, e.g. "google"
	category: ClassVar[Literal["text", "images", "videos", "news", "books"]]
	provider: ClassVar[str] # source of the search results (e.g. "bing" for DuckDuckgo)
	disabled: ClassVar[bool] = False # if True, the engine is disabled
	priority: ClassVar[float] = 1

	search_url: str
	search_method: ClassVar[str] # GET or POST
	search_headers: ClassVar[Mapping[str, str]] = {}
	items_xpath: ClassVar[str]
	elements_xpath: ClassVar[Mapping[str, str]]
	elements_replace: ClassVar[Mapping[str, str]]

	def __init__(self, proxy: str \| None = None, timeout: int \| None = None, *, verify: bool \| str = True) -> None:
	self.http_client = HttpClient(proxy=proxy, timeout=timeout, verify=verify)
	self.results: list[T] = []

	@property
	def result_type(self) -> type[T]:
	"""Get result type based on category."""
	categories = {
	"text": TextResult,
	"images": ImagesResult,
	"videos": VideosResult,
	"news": NewsResult,
	"books": BooksResult,
	}
	return categories[self.category]

	@abstractmethod
	def build_payload(
	self,
	query: str,
	region: str,
	safesearch: str,
	timelimit: str \| None,
	page: int,
	**kwargs: str,
	) -> dict[str, Any]:
	"""Build a payload for the search request."""
	raise NotImplementedError

	def request(self, args: Any, *kwargs: Any) -> str \| None: # noqa: ANN401
	"""Make a request to the search engine."""
	resp = self.http_client.request(args, *kwargs)
	if resp.status_code == 200:
	return resp.text
	return None

	@cached_property
	def parser(self) -> LHTMLParser:
	"""Get HTML parser."""
	return LHTMLParser(remove_blank_text=True, remove_comments=True, remove_pis=True, collect_ids=False)

	def extract_tree(self, html_text: str) -> html.Element:
	"""Extract html tree from html text."""
	return html.fromstring(html_text, parser=self.parser)

	def pre_process_html(self, html_text: str) -> str:
	"""Pre-process html_text before extracting results."""
	return html_text

	def extract_results(self, html_text: str) -> list[T]:
	"""Extract search results from html text."""
	html_text = self.pre_process_html(html_text)
	tree = self.extract_tree(html_text)
	items = tree.xpath(self.items_xpath)
	results = []
	for item in items:
	result = self.result_type()
	for key, value in self.elements_xpath.items():
	data = " ".join(x.strip() for x in item.xpath(value))
	result.__setattr__(key, data)
	results.append(result)
	return results

	def post_extract_results(self, results: list[T]) -> list[T]:
	"""Post-process search results."""
	return results

	def search(
	self,
	query: str,
	region: str = "us-en",
	safesearch: str = "moderate",
	timelimit: str \| None = None,
	page: int = 1,
	**kwargs: str,
	) -> list[T] \| None:
	"""Search the engine."""
	payload = self.build_payload(
	query=query, region=region, safesearch=safesearch, timelimit=timelimit, page=page, **kwargs
	)
	if self.search_method == "GET":
	html_text = self.request(self.search_method, self.search_url, params=payload, headers=self.search_headers)
	else:
	html_text = self.request(self.search_method, self.search_url, data=payload, headers=self.search_headers)
	if not html_text:
	return None
	results = self.extract_results(html_text)
	return self.post_extract_results(results)

Comments

Conversation

scarletkc commented Nov 17, 2025

Feature

Notes

Tests

Uh oh!

scarletkc commented Nov 17, 2025

Uh oh!

deedy5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scarletkc commented Nov 20, 2025

Uh oh!

Uh oh!

deedy5 commented Dec 17, 2025

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deedy5 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 Dec 18, 2025 •

edited

Loading

deedy5 left a comment •

edited

Loading