Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureAISearch Retriever only returns up to 50 docs #27830

Open
5 tasks done
sjjpo2002 opened this issue Nov 1, 2024 · 0 comments
Open
5 tasks done

AzureAISearch Retriever only returns up to 50 docs #27830

sjjpo2002 opened this issue Nov 1, 2024 · 0 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@sjjpo2002
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

To reproduce the issue mentioned here. Create an Azure Search AI index and upload any number of documents above 50 that share a search field. This could be source in the metadata. For example the same file name on all chunks. Instantiate the retriver:

retriever = AzureAISearchRetriever(
            service_name=.AZURE_SEARCH_ENDPOINT,
            index_name=AZURE_SEARCH_INDEX_NAME,
            api_key=AZURE_SEARCH_KEY,
            content_key="content",
            top_k=None,
        )

and invoke a query like:

retriever.invoke(doc.metadata["source"])

setting top_k to None should return all the results according to the documentation:

top_k: Optional[int] = None
"""Number of results to retrieve. Set to None to retrieve all results."""

But, because of the default number of 50 set by Azure, the returned results will always be up to 50 at the current implementation.

Error Message and Stack Trace (if applicable)

No response

Description

Azure AI Search service doesn't return all matches when a query is submitted using the search field as it is documented on their website:

"By default, the search engine returns up to the first 50 matches. The top 50 are determined by search score, assuming the query is full text search or semantic."

From the same documentation we can understand that we need to implement pagination if we want to retrieve all the documents when we query the service:

"To control the paging of all documents returned in a result set, add $top and $skip parameters to the GET query request, or top and skip to the POST query request. The following list explains the logic.

Return the first set of 15 matching documents plus a count of total matches: GET /indexes//docs?search=&$top=15&$skip=0&$count=true

Return the second set, skipping the first 15 to get the next 15: $top=15&$skip=15. Repeat for the third set of 15: $top=15&$skip=30"

If we look at the existing code there is no pagination implemented. This makes this retriever to return up to 50 results no matter how many records are in the database. This behavior is not fully documented and can result in unexpected behavior in scenarios where the user intended to retrieve all the documents. This is clear from the function that builds the API query:

def _build_search_url(self, query: str) -> str:
        url_suffix = get_from_env("", "AZURE_AI_SEARCH_URL_SUFFIX", DEFAULT_URL_SUFFIX)
        if url_suffix in self.service_name and "https://" in self.service_name:
            base_url = f"{self.service_name}/"
        elif url_suffix in self.service_name and "https://" not in self.service_name:
            base_url = f"https://{self.service_name}/"
        elif url_suffix not in self.service_name and "https://" in self.service_name:
            base_url = f"{self.service_name}.{url_suffix}/"
        elif (
            url_suffix not in self.service_name and "https://" not in self.service_name
        ):
            base_url = f"https://{self.service_name}.{url_suffix}/"
        else:
            # pass to Azure to throw a specific error
            base_url = self.service_name
        endpoint_path = f"indexes/{self.index_name}/docs?api-version={self.api_version}"
        top_param = f"&$top={self.top_k}" if self.top_k else ""
        filter_param = f"&$filter={self.filter}" if self.filter else ""
        return base_url + endpoint_path + f"&search={query}" + top_param + filter_param

To reproduce the issue mentioned here. Create an Azure Search AI index and upload any number of documents above 50 that share a search field. This could be source in the metadata. For example the same file name on all chunks. Instantiate the retriver:

retriever = AzureAISearchRetriever(
            service_name=.AZURE_SEARCH_ENDPOINT,
            index_name=AZURE_SEARCH_INDEX_NAME,
            api_key=AZURE_SEARCH_KEY,
            content_key="content",
            top_k=None,
        )

and invoke a query like:

retriever.invoke(doc.metadata["source"])

setting top_k to None should return all the results according to the documentation:

top_k: Optional[int] = None
"""Number of results to retrieve. Set to None to retrieve all results."""

But, because of the default number of 50 set by Azure, the returned results will always be up to 50 at the current implementation.

System Info

System Information

OS: Linux
OS Version: #1 SMP Wed Sep 11 18:02:00 EDT 2024
Python Version: 3.11.9 (main, Aug 26 2024, 10:40:41) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]

Package Information

langchain_core: 0.2.33
langchain: 0.2.5
langchain_community: 0.2.5
langsmith: 0.1.101
langchain_cli: 0.0.29
langchain_openai: 0.1.22
langchain_text_splitters: 0.2.2
langserve: 0.2.2

Optional packages not installed

langgraph

Other Dependencies

aiohttp: 3.9.5
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
fastapi: 0.110.0
gitpython: 3.1.43
httpx: 0.27.0
jsonpatch: 1.33
langserve[all]: Installed. No version info available.
libcst: 1.4.0
numpy: 1.26.4
openai: 1.41.0
orjson: 3.10.5
packaging: 23.2
pydantic: 2.6.2
pyproject-toml: 0.0.10
PyYAML: 5.3.1
requests: 2.32.3
SQLAlchemy: 2.0.27
sse-starlette: 1.8.2
tenacity: 8.4.1
tiktoken: 0.7.0
tomlkit: 0.12.5
typer[all]: Installed. No version info available.
typing-extensions: 4.12.2
uvicorn: 0.23.2

@sjjpo2002 sjjpo2002 changed the title AzureAISearch Retriever only returns up to 5 docs AzureAISearch Retriever only returns up to 50 docs Nov 1, 2024
@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant