Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Add llm-extraction option to FireCrawl Document Loader #25231

Merged
merged 6 commits into from
Aug 9, 2024

Conversation

shivendrasoni
Copy link
Contributor

Description: This minor PR aims to add llm_extraction to Firecrawl loader. This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response.
Twitter handle: scalable_pizza

This PR aims to add `llm_extraction` to Firecrawl loader. 
This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response.
@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Aug 9, 2024
Copy link

vercel bot commented Aug 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Aug 9, 2024 1:55pm

@dosubot dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 9, 2024
Minor fix (replace : with = for assignment)
Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. llm_extraction is not an attribute of Document. Was the intent here to store doc.get("llm_extraction") in the document metadata?

@shivendrasoni
Copy link
Contributor Author

Thanks for this. llm_extraction is not an attribute of Document. Was the intent here to store doc.get("llm_extraction") in the document metadata?

Hi, yes and I realized it just now. I am just thinking going by the API response schema llm_extraction, should be parallel to the page_content, but I realized that even thought Document allows kwargs, it isn't getting set for this arbitrary param.

Do you think saving it in meta is semantically correct?

File Updated:  firecrawl.py 
Method Updated:  lazy_load

Updated lazy_load method to add llm_extraction to the metadata  if it is available.
@ccurme
Copy link
Collaborator

ccurme commented Aug 9, 2024

Is llm_extraction serializable? Do you have an example? Document metadata isn't very opinionated and there aren't many constraints on it. If it's useful metadata about the document and is serializable then makes sense to me.

@shivendrasoni
Copy link
Contributor Author

shivendrasoni commented Aug 9, 2024

Is llm_extraction serializable? Do you have an example? Document metadata isn't very opinionated and there aren't many constraints on it. If it's useful metadata about the document and is serializable then makes sense to me.

Hi, yes, llm_extraction is serializable since it is always guaranteed to be a valid json object.

This response is received from firecrawl API:
Attaching snipper from firecrawl python sdk which is used by this loader. Note the scrape_url method, which returns a response from the API (respone['data']).
Eg response:

{
    "data":
    {
        "content": " Markdown of the HTML string  / Or HTML string (depending on input param)",
        "markdown": " Markdown of the HTML string",
        "metadata":
        {
            "title": "London student accommodation at Arch View House | Unite StudentsBack ButtonSearch IconFilter Icon",
            "description": "Book your high quality student accommodation in London with Unite Students at Arch View House",
            "ogTitle": "London student accommodation at Arch View House | Unite Students",
            "ogDescription": "Book your high quality student accommodation in London with Unite Students at Arch View House",
            "ogUrl": "https://www.unitestudents.com/london/Arch-View-House",
            "ogLocaleAlternate":
            [],
            "sourceURL": "https://www.unitestudents.com/london/arch-view-house",
            "pageStatusCode": 200
        },
        "linksOnPage":
        [
            "https://link1.com",
            "https://link2.com"
        ],
        "llm_extraction":
        {
            "field1": "value",
            "arrayField":
            [
                "str1",
                "str2",
                "str3"
            ]
        }
    }
}

Firecrawl 's scrape_url method:

    def scrape_url(self, url: str, params: Optional[Dict[str, Any]] = None) -> Any:
        """
        Scrape the specified URL using the Firecrawl API.

        Args:
            url (str): The URL to scrape.
            params (Optional[Dict[str, Any]]): Additional parameters for the scrape request.

        Returns:
            Any: The scraped data if the request is successful.

        Raises:
            Exception: If the scrape request fails.
        """

        headers = self._prepare_headers()

        # Prepare the base scrape parameters with the URL
        scrape_params = {'url': url}

        # If there are additional params, process them
        if params:
            # Initialize extractorOptions if present
            extractor_options = params.get('extractorOptions', {})
            # Check and convert the extractionSchema if it's a Pydantic model
            if 'extractionSchema' in extractor_options:
                if hasattr(extractor_options['extractionSchema'], 'schema'):
                    extractor_options['extractionSchema'] = extractor_options['extractionSchema'].schema()
                # Ensure 'mode' is set, defaulting to 'llm-extraction' if not explicitly provided
                extractor_options['mode'] = extractor_options.get('mode', 'llm-extraction')
                # Update the scrape_params with the processed extractorOptions
                scrape_params['extractorOptions'] = extractor_options

            # Include any other params directly at the top level of scrape_params
            for key, value in params.items():
                if key != 'extractorOptions':
                    scrape_params[key] = value
        # Make the POST request with the prepared headers and JSON data
        response = requests.post(
            f'{self.api_url}/v0/scrape',
            headers=headers,
            json=scrape_params,
        )
        if response.status_code == 200:
            response = response.json()
            if response['success'] and 'data' in response:
                return response['data']
            else:
                raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
        else:
            self._handle_error(response, 'scrape URL')

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Aug 9, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Aug 9, 2024
@ccurme ccurme enabled auto-merge (squash) August 9, 2024 13:55
@ccurme ccurme merged commit 66b7206 into langchain-ai:master Aug 9, 2024
43 checks passed
olgamurraft pushed a commit to olgamurraft/langchain that referenced this pull request Aug 16, 2024
…ngchain-ai#25231)

**Description:** This minor PR aims to add `llm_extraction` to Firecrawl
loader. This feature is supported on API and PythonSDK, but the
langchain loader omits adding this to the response.
**Twitter handle:** [scalable_pizza](https://x.com/scalablepizza)

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. size:S This PR changes 10-29 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants