community: Add llm-extraction option to FireCrawl Document Loader #25231

shivendrasoni · 2024-08-09T11:26:44Z

Description: This minor PR aims to add llm_extraction to Firecrawl loader. This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response.
Twitter handle: scalable_pizza

This PR aims to add `llm_extraction` to Firecrawl loader. This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response.

vercel · 2024-08-09T11:26:49Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Aug 9, 2024 1:55pm

Minor fix (replace : with = for assignment)

ccurme

Thanks for this. llm_extraction is not an attribute of Document. Was the intent here to store doc.get("llm_extraction") in the document metadata?

shivendrasoni · 2024-08-09T12:37:01Z

Thanks for this. llm_extraction is not an attribute of Document. Was the intent here to store doc.get("llm_extraction") in the document metadata?

Hi, yes and I realized it just now. I am just thinking going by the API response schema llm_extraction, should be parallel to the page_content, but I realized that even thought Document allows kwargs, it isn't getting set for this arbitrary param.

Do you think saving it in meta is semantically correct?

File Updated: firecrawl.py Method Updated: lazy_load Updated lazy_load method to add llm_extraction to the metadata if it is available.

ccurme · 2024-08-09T13:35:24Z

Is llm_extraction serializable? Do you have an example? Document metadata isn't very opinionated and there aren't many constraints on it. If it's useful metadata about the document and is serializable then makes sense to me.

shivendrasoni · 2024-08-09T13:45:30Z

Is llm_extraction serializable? Do you have an example? Document metadata isn't very opinionated and there aren't many constraints on it. If it's useful metadata about the document and is serializable then makes sense to me.

Hi, yes, llm_extraction is serializable since it is always guaranteed to be a valid json object.

This response is received from firecrawl API:
Attaching snipper from firecrawl python sdk which is used by this loader. Note the scrape_url method, which returns a response from the API (respone['data']).
Eg response:

{
    "data":
    {
        "content": " Markdown of the HTML string  / Or HTML string (depending on input param)",
        "markdown": " Markdown of the HTML string",
        "metadata":
        {
            "title": "London student accommodation at Arch View House | Unite StudentsBack ButtonSearch IconFilter Icon",
            "description": "Book your high quality student accommodation in London with Unite Students at Arch View House",
            "ogTitle": "London student accommodation at Arch View House | Unite Students",
            "ogDescription": "Book your high quality student accommodation in London with Unite Students at Arch View House",
            "ogUrl": "https://www.unitestudents.com/london/Arch-View-House",
            "ogLocaleAlternate":
            [],
            "sourceURL": "https://www.unitestudents.com/london/arch-view-house",
            "pageStatusCode": 200
        },
        "linksOnPage":
        [
            "https://link1.com",
            "https://link2.com"
        ],
        "llm_extraction":
        {
            "field1": "value",
            "arrayField":
            [
                "str1",
                "str2",
                "str3"
            ]
        }
    }
}

Firecrawl 's scrape_url method:

    def scrape_url(self, url: str, params: Optional[Dict[str, Any]] = None) -> Any:
        """
        Scrape the specified URL using the Firecrawl API.

        Args:
            url (str): The URL to scrape.
            params (Optional[Dict[str, Any]]): Additional parameters for the scrape request.

        Returns:
            Any: The scraped data if the request is successful.

        Raises:
            Exception: If the scrape request fails.
        """

        headers = self._prepare_headers()

        # Prepare the base scrape parameters with the URL
        scrape_params = {'url': url}

        # If there are additional params, process them
        if params:
            # Initialize extractorOptions if present
            extractor_options = params.get('extractorOptions', {})
            # Check and convert the extractionSchema if it's a Pydantic model
            if 'extractionSchema' in extractor_options:
                if hasattr(extractor_options['extractionSchema'], 'schema'):
                    extractor_options['extractionSchema'] = extractor_options['extractionSchema'].schema()
                # Ensure 'mode' is set, defaulting to 'llm-extraction' if not explicitly provided
                extractor_options['mode'] = extractor_options.get('mode', 'llm-extraction')
                # Update the scrape_params with the processed extractorOptions
                scrape_params['extractorOptions'] = extractor_options

            # Include any other params directly at the top level of scrape_params
            for key, value in params.items():
                if key != 'extractorOptions':
                    scrape_params[key] = value
        # Make the POST request with the prepared headers and JSON data
        response = requests.post(
            f'{self.api_url}/v0/scrape',
            headers=headers,
            json=scrape_params,
        )
        if response.status_code == 200:
            response = response.json()
            if response['success'] and 'data' in response:
                return response['data']
            else:
                raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
        else:
            self._handle_error(response, 'scrape URL')

…ngchain-ai#25231) **Description:** This minor PR aims to add `llm_extraction` to Firecrawl loader. This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response. **Twitter handle:** [scalable_pizza](https://x.com/scalablepizza) --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>

Add llm-extraction option to FireCrawl Document Loader

7b04c65

This PR aims to add `llm_extraction` to Firecrawl loader. This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response.

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Aug 9, 2024

dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 9, 2024

Update firecrawl.py

24c6d14

Minor fix (replace : with = for assignment)

ccurme reviewed Aug 9, 2024

View reviewed changes

Add llm_extraction to metadata

e20b9ba

File Updated: firecrawl.py Method Updated: lazy_load Updated lazy_load method to add llm_extraction to the metadata if it is available.

shivendrasoni requested a review from ccurme August 9, 2024 13:11

Merge branch 'master' into patch-1

0df3adb

shivendrasoni and others added 2 commits August 9, 2024 19:20

Merge branch 'master' into patch-1

1589256

check for None params

3db4ee9

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Aug 9, 2024

ccurme approved these changes Aug 9, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Aug 9, 2024

ccurme enabled auto-merge (squash) August 9, 2024 13:55

ccurme merged commit 66b7206 into langchain-ai:master Aug 9, 2024
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: Add llm-extraction option to FireCrawl Document Loader #25231

community: Add llm-extraction option to FireCrawl Document Loader #25231

shivendrasoni commented Aug 9, 2024

vercel bot commented Aug 9, 2024 •

edited

Loading

ccurme left a comment

shivendrasoni commented Aug 9, 2024

ccurme commented Aug 9, 2024

shivendrasoni commented Aug 9, 2024 •

edited

Loading

community: Add llm-extraction option to FireCrawl Document Loader #25231

community: Add llm-extraction option to FireCrawl Document Loader #25231

Conversation

shivendrasoni commented Aug 9, 2024

vercel bot commented Aug 9, 2024 • edited Loading

ccurme left a comment

Choose a reason for hiding this comment

shivendrasoni commented Aug 9, 2024

ccurme commented Aug 9, 2024

shivendrasoni commented Aug 9, 2024 • edited Loading

vercel bot commented Aug 9, 2024 •

edited

Loading

shivendrasoni commented Aug 9, 2024 •

edited

Loading