-
Notifications
You must be signed in to change notification settings - Fork 15.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: Add llm-extraction option to FireCrawl Document Loader #25231
Conversation
This PR aims to add `llm_extraction` to Firecrawl loader. This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response.
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
Minor fix (replace : with = for assignment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. llm_extraction
is not an attribute of Document. Was the intent here to store doc.get("llm_extraction")
in the document metadata?
Hi, yes and I realized it just now. I am just thinking going by the API response schema Do you think saving it in meta is semantically correct? |
File Updated: firecrawl.py Method Updated: lazy_load Updated lazy_load method to add llm_extraction to the metadata if it is available.
Is |
Hi, yes, llm_extraction is serializable since it is always guaranteed to be a valid json object. This response is received from firecrawl API: {
"data":
{
"content": " Markdown of the HTML string / Or HTML string (depending on input param)",
"markdown": " Markdown of the HTML string",
"metadata":
{
"title": "London student accommodation at Arch View House | Unite StudentsBack ButtonSearch IconFilter Icon",
"description": "Book your high quality student accommodation in London with Unite Students at Arch View House",
"ogTitle": "London student accommodation at Arch View House | Unite Students",
"ogDescription": "Book your high quality student accommodation in London with Unite Students at Arch View House",
"ogUrl": "https://www.unitestudents.com/london/Arch-View-House",
"ogLocaleAlternate":
[],
"sourceURL": "https://www.unitestudents.com/london/arch-view-house",
"pageStatusCode": 200
},
"linksOnPage":
[
"https://link1.com",
"https://link2.com"
],
"llm_extraction":
{
"field1": "value",
"arrayField":
[
"str1",
"str2",
"str3"
]
}
}
} Firecrawl 's scrape_url method: def scrape_url(self, url: str, params: Optional[Dict[str, Any]] = None) -> Any:
"""
Scrape the specified URL using the Firecrawl API.
Args:
url (str): The URL to scrape.
params (Optional[Dict[str, Any]]): Additional parameters for the scrape request.
Returns:
Any: The scraped data if the request is successful.
Raises:
Exception: If the scrape request fails.
"""
headers = self._prepare_headers()
# Prepare the base scrape parameters with the URL
scrape_params = {'url': url}
# If there are additional params, process them
if params:
# Initialize extractorOptions if present
extractor_options = params.get('extractorOptions', {})
# Check and convert the extractionSchema if it's a Pydantic model
if 'extractionSchema' in extractor_options:
if hasattr(extractor_options['extractionSchema'], 'schema'):
extractor_options['extractionSchema'] = extractor_options['extractionSchema'].schema()
# Ensure 'mode' is set, defaulting to 'llm-extraction' if not explicitly provided
extractor_options['mode'] = extractor_options.get('mode', 'llm-extraction')
# Update the scrape_params with the processed extractorOptions
scrape_params['extractorOptions'] = extractor_options
# Include any other params directly at the top level of scrape_params
for key, value in params.items():
if key != 'extractorOptions':
scrape_params[key] = value
# Make the POST request with the prepared headers and JSON data
response = requests.post(
f'{self.api_url}/v0/scrape',
headers=headers,
json=scrape_params,
)
if response.status_code == 200:
response = response.json()
if response['success'] and 'data' in response:
return response['data']
else:
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
else:
self._handle_error(response, 'scrape URL') |
…ngchain-ai#25231) **Description:** This minor PR aims to add `llm_extraction` to Firecrawl loader. This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response. **Twitter handle:** [scalable_pizza](https://x.com/scalablepizza) --------- Co-authored-by: Chester Curme <chester.curme@gmail.com>
Description: This minor PR aims to add
llm_extraction
to Firecrawl loader. This feature is supported on API and PythonSDK, but the langchain loader omits adding this to the response.Twitter handle: scalable_pizza