Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs[minor]: Update FireCrawl integration docs #6326

Merged
merged 2 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion deno.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
"youtubei.js": "npm:/youtubei.js",
"youtube-transcript": "npm:/youtube-transcript",
"neo4j-driver": "npm:/neo4j-driver",
"axios": "npm:/axios"
"axios": "npm:/axios",
"@mendable/firecrawl-js": "npm:/@mendable/firecrawl-js"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"sidebar_label: FireCrawl\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# FireCrawlLoader\n",
"\n",
"This notebook provides a quick overview for getting started with [FireCrawlLoader](/docs/integrations/document_loaders/). For detailed documentation of all FireCrawlLoader features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_web_firecrawl.FireCrawlLoader.html).\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | [PY support](https://python.langchain.com/docs/integrations/document_loaders/firecrawl)|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [FireCrawlLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_web_firecrawl.FireCrawlLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_web_firecrawl.html) | 🟠 (see details below) | beta | ✅ | \n",
"### Loader features\n",
"| Source | Web Loader | Node Envs Only\n",
"| :---: | :---: | :---: | \n",
"| FireCrawlLoader | ✅ | ❌ | \n",
"\n",
"[FireCrawl](https://firecrawl.dev) crawls and convert any website into LLM-ready data. It crawls all accessible sub-pages and give you clean markdown and metadata for each. No sitemap required.\n",
"\n",
"FireCrawl handles complex tasks such as reverse proxies, caching, rate limits, and content blocked by JavaScript. Built by the [mendable.ai](https://mendable.ai) team.\n",
"\n",
"This guide shows how to scrap and crawl entire websites and load them using the `FireCrawlLoader` in LangChain.\n",
"\n",
"## Setup\n",
"\n",
"To access `FireCrawlLoader` document loader you'll need to install the `@langchain/community` integration, and the `@mendable/firecrawl-js` package. Then create a **[FireCrawl](https://firecrawl.dev)** account and get an API key.\n",
"\n",
"### Credentials\n",
"\n",
"Sign up and get your free [FireCrawl API key](https://firecrawl.dev) to start. FireCrawl offers 300 free credits to get you started, and it's [open-source](https://github.com/mendableai/firecrawl) in case you want to self-host.\n",
"\n",
"Once you've done this set the `FIRECRAWL_API_KEY` environment variable:\n",
"\n",
"```bash\n",
"export FIRECRAWL_API_KEY=\"your-api-key\"\n",
"```\n",
"\n",
"If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:\n",
"\n",
"```bash\n",
"# export LANGCHAIN_TRACING_V2=\"true\"\n",
"# export LANGCHAIN_API_KEY=\"your-api-key\"\n",
"```\n",
"\n",
"### Installation\n",
"\n",
"The LangChain FireCrawlLoader integration lives in the `@langchain/community` package:\n",
"\n",
"```{=mdx}\n",
"import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n",
"import Npm2Yarn from \"@theme/Npm2Yarn\";\n",
"\n",
"<IntegrationInstallTooltip></IntegrationInstallTooltip>\n",
"\n",
"<Npm2Yarn>\n",
" @langchain/community @mendable/firecrawl-js\n",
"</Npm2Yarn>\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"Here's an example of how to use the `FireCrawlLoader` to load web search results:\n",
"\n",
"Firecrawl offers 2 modes: `scrape` and `crawl`. In `scrape` mode, Firecrawl will only scrape the page you provide. In `crawl` mode, Firecrawl will crawl the entire website.\n",
"\n",
"Now we can instantiate our model object and load documents:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import \"@mendable/firecrawl-js\";\n",
"import { FireCrawlLoader } from \"@langchain/community/document_loaders/web/firecrawl\"\n",
"\n",
"const loader = new FireCrawlLoader({\n",
" url: \"https://firecrawl.dev\", // The URL to scrape\n",
" apiKey: \"...\", // Optional, defaults to `FIRECRAWL_API_KEY` in your env.\n",
" mode: \"scrape\", // The mode to run the crawler in. Can be \"scrape\" for single urls or \"crawl\" for all accessible subpages\n",
" params: {\n",
" // optional parameters based on Firecrawl API docs\n",
" // For API documentation, visit https://docs.firecrawl.dev\n",
" },\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document {\n",
" pageContent: \u001b[32m\"Introducing [Smart Crawl!](https://www.firecrawl.dev/smart-crawl)\\n\"\u001b[39m +\n",
" \u001b[32m\" Join the waitlist to turn any web\"\u001b[39m... 18721 more characters,\n",
" metadata: {\n",
" title: \u001b[32m\"Home - Firecrawl\"\u001b[39m,\n",
" description: \u001b[32m\"Firecrawl crawls and converts any website into clean markdown.\"\u001b[39m,\n",
" keywords: \u001b[32m\"Firecrawl,Markdown,Data,Mendable,Langchain\"\u001b[39m,\n",
" robots: \u001b[32m\"follow, index\"\u001b[39m,\n",
" ogTitle: \u001b[32m\"Firecrawl\"\u001b[39m,\n",
" ogDescription: \u001b[32m\"Turn any website into LLM-ready data.\"\u001b[39m,\n",
" ogUrl: \u001b[32m\"https://www.firecrawl.dev/\"\u001b[39m,\n",
" ogImage: \u001b[32m\"https://www.firecrawl.dev/og.png?123\"\u001b[39m,\n",
" ogLocaleAlternate: [],\n",
" ogSiteName: \u001b[32m\"Firecrawl\"\u001b[39m,\n",
" sourceURL: \u001b[32m\"https://firecrawl.dev\"\u001b[39m,\n",
" pageStatusCode: \u001b[33m500\u001b[39m\n",
" },\n",
" id: \u001b[90mundefined\u001b[39m\n",
"}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"const docs = await loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" title: \"Home - Firecrawl\",\n",
" description: \"Firecrawl crawls and converts any website into clean markdown.\",\n",
" keywords: \"Firecrawl,Markdown,Data,Mendable,Langchain\",\n",
" robots: \"follow, index\",\n",
" ogTitle: \"Firecrawl\",\n",
" ogDescription: \"Turn any website into LLM-ready data.\",\n",
" ogUrl: \"https://www.firecrawl.dev/\",\n",
" ogImage: \"https://www.firecrawl.dev/og.png?123\",\n",
" ogLocaleAlternate: [],\n",
" ogSiteName: \"Firecrawl\",\n",
" sourceURL: \"https://firecrawl.dev\",\n",
" pageStatusCode: 500\n",
"}\n"
]
}
],
"source": [
"console.log(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Parameters\n",
"\n",
"For `params` you can pass any of the params according to the [Firecrawl documentation](https://docs.firecrawl.dev)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all FireCrawlLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_firecrawl.FireCrawlLoader.html"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Deno",
"language": "typescript",
"name": "deno"
},
"language_info": {
"file_extension": ".ts",
"mimetype": "text/x.typescript",
"name": "typescript",
"nb_converter": "script",
"pygments_lexer": "typescript",
"version": "5.3.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

This file was deleted.

Loading