-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
langchain[minor],docs[minor]: Add SitemapLoader
#4331
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
592c18b
init
bracesproul 7789192
fixed and docs
bracesproul af9135c
add page metadata to document metadata
bracesproul 7beab34
chore: lint files
bracesproul f90b178
cr
bracesproul 4232a85
chore: lint files
bracesproul 4e5b349
Merge branch 'main' into brace/sitemap-loader
bracesproul 94b7c78
cr
bracesproul 3b77d0f
chore: lint files
bracesproul 8fc1ff6
chore: lint files
bracesproul 5a7003f
chore: lint files
bracesproul File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
24 changes: 24 additions & 0 deletions
24
docs/core_docs/docs/integrations/document_loaders/web_loaders/sitemap.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Sitemap Loader | ||
|
||
This notebook goes over how to use the [`SitemapLoader`](https://api.js.langchain.com/classes/langchain_document_loaders_web_sitemap.SitemapLoader.html) class to load sitemaps into `Document`s. | ||
|
||
## Setup | ||
|
||
First, we need to install the `langchain` package: | ||
|
||
```bash npm2yarn | ||
npm install --save langchain | ||
``` | ||
|
||
The URL passed in must either contain the `.xml` path to the sitemap, or a default `/sitemap.xml` will be appended to the URL. | ||
|
||
import CodeBlock from "@theme/CodeBlock"; | ||
import Example from "@examples/document_loaders/sitemap.ts"; | ||
|
||
<CodeBlock language="typescript">{Example}</CodeBlock> | ||
|
||
Or, if you want to only load the sitemap and not the contents of each page from the sitemap, you can use the `parseSitemap` method: | ||
|
||
import ParseSitemapExample from "@examples/document_loaders/parse_sitemap.ts"; | ||
|
||
<CodeBlock language="typescript">{ParseSitemapExample}</CodeBlock> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
import { SitemapLoader } from "langchain/document_loaders/web/sitemap"; | ||
|
||
const loader = new SitemapLoader("https://www.langchain.com/"); | ||
|
||
const sitemap = await loader.parseSitemap(); | ||
console.log(sitemap); | ||
/** | ||
[ | ||
{ | ||
loc: 'https://www.langchain.com/blog-detail/starting-a-career-in-design', | ||
changefreq: '', | ||
lastmod: '', | ||
priority: '' | ||
}, | ||
{ | ||
loc: 'https://www.langchain.com/blog-detail/building-a-navigation-component', | ||
changefreq: '', | ||
lastmod: '', | ||
priority: '' | ||
}, | ||
{ | ||
loc: 'https://www.langchain.com/blog-detail/guide-to-creating-a-website', | ||
changefreq: '', | ||
lastmod: '', | ||
priority: '' | ||
}, | ||
{ | ||
loc: 'https://www.langchain.com/page-1/terms-and-conditions', | ||
changefreq: '', | ||
lastmod: '', | ||
priority: '' | ||
}, | ||
...42 more items | ||
] | ||
*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
import { SitemapLoader } from "langchain/document_loaders/web/sitemap"; | ||
|
||
const loader = new SitemapLoader("https://www.langchain.com/"); | ||
|
||
const docs = await loader.load(); | ||
console.log(docs.length); | ||
/** | ||
26 | ||
*/ | ||
console.log(docs[0]); | ||
/** | ||
Document { | ||
pageContent: '\n' + | ||
' \n' + | ||
'\n' + | ||
' \n' + | ||
' \n' + | ||
' Blog ArticleApr 8, 2022As the internet continues to develop and grow exponentially, jobs related to the industry do too, particularly those that relate to web design and development. The prediction is that by 2029, the job outlook for these two fields will grow by 8%—significantly faster than average. Whether you’re seeking salaried employment or aiming to work in a freelance capacity, a career in web design can offer a variety of employment arrangements, competitive salaries, and opportunities to utilize both technical and creative skill sets.What does a career in web design involve?A career in website design can involve the design, creation, and coding of a range of website types. Other tasks will typically include liaising with clients and discussing website specifications, incorporating feedback, working on graphic design and image editing, and enabling multimedia features such as audio and video. Requiring a range of creative and technical skills, web designers may be involved in work across a range of industries, including software companies, IT consultancies, web design companies, corporate organizations, and more. In contrast with web developers, web designers tend to play a more creative role, crafting the overall vision and design of a site, and determining how to best incorporate the necessary functionality. However, there can be significant overlap between the roles.Full-stack, back-end, and front-end web developmentThe U.S. Bureau of Labor Statistics (BLS) Occupational Outlook Handbook tends to group web developers and digital designers into one category. However, they define them separately, stating that web developers create and maintain websites and are responsible for the technical aspects including performance and capacity. Web or digital designers, on the other hand, are responsible for the look and functionality of websites and interfaces. They develop, create, and test the layout, functions, and navigation for usability. Web developers can focus on the back-end, front-end, or full-stack development, and typically utilize a range of programming languages, libraries, and frameworks to do so. Web designers may work more closely with front-end engineers to establish the user-end functionality and appearance of a site.Are web designers in demand in 2022?In our ever-increasingly digital environment, there is a constant need for websites—and therefore for web designers and developers. With 17.4 billion websites in existence as of January 2020, the demand for web developers is only expected to rise.Web designers with significant coding experience are typically in higher demand, and can usually expect a higher salary. Like all jobs, there are likely to be a range of opportunities, some of which are better paid than others. But certain skill sets are basic to web design, most of which are key to how to become a web designer in 2022.const removeHiddenBreakpointLayers = function ie(e){function t(){for(let{hash:r,mediaQuery:i}of e){if(!i)continue;if(window.matchMedia(i).matches)return r}return e[0]?.hash}let o=t();if(o)for(let r of document.querySelectorAll(".hidden-"+o))r.parentNode?.removeChild(r);for(let r of document.querySelectorAll(".ssr-variant")){for(;r.firstChild;)r.parentNode?.insertBefore(r.firstChild,r);r.parentNode?.removeChild(r)}for(let r of document.querySelectorAll("[data-framer-original-sizes]")){let i=r.getAttribute("data-framer-original-sizes");i===""?r.removeAttribute("sizes"):r.setAttribute("sizes",i),r.removeAttribute("data-framer-original-sizes")}};removeHiddenBreakpointLayers([{"hash":"1ksv3g6"}])\n' + | ||
'\n' + | ||
' \n' + | ||
' \n' + | ||
' \n' + | ||
' \n' + | ||
' \n' + | ||
'\n' + | ||
'\n', | ||
metadata: { | ||
changefreq: '', | ||
lastmod: '', | ||
priority: '', | ||
source: 'https://www.langchain.com/blog-detail/starting-a-career-in-design' | ||
} | ||
} | ||
*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
import { test } from "@jest/globals"; | ||
import { SitemapLoader } from "../web/sitemap.js"; | ||
|
||
test("SitemapLoader", async () => { | ||
const loader = new SitemapLoader("https://www.langchain.com/"); | ||
|
||
const docs = await loader.load(); | ||
expect(docs.length).toBeGreaterThan(0); | ||
}); | ||
|
||
test("checkUrlPatterns can properly identify unwanted links", async () => { | ||
const links = [ | ||
"https://js.langchain.com/docs/use_cases/agent_simulations/", | ||
"https://js.langchain.com/docs/use_cases/agent_simulations/generative_agents", | ||
"https://js.langchain.com/docs/integrations/platforms/google", | ||
"https://js.langchain.com/docs/integrations/vectorstores/analyticdb", | ||
"https://js.langchain.com/docs/expression_language/interface", | ||
"https://js.langchain.com/docs/modules/data_connection/", | ||
]; | ||
|
||
const linkRegex = | ||
/^(https:\/\/js\.langchain\.com\/docs\/use_cases)|.*interface$/; | ||
|
||
const loader = new SitemapLoader("https://www.langchain.com/", { | ||
filterUrls: [linkRegex.source], | ||
}); | ||
|
||
const matches = links.map((link) => loader._checkUrlPatterns(link)); | ||
expect(matches).toEqual([true, true, false, false, true, false]); | ||
}); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there! I noticed that this PR introduces a new dependency in the package.json file for "./document_loaders/web/sitemap". This change is flagged for maintainers to review, as it may impact peer/dev/hard dependencies. Thank you for your contribution!