Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: add proxy support to RecursiveUrlLoader #27364

Merged
merged 5 commits into from
Oct 16, 2024

Conversation

ccq1
Copy link
Contributor

@ccq1 ccq1 commented Oct 15, 2024

Description
This PR introduces the proxies parameter to the RecursiveUrlLoader class, allowing the user to specify proxy servers for requests. This update enables crawling through proxy servers, providing enhanced flexibility for network configurations.
The key changes include:
1.Added an optional proxies parameter to the constructor (init).
2.Updated the documentation to explain the proxies parameter usage with an example.
3.Modified the _get_child_links_recursive method to pass the proxies parameter to the requests.get function.

Sample Usage

from bs4 import BeautifulSoup as Soup
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

proxies = {
    "http": "http://localhost:1080",
    "https": "http://localhost:1080",
}
url = "https://python.langchain.com/docs/concepts/#langchain-expression-language-lcel"
loader = RecursiveUrlLoader(
    url=url, max_depth=1, extractor=lambda x: Soup(x, "html.parser").text,proxies=proxies
)
docs = loader.load()

…introducing the proxies parameter to allow the use of specified proxy servers in requests.
Copy link

vercel bot commented Oct 15, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 16, 2024 3:06am

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Oct 15, 2024
@@ -313,6 +314,16 @@ def simple_metadata_extractor(
encoding, unless the `encoding` argument has already been explicitly set.
encoding: The encoding of the response. If manually set, the encoding will be
set to given value, regardless of the `autoset_encoding` argument.
proxies: A dictionary mapping protocol names to the proxy URLs to be used for requests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, any chance you'd be willing to add a sentence to the security note so that folks know they can specify proxies?

@eyurtsev eyurtsev self-assigned this Oct 15, 2024
@eyurtsev eyurtsev changed the title community: add proxy support to RecursiveUrlLoader community[minor]: add proxy support to RecursiveUrlLoader Oct 15, 2024
@ccq1
Copy link
Contributor Author

ccq1 commented Oct 15, 2024

hi @eyurtsev , I just fixed a lint check error in my commit and added a note in the security document. Please check this commit.

@ccq1 ccq1 requested a review from eyurtsev October 15, 2024 16:55
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Oct 15, 2024
@eyurtsev
Copy link
Collaborator

@ccq1 you can auto format like this:

cd langchain/libs/community
make format

you'll need to set up the environment with ruff (poetry install in that directory)

@ccq1 ccq1 requested a review from eyurtsev October 16, 2024 02:53
@ccq1
Copy link
Contributor Author

ccq1 commented Oct 16, 2024

@ccq1 you can auto format like this:

cd langchain/libs/community
make format

you'll need to set up the environment with ruff (poetry install in that directory)

@eyurtsev thanks, I autoformatted this commit using ruff. Please check. (I have the experience for the next PR.)

@eyurtsev eyurtsev enabled auto-merge (squash) October 16, 2024 16:28
@eyurtsev eyurtsev enabled auto-merge (squash) October 16, 2024 16:29
@eyurtsev eyurtsev merged commit 31e7664 into langchain-ai:master Oct 16, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging. size:S This PR changes 10-29 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants