Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

cpdata
Copy link

@cpdata cpdata commented Dec 4, 2023

Initial Improvements

Main Additions

  • maxPagesToCrawl if = 0 then will crawl all matching urls and display during progress as 1/∞.
  • maxConcurrency Sets the number of concurrent crawl requests. If left unset then the undefined maxConcurrency will do maximum parallel connections like the originals default. Now defaults to 1 to avoid getting IP banned.
  • waitPerPageCrawlTimeoutRange Defaults to a range of 1 second to 1 second but can be set to create a random delay between any 2 numbers in milliseconds to avoid rate limit rejection when crawling.
  • headless is true by default but can now be configured in the config.ts file for situations that require it.
  • Improved README.md & config.ts documentation. ( More to be done.)

Full Summery

I would like to contribute to this project on a regular basis. I have a lot of Web-scraping, A.I./LLMs, CI/CD, Automation, experience and would like to discuss with the main collaborators and see were I can be of the most use.

README.md Outdated

```ts
type Config = {
/** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */

/** Required - URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of using tags instead, for example:

Suggested change
/** Required - URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
/**
* URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap
* @required
*/

This way we could eventually bring jsdoc/typedoc into the mix to generate meaningful documentation from them. Just a suggestion, as even jsdoc/typedoc can infer whether a property is required or not based on its typings 🤗

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree generating meaningful documentation is the way to go. I've updated the code accordingly. I will add more to the documentation as we progress.

config.ts Outdated
Comment on lines 9 to 20
const url_prefix = "https://"
const domain = "www.builder.io";
const url_suffix = "/c/docs";
const base_url = url_prefix + domain;
const match_url_prefix = base_url + url_suffix;
const match_url = match_url_prefix + "/**";

// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split('T')[0];
const time = now.toTimeString().split(' ')[0];
const outputs_dir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick here to guarantee code consistency, let's use camelCase for these variables:

Suggested change
const url_prefix = "https://"
const domain = "www.builder.io";
const url_suffix = "/c/docs";
const base_url = url_prefix + domain;
const match_url_prefix = base_url + url_suffix;
const match_url = match_url_prefix + "/**";
// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split('T')[0];
const time = now.toTimeString().split(' ')[0];
const outputs_dir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';
const urlPrefix = "https://"
const domain = "www.builder.io";
const urlSuffix = "/c/docs";
const baseUrl = urlPrefix + domain;
const matchUrlPrefix = baseUrl + urlSuffix;
const matchUrl = matchUrlPrefix + "/**";
// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split('T')[0];
const time = now.toTimeString().split(' ')[0];
const outputsDir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';

Requires changes below 👇

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to camelCase as you suggested and will continue to use that convention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind nitpicks 👍establishing a shared nomenclature for project is ideal in my opinion. I am always open to feedback and recommendations.

src/config.ts Outdated Show resolved Hide resolved
@marcelovicentegc marcelovicentegc added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 5, 2023
… and config.ts and formatting for jsdoc/typedoc as recommened by @marcelovicentegc in pull request BuilderIO#102, added .prettierignore file
@cpdata
Copy link
Author

cpdata commented Dec 6, 2023

I updated with prettier formatting for the files that failed README.md, src/config.ts, src/core.ts, and config.ts.
I also added the formatting for jsdoc/typedoc as recommened by @marcelovicentegc in response to my orginal pull request #102. Additionally, I added .prettierignore file.

@steve8708
Copy link
Contributor

@marcelovicentegc this look good to you to merge?

Comment on lines +13 to +23
const matchUrl_prefix = baseUrl + urlSuffix;
const matchUrl = matchUrl_prefix + "/**";

// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split("T")[0];
const time = now.toTimeString().split(" ")[0];
const outputs_dir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";

const outputFileName =
outputs_dir + "/" + domain + "-" + date + "-" + time + ".json";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cpdata! Just a couple more nitpicks (replacing snake_case for camelCase) and a rebase and we are good to go!

Suggested change
const matchUrl_prefix = baseUrl + urlSuffix;
const matchUrl = matchUrl_prefix + "/**";
// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split("T")[0];
const time = now.toTimeString().split(" ")[0];
const outputs_dir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";
const outputFileName =
outputs_dir + "/" + domain + "-" + date + "-" + time + ".json";
const matchUrlPrefix = baseUrl + urlSuffix;
const matchUrl = matchUrlPrefix + "/**";
// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split("T")[0];
const time = now.toTimeString().split(" ")[0];
const outputsDir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";
const outputFileName =
outputsDir + "/" + domain + "-" + date + "-" + time + ".json";

@marcelovicentegc
Copy link
Contributor

@marcelovicentegc this look good to you to merge?

Hey @steve8708! Happy new years! One rebase and a few nitpicks ☝️ and it occurs to me that we are good to go 🤗

@Ademrobert
Copy link

Please merge this branch ASAP!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants