Skip to content

cckellogg/docsearch-configs

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

DocSearch configurations

This is the repository hosting the public DocSearch configurations.

DocSearch is composed of 3 different projects:

If you want to run your own DocSearch instance on those configuration files, please get familiar with the scraper setup guidelines.

Introduction

The DocSearch scraper will use a configuration file specifying:

  • the Algolia index name that will store the records resulting from the crawling
  • the URLs it needs to crawl
  • the URLs it shouldn't crawl
  • the (hierarchical) CSS selectors to use to extract the relevant content from your webpages
  • the CSS selectors to skip
  • An optional sitemap URL that will be crawled and then scraped
  • additional options you might provide to fine-tune the scraping

How it works

Once you run the DocSearch scraper on a specific configuration, it will:

  • crawl all the URLs you specified (from the start_urls or the sitemap)
  • follow all the hyperlinks mentioned in the page, and continue the crawling there
  • stop the crawling as soon as you've reached a URL that is not specified in your configuration or affiliated to a start url
  • extract the content of every single crawled page following the logic you defined using the CSS selectors
  • push the resulting records to the Algolia index you configured

Update You can check the DocSearch dedicated documentation website if you need more details regarding how to fine-tune your configuration.

About

DocSearch - Configurations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published