-
Notifications
You must be signed in to change notification settings - Fork 12
Update README.md #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
urvishp80
wants to merge
1
commit into
master
Choose a base branch
from
scraper-guide
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Update README.md #80
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,68 +1,131 @@ | ||
# Welcome to the scraper! | ||
# Welcome to the Scraper | ||
|
||
This project is designed to automate the process of gathering information from a variety of key Bitcoin-related sources. | ||
It leverages GitHub Actions to schedule nightly cron jobs, ensuring that the most up-to-date content is captured from each source according to a defined frequency. | ||
It leverages GitHub Actions to schedule nightly cron jobs, ensuring that the most up-to-date content is captured from | ||
each source according to a defined frequency. | ||
The scraped data are then stored in an Elasticsearch index. | ||
|
||
Below is a detailed breakdown of the sources scraped and the schedule for each: | ||
|
||
Daily at 00:00 AM UTC | ||
|
||
- [Lightning Mailing List](https://lists.linuxfoundation.org/pipermail/lightning-dev/) ([cron](.github/workflows/mailing-list-lightning.yml), [source](mailing-list)) | ||
- [New Bitcoin Mailing List](https://gnusha.org/pi/bitcoindev/) ([cron](.github/workflows/mailing-list-bitcoin-new.yml), [source](mailing-list/main.py)) | ||
- [Bitcoin Mailing List](https://lists.linuxfoundation.org/pipermail/bitcoin-dev/) ([cron](.github/workflows/mailing-list-bitcoin.yml), [source](mailing-list)) | ||
- [Delving Bitcoin](https://delvingbitcoin.org/) ([cron](.github/workflows/delving-bitcoin.yml), [source](delvingbitcoin_2_elasticsearch)) | ||
|
||
Weekly | ||
|
||
- [bitcoin.stackexchange](https://bitcoin.stackexchange.com/) ([cron](.github/workflows/stackexchange.yml), [source](bitcoin.stackexchange.com)) | ||
- Bitcoin Talk Forum ([cron](.github/workflows/bitcointalk.yml), [source](bitcointalk)) | ||
- only the [Development & Technical Discussion Board](https://bitcointalk.org/index.php?board=6.0) | ||
- only for specific authors | ||
- [Bitcoin Transcript](https://btctranscripts.com/) ([cron](.github/workflows/bitcointranscripts.yml), [source](bitcointranscripts)) | ||
- [Bitcoin Optech](https://bitcoinops.org/) ([cron](.github/workflows/bitcoinops.yml), [source](bitcoinops)) | ||
|
||
Additionally, for on-demand scraping tasks, we utilize a Scrapybot, details of which can be found in the [Scrapybot section](#scrapybot) below. | ||
|
||
## Setup | ||
|
||
You need an env file to indicate where you are pushing the data. | ||
|
||
1. Copy `cp .env.sample .env` | ||
2. run `cd common && yarn install && cd ../mailing-list && yarn install && cd ..` | ||
3. To scrape a mailing list run `node mailing-list/main.js` with additional env vars like `URL='https://lists.linuxfoundation.org/pipermail/bitcoin-dev/'` and `NAME='bitcoin'` | ||
3a. Or you can do something like `cd bitcoin.stackexchange.com && pip install -r requirements.txt && cd .. && python3 bitcoin.stackexchange.com/main.py` | ||
|
||
You should be calling the scrapers from the root dir because they use the common dir. | ||
|
||
## Scrapybot | ||
|
||
We have implemented a variety of crawlers (spiders), each designed for a specific website of interest. | ||
You can find all the spiders in the [`scrapybot/scrapybot/spiders`](scrapybot/scrapybot/spiders) directory. | ||
|
||
This section explains how to run the scrapers in the `scrapybot` folder. | ||
|
||
To run a crawler using scrapybot, for example `rusty`, which will scrape the site `https://rusty.ozlabs.org`, switch to the root directory(where there is this README file) and run these commands from your terminal: | ||
- `pip install -r requirements.txt && cd scrapybot` | ||
- `scrapy crawl rusty -O rusty.json` | ||
|
||
The above commands will install scrapy dependencies, then run the `rusty` spider(one of the crawlers) and store the collected document in `rusty.json` file in the `scrapybot` project directory. | ||
|
||
The same procedure can be applied to any of the crawlers in the `scrapybot/spiders` directory. | ||
There is also a script in `scrapybot` directory called `scraper.sh` which can run all the spiders at once. | ||
|
||
|
||
### Sending the output to elastic search | ||
## Prerequisites | ||
|
||
Before you begin, ensure you have following installed: | ||
|
||
- [Python 3.8+](https://www.python.org/downloads/) | ||
- [Node.js 14+](https://nodejs.org/) | ||
- [pip](https://pip.pypa.io/en/stable/) | ||
- [yarn](https://classic.yarnpkg.com/en/docs/install/) | ||
- [Elasticsearch](https://www.elastic.co/downloads/elasticsearch) | ||
- [virtualenv](https://virtualenv.pypa.io/en/latest/) | ||
|
||
## Setup Instructions | ||
|
||
1. **Clone this repository** | ||
```bash | ||
https://github.com/bitcoinsearch/scraper.git | ||
cd bitcoin-scrapers | ||
``` | ||
|
||
2. **Create a virtual environment** | ||
```bash | ||
python -m venv venv | ||
``` | ||
3. **Activate the virtual environment** | ||
- **Windows:** | ||
```bash | ||
venv\Scripts\activate | ||
``` | ||
- **MacOS/Linux:** | ||
```bash | ||
source venv/bin/activate | ||
``` | ||
4. **Install the required Python packages** | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
5. **Create .env file from env.sample file** | ||
```bash | ||
cp .env.sample .env | ||
``` | ||
Open the .env file and provide the necessary values for the following variables: | ||
- `DATA_DIR`: Path to store temporary files. | ||
- `DAYS_TO_SUBTRACT`: Number of days to subtract from today's date to determine the date range for scraping and downloading documents for mailing list. | ||
- `URL`: For mailing list scraper, use one of these URLs: | ||
1. https://lists.linuxfoundation.org/pipermail/lightning-dev/ | ||
2. https://lists.linuxfoundation.org/pipermail/bitcoin-dev/ | ||
- `CLOUD_ID`: Your Elasticsearch cloud ID. This is required for connecting to your Elasticsearch cluster. | ||
- `USERNAME`: The username for your Elasticsearch instance. | ||
- `USER_PASSWORD`: The API key or password associated with the `USERNAME` in Elasticsearch. | ||
- `INDEX`: The name of the index where documents will be stored in Elasticsearch. | ||
6. **Node.js Packages/Dependencies installation** | ||
Run below command to install required packages for node.js scrapers. | ||
```bash | ||
cd common && yarn install && cd ../mailing-list && yarn install && cd .. | ||
``` | ||
## Running Scrapers | ||
To run a specific scraper, use the respective command listed below: | ||
1. [bitcoin.stackexchange.com](bitcoin.stackexchange.com) | ||
```bash | ||
python .\bitcoin.stackexchange.com\main.py | ||
``` | ||
2. [bitcoinbook](bitcoinbook) | ||
```bash | ||
python .\bitcoinbook\main.py | ||
``` | ||
3. [bitcoinops](bitcoinops) | ||
```bash | ||
python .\bitcoinops\main.py | ||
``` | ||
4. [bitcointalk](bitcointalk) | ||
```bash | ||
python .\bitcointalk\main.py | ||
``` | ||
5. [bitcointranscripts](bitcointranscripts) | ||
```bash | ||
python .\bitcointranscripts\main.py | ||
``` | ||
6. [delvingbitcoin_2_elasticsearch](delvingbitcoin_2_elasticsearch) | ||
```bash | ||
python .\delvingbitcoin_2_elasticsearch\delvingbitcoin_2_elasticsearch.py | ||
``` | ||
7. [mailing-list](mailing-list) | ||
- To run the mailing list scrapers, use the following commands based on the type of documents you want to scrape: | ||
- **For Linux Foundation Documents** | ||
Ensure that the `URL` environment variable is set to the appropriate mailing list URL (e.g., `https://lists.linuxfoundation.org/pipermail/lightning-dev/` or `https://lists.linuxfoundation.org/pipermail/bitcoin-dev/`). | ||
Run the following command: | ||
```bash | ||
node mailing-list/main.js | ||
``` | ||
Comment on lines
+115
to
+122
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a depreciated scraper |
||
- **For New Bitcoin Dev** | ||
Use the following command to run the Bitcoin Dev scraper: | ||
```bash | ||
python .\mailing-list\main.py | ||
``` | ||
- create an `example.ini` file inside the `scrapybot` directory with the following contents | ||
``` | ||
[ELASTIC] | ||
cloud_id = `your_cloud_id` | ||
user = `your_elasticsearch_username` | ||
password = `your_elasticsearch_password` | ||
``` | ||
- inside the `pipelines.py file` in the `scrapybot` directory, read the above file to load your elasticsearch credentials with the line below: | ||
``` | ||
config.read("/path/to/your/example.ini") - replace what's in quotes with your actual `ini` file | ||
``` | ||
## Other quirks | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently moved away from nodejs implemented scrapers, right? so nodejs and yarn are probably not needed in this list. Also elasticsearch is installed as part of
requirements.txt
, so I feel that this line will create confusion for people.