Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 107 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,131 @@
# Welcome to the scraper!
# Welcome to the Scraper

This project is designed to automate the process of gathering information from a variety of key Bitcoin-related sources.
It leverages GitHub Actions to schedule nightly cron jobs, ensuring that the most up-to-date content is captured from each source according to a defined frequency.
It leverages GitHub Actions to schedule nightly cron jobs, ensuring that the most up-to-date content is captured from
each source according to a defined frequency.
The scraped data are then stored in an Elasticsearch index.

Below is a detailed breakdown of the sources scraped and the schedule for each:

Daily at 00:00 AM UTC

- [Lightning Mailing List](https://lists.linuxfoundation.org/pipermail/lightning-dev/) ([cron](.github/workflows/mailing-list-lightning.yml), [source](mailing-list))
- [New Bitcoin Mailing List](https://gnusha.org/pi/bitcoindev/) ([cron](.github/workflows/mailing-list-bitcoin-new.yml), [source](mailing-list/main.py))
- [Bitcoin Mailing List](https://lists.linuxfoundation.org/pipermail/bitcoin-dev/) ([cron](.github/workflows/mailing-list-bitcoin.yml), [source](mailing-list))
- [Delving Bitcoin](https://delvingbitcoin.org/) ([cron](.github/workflows/delving-bitcoin.yml), [source](delvingbitcoin_2_elasticsearch))

Weekly

- [bitcoin.stackexchange](https://bitcoin.stackexchange.com/) ([cron](.github/workflows/stackexchange.yml), [source](bitcoin.stackexchange.com))
- Bitcoin Talk Forum ([cron](.github/workflows/bitcointalk.yml), [source](bitcointalk))
- only the [Development & Technical Discussion Board](https://bitcointalk.org/index.php?board=6.0)
- only for specific authors
- [Bitcoin Transcript](https://btctranscripts.com/) ([cron](.github/workflows/bitcointranscripts.yml), [source](bitcointranscripts))
- [Bitcoin Optech](https://bitcoinops.org/) ([cron](.github/workflows/bitcoinops.yml), [source](bitcoinops))

Additionally, for on-demand scraping tasks, we utilize a Scrapybot, details of which can be found in the [Scrapybot section](#scrapybot) below.

## Setup

You need an env file to indicate where you are pushing the data.

1. Copy `cp .env.sample .env`
2. run `cd common && yarn install && cd ../mailing-list && yarn install && cd ..`
3. To scrape a mailing list run `node mailing-list/main.js` with additional env vars like `URL='https://lists.linuxfoundation.org/pipermail/bitcoin-dev/'` and `NAME='bitcoin'`
3a. Or you can do something like `cd bitcoin.stackexchange.com && pip install -r requirements.txt && cd .. && python3 bitcoin.stackexchange.com/main.py`

You should be calling the scrapers from the root dir because they use the common dir.

## Scrapybot

We have implemented a variety of crawlers (spiders), each designed for a specific website of interest.
You can find all the spiders in the [`scrapybot/scrapybot/spiders`](scrapybot/scrapybot/spiders) directory.

This section explains how to run the scrapers in the `scrapybot` folder.

To run a crawler using scrapybot, for example `rusty`, which will scrape the site `https://rusty.ozlabs.org`, switch to the root directory(where there is this README file) and run these commands from your terminal:
- `pip install -r requirements.txt && cd scrapybot`
- `scrapy crawl rusty -O rusty.json`

The above commands will install scrapy dependencies, then run the `rusty` spider(one of the crawlers) and store the collected document in `rusty.json` file in the `scrapybot` project directory.

The same procedure can be applied to any of the crawlers in the `scrapybot/spiders` directory.
There is also a script in `scrapybot` directory called `scraper.sh` which can run all the spiders at once.


### Sending the output to elastic search
## Prerequisites

Before you begin, ensure you have following installed:

- [Python 3.8+](https://www.python.org/downloads/)
- [Node.js 14+](https://nodejs.org/)
- [pip](https://pip.pypa.io/en/stable/)
- [yarn](https://classic.yarnpkg.com/en/docs/install/)
- [Elasticsearch](https://www.elastic.co/downloads/elasticsearch)
Comment on lines +30 to +34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently moved away from nodejs implemented scrapers, right? so nodejs and yarn are probably not needed in this list. Also elasticsearch is installed as part of requirements.txt, so I feel that this line will create confusion for people.

- [virtualenv](https://virtualenv.pypa.io/en/latest/)

## Setup Instructions

1. **Clone this repository**
```bash
https://github.com/bitcoinsearch/scraper.git
cd bitcoin-scrapers
```

2. **Create a virtual environment**
```bash
python -m venv venv
```
3. **Activate the virtual environment**
- **Windows:**
```bash
venv\Scripts\activate
```
- **MacOS/Linux:**
```bash
source venv/bin/activate
```
4. **Install the required Python packages**
```bash
pip install -r requirements.txt
```
5. **Create .env file from env.sample file**
```bash
cp .env.sample .env
```
Open the .env file and provide the necessary values for the following variables:
- `DATA_DIR`: Path to store temporary files.
- `DAYS_TO_SUBTRACT`: Number of days to subtract from today's date to determine the date range for scraping and downloading documents for mailing list.
- `URL`: For mailing list scraper, use one of these URLs:
1. https://lists.linuxfoundation.org/pipermail/lightning-dev/
2. https://lists.linuxfoundation.org/pipermail/bitcoin-dev/
- `CLOUD_ID`: Your Elasticsearch cloud ID. This is required for connecting to your Elasticsearch cluster.
- `USERNAME`: The username for your Elasticsearch instance.
- `USER_PASSWORD`: The API key or password associated with the `USERNAME` in Elasticsearch.
- `INDEX`: The name of the index where documents will be stored in Elasticsearch.
6. **Node.js Packages/Dependencies installation**
Run below command to install required packages for node.js scrapers.
```bash
cd common && yarn install && cd ../mailing-list && yarn install && cd ..
```
## Running Scrapers
To run a specific scraper, use the respective command listed below:
1. [bitcoin.stackexchange.com](bitcoin.stackexchange.com)
```bash
python .\bitcoin.stackexchange.com\main.py
```
2. [bitcoinbook](bitcoinbook)
```bash
python .\bitcoinbook\main.py
```
3. [bitcoinops](bitcoinops)
```bash
python .\bitcoinops\main.py
```
4. [bitcointalk](bitcointalk)
```bash
python .\bitcointalk\main.py
```
5. [bitcointranscripts](bitcointranscripts)
```bash
python .\bitcointranscripts\main.py
```
6. [delvingbitcoin_2_elasticsearch](delvingbitcoin_2_elasticsearch)
```bash
python .\delvingbitcoin_2_elasticsearch\delvingbitcoin_2_elasticsearch.py
```
7. [mailing-list](mailing-list)
- To run the mailing list scrapers, use the following commands based on the type of documents you want to scrape:
- **For Linux Foundation Documents**
Ensure that the `URL` environment variable is set to the appropriate mailing list URL (e.g., `https://lists.linuxfoundation.org/pipermail/lightning-dev/` or `https://lists.linuxfoundation.org/pipermail/bitcoin-dev/`).
Run the following command:
```bash
node mailing-list/main.js
```
Comment on lines +115 to +122
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a depreciated scraper

- **For New Bitcoin Dev**
Use the following command to run the Bitcoin Dev scraper:
```bash
python .\mailing-list\main.py
```
- create an `example.ini` file inside the `scrapybot` directory with the following contents
```
[ELASTIC]
cloud_id = `your_cloud_id`
user = `your_elasticsearch_username`
password = `your_elasticsearch_password`
```
- inside the `pipelines.py file` in the `scrapybot` directory, read the above file to load your elasticsearch credentials with the line below:
```
config.read("/path/to/your/example.ini") - replace what's in quotes with your actual `ini` file
```
## Other quirks
Expand Down