Twitter Bulk Fetcher

A Go script that fetches large batches of tweets from the Gopher AI subnet API using pagination. This tool automatically optimizes batch sizes based on your target amount and saves tweets to a JSON file for analysis.

Features

Fetches configurable amounts of tweets using pagination (default: 10,000)
Automatically optimizes batch size: uses AMOUNT if ≤100, otherwise uses API maximum (100)
Uses max_id for proper chronological pagination
Progress logging during collection
Saves tweets to JSON with metadata
Fully configurable via environment variables (QUERY, AMOUNT)

Prerequisites

Go 1.24.6 or later
A Gopher AI API token

Setup

Clone or navigate to the repository:
```
cd sn42-datasets
```

Create a .env file in the root directory:

GOPHER_CLIENT_TOKEN=your_api_token_here
QUERY="bitcoin min_faves:1000"
AMOUNT=10000

The script will automatically load this file. You can also set these as environment variables:

export GOPHER_CLIENT_TOKEN=your_api_token_here
export QUERY="bitcoin min_faves:1000"
export AMOUNT=10000

Install dependencies:
```
go mod tidy
```

Usage

Basic Usage

Run the script to fetch 10,000 tweets matching your query:

go run ./cmd/fetch-tweets

The query and amount are read from the QUERY and AMOUNT environment variables in your .env file. If not set, they default to: bitcoin min_faves:1000 and 10000 tweets respectively.

Customization

Custom Query and Amount

Set the QUERY and AMOUNT environment variables in your .env file:

QUERY="ethereum min_faves:500"
AMOUNT=5000

Or set them as environment variables:

export QUERY="ethereum min_faves:500"
export AMOUNT=5000
go run ./cmd/fetch-tweets

Note:

AMOUNT is the total number of tweets to collect
The batch size (tweets per API request) is automatically set to min(AMOUNT, 100)
- If AMOUNT is 50, it will fetch 50 tweets in one request
- If AMOUNT is 5000, it will fetch 100 tweets per request (API maximum) until reaching 5000
The output filename is automatically generated from your query and amount

Output

The script automatically generates a filename based on your query and target tweet count. The format is:

data/{sanitized_query}_{target_count}.json

For example:

Query: "bitcoin min_faves:1000" with 10,000 tweets → data/bitcoin_min_faves_1000_10000.json
Query: "ethereum min_retweets:50" with 5,000 tweets → data/ethereum_min_retweets_50_5000.json

The output JSON file has the following structure:

{
  "total_tweets": 10000,
  "query": "bitcoin min_faves:1000",
  "collected_at": "2026-02-04T01:22:46Z",
  "tweets": [
    {
      "id": "2018797961606557803",
      "source": "twitter",
      "content": "Tweet content...",
      "metadata": {
        "tweet_id": 2018797961606557700,
        "created_at": "2026-02-03T21:25:16Z",
        "likes": 1715,
        "public_metrics": { ... },
        ...
      }
    },
    ...
  ]
}

How It Works

Initial Request: Fetches the first batch of tweets matching the query (batch size = min(AMOUNT, 100))
Pagination: Uses the last tweet's ID as max_id for the next request
Collection: Continues fetching batches until reaching the target count (AMOUNT) or running out of results
Output: Saves all collected tweets to a JSON file with metadata

Batch Size Examples:

AMOUNT=50 → 1 request of 50 tweets
AMOUNT=500 → 5 requests of 100 tweets each
AMOUNT=10000 → 100 requests of 100 tweets each

Pagination Logic

The script uses Twitter's max_id parameter for pagination:

First request: Uses your QUERY as-is (e.g., bitcoin min_faves:1000)
Subsequent requests: Appends max_id:{last_tweet_id} to your query (e.g., bitcoin min_faves:1000 max_id:1234567890)

This ensures you get older tweets in chronological order.

Environment Variables

The script uses the following environment variables (loaded from .env file):

GOPHER_CLIENT_TOKEN: Your Gopher AI API token (required)
QUERY: Twitter search query (optional, defaults to "bitcoin min_faves:1000")
AMOUNT: Total number of tweets to collect (optional, defaults to 10000)
GOPHER_CLIENT_URL: API base URL (optional, defaults to https://data.gopher-ai.com/api)
GOPHER_CLIENT_TIMEOUT: Request timeout (optional, defaults to 60s)

Batch Size Logic: The script automatically sets the batch size (tweets per API request) to min(AMOUNT, 100). This means:

If AMOUNT=50, it fetches 50 tweets in one request
If AMOUNT=5000, it fetches 100 tweets per request (API max) until reaching 5000

Query Examples

QUERY="bitcoin min_faves:1000" - Bitcoin tweets with at least 1000 likes
QUERY="ethereum min_retweets:50" - Ethereum tweets with at least 50 retweets
QUERY="crypto -filter:retweets" - Crypto tweets excluding retweets
QUERY="from:elonmusk" - All tweets from a specific user

Error Handling

The script handles:

Missing or invalid API tokens
API errors and rate limiting
Empty result sets
File write errors

If an error occurs, the script will log it and exit gracefully.

Performance

Batch Size: Automatically optimized based on AMOUNT:
- If AMOUNT ≤ 100: Single request with that amount
- If AMOUNT > 100: Multiple requests of 100 tweets each (API maximum)
Rate Limiting: 1 second delay between requests
Total Time: Varies based on AMOUNT:
- AMOUNT=50: ~5-10 seconds (1 request)
- AMOUNT=1000: ~10-20 seconds (10 requests)
- AMOUNT=10000: ~100-200 seconds (100 requests)

Troubleshooting

"Failed to create client from config"

Ensure your .env file exists in the sn42-datasets/ directory
Check that GOPHER_CLIENT_TOKEN is set correctly
Verify there are no extra spaces or quotes around the token value

"QUERY not set in .env, using default"

This is informational only - the script will use the default query
To use a custom query, add QUERY="your query here" to your .env file

"AMOUNT not set in .env, using default"

This is informational only - the script will use the default amount (10,000)
To use a custom amount, add AMOUNT=5000 (or your desired number) to your .env file

"Invalid AMOUNT value in .env"

AMOUNT must be a positive integer
Check that there are no quotes, spaces, or non-numeric characters
Example: AMOUNT=10000 (correct) vs AMOUNT="10000" or AMOUNT=ten thousand (incorrect)

"No more results available"

The query may not have enough matching tweets
Try adjusting the query parameters (e.g., lower min_faves threshold)
Check if you've reached the end of available results

Rate Limiting

If you encounter rate limiting:

Increase the delay between requests in the code (currently 1 second)
Reduce the AMOUNT value to fetch fewer tweets
Contact Gopher AI support for API rate limit information

Note: The batch size is automatically optimized, so you don't need to manually adjust it. Simply set AMOUNT to your desired total number of tweets.

fetch-trends: Get trends and collect tweets per trend

A second tool, fetch-trends, uses the gopher client to get current Twitter trends, then for each trend collects up to 10,000 tweets with at least 100 likes and saves them under data/.

How to run fetch-trends

Same .env as fetch-tweets (at least GOPHER_CLIENT_TOKEN). Optional: AMOUNT (default 10000 tweets per trend).

go run ./cmd/fetch-trends

Or build and run:

go build -o fetch-trends ./cmd/fetch-trends
./fetch-trends

What it does

Get trends – Calls the gopher API with a “get trends” job (CapGetTrends), waits for completion, and reads the list of trending topic strings.
For each trend – Builds a query "{trend}" min_faves:100 and fetches tweets the same way as fetch-tweets (pagination, batch size 100).
Output – One JSON file per trend in data/, e.g. data/trend_bitcoin_10000.json, with the same structure as fetch-tweets (metadata, collected_at, etc.).

So you get “trends → 10k tweets (min 100 likes) per trend” in one run.

Building

To build standalone binaries:

# Tweet fetcher (single query)
go build -o fetch-tweets ./cmd/fetch-tweets

# Trend-based fetcher (trends + 10k tweets per trend with min 100 likes)
go build -o fetch-trends ./cmd/fetch-trends

Then run:

./fetch-tweets
# or
./fetch-trends

License

See the repository license file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
cmd		cmd
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Bulk Fetcher

Features

Prerequisites

Setup

Usage

Basic Usage

Customization

Custom Query and Amount

Output

How It Works

Pagination Logic

Environment Variables

Query Examples

Error Handling

Performance

Troubleshooting

"Failed to create client from config"

"QUERY not set in .env, using default"

"AMOUNT not set in .env, using default"

"Invalid AMOUNT value in .env"

"No more results available"

Rate Limiting

fetch-trends: Get trends and collect tweets per trend

How to run fetch-trends

What it does

Building

License

About

Uh oh!

Releases

Packages

Languages

gopher-lab/sn42-datasets

Folders and files

Latest commit

History

Repository files navigation

Twitter Bulk Fetcher

Features

Prerequisites

Setup

Usage

Basic Usage

Customization

Custom Query and Amount

Output

How It Works

Pagination Logic

Environment Variables

Query Examples

Error Handling

Performance

Troubleshooting

"Failed to create client from config"

"QUERY not set in .env, using default"

"AMOUNT not set in .env, using default"

"Invalid AMOUNT value in .env"

"No more results available"

Rate Limiting

fetch-trends: Get trends and collect tweets per trend

How to run fetch-trends

What it does

Building

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages