Skip to content

Latest commit

 

History

History
176 lines (126 loc) · 4.93 KB

README.md

File metadata and controls

176 lines (126 loc) · 4.93 KB

Twitter Bird Watcher Logo

License: GPL v3

Twitter Bird Watcher: A Twitter Profile Archival Tool

TBWatcher snapshots a profile page when given a URL (or an exported .js list from the official Twitter exporter.) Supports UTF-8 text JSON files and image snapshots of each Twitter post!

This script is purely for the purposes of archival use only.

TBWatcher

Quick Highlights

  • ⚡ Multi-threaded!
  • 🗄️ Neatly stores metadata in json format for each specified twitter profile.
  • 📸 Snapshots tweets, thread replies, and reponses.
  • ♻️ Marks potential tweets that are self-retweeted.
  • 🚩 Removes Tweet Ads.
  • 🖥️ Allows for manual login (use at own risk.)

Usage

# Install the requirements. Once only.
python -m pip install -r requirements.txt

# Take a snapshot from a given profile URL.
python bin/watcher.py --url www.twitter.com/<profile>

# Take a snapshot of profile tweets and their replies
python bin/watcher.py --url www.twitter.com/<profile> -d 2

# For more help use:
python bin/watcher.py --help

Tested on Python 3.10.

Output

TBwatch generates the following in the snapshots folder (assuming --depth 2):

└───snapshots
    └───<user_id>           # Username
        │   metadata.json   # profile metadata
        │   profile.png     # snapshot of profile page
        │   tweets.json     # text format of all tweets on profile page
        │
        └───<prof_tweet_id_0>
            │   <prof_tweet_id_0>.png  # Snapshot
            │   tweets.json            # Responses to <prof_tweet_id_0>
            │
            ├───<response_tweet_id_0>
            │       <response_tweet_id_0>.png # Snapshot
            │
            └───<response_tweet_id_1>
                    <response_tweet_id_1>.png # Snapshot

Detailed Highlights

Multi-Threading

By default, multi-threading is enabled and proportional to the number of cores on your computer. Each thread spawns a unique window. Resist the urget to resize the windows as it can mess up the renders. But you can move the windows around.

If you find yourself out of memory, consider lowering the number of threads.

Multi-threading

Self Boosted Tweet Detection

A self-boosted tweet is a tweet where the original author retweets. These types of tweets are marked with potential_boost as true in tweets.json. The script detects these by matching exact meta-datas e.g. duplicate posts.

Schemas

Assume all data is UTF-8 compliant.

Input File

These files are what the Twitter exporter should generate (.js file) from the users you are following:

window.* = [
    {
        "following": {
            "accountId": <id>,
            "userLink": <url>
        }
        ...
    }
]

You can rename as json or specify via input flags to parse the file. window.* = is automatically removed by the script and is default generated by Twitter. However, you can also manually remove it to parse the file as JSON directly.

tweets.json

[
    {
        "id": int,
        "tag_text": str,
        "name": str,
        "handle" str,
        "timestamp": str,
        "tweet_text": str,
        "retweet_count": str,
        "like_count": str,
        "reply_count": str,
        "potential_boost":  bool,
        "parent_id": str | null
    }
]

id is the index assigned by Twitter. Invalid string entries will be marked as "NULL".

metadata.json

{
    "bio": str,
    "name": str,
    "username": str,
    "location": str,
    "website": str,
    "join_date": str,
    "following": str,
    "followers": str
}

Invalid string entries will be marked as "NULL".

Troubleshoot

  • TBWatcher terminates early?

It is possible that your images are taking sometime to load. Consider using -s to adjust load-time. Or your scrolling height is too low / too high. Consider using --scroll-algorithm to adjust the type of algorithm Then passing in a value to the algorithm --scroll-value.

"--help" has more information as to what --scroll-value encodes.

  • TBWatcher does not scrape anything or tweet cut-off?

Try to run with --debug and see if there are any "Unable to locate element" errors. If so, your render window size may be a bit too small. Under-the-hood we use Chrome to render tweets, which requires a browser window size that is sufficiently large.

Try to modify --window-size such that each tweet is clearly rendered.

  • Out of memory issues?

Each thread spawns a unique Chrome window. Try reducing number of threads with -t / --multi-threading.

Contributing

Intrested in contributing? Take a look at our CONTRIBUTING.md

Future Updates and Goals

  • Support Running Multiple Sessions to Resume Per-Profile Fetching
  • Save and Expand Post Attachments