Website to crowd-annotate tweets for Humor Research. Originally created for pgHumor, and also used in the HAHA competitions. If you want to learn general information about the data and its format, see HUMOR website.
There are two ways to run this code after cloning the repo: with Docker or via Pipenv. The first one is the recommended way to get started (or to just use for the database), and the second one is for the extraction and analysis part, and for advanced usage (such as debugging with an IDE).
You need Docker and Docker Compose for this. To run the Flask development server in debug mode, auto-detecting changes:
docker compose up --build
-
Install the Python and MySQL library headers. In Ubuntu, it'd be:
sudo apt install libmysqlclient-dev python3-dev
-
Install the dependencies using Pipenv:
pipenv install -d
-
Create a
.env
file with the following content (setting some env vars values):FLASK_APP=clasificahumor/main.py FLASK_DEBUG=1 FLASK_SECRET_KEY=SET_VALUE DB_HOST=SET_VALUE DB_USER=SET_VALUE DB_PASS=SET_VALUE DB_NAME=SET_VALUE
-
Run:
pipenv shell # It will load the environment, along with the .env file. flask run
-
Set up a MySQL 5.7 instance. It could be the instance generated with the Docker setup.
You need data to mess with. There's a dump with the downloaded tweets in the HUMOR repo.
First, create a database with the options DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci
. It could be created
with schema.sql:
mysql -u $USER -p < schema.sql
The default user for Docker is root
. The default password for the dev environment in Docker is specified in
the docker-compose.override.yml
file.
To load a database dump, run in another shell:
mysql -u $USER -p pghumor < dump.sql
You can prefix docker compose exec database
to the command to run it in the database Docker container. Or you can use
a local mysql
:
# First check the IP address of the container.
# Note the actual Docker container name depends on the local folder name.
docker container inspect pghumor-clasificahumor_database_1 | grep IPAddress
# Then use the IP address (e.g., 172.19.0.3) to connect:
mysql -h 172.19.0.3 -u root -p
# You can also set the password in the command like: -p$PASSWORD
Pro-tip: you can use mycli
, which is included in the dev dependencies for this project, and it's a more powerful
MySQL default CLI client (e.g., it has code highlighting, command auto-complete, and doesn't need the semicolon at
the end of every command):
mycli -h 172.19.0.3 -u root
# You can also set the password in the command like: -p $PASSWORD
For both mysql
and mycli
, you can append a database name at the end of the command (e.g., pghumor
) to select it
when starting the session.
List the databases:
SHOW DATABASES;
List pghumor
database tables:
USE pghumor;
SHOW tables;
Describe a particular table (e.g., tweets
):
DESCRIBE tweets;
Show some data from a table:
SELECT * FROM tweets LIMIT 10;
To run it using a WSGI server, just like in production, do:
docker compose -f docker-compose.yml -f docker-compose.testing.yml up -d --build
Then you can do some testing, such as running a load test:
./load_test.sh
To back up the data in production:
docker exec clasificahumor_database_1 mysqldump -u root -p pghumor > dump.sql
To run a SQL script in production (e.g., to restore some data):
docker exec -i clasificahumor_database_1 mysql -u root -p pghumor < dump.sql
To open a mysql interactive session in production:
docker exec -i clasificahumor_database_1 mysql -u root -p pghumor
For these commands, using directly Docker Compose (docker compose exec database
) is also supported instead of the
Docker CLI directly (docker exec clasificahumor_database_1
). However, the extra flags needed for each of them change
as Docker Compose exec
subcommand uses a pseudo TTY, and it's interactive by default while the Docker CLI exec
subcommand doesn't.
The repo was first cloned in production in /opt/clasificahumor
. The following command was run:
git config receive.denyCurrentBranch updateInstead
The file /opt/clasificahumor/.git/hooks/post-update
in production has been set with the following content to
deploy on git push
:
#!/usr/bin/env bash
pushd .. > /dev/null # So it loads the .env file in the working directory.
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --build
popd > /dev/null
Add a git remote to push to production:
git remote add production $YOUR_USERNAME@clasificahumor.com:/opt/clasificahumor
Then just push to production:
git push production
Follow the steps here to download new tweets and get them into the database.
Add the following to the .env
file with the content (replace with the Twitter API credentials values):
CONSUMER_TOKEN=...
CONSUMER_SECRET=...
ACCESS_TOKEN=...
ACCESS_TOKEN_SECRET=...
Note that normally we wouldn't need the access token and access token secret as we're not authenticating other users to this "Twitter app." However, the app access token can be used to act in the name of the Twitter app user owner (user-based authentication), and thus gain greater Twitter API rate limits than in an app-based authentication context.
./extraction/download_hose.py > tweets1.jsonl
./extraction/download_from_accounts.py > tweets2.jsonl
./extraction/persist.py < tweets.jsonl
See the options available in the command with ./extraction/persist.py --help
.
To compute the agreement (for example, with this annotations_by_tweet.csv file):
./analysis/agreement.py FILE
If you have an SSL connection error when trying to access the database, see MySQL ERROR 2026 - SSL connection error - Ubuntu 20.04.