Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Chiel dat 3098 create dockerized dev setup for xdiff #2

Merged

Conversation

cfernhout
Copy link
Collaborator

Adding docker-compose setup for PostgreSQL and MySQL with an automated example of how to use XDiff on the Movielens dataset.

@linear
Copy link

linear bot commented Apr 26, 2022

DAT-3098 Create dockerized dev setup for xDiff

https://github.com/datafold/xdiff

Create a one-click dev/test environment. I imagine a docker-compose that launches MySQL, PostgreSQL in respective containers, provisions test dataset, and allows you to run and test xDiff in one command. This will simplify further development and also make it super easy for anyone to test the tool.

@cfernhout cfernhout requested a review from erezsh April 26, 2022 12:36
Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I recommend changing, but all minor

@@ -1,5 +1,6 @@
// // Declare table & functions
func run_sql(code) {
print code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it just adds too much clutter. If you want to see the SQL statements, you can add --print-sql when running preql.

Like

preql -m prepare_db --print-sql

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--print-sql is not really workable. Because than it will spam everythin when importing the CSV. Like

Importing CSV file: 'ml-25m/ratings.csv' ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:05:32/**/;;  INSERT INTO "Rating" ( "userId", "movieId", "rating", "timestamp" ) VALUES ('756', '2115', '3.5', '1167792434'), ('756', '2117', '4.0', '1167793054'), ('756', '2159', '2.5', '1167861707'), ('756', '2162', '1.5', '1167792625'), ('756', '2167', '4.5', '1167673470'), ('756', '2186', '0.5', '1167790965'), ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense

@@ -39,10 +49,10 @@ And it's ready to use!
Example:

```bash
xdiff postgres:/// Rating postgres:/// Rating_del1p -c timestamp --stats
xdiff postgres://<uri> Rating postgres://<uri> Rating_del1 -c timestamp --stats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, postgres:// is part of the uri :)

dev/example.sh Outdated

main () {
initialize
prepaire_db
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepare

dev/example.sh Outdated
initialize() {
pip install poetry
poetry install
pip install preql==0.2.10 # Temporary due to version conflicts for runtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.2.11 now (or higher)

dev/example.sh Outdated
unzip ml-25m.zip
fi
MYSQL_IMAGE=${MYSQL_IMAGE} docker-compose up -d
sleep 15 # Increase if you receive error like: `mysql.connector.errors.InterfaceError: 2013: Lost connection to MySQL server during query`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of sleep, check for the connection every few seconds? Sounds more robust to me.

Copy link
Contributor

@sirupsen sirupsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome Chiel!! Thanks for getting this here. Biggest comment to me is about whether we really need to have this whole dev pyproject.toml, rather than using the one and dev-dependencies.

command: >
-c work_mem=1GB
-c maintenance_work_mem=1GB
-c max_wal_size=8GB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the size of the movielens dataset, the defaults are too low.

--key_buffer_size=8G
--read_buffer_size=2G
--max_connections=10
--innodb_io_capacity=400
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

exit
fi

# Use a linux/arm64 docker image for MySQL when running on ARM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked automatically for me on the oracle images, did it not for you? Are you sure we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, some images don't support the architectures. So you need a specific one. That is the case for this one.

@@ -0,0 +1,57 @@
#!/usr/bin/env bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this dev/setup.sh or something instead—makes it a bit more obvious what it's doing

fi
MYSQL_IMAGE=${MYSQL_IMAGE} docker-compose up -d

until nc -z -v -w30 localhost 3306 && nc -z -v -w30 localhost 5432; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😇 for not using a raw sleep(1)

set -ex

if [[ $( dirname $0 ) != "." ]]; then
echo "Execute from /dev folder."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this execute from the root directory as vanilla script/setup.sh ❤️

I know this is partly because of preql below. The trick to not pollute the whole script is to cd in a subshell, like so:

prepare_db() {
  (
    cd dev/ # won't pollute the main script
    poetry run preql -m prepare_db mysql://mysql:Password1@localhost/mysql
    poetry run preql -m prepare_db postgres://postgres:Password1@localhost/postgres
  ) 
}

@@ -0,0 +1,57 @@
#!/usr/bin/env bash
set -ex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ I sense a fellow scarred basher

@@ -0,0 +1,17 @@
[tool.poetry]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this as a separate pyproject... Can't we just add these as dev-dependencies and as part of poetry install in the script install all the extras?

I'm not super familiar with Python, but I think we should follow standard conventions for contributions if possible. Presumably those are that you go to the main directory, run poetry install, and then you're in business.

snowflake-connector-python = { version = "*", optional = true }

[tool.poetry.extras]
mysql = ["mysqlclient"]
mysql = ["mysql-connector-python"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 ran into this too


if [ ! -f ./ml-25m/ratings.csv ]; then
echo "Example data not found. Downloading.."
wget https://files.grouplens.org/datasets/movielens/ml-25m.zip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might we worth moving this to a test/fixtures directory or something for longer term. Don't blame you if you don't do that now though :)

@sirupsen sirupsen merged commit 4b6e004 into master May 5, 2022
nolar pushed a commit that referenced this pull request Apr 6, 2023
Initial code copied from data-diff
nolar pushed a commit that referenced this pull request Apr 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants