-
Notifications
You must be signed in to change notification settings - Fork 288
Chiel dat 3098 create dockerized dev setup for xdiff #2
Chiel dat 3098 create dockerized dev setup for xdiff #2
Conversation
DAT-3098 Create dockerized dev setup for xDiff
https://github.com/datafold/xdiff Create a one-click dev/test environment. I imagine a docker-compose that launches MySQL, PostgreSQL in respective containers, provisions test dataset, and allows you to run and test xDiff in one command. This will simplify further development and also make it super easy for anyone to test the tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things I recommend changing, but all minor
@@ -1,5 +1,6 @@ | |||
// // Declare table & functions | |||
func run_sql(code) { | |||
print code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it just adds too much clutter. If you want to see the SQL statements, you can add --print-sql
when running preql.
Like
preql -m prepare_db --print-sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--print-sql
is not really workable. Because than it will spam everythin when importing the CSV. Like
Importing CSV file: 'ml-25m/ratings.csv' ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:05:32/**/;; INSERT INTO "Rating" ( "userId", "movieId", "rating", "timestamp" ) VALUES ('756', '2115', '3.5', '1167792434'), ('756', '2117', '4.0', '1167793054'), ('756', '2159', '2.5', '1167861707'), ('756', '2162', '1.5', '1167792625'), ('756', '2167', '4.5', '1167673470'), ('756', '2186', '0.5', '1167790965'), ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
@@ -39,10 +49,10 @@ And it's ready to use! | |||
Example: | |||
|
|||
```bash | |||
xdiff postgres:/// Rating postgres:/// Rating_del1p -c timestamp --stats | |||
xdiff postgres://<uri> Rating postgres://<uri> Rating_del1 -c timestamp --stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, postgres://
is part of the uri :)
dev/example.sh
Outdated
|
||
main () { | ||
initialize | ||
prepaire_db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepare
dev/example.sh
Outdated
initialize() { | ||
pip install poetry | ||
poetry install | ||
pip install preql==0.2.10 # Temporary due to version conflicts for runtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.2.11 now (or higher)
dev/example.sh
Outdated
unzip ml-25m.zip | ||
fi | ||
MYSQL_IMAGE=${MYSQL_IMAGE} docker-compose up -d | ||
sleep 15 # Increase if you receive error like: `mysql.connector.errors.InterfaceError: 2013: Lost connection to MySQL server during query` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of sleep, check for the connection every few seconds? Sounds more robust to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome Chiel!! Thanks for getting this here. Biggest comment to me is about whether we really need to have this whole dev
pyproject.toml
, rather than using the one and dev-dependencies.
command: > | ||
-c work_mem=1GB | ||
-c maintenance_work_mem=1GB | ||
-c max_wal_size=8GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the size of the movielens dataset, the defaults are too low.
--key_buffer_size=8G | ||
--read_buffer_size=2G | ||
--max_connections=10 | ||
--innodb_io_capacity=400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
exit | ||
fi | ||
|
||
# Use a linux/arm64 docker image for MySQL when running on ARM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked automatically for me on the oracle images, did it not for you? Are you sure we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, some images don't support the architectures. So you need a specific one. That is the case for this one.
@@ -0,0 +1,57 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this dev/setup.sh
or something instead—makes it a bit more obvious what it's doing
fi | ||
MYSQL_IMAGE=${MYSQL_IMAGE} docker-compose up -d | ||
|
||
until nc -z -v -w30 localhost 3306 && nc -z -v -w30 localhost 5432; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😇 for not using a raw sleep(1)
set -ex | ||
|
||
if [[ $( dirname $0 ) != "." ]]; then | ||
echo "Execute from /dev folder." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this execute from the root directory as vanilla script/setup.sh
❤️
I know this is partly because of preql
below. The trick to not pollute the whole script is to cd
in a subshell, like so:
prepare_db() {
(
cd dev/ # won't pollute the main script
poetry run preql -m prepare_db mysql://mysql:Password1@localhost/mysql
poetry run preql -m prepare_db postgres://postgres:Password1@localhost/postgres
)
}
@@ -0,0 +1,57 @@ | |||
#!/usr/bin/env bash | |||
set -ex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ I sense a fellow scarred basher
@@ -0,0 +1,17 @@ | |||
[tool.poetry] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this as a separate pyproject
... Can't we just add these as dev-dependencies
and as part of poetry install
in the script install all the extras?
I'm not super familiar with Python, but I think we should follow standard conventions for contributions if possible. Presumably those are that you go to the main directory, run poetry install
, and then you're in business.
snowflake-connector-python = { version = "*", optional = true } | ||
|
||
[tool.poetry.extras] | ||
mysql = ["mysqlclient"] | ||
mysql = ["mysql-connector-python"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 ran into this too
|
||
if [ ! -f ./ml-25m/ratings.csv ]; then | ||
echo "Example data not found. Downloading.." | ||
wget https://files.grouplens.org/datasets/movielens/ml-25m.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might we worth moving this to a test/fixtures
directory or something for longer term. Don't blame you if you don't do that now though :)
Initial code copied from data-diff
Adding docker-compose setup for PostgreSQL and MySQL with an automated example of how to use XDiff on the Movielens dataset.