-
Notifications
You must be signed in to change notification settings - Fork 288
Threading Support #30
Conversation
Wrong implementation.. |
This seems to be okay now, as a first implementation. Main disadvantage of this approach is that we may end up opening a lot of threads. However, it's not necessarily that bad. We can also add measures to curb the thread count and keep it within a reasonable limit. An alternative would be to use futures/async, instead of regular threads, which will reduce the overhead, but probably complicate the code. (note: Threads & connections per database are still capped. But "algorithm" threads currently aren't.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall no major qualms here 🎉
You've clearly gone through a few iterations here to make it as simple as possible! 👏🏻
|
||
def _query_in_worker(self, sql_code: str): | ||
"This method runs in a worker thread" | ||
return _query_conn(self.thread_local.conn, sql_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too familiar with error handling in threads in Python–if an exception is raised in the thread pool, is it propagated to kill the main thread? 👂🏻 I think it should.
I ask because e.g. in Ruby/Go, that's not the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it's supposed to work. I will verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified. Exceptions bubble up from every thread.
data_diff/diff_tables.py
Outdated
task_pool = ThreadPoolExecutor() | ||
return task_pool.map(func, iter) | ||
|
||
def precalc_attr(attr, iter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precalc_
doesn't imply to me it's threaded, is this naming a Python pattern I am not aware of? Could we call it threaded_map_attr
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in this module is now threaded, so I didn't feel the need to make the distinction.
But if you think it's better this way, I don't mind renaming it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like threading in it if you don't mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Used for database connectors that do not support sharing their connection between different threads. | ||
""" | ||
|
||
def __init__(self, thread_count=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up you'll allow specifying the concurrency in the CLI? 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with this default though, it does seem to use concurrency?
[10:39:11] INFO - . Diffing segment 1/12 of size 90908 and 90908
[10:39:11] INFO - . Diffing segment 8/12 of size 90909 and 90909
[10:39:11] INFO - . Diffing segment 9/12 of size 90909 and 90909
[10:39:11] INFO - . Diffing segment 10/12 of size 90909 and 90909
[10:39:11] INFO - . Diffing segment 11/12 of size 90909 and 90909
[10:39:11] INFO - . Diffing segment 12/12 of size 2 and 2
[10:39:11] INFO - . Diffing segment 7/12 of size 90909 and 90909
[10:39:11] INFO - . Diffing segment 5/12 of size 90909 and 90909
[10:39:11] INFO - . Diffing segment 6/12 of size 90909 and 90908
[10:39:11] INFO - . Diffing segment 4/12 of size 90909 and 90909
[10:39:11] INFO - . Diffing segment 2/12 of size 90909 and 90909
[10:39:11] INFO - . Diffing segment 3/12 of size 90909 and 90909
Why isn't it basically single-threaded with thread_count=1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the CLI will pass an argument for that, once we figure out the interface.
The database access is single-threaded (when =1), but the algorithm itself is still threaded. That's necessary for taking advantage of thread_count>1. I could limit the algorithm to 1 thread too in this case, but I'm not sure that it matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it matters because in that case I'd expect to see the output in order if threads=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will now show in order if threads=1
data_diff/diff_tables.py
Outdated
for i, (t1, t2) in enumerate(safezip(segmented1, segmented2)) | ||
] | ||
|
||
for res in thread_map(list, diff_iters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it's worth it to wait with printing the verbose output until this point too... so you can sort it by segment. I see pros and cons with each...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's harder to keep track when it's not organized. But also, we want the diff message to appear when it actually starts, and not before. Also, it's inevitable that threads will interlace their printing and create a "mess", even if sorted.
I don't think there's really a full solution for this, unless we limit the concurrency to be less efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree... I don't think this is clear... it at least makes it quite clear that it's threaded. I don't think we need to change it. It just makes your beautiful output less cool 😄
I have some ideas on how to make the thread-count predictable, but I'm not even sure if it's necessary, since in 99% of the cases it won't be more than a few hundreds. (and max database connections are already predictable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice and simple!
I think having lots of green threads in Python is acceptable. I think it's ok to not worry more about it for now. It'd make things more difficult to reason about I think.
data_diff/__main__.py
Outdated
@@ -37,6 +37,7 @@ | |||
@click.option("-d", "--debug", is_flag=True, help="Print debug info") | |||
@click.option("-v", "--verbose", is_flag=True, help="Print extra info") | |||
@click.option("-i", "--interactive", is_flag=True, help="Confirm queries, implies --debug") | |||
@click.option("--threads", default=None, help="Number of threads to use. 1 means no threading. Auto if not specified.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-j
is a common short-hand here, e.g. from ripgrep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've seen it around. Any idea what it stands for?
They are regular threads, not green threads. But they'll spend most of their time idling. |
PR #20 with added infrastructure (TimestampTZ repr)
No description provided.