-
Notifications
You must be signed in to change notification settings - Fork 670
FEAT-#1205: melt implementation #1689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#1205: melt implementation #1689
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1689 +/- ##
==========================================
- Coverage 82.32% 81.75% -0.58%
==========================================
Files 77 77
Lines 8189 8867 +678
==========================================
+ Hits 6742 7249 +507
- Misses 1447 1618 +171
Continue to review full report at Codecov.
|
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have performance on this? Looking through the implementation, it might be expensive.
|
@devin-petersohn Thanks for your comments, definitely first implementation were extremely slow because of inserting column with As we see from the results, native modin frame function Here is the results without restoring pandas-order: How it was measuredimport numpy as np
import string
import csv
import os
from timeit import default_timer as timer
IS_PANDAS = True
RAND_LOW = -100
RAND_HIGH = 100
N = 500000
M = 45
TEST_FILENAME = os.path.abspath(f"int_dataset-{N},{M},{RAND_LOW},{RAND_HIGH}.csv")
def generate_data_file(filename, row_n, col_n):
letters = list(string.ascii_letters)
columns = [
"".join(np.random.choice(letters) for _ in np.arange(10))
for _ in np.arange(col_n)
]
with open(filename, "w") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(columns)
for _ in np.arange(row_n):
writer.writerow(np.random.randint(RAND_LOW, RAND_HIGH, col_n))
if __name__ == "__main__":
import pandas
import modin.pandas as pd
if not os.path.exists(TEST_FILENAME):
generate_data_file(TEST_FILENAME, N, M)
if IS_PANDAS:
pd_df = pandas.read_csv(TEST_FILENAME)
t1_pd = timer()
pd_df.melt()
t2_pd = timer()
print("Pandas:", t2_pd - t1_pd)
md_df = pd.read_csv(TEST_FILENAME)
t1_md = timer()
md_df.melt()
t2_md = timer()
print("Modin:", t2_md - t1_md)What is restoring pandas-order?For applying For example we have DataFrame splitter into row partitions like that: and then we will apply Restoring order from right result in left result is what I mean under restoring pandas-order Restoring order is the most expensive operation because we must have knowledge about the whole column axis, which is the longest one in case of So, @devin-petersohn is it acceptable for modin to output shuffled data in such functions or modin has a pattern for fast reordering which I'm don't know? |
|
@dchigarev maybe you can found some useful things in #725 |
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_apply_full_axis is asynchronous. Timing before and after isn't enough, you also have to print the result to block. It is 4+x slower than a MapFunction.
I left some questions for you.
|
@devin-petersohn I've reimplemented I've also fixed a strange bug in |
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev Yes, if there is no work being done, the column partition is still shuffled together to the same worker, which is some significant work.
Thanks for running the performance numbers on this. Is there a reason this cannot be done by applying the function to row partitions instead of column partitions?
@devin-petersohn It could be implemented via applying to row partitions, but there is a two details:
|
Ah, I forgot about that, sorry. The reordering rows can happen via sort, which isn't implemented scalably yet, but should be soon. It is okay to rely on functionality that isn't implemented yet. We could also perform the sort first, which would probably perform better. What do you think about this? |
|
@devin-petersohn what do you mean under "perform the sort first"? The problem is that after applying And another question, I don't know how data scientists use |
|
@dchigarev It will scale better if operations are performed across the columns |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
|
@devin-petersohn I've reimplemented Also added a warning of incorrect order of rows and created an issue, to we could fix this later (#1793) |
|
Thanks @dchigarev! Were those times including a print statement to ensure computation was finished? |
|
@devin-petersohn It's shamable, but it seems that I forgot to add |
|
@dchigarev 500,000 rows is not very many, so that speedup is quite good, especially on 128 columns. |
|
@dchigarev The implementation looks good, I have retriggered GitHub Actions because it showed a -40% coverage drop. |
|
@devin-petersohn now codecov check is okay, I guess |
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dchigarev, looks great!
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
What do these changes do?
Implementation of
pandas.DataFrame.melt+apply_full_axis_select_indicesoptimization.flake8 modinblack --check modingit commit -s