-
Notifications
You must be signed in to change notification settings - Fork 670
FEAT-#1523: 'nrows' support to 'read_csv' added #1894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1894 +/- ##
==========================================
- Coverage 81.80% 81.79% -0.02%
==========================================
Files 79 79
Lines 9389 9441 +52
==========================================
+ Hits 7681 7722 +41
- Misses 1708 1719 +11
Continue to review full report at Codecov.
|
|
@devin-petersohn, I guess it's ready for review. I've let the ability for Here is a time measurement of Data generatorimport numpy as np
import os
import pandas
RAND_LOW = -100
RAND_HIGH = 100
N = 2500000
M = 128
MULTILINE = True
TEST_FILENAME = os.path.abspath(
f"int_dataset-{N},{M},{RAND_LOW},{RAND_HIGH},{MULTILINE}.csv"
)
def generate_data_file(filename, row_n, col_n, multiline_rows=False):
data = {
f"col{i}": np.concatenate(
[
np.concatenate(
[
["some\nvery very very\nlong string\nwith many multilines"],
np.random.randint(RAND_LOW, RAND_HIGH, 9),
]
)
for _ in np.arange(row_n // 10)
]
)
if (i % 10 == 0 and multiline_rows)
else np.random.randint(RAND_LOW, RAND_HIGH, row_n)
for i in np.arange(col_n)
}
print("dict generated!")
df = pandas.DataFrame(data)
print("dataframe created!")
df.to_csv(filename)
print("csv ready!")
if __name__ == "__main__":
if not os.path.exists(TEST_FILENAME):
generate_data_file(TEST_FILENAME, N, M, MULTILINE)As you can see per row reading gives noticeable overhead in case of big files (2,500,000 rows). |
0a6ab80 to
21ed431
Compare
|
@anmyachev @devin-petersohn I think, that I've addressed all your comments, please take a look |
|
@dchigarev ok, what about current performance? |
the implementation is not significantly changed since that message if it's necessary I'll run these tests again at the latest commit |
Then that's enough. |
|
@anmyachev I've rerun performance tests on the latest commit and saw some degradation between that message and current results, I'll investigate |
|
@anmyachev I've fixed performance issue at the latest commit, the reason was that I've passed |
df0d78c to
b805866
Compare
So, you did skip rows twice, why tests were be passed? |
No, it was just triggering a slow path of reading data (via |
It's strange :) Ok |
b805866 to
66a49f9
Compare
|
@dchigarev look at TeamCity build and rerun if needed. |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
66a49f9 to
b981ecf
Compare
now all tests passes |
anmyachev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev great change!
|
@devin-petersohn it's up to you :) |
|
@devin-petersohn just a reminder to review) |
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some testing locally, I do not see a significant change in performance from master on default behavior (very little change if any). The refactor looks good.
LGTM, thanks @dchigarev!
…roject#1894) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev dmitry.chigarev@intel.com
What do these changes do?
flake8 modinblack --check modingit commit -sskiprowsworks incorrect forread_csvin case of multiline rows #1924