-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#1067: skiprows
support added for read_csv
#2607
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2607 +/- ##
===========================================
- Coverage 85.34% 26.13% -59.22%
===========================================
Files 151 145 -6
Lines 15916 14946 -970
===========================================
- Hits 13584 3906 -9678
- Misses 2332 11040 +8708
Continue to review full report at Codecov.
|
8bd2638
to
46aff86
Compare
ae42db1
to
6407fe5
Compare
Performance numbers obtained by
According to the above table, performance of |
@anmyachev @dchigarev please review this PR. |
modin/pandas/test/utils.py
Outdated
if isinstance(result.index, pandas.MultiIndex): | ||
levels_number = len(result.index.levels) | ||
new_index_data = [ | ||
result.index.get_level_values(lev_id).astype(str) | ||
for lev_id in range(levels_number) | ||
] | ||
new_index = pandas.MultiIndex.from_arrays(new_index_data) | ||
new_index.names = result.index.names | ||
result.index = new_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just result.index = result.index.astype(str)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we apply astytpe
to the Multiindex
object directly we will obtain error
TypeError: Setting <class 'pandas.core.indexes.multi.MultiIndex'> dtype to anything other than object is not supported
But actually in this PR Multiindex
conversion is not used, so these changes were removed as unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are you going to delete this?
@anmyachev @dchigarev do you have any comments on this PR? |
a2f73a4
to
2da042f
Compare
@anmyachev PR is ready for review. |
@dchigarev please review this |
You are right, we can drop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyskov it makes sense, maybe we should test the Cython addition first?
There are some test failures here as well. Can you take a look?
5dcff16
to
0878d06
Compare
@devin-petersohn , this PR was reworked in the way it was discussed offline - now array-like or callable
As it can be seen |
Hi @devin-petersohn please take a look at this PR. |
@amyskov you got some conflicts here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyskov What is the status of this PR?
@devin-petersohn this PR was rebased and now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to move one minor change to a new PR, overall it looks great. I am excited to get this change in!
Unrelated change was moved into separated PR. Yes, it was really long way to get it done) |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> FEAT-modin-project#1067: skiprows support for read_csv added Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> FEAT-modin-project#1067: make 'skiprows' and 'nrows' work together Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> FEAT-modin-project#1067: range like 'skiprows' optimizations Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: rework read_csv part Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: rework json and fwf dispatchers Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: add description Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: add tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: mark xfailed tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: fix skiprows when encoding is passed Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: handle partitions with empty DataFrames Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: mark xfailed tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: remove unused code Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: add asv tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: handle narrow dataframes Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: fix asv tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: fix Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: remove code duplication Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: consolidate some code into functions Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: correct io test to check pre_reading part Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: remove unrelated changes Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: Apply suggestions from code review Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com> Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: correct asv tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: Update modin/engines/base/io/text/text_file_dispatcher.py Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com> Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: Update modin/engines/base/io/text/csv_dispatcher.py Co-authored-by: Dmitry Chigarev <62142979+dchigarev@users.noreply.github.com> Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: rework asv tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: rollback parser changes Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: Update modin/engines/base/io/text/text_file_dispatcher.py Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com> Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: update tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: fix for cloud tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: Update modin/engines/base/io/text/text_file_dispatcher.py Co-authored-by: Dmitry Chigarev <62142979+dchigarev@users.noreply.github.com> Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: remove asv tests Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: correct getting dummy_df Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: remove duplicated code Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: addressing review comments Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: handle corner case with callable skiprows Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: localize the work with rows_considered Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: remove unnecessary changes Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: correct docstrings Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: drop complex `skiprows` after data import Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: remove fallback to pandas case Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: optimize rows skipping Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: rollback unrelated changes, formatting Signed-off-by: Alexander Myskov <alexander.myskov@intel.com> FEAT-modin-project#1067: fix incorrect rebase Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
@devin-petersohn CI is green now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amyskov, looks great!
This PR introduces parallel implementation for
read_csv
function with array-like and callableskiprows
parameter.skiprows
is handled differently depending on the parameter type:skiprows
is integer - rows will be eliminated from partitioned data in thepartitioned_file
function and wouldn't be actually read.skiprows
is uniformly distributed array (for example [1, 2, 3]) - in this case rows within header and the first row fromskiprows
will be read in the first partition (if there are any) andskiprows
will be handled as integer.skiprows
is array, but not uniformly distributed, or callable - in this caseskiprows
will be dropped only after full dataset is imported (this is done because handling of such kind ofskiprows
is inefficient inpartitioned_file
function).Note 1: this PR is continuation of #1932.
What do these changes do?
flake8 modin
black --check modin
git commit -s