Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add firstindex check and fix a bug in mapcols! #2594

Merged
merged 10 commits into from
Jan 24, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jan 8, 2021

We currently assume that vectors are 1-based in DataFrames.jl. This PR makes sure we check for that. Not checking this will lead to bugs. Also I found some small bug in mapcols! in the process related to AbstractRange handling.

I think we should decide on this before 1.0 release (it is better to be strict I think now; we can be more flexible later).

I will add tests, NEWS.md and update documentation with this PR after we agree that we want to add these changes.

@bkamins bkamins added this to the 1.0 milestone Jan 8, 2021
@bkamins bkamins requested a review from nalimilan January 8, 2021 11:27
@bkamins
Copy link
Member Author

bkamins commented Jan 17, 2021

@nalimilan - are you OK with these proposed changes. If yes I would finalize the PR

@bkamins
Copy link
Member Author

bkamins commented Jan 20, 2021

@nalimilan - the issue came up again in #2604. Are you OK to finalize this PR in the form I have proposed?

@bkamins bkamins marked this pull request as ready for review January 21, 2021 14:20
@bkamins
Copy link
Member Author

bkamins commented Jan 21, 2021

@nalimilan - this should be good to have a look at (especially corner cases in the tests).

test/indexing.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 21, 2021

I had to split the tests into a separate file as OffsetArrays.jl does not run on Julia 1.0 (due to its dependency).

test/indexing.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 22, 2021

CI and coverage passes cleanly now.

NEWS.md Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
@@ -454,7 +454,17 @@ function mapcols!(f::Union{Function, Type}, df::DataFrame)
if len_min != len_max
throw(DimensionMismatch("lengths of returned vectors must be identical"))
end
_columns(df) .= vs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the problem with this syntax? Compilation overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - I try to remove broadcasting in the internal code to reduce latency a bit (also I thought that the syntax might seem too magic for casual reader).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is:

current release:

julia> using DataFrames

julia> df = DataFrame(a=1)
1×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1

julia> @time mapcols!(length, df)
  0.003640 seconds (2.75 k allocations: 146.932 KiB)
1×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1

this PR:

julia> using DataFrames

julia> df = DataFrame(a=1)
1×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1

julia> @time mapcols!(length, df)
  0.000015 seconds (4 allocations: 256 bytes)
1×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1

so it is not a big deal (but we have to iterate vs anyway to check firstindex).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. I wouldn't have expected such a large difference in the number of allocations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broadcasting is heavy (when doing the PRs that Tim Holy prompted I will review the whole codebase against using broadcasting). Of course here the difference is one-time and not that big.

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
test/indexing_offset.jl Outdated Show resolved Hide resolved
@yurivish
Copy link

@bkamins I just learned about the function Base.require_one_based_indexing – are you aware of it? It's mentioned in the devdocs here and I'm not sure if it's sanctioned for use in user code, but the devdocs are written as if it is.

It is defined here: https://github.com/JuliaLang/julia/blob/ae53238c45a0cd6dafc6e121f4daaa93143bf627/base/abstractarray.jl#L103

@bkamins
Copy link
Member Author

bkamins commented Jan 22, 2021

@yurivish - thank you for looking into this. I preferred custom checks for the following reasons (although the function seems to be a part of "official" API even if it is not exported):

  • it is not available in Julia 1.0 which we support
  • we can throw custom error message if we handle the check ourselves
  • it is, easy to make this check

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

Mmm, is the CI failure related to recent changes in Tables?

@bkamins
Copy link
Member Author

bkamins commented Jan 24, 2021

Yes, I will rebase and merge after CI passes.

@bkamins bkamins merged commit f6eb4e3 into JuliaData:main Jan 24, 2021
@bkamins bkamins deleted the check_firstindex branch January 24, 2021 22:30
@bkamins
Copy link
Member Author

bkamins commented Jan 24, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants