-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add firstindex check and fix a bug in mapcols! #2594
Conversation
@nalimilan - are you OK with these proposed changes. If yes I would finalize the PR |
@nalimilan - the issue came up again in #2604. Are you OK to finalize this PR in the form I have proposed? |
@nalimilan - this should be good to have a look at (especially corner cases in the tests). |
I had to split the tests into a separate file as OffsetArrays.jl does not run on Julia 1.0 (due to its dependency). |
CI and coverage passes cleanly now. |
@@ -454,7 +454,17 @@ function mapcols!(f::Union{Function, Type}, df::DataFrame) | |||
if len_min != len_max | |||
throw(DimensionMismatch("lengths of returned vectors must be identical")) | |||
end | |||
_columns(df) .= vs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the problem with this syntax? Compilation overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - I try to remove broadcasting in the internal code to reduce latency a bit (also I thought that the syntax might seem too magic for casual reader).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is:
current release:
julia> using DataFrames
julia> df = DataFrame(a=1)
1×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
julia> @time mapcols!(length, df)
0.003640 seconds (2.75 k allocations: 146.932 KiB)
1×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
this PR:
julia> using DataFrames
julia> df = DataFrame(a=1)
1×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
julia> @time mapcols!(length, df)
0.000015 seconds (4 allocations: 256 bytes)
1×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
so it is not a big deal (but we have to iterate vs
anyway to check firstindex
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. I wouldn't have expected such a large difference in the number of allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadcasting is heavy (when doing the PRs that Tim Holy prompted I will review the whole codebase against using broadcasting). Of course here the difference is one-time and not that big.
@bkamins I just learned about the function It is defined here: https://github.com/JuliaLang/julia/blob/ae53238c45a0cd6dafc6e121f4daaa93143bf627/base/abstractarray.jl#L103 |
@yurivish - thank you for looking into this. I preferred custom checks for the following reasons (although the function seems to be a part of "official" API even if it is not exported):
|
Mmm, is the CI failure related to recent changes in Tables? |
Yes, I will rebase and merge after CI passes. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
eaf2e99
to
5281272
Compare
Thank you! |
We currently assume that vectors are 1-based in DataFrames.jl. This PR makes sure we check for that. Not checking this will lead to bugs. Also I found some small bug in
mapcols!
in the process related toAbstractRange
handling.I think we should decide on this before 1.0 release (it is better to be strict I think now; we can be more flexible later).
I will add tests, NEWS.md and update documentation with this PR after we agree that we want to add these changes.