Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add sample function for DataFrames #997

Closed
wants to merge 2 commits into from

Conversation

s-celles
Copy link

"""
sample(df[, N])

Returns a (random) sample of a DataFrame

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer something like "Returns a (random) sample of rows from a DataFrame" to be explicit about the selección of rows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, but say "of N rows".

@diegozea
Copy link

Thanks for doing this @scls19fr ! :) Test are missing, It would be great to have some tests for this functions.

```
julia> using RDatasets
julia> iris = dataset("datasets", "iris")
julia> sample(iris, 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a call to srand(1) to ensure the results are always the same.

@nalimilan
Copy link
Member

Thanks. Please add tests for the new feature, and add the function to the list of exports.

@s-celles
Copy link
Author

I haven't add sample to utils.jl but to sample.jl because AbstractDataFrame must be defined before using them

@@ -0,0 +1,29 @@
import StatsBase: sample

function sample(df::AbstractDataFrame; replace::Bool=true, ordered::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do N::Integer=1 in the definition below, and you'll get this one for free.

@s-celles s-celles force-pushed the master branch 3 times, most recently from 86b4c36 to dc436ee Compare June 15, 2016 05:54
"""
sample(df[, n])

Draw a random sample of `n` rows from a data frame `df` and return the result as a data frame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ending dot.

@nalimilan
Copy link
Member

Any opinions about the opportunity of adding this function?

@nalimilan nalimilan closed this Sep 5, 2016
@nalimilan nalimilan reopened this Sep 5, 2016
│ 5 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ "virginica" │
```
"""
function sample(df::AbstractDataFrame, n::Integer=1; replace::Bool=true, ordered::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should extend the function from StatsBase, i.e. function StatsBase.sample(...)

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

I think this could be useful, but needs a rebase.

@bkamins bkamins mentioned this pull request Jan 15, 2019
31 tasks
@bkamins
Copy link
Member

bkamins commented Jul 24, 2019

I am closing this since we removed StatsBase.jl dependency in DataFrames.jl.

Given the relation of cost of materializing a DataFrame vs. cost of sampling I think it is easy enough to use df[sample(something selecting rows), :] even if this is suboptimal performance wise (the difference will not be noticeable).

Please reopen if you disagree.

@bkamins bkamins closed this Jul 24, 2019
@nalimilan
Copy link
Member

nalimilan commented Jul 25, 2019

It would still be kind of nice to be able to do sample(df, n) at some point. Maybe we could define sample in DataAPI, just like we do for describe. EDIT: and then we could have a generic definition in Tables.jl.

@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

Exactly, DataFrame is not specific here and we will delegate all the work to StatsBase.jl sample anyway.

@anandijain
Copy link
Contributor

What about for groupby objects?

groups = groupby(df, :A)
I think it would be nice to be able to sample(groups, 3) and get three of the groups.

@bkamins
Copy link
Member

bkamins commented Oct 2, 2019

There are two issues here:

  1. The question is we want GroupedDataFrame to be a subtype of AbstractVector.
    @nalimilan - what is your current status of thinking about it?

  2. In general such sampling is not allowed as it would possibly produce duplicates in sample(groups, 3) and GroupedDataFrame does not allow for duplicate groups (and I think it should not allow for this).

In short to sample subgroups now you should write sample(collect(groups), 3) and probably this will stay this way.

@nalimilan
Copy link
Member

IIRC we concluded that subtyping AbstractVector wasn't a good idea at least at this point. And indeed you're right that duplicating groups would be problematic.

@bkamins
Copy link
Member

bkamins commented Oct 2, 2019

Thank you. I just wanted to make sure with AbstractVector status (and I agree).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sample might support DataFrames
7 participants