Skip to content

Split fct_lump components to fct_prop and fct_count (new branch name) #247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Feb 28, 2020

Conversation

jonocarroll
Copy link
Contributor

  • Splits out functionality to dedicated functions, which are then used inside fct_lump.
  • New tests, all passing.
  • Renamed fct_lump test file.

Closes #167
Closes #142

@jonocarroll jonocarroll requested a review from hadley January 31, 2020 23:23
@hadley hadley added the tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day label Jan 31, 2020
@hadley
Copy link
Member

hadley commented Jan 31, 2020

Can you take a look at the code coverage please? (with e.g. devtools::test_coverage_file())

one correction
@jonocarroll
Copy link
Contributor Author

Test coverage updated: R/lump.R - 97.12%

The only remaining lines are, I think, impossible to get to, e.g. returning from an if when no 'other' was produced but the levels were changed in fct_lump_count - this can only either return the input or a lumped output.

@hadley
Copy link
Member

hadley commented Feb 21, 2020

I made a few documentation tweaks which with left me with two questions:

  1. What does fct_lump_count() do? I can't understand the point of it from the current docs, implementation, or tests.

  2. Instead of fct_lump_n() and fct_lump_prop() should we have fct_lump_smallest() and fct_lump_largest() that take n and prop arguments? (like new dplyr::slice_min() and slice_max())

@jonocarroll
Copy link
Contributor Author

jonocarroll commented Feb 23, 2020

Did you mean to push changes?

My understanding is that fct_lump_count() lumps together levels without a particular count of appearances, e.g.

f <- c("a", "a", "a", "b", "b", "c", "d", "e", "f", "g")
table(f)
#> f
#> a b c d e f g 
#> 3 2 1 1 1 1 1
fct_lump_count(f, count = 2)
#>  [1] Other Other Other b     b     Other Other Other Other Other
#> Levels: b Other
fct_lump_count(f, count = 3)
#>  [1] a     a     a     Other Other Other Other Other Other Other
#> Levels: a Other

Created on 2020-02-23 by the reprex package (v0.3.0)

I don't know what the use-case is for such a function but the request for it was #142. There was also some discussion about making the logic >= rather than just == (still distinct from fct_lump_n() which is the n most frequently appearing) in which case one might want "lump together levels not appearing at least 3 times".

The smallest and largest refactoring makes sense to me but would require changing the internal logic for positive/negative prop and n so perhaps a new PR for that?

@huftis
Copy link

huftis commented Feb 23, 2020

I don't know what the use-case is for such a function but the request for it was #142.

The function as described in #142 seems useful. From my understanding, fct_lump_count(f, count = n) should lump together all levels that have a frequency ≤ n. But I don’t understand the behaviour of fct_lump_count() in the above example, and it does not seem useful.

@hadley
Copy link
Member

hadley commented Feb 27, 2020

@jonocarroll oops yes, pushed.

Oh, I didn't notice that you added more arguments to fct_lump(); I don't think that's necessary, so I'll revert them.

It looks like fct_lump_count() can be removed since it was never released.

@hadley hadley merged commit 4ece0e7 into tidyverse:master Feb 28, 2020
@hadley
Copy link
Member

hadley commented Feb 28, 2020

I've decided to merge what we have so far so I can get the release out today. I think we'll probably have to think holistically about fct_lump_*() again in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Splitting up fct_lump() fct_lump all with count above/below threshold
3 participants