Skip to content

Computing stats between groups #6476

Closed
Closed
@kieran-mace

Description

@kieran-mace

In situations when we want to calculate a group stat that requires knowledge of other groups, it would be useful for compute_group to have access to the rest of the data

I would like to be able to create a new property, bin_prop, applied to StatBin, that returns the proportion of data in that bin, that belongs to the group.

In the example below, I want to analyze the number of plays, by each player in the lakers. I will use geom_freqpoly to show the counts, but what I really want is the proportion of plays per player within the bin.

Set up data

library(lubridate)
library(ggplot2) 
library(dplyr)


# set up data
laker_player_plays = lakers |> 
  tibble::as_tibble() |> 
  filter(team == 'LAL', stringr::str_length(player) > 0) |> 
  mutate(date = ymd(date))

Just counts, close to what I want, but I would love to use a after_stat(bin_prop) instead.

# I'd like to do this, but instead cerate a new property `bin_prop` that shows the percentage of plays by that player
ggplot(laker_player_plays) +
  geom_freqpoly(aes(x = date,
                    color = player,
                    y = after_stat(count)
  ),
  binwidth = 31)

Side note

I do see that something equivalent can be done with geom_histogram + position = 'fill' - but I do not believe this is being done by the stat layer, but maybe by the scales layer?

# I do notice this is done to some extent using geom_histogram + position = fill, but I believe this position is not computed during the stat step
ggplot(laker_player_plays) +
  geom_histogram(aes(x = date, fill = player), position = 'fill', binwidth = 31)

<!-- →

Desired output

Here is an example of what I'd like to achieve, but by using stats instead of precomputing the proportion_of_plays ahed of time`

# This is the type of plot I think we should be able to create, without having to pre-calculate the proportions (should be computed in StatBin)
# calculate breaks, for solutions that can't use stat_bin

breaks = seq(min(laker_player_plays$date), max(laker_player_plays$date)+31, by = 31)

laker_player_plays |> 
  mutate(date_group = cut(date, breaks = breaks, )) |>
  group_by(player, date_group) |> 
  count(name = 'plays') |> 
  group_by(date_group) |> 
  mutate(proportion_of_plays = plays/sum(plays)) |> 
  ggplot(aes(x = date_group, 
             y = proportion_of_plays,
             color = player,
             group = player)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(labels=scales::percent)

Created on 2025-05-22 with reprex v2.1.1

Suggested API

ggplot(laker_player_plays) +
  geom_freqpoly(aes(x = date,
                    color = player,
                    y = after_stat(bin_prop)
  ),
  binwidth = 31)

I've attempted to create a PR for this, but noticed that each group is calculated independently. Is there a solution, or workaround that you propose to create a PR that enables the calculation of bin_prop in StatBin that requires calculation of proportions between groups? I do see that after_stat(prop) is available for geom_bar so I suspect this pattern has been solved for before?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions