-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Description
When using geom_box()
to plot data distributions, the whiskers appear to be computed as a linear graphical multiple of the IQR, regardless of the axis scale. As a result, applying scale_x_log10()
or scale_y_log10()
produces incorrect whiskers and incorrect outliers.
Reproducible example
R
and ggplot
version
> R.version.string
[1] "R version 4.5.1 (2025-06-13 ucrt)"
>
> # Print ggplot2 version
> packageVersion("ggplot2")
[1] ‘4.0.0’
Demonstration of the issue
library(ggplot2)
set.seed(123)
# Generate exponential data
n <- 2000
rate <- 1
x <- (rexp(n, rate = rate) + 1) * 1e4
# Compute percentiles and IQR-based bounds
qs <- quantile(x, probs = c(0.25, 0.50, 0.75), names = FALSE)
q1 <- qs[1]
q2 <- qs[2]
q3 <- qs[3]
iqr <- IQR(x)
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
# Inspect computed values
stats <- data.frame(
p25 = q1,
p50 = q2,
p75 = q3,
IQR = iqr,
lower_1p5_IQR = lower_bound,
upper_1p5_IQR = upper_bound
)
stats
p25 p50 p75 IQR lower_1p5_IQR upper_1p5_IQR
1 12847.19 17139.3 24405.04 11557.84 -4489.57 41741.8
# Plotting
base_plot <- ggplot(data.frame(x = x), aes(x = x)) +
geom_boxplot() + labs(title = "Exponential sample (linear scale)")
p_log <- base_plot + scale_x_log10() + labs(title = "Exponential sample (log10 scale)")
base_plot

p_log

In the example above, the theoretical upper whisker should extend to the most extreme point within the range [p75, p75 + 1.5 × IQR]
. In this case 1.5 × IQR = 41741.8
, the whisker should end at the most extreme point below or equal to that value. This behaviour seems correct in the linear-scale plot.
However, when using a logarithmic scale, the whisker incorrectly extends beyond 50,000
, which is outside the valid range. It appears that the whisker’s position is determined by a linear pixel distance corresponding to 1.5 × IQR
, rather than using the correct scale transformation.
Expected behaviour
- The upper whisker should extend to the most extreme data point within
[p75, p75 + 1.5 × IQR]
. - The lower whisker should extend to the most extreme data point within
[p25 - 1.5 × IQR, p25]
. - “Extreme point” refers to the observation with the highest absolute value from the median.
- The whisker range should respect the scale of the axis on which it is plotted (e.g., logarithmic or linear), not a linear pixel distance.