-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Update documentation to midwest variables #4274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update documentation to midwest variables #4274
Conversation
Thanks. Could you explain a bit about where you found these details of the fields? I have no knowledge to judge whether these new descriptions are true or not. |
@@ -48,16 +48,16 @@ | |||
|
|||
#' Midwest demographics | |||
#' | |||
#' Demographic information of midwest counties | |||
#' Demographic information of midwest counties from 2000 US census |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' \item{PID}{Unique county identifier.} | ||
#' \item{county}{County name.} | ||
#' \item{state}{State to which county belongs to.} | ||
#' \item{area}{Area of county (units unknown).} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for area
, I tried looking up the land area units for this and nothing came out as something reasonable. So I've just made this units unknown
.
The other variables here (PID
, county
, and state
), I'm just spelling out more on what those variables are to be thorough.
@@ -69,17 +69,17 @@ | |||
#' \item{percasian}{Percent Asian.} | |||
#' \item{percother}{Percent other races.} | |||
#' \item{popadults}{Number of adults.} | |||
#' \item{perchsd}{} | |||
#' \item{perchsd}{Percent with high school diploma.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very reliable, but I found someone's homework that lists perchsd
as "the percentage of people with a high schooldiploma in each county".
I couldn't verify this, but given the surrounding other variables on education status (percollege
and percprof
), I figured this to be a high school diploma.
#' \item{percelderlypoverty}{} | ||
#' \item{inmetro}{In a metro area.} | ||
#' \item{category}{} | ||
#' \item{percprof}{Percent with professional degree.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For percprof
, I got the same hint from that one homework assignment. And with the context of the other variables (perchsd
and percollege
), this would make more sense that this is a professional degree rather than merely "profession". The percentages of professional degrees in the data set are much lower than the percollege
variable, which is a good sanity check.
#' \item{poppovertyknown}{Population with known poverty status.} | ||
#' \item{percpovertyknown}{Percent of population with known poverty status.} | ||
#' \item{percbelowpoverty}{Percent of people below poverty line.} | ||
#' \item{percchildbelowpovert}{Percent of children below poverty line.} | ||
#' \item{percadultpoverty}{Percent of adults below poverty line.} | ||
#' \item{percelderlypoverty}{Percent of elderly below poverty line.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poppovertyknown
is always less than population total poptotal
. So I personally inferred that this meant how many of these individuals do we actually know their poverty status for.
This is validated with some manual calculations to check if our manual calculation of percent poverty is equal to the one in the data set.
library(ggplot2)
library(dplyr)
data("midwest")
# Find number of counties our manual calculation fails
midwest %>%
select(poptotal, poppovertyknown, percpovertyknown) %>%
mutate(manual_percpoverty = poppovertyknown / poptotal * 100) %>%
mutate(not_equal_per = !all.equal(percpovertyknown, manual_percpoverty)) %>%
pull(not_equal_per) %>%
sum()
#> [1] 0
Created on 2020-11-27 by the reprex package (v0.3.0)
The remaining variables (percbelowpoverty
, percchildbelowpovert
, percadultpoverty
, and percelderlypoverty
) were my own personal inferences on the pieces of information based on the variable name itself. The ages for poverty (child, adult, elderly) are also seen in poverty reports (PDF) as "Under 18 years", "18 to 64 years", and "65 years and over".
#' \item{percchildbelowpovert}{Percent of children below poverty line.} | ||
#' \item{percadultpoverty}{Percent of adults below poverty line.} | ||
#' \item{percelderlypoverty}{Percent of elderly below poverty line.} | ||
#' \item{inmetro}{County considered in a metro area.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inmetro
, this is merely giving more detail on what is in a metro area, which is the county/
#' \item{percadultpoverty}{Percent of adults below poverty line.} | ||
#' \item{percelderlypoverty}{Percent of elderly below poverty line.} | ||
#' \item{inmetro}{County considered in a metro area.} | ||
#' \item{category}{Miscellaneous.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For category,
I could not find out what this was but I didn't want to leave this blank, so I just gave it this name. Happy to remove this if it is pointless.
@yutannihilation I've made comments on my changes on what led me to those descriptions. Most are my own inferences on what the variable is based on the variables around it and the only hint I got about one variable, All in all, this PR was more for filling out the documentation to some degree. If my speculations on what these variables mean is not enough, feel free to pick and choose which variable descriptions seem fit, or close this PR entirely. |
Thanks, I really appreciate your efforts on this, and these inferences seem reasonable. However, I'm not 100% confident if it's good to include them in ggplot2 officially, though it's ggplot2's fault that the descriptions are not included when the data was added... @hadley @clauswilke @karawoo @thomasp85 |
I think that if we don't know for sure what the variables are then we should at least mention this. Let's not pretend we know something for sure when we don't. |
Should the body of the documentation be something like this then? #' Midwest demographics
#'
#' Demographic information of midwest counties from 2000 US census.
#'
+ #' Note: data is included for illustrative purposes and may not contain accurate
+ #' data values or column descriptions. For more accurate US census data, see the
+ #' [`acs` package](https://cran.r-project.org/web/packages/acs/index.html) for
+ #' more up to date data. |
Thanks, agreed.
I think this note should be clearer that we've lost the track of the original descriptions and the current ones are based on speculation. |
Thanks everyone. I've added a note based on the feedback. Feel free to comment directly on those changes with any modifications. |
/document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to me now.
Before I merge this, I want to ask you one more thing. Tidyverse team is trying to change the license of ggplot2 to more permissive one (the MIT license) to minimize the problems and confusions arising from license incompatibilities. If you are fine that your contribution will be merged to ggplot2 and get licensed under MIT, would you mind commenting "I agree" on #4231?
For more details, you can join the discussion on #4236, or wait for their upcoming blog post.
@yutannihilation thanks for the review. I've commented with my agreement. Thanks for your time! |
Thanks! |
Attempts to address concerns and more in #4213