Skip to content

Update documentation to midwest variables #4274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

erictleung
Copy link
Contributor

Attempts to address concerns and more in #4213

@erictleung erictleung changed the title Add periods to midwest variables for consistency Update documentation to midwest variables Nov 26, 2020
@yutannihilation
Copy link
Member

Thanks. Could you explain a bit about where you found these details of the fields? I have no knowledge to judge whether these new descriptions are true or not.

@@ -48,16 +48,16 @@

#' Midwest demographics
#'
#' Demographic information of midwest counties
#' Demographic information of midwest counties from 2000 US census
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#' \item{PID}{Unique county identifier.}
#' \item{county}{County name.}
#' \item{state}{State to which county belongs to.}
#' \item{area}{Area of county (units unknown).}
Copy link
Contributor Author

@erictleung erictleung Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for area, I tried looking up the land area units for this and nothing came out as something reasonable. So I've just made this units unknown.

The other variables here (PID, county, and state), I'm just spelling out more on what those variables are to be thorough.

@@ -69,17 +69,17 @@
#' \item{percasian}{Percent Asian.}
#' \item{percother}{Percent other races.}
#' \item{popadults}{Number of adults.}
#' \item{perchsd}{}
#' \item{perchsd}{Percent with high school diploma.}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very reliable, but I found someone's homework that lists perchsd as "the percentage of people with a high schooldiploma in each county".

I couldn't verify this, but given the surrounding other variables on education status (percollege and percprof), I figured this to be a high school diploma.

#' \item{percelderlypoverty}{}
#' \item{inmetro}{In a metro area.}
#' \item{category}{}
#' \item{percprof}{Percent with professional degree.}
Copy link
Contributor Author

@erictleung erictleung Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For percprof, I got the same hint from that one homework assignment. And with the context of the other variables (perchsd and percollege), this would make more sense that this is a professional degree rather than merely "profession". The percentages of professional degrees in the data set are much lower than the percollege variable, which is a good sanity check.

Comment on lines +75 to +80
#' \item{poppovertyknown}{Population with known poverty status.}
#' \item{percpovertyknown}{Percent of population with known poverty status.}
#' \item{percbelowpoverty}{Percent of people below poverty line.}
#' \item{percchildbelowpovert}{Percent of children below poverty line.}
#' \item{percadultpoverty}{Percent of adults below poverty line.}
#' \item{percelderlypoverty}{Percent of elderly below poverty line.}
Copy link
Contributor Author

@erictleung erictleung Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poppovertyknown is always less than population total poptotal. So I personally inferred that this meant how many of these individuals do we actually know their poverty status for.

This is validated with some manual calculations to check if our manual calculation of percent poverty is equal to the one in the data set.

library(ggplot2)
library(dplyr)

data("midwest")

# Find number of counties our manual calculation fails
midwest %>%
  select(poptotal, poppovertyknown, percpovertyknown) %>%
  mutate(manual_percpoverty = poppovertyknown / poptotal * 100) %>%
  mutate(not_equal_per = !all.equal(percpovertyknown, manual_percpoverty)) %>%
  pull(not_equal_per) %>%
  sum()
#> [1] 0

Created on 2020-11-27 by the reprex package (v0.3.0)

The remaining variables (percbelowpoverty, percchildbelowpovert, percadultpoverty, and percelderlypoverty) were my own personal inferences on the pieces of information based on the variable name itself. The ages for poverty (child, adult, elderly) are also seen in poverty reports (PDF) as "Under 18 years", "18 to 64 years", and "65 years and over".

#' \item{percchildbelowpovert}{Percent of children below poverty line.}
#' \item{percadultpoverty}{Percent of adults below poverty line.}
#' \item{percelderlypoverty}{Percent of elderly below poverty line.}
#' \item{inmetro}{County considered in a metro area.}
Copy link
Contributor Author

@erictleung erictleung Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inmetro, this is merely giving more detail on what is in a metro area, which is the county/

#' \item{percadultpoverty}{Percent of adults below poverty line.}
#' \item{percelderlypoverty}{Percent of elderly below poverty line.}
#' \item{inmetro}{County considered in a metro area.}
#' \item{category}{Miscellaneous.}
Copy link
Contributor Author

@erictleung erictleung Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For category, I could not find out what this was but I didn't want to leave this blank, so I just gave it this name. Happy to remove this if it is pointless.

@erictleung
Copy link
Contributor Author

@yutannihilation I've made comments on my changes on what led me to those descriptions. Most are my own inferences on what the variable is based on the variables around it and the only hint I got about one variable, perchsd came from a random person's homework assignment.

All in all, this PR was more for filling out the documentation to some degree. If my speculations on what these variables mean is not enough, feel free to pick and choose which variable descriptions seem fit, or close this PR entirely.

@yutannihilation
Copy link
Member

Thanks, I really appreciate your efforts on this, and these inferences seem reasonable. However, I'm not 100% confident if it's good to include them in ggplot2 officially, though it's ggplot2's fault that the descriptions are not included when the data was added...

@hadley @clauswilke @karawoo @thomasp85
Any thoughts on adding the descriptions of midwest dataset based mainly on speculation? If there's no strong concern, I'll just merge this because maybe this is at least better than nothing and we can update this when the formal definition is found.

@clauswilke
Copy link
Member

I think that if we don't know for sure what the variables are then we should at least mention this. Let's not pretend we know something for sure when we don't.

@erictleung
Copy link
Contributor Author

I think that if we don't know for sure what the variables are then we should at least mention this. Let's not pretend we know something for sure when we don't.

Should the body of the documentation be something like this then?

#' Midwest demographics
#'
#' Demographic information of midwest counties from 2000 US census.
#' 
+ #' Note: data is included for illustrative purposes and may not contain accurate
+ #' data values or column descriptions. For more accurate US census data, see the
+ #' [`acs` package](https://cran.r-project.org/web/packages/acs/index.html) for
+ #' more up to date data.

@yutannihilation
Copy link
Member

Let's not pretend we know something for sure when we don't.

Thanks, agreed.

may not contain accurate data values or column descriptions

I think this note should be clearer that we've lost the track of the original descriptions and the current ones are based on speculation.

@erictleung
Copy link
Contributor Author

Thanks everyone. I've added a note based on the feedback. Feel free to comment directly on those changes with any modifications.

@yutannihilation
Copy link
Member

/document

Copy link
Member

@yutannihilation yutannihilation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me now.

Before I merge this, I want to ask you one more thing. Tidyverse team is trying to change the license of ggplot2 to more permissive one (the MIT license) to minimize the problems and confusions arising from license incompatibilities. If you are fine that your contribution will be merged to ggplot2 and get licensed under MIT, would you mind commenting "I agree" on #4231?

For more details, you can join the discussion on #4236, or wait for their upcoming blog post.

@erictleung erictleung mentioned this pull request Dec 1, 2020
55 tasks
@erictleung
Copy link
Contributor Author

@yutannihilation thanks for the review. I've commented with my agreement. Thanks for your time!

@yutannihilation yutannihilation merged commit e60a3b9 into tidyverse:master Dec 2, 2020
@yutannihilation
Copy link
Member

Thanks!

@erictleung erictleung deleted the add-midwest-documentation branch December 2, 2020 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants