Skip to content
This repository has been archived by the owner on Jun 26, 2020. It is now read-only.

medians, etc are aggregated incorrectly #1

Open
JoeGermuska opened this issue May 20, 2014 · 14 comments
Open

medians, etc are aggregated incorrectly #1

JoeGermuska opened this issue May 20, 2014 · 14 comments
Labels

Comments

@JoeGermuska
Copy link
Member

Here is a prescribed method for averaging medians https://www.chrispy.net/pipermail/ctpp-news/2011-January/005355.html

@JoeGermuska JoeGermuska changed the title medians are aggregated incorrectly medians, etc are aggregated incorrectly May 20, 2014
@JoeGermuska
Copy link
Member Author

other (non-median) kinds of tables which are maybe not amenable to the basic aggregation method.

  • B19301 series (per capita income) -- could it be recalculated based on aggregated aggregate and pop?
  • averages: B23020 (Mean Usual Hours Worked for Workers 16 to 64 Years) and B25010 (Average Household Size of Occupied Housing Units by Tenure)
  • quintiles for income (B19080, B19081, B19082) and quartiles for housing cost groups (B25057, B25059, B25076, B25078) -- average like medians?
  • gini index -- ?

Other table types that don't have a "denominator column" and so may need a bit more thought:

  • "aggregate *" -- probably OK?
  • unweighted sample counts -- just total (?)
  • coverage rate -- just total (?)
  • imputation -- just total (?)

@palewire
Copy link

Is this the same neck of the woods we're wandering here?

@JoeGermuska
Copy link
Member Author

I think it is. Looks like you guys have a better line on best practices.

@anthonyjpesce
Copy link

We have an algorithm working that seems pretty accurate based on some initial testing: https://github.com/datadesk/latimes-calculate/blob/pareto/calculate/pareto.py

@palewire
Copy link

palewire commented Aug 1, 2014

Seems being the key word in that sentence.

@JoeGermuska
Copy link
Member Author

OK, so now that I read the code, we are not working on the same problem.

The STF3-P80 table from 1990 referenced in Steve's original SAS offers counts of households in each income bracket. I don't exactly understand why they did the exercise, since P80A offers the median income -- unless they were specifically interested to see how the formula compared to the published value.

My question is: given a list of median income by census tract in a neighborhood, can I compute the median income for the neighborhood?

image

@palewire
Copy link

palewire commented Aug 1, 2014

The objective of the so-called "Pareto" formula is to calculate an estimated median in cases where you are combining geographies.

@JoeGermuska
Copy link
Member Author

I missed that part. Your code refers to bins of known size, etc. Show me what I overlooked!

@anthonyjpesce
Copy link

The idea is to sum the counts/people for each bin in the areas you want to combine, then run the algorithm. We need it right now to calculate median income for combined counties. To combine it across multiple areas you need a table with actual counts of people, and the only way to get that is to go with the brackets (that we know of). It's absolutely an estimate, but when we run it against data like the table above -- where you have the brackets and an actual median -- the estimate comes back very accurate.

I've reached out to the Census for help/confirmation of the method, but I haven't heard back yet.

@anthonyjpesce
Copy link

OK I heard back from the Census and they pointed me to pages 16 and 17 of this document: http://www.census.gov/content/dam/Census/programs-surveys/sipp/tech-documentation/source-accuracy-statements/2008/SIPP%202008%20Panel%20Wave%2005%20-%20Core%20Source%20and%20Accuracy%20Statements.pdf

What we've been calling Pareto is actually linear interpolation (we'll have to rename that), though it seems they use both depending on the application. I think we're going to stick with linear for our purposes.

@JoeGermuska
Copy link
Member Author

so if it's cause I just ate lunch forgive me, but here's where I break down

    "The total population is 2,976,831, so the midpoint of the
    population is 2,976,831/2=1,488,416. That value falls into
    the 35 to 44 years range, which begins with 1,473,483 counted
    in ages 0-34. There are 459,664 people in the 35-44 range.
    The midpoint is 1,488,416-1,473,483 = 14,933 people into the
    range. As a decimal, it is 14,933/459,664 = 0.032 into the range.
    The 35-44 range is 10 years wide. 35+(0.032*10) = 35.3 years"

(Doig quote, lines 39-45)

With age buckets, there's a linear progression, so you can say that the midpoint is "0.032 into the range." With medians by geography, even with population counts, there's no "range" to be some fraction into. And there's no income.

Is your idea that for each explicit median value (as in table P80A in the example) we would need to identify a bucketed table (P80) and to produce a "P80A equivalent"? For my purposes, I'd have to scout around in the ACS to see if I can find those.

@JoeGermuska
Copy link
Member Author

I made a gist which may identify pairs suitable for use with these interpolation methods. And some tables I'm less clear about.

@thisisbinod
Copy link

I was running into the same issue. Pareto interpolation was very helpful. Thanks a lot!

@fgregg
Copy link

fgregg commented Jul 22, 2017

It seems we might want to distinguish between two types of functions: aggregations and reductions.

Aggregations are functions that take a sequence of values and returns a value of the same semantic type. For count data a sum is an aggregation and so is sum_moes. A sum of count is a count for the composite, a sum of moe is still a moe for the composite.

Then there are reductions, like medians, averages, and standard deviations. An average of counts is not the count of the composite.

Things might be a little confusing because sometimes a reduction of reductions is an aggregation. A weighted average of averages can be the overall average for the composite. However this is generally not true. The unweighted average of averages is not the overall average (except when each subunit has the same number of elements).

The average of medians is not the overall median of the composite and unless conditions are unusual the average of medians does not even approximate the median of the composite. The same is true of the median of medians.

For calculating the median of a composite with census data, the most correct procedure will be to aggregate (i.e. sum) the binned data and then interpolate the composite median using a linear, Pareto, or some other appropriate method.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants