medians, etc are aggregated incorrectly #1

JoeGermuska · 2014-05-20T12:21:57Z

Here is a prescribed method for averaging medians https://www.chrispy.net/pipermail/ctpp-news/2011-January/005355.html

JoeGermuska · 2014-05-20T12:54:25Z

other (non-median) kinds of tables which are maybe not amenable to the basic aggregation method.

B19301 series (per capita income) -- could it be recalculated based on aggregated aggregate and pop?
averages: B23020 (Mean Usual Hours Worked for Workers 16 to 64 Years) and B25010 (Average Household Size of Occupied Housing Units by Tenure)
quintiles for income (B19080, B19081, B19082) and quartiles for housing cost groups (B25057, B25059, B25076, B25078) -- average like medians?
gini index -- ?

Other table types that don't have a "denominator column" and so may need a bit more thought:

"aggregate *" -- probably OK?
unweighted sample counts -- just total (?)
coverage rate -- just total (?)
imputation -- just total (?)

…OK. Skeptical about MOE aggregation

palewire · 2014-07-31T22:02:49Z

Is this the same neck of the woods we're wandering here?

JoeGermuska · 2014-07-31T22:05:22Z

I think it is. Looks like you guys have a better line on best practices.

anthonyjpesce · 2014-07-31T22:15:21Z

We have an algorithm working that seems pretty accurate based on some initial testing: https://github.com/datadesk/latimes-calculate/blob/pareto/calculate/pareto.py

palewire · 2014-08-01T00:51:13Z

Seems being the key word in that sentence.

JoeGermuska · 2014-08-01T15:20:47Z

OK, so now that I read the code, we are not working on the same problem.

The STF3-P80 table from 1990 referenced in Steve's original SAS offers counts of households in each income bracket. I don't exactly understand why they did the exercise, since P80A offers the median income -- unless they were specifically interested to see how the formula compared to the published value.

My question is: given a list of median income by census tract in a neighborhood, can I compute the median income for the neighborhood?

palewire · 2014-08-01T17:22:49Z

The objective of the so-called "Pareto" formula is to calculate an estimated median in cases where you are combining geographies.

JoeGermuska · 2014-08-01T17:40:41Z

I missed that part. Your code refers to bins of known size, etc. Show me what I overlooked!

anthonyjpesce · 2014-08-01T17:50:49Z

The idea is to sum the counts/people for each bin in the areas you want to combine, then run the algorithm. We need it right now to calculate median income for combined counties. To combine it across multiple areas you need a table with actual counts of people, and the only way to get that is to go with the brackets (that we know of). It's absolutely an estimate, but when we run it against data like the table above -- where you have the brackets and an actual median -- the estimate comes back very accurate.

I've reached out to the Census for help/confirmation of the method, but I haven't heard back yet.

anthonyjpesce · 2014-08-01T18:29:30Z

OK I heard back from the Census and they pointed me to pages 16 and 17 of this document: http://www.census.gov/content/dam/Census/programs-surveys/sipp/tech-documentation/source-accuracy-statements/2008/SIPP%202008%20Panel%20Wave%2005%20-%20Core%20Source%20and%20Accuracy%20Statements.pdf

What we've been calling Pareto is actually linear interpolation (we'll have to rename that), though it seems they use both depending on the application. I think we're going to stick with linear for our purposes.

JoeGermuska · 2014-08-01T19:09:56Z

so if it's cause I just ate lunch forgive me, but here's where I break down

    "The total population is 2,976,831, so the midpoint of the
    population is 2,976,831/2=1,488,416. That value falls into
    the 35 to 44 years range, which begins with 1,473,483 counted
    in ages 0-34. There are 459,664 people in the 35-44 range.
    The midpoint is 1,488,416-1,473,483 = 14,933 people into the
    range. As a decimal, it is 14,933/459,664 = 0.032 into the range.
    The 35-44 range is 10 years wide. 35+(0.032*10) = 35.3 years"

(Doig quote, lines 39-45)

With age buckets, there's a linear progression, so you can say that the midpoint is "0.032 into the range." With medians by geography, even with population counts, there's no "range" to be some fraction into. And there's no income.

Is your idea that for each explicit median value (as in table P80A in the example) we would need to identify a bucketed table (P80) and to produce a "P80A equivalent"? For my purposes, I'd have to scout around in the ACS to see if I can find those.

JoeGermuska · 2014-08-01T20:02:20Z

I made a gist which may identify pairs suitable for use with these interpolation methods. And some tables I'm less clear about.

thisisbinod · 2014-09-04T02:39:18Z

I was running into the same issue. Pareto interpolation was very helpful. Thanks a lot!

fgregg · 2017-07-22T16:34:28Z

It seems we might want to distinguish between two types of functions: aggregations and reductions.

Aggregations are functions that take a sequence of values and returns a value of the same semantic type. For count data a sum is an aggregation and so is sum_moes. A sum of count is a count for the composite, a sum of moe is still a moe for the composite.

Then there are reductions, like medians, averages, and standard deviations. An average of counts is not the count of the composite.

Things might be a little confusing because sometimes a reduction of reductions is an aggregation. A weighted average of averages can be the overall average for the composite. However this is generally not true. The unweighted average of averages is not the overall average (except when each subunit has the same number of elements).

The average of medians is not the overall median of the composite and unless conditions are unusual the average of medians does not even approximate the median of the composite. The same is true of the median of medians.

For calculating the median of a composite with census data, the most correct procedure will be to aggregate (i.e. sum) the binned data and then interpolate the composite median using a linear, Pareto, or some other appropriate method.

JoeGermuska added the bug label May 20, 2014

JoeGermuska changed the title ~~medians are aggregated incorrectly~~ medians, etc are aggregated incorrectly May 20, 2014

JoeGermuska added a commit that referenced this issue May 20, 2014

handle -1 MOE as N/A instead of aggregating them. Relates to #1

f7f802e

JoeGermuska added a commit that referenced this issue May 22, 2014

fix implentation for handling NA MOE. (#1)

ec117f3

JoeGermuska added a commit that referenced this issue May 31, 2014

regenerate technical tables with correct (-1) MOE (see #1)

89fb9c0

JoeGermuska added a commit that referenced this issue May 31, 2014

step to implement median aggregation (see #1). Eyeball averages seem …

8813b09

…OK. Skeptical about MOE aggregation

palewire mentioned this issue Aug 1, 2014

Pareto formula for estimating the median given only subgroup totals datadesk/latimes-calculate#44

Open

fgregg mentioned this issue May 9, 2017

Aggregate functions datamade/census_area#6

Open

andrewbt mentioned this issue Nov 8, 2017

'universe' target should not be an option for normalization CartoDB/bigmetadata#326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

medians, etc are aggregated incorrectly #1

medians, etc are aggregated incorrectly #1

JoeGermuska commented May 20, 2014

JoeGermuska commented May 20, 2014

palewire commented Jul 31, 2014

JoeGermuska commented Jul 31, 2014

anthonyjpesce commented Jul 31, 2014

palewire commented Aug 1, 2014

JoeGermuska commented Aug 1, 2014

palewire commented Aug 1, 2014

JoeGermuska commented Aug 1, 2014

anthonyjpesce commented Aug 1, 2014

anthonyjpesce commented Aug 1, 2014

JoeGermuska commented Aug 1, 2014

JoeGermuska commented Aug 1, 2014

thisisbinod commented Sep 4, 2014

fgregg commented Jul 22, 2017 •

edited

Loading

medians, etc are aggregated incorrectly #1

medians, etc are aggregated incorrectly #1

Comments

JoeGermuska commented May 20, 2014

JoeGermuska commented May 20, 2014

palewire commented Jul 31, 2014

JoeGermuska commented Jul 31, 2014

anthonyjpesce commented Jul 31, 2014

palewire commented Aug 1, 2014

JoeGermuska commented Aug 1, 2014

palewire commented Aug 1, 2014

JoeGermuska commented Aug 1, 2014

anthonyjpesce commented Aug 1, 2014

anthonyjpesce commented Aug 1, 2014

JoeGermuska commented Aug 1, 2014

JoeGermuska commented Aug 1, 2014

thisisbinod commented Sep 4, 2014

fgregg commented Jul 22, 2017 • edited Loading

fgregg commented Jul 22, 2017 •

edited

Loading