Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate functions #6

Open
fgregg opened this issue May 9, 2017 · 15 comments
Open

Aggregate functions #6

fgregg opened this issue May 9, 2017 · 15 comments

Comments

@fgregg
Copy link
Member

fgregg commented May 9, 2017

It would be great if census_area handled the aggregations of census variables correctly.

Prior art

@patwater
Copy link

patwater commented Nov 21, 2018

@fgregg hey we're looking to develop this functionality at ARGO. Key need to aggregate census statistics like median income correctly for our California water agency partners which have service area boundaries that don't align nicely with census boundaries. You know the story :)

We have a team of CUSP grad students looking to sprint on this mid December to mid January and would love your thoughts. Plan is a simple fork for the sprint and then can PR assuming everything works nicely :)

@fgregg
Copy link
Member Author

fgregg commented Nov 21, 2018

Sounds great!

I think I would start by following the Census's guidance on aggregating statistics https://www.census.gov/content/dam/Census/library/publications/2018/acs/acs_general_handbook_2018_ch08.pdf

It would be very, very nice to make use of the variance data that the census has started to make available. https://www.census.gov/programs-surveys/acs/data/variance-tables.html but that's probably a phase II or phase III project.

I'd also recommend that you develop the aggregation code in a separate files from the existing ones, as it may be nice, in the future, to pull the aggregate code into a separate library.

@dmarulli
Copy link

dmarulli commented Nov 26, 2018

Hey @fgregg - I put together an initial project board for our team of students. I will be continuing to update that, but wanted to drop it in this thread for those interested.

I also wanted to run the actual technical approach by you all to increase the probability of things lining up nicely.

So right now looks like there is a family of .geo_X() methods that can return geojson-like structures with statistics and geometries for lower level census geographies within higher level ones as well as for arbitrary geometries. (Though for sf3, the naming convention changes?)

One approach came to mind that would act pretty independently of the existing codebase, which would allow us to pull things into a separate library if that ends up feeling better. In this approach, one would create a new aggregator function that takes as inputs the statistic and geometry outputs of the .geo_X() methods along with the type of statistic to aggregate and the geometry to aggregate to--thinking is that this last piece would be necessary to properly downscale the statistics for the partial edge geometries.

So something like:


def new_aggregator_function(
          list_of_dictionaries_with_statistic_and_geometry,
          type_of_statistic,
          geometry_to_aggregate_to
     ):

     areally_interpolated_statistics = check_for_edge_geometries_and_downscale_statistics(...)

     aggregated_statistic = aggregate(areally_interpolated_statistics, type_of_statistic)

     return aggregated_statistic

Any feedback there?


Lastly, on the Census Data API side of things, the table and attribute names do seem cryptic--e.g.B25034_010E. I found this reference, but still feels pretty dense.

The human-readable table/attribute name --> code direction might be tough, but the other direction doesn't seem too far-fetched and it would really be great if these codes were parsable for type of statistic. This could be used to help prevent statistical gotchas like trying to aggregate a median like an average. Not sure if you all have thought about this bit. May be for down the road though. Hopefully explicitly asking the user to provide type of statistic is a reasonable enough solution for now.

cc: @patwater @christophertull

@fgregg
Copy link
Member Author

fgregg commented Nov 26, 2018

  1. could you you tell me a little bit more about what you mean by "necessary to properly downscale the statistics for the partial edge geometries"?

  2. I think it's reasonable to have the user supply the type of aggregation in the first phase. There's a lot that could be done to infer what type of aggregation is appropriate, but that can wait.

@fgregg
Copy link
Member Author

fgregg commented Nov 26, 2018

Do you mean that the desired shape can cut across census geographies, and you'll need to figure out what data to apportion?

@dmarulli
Copy link

Yep, that's all I meant by that. We see that with California water district boundaries for example.

@fgregg
Copy link
Member Author

fgregg commented Nov 26, 2018

Okay, finding the intersections is a fairly expensive operation.

When we do it here:

if intersection.area/area_geo.area > 0.1:
yield area

It would be probably be a good idea to go ahead and return the proportion of the census tract covered falling withing the target geography, and stuff it into the statistics dictionary.

That's coverage proportion is probably what you need you would be calculating with check_for_edge_geometries_and_downscale_statistics anyway.

If you did it that way, you would only need "sequence of statistics", "sequences of weights", "type of statistics"

@dmarulli
Copy link

Nice, thanks a lot Forest. I'll look into that.

@fgregg
Copy link
Member Author

fgregg commented Nov 26, 2018

weights are going to be important as, for example, sometimes you'll want to know size of the associated population. Anyway, i think you have enough to move forward.

@fgregg
Copy link
Member Author

fgregg commented Dec 14, 2018

@dmarulli, any updates on your project?

@patwater
Copy link

His student team has there kickoff call scheduled for this upcoming Friday 12/21 so probably not.

@patwater
Copy link

patwater commented Jan 28, 2019

@fgregg FYI the functionality to calculate the areal interpolation is getting pretty close though some outstanding refactoring to clean up the student code. See here for the latest: https://github.com/argo-marketplace/census_area/tree/dev_branch

Do you A) have any stylistic preferences on integration to note and B) capacity to help with that integration (bit swamped on our end)? Thanks much!

@fgregg
Copy link
Member Author

fgregg commented Jan 28, 2019

Hi @patwater, this looks like it's pretty far from ready to be brought in. There are some nice ideas in here, but

  1. there are many extraneous files
  2. the interface is very different than the current library
  3. the code needs to be split out of the one giant method
  4. it's out of sync with master
  5. there are no tests

I'm sorry to hear that you don't have the bandwidth to work on the integration. Let me know when you do.

@patwater
Copy link

Yeah I hear you. Part of working with grad students early in their program... will keep you posted.

@christophertull
Copy link

Some interest reviving here (also I want my Hacktoberfest contributions ;).

@fgregg I see your reference to census-data-aggregator above. Would it make sense to use census_area to fetch the data for our census units of interest and then feed that into census-data-aggregator?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants