Skip to content

Commit

Permalink
add reference to Figure 1, other text changes
Browse files Browse the repository at this point in the history
  • Loading branch information
ks905383 committed Aug 22, 2024
1 parent 18a5834 commit 3efe325
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions joss_paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ tags:
authors:
- name: Kevin Schwarzwald
orcid: 0000-0001-8309-7124
equal-contrib: true
equal-contrib: false
corresponding: true
affiliation: "1, 2" # (Multiple affiliations must be quoted)
- name: Kerrie Geil
Expand All @@ -29,23 +29,23 @@ bibliography: paper.bib
# Summary
Scientific data is often stored on grids or rasters: gridded weather observations, interpolated pollution data, night-time lights, or other remote sensing products all approximate the continuous real world for ease of calculation, standardization, or technical limiations. However, living things don't live on grids, and rarely act or observe data on grids either. Instead, demographic or agricultural data is often collected on the county or city level, birds fly along complex migratory corridors, and rain- and watersheds follow valleys and mountains, in other words, along areas that can be described using geographic polygons.

When these raster and polygon worlds collide, as they often do in social or natural science research, data must often be aggregated between them. This aggregation must, however, be done with care. Consider a researcher who needs to aggregate temperature data from a gridded reanalysis product onto Los Angeles County, at which level they observe population or mortality statistics. The simplest way to aggregate data would be to average across every grid cell that partially overlaps with the county. However, given the complex topography of the region, a grid cell only slightly overlapping with the county, or only overlapping with the sparsely populated mountains of the county, would be unhelpful if studying the relationship between temperature and society.
When these raster and polygon worlds collide, as they often do in social or natural science research, data must often be aggregated between them (e.g., @auffhammer_using_2013). This aggregation must, however, be done with care. Consider a researcher who needs to aggregate temperature data from a gridded reanalysis product onto Los Angeles County, at which level they observe population or mortality statistics (Figure \autoref{fig1}). The simplest way to aggregate data would be to average across every grid cell that partially overlaps with the county. However, given the complex topography of the region, a grid cell only slightly overlapping with the county, or only overlapping with the sparsely populated mountains of the county, would be unhelpful if studying the relationship between temperature and society.

![Illustration of `xagg` workflow. Variables stored on a geographic grid (in this case 2-meter daily temperature from ERA5 reanalysis; @hersbach_era5_2020), a set of geographic polygons (in this case US county borders, focusing on Los Angeles County as an example), and an optional second weight on a geographic grid (in this case LandScan Day Population; @rose_landscan_2017) are inputted (panels a., c.). `xagg` calculates the relative overlap between each ERA5 grid cell and each county (panel b.). `xagg` regrids the population grid to the ERA5 grid (panel d.), and produces a set of final grid cell weights composed of both the area overlap and the population density (panel e.). For each county, these weights are used to calculate weighted averages of daily temperature (panel f.), which can be then be outputted in multiple formats for further analysis.](xagg_joss_figure1.pdf)
![Illustration of `xagg` workflow. Variables stored on a geographic grid (in this case 2-meter daily temperature from ERA5 reanalysis; @hersbach_era5_2020), a set of geographic polygons (in this case US county borders, focusing on Los Angeles County as an example), and an optional second weight on a geographic grid (in this case LandScan Day Population; @rose_landscan_2017) are inputted (panels a., c.). `xagg` calculates the relative overlap between each ERA5 grid cell and each county (panel b.). `xagg` regrids the population grid to the ERA5 grid (panel d.), and produces a set of final grid cell weights composed of both the area overlap and the population density (panel e.). For each county, these weights are used to calculate weighted averages of daily temperature (panel f.), which can be then be outputted in multiple formats for further analysis.\label{fig1}](xagg_joss_figure1.pdf)

Therefore, an ideal aggregation would weight not only by the area overlap between grid cells and polygons, but also optionally by other densities of relevant variables - population, area planted, etc. @[auffhammer_using_2013].
Therefore, an ideal aggregation would weight not only by the area overlap between grid cells and polygons, but also optionally by other densities of relevant variables - population, area planted, etc. [@auffhammer_using_2013].

`xagg` fulfills this need, by providing a simple interface for aggregating raster data stored in `xarray` @[hoyer_xarray_2017] `Datasets` or `DataArrays` onto polygons stored in `geopandas` @[bossche_geopandasgeopandas_2024] `geodataframes`, weighted by the fractional area overlap between the raster grid and the polygon, and optionally additionally weighted by a secondary gridded variable. Fractional area weights are generated by constructing polygons for each grid cell and using `geopandas`' `gpd.overlay()` function to calculate the overlaps between input polygons and grid cells. Aggregated data is then returned as an `xarray` `Dataset`, a `pandas` `DataFrame`, or a `geopandas` `GeoDataFrame`, depending on the user's needs.
`xagg` fulfills this need, by providing a simple interface for aggregating raster data stored in `xarray` [@hoyer_xarray_2017] `Datasets` or `DataArrays` onto polygons stored in `geopandas` [@bossche_geopandasgeopandas_2024] `geodataframes`, weighted by the fractional area overlap between the raster grid and the polygon, and optionally additionally weighted by a secondary gridded variable (see Figure \autoref{fig1} for a sample workflow). Fractional area weights are generated by constructing polygons for each grid cell and using `geopandas`' `gpd.overlay()` function to calculate the overlaps between input polygons and grid cells. Aggregated data is then returned as an `xarray` `Dataset`, a `pandas` `DataFrame`, or a `geopandas` `GeoDataFrame`, depending on the user's needs.


# Statement of need
Aggregating gridded data onto polygons is a fundamental aspect of much social and natural science research (e.g., @auffhammer_using_2013; @hsiang_estimating_2017; @carleton_valuing_2022; @mastrantonas_forecasting_2022). Historically, this process has been conducted on an ad hoc basis by individual research groups, often using simplifications such as averaging over all grid cells that overlap with a county, regardless of the size of that overlap (e.g,. @schlenker_nonlinear_2009).

`xagg` fills a need for an easy, standardized, and accurate workflow for this aggregation. Working and outputting data in `xarray` and `*pandas` formats (including keeping by default relevant metadata and attributes from the inputted polygons) means `xagg` can be plugged into a wide array of existing workflows in natural and social sciences, and can easily export aggregated results in formats read by other languages often used in research, including R, QGIS, or STATA.

Though other `python` packages allow aggregation of raster data, to the authors' knowledge, none provide the same depth of functionality. `regionmask` @[hauser_regionmaskregionmask_2023]'s `mask_3D_frac_approx` function also approximates relative overlaps between grid cells and regions, for example; this however only works for regular rectangular grids (while `xagg` works with any rectangular grid), and can be less accurate than `xagg`'s . In addition, none allow easy weighting by a secondary raster variable (e.g., population density or yield), or keep polygon metadata intact.
Though other `python` packages allow aggregation of raster data, to the authors' knowledge, none provide the same depth of functionality. `regionmask` [@hauser_regionmaskregionmask_2023]'s `mask_3D_frac_approx` function also approximates relative overlaps between grid cells and regions, for example; this however only works for regular rectangular grids (while `xagg` works with any rectangular grid), and can be less accurate than `xagg`'s . In addition, none allow easy weighting by a secondary raster variable (e.g., population density or yield), or keep polygon metadata intact.

`xagg` has already been used in peer-reviewed (e.g., @pulla_grace_2023-1; @mastrantonas_forecasting_2022; @schwarzwald_importance_2022) and upcoming (e.g., @sichone_assessment_2024; @peard_combining_2023]) scientific publications, has reached over 15,000 cumulative downloads across versions, and is a key component of a how-to guide for climate econometrics @[rising_practical_2024].
`xagg` has already been used in peer-reviewed (e.g., @pulla_grace_2023-1; @mastrantonas_forecasting_2022; @schwarzwald_importance_2022) and upcoming (e.g., @sichone_assessment_2024; @peard_combining_2023]) scientific publications, has reached over 15,000 cumulative downloads across versions, and is a key component of a how-to guide for climate econometrics [@rising_practical_2024].

# Acknowledgements
The authors would like to thank Ryan Abernathy, Julius Busecke, Tom Nicholas, and James Rising for help in getting this project across the ground, in addition to anyone who contributed to GitHub issues or the codebase over the years.
Expand Down

0 comments on commit 3efe325

Please sign in to comment.