Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle reading, grouping, and aggregating tens of millions of raster summary records #2

Open
robyngit opened this issue May 4, 2022 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@robyngit
Copy link
Member

robyngit commented May 4, 2022

This issue outlines a potential problem that could arise once we start running the rasterization workflow on data that spans the entire arctic. Ideally, we should test and fix this problem before we run the whole workflow (especially if it will take many days to complete.)

Background

For every GeoTIFF tile that is created during rasterization, a record of summary statistics for each band within the GeoTIFF is saved. Here is an example of the data saved for a single band in a GeoTIFF tile created at zoom level 13:

stat                                            iwp_count
bounds  [-148.84277343750014, -148.82080078125014, 70....
min                                                   0.0
max                                                   1.0
mean                                             0.000443
median                                                0.0
std                                              0.021031
var                                              0.000442
sum                                                  29.0
path    /home/thiessenbock/PDG-test/parsl-test/20-file...
tile                            Tile(x=1418, y=887, z=13)
z                                                      13

Since the stat for this band is iwp_count (count of the centroids of each Ice Wedge Polygon within each pixel), then the min gives the least number of polygons that any pixel contains, the max gives the highest number of polygons any pixel contains, sum gives the total number of ice wedge polygons within that tile, etc.

After creating all of the GeoTIFFs, the RasterTiler reads in these records and calculates the min and max across each z level (using the RasterTiler.get_z_ranges method). The min and max are then mapped to the 'lowest' and 'highest' color in the color palette for that statistic. (Unless these min and/or max values are explicitly set in the configuration, in which case the calculated min/max are ignored.) The web tile images must use a consistent mapping of value to color so that color has a consistent meaning across tiles at the same zoom level (e.g. 'yellow' should mean '1' for every tile at z=13.)

The potential problem

Currently, these records are stored in a CSV file. Each time a GeoTIFF is created, rows are appended to the CSV. This strategy works well in the tests we've done so far, where we've been creating thousands or tens of thousands of GeoTIFFs. Once we start processing the entire Arctic, this CSV will grow to over 30 million rows.

Previously, we calculated that there are about 12,225,697 tiles covering land in 50-90 N band at z=13. We also create tiles for the z=12 to z=0 range, resulting in another ~ 4,075,239 tiles (each parent z-level has approx. 25% of the number of tiles as the child level.) This means we'll be creating about 16,300,936 GeoTIFFs each with 2 bands, resulting in a CSV that has 32,601,872 rows.

This 30-million-row-CSV file will then need to be read into memory, then grouped and aggregated with pandas in order to calculate the min & max for each statistic and z-level.

Question

Will whichever hardware we run the workflow on be able to handle reading a 32 million row CSV into memory, then grouping and aggregating it with pandas?

Possible solutions

Some possible solutions:

  1. Read and write to a database (instead of CSV) with an SQL library like sqlite3 (serverless and self-contained, handles locking and concurrency automatically)
  2. Only read in and process chunks of the CSV at once using panda's chunking feature.
  3. Use a library like Dask, which has an API for working with larger than memory datasets in parallel

Next steps

  • Create a 32 million row raster summary CSV (perhaps by just duplicating the rows from one of the summaries creating during a test run). Check the filesize of the CSV.
  • Try running the RasterTiler.get_z_ranges method on this data (on a laptop? on the datateam server?). Time how long it takes, and see if any issues arise.
@robyngit robyngit added the question Further information is requested label May 4, 2022
@robyngit robyngit self-assigned this May 4, 2022
@mbjones
Copy link
Member

mbjones commented May 4, 2022

Thanks for the summary @robyngit . Rather than parsing the whole CSV, isn't it feasible to simply seek to the end, and append a new row on the end of the file? Seems like parsing the whole file is not needed for the append that needs to be done.

In addition to the scale problems, do you also potentially have a locking problem where many threads might be trying to update this file at once? As they seem like append-only operations, it seems like pretty tractable, but one does need to to avoid race conditions.

@robyngit
Copy link
Member Author

robyngit commented May 4, 2022

Rather than parsing the whole CSV, isn't it feasible to simply seek to the end, and append a new row on the end of the file? Seems like parsing the whole file is not needed for the append that needs to be done.

Yep, and this is exactly what RasterTiler does. The problem is not with writing, but when we need to read the entire file then group and aggregate to get the overall min and max for each stat and each z-level. This only needs to happen once.

In addition to the scale problems, do you also potentially have a locking problem where many threads might be trying to update this file at once? As they seem like append-only operations, it seems like pretty tractable, but one does need to to avoid race conditions.

This error hasn't come up but I assume it's probably because the operation is so quick that a conflict is very rare. If we continue with using a CSV for these records, we should lock the file while writing like we do now when saving tiles in viz-staging.

@robyngit
Copy link
Member Author

robyngit commented May 4, 2022

(Added a line to the initial comment to clarify that the problem is with the reading & processing, not the writing to CSV)

@mbjones
Copy link
Member

mbjones commented May 4, 2022

That lock might end up being a bottleneck to scaling. An alternative is to use the map-reduce strategy where each thread writes its own results out to an independent file with a single row (and has no dependencies on a shared file), and then a different process concatenates all of the incoming individual files into the master file. That process could even be made to keep the running cumulative stats totals as it parses and appends each line, such that the whole 32M row file would never have to be read in its entirety.

@robyngit
Copy link
Member Author

robyngit commented May 4, 2022

Oh I like this idea! I think writing to individual files and appending separately would be the way to go if we use a CSV. Is the idea to keep the running cumulative stats totals in memory then? I switched to writing stats to disk to avoid the problem that the viz-raster process fails part way through and the summary stats are lost. Would have to think of a way around that problem still.

Still, I think it might be nice to have it in a database that is easier to query than a 30-million-row CSV file. Since we'll have information like the total number of polygons / tile, I was imagining that we could use this data in the future to create a coarser grain visualization like the one discussed earlier:

Screen Shot 2022-05-04 at 15 58 37

@mbjones
Copy link
Member

mbjones commented May 4, 2022

Yeah, in memory. The process could checkpoint its results periodically (say, every 5 minutes or something) so that recovery from a failure or restart would be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: No status
Status: No status
Development

No branches or pull requests

2 participants