Skip to content
Nathaniel V. KELSO edited this page Mar 25, 2018 · 17 revisions

High Level Tilezen Architecture

This is a description of the Tilezen architecture as it was running at Mapzen in production. Of note is that this is meant to capture the various systems as they were running before rawr tiles and global builds.

The system that emerged reflects the choices and trade-offs that were made given our goals. Typically, the variables that affect most of the decisions are acceptable latencies, both typical and worst case, data freshness, and cost (database, compute, and storage). We wanted to serve tiles that were usually just a few hours out of date, with most of the service being reasonably fast, while it being acceptable for a small percentage of tiles to be slow to render.

The way we chose to balance the goals was to create a concept called the TOI, or tiles of interest list. This represented a set of tiles that should be fast with the most up-to-date data, and is the list of tiles that were "most requested" in a particular time window. This is the set of tiles that was pre-generated by tilequeue and stored to S3 with great (fast, <100 ms) latency. Requests for tiles outside of this set would be served live with poor (slow, ~450 ms) latency.

Key Workflows

User request

Requests are satisfied by first trying a cache, then if the tile is pre-generated on S3, and finally it is rendered on-demand if necessary. The fastly content distribution network would run this logic in custom VCL code, and the tapalactl service would return a 404 if the tile didn't exist, which would signal to fastly that the request should be directed to tileserver to generate the tile on-demand.

NOTE: Additional VCL logic would route requests based on custom vector tile layers or terrain tiles. The newer tapalcatl-py enables "serverless" but doesn't yet support all the original's features.

Diff processing

osm2pgql applies OpenStreetMap planet file diffs to the database, and it would generate a tile expiry list. tilequeue (tilequeue intersect command) would read this list, and the tiles that were in the TOI (tiles of interest list) would get enqueued onto Amazon's SQS for processing. A number of instances running tilequeue process would read tiles from SQS, generate them and store them onto Amazon's S3.

Initial seed run

Initially, there is no TOI (tiles-of-interest list) or pre-generated tiles. Although things can be started off this way, it's typically prudent to "seed" the TOI with an initial state and pre-generate those. The tilequeue seed command provides that functionality.

The configuration in the tilequeue sample config includes common scenarios and our recommendations. These can all be added with a single seed run, and are:

  • all of z0-10 (just a million tiles globally)
  • the Metro Extracts "cities" from z11-15 (like New York)
  • zoom 11-14 in your countries of interest (like the United States)
  • a list of the top 50k most popular tiles from Mapbox (for areas not covered in the cases above)

There is also support for adding custom bounding boxes.

NOTE: seed runs typically take on the order of days to complete (assuming several million tiles are enqueued).

Tile Gardening

To manage the TOI (tiles of interest list), a "gardener" process (tilequeue prune-tiles-of-interest command and config), would run periodically (say daily) to add and remove tiles from the TOI based on a rolling window of requests (usually 2 months average service usage). New tiles would get enqueued for processing by tilequeue, and tiles to be removed would get deleted from S3.

Cache settings by zoom range

When tiles are syndicated with a content delivery network (CDN) custom cache settings can be configured by zoom range to optimize cache hit rates and lower origin costs. Since low zoom tile content doesn't change much the following example keep it around longer, and prefers to update "max zoom" 15 and 16 tiles at a faster cadence to create a virtuous cycle of edit in OSM.org, go-live in vector tiles, review, and iterate. Zoom 17+ are excessive but popular for generic low-value uses so keep them around longest in cache.

  ttls:
    "0-10":
      ttl: 12h
      max-age: 43200
      grace: 13h
    "11-12":
      ttl: 8h
      max-age: 43200
      grace: 9h
    "13-14":
      ttl: 4h
      max-age: 43200
      grace: 5h
    "15-16":
      ttl: 2h
      max-age: 43200
      grace: 3h
    "17-20":
      ttl: 1w
      max-age: 604800
      grace: 2w
    default:
      ttl: 4h
      max-age: 43200
      grace: 5h

Instance Types

  • applying diffs and tilequeue intersect: r.xlarge
  • tilequeue process: ~12 c.xlarge instances (and additional autoscaling based on load)
  • tileserver: ~3 r.large instances (and additional autoscaling based on load/latency)
  • tapalcatl: ~6 m.large instances (and additional autoscaling based on load/latency). These also existed in other regions.
  • rds instances: 5 r.2xlarge instances (1 master and 4 replicas, master and 1 replica for tileserver, and 3 replicas for tilequeue process)