-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command to automatically prune tiles of interest #176
Conversation
tilequeue/command.py
Outdated
cur.execute(""" | ||
select x, y, z | ||
from tile_traffic_v4 | ||
where (date >= dateadd(day, -30, current_date)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make this time window configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 071cc52.
tilequeue/command.py
Outdated
|
||
for coord_int in toi_to_remove: | ||
# FIXME: Think about doing this in a thread/process pool | ||
delete_tile_of_interest(coord_int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also think about formalizing this a bit more and doing this out of process. What's the order of the amount that we've been managing in the past, several million?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't calculated the number of tiles that would be removed. I can do that now.
I was thinking about putting together an SQS queue and a worker process to do the deletes, but that seemed heavy-handed. Maybe a lambda task to do the delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about putting together an SQS queue and a worker process to do the deletes, but that seemed heavy-handed. Maybe a lambda task to do the delete?
I was thinking the same. It's operationally heavier, but I think we'll need something like that if we want to scale past multiple processes/threads on a single instance.
Maybe a good use case for batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a comprehensive list of event sources for lambda, but thinking about it more I think that's a reasonable option. One idea is that we can split up the list into groups of 10k or so, push those groups to a location on s3, and have lambda listen to that. Lambda would perform the delete and remove that object from s3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested this on dev and each run of the delete_tile_of_interest()
function (which deletes 1000 tiles at a time) takes ~2 seconds. With ~9 million tiles to remove, that'll take ~5 hours or so. Subsequent runs should be faster.
tilequeue/tile.py
Outdated
@@ -33,6 +33,10 @@ def deserialize_coord(coord_string): | |||
return coord | |||
|
|||
|
|||
def create_coord(x, y, z): | |||
return Coordinate(row=x, column=y, zoom=z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column and row transposed
Coordinate(column=x, row=y, zoom=z)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 254ac0f.
tilequeue/command.py
Outdated
with psycopg2.connect(redshift_uri) as conn: | ||
with conn.cursor() as cur: | ||
cur.execute(""" | ||
select x, y, z | ||
from tile_traffic_v4 | ||
where (date >= dateadd(day, -{days}, current_date)) | ||
and (z between 0 and 16) | ||
and (z between 10 and 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should 16
here be a configurable max zoom? Right now that's 16, but with 2x2 metatiles it'd be 15.
Here's a (redacted) bit of yaml that I added to my tilequeue config.yaml file to run this on dev. toi-prune:
redshift-uri: postgresql://user:pass@localhost:5439/analytics
days: 30
s3:
bucket: mapzen-tiles-dev
date-prefix: 20170123
path: osm
layer: all
format: zip
always-include-bboxes:
conus:
bbox: -124.8,24.8,-66.1,49.3
min_zoom: 11
max_zoom: 14
world:
bbox: -180.0,-85.06,180.0,85.06
min_zoom: 0
max_zoom: 10 |
Going to merge this for now and will work on supporting the 512/256 hybrid stuff later. |
Adding a command that prunes the tiles of interest by checking for tiles that were frequently requested and removing tiles from the tiles of interest set (and from S3) that were not frequently requested.