Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental loading of daily prod #13

Open
sbailey opened this issue Nov 29, 2023 · 8 comments
Open

Support incremental loading of daily prod #13

sbailey opened this issue Nov 29, 2023 · 8 comments
Assignees

Comments

@sbailey
Copy link
Contributor

sbailey commented Nov 29, 2023

Documenting a conversation with @akremin and @weaverba137 about how to support incremental loading of the daily production:

  • We'd start with a bulk load of the existing production
    • starting on 20201214 (first night of sv1/fuji) would simplify some of evolution of the early data model, but there are likely cases where things changed and we'll need to empirically rediscover them and decide how to handle that
  • Each day we'd perform an incremental update, which includes
    • New data that has just been processed (sometimes new tiles, sometimes new exposures for pre-existing tiles)
    • Reprocessed tiles (could be days, weeks, or even months since they were initially processed and loaded)
  • See Add UPDATED column to prod tiles file desispec#2143 for adding a new UPDATED column to daily/tiles-daily.csv with timestamp to help identify which tiles need to be updated.
  • daily only includes tiles/cumulative redshifts, not healpix redshifts
  • The first time a tile has been observed, the targets and fiberassign tables could be loaded and they don't change with subsequent observations. New observations of the same tile would trigger additional entries in the exposures table, and updated (replaced) entries in the redshifts table.
  • The same cronjob that updates tiles-daily.* could also trigger the DB load, but structurally we should keep the DB loading as a separate operation that could be run independently.
@weaverba137
Copy link
Member

daily only includes tiles/cumulative redshifts, not healpix redshifts

This is good, but let's explicitly decide whether to include all types of tile redshifts: perexp, pernight, cumulative, or a subset of those.

@akremin
Copy link
Member

akremin commented Nov 29, 2023 via email

@weaverba137
Copy link
Member

Interesting. This might be a separate ticket, but we should define under what circumstances we want to continue to support loading tile redshifts other than cumulative. Right now, we do not load perexp or pernight for data release specprods (fuji, etc.), and it looks we would never load anything other than cumulative for daily either.

@weaverba137
Copy link
Member

@sbailey, in the absence of any other document, I think this ticket is our de facto requirements document for daily database loading. As such here are some additional questions:

  • Most important: What does it mean when the PROGRAM column in tiles-daily.csv is empty? This is not something we've had to handle for fuji, guadalupe, iron. There are a lot of these, including in SURVEY = 'main'.
  • As mentioned above, we need a "last updated" column in tiles file (referred to as UPDATED above).
    • I recall hearing that tiles-daily.csv is written by one and only one function, but which function is that? It's easier to just ask than hunt for it.
    • Is the tiles-daily.csv file always written out last? In other words, if a process is watching the tiles-daily.csv file for updates, can we assume that all other files related to any updated tile have already been written out before tiles-daily.csv?
  • In order to create the UPDATED column from scratch, where is the metadata that tracks tile history?
  • Would a tile ever be removed from the tiles file? From the database perspective: is loading only ever INSERT and UPDATE or INSERT, UPDATE, DELETE?
  • We've previously mentioned that it would be desirable to load photometry directly from the code that generates the lsdr9-photometry files.
    • Mentioning this here as a reminder.
    • I think this is something that should be addressed after we have a working system that can handle updates to the tiles and exposures files.
  • There was an unaddressed question above about tile-based redshifts: can we formally say that we will never support loading perexp or pernight tile redshifts, even for daily?

@weaverba137 weaverba137 self-assigned this May 13, 2024
@weaverba137
Copy link
Member

More details about missing PROGRAM. About 25% of all entries in tiles-daily.csv have an empty value for PROGRAM, and it seems like those are all at the "top" of the file. So there is some kind of ordering going on that is not based on tile number. It does seem to be ordered by LASTNIGHT, with non-null values of PROGRAM starting around 20220210.

@weaverba137
Copy link
Member

Note on loading: patched versions of the top-level tiles and exposures files are needed for the initial load. From these patched versions, load all tiles that are considered complete, then catch up on the new tiles. Although in principle this load could be done without the extra status column in tiles-daily.csv, the extra column would be useful to identify completed tiles.

See also desihub/desispec#2251.

@weaverba137
Copy link
Member

Proposal for initial loading of the daily database.

  1. Freeze the patched versions of tiles-daily.csv and exposures-daily.fits with a particular timestamp. This timestamp also becomes the seed value for the new UPDATED column in tiles-daily.csv.
  2. Load tiles, exposures and frames tables.
  3. Assemble zcatalog-style redshift catalogs for daily.
  4. Assemble lsdr9-photometry catalogs for daily. These are actually built from zcatalog files, so step 3 is necessary.
  5. Load photometry.
  6. Load redshift catalogs.
  7. Load fiberassign/potential tables.
  8. Update primary information or create a zall file in step 3.
  9. The patched files in step 1 become the official daily files.

Background: tile-based loading runs at about 10-12 minutes per tile. This is completely dominated by the time needed to read the tractor photometry. Not load the tractor photometry, read the photometry from files. It will probably be faster for the initial load to create the equivalent of the lsdr9-photometry VAC for daily and load that all at once. Once the initial loading is done, adding a few new tiles a day at 10 min/tile should be acceptable.

@weaverba137
Copy link
Member

PR #14 got us most of the way there, but we still need to implement code for tile-based updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants