Skip to content

Batch pipeline steps r91

Mike Jongbloet edited this page Jan 11, 2021 · 3 revisions

This documentation is outdated!

🚧 The latest setup guidance for Snowplow can be found on the Snowplow documentation site.


This page refers to Snowplow R91-R101

Click here for the corresponding documentation for other releases

Dataflow diagram

Recovery steps

The below table summarizes the actions to be taken at each particular step failure from the dataflow diagram above.

Failed step Recovery actions
1 If no files have been moved yet (raw:processing [A] is empty), rerun the EmrEtlRunner as usual. If (on the other hand) some files have already been moved, rerun the EmrEtlRunner with --skip staging option to proceed with processing of those log files.
2 Rerun the EmrEtlRunner with --skip staging option.
3 Rerun the EmrEtlRunner with --skip staging option.

Note: The enriched:bad [D] and enriched:error [E] could contain the files produced as a result of the step 3. Therefore rerunning the EmrEtlRunner could result in duplicated bad/error files. This could be significant if elasticsearch step [8-9] is engaged for examining bad data [D]. The outcome would be the same data timestamped with different time values by different EMR runs.
4 Delete enriched:good files [F] and rerun the EmrEtlRunner with either --skip staging option or with --resume-from enrich.
5 Delete enriched:good files [F] and rerun the EmrEtlRunner with either --skip staging option or with --resume-from enrich.
6 Delete enriched:good files [F] and rerun the EmrEtlRunner with either --skip staging option or with --resume-from enrich.
7 Delete enriched:good files [F] and rerun the EmrEtlRunner with either --skip staging option or with --resume-from enrich.

Note: The enriched:bad [D] and shredded:bad [H] could contain the files produced as a result of the step 3 and 6 respectively. Therefore rerunning the EmrEtlRunner could result in duplicated bad files. This could be significant if elasticsearch step (8-9) is engaged for examining bad data ([D],[H]). The outcome would be the same data timestamped with different time values by different EMR runs.
8 Delete enriched:good [F] and shredded:good [K]. Rerun the EmrEtlRunner with either --skip staging option or with --resume-from enrich.
9 Rerun the EmrEtlRunner with either --skip staging,enrich,shred option or with --resume-from elasticsearch (Elasticsearch is used) or --resume-from archive_raw.
10 If duplicated bad data is not critical rerun the EmrEtlRunner with --skip staging,enrich,shred option. If duplicated bad data is critical, instructions to come (#2593).

WARNING: In R90/R91, if you pass --skip shred to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92.
11 If duplicated bad data is not critical rerun the EmrEtlRunner with --skip staging,enrich,shred option. If duplicated bad data is critical, instructions to come (#2593).

WARNING: In R90/R91, if you pass --skip shred to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92.
12 Rerun the EmrEtlRunner with --skip staging,enrich,shred,elasticsearch option or --resume-from archive_raw.

WARNING: In R90/R91, if you pass --skip shred to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92.
13 The data loads are wrapped in a single transaction, so an RDB Loader failure will not result in a partial load. However, if multiple data targets are used and some targets already been loaded, you may need to temporarily remove those from config.yml during your recovery process.

There are 3 stages in rdb_load step, namely "discover", "load", and "analize" (in that order). At the "discover" stage the availability of JSONPaths files are checked. After the data is loaded at "load" stage, the tables are analized to update table statistics for use by the query planner. To start RDB Loader from the beginning, use the --resume-from rdb_load option.

If the failure occurred at the analyze stage (i.e. after the data was successfully loaded), you can skip the analyze with the --resume-from archive_enriched option. To analyze, resume with --resume-from analyze.
14 Rerun the EmrEtlRunner with --resume-from archive_enriched option.
15 Rerun the EmrEtlRunner with --resume-from archive_enriched option.
Clone this wiki locally