-
Notifications
You must be signed in to change notification settings - Fork 46
How we render a step
Adam Hooper edited this page Nov 20, 2020
·
3 revisions
Each Step in a Workflow corresponds to a "module". Its inputs are:
-
arrow_table-- the prior Step's output -
settings-- e.g., row-number limit, column-name-length limit -
fetch_result-- if this is a Fetch module and data was downloaded -
params-- form fields entered by the user (and validated and massaged by Workbench at render time) -
secrets-- tokens even the user may not read -
tabs-- output tables from other Tabs that this step's params require
Its outputs are:
-
table: the new output table (may be null, if an error is set) -
errors: a list of "errors" (iftableis null) or "warnings" (iftableis not null), in an i18n-ready format. (We store translation keys in the database; Workbench translates them when a visitor views the workflow.)
Outputs are cached so that users can view them on the Workbench website.
Ignoring optimizations, here's what happens during a render:
- Workbench loads
params,secretsandsettingsusing the database. - Workbench loads
fetch_resultfrom the "Stored Objects" S3 bucket (if it exists). The file may have any format -- the module's ownfetch()function generated it. - Workbench loads
tableandtabs-- output from the previous step and other tabs, if applicable -- from the "Render Cache" S3 bucket. This is Parquet format. Table Data Structures - Workbench converts Parquets file to Arrow. (Parquet-format files are compressed and the format won't change frequently. Arrow-format files are for fast computation and inter-process communication.)
- Workbench spawns a new process with the module. It links the
fetch_resultfile and Arrow files into the process's sandbox. - Workbench transmits all inputs (other than the
fetch_resultfile and Arrow files) in a Thrift format to the module process. - The module process decodes the Thrift data and opens
fetch_resultand Arrow files. - The module process executes
render(). - The module process writes a new Arrow table to
output_filename, outputs table metadata anderrorsto stdout, and exits with status0. - Workbench validates the Step-output Arrow file, metadata and exit status.
- Workbench converts the Arrow file to Parquet and stores it in the "Render Cache". It stores metadata (such as
errorsandcolumns) in the database.
... and then Workbench moves on to the next Step.
- Workbench renders a workflow at a time. When it finishes a Step but still needs its output, it re-uses the Arrow file instead of reading the Parquet file it just wrote to the render cache.
- Workbench skips the module code entirely if it runs into an error while massaging
params(for instance, if a user chose a "Timestamp" column for timestamp math, but it's now a "Text" column come render time). - Module-process spawning is cached.
- Each module process is sandboxed during creation. Read our 3-part series on sandboxing for details. Briefly:
- The process can't access our internal network.
- The process runs as non-root, and it can't gain privileges.
- The process's can't read Workbench's environment variables.
- The process has a greatly-constrained filesystem: it can read only the files we provide (such as approved Python modules and input files), and it can only write output files and temporary files -- and that, only to a constrained filesystem size. All files the process writes are destroyed before another process is spawned.
- The process's RAM and CPU are bounded by cgroups.
- The process's child processes all die when the process exits.
- The process is killed if it exceeds a timeout.
- The process gets no file handles but
stdin,stdoutandstderr. Itsstderrandstdoutare truncated to a fixed buffer size, and itsstdoutis validated using Thrift.