Skip to content

Module Framework

Adam Hooper edited this page Apr 6, 2021 · 9 revisions

The lowest level: calling convention

A Module is a callable that accepts a Thrift data structure and filename; writes to the filename; and returns a Thrift data structure.

Thrift definitions

Why? Mainly, security. Workbench modules are untrusted code -- and so they produce untrusted data, and they run in a separate process. Workbench proper must read that untrusted data, after the module stops executing. Data can only be sent over pipes and in files; and the Workbench code that handles this untrusted data must never have undefined behavior. (The data-transfer format can't be Python "pickle" data, for instance, because that executes arbitrary code on the reader.)

We send table data using files -- mostly Arrow files. The beauty of Arrow: files are mmapped, so data isn't serialized or deserialized when passed between Workbench and the module. A huge file and a tiny file cost the same.

We send metadata using Thrift over standard input and output. One could make the case for protobuf or JSON; but Thrift negates a dependency (since modules depend on Parquet libraries already) and JSON would imply frustrating serialize/deserialize code. (Remember, Workbench needs to validate all this data.)

The low-level Arrow framework

cjwkernel and cjwmodule.arrow handle Thrift and pipes, so module authors don't need to.

The Arrow framework arranges arguments into Arrow tables (already opened) and data structures like Table, TabOutput and Column.

It calls the module code. The module returns a cjwmodule.arrow.RenderResult.

The framework writes the Arrow file and writes Thrift to stdout.

The advantages of writing code at this level:

  • It's efficient. Arrow operations are very fast. The RAM overhead is negligible. The only cost is the cost of opening the file.
  • It's the standard. Workbench itself uses these exact data types.

The disadvantage:

  • Tooling. pyarrow has some great features; but as of 2021-04-06 at least it's missing a "day-of-week" function, a "divmod" operator, a "join" operation; and so on.

The high-level Pandas framework

cjwkernel.pandas handles Arrow, so module authors don't need to.

The Pandas framework converts Arrow tables to Pandas DataFrames.

It calls the module code. The module returns all sorts of weird stuff. (Back in the day, there were way too many features here!)

The framework converts all that weird stuff into a cjwmodule.arrow.RenderResult and passes it back to the Arrow framework.

The advantages to writing code at this level:

  • There are all sorts of guides on the Internet. Everybody uses Pandas.

The disadvantages to writing code at this level:

  • Pandas is buggy. As of 2021-04-06, Pandas has 3,563 bugs. Most of these bugs will never be fixed. Some lead to extreme workarounds. The group and reshape modules have been horrendous time sinks. We've spent days debugging bizarre edge cases, filing bug reports with Pandas, and then implementing costly workarounds because the Pandas bugs won't be fixed.
    • Pandas APIs promise impossible feats. There is no way to parse dates automatically. There is no way to parse numbers without locale settings. There is no way to turn HTML into a table without additional options. When using Pandas, ask yourself: are computers ever, in a million years, going to be able to achieve the task you're setting out to achieve?
    • Those guides online are also buggy. Workbench modules must handle edge cases -- for instance, what happens when every value is null? What happens when dividing by zero? And so on. Those guides online will often suggest solutions that work in the author's case and break in the real world.
  • Pandas is inefficient. A 1M-row text column costs ~60MB (heavily fragmented), even if there's only one byte per cell. That same column in Arrow costs ~9MB (contiguous).
  • Pandas makes you make mistakes. For example, it's far too easy to create a DataFrame that has two columns with the same name. Or an integer-named column. Or an int in an object column. And so on.
  • Nullable integers become floating-point. Pandas has an IntegerArray type, but we haven't researched the edge cases and we don't support it. Workbench handles 64-bit nullable integers correctly; Pandas modules don't. So if someone adds a Pandas step to a workflow, Workbench will irreversibly convert all 64-bit integer columns that contain nulls to floating-point -- even columns your module didn't create or alter.
Clone this wiki locally