Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zyp] This and that: Mostly documentation and software tests #53

Merged
merged 8 commits into from
Sep 22, 2024
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## Unreleased
- MongoDB: Fixed edge case when decoding MongoDB Extended JSON elements
- Zyp: Added capability to skip rule evaluation when `disabled: true`

## 2024/09/19 v0.0.16
- MongoDB: Added `MongoDBFullLoadTranslator` and `MongoDBCrateDBConverter`
Expand Down
29 changes: 22 additions & 7 deletions doc/zyp/backlog.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,24 @@
# Zyp Backlog

## Iteration +1
- [x] Refactor module namespace to `zyp`
- [x] Documentation
- [ ] CLI interface
- [x] Apply to MongoDB Table Loader in CrateDB Toolkit
- [ ] Document `jq` functions
- [ ] Documentation: jqlang stdlib's `to_object` function for substructure management
- [ ] Documentation: Type casting
`echo '{"a": 42, "b": {}, "c": []}' | jq -c '.|= (.b |= objects | .c |= objects)'`
`{"a":42,"b":{}}`
- [ ] Renaming currently needs JSON Pointer support, implemented in Python.
Alternatively, can `jq` also do it?
- [ ] Simple IFTTT: When condition, do that (i.e. add tag)
- [ ] Documentation: `jq` functions
- `builtin.jq`: https://github.com/jqlang/jq/blob/master/src/builtin.jq
- `function.jq`
- [ ] Renaming needs JSON Pointer support. Alternatively, can `jq` do it?
- [ ] Documentation: Add Python example to "Synopsis" section on /index.html
- [ ] Documentation: Update "What’s Inside"
- [ ] Documentation: Usage (build (API, from_yaml), apply)
- [ ] Documentation: How to extend `function.{jq,py}`

## Iteration +2
- [ ] CLI interface
- [ ] Documentation: Add Python example to "Synopsis" section on /index.html

Demonstrate more use cases, like...
- [ ] math expressions
- [ ] omit key (recursively)
Expand Down Expand Up @@ -49,3 +56,11 @@ Demonstrate more use cases, like...
- https://github.com/meltano/sdk/blob/v0.39.1/singer_sdk/mapper.py
- [ ] Is `jqpy` better than `jq`?
- https://baterflyrity.github.io/jqpy/

## Done
- [x] Refactor module namespace to `zyp`
- [x] Documentation
- [x] Apply to MongoDB Table Loader in CrateDB Toolkit
- [x] Model: Toggle rule active / inactive by respecting `disabled` flag
- [x] Documentation: How to delete attributes from lists using jq?
- [x] Review and test jqlang stdlib's `to_object` function
223 changes: 219 additions & 4 deletions doc/zyp/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ to Zyp's capabilities.
If you discover the need for another kind of transformation, or need assistance
crafting transformation rules, please reach out to us on the [issue tracker].

## General Information

- Transformation recipes include a number of transformation rules
- Transformation rules can use different kinds of processors
- Individual rules can be toggled inactive by using the attribute `disabled: true` on them


## Bucket Transformation
A `BucketTransformation` works on **individual data records**, i.e. on a per-record level,
Expand Down Expand Up @@ -86,8 +92,8 @@ meta:
version: 1
names:
rules:
- new: id
old: _id
- old: _id
new: id
values:
rules:
- pointer: /id
Expand Down Expand Up @@ -237,8 +243,8 @@ pre:
bucket:
names:
rules:
- new: id
old: _id
- old: _id
new: id
values:
rules:
- pointer: /id
Expand Down Expand Up @@ -546,5 +552,214 @@ how to define transformation rules, and the corresponding YAML representation.
:::::::


## jqlang cheat sheet
This section enumerates a few jqlang expressions that you may find useful
in this context. Please also visit the [jqlang manual].

### Drop Elements
::::::{card}
Drop object attributes by path, also multiple ones at once.
```yaml
expression: .[] |= del(.meta.timestamp, .data.def)
```
:::::{dropdown} Example
:margin: 0

::::{grid} 2
:gutter: 0
:margin: 0
:padding: 0

:::{grid-item-card}
:margin: 0
:padding: 0
Input Data
```json
[{
"meta": {"id": "Hotzenplotz", "timestamp": 123456789},
"data": {"abc": 123, "def": 456}
}]
```
:::
:::{grid-item-card}
:margin: 0
:padding: 0
Output Data
```json
[{
"meta": {"id": "Hotzenplotz"},
"data": {"abc": 123}
}]
```
:::
::::
:::::
::::::

::::::{card}
Drop attribute from all objects in array, where in some documents,
the array may not exist, or it might not be an array.
```yaml
expression: .[] |= del(.data.array[]?.def)
```
:::::{dropdown} Example
::::{grid} 2
:gutter: 0
:margin: 0
:padding: 0

:::{grid-item-card}
:margin: 0
:padding: 0
Input Data
```json
[
{"data": {"array": [
{"abc": 123, "def": 456},
{"abc": 123, "def": 456},
{"abc": 123}
]}},
{"data": {"array": 42}},
{"data": {}},
{"meta": {"version": 42}}
]
```
:::
:::{grid-item-card}
:margin: 0
:padding: 0
Output Data
```json
[
{"data": {"array": [
{"abc": 123},
{"abc": 123},
{"abc": 123}
]}},
{"data": {"array": 42}},
{"data": {}},
{"meta": {"version": 42}}
]
```
:::
::::
:::::
::::::

::::::{card}
Drop array elements by index.
```yaml
expression: .[] |= del(.data.[1])
```
:::::{dropdown} Example
::::{grid} 2
:gutter: 0
:margin: 0
:padding: 0

:::{grid-item-card}
:margin: 0
:padding: 0
Input Data
```json
[{"data": [1, {"foo": "bar"}, 2]}]
```
:::
:::{grid-item-card}
:margin: 0
:padding: 0
Output Data
```json
[{"data": [1, 2]}]
```
:::
::::
:::::
::::::

### Manipulate Values
::::::{card}
Update value of deeply nested attribute if it exists.
```yaml
expression: .[] |= if .data.abc then .data.abc *= 2 end
```
:::::{dropdown} Example
::::{grid} 2
:gutter: 0
:margin: 0
:padding: 0

:::{grid-item-card}
:margin: 0
:padding: 0
Input Data
```json
[
{"data": {"abc": 123}},
{"data": {"def": 456}},
{"meta": {"version": 42}}
]
```
:::
:::{grid-item-card}
:margin: 0
:padding: 0
Output Data
```json
[
{"data": {"abc": 246}},
{"data": {"def": 456}},
{"meta": {"version": 42}}
]
```
:::
::::
:::::
::::::

::::::{card}
Update value of deeply nested attribute within an array if it exists.
```yaml
expression: .[] |= if (.data | type == "array") and .data[].abc then .data[].abc *= 2 end
```
:::::{dropdown} Example
::::{grid} 2
:gutter: 0
:margin: 0
:padding: 0

:::{grid-item-card}
:margin: 0
:padding: 0
Input Data
```json
[
{"data": [{"abc": 123}]},
{"data": [{"def": 456}]},
{"data": null},
{"data": 42},
{"meta": {"version": 42}}
]
```
:::
:::{grid-item-card}
:margin: 0
:padding: 0
Output Data
```json
[
{"data": [{"abc": 246}]},
{"data": [{"def": 456}]},
{"data": null},
{"data": 42},
{"meta": {"version": 42}}
]
```
:::
::::
:::::
::::::


[issue tracker]: https://github.com/crate/commons-codec/issues
[jqlang manual]: https://jqlang.github.io/jq/manual/
53 changes: 39 additions & 14 deletions doc/zyp/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,44 @@ The reference implementation is written in [Python], using [attrs] and [cattrs].
The design, conventions, and definitions also encourage implementations
in other programming languages.

## Ideas
## Features
:Conciseness:
Define a multistep data refinement process with as little code as possible.
:Precision:
When filtering and manipulating deeply nested documents, you want to exactly
address specific elements and substructures.
:Polyglot:
The toolbox includes different kinds of tools, to always have the
right one at hand. The venerable jq language is always on your fingertips,
while people accustomed to JMESPath expressions as employed by AWS CLI's
\\-\\-query parameter can also use it. On top of that, transformation
steps can also be written in Python.
:Flexibility:
Zyp is a data transformation library that can be used within frameworks and
ad hoc pipelines equally well. To be invoked, it doesn't need any infrastructure
services and is pipeline framework agnostic.
The library can be used both within frameworks, applications, and ad hoc
pipelines equally well. It does not depend on any infrastructure services,
and can be used together with any other ETL or pipeline framework.
:Interoperability:
Transformation recipe definitions are represented by a concise data model, which
can be marshalled to/from text-only representations like JSON or YAML, in order to
Transformation recipe definitions are represented by a concise data model,
which can be marshalled to/from text-only representations like JSON or YAML,
in order to
a) encourage implementations in other programming languages, and
b) be transferred, processed and stored by third party systems.
:Performance:
Depending on how many transformation rules are written in pure Python vs. more
efficient processors like jqlang or other compiled transformation languages, it
may be slower or faster. When applicable, hot spots of the library
may gradually be rewritten in Rust if that topic becomes an issue.
Depending on how many transformation rules are written in pure Python vs.
more efficient processors like jqlang or other compiled transformation
languages, it may be slower or faster. When applicable, hot spots of the
library may gradually be rewritten in Rust if that topic becomes an issue.
:Immediate:
Other ETL frameworks and concepts often need to first land your data in the target
system before applying subsequent transformations. Zyp is working directly within
the data pipeline, before data is inserted into the target system.
Other ETL frameworks and concepts often need to first land your data in the
target system before applying subsequent transformations. Zyp is working
directly within the data pipeline, before data is inserted into the target
system.
:Human:
Zyp provides capabilities to imperatively filter and reshape data structures
in an iterative authoring process, based on deterministic procedures building
upon each other. When it comes to ad hoc or automated data conversion tasks,
it puts you into the driver's seat, and encourages sharing and reuse of
transformation recipes.

## Design
:Data Model:
Expand All @@ -43,7 +60,7 @@ in other programming languages.
JSON Pointer, `jq`, and friends. The components are configured using rules.

:Phases and Process:
The transformation process is conducted on behalf of multiple phases that are
The transformation process includes multiple phases that are
defined by labels like `pre`, `bucket`, `post`, `treatment`, in that order.
Each phase can include multiple rules of different kinds.

Expand Down Expand Up @@ -123,6 +140,11 @@ inspirations that might not have been reflected on the documentation yet.
- [tests/transform/mongodb]
- [tests/transform/test_zyp_generic.py]

## Tools
- [jp]: A command line interface to JMESPath, an expression language for manipulating JSON.
- [jq]: A lightweight and flexible command-line JSON processor.
- [jsonpointer]: A commandline utility that can be used to resolve JSON pointers on JSON files.

## Prior Art
See [research and development notes](project:#zyp-research),
specifically [an introduction and overview about Singer].
Expand All @@ -145,7 +167,10 @@ Backlog <backlog>
[cattrs]: https://catt.rs/
[DWIM]: https://en.wikipedia.org/wiki/DWIM
[Kris Zyp]: https://github.com/kriszyp
[jp]: https://github.com/jmespath/jp
[jq]: https://jqlang.github.io/jq/
[jsonpointer]: https://python-json-pointer.readthedocs.io/en/latest/commandline.html
[jqlang]: https://jqlang.github.io/jq/manual/
[JMESPath]: https://jmespath.org/
[JSON Pointer]: https://datatracker.ietf.org/doc/html/rfc6901
[Python]: https://en.wikipedia.org/wiki/Python_(programming_language)
Expand Down
Loading