Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/framework-meltano.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ jobs:
os: [ 'ubuntu-latest' ]
python-version: [
'3.10',
'3.14',
# TODO: Update to more recent versions after updating software components.
# https://github.com/crate-workbench/meltano-tap-cratedb/issues/13
'3.12',
]
cratedb-version: [ 'nightly' ]

Expand Down
8 changes: 6 additions & 2 deletions framework/meltano/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# Meltano Examples

Concise examples about working with [CrateDB] and [Meltano], for conceiving and
running flexible ELT tasks. All the recipes are using [meltano-target-cratedb]
for reading and writing data from/to CrateDB.
running flexible ELT tasks. All the recipes are using [meltano-tap-cratedb] or
[meltano-target-cratedb] for reading and writing data from/to CrateDB.

## What's inside

- `file-to-cratedb`: Acquire data from Singer File, and load it into
CrateDB database table.

- `cratedb-to-file`: Export data from a CrateDB database table into
different kinds of files.

## Prerequisites

Before running the examples within the subdirectories, make sure to install
Expand Down Expand Up @@ -39,4 +42,5 @@ poe check

[CrateDB]: https://cratedb.com/product
[Meltano]: https://meltano.com/
[meltano-tap-cratedb]: https://github.com/crate/meltano-tap-cratedb
[meltano-target-cratedb]: https://github.com/crate/meltano-target-cratedb
77 changes: 77 additions & 0 deletions framework/meltano/cratedb-to-file/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Use Meltano to export data from CrateDB into files

## About

Export data from CrateDB into different file-based output formats, using
[meltano-tap-cratedb], [target-csv], [target-jsonl], [target-singer-jsonl],
and [target-jsonl-blob].

## Configuration

### tap-cratedb

Within the `extractors` section, at `tap-cratedb`, adjust
`config.sqlalchemy_url` to match your database connectivity settings
as pipeline source. Within the `select` section, configure which
tables you are targeting for export.

### target-{csv,jsonl}

Within the `loaders` section, at the corresponding subsections, have a look at
the relevant configuration slots `output_path_prefix`, `destination_path`,
`local.folder`, or `bucket`, to configure the filesystem destination where
export data is saved to.

## Usage

Install dependencies.
```shell
meltano install
```

Discover data schema.
```shell
meltano invoke tap-cratedb --discover > catalog.json
```

Run plugin standalone, testdrive.
```shell
meltano invoke tap-cratedb --catalog catalog.json
```

Invoke data transfer from CrateDB database to output files.
```shell
meltano run tap-cratedb target-csv
meltano run tap-cratedb target-jsonl
meltano run tap-cratedb target-singer-jsonl
```

## Screenshot

Enjoy the list of mountains.

```shell
cat output/jsonl-txt/sys-summits.jsonl | head -n 3
```

```json lines
{"classification": "I/B-07.V-B", "country": "FR/IT", "first_ascent": 1786, "height": 4808, "mountain": "Mont Blanc", "prominence": 4695, "range": "U-Savoy/Aosta", "region": "Mont Blanc massif"}
{"classification": "I/B-09.III-A", "country": "CH", "first_ascent": 1855, "height": 4634, "mountain": "Monte Rosa", "prominence": 2165, "range": "Valais", "region": "Monte Rosa Alps"}
{"classification": "I/B-09.V-A", "country": "CH", "first_ascent": 1858, "height": 4545, "mountain": "Dom", "prominence": 1046, "range": "Valais", "region": "Mischabel"}
```

## Development

In order to link the sandbox to a development installation of [meltano-tap-cratedb],
configure the `pip_url` of the component like this:

```yaml
pip_url: --editable=/path/to/sources/meltano-tap-cratedb
```


[meltano-tap-cratedb]: https://github.com/crate/meltano-tap-cratedb
[target-csv]: https://hub.meltano.com/loaders/target-csv/
[target-jsonl]: https://hub.meltano.com/loaders/target-jsonl/
[target-jsonl-blob]: https://github.com/MeltanoLabs/target-jsonl-blob
[target-singer-jsonl]: https://hub.meltano.com/loaders/target-singer-jsonl/
115 changes: 115 additions & 0 deletions framework/meltano/cratedb-to-file/meltano.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# A Meltano project is just a directory on your filesystem containing text-based files.
# At a minimum, a Meltano project must contain a project file named `meltano.yml`,
# which contains your project configuration, and tells Meltano that a particular
# directory is a Meltano project.
#
# A Meltano project defines the elements of a data processing pipeline, and can
# include both configuration and code. A pipeline is a classic construct to
# describe a data mangling process from A (source) to B (sink). In Singer/Meltano
# jargon, those elements are named differently, but you will get over it ;].
#
# Legend:
# - Data Source: Singer Tap, Meltano Extractor
# - Data Sink: Singer Target, Meltano Loader
---
version: 1
default_environment: dev
send_anonymous_usage_stats: false
project_id: da405837-da1d-48c8-b7b1-4fb2e5e60f80

environments:
- name: dev
- name: staging
- name: prod

plugins:

# Configure data sources (Singer Tap / Meltano Extractor).
extractors:

- name: tap-cratedb
namespace: cratedb
variant: cratedb

# Acquire package from PyPI.
pip_url: meltano-tap-cratedb

# Acquire package from GitHub.
# pip_url: git+https://github.com/crate/meltano-tap-cratedb.git@make-it-work

# Until the tap is published on Meltano Hub, this needs to be defined here.
# Otherwise, automatic reflection will not work, and the `select` attribute
# is not honoured. See also `meltano select tap-cratedb --list`.
# Selecting streams within the Meltano project definition is a feature implemented
# on behalf of the Meltano package, within `meltano.core.select_service`. It does
# not exist in the Singer SDK.
capabilities:
- state
- catalog
- discover

# Configure database server and credentials.
config:
sqlalchemy_url: crate://crate@localhost/

# Configure selected streams / database tables.
select:

# Use all columns from `sys.summits`.
- "sys-summits.*"

# Omit `sys.summits.coordinates`, because the
# geospatial data type is not supported yet.
# TODO: Update software components and validate again.
- "!sys-summits.coordinates"

# Evaluate selecting other streams.
#- "*"
#- "sys-jobs.*"
#- "!sys-jobs.node"
#- "!sys-operations.node"

# Configure data sinks (Singer Target / Meltano Loader).
loaders:

# https://github.com/MeltanoLabs/target-csv
- name: target-csv
variant: meltanolabs
config:
output_path_prefix: ./output/csv/

# Receive Singer messages and write to JSONL formatted files.
# https://hub.meltano.com/loaders/target-jsonl
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
config:
destination_path: ./output/jsonl-txt

# Receive Singer messages and write to JSONL formatted and g-zipped files.
# https://hub.meltano.com/loaders/target-singer-jsonl
#
# Supports local filesystem and S3 destinations.
# The schema is to use folder/key per stream using the stream name and a
# timestamp for each file.
# local: <folder>/<stream name>/<stream name>-<timestamp>.singer.gz
# s3: s3://<bucket>/<prefix>/<stream name>/<stream name>-<timestamp>.singer.gz
- name: target-singer-jsonl
variant: kgpayne
pip_url: git+https://github.com/kgpayne/target-singer-jsonl
config:
add_record_metadata: false
local:
folder: ./output/jsonl-gz

# Receive Singer messages and write to JSONL formatted files.
# https://github.com/MeltanoLabs/target-jsonl-blob
#
# Supports local filesystem, S3, Azure, and GCS.
# It is implemented in Go, and needs a customized installation.
- name: target-jsonl-blob
variant: meltanolabs
namespace: target_jsonl_blob
executable: ./target-jsonl-blob
config:
bucket: file://./output/my-bucket
47 changes: 47 additions & 0 deletions framework/meltano/cratedb-to-file/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
[tool.poe.tasks]

test = [
# Invoke Meltano ETL pipeline.
"meltano-pipeline",

# Verify exported results.
"verify-export",
]

meltano-pipeline = [

# Install recipe.
{ cmd = "meltano lock --update --all" },
{ cmd = "meltano install" },

# Singer tap schema discovery, for generating catalogs.
{ shell = "meltano invoke tap-cratedb --discover > catalog.json" },

# Display Meltano extractor selection patterns.
{ cmd = "meltano select tap-cratedb --list" },

# Run plugin standalone.
{ cmd = "meltano invoke tap-cratedb" },

# Invoke pipeline, exporting data from the database, for real.
{ shell = "mkdir -p output/{csv,jsonl-gz,jsonl-txt}" },
{ cmd = "meltano el tap-cratedb target-csv" },
{ cmd = "meltano el tap-cratedb target-jsonl" },
{ cmd = "meltano el tap-cratedb target-singer-jsonl" },

]

# TODO: In a future iteration, I would like to see validators of better quality here,
# that provide concise output to the user instead of just an exitcode != 0.
verify-export = [

# Verify existence of output files.
{ cmd = "test -e output/csv/sys-summits.csv" },
{ cmd = "test -e output/jsonl-txt/sys-summits.jsonl" },
{ shell = 'test ! -z "$(ls output/jsonl-gz/sys-summits/sys-summits-*.singer.gz)"' },

# Verify number of exported records.
{ shell = "test $(cat output/csv/sys-summits.csv | wc -l | tr -d ' ') -ge 1605" },
{ shell = "test $(cat output/jsonl-txt/sys-summits.jsonl | wc -l | tr -d ' ') -ge 1605" },

]
1 change: 1 addition & 0 deletions framework/meltano/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,5 @@ lint = [

test = [
{ cmd = "poe --root file-to-cratedb test" },
{ cmd = "poe --root cratedb-to-file test" },
]