Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Apify Dataset: fix broken stream, manifest refactor #30428

Merged
merged 45 commits into from
Oct 6, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
43d75bd
Improve descriptions in `spec.yaml` of Apify connector
vdusek Sep 4, 2023
87904fd
Remove broken Item Collection stream
vdusek Sep 13, 2023
90c8bf2
Add Item Collection WCC stream
vdusek Sep 13, 2023
2e173d0
Rename streams
vdusek Sep 13, 2023
244ed93
Remove partitioning since it was broken
vdusek Sep 13, 2023
b77c7bc
Update to version 0.51.11
vdusek Sep 13, 2023
6992b1e
Move Spec into Manifest
vdusek Sep 13, 2023
8237df8
Every stream has its own Selector
vdusek Sep 13, 2023
1c1e5c6
Properties are in camel_case
vdusek Sep 13, 2023
9bbb7a9
Update auth to use bearer
vdusek Sep 13, 2023
162e4bb
Remove useless base defs
vdusek Sep 13, 2023
c78bda8
Simplify definitions of streams
vdusek Sep 13, 2023
6117192
Simplification of Paginator
vdusek Sep 13, 2023
a919845
Schema loader
vdusek Sep 13, 2023
400149c
Add Building via Python section to README
vdusek Sep 14, 2023
038c808
Version, changelog, metadata
vdusek Sep 14, 2023
61636d5
Fixing stuff based on review
vdusek Sep 15, 2023
abfce14
fix schema paths
Sep 25, 2023
3073737
Automated Commit - Formatting Changes
flash1293 Sep 25, 2023
64a0b55
Merge remote-tracking branch 'origin/master' into update-apify-connector
Sep 25, 2023
318fd68
Merge branch 'update-apify-connector' of github.com:airbytehq/airbyte…
Sep 25, 2023
13c01e5
fix metadata
Sep 25, 2023
5812f57
Merge branch 'master' into update-apify-connector
Sep 25, 2023
413cc92
Automated Commit - Formatting Changes
flash1293 Sep 25, 2023
e620256
Changes based on the review
vdusek Sep 29, 2023
8561c90
Merge remote-tracking branch 'vdusek/update-apify-connector' into upd…
Oct 2, 2023
0ad772a
Merge remote-tracking branch 'origin/master' into update-apify-connector
Oct 2, 2023
566815b
adjust acceptance tests
Oct 2, 2023
769c7c8
fix tests
Oct 2, 2023
75934fb
Merge remote-tracking branch 'origin/master' into update-apify-connector
Oct 2, 2023
e85d045
Merge branch 'master' into update-apify-connector
Oct 2, 2023
609befa
Merge branch 'master' into update-apify-connector
Oct 2, 2023
95f3f03
fix tests
Oct 2, 2023
f676417
Merge branch 'update-apify-connector' of github.com:airbytehq/airbyte…
Oct 2, 2023
1ab61da
Merge branch 'master' into update-apify-connector
Oct 3, 2023
f3dce63
Merge branch 'update-apify-connector' of github.com:vdusek/airbyte in…
Oct 3, 2023
e77ed45
Trigger Build
Oct 3, 2023
855f99f
Merge remote-tracking branch 'origin/master' into update-apify-connector
Oct 6, 2023
97c000b
fix schemas
Oct 6, 2023
7174201
fix schemas
Oct 6, 2023
8b8555c
fix schema for good
Oct 6, 2023
f6a10d1
Trigger Build
Oct 6, 2023
3d5edc1
Merge branch 'master' into update-apify-connector
Oct 6, 2023
404b1de
Merge branch 'master' into update-apify-connector
Oct 6, 2023
6f18ff4
Trigger Build
Oct 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,5 @@ COPY source_apify_dataset ./source_apify_dataset
ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]

LABEL io.airbyte.version=1.0.0
LABEL io.airbyte.version=1.1.0
LABEL io.airbyte.name=airbyte/source-apify-dataset
56 changes: 54 additions & 2 deletions airbyte-integrations/connectors/source-apify-dataset/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,50 @@ For information about how to use this connector within Airbyte, see [the documen

## Local development

#### Building via Python

Create a Python virtual environment

```
virtualenv --python $(which python3.10) .venv
```

Source it

```
source .venv/bin/activate
```

Check connector specifications/definition

```
python main.py spec
```

Basic check - check connection to the API

```
python main.py check --config secrets/config.json
```

Integration tests - read operation from the API

```
python main.py read --config secrets/config.json --catalog integration_tests/configured_catalog.json
```

#### Building via Gradle

You can also build the connector in Gradle. This is typically used in CI and not needed for your development workflow.

To build using Gradle, from the Airbyte repository root, run:

```
./gradlew :airbyte-integrations:connectors:source-apify-dataset:build
```

#### Create credentials

**If you are a community contributor**, follow the instructions in the [documentation](https://docs.airbyte.com/integrations/sources/apify-dataset)
to generate the necessary credentials. Then create a file `secrets/config.json` conforming to the `source_apify_dataset/spec.yaml` file.
Note that any directory named `secrets` is gitignored across the entire Airbyte repo, so there is no danger of accidentally checking in sensitive information.
Expand All @@ -25,56 +60,73 @@ and place them into `secrets/config.json`.
### Locally running the connector docker image

#### Build

First, make sure you build the latest Docker image:

```
docker build . -t airbyte/source-apify-dataset:dev
```

You can also build the connector image via Gradle:

```
./gradlew :airbyte-integrations:connectors:source-apify-dataset:airbyteDocker
```

When building via Gradle, the docker image name and tag, respectively, are the values of the `io.airbyte.name` and `io.airbyte.version` `LABEL`s in
the Dockerfile.

#### Run

Then run any of the connector commands as follows:

```
docker run --rm airbyte/source-apify-dataset:dev spec
docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-apify-dataset:dev check --config /secrets/config.json
docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-apify-dataset:dev discover --config /secrets/config.json
docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/integration_tests:/integration_tests airbyte/source-apify-dataset:dev read --config /secrets/config.json --catalog /integration_tests/configured_catalog.json
```

## Testing

#### Acceptance Tests

Customize `acceptance-test-config.yml` file to configure tests. See [Connector Acceptance Tests](https://docs.airbyte.com/connector-development/testing-connectors/connector-acceptance-tests-reference) for more information.
If your connector requires to create or destroy resources for use during acceptance tests create fixtures for it and place them inside integration_tests/acceptance.py.

To run your integration tests with Docker, run:

```
./acceptance-test-docker.sh
```

### Using gradle to run tests

All commands should be run from airbyte project root.
To run unit tests:

```
./gradlew :airbyte-integrations:connectors:source-apify-dataset:unitTest
```

To run acceptance and custom integration tests:

```
./gradlew :airbyte-integrations:connectors:source-apify-dataset:integrationTest
```

## Dependency Management

All of your dependencies should go in `setup.py`, NOT `requirements.txt`. The requirements file is only used to connect internal Airbyte dependencies in the monorepo for local development.
We split dependencies between two groups, dependencies that are:
* required for your connector to work need to go to `MAIN_REQUIREMENTS` list.
* required for the testing need to go to `TEST_REQUIREMENTS` list

- required for your connector to work need to go to `MAIN_REQUIREMENTS` list.
- required for the testing need to go to `TEST_REQUIREMENTS` list

### Publishing a new version of the connector

You've checked out the repo, implemented a million dollar feature, and you're ready to share your changes with the world. Now what?

1. Make sure your changes are passing unit and integration tests.
1. Bump the connector version in `Dockerfile` -- just increment the value of the `LABEL io.airbyte.version` appropriately (we use [SemVer](https://semver.org/)).
1. Create a Pull Request.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"streams": [
{
"stream": {
"name": "datasets",
"name": "Dataset Collection",
vdusek marked this conversation as resolved.
Show resolved Hide resolved
"json_schema": {},
"supported_sync_modes": ["full_refresh"]
},
Expand All @@ -11,7 +11,7 @@
},
{
"stream": {
"name": "dataset",
"name": "Dataset",
"json_schema": {},
"supported_sync_modes": ["full_refresh"]
},
Expand All @@ -20,7 +20,7 @@
},
{
"stream": {
"name": "item_collection",
"name": "Item Collection - Website Content Crawler (WCC)",
"json_schema": {},
"supported_sync_modes": ["full_refresh"]
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ data:
1.0.0:
upgradeDeadline: 2023-08-30
message: "Update spec to use token and ingest all 3 streams correctly"
1.1.0:
vdusek marked this conversation as resolved.
Show resolved Hide resolved
upgradeDeadline: 2023-09-14
message: "Fix broken stream, manifest refactor"
supportLevel: community
documentationUrl: https://docs.airbyte.com/integrations/sources/apify-dataset
tags:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,109 +1,122 @@
version: "0.29.0"
version: "0.51.11"
type: DeclarativeSource

definitions:
selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: ["data"]
requester:
type: HttpRequester
url_base: "https://api.apify.com/v2/"
http_method: "GET"
authenticator:
type: NoAuth
request_parameters:
token: "{{ config['token'] }}"
spec:
type: Spec
documentation_url: https://docs.airbyte.com/integrations/sources/apify-dataset
connection_specification:
$schema: http://json-schema.org/draft-07/schema#
title: Apify Dataset Spec
type: object
required:
- token
- dataset_id
properties:
token:
type: string
title: API token
description: >-
Personal API token of your Apify account. In Apify Console, you can find your API token in the
<a href="https://console.apify.com/account/integrations">Settings section under the Integrations tab</a>
after you login. See the <a href="https://docs.apify.com/platform/integrations/api#api-token">Apify Docs</a>
for more information.
examples:
- apify_api_PbVwb1cBbuvbfg2jRmAIHZKgx3NQyfEMG7uk
airbyte_secret: true
dataset_id:
type: string
title: Dataset ID
description: >-
ID of the dataset you would like to load to Airbyte. In Apify Console, you can view your datasets in the
<a href="https://console.apify.com/storage/datasets">Storage section under the Datasets tab</a>
after you login. See the <a href="https://docs.apify.com/platform/storage/dataset">Apify Docs</a>
for more information.
examples:
- rHuMdwm6xCFt6WiGU
clean:
type: boolean
vdusek marked this conversation as resolved.
Show resolved Hide resolved
title: Clean
description: >-
If set to true, only clean items will be downloaded from the dataset. See description of what clean means in
<a href="https://docs.apify.com/api/v2#/reference/datasets/item-collection/get-items">Apify API docs</a>.
If not sure, set clean to false.
additionalProperties: true

definitions:
retriever:
type: SimpleRetriever
record_selector:
$ref: "#/definitions/selector"
paginator:
type: "NoPagination"
requester:
$ref: "#/definitions/requester"

base_paginator:
type: "DefaultPaginator"
page_size_option:
type: "RequestOption"
inject_into: "request_parameter"
field_name: "limit"
pagination_strategy:
type: "OffsetIncrement"
page_size: 50
page_token_option:
type: "RequestOption"
field_name: "offset"
inject_into: "request_parameter"

base_stream:
type: DeclarativeStream
retriever:
$ref: "#/definitions/retriever"
type: HttpRequester
url_base: "https://api.apify.com/v2/"
http_method: "GET"
authenticator:
type: BearerAuthenticator
api_token: "{{ config['token'] }}"
paginator:
type: "DefaultPaginator"
page_size_option:
type: "RequestOption"
inject_into: "request_parameter"
field_name: "limit"
pagination_strategy:
type: "OffsetIncrement"
page_size: 50
page_token_option:
type: "RequestOption"
field_name: "offset"
inject_into: "request_parameter"

datasets_stream:
$ref: "#/definitions/base_stream"
streams:
- type: DeclarativeStream
name: "Dataset Collection"
primary_key: "id"
$parameters:
name: "datasets"
primary_key: "id"
path: "datasets"
schema_loader:
type: JsonFileSchemaLoader
file_path: "schemas/dataset_collection.json"
retriever:
$ref: "#/definitions/retriever"
paginator:
$ref: "#/definitions/base_paginator"
record_selector:
$ref: "#/definitions/selector"
type: RecordSelector
extractor:
type: DpathExtractor
field_path: ["data", "items"]

datasets_partition_router:
type: SubstreamPartitionRouter
parent_stream_configs:
- stream: "#/definitions/datasets_stream"
parent_key: "id"
partition_field: "parent_id"

dataset_stream:
$ref: "#/definitions/base_stream"
- type: DeclarativeStream
name: "Dataset"
primary_key: "id"
$parameters:
name: "dataset"
primary_key: "id"
path: "datasets/{{ stream_partition.parent_id }}"
path: "datasets/{{ config['dataset_id'] }}"
schema_loader:
type: JsonFileSchemaLoader
file_path: "schemas/dataset.json"
retriever:
$ref: "#/definitions/retriever"
paginator:
$ref: "#/definitions/base_paginator"
partition_router:
$ref: "#/definitions/datasets_partition_router"
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: ["data"]

item_collection_stream:
$ref: "#/definitions/base_stream"
- type: DeclarativeStream
name: "Item Collection - Website Content Crawler (WCC)"
$parameters:
name: "item_collection"
path: "datasets/{{ stream_partition.parent_id }}/items"
path: "datasets/{{ config['dataset_id'] }}/items"
schema_loader:
type: JsonFileSchemaLoader
file_path: "schemas/item_collection_wcc.json"
retriever:
$ref: "#/definitions/retriever"
paginator:
$ref: "#/definitions/base_paginator"
record_selector:
$ref: "#/definitions/selector"
type: RecordSelector
extractor:
type: DpathExtractor
field_path: []
partition_router:
$ref: "#/definitions/datasets_partition_router"

streams:
- "#/definitions/datasets_stream"
- "#/definitions/dataset_stream"
- "#/definitions/item_collection_stream"

check:
type: CheckStream
stream_names:
- "datasets"
- "dataset"
- "item_collection"
- "Dataset Collection"
- "Dataset"
- "Item Collection - Website Content Crawler (WCC)"

This file was deleted.

Loading
Loading