Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Components - TFX #2671

Merged
merged 26 commits into from
Dec 5, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
1de4368
Added CsvExampleGen component
Ark-kun Oct 21, 2019
1751f99
Switched to using some processing code from the component class
Ark-kun Oct 31, 2019
0ccd7c3
Renamed output_examples to example_artifacts for consistency with the…
Ark-kun Oct 31, 2019
ace062e
Fixed the docstring a bit
Ark-kun Oct 31, 2019
8e30a62
Added StatisticsGen
Ark-kun Oct 31, 2019
b8fd5a7
Added SchemaGen
Ark-kun Oct 31, 2019
35ab2e0
Fixed the input_dict construction
Ark-kun Nov 1, 2019
fcef473
Use None defaults
Ark-kun Nov 1, 2019
8a1d1e5
Switched to TFX container image
Ark-kun Nov 1, 2019
d6e6b52
Updated component definitions
Ark-kun Nov 1, 2019
fa7374c
Fixed StatisticsGen and SchemaGen
Ark-kun Nov 2, 2019
a9e784e
Printing component instance in CsvExampleGen
Ark-kun Nov 2, 2019
3a1159a
Moved components to directories
Ark-kun Nov 2, 2019
5645997
Updated the sample TFX pipeline
Ark-kun Nov 2, 2019
eb8f281
Renamed ExamplesPath to Examples for data passing components
Ark-kun Nov 7, 2019
f84d7c9
Corrected output_component_file paths
Ark-kun Nov 7, 2019
1cc4a0f
Added the Transform component
Ark-kun Nov 7, 2019
7cc3350
Added the Trainer component
Ark-kun Nov 7, 2019
9f5fe9c
Added the BigQueryExampleGen component
Ark-kun Nov 7, 2019
91ec94d
Added the ImportExampleGen component
Ark-kun Nov 7, 2019
4892bbf
Added the Evaluator component
Ark-kun Nov 7, 2019
bda6978
Added the ExampleValidator component
Ark-kun Nov 7, 2019
093e2d2
Updated the sample
Ark-kun Nov 26, 2019
034cd32
Upgraded to TFX 0.15.0
Ark-kun Nov 28, 2019
8fec5f1
Upgraded the sample to 0.15.0
Ark-kun Nov 28, 2019
bbf11e3
Silence Flake8 for annotations
Ark-kun Nov 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Updated component definitions
  • Loading branch information
Ark-kun committed Nov 2, 2019
commit d6e6b52084420a50e05f997697cdf0ac5f98c84f
265 changes: 147 additions & 118 deletions components/tfx/CsvExampleGen.component.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
name: CsvExampleGen
inputs:
- {name: input_base, type: ExternalPath}
- {name: input_config, optional: true, type: ExampleGen.Input}
- {name: output_config, optional: true, type: ExampleGen.Output}
outputs:
- {name: example_artifacts, type: ExamplesPath}
description: |
Executes the CsvExampleGen component.

Expand All @@ -10,132 +17,154 @@ description: |
output_config: An example_gen_pb2.Output instance, providing output
configuration. If unset, default splits will be 'train' and 'eval' with
size 2:1.
??? example_artifacts: Optional channel of 'ExamplesPath' for output train and
eval examples.
??? input: Forwards compatibility alias for the 'input_base' argument.
??? instance_name: Optional unique instance name. Necessary if multiple
CsvExampleGen components are declared in the same pipeline.
Returns:
example_artifacts: Artifact of type 'ExamplesPath' for output train and
eval examples.
implementation:
container:
image: tensorflow/tfx:0.15.0rc0
command:
- python3
- -u
- -c
- |
def _make_parent_dirs_and_return_path(file_path: str):
import os
os.makedirs(os.path.dirname(file_path), exist_ok=True)
return file_path

class OutputPath:
'''When creating component from function, OutputPath should be used as function parameter annotation to tell the system that the function wants to output data by writing it into a file with the given path instead of returning the data from the function.'''
def __init__(self, type=None):
self.type = type

class InputPath:
'''When creating component from function, InputPath should be used as function parameter annotation to tell the system to pass the *data file path* to the function instead of passing the actual data.'''
def __init__(self, type=None):
self.type = type

def CsvExampleGen(
# Inputs
input_base_path: InputPath('ExternalPath'),
#input_base_path: 'ExternalPath', # A Channel of 'ExternalPath' type, which includes one artifact whose uri is an external directory with csv files inside (required).

# Outputs
example_artifacts_path: OutputPath('ExamplesPath'),
#example_artifacts_path: 'ExamplesPath',

# Execution properties
#input_config_splits: {'List' : {'item_type': 'ExampleGen.Input.Split'}},
input_config: 'ExampleGen.Input' = None, # = '{"splits": []}', # JSON-serialized example_gen_pb2.Input instance, providing input configuration. If unset, the files under input_base will be treated as a single split.
#output_config_splits: {'List' : {'item_type': 'ExampleGen.SplitConfig'}},
output_config: 'ExampleGen.Output' = None, # = '{"splitConfig": {"splits": []}}', # JSON-serialized example_gen_pb2.Output instance, providing output configuration. If unset, default splits will be 'train' and 'eval' with size 2:1.
#custom_config: 'ExampleGen.CustomConfig' = None,
):
"""\
Executes the CsvExampleGen component.

Args:
input_base: A Channel of 'ExternalPath' type, which includes one artifact
whose uri is an external directory with csv files inside (required).
input_config: An example_gen_pb2.Input instance, providing input
configuration. If unset, the files under input_base will be treated as a
single split.
output_config: An example_gen_pb2.Output instance, providing output
configuration. If unset, default splits will be 'train' and 'eval' with
size 2:1.
??? input: Forwards compatibility alias for the 'input_base' argument.
Returns:
example_artifacts: Artifact of type 'ExamplesPath' for output train and
eval examples.
"""

import json
import os
from google.protobuf import json_format
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.proto import example_gen_pb2
from tfx.types import standard_artifacts
from tfx.types import channel_utils

# Create input dict.
input_base = standard_artifacts.ExternalArtifact()
input_base.uri = input_base_path
input_base_channel = channel_utils.as_channel([input_base])

input_config_obj = None
if input_config:
input_config_obj = example_gen_pb2.Input()
json_format.Parse(input_config, input_config_obj)

output_config_obj = None
if output_config:
output_config_obj = example_gen_pb2.Output()
json_format.Parse(output_config, output_config_obj)

component_class_instance = CsvExampleGen(
input=input_base_channel,
input_config=input_config_obj,
output_config=output_config_obj,
)

# component_class_instance.inputs/outputs are wrappers that do not behave like real dictionaries. The underlying dict can be accessed using .get_all()
# Channel artifacts can be accessed by calling .get()
input_dict = {name: channel.get() for name, channel in component_class_instance.inputs.get_all().items()}
output_dict = {name: channel.get() for name, channel in component_class_instance.outputs.get_all().items()}

exec_properties = component_class_instance.exec_properties

# Generating paths for output artifacts
for output_artifact in output_dict['examples']:
output_artifact.uri = example_artifacts_path
if output_artifact.split:
output_artifact.uri = os.path.join(output_artifact.uri, output_artifact.split)

executor = CsvExampleGen.EXECUTOR_SPEC.executor_class()
executor.Do(
input_dict=input_dict,
output_dict=output_dict,
exec_properties=exec_properties,
)

import argparse
_parser = argparse.ArgumentParser(prog='Csvexamplegen', description="Executes the CsvExampleGen component.\n\n Args:\n input_base: A Channel of 'ExternalPath' type, which includes one artifact\n whose uri is an external directory with csv files inside (required).\n input_config: An example_gen_pb2.Input instance, providing input\n configuration. If unset, the files under input_base will be treated as a\n single split.\n output_config: An example_gen_pb2.Output instance, providing output\n configuration. If unset, default splits will be 'train' and 'eval' with\n size 2:1.\n ??? input: Forwards compatibility alias for the 'input_base' argument.\n Returns:\n example_artifacts: Artifact of type 'ExamplesPath' for output train and\n eval examples.\n")
_parser.add_argument("--input-base", dest="input_base_path", type=str, required=True, default=argparse.SUPPRESS)
_parser.add_argument("--input-config", dest="input_config", type=str, required=False, default=argparse.SUPPRESS)
_parser.add_argument("--output-config", dest="output_config", type=str, required=False, default=argparse.SUPPRESS)
_parser.add_argument("--example-artifacts", dest="example_artifacts_path", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)
_parsed_args = vars(_parser.parse_args())
_output_files = _parsed_args.pop("_output_paths", [])

_outputs = CsvExampleGen(**_parsed_args)

if not hasattr(_outputs, '__getitem__') or isinstance(_outputs, str):
_outputs = [_outputs]

_output_serializers = [

]

import os
for idx, output_file in enumerate(_output_files):
try:
os.makedirs(os.path.dirname(output_file))
except OSError:
pass
with open(output_file, 'w') as f:
f.write(_output_serializers[idx](_outputs[idx]))
args:
- --input-base
- inputPath: input_base
- {inputPath: input_base}
- if:
cond:
isPresent: input_config
cond: {isPresent: input_config}
then:
- --input-config
- inputValue: input_config
- {inputValue: input_config}
- if:
cond:
isPresent: output_config
cond: {isPresent: output_config}
then:
- --output-config
- inputValue: output_config
- --output-examples
- outputPath: output_examples
command:
- sh
- -c
- (PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location
'tfx==0.14' 'six>=1.12.0' || PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip
install --quiet --no-warn-script-location 'tfx==0.14' 'six>=1.12.0' --user)
&& "$0" "$@"
- python3
- -u
- -c
- "def _make_parent_dirs_and_return_path(file_path: str):\n import os\n \
\ os.makedirs(os.path.dirname(file_path), exist_ok=True)\n return file_path\n\
\nclass InputPath:\n '''When creating component from function, InputPath\
\ should be used as function parameter annotation to tell the system to pass\
\ the *data file path* to the function instead of passing the actual data.'''\n\
\ def __init__(self, type=None):\n self.type = type\n\nclass OutputPath:\n\
\ '''When creating component from function, OutputPath should be used as\
\ function parameter annotation to tell the system that the function wants to\
\ output data by writing it into a file with the given path instead of returning\
\ the data from the function.'''\n def __init__(self, type=None):\n \
\ self.type = type\n\ndef CsvExampleGen(\n # Inputs\n input_base_path:\
\ InputPath('ExternalPath'),\n#input_base_path: 'ExternalPath', # A Channel\
\ of 'ExternalPath' type, which includes one artifact whose uri is an external\
\ directory with csv files inside (required).\n\n # Outputs\n output_examples_path:\
\ OutputPath('ExamplesPath'),\n #output_examples_path: 'ExamplesPath',\n\n\
\ # Execution properties\n #input_config_splits: {'List' : {'item_type':\
\ 'ExampleGen.Input.Split'}},\n input_config: 'ExampleGen.Input' = None,\
\ # = '{\"splits\": []}', # JSON-serialized example_gen_pb2.Input instance,\
\ providing input configuration. If unset, the files under input_base will be\
\ treated as a single split.\n #output_config_splits: {'List' : {'item_type':\
\ 'ExampleGen.SplitConfig'}},\n output_config: 'ExampleGen.Output' = None,\
\ # = '{\"splitConfig\": {\"splits\": []}}', # JSON-serialized example_gen_pb2.Output\
\ instance, providing output configuration. If unset, default splits will be\
\ 'train' and 'eval' with size 2:1.\n #custom_config: 'ExampleGen.CustomConfig'\
\ = None,\n):\n \"\"\"Executes the CsvExampleGen component.\n\n Args:\n\
\ input_base: A Channel of 'ExternalPath' type, which includes one artifact\n\
\ whose uri is an external directory with csv files inside (required).\n\
\ input_config: An example_gen_pb2.Input instance, providing input\n \
\ configuration. If unset, the files under input_base will be treated as\
\ a\n single split.\n output_config: An example_gen_pb2.Output instance,\
\ providing output\n configuration. If unset, default splits will be\
\ 'train' and 'eval' with\n size 2:1.\n ??? example_artifacts: Optional\
\ channel of 'ExamplesPath' for output train and\n eval examples.\n \
\ ??? input: Forwards compatibility alias for the 'input_base' argument.\n\
\ ??? instance_name: Optional unique instance name. Necessary if multiple\n\
\ CsvExampleGen components are declared in the same pipeline.\n \"\
\"\"\n\n import json\n import os\n from google.protobuf import json_format\n\
\ from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen\n\
\ from tfx.proto import example_gen_pb2\n from tfx.types import standard_artifacts\n\
\ from tfx.types import channel_utils\n\n # Create input dict.\n input_base\
\ = standard_artifacts.ExternalArtifact()\n input_base.uri = input_base_path\n\
\ input_base_channel = channel_utils.as_channel([input_base])\n\n input_config_obj\
\ = None\n if input_config:\n input_config_obj = example_gen_pb2.Input()\n\
\ json_format.Parse(input_config, input_config_obj)\n\n output_config_obj\
\ = None\n if output_config:\n output_config_obj = example_gen_pb2.Output()\n\
\ json_format.Parse(output_config, output_config_obj)\n\n component_class_instance\
\ = CsvExampleGen(\n input=input_base_channel,\n input_config=input_config_obj,\n\
\ output_config=output_config_obj,\n )\n\n input_dict = {name:\
\ channel.artifacts for name, channel in component_class_instance.inputs.items()}\n\
\ output_dict = {name: channel.artifacts for name, channel in component_class_instance.outputs.items()}\n\
\ exec_properties = component_class_instance.exec_properties\n\n # Generating\
\ paths for output artifacts\n for output_artifact in output_dict['examples']:\n\
\ output_artifact.uri = output_examples_path\n if output_artifact.split:\n\
\ output_artifact.uri = os.path.join(output_artifact.uri, output_artifact.split)\n\
\n executor = CsvExampleGen.EXECUTOR_SPEC.executor_class()\n executor.Do(\n\
\ input_dict=input_dict,\n output_dict=output_dict,\n exec_properties=exec_properties,\n\
\ )\n\nimport argparse\n_parser = argparse.ArgumentParser(prog='Csvexamplegen',\
\ description=\"Executes the CsvExampleGen component.\\n\\n Args:\\n \
\ input_base: A Channel of 'ExternalPath' type, which includes one artifact\\\
n whose uri is an external directory with csv files inside (required).\\\
n input_config: An example_gen_pb2.Input instance, providing input\\n \
\ configuration. If unset, the files under input_base will be treated\
\ as a\\n single split.\\n output_config: An example_gen_pb2.Output\
\ instance, providing output\\n configuration. If unset, default splits\
\ will be 'train' and 'eval' with\\n size 2:1.\\n ??? example_artifacts:\
\ Optional channel of 'ExamplesPath' for output train and\\n eval examples.\\\
n ??? input: Forwards compatibility alias for the 'input_base' argument.\\\
n ??? instance_name: Optional unique instance name. Necessary if multiple\\\
n CsvExampleGen components are declared in the same pipeline.\\n\")\n\
_parser.add_argument(\"--input-base\", dest=\"input_base_path\", type=str, required=True,\
\ default=argparse.SUPPRESS)\n_parser.add_argument(\"--input-config\", dest=\"\
input_config\", type=str, required=False, default=argparse.SUPPRESS)\n_parser.add_argument(\"\
--output-config\", dest=\"output_config\", type=str, required=False, default=argparse.SUPPRESS)\n\
_parser.add_argument(\"--output-examples\", dest=\"output_examples_path\", type=_make_parent_dirs_and_return_path,\
\ required=True, default=argparse.SUPPRESS)\n_parsed_args = vars(_parser.parse_args())\n\
_output_files = _parsed_args.pop(\"_output_paths\", [])\n\n_outputs = CsvExampleGen(**_parsed_args)\n\
\nif not hasattr(_outputs, '__getitem__') or isinstance(_outputs, str):\n \
\ _outputs = [_outputs]\n\n_output_serializers = [\n \n]\n\nimport os\n\
for idx, output_file in enumerate(_output_files):\n try:\n os.makedirs(os.path.dirname(output_file))\n\
\ except OSError:\n pass\n with open(output_file, 'w') as f:\n\
\ f.write(_output_serializers[idx](_outputs[idx]))\n"
image: tensorflow/tensorflow:1.14.0-py3
inputs:
- name: input_base
type: ExternalPath
- name: input_config
optional: true
type: ExampleGen.Input
- name: output_config
optional: true
type: ExampleGen.Output
name: Csvexamplegen
outputs:
- name: output_examples
type: ExamplesPath
- {inputValue: output_config}
- --example-artifacts
- {outputPath: example_artifacts}
Loading