Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Support more Beam portable schema types as Python types #25946

Open
1 of 15 tasks
ahmedabu98 opened this issue Mar 23, 2023 · 3 comments
Open
1 of 15 tasks

[Task]: Support more Beam portable schema types as Python types #25946

ahmedabu98 opened this issue Mar 23, 2023 · 3 comments
Labels
P1 python schemas Issues related to Beam Schemas task types

Comments

@ahmedabu98
Copy link
Contributor

ahmedabu98 commented Mar 23, 2023

What needs to happen?

Beam portable schemas include primitive and more complex types (represented as logical types). Some of these types are supported in the Python SDK:

Python Schema
np.int8 <-----> BYTE
np.int16 <-----> INT16
np.int32 <-----> INT32
np.int64 <-----> INT64
int ------> INT64
np.float32 <-----> FLOAT
np.float64 <-----> DOUBLE
float ------> DOUBLE
bool <-----> BOOLEAN
str <-----> STRING
bytes <-----> BYTES
ByteString ------> BYTES
Timestamp <-----> LogicalType(urn="beam:logical_type:micros_instant:v1")
Decimal <-----> LogicalType(urn="beam:logical_type:fixed_decimal:v1")
Mapping <-----> MapType
Sequence <-----> ArrayType
NamedTuple <-----> RowType
beam.Row ------> RowType

When necessary, Python classes are created to represent a portable type. For example, see Timestamp below:

class Timestamp(object):

There are some missing portable types in the Python SDK (e.g. Date, DateTime, Time) that we should add support for to make the cross-language experience more smooth.

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@pavleec
Copy link

pavleec commented May 21, 2024

JSON type is also missing in Python SDK 😕

@unography
Copy link

Hi @ahmedabu98 , currently GEOGRAPHY as a data type isn't supported, and it throws error here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L110-L123

Are there plans to add support for it?

@jd185367
Copy link

jd185367 commented Oct 18, 2024

I'd also like to bump this as needed for using WriteToBigQuery in Python:

class WriteToBigQuery(PTransform):

Google recommends using the STORAGE_WRITE_API method in their Dataflow Best Practices, which requires passing this transform the schema argument for a table. But since many of our BigQuery tables have a DATE or DATETIME column, which isn't supported yet for these schemas in Python, we aren't able to use this.

As of Beam 2.60.0, we haven't found a current workaround - e.g. specifying our DATE columns as TIMESTAMP in the Python schema seems to fail either when Beam tries to actually write to BigQuery, or at some point when the Java code is executing and doing its own conversion. If anyone knows a workaround for this, I'd appreciate it.

As a side-note: why does STORAGE_WRITE_API require specifying a schema in advance, while STREAMING_INSERT does not?

@liferoad liferoad added P1 and removed P2 labels Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 python schemas Issues related to Beam Schemas task types
Projects
None yet
Development

No branches or pull requests

6 participants