Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to store BigQuery schemas for later use. #3419

Closed
tweeter0830 opened this issue May 15, 2017 · 15 comments
Closed

How to store BigQuery schemas for later use. #3419

tweeter0830 opened this issue May 15, 2017 · 15 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tweeter0830
Copy link

Is there an easy way to get the schema from a BigQuery table and save it to disk for later use?

For example, say that I wanted to get the schema of a table that currently exists, save it to disk and later create a table with the same schema.

I see that there is such a thing as a SchemaField in the api, but there doesn't seem to be a clean way to serialize it.

@tseaver tseaver added api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels May 15, 2017
@tseaver
Copy link
Contributor

tseaver commented May 15, 2017

@tweeter0830 There is no built-in support for serializing SchemaField instances. The non-public table._build_schema_resource and table._parse_schema_resource functions could be used as the basis for serializing them to JSON.

@tweeter0830
Copy link
Author

Thanks.

@tseaver tseaver added the priority: p2 Moderately-important priority. Fix may not be included in next release. label May 15, 2017
@tseaver tseaver reopened this May 15, 2017
@tseaver
Copy link
Contributor

tseaver commented May 15, 2017

I'll leave the issue open, as making those two functions public / better documented seems like a reasonable enhancement.

@tseaver
Copy link
Contributor

tseaver commented May 15, 2017

E.g., I could see adding a schema.SchemaField.from_api_repr classmethod and a corresponding schema.SchemaField.to_api_repr instance method to support your usecase.

@tseaver
Copy link
Contributor

tseaver commented Aug 10, 2017

#3786 doesn't fix this issue.

@tseaver tseaver reopened this Aug 10, 2017
@lukesneeringer
Copy link
Contributor

lukesneeringer commented Aug 10, 2017

Why? As soon as it is a dictionary, saving it to disk is trivial (json.dump). Nevermind, I saw your comment on the other issue.

@tseaver
Copy link
Contributor

tseaver commented Aug 10, 2017

The goal of this issue is to store the entire schema for a table (not a single field) and in a format which is "good" for people (i.e., with the appropriate top-level mapping / key holding that list).

@yan-hic
Copy link

yan-hic commented Jan 10, 2018

Similar use case. I have the JSON schema files in GCS already for all tables. bq load CLI supports JSON schema but Python Client lib not (0.29).
Next to a suggested getter method, LoadJobConfig.schema.save, there should be a LoadJobConfig.schema.load method too that takes a URL has parameter (gs:// or local).
That would be consistent with CLI and other languages - Java lib has a setSchema

@chemelnucfin
Copy link
Contributor

chemelnucfin commented Jan 22, 2018

Hello, feature requests will now be tracked in the project Feature Requests. I will close this issue now, please feel free to continue to address any issues/concerns here.

@tswast
Copy link
Contributor

tswast commented Apr 10, 2019

Background

With the BigQuery command-line tool, it's possible to specify a schema via a JSON file. See: https://cloud.google.com/bigquery/docs/schemas#using_a_json_schema_file

Schema format:

[
  {
    "description": "quarter",
    "mode": "REQUIRED",
    "name": "qtr",
    "type": "STRING"
  },
  {
    "description": "sales representative",
    "mode": "NULLABLE",
    "name": "rep",
    "type": "STRING"
  },
  {
    "description": "total sales",
    "mode": "NULLABLE",
    "name": "sales",
    "type": "FLOAT"
  }
]

Phase 1

This feature request is to add a helper function that can load such files and return a list of SchemaField objects.

# file_or_path = "path/to/schema.json"
# OR
# file_or_path = open("path/to/schema.json")

schema_fields = client.schema_from_json(file_or_path)

# Now the value of schema_fields is
# schema_fields = [
#      schema.SchemaField(...),
#      schema.SchemaField(...),
#      schema.SchemaField(...),
# ]

Phase 1.b

Add client.schema_to_json(schema_list, destination) which serializes a list of SchemaFields to JSON and writes it to destination, which is a file-like object or path.

Phase 2

Blocked by #7693

client.schema_from_json should accept GCS paths like file_or_path = "gs://bucket_name/path/to/schema.json". Likewise, client.schema_to_json should accept GCS paths for destination.

This (optional) feature only works if a storage_client argument is passed in as an additional argument.

schema_fields = client.schema_from_json(file_or_path, storage_client=storage_client)

Phase 3

If a gs:// URL is provided but not a storage_client, try to construct one using the BigQuery client's credentials. Fails if the google-cloud-storage library is not installed.

@tswast
Copy link
Contributor

tswast commented Apr 10, 2019

cc @lbristol88

@yan-hic
Copy link

yan-hic commented Apr 11, 2019

@tswast could you make schema_from_json accept a string too ?

We have the schema in gcs but need to touch it a bit before using in bigquery.
This would avoid saving it locally just to use that new function.

FYI, we do this today: job_config._properties['load']['schema'] = schema_json

@tswast
Copy link
Contributor

tswast commented Apr 11, 2019

In the case of in-memory JSON, I'd prefer you pass in a file-like object, i.e.

schema_from_json(io.BytesIO(b'[{"name": "col1", "type": "INTEGER"}]'))

Otherwise it gets a little tricky to disambiguate a path from file contents.

@yan-hic
Copy link

yan-hic commented Apr 11, 2019

Makes sense.

@tswast
Copy link
Contributor

tswast commented Nov 12, 2019

Fixed by #9550 (allow passing JSON version of schema directly to schema setters) and #7761 (adds schema_from_json and schema_to_json helpers).

Closing despite missing GCS integration, as that wasn't in the original request.

@tswast tswast closed this as completed Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

6 participants