Docs: Location Provider Documentation#1537
Conversation
mkdocs/docs/configuration.md
Outdated
| | `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| | `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | | ||
| | `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | | ||
| | `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | |
There was a problem hiding this comment.
Unsure about what Options to put here. Happy to take suggestions.
There was a problem hiding this comment.
maybe similar to the custom catalog
https://py.iceberg.apache.org/configuration/#custom-catalog-implementations
There was a problem hiding this comment.
Don't love how this looks. I prefer what it is now:
I've changed the section linked above to have mymodule.MyLocationProvider in custom Catalog style. WDYT?
There was a problem hiding this comment.
(The above screenshot also shows how code/backticks hyperlinks look, I think they're fine. This is now relevant because of #1537 (comment).
mkdocs/docs/configuration.md
Outdated
|
|
||
| ### SimpleLocationProvider | ||
|
|
||
| The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, |
There was a problem hiding this comment.
I realised I was wrong in #1510 (comment) about docs not needing to change when write.data.path is supported. I think we do want to say that data files are under a data directory, here and below.
But I think this is fine because the change will be small - it'd just be "write.data.path" instead of "data directory under the table's location" throughout. write.data.path defaults to this.
mkdocs/docs/configuration.md
Outdated
|
|
||
| The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`. | ||
|
|
||
| ### ObjectStoreLocationProvider |
There was a problem hiding this comment.
There's a lot of natural duplication between this section and https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout. I've gone less in-depth here though.
I was unsure whether to link to this webpage (and if so, how to word it) because there's a lot that's not relevant to us, e.g.
Note, the path resolution logic for ObjectStoreLocationProvider is write.data.path then /data. However, for the older versions up to 0.12.0, the logic is as follows: - before 0.12.0, write.object-storage.path must be set. - at 0.12.0, write.object-storage.path then write.folder-storage.path then /data.
and
Previously provided base64 hash was updated to base2 in order to provide an improved auto-scaling behavior on S3 General Purpose Buckets.
There was a problem hiding this comment.
i think we should link to that for additional context
mkdocs/docs/configuration.md
Outdated
| s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
| ``` | ||
|
|
||
| ### Loading a Custom LocationProvider |
There was a problem hiding this comment.
|
|
||
| class LocationProvider(ABC): | ||
| """A base class for location providers, that provide data file locations for write tasks.""" | ||
| """A base class for location providers, that provide data file locations for a table's write tasks. |
mkdocs/docs/configuration.md
Outdated
| ### Loading a Custom LocationProvider | ||
|
|
||
| Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base | ||
| class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The |
There was a problem hiding this comment.
I wanted to link to this, and this works for me locally, but I get the following warning when serving docs locally:
INFO - Doc file 'configuration md' contains an unrecognized relative link '.. / reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider', it was left as is. Did you mean ' reference/pyiceberg/table/locations.md#pyiceberg.table.locations.LocationProvider'?
But I get a similar warning elsewhere: Doc file 'SUMMARY.md' contains an unrecognized relative link 'reference/', it was left as is.. So I think this is fine?
There was a problem hiding this comment.
yea i think its fine, as long as the hyperlink works when you run it locally
| An example, custom `LocationProvider` implementation is shown below. | ||
|
|
||
| ```py | ||
| import uuid |
There was a problem hiding this comment.
I've only shown this import for conciseness.
mkdocs/docs/configuration.md
Outdated
| | `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | | ||
| | `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||
| | `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| | `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | |
There was a problem hiding this comment.
(Same wording, more or less, as https://iceberg.apache.org/docs/latest/configuration/#write-properties)
There was a problem hiding this comment.
maybe we should add a warning or something about how this default differs from the java implementation
mkdocs/docs/configuration.md
Outdated
| The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories, | ||
| into file paths, to distribute files across a larger number of object store prefixes. | ||
|
|
||
| Paths contain partitions just before the file name and a `data` directory beneath the table's location, in a similar |
kevinjqliu
left a comment
There was a problem hiding this comment.
Thanks for the PR! These docs are great. I've added a few nit comments
mkdocs/docs/configuration.md
Outdated
| | `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| | `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | | ||
| | `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | | ||
| | `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | |
There was a problem hiding this comment.
maybe similar to the custom catalog
https://py.iceberg.apache.org/configuration/#custom-catalog-implementations
mkdocs/docs/configuration.md
Outdated
| | `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | | ||
| | `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||
| | `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| | `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | |
There was a problem hiding this comment.
maybe we should add a warning or something about how this default differs from the java implementation
mkdocs/docs/configuration.md
Outdated
| | `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||
| | `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| | `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | | ||
| | `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | |
There was a problem hiding this comment.
hyperlinks are weird sometimes, can you make sure that these work as intended
There was a problem hiding this comment.
I've checked all hyperlinks on the current version and they work as intended
mkdocs/docs/configuration.md
Outdated
|
|
||
| ### SimpleLocationProvider | ||
|
|
||
| The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, |
There was a problem hiding this comment.
| The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, | |
| The `SimpleLocationProvider` places a table's data file names underneath a `data` directory in the table's storage location. For example, |
There was a problem hiding this comment.
I think we should also call out that the "base location" is the table.metadata.location
From the spec, https://iceberg.apache.org/spec/#table-metadata-fields
The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files.
mkdocs/docs/configuration.md
Outdated
| s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
| ``` | ||
|
|
||
| When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key |
There was a problem hiding this comment.
| When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key | |
| When the table is partitioned, files under a given partition are grouped into a subdirectory, with that partition key |
There was a problem hiding this comment.
nit: maybe also when the hive-style partition path
| ```txt | ||
| s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
| ``` |
There was a problem hiding this comment.
nit: what about giving an example of this set to True and another one set to False
There was a problem hiding this comment.
I have the False case just above ("the same data file above" here) - or do you mean making that more explicit?
There was a problem hiding this comment.
i think given the new wording above, its clear now :) thanks!
mkdocs/docs/configuration.md
Outdated
|
|
||
| The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`. | ||
|
|
||
| ### ObjectStoreLocationProvider |
There was a problem hiding this comment.
i think we should link to that for additional context
mkdocs/docs/configuration.md
Outdated
| ### Loading a Custom LocationProvider | ||
|
|
||
| Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base | ||
| class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The |
There was a problem hiding this comment.
yea i think its fine, as long as the hyperlink works when you run it locally
mkdocs/docs/configuration.md
Outdated
| class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The | ||
| table property `write.py-location-provider.impl` should be set to the fully-qualified name of the custom | ||
| LocationProvider (i.e. `module.CustomLocationProvider`). Recall that a LocationProvider is configured per-table, | ||
| permitting different location provision for different tables. |
There was a problem hiding this comment.
also mention that java uses write.location-provider.impl
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the great docs
| ```txt | ||
| s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
| ``` |
There was a problem hiding this comment.
i think given the new wording above, its clear now :) thanks!
Fokko
left a comment
There was a problem hiding this comment.
This looks great @smaheshwar-pltr thanks for the comprehensive documentation.
|
This PR concludes initial support for location providers in PyIceberg 🎉 Huge thanks to Kevin and Fokko for their excellent and patient reviews throughout these PRs - I really appreciate your responsiveness and focus on quality that made this contribution enjoyable. Thank you for that and for all your work with PyIceberg! |





(See below for screenshots)
Closes #1510. This is my first time writing docs here! Happy to receive style feedback - I already suspect I've written too much.
cc @kevinjqliu @Fokko