Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Add examples for scanning hive datasets to user guide #17431

Merged
merged 1 commit into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions docs/src/python/user-guide/io/hive.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# --8<-- [start:init_paths]
import polars as pl
from pathlib import Path

dfs = [
pl.DataFrame({"x": [1, 2]}),
pl.DataFrame({"x": [3, 4, 5]}),
pl.DataFrame({"x": [6, 7]}),
pl.DataFrame({"x": [8, 9, 10, 11]}),
]

parts = [
"year=2023/month=11",
"year=2023/month=12",
"year=2024/month=01",
"year=2024/month=02",
]

for df, part in zip(dfs, parts):
path = Path("docs/data/hive/") / part / "data.parquet"
Path(path).parent.mkdir(exist_ok=True, parents=True)
df.write_parquet(path)

path = Path("docs/data/hive_mixed/") / part / "data.parquet"
Path(path).parent.mkdir(exist_ok=True, parents=True)
df.write_parquet(path)

Path("docs/data/hive_mixed/description.txt").touch()


def print_paths(path: str) -> None:
def dir_recurse(path: Path):
if path.is_dir():
for p in path.iterdir():
yield from dir_recurse(p)
else:
yield path

df = (
pl.Series(
"File path",
(str(x) for x in dir_recurse(Path(path))),
dtype=pl.String,
)
.sort()
.to_frame()
)

with pl.Config(
tbl_hide_column_data_types=True,
tbl_hide_dataframe_shape=True,
fmt_str_lengths=999,
):
print(df)


print_paths("docs/data/hive/")
# --8<-- [end:init_paths]

# --8<-- [start:show_mixed_paths]
print_paths("docs/data/hive_mixed/")
# --8<-- [end:show_mixed_paths]

# --8<-- [start:scan_dir]
import polars as pl

df = pl.scan_parquet("docs/data/hive/").collect()

with pl.Config(tbl_rows=99):
print(df)
# --8<-- [end:scan_dir]

# --8<-- [start:scan_dir_err]
from pathlib import Path

try:
pl.scan_parquet("docs/data/hive_mixed/").collect()
except Exception as e:
print(e)

# --8<-- [end:scan_dir_err]

# --8<-- [start:scan_glob]
df = pl.scan_parquet(
# Glob to match all files ending in `.parquet`
"docs/data/hive_mixed/**/*.parquet",
hive_partitioning=True,
).collect()

with pl.Config(tbl_rows=99):
print(df)

# --8<-- [end:scan_glob]

# --8<-- [start:scan_file_no_hive]
df = pl.scan_parquet(
[
"docs/data/hive/year=2024/month=01/data.parquet",
"docs/data/hive/year=2024/month=02/data.parquet",
],
).collect()

print(df)

# --8<-- [end:scan_file_no_hive]

# --8<-- [start:scan_file_hive]
df = pl.scan_parquet(
[
"docs/data/hive/year=2024/month=01/data.parquet",
"docs/data/hive/year=2024/month=02/data.parquet",
],
hive_partitioning=True,
).collect()

print(df)

# --8<-- [end:scan_file_hive]
71 changes: 71 additions & 0 deletions docs/user-guide/io/hive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
## Hive partitioned data

Polars supports scanning hive partitioned parquet and IPC datasets, with planned support for other
formats in the future.

Hive partition parsing is enabled by default if `scan_parquet` receives a single directory path,
otherwise it is disabled by default. This can be explicitly configured using the `hive_partitioning`
parameter.

### Scanning a hive directory

For this example the following directory structure is used:

```python exec="on" result="text" session="user-guide/io/hive"
--8<-- "python/user-guide/io/hive.py:init_paths"
```

Simply pass the directory to `scan_parquet`, and all files will be loaded with the hive parts in the
path included in the output:

{{code_block('user-guide/io/hive','scan_dir',['scan_parquet'])}}

```python exec="on" result="text" session="user-guide/io/hive"
--8<-- "python/user-guide/io/hive.py:scan_dir"
```

### Handling mixed files

Passing a directory to `scan_parquet` may not work if there are extra non-data files next to the
data files.

For this example the following directory structure is used:

```python exec="on" result="text" session="user-guide/io/hive"
--8<-- "python/user-guide/io/hive.py:show_mixed_paths"
```

{{code_block('user-guide/io/hive','scan_dir_err',['scan_parquet'])}}

The above fails as `description.txt` is not a valid parquet file:

```python exec="on" result="text" session="user-guide/io/hive"
--8<-- "python/user-guide/io/hive.py:scan_dir_err"
```

In this situation, a glob pattern can be used to be more specific about which files to load. Note
that `hive_partitioning` must explicitly set to `True`:

{{code_block('user-guide/io/hive','scan_glob',['scan_parquet'])}}

```python exec="on" result="text" session="user-guide/io/hive"
--8<-- "python/user-guide/io/hive.py:scan_glob"
```

### Scanning file paths with hive parts

`hive_partitioning` is not enabled by default for file paths:

{{code_block('user-guide/io/hive','scan_file_no_hive',['scan_parquet'])}}

```python exec="on" result="text" session="user-guide/io/hive"
--8<-- "python/user-guide/io/hive.py:scan_file_no_hive"
```

Pass `hive_partitioning=True` to enable hive partition parsing:

{{code_block('user-guide/io/hive','scan_file_hive',['scan_parquet'])}}

```python exec="on" result="text" session="user-guide/io/hive"
--8<-- "python/user-guide/io/hive.py:scan_file_hive"
```
1 change: 1 addition & 0 deletions docs/user-guide/io/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Reading and writing your data is crucial for a DataFrame library. In this chapte
- [Parquet](parquet.md)
- [Json](json.md)
- [Multiple](multiple.md)
- [Hive](hive.md)
- [Database](database.md)
- [Cloud storage](cloud-storage.md)
- [Google Big Query](bigquery.md)
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ nav:
- user-guide/io/parquet.md
- user-guide/io/json.md
- user-guide/io/multiple.md
- user-guide/io/hive.md
- user-guide/io/database.md
- user-guide/io/cloud-storage.md
- user-guide/io/bigquery.md
Expand Down