Skip to content

Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 40 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
d913c86
Add pygmt.read to read a dataset/grid/image into pandas.DataFrame/xar…
seisman Dec 2, 2024
f456bf8
Set GMT accessor
seisman Dec 5, 2024
c3cbb6e
Need to set 'source' encoding to make GMT accessor work
seisman Dec 5, 2024
f2a4ce4
Merge branch 'main' into feature/read
seisman Dec 5, 2024
1dd97c6
Fix the source encoding
seisman Dec 5, 2024
7790ea3
No need to set the source encoding in load_remote_dataset.py
seisman Dec 5, 2024
e588008
Revert changes in pygmt/datasets/load_remote_dataset.py
seisman Dec 6, 2024
40d12ee
Improve docstring in pygmt/helpers/testing.py
seisman Dec 6, 2024
fa1021d
Improve docstrinbgs
seisman Dec 6, 2024
c378225
Get rid of decorators
seisman Dec 8, 2024
7b749e0
Improve comment
seisman Dec 8, 2024
8befa58
Get rid of the fmt_docstring alias
seisman Dec 8, 2024
a758752
Fix type hints issue with overload
seisman Dec 9, 2024
9d66cf4
Remove the type ignore flag
seisman Dec 9, 2024
a05383a
region defaults to None
seisman Dec 9, 2024
6ca4ef2
Merge branch 'main' into feature/read
seisman Dec 9, 2024
7851ced
Improve type hints and add tests
seisman Dec 9, 2024
084b87a
Improve the checking of return value of which
seisman Dec 9, 2024
b21997c
Use the read funciton in pygmt/tests/test_datatypes_dataset.py
seisman Dec 9, 2024
a812317
Use the read function instead of the load_dataarray method
seisman Dec 9, 2024
1f0f158
Add one test to make sure that read and load_dataarray returns the sa…
seisman Dec 9, 2024
957c7eb
Simplify pygmt/tests/test_clib_read_data.py with read
seisman Dec 9, 2024
6aef3ca
Fix a typo
seisman Dec 9, 2024
72afbfe
Replace xr.open_dataarray with read
seisman Dec 9, 2024
03de9b7
Fix a typo
seisman Dec 9, 2024
85c533d
Merge branch 'main' into feature/read
seisman Dec 19, 2024
663c76d
Merge branch 'main' into feature/read
seisman Mar 12, 2025
3ed1032
Fix styling
seisman Mar 12, 2025
7d320f4
Merge branch 'main' into feature/read
seisman Mar 12, 2025
2e72ebe
Merge branch 'main' into feature/read
seisman Apr 16, 2025
6d634cc
Minor fix
seisman Apr 16, 2025
4dc7974
Add parameter name
seisman Apr 16, 2025
69f5c45
Rename read to gmtread
seisman Apr 17, 2025
061f5f2
Restructure io.py into a directory
seisman Apr 17, 2025
4f0779e
Move gmtread from src to io
seisman Apr 17, 2025
a06ddca
Fixes, and clean up
seisman Apr 17, 2025
82b80f5
Fix a doctest
seisman Apr 17, 2025
a6c4ee7
Add tests for reading images
seisman Apr 17, 2025
b4a0b9d
Merge branch 'main' into feature/read
seisman May 2, 2025
37fc1de
Revert pygmt/tests/test_clib_put_matrix.py
seisman May 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ Input/output
:toctree: generated

load_dataarray
read
Comment on lines 174 to +175
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_dataarray function was put under the pygmt.io namespace. Should we consider putting read under pygmt.io too? (Thinking about whether we need a low-level pygmt.clib.read and high-level pygmt.io.read in my other comment).

Copy link
Member Author

@seisman seisman Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds good. I have two questions:

  1. Should we place the read source code in pygmt/io.py, or restructure io.py into a directory and put it in pygmt/io/read.py instead?
  2. Should we deprecate the load_dataarray function in favor of the new read function?

I'm expecting to have a write function that writes a pandas.DataFrame/xarray.DataArray into a tabular/netCDF file

GMT.jl also wraps the read module (xref: https://www.generic-mapping-tools.org/GMTjl_doc/documentation/utilities/gmtread/). The differences are:

  1. It uses name gmtread, which I think is better since read is a little to general.
  2. It returns custom data types like GMTVector, GMTGrid. [This doesn't work in PyGMT]
  3. It guesses the data kind based on the extensions. [Perhaps we can also do a similar guess?]

Copy link
Member

@weiji14 weiji14 Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we place the read source code in pygmt/io.py, or restructure io.py into a directory and put it in pygmt/io/read.py instead?

I think making the io directory sounds good, especially if you're planning on making a write function in the future.

Should we deprecate the load_dataarray function in favor of the new read function?

No, let's keep load_dataarray for now. Something I'm contemplating is to make an xarray BackendEntrypoint that uses GMT read, so that users can then do pygmt.io.load_dataarray(..., engine="gmtread") or something like that. The load_dataarray function would use this new gmtread backend engine by default instead of netcdf4.


GMT Defaults
------------
Expand Down
2 changes: 1 addition & 1 deletion pygmt/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
from pygmt import datasets
from pygmt._show_versions import __commit__, __version__, show_versions
from pygmt.figure import Figure, set_display
from pygmt.io import load_dataarray
from pygmt.io import gmtread, load_dataarray
from pygmt.session_management import begin as _begin
from pygmt.session_management import end as _end
from pygmt.src import (
Expand Down
6 changes: 3 additions & 3 deletions pygmt/helpers/testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,13 +144,13 @@ def wrapper(*args, ext="png", request=None, **kwargs):
return decorator


def load_static_earth_relief():
def load_static_earth_relief() -> xr.DataArray:
"""
Load the static_earth_relief file for internal testing.
Load the static_earth_relief.nc file for internal testing.

Returns
-------
data : xarray.DataArray
data
A grid of Earth relief for internal tests.
"""
fname = which("@static_earth_relief.nc", download="c")
Expand Down
6 changes: 6 additions & 0 deletions pygmt/io/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""
PyGMT input/output (I/O) utilities.
"""

from pygmt.io.gmtread import gmtread
from pygmt.io.load_dataarray import load_dataarray
125 changes: 125 additions & 0 deletions pygmt/io/gmtread.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
"""
Read a file into an appropriate object.
"""

from collections.abc import Mapping, Sequence
from pathlib import PurePath
from typing import Any, Literal

import pandas as pd
import xarray as xr
from pygmt.clib import Session
from pygmt.helpers import build_arg_list, is_nonstr_iter
from pygmt.src.which import which


def gmtread(
file: str | PurePath,
kind: Literal["dataset", "grid", "image"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GMT read also handle 'cube'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

region: Sequence[float] | str | None = None,
header: int | None = None,
column_names: pd.Index | None = None,
dtype: type | Mapping[Any, type] | None = None,
index_col: str | int | None = None,
) -> pd.DataFrame | xr.DataArray:
Comment on lines +16 to +24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I'm thinking if we should make gmtread a private function for internal use only for now, the fact that it can read either tabular or grid/image files seems like a lot of magic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the gmtread function is no longer needed if PR #3919 is implemented, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not needed for grids/images, but we could still use gmtread for tabular datasets? Though let's think about #3673 (comment).

"""
Read a dataset, grid, or image from a file and return the appropriate object.

The returned object is a :class:`pandas.DataFrame` for datasets, and
:class:`xarray.DataArray` for grids and images.

For datasets, keyword arguments ``column_names``, ``header``, ``dtype``, and
``index_col`` are supported.

Parameters
----------
file
The file name to read.
kind
The kind of data to read. Valid values are ``"dataset"``, ``"grid"``, and
``"image"``.
region
The region of interest. Only data within this region will be read.
column_names
A list of column names.
header
Row number containing column names. ``header=None`` means not to parse the
column names from table header. Ignored if the row number is larger than the
number of headers in the table.
dtype
Data type. Can be a single type for all columns or a dictionary mapping
column names to types.
index_col
Column to set as index.
Comment on lines +43 to +53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we indicate in the docstring that these params are only used for kind="dataset"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 31:

For datasets, keyword arguments column_names, header, dtype, and
index_col are supported.


Returns
-------
data
Return type depends on the ``kind`` argument:

- ``"dataset"``: :class:`pandas.DataFrame`
- ``"grid"`` or ``"image"``: :class:`xarray.DataArray`


Examples
--------
Read a dataset into a :class:`pandas.DataFrame` object:

>>> from pygmt import gmtread
>>> df = gmtread("@hotspots.txt", kind="dataset")
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

Read a grid into an :class:`xarray.DataArray` object:

>>> dataarray = gmtread("@earth_relief_01d", kind="grid")
>>> type(dataarray)
<class 'xarray.core.dataarray.DataArray'>

Read an image into an :class:`xarray.DataArray` object:
>>> image = gmtread("@earth_day_01d", kind="image")
>>> type(image)
<class 'xarray.core.dataarray.DataArray'>
"""
if kind not in {"dataset", "grid", "image"}:
msg = f"Invalid kind '{kind}': must be one of 'dataset', 'grid', or 'image'."
raise ValueError(msg)

Check warning on line 86 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L84-L86

Added lines #L84 - L86 were not covered by tests

if kind != "dataset" and any(

Check warning on line 88 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L88

Added line #L88 was not covered by tests
v is not None for v in [column_names, header, dtype, index_col]
):
msg = (

Check warning on line 91 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L91

Added line #L91 was not covered by tests
"Only the 'dataset' kind supports the 'column_names', 'header', 'dtype', "
"and 'index_col' arguments."
)
raise ValueError(msg)

Check warning on line 95 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L95

Added line #L95 was not covered by tests

kwdict = {

Check warning on line 97 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L97

Added line #L97 was not covered by tests
"R": "/".join(f"{v}" for v in region) if is_nonstr_iter(region) else region, # type: ignore[union-attr]
"T": {"dataset": "d", "grid": "g", "image": "i"}[kind],
}

with Session() as lib:
with lib.virtualfile_out(kind=kind) as voutfile:
lib.call_module(

Check warning on line 104 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L102-L104

Added lines #L102 - L104 were not covered by tests
module="read", args=[file, voutfile, *build_arg_list(kwdict)]
)

match kind:
case "dataset":
return lib.virtualfile_to_dataset(

Check warning on line 110 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L108-L110

Added lines #L108 - L110 were not covered by tests
vfname=voutfile,
column_names=column_names,
header=header,
dtype=dtype,
index_col=index_col,
)
case "grid" | "image":
raster = lib.virtualfile_to_raster(vfname=voutfile, kind=kind)

Check warning on line 118 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L117-L118

Added lines #L117 - L118 were not covered by tests
# Add "source" encoding
source = which(fname=file)
raster.encoding["source"] = (

Check warning on line 121 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L120-L121

Added lines #L120 - L121 were not covered by tests
source[0] if isinstance(source, list) else source
)
_ = raster.gmt # Load GMTDataArray accessor information
return raster

Check warning on line 125 in pygmt/io/gmtread.py

View check run for this annotation

Codecov / codecov/patch

pygmt/io/gmtread.py#L124-L125

Added lines #L124 - L125 were not covered by tests
2 changes: 1 addition & 1 deletion pygmt/io.py → pygmt/io/load_dataarray.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
PyGMT input/output (I/O) utilities.
Load xarray.DataArray from a file or file-like object.
"""

import warnings
Expand Down
9 changes: 2 additions & 7 deletions pygmt/tests/test_datatypes_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@

import pandas as pd
import pytest
from pygmt import which
from pygmt.clib import Session
from pygmt import gmtread, which
from pygmt.helpers import GMTTempFile


Expand Down Expand Up @@ -44,11 +43,7 @@ def dataframe_from_gmt(fname, **kwargs):
"""
Read tabular data as pandas.DataFrame using GMT virtual file.
"""
with Session() as lib:
with lib.virtualfile_out(kind="dataset") as vouttbl:
lib.call_module("read", [fname, vouttbl, "-Td"])
df = lib.virtualfile_to_dataset(vfname=vouttbl, **kwargs)
return df
return gmtread(fname, kind="dataset", **kwargs)


@pytest.mark.benchmark
Expand Down
61 changes: 61 additions & 0 deletions pygmt/tests/test_io_gmtread.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""
Test the gmtread function.
"""

import importlib

import numpy as np
import pytest
import rioxarray
import xarray as xr
from pygmt import gmtread, which

_HAS_NETCDF4 = bool(importlib.util.find_spec("netCDF4"))
_HAS_RIORASTERIO = bool(importlib.util.find_spec("rioxarray"))


@pytest.mark.skipif(not _HAS_NETCDF4, reason="netCDF4 is not installed.")
def test_io_gmtread_grid():
"""
Test that reading a grid returns an xr.DataArray and the grid is the same as the one
loaded via xarray.load_dataarray.
"""
grid = gmtread("@static_earth_relief.nc", kind="grid")
assert isinstance(grid, xr.DataArray)
expected_grid = xr.load_dataarray(which("@static_earth_relief.nc", download="a"))
assert np.allclose(grid, expected_grid)
Comment on lines +17 to +26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should have a similar test for kind="image", comparing against rioxarray.open_rasterio?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a6c4ee7.

When I tried to add a test for reading datasets, I realized that the DataFrame returned by the load_sample_data is not ideal:

In [1]: from pygmt.datasets import load_sample_data

In [2]: data = load_sample_data("hotspots")

In [3]: data.dtypes
Out[3]: 
longitude      float64
latitude       float64
symbol_size    float64
place_name      object
dtype: object

The last column place_name should be string dtype, rather than object. We also have similar issues for other sample datasets.

We have three options:

  1. Do nothing and keep them unchanged
  2. Fix and use appropriate dtypes
  3. Use the new gmtread function instead of pd.read_csv in _load_xxx functions.

I'm inclined to option 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Use the new gmtread function instead of pd.read_csv in _load_xxx functions.

I'm inclined to option 3.

Agree with this. We should also add dtype related checks for the tabular dataset tests in pygmt/tests/test_datasets_samples.py.



@pytest.mark.skipif(not _HAS_RIORASTERIO, reason="rioxarray is not installed.")
def test_io_gmtread_image():
"""
Test that reading an image returns an xr.DataArray.
"""
image = gmtread("@earth_day_01d", kind="image")
assert isinstance(image, xr.DataArray)
with rioxarray.open_rasterio(
which("@earth_day_01d", download="a")
) as expected_image:
assert np.allclose(image, expected_image)


def test_io_gmtread_invalid_kind():
"""
Test that an invalid kind raises a ValueError.
"""
with pytest.raises(ValueError, match="Invalid kind"):
gmtread("file.cpt", kind="cpt")


def test_io_gmtread_invalid_arguments():
"""
Test that invalid arguments raise a ValueError for non-'dataset' kind.
"""
with pytest.raises(ValueError, match="Only the 'dataset' kind supports"):
gmtread("file.nc", kind="grid", column_names="foo")

with pytest.raises(ValueError, match="Only the 'dataset' kind supports"):
gmtread("file.nc", kind="grid", header=1)

with pytest.raises(ValueError, match="Only the 'dataset' kind supports"):
gmtread("file.nc", kind="grid", dtype="float")
File renamed without changes.
Loading