Skip to content

ENH: Support pd.json_normalize for normalizing only meta fields  #60479

Open
@Ynjxsjmh

Description

@Ynjxsjmh

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, meta is used with when record_path is not None. The logic is to extract both the record_path and the meta. For example:

data = [
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott", "year": 2014},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000},
        ],
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich", "year": 2015},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337},
        ],
    },
]
result = pd.json_normalize(
    data, "counties", ["state", "shortname", ["info", "governor"]]
)

result
         name  population    state shortname info.governor
0        Dade       12345   Florida    FL    Rick Scott
1     Broward       40000   Florida    FL    Rick Scott
2  Palm Beach       60000   Florida    FL    Rick Scott
3      Summit        1234   Ohio       OH    John Kasich
4    Cuyahoga        1337   Ohio       OH    John Kasich

In the above example, pd.json_normalize not only retrieves counties, but also retrieves state, shortname and info.governor.


When record_path is not given, meta is ignored, for example:

result = pd.json_normalize(
    data, meta=["state", "shortname", ["info", "governor"]]
)

result
     state shortname                                           counties    info.governor   info.year
0  Florida        FL  [{'name': 'Dade', 'population': 12345}, {'name...       Rick Scott        2014
1     Ohio        OH  [{'name': 'Summit', 'population': 1234}, {'nam...      John Kasich        2015

Feature Description

I would expect when record_path is None or an empty list, json_normalize can only extract the meta.

result = pd.json_normalize(
    data, meta=["state", "shortname", ["info", "governor"]]
)

result
  shortname    state info.governor
0        FL  Florida    Rick Scott
1        OH     Ohio   John Kasich

The behavior can be summarized as:

  • record_path is None, meta is None: normalize all records.
  • record_path is not None, meta is None: normalize only record_path.
  • record_path is not None, meta is not None: normalize record_path and meta.
  • record_path is None, meta is not None: normalize only meta. [This request]

Alternative Solutions

The alternative solution is to use pd.json_normalize to create the dataframe then select desired columns:

result = pd.json_normalize(data)[["shortname", "state", "info.governor"]]

  shortname    state info.governor
0        FL  Florida    Rick Scott
1        OH     Ohio   John Kasich

However, this is inconvenient that when you specify sep in pd.json_normalize, the columns name would also change as f"info{sep}governor":

pd.json_normalize(data, sep='-')[["shortname", "state", "info-governor"]]

The most similar feature is DataFrame.from_records columns parameter, but it is not able to parse inner dictionary.

result = pd.DataFrame.from_records(data, columns=["state", "shortname", "info"])

result
     state shortname                         info
0  Florida        FL   {'governor': 'Rick Scott'}
1     Ohio        OH  {'governor': 'John Kasich'}

df = pd.DataFrame.from_records(data, columns=["state", "shortname", ["info", "governor"]])

Traceback (most recent call last):
  File "d:\sourcecode\pandas\test.py", line 20, in <module>
    df = pd.DataFrame.from_records(data, columns=["state", "shortname", ["info", "governor"]])
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\software\Python\Python312\Lib\site-packages\pandas\core\frame.py", line 2491, in from_records
    arrays, arr_columns = to_arrays(data, columns)
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\software\Python\Python312\Lib\site-packages\pandas\core\internals\construction.py", line 837, in to_arrays
    arr, columns = _list_of_dict_to_arrays(data, columns)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\software\Python\Python312\Lib\site-packages\pandas\core\internals\construction.py", line 924, in _list_of_dict_to_arrays
    content = lib.dicts_to_array(data, list(columns))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib.pyx", line 401, in pandas._libs.lib.dicts_to_array
TypeError: unhashable type: 'list'

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO JSONread_json, to_json, json_normalizeNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions