Open
Description
After this change in #1535, loading a dataframe where the index is also a column now fails:
[ins] In [42]: df
Out[42]:
a
a
A A
B B
[ins] In [43]: bigquery.Client().load_table_from_dataframe(df, "tmp.blah")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [43], in <cell line: 1>()
----> 1 bigquery.Client().load_table_from_dataframe(df, "tmp.blah")
...
File ~/model/.venv/lib/python3.10/site-packages/google/cloud/bigquery/_pandas_helpers.py:484, in dataframe_to_bq_schema(dataframe, bq_schema)
482 bq_type = _PANDAS_DTYPE_TO_BQ.get(dtype.name)
483 if bq_type is None:
--> 484 sample_data = _first_valid(dataframe.reset_index()[column])
485 if (
486 isinstance(sample_data, _BaseGeometry)
487 and sample_data is not None # Paranoia
488 ):
489 bq_type = "GEOGRAPHY"
...
File ~/model/.venv/lib/python3.10/site-packages/pandas/core/frame.py:4440, in DataFrame.insert(self, loc, column, value, allow_duplicates)
4434 raise ValueError(
4435 "Cannot specify 'allow_duplicates=True' when "
4436 "'self.flags.allows_duplicate_labels' is False."
4437 )
4438 if not allow_duplicates and column in self.columns:
4439 # Should this be a different kind of error??
-> 4440 raise ValueError(f"cannot insert {column}, already exists")
4441 if not isinstance(loc, int):
4442 raise TypeError("loc must be int")
ValueError: cannot insert a, already exists
Kind of a weird edge case but I think the same goal of that PR could have been accomplished without a breaking change. Perhaps the easiest would be to just reset_index() in a separate statement and catch the ValueError (since if you hit it then the reset_index() call wasn't needed)?