Skip to content

PERF: always slice when indexing on columns #33597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 63 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
1697252
PERF: block-wise arithmetic for frame-with-frame
jbrockmendel Mar 17, 2020
a7764d6
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Mar 17, 2020
30a836d
lint fixup
jbrockmendel Mar 17, 2020
3559698
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Mar 18, 2020
4334353
troubleshoot npdev build
jbrockmendel Mar 18, 2020
cb40b0c
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Mar 24, 2020
713a776
comment
jbrockmendel Mar 25, 2020
95ef3ad
checkpoint passing
jbrockmendel Mar 25, 2020
61e5cd6
checkpoint passing
jbrockmendel Mar 25, 2020
89c3d7b
refactor
jbrockmendel Mar 25, 2020
e348e46
blackify
jbrockmendel Mar 25, 2020
519c757
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Mar 31, 2020
2b1ba18
disable assertions for perf
jbrockmendel Mar 31, 2020
53e93fc
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 1, 2020
91c86a3
asv
jbrockmendel Apr 1, 2020
2034084
whatsnew
jbrockmendel Apr 1, 2020
8aedf35
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 3, 2020
0c12d35
revert warning suppression
jbrockmendel Apr 3, 2020
9727562
Fixupm indentation
jbrockmendel Apr 3, 2020
6661dd3
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 4, 2020
42bbbf3
suppress warning
jbrockmendel Apr 4, 2020
65ab023
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 6, 2020
0d958a3
update asv
jbrockmendel Apr 6, 2020
7f91e74
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 6, 2020
56eef51
_data->_mgr
jbrockmendel Apr 6, 2020
4baea6f
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 7, 2020
41a4e7a
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 7, 2020
b23144e
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 9, 2020
7f24d57
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 9, 2020
ae744b7
update to use faspath constructor
jbrockmendel Apr 9, 2020
b14a98c
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 10, 2020
f42c403
update import
jbrockmendel Apr 10, 2020
8a2807e
remove unused import
jbrockmendel Apr 10, 2020
fa046f0
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 10, 2020
fd10fb6
rebase compat
jbrockmendel Apr 10, 2020
7ea5d3a
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 11, 2020
7150e87
slice instead of take
jbrockmendel Apr 12, 2020
bddfbb0
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 13, 2020
25f83d6
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 13, 2020
2142d29
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 14, 2020
0ca2125
Dummy commit to force CI
jbrockmendel Apr 14, 2020
1ea0cc0
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 15, 2020
2bfc308
update call bound
jbrockmendel Apr 15, 2020
d5ad2a0
update max_len
jbrockmendel Apr 15, 2020
92007e8
revert
jbrockmendel Apr 16, 2020
10f6349
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 16, 2020
3ce6f51
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 16, 2020
df471d8
PERF: always slice when indexing on columns
jbrockmendel Apr 16, 2020
b323ea6
typo fixup
jbrockmendel Apr 16, 2020
e557ace
Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…
jbrockmendel Apr 21, 2020
96b7d6a
avoid take in one more case
jbrockmendel Apr 21, 2020
2ab3917
Merge branch 'master' into perf-never-take
jbrockmendel Jul 16, 2021
c63464f
Merge branch 'master' into perf-never-take
jbrockmendel Jul 18, 2021
6e34e3e
Merge branch 'master' into perf-never-take
jbrockmendel Jul 21, 2021
314a0f8
Merge branch 'master' into perf-never-take
jbrockmendel Jul 21, 2021
587ffef
Merge branch 'master' into perf-never-take
jbrockmendel Jul 24, 2021
7fad752
update
jbrockmendel Jul 24, 2021
7e8b0bd
Merge branch 'master' into perf-never-take
jbrockmendel Jul 31, 2021
6d82359
Merge branch 'master' into perf-never-take
jbrockmendel Aug 8, 2021
201e57d
fix last 2 failing tests
jbrockmendel Aug 14, 2021
69f08cb
mypy fixup
jbrockmendel Aug 15, 2021
f65cbf1
Merge branch 'master' into perf-never-take
jbrockmendel Aug 24, 2021
f337dfe
Merge branch 'master' into perf-never-take
jbrockmendel Aug 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
revert
  • Loading branch information
jbrockmendel committed Apr 16, 2020
commit 92007e8c508c5ca6b2b98cb52999268b5e2e8b44
50 changes: 0 additions & 50 deletions asv_bench/benchmarks/arithmetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,56 +97,6 @@ def time_frame_op_with_series_axis0(self, opname):
getattr(self.df, opname)(self.ser, axis=0)


class FrameWithFrameWide:
# Many-columns, mixed dtypes

params = [
[
operator.add,
operator.sub,
operator.mul,
operator.truediv,
operator.floordiv,
operator.pow,
operator.mod,
operator.eq,
operator.ne,
operator.gt,
operator.ge,
operator.lt,
operator.le,
]
]
param_names = ["op"]

def setup(self, op):
# we choose dtypes so as to make the blocks
# a) not perfectly match between right and left
# b) appreciably bigger than single columns
arr = np.random.randn(10 ** 6).reshape(500, 2000).astype(np.float64)
df = pd.DataFrame(arr)
df[1000] = df[1000].astype(np.float32)
df.iloc[:, 1000:] = df.iloc[:, 1000:].astype(np.float32)
df._consolidate_inplace()

# TODO: GH#33198 the setting here shoudlnt need two steps
df2 = pd.DataFrame(arr)
df2[1000] = df2[1000].astype(np.int64)
df2.iloc[:, 500:1500] = df2.iloc[:, 500:1500].astype(np.int64)
df2._consolidate_inplace()

self.left = df
self.right = df

def time_op_different_blocks(self, op):
# blocks (and dtypes) are not aligned
op(self.left, self.right)

def time_op_same_blocks(self, op):
# blocks (and dtypes) are aligned
op(self.left, self.left)


class Ops:

params = [[True, False], ["default", 1]]
Expand Down
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@ Performance improvements
:meth:`DataFrame.sparse.from_spmatrix` constructor (:issue:`32821`,
:issue:`32825`, :issue:`32826`, :issue:`32856`, :issue:`32858`).
- Performance improvement in reductions (sum, prod, min, max) for nullable (integer and boolean) dtypes (:issue:`30982`, :issue:`33261`, :issue:`33442`).
- Performance improvement in arithmetic operations between two :class:`DataFrame` objects (:issue:`32779`)


.. ---------------------------------------------------------------------------

Expand Down Expand Up @@ -467,6 +467,7 @@ Indexing
- Bug in :meth:`DatetimeIndex.get_loc` raising ``KeyError`` with converted-integer key instead of the user-passed key (:issue:`31425`)
- Bug in :meth:`Series.xs` incorrectly returning ``Timestamp`` instead of ``datetime64`` in some object-dtype cases (:issue:`31630`)
- Bug in :meth:`DataFrame.iat` incorrectly returning ``Timestamp`` instead of ``datetime`` in some object-dtype cases (:issue:`32809`)
- Bug in :meth:`DataFrame.at` when either columns or index is non-unique (:issue:`33041`)
- Bug in :meth:`Series.loc` and :meth:`DataFrame.loc` when indexing with an integer key on a object-dtype :class:`Index` that is not all-integers (:issue:`31905`)
- Bug in :meth:`DataFrame.iloc.__setitem__` on a :class:`DataFrame` with duplicate columns incorrectly setting values for all matching columns (:issue:`15686`, :issue:`22036`)
- Bug in :meth:`DataFrame.loc:` and :meth:`Series.loc` with a :class:`DatetimeIndex`, :class:`TimedeltaIndex`, or :class:`PeriodIndex` incorrectly allowing lookups of non-matching datetime-like dtypes (:issue:`32650`)
Expand Down
108 changes: 39 additions & 69 deletions pandas/_libs/tslibs/period.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -806,24 +806,22 @@ cdef int64_t get_period_ordinal(npy_datetimestruct *dts, int freq) nogil:
return unix_date_to_week(unix_date, freq - FR_WK)


cdef void get_date_info(int64_t ordinal, int freq,
npy_datetimestruct *dts) nogil:
cdef void get_date_info(int64_t ordinal, int freq, npy_datetimestruct *dts) nogil:
cdef:
int64_t unix_date
double abstime
int64_t unix_date, nanos
npy_datetimestruct dts2

unix_date = get_unix_date(ordinal, freq)
abstime = get_abs_time(freq, unix_date, ordinal)

while abstime < 0:
abstime += 86400
unix_date -= 1
nanos = get_time_nanos(freq, unix_date, ordinal)

while abstime >= 86400:
abstime -= 86400
unix_date += 1
pandas_datetime_to_datetimestruct(unix_date, NPY_FR_D, dts)

date_info_from_days_and_time(dts, unix_date, abstime)
dt64_to_dtstruct(nanos, &dts2)
dts.hour = dts2.hour
dts.min = dts2.min
dts.sec = dts2.sec
dts.us = dts2.us
dts.ps = dts2.ps


cdef int64_t get_unix_date(int64_t period_ordinal, int freq) nogil:
Expand Down Expand Up @@ -855,74 +853,50 @@ cdef int64_t get_unix_date(int64_t period_ordinal, int freq) nogil:


@cython.cdivision
cdef void date_info_from_days_and_time(npy_datetimestruct *dts,
int64_t unix_date,
double abstime) nogil:
cdef int64_t get_time_nanos(int freq, int64_t unix_date, int64_t ordinal) nogil:
"""
Set the instance's value using the given date and time.
Find the number of nanoseconds after midnight on the given unix_date
that the ordinal represents in the given frequency.

Parameters
----------
dts : npy_datetimestruct*
freq : int
unix_date : int64_t
days elapsed since datetime(1970, 1, 1)
abstime : double
seconds elapsed since beginning of day described by unix_date
ordinal : int64_t

Notes
-----
Updates dts inplace
Returns
-------
int64_t
"""
cdef:
int inttime
int hour, minute
double second, subsecond_fraction

# Bounds check
# The calling function is responsible for ensuring that
# abstime >= 0.0 and abstime <= 86400

# Calculate the date
pandas_datetime_to_datetimestruct(unix_date, NPY_FR_D, dts)
int64_t sub, factor

# Calculate the time
inttime = <int>abstime
hour = inttime / 3600
minute = (inttime % 3600) / 60
second = abstime - <double>(hour * 3600 + minute * 60)
freq = get_freq_group(freq)

dts.hour = hour
dts.min = minute
dts.sec = <int>second

subsecond_fraction = second - dts.sec
dts.us = int((subsecond_fraction) * 1e6)
dts.ps = int(((subsecond_fraction) * 1e6 - dts.us) * 1e6)
if freq <= FR_DAY:
return 0

elif freq == FR_NS:
factor = 1

@cython.cdivision
cdef double get_abs_time(int freq, int64_t unix_date, int64_t ordinal) nogil:
cdef:
int freq_index, day_index, base_index
int64_t per_day, start_ord
double unit, result
elif freq == FR_US:
factor = 10**3

if freq <= FR_DAY:
return 0
elif freq == FR_MS:
factor = 10**6

freq_index = freq // 1000
day_index = FR_DAY // 1000
base_index = FR_SEC // 1000
elif freq == FR_SEC:
factor = 10 **9

per_day = get_daytime_conversion_factor(day_index, freq_index)
unit = get_daytime_conversion_factor(freq_index, base_index)
elif freq == FR_MIN:
factor = 10**9 * 60

if base_index < freq_index:
unit = 1 / unit
else:
# We must have freq == FR_HR
factor = 10**9 * 3600

start_ord = unix_date * per_day
result = <double>(unit * (ordinal - start_ord))
return result
sub = ordinal - unix_date * 24 * 3600 * 10**9 / factor
return sub * factor


cdef int get_yq(int64_t ordinal, int freq, int *quarter, int *year):
Expand Down Expand Up @@ -1176,11 +1150,7 @@ cdef int64_t period_ordinal_to_dt64(int64_t ordinal, int freq) except? -1:
if ordinal == NPY_NAT:
return NPY_NAT

if freq == 11000:
# Microsecond, avoid get_date_info to prevent floating point errors
pandas_datetime_to_datetimestruct(ordinal, NPY_FR_us, &dts)
else:
get_date_info(ordinal, freq, &dts)
get_date_info(ordinal, freq, &dts)

check_dts_bounds(&dts)
return dtstruct_to_dt64(&dts)
Expand Down
7 changes: 3 additions & 4 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -1291,20 +1291,19 @@ def _addsub_object_array(self, other: np.ndarray, op):
result : same class as self
"""
assert op in [operator.add, operator.sub]
if len(other) == 1 and self.ndim == other.ndim == 1:
# If both 1D then broadcasting is unambiguous
if len(other) == 1:
return op(self, other[0])

warnings.warn(
"Adding/subtracting object-dtype array to "
"Adding/subtracting array of DateOffsets to "
f"{type(self).__name__} not vectorized",
PerformanceWarning,
)

# Caller is responsible for broadcasting if necessary
assert self.shape == other.shape, (self.shape, other.shape)

res_values = op(self.astype("O"), np.asarray(other))
res_values = op(self.astype("O"), np.array(other))
result = array(res_values.ravel())
result = extract_array(result, extract_numpy=True).reshape(self.shape)
return result
Expand Down
25 changes: 10 additions & 15 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,6 @@ def __init__(
mgr = self._init_mgr(
data, axes=dict(index=index, columns=columns), dtype=dtype, copy=copy
)

elif isinstance(data, dict):
mgr = init_dict(data, index, columns, dtype=dtype)
elif isinstance(data, ma.MaskedArray):
Expand Down Expand Up @@ -2587,7 +2586,7 @@ def _ixs(self, i: int, axis: int = 0):
label = self.columns[i]

values = self._mgr.iget(i)
result = self._box_col_values(values, label)
result = self._box_col_values(values, i)

# this is a cached value, mark it so
result._set_as_cached(label, self)
Expand Down Expand Up @@ -2692,7 +2691,7 @@ def _getitem_bool_array(self, key):
def _getitem_multilevel(self, key):
# self.columns is a MultiIndex
loc = self.columns.get_loc(key)
if isinstance(loc, (slice, Series, np.ndarray, Index)):
if isinstance(loc, (slice, np.ndarray)):
new_columns = self.columns[loc]
result_columns = maybe_droplevels(new_columns, key)
if self._is_mixed_type:
Expand Down Expand Up @@ -2725,7 +2724,8 @@ def _getitem_multilevel(self, key):
result._set_is_copy(self)
return result
else:
return self._get_item_cache(key)
# loc is neither a slice nor ndarray, so must be an int
return self._ixs(loc, axis=1)

def _get_value(self, index, col, takeable: bool = False):
"""
Expand Down Expand Up @@ -2916,19 +2916,15 @@ def _ensure_valid_index(self, value):
value.index.copy(), axis=1, fill_value=np.nan
)

def _box_item_values(self, key, values):
items = self.columns[self.columns.get_loc(key)]
if values.ndim == 2:
return self._constructor(values.T, columns=items, index=self.index)
else:
return self._box_col_values(values, items)

def _box_col_values(self, values, items):
def _box_col_values(self, values, loc: int) -> Series:
"""
Provide boxed values for a column.
"""
# Lookup in columns so that if e.g. a str datetime was passed
# we attach the Timestamp object as the name.
name = self.columns[loc]
klass = self._constructor_sliced
return klass(values, index=self.index, name=items, fastpath=True)
return klass(values, index=self.index, name=name, fastpath=True)

# ----------------------------------------------------------------------
# Unsorted
Expand Down Expand Up @@ -5536,11 +5532,10 @@ def _construct_result(self, result) -> "DataFrame":
-------
DataFrame
"""
out = self._constructor(result, copy=False)
out = self._constructor(result, index=self.index, copy=False)
# Pin columns instead of passing to constructor for compat with
# non-unique columns case
out.columns = self.columns
out.index = self.index
return out

def combine(
Expand Down
12 changes: 7 additions & 5 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -3558,18 +3558,20 @@ def _get_item_cache(self, item):
cache = self._item_cache
res = cache.get(item)
if res is None:
values = self._mgr.get(item)
res = self._box_item_values(item, values)
# All places that call _get_item_cache have unique columns,
# pending resolution of GH#33047

loc = self.columns.get_loc(item)
values = self._mgr.iget(loc)
res = self._box_col_values(values, loc)

cache[item] = res
res._set_as_cached(item, self)

# for a chain
res._is_copy = self._is_copy
return res

def _box_item_values(self, key, values):
raise AbstractMethodError(self)

def _slice(self: FrameOrSeries, slobj: slice, axis=0) -> FrameOrSeries:
"""
Construct a slice of this container.
Expand Down
8 changes: 7 additions & 1 deletion pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from pandas.core.algorithms import take_1d
from pandas.core.arrays.categorical import Categorical, contains, recode_for_categories
import pandas.core.common as com
from pandas.core.construction import extract_array
import pandas.core.indexes.base as ibase
from pandas.core.indexes.base import Index, _index_shared_docs, maybe_extract_name
from pandas.core.indexes.extension import ExtensionIndex, inherit_names
Expand Down Expand Up @@ -198,8 +199,13 @@ def __new__(
data = []

assert isinstance(dtype, CategoricalDtype), dtype
if not isinstance(data, Categorical) or data.dtype != dtype:
data = extract_array(data, extract_numpy=True)

if not isinstance(data, Categorical):
data = Categorical(data, dtype=dtype)
elif isinstance(dtype, CategoricalDtype) and dtype != data.dtype:
# we want to silently ignore dtype='category'
data = data._set_dtype(dtype)

data = data.copy() if copy else data

Expand Down
Loading