Skip to content

Commit 9c3856d

Browse files
committed
Merge remote-tracking branch 'upstream/master' into sparse-astype-fill-value
2 parents 7454e31 + adc54fe commit 9c3856d

38 files changed

+567
-329
lines changed

.circleci/config.yml

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
11
version: 2
22
jobs:
3-
4-
# --------------------------------------------------------------------------
5-
# 1. py36_locale
6-
# --------------------------------------------------------------------------
7-
py36_locale:
3+
build:
84
docker:
95
- image: continuumio/miniconda:latest
106
# databases configuration
@@ -34,9 +30,3 @@ jobs:
3430
- run:
3531
name: test
3632
command: ./ci/circle/run_circle.sh --skip-slow --skip-network
37-
38-
workflows:
39-
version: 2
40-
build_and_test:
41-
jobs:
42-
- py36_locale

doc/source/groupby.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -995,6 +995,33 @@ Note that ``df.groupby('A').colname.std().`` is more efficient than
995995
is only interesting over one column (here ``colname``), it may be filtered
996996
*before* applying the aggregation function.
997997

998+
.. note::
999+
Any object column, also if it contains numerical values such as ``Decimal``
1000+
objects, is considered as a "nuisance" columns. They are excluded from
1001+
aggregate functions automatically in groupby.
1002+
1003+
If you do wish to include decimal or object columns in an aggregation with
1004+
other non-nuisance data types, you must do so explicitly.
1005+
1006+
.. ipython:: python
1007+
1008+
from decimal import Decimal
1009+
df_dec = pd.DataFrame(
1010+
{'id': [1, 2, 1, 2],
1011+
'int_column': [1, 2, 3, 4],
1012+
'dec_column': [Decimal('0.50'), Decimal('0.15'), Decimal('0.25'), Decimal('0.40')]
1013+
}
1014+
)
1015+
1016+
# Decimal columns can be sum'd explicitly by themselves...
1017+
df_dec.groupby(['id'])[['dec_column']].sum()
1018+
1019+
# ...but cannot be combined with standard data types or they will be excluded
1020+
df_dec.groupby(['id'])[['int_column', 'dec_column']].sum()
1021+
1022+
# Use .agg function to aggregate over standard and "nuisance" data types at the same time
1023+
df_dec.groupby(['id']).agg({'int_column': 'sum', 'dec_column': 'sum'})
1024+
9981025
.. _groupby.observed:
9991026

10001027
Handling of (un)observed Categorical values

doc/source/io.rst

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4806,6 +4806,36 @@ default ``Text`` type for string columns:
48064806
Because of this, reading the database table back in does **not** generate
48074807
a categorical.
48084808

4809+
.. _io.sql_datetime_data:
4810+
4811+
Datetime data types
4812+
'''''''''''''''''''
4813+
4814+
Using SQLAlchemy, :func:`~pandas.DataFrame.to_sql` is capable of writing
4815+
datetime data that is timezone naive or timezone aware. However, the resulting
4816+
data stored in the database ultimately depends on the supported data type
4817+
for datetime data of the database system being used.
4818+
4819+
The following table lists supported data types for datetime data for some
4820+
common databases. Other database dialects may have different data types for
4821+
datetime data.
4822+
4823+
=========== ============================================= ===================
4824+
Database SQL Datetime Types Timezone Support
4825+
=========== ============================================= ===================
4826+
SQLite ``TEXT`` No
4827+
MySQL ``TIMESTAMP`` or ``DATETIME`` No
4828+
PostgreSQL ``TIMESTAMP`` or ``TIMESTAMP WITH TIME ZONE`` Yes
4829+
=========== ============================================= ===================
4830+
4831+
When writing timezone aware data to databases that do not support timezones,
4832+
the data will be written as timezone naive timestamps that are in local time
4833+
with respect to the timezone.
4834+
4835+
:func:`~pandas.read_sql_table` is also capable of reading datetime data that is
4836+
timezone aware or naive. When reading ``TIMESTAMP WITH TIME ZONE`` types, pandas
4837+
will convert the data to UTC.
4838+
48094839
Reading Tables
48104840
''''''''''''''
48114841

doc/source/whatsnew/v0.24.0.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,7 @@ Other Enhancements
222222
- :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
223223
- :func:`~DataFrame.to_csv`, :func:`~Series.to_csv`, :func:`~DataFrame.to_json`, and :func:`~Series.to_json` now support ``compression='infer'`` to infer compression based on filename extension (:issue:`15008`).
224224
The default compression for ``to_csv``, ``to_json``, and ``to_pickle`` methods has been updated to ``'infer'`` (:issue:`22004`).
225+
- :meth:`DataFrame.to_sql` now supports writing ``TIMESTAMP WITH TIME ZONE`` types for supported databases. For databases that don't support timezones, datetime data will be stored as timezone unaware local timestamps. See the :ref:`io.sql_datetime_data` for implications (:issue:`9086`).
225226
- :func:`to_timedelta` now supports iso-formated timedelta strings (:issue:`21877`)
226227
- :class:`Series` and :class:`DataFrame` now support :class:`Iterable` in constructor (:issue:`2193`)
227228
- :class:`DatetimeIndex` gained :attr:`DatetimeIndex.timetz` attribute. Returns local time with timezone information. (:issue:`21358`)
@@ -853,6 +854,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
853854
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
854855
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
855856
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)
857+
- :meth:`DataFrame.stack` no longer converts to object dtype for DataFrames where each column has the same extension dtype. The output Series will have the same dtype as the columns (:issue:`23077`).
856858
- :meth:`Series.unstack` and :meth:`DataFrame.unstack` no longer convert extension arrays to object-dtype ndarrays. Each column in the output ``DataFrame`` will now have the same dtype as the input (:issue:`23077`).
857859
- Bug when grouping :meth:`Dataframe.groupby()` and aggregating on ``ExtensionArray`` it was not returning the actual ``ExtensionArray`` dtype (:issue:`23227`).
858860

@@ -1245,6 +1247,9 @@ MultiIndex
12451247
I/O
12461248
^^^
12471249

1250+
- Bug in :meth:`to_sql` when writing timezone aware data (``datetime64[ns, tz]`` dtype) would raise a ``TypeError`` (:issue:`9086`)
1251+
- Bug in :meth:`to_sql` where a naive DatetimeIndex would be written as ``TIMESTAMP WITH TIMEZONE`` type in supported databases, e.g. PostgreSQL (:issue:`23510`)
1252+
12481253
.. _whatsnew_0240.bug_fixes.nan_with_str_dtype:
12491254

12501255
Proper handling of `np.NaN` in a string data-typed column with the Python engine
@@ -1292,6 +1297,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
12921297
- Bug in :func:`DataFrame.to_csv` where a single level MultiIndex incorrectly wrote a tuple. Now just the value of the index is written (:issue:`19589`).
12931298
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
12941299
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
1300+
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
12951301

12961302
Plotting
12971303
^^^^^^^^

pandas/_libs/intervaltree.pxi.in

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ cdef class IntervalTree(IntervalMixin):
105105
self.root.query(result, key)
106106
if not result.data.n:
107107
raise KeyError(key)
108-
return result.to_array()
108+
return result.to_array().astype('intp')
109109

110110
def _get_partial_overlap(self, key_left, key_right, side):
111111
"""Return all positions corresponding to intervals with the given side
@@ -155,7 +155,7 @@ cdef class IntervalTree(IntervalMixin):
155155
raise KeyError(
156156
'indexer does not intersect a unique set of intervals')
157157
old_len = result.data.n
158-
return result.to_array()
158+
return result.to_array().astype('intp')
159159

160160
def get_indexer_non_unique(self, scalar_t[:] target):
161161
"""Return the positions corresponding to intervals that overlap with
@@ -175,7 +175,8 @@ cdef class IntervalTree(IntervalMixin):
175175
result.append(-1)
176176
missing.append(i)
177177
old_len = result.data.n
178-
return result.to_array(), missing.to_array()
178+
return (result.to_array().astype('intp'),
179+
missing.to_array().astype('intp'))
179180

180181
def __repr__(self):
181182
return ('<IntervalTree[{dtype},{closed}]: '

pandas/_libs/tslibs/timestamps.pyx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -733,11 +733,11 @@ class Timestamp(_Timestamp):
733733
if ts.value == NPY_NAT:
734734
return NaT
735735

736-
if is_string_object(freq):
737-
freq = to_offset(freq)
738-
elif not is_offset_object(freq):
736+
if freq is None:
739737
# GH 22311: Try to extract the frequency of a given Timestamp input
740738
freq = getattr(ts_input, 'freq', None)
739+
elif not is_offset_object(freq):
740+
freq = to_offset(freq)
741741

742742
return create_timestamp_from_ts(ts.value, ts.dts, ts.tzinfo, freq)
743743

pandas/core/generic.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
import numpy as np
1111
import pandas as pd
1212

13-
from pandas._libs import tslib, properties
13+
from pandas._libs import properties, Timestamp, iNaT
1414
from pandas.core.dtypes.common import (
1515
ensure_int64,
1616
ensure_object,
@@ -2397,6 +2397,15 @@ def to_sql(self, name, con, schema=None, if_exists='fail', index=True,
23972397
--------
23982398
pandas.read_sql : read a DataFrame from a table
23992399
2400+
Notes
2401+
-----
2402+
Timezone aware datetime columns will be written as
2403+
``Timestamp with timezone`` type with SQLAlchemy if supported by the
2404+
database. Otherwise, the datetimes will be stored as timezone unaware
2405+
timestamps local to the original timezone.
2406+
2407+
.. versionadded:: 0.24.0
2408+
24002409
References
24012410
----------
24022411
.. [1] http://docs.sqlalchemy.org
@@ -5091,7 +5100,7 @@ def get_ftype_counts(self):
50915100
1 b 2 2.0
50925101
2 c 3 3.0
50935102
5094-
>>> df.get_ftype_counts()
5103+
>>> df.get_ftype_counts() # doctest: +SKIP
50955104
float64:dense 1
50965105
int64:dense 1
50975106
object:dense 1
@@ -9273,9 +9282,9 @@ def describe_categorical_1d(data):
92739282
tz = data.dt.tz
92749283
asint = data.dropna().values.view('i8')
92759284
names += ['top', 'freq', 'first', 'last']
9276-
result += [tslib.Timestamp(top, tz=tz), freq,
9277-
tslib.Timestamp(asint.min(), tz=tz),
9278-
tslib.Timestamp(asint.max(), tz=tz)]
9285+
result += [Timestamp(top, tz=tz), freq,
9286+
Timestamp(asint.min(), tz=tz),
9287+
Timestamp(asint.max(), tz=tz)]
92799288
else:
92809289
names += ['top', 'freq']
92819290
result += [top, freq]
@@ -10613,7 +10622,7 @@ def cum_func(self, axis=None, skipna=True, *args, **kwargs):
1061310622
issubclass(y.dtype.type, (np.datetime64, np.timedelta64))):
1061410623
result = accum_func(y, axis)
1061510624
mask = isna(self)
10616-
np.putmask(result, mask, tslib.iNaT)
10625+
np.putmask(result, mask, iNaT)
1061710626
elif skipna and not issubclass(y.dtype.type, (np.integer, np.bool_)):
1061810627
mask = isna(self)
1061910628
np.putmask(y, mask, mask_a)

pandas/core/indexes/base.py

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1875,35 +1875,31 @@ def get_duplicates(self):
18751875
18761876
Works on different Index of types.
18771877
1878-
>>> pd.Index([1, 2, 2, 3, 3, 3, 4]).get_duplicates()
1878+
>>> pd.Index([1, 2, 2, 3, 3, 3, 4]).get_duplicates() # doctest: +SKIP
18791879
[2, 3]
1880-
>>> pd.Index([1., 2., 2., 3., 3., 3., 4.]).get_duplicates()
1881-
[2.0, 3.0]
1882-
>>> pd.Index(['a', 'b', 'b', 'c', 'c', 'c', 'd']).get_duplicates()
1883-
['b', 'c']
18841880
18851881
Note that for a DatetimeIndex, it does not return a list but a new
18861882
DatetimeIndex:
18871883
18881884
>>> dates = pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-03',
18891885
... '2018-01-03', '2018-01-04', '2018-01-04'],
18901886
... format='%Y-%m-%d')
1891-
>>> pd.Index(dates).get_duplicates()
1887+
>>> pd.Index(dates).get_duplicates() # doctest: +SKIP
18921888
DatetimeIndex(['2018-01-03', '2018-01-04'],
18931889
dtype='datetime64[ns]', freq=None)
18941890
18951891
Sorts duplicated elements even when indexes are unordered.
18961892
1897-
>>> pd.Index([1, 2, 3, 2, 3, 4, 3]).get_duplicates()
1893+
>>> pd.Index([1, 2, 3, 2, 3, 4, 3]).get_duplicates() # doctest: +SKIP
18981894
[2, 3]
18991895
19001896
Return empty array-like structure when all elements are unique.
19011897
1902-
>>> pd.Index([1, 2, 3, 4]).get_duplicates()
1898+
>>> pd.Index([1, 2, 3, 4]).get_duplicates() # doctest: +SKIP
19031899
[]
19041900
>>> dates = pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-03'],
19051901
... format='%Y-%m-%d')
1906-
>>> pd.Index(dates).get_duplicates()
1902+
>>> pd.Index(dates).get_duplicates() # doctest: +SKIP
19071903
DatetimeIndex([], dtype='datetime64[ns]', freq=None)
19081904
"""
19091905
warnings.warn("'get_duplicates' is deprecated and will be removed in "

pandas/core/indexes/period.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@
2222
)
2323
from pandas.core.tools.datetimes import parse_time_string, DateParseError
2424

25-
from pandas._libs import tslib, index as libindex
25+
from pandas._libs import index as libindex
2626
from pandas._libs.tslibs.period import (Period, IncompatibleFrequency,
2727
DIFFERENT_FREQ_INDEX)
2828

29-
from pandas._libs.tslibs import resolution
29+
from pandas._libs.tslibs import resolution, NaT, iNaT
3030

3131
from pandas.core.algorithms import unique1d
3232
import pandas.core.arrays.datetimelike as dtl
@@ -336,7 +336,7 @@ def _box_func(self):
336336
# places outside of indexes/period.py are calling this _box_func,
337337
# but passing data that's already boxed.
338338
def func(x):
339-
if isinstance(x, Period) or x is tslib.NaT:
339+
if isinstance(x, Period) or x is NaT:
340340
return x
341341
else:
342342
return Period._from_ordinal(ordinal=x, freq=self.freq)
@@ -726,7 +726,7 @@ def get_loc(self, key, method=None, tolerance=None):
726726
raise KeyError(key)
727727

728728
try:
729-
ordinal = tslib.iNaT if key is tslib.NaT else key.ordinal
729+
ordinal = iNaT if key is NaT else key.ordinal
730730
if tolerance is not None:
731731
tolerance = self._convert_tolerance(tolerance,
732732
np.asarray(key))

pandas/core/internals/blocks.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,9 @@
3535
is_numeric_v_string_like, is_extension_type,
3636
is_extension_array_dtype,
3737
is_list_like,
38-
is_sparse,
3938
is_re,
4039
is_re_compilable,
40+
is_sparse,
4141
pandas_dtype)
4242
from pandas.core.dtypes.cast import (
4343
maybe_downcast_to_dtype,

0 commit comments

Comments
 (0)