Closed
Description
Date parsing is slow on dates of the form "%m.%d.%Y" compared to "%Y-%m-%d". I am using 0.17.1 so maybe it's improved in 0.18 but I didn't see anything about it. I provide a benchmark below to compare the two datetime formats.
Some observations:
- Letting pandas figure it out is unacceptably slow. In small files it won't make much of a difference but it became a massive bottleneck in an application.
- In my particular use case, I know each timestamp in my file will have the same date so it is very easy to do a string replace. And this turns out to be by far the fastest option here. Even with the actual string replacement time.
- Regex is slow. I'm sure that would have been obvious but I tried it out for comparison.
infer_datetime_format
is the next best option but it appears to be a massive slowdown. Compare the relative times on the large file to the relative times of the pure read on the small file and date parsing withinfer_datetime_format
.- A custom date parser doesn't help. Not sure why it doesn't help. Maybe it's not compiled?
Code Sample
test.csv
(1 million lines)
2016-03-12, 1
2016-03-12, 1
2016-03-12, 1
2016-03-12, 1
...
test2.csv
(5000 lines)
03.12.2016, 1
03.12.2016, 1
03.12.2016, 1
03.12.2016, 1
...
import pandas as pd
from cStringIO import StringIO
import datetime
import re
print "Large file: pure read"
%timeit pd.read_csv('test.csv')
print "Large file: parse dates"
%timeit pd.read_csv('test.csv', parse_dates=[0])
print "Small file: pure read"
%timeit pd.read_csv('test2.csv')
print "Small file: parse dates"
%timeit pd.read_csv('test2.csv', parse_dates=[0])
print "Small file: parse dates with infer datetime format"
%timeit pd.read_csv('test2.csv', parse_dates=[0], infer_datetime_format=True)
print "Small file: string replace then parse dates "
def test_replace():
s = file('test2.csv', 'r').read()
s = s.replace('03.12.2016', '2016-03-12')
pd.read_csv(StringIO(s), parse_dates=[0])
%timeit test_replace()
print "Small file: parse dates with custom date parser"
date_parser=lambda t: datetime.datetime.strptime(t, "%m.%d.%Y")
%timeit pd.read_csv('test2.csv', parse_dates=[0], date_parser=date_parser)
print "Small file: regex sub then parse dates"
def test_re_sub():
s = file('test2.csv', 'r').read()
p = re.compile('(?P<month> \d{2}).(?P<day> \d{2}).(?P<year> \d{4})')
s = p.sub('\g<year>-\g<month>-\g<day>', s)
pd.read_csv(StringIO(s), parse_dates=[0])
%timeit test_re_sub()
Output (2012 MacBook Air)
Large file: pure read
1 loop, best of 3: 253 ms per loop
Large file: parse dates
1 loop, best of 3: 790 ms per loop
Small file: pure read
100 loops, best of 3: 2.9 ms per loop
Small file: parse dates
1 loop, best of 3: 1.22 s per loop
Small file: parse dates with infer datetime format
10 loops, best of 3: 32.7 ms per loop
Small file: string replace then parse dates
100 loops, best of 3: 4.97 ms per loop
Small file: parse dates with custom date parser
10 loops, best of 3: 90.6 ms per loop
Small file: regex sub then parse dates
1 loop, best of 3: 1.08 s per loop
output of pd.show_versions()
>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.17.1
nose: 1.3.7
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
IPython: 4.1.2
sphinx: 1.3.6
patsy: None
dateutil: 2.5.1
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None