Skip to content

PERF: parse non-ambiguous date formats in c #12667

Closed
@apapanico

Description

@apapanico

Date parsing is slow on dates of the form "%m.%d.%Y" compared to "%Y-%m-%d". I am using 0.17.1 so maybe it's improved in 0.18 but I didn't see anything about it. I provide a benchmark below to compare the two datetime formats.

Some observations:

  • Letting pandas figure it out is unacceptably slow. In small files it won't make much of a difference but it became a massive bottleneck in an application.
  • In my particular use case, I know each timestamp in my file will have the same date so it is very easy to do a string replace. And this turns out to be by far the fastest option here. Even with the actual string replacement time.
  • Regex is slow. I'm sure that would have been obvious but I tried it out for comparison.
  • infer_datetime_format is the next best option but it appears to be a massive slowdown. Compare the relative times on the large file to the relative times of the pure read on the small file and date parsing with infer_datetime_format.
  • A custom date parser doesn't help. Not sure why it doesn't help. Maybe it's not compiled?

Code Sample

test.csv (1 million lines)

2016-03-12, 1
2016-03-12, 1
2016-03-12, 1
2016-03-12, 1
...

test2.csv (5000 lines)

03.12.2016, 1
03.12.2016, 1
03.12.2016, 1
03.12.2016, 1
...
import pandas as pd
from cStringIO import StringIO
import datetime
import re
print "Large file: pure read"
%timeit pd.read_csv('test.csv')

print "Large file: parse dates"
%timeit pd.read_csv('test.csv', parse_dates=[0])

print "Small file: pure read"
%timeit pd.read_csv('test2.csv')

print "Small file: parse dates"
%timeit pd.read_csv('test2.csv', parse_dates=[0])

print "Small file: parse dates with infer datetime format"
%timeit pd.read_csv('test2.csv', parse_dates=[0], infer_datetime_format=True)

print "Small file: string replace then parse dates "
def test_replace():
    s = file('test2.csv', 'r').read()
    s = s.replace('03.12.2016', '2016-03-12')
    pd.read_csv(StringIO(s), parse_dates=[0])
%timeit test_replace()

print "Small file: parse dates with custom date parser"
date_parser=lambda t: datetime.datetime.strptime(t, "%m.%d.%Y")
%timeit pd.read_csv('test2.csv', parse_dates=[0], date_parser=date_parser)

print "Small file: regex sub then parse dates"
def test_re_sub():
    s = file('test2.csv', 'r').read()
    p = re.compile('(?P<month> \d{2}).(?P<day> \d{2}).(?P<year> \d{4})')
    s = p.sub('\g<year>-\g<month>-\g<day>', s)
    pd.read_csv(StringIO(s), parse_dates=[0])
%timeit test_re_sub()

Output (2012 MacBook Air)

Large file: pure read
1 loop, best of 3: 253 ms per loop
Large file: parse dates
1 loop, best of 3: 790 ms per loop

Small file: pure read
100 loops, best of 3: 2.9 ms per loop
Small file: parse dates
1 loop, best of 3: 1.22 s per loop
Small file: parse dates with infer datetime format
10 loops, best of 3: 32.7 ms per loop
Small file: string replace then parse dates 
100 loops, best of 3: 4.97 ms per loop
Small file: parse dates with custom date parser
10 loops, best of 3: 90.6 ms per loop
Small file: regex sub then parse dates
1 loop, best of 3: 1.08 s per loop

output of pd.show_versions()

>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
IPython: 4.1.2
sphinx: 1.3.6
patsy: None
dateutil: 2.5.1
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtypePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions