Description
Code Sample, a copy-pastable example if possible
numiter = 100
for n in [10, 1000, 1000000, 10000000,]:
domain = np.arange(0, n)
range = domain+10
maptable = pd.Series(range, index=domain).sort_index()
query_vals = pd.Series([1,2,3])
def f():
query_vals.map(maptable)
print n, timeit.timeit(stmt=f, number=numiter)/numiter
10 0.000630810260773
1000 0.000978469848633
1000000 0.00130645036697
10000000 0.0162791204453
Problem description
The above tries to map 3 values in a lookup table, much like looking up values in python dictionary.
n is the size of the table (not the query).
At n=10000000 its taken (0.01/3) seconds per mapped value -- unbelievably shockingly slow.
Expected Output
Costs are growing with O(len(maptable)).
My series' index is sorted. My expectation is that pandas costs <<< O(maptable) for this type of operation.
Costs can scale equal to or less than:
- O(len(query_vals) * O(len(log(maptable))), for a binary search on index, or
- O(len(query_vals)) for hashed index.
... in other words, there shouldn't be huge differences in the timing of n=10, and n=10000000
for a trivial implementation.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.2.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None