FastPandas

Lazy evaluation with DuckDB for lighting fast pandas functions. Scales to hundreds of millions of data in fractions of a second!

Includes tons of

To find the exact functions, go to the compiled code.

What is the library? And why lazy?

DuckDB runs SQL and is extremely fast. Like really really really fast.

This library makes it easy to run DuckDB on your pandas dataframes.

It works by selecting a column in your dataframe df

df = pd.DataFrame() # with your own data
FastPandas(df)["column_in_df"]

and applying numeric functions and aggregate functions.

For example taking the natural log, squaring, then averaging.

df = pd.DataFrame() # with your own data
output = FastPandas(df)["column_in_df"].ln().pow(2).avg()

But nothing is computed yet! Only when you need the value, the entire chain is fused together, then executed all at once

df = pd.DataFrame() # with your own data
output = FastPandas(df)["column_in_df"].ln().pow(2).avg() # not executed yet

print(output.item()) # NOW gets executed with .item()

There is no need to compute them separetly, then combine. Just run'em all at once! We love being lazy!

Usage

Check out example.ipynb for real examples you can run through or continue.

Installation

git clone https://github.com/xnought/fastpandas.git
cd fastpandas
python3 -m pip install -e .

Lazy Evaluation

FastPandas is lazy. Meaning you chain your desired operations, and the value is only computed when you actually request the data (with .item() or .df()).

For example, to average over the dataframes "a" column you could do

from fastpandas import FastPandas

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

output = FastPandas(df)["a"].avg()
print(output.item())
#  5.5

If you would rather output the entire dataframe (ie you never applied an aggregate function), use .df() instead of .item().

Or if you have more complex operations

Like if I wanted to multiply two columns, then absolute value the result, then add 1, then compute the entropy of that column, I could do

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'b': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]})

output = FastPandas(df)["a"].mul(FastPandas(df)["b"]).abs().add(1).entropy()
print(output.item())
# 2.321928094887362

Which essentially lazily fuses the operations into a single SQL query which is insanely fast on DuckDB.

Or if you want to approximately count the number of unique values in a column

Use DuckDB's hyperloglog implementation!!!

first create a df, this one has 100 million rows!

import random
large_df = pd.DataFrame({"a": random.choices(range(35_000_000), k=100_000_000)})

Then easily use a DuckDB Aggregate function like approx_count_distinct to count the number of unique elements in the column (roughly).

unique_elements = FastPandas(large_df)["a"].approx_count_distinct()
print(unique_elements.item()) # uses hyperloglog under the hood and took 0.1 seconds

Filtering

If you want to also filter down, you can do that. Note that the last .filter() will be the only one applied.

For example, if I want all the values in column "a" that are greater than 1 and summed up, I could do

FastPandas(large_df)["a"].filter(FastPandas(large_df)["a"].gt(1)).sum().item()

If I wanted all the values less than 0 and greater than 1 summed up, I could do

_filter = FastPandas(large_df)["a"].lt(0)._and(FastPandas(large_df)["a"].gt(1))
FastPandas(large_df)["a"].filter(_filter).sum().item()

Or going back to the counting unique elements if I wanted to counted the number of unique elements between 0 and 10,

between_0_and_10 = FastPandas(large_df)["a"].gte(0)._and(FastPandas(large_df)["b"].lte(10))
FastPandas(large_df)["a"].filter(between_0_and_10).approx_count_distinct().item()

Type Casting

You can easily call type cast functions like .int() or .float() on a column for type conversion.

For example, the .factorial() function needs an integer in each row so you can do

df = pd.DataFrame({'a': [1, 2, 3]})
_factorial = FastPandas(df)["a"].int().factorial() # notice the .int()
print(_factorial.df())
"""
 1
 2
 6
"""

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
fastpandas		fastpandas
tables		tables
.gitignore		.gitignore
README.md		README.md
compiler.py		compiler.py
example.ipynb		example.ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FastPandas

What is the library? And why lazy?

Usage

About

Uh oh!

Uh oh!

Languages

xnought/fastpandas

Folders and files

Latest commit

History

Repository files navigation

FastPandas

What is the library? And why lazy?

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages