Skip to content

xnought/fastpandas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastPandas

Lazy evaluation with DuckDB for lighting fast pandas functions. Scales to hundreds of millions of data in fractions of a second!

Includes tons of

To find the exact functions, go to the compiled code.

What is the library? And why lazy?

DuckDB runs SQL and is extremely fast. Like really really really fast.

This library makes it easy to run DuckDB on your pandas dataframes.

It works by selecting a column in your dataframe df

df = pd.DataFrame() # with your own data
FastPandas(df)["column_in_df"]

and applying numeric functions and aggregate functions.

For example taking the natural log, squaring, then averaging.

df = pd.DataFrame() # with your own data
output = FastPandas(df)["column_in_df"].ln().pow(2).avg()

But nothing is computed yet! Only when you need the value, the entire chain is fused together, then executed all at once

df = pd.DataFrame() # with your own data
output = FastPandas(df)["column_in_df"].ln().pow(2).avg() # not executed yet

print(output.item()) # NOW gets executed with .item()

There is no need to compute them separetly, then combine. Just run'em all at once! We love being lazy!

Usage

Check out example.ipynb for real examples you can run through or continue.

Open In Colab

Installation

git clone https://github.com/xnought/fastpandas.git
cd fastpandas
python3 -m pip install -e .

Lazy Evaluation

FastPandas is lazy. Meaning you chain your desired operations, and the value is only computed when you actually request the data (with .item() or .df()).

For example, to average over the dataframes "a" column you could do

from fastpandas import FastPandas

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

output = FastPandas(df)["a"].avg()
print(output.item())
#  5.5

If you would rather output the entire dataframe (ie you never applied an aggregate function), use .df() instead of .item().

Or if you have more complex operations

Like if I wanted to multiply two columns, then absolute value the result, then add 1, then compute the entropy of that column, I could do

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'b': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]})

output = FastPandas(df)["a"].mul(FastPandas(df)["b"]).abs().add(1).entropy()
print(output.item())
# 2.321928094887362

Which essentially lazily fuses the operations into a single SQL query which is insanely fast on DuckDB.

Or if you want to approximately count the number of unique values in a column

Use DuckDB's hyperloglog implementation!!!

first create a df, this one has 100 million rows!

import random
large_df = pd.DataFrame({"a": random.choices(range(35_000_000), k=100_000_000)})

Then easily use a DuckDB Aggregate function like approx_count_distinct to count the number of unique elements in the column (roughly).

unique_elements = FastPandas(large_df)["a"].approx_count_distinct()
print(unique_elements.item()) # uses hyperloglog under the hood and took 0.1 seconds

Filtering

If you want to also filter down, you can do that. Note that the last .filter() will be the only one applied.

For example, if I want all the values in column "a" that are greater than 1 and summed up, I could do

FastPandas(large_df)["a"].filter(FastPandas(large_df)["a"].gt(1)).sum().item()

If I wanted all the values less than 0 and greater than 1 summed up, I could do

_filter = FastPandas(large_df)["a"].lt(0)._and(FastPandas(large_df)["a"].gt(1))
FastPandas(large_df)["a"].filter(_filter).sum().item()

Or going back to the counting unique elements if I wanted to counted the number of unique elements between 0 and 10,

between_0_and_10 = FastPandas(large_df)["a"].gte(0)._and(FastPandas(large_df)["b"].lte(10))
FastPandas(large_df)["a"].filter(between_0_and_10).approx_count_distinct().item()

Type Casting

You can easily call type cast functions like .int() or .float() on a column for type conversion.

For example, the .factorial() function needs an integer in each row so you can do

df = pd.DataFrame({'a': [1, 2, 3]})
_factorial = FastPandas(df)["a"].int().factorial() # notice the .int()
print(_factorial.df())
"""
 1
 2
 6
"""

About

Use DuckDB aggregate, numeric, and comparison functions natively on your pandas dataframes.

Topics

Resources

Stars

Watchers

Forks