Skip to content

Latest commit

 

History

History
672 lines (464 loc) · 16.5 KB

exploring_explorer.livemd

File metadata and controls

672 lines (464 loc) · 16.5 KB

Ten Minutes to Explorer

Mix.install([
  {:explorer, "~> 0.5.0"},
  {:kino, "~> 0.8.0"}
])

Introduction

Explorer is a dataframe library for Elixir. A dataframe is a common data structure used in data analysis. It is a two-dimensional table composed of columns and rows similar to a SQL table or a spreadsheet.

Explorer's aim is to provide a simple and powerful API for manipulating dataframes. It takes influences mainly from the tidyverse, but if you've used other dataframe libraries like pandas you shouldn't have too much trouble working with Explorer.

This document is meant to give you a crash course in using Explorer. More in-depth documentation can be found in the relevant sections of the docs.

We strongly recommend you run this livebook locally so you can see the outputs and play with the inputs!

Reading and writing data

Data can be read from delimited files (like CSV), NDJSON, Parquet, and the Arrow IPC (feather) format. You can also load in data from a map or keyword list of columns with Explorer.DataFrame.new/1.

For CSV, your 'usual suspects' of options are available:

  • delimiter - A single character used to separate fields within a record. (default: ",")
  • dtypes - A keyword list of [column_name: dtype]. If a type is not specified for a column, it is imputed from the first 1000 rows. (default: [])
  • header - Does the file have a header of column names as the first row or not? (default: true)
  • max_rows - Maximum number of lines to read. (default: nil)
  • null_character - The string that should be interpreted as a nil value. (default: "NA")
  • skip_rows - The number of lines to skip at the beginning of the file. (default: 0)
  • columns - A list of column names to keep. If present, only these columns are read into the dataframe. (default: nil)

Explorer also has multiple example datasets built in, which you can load from the Explorer.Datasets module like so:

df = Explorer.Datasets.fossil_fuels()

You'll notice that the output looks slightly different than many dataframe libraries. Explorer takes inspiration on this front from glimpse in R. A benefit to this approach is that you will rarely need to elide columns.

If you'd like to see a table with your data, we've got you covered there too.

Explorer.DataFrame.table(df)

Writing files is very similar to reading them. The options are a little more limited:

  • header - Should the column names be written as the first line of the file? (default: true)
  • delimiter - A single character used to separate fields within a record. (default: ",")

First, let's add some useful aliases:

alias Explorer.DataFrame
alias Explorer.Series

And then write to a file of your choosing:

input = Kino.Input.text("Filename")
filename = Kino.Input.read(input)
DataFrame.to_csv(df, filename)

Working with Series

Explorer, like Polars, works up from the concept of a Series. In many ways, you can think of a dataframe as a row-aligned map of Series. These are like vectors in R or series in Pandas.

For simplicity, Explorer uses the following Series dtypes:

  • :float - 64-bit floating point number
  • :integer - 64-bit signed integer
  • :boolean - Boolean
  • :string - UTF-8 encoded binary
  • :date - Date type that unwraps to Elixir.Date
  • :datetime - DateTime type that unwraps to Elixir.NaiveDateTime

Series can be constructed from Elixir basic types. For example:

s1 = Series.from_list([1, 2, 3])
s2 = Series.from_list(["a", "b", "c"])
s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])

You'll notice that the dtype and size of the Series are at the top of the printed value. You can get those programmatically as well.

Series.dtype(s3)
Series.size(s3)

And the printed values max out at 50:

1..100 |> Enum.to_list() |> Series.from_list()

Series are also nullable.

s = Series.from_list([1.0, 2.0, nil, nil, 5.0])

And you can fill in those missing values using one of the following strategies:

  • :forward - replace nil with the previous value
  • :backward - replace nil with the next value
  • :max - replace nil with the series maximum
  • :min - replace nil with the series minimum
  • :mean - replace nil with the series mean
Series.fill_missing(s, :forward)

In the case of mixed numeric types (i.e. integers and floats), Series will downcast to a float:

Series.from_list([1, 2.0])

In all other cases, Series must all be of the same dtype or else you'll get an ArgumentError.

Series.from_list([1, 2, 3, "a"])

One of the goals of Explorer is useful error messages. If you look at the error above, you get:

Cannot make a series from mismatched types. Type of "a" does not match inferred dtype integer.

Hopefully this makes abundantly clear what's going on.

Series also implements the Access protocol. You can slice and dice in many ways:

s = 1..10 |> Enum.to_list() |> Series.from_list()
s[1]
s[-1]
s[0..4]
s[[0, 4, 4]]

And of course, you can convert back to an Elixir list.

Series.to_list(s)

Explorer supports comparisons.

s = 1..11 |> Enum.to_list() |> Series.from_list()
s1 = 11..1 |> Enum.to_list() |> Series.from_list()
Series.equal(s, s1)
Series.equal(s, 5)
Series.not_equal(s, 10)
Series.greater_equal(s, 4)

And arithmetic.

Series.add(s, s1)
Series.subtract(s, 4)
Series.multiply(s, s1)

Remember those helpful errors? We've tried to add those throughout. So if you try to do arithmetic with mismatching dtypes:

s = Series.from_list([1, 2, 3])
s1 = Series.from_list([1.0, 2.0, 3.0])
Series.add(s, s1)

Just kidding! Integers and floats will downcast to floats. Let's try again:

s = Series.from_list([1, 2, 3])
s1 = Series.from_list(["a", "b", "c"])
Series.add(s, s1)

You can flip them around.

s = Series.from_list([1, 2, 3, 4])
Series.reverse(s)

And sort.

1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list() |> Series.sort()

Or argsort.

s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
ids = Series.argsort(s) |> Series.to_list()

Which you can pass to Explorer.Series.slice/2 if you want the sorted values.

Series.slice(s, ids)

You can calculate cumulative values.

s = 1..100 |> Enum.to_list() |> Series.from_list()
Series.cumulative_sum(s)

Or rolling ones.

Series.window_sum(s, 4)

You can count and list unique values.

s = Series.from_list(["a", "b", "b", "c", "c", "c"])
Series.distinct(s)
Series.n_distinct(s)

And you can even get a dataframe showing the frequencies for each distinct value.

Series.frequencies(s)

Working with DataFrames

A DataFrame is really just a collection of Series of the same size. Which is why you can create a DataFrame from a Keyword list.

DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])

Similarly to Series, the Inspect implementation prints some info at the top and to the left. At the top we see the shape of the dataframe (rows and columns) and then for each column we see the name, dtype, and first five values. We can see a bit more from that built-in dataset we loaded in earlier.

df

You will also see grouping information there, but we'll get to that later. You can get the info yourself directly:

DataFrame.names(df)
DataFrame.dtypes(df)
DataFrame.shape(df)
{DataFrame.n_rows(df), DataFrame.n_columns(df)}

We can grab the head.

DataFrame.head(df)

Or the tail. Let's get a few more values from the tail.

DataFrame.tail(df, 10)

Verbs and macros

In Explorer, like in dplyr, we have five main verbs to work with dataframes:

  • select
  • filter
  • mutate
  • arrange
  • summarise

We are going to explore then in this notebook, but first we need to "require" the Explorer.DataFrame module in order to load the macros needed for these verbs.

I want to take the opportunity to create a shorter alias for the DataFrame module, called DF:

require DataFrame, as: DF

From now on we are using the shorter version, DF, to refer to the required Explorer.DataFrame module.

Select

Let's jump right into it. We can select columns pretty simply.

DF.select(df, ["year", "country"])

But Elixir gives us some superpowers. In R there's tidy-select. I don't think we need that in Elixir. Anywhere in Explorer where you need to pass a list of column names, you can also execute a filtering callback on the column names. It's just an anonymous function passed to df |> DataFrame.names() |> Enum.filter(callback_here).

DF.select(df, &String.ends_with?(&1, "fuel"))

Want all but some columns? discard/2 performs the opposite of select/2.

DF.discard(df, &String.ends_with?(&1, "fuel"))

Filter

The next verb we'll look at is filter.

This is implemented using a macro, so it's possible to use expressions like you would if comparing variables in Elixir:

DF.filter(df, country == "BRAZIL")

Using complex filters is also possible:

DF.filter(df, country == "ALGERIA" and year > 2012)

You can also write the same filter without the macro, by using the callback version function which is filter_with/2:

DF.filter_with(df, fn ldf ->
  ldf["country"]
  |> Series.equal("ALGERIA")
  |> Series.and(Series.greater(ldf["year"], 2012))
end)

By the way, all the Explorer.DataFrame macros have a correspondent function that accepts a callback. In fact, our macros are implemented using those functions.

The filter_with/2 function is going to use a virtual representation of the dataframe that we call a "lazy frame". With lazy frames you can´t access the series contents, but every operation will be optimized and run only once.

Remember those helpful error messages?

DF.filter(df, cuontry == "BRAZIL")

Mutate

A common task in data analysis is to add columns or change existing ones. Mutate is a handy verb.

DF.mutate(df, new_column: solid_fuel + cement)

Did you catch that? You can pass in new columns as keyword arguments. It also works to transform existing columns.

DF.mutate(df,
  gas_fuel: Series.cast(gas_fuel, :float),
  gas_and_liquid_fuel: gas_fuel + liquid_fuel
)

DataFrame.mutate/2 is flexible though. You may not always want to use keyword arguments. Given that column names are String.t(), it may make more sense to use a map.

DF.mutate(df, %{"gas_fuel" => gas_fuel - 10})

DF.transmute/2, which is DF.mutate/2 that only retains the specified columns, is forthcoming.

Arrange

Sorting the dataframe is pretty straightforward.

DF.arrange(df, year)

But it comes with some tricks up its sleeve.

DF.arrange(df, asc: total, desc: year)

As the examples show, arrange/2 is a macro, and therefore you can use some functions to arrange your dataframe:

DF.arrange(df, asc: Series.window_sum(total, 2))

Sort operations happen left to right. And keyword list args permit specifying the direction.

Distinct

Okay, as expected here too. Very straightforward.

DF.distinct(df, ["year", "country"])

You can specify whether to keep the other columns as well, so the first row of each distinct value is kept:

DF.distinct(df, ["country"], keep_all: true)

Rename

Rename can take either a list of new names or a callback that is passed to Enum.map/2 against the names. You can also use a map or keyword args to rename specific columns.

DF.rename(df, year: "year_test")
DF.rename_with(df, &(&1 <> "_test"))

Dummies

This is fun! We can get dummy variables for unique values.

DF.dummies(df, ["year"])
DF.dummies(df, ["country"])

Sampling

Random samples can give us a percent or a specific number of samples, with or without replacement, and the function is seedable.

DF.sample(df, 10)
DF.sample(df, 0.4)

Trying for those helpful error messages again.

DF.sample(df, 10000)
DF.sample(df, 10000, replacement: true)

Pull and slice

Slicing and dicing can be done with the Access protocol or with explicit pull/slice/take functions.

df["year"]
DF.pull(df, "year")
df[["year", "country"]]
DF.slice(df, [1, 20, 50])

Negative offsets work for slice!

DF.slice(df, -10, 5)
DF.slice(df, 10, 5)

Slice also works with ranges:

DF.slice(df, 12..42)

Pivot

We can pivot_longer/3 and pivot_wider/4. These are inspired by tidyr.

There are some shortcomings in pivot_wider/4 related to polars. The values_from column must be a numeric type.

DF.pivot_longer(df, ["year", "country"], value_columns: &String.ends_with?(&1, "fuel"))
DF.pivot_wider(df, "country", "total", id_columns: ["year"])

Let's make those names look nicer!

tidy_names = fn name ->
  name
  |> String.downcase()
  |> String.replace(~r/\s/, " ")
  |> String.replace(~r/[^A-Za-z\s]/, "")
  |> String.replace(" ", "_")
end

df
|> DF.pivot_wider("country", "total", id_columns: ["year"])
|> DF.rename_with(tidy_names)

Joins

Joining is fast and easy. You can specify the columns to join on and how to join. Polars even supports cartesian (cross) joins, so Explorer does too.

df1 = DF.select(df, ["year", "country", "total"])
df2 = DF.select(df, ["year", "country", "cement"])

DF.join(df1, df2)
df3 = df |> DF.select(["year", "cement"]) |> DF.slice(0, 500)

DF.join(df1, df3, how: :left)

Grouping

Explorer supports groupby operations. They're limited based on what's possible in Polars, but they do most of what you need to do.

grouped = DF.group_by(df, ["country"])

Notice that the Inspect call now shows groups as well as rows and columns. You can, of course, get them explicitly.

DF.groups(grouped)

And you can ungroup explicitly.

DF.ungroup(grouped)

But what we care about the most is aggregating! Let's see which country has the max per_capita value.

grouped
|> DF.summarise(max_per_capita: max(per_capita))
|> DF.arrange(desc: max_per_capita)

Qatar it is.

You may noticed that we are using max/1 inside the summarise macro. This is possible because we expose all functions from the Series module. You can use the following aggregations inside summarise:

  • min/1 - Take the minimum value within the group. See Explorer.Series.min/1.
  • max/1 - Take the maximum value within the group. See Explorer.Series.max/1.
  • sum/1 - Take the sum of the series within the group. See Explorer.Series.sum/1.
  • mean/1 - Take the mean of the series within the group. See Explorer.Series.mean/1.
  • median/1 - Take the median of the series within the group. See Explorer.Series.median/1.
  • first/1 - Take the first value within the group. See Explorer.Series.first/1.
  • last/1 - Take the last value within the group. See Explorer.Series.last/1.
  • count/1 - Count the number of rows per group.
  • n_unique/1 - Count the number of unique rows per group.

The API is similar to mutate: you can use keyword args or a map and specify aggregations to use.

DF.summarise(grouped, min_per_capita: min(per_capita), min_total: min(total))

Speaking of mutate, it's 'group-aware'. As are arrange, distinct, and n_rows.

DF.mutate(grouped, total_window_sum: window_sum(total, 3), rows_in_group: count(country))

It's also possible to use aggregations inside other functions:

grouped
|> DF.summarise(greater_than_9: greater(max(per_capita), 9.0), per_capita_max: max(per_capita))
|> DataFrame.arrange(desc: per_capita_max)

That's it!

And not. This is certainly not exhaustive, but I hope it gives you a good idea of what can be done and what the 'flavour' of the API is like. I'd love contributions and issues raised where you find them!