Skip to content

DataFrame feature suggestions #5670

Open
@MikaelUmaN

Description

@MikaelUmaN

I’ve had time to check some of the APIs now and this is my feedback. Sorry if I have missed something that already exists. I just tested this yesterday so please forgive any mistakes.

Before saying anything else, I would just like to say that I think this whole initiative is awesome! So happy to see these ideas coming to dotnet. Great job.

Quick background on me is that I work in the finance industry dealing heavily with time series data of all kinds. Currently we use a mix of dotnet, python, R and MATLAB. My favored dotnet language for data analysis is F# (which will be reflected below).

Fsharp Interface

Is there an F# tailored interface?

I didn’t see one and while it’s possible to use all features the differences between F# and C# really stand out for some operations. Example (grid approximation of posterior distribution):

let l2 = ps1 |> Array.map (fun p -> Binomial.PMFLn(p, n2, k2))
let p2 = ps1 |> Array.map (fun p -> ContinuousUniform.PDFLn(0., 1., p))
let l2Col = PrimitiveDataFrameColumn("Likelihood", l2)
let p2Col = PrimitiveDataFrameColumn("Prior", p2)
let bdf2 = DataFrame(l2Col, p2Col)

// The unstandardized likelihood.
bdf2.["UnstdPostLn"] <- bdf2.["Likelihood"] + bdf2.["Prior"]

// What I really want to do is the equivalent of pandas "assign" operation. I want to create a new column based on existing columns
// in a non-trivial way. The only alternative I found was to clone and then apply elementwise.
bdf2.["UnstdPost"] <- bdf2.["UnstdPostLn"].Clone()

// Here, type information is lost so I have to cast. Then I have to work with nullable which is a pain.
// F# has good support for a lot of nullable operators but no support for when you want to apply functions like exp.
(bdf2.["UnstdPost"] :?> PrimitiveDataFrameColumn<float>).ApplyElementwise(fun (x: Nullable<float>) i -> Nullable(exp x.Value))

// Normalizing constant
let evidence2 = bdf2.["UnstdPost"].Sum() :?> float
bdf2.["StdPostLn"] <- bdf2.["UnstdPost"] - log evidence2

// Final, standardized posterior approximation. Same issues as before.
(bdf2.["StdPost"] :?> PrimitiveDataFrameColumn<float>).ApplyElementwise(fun x i -> Nullable(exp x.Value))

I don’t think this code is that nice from an F# perspective. I would hope that some of these quirks can be done away with either by tailoring an interface for F# or by making some other adjustments, discussed below.

Extend the concept of an index

In other dataframe solutions the concept of an index column takes a central role. Usually this is an integer or a datetime. This then enables easy joins and with new timeseries and other operations. One example for timeseries data is resampling. That is, given data on a millisecond basis I may want to resample that data to seconds and perform a custom aggregation in doing so (see pandas resample).

In the NET implementation, there is an index but it’s always integer based and you can’t supply it when creating a series (data frame column). This makes it harder than would have to be to quickly put together a dataframe from disparate data sources. Requiring the length
of all columns to be the same is not good enough for production use and inhibits productivity.

See pandas or deedle:

index
series

Missing values treatment

On the dataframe there is a DropNulls operation but not when working on individual columns?

From my previous code example, what I could have been OK with would have been to drop all nulls from the column and then call Apply with my custom function, not having to deal with Nullable. This would have given me a new column where I have my index info (datetime) together with my new values. Then I would assign that to a new column in my dataframe. For the indices where I am missing values, the dataframe would just know that.

Currently that’s not possible (?) and it makes anything non-trivial a hassle.

Time series operations

The dataframe comes from the world of time series analysis in different forms. I think the design and implementation should recognize and honour that. Otherwise I don’t see the point as that’s where practically all applications lie.

This means out-of-the-box support for standard calculations such as moving averages. Much of this can of course be done in a third-party library but at least the necessary concepts have to exist. As I see it this is primarily what’s called “window”-functionality. In deedle and pandas it’s possible to perform windowed calculations. Either a moving window of a fixed size or an expanding window that adds a new row for each iteration. This is really useful for smoothing data and the expanding functionality is very powerful for making sure that all computations are done in a chronologically consistent way (no benefit of hindsight).

See pandas or deedle:

windowing

Summary

Great initiative. Please improve F# experience, introduce concept of an index (usually datetime-based) and put time series analysis in center-stage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions