This is an initial implementation of DataFrames for the Go programming language (Golang). This is still at a very early stage of development and changes to the API are to be expected. Use at your own risk.
The term DataFrame typically refers to a tabular dataset that can be viewed as a two dimensional table. Often the columns of this dataset refers to a list of features, while the rows represent a number of measurements.
Common examples of DataFrames can be found on Excel sheets, CSV files or data originated on a SQL database, but this data can come on a variety of other formats, like a collection of JSON objects or XML files.
The utility of DataFrames resides on the ability to subset them, merge them, summarize the data for individual features or apply functions to entire rows or columns.
When used in data analysis, techniques like Split Apply Combine can be used to obtain insightful information for our dataset.
Golang powerful features for backend and API servers make it very appealing to a lot of programmers, but it is severly lacking when trying to use it for data science.
Sure, there are fantastic packages out there that allow us to perform data analysis over numeric elements, like gonum. But what if what we really want is to retrieve data from some source, like a file or a database? What if we have different types of data on said dataset? What if we want to subset the data using those non-numeric features? What if after processing this data we want to pass it nicely formatted to our javascript frontend via json for visualization?
We could always resort to call some dedicated statistical software from Go to perform this functions and use the excellent marshaling capabilities to transform it to our desired format, but I think it would be much nicer and less cumbersome to be able to perform all these actions from within Go.
Since I couldn't find any packages out there that fulfil my requirements I decided to create it myself.The ambition is to create a package that:
- Load/save CSV data
- Load/save XML data
- Load/save JSON data
- Parse loaded data to the given types (Currently supported:
Int
,Float
,Bool
, &String
) - Row/Column subsetting (Indexing, column names, row numbers, range)
- Unique/Duplicate row subsetting
- Conditional subsetting (i.e.:
Age > 35 && City == "London"
) - DataFrame combinations by rows and columns (cbind/rbind)
- DataFrame joining by keys (InnerJoin, LeftJoin, RightJoin)
- DataFrame joining CrossJoin
- DataFrame joining by keys FullOuterJoin
- Function application over rows
- Function application over columns
- Statistics and summaries over the different features (Type dependant)
- Value counting (For histogram representations)
- Conversion between wide and long formats
Each column in a DataFrame can only have elements of a given type. To
be able parse columns to different types, sort them and keep
everything compatible the types included into a column have to comply
with the Cell
interface. Right now the Cell
interface have the
following methods:
type Cell interface {
String() string
Int() (*int, error)
Float() (*float64, error)
Bool() (*bool, error)
NA() Cell
IsNA() bool
Checksum() [16]byte
Copy() Cell
}
d := df.DataFrame{}
absPath, _ := filepath.Abs("dataset.csv")
csvfile, err := os.Open(absPath)
if err != nil {
fmt.Println(err)
return
}
r := csv.NewReader(csvfile)
records, err := r.ReadAll()
if err != nil {
fmt.Println(err)
return
}
// Load the data as string columns
err = d.LoadData(records)
if err != nil {
fmt.Println(err)
return
}
// Load and parse the features to the given format
err = d.LoadAndParse(records, df.T{"Age": "int", "Amount": "float"})
if err != nil {
fmt.Println(err)
return
}
// Create a new DataFrame with a custom constructor
d, err := df.New(
df.C{"A", df.Strings("a", "b", "c")},
df.C{"B", df.Ints(1, nil, 2)},
df.C{"C", df.Ints(1, 2, 3)},
)
// Print a DataFrame to console
fmt.Println(d)
> Country Date Age Amount Id
>
> 0: United States 2012-02-01 50 112.1 01234
> 1: United States 2012-02-01 32 321.31 54320
> 2: United Kingdom 2012-02-01 17 18.2 12345
> 3: United States 2012-02-01 32 321.31 54320
> 4: United Kingdom 2012-02-01 NA 18.2 12345
> 5: United States 2012-02-01 32 321.31 54320
> 6: United States 2012-02-01 32 321.31 54320
> 7: Spain 2012-02-01 66 555.42 00241
// Subset by column and rearrange the columns by name on the given order
d1, err := d.SubsetColumns([]string{"Date", "Country"})
// Subset by column using a range element
d2, err := d.SubsetColumns(df.R{0, 1})
// Subset by column using an array of column numbers
d3, err := d.SubsetColumns([]int{0, 3, 1})
// Subset by rows using a range element
d4, err := d.SubsetRows(df.R{0, 1})
// Subset by column using an array of row numbers
d5, err := d.SubsetRows([]int{0, 2, 1})
// Subset by both columns and rows any subsetting format can be used
d6, err := d.Subset([]string{"Date", "Age"}, df.R{0, 2})
// Only unique elements
d7, err := d.Unique()
// Only duplicated elements
d8, err := d.Duplicated()
da, _ := d.SubsetRows(df.R{0, 3})
db, _ := d.SubsetRows(df.R{3, 4})
dc, _ := d.SubsetColumns([]string{"Age", "Country"})
dd, _ := d.SubsetColumns([]string{"Date"})
fmt.Println(df.Rbind(*da, *db))
fmt.Println(df.Cbind(*dc, *dd))
Copyright 2016 Alejandro Sanchez Brotons
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.