Linux | Coverage | Documentation |
---|---|---|
This repository is now deprecated. The last supported release is
MLJScientificTypes
0.4.8. ScientificTypes
2.0 and higher now serves the original purpose of MLJScientificTypes,
implementing a scientific type convention called DefaultConvention
(but previously known as the MLJ
convention).
The scientific types themselves (on which all scientific type conventions are based) are now defined in ScientificTypesBase. Previously ScientificTypes (versions 1.1.1 and lower) defined the basic types and API.
Implementation of a convention for scientific types, as used in the MLJ universe.
Important note. While this document refers to the MLJ convention, this convention could (and, hopefully, will) be adopted in statistical/scientific software outside of the MLJ project. Of its dependencies, only the tiny package ScientificTypes.jl has any direct connection to MLJ.
This package makes a distinction between machine type and scientific type of a Julia object:
-
The machine type refers to the Julia type being used to represent the object (for instance,
Float64
). -
The scientific type is one of the types defined in ScientificTypes.jl reflecting how the object should be interpreted (for instance,
Continuous
orMulticlass
).
using Pkg
Pkg.add(MLJScientificTypes)
This repository has two kinds of users in mind:
-
users of software in the MLJ universe seeking a deeper understanding of the use of scientific types and associated tools; these users do not need to directly install this package but may find its documentation helpful
-
developers of statistical and scientific software who want to articulate their data type requirements in a generic, purpose-oriented way, and who are furthermore happy to adopt an existing convention about what data types should be used for what purpose (a convention that has been successfully adopted in an existing large scale Julia project)
Developers interested in implementing a different convention will instead import Scientific Types.jl, following the documentation there, possibly using this repo as a template.
The module MLJScientificTypes
defined in this repo rexports the
scientific types and associated methods defined in Scientific
Types.jl
and provides:
-
a collection of
ScientificTypes.scitype
definitions that articulate the MLJ convention, importing the module automatically activating the convention -
a
coerce
function, for changing machine types to reflect a specified scientific interpretation (scientific type) -
an
autotype
fuction for "guessing" the intended scientific type of data
For more information and examples please refer to the manual.
using MLJScientificTypes, DataFrames
X = DataFrame(
a = randn(5),
b = [-2.0, 1.0, 2.0, missing, 3.0],
c = [1, 2, 3, 4, 5],
d = [0, 1, 0, 1, 0],
e = ['M', 'F', missing, 'M', 'F'],
)
sch = schema(X)
will print
_.table =
┌─────────┬─────────────────────────┬────────────────────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────────────────────┼────────────────────────────┤
│ a │ Float64 │ Continuous │
│ b │ Union{Missing, Float64} │ Union{Missing, Continuous} │
│ c │ Int64 │ Count │
│ d │ Int64 │ Count │
│ e │ Union{Missing, Char} │ Union{Missing, Unknown} │
└─────────┴─────────────────────────┴────────────────────────────┘
_.nrows = 5
Detail is obtained in the obvious way; for example:
julia> sch.names
(:a, :b, :c, :d, :e)
To specify that instead b
should be regared as Count
, and that both d
and e
are Multiclass
, we use the coerce
function:
Xc = coerce(X, :b=>Count, :d=>Multiclass, :e=>Multiclass)
schema(Xc)
which prints
_.table =
┌─────────┬──────────────────────────────────────────────┬───────────────────────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼──────────────────────────────────────────────┼───────────────────────────────┤
│ a │ Float64 │ Continuous │
│ b │ Union{Missing, Int64} │ Union{Missing, Count} │
│ c │ Int64 │ Count │
│ d │ CategoricalValue{Int64,UInt32} │ Multiclass{2} │
│ e │ Union{Missing, CategoricalValue{Char,UInt32}}│ Union{Missing, Multiclass{2}} │
└─────────┴──────────────────────────────────────────────┴───────────────────────────────┘
_.nrows = 5