Skip to content

numobs and getobs #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
matrix:
version:
- '1.6'
- '1.7'
- '1'
- 'nightly'
os:
- ubuntu-latest
Expand Down
11 changes: 7 additions & 4 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
name = "MLBase"
name = "MLUtils"
uuid = "f1d291b0-491e-4a28-83b9-f70985020b54"
authors = ["Carlo Lucibello <carlo.lucibello@gmail.com> and contributors"]
version = "0.1.0"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
[deps]

[compat]
julia = "1.6"

[extras]
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test"]
test = ["SparseArrays", "Test"]
58 changes: 55 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,56 @@
# MLBase
# MLUtils

[![Build Status](https://github.com/JuliaML/MLUtils.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/JuliaML/MLUtils.jl/actions/workflows/CI.yml?query=branch%3Amain)
[![Coverage](https://codecov.io/gh/JuliaML/MLUtils.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/JuliaML/MLUtils.jl)


This package embodies a community effort to provide common and extensible functionalities for Machine Learning packages in Julia.

The aim is to consolidate packages in the ML ecosystem such as [MLDataPattern.jl](https://github.com/JuliaML/MLDataPattern.jl) and [MLLabelUtils.jl](https://github.com/JuliaML/MLLabelUtils.jl) into a single well-mantained repository.

## Interface

### Observation

The functions `numobs`, `getobs`, and `getobs!` are implemented for basic types (arrays, tuples, ...)
and can be extend

For array types, the observation dimension is the last dimension. This means that when working
with matrices each columns is considered as an individual observation.

```
numobs(data)

Return the total number of observations contained in `data`.
See also [`getobs`](@ref)
```

```
getobs(data, idx)

Return the observations corresponding to the observation-index `idx`.
Note that `idx` can be any type as long as `data` has defined
`getobs` for that type.
The returned observation(s) should be in the form intended to
be passed as-is to some learning algorithm. There is no strict
interface requirement on how this "actual data" must look like.
Every author behind some custom data container can make this
decision themselves.
The output should be consistent when `idx` is a scalar vs vector.
See also [`getobs!`](@ref) and [`numobs`](@ref)
```

```
getobs!(buffer, data, idx)

Inplace version of `getobs(data, idx)`. If this method
is defined for the type of `data`, then `buffer` should be used
to store the result, instead of allocating a dedicated object.
Implementing this function is optional. In the case no such
method is provided for the type of `data`, then `buffer` will be
*ignored* and the result of `getobs` returned. This could be
because the type of `data` may not lend itself to the concept
of `copy!`. Thus, supporting a custom `getobs!` is optional
and not required.
```

[![Build Status](https://github.com/Carlo/MLBase.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/Carlo/MLBase.jl/actions/workflows/CI.yml?query=branch%3Amain)
[![Coverage](https://codecov.io/gh/Carlo/MLBase.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/Carlo/MLBase.jl)
5 changes: 0 additions & 5 deletions src/MLBase.jl

This file was deleted.

10 changes: 10 additions & 0 deletions src/MLUtils.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
module MLUtils


include("observation.jl")
export numobs, getobs, getobs!

include("randobs.jl")
export randobs

end
103 changes: 103 additions & 0 deletions src/observation.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
"""
numobs(data)

Return the total number of observations contained in `data`.

See also [`getobs`](@ref)
"""
function nobs end

"""
getobs(data, idx)

Return the observations corresponding to the observation-index `idx`.
Note that `idx` can be any type as long as `data` has defined
`getobs` for that type.
The returned observation(s) should be in the form intended to
be passed as-is to some learning algorithm. There is no strict
interface requirement on how this "actual data" must look like.
Every author behind some custom data container can make this
decision themselves.
The output should be consistent when `idx` is a scalar vs vector.


See also [`getobs!`](@ref) and [`numobs`](@ref)
"""
function getobs end

getobs(data, idx) = data[idx]

"""
getobs!(buffer, data, idx)

Inplace version of `getobs(data, idx)`. If this method
is defined for the type of `data`, then `buffer` should be used
to store the result, instead of allocating a dedicated object.
Implementing this function is optional. In the case no such
method is provided for the type of `data`, then `buffer` will be
*ignored* and the result of `getobs` returned. This could be
because the type of `data` may not lend itself to the concept
of `copy!`. Thus, supporting a custom `getobs!` is optional
and not required.
"""
function getobs! end
getobs!(buffer, data, idx) = getobs(data, idx)

# --------------------------------------------------------------------
# Arrays
# We are very opinionated with arrays: the observation dimension
# is th last dimension. For different behavior wrap the array in
# a custom type, e.g. with Tables.table.


numobs(A::AbstractArray{<:Any, N}) where {N} = size(A, N)

# 0-dim arrays
numobs(A::AbstractArray{<:Any, 0}) = 1

function getobs(A::AbstractArray{<:Any, N}, idx) where N
I = ntuple(_ -> :, N-1)
return A[I..., idx]
end

getobs(A::AbstractArray{<:Any, 0}, idx) = A[idx]

function getobs!(buffer::AbstractArray, A::AbstractArray{<:Any, N}, idx) where N
I = ntuple(_ -> :, N-1)
buffer .= A[I..., idx]
return buffer
end

# --------------------------------------------------------------------
# Tuples and NamedTuples

_check_numobs_error() =
throw(DimensionMismatch("All data containers must have the same number of observations."))

function _check_numobs(tup::Union{Tuple, NamedTuple})
length(tup) == 0 && return
n1 = numobs(tup[1])
for i=2:length(tup)
numobs(tup[i]) != n1 && _check_numobs_error()
end
end

function numobs(tup::Union{Tuple, NamedTuple})::Int
_check_numobs(tup)
return length(tup) == 0 ? 0 : numobs(tup[1])
end

function getobs(tup::Union{Tuple, NamedTuple}, indices)
_check_numobs(tup)
return map(x -> getobs(x, indices), tup)
end

function getobs!(buffers::Union{Tuple, NamedTuple},
tup::Union{Tuple, NamedTuple},
indices)
_check_numobs(tup)

return map(buffers, tup) do buffer, x
getobs!(buffer, x, indices)
end
end
13 changes: 13 additions & 0 deletions src/randobs.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# TODO: allow passing a rng as first parameter

"""
randobs(data, [n])

Pick a random observation or a batch of `n` random observations
from `data`.
For this function to work, the type of `data` must implement
[`numobs`](@ref) and [`getobs`](@ref).
"""
randobs(data) = getobs(data, rand(1:numobs(data)))

randobs(data, n) = getobs(data, rand(1:numobs(data), n))
Loading