filearray/man/mapreduce.Rd at main · dipterix/filearray · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/mapreduce.R
\name{mapreduce}
\alias{mapreduce}
\alias{mapreduce,FileArray,ANY,function-method}
\alias{mapreduce,FileArray,ANY,NULL-method}
\alias{mapreduce,FileArray,ANY,missing-method}
\title{A map-reduce method to iterate blocks of file-array data with little memory usage}
\usage{
mapreduce(x, map, reduce, ...)

\S4method{mapreduce}{FileArray,ANY,function}(x, map, reduce, buffer_size = NA, ...)

\S4method{mapreduce}{FileArray,ANY,NULL}(x, map, reduce, buffer_size = NA, ...)

\S4method{mapreduce}{FileArray,ANY,missing}(x, map, reduce, buffer_size = NA, ...)
}
\arguments{
\item{x}{a file array object}

\item{map}{mapping function that receives 3 arguments; see 'Details'}

\item{reduce}{\code{NULL}, or a function that takes a list as input}

\item{...}{passed to other methods}

\item{buffer_size}{control how we split the array; see 'Details'}
}
\value{
If \code{reduce} is \code{NULL}, return mapped results, otherwise
return reduced results from \code{reduce} function
}
\description{
A map-reduce method to iterate blocks of file-array data with little memory usage
}
\details{
When handling out-of-memory arrays, it is recommended to load
a block of array at a time and execute on block level. See
\code{\link{apply}} for a implementation. When an array is too large,
and when there are too many blocks, this operation will become
very slow if computer memory is low.
This is because the R will perform garbage collection frequently.
Implemented in \code{C++}, \code{mapreduce} creates a buffer to store
the block data. By reusing the memory over and over again, it is possible
to iterate through the array with minimal garbage collections. Many
statistics, including \code{min}, \code{max}, \code{sum},
\code{mean}, ... These statistics can be calculated in this
way efficiently.

The function \code{map} contains three arguments: \code{data} (mandate),
\code{size} (optional), and \code{first_index} (optional).
The \code{data} is the buffer,
whose length is consistent across iterations. \code{size} indicates
the effective size of the buffer. If the partition size
is not divisible by the buffer size, only first \code{size} elements of
the data are from array, and the rest elements will be \code{NA}.
This situation could only occurs when \code{buffer_size} is manually
specified. By default, all of \code{data} should belong to arrays.
The last argument \code{first_index} is the index of the first element
\code{data[1]} in the whole array. It is useful when positional data
is needed.

The buffer size, specified by \code{buffer_size} is an
additional optional argument in \code{...}. Its default is \code{NA},
and will be calculated automatically. If manually specified, a
large buffer size would be desired to speed up the calculation.
The default buffer size will not exceed \eqn{nThreads x 2MB}, where
\code{nThreads} is the number of threads set by \code{\link{filearray_threads}}.
When partition length cannot be divided by the buffer size, instead of
trimming the buffer, \code{NA}s will be filled to the buffer,
passed to \code{map} function; see previous paragraph for treatments.

The function \code{mapreduce} ignores the missing partitions. That means
if a partition is missing, its data will not be read nor passed to
\code{map} function. Please run \code{x$initialize_partition()} to make sure
partition files exist.
}
\examples{


x <- filearray_create(tempfile(), c(100, 100, 10))
x[] <- rnorm(1e5)

## calculate summation
# identical to sum(x[]), but is more feasible in large cases

mapreduce(x, map = function(data, size) {
    # make sure `data` is all from array
    if (length(data) != size) {
        data <- data[1:size]
    }
    sum(data)
}, reduce = function(mapped_list) {
    do.call(sum, mapped_list)
})


## Find elements are less than -3
positions <- mapreduce(
    x,
    map = function(data, size, first_index) {
        if (length(data) != size) {
            data <- data[1:size]
        }
        which(data < -3) + (first_index - 1)
    },
    reduce = function(mapped_list) {
        do.call(c, mapped_list)
    }
)

if(length(positions)) {
    x[[positions[1]]]
}


}