This cheatsheet is meant to be an exhaustive reference to the operators that compose ni
as a language, except for its many Perl extensions (which have their own cheatsheet). This document is designed such that it can be read in order and it should make sense, though it's quite long and light on details.
However, there's a lot more to achieving fluency than just knowing the words; if you use ni
regularly, you'll be well-rewarded for walking through ni
by Example, the more formal tutorial. And if you ever write your own ni
pipelines, you should read ni
-fu to learn how to debug them. I guarantee it will save you time.
n
: Integer streamn#
: Stream the sequence1..#
, one number per linen0#
: Stream the sequence0..#-1
, one number per linen
: Stream the sequence1..
, one number per line.
<filename>
: File input- Automatically decompresses the file and streams out one line of the file out at a time.
- Files are automatically decompressed from
gzip
, andbzip
,xzip
andlzo
decompression are also supported if you have the binaries.
e[<script>]
: Evaluate script- evaluate
<script>
in bash, and stream out the results one line at a time. - Can also be written
e'<script'
$ ni n10 e'wc -l'
will count lines.
- evaluate
i<text>
: Literal text input$ ni iOK!
-- add the literal textOK!
(and a newline) to the stream.$ ni i'a cat'
-- add the literal texta cat
(and a newline) to the stream. The quotes are necessary to instruct theni
parser where the boundaries of the string are. Double quotes will work as well.$ ni i[these cats]
-- add a tab-separated line consisting ofthese cats
, and a newline.
D:<field1>,:<field2>...
: JSON Destructureni
can easily grab scalar fields (i.e. strings, numbers, and nulls) from JSON, For example:$ ni i'{"hi":5,"there":"alive"}' D:there,:hi
yieldsalive 5
- The JSON destructurer does not support pulling out list-based fields or nested hashes within JSON.
1
: Dummy pulse$ ni 1
is syntactic sugar for$ ni n1
. It is useful for launching scripts in Perl, Ruby, Lisp, or Python fromni
.- The
1
operator is primarily used to make a perl script run; it's often more useful in test than in actual development.
\>filename
: Redirect stream to file and emit filename- Consumes the stream and outputs it to the file named
filename
, and emits the filename. - Note that there is no whitespace between the angle-bracket and the filename.
- This "literal right angle bracket" operator
\>
is usually much more useful than a file redirect>
, since the filename(s) output to the stream can be opened using the "literal left angle bracket" operator,\<
.
- Consumes the stream and outputs it to the file named
\<
: Read from filenames- Normally when
ni
sees a filename in its instructions, it will open the file andcat
its contents to the stream. - However, if you have a list of filenames,
- This tool can be powerful in combination with Hadoop operations, described later.
- Normally when
z
: Compress stream- Defaults to
gzip
compression, butxzip
,bzip
, and others are available. Seeni
by Example Chapter 1 for details.
- Defaults to
The operator r
is used to filter rows.
$ ni <data> r3
- take the first 3 rows of the stream- CAVEAT:
r#
is a wrapper over the Unix utilityhead
, and emits aSIGPIPE
which will break Streaming MapReduce jobs. To user
in the context of MapReduce, use the safe row operatorrs
instead. $ ni <data> rs3
- take the first 3 rows of the stream and do not emit aSIGPIPE
.
- CAVEAT:
$ ni <data> r-3
- take everything after the first 3 rows of the stream$ ni <data> r~3
- take the last 3 rows of the stream$ ni <data> rx100
- take the first row in the stream, and every 100th row that follows.$ ni <data> r.05
- sample 5% of the rows in the stream.- The sampling here is deterministic (conditioned on the environment variable
NI_SEED
) and will always return the same set of rows.
- The sampling here is deterministic (conditioned on the environment variable
$ ni <data> r/<regex>/
- take rows where<regex>
matches.
Columns are referenced "spreadsheet-style"--the first column is A
, the second is B
, etc.
f
: Take columns$ ni <data> fAA
- select the first column and duplicate it$ ni <data> fAD.
- select the first column and all remaining columns starting with the fourth$ ni <data> fB-E
- select columns 2-5$ ni <data> fCAHDF
- selects columns 3, 1, 8, 4, 6, and streams them out in that order.$ ni <data> f#<N1>,#<N2>
, selects data from (zero-indexed columns and ).- This can be used to select columns beyond the 26th column.
$ ni <data> f#87,#45,#9,#18
will take the 88th, 46th, 10th, and 19th columns from the data source. - Every column operation written with letters can be rewritten using the numeric form:
$ ni <data> f#0,#0
is equivalent to$ ni <data> fAA
$ ni <data> f#0,#3.
is equivalent to$ ni <data> fAD.
$ ni <data> f#1-#4
is equivalent to$ ni <data> fB-E
$ ni <data> f#2,#0,#7,#3,#5
is equivalent to$ ni <data> fCAHDF
- This can be used to select columns beyond the 26th column.
r<cols>
- take row if exists$ ni <data> rCF
- take rows where columns 3 and 6 are nonempty.
F
: Split stream into columnsF:<char>
: split on character- WARNING: this does not work with certain characters that need to be escaped; use
F/regex/
below for more flexibility (at the cost of less concision).
- WARNING: this does not work with certain characters that need to be escaped; use
F/regex/
: split on occurrences of regex. If present, the first capture group will be included before a tab is appended to a field.Fm/regex/
: don't split; instead, look for matches of regex and use those as the field values.FC
: split on commas (doesn't handle special CSV cases)FV
: parse CSV "correctly", up to newlines in fields.FS
: split on runs of horizontal whitespaceFW
: split on runs of non-word charactersFP
: split on pipe symbols
x
: Exchange columnsx
: exchange the first two columns.$ ni data x
is equivalent to$ ni data fBA.
xC
: exchange column 3 with column 1.$ni data xC
is equivalent to$ ni data fCBA.
xBE
: exchange columns 2 and 5 with columns 1 and 2.- This runs in order, so
B
will be switched withA
first, and whatever is in the second column now will be switched with columnE
. $ ni data xBE
is equivalent to$ ni data fBECDA.
- This runs in order, so
- WARNING: Sorting is often a rate-limiting step in
ni
jobs, as data will need to be buffered to disk if a sort is too large to fit in memory. If your data is larger than 1 GB, you should consider distributing your workload using Hadoop operations. g
: General sortinggB
- sort rows ascending by the lexicographic value of the second column- Lexicographic value is determined by the ordering of characters in the ASCII table.
ni ia iC g
will put the capitalC
before the lower-casea
, because capital Latin letters precede lowercase Latin letters in ASCII.
gC-
- sort rows descending by the lexicographic value of the third columngCA-
- sort rows first by the lexicographic value of the third column, ascending. For rows with the same value for the third column, sort by descending value of the first column.gDn
- sort rows ascending by the numerical value of the fourth column.gEnA-
- sort rows ascending by the numerical value of the fifth column; in the case where values in the fifth column are equal, sort by the lexicographic value of the first column, descending.- CAVEAT: The numeric sort works on integers and floating-point numbers written as decimals. The numeric sort will not work on numbers written in exponential/scientific notation
u
: unique sorted rows$ ni <data> fACgABu
: get the lexicographically-sorted unique values from the first and third columns of<data>
.
c
: count sorted rows$ ni <data> fBgc
: return the number of times each unique value of the second column occurs in<data>
- Note that the above operation is superior to
$ ni <data> gBfBc
(which will give the same results), since the total amount of data that needs to be sorted is reduced.
o
andO
: Numeric sortingo
: Sort rows ascending (numerical)oA
is syntactic sugar for$ ni <data> gAn
O
: sort rows descending (numerical)OB
is equivalent to$ ni <data> gBn-
- Important Note:
o
andO
sorts cannot be chained together or combined withg
. There is no guarantee that the output of$ ni <data> gAoB
will have a lexicographically sorted first column, and there is no guarantee that$ ni <data> oBOA
will have a numerically sorted second column. If you want to sort by multiple conditions, you must useg
.
$ ni <data> ,<op><columns>
These provide keystroke-efficient ways to do transformations on a single column of the input data. Of particular use is the deterministic hashing function, which does a good job of compacting long IDs into 32-bit integers.
,a
: Running average,d
: Difference between consecutive rows,e
: Natural exponential (e**x
),h
: Murmurhash (deterministic 32-bit hash function),j<amt>
: Jitter (add uniform random noise in the range[-amt/2, amt/2]
),l
: Natural log (ln x
),s
: Running sum,q
: Quantize,z
: Intify (hash and then convert hash values to integers starting with 1),t
: Convert timestamp to readable ISO 8601 form,g
: Geohash encode,G
: Geohash decod3
When ni HS...
is called, ni
packages itself as a .jar
to the configured Hadoop server, which includes all the instructions for Hadoop to run ni
.
When ni
uploads itself, it will also upload all data that is stored in data closures; if these closures are too large, the Hadoop server will refuse the job.
hdfs://<path>
: HDFScat
- Equivalent to
hadoop fs -cat <path>
- Equivalent to
hdfst://<path>
: HDFStext
- Equivalent to
hadoop fs -text <path>
- Equivalent to
hdfsj://<path>
: HDFSjoin
- Identifies the correct file within the directory
<path>
that should be joined with the map file (in a Hadoop Streaming context),$ENV{mapreduce_map_input_file}
- Identifies the correct file within the directory
HS[mapper] [combiner] [reducer]
: Hadoop Streaming Job- Any
ni
snippet can be used for the mapper, combiner, and reducer. Be careful that all of the data you reference is available to the Hadoop cluster;w/W
operations referencing a local file are good examples of operations that may work on your machine that may fail on a Hadoop cluster with no access to those files. _
-- skip the mapper/reducer/combiner.:
-- apply the trivial operation (i.e. redirect STDIN to STDOUT) for the mapper/reducer/combiner- If the reducer step is skipped with
_
, the output may not be sorted, as one might expect from a Hadoop operation. Use:
for the reducer to ensure that output is sorted correctly. - Remember that you will be limited in the size of the
.jar
that can be uploaded to your Hadoop job server; you can upload data closures that are large, but not too large.
- Any
- Using HDFS paths in Hadoop Streaming Jobs:
ni ... ihdfst://<path> HS...
- The path must be input as literal text (with
i
) so thatni
knows to get the data during the Hadoop job. - WARNING if you do not use a literal path, for example with
ni hdfst://...
,ni
will try to download all of the data, then upload the data to HDFS, and then run the job. - If you find that a job takes over a minute to start, you may want to check that you haven't made this error.
s<host>[...]
: execute[...]
in<host>
- You will need to set up your hosts properly in your
.ssh/config
to use a named host. - Remember that within the bracketed operator, you will have access to the
<host>
filesystem.
- You will need to set up your hosts properly in your
C<container_name>[...]
: execute[...]
in<container_name>
- Running in containers requires that Docker be installed on your machine.
- Running containers can be especially useful to take advantage of better OS-dependent utilities.
- For example, Mac OS X's
sort
is painfully slow compared to Ubuntu's. If you are developing on a Mac, there will be a noticeable performance difference increase if you replace:ni n1E7 g
withni n1E7 Cubuntu[g]
, because the sort will be done in the faster Ubuntu container. - Containers are also useful for testing the portability of your code.
Note that you will need sufficient processing cores to effectively horizontally scale. If your computer has 2 cores and you call S8
, it may slow your work down, as ni
tries to spin up more processes than your machine can bear.
S
: Horizontal Scaling$ ni <data> S<# cores>[...]
: distributes the computation of...
across<# cores>
processors.
We can weave together row, column, and Perl operations to create more complex row operations. We also introduce some more advanced column operators.
j
- streaming join- This will (inner) join two streams using one or more of their columns as a key; if not specified, the key will be the first column which is assumed to be sorted.
- Adding columns to the operator will use those as the join column.
jAB
will join on the first two columns the left and the right datasets.
w
: Append column to stream$ ni <data> w[np'a*a']
w
will add columns only up to the length of the input stream
W
: Prepend column stream$ ni <data> Wn
- Add line numbers to the stream (by prepending one element the infinite streamn
)W
will add rows only up to the length of the input stream
v
: Vertical operation on columns- Important Note: This operator is too slow to use in production.
Data closures are compiled into the ni
source before the pipeline is executed. This allows them to serve the function of a broadcasted dataset across nodes in a distributed context. Closures can be accessed from within Perl snippets by using their name.
::closure_name[...]
: Create a data closure- A common motif for closures is using them as a filter:
$ ni ::good_points[i100 i3 i76] \
n100 rp'^{%h = ab_ good_points} exists($h{+a})'
-
Here we've created a closure called
good_points
which contains the data100\n3\n76\n
as a string (initially). We use some Perl (see the Perl Cheatsheet for details) to convert those lines into a Perl hash called%h
, whose keys are and we check whether the value of the first column+a
exists as a key in the hash -
Closures are computed separately from each other, which means that one closure cannot in general reference the value of another closure. If you find a situation where you need to create a closure that depends on the value of another closure, this is likely not (currently) good
ni
style, and you should look for another way to solve the problem. -
@:[disk_backed_data_closure]
: Create disk-backed data closure.- A disk-backed data closure operates in much the same way as a regular data closure.
- In general, backing your data closures to disk is a way to get around memory restrictions on machines when using particularly large data closures.
- Disk-backed data closures may fail in environments where you have restrictions about writing to disk (e.g. in Hadoop mappers or reducers).
-
:<checkpoint_name>
: Create a checkpoint- Checkpoints are useful for building extended pipelines, especially ones with long-running or expensive operations (e.g. Hadoop MapReduce jobs).
- The use of checkpoints is a bit too tricky for your author's taste, but you are likely smarter than he.
- Here's an appropriate (if contrived) use of checkpoints:
$ ni nE7 g :alpha_numbers r3
- With the checkpoint, the result of
$ ni nE7 g
will be stored in:alpha_numbers
, and you can re-run the spell quickly. - Without the checkpoint, the entire sort would have to be re-run each time, which takes a painfully long amount of time.
- In the background,
ni
is sinking the data from$ni nE7 g
to a file calledalpha_numbers
. - WARNING: Once a checkpoint has been written once, it will remain the same. For example, if, after running
$ ni nE7 g :alpha_numbers r10
you run$ ni n1000 g :alpha_numbers r10
,ni
will start using the previousalpha_numbers
checkpoint data. This failure mode is silent, so BE CAREFUL!! (Or don't use them, like me). - Checkpoints are also disk-backed, which means they suffer from many of the same limitations as disk-backed data closures.
W\<
: Read from files and prepend filename- The
\<
operator will read data from files, but will not state which file each line came from. W\<
does the same thing as\<
except it prepends a column to the output data with the name of the file the data came from.
- The
W\>
: Redirect filename-prepended stream to files- This operator consumes stream data with lines of the form
<filename> <data1> <data2> ... <dataN>
and outputs the lines<data1> <data2> ... <dataN>
to<filename>
, and outputs<filename>
- Be aware that input data should be in sorted order;
ni i[a.txt 1] i[b.txt 2] i[a.txt 3]
will leave filea.txt
with only one line with the value3
.
- This operator consumes stream data with lines of the form
Y
- dense-to-sparse transformation- Explodes each row of the stream into several rows, each with three columns:
- The index of the row that the input data that came from
- The index of the column that the input data came from
- The value of the input stream at the row + column specified by the first two columns.
- Explodes each row of the stream into several rows, each with three columns:
X
- sparse-to-dense transformationX
invertsY
; it converts a specifically-formatted 3-column stream into a multiple-column stream.- In the case that there are collisions for locations
X
,X
will sum the values - For example:
ni n010p'r 0, a%3, 1' X
- The specification for what the input matrix must look like is described above in the
Y
operator.
N'x = ...'
: Numpy matrix operations- Dense matrices can be transformed using Numpy-like operations
- The entire input matrix (i.e. the stream) is referred to as
x
. - Example:
ni n10p'r map a*$_, 1..10' N'x = x + 1'
creates a matrix and adds one to every element with high keystroke efficiency. - Example
ni n5p'r map a*$_, 1..10' N'x = x.T'
- You also have available the entire numpy package at your disposal:
- Example:
ni n10p'r map a*$_, 1..10' N'x = dot(x, x.T)'
- Example:
ni 1N'x = random.normal(size=(5,3))'
- Example:
- Note that your statements must always store the matrix back in the variable
x
.
%<#>[stream]
: interleave streams- This is the most useful of the operations, especially for plotting data using
ni --js
. %
- CAVEAT: Interleaving is not an exact operation (though it's negligibly close for large datasets), and output can somewhat depend on the speed with which the two streams are generated.
- This is the most useful of the operations, especially for plotting data using
=
: duplicate this stream and discard its output- The best use of this operation is to sink data to a file in the middle of a pipeline without impeding its progress.
$ ni n100 =\>hundo.txt fAA p'r a*3, b' \>other.txt
will sink the stream after$ ni n100
to a file calledhundo.txt
, while still allowing the data to be processed through the rest of the pipeline.
+
: append a stream- In general, a stream written later in the spell will be appended automatically to the stream that comes before it; I rarely use this.
$ ni ia +ib
yieldsa
, thenb
, the same as$ ni ia ib
would. Not very useful.
^
: prepend a stream- This is a more useful operator than
+
in theory, but it is also rarely used, since it in general makes more sense to order the streams. $ni ia ^ib
yieldsb
thena
. Slightly useful!
- This is a more useful operator than
One improvement of ni
over its predecessor, nfu
is that mathematical operations on tall and wide matrices have better support through partitioning.
Some matrix operations are not performant in ni
because they may require space greater than memory. If you're doing a lot of complex matrix operations, ni
may not be the right tool. However, ni
does provide some
Y<col>
: Dense-To-Sparse Transform with Partitioning
$ ni i[a b c d] i[a b x y] i[a b foo bar] YC
a b 0 0 c
a b 0 1 d
a b 1 0 x
a b 1 1 y
a b 2 0 foo
a b 2 1 bar
Y
has reduced over the first two columns of the data, and kept their values in the first two columns of its output. The next two columns represent the row and column in the matrix as the normal Y
operator does, and and the final column has the value.
N<col>
: Numpy on Dense Partitioned Matrix
Reduces over the columns before <col>
, and then does the standard N
operation.
# LazyTest automation: not all environments have numpy installed
if ! [[ -e /nonumpy ]]; then
$ ni i[a b 1 5] i[a b 100 500] i[a b -10 -20] \
i[c d 1 0] i[c d 1 1] \
NC'x = dot(x.T, x)'
a b 10101 50205
a b 50205 250425
c d 2 1
c d 1 1
fi # -e /nonumpy
X<col>
: Sparse-To-Dense Transform of Partitioned Data
X<col>
inverts Y<col>
.
$ ni i[a b c d] i[a b x y] i[a b foo bar] YC XC
a b c d
a b x y
a b foo bar
Don't use things other than Perl. Here are other things:
m'<...>'
: Ruby- applies the Ruby snippet
<...>
to each row of the stream
- applies the Ruby snippet
l'<...>'
: Lisp- applies the Lisp snippet
<...>
to each row of the stream
- applies the Lisp snippet
y'<...>'
: Python- applies the Python snippet
<...>
to each row of the stream
- applies the Python snippet
I have only ever used the Ruby extension, and only to us a library not written in ni
.
Keep in mind that code written in any other language will not be portable and result in configuration headaches. For example, if you have a Ruby gem installed on your local machine and are able to run a ni
spell on your local, you will have to install the same gem on your remote machine to use it over SSH. Moreover, if you want to run the same task on your Hadoop cluster, you'll have to have the gem installed on every node of the cluster
When I need another language for its library, I'll usually create a copy of the (part of the) library that I need and add it to ni
instead.
The primary use of binary operations is to operate on data that is most effectively represented in raw binary form (for example, .wav
files). See the binary docs until I figure out something useful.