Skip to content

Commit

Permalink
filename options for split (iss. #1365) (#1366)
Browse files Browse the repository at this point in the history
* #1365 - filename options for `split`

* Don't use joiner string when prefix is empty.
* Add option to specify joiner string.
* Add option to not URL-escape file names.

* #1365 - update documentation

* #1365 - don't URL-escape file name prefix

I **_thought_** it'd be cool to apply URL-escaping to the file name prefix as well, just in case it included spaces or other characters.  I forgot that a common use for the prefix is to specify a directory path that will contain the file.  When the slashes ("`/`") of the path are URL-escaped, they become "`%2F`" and the directories will not be created.  So, I moved the prefix handling code to come after the URL-escaping.

* #1365 - new `split` options for CLI help output

* #1365 - fix escape/suffix logic error

Trying to make the `return` statement cleaner, I thought it'd be good to add the file name suffix immediately after the file name is URL-escaped.  I'd forgotten that the suffix will not be added if the new `-e` option is used to skip URL-escaping.  So, I put the suffix back where I had it.

* #1365 - add `split` to the "10 minutes" document

Not strictly part of this issue, but as I was checking for docs that I should update as a result of my changes, I noticed this document showed how to split data using the `put` and `tee` combination, but not about the `split` verb.

* #1365 - updated manpage

When I ran `make dev`, generating `data-diving-examples.md` failed.  The two `manpage.txt` files ended up empty, but `mlr.1` seems to be correct.

---------

Co-authored-by: Mr. Lance E Sloan (sloanlance) <sloanlance@users.noreply.github.com>
  • Loading branch information
sloanlance and sloanlance authored Aug 23, 2023
1 parent 12f3b14 commit e233819
Show file tree
Hide file tree
Showing 6 changed files with 113 additions and 26 deletions.
37 changes: 37 additions & 0 deletions docs/src/10min.md
Original file line number Diff line number Diff line change
Expand Up @@ -909,3 +909,40 @@ yellow,triangle,true,1,11,43.6498,9.8870
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
</pre>

Alternatively, the `split` verb can do the same thing:

<pre class="pre-highlight-non-pair">
<b>mlr --csv --from example.csv split -g shape</b>
</pre>

<pre class="pre-highlight-in-pair">
<b>cat split_circle.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
red,circle,true,3,16,13.8103,2.9010
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
</pre>

<pre class="pre-highlight-in-pair">
<b>cat split_square.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
red,square,true,2,15,79.2778,0.0130
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
purple,square,false,10,91,72.3735,8.2430
</pre>

<pre class="pre-highlight-in-pair">
<b>cat split_triangle.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
</pre>
18 changes: 18 additions & 0 deletions docs/src/10min.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -434,3 +434,21 @@ GENMD-EOF
GENMD-RUN-COMMAND
cat triangle.csv
GENMD-EOF

Alternatively, the `split` verb can do the same thing:

GENMD-RUN-COMMAND
mlr --csv --from example.csv split -g shape
GENMD-EOF

GENMD-RUN-COMMAND
cat split_circle.csv
GENMD-EOF

GENMD-RUN-COMMAND
cat split_square.csv
GENMD-EOF

GENMD-RUN-COMMAND
cat split_triangle.csv
GENMD-EOF
2 changes: 2 additions & 0 deletions docs/src/manpage.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1813,6 +1813,8 @@ MILLER(1) MILLER(1)
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-e Do NOT URL-escape names of output files.
-j {j} Use string J to join filename parts; default "_".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
Expand Down
74 changes: 50 additions & 24 deletions internal/pkg/transformers/split.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
package transformers

import (
"bytes"
"container/list"
"fmt"
"net/url"
Expand All @@ -17,6 +16,7 @@ import (
// ----------------------------------------------------------------
const verbNameSplit = "split"
const splitDefaultOutputFileNamePrefix = "split"
const splitDefaultFileNamePartJoiner = "_"

var SplitSetup = TransformerSetup{
Verb: verbNameSplit,
Expand All @@ -39,6 +39,8 @@ Exactly one of -m, -n, or -g must be supplied.
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-e Do NOT URL-escape names of output files.
-j {j} Use string J to join filename parts; default "`+splitDefaultFileNamePartJoiner+`".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
Expand Down Expand Up @@ -88,6 +90,8 @@ func transformerSplitParseCLI(
var doSize bool = false
var groupByFieldNames []string = nil
var emitDownstream bool = false
var escapeFileNameCharacters bool = true
var fileNamePartJoiner string = splitDefaultFileNamePartJoiner
var doAppend bool = false
var outputFileNamePrefix string = splitDefaultOutputFileNamePrefix
var outputFileNameSuffix string = "uninit"
Expand Down Expand Up @@ -138,6 +142,12 @@ func transformerSplitParseCLI(
} else if opt == "-v" {
emitDownstream = true

} else if opt == "-e" {
escapeFileNameCharacters = false

} else if opt == "-j" {
fileNamePartJoiner = cli.VerbGetStringArgOrDie(verb, opt, args, &argi, argc)

} else {
// This is inelegant. For error-proofing we advance argi already in our
// loop (so individual if-statements don't need to). However,
Expand Down Expand Up @@ -180,6 +190,8 @@ func transformerSplitParseCLI(
doSize,
groupByFieldNames,
emitDownstream,
escapeFileNameCharacters,
fileNamePartJoiner,
doAppend,
outputFileNamePrefix,
outputFileNameSuffix,
Expand All @@ -195,14 +207,16 @@ func transformerSplitParseCLI(

// ----------------------------------------------------------------
type TransformerSplit struct {
n int64
outputFileNamePrefix string
outputFileNameSuffix string
emitDownstream bool
ungroupedCounter int64
groupByFieldNames []string
recordWriterOptions *cli.TWriterOptions
doAppend bool
n int64
outputFileNamePrefix string
outputFileNameSuffix string
emitDownstream bool
escapeFileNameCharacters bool
fileNamePartJoiner string
ungroupedCounter int64
groupByFieldNames []string
recordWriterOptions *cli.TWriterOptions
doAppend bool

// For doSize ungrouped: only one file open at a time
outputHandler output.OutputHandler
Expand All @@ -220,21 +234,25 @@ func NewTransformerSplit(
doSize bool,
groupByFieldNames []string,
emitDownstream bool,
escapeFileNameCharacters bool,
fileNamePartJoiner string,
doAppend bool,
outputFileNamePrefix string,
outputFileNameSuffix string,
recordWriterOptions *cli.TWriterOptions,
) (*TransformerSplit, error) {

tr := &TransformerSplit{
n: n,
outputFileNamePrefix: outputFileNamePrefix,
outputFileNameSuffix: outputFileNameSuffix,
emitDownstream: emitDownstream,
ungroupedCounter: 0,
groupByFieldNames: groupByFieldNames,
recordWriterOptions: recordWriterOptions,
doAppend: doAppend,
n: n,
outputFileNamePrefix: outputFileNamePrefix,
outputFileNameSuffix: outputFileNameSuffix,
emitDownstream: emitDownstream,
escapeFileNameCharacters: escapeFileNameCharacters,
fileNamePartJoiner: fileNamePartJoiner,
ungroupedCounter: 0,
groupByFieldNames: groupByFieldNames,
recordWriterOptions: recordWriterOptions,
doAppend: doAppend,

outputHandler: nil,
previousQuotient: -1,
Expand Down Expand Up @@ -402,13 +420,21 @@ func (tr *TransformerSplit) makeUngroupedOutputFileName(k int64) string {
func (tr *TransformerSplit) makeGroupedOutputFileName(
groupByFieldValues []*mlrval.Mlrval,
) string {
var buffer bytes.Buffer
buffer.WriteString(tr.outputFileNamePrefix)
var fileNameParts []string

for _, groupByFieldValue := range groupByFieldValues {
buffer.WriteString("_")
buffer.WriteString(url.QueryEscape(groupByFieldValue.String()))
fileNameParts = append(fileNameParts, groupByFieldValue.String())
}
buffer.WriteString(".")
buffer.WriteString(tr.outputFileNameSuffix)
return buffer.String()

fileName := strings.Join(fileNameParts, tr.fileNamePartJoiner)

if tr.escapeFileNameCharacters {
fileName = url.QueryEscape(fileName)
}

if tr.outputFileNamePrefix != "" {
fileName = tr.outputFileNamePrefix + tr.fileNamePartJoiner + fileName
}

return fileName + "." + tr.outputFileNameSuffix
}
6 changes: 4 additions & 2 deletions man/mlr.1
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
.\" Title: mlr
.\" Author: [see the "AUTHOR" section]
.\" Generator: ./mkman.rb
.\" Date: 2023-08-20
.\" Date: 2023-08-22
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "MILLER" "1" "2023-08-20" "\ \&" "\ \&"
.TH "MILLER" "1" "2023-08-22" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Portability definitions
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -2296,6 +2296,8 @@ Exactly one of -m, -n, or -g must be supplied.
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-e Do NOT URL-escape names of output files.
-j {j} Use string J to join filename parts; default "_".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
Expand Down
2 changes: 2 additions & 0 deletions test/cases/cli-help/0001/expout
Original file line number Diff line number Diff line change
Expand Up @@ -997,6 +997,8 @@ Exactly one of -m, -n, or -g must be supplied.
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-e Do NOT URL-escape names of output files.
-j {j} Use string J to join filename parts; default "_".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
Expand Down

0 comments on commit e233819

Please sign in to comment.