Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple TSV output format #645

Closed
wants to merge 1 commit into from

Conversation

amandasaurus
Copy link

I've added a @tsv output filter by copying the @csv filter. It outputs into TSV (tab-separated values) file. Some people have asked for this on #48 .

I haven't done anything with string escaping, because (a) I'm not sure how to use escape_string, and (b) I'm unsure what "the standard" TSV approach is. I think many unix tools don't escape tabs.

@pkoppstein
Copy link
Contributor

The proposed implementation of @tsv mainly succeeds but I would argue fails in one important case.

The successes are illustrated by the following "#-annotated" transcript:

$ jq -M -r -n '["a,b","c"] | @csv'
"a,b","c"                               # correct - quotation marks needed for embedded comma
$ jq -M -r -n '["a,b","c"] | @tsv'
a,b c                                   # correct - adding quotation marks would be wrong

$ jq -M -r -n '["a\"b","c"] | @csv'
"a""b","c"                              # correct - duplication of quotation mark is required by CSV standard
# jq -M -r -n '["a\"b","c"] | @tsv'
a"b c                                   # correct - duplication is unneeded and would be wrong

The failure is illustrated by the following:

$ jq -M -r -n '["a\tb","c"] | @csv'
"a  b","c"                          # acceptable
$ jq -M -r -n '["a\tb","c"] | @tsv'
a   b   c                           # incorrect (*)

(*) Apart from the fact that we requested two fields but got three, this result can be deemed to be incorrect because the IANA standard states:

Note that fields that contain tabs are not allowable in this encoding.

jq thus has a choice: to raise an error condition, or to retain the embedded escaped tab as such, i.e. in this case, to produce the six-character string:

a\tb    c

Raising an error condition here would be aggravating at best. Since the two-character sequence "\t" is often used to denote an embedded tab, not least within JSON itself, I believe that alternative is the better choice here.

(There is no need to consider here the case of embedded but unescaped tabs as they are not valid in JSON.)

@amandasaurus
Copy link
Author

That is a good point. I think replacing tab with \t is the best way. How does escape_string work? I can add the escaping to that.

@pkoppstein
Copy link
Contributor

escape_string escapes single characters, e.g. & to &. Since @tsv must produce valid JSON, escape_string is not applicable here.

That is, @tsv must convert occurrences of \t to \\t in JSON strings. For example, we want @tsv to convert the JSON string "a\tb" to the JSON string "a\\tb", so that ultimately (with the -r option) we get a\tb. As in:

jq -n -r '"a\\tb"'
a\tb

@amandasaurus
Copy link
Author

I understand. My apologies, I saw escape_string being used in @csv and thought it was a general purpose string escaping function. I will add this escaping

@nicowilliams
Copy link
Contributor

Can we merge this codepath with the @csv codepath? It seems the only things that should be different are: a) the separator (\t vs ,) and b) the need to escape it when present in strings in the input array.

@pkoppstein
Copy link
Contributor

@nicowilliams - In the interests of simplicity and brevity, the following has much to recommend it, except perhaps for the name:

def tsv: map( gsub("\t";"\\t") ) | join("\t");

Is there a way to add "@TSV" defined as above?

@amandasaurus
Copy link
Author

Can we merge this codepath with the @csv codepath? It seems the only things that should be different are: a) the separator (\t vs ,) and b) the need to escape it when present in strings in the input array.

That would probably be a good idea. However I'm not very good at C, so I took the simple, and direct, approach with this.

@nicowilliams
Copy link
Contributor

@pkoppstein @csv does not escape commas. It does place double-quotes around strings though (and escapes internal double-quotes).

@nicowilliams
Copy link
Contributor

@rory This is my proposed change:

diff --git a/builtin.c b/builtin.c
--- a/builtin.c
+++ b/builtin.c
@@ -358,13 +358,14 @@ static jv f_format(jq_state *jq, jv input, jv fmt) {
   } else if (!strcmp(fmt_s, "text")) {
     jv_free(fmt);
     return f_tostring(jq, input);
-  } else if (!strcmp(fmt_s, "csv")) {
+  } else if (!strcmp(fmt_s, "csv") || !strcmp(fmt_s, "tsv")) {
+    const char *sep = strcmp(fmt_s, "csv") == 0 ? "," : "\t";
     jv_free(fmt);
     if (jv_get_kind(input) != JV_KIND_ARRAY)
       return type_error(input, "cannot be csv-formatted, only array");
     jv line = jv_string("");
     jv_array_foreach(input, i, x) {
-      if (i) line = jv_string_append_str(line, ",");
+      if (i) line = jv_string_append_str(line, sep);
       switch (jv_get_kind(x)) {
       case JV_KIND_NULL:
         /* null rendered as empty string */

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

@pkoppstein @csv does not escape commas. It does place double-quotes around strings though (and escapes internal double-quotes).

Perhaps you were intending this remark for someone else. I merely observed that it seems to me that all that is needed to support @tsv would be along the lines:

def tsv: map( gsub("\t";"\\t") ) | join("\t");

except that I don't know how to use "def" to define @tsv.

The differences between @csv (complex) and @tsv (simple) are so great that I wonder whether tweaking the C code will yield the correct results so easily ....

@nicowilliams
Copy link
Contributor

@pkoppstein Ah, you're right that tsv is simpler: no need to quote fields, just s/\t/\\t/g, as you say.

Here's a new patch:

diff --git a/builtin.c b/builtin.c
index aad738c..c9d3131 100644
--- a/builtin.c
+++ b/builtin.c
@@ -358,13 +358,24 @@ static jv f_format(jq_state *jq, jv input, jv fmt) {
   } else if (!strcmp(fmt_s, "text")) {
     jv_free(fmt);
     return f_tostring(jq, input);
-  } else if (!strcmp(fmt_s, "csv")) {
+  } else if (!strcmp(fmt_s, "csv") || !strcmp(fmt_s, "tsv")) {
+    const char *quotes, *sep, *escapings;
+    if (!strcmp(fmt_s, "csv")) {
+      quotes = "\"";
+      sep = ",";
+      escapings = "\"\"\"\0";
+    } else {
+      assert(!strcmp(fmt_s, "tsv"));
+      quotes = "";
+      sep = "\t";
+      escapings = "\t\\t\0";
+    }
     jv_free(fmt);
     if (jv_get_kind(input) != JV_KIND_ARRAY)
       return type_error(input, "cannot be csv-formatted, only array");
     jv line = jv_string("");
     jv_array_foreach(input, i, x) {
-      if (i) line = jv_string_append_str(line, ",");
+      if (i) line = jv_string_append_str(line, sep);
       switch (jv_get_kind(x)) {
       case JV_KIND_NULL:
         /* null rendered as empty string */
@@ -383,9 +394,9 @@ static jv f_format(jq_state *jq, jv input, jv fmt) {
         }
         break;
       case JV_KIND_STRING: {
-        line = jv_string_append_str(line, "\"");
-        line = jv_string_concat(line, escape_string(x, "\"\"\"\0"));
-        line = jv_string_append_str(line, "\"");
+        line = jv_string_append_str(line, quotes);
+        line = jv_string_concat(line, escape_string(x, escapings));
+        line = jv_string_append_str(line, quotes);
         break;
       }

Also included, for paranoia:

diff --git a/builtin.c b/builtin.c
index aad738c..c9d3131 100644
--- a/builtin.c
+++ b/builtin.c
@@ -334,7 +334,7 @@ static jv escape_string(jv input, const char* escapings) {
   const char* cstart;
   int c = 0;
   while ((i = jvp_utf8_next((cstart = i), end, &c))) {
-    assert(c != -1);
+    assert(c > 0);
     if (c < 128 && lookup[c]) {
       ret = jv_string_append_str(ret, lookup[c]);
     } else {

@nicowilliams
Copy link
Contributor

That produces:

$ ./jq -rn '@tsv "\(["a,b\tc","d e","f",0])"'
a,b\tc  d e     f       0
$ ./jq -rn '@csv "\(["a,b\tc","d e","f",0])"'
"a,b    c","d e","f",0
$ 

which I think is what you meant. And it shows that tsvis simpler indeed.

Hmmm, we don't have fromcsv and fromtsvbuiltins. Should we?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants