CFE-2768: Added function to filter CSV by class expressions #3463

karlhto · 2018-12-10T10:08:59Z

Will currently only filter data, does not remove duplicates and
such. Also cannot sort yet.

This is not ready to be merged, but I want feedback.

@nickanderson if you want to try it, you could check out my branch and compile it.

cf-bottom · 2018-12-10T10:09:26Z

Thank you for submitting a pull request! Maybe @olehermanse can review this?

nickanderson · 2018-12-10T14:58:53Z

@cf-bottom jenkins, please

cf-bottom · 2018-12-10T15:00:24Z

Alright, I triggered a build:

https://ci.cfengine.com/job/pr-pipeline/1785/

libpromises/evalfunction.c

craigcomstock · 2018-12-10T15:12:55Z

libpromises/evalfunction.c

+        Seq *seq = SeqParseCsvString(line);
+        if (seq == NULL)
+        {
+            return FnFailure();


Could we log a message that a particular line does not parse properly?

Wouldn't you need to free(line) if you are returning FnFailure()?

craigcomstock · 2018-12-10T15:23:00Z

libpromises/evalfunction.c

+            {
+                JsonObjectAppendString(json_object,
+                                       SeqAt(heading_seq, i),
+                                       SeqAt(seq, i));


It seems you would need to range check the heading_seq length against the sequence length for each line. SeqAt() only has asserts so it's return seq->data[i] could get out of range.

nickanderson

I grabbed a build with this function present. Either I am not doing something correctly, or it doesn't work.

I tested with a very simple data file:

linux,1,net.ipv4.ip_forward,0
router,0,net.ipv4.ip_forward,1

And a simple policy to filter the data and show the resulting data container.

  bundle agent main
  {
      vars:
        "data_file" string => "/tmp/data-file.csv";
        "d" data => classexpression_filterdata( $(data_file),
                                                1,      # Column containing class expression to filter with
                                                ",",    # Delim
                                                "false", # Data file contains column headings
                                                1);      # Column to sort by

      reports:
        "$(with)" with => string_mustache( "{{%-top-}}", d );
  }

But the resulting data container is empty.

R: []

When running on a linux host that does not have the class router defined, I expect a result similar to this:

  { 
    "0": "linux",
    "1": "1",
    "2": "net.ipv4.ip_forward",
    "3": "0"
  }

nickanderson · 2018-12-10T19:35:01Z

libpromises/evalfunction.c

+static const FnCallArg CLASSEXPRESSION_FILTERDATA_ARGS[] =
+{
+    {CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"},
+    {CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},


Is this the column name or column index position?
If it is an index position, is it counting from 0 or 1?
If it is can be a column name, will it error if the column name is not found?

Currently only column position. For the sake of consistency, it should definitely count from 0 like everything else (which is also what it does).

I am unsure if allowing both column position and name is a good idea. If column names are numbers, how will the function know whether to count the given parameter as a column index or name? If there are several matches of the column name, does it just go for the first match?

If this should be allowed, it should be decided by has_heading, but I would say it's better to only use indexes. Might be different for the sort_by column.

+1 for only indices, keep it simple. I think you should update the description string to say:

"Column index with class expression"

nickanderson · 2018-12-10T19:37:00Z

libpromises/evalfunction.c

+{
+    {CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"},
+    {CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
+    {CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"},


Is this a literal string delimiter, or is it a regular expression?

Currently this is not used. The problems I see with allowing you to choose the delimiters are:

To change this in csv_parser.c, the code has to be refactored a bunch. This will probably take a bit of time, since the code has a lot of spaghetti.

If it is to be done "on top" of the CSV parsing -- as in replacing occurrences of a given delimiter with , -- you can end up with situations where one line suddenly has more delimiters than intended because a comma was used as well.

I do however think that it should be changed. Allowing CFEngine to read CSV-files with non-comma delimiters would be a good idea. It's just that it will probably take more time, and if it should work for this function, it should also work for readcsv().

2 is no good, we would have to change the csv parser as you mention in 1. I think you should remove this function param, it can be added as a last optional parameter later.

nickanderson · 2018-12-10T19:37:12Z

libpromises/evalfunction.c

+    {CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
+    {CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"},
+    {CF_BOOL, CF_DATA_TYPE_OPTION, "CSV file has heading"},
+    {CF_ANYSTRING, CF_DATA_TYPE_STRING, "Column to sort by"},


Is this the column name or column index position?
If it is an index position, is it counting from 0 or 1?
If it is can be a column name, will it error if the column name is not found?

Currently not implemented properly. I think that this should be explicitly based on the index, but in case you want both name and index to be possible, it should be decided based on the has_heading parameter.

Add the word index to description.

karlhto · 2018-12-11T13:15:00Z

Thanks for the input! :)

What should the function do if line length is inconsistent or the CSV-file is corrupted in some other way? Should it just FnFailure(), or should it skip the corrupted lines?

nickanderson · 2018-12-11T14:38:46Z

Thanks for the input! :)

What should the function do if line length is inconsistent or the CSV-file is corrupted in some other way? Should it just FnFailure(), or should it skip the corrupted lines?

With line based data, if one line is corrupt, then I think just that one record should be skipped, and we should emit some kind of warning with the info. One of the nice things about CSV, is that one bad record does not invalidate all records as you would typically get with JSON or some other structured data.

nickanderson · 2018-12-11T16:33:34Z

@cf-bottom jenkins, please

cf-bottom · 2018-12-11T16:34:45Z

Sure, I triggered a build:

https://ci.cfengine.com/job/pr-pipeline/1794/

nickanderson · 2018-12-11T17:17:28Z

./01_vars/02_functions/classexpression_filterdata_1.cf FAIL (UNEXPECTED FAILURE)
Makefile.testall:104: recipe for target '01_vars/02_functions/classexpression_filterdata_1.cf_rule' failed
make: [01_vars/02_functions/classexpression_filterdata_1.cf_rule] Error 1 (ignored)

nickanderson · 2018-12-11T18:15:20Z

./01_vars/02_functions/classexpression_filterdata_1.cf FAIL (UNEXPECTED FAILURE)
Makefile.testall:104: recipe for target '01_vars/02_functions/classexpression_filterdata_1.cf_rule' failed
make: [01_vars/02_functions/classexpression_filterdata_1.cf_rule] Error 1 (ignored)

So, the datacontainer returned was empty.

2018-12-11T16:53:00+0000     info: Created directory "/home/jenkins/workspace/testing-pr/label/PACKAGES_HUB_x86_64_linux_redhat_7/cfengine-3.14.0a.283cd76/tests/acceptance/workdir/__01_vars_02_functions_classexpression_filterdata_1_cf/tmp/TESTDIR.cfengine/."
R: test description: Test that classexpression_filterdata() works with column headings filtered by first column.
R: /home/jenkins/workspace/testing-pr/label/PACKAGES_HUB_x86_64_linux_redhat_7/cfengine-3.14.0a.283cd76/tests/acceptance/./01_vars/02_functions/classexpression_filterdata_1.cf FAIL
R: dcs_passif: failing based on class "TokenValueLength_OK"
R: Function returned:[]

karlhto · 2018-12-12T20:41:11Z

@cf-bottom jenkins dude

cf-bottom · 2018-12-12T20:44:26Z

Sure, I triggered a build:

https://ci.cfengine.com/job/pr-pipeline/1797/

karlhto · 2018-12-13T15:05:49Z

@cf-bottom jenkins, my dude

cf-bottom · 2018-12-13T15:09:37Z

Sure, I triggered a build:

https://ci.cfengine.com/job/pr-pipeline/1806/

nickanderson · 2018-12-13T20:54:53Z

@cf-bottom jenkins, pwease

cf-bottom · 2018-12-13T20:59:36Z

Alright, I triggered a build:

https://ci.cfengine.com/job/pr-pipeline/1808/

karlhto · 2018-12-14T13:46:32Z

I realise that I should probably move the contents of this function to json-utils.c or something for more readability.

karlhto · 2018-12-14T13:47:00Z

@cf-bottom jenkins :v

cf-bottom · 2018-12-14T13:49:56Z

Alright, I triggered a build:

https://ci.cfengine.com/job/pr-pipeline/1813/

olehermanse

Logic looks good overall. Haven't reviewed the tests yet.

libpromises/evalfunction.c

olehermanse · 2018-12-17T12:46:25Z

libpromises/evalfunction.c

+static const FnCallArg CLASSEXPRESSION_FILTERDATA_ARGS[] =
+{
+    {CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"},
+    {CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},


+1 for only indices, keep it simple. I think you should update the description string to say:

"Column index with class expression"

olehermanse · 2018-12-17T12:47:53Z

libpromises/evalfunction.c

+{
+    {CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"},
+    {CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
+    {CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"},


2 is no good, we would have to change the csv parser as you mention in 1. I think you should remove this function param, it can be added as a last optional parameter later.

olehermanse · 2018-12-17T12:48:20Z

libpromises/evalfunction.c

+    {CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
+    {CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"},
+    {CF_BOOL, CF_DATA_TYPE_OPTION, "CSV file has heading"},
+    {CF_ANYSTRING, CF_DATA_TYPE_STRING, "Column to sort by"},


Add the word index to description.

karlhto · 2018-12-17T13:10:19Z

@olehermanse While I personally think that sticking to indices would be better, I think @nickanderson (and probably other users) would prefer being able to use find column based on column names if header is given as an argument. Would be nice to know which is better.

https://tracker.mender.io/browse/CFE-2768 is probably the best place to discuss that.

olehermanse · 2018-12-17T13:36:24Z

@karlhto I would still stick to indices for the simple reason that it is faster to implement. To support both without ambiguity we could add a syntax for it later, for example quoted string or beginning with # could mean column name instead of index.

karlhto · 2018-12-17T13:38:37Z

Alright. I'll stick to indices for now.

olehermanse · 2018-12-17T13:54:21Z

@karlhto tell me when you've addressed my comments, fixed the commit history and tests are ready, and I'll review again.

olehermanse · 2019-03-19T09:17:52Z

libpromises/evalfunction.c

+        {
+            Log(LOG_LEVEL_VERBOSE,
+                    "%s: sorting column (%zu) is the same as class "
+                    "expression column (%zu). Not sorting data container.",


Why this limitation when sort_arg is optional anyway?

IMO, if you want to keep this, it should be bumped up to WARNING log level. The user is specifying a sorting column which you have defined to be invalid.

olehermanse

Looks good overall. I added some smaller comments, but they are not blockers. Wait for @vpodzime also.

olehermanse · 2019-03-19T09:21:38Z

@cf-bottom jenkins with exotics

cf-bottom · 2019-03-19T09:25:09Z

Sure, I triggered a build:

(with exotics)

https://ci.cfengine.com/job/pr-pipeline/2189/

vpodzime

Looks good to me otherwise.

libpromises/evalfunction.c

vpodzime · 2019-03-22T13:05:52Z

libpromises/evalfunction.c

+            if (class_index >= num_columns)
+            {
+                Log(LOG_LEVEL_ERR,
+                    "%s: Class expression index is out of bounds. Row "


Split the log message on punctuation ;-)

vpodzime · 2019-03-22T13:09:12Z

libpromises/evalfunction.c

+            }
+
+            SeqRemove(list, class_index);
+            JsonElement *json_object = JsonObjectCreate(num_columns);


I think a better name than the generic json_object could make the code that follows more readable and self-explanatory.

vpodzime · 2019-03-22T13:10:40Z

libpromises/evalfunction.c

+        {
+            Log(LOG_LEVEL_WARNING,
+                    "%s: sorting index %zu out of bounds. Not sorting data "
+                    "container.",


Again, split on punctuation. Or just add that one more word to the same line, really. :)

libpromises/evalfunction.c

vpodzime

Some last nitpicks.

vpodzime · 2019-03-25T10:19:00Z

libpromises/evalfunction.c

+                               void *user_data)
+{
+    assert(JsonGetContainerType(left_obj) == JSON_ELEMENT_TYPE_PRIMITIVE);
+    assert(JsonGetContainerType(right_obj) == JSON_ELEMENT_TYPE_PRIMITIVE);


Now that I'm looking at this, maybe the function could get a better name like JsonPrimitiveComparator?

vpodzime · 2019-03-25T10:20:01Z

libpromises/evalfunction.c

+    size_t const index = user_data;
+    char const *left = JsonPrimitiveGetAsString(JsonAt(left_obj, index));
+    char const *right = JsonPrimitiveGetAsString(JsonAt(right_obj, index));
+    return strcmp(left, right);


This should use StringSafeCompare.

karlhto · 2019-03-25T12:54:37Z

@cf-bottom Jenkins with exotics.

cf-bottom · 2019-03-25T12:54:57Z

Alright, I triggered a build:

(with exotics)

Jenkins: https://ci.cfengine.com/job/pr-pipeline/2215/

Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-2215/

Takes arguments `path`, `heading`, `class_index` and optionally `sort_index`. Generates a data container filtered by defined classes, which is sorted by the elements in column `sort_index` of the CSV file. Ticket: CFE-2768 Changelog: Title

Tests with header, without header, sorting and invalid entries. Ticket: CFE-2768 Changelog: None

karlhto · 2019-03-29T12:59:33Z

@cf-bottom jenkins with exotics

@olehermanse @vpodzime I followed up the last of your nitpicks. :)

cf-bottom · 2019-03-29T13:00:27Z

Alright, I triggered a build:

(with exotics)

Jenkins: https://ci.cfengine.com/job/pr-pipeline/2257/

Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-2257/

vpodzime

Looks good to me.

vpodzime · 2019-04-01T09:33:47Z

@olehermanse please review and merge this if you find it 👌

olehermanse

Looks good overall!

olehermanse · 2019-04-01T15:37:46Z

libpromises/evalfunction.c

+                                   JsonElement const *right_obj,
+                                   void *user_data)
+{
+    size_t const index = *(size_t *)user_data;


Style nitpick:

size_t const index = *((size_t *) user_data);

(Space after cast, and paren to show what you are dereferencing.)

olehermanse · 2019-04-01T15:57:14Z

libpromises/evalfunction.c

+    Seq *heading = NULL;
+    JsonElement *json = JsonArrayCreate(50);
+    char *line;
+    size_t num_columns = 0;


Initializing only the necessary variables, good! ;)

olehermanse · 2019-05-21T14:27:12Z

libpromises/evalfunction.c

@@ -7128,7 +7312,6 @@ static FnCallResult FnCallFileSexist(EvalContext *ctx, ARG_UNUSED const Policy *
        {
            file_found = false;
        }
-        free(val);


@vpodzime @karlhto oops, memory leak introduced.

olehermanse self-requested a review December 10, 2018 12:29

craigcomstock added the WIP Work in Progress label Dec 10, 2018

craigcomstock reviewed Dec 10, 2018

View reviewed changes

nickanderson reviewed Dec 10, 2018

View reviewed changes

karlhto force-pushed the CFE-2768 branch 2 times, most recently from da93eb1 to f1aa864 Compare December 11, 2018 16:10

olehermanse requested changes Dec 17, 2018

View reviewed changes

karlhto force-pushed the CFE-2768 branch from 9214766 to 68745ed Compare December 17, 2018 15:04

olehermanse reviewed Mar 19, 2019

View reviewed changes

olehermanse previously approved these changes Mar 19, 2019

View reviewed changes

olehermanse removed the WIP Work in Progress label Mar 19, 2019

olehermanse requested a review from vpodzime March 19, 2019 09:21

vpodzime reviewed Mar 22, 2019

View reviewed changes

karlhto dismissed olehermanse’s stale review via a788927 March 25, 2019 09:17

karlhto force-pushed the CFE-2768 branch 2 times, most recently from a788927 to 089f2dc Compare March 25, 2019 09:42

vpodzime reviewed Mar 25, 2019

View reviewed changes

karlhto force-pushed the CFE-2768 branch from 089f2dc to dc9b86e Compare March 25, 2019 10:38

This comment has been minimized.

Sign in to view

karlhto force-pushed the CFE-2768 branch 2 times, most recently from 69033b0 to 6f339ed Compare March 25, 2019 11:56

This comment has been minimized.

Sign in to view

karlhto force-pushed the CFE-2768 branch from 6f339ed to a5b9c6e Compare March 25, 2019 12:46

olehermanse self-requested a review March 25, 2019 13:16

karlhto added 2 commits March 29, 2019 13:58

CFE-2768: Added tests for classfiltercsv

dc7714c

Tests with header, without header, sorting and invalid entries. Ticket: CFE-2768 Changelog: None

karlhto force-pushed the CFE-2768 branch from a5b9c6e to dc7714c Compare March 29, 2019 12:58

vpodzime approved these changes Apr 1, 2019

View reviewed changes

olehermanse approved these changes Apr 2, 2019

View reviewed changes

olehermanse merged commit 1bd81dd into cfengine:master Apr 2, 2019

olehermanse reviewed May 21, 2019

View reviewed changes

CFE-2768: Added function to filter CSV by class expressions #3463

CFE-2768: Added function to filter CSV by class expressions #3463

Uh oh!

Conversation

karlhto commented Dec 10, 2018

Uh oh!

cf-bottom commented Dec 10, 2018

Uh oh!

nickanderson commented Dec 10, 2018

Uh oh!

cf-bottom commented Dec 10, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickanderson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karlhto Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karlhto commented Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nickanderson commented Dec 11, 2018

Uh oh!

nickanderson commented Dec 11, 2018

Uh oh!

cf-bottom commented Dec 11, 2018

Uh oh!

nickanderson commented Dec 11, 2018

Uh oh!

nickanderson commented Dec 11, 2018

Uh oh!

karlhto commented Dec 12, 2018

Uh oh!

cf-bottom commented Dec 12, 2018

Uh oh!

karlhto commented Dec 13, 2018

Uh oh!

cf-bottom commented Dec 13, 2018

Uh oh!

nickanderson commented Dec 13, 2018

Uh oh!

cf-bottom commented Dec 13, 2018

Uh oh!

karlhto commented Dec 14, 2018

Uh oh!

karlhto commented Dec 14, 2018

Uh oh!

cf-bottom commented Dec 14, 2018

Uh oh!

olehermanse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

karlhto Dec 11, 2018 •

edited

Loading

karlhto commented Dec 11, 2018 •

edited

Loading