Skip to content

Conversation

karlhto
Copy link
Contributor

@karlhto karlhto commented Dec 10, 2018

Will currently only filter data, does not remove duplicates and
such. Also cannot sort yet.

This is not ready to be merged, but I want feedback.

@nickanderson if you want to try it, you could check out my branch and compile it.

@cf-bottom
Copy link

Thank you for submitting a pull request! Maybe @olehermanse can review this?

@olehermanse olehermanse self-requested a review December 10, 2018 12:29
@nickanderson
Copy link
Member

@cf-bottom jenkins, please

@cf-bottom
Copy link

Alright, I triggered a build:

Build Status

https://ci.cfengine.com/job/pr-pipeline/1785/

@craigcomstock craigcomstock added the WIP Work in Progress label Dec 10, 2018
Seq *seq = SeqParseCsvString(line);
if (seq == NULL)
{
return FnFailure();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we log a message that a particular line does not parse properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you need to free(line) if you are returning FnFailure()?

{
JsonObjectAppendString(json_object,
SeqAt(heading_seq, i),
SeqAt(seq, i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you would need to range check the heading_seq length against the sequence length for each line. SeqAt() only has asserts so it's return seq->data[i] could get out of range.

Copy link
Member

@nickanderson nickanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grabbed a build with this function present. Either I am not doing something correctly, or it doesn't work.

I tested with a very simple data file:

linux,1,net.ipv4.ip_forward,0
router,0,net.ipv4.ip_forward,1

And a simple policy to filter the data and show the resulting data container.

  bundle agent main
  {
      vars:
        "data_file" string => "/tmp/data-file.csv";
        "d" data => classexpression_filterdata( $(data_file),
                                                1,      # Column containing class expression to filter with
                                                ",",    # Delim
                                                "false", # Data file contains column headings
                                                1);      # Column to sort by

      reports:
        "$(with)" with => string_mustache( "{{%-top-}}", d );
  }

But the resulting data container is empty.

R: []

When running on a linux host that does not have the class router defined, I expect a result similar to this:

  { 
    "0": "linux",
    "1": "1",
    "2": "net.ipv4.ip_forward",
    "3": "0"
  }

static const FnCallArg CLASSEXPRESSION_FILTERDATA_ARGS[] =
{
{CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"},
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the column name or column index position?
If it is an index position, is it counting from 0 or 1?
If it is can be a column name, will it error if the column name is not found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only column position. For the sake of consistency, it should definitely count from 0 like everything else (which is also what it does).

I am unsure if allowing both column position and name is a good idea. If column names are numbers, how will the function know whether to count the given parameter as a column index or name? If there are several matches of the column name, does it just go for the first match?

If this should be allowed, it should be decided by has_heading, but I would say it's better to only use indexes. Might be different for the sort_by column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for only indices, keep it simple. I think you should update the description string to say:

"Column index with class expression"

{
{CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"},
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a literal string delimiter, or is it a regular expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is not used. The problems I see with allowing you to choose the delimiters are:

  1. To change this in csv_parser.c, the code has to be refactored a bunch. This will probably take a bit of time, since the code has a lot of spaghetti.
  2. If it is to be done "on top" of the CSV parsing -- as in replacing occurrences of a given delimiter with , -- you can end up with situations where one line suddenly has more delimiters than intended because a comma was used as well.

I do however think that it should be changed. Allowing CFEngine to read CSV-files with non-comma delimiters would be a good idea. It's just that it will probably take more time, and if it should work for this function, it should also work for readcsv().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 is no good, we would have to change the csv parser as you mention in 1. I think you should remove this function param, it can be added as a last optional parameter later.

{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"},
{CF_BOOL, CF_DATA_TYPE_OPTION, "CSV file has heading"},
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "Column to sort by"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the column name or column index position?
If it is an index position, is it counting from 0 or 1?
If it is can be a column name, will it error if the column name is not found?

Copy link
Contributor Author

@karlhto karlhto Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not implemented properly. I think that this should be explicitly based on the index, but in case you want both name and index to be possible, it should be decided based on the has_heading parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the word index to description.

@karlhto
Copy link
Contributor Author

karlhto commented Dec 11, 2018

Thanks for the input! :)

What should the function do if line length is inconsistent or the CSV-file is corrupted in some other way? Should it just FnFailure(), or should it skip the corrupted lines?

@nickanderson
Copy link
Member

Thanks for the input! :)

What should the function do if line length is inconsistent or the CSV-file is corrupted in some other way? Should it just FnFailure(), or should it skip the corrupted lines?

With line based data, if one line is corrupt, then I think just that one record should be skipped, and we should emit some kind of warning with the info. One of the nice things about CSV, is that one bad record does not invalidate all records as you would typically get with JSON or some other structured data.

@karlhto karlhto force-pushed the CFE-2768 branch 2 times, most recently from da93eb1 to f1aa864 Compare December 11, 2018 16:10
@nickanderson
Copy link
Member

@cf-bottom jenkins, please

@cf-bottom
Copy link

Sure, I triggered a build:

Build Status

https://ci.cfengine.com/job/pr-pipeline/1794/

@nickanderson
Copy link
Member

./01_vars/02_functions/classexpression_filterdata_1.cf FAIL (UNEXPECTED FAILURE)
Makefile.testall:104: recipe for target '01_vars/02_functions/classexpression_filterdata_1.cf_rule' failed
make: [01_vars/02_functions/classexpression_filterdata_1.cf_rule] Error 1 (ignored)

@nickanderson
Copy link
Member

./01_vars/02_functions/classexpression_filterdata_1.cf FAIL (UNEXPECTED FAILURE)
Makefile.testall:104: recipe for target '01_vars/02_functions/classexpression_filterdata_1.cf_rule' failed
make: [01_vars/02_functions/classexpression_filterdata_1.cf_rule] Error 1 (ignored)

So, the datacontainer returned was empty.

2018-12-11T16:53:00+0000     info: Created directory "/home/jenkins/workspace/testing-pr/label/PACKAGES_HUB_x86_64_linux_redhat_7/cfengine-3.14.0a.283cd76/tests/acceptance/workdir/__01_vars_02_functions_classexpression_filterdata_1_cf/tmp/TESTDIR.cfengine/."
R: test description: Test that classexpression_filterdata() works with column headings filtered by first column.
R: /home/jenkins/workspace/testing-pr/label/PACKAGES_HUB_x86_64_linux_redhat_7/cfengine-3.14.0a.283cd76/tests/acceptance/./01_vars/02_functions/classexpression_filterdata_1.cf FAIL
R: dcs_passif: failing based on class "TokenValueLength_OK"
R: Function returned:[]

@karlhto
Copy link
Contributor Author

karlhto commented Dec 12, 2018

@cf-bottom jenkins dude

@cf-bottom
Copy link

Sure, I triggered a build:

Build Status

https://ci.cfengine.com/job/pr-pipeline/1797/

@karlhto
Copy link
Contributor Author

karlhto commented Dec 13, 2018

@cf-bottom jenkins, my dude

@cf-bottom
Copy link

Sure, I triggered a build:

Build Status

https://ci.cfengine.com/job/pr-pipeline/1806/

@nickanderson
Copy link
Member

@cf-bottom jenkins, pwease

@cf-bottom
Copy link

Alright, I triggered a build:

Build Status

https://ci.cfengine.com/job/pr-pipeline/1808/

@karlhto
Copy link
Contributor Author

karlhto commented Dec 14, 2018

I realise that I should probably move the contents of this function to json-utils.c or something for more readability.

@karlhto
Copy link
Contributor Author

karlhto commented Dec 14, 2018

@cf-bottom jenkins :v

@cf-bottom
Copy link

Alright, I triggered a build:

Build Status

https://ci.cfengine.com/job/pr-pipeline/1813/

Copy link
Member

@olehermanse olehermanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good overall. Haven't reviewed the tests yet.

static const FnCallArg CLASSEXPRESSION_FILTERDATA_ARGS[] =
{
{CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"},
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for only indices, keep it simple. I think you should update the description string to say:

"Column index with class expression"

{
{CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"},
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 is no good, we would have to change the csv parser as you mention in 1. I think you should remove this function param, it can be added as a last optional parameter later.

{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"},
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"},
{CF_BOOL, CF_DATA_TYPE_OPTION, "CSV file has heading"},
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "Column to sort by"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the word index to description.

@karlhto
Copy link
Contributor Author

karlhto commented Dec 17, 2018

@olehermanse While I personally think that sticking to indices would be better, I think @nickanderson (and probably other users) would prefer being able to use find column based on column names if header is given as an argument. Would be nice to know which is better.

https://tracker.mender.io/browse/CFE-2768 is probably the best place to discuss that.

@olehermanse
Copy link
Member

@karlhto I would still stick to indices for the simple reason that it is faster to implement. To support both without ambiguity we could add a syntax for it later, for example quoted string or beginning with # could mean column name instead of index.

@karlhto
Copy link
Contributor Author

karlhto commented Dec 17, 2018

Alright. I'll stick to indices for now.

@olehermanse
Copy link
Member

@karlhto tell me when you've addressed my comments, fixed the commit history and tests are ready, and I'll review again.

{
Log(LOG_LEVEL_VERBOSE,
"%s: sorting column (%zu) is the same as class "
"expression column (%zu). Not sorting data container.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this limitation when sort_arg is optional anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if you want to keep this, it should be bumped up to WARNING log level. The user is specifying a sorting column which you have defined to be invalid.

olehermanse
olehermanse previously approved these changes Mar 19, 2019
Copy link
Member

@olehermanse olehermanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I added some smaller comments, but they are not blockers. Wait for @vpodzime also.

@olehermanse
Copy link
Member

@cf-bottom jenkins with exotics

@olehermanse olehermanse removed the WIP Work in Progress label Mar 19, 2019
@olehermanse olehermanse requested a review from vpodzime March 19, 2019 09:21
@cf-bottom
Copy link

Sure, I triggered a build:

Build Status

(with exotics)

https://ci.cfengine.com/job/pr-pipeline/2189/

Copy link
Contributor

@vpodzime vpodzime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me otherwise.

if (class_index >= num_columns)
{
Log(LOG_LEVEL_ERR,
"%s: Class expression index is out of bounds. Row "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split the log message on punctuation ;-)

}

SeqRemove(list, class_index);
JsonElement *json_object = JsonObjectCreate(num_columns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better name than the generic json_object could make the code that follows more readable and self-explanatory.

{
Log(LOG_LEVEL_WARNING,
"%s: sorting index %zu out of bounds. Not sorting data "
"container.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, split on punctuation. Or just add that one more word to the same line, really. :)

Copy link
Contributor

@vpodzime vpodzime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last nitpicks.

void *user_data)
{
assert(JsonGetContainerType(left_obj) == JSON_ELEMENT_TYPE_PRIMITIVE);
assert(JsonGetContainerType(right_obj) == JSON_ELEMENT_TYPE_PRIMITIVE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm looking at this, maybe the function could get a better name like JsonPrimitiveComparator?

size_t const index = user_data;
char const *left = JsonPrimitiveGetAsString(JsonAt(left_obj, index));
char const *right = JsonPrimitiveGetAsString(JsonAt(right_obj, index));
return strcmp(left, right);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use StringSafeCompare.

@cf-bottom

This comment has been minimized.

@karlhto karlhto force-pushed the CFE-2768 branch 2 times, most recently from 69033b0 to 6f339ed Compare March 25, 2019 11:56
@cf-bottom

This comment has been minimized.

@karlhto
Copy link
Contributor Author

karlhto commented Mar 25, 2019

@cf-bottom Jenkins with exotics.

@cf-bottom
Copy link

Alright, I triggered a build:

Build Status

(with exotics)

Jenkins: https://ci.cfengine.com/job/pr-pipeline/2215/

Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-2215/

@olehermanse olehermanse self-requested a review March 25, 2019 13:16
karlhto added 2 commits March 29, 2019 13:58
Takes arguments `path`, `heading`, `class_index` and optionally
`sort_index`. Generates a data container filtered by defined
classes, which is sorted by the elements in column `sort_index`
of the CSV file.

Ticket: CFE-2768
Changelog: Title
Tests with header, without header, sorting and
invalid entries.

Ticket: CFE-2768
Changelog: None
@karlhto
Copy link
Contributor Author

karlhto commented Mar 29, 2019

@cf-bottom jenkins with exotics

@olehermanse @vpodzime I followed up the last of your nitpicks. :)

@cf-bottom
Copy link

Alright, I triggered a build:

Build Status

(with exotics)

Jenkins: https://ci.cfengine.com/job/pr-pipeline/2257/

Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-2257/

Copy link
Contributor

@vpodzime vpodzime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@vpodzime
Copy link
Contributor

vpodzime commented Apr 1, 2019

@olehermanse please review and merge this if you find it 👌

Copy link
Member

@olehermanse olehermanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

JsonElement const *right_obj,
void *user_data)
{
size_t const index = *(size_t *)user_data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick:

size_t const index = *((size_t *) user_data);

(Space after cast, and paren to show what you are dereferencing.)

Seq *heading = NULL;
JsonElement *json = JsonArrayCreate(50);
char *line;
size_t num_columns = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initializing only the necessary variables, good! ;)

@olehermanse olehermanse merged commit 1bd81dd into cfengine:master Apr 2, 2019
@@ -7128,7 +7312,6 @@ static FnCallResult FnCallFileSexist(EvalContext *ctx, ARG_UNUSED const Policy *
{
file_found = false;
}
free(val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpodzime @karlhto oops, memory leak introduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants