-
Notifications
You must be signed in to change notification settings - Fork 189
CFE-2768: Added function to filter CSV by class expressions #3463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for submitting a pull request! Maybe @olehermanse can review this? |
@cf-bottom jenkins, please |
Alright, I triggered a build: |
libpromises/evalfunction.c
Outdated
Seq *seq = SeqParseCsvString(line); | ||
if (seq == NULL) | ||
{ | ||
return FnFailure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we log a message that a particular line does not parse properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you need to free(line) if you are returning FnFailure()?
libpromises/evalfunction.c
Outdated
{ | ||
JsonObjectAppendString(json_object, | ||
SeqAt(heading_seq, i), | ||
SeqAt(seq, i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you would need to range check the heading_seq length against the sequence length for each line. SeqAt() only has asserts so it's return seq->data[i]
could get out of range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I grabbed a build with this function present. Either I am not doing something correctly, or it doesn't work.
I tested with a very simple data file:
linux,1,net.ipv4.ip_forward,0
router,0,net.ipv4.ip_forward,1
And a simple policy to filter the data and show the resulting data container.
bundle agent main
{
vars:
"data_file" string => "/tmp/data-file.csv";
"d" data => classexpression_filterdata( $(data_file),
1, # Column containing class expression to filter with
",", # Delim
"false", # Data file contains column headings
1); # Column to sort by
reports:
"$(with)" with => string_mustache( "{{%-top-}}", d );
}
But the resulting data container is empty.
R: []
When running on a linux host that does not have the class router
defined, I expect a result similar to this:
{
"0": "linux",
"1": "1",
"2": "net.ipv4.ip_forward",
"3": "0"
}
libpromises/evalfunction.c
Outdated
static const FnCallArg CLASSEXPRESSION_FILTERDATA_ARGS[] = | ||
{ | ||
{CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"}, | ||
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the column name or column index position?
If it is an index position, is it counting from 0 or 1?
If it is can be a column name, will it error if the column name is not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently only column position. For the sake of consistency, it should definitely count from 0 like everything else (which is also what it does).
I am unsure if allowing both column position and name is a good idea. If column names are numbers, how will the function know whether to count the given parameter as a column index or name? If there are several matches of the column name, does it just go for the first match?
If this should be allowed, it should be decided by has_heading
, but I would say it's better to only use indexes. Might be different for the sort_by
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for only indices, keep it simple. I think you should update the description string to say:
"Column index with class expression"
libpromises/evalfunction.c
Outdated
{ | ||
{CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"}, | ||
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"}, | ||
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a literal string delimiter, or is it a regular expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is not used. The problems I see with allowing you to choose the delimiters are:
- To change this in
csv_parser.c
, the code has to be refactored a bunch. This will probably take a bit of time, since the code has a lot of spaghetti. - If it is to be done "on top" of the CSV parsing -- as in replacing occurrences of a given delimiter with
,
-- you can end up with situations where one line suddenly has more delimiters than intended because a comma was used as well.
I do however think that it should be changed. Allowing CFEngine to read CSV-files with non-comma delimiters would be a good idea. It's just that it will probably take more time, and if it should work for this function, it should also work for readcsv()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 is no good, we would have to change the csv parser as you mention in 1. I think you should remove this function param, it can be added as a last optional parameter later.
libpromises/evalfunction.c
Outdated
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"}, | ||
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"}, | ||
{CF_BOOL, CF_DATA_TYPE_OPTION, "CSV file has heading"}, | ||
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "Column to sort by"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the column name or column index position?
If it is an index position, is it counting from 0 or 1?
If it is can be a column name, will it error if the column name is not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently not implemented properly. I think that this should be explicitly based on the index, but in case you want both name and index to be possible, it should be decided based on the has_heading
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the word index to description.
Thanks for the input! :) What should the function do if line length is inconsistent or the CSV-file is corrupted in some other way? Should it just |
With line based data, if one line is corrupt, then I think just that one record should be skipped, and we should emit some kind of warning with the info. One of the nice things about CSV, is that one bad record does not invalidate all records as you would typically get with JSON or some other structured data. |
da93eb1
to
f1aa864
Compare
@cf-bottom jenkins, please |
Sure, I triggered a build: |
|
So, the datacontainer returned was empty.
|
@cf-bottom jenkins dude |
Sure, I triggered a build: |
@cf-bottom jenkins, my dude |
Sure, I triggered a build: |
@cf-bottom jenkins, pwease |
Alright, I triggered a build: |
I realise that I should probably move the contents of this function to |
@cf-bottom jenkins :v |
Alright, I triggered a build: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks good overall. Haven't reviewed the tests yet.
libpromises/evalfunction.c
Outdated
static const FnCallArg CLASSEXPRESSION_FILTERDATA_ARGS[] = | ||
{ | ||
{CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"}, | ||
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for only indices, keep it simple. I think you should update the description string to say:
"Column index with class expression"
libpromises/evalfunction.c
Outdated
{ | ||
{CF_ABSPATHRANGE, CF_DATA_TYPE_STRING, "File name to read"}, | ||
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"}, | ||
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 is no good, we would have to change the csv parser as you mention in 1. I think you should remove this function param, it can be added as a last optional parameter later.
libpromises/evalfunction.c
Outdated
{CF_VALRANGE, CF_DATA_TYPE_INT, "Column with class expression"}, | ||
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "String to delimit data by"}, | ||
{CF_BOOL, CF_DATA_TYPE_OPTION, "CSV file has heading"}, | ||
{CF_ANYSTRING, CF_DATA_TYPE_STRING, "Column to sort by"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the word index to description.
@olehermanse While I personally think that sticking to indices would be better, I think @nickanderson (and probably other users) would prefer being able to use find column based on column names if header is given as an argument. Would be nice to know which is better. https://tracker.mender.io/browse/CFE-2768 is probably the best place to discuss that. |
@karlhto I would still stick to indices for the simple reason that it is faster to implement. To support both without ambiguity we could add a syntax for it later, for example quoted string or beginning with |
Alright. I'll stick to indices for now. |
@karlhto tell me when you've addressed my comments, fixed the commit history and tests are ready, and I'll review again. |
libpromises/evalfunction.c
Outdated
{ | ||
Log(LOG_LEVEL_VERBOSE, | ||
"%s: sorting column (%zu) is the same as class " | ||
"expression column (%zu). Not sorting data container.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this limitation when sort_arg
is optional anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if you want to keep this, it should be bumped up to WARNING log level. The user is specifying a sorting column which you have defined to be invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I added some smaller comments, but they are not blockers. Wait for @vpodzime also.
@cf-bottom jenkins with exotics |
Sure, I triggered a build: (with exotics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me otherwise.
libpromises/evalfunction.c
Outdated
if (class_index >= num_columns) | ||
{ | ||
Log(LOG_LEVEL_ERR, | ||
"%s: Class expression index is out of bounds. Row " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split the log message on punctuation ;-)
libpromises/evalfunction.c
Outdated
} | ||
|
||
SeqRemove(list, class_index); | ||
JsonElement *json_object = JsonObjectCreate(num_columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better name than the generic json_object
could make the code that follows more readable and self-explanatory.
libpromises/evalfunction.c
Outdated
{ | ||
Log(LOG_LEVEL_WARNING, | ||
"%s: sorting index %zu out of bounds. Not sorting data " | ||
"container.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, split on punctuation. Or just add that one more word to the same line, really. :)
a788927
to
089f2dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some last nitpicks.
libpromises/evalfunction.c
Outdated
void *user_data) | ||
{ | ||
assert(JsonGetContainerType(left_obj) == JSON_ELEMENT_TYPE_PRIMITIVE); | ||
assert(JsonGetContainerType(right_obj) == JSON_ELEMENT_TYPE_PRIMITIVE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm looking at this, maybe the function could get a better name like JsonPrimitiveComparator
?
libpromises/evalfunction.c
Outdated
size_t const index = user_data; | ||
char const *left = JsonPrimitiveGetAsString(JsonAt(left_obj, index)); | ||
char const *right = JsonPrimitiveGetAsString(JsonAt(right_obj, index)); | ||
return strcmp(left, right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use StringSafeCompare
.
This comment has been minimized.
This comment has been minimized.
69033b0
to
6f339ed
Compare
This comment has been minimized.
This comment has been minimized.
@cf-bottom Jenkins with exotics. |
Alright, I triggered a build: (with exotics) Jenkins: https://ci.cfengine.com/job/pr-pipeline/2215/ Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-2215/ |
Takes arguments `path`, `heading`, `class_index` and optionally `sort_index`. Generates a data container filtered by defined classes, which is sorted by the elements in column `sort_index` of the CSV file. Ticket: CFE-2768 Changelog: Title
Tests with header, without header, sorting and invalid entries. Ticket: CFE-2768 Changelog: None
@cf-bottom jenkins with exotics @olehermanse @vpodzime I followed up the last of your nitpicks. :) |
Alright, I triggered a build: (with exotics) Jenkins: https://ci.cfengine.com/job/pr-pipeline/2257/ Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-2257/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@olehermanse please review and merge this if you find it 👌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
JsonElement const *right_obj, | ||
void *user_data) | ||
{ | ||
size_t const index = *(size_t *)user_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpick:
size_t const index = *((size_t *) user_data);
(Space after cast, and paren to show what you are dereferencing.)
Seq *heading = NULL; | ||
JsonElement *json = JsonArrayCreate(50); | ||
char *line; | ||
size_t num_columns = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initializing only the necessary variables, good! ;)
@@ -7128,7 +7312,6 @@ static FnCallResult FnCallFileSexist(EvalContext *ctx, ARG_UNUSED const Policy * | |||
{ | |||
file_found = false; | |||
} | |||
free(val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will currently only filter data, does not remove duplicates and
such. Also cannot sort yet.
This is not ready to be merged, but I want feedback.
@nickanderson if you want to try it, you could check out my branch and compile it.