Skip to content

Conversation

@michaelmior
Copy link
Contributor

I'd like to be able to use Cleora for JSON data, so this is an attempt to implement support.
I'm not an experienced Rustacean, so there's definitely some cruft here to be removed, but it's basically working.
There is now a file type parameter that can be passed in to parse input data as JSON.
Note that I've removed any mention of CSV since it doesn't actually seem to be supported, but correct if I'm wrong.
If this seems like something you may eventually want to merge, happy to take suggestions on improvements

@piobab
Copy link
Contributor

piobab commented Nov 15, 2020

@michaelmior thanks for the implementation - the JSON idea is very nice :) I have made some comments. Let me know if you need any help. If you wouldn't like to deal with it, I can take it over from you.

@kodieg
Copy link
Collaborator

kodieg commented Nov 17, 2020

@michaelmior I allowed myself to add my two cents here. Hope you don't mind :)

Pushed commit that makes process_row allow to accept both &str and String. This allows to keep previous performance of CSV parsing.

Also moved file_type test out of the loop just in case that could cause some regressions. It adds some code duplication, so possibly in the future we should refactor this portion. Especially, if we will want to add another file type.

@piobab I would propose merging this as this PR after your final review.

@michaelmior
Copy link
Contributor Author

@kodieg Thanks! I just rebased and made one minor change.

@kodieg
Copy link
Collaborator

kodieg commented Nov 17, 2020

Removed two new clippy warnings regarding &Vec<...> refs in function arguments. Replaced with slices.

@michaelmior
Copy link
Contributor Author

If all else looks good, let me know if you want to rebase to clean up the messy history.

src/pipeline.rs Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use other types for non-complex column as well (something similar as it is for arrays):

let elem = parsed.at_key(&c.name).unwrap();
let value = match elem.get_type() {
  dom::element::ElementType::String => elem.get_string().unwrap(),
  _ => elem.minify(),
};
smallvec![value]

After modification we can read such lines:

{"users": 1, "products": ["p1", "p2"], "brands": ["b1", "b2"]}
{"users": 2, "products": ["p2", "p3", "p4"], "brands": ["b1"]}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Although personally I would lean towards disallowing array values here since they should probably be treated as complex. This is why I only allowed strings initially. Although certainly other atomic values would make sense here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right but column config is here for "support". If there is some array it is treated as single entity and a. user see it in the output file. Otherwise the user should prepare json file with just strings.

@piobab
Copy link
Contributor

piobab commented Nov 18, 2020

@michaelmior I made a bench for TSV based on very big file - no regression. We will add benches to CI pipeline in other PR.

I've made a review. Please take a look. After your changes I think we can merge.

@michaelmior michaelmior changed the title Add basic JSON support (WIP) Add basic JSON support Nov 18, 2020
@michaelmior
Copy link
Contributor Author

I've addressed a couple final issues and rebased into a single commit so I think this is ready to be merged.

@piobab piobab merged commit 0ac407d into BaseModelAI:master Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants