Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: random partition #181

Open
plainas opened this issue Jun 2, 2019 · 6 comments
Open

Feature request: random partition #181

plainas opened this issue Jun 2, 2019 · 6 comments

Comments

@plainas
Copy link

plainas commented Jun 2, 2019

For those of us working machine learning, a feature to quickly divide the data set into training data and test data would be a really nice to have.
Is there a way to do this already?

I am tempted to use other command line tools to achieve this by partitioning lines rather than csv rows. Is there a way to escape new lines inside values so I ensure that each line of output is exactly one CSV row?

@BurntSushi
Copy link
Owner

Is there a way to do this already?

I can't think of any simple way. But if xsv sort grew a flag to shuffle the rows (analogous to sort's -R/--random-sort flag), then it would be a simple matter of a shuffle followed by xsv slice.

I am tempted to use other command line tools to achieve this by partitioning lines rather than csv rows. Is there a way to escape new lines inside values so I ensure that each line of output is exactly one CSV row?

No. Not without layering your own encoding on top of CSV. If you need to handle arbitrary CSV data, then using other command line tools won't work. If you can guarantee that all CSV records occupy a single line, then other line oriented tools would work okay.

@sd2k
Copy link

sd2k commented Jun 11, 2019

@plainas This may or may not help but a while ago I wrote a separate tool for doing this: https://github.com/sd2k/ttv

You can compose it with xsv if desired, e.g. if you need to select columns etc.

@BurntSushi
Copy link
Owner

BurntSushi commented Jun 11, 2019

@sd2k Neat tool, although it doesn't look like it correctly supports CSV data? I don't see any CSV parsing happening in that tool. (A single CSV record can span an arbitrary number of lines.)

@plainas plainas closed this as completed Jun 11, 2019
@plainas plainas reopened this Jun 11, 2019
@sd2k
Copy link

sd2k commented Jun 11, 2019

Ah, I misread the initial description. You're right, that tool is completely naive when it comes to nested newlines. It could potentially be 'upgraded' if there's a need for it!

@plainas
Copy link
Author

plainas commented Jun 11, 2019

There definitely is :)

@BurntSushi
Copy link
Owner

BurntSushi commented Jun 11, 2019

Y'all might consider my suggested implementation strategy. There's really no need for a separate tool for the stated use case. That is, all you need to do is add random sorting to xsv sort. Once you have that, you can dice it up any way you want. It should be fairly easy to implement using rand's shuffle routine. PRs are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants