Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awk #160

Open
JohannesBuchner opened this issue Aug 21, 2020 · 11 comments
Open

awk #160

JohannesBuchner opened this issue Aug 21, 2020 · 11 comments

Comments

@JohannesBuchner
Copy link

awk is a small DSL which can parse texts relatively quickly. It is installed by default on many unix-based systems, requires little code, and is easy to integrate in shell script pipelines.

I placed some solutions for the groupby questions here: https://gist.github.com/JohannesBuchner/442e09b7c77c7150a4885c715eb17e6b
Some of them may be correct.

mawk used to be faster than gawk, not sure this is still significant.

The median-related question has sorting in the solution, which can be parallelized. Not sure if there is a more elegant solution.

@JohannesBuchner
Copy link
Author

This should work OK for very large datasets, in particular those much larger than RAM.

@jangorecki
Copy link
Contributor

Thank you, will try it out. AFAIU it prints result to stdout. What is the best way to print it to a in-memory variable? piping into file on a ram-disk? In the last question, there should be also count by group, not just sum.

@JohannesBuchner
Copy link
Author

Not sure I understand, stdout is in RAM. If you want to store it in a python program, perhaps subprocess.check_output is easiest.

@JohannesBuchner
Copy link
Author

Updated the last command to include count.

@JohannesBuchner
Copy link
Author

For very large responses, perhaps reading with a pipe (also possible with subprocess) is useful, to avoid using much memory.

@jangorecki
Copy link
Contributor

The problem is that printing out to console will add an overhead, thus piping output into file should be preferred to reduce the overhead.

@jangorecki
Copy link
Contributor

Also each single command read data from disk, this is another overhead that should be reduced. Ideally to read data once and then run all commands in sequence producing output files of each query.

@JohannesBuchner
Copy link
Author

JohannesBuchner commented Aug 21, 2020

OK, if you want to remove the io time, ramdisks are probably a good solution.

@JohannesBuchner
Copy link
Author

I am not sure whether you want to look at the output or not. If not, then you can pipe it to /dev/null, which will avoid the console printing overhead.

@jangorecki
Copy link
Contributor

Any idea if this is the most recent version? https://github.com/ploxiln/mawk-2

@JohannesBuchner
Copy link
Author

I simply installed the ubuntu package, which is mawk 1.3.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants