Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running benchmark for datatable #89

Closed
st-pasha opened this issue Jun 29, 2019 · 10 comments
Closed

Error running benchmark for datatable #89

st-pasha opened this issue Jun 29, 2019 · 10 comments

Comments

@st-pasha
Copy link

In run.conf I specify to run the benchmark for datatable only:

# task, used in init-setup-iteration.R
export RUN_TASKS="groupby" # join sort read"

# solution, used in init-setup-iteration.R
export RUN_SOLUTIONS="pydatatable"

# not run benchmarks but print what would run and what skipped
export MOCKUP=false

# print csv entries to console, uses when writing timings to csv
export CSV_VERBOSE=false

# flag to upgrade tools, used in run.sh on init
export DO_UPGRADE=false

# force run, ignore if same version was run already
export FORCE_RUN=true

# flag to build reports, used in ruh.sh before publish
export DO_REPORT=false

# flag to publish, used in ruh.sh before exit
export DO_PUBLISH=false

Still, when running run.sh the error is returned related to missing clickhouse client:

$ ./run.sh
Unexpected return code from clickhouse-client: 127
Error: '\.' is an unrecognized escape in character string starting ""[^0-9\."
Execution halted
# Benchmark run 1561766743 started
./versions.sh: line 5: clickhouse-client: command not found
Error in read.dcf(system.file(package = "dplyr", "DESCRIPTION"), fields = c("Version",  : 
  cannot open the connection
In addition: Warning message:
In read.dcf(system.file(package = "dplyr", "DESCRIPTION"), fields = c("Version",  :
  cannot open compressed file '', probable reason 'No such file or directory'
Execution halted
# Benchmark run 1561766743 failed to check versions of currently installed solutions

What is the proper way to run a single solution?

@jangorecki
Copy link
Contributor

this is the proper way to run a single solution, unfortunately clickhouse is not yet escaped nicely

@jangorecki
Copy link
Contributor

@st-pasha please retry on latest master

@jangorecki
Copy link
Contributor

jangorecki commented Jul 29, 2019

@st-pasha any update on this?

@st-pasha
Copy link
Author

Apologies, I missed your previous comment somehow.

With latest master I no longer see any clickhouse-related problems:

Error: '\.' is an unrecognized escape in character string starting ""[^0-9\."
Execution halted
# Benchmark run 1564421368 started
starting: pydatatable groupby G1_1e7_1e2_0_0
/bin/bash: out/run_pydatatable_groupby_G1_1e7_1e2_0_0.out: No such file or directory
finished: pydatatable groupby G1_1e7_1e2_0_0
starting: pydatatable groupby G1_1e7_1e1_0_0
/bin/bash: out/run_pydatatable_groupby_G1_1e7_1e1_0_0.out: No such file or directory
finished: pydatatable groupby G1_1e7_1e1_0_0
starting: pydatatable groupby G1_1e7_2e0_0_0
/bin/bash: out/run_pydatatable_groupby_G1_1e7_2e0_0_0.out: No such file or directory
finished: pydatatable groupby G1_1e7_2e0_0_0
starting: pydatatable groupby G1_1e7_1e2_0_1
/bin/bash: out/run_pydatatable_groupby_G1_1e7_1e2_0_1.out: No such file or directory
finished: pydatatable groupby G1_1e7_1e2_0_1
# Benchmark run 1564421368 has been completed in 1s

For the first error, my guess is that bash "eats" one level of escaping, so R sees only \. which is not a proper escape. An easy way to fix this is to remove backslashes altogether, since in regex language a dot inside square brackets is always interpreted literally. So, after doing that and running the command in R I get:

Error in `[.data.table`(data.table::fread("free -h | grep Swap", header = FALSE),  : 
  Item 1 of j is 1 which is outside the column number range [1,ncol=0]
In addition: Warning message:
In data.table::fread("free -h | grep Swap", header = FALSE) :
  File '/var/folders/d7/dw1pt7c114711zdyqf4gtg0h0000gn/T//RtmpVh7i9Q/file85061dade4dc' has size 0. Returning a NULL data.table.

Running just the first fread command returns:

> data.table::fread("free -h | grep Swap", header=FALSE)
sh: free: command not found
Null data.table (0 rows and 0 cols)
Warning message:
In data.table::fread("free -h | grep Swap", header = FALSE) :
  File '/var/folders/d7/dw1pt7c114711zdyqf4gtg0h0000gn/T//RtmpVh7i9Q/file850638c36bd' has size 0. Returning a NULL data.table.

So the actual issue is that my shell doesn't have the free command line utility, yet somehow data.table gobbles that error and issues a warning instead.


Still, despite the errors above the benchmark runs, producing some more error messages:

starting: pydatatable groupby G1_1e7_1e2_0_0
/bin/bash: out/run_pydatatable_groupby_G1_1e7_1e2_0_0.out: No such file or directory
finished: pydatatable groupby G1_1e7_1e2_0_0

I don't know what was supposed to be printed here, but I was hoping for something similar to the benchmark chart:

Question 1 -- first run time -- second run time
Question 2 -- first run time -- second run time
...

jangorecki added a commit that referenced this issue Jul 29, 2019
@jangorecki
Copy link
Contributor

Are you trying to use osx to run benchmark? It was designed having debian-compatible os in mind.
Software that is used on our machine that runs benchmark:

GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)
free from procps-ng 3.3.10

The last issue is I believe about missing out directory, will amend code to create it automatically if it doesn't exist.

Timings are landing in time.csv file, attempts of running scripts are landing in logs.csv.
structure of timings is following:

question 1 -- first run time
question 1 -- second run time
question 2 -- first run time
question 2 -- second run time

which is later processed for reports to the structure you mentioned in

model_time = function(d) {

please retry latest master, ideally after installing free

@st-pasha
Copy link
Author

According to SO, the equivalent of free on MacOS is vm_stat, which reports things like this:

$ vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free:                              208197.
Pages active:                           1478906.
Pages inactive:                          868832.
Pages speculative:                       107124.
Pages throttled:                              0.
Pages wired down:                        997248.
Pages purgeable:                           9437.
"Translation faults":               36699619531.
Pages copy-on-write:                  444929577.
Pages zero filled:                   5459321091.
Pages reactivated:                    487618793.
Pages purged:                          19600537.
File-backed pages:                       468271.
Anonymous pages:                        1986591.
Pages stored in compressor:             4044251.
Pages occupied by compressor:            533481.
Decompressions:                       196753666.
Compressions:                        1049140452.
Pageins:                               87517994.
Pageouts:                                129923.
Swapins:                              148769198.
Swapouts:                             370675952.

Now, disabling swap can be done (https://summercode.com/wiki/how-to-disable-or-enable-swapping-in-mac-os-x), but it seems mighty dangerous...
However, since the check is optional (the script keeps running even if the check fails), I guess it's not that important.

This is the output that I'm currently getting:

sh: free: command not found
Error in `[.data.table`(data.table::fread("free -h | grep Swap", header = FALSE),  : 
  Item 1 of j is 1 which is outside the column number range [1,ncol=0]
Calls: [ -> [.data.table
In addition: Warning message:
In data.table::fread("free -h | grep Swap", header = FALSE) :
  File '/var/folders/d7/dw1pt7c114711zdyqf4gtg0h0000gn/T//Rtmp7qDc35/filef3461723cb91' has size 0. Returning a NULL data.table.
Execution halted
# Benchmark run 1564431516 started
starting: pydatatable groupby G1_1e7_1e2_0_0
finished: pydatatable groupby G1_1e7_1e2_0_0: stderr 5
starting: pydatatable groupby G1_1e7_1e1_0_0
finished: pydatatable groupby G1_1e7_1e1_0_0: stderr 5
starting: pydatatable groupby G1_1e7_2e0_0_0
finished: pydatatable groupby G1_1e7_2e0_0_0: stderr 5
starting: pydatatable groupby G1_1e7_1e2_0_1
finished: pydatatable groupby G1_1e7_1e2_0_1: stderr 5
# Benchmark run 1564431516 has been completed in 2s

At first it was complaining about # Benchmark run 1564431330 aborted. './data' directory does not exists, but that error disappeared after creating directory "data". I even copied the files "G1_1e7_*" there, just in case. Still, some errors are produced in the printout above, and I can't figure out what they mean.

@jangorecki
Copy link
Contributor

jangorecki commented Jul 29, 2019

please include some out/*.err,
note that data files are now named G1_1e7_1e2_0_0.csv, the old name did not have two extra zeros which stands for NA percentage and if data are ordered.

@st-pasha
Copy link
Author

Ah, I see. The .err files complain about missing module "psutil" and "pandas". After installing those the script finally runs

@jangorecki
Copy link
Contributor

if there are no other problems here, and you obtained timings from time.csv file then we can close this issue.

@st-pasha
Copy link
Author

sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants