-
Notifications
You must be signed in to change notification settings - Fork 74
Extension of benchmark parameters and output #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In general, looks good! Few comments:
- in arg parsing, --check-finite needs an action and should be named
accordingly
- paramaters -> parameters
…On Wed, Jan 22, 2020, 10:22 Alexander Andreev ***@***.***> wrote:
Addition of next benchmark parameters:
- Data format (pandas DataFrame/numpy array)
- Data order (column/row major)
Addition of JSON output format
------------------------------
You can view, comment on, or merge this pull request online at:
#19
Commit Summary
- Add data type and order choice for sklearn benchmarks
- Fix absence of dtype property in pandas.DataFrame for distance
- JSON output and minor additions
- Code cleanup
File Changes
- *M* sklearn/bench.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-0>
(55)
- *M* sklearn/df_clsf.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-1>
(49)
- *M* sklearn/df_regr.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-2>
(51)
- *M* sklearn/distances.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-3>
(37)
- *M* sklearn/kmeans.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-4>
(48)
- *M* sklearn/linear.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-5>
(54)
- *M* sklearn/log_reg.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-6>
(65)
- *M* sklearn/pca.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-7>
(47)
- *M* sklearn/ridge.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-8>
(53)
- *M* sklearn/svm.py
<https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-9>
(45)
Patch Links:
- https://github.com/IntelPython/scikit-learn_bench/pull/19.patch
- https://github.com/IntelPython/scikit-learn_bench/pull/19.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#19?email_source=notifications&email_token=AFMZNIAPEKYXIV2ZVPRT3Q3Q7BXF5A5CNFSM4KKITNSKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IIAA7LQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFMZNIHRMUQUWZLXKID2GDDQ7BXF5ANCNFSM4KKITNSA>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! In addition to specific comments in diff, there's a few other things I think need to be addressed:
- daal4py benchmarks must also support JSON output. I guess you must be working on this since this is still a draft
- It would probably be worth it to factor out the output part of each benchmark to a common function in e.g.
bench.py
. The JSON/CSV output format differences can be handled there. We should also try to reduce the differences between the output formats. Right now, CSV output omits certain things that JSON format provides. Ideally, both should be able to communicate the same information. - I would recommend running
flake8
on the files to be style-consistent. There is really no formal style guideline for this repo right now (since contributions were not originally expected 😄 ), but I have been runningflake8
to verify consistency with PEP 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general. I have a few comments attached. I noticed that the permissions for the files have changed from 644
to 755
at some point. Was there any reason for this change? I don't think it is useful to have python scripts with the execute bit set unless a shebang is included at the top of the file.
The most important comments to address are about make_datasets.py
. It would also be good if the JSON/CSV output could be handled entirely in bench.py
instead of in separate benchmarks, since the way it is right now looks very complicated. This could be done with a single function call from benchmarks. It is not important to preserve the names e.g. Linear.fit
and Linear.predict
.
Since you have updated most of the python scripts here, their copyright years should also be updated to include 2020 if they don't yet have it.
@@ -119,23 +119,22 @@ ARGS_SKLEARN_ridge = --size "$(REGRESSION_SIZE)" | |||
ARGS_SKLEARN_linear = --size "$(REGRESSION_SIZE)" | |||
ARGS_SKLEARN_pca_daal = --size "$(REGRESSION_SIZE)" --svd-solver daal | |||
ARGS_SKLEARN_pca_full = --size "$(REGRESSION_SIZE)" --svd-solver full | |||
ARGS_SKLEARN_kmeans = --data-multiplier "$(MULTIPLIER)" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the runner could handle running everything, including native benchmarks, we could just remove the top-level Makefile, since that only exists for the same purpose. This might not be possible yet because the native benchmarks don't accept the exact same inputs (now, some sklearn benchmarks require a testing dataset in addition to training datasets, while the same native benchmarks do not). In that case, I can later change the native benchmarks to have the exact same arguments as the python ones so we can completely remove the Makefile.
--fileY data/multi/y-$(DFCLF_SIZE).npy | ||
ARGS_SKLEARN_dfreg = --fileX data/reg/X-$(DFREG_SIZE).npy \ | ||
--fileY data/reg/y-$(DFREG_SIZE).npy | ||
ARGS_SKLEARN_kmeans = --file-X-train data/clustering/kmeans_$(KMEANS_SIZE).npy \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to align some of this indentation, if we will not be removing the Makefile in this PR. Github has a tab width of 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary at least for python part of Makefile which probably will be removed.
@bibikar, I applied major things/comments for this pull request. Are there other changes that should be in it? |
@Alexsandruss Will take a look, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've attached some comments. Should be ready to merge soon once those are addressed.
runner.py
Outdated
cases = cases * n_param_values | ||
for i in range(n_param_values): | ||
for j in range(prev_lenght): | ||
cases[prev_lenght * i + j] += ' {}{} {}'.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout this file, since we only target python>=3.6 with these benchmarks, you could use f-strings, as we do in bench.py
files, instead of str.format
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-strings are applied in all places.
runner.py
Outdated
|
||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--config', metavar='ConfigPath', type=str, | ||
default='config.json', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you possibly add your example config to this repo, e.g. as config.example.json
?
Also, since you specify a file here, you could use argparse.FileType('r')
as the type, and argparse
will automatically check for the existence of the file:
$ python zzzzzz.py aaaaaaaaa
usage: zzzzzz.py [-h] file
zzzzzz.py: error: argument file: can't open 'aaaaaaaaa': [Errno 2] No such file or directory: 'aaaaaaaaa'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected, config added.
runner.py
Outdated
x_train_file, y_train_file, | ||
x_test_file, y_test_file) | ||
else: | ||
command = 'python make_datasets.py -s {} -f {} classification -c {} -x {} -y {}'.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar duplication here. If a user doesn't want to generate a testing dataset, you can simply leave out the arguments for testing instead of generating separate arguments for runs with testing and for runs without testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset generation was rewritten, duplication of code was avoid where is possible.
runner.py
Outdated
if not args.dummy_run: | ||
r = subprocess.run( | ||
command.split(' '), stdout=subprocess.PIPE, | ||
stderr=stderr_file, encoding='utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to see stderr in the terminal in case something went wrong or some diagnostic messages were printed. The user can also specify stderr redirection without our intervention with 2> some_file
in the shell, so you can probably leave this redirection out entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, now benchmarks stderr is redirected into runner stderr in realtime.
runner.py
Outdated
log += r.stdout | ||
|
||
# add commas to correct JSON output | ||
while '}\n{' in log: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply parse the JSON output from each benchmark separately, or better yet, always output a list of maps [{benchmark1...}, {benchmark2...}]
from each benchmark? That way, the output of each benchmark is valid JSON and you don't need to worry about fixing it later. You can directly do something like list_of_results += json.loads(r.stdout)
for each benchmark, and then simply do result['results'] = list_of_results
at the end. No need to mess with the actual text of the JSON representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addition of results in output JSON was corrected.
sklearn/bench.py
Outdated
def print_output(library, algorithm, stages, columns, params, functions, | ||
times, accuracy_type, accuracies, data, alg_instance=None): | ||
if params.output_format == 'csv': | ||
output_csv(columns, params, functions, times, [None, accuracies[1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since functions
appears to be semantically equivalent to algorithm
.stages[i]
, can we generate functions
in this function (probably better: in output_csv
) instead of having to specify it in the individual scripts? It would also probably be better to change columns
to csv_columns
for clarity. This will affect the output that I have to parse on my end, but that's ok if it's possible to unify everything.
accuracies
argument should be specified as a keyword argument, since it's declared as one. Also, why not report training accuracy as well in CSV format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Training accuracy was added in CSV format.
I think, better unification of code can be available when backward compatibility will not be needed (if it happen).
'for benchmarks') | ||
parser.add_argument('--dummy-run', default=False, action='store_true', | ||
help='Run configuration parser and datasets generation' | ||
'without benchmarks running') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could also support CSV output here, that would allow us to remove the top-level Makefile entirely. No special processing is needed for CSV, and each CSV document sent to stdout can be separated with some special character (e.g. %
) on its own line. Better yet, CSV outputs can be written to result files as is currently done in the Makefile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV output support was added.
The makefile also currently have |
@bibikar, I added box filter option for time measurements and XGBoost benchmark. Please, review it and other parts if needed. I think, no more additions are expected for this PR. So, it's to time to finish it. |
I tried to run
It appears that
It may be cleaner to also run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to run the benchmarks with a single thread using the runner? i.e. setting OMP_NUM_THREADS=1 and using --num-threads 1, etc. in benchmarks.
It would also be good if you could look through previous comments like the comment about constructing the json list.
cuml/bench.py
Outdated
parser.add_argument('--time-method', type=str, default='mean_min', | ||
choices=('box_filter', 'mean_min'), | ||
help='Method used for time mesurements') | ||
parser.add_argument('--n-meas', type=int, default=100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this long form --n-measurements
or --box-filter-measurements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arg name is changed to box-filter-measurements
.
@@ -227,6 +217,44 @@ def prepare_daal(num_threads=-1): | |||
return num_threads, daal_version | |||
|
|||
|
|||
def measure_function_time(func, *args, params, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that here you don't pass kwargs
anywhere. Maybe now it would be better to change this function to have signature func, args, kwargs, timing_params
, and explicitly specify args
and kwargs
as tuple and dict, and then actually execute func(*args, **kwargs)
in actual timing. This would make it possible to have verbose=True for both timing and function execution, and make it clearer. I'm not sure why I didn't implement it this way before, but it would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No exact need for changing now. Many algorithms don't have verbose mode, but I agree it's good to have.
config_example.json
Outdated
"algorithm": "distances", | ||
"dataset": [ | ||
{ | ||
"training": "synth_clsf_1000_15000_2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be ultimately cleaner to split these strings synth_clsf_1000_15000_2
into some dict like
{"origin": "synthetic", "type": "classification", "n_features": 1000, "n_samples": 15000, "n_classes": 2}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed synthetic and others dataset configs to cleaner view.
Thanks for |
@bibikar , are there any stoppers/comments for this pull request? |
@Alexsandruss, I will test it again and see if it works for my needs, but recent changes look good 👍 |
Have you added functionality to change the number of threads the benchmark can use from the runner? It will be required to execute sklearn benchmarks using
|
Yes, number of threads is supported in runner. Add |
Benchmarks with usage of |
It turns out that using column-major order (which is the default in |
RandomForest isn't patched because of algorithm difference between sklearn and DAAL but implemented as sklearn class. It's a bit confusing, but without direct import ( |
Pandas DataFrame with column-major order is chosen for default in config example since it's widely used as input for ML algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. @oleksandr-pavlyk, do you have any comments?
sklearn/df_regr.py
Outdated
min_impurity_decrease=params.min_impurity_decrease, | ||
bootstrap=params.bootstrap, | ||
random_state=params.seed, | ||
n_jobs=params.n_jobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small comment: n_jobs
is not supported for random forest regression in sklearn
/localdisk/work/sbibikar/miniconda3/envs/idp2020.0/lib/python3.7/site-packages/daal4py/sklearn/ensemble/decision_forest.py:322: UserWarning: RandomForestRegressor ignores non-default settings of n_jobs
warnings.warn(_class_name + ' ignores non-default settings of n_jobs')
@bibikar, are you still waiting for comment from Oleksandr? Can pull request be merged without it? |
@Alexsandruss I will ping him and see if he has any comments. Otherwise, it looks ready to me |
Merged, thanks! |
Addition of next benchmark parameters:
Other pull request content: