Skip to content

Extension of benchmark parameters and output #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 78 commits into from
Apr 10, 2020

Conversation

Alexsandruss
Copy link
Contributor

@Alexsandruss Alexsandruss commented Jan 22, 2020

Addition of next benchmark parameters:

  • Data format (pandas DataFrame/numpy array)
  • Data order (column/row major)

Other pull request content:

  • JSON output format
  • cuML Python API benchmarks
  • XGBoost benchmarks (Gradient Boosted Trees) with OMP variables setting
  • New data passing method
  • Box filter as additional time measurement method
  • Benchmarks runner with config parser

@bibikar
Copy link
Contributor

bibikar commented Jan 22, 2020 via email

Copy link
Contributor

@bibikar bibikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! In addition to specific comments in diff, there's a few other things I think need to be addressed:

  • daal4py benchmarks must also support JSON output. I guess you must be working on this since this is still a draft
  • It would probably be worth it to factor out the output part of each benchmark to a common function in e.g. bench.py. The JSON/CSV output format differences can be handled there. We should also try to reduce the differences between the output formats. Right now, CSV output omits certain things that JSON format provides. Ideally, both should be able to communicate the same information.
  • I would recommend running flake8 on the files to be style-consistent. There is really no formal style guideline for this repo right now (since contributions were not originally expected 😄 ), but I have been running flake8 to verify consistency with PEP 8.

@bibikar bibikar self-assigned this Jan 23, 2020
@Alexsandruss Alexsandruss marked this pull request as ready for review January 29, 2020 21:59
Copy link
Contributor

@bibikar bibikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. I have a few comments attached. I noticed that the permissions for the files have changed from 644 to 755 at some point. Was there any reason for this change? I don't think it is useful to have python scripts with the execute bit set unless a shebang is included at the top of the file.

The most important comments to address are about make_datasets.py. It would also be good if the JSON/CSV output could be handled entirely in bench.py instead of in separate benchmarks, since the way it is right now looks very complicated. This could be done with a single function call from benchmarks. It is not important to preserve the names e.g. Linear.fit and Linear.predict.

Since you have updated most of the python scripts here, their copyright years should also be updated to include 2020 if they don't yet have it.

@@ -119,23 +119,22 @@ ARGS_SKLEARN_ridge = --size "$(REGRESSION_SIZE)"
ARGS_SKLEARN_linear = --size "$(REGRESSION_SIZE)"
ARGS_SKLEARN_pca_daal = --size "$(REGRESSION_SIZE)" --svd-solver daal
ARGS_SKLEARN_pca_full = --size "$(REGRESSION_SIZE)" --svd-solver full
ARGS_SKLEARN_kmeans = --data-multiplier "$(MULTIPLIER)" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the runner could handle running everything, including native benchmarks, we could just remove the top-level Makefile, since that only exists for the same purpose. This might not be possible yet because the native benchmarks don't accept the exact same inputs (now, some sklearn benchmarks require a testing dataset in addition to training datasets, while the same native benchmarks do not). In that case, I can later change the native benchmarks to have the exact same arguments as the python ones so we can completely remove the Makefile.

--fileY data/multi/y-$(DFCLF_SIZE).npy
ARGS_SKLEARN_dfreg = --fileX data/reg/X-$(DFREG_SIZE).npy \
--fileY data/reg/y-$(DFREG_SIZE).npy
ARGS_SKLEARN_kmeans = --file-X-train data/clustering/kmeans_$(KMEANS_SIZE).npy \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to align some of this indentation, if we will not be removing the Makefile in this PR. Github has a tab width of 8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unnecessary at least for python part of Makefile which probably will be removed.

@Alexsandruss
Copy link
Contributor Author

@bibikar, I applied major things/comments for this pull request. Are there other changes that should be in it?

@bibikar
Copy link
Contributor

bibikar commented Feb 4, 2020

@Alexsandruss Will take a look, thanks!

Copy link
Contributor

@bibikar bibikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've attached some comments. Should be ready to merge soon once those are addressed.

runner.py Outdated
cases = cases * n_param_values
for i in range(n_param_values):
for j in range(prev_lenght):
cases[prev_lenght * i + j] += ' {}{} {}'.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout this file, since we only target python>=3.6 with these benchmarks, you could use f-strings, as we do in bench.py files, instead of str.format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-strings are applied in all places.

runner.py Outdated

parser = argparse.ArgumentParser()
parser.add_argument('--config', metavar='ConfigPath', type=str,
default='config.json',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you possibly add your example config to this repo, e.g. as config.example.json?

Also, since you specify a file here, you could use argparse.FileType('r') as the type, and argparse will automatically check for the existence of the file:

$ python zzzzzz.py aaaaaaaaa
usage: zzzzzz.py [-h] file
zzzzzz.py: error: argument file: can't open 'aaaaaaaaa': [Errno 2] No such file or directory: 'aaaaaaaaa'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected, config added.

runner.py Outdated
x_train_file, y_train_file,
x_test_file, y_test_file)
else:
command = 'python make_datasets.py -s {} -f {} classification -c {} -x {} -y {}'.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar duplication here. If a user doesn't want to generate a testing dataset, you can simply leave out the arguments for testing instead of generating separate arguments for runs with testing and for runs without testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset generation was rewritten, duplication of code was avoid where is possible.

runner.py Outdated
if not args.dummy_run:
r = subprocess.run(
command.split(' '), stdout=subprocess.PIPE,
stderr=stderr_file, encoding='utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to see stderr in the terminal in case something went wrong or some diagnostic messages were printed. The user can also specify stderr redirection without our intervention with 2> some_file in the shell, so you can probably leave this redirection out entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, now benchmarks stderr is redirected into runner stderr in realtime.

runner.py Outdated
log += r.stdout

# add commas to correct JSON output
while '}\n{' in log:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply parse the JSON output from each benchmark separately, or better yet, always output a list of maps [{benchmark1...}, {benchmark2...}] from each benchmark? That way, the output of each benchmark is valid JSON and you don't need to worry about fixing it later. You can directly do something like list_of_results += json.loads(r.stdout) for each benchmark, and then simply do result['results'] = list_of_results at the end. No need to mess with the actual text of the JSON representation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition of results in output JSON was corrected.

sklearn/bench.py Outdated
def print_output(library, algorithm, stages, columns, params, functions,
times, accuracy_type, accuracies, data, alg_instance=None):
if params.output_format == 'csv':
output_csv(columns, params, functions, times, [None, accuracies[1]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since functions appears to be semantically equivalent to algorithm.stages[i], can we generate functions in this function (probably better: in output_csv) instead of having to specify it in the individual scripts? It would also probably be better to change columns to csv_columns for clarity. This will affect the output that I have to parse on my end, but that's ok if it's possible to unify everything.

accuracies argument should be specified as a keyword argument, since it's declared as one. Also, why not report training accuracy as well in CSV format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training accuracy was added in CSV format.
I think, better unification of code can be available when backward compatibility will not be needed (if it happen).

'for benchmarks')
parser.add_argument('--dummy-run', default=False, action='store_true',
help='Run configuration parser and datasets generation'
'without benchmarks running')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could also support CSV output here, that would allow us to remove the top-level Makefile entirely. No special processing is needed for CSV, and each CSV document sent to stdout can be separated with some special character (e.g. %) on its own line. Better yet, CSV outputs can be written to result files as is currently done in the Makefile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV output support was added.

@bibikar
Copy link
Contributor

bibikar commented Feb 4, 2020

The makefile also currently have 755 permissions set. If you plan to delete it, that's fine, but if you don't, it should be set back to 644.

@Alexsandruss
Copy link
Contributor Author

@bibikar, I added box filter option for time measurements and XGBoost benchmark. Please, review it and other parts if needed. I think, no more additions are expected for this PR. So, it's to time to finish it.

@bibikar
Copy link
Contributor

bibikar commented Mar 19, 2020

I tried to run python runner.py --config config_example.json and got the following.

(idp2020.0) sbibikar@ansatlin13 /localdisk/work/sbibikar/repos/scikit-learn_bench$ python runner.py --config config_example.json
Traceback (most recent call last):
  File "runner.py", line 163, in <module>
    'channel': pkg_info[3]
IndexError: list index out of range

It appears that conda list's output sometimes doesn't include the channel:

# packages in environment at /localdisk/work/sbibikar/miniconda3/envs/idp2020.0:
#
# Name                    Version                   Build  Channel
absl-py                   0.9.0                    py37_0
asn1crypto                0.24.0                   py37_3    intel
astor                     0.8.0                    py37_0
backcall                  0.1.0                    py37_0
bzip2                     1.0.8                         0    intel
c-ares                    1.15.0            h7b6447c_1001
certifi                   2019.9.11                py37_0
cffi                      1.12.3                   py37_2    intel
chardet                   3.0.4                    py37_3    intel
conda-package-handling    1.4.0                    py37_0    intel
cryptography              2.7                      py37_0    intel
cycler                    0.10.0                   py37_7    intel
cython                    0.29.13          py37ha68da19_0    intel
daal                      2020.0                intel_166    intel
daal4py                   2020.0           py37ha68da19_5    intel
decorator                 4.4.1                      py_0
freetype                  2.10.1                        1    intel
funcsigs                  1.0.2                    py37_7    intel
gast                      0.3.3                      py_0
google-pasta              0.1.8                      py_0
grpcio                    1.27.2           py37hf8bcb03_0
h5py                      2.8.0            py37h989c5e5_3

It may be cleaner to also run conda list --json, parse its output, and grab what you need (or just grab everything).

Copy link
Contributor

@bibikar bibikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to run the benchmarks with a single thread using the runner? i.e. setting OMP_NUM_THREADS=1 and using --num-threads 1, etc. in benchmarks.

It would also be good if you could look through previous comments like the comment about constructing the json list.

cuml/bench.py Outdated
parser.add_argument('--time-method', type=str, default='mean_min',
choices=('box_filter', 'mean_min'),
help='Method used for time mesurements')
parser.add_argument('--n-meas', type=int, default=100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this long form --n-measurements or --box-filter-measurements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg name is changed to box-filter-measurements.

@@ -227,6 +217,44 @@ def prepare_daal(num_threads=-1):
return num_threads, daal_version


def measure_function_time(func, *args, params, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that here you don't pass kwargs anywhere. Maybe now it would be better to change this function to have signature func, args, kwargs, timing_params, and explicitly specify args and kwargs as tuple and dict, and then actually execute func(*args, **kwargs) in actual timing. This would make it possible to have verbose=True for both timing and function execution, and make it clearer. I'm not sure why I didn't implement it this way before, but it would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No exact need for changing now. Many algorithms don't have verbose mode, but I agree it's good to have.

"algorithm": "distances",
"dataset": [
{
"training": "synth_clsf_1000_15000_2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be ultimately cleaner to split these strings synth_clsf_1000_15000_2 into some dict like
{"origin": "synthetic", "type": "classification", "n_features": 1000, "n_samples": 15000, "n_classes": 2}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed synthetic and others dataset configs to cleaner view.

@Alexsandruss
Copy link
Contributor Author

Alexsandruss commented Mar 24, 2020

Thanks for conda list --json, I didn't know about it.
OMP_NUM_THREADS are needed in XGBoost and currently sets only for this lib. Without it XGBoost may have lower performance when HT is on.
I think better control of threads number is needed, but it can be added later.
I looked at all conversations at this pull request, it looks like all comments are applied. Am I right?

@Alexsandruss
Copy link
Contributor Author

@bibikar , are there any stoppers/comments for this pull request?

@bibikar
Copy link
Contributor

bibikar commented Mar 26, 2020

@Alexsandruss, I will test it again and see if it works for my needs, but recent changes look good 👍

@bibikar
Copy link
Contributor

bibikar commented Mar 26, 2020

Have you added functionality to change the number of threads the benchmark can use from the runner? It will be required to execute sklearn benchmarks using --num-threads and also provide OMP_NUM_THREADS environment variable to sklearn runs. Currently, I use the makefile for this functionality and just set NUM_THREADS = 1 at the top when I need to run benchmarks single-threaded. But if you want to move that to a separate PR, I think it's fine.

getFPType = import_fptype_getter() seems somewhat unnecessary. You could just put that code at the top-level of bench.py and then do from bench import getFPType in each benchmark and not have to call the function from each benchmark.

@Alexsandruss
Copy link
Contributor Author

Alexsandruss commented Mar 27, 2020

Yes, number of threads is supported in runner. Add "num-threads": [1] to common params or single case params in passed config to run benchmarks single-threaded.

@Alexsandruss
Copy link
Contributor Author

Benchmarks with usage of getFPType are corrected.

@bibikar
Copy link
Contributor

bibikar commented Mar 27, 2020

It turns out that using column-major order (which is the default in config_example.json) causes sklearn's RandomForestClassifier to not call DAAL. Is there any reason that column-major order is the default in the example?

@Alexsandruss
Copy link
Contributor Author

RandomForest isn't patched because of algorithm difference between sklearn and DAAL but implemented as sklearn class. It's a bit confusing, but without direct import (from daal4py.sklearn.ensemble import RandomForestClassifier) DAAL isn't used. Argument --use-sklearn-class controls which class is used.

@Alexsandruss
Copy link
Contributor Author

Pandas DataFrame with column-major order is chosen for default in config example since it's widely used as input for ML algorithms.

Copy link
Contributor

@bibikar bibikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @oleksandr-pavlyk, do you have any comments?

min_impurity_decrease=params.min_impurity_decrease,
bootstrap=params.bootstrap,
random_state=params.seed,
n_jobs=params.n_jobs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small comment: n_jobs is not supported for random forest regression in sklearn

/localdisk/work/sbibikar/miniconda3/envs/idp2020.0/lib/python3.7/site-packages/daal4py/sklearn/ensemble/decision_forest.py:322: UserWarning: RandomForestRegressor ignores non-default settings of n_jobs
  warnings.warn(_class_name + ' ignores non-default settings of n_jobs')

@Alexsandruss
Copy link
Contributor Author

Alexsandruss commented Apr 9, 2020

@bibikar, are you still waiting for comment from Oleksandr? Can pull request be merged without it?

@bibikar
Copy link
Contributor

bibikar commented Apr 10, 2020

@Alexsandruss I will ping him and see if he has any comments. Otherwise, it looks ready to me

@bibikar bibikar merged commit f7d62c7 into IntelPython:master Apr 10, 2020
@bibikar
Copy link
Contributor

bibikar commented Apr 10, 2020

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants