Extension of benchmark parameters and output #19

Alexsandruss · 2020-01-22T16:22:53Z

Addition of next benchmark parameters:

Data format (pandas DataFrame/numpy array)
Data order (column/row major)

Other pull request content:

JSON output format
cuML Python API benchmarks
XGBoost benchmarks (Gradient Boosted Trees) with OMP variables setting
New data passing method
Box filter as additional time measurement method
Benchmarks runner with config parser

bibikar · 2020-01-22T19:24:51Z

In general, looks good! Few comments: - in arg parsing, --check-finite needs an action and should be named accordingly - paramaters -> parameters

…

On Wed, Jan 22, 2020, 10:22 Alexander Andreev ***@***.***> wrote: Addition of next benchmark parameters: - Data format (pandas DataFrame/numpy array) - Data order (column/row major) Addition of JSON output format ------------------------------ You can view, comment on, or merge this pull request online at: #19 Commit Summary - Add data type and order choice for sklearn benchmarks - Fix absence of dtype property in pandas.DataFrame for distance - JSON output and minor additions - Code cleanup File Changes - *M* sklearn/bench.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-0> (55) - *M* sklearn/df_clsf.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-1> (49) - *M* sklearn/df_regr.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-2> (51) - *M* sklearn/distances.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-3> (37) - *M* sklearn/kmeans.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-4> (48) - *M* sklearn/linear.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-5> (54) - *M* sklearn/log_reg.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-6> (65) - *M* sklearn/pca.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-7> (47) - *M* sklearn/ridge.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-8> (53) - *M* sklearn/svm.py <https://github.com/IntelPython/scikit-learn_bench/pull/19/files#diff-9> (45) Patch Links: - https://github.com/IntelPython/scikit-learn_bench/pull/19.patch - https://github.com/IntelPython/scikit-learn_bench/pull/19.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#19?email_source=notifications&email_token=AFMZNIAPEKYXIV2ZVPRT3Q3Q7BXF5A5CNFSM4KKITNSKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IIAA7LQ>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFMZNIHRMUQUWZLXKID2GDDQ7BXF5ANCNFSM4KKITNSA> .

bibikar

Thanks for the PR! In addition to specific comments in diff, there's a few other things I think need to be addressed:

daal4py benchmarks must also support JSON output. I guess you must be working on this since this is still a draft
It would probably be worth it to factor out the output part of each benchmark to a common function in e.g. bench.py. The JSON/CSV output format differences can be handled there. We should also try to reduce the differences between the output formats. Right now, CSV output omits certain things that JSON format provides. Ideally, both should be able to communicate the same information.
I would recommend running flake8 on the files to be style-consistent. There is really no formal style guideline for this repo right now (since contributions were not originally expected 😄 ), but I have been running flake8 to verify consistency with PEP 8.

sklearn/bench.py

sklearn/df_clsf.py

bibikar

Looks good in general. I have a few comments attached. I noticed that the permissions for the files have changed from 644 to 755 at some point. Was there any reason for this change? I don't think it is useful to have python scripts with the execute bit set unless a shebang is included at the top of the file.

The most important comments to address are about make_datasets.py. It would also be good if the JSON/CSV output could be handled entirely in bench.py instead of in separate benchmarks, since the way it is right now looks very complicated. This could be done with a single function call from benchmarks. It is not important to preserve the names e.g. Linear.fit and Linear.predict.

Since you have updated most of the python scripts here, their copyright years should also be updated to include 2020 if they don't yet have it.

bibikar · 2020-01-31T17:30:34Z

Makefile

@@ -119,23 +119,22 @@ ARGS_SKLEARN_ridge = 	--size "$(REGRESSION_SIZE)"
 ARGS_SKLEARN_linear = 	--size "$(REGRESSION_SIZE)"
 ARGS_SKLEARN_pca_daal = --size "$(REGRESSION_SIZE)" --svd-solver daal
 ARGS_SKLEARN_pca_full = --size "$(REGRESSION_SIZE)" --svd-solver full
-ARGS_SKLEARN_kmeans = 	--data-multiplier "$(MULTIPLIER)" \


If the runner could handle running everything, including native benchmarks, we could just remove the top-level Makefile, since that only exists for the same purpose. This might not be possible yet because the native benchmarks don't accept the exact same inputs (now, some sklearn benchmarks require a testing dataset in addition to training datasets, while the same native benchmarks do not). In that case, I can later change the native benchmarks to have the exact same arguments as the python ones so we can completely remove the Makefile.

bibikar · 2020-01-31T17:39:47Z

Makefile

-			--fileY data/multi/y-$(DFCLF_SIZE).npy
-ARGS_SKLEARN_dfreg = 	--fileX data/reg/X-$(DFREG_SIZE).npy \
-			--fileY data/reg/y-$(DFREG_SIZE).npy
+ARGS_SKLEARN_kmeans = 	--file-X-train data/clustering/kmeans_$(KMEANS_SIZE).npy \


It would be good to align some of this indentation, if we will not be removing the Makefile in this PR. Github has a tab width of 8.

It's unnecessary at least for python part of Makefile which probably will be removed.

cuml/kmeans.py

make_datasets.py

sklearn/bench.py

cuml/kmeans.py

runner.py

sklearn/bench.py

Alexsandruss · 2020-02-04T21:09:17Z

@bibikar, I applied major things/comments for this pull request. Are there other changes that should be in it?

bibikar · 2020-02-04T21:16:21Z

@Alexsandruss Will take a look, thanks!

bibikar

Thanks! I've attached some comments. Should be ready to merge soon once those are addressed.

runner.py

bibikar · 2020-02-04T20:40:03Z

runner.py

+    cases = cases * n_param_values
+    for i in range(n_param_values):
+        for j in range(prev_lenght):
+            cases[prev_lenght * i + j] += ' {}{} {}'.format(


Throughout this file, since we only target python>=3.6 with these benchmarks, you could use f-strings, as we do in bench.py files, instead of str.format.

f-strings are applied in all places.

runner.py

bibikar · 2020-02-04T20:47:18Z

runner.py

+
+parser = argparse.ArgumentParser()
+parser.add_argument('--config', metavar='ConfigPath', type=str,
+                    default='config.json',


Could you possibly add your example config to this repo, e.g. as config.example.json?

Also, since you specify a file here, you could use argparse.FileType('r') as the type, and argparse will automatically check for the existence of the file:

$ python zzzzzz.py aaaaaaaaa usage: zzzzzz.py [-h] file zzzzzz.py: error: argument file: can't open 'aaaaaaaaa': [Errno 2] No such file or directory: 'aaaaaaaaa'

Corrected, config added.

bibikar · 2020-02-04T21:22:24Z

runner.py

+                        x_train_file, y_train_file,
+                        x_test_file, y_test_file)
+                else:
+                    command = 'python make_datasets.py -s {} -f {} classification -c {} -x {} -y {}'.format(


Similar duplication here. If a user doesn't want to generate a testing dataset, you can simply leave out the arguments for testing instead of generating separate arguments for runs with testing and for runs without testing.

Dataset generation was rewritten, duplication of code was avoid where is possible.

bibikar · 2020-02-04T21:30:38Z

runner.py

+                if not args.dummy_run:
+                    r = subprocess.run(
+                        command.split(' '), stdout=subprocess.PIPE,
+                        stderr=stderr_file, encoding='utf-8')


I like to see stderr in the terminal in case something went wrong or some diagnostic messages were printed. The user can also specify stderr redirection without our intervention with 2> some_file in the shell, so you can probably leave this redirection out entirely.

Changed, now benchmarks stderr is redirected into runner stderr in realtime.

bibikar · 2020-02-04T21:32:06Z

runner.py

+                    log += r.stdout
+
+# add commas to correct JSON output
+while '}\n{' in log:


Why not simply parse the JSON output from each benchmark separately, or better yet, always output a list of maps [{benchmark1...}, {benchmark2...}] from each benchmark? That way, the output of each benchmark is valid JSON and you don't need to worry about fixing it later. You can directly do something like list_of_results += json.loads(r.stdout) for each benchmark, and then simply do result['results'] = list_of_results at the end. No need to mess with the actual text of the JSON representation.

Addition of results in output JSON was corrected.

bibikar · 2020-02-04T21:48:16Z

sklearn/bench.py

+def print_output(library, algorithm, stages, columns, params, functions,
+                 times, accuracy_type, accuracies, data, alg_instance=None):
+    if params.output_format == 'csv':
+        output_csv(columns, params, functions, times, [None, accuracies[1]])


Since functions appears to be semantically equivalent to algorithm.stages[i], can we generate functions in this function (probably better: in output_csv) instead of having to specify it in the individual scripts? It would also probably be better to change columns to csv_columns for clarity. This will affect the output that I have to parse on my end, but that's ok if it's possible to unify everything.

accuracies argument should be specified as a keyword argument, since it's declared as one. Also, why not report training accuracy as well in CSV format?

Training accuracy was added in CSV format.
I think, better unification of code can be available when backward compatibility will not be needed (if it happen).

bibikar · 2020-02-04T21:56:23Z

runner.py

+                         'for benchmarks')
+parser.add_argument('--dummy-run', default=False, action='store_true',
+                    help='Run configuration parser and datasets generation'
+                         'without benchmarks running')


If you could also support CSV output here, that would allow us to remove the top-level Makefile entirely. No special processing is needed for CSV, and each CSV document sent to stdout can be separated with some special character (e.g. %) on its own line. Better yet, CSV outputs can be written to result files as is currently done in the Makefile.

CSV output support was added.

bibikar · 2020-02-04T22:05:01Z

The makefile also currently have 755 permissions set. If you plan to delete it, that's fine, but if you don't, it should be set back to 644.

… algorithms

Alexsandruss · 2020-03-17T21:48:00Z

@bibikar, I added box filter option for time measurements and XGBoost benchmark. Please, review it and other parts if needed. I think, no more additions are expected for this PR. So, it's to time to finish it.

bibikar · 2020-03-19T20:25:08Z

I tried to run python runner.py --config config_example.json and got the following.

(idp2020.0) sbibikar@ansatlin13 /localdisk/work/sbibikar/repos/scikit-learn_bench$ python runner.py --config config_example.json
Traceback (most recent call last):
  File "runner.py", line 163, in <module>
    'channel': pkg_info[3]
IndexError: list index out of range

It appears that conda list's output sometimes doesn't include the channel:

# packages in environment at /localdisk/work/sbibikar/miniconda3/envs/idp2020.0:
#
# Name                    Version                   Build  Channel
absl-py                   0.9.0                    py37_0
asn1crypto                0.24.0                   py37_3    intel
astor                     0.8.0                    py37_0
backcall                  0.1.0                    py37_0
bzip2                     1.0.8                         0    intel
c-ares                    1.15.0            h7b6447c_1001
certifi                   2019.9.11                py37_0
cffi                      1.12.3                   py37_2    intel
chardet                   3.0.4                    py37_3    intel
conda-package-handling    1.4.0                    py37_0    intel
cryptography              2.7                      py37_0    intel
cycler                    0.10.0                   py37_7    intel
cython                    0.29.13          py37ha68da19_0    intel
daal                      2020.0                intel_166    intel
daal4py                   2020.0           py37ha68da19_5    intel
decorator                 4.4.1                      py_0
freetype                  2.10.1                        1    intel
funcsigs                  1.0.2                    py37_7    intel
gast                      0.3.3                      py_0
google-pasta              0.1.8                      py_0
grpcio                    1.27.2           py37hf8bcb03_0
h5py                      2.8.0            py37h989c5e5_3

It may be cleaner to also run conda list --json, parse its output, and grab what you need (or just grab everything).

bibikar

Is it possible to run the benchmarks with a single thread using the runner? i.e. setting OMP_NUM_THREADS=1 and using --num-threads 1, etc. in benchmarks.

It would also be good if you could look through previous comments like the comment about constructing the json list.

bibikar · 2020-03-19T20:47:57Z

cuml/bench.py

+    parser.add_argument('--time-method', type=str, default='mean_min',
+                        choices=('box_filter', 'mean_min'),
+                        help='Method used for time mesurements')
+    parser.add_argument('--n-meas', type=int, default=100,


I would make this long form --n-measurements or --box-filter-measurements

Arg name is changed to box-filter-measurements.

bibikar · 2020-03-19T20:57:02Z

cuml/bench.py

@@ -227,6 +217,44 @@ def prepare_daal(num_threads=-1):
    return num_threads, daal_version


+def measure_function_time(func, *args, params, **kwargs):


I noticed that here you don't pass kwargs anywhere. Maybe now it would be better to change this function to have signature func, args, kwargs, timing_params, and explicitly specify args and kwargs as tuple and dict, and then actually execute func(*args, **kwargs) in actual timing. This would make it possible to have verbose=True for both timing and function execution, and make it clearer. I'm not sure why I didn't implement it this way before, but it would be better.

No exact need for changing now. Many algorithms don't have verbose mode, but I agree it's good to have.

bibikar · 2020-03-19T20:59:17Z

config_example.json

+            "algorithm": "distances",
+            "dataset": [
+                {
+                    "training": "synth_clsf_1000_15000_2"


I think it would be ultimately cleaner to split these strings synth_clsf_1000_15000_2 into some dict like
{"origin": "synthetic", "type": "classification", "n_features": 1000, "n_samples": 15000, "n_classes": 2}

I changed synthetic and others dataset configs to cleaner view.

Alexsandruss · 2020-03-24T18:31:06Z

Thanks for conda list --json, I didn't know about it.
OMP_NUM_THREADS are needed in XGBoost and currently sets only for this lib. Without it XGBoost may have lower performance when HT is on.
I think better control of threads number is needed, but it can be added later.
I looked at all conversations at this pull request, it looks like all comments are applied. Am I right?

Alexsandruss · 2020-03-26T17:27:46Z

@bibikar , are there any stoppers/comments for this pull request?

bibikar · 2020-03-26T19:44:16Z

@Alexsandruss, I will test it again and see if it works for my needs, but recent changes look good 👍

bibikar · 2020-03-26T20:05:49Z

Have you added functionality to change the number of threads the benchmark can use from the runner? It will be required to execute sklearn benchmarks using --num-threads and also provide OMP_NUM_THREADS environment variable to sklearn runs. Currently, I use the makefile for this functionality and just set NUM_THREADS = 1 at the top when I need to run benchmarks single-threaded. But if you want to move that to a separate PR, I think it's fine.

getFPType = import_fptype_getter() seems somewhat unnecessary. You could just put that code at the top-level of bench.py and then do from bench import getFPType in each benchmark and not have to call the function from each benchmark.

Alexsandruss · 2020-03-27T12:29:56Z

Yes, number of threads is supported in runner. Add "num-threads": [1] to common params or single case params in passed config to run benchmarks single-threaded.

Alexsandruss · 2020-03-27T18:26:27Z

Benchmarks with usage of getFPType are corrected.

bibikar · 2020-03-27T19:21:19Z

It turns out that using column-major order (which is the default in config_example.json) causes sklearn's RandomForestClassifier to not call DAAL. Is there any reason that column-major order is the default in the example?

Alexsandruss · 2020-03-30T10:32:22Z

RandomForest isn't patched because of algorithm difference between sklearn and DAAL but implemented as sklearn class. It's a bit confusing, but without direct import (from daal4py.sklearn.ensemble import RandomForestClassifier) DAAL isn't used. Argument --use-sklearn-class controls which class is used.

Alexsandruss · 2020-03-30T10:39:14Z

Pandas DataFrame with column-major order is chosen for default in config example since it's widely used as input for ML algorithms.

bibikar

Looks good to me. @oleksandr-pavlyk, do you have any comments?

bibikar · 2020-04-02T17:09:12Z

sklearn/df_regr.py

+                             min_impurity_decrease=params.min_impurity_decrease,
+                             bootstrap=params.bootstrap,
+                             random_state=params.seed,
+                             n_jobs=params.n_jobs)


just a small comment: n_jobs is not supported for random forest regression in sklearn

/localdisk/work/sbibikar/miniconda3/envs/idp2020.0/lib/python3.7/site-packages/daal4py/sklearn/ensemble/decision_forest.py:322: UserWarning: RandomForestRegressor ignores non-default settings of n_jobs warnings.warn(_class_name + ' ignores non-default settings of n_jobs')

Alexsandruss · 2020-04-09T23:02:26Z

@bibikar, are you still waiting for comment from Oleksandr? Can pull request be merged without it?

bibikar · 2020-04-10T03:32:54Z

@Alexsandruss I will ping him and see if he has any comments. Otherwise, it looks ready to me

bibikar · 2020-04-10T17:16:30Z

Merged, thanks!

bibikar suggested changes Jan 23, 2020

View reviewed changes

bibikar self-assigned this Jan 23, 2020

Alexsandruss marked this pull request as ready for review January 29, 2020 21:59

bibikar reviewed Jan 31, 2020

View reviewed changes

runner.py Outdated Show resolved Hide resolved

bibikar reviewed Jan 31, 2020

View reviewed changes

sklearn/bench.py Outdated Show resolved Hide resolved

bibikar suggested changes Feb 4, 2020

View reviewed changes

Alexsandruss added 19 commits February 21, 2020 13:20

Add data type and order choice for sklearn benchmarks

08dbca7

Fix absence of dtype property in pandas.DataFrame for distance

c4da4a5

JSON output and minor additions

8b451b3

Code cleanup

982b7f5

Minor corrections

a213581

Runner creation

4d41e66

Code cleanup

f906596

Additional code cleanup

8c3d5c2

Change data passing method and unify output code

4afb467

Extension for daal4py benchmarks

c8896a0

Add extensions for last algorithms

c3fe4bc

cuML benchmarks and Makefile correction

4bf809e

Random Init for KMeans

07f806f

KMeans and runner fixes

9feae57

Quotes correction

b4833c4

Change of benchmarks runner and config format, corrections for set of…

954bf1d

… algorithms

Code corrections

d12f441

cuML corrections and alpha parameter for Ridge

a173c02

cuML KMeans and LogReg benchmarks corrections

c9bdb6f

Alexsandruss added 3 commits March 18, 2020 15:46

Add dataset name in benchmarks

b6be7be

kmeans sklearn bench correction

9f08ef6

Fix kmeans sklearn fitting

6e7abd3

bibikar reviewed Mar 19, 2020

View reviewed changes

Alexsandruss and others added 6 commits March 20, 2020 17:53

Fix for conda list reader

562e47d

Fix of getFPType import for newer daal4py, distances benchs change

03fd588

Box filter param fix

a17644e

Runner changes (OMP env only for XGBoost and json decode error handling)

9698e00

Synthetic datasets config change, OMP env. fix

9d28e8a

Remove OMP_THREADS env variable

3e9be64

Alexsandruss added 2 commits March 27, 2020 15:32

Add num-threads passing for xgboost benchmark

4dcd1ce

GetFPType fix

1baf3fd

bibikar approved these changes Apr 2, 2020

View reviewed changes

Alexsandruss added 2 commits April 3, 2020 01:25

Unset n_jobs for sklearn forests

d144bd8

Modify OMP env variables setting

3aef98e

bibikar merged commit f7d62c7 into IntelPython:master Apr 10, 2020

		@@ -227,6 +217,44 @@ def prepare_daal(num_threads=-1):
		return num_threads, daal_version


		def measure_function_time(func, args, params, *kwargs):

Extension of benchmark parameters and output #19

Extension of benchmark parameters and output #19

Uh oh!

Conversation

Alexsandruss commented Jan 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bibikar commented Jan 22, 2020 via email

Uh oh!

bibikar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bibikar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alexsandruss commented Feb 4, 2020

Uh oh!

bibikar commented Feb 4, 2020

Uh oh!

bibikar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bibikar commented Feb 4, 2020

Uh oh!

Alexsandruss commented Mar 17, 2020

Alexsandruss commented Jan 22, 2020 •

edited

Loading

bibikar left a comment •

edited

Loading

bibikar left a comment •

edited

Loading

Alexsandruss commented Mar 24, 2020 •

edited

Loading

bibikar commented Mar 26, 2020 •

edited

Loading

Alexsandruss commented Mar 27, 2020 •

edited

Loading