[WIP - do not merge!] Move sparkdl utilities for conversion between numpy arrays and image schema to ImageSchema by tomasatdatabricks · Pull Request #90 · databricks/spark-deep-learning

tomasatdatabricks · 2017-12-28T23:35:34Z

[WIP] Preparation for moving stuff to Spark.

Moved utilities for image schema <=> numpy array conversion to (copy pasted from spark 2.3) Image schema code.

Extended ImageSchema scala code with support/information for all OpenCv modes
python toNDArray and toImage utilities extended to work with all supported data types.
[minor] sparkdl toImage function included batch size stripping - had to make a separate call for that

MrBago

This looks good to me, just a few minor comments.

MrBago · 2017-12-29T18:10:40Z

python/sparkdl/image/image.py

+            dtype=ocvType.nptype,
            buffer=image.data,
-            strides=(width * nChannels, nChannels, 1))
+            strides=(width * nChannels * itemSz, nChannels * itemSz, itemSz))


Will numpy figure out the right strides if we don't pass it explicitly?

Hmm yeah I would think so. The original code from ms folks was like this and I did not want to do more changes than necessary.

MrBago · 2017-12-29T18:12:10Z

python/sparkdl/image/image.py


        if array.ndim != 3:
-            raise ValueError("Invalid array shape")
+            raise ValueError("Invalid array shape %s" % str(array.shape))


Do we want to reshape 2d arrays to be shape + (1,)?

I agree with their approach. I think it's better to make the caller pass the arguments in expected format rather than trying to auto-convert unless that is completely unambiguous.

So in this case, we say images are always 3 dimensional arrays and it's up to the user to make sure they conform to that. Otherwise they might be passing something else than they think they are passing and we would mask their bug until later.

MrBago · 2017-12-29T18:13:09Z

python/sparkdl/image/image.py

+                "Unexpected/unsupported array data type '%s', currently only supported formats are %s" %
+                (str(
+                    array.dtype), str(
+                    self._numpyToOcvMap.keys())))


Can we get this on fewer lines or sue some variables, it looks odd.

Yeah it does, I think it's the autopep8 being weird here. I'll reformat that.

MrBago · 2017-12-29T18:23:56Z

src/test/scala/com/databricks/sparkdl/DeepImageFeaturizerSuite.scala

    testDefaultReadWrite(featurizer)
  }
+
+


Extra white space.

MrBago · 2017-12-29T18:35:15Z

python/sparkdl/image/image.py

+                                            dataType=x.dataType(),
+                                            nptype=self._ocvToNumpyMap[x.dataType()])
+                              for x in ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()]
+        return [x for x in self._ocvTypes]


What does this do? Isn't self._ocvTypes already a list?

The purpose was to return copy of the list so that the private member can not be modified.

oic, I usually see myList[:] or list(myList) to make a (shallow) copy.

Thanks, that's nicer :)
The members of the list are tuples so shallow copy suffices here.

codecov-io · 2017-12-29T19:35:18Z

Codecov Report

Merging #90 into master will increase coverage by 1.43%.
The diff coverage is 77.46%.

@@            Coverage Diff             @@
##           master      #90      +/-   ##
==========================================
+ Coverage   82.49%   83.92%   +1.43%     
==========================================
  Files          33       33              
  Lines        1879     1866      -13     
  Branches       35       39       +4     
==========================================
+ Hits         1550     1566      +16     
+ Misses        329      300      -29

Impacted Files	Coverage Δ
python/sparkdl/udf/keras_image_model.py	`75.6% <0%> (+1.8%)`	⬆️
...main/scala/com/databricks/sparkdl/ImageUtils.scala	`90.9% <100%> (ø)`	⬆️
...n/sparkdl/estimators/keras_image_file_estimator.py	`74.35% <100%> (ø)`	⬆️
python/sparkdl/transformers/tf_image.py	`94.06% <33.33%> (-0.05%)`	⬇️
python/sparkdl/param/image_params.py	`81.81% <50%> (+6.14%)`	⬆️
.../scala/org/apache/spark/ml/image/ImageSchema.scala	`77.94% <75%> (-1.1%)`	⬇️
python/sparkdl/image/imageIO.py	`73.33% <81.25%> (-4.77%)`	⬇️
python/sparkdl/image/image.py	`78.82% <82.35%> (+40.58%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeff9c9...ce17c43. Read the comment docs.

sueann

Are we planning to merge this? It'll be difficult to maintain and resolve the differences in the copied files between DL Pipelines and Spark. It'd be much easier cognitively to merge these changes into Spark then remove the corresponding files in sparkdl. If we need to keep these files in sparkdl until Spark 2.4 is out, it'd be safer to first get the changes merged into Spark then copy the exact changes here; if we merge this first, it could easily get out of sync with whatever revisions get made in Spark.

tomasatdatabricks · 2018-01-09T21:17:07Z

@sueann Yes I agree, I would merge spark version first and merge this one only after spark 2.4 is released. I made the PR here mostly because that's what we need the changes for, so it can be reviewed in context, also to run tests.

I'll mark it WIP.

sueann · 2018-01-09T21:50:45Z

ah ok got it. thanks!

tomasatdatabricks requested a review from MrBago December 28, 2017 23:40

tomasatdatabricks force-pushed the tomas/ImageSchemaUpdate2 branch 2 times, most recently from a959b18 to b533e69 Compare December 29, 2017 18:02

MrBago approved these changes Dec 29, 2017

View reviewed changes

tomasatdatabricks force-pushed the tomas/ImageSchemaUpdate2 branch from b533e69 to 52c6041 Compare December 29, 2017 18:33

MrBago reviewed Dec 29, 2017

View reviewed changes

tomasatdatabricks force-pushed the tomas/ImageSchemaUpdate2 branch from 52c6041 to f616462 Compare December 29, 2017 18:50

merged with master

73e4c94

Added undefined ocv type to the list of types

c8c90e0

tomasatdatabricks force-pushed the tomas/ImageSchemaUpdate2 branch from f616462 to c8c90e0 Compare December 29, 2017 21:53

added conversion test for all ocv types

ce17c43

sueann reviewed Jan 9, 2018

View reviewed changes

tomasatdatabricks changed the title ~~Move sparkdl utilities for conversion between numpy arrays and image schema to ImageSchema~~ [WIP - do not merge!] Move sparkdl utilities for conversion between numpy arrays and image schema to ImageSchema Jan 9, 2018

Conversation

tomasatdatabricks commented Dec 28, 2017

Uh oh!

MrBago left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrBago Dec 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Dec 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sueann left a comment

Choose a reason for hiding this comment

Uh oh!

tomasatdatabricks commented Jan 9, 2018

Uh oh!

sueann commented Jan 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MrBago Dec 29, 2017 •

edited

Loading

codecov-io commented Dec 29, 2017 •

edited

Loading