Skip to content

Commit 1b91312

Browse files
Merge branch 'master' into feature_streaming_enhancments
2 parents 2b3060e + 981a5a4 commit 1b91312

17 files changed

+307
-48
lines changed

CHANGELOG.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,30 @@ See the contents of the file `python/require.txt` to see the Python package depe
2323
* renamed packaging to `dbldatagen`
2424
* Releases now available at https://github.com/databrickslabs/dbldatagen/releases
2525
* code tidy up and rename of options
26-
* added text generation plugin support for python functions and 3rd party libraries such as Faker
26+
* added text generation plugin support for python functions and 3rd party libraries
2727
* Use of data generator to generate static and streaming data sources in Databricks Delta Live Tables
2828
* added support for install from PyPi
29+
30+
### version 0.3.0
31+
32+
The code for the Databricks Data Generator has the following dependencies
33+
34+
* Requires Databricks runtime 9.1 LTS or later
35+
* Requires Spark 3.1.2 or later
36+
* Requires Python 3.8.10 or later
37+
38+
While the data generator framework does not require all libraries used by the runtimes, where a library from
39+
the Databricks runtime is used, it will use the version found in the Databricks runtime for 9.1 LTS or later.
40+
You can use older versions of the Databricks Labs Data Generator by referring to that explicit version.
41+
42+
To use an older DB runtime version in your notebook, you can use the following code in your notebook:
43+
44+
```commandline
45+
%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
46+
```
47+
48+
See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html)
49+
for the full list of dependencies.
50+
51+
This can be found at : https://docs.databricks.com/release-notes/runtime/releases.html
52+

CONTRIBUTING.md

Lines changed: 43 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,10 @@ warrant that you have the legal authority to do so.
1515

1616
## Python compatibility
1717

18-
The code has been tested with Python 3.7.5 and 3.8
18+
The code has been tested with Python 3.8.10 and later.
19+
20+
Older releases were tested with Python 3.7.5 but as of this release, it requires the Databricks runtime 9.1 LTS or later
21+
which relies on Python 3.8.10
1922

2023
## Checking your code for common issues
2124

@@ -77,10 +80,21 @@ Run `make clean dist` from the main project directory.
7780

7881
# Testing
7982

80-
## Creating tests
81-
Preferred style is to use pytest rather than unittest but some unittest based code is used in compatibility mode.
83+
## Developing new tests
84+
New tests should be created using PyTest with classes combining multiple `Pytest` tests.
85+
86+
Existing test code contains tests based on Python's `unittest` framework but these are
87+
run on `pytest` rather than `unitest`.
88+
89+
To get a `spark` instance for test purposes, use the following code:
90+
91+
```python
92+
import dbldatagen as dg
8293

83-
Any new tests should be written as pytest compatible test classes.
94+
spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>")
95+
```
96+
97+
The name used to flag the spark instance should be the test module or test class name.
8498

8599
## Running unit / integration tests
86100

@@ -100,9 +114,32 @@ To run the tests using a `pipenv` environment:
100114
- Run `make test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`
101115

102116
# Using the Databricks Labs data generator
103-
To use the project, the generated wheel should be installed in your Python notebook as a wheel based library
117+
The recommended method for installation is to install from the PyPi package
118+
119+
You can install the library as a notebook scoped library when working within the Databricks
120+
notebook environment through the use of a `%pip` cell in your notebook.
121+
122+
To install as a notebook-scoped library, create and execute a notebook cell with the following text:
123+
124+
> `%pip install dbldatagen`
125+
126+
This installs from the PyPi package
127+
128+
You can also install from release binaries or directly from the Github sources.
129+
130+
The release binaries can be accessed at:
131+
- Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases
132+
133+
134+
The `%pip install` method also works on the Databricks Community Edition.
135+
136+
Alternatively, you use download a wheel file and install using the Databricks install mechanism to install a wheel based
137+
library into your workspace.
138+
139+
The `%pip install` method can also down load a specific binary release.
140+
For example, the following code downloads the release V0.2.1
104141

105-
Once the library has been installed, you can use it to generate a test data frame.
142+
> '%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl'
106143
107144
# Coding Style
108145

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ used in other computations
4545
* Generating values to conform to a schema or independent of an existing schema
4646
* use of SQL expressions in test data generation
4747
* plugin mechanism to allow use of 3rd party libraries such as Faker
48-
* Use of data generator to generate data sources in Databricks Delta Live Tables
48+
* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
4949

5050
Details of these features can be found in the online documentation -
5151
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
@@ -57,7 +57,7 @@ details of use and many examples.
5757

5858
Release notes and details of the latest changes for this specific release
5959
can be found in the Github repository
60-
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.2.1/CHANGELOG.md)
60+
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.0/CHANGELOG.md)
6161

6262
# Installation
6363

@@ -75,9 +75,10 @@ The documentation [installation notes](https://databrickslabs.github.io/dbldatag
7575
contains details of installation using alternative mechanisms.
7676

7777
## Compatibility
78-
The Databricks Labs data generator framework can be used with Pyspark 3.x and Python 3.6 or later
78+
The Databricks Labs data generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
79+
compatible with the Databricks runtime 9.1 LTS and later releases.
7980

80-
However prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS
81+
Older prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS
8182
or later) and built with Python 3.7.5
8283

8384
For full library compatibility for a specific Databricks Spark release, see the Databricks

dbldatagen/__init__.py

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@
2424
"""
2525

2626
from .data_generator import DataGenerator
27-
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_RANDOM, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME
27+
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_RANDOM, RANDOM_SEED_FIXED, \
28+
RANDOM_SEED_HASH_FIELD_NAME, MIN_PYTHON_VERSION, MIN_SPARK_VERSION
2829
from .utils import ensure, topologicalSort, mkBoundsList, coalesce_values, \
2930
deprecated, parse_time_interval, DataGenError
3031
from ._version import __version__
@@ -46,12 +47,19 @@
4647
"text_generator_plugins"
4748
]
4849

50+
def python_version_check(python_version_expected):
51+
"""Check against Python version
4952
50-
def python_version_check():
53+
Allows minimum version to be passed in to facilitate unit testing
54+
55+
:param python_version_expected: = minimum version of python to support as tuple e.g (3,6)
56+
:return: True if passed
57+
58+
"""
5159
import sys
52-
if not sys.version_info >= (3, 6):
53-
raise RuntimeError("Minimum version of Python supported is 3.6")
60+
return sys.version_info >= python_version_expected
5461

5562

5663
# lets check for a correct python version or raise an exception
57-
python_version_check()
64+
if not python_version_check(MIN_PYTHON_VERSION):
65+
raise RuntimeError(f"Minimum version of Python supported is {MIN_PYTHON_VERSION[0]}.{MIN_PYTHON_VERSION[1]}")

dbldatagen/_version.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,5 +33,18 @@ def get_version(version):
3333
return version_info
3434

3535

36-
__version__ = "0.2.1" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
36+
__version__ = "0.3.0" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
3737
__version_info__ = get_version(__version__)
38+
39+
40+
def _get_spark_version(sparkVersion):
41+
try:
42+
r = re.compile(r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?P<release>.*)')
43+
major, minor, patch, release = r.match(sparkVersion).groups()
44+
spark_version_info = VersionInfo(int(major), int(minor), int(patch), release, build="0")
45+
except (RuntimeError, AttributeError):
46+
spark_version_info = VersionInfo(major=3, minor=0, patch=1, release="unknown", build="0")
47+
logging.warning("Could not parse spark version - using assumed Spark Version : %s", spark_version_info)
48+
49+
return spark_version_info
50+

dbldatagen/data_generator.py

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,9 @@
1212
from pyspark.sql.types import LongType, IntegerType, StringType, StructType, StructField, DataType
1313
from .spark_singleton import SparkSingleton
1414
from .column_generation_spec import ColumnGenerationSpec
15-
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME
15+
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME, MIN_SPARK_VERSION
1616
from .utils import ensure, topologicalSort, DataGenError, deprecated
17+
from . _version import _get_spark_version
1718

1819
START_TIMESTAMP_OPTION = "startTimestamp"
1920
ROWS_PER_SECOND_OPTION = "rowsPerSecond"
@@ -141,9 +142,48 @@ def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
141142
self.withColumn(ColumnGenerationSpec.SEED_COLUMN, LongType(), nullable=False, implicit=True, omit=True)
142143
self._batchSize = batchSize
143144

145+
# set up spark session
146+
self._setupSparkSession(sparkSession)
147+
144148
# set up use of pandas udfs
145149
self._setupPandas(batchSize)
146150

151+
@classmethod
152+
def _checkSparkVersion(cls, sparkVersion, minSparkVersion):
153+
"""
154+
check spark version
155+
:param sparkVersion: spark version string
156+
:param minSparkVersion: min spark version as tuple
157+
:return: True if version passes minVersion
158+
159+
Layout of version string must be compatible "xx.xx.xx.patch"
160+
"""
161+
sparkVersionInfo = _get_spark_version(sparkVersion)
162+
163+
if sparkVersionInfo < minSparkVersion:
164+
logging.warn(f"*** Minimum version of Python supported is {minSparkVersion} - found version %s ",
165+
sparkVersionInfo )
166+
return False
167+
168+
return True
169+
170+
def _setupSparkSession(self, sparkSession):
171+
"""
172+
Set up spark session
173+
:param sparkSession: spark session to use
174+
:return: nothing
175+
"""
176+
if sparkSession is None:
177+
sparkSession = SparkSingleton.getInstance()
178+
179+
assert sparkSession is not None, "Spark session not initialized"
180+
181+
self.sparkSession = sparkSession
182+
183+
# check if the spark version meets the minimum requirements and warn if not
184+
sparkVersion = sparkSession.version
185+
self._checkSparkVersion(sparkVersion, MIN_SPARK_VERSION)
186+
147187
def _setupPandas(self, pandasBatchSize):
148188
"""
149189
Set up pandas

dbldatagen/datagen_constants.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,7 @@
2525
RANDOM_SEED_RANDOM_FLOAT = -1.0
2626
RANDOM_SEED_FIXED = "fixed"
2727
RANDOM_SEED_HASH_FIELD_NAME = "hash_fieldname"
28+
29+
# minimum versions for version checks
30+
MIN_PYTHON_VERSION = (3, 8)
31+
MIN_SPARK_VERSION = (3, 1, 2)

docs/source/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
author = 'Databricks Inc'
2929

3030
# The full version, including alpha/beta/rc tags
31-
release = "0.2.1" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
31+
release = "0.3.0" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
3232

3333

3434
# -- General configuration ---------------------------------------------------

makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ prepare: clean
2727

2828
create-dev-env:
2929
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
30-
conda create -n $(ENV_NAME) python=3.8
30+
conda create -n $(ENV_NAME) python=3.8.10
3131

3232
create-dev-env-321:
3333
@echo "$(OK_COLOR)=> making conda dev environment for Spark 3.2.1$(NO_COLOR)"

python/.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.2.1
2+
current_version = 0.3.0
33
commit = False
44
tag = False
55
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)\-{0,1}(?P<release>\D*)(?P<build>\d*)

python/dev_require.txt

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
# The following packages are used in building the test data generator framework.
22
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
3-
numpy==1.22.0
4-
pandas==1.0.1
3+
numpy==1.19.2
4+
pandas==1.2.4
55
pickleshare==0.7.5
66
py4j==0.10.9
7-
pyarrow==1.0.1
8-
pyspark>=3.0.1
7+
pyarrow==4.0.0
8+
pyspark>=3.1.2
99
python-dateutil==2.8.1
10-
six==1.14.0
10+
six==1.15.0
1111

1212
# The following packages are required for development only
13-
wheel==0.34.2
14-
setuptools==45.2.0
13+
wheel==0.36.2
14+
setuptools==52.0.0
1515
bumpversion
1616
pytest
1717
pytest-cov
@@ -25,7 +25,7 @@ sphinx_rtd_theme
2525
nbsphinx
2626
numpydoc==0.8
2727
pypandoc
28-
ipython==7.16.3
28+
ipython==7.22.0
2929
recommonmark
3030
sphinx-markdown-builder
3131
rst2pdf==0.98

python/require.txt

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
# The following packages are used in building the test data generator framework.
22
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
33
numpy==1.22.0
4-
pandas==1.0.1
4+
pandas==1.2.5
55
pickleshare==0.7.5
66
py4j==0.10.9
7-
pyarrow==1.0.1
8-
pyspark>=3.0.1
7+
pyarrow==4.0.0
8+
pyspark>=3.1.2
99
python-dateutil==2.8.1
10-
six==1.14.0
10+
six==1.15.0
1111

1212
# The following packages are required for development only
13-
wheel==0.34.2
14-
setuptools==45.2.0
13+
wheel==0.36.2
14+
setuptools==52.0.0
1515
bumpversion
1616
pytest
1717
pytest-cov
@@ -25,7 +25,7 @@ sphinx_rtd_theme
2525
nbsphinx
2626
numpydoc==0.8
2727
pypandoc
28-
ipython==7.16.3
28+
ipython==7.22.0
2929
recommonmark
3030
sphinx-markdown-builder
3131
rst2pdf==0.98

setup.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,13 @@
3131

3232
setuptools.setup(
3333
name="dbldatagen",
34-
version="0.2.1",
34+
version="0.3.0",
3535
author="Ronan Stokes, Databricks",
3636
description="Databricks Labs - PySpark Synthetic Data Generator",
3737
long_description=long_description,
3838
long_description_content_type="text/markdown",
3939
url="https://github.com/databrickslabs/data-generator",
40-
project_urls = {
40+
project_urls={
4141
"Databricks Labs": "https://www.databricks.com/learn/labs",
4242
"Documentation": "https://databrickslabs.github.io/dbldatagen/public_docs/index.html"
4343
},
@@ -52,5 +52,5 @@
5252
"Intended Audience :: Developers",
5353
"Intended Audience :: System Administrators"
5454
],
55-
python_requires='>=3.7.5',
55+
python_requires='>=3.8.10',
5656
)

0 commit comments

Comments
 (0)