Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code for python visualization service #1651

Merged
merged 36 commits into from
Aug 1, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
6c8c64d
Setup initial server with roc_curve visualization
ajchili Jul 22, 2019
d3f69d0
Created Dockerfile.visualization
ajchili Jul 22, 2019
b66430a
Fixed import issue
ajchili Jul 22, 2019
b42d993
Changed implementation of generate_html_from_notebook to allow templa…
ajchili Jul 22, 2019
6d51c38
Added tfdv.py
ajchili Jul 22, 2019
eedf87b
Added unit tests for exporter.py
ajchili Jul 22, 2019
0ff333f
Deleted __init__.py
ajchili Jul 22, 2019
45d31ca
visualizations/ -> visualization/
ajchili Jul 22, 2019
bf3f8d5
Added requirements.txt and updated Dockerfile.visualization to use it
ajchili Jul 22, 2019
c091270
Updated .travis.yml to run python visualization unit tests
ajchili Jul 22, 2019
ccdbffa
Fixed travis file path issue
ajchili Jul 22, 2019
51efcf5
Continued testing to fix travis test issues
ajchili Jul 23, 2019
c8d0457
Removed jupyter from pip3 install
ajchili Jul 23, 2019
525c5ad
Updated requirements.txt to included ipykernel
ajchili Jul 23, 2019
4df0647
Removed maxDiff limit for all python tests
ajchili Jul 23, 2019
eaa931c
Sorted keys within args dictionary to ensure tests do not fail due to…
ajchili Jul 23, 2019
12fa85e
Created requirements-test.txt
ajchili Jul 23, 2019
6fe1945
Added input_path argument support for python service
ajchili Jul 23, 2019
676bf84
Updated Copyright in Dockerfile.visualization
ajchili Jul 23, 2019
d991c85
Updated snapshot to include all tests
ajchili Jul 23, 2019
9dea2fd
Added types, additional comments, and TemplateType enum
ajchili Jul 23, 2019
a7afd7b
Formatted template files
ajchili Jul 29, 2019
394d4c3
Addressed most feedback made by @kevinbache
ajchili Jul 29, 2019
0166a3f
Revert "Formatted template files"
ajchili Jul 29, 2019
1d1da83
Fixed comment placement and switched os -> Path
ajchili Jul 30, 2019
930c200
Changed way exporter is implemented to use importlib
ajchili Jul 30, 2019
e3cf115
Reverted to str.format due to python comparability issue
ajchili Jul 30, 2019
5aaad87
Added unit tests for tornado web server
ajchili Jul 30, 2019
1ba7155
Added license script for open source compliance
ajchili Jul 30, 2019
2fa67c9
Added line between file comment and license to match exporter.py
ajchili Jul 30, 2019
9fec259
Updated server structure
ajchili Jul 30, 2019
f1414c0
Addressed additional feedback from @kevinbache
ajchili Jul 31, 2019
7629618
Fixed snapshot for test_exporter
ajchili Jul 31, 2019
17a2bc2
Comments -> Docstring Comments and other small fixes
ajchili Jul 31, 2019
d09a3d1
Added missing and updated docstring comments in server.py
ajchili Jul 31, 2019
5320a0f
Resolved latency issue with visualization server
ajchili Aug 1, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,13 @@ matrix:
- python tests/compiler/main.py
- $TRAVIS_BUILD_DIR/sdk/python/tests/run_tests.sh

# Visualization test
- cd $TRAVIS_BUILD_DIR/backend/src/apiserver/visualization
# Visualization test dependencies
- pip3 install -r requirements-test.txt
- python3 test_exporter.py
- python3 test_server.py

# Test loading all component.yaml definitions
- $TRAVIS_BUILD_DIR/components/test_load_all_components.sh

Expand Down
36 changes: 36 additions & 0 deletions backend/Dockerfile.visualization
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# This docker file starts server.py (located at src/apiserver/visualization)
# which accepts a post request that resolves to html that depicts a specified
# visualization. More details about this process can be found in the server.py
# and exporter.py files in the directory specified above.

# Copyright 2019 Google LLC
ajchili marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM python:3

RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
RUN mkdir -p /usr/local/gcloud
RUN tar -C /usr/local/gcloud -xf /tmp/google-cloud-sdk.tar.gz
RUN /usr/local/gcloud/google-cloud-sdk/install.sh
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

WORKDIR /src

COPY src/apiserver/visualization /src

RUN pip3 install -r requirements.txt

RUN ./license.sh third_party_licenses.csv /usr/licenses

ENTRYPOINT [ "python3", "server.py" ]
5 changes: 5 additions & 0 deletions backend/src/apiserver/visualization/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
*.html
*.csv
*.json
__init__.py
!third_party_licenses.csv
149 changes: 149 additions & 0 deletions backend/src/apiserver/visualization/exporter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
"""
exporter.py provides utility functions for generating NotebookNode objects and
converting those objects to HTML.
"""

# Copyright 2019 Google LLC
ajchili marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
from enum import Enum
import json
from pathlib import Path
from typing import Text
from jupyter_client import KernelManager
from nbconvert import HTMLExporter
from nbconvert.preprocessors import ExecutePreprocessor
from nbformat import NotebookNode
from nbformat.v4 import new_code_cell


# Visualization Template types:
# - Basic: Uses the basic.tpl file within the templates directory to generate
# a visualization that contains no styling and minimal HTML tags. This is ideal
# for testing as it reduces size of generated visualization. However, for usage
# with actual visualizations it is not ideal due to its lack of javascript and
# styling which can limit usability of a visualization.
# - Full: Uses the full.tpl file within the template directory to generate a
# visualization that can be viewed as a standalone web page. The full.tpl file
# utilizes the basic.tpl file for visualizations then wraps that output with
# additional tags for javascript and style support. This is ideal for generating
# visualizations that will be displayed via the frontend.
class TemplateType(Enum):
BASIC = 'basic'
FULL = 'full'


class Exporter:
"""Handler for interaction with NotebookNodes, including output generation.

Attributes:
timeout (int): Amount of time in seconds that a visualization can run
for before being stopped.
template_type (TemplateType): Type of template to use when generating
visualization output.
km (KernelManager): Custom KernelManager that stays alive between
visualizations.
ep (ExecutePreprocessor): Process that is responsible for generating
outputs from NotebookNodes.

"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra whitespace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not add due to PEP8 complaining about additional whitespace.

def __init__(
self,
timeout: int = 100,
template_type: TemplateType = TemplateType.FULL
):
"""
Initializes Exporter with default timeout (100 seconds) and template
(FULL) and handles instantiation of km and ep variables for usage when
generating NotebookNodes and their outputs.

Args:
timeout (int): Amount of time in seconds that a visualization can
run for before being stopped.
template_type (TemplateType): Type of template to use when
generating visualization output.
"""
self.timeout = timeout
self.template_type = template_type
# Create custom KernelManager.
# This will circumvent issues where kernel is shutdown after
# preprocessing. Due to the shutdown, latency would be introduced
# because a kernel must be started per visualization.
self.km = KernelManager()
self.km.start_kernel()
self.ep = ExecutePreprocessor(
timeout=self.timeout,
kernel_name='python3'
)

@staticmethod
def create_cell_from_args(args: argparse.Namespace) -> NotebookNode:
"""Creates a NotebookNode object with provided arguments as variables.

Args:
args: Arguments that need to be injected into a NotebookNode.

Returns:
NotebookNode with provided arguments as variables.

"""
variables = ""
args = json.loads(args)
for key in sorted(args.keys()):
# Check type of variable to maintain type when converting from JSON
# to notebook cell
if args[key] is None or isinstance(args[key], bool):
variables += "{} = {}\n".format(key, args[key])
else:
variables += '{} = "{}"\n'.format(key, args[key])

return new_code_cell(variables)

@staticmethod
def create_cell_from_file(filepath: Text) -> NotebookNode:
"""Creates a NotebookNode object with provided file as code in node.

Args:
filepath: Path to file that should be used.

Returns:
NotebookNode with specified file as code within node.

"""
with open(filepath, 'r') as f:
code = f.read()

return new_code_cell(code)

def generate_html_from_notebook(self, nb: NotebookNode) -> Text:
"""Converts a provided NotebookNode to HTML.

Args:
nb: NotebookNode that should be converted to HTML.

Returns:
HTML from converted NotebookNode as a string.

"""
# HTML generator and exporter object
html_exporter = HTMLExporter()
template_file = "templates/{}.tpl".format(self.template_type.value)
html_exporter.template_file = str(Path.cwd() / template_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out __file__ rather than cwd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require the usage of PurePath.parent to determine the parent route of the file, is that ideal for this?

# Output generator
self.ep.preprocess(nb, {"metadata": {"path": Path.cwd()}}, self.km)
# Export all html and outputs
body, _ = html_exporter.from_notebook_node(nb)
return body
63 changes: 63 additions & 0 deletions backend/src/apiserver/visualization/license.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/bin/bash -e
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# A script to gather locally-installed python packages, and based on
# a specified license table (3 columns: name,license_link,license_type) csv file,
# download and save all license files into specified directory.
# Usage:
# license.sh third_party_licenses.csv /usr/licenses


# Get the list of python packages installed locally.
IFS=$'\n'
INSTALLED_PACKAGES=($(pip freeze | sed s/=.*//))


# Get the list of python packages tracked in the given CSV file.
REGISTERED_PACKAGES=()
while IFS=, read -r col1 col2 col3
do
REGISTERED_PACKAGES+=($col1)
done < $1

# Make sure all locally installed packages are covered.
DIFF=()
for i in "${INSTALLED_PACKAGES[@]}"; do
skip=
for j in "${REGISTERED_PACKAGES[@]}"; do
[[ $i == $j ]] && { skip=1; break; }
done
[[ -n $skip ]] || DIFF+=("$i")
done

if [ -n "$DIFF" ]; then
echo "The following packages are not found for licenses tracking."
echo "Please add an entry in $1 for each of them."
echo ${DIFF[@]}
exit 1
fi

# Gather license files for each package. For packages with GPL license we mirror the source code.
mkdir -p $2/source
while IFS=, read -r col1 col2 col3
do
if [[ " ${INSTALLED_PACKAGES[@]} " =~ " ${col1} " ]]; then
wget -O $2/$col1.LICENSE $col2
if [[ "${col3}" == *GPL* ]] || [[ "${col3}" =~ ^MPL ]]; then
pip install -t "$2/source/${col1}" ${col1}
fi
fi
done < $1
6 changes: 6 additions & 0 deletions backend/src/apiserver/visualization/requirements-test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
ipykernel==5.1.1
jupyter_client==5.2.4
nbconvert==5.5.0
nbformat==4.4.0
snapshottest==0.5.1
tornado==6.0.2
13 changes: 13 additions & 0 deletions backend/src/apiserver/visualization/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
bokeh==1.2.0
gcsfs==0.2.3
google-api-python-client==1.7.9
ipykernel==5.1.1
jupyter_client==5.2.4
nbconvert==5.5.0
nbformat==4.4.0
pandas==0.24.2
scikit_learn==0.21.2
tensorflow==1.13.1
tensorflow-data-validation==0.13.1
tensorflow-model-analysis==0.13.1
tornado==6.0.2
74 changes: 74 additions & 0 deletions backend/src/apiserver/visualization/roc_curve.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Copyright 2019 Google LLC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to start with copying the pipeline component code in here. Sometime in the future, we should generalize this code for better reusability. e.g. expose the same interface as scikit learn roc_curve interface.

#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
from pathlib import Path
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.models import HoverTool
# gcsfs is required for pandas GCS integration.
import gcsfs
import pandas as pd
from sklearn.metrics import roc_curve
from tensorflow.python.lib.io import file_io

# The following variables are provided through dependency injection. These
# variables come from the specified input path and arguments provided by the
# API post request.
#
# is_generated
# input_path
# target_lambda
# trueclass
# true_score_column

if is_generated is False:
# Create data from specified csv file(s).
# The schema file provides column names for the csv file that will be used
# to generate the roc curve.
schema_file = Path(input_path) / 'schema.json'
schema = json.loads(file_io.read_file_to_string(schema_file))
names = [x['name'] for x in schema]

dfs = []
files = file_io.get_matching_files(input_path)
ajchili marked this conversation as resolved.
Show resolved Hide resolved
for f in files:
dfs.append(pd.read_csv(f, names=names))

df = pd.concat(dfs)
if target_lambda:
df['target'] = df.apply(eval(target_lambda), axis=1)
else:
df['target'] = df['target'].apply(lambda x: 1 if x == trueclass else 0)
fpr, tpr, thresholds = roc_curve(df['target'], df[true_score_column])
source = pd.DataFrame({'fpr': fpr, 'tpr': tpr, 'thresholds': thresholds})
else:
# Load data from generated csv file.
source = pd.read_csv(
input_path,
header=None,
names=['fpr', 'tpr', 'thresholds']
)

# Create visualization.
output_notebook()

p = figure(tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave")
p.line('fpr', 'tpr', line_width=2, source=source)

hover = p.select(dict(type=HoverTool))
hover.tooltips = [("Threshold", "@thresholds")]

show(row(p, sizing_mode="scale_width"))
Loading