Skip to content

Conversation

koomen
Copy link
Contributor

@koomen koomen commented Jul 31, 2020

No description provided.

@koomen koomen requested a review from asaschachar July 31, 2020 21:16
Copy link
Contributor

@asaschachar asaschachar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but they are more questions around where instructions live and why. I'm overly focused on this now because in these first few labs, we have an opportunity to set the right precedence for how people will write them in the future.

I think a good next step is to get people to work through these more complex labs (that include running a jupyter notebook) and see how successful they are.

@@ -0,0 +1,76 @@
# Computing Experiment Datasets #1: Experiment Subjects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we number them? Do they have to be one after the other? Or should we have each lab be standalone? Part of me feels like the way Labs are setup, each is more standalone and the tags / title can bring multiple together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several labs I want to complete in this series:

  1. Computing experiment subjects
  2. Computing experiment observations
  3. Computing experiment segments
  4. Computing experiment metrics

#2 depends on #1, and #4 depends on #1, #2, and #3. Each lab can be run on its own, but if you want to start with enriched event data you should run them sequentially.


## Running this notebook with Docker

The simplest way to get started with PySpark is to run it in a [Docker](https://www.docker.com/) container. With Docker, you can run PySpark and Jupyter Lab without installing any other dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we spell out Docker as a prerequisite here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that the docker link was explicit enough. wdyt?


### Running Jupyter Lab

This lab directory contains a handy script for building your conda environment and running Jupyter Lab. To run it, simply use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These instructions feel like they are centered around the user who is viewing this from Github... After syncing with you offline, that sounds intentional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, or a local directory (possible from the zipped Lab directory linked on the labs page)

This lab directory contains a handy script for building your conda environment and running Jupyter Lab. To run it, simply use

```sh
$ bash run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this should be in a type of styleguide around the labs repo, but should we prefix all bash commands with $ or not? I can see arguments for both sides, but I think I lean towards not including $ because then it makes it easier to copy and paste. Possibly a discussion for later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I can remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

$ export OPTIMIZELY_DATA_DIR=~/optimizely_data
```

### Building `index.md`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an odd set of instructions to include for someone who is just using this lab. Perhaps this should be prefaced as "If you want to update the page on optimizely.com/labs" then do the following.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed this.


## Analysis parameters

We'll use the following global variables to parameterize our computation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of instruction makes sense when inside the environment of a jupyter lab, but it may seem odd to see it in a webpage. For example, when I see "We'll use the following...", I think, where do I put this code? Do I need to put it in a file? What filename? How do I run it?

Obviously this is clearer when someone is running the notebook. So I'm wondering if this content should only live in the notebook so that it makes the most sense in the context its designed for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I get the point, but I wouldn't want to omit them from the page since they are referenced later on in the notebook. I could add a note this page being a notebook that can actually be executed (see instructions below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to the intro.

@@ -0,0 +1,213 @@
# Computing Experiment Datasets #1: Experiment Subjects

This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute _experiment subjects_ from [Optimizely Enriched Event](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-export) ["Decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to compute experiment subjects? Is it just to count them up? Whats the benefit of doing this? Perhaps this is covered in part 2 of the series, but it seems like each lab should clearly define the value it provides. Otherwise, it seems like it should just be folded into a larger lab.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is to 'count' the experiment subjects, then we should consider changing the title from compute experiment subjects to count experiment subjects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good question. This lab actually computes a table of subjects for each experiment for which there are decisions present in the intput dataset. The subjects dataset may be joined with other analytics datasets to compute experiment metrics. I'll try adding something like this in explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, #2 in this series works by joining subjects with enriched events to compute observations

<table border='1'>
<tr><th>experiment_id</th><th>variation_id</th><th>subject_count</th></tr>
<tr><td>18156943409</td><td>18112613000</td><td>4487</td></tr>
<tr><td>18156943409</td><td>18174970251</td><td>4514</td></tr>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this it? Maybe provide a conclusion around "We've counted the number of experiment subjects across different experiment / variations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a conclusion




<table border='1'>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, this lab seems like it's a "I'll show you this first with an existing data set, so you can do it on your own later", but I wonder if the lab would be more powerful if it were "Follow these steps to compute experiment subjects on your data" (which may have to include instructions on how to get the data to begin with).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included a note above about using this notebook with a custom data directory. I can add a link to oevents too. The nice thing is that oevents and all of these notebooks will use the OPTIMIZELY_DATA_DIR if it is specified, so one you've set that variable you can download data and then analyze it with these notebooks without changing anything.

@@ -0,0 +1,76 @@
# Computing Experiment Datasets #1: Experiment Subjects

This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute _experiment subjects_ from [Optimizely Enriched Event](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-export) ["Decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is experiment subjects the term we want to centralize on as a part of our documentation? What happens in a world where someone is getting analytics on feature flags being on / off, should they still be experiment subjects? Maybe worth thinking about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, but for now I'd rather use experiment subjects, since e3 data is not even available for feature flags outside of experiments. Within the context of experiment analysis, the term is the right one. If, later on, we decide to use a different umbrella term we can always update this and other labs accordingly.

cc @loganlinn

@koomen koomen merged commit 96d05c6 into optimizely:master Jul 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants