-
Notifications
You must be signed in to change notification settings - Fork 8
Koomen/compute experiment subjects #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Koomen/compute experiment subjects #5
Conversation
…n/labs into koomen/compute-experiment-subjects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, but they are more questions around where instructions live and why. I'm overly focused on this now because in these first few labs, we have an opportunity to set the right precedence for how people will write them in the future.
I think a good next step is to get people to work through these more complex labs (that include running a jupyter notebook) and see how successful they are.
@@ -0,0 +1,76 @@ | |||
# Computing Experiment Datasets #1: Experiment Subjects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we number them? Do they have to be one after the other? Or should we have each lab be standalone? Part of me feels like the way Labs are setup, each is more standalone and the tags / title can bring multiple together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several labs I want to complete in this series:
- Computing experiment subjects
- Computing experiment observations
- Computing experiment segments
- Computing experiment metrics
#2 depends on #1, and #4 depends on #1, #2, and #3. Each lab can be run on its own, but if you want to start with enriched event data you should run them sequentially.
|
||
## Running this notebook with Docker | ||
|
||
The simplest way to get started with PySpark is to run it in a [Docker](https://www.docker.com/) container. With Docker, you can run PySpark and Jupyter Lab without installing any other dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we spell out Docker as a prerequisite here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping that the docker link was explicit enough. wdyt?
|
||
### Running Jupyter Lab | ||
|
||
This lab directory contains a handy script for building your conda environment and running Jupyter Lab. To run it, simply use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These instructions feel like they are centered around the user who is viewing this from Github... After syncing with you offline, that sounds intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, or a local directory (possible from the zipped Lab directory linked on the labs page)
This lab directory contains a handy script for building your conda environment and running Jupyter Lab. To run it, simply use | ||
|
||
```sh | ||
$ bash run.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this should be in a type of styleguide
around the labs repo, but should we prefix all bash commands with $
or not? I can see arguments for both sides, but I think I lean towards not including $
because then it makes it easier to copy and paste. Possibly a discussion for later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I can remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
$ export OPTIMIZELY_DATA_DIR=~/optimizely_data | ||
``` | ||
|
||
### Building `index.md` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an odd set of instructions to include for someone who is just using this lab. Perhaps this should be prefaced as "If you want to update the page on optimizely.com/labs" then do the following.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed this.
|
||
## Analysis parameters | ||
|
||
We'll use the following global variables to parameterize our computation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kind of instruction makes sense when inside the environment of a jupyter lab, but it may seem odd to see it in a webpage. For example, when I see "We'll use the following...", I think, where do I put this code? Do I need to put it in a file? What filename? How do I run it?
Obviously this is clearer when someone is running the notebook. So I'm wondering if this content should only live in the notebook so that it makes the most sense in the context its designed for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I get the point, but I wouldn't want to omit them from the page since they are referenced later on in the notebook. I could add a note this page being a notebook that can actually be executed (see instructions below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to the intro.
@@ -0,0 +1,213 @@ | |||
# Computing Experiment Datasets #1: Experiment Subjects | |||
|
|||
This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute _experiment subjects_ from [Optimizely Enriched Event](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-export) ["Decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to compute
experiment subjects? Is it just to count them up? Whats the benefit of doing this? Perhaps this is covered in part 2 of the series, but it seems like each lab should clearly define the value it provides. Otherwise, it seems like it should just be folded into a larger lab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is to 'count' the experiment subjects, then we should consider changing the title from compute experiment subjects
to count experiment subjects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good question. This lab actually computes a table of subjects for each experiment for which there are decisions present in the intput dataset. The subjects dataset may be joined with other analytics datasets to compute experiment metrics. I'll try adding something like this in explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, #2 in this series works by joining subjects with enriched events to compute observations
<table border='1'> | ||
<tr><th>experiment_id</th><th>variation_id</th><th>subject_count</th></tr> | ||
<tr><td>18156943409</td><td>18112613000</td><td>4487</td></tr> | ||
<tr><td>18156943409</td><td>18174970251</td><td>4514</td></tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this it? Maybe provide a conclusion around "We've counted the number of experiment subjects across different experiment / variations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a conclusion
|
||
|
||
|
||
<table border='1'> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, this lab seems like it's a "I'll show you this first with an existing data set, so you can do it on your own later", but I wonder if the lab would be more powerful if it were "Follow these steps to compute experiment subjects on your data" (which may have to include instructions on how to get the data to begin with).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included a note above about using this notebook with a custom data directory. I can add a link to oevents too. The nice thing is that oevents
and all of these notebooks will use the OPTIMIZELY_DATA_DIR
if it is specified, so one you've set that variable you can download data and then analyze it with these notebooks without changing anything.
@@ -0,0 +1,76 @@ | |||
# Computing Experiment Datasets #1: Experiment Subjects | |||
|
|||
This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute _experiment subjects_ from [Optimizely Enriched Event](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-export) ["Decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is experiment subjects the term we want to centralize on as a part of our documentation? What happens in a world where someone is getting analytics on feature flags being on / off, should they still be experiment subjects
? Maybe worth thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, but for now I'd rather use experiment subjects, since e3 data is not even available for feature flags outside of experiments. Within the context of experiment analysis, the term is the right one. If, later on, we decide to use a different umbrella term we can always update this and other labs accordingly.
cc @loganlinn
No description provided.