Koomen/compute experiment subjects #5

koomen · 2020-07-31T21:16:51Z

No description provided.

…scripts

…n/labs into koomen/compute-experiment-subjects

asaschachar

Left some comments, but they are more questions around where instructions live and why. I'm overly focused on this now because in these first few labs, we have an opportunity to set the right precedence for how people will write them in the future.

I think a good next step is to get people to work through these more complex labs (that include running a jupyter notebook) and see how successful they are.

asaschachar · 2020-07-31T21:22:27Z

labs/computing-experiment-subjects/README.md

@@ -0,0 +1,76 @@
+# Computing Experiment Datasets #1: Experiment Subjects


Should we number them? Do they have to be one after the other? Or should we have each lab be standalone? Part of me feels like the way Labs are setup, each is more standalone and the tags / title can bring multiple together.

There are several labs I want to complete in this series:

Computing experiment subjects

Computing experiment observations

Computing experiment segments

Computing experiment metrics

#2 depends on #1, and #4 depends on #1, #2, and #3. Each lab can be run on its own, but if you want to start with enriched event data you should run them sequentially.

asaschachar · 2020-07-31T21:26:35Z

labs/computing-experiment-subjects/README.md

+
+## Running this notebook with Docker
+
+The simplest way to get started with PySpark is to run it in a [Docker](https://www.docker.com/) container. With Docker, you can run PySpark and Jupyter Lab without installing any other dependencies.


Should we spell out Docker as a prerequisite here?

I was hoping that the docker link was explicit enough. wdyt?

asaschachar · 2020-07-31T21:29:51Z

labs/computing-experiment-subjects/README.md

+
+### Running Jupyter Lab
+
+This lab directory contains a handy script for building your conda environment and running Jupyter Lab.  To run it, simply use


These instructions feel like they are centered around the user who is viewing this from Github... After syncing with you offline, that sounds intentional

Yes, or a local directory (possible from the zipped Lab directory linked on the labs page)

asaschachar · 2020-07-31T21:31:50Z

labs/computing-experiment-subjects/README.md

+This lab directory contains a handy script for building your conda environment and running Jupyter Lab.  To run it, simply use
+
+```sh
+$ bash run.sh


nit: I think this should be in a type of styleguide around the labs repo, but should we prefix all bash commands with $ or not? I can see arguments for both sides, but I think I lean towards not including $ because then it makes it easier to copy and paste. Possibly a discussion for later

Good point. I can remove.

asaschachar · 2020-07-31T21:34:21Z

labs/computing-experiment-subjects/README.md

+$ export OPTIMIZELY_DATA_DIR=~/optimizely_data
+```
+
+### Building `index.md`


This seems like an odd set of instructions to include for someone who is just using this lab. Perhaps this should be prefaced as "If you want to update the page on optimizely.com/labs" then do the following.

Good point. Will do.

I just removed this.

asaschachar · 2020-07-31T22:00:28Z

labs/computing-experiment-subjects/index.md

+
+## Analysis parameters
+
+We'll use the following global variables to parameterize our computation:


I think this kind of instruction makes sense when inside the environment of a jupyter lab, but it may seem odd to see it in a webpage. For example, when I see "We'll use the following...", I think, where do I put this code? Do I need to put it in a file? What filename? How do I run it?

Obviously this is clearer when someone is running the notebook. So I'm wondering if this content should only live in the notebook so that it makes the most sense in the context its designed for.

Hm, I get the point, but I wouldn't want to omit them from the page since they are referenced later on in the notebook. I could add a note this page being a notebook that can actually be executed (see instructions below)

Added a note to the intro.

asaschachar · 2020-07-31T22:04:04Z

labs/computing-experiment-subjects/index.md

@@ -0,0 +1,213 @@
+# Computing Experiment Datasets #1: Experiment Subjects
+
+This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute _experiment subjects_ from [Optimizely Enriched Event](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-export) ["Decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data.


What does it mean to compute experiment subjects? Is it just to count them up? Whats the benefit of doing this? Perhaps this is covered in part 2 of the series, but it seems like each lab should clearly define the value it provides. Otherwise, it seems like it should just be folded into a larger lab.

If it is to 'count' the experiment subjects, then we should consider changing the title from compute experiment subjects to count experiment subjects

Ah, good question. This lab actually computes a table of subjects for each experiment for which there are decisions present in the intput dataset. The subjects dataset may be joined with other analytics datasets to compute experiment metrics. I'll try adding something like this in explicitly.

FWIW, #2 in this series works by joining subjects with enriched events to compute observations

asaschachar · 2020-07-31T22:05:34Z

labs/computing-experiment-subjects/index.md

+<table border='1'>
+<tr><th>experiment_id</th><th>variation_id</th><th>subject_count</th></tr>
+<tr><td>18156943409</td><td>18112613000</td><td>4487</td></tr>
+<tr><td>18156943409</td><td>18174970251</td><td>4514</td></tr>


Is this it? Maybe provide a conclusion around "We've counted the number of experiment subjects across different experiment / variations

Added a conclusion

asaschachar · 2020-07-31T22:09:42Z

labs/computing-experiment-subjects/index.md

+
+
+
+<table border='1'>


As is, this lab seems like it's a "I'll show you this first with an existing data set, so you can do it on your own later", but I wonder if the lab would be more powerful if it were "Follow these steps to compute experiment subjects on your data" (which may have to include instructions on how to get the data to begin with).

I included a note above about using this notebook with a custom data directory. I can add a link to oevents too. The nice thing is that oevents and all of these notebooks will use the OPTIMIZELY_DATA_DIR if it is specified, so one you've set that variable you can download data and then analyze it with these notebooks without changing anything.

asaschachar · 2020-07-31T22:15:05Z

labs/computing-experiment-subjects/README.md

@@ -0,0 +1,76 @@
+# Computing Experiment Datasets #1: Experiment Subjects
+
+This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute _experiment subjects_ from [Optimizely Enriched Event](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-export) ["Decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data.


Is experiment subjects the term we want to centralize on as a part of our documentation? What happens in a world where someone is getting analytics on feature flags being on / off, should they still be experiment subjects? Maybe worth thinking about.

That is a good point, but for now I'd rather use experiment subjects, since e3 data is not even available for feature flags outside of experiments. Within the context of experiment analysis, the term is the right one. If, later on, we decide to use a different umbrella term we can always update this and other labs accordingly.

cc @loganlinn

koomen and others added 23 commits July 30, 2020 19:46

Add subject lab

30a711f

Add jupyter and npm artifacts to gitignore

d72939e

Updated draft of experiment subjects lab

468184d

Add conda artifacts to gitignore

d8d80e4

Add orchestration scripts to subject computation lab

d0aecc7

Update README

a833d35

Update README

6150111

Clean up README and add explanatory comments to lab notebook and run …

d14560f

…scripts

Add index.md instructions to README

cb98202

Add custom data instructions to the lab notebook

cd8d209

Build index.md

e8667b6

Add subject lab

81f97f6

Add jupyter and npm artifacts to gitignore

4ba1e30

Updated draft of experiment subjects lab

abddc6b

Add conda artifacts to gitignore

d4f52ea

Add orchestration scripts to subject computation lab

29f62d4

Update README

c8d8c79

Update README

c6a0d8d

Clean up README and add explanatory comments to lab notebook and run …

0ce7324

…scripts

Add index.md instructions to README

04b3b49

Add custom data instructions to the lab notebook

b283ab4

Build index.md

4d2c2e9

Merge branch 'koomen/compute-experiment-subjects' of github.com:koome…

c1c0244

…n/labs into koomen/compute-experiment-subjects

koomen requested a review from asaschachar July 31, 2020 21:16

Add metadata

c0583b3

asaschachar approved these changes Jul 31, 2020

View reviewed changes

koomen added 4 commits July 31, 2020 15:49

Change decisions_subject_id variable name

a41ec45

Split parameters into different cells

23740c0

Remove '$' shell prompts from README.md

01b2dee

Remove index.md instructions

a9933d9

koomen added 7 commits July 31, 2020 16:02

Change confusing ordered lists into unordered lists

d1f5507

Move instructions for running this notebook to the end

11511bf

Add note about this lab being generated by a Jupyter Notebook

32eb7cf

Add explanation to introduction on why this dataset is useful

c2b1dd4

Add conclusion

8cffaf5

Update index.md

7b7db9f

Update example output data

6ec5991

koomen merged commit 96d05c6 into optimizely:master Jul 31, 2020

		@@ -0,0 +1,76 @@
		# Computing Experiment Datasets #1: Experiment Subjects


		## Running this notebook with Docker

		The simplest way to get started with PySpark is to run it in a [Docker](https://www.docker.com/) container. With Docker, you can run PySpark and Jupyter Lab without installing any other dependencies.


		### Running Jupyter Lab

		This lab directory contains a handy script for building your conda environment and running Jupyter Lab. To run it, simply use


		## Analysis parameters

		We'll use the following global variables to parameterize our computation:

		@@ -0,0 +1,213 @@
		# Computing Experiment Datasets #1: Experiment Subjects

		This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute _experiment subjects_ from [Optimizely Enriched Event](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-export) ["Decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data.




		<table border='1'>

Koomen/compute experiment subjects #5

Koomen/compute experiment subjects #5

Uh oh!

Conversation

koomen commented Jul 31, 2020

Uh oh!

asaschachar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!