diff --git a/_episodes/02-numpy.md b/_episodes/02-numpy.md index edb9036b6..0277a6939 100644 --- a/_episodes/02-numpy.md +++ b/_episodes/02-numpy.md @@ -30,7 +30,7 @@ that can be called upon when needed. ## Loading data into Python -To begin processing inflammation data, we need to load it into Python. +To begin processing the clinical trial inflammation data, we need to load it into Python. We can do that using a library called [NumPy](http://docs.scipy.org/doc/numpy/ "NumPy Documentation"), which stands for Numerical Python. In general, you should use this library when you want to do fancy things with lots of numbers, diff --git a/_episodes/03-matplotlib.md b/_episodes/03-matplotlib.md index 117e9bb34..69fcd2c49 100644 --- a/_episodes/03-matplotlib.md +++ b/_episodes/03-matplotlib.md @@ -13,8 +13,8 @@ keypoints: --- ## Visualizing data -The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers," and -the best way to develop insight is often to visualize data. Visualization deserves an entire +The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers," +and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture of its own, but we can explore a few features of Python's `matplotlib` library here. While there is no official plotting library, `matplotlib` is the _de facto_ standard. First, we will import the `pyplot` module from `matplotlib` and use two of its functions to create and display a @@ -30,9 +30,19 @@ matplotlib.pyplot.show() ![Heat map representing the `data` variable. Each cell is colored by value along a color gradient from blue to yellow.](../fig/inflammation-01-imshow.svg) -Blue pixels in this heat map represent low values, while yellow pixels represent high values. As we -can see, inflammation rises and falls over a 40-day period. Let's take a look at the average -inflammation over time: +Each row in the heat map corresponds to a patient in the clinical trial dataset, and each column +corresponds to a day in the dataset. Blue pixels in this heat map represent low values, while +yellow pixels represent high values. As we can see, the general number of inflammation flare-ups +for the patients rises and falls over a 40-day period. + +So far so good as this is in line with our knowledge of the clinical trial and Dr. Maverick's +claims: + +* the patients take their medication once their inflammation flare-ups begin +* it takes around 3 weeks for the medication to take effect and begin reducing flare-ups +* and flare-ups appear to drop to zero by the end of the clinical trial. + +Now let's take a look at the average inflammation over time: ~~~ ave_inflammation = numpy.mean(data, axis=0) @@ -45,8 +55,9 @@ matplotlib.pyplot.show() Here, we have put the average inflammation per day across all patients in the variable `ave_inflammation`, then asked `matplotlib.pyplot` to create and display a line graph of those -values. The result is a roughly linear rise and fall, which is suspicious: we might instead expect -a sharper rise and slower fall. Let's have a look at two other statistics: +values. The result is a reasonably linear rise and fall, in line with Dr. Maverick's claim that +the medication takes 3 weeks to take effect. But a good data scientist doesn't just consider the +average of a dataset, so let's have a look at two other statistics: ~~~ max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0)) @@ -64,18 +75,18 @@ matplotlib.pyplot.show() ![A line graph showing the minimum inflammation across all patients over a 40-day period.](../fig/inflammation-01-minimum.svg) -The maximum value rises and falls smoothly, while the minimum seems to be a step function. Neither -trend seems particularly likely, so either there's a mistake in our calculations or something is -wrong with our data. This insight would have been difficult to reach by examining the numbers -themselves without visualization tools. +The maximum value rises and falls linearly, while the minimum seems to be a step function. +Neither trend seems particularly likely, so either there's a mistake in our calculations or +something is wrong with our data. This insight would have been difficult to reach by examining +the numbers themselves without visualization tools. ### Grouping plots You can group similar plots in a single figure using subplots. This script below uses a number of new commands. The function `matplotlib.pyplot.figure()` creates a space into which we will place all of our plots. The parameter `figsize` tells Python how big to make this space. Each subplot is placed into the figure using -its `add_subplot` [method]({{ page.root }}/reference.html#method). The `add_subplot` method takes 3 -parameters. The first denotes how many total rows of subplots there are, the second parameter +its `add_subplot` [method]({{ page.root }}/reference.html#method). The `add_subplot` method takes +3 parameters. The first denotes how many total rows of subplots there are, the second parameter refers to the total number of subplot columns, and the final parameter denotes which subplot your variable is referencing (left-to-right, top-to-bottom). Each subplot is stored in a different variable (`axes1`, `axes2`, `axes3`). Once a subplot is created, the axes can diff --git a/_episodes/04-lists.md b/_episodes/04-lists.md index bbac79d7a..053cfd821 100644 --- a/_episodes/04-lists.md +++ b/_episodes/04-lists.md @@ -20,8 +20,14 @@ list[2:9]), in the same way as strings and arrays." - "Strings are immutable (i.e., the characters in them cannot be changed)." --- -In the previous episode, we analyzed a single file with inflammation data. Our goal, however, is to -process all the inflammation data we have, which means that we still have eleven more files to go! +In the previous episode, we analyzed a single file of clinical trial inflammation data. However, +after finding some peculiar and potentially suspicious trends in the trial data we ask +Dr. Maverick if they have performed any other clinical trials. Surprisingly, they say that they +have and provide us with 11 more CSV files for a further 11 clinical trials they have undertaken +since the initial trial. + +Our goal now is to process all the inflammation data we have, which means that we still have +eleven more files to go! The natural first step is to collect the names of all the files that we have to process. In Python, a list is a way to store multiple values together. In this episode, we will learn how to store diff --git a/_episodes/05-loop.md b/_episodes/05-loop.md index feab7fd9c..c7aed6bf1 100644 --- a/_episodes/05-loop.md +++ b/_episodes/05-loop.md @@ -19,11 +19,13 @@ In the episode about visualizing data, we wrote Python code that plots values of interest from our first inflammation dataset (`inflammation-01.csv`), which revealed some suspicious features in it. -![Line graphs showing average, maximum and minimum inflammation across all patients over a 40-day period.](../fig/03-loop_2_0.png) +![Line graphs showing average, maximum and minimum inflammation across all patients over a 40-day +period.](../fig/03-loop_2_0.png) -We have a dozen data sets right now, though, and more on the way. -We want to create plots for all of our data sets with a single statement. -To do that, we'll have to teach the computer how to repeat things. +We have a dozen data sets right now and potentially more on the way if Dr. Maverick +can keep up their surprisingly fast clinical trial rate. We want to create plots for all of +our data sets with a single statement. To do that, we'll have to teach the computer how to +repeat things. An example task that we might want to repeat is accessing numbers in a list, which we @@ -148,7 +150,8 @@ for variable in collection: Using the odds example above, the loop might look like this: -![Loop variable 'num' being assigned the value of each element in the list `odds` in turn and then being printed](../fig/05-loops_image_num.png) +![Loop variable 'num' being assigned the value of each element in the list `odds` in turn and +then being printed](../fig/05-loops_image_num.png) where each number (`num`) in the variable `odds` is looped through and printed one number after another. The other numbers in the diagram denote which loop cycle the number was printed in (1 diff --git a/_episodes/06-files.md b/_episodes/06-files.md index efb2437a6..05fcb3410 100644 --- a/_episodes/06-files.md +++ b/_episodes/06-files.md @@ -45,6 +45,7 @@ This means we can loop over it to do something with each filename in turn. In our case, the "something" we want to do is generate a set of plots for each file in our inflammation dataset. + If we want to start by analyzing just the first three files in alphabetical order, we can use the `sorted` built-in function to generate a new sorted list from the `glob.glob` output: @@ -107,11 +108,26 @@ inflammation-03.csv maximum and minimum inflammation over a 40-day period for all patients in the third dataset.](../fig/03-loop_49_5.png) -Sure enough, -the maxima of the first two data sets show exactly the same ramp as the first, -and their minima show the same staircase structure; -a different situation has been revealed in the third dataset, -where the maxima are a bit less regular, but the minima are consistently zero. + +The plots generated for the second clinical trial file look very similar to the plots for +the first file: their average plots show similar "noisy" rises and falls; their maxima plots +show exactly the same linear rise and fall; and their minima plots show similar staircase +structures. + +The third dataset shows much noisier average and maxima plots that are far less suspicious than +the first two datasets, however the minima plot shows that the third dataset minima is +consistently zero across every day of the trial. If we produce a heat map for the third data file +we see the following: + +![Heat map of the third inflammation dataset. Note that there are sporadic zero values throughout +the entire dataset, and the last patient only has zero values over the 40 day study. +](../fig/inflammation-03-imshow.svg) + +We can see that there are zero values sporadically distributed across all patients and days of the +clinical trial, suggesting that there were potential issues with data collection throughout the +trial. In addition, we can see that the last patient in the study didn't have any inflammation +flare-ups at all throughout the trial, suggesting that they may not even suffer from arthritis! + > ## Plotting Differences > @@ -197,4 +213,34 @@ where the maxima are a bit less regular, but the minima are consistently zero. >{: .solution} {: .challenge} +After spending some time investigating the heat map and statistical plots, as well as +doing the above exercises to plot differences between datasets and to generate composite +patient statistics, we gain some insight into the twelve clinical trial datasets. + +The datasets appear to fall into two categories: + +* seemingly "ideal" datasets that agree excellently with Dr. Maverick's claims, + but display suspicious maxima and minima (such as `inflammation-01.csv` and `inflammation-02.csv`) +* "noisy" datasets that somewhat agree with Dr. Maverick's claims, but show concerning + data collection issues such as sporadic missing values and even an unsuitable candidate + making it into the clinical trial. + +In fact, it appears that all three of the "noisy" datasets (`inflammation-03.csv`, +`inflammation-08.csv`, and `inflammation-11.csv`) are identical down to the last value. +Armed with this information, we confront Dr. Maverick about the suspicious data and +duplicated files. + +Dr. Maverick confesses that they fabricated the clinical data after they found out +that the initial trial suffered from a number of issues, including unreliable data-recording and +poor participant selection. They created fake data to prove their drug worked, and when we asked +for more data they tried to generate more fake datasets, as well as throwing in the original +poor-quality dataset a few times to try and make all the trials seem a bit more "realistic". + +Congratulations! We've investigated the inflammation data and proven that the datasets have been +synthetically generated. + +But it would be a shame to throw away the synthetic datasets that have taught us so much +already, so we'll forgive the imaginary Dr. Maverick and continue to use the data to learn +how to program. + {% include links.md %} diff --git a/_extras/guide.md b/_extras/guide.md index 2feb4fe86..0ae6c28fc 100644 --- a/_extras/guide.md +++ b/_extras/guide.md @@ -18,9 +18,9 @@ We use Python in our lessons because: We are using a dataset with records on inflammation from patients following an arthritis treatment. -We make reference in the lesson that this data is somehow strange. It is strange -because it is fabricated! The script used to generate the inflammation data -is included as [`code/gen_inflammation.py`](../code/gen_inflammation.py). +We make reference in the lesson that this data is suspicious and has been +synthetically generated in Python by the imaginary "Dr. Maverick"! The script used to generate +the inflammation data is included as [`code/gen_inflammation.py`](../code/gen_inflammation.py). ## Overall diff --git a/fig/inflammation-03-imshow.svg b/fig/inflammation-03-imshow.svg new file mode 100644 index 000000000..5bf8056d1 --- /dev/null +++ b/fig/inflammation-03-imshow.svg @@ -0,0 +1,355 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/index.md b/index.md index e661f98cd..211ed64a9 100644 --- a/index.md +++ b/index.md @@ -8,13 +8,21 @@ The best way to learn how to program is to do something useful, so this introduction to Python is built around a common scientific task: **data analysis**. -### Arthritis Inflammation -We are studying **inflammation in patients** who have been given a new treatment for arthritis. +### Scenario: A Miracle Arthritis Inflammation Cure -There are 60 patients, who had their inflammation levels recorded for 40 days. -We want to analyze these recordings to study the effect of the new arthritis treatment. +Our imaginary colleague "Dr. Maverick" has invented a new miracle drug that promises to +cure arthritis inflammation flare-ups after only 3 weeks since initially taking the +medication! Naturally, we wish to see the clinical trial data, and after months of asking +for the data they have finally provided us with a CSV spreadsheet containing the clinical +trial data. -To see how the treatment is affecting the patients in general, we would like to: +The CSV file contains the number of inflammation flare-ups per day for the 60 patients +in the initial clinical trial, with the trial lasting 40 days. Each row corresponds to a +patient, and each column corresponds to a day in the trial. Once a patient has their first +inflammation flare-up they take the medication and wait a few weeks for it to take effect +and reduce flare-ups. + +To see how effective the treatment is we would like to: 1. Calculate the average inflammation per day across all patients. 2. Plot the result to discuss and share with colleagues.