Skip to content

FIX: Update python code to simplify and resolve FutureWarning #540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 1, 2024
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ channels:
- conda-forge
dependencies:
- python=3.11
- anaconda=2024.02
- anaconda=2024.06
- pip
- pip:
- jupyter-book==0.15.1
Expand Down
49 changes: 11 additions & 38 deletions lectures/prob_dist.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,13 @@ jupytext:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.5
jupytext_version: 1.16.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---


# Distributions and Probabilities

```{index} single: Distributions and Probabilities
Expand All @@ -23,6 +22,7 @@ In this lecture we give a quick introduction to data and probability distributio

```{code-cell} ipython3
:tags: [hide-output]

!pip install --upgrade yfinance
```

Expand All @@ -35,7 +35,6 @@ import scipy.stats
import seaborn as sns
```


## Common distributions

In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy.
Expand Down Expand Up @@ -99,7 +98,6 @@ n = 10
u = scipy.stats.randint(1, n+1)
```


Here's the mean and variance:

```{code-cell} ipython3
Expand Down Expand Up @@ -195,7 +193,6 @@ u.pmf(0)
u.pmf(1)
```


#### Binomial distribution

Another useful (and more interesting) distribution is the **binomial distribution** on $S=\{0, \ldots, n\}$, which has PMF:
Expand Down Expand Up @@ -232,7 +229,6 @@ Let's see if SciPy gives us the same results:
u.mean(), u.var()
```


Here's the PMF:

```{code-cell} ipython3
Expand All @@ -250,7 +246,6 @@ ax.set_ylabel('PMF')
plt.show()
```


Here's the CDF:

```{code-cell} ipython3
Expand All @@ -264,7 +259,6 @@ ax.set_ylabel('CDF')
plt.show()
```


```{exercise}
:label: prob_ex3

Expand Down Expand Up @@ -334,7 +328,6 @@ ax.set_ylabel('PMF')
plt.show()
```


#### Poisson distribution

The Poisson distribution on $S = \{0, 1, \ldots\}$ with parameter $\lambda > 0$ has PMF
Expand Down Expand Up @@ -372,7 +365,6 @@ ax.set_ylabel('PMF')
plt.show()
```


### Continuous distributions


Expand Down Expand Up @@ -449,7 +441,6 @@ plt.legend()
plt.show()
```


Here's a plot of the CDF:

```{code-cell} ipython3
Expand All @@ -466,7 +457,6 @@ plt.legend()
plt.show()
```


#### Lognormal distribution

The **lognormal distribution** is a distribution on $\left(0, \infty\right)$ with density
Expand Down Expand Up @@ -646,7 +636,6 @@ plt.legend()
plt.show()
```


#### Gamma distribution

The **gamma distribution** is a distribution on $\left(0, \infty\right)$ with density
Expand Down Expand Up @@ -730,7 +719,6 @@ df = pd.DataFrame(data, columns=['name', 'income'])
df
```


In this situation, we might refer to the set of their incomes as the "income distribution."

The terminology is confusing because this set is not a probability distribution
Expand Down Expand Up @@ -761,14 +749,10 @@ $$
For the income distribution given above, we can calculate these numbers via

```{code-cell} ipython3
x = np.asarray(df['income']) # Pull out income as a NumPy array
```

```{code-cell} ipython3
x = df['income']
x.mean(), x.var()
```


```{exercise}
:label: prob_ex4

Expand All @@ -792,15 +776,13 @@ We will cover
We can histogram the income distribution we just constructed as follows

```{code-cell} ipython3
x = df['income']
fig, ax = plt.subplots()
ax.hist(x, bins=5, density=True, histtype='bar')
ax.set_xlabel('income')
ax.set_ylabel('density')
plt.show()
```


Let's look at a distribution from real data.

In particular, we will look at the monthly return on Amazon shares between 2000/1/1 and 2024/1/1.
Expand All @@ -811,25 +793,21 @@ So we will have one observation for each month.

```{code-cell} ipython3
:tags: [hide-output]
df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo' )

df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo')
prices = df['Adj Close']
data = prices.pct_change()[1:] * 100
data.head()
x_amazon = prices.pct_change()[1:] * 100
x_amazon.head()
```


The first observation is the monthly return (percent change) over January 2000, which was

```{code-cell} ipython3
data[0]
x_amazon.iloc[0]
```

Let's turn the return observations into an array and histogram it.

```{code-cell} ipython3
x_amazon = np.asarray(data)
```

```{code-cell} ipython3
fig, ax = plt.subplots()
ax.hist(x_amazon, bins=20)
Expand All @@ -838,7 +816,6 @@ ax.set_ylabel('density')
plt.show()
```


#### Kernel density estimates

Kernel density estimates (KDE) provide a simple way to estimate and visualize the density of a distribution.
Expand Down Expand Up @@ -893,10 +870,10 @@ For example, let's compare the monthly returns on Amazon shares with the monthly

```{code-cell} ipython3
:tags: [hide-output]
df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo' )

df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo')
prices = df['Adj Close']
data = prices.pct_change()[1:] * 100
x_costco = np.asarray(data)
x_costco = prices.pct_change()[1:] * 100
```

```{code-cell} ipython3
Expand All @@ -907,7 +884,6 @@ ax.set_xlabel('KDE')
plt.show()
```


### Connection to probability distributions

Let's discuss the connection between observed distributions and probability distributions.
Expand Down Expand Up @@ -941,7 +917,6 @@ ax.set_ylabel('density')
plt.show()
```


The match between the histogram and the density is not bad but also not very good.

One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about {ref}`heavy tailed distributions<heavy_tail>`.
Expand All @@ -967,8 +942,6 @@ ax.set_ylabel('density')
plt.show()
```


Note that if you keep increasing $N$, which is the number of observations, the fit will get better and better.

This convergence is a version of the "law of large numbers", which we will discuss {ref}`later<lln_mr>`.