GitHub - dataglyder/Propane-Gas-Prices: This is an exploratory analysis with Linear Regression Model

Propane Prices

I've been observing the flunctuations in gas prices for some years and I do wonder if consumers have ever had the oppotunity to enjoy a low price after a hike. The National Propane Gas Association noted that 9% of U.S. households utilize propane for at least one residential application excluding grilling; I'm just curious to see if these consumers have had any chance to enjoy low prices in the past or if there could be one in the future. In this analysis, I looked at the price of propane over some years and also create a model to see if there is any relieve for consumers in 2025. I used the New York Average Propane Price Data for this analysis.

Data Analysis

Data is any record that might be text (structured or unstructured), vedio, audio, graph etc that could be used for reference purposes. Data analysis is the process of extracting facts from data for human or business decision. To analyze data in Python, we must first import the necessary libraries. See my write up on Python Libraries for more details.

Open Data File with Pandas

First import Pandas

import pandas as pd        
data = pd.read_csv("file name or file path")
print(data.head())           # Use display/print to view the whole table or display/print(data.head()) to view first 5 rows.

The data is from 1997 to 2024; each month within each each year has price imput of at least 2 i.e., at the beggining and at the end of the month. I will not fill-in each month with its average price in other to have a full month because: 1 There is no indication of missing values 2 This will not change the average price of each month.

print(data.isnull().sum())     # To confirm that no value was missing.

Select Needed Columns From Data in Python

Often times, not all the columns in the dataframe are usefull for ones analysis; so, for easier handling of the data table, one can select only necessary columns. ".loc" could be used as in the code below or the double square bracket method can also be used.

needed_cols = data.loc[:["Data", "New York Statewide Average ($/gal)"]] # ":" selected all the rows and "Data" and
                                                                          # "New York Statewide Average ($/gal)"   
                                                                             # were the columns selected.

Change Column Name in Python

This is usually done to rename the columns. In this case, I will rename the columns to change all the characters to lower case and also to shorten the names.

needed_cols = data.rename(column = {"old_name1":"new_name1", "old_name2":"new_name2"}]
needed_cols = data.rename(column={"Data":"data", "New York Statewide Average ($/gal)":"ny_state_avg_price"})

Working with Date in Python

Let's split the date column into Month and year using Pandas Pandas.to_datetime additonal information is available on Python datetime. Although some IDE might work with datetime without first importing datetime, it is better to import datetime to be on the saver side. I first converted the date to Pandas datetime and then create new columns for the months and the years.

from datetime import datetime
needed_cols.date = pd.to_datetime(needed_cols.date)
needed_cols["month"], needed_cols["year"] = needed_cols.date.dt.month, needed_cols.date.dt.month

Here is what the new table look like:

Now that the data is ready, let's start the analysis.

Histogram

We'll view the shape of the data but befote then, let's import another library for visualization.

import matplotlib.pyplot as m
import seaborn as s
hist_chart=p.figure(figsize=(6,5), layout="constrained")
s.histplot(needed_cols, x="ny_state_avg_price", bins=20, element="step")
m.ylabel("frequncy")
m.show()

The chart shows the price on the x-axis and its frequencies on the y-axis. It is right skewed i.e. price mostly move towards right side of the chart and it looks like it's bi-modal i.e., has two peak periods. Let's lay a density curve on the chart to ascertain the bi-modal shape.

Density Curve

Density curve summarises the approximate shape and the pattern a data distribution.

s.histplot(needed_cols, x="ny_state_avg_price", bins=20, element="step", kde=True)
m.show()

Density Curve without bars or Elements

For better clarity, I will view the modal density shape without the bars or elements.

s.displot(needed_cols, x="ny_state_avg_price", kind = "kde"
m.show()

The distribution appears to be bi-modal, one peak at around price $1.5 and another peak period around price $3.0

Box Plot

Let's examine the average price within each year with box plot.

fig, ax=m.subplots(figsize=(10,8))
s.boxplot(x=needed_cols["year"], y=["ny_state_avg_price"])
ax.tick_params(rotation=90)
m.show()

The boxplot summarises the price for each year.The "T" extensions show the minimum (bottom) and maximumm (top) price for the year. The horizontal line that cut through the box indicate the median for the year. The spade shape around the box indicate outliers. There seems to have been some relieve in the past: price drop in 2002 after the increase in 2001,it also drop in 2009 after a preceeding consistent hike in price and the most recent in hike was in 2022 before a bit of relief in 2023 and 2024.

Bar Chart

I will like to summarise the average price per year with a bar cahrt.

import numpy as np           
ave = round(needed_cols.groupby(["year"])["ny_state_avg_price"].mean().reset_index(),2)   # return unique table of yr with selected price column rounded to 2 decimal places.
fig, ax=m.subplots(figsize=(6,8))
m.bar(ave["year"],np.log(ave["ny_state_avg_price"]))      # "np.log" transformed the prices into logarithmic values
ax.xlabel("years")
ax.ylabel("Log of Average Prices $/gal")
m.show()

The prices are in log format so that the bars could be displayed distinctively. The real values could be estimated by taking the exponential of the displayed y-axis value. E.g. 1.2 as np.exp(1.2) is approximately 3.32

Dashboard

We can summarise all the exploratory charts into dashboard for easier visual.

fig, axes=m.subplots(2,2, figsize(6,8))

s.histplot(needed_cols, x="ny_state_avg_price", bins=20, element="step", ax=axes[0,0])
axes[0,0].set_title(Shape of Propane Prices from 1997 to 2024)

s.histplot(needed_cols, x="ny_state_avg_price", bins=20, element="step", kde=True, ax=axes[0,1])
axes[0,1].set_title(Density Curve of the Shape of Propane Price from 1997 to 2024)

s.boxplot(x=needed_cols["year"], y=["ny_state_avg_price"], ax=axes[1,0])
axes[1,0].tick_params(rotation=90)
axes[1,0].set_xlabel("Years")
axes[1,0].set_ylabel(Propane Price $/gal)

axes[1,1].bar(avg["year],np.log(avg["ny_state_avg_price"]))
axes[1,1].set_title("Logarithmic Price of Propane Price from 1997 to 2024)
axes[1,1].set_xlabel("Year")
axes[1,1].set_ylabel(Logarithim of Average Price of Propane")

m.tight_layout()
m.show()

Statistical Analysis

Let's take some sample and carryout statistical analysis on the sample. Year 2003 and 2004 are very recent; are there any sigificant difference between their price? Let's check with a T- statistics. You can check my write up on traditional way of conductiong statistical analysis. Let's quickly run it with Python here.

Let's state our hypothesis

Null hypothsis $H_{0}$: There is no significant difference between the price of propane in 2023 and 2024; $\mu_{2023} = \mu_{2024}$

Alternamte hypothesis $H_{1}$ There is significance difference between the price of propane in 2023 and 2024 $\mu_{2023}\neq\mu_{20}$

Now, let's filter the needed years before we import the necesarry library.

Filter Columns in python

filtered_column = data[data["column"].isin([value/"sting" to be filtered])
yr_2023_sample = needed_cols[needed_cols["year"].isin([2023]).sample(n=25, random_satate=42)
yr_2024_sample = needed_cols[needed_cols["year"].isin([2024]).sample(n=25, random_state=42)

We just filtered year 2023 and 2024 from the dataframe and took 25 random samples from each year. This sample is sensitive to randomness hence the use of random.state to fodster reproduibility The use of another value for random.state might might alter the result and not using random_state at all might even generate opposite result to mine.

import scipy.stats
sig_test = scipy.stats.ttest_ind(yr_2023_sample["ny_state_avg_price"], yr_2024_sample["ny_state_avg_price"])
print(sig_test)

Let's assume we are using confidence level of 95%, our level of significance alpha $\alpha$ will be 0.05. Since our calculated p-value (0.2635 is greater than our chosen $\alpha$ (0.05) we will fail to reject the null hypothesis. Therefore, there is no significant difference between the price of propane in 2023 and 2024.

Regression Analysis

Regression analysis is used to predict how a change in variable x might cause a change in variable y. I will be using linear regression and it could be expressed mathematically as $`y=m+cx. where:

y = dependent variable in this analysis, the "price"

x = independent variable in this analysis, the "day"

c= coefficient of x or the slope or change in x

m= is the intercept i.e point at which the regression line meets the y-axis

Before we move forward, it is important to establish that the fact that y change (correlate) when there is change in x does not means that x is the root cause of change in y; hence the popular saying correlation does not equal causation.

Linear Regression in simple graphical representation is just scatter plot combine with line graph. We'll use regression to see how change in years might have influnce change in price between 2000 to 2024.

Scatter Plot

from_2000 = pd.DataFrame(needed_cols[needed_cols["year"]>1999])    #filter the needed years
from_2000["day"] = [daz for daz in range(len(from_2000))           # Days in those years that are selected
fig, ax=m.subplots(figsize=(6,8))
m.scatter(from_2000["day"],from_2000["ny_state_avg_price"])
m.xlabel("day")
m.ylabel("price")

![scatter_plot](

Each point on the scatter plot represent price from from its corresponding day. Now, let's fit the regression line to the scatter plot to visualize how many points will fall exactly on the line or be closser to the line.

Scatter Plot + Line Plot = Linear Regression Chart

import sklearn
from sklearn.linear_model, import LinearRegression
# Reshape necessary columns to 2-D to facilitate analysis
x=from_2000.day.values.reshape(-1,1)
y=from_2000.ny_state_avg_price.values.reshape(-1,1)
regression = LinearRegression()
regression.fit(x,y)
predict_y=reg.predict(x)
m.plot(x,predict_y
m.show()

The regression line travels from left to right indicating a positive correlation. For instance, as the days progress the price somewhat increase. Let's confrim the correlation

correlation = from_2000["day"].corr(from_2000["ny_state_avg_price"])
print("correlation =", correlation)

The correlation is positive because it's greater than zero (0) and it's very strong because it's more than 0.5

Regression Equation

$y = m + cx$ could be rewritten as:

$ price = intercept + slope(day)$

Since we have our regression chart, we can estimate the intercept and the coefficient from it or calculte them. Let's do the latter.

regression.fit(x,y)
coeffi = regression.coef_
intercet = regression.intercept_
print("coefficient = ", coeffi, "intercept = ", intercet)

Now that we have our slope and intercept, the regression equation from which our regression model is formed becomes:

Regression Model

price = 1.7929 + 0.0017(day) This is developed from regression equation and our calculated variables like the intercept and coeficient.

So, if we pick a day say day 200, we can check the price for that particular day i.e., $price = 1.7929 + 0.0017(200) which is approximately ($2.13)$. But, this is from day in the present. How about the future?

Predicting the Future with Linear Regression

Our linear regression chart stops where the days ends(i.e., last day of Dec. 2024) we can extend the day to how many days we want in 2025 by extending the regression line to the right as in:

import numpy as n
last_day_in_2024 = from_2000["day"].max()     # the value is 931
future_daz_in_2025 = np.append(from_2000["day], n.arange(932, 1022))

Future days in 2025 were estimated (i.e., extended to the end of March 2025) with "future_daz_in_2025". Now we can estimate the price propane on first day of 2025 or last day in the month of March as in:

first_day_of_2025 = 1.7929 + 932(0.0017) = $3.38

last_day_of_March_2025 = 1.7929 + 1021(0.0017) = $3.52

The Uncertain Future; Predicting with Training and Test Set

It's easier to extend the rregression line and extrapolate for the future; but is the future really this predictable? Can we ascertain that the future would follow the extrapolated data? The answer is obvious - no. One way to go about this is to split the historical data into training(usually 70%-80% of data) and test sets (30%-20%). The training set would be used in the model while the test set would be used to varify how the data would perform on an unseen future data.

x_train = from_2000.loc[0:740, ["day"]].values.reshape(-1,1)
y_train = from_2000.loc[0:740, ["ny_state_avg_price"]].values.reshape(-1,1)
x_test = from_2000.loc[741:932, ["day"]]
regression.fit(x_train, y_train)
print("coefficient = ", reg.coef_, "intercept =", reg.intercept_)

Now that we have the coefficient and intercept from the training set, we can generate a training model and test our test set (assumed future data) on it to guess and prepare our model for the uncertain future.

price = 1.711 + 0.0021(any day in the test set) # let's try day 783 price_target_future = 1.711 + 0.0021(783) = $3.36

Conclusion

Yes! Historically, consumers have enjoyed some reduction in price after increment according to this analysis and maybe possibly in the futrue. Price is expected to flunctuate as days pass by; the estimates into the future are just guide into what are expected, there is no guanratee that price will increase linearly but training our model can help it prepare for any future data.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Hist_of_data.png		Hist_of_data.png
README.md		README.md
bar_chart.png		bar_chart.png
boxplot.png		boxplot.png
cleaned_table.png		cleaned_table.png
coeff_an_interc.png		coeff_an_interc.png
cor.png		cor.png
density_c_wit_chart.png		density_c_wit_chart.png
density_c_witht_chart.png		density_c_witht_chart.png
head.png		head.png
isnul.png		isnul.png
linear_reg_day.png		linear_reg_day.png
scatterplot.png		scatterplot.png
sig_test.png		sig_test.png
sig_test_with_rand_ts.png		sig_test_with_rand_ts.png
summary.png		summary.png
traini_coef_interc.png		traini_coef_interc.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Propane Prices

Data Analysis

Open Data File with Pandas

Select Needed Columns From Data in Python

Change Column Name in Python

Working with Date in Python

Histogram

Density Curve

Density Curve without bars or Elements

Box Plot

Bar Chart

Dashboard

Statistical Analysis

Filter Columns in python

Regression Analysis

Scatter Plot

Scatter Plot + Line Plot = Linear Regression Chart

Regression Equation

Regression Model

Predicting the Future with Linear Regression

The Uncertain Future; Predicting with Training and Test Set

Conclusion

About

Uh oh!

Releases

Packages

dataglyder/Propane-Gas-Prices

Folders and files

Latest commit

History

Repository files navigation

Propane Prices

Data Analysis

Open Data File with Pandas

Select Needed Columns From Data in Python

Change Column Name in Python

Working with Date in Python

Histogram

Density Curve

Density Curve without bars or Elements

Box Plot

Bar Chart

Dashboard

Statistical Analysis

Filter Columns in python

Regression Analysis

Scatter Plot

Scatter Plot + Line Plot = Linear Regression Chart

Regression Equation

Regression Model

Predicting the Future with Linear Regression

The Uncertain Future; Predicting with Training and Test Set

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages