Skip to content

Testing carpet samples for chemical compounds to determine their age using SAS. Dataset can be found in the README file.

Notifications You must be signed in to change notification settings

AneesahG/-Chemical-Compounds-Age-SAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Chemical Compounds and The Age of Carpets Using SAS

Testing carpet samples for chemical compounds to determine their age using SAS. I use logistic regression in SAS Studio with a dataset from "Age Estimation of Old Carpets Based on Cystine and Cysteic Acid Content."

Getting Started

To begin the project, you'll need to download the following dataset: Age Estimation of Old Carpets Based on Cystine and Cysteic Acid Content.

Source: J. Csapo, Z. Csapo-Kiss, T.G. Martin, S. Folestad, O. Orwar, A. Tivesten, and S. Nemethy (1995). "Age Estimation of Old Carpets Based on Cystine and Cysteic Acid Content," Analytica Chimica Acta, Vol. 300, pp. 313-320.

Prerequisites

You will need to download SAS in order to run the code. More details on how to install SAS on a Windows machine are here.

Creating the Q-Q Plot

Our covariates are the four organic compounds--Cysteic Acid, Cystine, Methionine, and Tyrosine. The first step I did was creating a QQ-plot in order to see if our residuals follow a normal pattern.

proc reg DATA=dg.carpet plots(only)=QQPLot;
model age=cys_acid cys met tyr;
ods select QQPlot;
run;

Screenshot 2023-05-16 211755

Make sure to import the data file correctly before creating your Q-Q plot. The plot should look like this:

Screenshot 2023-05-16 212054

Plotting the Residuals

Although it is lightly tailed on both ends, the data seems to be normally distributed, which is what we want. To further rectify that there is a linear relationship, we can plot the residuals, which are the differences between our observed and predicted values. Ideally, we want our plot of the residuals to look totally random, even if there are symmetrically distributed clouds of points.

data subset;
set dis2.carpet;
if age=. then delete;

option obs=1000;

proc corr data=subset plots=matrix;
var age cys_acid cys met tyr;

option obs=1000;

proc reg data=subset;
model age=cys_acid cys met tyr;
output out=dis2.carpet;

please note there is a typo in line one, the first statement should read 'libname' to associate the chemicals' library with a libref. Sorry!

Screenshot 2023-05-16 212748

Your model should look like the image below:

Screenshot 2023-05-16 212847

Findings

In the case that there is a distinct pattern, outliers, or shape, we can further improve themodel. We can see in Figure 2, I’ve modelled the residual plots for each of our four covariates respectively. There doesn’t seem to be a distinct pattern so we can check off these assumptions: the variance must have a mean of and the variance of the error terms must be constant.

Final Data Summary

Doing a data summary, we can take note that cysteic acid has the smallest p-value and thus a minimal effect on the age of our wood samples. In any case for any of the four covariates, you would fail to reject a null hypothesis for alpha equals 0.01. All of the compounds have F-values less than 1%.

proc contents data = carpet;


proc reg data = carpet; 
model age = cys;

proc reg data = carpet;
model age = met;

Your output should look like this procedure for the regression of our model. Make sure to accompany the PROC REG statement with a MODEL statement to specify the regression models.

Screenshot 2023-05-16 213433

Checking With A Log Transformation

Our adjusted coefficient of determination is approximately 0.9946—implying that 99.46% of our Cysteic Acid, Cystine, Methionine, and Tyrosine’s variation can be explained by our linear model. Though it isn’t quite 1, the regression predictions almost perfectly fit the data, so we’re on the right track. I tried playing around and doing a logarithmic transformation on age but didn’t really see a difference (i.e, expecting a tighter QQ-plot for the data but instead getting Figure 3). For this reason, I would suggest sticking to the first model since we would have a coefficient of determination closest to one and better results overall.

data work.transform;
set WORK.IMPORT;
log_age=log(age);
log_cys_acid=log(cys_acid);
log_cys=l0g(cys);
run;

Screenshot 2023-05-16 214119

The Q-Q plot for the log transformed age category:

Screenshot 2023-05-16 213944

We can assess the quality of the fit with the 'Fit Diagnostic' function.

Screenshot 2023-05-16 213959

Thank you for reading!

About

Testing carpet samples for chemical compounds to determine their age using SAS. Dataset can be found in the README file.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages