Quiz week 1 template and README.md template

myounus96 · Feb 19, 2017 · 49c1b57 · 49c1b57
1 parent a124c13
commit 49c1b57
Show file tree

Hide file tree

Showing 4 changed files with 131 additions and 1 deletion.
diff --git a/.DS_Store b/.DS_Store
diff --git a/4_Exploratory_Data_Analysis/.DS_Store b/4_Exploratory_Data_Analysis/.DS_Store
diff --git a/4_Exploratory_Data_Analysis/README.md b/4_Exploratory_Data_Analysis/README.md
@@ -1,4 +1,38 @@
-## Data Science Specialization | John Hopkins Coursera
+# Getting and Cleaning Data Project
+Author: Michael Galarnyk <br />
+Blog Post: [Getting and Cleaning Data Review](https://medium.com/@GalarnykMichael/review-course-1-the-data-scientists-toolbox-jhu-coursera-4d7459458821#.5jpg133ln "Click to go to Repo") <br />
+Data Zip File Location: [UC Irvine Repo](https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip "Clicking will download the data")
+
+## Goal of the Project
+1. A tidy data set 
+2. A link to a Github repository with your script for performing the analysis 
+3. A code book that describes the variables, the data, and any transformations or work that you performed to clean up the data called CodeBook.md. You should also include a README.md in the repo with your scripts. This repo explains how all of the scripts work and how they are connected.
+4. Analysis R Script
+
+## Review Criteria
+
+Goal | Item | Link to Item
+--- | --- | ---
+Tidy Data Set |  Clean Data Set |  [Data Set Link](https://github.com/mGalarnyk/datasciencecoursera/blob/master/3_Getting_and_Cleaning_Data/tidyData.txt "Click to go to Repo")
+Github Repo | Repo |  [Repo Link](https://github.com/mGalarnyk/datasciencecoursera/tree/master/3_Getting_and_Cleaning_Data "Click to go to Repo")
+Cookbook | CodeBook.md |  [Repo Link](https://github.com/mGalarnyk/datasciencecoursera/blob/master/3_Getting_and_Cleaning_Data/CodeBook.md "CodeBook.md")
+README | ReadingItNow |  [Repo Link](https://github.com/mGalarnyk/datasciencecoursera/blob/master/3_Getting_and_Cleaning_Data/README.md "README.md")
+Analysis R Script |  run_analysis.R |  [R Script Link](https://github.com/mGalarnyk/datasciencecoursera/blob/master/3_Getting_and_Cleaning_Data/run_analysis.R "run_analysis.R")
+
+## Contributors
+
+FirstName | LastName | Email
+--- | --- | ---
+Michael |  Galarnyk |  <mgalarny@gmail.com>
+Submit |  Pull Request | <youremailhere@gmail.com>
+
+## License
+
+Anyone may contribute after this assignment is turned in and graded. 
+
+## Blog Posts on the Specialization | John Hopkins Coursera
+
+[Getting and Cleaning Data (JHU Coursera)](https://medium.com/@GalarnykMichael/getting-and-cleaning-data-jhu-coursera-course-3-c3635747858b#.y93kqfa0u "Review + data.table")
 
 [R Programming (JHU Coursera)](https://medium.com/@GalarnykMichael/in-progress-review-course-2-r-programming-jhu-coursera-ad27086d8438#.bzzr29fvo "Review + data.table")
 

diff --git a/4_Exploratory_Data_Analysis/quiz_week1.R b/4_Exploratory_Data_Analysis/quiz_week1.R
@@ -0,0 +1,96 @@
+# Getting and Cleaning Data, JHU Coursera
+
+#1. 
+#The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
+
+# https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
+
+# and load the data into R. The code book, describing the variable names is here:
+
+# https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
+
+# Apply strsplit() to split all the names of the data frame on the characters "wgtp". What is the value of the 123 element of the resulting list?
+
+communities <- data.table::fread("http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv")
+varNamesSplit <- strsplit(names(communities), "wgtp")
+varNamesSplit[[123]]
+
+#2. 
+#Load the Gross Domestic Product data for the 190 ranked countries in this data set:
+
+#https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
+
+# Remove the commas from the GDP numbers in millions of dollars and average them. What is the average?
+
+#Original data sources:
+
+#  http://data.worldbank.org/data-catalog/GDP-ranking-table
+
+
+# Removed the s from https to be compatible with windows computers. 
+# Skip first 5 rows and only read in relevent columns
+GDPrank <- data.table::fread('http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv'
+                    , skip=5
+                    , nrows=190
+                    , select = c(1, 2, 4, 5)
+                    , col.names=c("CountryCode", "Rank", "Country", "GDP")
+)
+
+# Remove the commas using gsub
+# Convert to integer after removing commas. 
+# Take mean of GDP column (I know this code may look a little confusing)
+GDPrank[, mean(as.integer(gsub(pattern = ',', replacement = '', x = GDP )))]
+
+
+
+#3. In the data set from Question 2 
+# what is a regular expression that would allow you to count the number of countries whose name begins with "United"?
+# Assume that the variable with the country names in it is named countryNames. How many countries begin with United?
+
+grep("^United",GDPrank[, Country])
+
+# 4.Load the Gross Domestic Product data for the 190 ranked countries in this data set:
+# https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
+# Load the educational data from this data set:
+# https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv
+# Match the data based on the country shortcode. 
+# Of the countries for which the end of the fiscal year is available, how many end in June?
+
+GDPrank <- data.table::fread('http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv'
+                             , skip=5
+                             , nrows=190
+                             , select = c(1, 2, 4, 5)
+                             , col.names=c("CountryCode", "Rank", "Country", "GDP")
+)
+
+eduDT <- data.table::fread('http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv')
+
+mergedDT <- merge(GDPrank, eduDT, by = 'CountryCode')
+
+mergedDT[grepl(pattern = "Fiscal year end: June 30;", mergedDT[, `Special Notes`]), .N]
+
+
+# 5. You can use the quantmod (http://www.quantmod.com/) package
+# to get historical stock prices for publicly traded companies on the NASDAQ and NYSE. 
+# Use the following code to download data on Amazon's stock price and get the times the data was sampled.
+
+# library(quantmod)
+# amzn = getSymbols("AMZN",auto.assign=FALSE)
+# sampleTimes = index(amzn)
+
+
+# install.packages("quantmod")
+library("quantmod")
+amzn <- getSymbols("AMZN",auto.assign=FALSE)
+sampleTimes <- index(amzn) 
+timeDT <- data.table::data.table(timeCol = sampleTimes)
+
+# How many values were collected in 2012? 
+timeDT[(timeCol >= "2012-01-01") & (timeCol) < "2013-01-01", .N ]
+
+# How many values were collected on Mondays in 2012?
+timeDT[((timeCol >= "2012-01-01") & (timeCol < "2013-01-01")) & (weekdays(timeCol) == "Monday"), .N ]
+
+
+
+