Skip to content

Commit

Permalink
Quiz week 1 template and README.md template
Browse files Browse the repository at this point in the history
  • Loading branch information
mGalarnyk committed Feb 19, 2017
1 parent a124c13 commit 49c1b57
Show file tree
Hide file tree
Showing 4 changed files with 131 additions and 1 deletion.
Binary file modified .DS_Store
Binary file not shown.
Binary file added 4_Exploratory_Data_Analysis/.DS_Store
Binary file not shown.
36 changes: 35 additions & 1 deletion 4_Exploratory_Data_Analysis/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,38 @@
## Data Science Specialization | John Hopkins Coursera
# Getting and Cleaning Data Project
Author: Michael Galarnyk <br />
Blog Post: [Getting and Cleaning Data Review](https://medium.com/@GalarnykMichael/review-course-1-the-data-scientists-toolbox-jhu-coursera-4d7459458821#.5jpg133ln "Click to go to Repo") <br />
Data Zip File Location: [UC Irvine Repo](https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip "Clicking will download the data")

## Goal of the Project
1. A tidy data set
2. A link to a Github repository with your script for performing the analysis
3. A code book that describes the variables, the data, and any transformations or work that you performed to clean up the data called CodeBook.md. You should also include a README.md in the repo with your scripts. This repo explains how all of the scripts work and how they are connected.
4. Analysis R Script

## Review Criteria

Goal | Item | Link to Item
--- | --- | ---
Tidy Data Set | Clean Data Set | [Data Set Link](https://github.com/mGalarnyk/datasciencecoursera/blob/master/3_Getting_and_Cleaning_Data/tidyData.txt "Click to go to Repo")
Github Repo | Repo | [Repo Link](https://github.com/mGalarnyk/datasciencecoursera/tree/master/3_Getting_and_Cleaning_Data "Click to go to Repo")
Cookbook | CodeBook.md | [Repo Link](https://github.com/mGalarnyk/datasciencecoursera/blob/master/3_Getting_and_Cleaning_Data/CodeBook.md "CodeBook.md")
README | ReadingItNow | [Repo Link](https://github.com/mGalarnyk/datasciencecoursera/blob/master/3_Getting_and_Cleaning_Data/README.md "README.md")
Analysis R Script | run_analysis.R | [R Script Link](https://github.com/mGalarnyk/datasciencecoursera/blob/master/3_Getting_and_Cleaning_Data/run_analysis.R "run_analysis.R")

## Contributors

FirstName | LastName | Email
--- | --- | ---
Michael | Galarnyk | <mgalarny@gmail.com>
Submit | Pull Request | <youremailhere@gmail.com>

## License

Anyone may contribute after this assignment is turned in and graded.

## Blog Posts on the Specialization | John Hopkins Coursera

[Getting and Cleaning Data (JHU Coursera)](https://medium.com/@GalarnykMichael/getting-and-cleaning-data-jhu-coursera-course-3-c3635747858b#.y93kqfa0u "Review + data.table")

[R Programming (JHU Coursera)](https://medium.com/@GalarnykMichael/in-progress-review-course-2-r-programming-jhu-coursera-ad27086d8438#.bzzr29fvo "Review + data.table")

Expand Down
96 changes: 96 additions & 0 deletions 4_Exploratory_Data_Analysis/quiz_week1.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Getting and Cleaning Data, JHU Coursera

#1.
#The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

# https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

# and load the data into R. The code book, describing the variable names is here:

# https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

# Apply strsplit() to split all the names of the data frame on the characters "wgtp". What is the value of the 123 element of the resulting list?

communities <- data.table::fread("http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv")
varNamesSplit <- strsplit(names(communities), "wgtp")
varNamesSplit[[123]]

#2.
#Load the Gross Domestic Product data for the 190 ranked countries in this data set:

#https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

# Remove the commas from the GDP numbers in millions of dollars and average them. What is the average?

#Original data sources:

# http://data.worldbank.org/data-catalog/GDP-ranking-table


# Removed the s from https to be compatible with windows computers.
# Skip first 5 rows and only read in relevent columns
GDPrank <- data.table::fread('http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv'
, skip=5
, nrows=190
, select = c(1, 2, 4, 5)
, col.names=c("CountryCode", "Rank", "Country", "GDP")
)

# Remove the commas using gsub
# Convert to integer after removing commas.
# Take mean of GDP column (I know this code may look a little confusing)
GDPrank[, mean(as.integer(gsub(pattern = ',', replacement = '', x = GDP )))]



#3. In the data set from Question 2
# what is a regular expression that would allow you to count the number of countries whose name begins with "United"?
# Assume that the variable with the country names in it is named countryNames. How many countries begin with United?

grep("^United",GDPrank[, Country])

# 4.Load the Gross Domestic Product data for the 190 ranked countries in this data set:
# https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
# Load the educational data from this data set:
# https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv
# Match the data based on the country shortcode.
# Of the countries for which the end of the fiscal year is available, how many end in June?

GDPrank <- data.table::fread('http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv'
, skip=5
, nrows=190
, select = c(1, 2, 4, 5)
, col.names=c("CountryCode", "Rank", "Country", "GDP")
)

eduDT <- data.table::fread('http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv')

mergedDT <- merge(GDPrank, eduDT, by = 'CountryCode')

mergedDT[grepl(pattern = "Fiscal year end: June 30;", mergedDT[, `Special Notes`]), .N]


# 5. You can use the quantmod (http://www.quantmod.com/) package
# to get historical stock prices for publicly traded companies on the NASDAQ and NYSE.
# Use the following code to download data on Amazon's stock price and get the times the data was sampled.

# library(quantmod)
# amzn = getSymbols("AMZN",auto.assign=FALSE)
# sampleTimes = index(amzn)


# install.packages("quantmod")
library("quantmod")
amzn <- getSymbols("AMZN",auto.assign=FALSE)
sampleTimes <- index(amzn)
timeDT <- data.table::data.table(timeCol = sampleTimes)

# How many values were collected in 2012?
timeDT[(timeCol >= "2012-01-01") & (timeCol) < "2013-01-01", .N ]

# How many values were collected on Mondays in 2012?
timeDT[((timeCol >= "2012-01-01") & (timeCol < "2013-01-01")) & (weekdays(timeCol) == "Monday"), .N ]




0 comments on commit 49c1b57

Please sign in to comment.