|
| 1 | +--- |
| 2 | +title: "Data Scientist’s Toolbox Course Notes" |
| 3 | +author: "Xing Su" |
| 4 | +output: pdf_document |
| 5 | +--- |
| 6 | + |
| 7 | +## CLI (Command Line Interface) |
| 8 | + |
| 9 | +* `/` = root directory |
| 10 | +* `~` = home directory |
| 11 | +* `pwd` = print working directory (current directory) |
| 12 | +* `clear` = clear screen |
| 13 | +* `ls` = list stuff |
| 14 | + * `-a` = see all (hidden) |
| 15 | + * `-l` = details |
| 16 | +* `cd` = change directory |
| 17 | +* `mkdir` = make directory |
| 18 | +* `touch` = creates an empty file |
| 19 | +* `cp` = copy |
| 20 | + * `cp <file> <directory>` = copy a file to a directory |
| 21 | + * `cp -r <directory> <newDirectory>` = copy all documents from directory to new Directory |
| 22 | + * `-r` = recursive |
| 23 | +* `rm` = remove |
| 24 | + * `-r` = remove entire directories (no undo) |
| 25 | +* `mv` = move |
| 26 | + * `move <file> <directory>` = move file to directory |
| 27 | + * `move <fileName> <newName>` = rename file |
| 28 | +* `echo` = print arguments you give/variables |
| 29 | +* `date` = print current date |
| 30 | + |
| 31 | + |
| 32 | + |
| 33 | +## GitHub |
| 34 | + |
| 35 | +* **Workflow** |
| 36 | + 1. make edits in workspace |
| 37 | + 2. update index/add files |
| 38 | + 3. commit to local repo |
| 39 | + 4. push to remote repository |
| 40 | +* `git add .` = add all new files to be tracked |
| 41 | +* `git add -u` = updates tracking for files that are renamed or deleted |
| 42 | +* `git add -A` = both of the above |
| 43 | + * ***Note**: `add` is performed before committing* |
| 44 | +* `git commit -m "message"` = commit the changes you want to be saved to the local copy |
| 45 | +* `git checkout -b branchname` = create new branch |
| 46 | +* `git branch` = tells you what branch you are on |
| 47 | +* `git checkout master` = move back to the master branch |
| 48 | +* `git pull` = merge you changes into other branch/repo (pull request, sent to owner of the repo) |
| 49 | +* `git push` = commit local changes to remote (GitHub) |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | +## Markdown |
| 54 | + |
| 55 | +* `##` = signifies secondary heading (bold big font) |
| 56 | +* `###` = signifies tertiary heading (slightly smaller font than secondary, not bold) |
| 57 | +* `*` = bullet list item |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | +## R Packages |
| 62 | + |
| 63 | +* Primary location for R packages --> CRAN |
| 64 | +* `available.packages()` = all packages available |
| 65 | +* `head(rownames(a),3)` = returns first three names of a |
| 66 | +* `install.packages("nameOfPackage")` = install single package |
| 67 | +* `install.packages(c("nameOfPackage", "nameOfPackage", "nameOfPackage")` = install multiple package |
| 68 | +* Bioconductor Packages: |
| 69 | + * `source("https://bioconductor.org/biocLite.R")` |
| 70 | + * `biocLite()` = install bioconductor packages |
| 71 | +* `library(packagename)` = load package |
| 72 | +* `search()` = see all functions in package after loading |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +## Types of Data Science Questions |
| 77 | + |
| 78 | +* in order of difficulty: ***Descriptive*** --> ***Exploratory*** --> ***Inferential*** --> ***Predictive*** --> ***Causal*** --> ***Mechanistic*** |
| 79 | +* **Descriptive analysis** = describe set of data, interpret what you see (census, Google Ngram) |
| 80 | +* **Exploratory analysis** = discovering connections (correlation does not = causation) |
| 81 | +* **Inferential analysis** = use data conclusions from smaller population for the broader group |
| 82 | +* **Predictive analysis** = use data on one object to predict values for another (if X predicts Y, does not = X cause Y) |
| 83 | +* **Causal analysis** = how does changing one variable affect another, using randomized studies, Strong assumptions, golden standard for statistical analysis |
| 84 | +* **Mechanistic analysis** = understand exact changes in variables in other variables, modeled by empirical equations (engineering/physics |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +## Data |
| 89 | +* **Data** = values of qualitative or quantitative variables, belonging to a set of items (usually population) |
| 90 | +* **Variables** = measurement/characteristic of an item (qualitative vs quantitative) |
| 91 | +* **Data** = not always structured, usually raw file, different formats |
| 92 | +* Most important thing is question, then it is data |
| 93 | +* **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data) |
| 94 | + |
| 95 | +## Experimental Design |
| 96 | +* Formulate you question in advance |
| 97 | +* **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly |
| 98 | +* ***[Inference]*** **Variability** = lower variability + clearer differences = decision |
| 99 | +* ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation) |
| 100 | + * dealing with confounding: fix variables, stratify (all options), randomize |
| 101 | +* ***[Prediction]*** collection observations for different variable values, build predictive functions |
| 102 | + * similar problems of probability/sampling and confounding variables |
| 103 | +* ***[Prediction]*** Difficult to understand where observation is from from different distributions. (size of effects important) |
| 104 | +* ***[Prediction]*** Positive/negative statuses: True positive, false positive, false negative, true negative |
| 105 | + * **Sensitivity** = Pr(positive test | disease) |
| 106 | + * **Specificity** = Pr(negative test | no disease) |
| 107 | + * **Positive Predictive Value** = Pr(disease | positive test) |
| 108 | + * **Negative Predictive Value** = Pr(no disease | negative test) |
| 109 | + * **Accuracy** = Pr(correct outcome) |
| 110 | +* **Data dredging** = use data to fit hypothesis |
| 111 | +* **Good experiments** = have replication, measure variability, generalize problem, transparent |
| 112 | +* Prediction is not inference, and be ware of data dredging |
| 113 | + |
0 commit comments