5 min read

StatisticsR

Installing required libraries

# library(devtools)
# install.packages("openintro")
# install.packages("tidyverse")
# install.packages("nycflights13")
library(nycflights13)
library(openintro) # for data sets in the text book
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1     <U+221A> purrr   0.2.4
## <U+221A> tibble  1.4.2     <U+221A> dplyr   0.7.4
## <U+221A> tidyr   0.8.0     <U+221A> stringr 1.3.0
## <U+221A> readr   1.1.1     <U+221A> forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

1. Introduction to Data

It is helpful to put statistics in the context of a general process of investigation:

  1. Identify a question or problem.
  2. Collect relevant data on the topic.
  3. Analyze the data.
  4. Form a conclusion. (Source: Openintro Statistics)

Loading stent30 & stent365 data and creating table

data("stent30")
data("stent365")
stent30 <- as_tibble(stent30)
stent30
## # A tibble: 451 x 2
##    group     outcome
##    <fct>     <fct>  
##  1 treatment stroke 
##  2 treatment stroke 
##  3 treatment stroke 
##  4 treatment stroke 
##  5 treatment stroke 
##  6 treatment stroke 
##  7 treatment stroke 
##  8 treatment stroke 
##  9 treatment stroke 
## 10 treatment stroke 
## # ... with 441 more rows
stent365 <- as_tibble(stent365)
stent365
## # A tibble: 451 x 2
##    group     outcome
##    <fct>     <fct>  
##  1 treatment stroke 
##  2 treatment stroke 
##  3 treatment stroke 
##  4 treatment stroke 
##  5 treatment stroke 
##  6 treatment stroke 
##  7 treatment stroke 
##  8 treatment stroke 
##  9 treatment stroke 
## 10 treatment stroke 
## # ... with 441 more rows

Creating a table is very useful way of discovering interesting features of categorical variables. Let’s now see datasets stent30 and stent365 in the table. stent30 abd stent365 datasets are based on an experiment that studies effectiveness of stents in treating patients at risk of stroke. 451 data points were collected.

# table for stent30 and stent 365
table(stent30)
##            outcome
## group       no event stroke
##   control        214     13
##   treatment      191     33
table(stent365)
##            outcome
## group       no event stroke
##   control        199     28
##   treatment      179     45

We can compute the summary statistics from those tables. A summary statistics is a single number summarizing large amount of data.

# In stent365, the proportion who had a stroke in the treatment group
45 / (45 + 179)
## [1] 0.2008929
# in control group
28 / (28 + 199)
## [1] 0.123348

It is very useful in looking for the differences in the groups. Although doctors could expect more less stroke in treatment group which patients have received stents, the proportion of the stroke was higher than control group. It is contrary to what doctors have expected. This is a classic challenge in statistics.

What can be the potential resoans behind this? do the data show a real difference between the groups?

This second question is subtle. Suppose you flip a coin 100 times. While the chance a coin lands heads in any given coin flip is 50%, we probably won’t observe exactly 50 heads. This type of fluctuation is part of almost any type of data generating process. It is possible that the 8% difference in the stent study is due to this natural variation. However,the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance. So what we are really asking is the following: is the difference so large that we should reject the notion that it was due to chance? (Source: Openintro Statistics)

“While we don’t yet have our statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.”

“Be careful: do not generalize the results of this study to all patients and all stents. This study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients. In addition, there are many types of stents and this study only considered the self-expanding Wingspan stent (Boston Scientific). However, this study does leave us with an important lesson: we should keep our eyes open for surprises.(Source: Openintro Statistics)”

Data basics

Data matrices are a convenient way to record and store data. In a good structured data each row represents an observation and each column represents a variable. Variables are in the form of numerical and categorical format.

Numerical variables are further divided into two category; discrete and continuous. A discrete variable is a variable that can only take on a certain values whereas a continues variables can theoreticaly take infinite values, i.e., stock price.

If two variables are not associated, they are said to be independent.

Let’s now anlyze the diamonds dataset from ggplot2 package.

diamonds <- diamonds
diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.230 Ideal     E     SI2      61.5   55.   326  3.95  3.98  2.43
##  2 0.210 Premium   E     SI1      59.8   61.   326  3.89  3.84  2.31
##  3 0.230 Good      E     VS1      56.9   65.   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4   58.   334  4.20  4.23  2.63
##  5 0.310 Good      J     SI2      63.3   58.   335  4.34  4.35  2.75
##  6 0.240 Very Good J     VVS2     62.8   57.   336  3.94  3.96  2.48
##  7 0.240 Very Good I     VVS1     62.3   57.   336  3.95  3.98  2.47
##  8 0.260 Very Good H     SI1      61.9   55.   337  4.07  4.11  2.53
##  9 0.220 Fair      E     VS2      65.1   61.   337  3.87  3.78  2.49
## 10 0.230 Very Good H     VS1      59.4   61.   338  4.00  4.05  2.39
## # ... with 53,930 more rows