Modeling and Inference in R

Deirdre Fitzpatrick & Dana Udwin

July 26, 2016

Introduction

Modeling in R consists of:

  • formula object
    • Sepal.Length ~ Sepal.Width
    • Use Sepal.Width to predict Sepal.Length
  • model object
    • lm()
    • Put your formula inside of lm()
  • observing your model
    • anova(), plot() or summary()
    • Put your model inside of plot()

Simple Linear Regression (SLR)

  • Linear relationship between variables
    • i.e. as Sepal.Width increases, we think Sepal.Length does, too (in a nice straight line)
  • Response variable is continuous
    • Sepal.Length is numeric (not categories like “short” or “long”)
  • “Simple” = only one explanatory variable

Preview the Data

# ?swiss
head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Eyeball the Data

plot(swiss)

Fit the SLR Model

mdl <- lm(Agriculture ~ Examination, data=swiss)
  • response or dependent variable = Agriculture
    • the thing we want to predict
  • explanatory or independent variable = Examination

Summarize Model

summary(mdl)
## 
## Call:
## lm(formula = Agriculture ~ Examination, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.570 -10.781   2.575  11.871  29.411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  82.8869     5.6407  14.694  < 2e-16 ***
## Examination  -1.9544     0.3086  -6.334 9.95e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.7 on 45 degrees of freedom
## Multiple R-squared:  0.4713, Adjusted R-squared:  0.4596 
## F-statistic: 40.12 on 1 and 45 DF,  p-value: 9.952e-08

Agriculture = 82.8868674 - 1.9544294 Examination + \(\epsilon\)

(You get lots of asterisks next to significant variables.)

Analysis of Variance

anova(mdl)
## Analysis of Variance Table
## 
## Response: Agriculture
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## Examination  1  11183 11183.4  40.121 9.952e-08 ***
## Residuals   45  12543   278.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Check Out Residuals

par(mfrow=c(2,2))
plot(mdl)

Or, Try It Without an Intercept!

mdl <- lm(Agriculture ~ Examination - 1, data=swiss)
mdl <- lm(Agriculture ~ Examination + 0, data=swiss)

summary(mdl)
## 
## Call:
## lm(formula = Agriculture ~ Examination + 0, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.806   0.812  20.129  35.688  79.494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## Examination   2.1353     0.3173    6.73 2.32e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.76 on 46 degrees of freedom
## Multiple R-squared:  0.4961, Adjusted R-squared:  0.4852 
## F-statistic: 45.29 on 1 and 46 DF,  p-value: 2.317e-08

Rewind

What even is an lm()?

About the Model Object

names(mdl)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
mdl['coefficients']
## $coefficients
## Examination 
##    2.135296

And Plays Well with Others

# sample 5 numbers from the uniform distribution (0, 100)
fake_data <- data.frame(Examination = runif(5, 0, 100))

predict(mdl, newdata=fake_data, interval="confidence")
##         fit       lwr       upr
## 1 174.16358 122.07343 226.25373
## 2 180.32573 126.39256 234.25889
## 3 212.20944 148.74025 275.67862
## 4 164.87190 115.56078 214.18303
## 5  19.93138  13.97016  25.89261

More Nice Playing

r <- residuals(mdl)
head(r)
##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville 
##   -15.029445    32.288222    29.023518    10.876444     7.199962 
##   Porrentruy 
##    16.082333
coef(mdl)
## Examination 
##    2.135296
f <- fitted(mdl)
head(f)
##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville 
##     32.02945     12.81178     10.67648     25.62356     36.30004 
##   Porrentruy 
##     19.21767

Working Example

# install.packages('ggplot2')
require(ggplot2)

head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Lines of Inquiry

  • Use lm() to model price as a function of carat.

  • Find the coefficients. (Are they significant?)

  • Use your model with predict() to predict the price of diamonds with 0.5, 2 and 0.73 carats, respectively.

# hint, use this fresh dataframe
# with the newdata argument inside of predict()
fresh_diamonds <- data.frame(carat = c(0.5, 2, 0.73))

Type ?lm or ?predict in your console for more info!

One Way to Answer

mdl <- lm(price ~ carat, data=diamonds)

summary(mdl)
## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
fresh_diamonds <- data.frame(carat = c(0.5, 2, 0.73))
predict(mdl, newdata=fresh_diamonds)
##         1         2         3 
##  1621.852 13256.491  3405.830

Multiple Linear Regression

  • Response variable is still continuous
  • Two or more independent explanatory variables
  • Let’s model fertility using Catholic and Education as predictors

Check Independence Between Predictors

# library(ggplot2)
ggplot(swiss, aes(x = Catholic, y = Education)) + geom_point()

Cover Ass Better

cor(swiss$Catholic, swiss$Education)
## [1] -0.1538589

Fit Model

mdl <- lm(Fertility ~ Catholic + Education, data=swiss)
summary(mdl)
## 
## Call:
## lm(formula = Fertility ~ Catholic + Education, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.042  -6.578  -1.431   6.122  14.322 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.23369    2.35197  31.562  < 2e-16 ***
## Catholic     0.11092    0.02981   3.721  0.00056 ***
## Education   -0.78833    0.12929  -6.097 2.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.331 on 44 degrees of freedom
## Multiple R-squared:  0.5745, Adjusted R-squared:  0.5552 
## F-statistic:  29.7 on 2 and 44 DF,  p-value: 6.849e-09

Partial F-Test

Let’s suppose a new model!

Fertility = \(\beta_0\) + \(\beta_1\) Education + \(\beta_2\) Catholic + \(\beta_3\) Infant.Mortality + \(\epsilon\)

Test hypothesis that \(\beta_2 = \beta_3 = 0\).

Partial F-Test

reduced_mdl <- lm(Fertility ~ Education, data=swiss)
full_mdl <- lm(Fertility ~ Education + Catholic + Infant.Mortality, data=swiss)

# you can pass in two models!
anova(reduced_mdl, full_mdl)
## Analysis of Variance Table
## 
## Model 1: Fertility ~ Education
## Model 2: Fertility ~ Education + Catholic + Infant.Mortality
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     45 4015.2                                  
## 2     43 2422.2  2      1593 14.139 1.909e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Teeny-tiny p-value (look at all those asterisks) tells us \(\beta_2\) and \(\beta_3\) are not both zero, i.e. adding those variables was handy!

Diversion…Get More Succinct!

What’s in an anova() ?

anova_magic <- anova(reduced_mdl, full_mdl)

class(anova_magic)
## [1] "anova"      "data.frame"
names(anova_magic)
## [1] "Res.Df"    "RSS"       "Df"        "Sum of Sq" "F"         "Pr(>F)"
anova_magic[["Pr(>F)"]]
## [1]           NA 1.909423e-05

Multiple Linear Regression…with interaction!

Let’s revise the additive model we fit earlier (Fertility = \(\beta_0\) + \(\beta_1\) Catholic + \(\beta_2\) Education + \(\epsilon\)).

We’re curious:

As Education changes, does the effect of Catholic on Fertility change?

As Catholic changes, does the effect of Education on Fertility change?

Fit the Model with Interaction Term

Fertility = \(\beta_0\) + \(\beta_1\) Catholic + \(\beta_2\) Education + \(\beta_3\) Catholic*Education + \(\epsilon\)

mdl <- lm(Fertility ~ Catholic + Education + Catholic*Education, data=swiss) # yay
mdl <- lm(Fertility ~ Catholic + Education + Catholic:Education, data=swiss) # yay
mdl <- lm(Fertility ~ Catholic*Education, data=swiss) # yay
# mdl <- lm(Fertility ~ Catholic:Education, data=swiss) <-- NO, includes no additive terms

Summarize Model

summary(mdl)
## 
## Call:
## lm(formula = Fertility ~ Catholic * Education, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.953  -6.319  -1.368   6.380  14.297 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        70.937553   3.106471  22.835  < 2e-16 ***
## Catholic            0.184003   0.054539   3.374  0.00158 ** 
## Education          -0.427637   0.260176  -1.644  0.10754    
## Catholic:Education -0.009380   0.005904  -1.589  0.11942    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.191 on 43 degrees of freedom
## Multiple R-squared:  0.5981, Adjusted R-squared:  0.5701 
## F-statistic: 21.33 on 3 and 43 DF,  p-value: 1.286e-08

Looks Shoddy…Let’s Explore Further

Recall our plot from earlier:

ggplot(swiss, aes(x = Catholic, y = Education)) + geom_point()

Looks like Catholic is either very small or very big.

A Little Manipulation

Let’s recode Catholic as a factor and see what happens.

Catholic_median <- summary(swiss$Catholic)[["Median"]]

# install.packages('dplyr')
# install.packages('mosaic')
library(dplyr)
library(mosaic)

swiss_new <- swiss %>% mutate(
  Catholic_factor = derivedFactor(low = Catholic <= Catholic_median, 
                           high = Catholic > Catholic_median))

head(swiss_new[c('Catholic', 'Catholic_factor')])
##   Catholic Catholic_factor
## 1     9.96             low
## 2    84.84            high
## 3    93.40            high
## 4    33.77            high
## 5     5.16             low
## 6    90.57            high

Fit the New Model with Interaction

mdl <- lm(Fertility ~ Catholic_factor*Education, data=swiss_new)
summary(mdl)
## 
## Call:
## lm(formula = Fertility ~ Catholic_factor * Education, data = swiss_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2793  -6.9639  -0.8848   6.7463  14.9044 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    71.2333     3.5081  20.305  < 2e-16 ***
## Catholic_factorhigh            14.3740     4.2755   3.362  0.00163 ** 
## Education                      -0.4226     0.3077  -1.373  0.17679    
## Catholic_factorhigh:Education  -0.5914     0.3390  -1.745  0.08820 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.365 on 43 degrees of freedom
## Multiple R-squared:  0.5808, Adjusted R-squared:  0.5516 
## F-statistic: 19.86 on 3 and 43 DF,  p-value: 3.141e-08

Working Example

# require(ggplot2)

head(diamonds)
## Source: local data frame [6 x 10]
## 
##   carat       cut  color clarity depth table price     x     y     z
##   (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1  0.23     Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good      J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96  2.48
  • Fit a model to predict price using carat and another variable of your choosing.

  • Now add an interaction term between carat and your other variable of choice!

How We Did

# ?diamonds

mdl1 <- lm(price ~ carat + x, data=diamonds)

mdl2 <- lm(price ~ carat*x, data=diamonds)

# anova(mdl1, mdl2)
# summary(mdl1)
# summary(mdl2)

Logistic Regression

  • Response variable is categorical
  • But we are still considering a linear relationship!

Logistic Regression

You can’t have a linear relationship between, say, age and gender.

But you can have a linear relationship between age and the log odds that someone is female.

Enter logit function:

\[ \begin{equation} \log \frac{p}{1-p} = \beta_0 + \beta_1 X_1 + \dots + \beta_n X_n \end{equation} \]

\(p\) = probability of “success”

\(\log\)\(\frac{p}{1-p}\) = “log odds”

The Data

# ?UCBAdmissions
df <- as.data.frame(UCBAdmissions)
head(df)
##      Admit Gender Dept Freq
## 1 Admitted   Male    A  512
## 2 Rejected   Male    A  313
## 3 Admitted Female    A   89
## 4 Rejected Female    A   19
## 5 Admitted   Male    B  353
## 6 Rejected   Male    B  207

\(p\) = probability of being admitted

\(X_1\) = Gender

\(X_2\) = Dept

A Little Finagling

# install.packages('reshape')
require(reshape)

# unravel
df <- untable(df[, c('Admit', 'Gender', 'Dept')], num=df[, 'Freq'])
rownames(df) <- seq(length=nrow(df))

# re-order levels of Admit
df <- mutate(df, Admit = relevel(Admit, 'Rejected'))

head(df) # 4526 rows now
##      Admit Gender Dept
## 1 Admitted   Male    A
## 2 Admitted   Male    A
## 3 Admitted   Male    A
## 4 Admitted   Male    A
## 5 Admitted   Male    A
## 6 Admitted   Male    A

Training

Subset the data into training set (fit model) and testing set (test model).

df_idx <- 1:nrow(df)
train_size <- floor(.8 * nrow(df))

train_idx <- sample(df_idx, size=train_size)
test_idx <- df_idx[!(df_idx %in% train_idx)]

train_df <- df[train_idx, ]
test_df <- df[test_idx, ]

Fit the Model

glm() function (generalized linear model)

mdl <- glm(Admit ~ Gender, family=binomial(link='logit'), data=train_df)

summary(mdl)
## 
## Call:
## glm(formula = Admit ~ Gender, family = binomial(link = "logit"), 
##     data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0825  -1.0825  -0.8672   1.2753   1.5233  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.22730    0.04326  -5.254 1.49e-07 ***
## GenderFemale -0.55696    0.07117  -7.826 5.06e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4845.5  on 3619  degrees of freedom
## Residual deviance: 4783.0  on 3618  degrees of freedom
## AIC: 4787
## 
## Number of Fisher Scoring iterations: 4

Interpret Coefficients

Recall: The model fits \(\log \frac{p}{1-p} = \beta_0 + \beta_1 X_1\).

\(\beta_0\) = -0.2273016

\(\beta_1\) = -0.5569604

coef(mdl)
##  (Intercept) GenderFemale 
##   -0.2273016   -0.5569604

When applicant is male, \(X_1\) = 0 and the log odds of admission are equal to the intercept.

We expect the log odds of admission to increase by -0.5569604 when applicant is female.

Multiple Logistic Regression

# pro tip: a period includes all variables as predictors
mdl <- glm(Admit ~ ., family=binomial(link='logit'), data=train_df)

summary(mdl)
## 
## Call:
## glm(formula = Admit ~ ., family = binomial(link = "logit"), data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5042  -0.9555  -0.3912   0.9627   2.3490  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.58038    0.07748   7.491 6.83e-14 ***
## GenderFemale  0.16151    0.09006   1.793   0.0729 .  
## DeptB        -0.05172    0.12247  -0.422   0.6728    
## DeptC        -1.28925    0.11993 -10.750  < 2e-16 ***
## DeptD        -1.26274    0.11685 -10.807  < 2e-16 ***
## DeptE        -1.79065    0.14141 -12.663  < 2e-16 ***
## DeptF        -3.27378    0.18632 -17.571  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4845.5  on 3619  degrees of freedom
## Residual deviance: 4169.5  on 3613  degrees of freedom
## AIC: 4183.5
## 
## Number of Fisher Scoring iterations: 5

Simpson’s Paradox!

Receiver Operating Characteristic Curve

a.k.a the illustrious ROC Curve

#install.packages('ROCR')
library(ROCR)

prob_admit <- predict(mdl, newdata=test_df, type="response")
truth_admit <- test_df$Admit
pred <- prediction(prob_admit, truth_admit, label.ordering=c('Rejected', 'Admitted'))
perf <- performance(pred, "tpr", "fpr")
plot(perf)

Area Under the Curve (AUC)

auc.tmp <- performance(pred, "auc")
auc <- as.numeric(auc.tmp@y.values)
auc
## [1] 0.7315764

Working Example

Run me:

diamonds_new <- diamonds %>% 
      mutate(bling = derivedFactor(no = price <= 5000, 
          yes = price > 5000))

Use glm() with the diamonds_new data you just defined to fit a logistic regression where:

  • we are predicting the probability of bling
  • using predictor depth

Our Approach

mdl <- glm(bling ~ depth, family=binomial(link='logit'), data=diamonds_new)
summary(mdl)
## 
## Call:
## glm(formula = bling ~ depth, family = binomial(link = "logit"), 
##     data = diamonds_new)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9085  -0.8001  -0.7950   1.5959   1.6777  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.037094   0.416643   0.089   0.9291  
## depth       -0.016482   0.006747  -2.443   0.0146 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63219  on 53939  degrees of freedom
## Residual deviance: 63213  on 53938  degrees of freedom
## AIC: 63217
## 
## Number of Fisher Scoring iterations: 4

Clustering

This is a classification task that seeks to group items in a way that:

  1. Minimizes variance within groups

  2. Maximizes distance between groups

Here, we will use the kmeans() function to implement k-means clustering. This is an unsupervised (a.k.a. the model doesn’t learn from ground truth labels) clustering algorithm that identifies \(k\) group “centers” and assigns each data point to the group with the nearest centroid.

Data

# I'm sorry
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Do a little scrubbing.

df <- iris[, c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width')]
df <- na.omit(df)
df <- scale(df)

Determine Number of Clusters

Elbow Method: Identify the number of clusters at which the sums of squared errors within groups drops.

set.seed(1)
col_variance <- apply(df, 2, var)

# sum squared error for only 1 group
weighted_ss <- (nrow(df)-1)*sum(col_variance)

for (i in 2:15) {
  within_cluster_ss <- kmeans(df, centers=i)$withinss
  weighted_ss[i] <- sum(within_cluster_ss)
}

Voila, Elbow

plot(1:15, weighted_ss, pch=20, type="b", xlab="Number of Clusters", ylab="Within Group SSE")

Implement K-Means

fit <- kmeans(df, centers=3)

aggregate(df, by=list(fit$cluster), FUN=mean)
##   Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1       1  -1.01119138  0.85041372   -1.3006301  -1.2507035
## 2       2   1.13217737  0.08812645    0.9928284   1.0141287
## 3       3  -0.05005221 -0.88042696    0.3465767   0.2805873

How Did We Do?

We implemented unsupervised clustering, but…the iris data set has a species column. Let’s consider that ground truth for our clusters and compare with the k-means results.

df <- data.frame(df, Fit=fit$cluster)
df$Fit <- as.factor(df$Fit)
df["Species"] <- iris$Species

tbl <- table(df$Fit, df$Species)
print(tbl)
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         11        36
##   3      0         39        14

Confusion Matrix

# install.packages('dplyr')
# install.packages('plyr')
# require(dplyr)
require(plyr)
df <- df %>% mutate(Fit = revalue(Fit, 
        c('1' = 'setosa', '2' = 'virginica', '3' = 'versicolor')))

#install.packages('caret')
#install.packages('e1071')
require(caret)
require(e1071)

confusionMatrix(df$Fit, df$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         39        14
##   virginica       0         11        36
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7639, 0.8891)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.75            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.7800           0.7200
## Specificity                 1.0000            0.8600           0.8900
## Pos Pred Value              1.0000            0.7358           0.7660
## Neg Pred Value              1.0000            0.8866           0.8641
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2600           0.2400
## Detection Prevalence        0.3333            0.3533           0.3133
## Balanced Accuracy           1.0000            0.8200           0.8050

Thanks!

deirdrefitzpatrick@massmutual.com

dudwin@massmutual.com

(chat with us about MassMutual Data Labs + our data science development program located in Amherst)