Modeling in R consists of:
Sepal.Length ~ Sepal.Width
Sepal.Width
to predict Sepal.Length
lm()
lm()
anova()
, plot()
or summary()
plot()
Sepal.Width
increases, we think Sepal.Length
does, too (in a nice straight line)Sepal.Length
is numeric (not categories like “short” or “long”)# ?swiss
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
plot(swiss)
mdl <- lm(Agriculture ~ Examination, data=swiss)
summary(mdl)
##
## Call:
## lm(formula = Agriculture ~ Examination, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.570 -10.781 2.575 11.871 29.411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.8869 5.6407 14.694 < 2e-16 ***
## Examination -1.9544 0.3086 -6.334 9.95e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.7 on 45 degrees of freedom
## Multiple R-squared: 0.4713, Adjusted R-squared: 0.4596
## F-statistic: 40.12 on 1 and 45 DF, p-value: 9.952e-08
Agriculture = 82.8868674 - 1.9544294 Examination + \(\epsilon\)
(You get lots of asterisks next to significant variables.)
anova(mdl)
## Analysis of Variance Table
##
## Response: Agriculture
## Df Sum Sq Mean Sq F value Pr(>F)
## Examination 1 11183 11183.4 40.121 9.952e-08 ***
## Residuals 45 12543 278.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(mdl)
mdl <- lm(Agriculture ~ Examination - 1, data=swiss)
mdl <- lm(Agriculture ~ Examination + 0, data=swiss)
summary(mdl)
##
## Call:
## lm(formula = Agriculture ~ Examination + 0, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.806 0.812 20.129 35.688 79.494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Examination 2.1353 0.3173 6.73 2.32e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.76 on 46 degrees of freedom
## Multiple R-squared: 0.4961, Adjusted R-squared: 0.4852
## F-statistic: 45.29 on 1 and 46 DF, p-value: 2.317e-08
What even is an lm()
?
names(mdl)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
mdl['coefficients']
## $coefficients
## Examination
## 2.135296
# sample 5 numbers from the uniform distribution (0, 100)
fake_data <- data.frame(Examination = runif(5, 0, 100))
predict(mdl, newdata=fake_data, interval="confidence")
## fit lwr upr
## 1 174.16358 122.07343 226.25373
## 2 180.32573 126.39256 234.25889
## 3 212.20944 148.74025 275.67862
## 4 164.87190 115.56078 214.18303
## 5 19.93138 13.97016 25.89261
r <- residuals(mdl)
head(r)
## Courtelary Delemont Franches-Mnt Moutier Neuveville
## -15.029445 32.288222 29.023518 10.876444 7.199962
## Porrentruy
## 16.082333
coef(mdl)
## Examination
## 2.135296
f <- fitted(mdl)
head(f)
## Courtelary Delemont Franches-Mnt Moutier Neuveville
## 32.02945 12.81178 10.67648 25.62356 36.30004
## Porrentruy
## 19.21767
# install.packages('ggplot2')
require(ggplot2)
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Use lm()
to model price as a function of carat.
Find the coefficients. (Are they significant?)
Use your model with predict()
to predict the price of diamonds with 0.5, 2 and 0.73 carats, respectively.
# hint, use this fresh dataframe
# with the newdata argument inside of predict()
fresh_diamonds <- data.frame(carat = c(0.5, 2, 0.73))
Type ?lm
or ?predict
in your console for more info!
mdl <- lm(price ~ carat, data=diamonds)
summary(mdl)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
fresh_diamonds <- data.frame(carat = c(0.5, 2, 0.73))
predict(mdl, newdata=fresh_diamonds)
## 1 2 3
## 1621.852 13256.491 3405.830
# library(ggplot2)
ggplot(swiss, aes(x = Catholic, y = Education)) + geom_point()
cor(swiss$Catholic, swiss$Education)
## [1] -0.1538589
mdl <- lm(Fertility ~ Catholic + Education, data=swiss)
summary(mdl)
##
## Call:
## lm(formula = Fertility ~ Catholic + Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.042 -6.578 -1.431 6.122 14.322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.23369 2.35197 31.562 < 2e-16 ***
## Catholic 0.11092 0.02981 3.721 0.00056 ***
## Education -0.78833 0.12929 -6.097 2.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.331 on 44 degrees of freedom
## Multiple R-squared: 0.5745, Adjusted R-squared: 0.5552
## F-statistic: 29.7 on 2 and 44 DF, p-value: 6.849e-09
Let’s suppose a new model!
Fertility = \(\beta_0\) + \(\beta_1\) Education + \(\beta_2\) Catholic + \(\beta_3\) Infant.Mortality + \(\epsilon\)
Test hypothesis that \(\beta_2 = \beta_3 = 0\).
reduced_mdl <- lm(Fertility ~ Education, data=swiss)
full_mdl <- lm(Fertility ~ Education + Catholic + Infant.Mortality, data=swiss)
# you can pass in two models!
anova(reduced_mdl, full_mdl)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Education
## Model 2: Fertility ~ Education + Catholic + Infant.Mortality
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 4015.2
## 2 43 2422.2 2 1593 14.139 1.909e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Teeny-tiny p-value (look at all those asterisks) tells us \(\beta_2\) and \(\beta_3\) are not both zero, i.e. adding those variables was handy!
What’s in an anova()
?
anova_magic <- anova(reduced_mdl, full_mdl)
class(anova_magic)
## [1] "anova" "data.frame"
names(anova_magic)
## [1] "Res.Df" "RSS" "Df" "Sum of Sq" "F" "Pr(>F)"
anova_magic[["Pr(>F)"]]
## [1] NA 1.909423e-05
Let’s revise the additive model we fit earlier (Fertility = \(\beta_0\) + \(\beta_1\) Catholic + \(\beta_2\) Education + \(\epsilon\)).
We’re curious:
As Education changes, does the effect of Catholic on Fertility change?
As Catholic changes, does the effect of Education on Fertility change?
Fertility = \(\beta_0\) + \(\beta_1\) Catholic + \(\beta_2\) Education + \(\beta_3\) Catholic*Education + \(\epsilon\)
mdl <- lm(Fertility ~ Catholic + Education + Catholic*Education, data=swiss) # yay
mdl <- lm(Fertility ~ Catholic + Education + Catholic:Education, data=swiss) # yay
mdl <- lm(Fertility ~ Catholic*Education, data=swiss) # yay
# mdl <- lm(Fertility ~ Catholic:Education, data=swiss) <-- NO, includes no additive terms
summary(mdl)
##
## Call:
## lm(formula = Fertility ~ Catholic * Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.953 -6.319 -1.368 6.380 14.297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.937553 3.106471 22.835 < 2e-16 ***
## Catholic 0.184003 0.054539 3.374 0.00158 **
## Education -0.427637 0.260176 -1.644 0.10754
## Catholic:Education -0.009380 0.005904 -1.589 0.11942
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.191 on 43 degrees of freedom
## Multiple R-squared: 0.5981, Adjusted R-squared: 0.5701
## F-statistic: 21.33 on 3 and 43 DF, p-value: 1.286e-08
Recall our plot from earlier:
ggplot(swiss, aes(x = Catholic, y = Education)) + geom_point()
Looks like Catholic is either very small or very big.
Let’s recode Catholic as a factor and see what happens.
Catholic_median <- summary(swiss$Catholic)[["Median"]]
# install.packages('dplyr')
# install.packages('mosaic')
library(dplyr)
library(mosaic)
swiss_new <- swiss %>% mutate(
Catholic_factor = derivedFactor(low = Catholic <= Catholic_median,
high = Catholic > Catholic_median))
head(swiss_new[c('Catholic', 'Catholic_factor')])
## Catholic Catholic_factor
## 1 9.96 low
## 2 84.84 high
## 3 93.40 high
## 4 33.77 high
## 5 5.16 low
## 6 90.57 high
mdl <- lm(Fertility ~ Catholic_factor*Education, data=swiss_new)
summary(mdl)
##
## Call:
## lm(formula = Fertility ~ Catholic_factor * Education, data = swiss_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2793 -6.9639 -0.8848 6.7463 14.9044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.2333 3.5081 20.305 < 2e-16 ***
## Catholic_factorhigh 14.3740 4.2755 3.362 0.00163 **
## Education -0.4226 0.3077 -1.373 0.17679
## Catholic_factorhigh:Education -0.5914 0.3390 -1.745 0.08820 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.365 on 43 degrees of freedom
## Multiple R-squared: 0.5808, Adjusted R-squared: 0.5516
## F-statistic: 19.86 on 3 and 43 DF, p-value: 3.141e-08
# require(ggplot2)
head(diamonds)
## Source: local data frame [6 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Fit a model to predict price using carat and another variable of your choosing.
Now add an interaction term between carat and your other variable of choice!
# ?diamonds
mdl1 <- lm(price ~ carat + x, data=diamonds)
mdl2 <- lm(price ~ carat*x, data=diamonds)
# anova(mdl1, mdl2)
# summary(mdl1)
# summary(mdl2)
You can’t have a linear relationship between, say, age and gender.
But you can have a linear relationship between age and the log odds that someone is female.
Enter logit function:
\[ \begin{equation} \log \frac{p}{1-p} = \beta_0 + \beta_1 X_1 + \dots + \beta_n X_n \end{equation} \]
\(p\) = probability of “success”
\(\log\)\(\frac{p}{1-p}\) = “log odds”
# ?UCBAdmissions
df <- as.data.frame(UCBAdmissions)
head(df)
## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
\(p\) = probability of being admitted
\(X_1\) = Gender
\(X_2\) = Dept
# install.packages('reshape')
require(reshape)
# unravel
df <- untable(df[, c('Admit', 'Gender', 'Dept')], num=df[, 'Freq'])
rownames(df) <- seq(length=nrow(df))
# re-order levels of Admit
df <- mutate(df, Admit = relevel(Admit, 'Rejected'))
head(df) # 4526 rows now
## Admit Gender Dept
## 1 Admitted Male A
## 2 Admitted Male A
## 3 Admitted Male A
## 4 Admitted Male A
## 5 Admitted Male A
## 6 Admitted Male A
Subset the data into training set (fit model) and testing set (test model).
df_idx <- 1:nrow(df)
train_size <- floor(.8 * nrow(df))
train_idx <- sample(df_idx, size=train_size)
test_idx <- df_idx[!(df_idx %in% train_idx)]
train_df <- df[train_idx, ]
test_df <- df[test_idx, ]
glm()
function (generalized linear model)
mdl <- glm(Admit ~ Gender, family=binomial(link='logit'), data=train_df)
summary(mdl)
##
## Call:
## glm(formula = Admit ~ Gender, family = binomial(link = "logit"),
## data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0825 -1.0825 -0.8672 1.2753 1.5233
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.22730 0.04326 -5.254 1.49e-07 ***
## GenderFemale -0.55696 0.07117 -7.826 5.06e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4845.5 on 3619 degrees of freedom
## Residual deviance: 4783.0 on 3618 degrees of freedom
## AIC: 4787
##
## Number of Fisher Scoring iterations: 4
Recall: The model fits \(\log \frac{p}{1-p} = \beta_0 + \beta_1 X_1\).
\(\beta_0\) = -0.2273016
\(\beta_1\) = -0.5569604
coef(mdl)
## (Intercept) GenderFemale
## -0.2273016 -0.5569604
When applicant is male, \(X_1\) = 0 and the log odds of admission are equal to the intercept.
We expect the log odds of admission to increase by -0.5569604 when applicant is female.
# pro tip: a period includes all variables as predictors
mdl <- glm(Admit ~ ., family=binomial(link='logit'), data=train_df)
summary(mdl)
##
## Call:
## glm(formula = Admit ~ ., family = binomial(link = "logit"), data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5042 -0.9555 -0.3912 0.9627 2.3490
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.58038 0.07748 7.491 6.83e-14 ***
## GenderFemale 0.16151 0.09006 1.793 0.0729 .
## DeptB -0.05172 0.12247 -0.422 0.6728
## DeptC -1.28925 0.11993 -10.750 < 2e-16 ***
## DeptD -1.26274 0.11685 -10.807 < 2e-16 ***
## DeptE -1.79065 0.14141 -12.663 < 2e-16 ***
## DeptF -3.27378 0.18632 -17.571 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4845.5 on 3619 degrees of freedom
## Residual deviance: 4169.5 on 3613 degrees of freedom
## AIC: 4183.5
##
## Number of Fisher Scoring iterations: 5
Simpson’s Paradox!
a.k.a the illustrious ROC Curve
#install.packages('ROCR')
library(ROCR)
prob_admit <- predict(mdl, newdata=test_df, type="response")
truth_admit <- test_df$Admit
pred <- prediction(prob_admit, truth_admit, label.ordering=c('Rejected', 'Admitted'))
perf <- performance(pred, "tpr", "fpr")
plot(perf)
auc.tmp <- performance(pred, "auc")
auc <- as.numeric(auc.tmp@y.values)
auc
## [1] 0.7315764
Run me:
diamonds_new <- diamonds %>%
mutate(bling = derivedFactor(no = price <= 5000,
yes = price > 5000))
Use glm()
with the diamonds_new
data you just defined to fit a logistic regression where:
mdl <- glm(bling ~ depth, family=binomial(link='logit'), data=diamonds_new)
summary(mdl)
##
## Call:
## glm(formula = bling ~ depth, family = binomial(link = "logit"),
## data = diamonds_new)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9085 -0.8001 -0.7950 1.5959 1.6777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.037094 0.416643 0.089 0.9291
## depth -0.016482 0.006747 -2.443 0.0146 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63219 on 53939 degrees of freedom
## Residual deviance: 63213 on 53938 degrees of freedom
## AIC: 63217
##
## Number of Fisher Scoring iterations: 4
This is a classification task that seeks to group items in a way that:
Minimizes variance within groups
Maximizes distance between groups
Here, we will use the kmeans()
function to implement k-means clustering. This is an unsupervised (a.k.a. the model doesn’t learn from ground truth labels) clustering algorithm that identifies \(k\) group “centers” and assigns each data point to the group with the nearest centroid.
# I'm sorry
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Do a little scrubbing.
df <- iris[, c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width')]
df <- na.omit(df)
df <- scale(df)
Elbow Method: Identify the number of clusters at which the sums of squared errors within groups drops.
set.seed(1)
col_variance <- apply(df, 2, var)
# sum squared error for only 1 group
weighted_ss <- (nrow(df)-1)*sum(col_variance)
for (i in 2:15) {
within_cluster_ss <- kmeans(df, centers=i)$withinss
weighted_ss[i] <- sum(within_cluster_ss)
}
plot(1:15, weighted_ss, pch=20, type="b", xlab="Number of Clusters", ylab="Within Group SSE")
fit <- kmeans(df, centers=3)
aggregate(df, by=list(fit$cluster), FUN=mean)
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 -1.01119138 0.85041372 -1.3006301 -1.2507035
## 2 2 1.13217737 0.08812645 0.9928284 1.0141287
## 3 3 -0.05005221 -0.88042696 0.3465767 0.2805873
We implemented unsupervised clustering, but…the iris data set has a species column. Let’s consider that ground truth for our clusters and compare with the k-means results.
df <- data.frame(df, Fit=fit$cluster)
df$Fit <- as.factor(df$Fit)
df["Species"] <- iris$Species
tbl <- table(df$Fit, df$Species)
print(tbl)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 11 36
## 3 0 39 14
# install.packages('dplyr')
# install.packages('plyr')
# require(dplyr)
require(plyr)
df <- df %>% mutate(Fit = revalue(Fit,
c('1' = 'setosa', '2' = 'virginica', '3' = 'versicolor')))
#install.packages('caret')
#install.packages('e1071')
require(caret)
require(e1071)
confusionMatrix(df$Fit, df$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 39 14
## virginica 0 11 36
##
## Overall Statistics
##
## Accuracy : 0.8333
## 95% CI : (0.7639, 0.8891)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.75
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.7800 0.7200
## Specificity 1.0000 0.8600 0.8900
## Pos Pred Value 1.0000 0.7358 0.7660
## Neg Pred Value 1.0000 0.8866 0.8641
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2600 0.2400
## Detection Prevalence 0.3333 0.3533 0.3133
## Balanced Accuracy 1.0000 0.8200 0.8050
deirdrefitzpatrick@massmutual.com
(chat with us about MassMutual Data Labs + our data science development program located in Amherst)