Lab 5: Model specification

1 Types of models

Which of the following best describes the goal of a regression model?

To classify observations into distinct categories To predict continuous outcomes To find the median of the dataset To determine the probability of each class
In classification tasks, the output variable (label) is typically:

Continuous Discrete Ordinal A linear function
Which of the following is an example of a regression problem?

Predicting whether an email is spam or not Predicting the price of a house based on its features Identifying the species of a flower Grouping customers into clusters based on purchasing behavior
What is the primary difference between regression and classification?

Regression predicts a continuous value, while classification predicts a category Regression is a type of unsupervised learning, while classification is supervised Classification uses linear relationships, while regression uses non-linear relationships Classification focuses on finding patterns in data, while regression doesn’t
Which of the following tasks is a classification problem?

Estimating a person’s height based on their age Predicting if a student will pass or fail a course Predicting the temperature next week Estimating the number of sales for the next quarter
True or false, supervised learning requires labeled data to train the model.

True False
True or false, in unsupervised learning, the model attempts to identify patterns or structures in data without any specific target variable.

True False

2 Model specification

Which of the following is the first step in model specification?

Fitting the model Defining the response variable Calculating residuals Transforming variables
What does model specification involve?

Estimating the parameters Defining the functional form of the model Calculating prediction accuracy Testing the model’s reliability
Which of the following is NOT part of model specification?

Choosing which variables to include Defining the relationship between predictors and response Assessing the goodness-of-fit Determining if interaction terms are necessary
Which of the following describes a correctly specified model?

A model that includes irrelevant variables A model that excludes important variables A model that represents the true relationship between predictors and response A model that overfits the training data
True or false, Adding interaction terms between predictors is part of the model specification process.

True False
Model specification is the final step in the model-building process.

True False

Write the equation that expresses the response variable as a weighted sum of regressors (our favorite).

\(y=\sum_{i=1}^{n}w_ix_i\)

In the linear regression equation \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon\) , what do the \(\beta\)’s represent?

The predicted values The error terms The weights for each regressor The intercept

Question
Answer

Write the linear model equation in matrix notation.

\(y = Xβ + ε\)

or similar

In matrix notation, what is \(\mathbf{X}\)?

A vector of error terms A matrix of predictors (explanatory variables) A vector of residuals The coefficients of the model
Suppose our SwimRecords data includes the year, sex, record time, swimsuit type, and swim cap type. Which of the following variables is most likely to be irrelevant for predicting swim times?

suit type year sex swim cap type
What is the potential issue of including too many irrelevant variables in your model?

It will improve model accuracy. It can lead to overfitting and increased model complexity. It will simplify the interpretation of results. It has no effect on the model.

4 Primate brains

Primates have brains of varying sizes, and one possible explanation for this variation is differences in body size. Larger-bodied primates may tend to have heavier brains, but this relationship is not always straightforward. To investigate whether body size can reliably explain differences in brain weight across primate species, let’s fit a model that predicts brain weight based on body size.

The data, in case you want to work with it yourself: primate brains

Code

data <- read_csv("https://kschuler.github.io/datasci/assests/csv/primate_brains.csv")
glimpse(data)

Rows: 144
Columns: 5
$ taxon          <chr> "Alouatta_caraya", "Alouatta_palliata", "Alouatta_pigra…
$ body_weight_g  <dbl> 5597, 6359, 8940, 6247, 1073, 870, 871, 239, 6409, 8034…
$ brain_weight_g <dbl> 52.72, 50.91, 52.97, 56.57, 21.41, 16.78, 17.21, 7.17, …
$ diet_category  <chr> "Fol", "Fol", "Fol", "Fol", "Frug/Fol", "Frug", "Frug",…
$ group_size     <dbl> 6.68, 15.55, 5.93, 6.97, 3.00, 3.50, 3.51, 1.25, 16.40,…

Code

ggplot(data, aes( x = body_weight_g, y = brain_weight_g)) +
    geom_point()

4.2 Model specification

Suppose we specify the following model for the primate brains data: \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

What is the response variable?

brain_weight_g body_weight_g log(brain_weight_g) log(body_weight_g)
What is the explanatory variable?

brain_weight_g body_weight_g log(brain_weight_g) log(body_weight_g)
True or false, the functional form of this model can be expressed as a weighted sum of inputs? \(y=\sum_{i=1}^{n}w_ix_i\)

True False
Which of the following model terms are included in the model specification above? Choose all that apply.

Intercept Main Interaction Transformation
Specify the model equation in R notation.

Answer

# like this (explicit intercept)
log(brain_weight_g) ~ 1 + log_(body_weight_g)

# or like this (implicit intercept)
log(brain_weight_g) ~ log(body_weight_g)

4.3 Fitted model

Suppose you fit the model with lm() and return the following:


Call:
lm(formula = log(brain_weight_g) ~ 1 + log(body_weight_g), data = data)

Coefficients:
       (Intercept)  log(body_weight_g)  
           -2.4649              0.7752

Which of the following is \(w_1\) in the model specification \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

1 -2.4649 0.7752 Not enough information to determine this
Which of the following is \(w_2\) in the model specification \(\log(brain\_weight\_g) = w_1 1 + w_2 \log(body\_weight\_g)\)

1 -2.4649 0.7752 Not enough information to determine this
Suppose a primate has a \(\log(body\_weight\_g)\) equal to 10. Which of the following would the model predict to be the primate’s \(\log(brain\_weight\_g)\)?

25.21 5.29 -10.7752 Not enough information to determine this
Which of the following figures could show the fitted model?

the blue line the red line the black line Not enough information to determine this

5 Social brain hypothesis

The Social Brain Hypothesis argues that the pressures of navigating increasingly complex social environments were a significant driver in the evolution of brain size and intelligence in humans and other primates.

Let’s specify and fit this model in R.

model <- lm(log(brain_weight_g) ~ 1 + log(group_size), 
    data = primate_brains)

Code

primate_brains <- primate_brains %>%
    mutate(y_body_group = predict(model, primate_brains))

ggplot(primate_brains, aes(, 
    y = log(brain_weight_g),
    x = log(group_size))
) +
geom_point(size = 2) +
geom_line(color = "blue", aes(y = y_body_group))

Fill in the blank: how many inputs does this model have?

b. Question
Answer

Specify the model as an equation

\(\log(brain\_weight\_g) = w_1 1 + w_2 \log(group\_size)\)

or, if you created new columns in your data with the the log transformed data, for example:

data <- data %>%
    mutate(log_brain_weight = log(brain_weight_g)) %>%
    mutate(log_group_size = log(group_size))

then you could have written:

\(log\_brain\_weight\_g = w_1 1 + w_2 log\_group\_size\)

Given the figure above, which of the following could be the free paramter estimate for \(w_1\)?

1 0.66 2.25 5 Not enough information to determine this
Given the figure above, which of the following could be the free paramter estimate for \(w_2\)?

1 0.66 2.25 5 Not enough information to determine this
Suppose we encounter a primate in a (log) group size of 4. What could be the model prediction for their (log) brain weight?

3.5 4.1 4.9 6.2 Not enough information to determine this

f. Question
Answer

Suppose we wanted to include \(\log(body\_size\_g)\) back into the model as an additional predictor of \(\log(brain\_size\_g)\). Specify the model in R.

log(brain_size_g) ~ 1 + log(group_size) + log(body_size_g)

6 Fruit v Leaf eaters

Diet may influence the relationship between brain and body size in primates because the type of food a species consumes can impact its ability to meet the energy demands of a larger brain. Fruit-eating primates have access to energy-rich, easily digestible food, which could support the metabolic costs of both a large body and a larger, more complex brain.

Let’s begin by adding diet_category to our plot mapped to the color aesthetic.

Code

primate_brains %>%
    ggplot(aes(
        y = log(brain_weight_g), 
        x = log(body_weight_g),
        color = diet_category
    )) +
    geom_point(size = 2)

Frugivorous (“Frug”) primates primarily eat fruit, while folivorous (“Fol”) primates primarily consume leaves. The “Frug/Fol” category refers to primates that combine both fruit and leaf consumption in their diet. “Om” stands for omnivores, which we might suspect is similar to “Frug/Fol” with more variation in diet. To simplify things, let’s focus our analysis on just the Fol and Frug categories.

Code

fruit_v_leaves <- primate_brains %>%
    filter(diet_category %in% c("Fol", "Frug")) 

fruit_v_leaves %>%
    ggplot(aes(
        x = log(body_weight_g), 
        y = log(brain_weight_g), 
        color = diet_category
    )) +
    geom_point()

Suppose we specify a model that predicts brain weight by body size and diet category.

Code

model <- lm(log(brain_weight_g) ~ log(body_weight_g) + diet_category, data = fruit_v_leaves) 

model


Call:
lm(formula = log(brain_weight_g) ~ log(body_weight_g) + diet_category, 
    data = fruit_v_leaves)

Coefficients:
       (Intercept)  log(body_weight_g)   diet_categoryFrug  
           -2.8047              0.7778              0.4576

a. Question
Answer

Specify the model with a mathematical expression.

\(\log(brain\_size\_g) = w_11 + w_2\log(body\_size\_g) + w_3diet\_category\)

b. Question
Answer

Notice we did not include an interaction term between body weight and diet category. Why might a modeler make this decision?

A modeler might choose not to include an interaction term based on exploratory visualization. The scatter plot shows roughly parallel lines for frugivorous and folivorous primates, which could indicate that body size influences brain weight similarly, regardless of diet.

You could have also said something relevant to model complexity: the modeler may have noticed in exploratory data analysis that body size seems to influence brain weight similarly, and decided to keep the model simpler and easier to interpret by leaving out the interaction.

True or false: the diet_category variable is categorical, so this is a classification problem.

True False

c. Question
Answer

Write the fitted model as a mathematical expression.

\(\log(brain\_size\_g) = -2.8047\times{1} + 0.7778\times{\log(body\_size\_g)} + 0.4576\times{diet\_category}\)

Based on the fitted model returned by lm() above, which level of diet_category is the reference level?

Fol Frug Not enough information to determine this

e. Question
Answer

What is the model’s prediction for a primate with a (log) body weight of 7 who eats leaves? Write your answer as a mathematical expression without simplifying it.

\(\log(brain\_size\_g) = -2.8047\times{1} + 0.7778\times{\mathbf{7}} + 0.4576\times{\mathbf{0}}\)

Fill in the blank. Of the figures below, figure could be the plot of the model we specified.

7 Matching plots to equations

Match the following plots to the equations below. Each plot can be mapped to a unique expression of the linear model equation.

\(y = w_11\)
\(y = w_11 + w_2x\)
\(y = w_11 + w_2z\)
\(y = w_11 + w_2x + w_3z\)
\(y = w_11 + w_2x + w_3z + w_4x\times{z} + w_5x^2\)
\(y = w_11 + w_2x + w_3x^2\)
Which of the equations above has the most inputs (enter a lowercase letter a-f)?
Which of the equations above is the most complex model? (enter a lowercase letter a-f)?

8 Polynomials

What is the purpose of including polynomial terms in a linear model?

To improve model interpretability To model nonlinear relationships To reduce overfitting in the model To ensure that residuals are normally distributed
Which of the following is an example of a quadratic polynomial term in a linear model?
1. \(x\)
2. \(x^2\)
3. \(\sqrt{x}\)
4. \(\log{x}\)
Why might higher-degree polynomial terms lead to overfitting in a linear model?

Higher-degree terms make the model too simple Higher-degree terms force the model to fit the noise in the data Polynomial terms always reduce the model's flexibility Polynomial terms make the model biased
Which of the following models includes both linear and quadratic terms
1. \(y = \beta_0 + \beta_1x\)
2. \(y = \beta_0 + \beta_1x^2\)
3. \(y = \beta_0 + \beta_1x + \beta_2x^2\)
4. \(y = \beta_0 + \beta_1x + \beta_2x^3\)

1 Types of models

2 Model specification

3 Functional form of linear models

4 Primate brains

4.1 Type of model

4.2 Model specification

4.3 Fitted model

5 Social brain hypothesis

6 Fruit v Leaf eaters

7 Matching plots to equations

8 Polynomials