R basics

Data Science for Studying Language and the Mind

Katie Schuler

2024-08-28

Syllabus

Follow along on the syllabus!

Paperwork

  • When you arrive, complete this anonymous form: Who’s in class

  • You can also join the waitlist if you are not enrolled

Announcements

  • The course is full and the room is full
  • Ways to join:
    1. Watch for an opening (highest odds of getting in)
    2. Add your name to our waitlist

Course description

Data Sci for Lang & Mind is an entry-level course designed to teach basic principles of statistics and data science to students with little or no background in statistics or computer science. Students will learn to identify patterns in data using visualizations and descriptive statistics; make predictions from data using machine learning and optimization; and quantify the certainty of their predictions using statistical models. This course aims to help students build a foundation of critical thinking and computational skills that will allow them to work with data in all fields related to the study of the mind (e.g. linguistics, psychology, philosophy, cognitive science, neuroscience).

Prerequisites

There are no prerequisites beyond high school algebra. No prior programming or statistics experience is necessary, though you will still enjoy this course if you already have a little. Students who have taken several computer science or statistics classes should look for a more advanced course.

Teaching team

Instructor: Dr. Katie Schuler (she/her)

TAs:

  • Brittany Zykoski
  • Mingyang Bian

About me, your instructor (Katie)

  • You can call me Professor Schuler or Katie, whichever makes you more comfortable

  • I live in Mt Airy with my husband and two kids (Dory, 2 and Joan, 6)

  • At Penn I also have a research lab, the Child Language Lab and am on the Natural Science and Math Panel (a group focused on improving inclusive teaching in STEM at Penn).

  • I’m a first-generation college student from Western NY. I worked 40 hours a week to put myself through college; I am still paying off my student loans.

My assumptions about you

You are an honest, kind, and hardworking person who wants to do well in and enjoy this class

You are very busy, and will sometimes have to prioritize other things above this class.

Course overview

Data science

Data science is about making decisions based on incomplete information.

Figure 1: from Kok & de Lange (2014)

This concept is not new. Brains were built for doing this!

But we have new tools and lots more data!

Figure 2: from https://web-assets.domo.com/miyagi/images/product/product-feature-22-data-never-sleeps-10.png

Data science workflow

The folks who wrote R for Data Science proposed the following data science workflow:

Figure 3: from R for Data Science

Overview of course

We will spend the first few weeks getting comfortable programming in R, including some useful skills for data science:

  • R basics
  • Data visualization
  • Data wrangling (import, tidy, and transform)

Overview of course

Then, we will spend the next several weeks building a foundation in basic statistics and model building:

  • Sampling distribution
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model evaluation (accuracy and reliability)

Overview of course

Finally if we have time we will cover a selection of more advanced topics that are often applied in language and mind fields, with a focus on basic understanding:

  • Feature engineering
  • Classification
  • Mixed-effect models

Syllabus, briefly

Each week will include two lectures and a lab:

  • Lectures are on Tuesdays and Thursdays at 12pm and will be a mix of conceptual overviews and R tutorials. It is a good idea to bring your laptop so you can follow along and try stuff in R!
  • Labs are on Thursday or Friday and will consist of (ungraded) practice problems and concept review with TAs. You may attend any lab section that works for your schedule. Lab attendance is required

Syllabus, briefly

There are 10 graded assessments:

  • 8 Problem sets (20%) in which you will be asked to apply your newly aquired R programming skills.
  • 2 Midterm exams (60%) in which you will be tested on your understanding of lecture concepts.

And 20% is lab attendance.

Syllabus, briefly

There are a few policies to take note of:

  • Missed exams cannot be made up except in cases of genuine conflict or emergency (documentation and course action notice required). You may take the optional final exam to replace a missed or low scoring exam.
  • You may request an extension on any problem set of up to 3 days. But extensions beyond 3 days will not be granted (because delying solutions will negative impact other students).
  • You may submit any missed quiz or problem set by the end of the semester for half-credit (50%), even after solutions are posted.
  • We will drop your lowest pset grade, and you can miss up to 2 labs without penalty

Resources

In addition to our course website, we will use the following:

  • google colab (r kernel) - for computing
  • canvas- for posting grades
  • gradescope - for submitting problem sets
  • ed discussion - for announcements and questions

Wellness resources

Please consider using these Penn resources this semester:

Why R?

With many programming languages available for data science (e.g. R, Python, Julia, MATLAB), why use R?

  • Built for stats, specifically
  • Makes nice visualizations
  • Lots of people are doing it, especially in academia
  • Easier for beginners to understand
  • Free and open source (though so are Python and Julia, MATLAB costs $)

Many ways to use R

Google Colab

  • Google Colab is a cloud-based Jupyter notebook that allows you to write, execute, and share code like a google doc.
  • We use Google Colab because it’s simple and accessible to everyone. You can start programming right away, no setup required!

Secretly, R!

Google Colab officially supports Python, but secretly supports R (and Julia, too!)

  • Update 2024: Google Colab now officially supports R!

  • Update 2025: Google Colab officially supports Julia!

  • colab (r kernel)

Let’s try it!

Google colab demo

Open a new R notebook:

  • colab (r kernel) - use this link to start a new R notebook
  • File > New notebook and then Runtime > Change runtime type to R

Cell types:

  • + Code - write and execute code
  • + Text - write text blocks in markdown

Frequently used menu options:

  • File > Locate in Drive - where in your Google Drive?
  • File > Save - saves
  • File > Revision history - history of changes you made
  • File > Download > Download .ipynb - used to submit assignments!
  • File > Print - prints
  • Runtime > Run all - run all cells
  • Runtime > Run before - run all cells before current active cell
  • Runtime > Restart and run all - restart runtime, then run all

Frequently used keyboard shortcuts:

  • Cmd/Ctrl+S - save
  • Cmd/Ctrl+Enter - run focused cell
  • Cmd/Ctrl+Shift+A - select all cells
  • Cmd/Ctrl+/ - comment/uncomment selection
  • Cmd/Ctrl+] - increase indent
  • Cmd/Ctrl+[ - decrease indent

You are here

Data science with R
  • R basics
  • Data visualization
  • Data wrangling
Stats & Model buidling
  • Sampling distribution
  • Hypothesis testing
  • Model specification
  • Model fitting
  • Model evaluation
More advanced
  • Feature Engineering
  • Classification
  • Mixed-effect models

R Basics

We begin by defining some basic concepts:

Basic concepts

  • Expressions: fundamental building blocks of programming
  • Objects: allow us to store stuff, created with assignment operator
  • Names: names w give objects must be letters, numbers, ., or _
  • Attributes: allow us to attach arbitrary metadata to objects
  • Functions: take some input, perform some computation, and return some output
  • Environment: collection of all objects we defined in current R session
  • Packages: collections of functions, data, and documentation bundled together in R
  • Comments: notes you leave for yourself, not evaluated
  • Messages: notes R leaves for you (FYI, warning, error)

Expressions

  • Expressions are combinations of values, variables, operators, and functions that can be evaluated to produce a result. Expressions can be as simple as a single value or more complex involving calculations, comparisons, and function calls. They are the fundamental building blocks of programming.
    • 10 - a simple value expression that evaluates to 10.
    • x <- 10 - an expression that assigns the value of 10 to x.
    • x + 10 - an expression that adds the value of x to 10.
    • a <- x + 10 - an expression that adds the value of x to 10 and assigns the result to the variable a

Important functions

Objects

  • str(x) - returns summary of object’s structure
  • typeof(x) - returns object’s data type
  • length(x) - returns object’s length
  • attributes(x) - returns list of object’s attributes

Important functions

Environment

  • ls() - list all variables in environment
  • rm(x) - remove x variable from environment
  • rm(list = ls()) - remove all variables from environment

Thursday Aug 28

Thursday’s class started here.

Important function

Packages

  • install.packages() to install packages
  • library() to load package into current R session.
  • data() to load data from package into environment
  • sessionInfo() - version info, packages for current R session

Important functions

Help

  • ?mean - get help with a function
  • help('mean') - search help files for word or phrase
  • help(package='tidyverse') - find help for a package

Vectors

Vectors

are fundamental data structures in R. There are two types:

  • atomic vectors - elements of the same data type
  • lists - elements refer to any object

Atomic vectors

Atomic vectors can be one of six data types:

typeof(x) examples
double 3, 3.32
integer 1L, 144L
character “hello”, ‘hello, world!’
logical TRUE, F

atomic because they must contain only one type

Atomic vectors

double

typeof(3.34)
[1] "double"

integer

typeof(3L)
[1] "integer"

character

typeof('hello, world!')
[1] "character"

logical

typeof(TRUE)
[1] "logical"

Create a vector

with c() for concatenate

c(2,4,6)
[1] 2 4 6
c("hello", "world", "!")
[1] "hello" "world" "!"    
c(T, F, T)
[1]  TRUE FALSE  TRUE
c("hello", c(2, 3))
[1] "hello" "2"     "3"    

Create a vector

with sequences seq() or repetitions rep()

# sequence of integers have a special shorthand
6:10
[1]  6  7  8  9 10
# sequence from, to, by 
seq(from=3, to=5, by=0.5)
[1] 3.0 3.5 4.0 4.5 5.0
# rep(x, times = 1, each = 1)
rep(c(1,0), times = 4)
[1] 1 0 1 0 1 0 1 0
# rep(x, times = 1, each = 1)
rep(c(1,0), each = 4)
[1] 1 1 1 1 0 0 0 0

Check data type

with typeof(x) - returns the type of vector x

typeof(3)
[1] "double"
typeof(3L)
[1] "integer"
typeof("three")
[1] "character"
typeof(TRUE)
[1] "logical"

Check data type

with is.*(x) - returns TRUE if x has type *

is.double(3)
[1] TRUE
is.integer(3L)
[1] TRUE
is.character("three")
[1] TRUE
is.logical(TRUE)
[1] TRUE

Coercion, implicit

If you try to include elements of different types, R will coerce them into the same type without warning (implicit coercion)

x <- c(1, 2, "three", 4, 5 )
x
[1] "1"     "2"     "three" "4"     "5"    
typeof(x)
[1] "character"

Coercion, explicit

You can also use explict coercion to change a vector to another data type with as.*()

x <- c(1, 0 , 1, 0)
as.logical(x)
[1]  TRUE FALSE  TRUE FALSE

More complex structures

More complex structures

Some more complex data structures are built from atomic vectors by adding attributes:

Structure Description
matrix vector with dim attribute representing 2 dimensions
array vector with dim attribute representing n dimensions
data.frame a named list of vectors (of equal length) with attributes for names (column names), row.names, and class="data.frame"

Create more complex structures

matrix

matrix(0, nrow=2, ncol=3)
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

data.frame

data.frame(x=c(1,2,3), y=c('a','b','c'))
  x y
1 1 a
2 2 b
3 3 c

array

array(0, dim=c(2,3,2))
, , 1

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

, , 2

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

Operations

Basic math operators

Operator Operation
() Parentheses
^ Exponent
* Multiply
/ Divide
+ Add
- Subtract

Basic math operations

follow the order of operations you expect (PEMDAS)

# multiplication takes precedence
2 + 3 * 10
[1] 32
# we can use paratheses to be explicit
(2 + 3) * 10 
[1] 50

Comparison operators

Operator Comparison
x < y less than
x > y greater than
x <= y less than or equal to
x >= y greater than or equal to
x != y not equal to
x == y equal to

Comparison operators

x <- 2
y <- 3


x < y
[1] TRUE
x > y 
[1] FALSE
x != y
[1] TRUE
x == y
[1] FALSE

Logical operators

Operator Operation
x | y or
x & y and
!x not
any() true if any element meets condition
all() true if all elements meet condition
%in% true if any element is in following vector

Logical operators

x <- TRUE
y <- FALSE


x | y
[1] TRUE
x & y 
[1] FALSE
!x 
[1] FALSE
any(c(x,y))
[1] TRUE
all(c(x,y))
[1] FALSE

Operations are vectorized

Almost all operations (and many functions) are vectorized

math

c(1, 2, 3) + c(4, 5, 6)
[1] 5 7 9
c(1, 2, 3) / c(4, 5, 6)
[1] 0.25 0.40 0.50
c(1, 2, 3) * 10 
[1] 10 20 30
c(1, 2, 30) > 10
[1] FALSE FALSE  TRUE

logical

x <- c(TRUE, FALSE, FALSE)
y <- c(TRUE, TRUE, FALSE)
z <- TRUE
x | y
[1]  TRUE  TRUE FALSE
x & y 
[1]  TRUE FALSE FALSE
x | z 
[1] TRUE TRUE TRUE
x & z 
[1]  TRUE FALSE FALSE

Operator coercion

Operators and functions will also coerce values when needed (and without warning)

5.6 + 2L
[1] 7.6
10 + FALSE 
[1] 10
log(1)
[1] 0
log(TRUE)
[1] 0

Subsetting

Subsetting is a natural complement to str(). While str() shows you all the pieces of any object (its structure), subsetting allows you to pull out the pieces that you’re interested in. ~ Hadley Wickham, Advanced R

str()

x <- c("hello", "world", "!")
str(x)
 chr [1:3] "hello" "world" "!"
y <- c(1, 2, 3, 4, 5)
str(y)
 num [1:5] 1 2 3 4 5

Subsetting

There are three operators for subsetting objects:

  • [ - subsets (one or more) elements
  • [[ and $ - extracts a single element

Subset multiple elements with [

Code Returns
x[c(1,2)] positive integers select elements at specified indexes
x[-c(1,2)] negative integers select all but elements at specified indexes
x[c("x", "y")] select elements by name, if elements are named
x[] nothing returns the original object
x[0] zero returns a zero-length vector
x[c(TRUE, TRUE)] select elements where corresponding logical value is TRUE

Subset multiple elements with [

atomic vector

x <- c("hello", "world", "1")
x[c(1,2)]
[1] "hello" "world"
x[-c(1,2)]
[1] "1"
x[]
[1] "hello" "world" "1"    

data.frame

y <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )
y[c(1,2)]
  this that
1    1    a
2    2    b
3    3    c
y[-c(1,2)]
  theother
1        4
2        5
3        6
y[c("this")]
  this
1    1
2    2
3    3

3 ways to extract a single element

Code Returns
[[2]] a single positive integer (index)
[['name']] a single string
x$name the $ operator is a useful shorthand for [['name']]

3 ways to extract a single element

atomic vector

x <- c("hello", "world", "1")
x[[1]]
[1] "hello"
x[[2]]
[1] "world"
x[[3]]
[1] "1"

data.frame

y <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )
y[[1]]
[1] 1 2 3
y[["that"]]
[1] "a" "b" "c"
y$that
[1] "a" "b" "c"

R has many built-in functions

x <- c(1, -2, 3)

Some are vectorized

log(x)
[1] 0.000000      NaN 1.098612
abs(x)
[1] 1 2 3
round(x, 2)
[1]  1 -2  3

Some are not

mean(x)
[1] 0.6666667
max(x)
[1] 3
min(x)
[1] -2

Missing values

NA

  • used to represent missing or unknown elements in vectors
  • Note that NA is contageous: expressions including NA usually return NA
  • Check for NA values with is.na()
x <- c(1, NA, 3)
is.na(x)
[1] FALSE  TRUE FALSE
length(x)
[1] 3
mean(x)
[1] NA

NULL

  • used to represent an empty or absent vector of arbitrary type
  • NULL is its own special type and always has length zero and NULL attributes
  • Check for NULL values with is.null()
x <- c()
is.null(x)
[1] TRUE
length(x)
[1] 0
mean(x)
[1] NA

Programming

functions

are reusable pieces of code that take some input, perform some task or computation, and return an output

function(inputs){
    # do something
    return(output)
}

control flow

refers to managing the order in which expressions are executed in a program

  • ifelse - if something is true, do this; otherwise do that
  • for loops - repeat code a specific number of times
  • while loops - repeat code as long as certain conditions are true
  • break - exit a loop early
  • next - skip to next iteration in a loop

Subsetting quirks

If we have time

Notes on [ with higher dim objects

m <- matrix(1:6, nrow=2, ncol=3)
m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
# separate dimensions by comma 
m[1, 2]
[1] 3
# omitted dim return all from that dim 
m[2, ]
[1] 2 4 6
m[ , 2]
[1] 3 4

Notes on [[ and $:

both [[ and [ work for vectors; use [[

x <- c(1, -2, 3)
x[[1]]
[1] 1
x[1]
[1] 1

$ does partial matching without warning

df <- data.frame(
        this = c(1, 2,3), 
        that = c("a", "b", "c"),
        theother = c(4, 5, 6)
        )
df[['theo']]
NULL
df$theo
[1] 4 5 6

Questions?

Have a great weekend!