[1] "double"
Data Science for Studying Language and the Mind
2024-08-28
Follow along on the syllabus!
When you arrive, complete this anonymous form: Who’s in class
You can also join the waitlist if you are not enrolled
Data Sci for Lang & Mind is an entry-level course designed to teach basic principles of statistics and data science to students with little or no background in statistics or computer science. Students will learn to identify patterns in data using visualizations and descriptive statistics; make predictions from data using machine learning and optimization; and quantify the certainty of their predictions using statistical models. This course aims to help students build a foundation of critical thinking and computational skills that will allow them to work with data in all fields related to the study of the mind (e.g. linguistics, psychology, philosophy, cognitive science, neuroscience).
There are no prerequisites beyond high school algebra. No prior programming or statistics experience is necessary, though you will still enjoy this course if you already have a little. Students who have taken several computer science or statistics classes should look for a more advanced course.
Instructor: Dr. Katie Schuler (she/her)
TAs:
You can call me Professor Schuler or Katie, whichever makes you more comfortable
I live in Mt Airy with my husband and two kids (Dory, 2 and Joan, 6)
At Penn I also have a research lab, the Child Language Lab and am on the Natural Science and Math Panel (a group focused on improving inclusive teaching in STEM at Penn).
I’m a first-generation college student from Western NY. I worked 40 hours a week to put myself through college; I am still paying off my student loans.
You are an honest, kind, and hardworking person who wants to do well in and enjoy this class
You are very busy, and will sometimes have to prioritize other things above this class.
Data science is about making decisions based on incomplete information.
This concept is not new. Brains were built for doing this!
The folks who wrote R for Data Science proposed the following data science workflow:
We will spend the first few weeks getting comfortable programming in R, including some useful skills for data science:
Then, we will spend the next several weeks building a foundation in basic statistics and model building:
Finally if we have time we will cover a selection of more advanced topics that are often applied in language and mind fields, with a focus on basic understanding:
Each week will include two lectures and a lab:
There are 10 graded assessments:
And 20% is lab attendance.
There are a few policies to take note of:
In addition to our course website, we will use the following:
Please consider using these Penn resources this semester:
With many programming languages available for data science (e.g. R, Python, Julia, MATLAB), why use R?
Google Colab officially supports Python, but secretly supports R (and Julia, too!)
Update 2024: Google Colab now officially supports R!
Update 2025: Google Colab officially supports Julia!
Google colab demo
File > New notebook
and then Runtime
> Change runtime type
to R+ Code
- write and execute code+ Text
- write text blocks in markdownTable of contents
- outline from text headingsFind and replace
- find and/or replaceFiles
- upload files to cloud sessionFile > Locate in Drive
- where in your Google Drive?File > Save
- savesFile > Revision history
- history of changes you madeFile > Download > Download .ipynb
- used to submit assignments!File > Print
- printsRuntime > Run all
- run all cellsRuntime > Run before
- run all cells before current active cellRuntime > Restart and run all
- restart runtime, then run allCmd/Ctrl+S
- saveCmd/Ctrl+Enter
- run focused cellCmd/Ctrl+Shift+A
- select all cellsCmd/Ctrl+/
- comment/uncomment selectionCmd/Ctrl+]
- increase indentCmd/Ctrl+[
- decrease indenthere
R basics
We begin by defining some basic concepts:
Expressions
: fundamental building blocks of programmingObjects
: allow us to store stuff, created with assignment operatorNames
: names w give objects must be letters, numbers, ., or _Attributes
: allow us to attach arbitrary metadata to objectsFunctions
: take some input, perform some computation, and return some outputEnvironment
: collection of all objects we defined in current R sessionPackages
: collections of functions, data, and documentation bundled together in RComments
: notes you leave for yourself, not evaluatedMessages
: notes R leaves for you (FYI, warning, error)10
- a simple value expression that evaluates to 10
.x <- 10
- an expression that assigns the value of 10
to x
.x + 10
- an expression that adds the value of x
to 10
.a <- x + 10
- an expression that adds the value of x
to 10
and assigns the result to the variable a
str(x)
- returns summary of object’s structuretypeof(x)
- returns object’s data typelength(x)
- returns object’s lengthattributes(x)
- returns list of object’s attributesls()
- list all variables in environmentrm(x)
- remove x variable from environmentrm(list = ls())
- remove all variables from environmentThursday’s class started here.
install.packages()
to install packageslibrary()
to load package into current R session.data()
to load data from package into environmentsessionInfo()
- version info, packages for current R session?mean
- get help with a functionhelp('mean')
- search help files for word or phrasehelp(package='tidyverse')
- find help for a packageare fundamental data structures in R. There are two types:
Atomic vectors can be one of six data types:
typeof(x) |
examples |
---|---|
double | 3, 3.32 |
integer | 1L, 144L |
character | “hello”, ‘hello, world!’ |
logical | TRUE, F |
atomic because they must contain only one type
with c()
for concatenate
with sequences seq()
or repetitions rep()
with typeof(x)
- returns the type of vector x
with is.*(x)
- returns TRUE
if x has type *
If you try to include elements of different types, R will coerce them into the same type without warning (implicit coercion)
You can also use explict coercion to change a vector to another data type with as.*()
Some more complex data structures are built from atomic vectors by adding attributes:
Structure | Description |
---|---|
matrix |
vector with dim attribute representing 2 dimensions |
array |
vector with dim attribute representing n dimensions |
data.frame |
a named list of vectors (of equal length) with attributes for names (column names), row.names , and class="data.frame" |
Operator | Operation |
---|---|
() |
Parentheses |
^ |
Exponent |
* |
Multiply |
/ |
Divide |
+ |
Add |
- |
Subtract |
follow the order of operations you expect (PEMDAS)
Operator | Comparison |
---|---|
x < y |
less than |
x > y |
greater than |
x <= y |
less than or equal to |
x >= y |
greater than or equal to |
x != y |
not equal to |
x == y |
equal to |
Operator | Operation |
---|---|
x | y |
or |
x & y |
and |
!x |
not |
any() |
true if any element meets condition |
all() |
true if all elements meet condition |
%in% |
true if any element is in following vector |
Almost all operations (and many functions) are vectorized
math
Operators and functions will also coerce values when needed (and without warning)
Subsetting is a natural complement to str(). While str() shows you all the pieces of any object (its structure), subsetting allows you to pull out the pieces that you’re interested in. ~ Hadley Wickham, Advanced R
str()
There are three operators for subsetting objects:
[
- subsets (one or more) elements[[
and $
- extracts a single element[
Code | Returns |
---|---|
x[c(1,2)] |
positive integers select elements at specified indexes |
x[-c(1,2)] |
negative integers select all but elements at specified indexes |
x[c("x", "y")] |
select elements by name, if elements are named |
x[] |
nothing returns the original object |
x[0] |
zero returns a zero-length vector |
x[c(TRUE, TRUE)] |
select elements where corresponding logical value is TRUE |
[
atomic vector
Code | Returns |
---|---|
[[2]] |
a single positive integer (index) |
[['name']] |
a single string |
x$name |
the $ operator is a useful shorthand for [['name']] |
NA
NA
is contageous: expressions including NA
usually return NA
NA
values with is.na()
functions
are reusable pieces of code that take some input, perform some task or computation, and return an output
control flow
refers to managing the order in which expressions are executed in a program
if
…else
- if something is true, do this; otherwise do thatfor
loops - repeat code a specific number of timeswhile
loops - repeat code as long as certain conditions are truebreak
- exit a loop earlynext
- skip to next iteration in a loopIf we have time
[
with higher dim objects[[
and $
:both [[
and [
work for vectors; use [[
$
does partial matching without warning
Have a great weekend!