Data Science for Studying Language and the Mind
2024-09-10
ggplot2hereData wranglingData Science Workflow by R4DS
purr - functional programmingtibble - modern data.framereadr - reading dataThe tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
ggplot2 - for data visualizationdplyr - for data wranglingreadr - for reading datatibble - for modern data framesstringr: for string manipulationforcats: for dealing with factorstidyr: for data tidyingpurrr: for functional programming
Already installed on Google Colab’s R kernel:
Returns a message in Google Colab:
Tidyverse makes use of tidy data, a standard way of structuring datasets:
Visual of tidy data rules, from R for Data Science
Why tidy data?
purrFunctional programming
to illustrate the joy of tidyverse and tidy data
purrpurrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read.
map_*() functionsmap_*() functionsWe say “functions” because there are 5, one for each type of vector:
map() - listmap_lgl() - logicalmap_int() - integermap_dbl() - doublemap_chr() - charactermap use casetibblemodern data frames
tibbleA tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less and complain more
tibbleTibbles do less than data frames, in a good way:
tibbleCoerce an existing object:
# A tibble: 4 × 2
x y
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
Pass a column of vectors:
tibbleWith is_tibble(x) and is.data.frame(x)
data.frame v tibbleYou will encounter 2 main differences:
<dbl>, <chr>)[[ and $:
readrreading data
readrThe goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results.
Figure 1: Sample csv file from R for Data Science
read_*()The read_*() functions have two important arguments:
file - the path to the filecol_types - a list of how each column should be converted to a specific data typeread_*()read_csv(): comma-separated values (CSV)read_tsv(): tab-separated values (TSV)read_csv2(): semicolon-separated valuesread_delim(): delimited files (CSV and TSV are important special cases)read_fwf(): fixed-width filesread_table(): whitespace-separated filesread_log(): web log filescsv filesPath only, readr guesses types:
col_types column specificationThere are 11 column types that can be specified:
col_logical() - reads as boolean TRUE FALSE valuescol_integer() - reads as integercol_double() - reads as doublecol_number() - numeric parser that can ignore non-numberscol_character() - reads as stringscol_factor(levels, ordered = FALSE) - creates factorscol_datetime(format = "") - creates date-timescol_date(format = "") - creates datescol_time(format = "") - creates timescol_skip() - skips a columncol_guess() - tries to guess the columnReading more complex file types requires functions outside the tidyverse:
readxl - see Spreadsheets in R for Data Sciencegooglesheets4 - see Spreadsheets in R for Data ScienceDBI - see Databases in R for Data Sciencejsonlite - see Hierarchical data in R for Data ScienceWrite to a .csv file with
readr# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
AGE)NA (AGE and favorite.food)Student ID and Full Name)Your dataset has a column that you expected to be logical or double, but there is a typo somewhere, so R has coerced the column into character.
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
Solve by specifying the column type col_double() and then using the problems() function to see where R failed.
NAYour dataset has missing values, but they were not coded as NA as R expects.
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
Solve by adding an na argument (e.g. na=c("N/A"))
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only "4"
2 2 Barclay Lynn French fries Lunch only "5"
3 3 Jayendra Lyne <NA> Breakfast and lunch "7"
4 4 Leon Rossini Anchovies Lunch only ""
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch "five"
6 6 Güvenç Attila Ice cream Lunch only "6"
Your dataset has column names that include spaces, breaking R’s naming rules. In these cases, R adds backticks (e.g. `brain size`);
We can use the rename() function to fix them.
# A tibble: 6 × 5
student_id full_name favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
. . . d If we have a lot to rename and that gets annoying, see janitor::clean_names().