Importing Data in R

Data Science Salon - Session 3

Eve Perry

November 8, 2023

Outline

  • Types of Data
  • Useful Packages
  • Examples

R for Data Science Chapter 8: Data Import

Types of Data

General Purpose Flat Files

File Type Usual File Extensions R Packages
Comma Separated (CSV) .csv readr, base R
Excel Spreadsheets .xls, .xlsx readxl , writexl
Tab Separated .tsv readr, base R
Fixed Width Varies readr, base R

Stats Packages Flat Files

File Type Usual File Extensions R Packages
R .rds readr, base R
Stata .dta haven, labelled
SAS .sas7bdat , .sas7bcat , .xpt haven
SPSS .sav, .por haven

Advanced Data

Examples

CSV: Initial Import

Raw Data

Student ID,Full Name,favourite.food,mealPlan,AGE,DOB
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4,10/26/2019
2,Barclay Lynn,French fries,Lunch only,5,3/14/2018
3,Jayendra Lyne,N/A,Breakfast and lunch,7,7/15/2016
4,Leon Rossini,Anchovies,Lunch only,,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five,9/14/2018
6,Güvenç Attila,Ice cream,Lunch only,6,1/23/2019
library(readr)
students <- read_csv("data/students.csv") 

students
# A tibble: 6 × 6
  `Student ID` `Full Name`      favourite.food     mealPlan          AGE   DOB  
         <dbl> <chr>            <chr>              <chr>             <chr> <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only        4     10/2…
2            2 Barclay Lynn     French fries       Lunch only        5     3/14…
3            3 Jayendra Lyne    N/A                Breakfast and lu… 7     7/15…
4            4 Leon Rossini     Anchovies          Lunch only        <NA>  <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lu… five  9/14…
6            6 Güvenç Attila    Ice cream          Lunch only        6     1/23…

CSV: Fix Variable Names with dplyr::rename()

library(dplyr)

students <- read_csv("data/students.csv") |> 
  rename(
    student_id = `Student ID`,
    full_name = `Full Name`,
    favorite_food = favourite.food,
    meal_plan = mealPlan,
    age = AGE
  )

students
# A tibble: 6 × 6
  student_id full_name        favorite_food      meal_plan           age   DOB  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     10/2…
2          2 Barclay Lynn     French fries       Lunch only          5     3/14…
3          3 Jayendra Lyne    N/A                Breakfast and lunch 7     7/15…
4          4 Leon Rossini     Anchovies          Lunch only          <NA>  <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  9/14…
6          6 Güvenç Attila    Ice cream          Lunch only          6     1/23…

CSV: Fix Variable Names with janitor::clean_names()

library(janitor)

students <- read_csv("data/students.csv") |> 
  clean_names()

students
# A tibble: 6 × 6
  student_id full_name        favourite_food     meal_plan           age   dob  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     10/2…
2          2 Barclay Lynn     French fries       Lunch only          5     3/14…
3          3 Jayendra Lyne    N/A                Breakfast and lunch 7     7/15…
4          4 Leon Rossini     Anchovies          Lunch only          <NA>  <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  9/14…
6          6 Güvenç Attila    Ice cream          Lunch only          6     1/23…

CSV: Files Without Variable Names

1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
students <- read_csv(
  "data/students_no_variable_names.csv",
  col_names = c("student_id", "full_name", "favorite_food", "meal_plan", "age", "dob")
)

students
# A tibble: 6 × 6
  student_id full_name        favorite_food      meal_plan           age   dob  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     10/2…
2          2 Barclay Lynn     French fries       Lunch only          5     3/14…
3          3 Jayendra Lyne    N/A                Breakfast and lunch 7     7/15…
4          4 Leon Rossini     Anchovies          Lunch only          <NA>  <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  9/14…
6          6 Güvenç Attila    Ice cream          Lunch only          6     1/23…

CSV: Missing Data

Raw Data

Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
students
# A tibble: 6 × 6
  student_id full_name        favorite_food      meal_plan           age   dob  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     10/2…
2          2 Barclay Lynn     French fries       Lunch only          5     3/14…
3          3 Jayendra Lyne    N/A                Breakfast and lunch 7     7/15…
4          4 Leon Rossini     Anchovies          Lunch only          <NA>  <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  9/14…
6          6 Güvenç Attila    Ice cream          Lunch only          6     1/23…

CSV: Missing Data

students <- read_csv("data/students.csv", na = c("N/A", "")) |> 
  clean_names()

students
# A tibble: 6 × 6
  student_id full_name        favourite_food     meal_plan           age   dob  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     10/2…
2          2 Barclay Lynn     French fries       Lunch only          5     3/14…
3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7     7/15…
4          4 Leon Rossini     Anchovies          Lunch only          <NA>  <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  9/14…
6          6 Güvenç Attila    Ice cream          Lunch only          6     1/23…

CSV: Variable Types

# A tibble: 6 × 6
  student_id full_name        favourite_food     meal_plan           age   dob  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     10/2…
2          2 Barclay Lynn     French fries       Lunch only          5     3/14…
3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7     7/15…
4          4 Leon Rossini     Anchovies          Lunch only          <NA>  <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  9/14…
6          6 Güvenç Attila    Ice cream          Lunch only          6     1/23…

CSV: Variable Types

students <- read_csv(
  "data/students.csv",
  na = c("N/A", ""),
  col_types = cols(mealPlan = col_factor(), DOB = col_date(format = "%m/%d/%Y"))
) |> 
  clean_names()

students
# A tibble: 6 × 6
  student_id full_name        favourite_food     meal_plan      age   dob       
       <dbl> <chr>            <chr>              <fct>          <chr> <date>    
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only     4     2019-10-26
2          2 Barclay Lynn     French fries       Lunch only     5     2018-03-14
3          3 Jayendra Lyne    <NA>               Breakfast and… 7     2016-07-15
4          4 Leon Rossini     Anchovies          Lunch only     <NA>  NA        
5          5 Chidiegwu Dunkel Pizza              Breakfast and… five  2018-09-14
6          6 Güvenç Attila    Ice cream          Lunch only     6     2019-01-23

CSV: Exporting Data

write_csv(students, "data/students_clean.csv")

read_csv("data/students_clean.csv")
# A tibble: 6 × 6
  student_id full_name        favourite_food     meal_plan      age   dob       
       <dbl> <chr>            <chr>              <chr>          <chr> <date>    
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only     4     2019-10-26
2          2 Barclay Lynn     French fries       Lunch only     5     2018-03-14
3          3 Jayendra Lyne    <NA>               Breakfast and… 7     2016-07-15
4          4 Leon Rossini     Anchovies          Lunch only     <NA>  NA        
5          5 Chidiegwu Dunkel Pizza              Breakfast and… five  2018-09-14
6          6 Güvenç Attila    Ice cream          Lunch only     6     2019-01-23

Excel: Initial Import

library(readxl)

read_excel("data/students.xlsx")
# A tibble: 6 × 6
  `Student ID` `Full Name`     favourite.food mealPlan AGE   DOB                
         <dbl> <chr>           <chr>          <chr>    <chr> <dttm>             
1            1 Sunil Huffmann  Strawberry yo… Lunch o… 4     2019-10-26 00:00:00
2            2 Barclay Lynn    French fries   Lunch o… 5     2018-03-14 00:00:00
3            3 Jayendra Lyne   N/A            Breakfa… 7     2016-07-15 00:00:00
4            4 Leon Rossini    Anchovies      Lunch o… <NA>  NA                 
5            5 Chidiegwu Dunk… Pizza          Breakfa… five  2018-09-14 00:00:00
6            6 Güvenç Attila   Ice cream      Lunch o… 6     2019-01-23 00:00:00

Excel: Multiple Tabs

read_excel("data/students.xlsx", sheet = "Student List")
# A tibble: 6 × 6
  `Student ID` `Full Name`     favourite.food mealPlan AGE   DOB                
         <dbl> <chr>           <chr>          <chr>    <chr> <dttm>             
1            1 Sunil Huffmann  Strawberry yo… Lunch o… 4     2019-10-26 00:00:00
2            2 Barclay Lynn    French fries   Lunch o… 5     2018-03-14 00:00:00
3            3 Jayendra Lyne   N/A            Breakfa… 7     2016-07-15 00:00:00
4            4 Leon Rossini    Anchovies      Lunch o… <NA>  NA                 
5            5 Chidiegwu Dunk… Pizza          Breakfa… five  2018-09-14 00:00:00
6            6 Güvenç Attila   Ice cream      Lunch o… 6     2019-01-23 00:00:00

Excel: Specify Column Names

read_excel(
  "data/students.xlsx",
  col_names = c("student_id", "full_name", "favorite_food", "meal_plan", "age", "dob"),
  skip = 1
)
# A tibble: 6 × 6
  student_id full_name        favorite_food  meal_plan age   dob                
       <dbl> <chr>            <chr>          <chr>     <chr> <dttm>             
1          1 Sunil Huffmann   Strawberry yo… Lunch on… 4     2019-10-26 00:00:00
2          2 Barclay Lynn     French fries   Lunch on… 5     2018-03-14 00:00:00
3          3 Jayendra Lyne    N/A            Breakfas… 7     2016-07-15 00:00:00
4          4 Leon Rossini     Anchovies      Lunch on… <NA>  NA                 
5          5 Chidiegwu Dunkel Pizza          Breakfas… five  2018-09-14 00:00:00
6          6 Güvenç Attila    Ice cream      Lunch on… 6     2019-01-23 00:00:00

Excel: Variable Types & Missing Data

excel_students = read_excel(
  "data/students.xlsx",
  col_types = c("text", "guess", "guess", "skip", "guess", "date"),
  na = c("", "N/A")
) |> 
  clean_names()

excel_students
# A tibble: 6 × 5
  student_id full_name        favourite_food     age   dob                
  <chr>      <chr>            <chr>              <chr> <dttm>             
1 1          Sunil Huffmann   Strawberry yoghurt 4     2019-10-26 00:00:00
2 2          Barclay Lynn     French fries       5     2018-03-14 00:00:00
3 3          Jayendra Lyne    <NA>               7     2016-07-15 00:00:00
4 4          Leon Rossini     Anchovies          <NA>  NA                 
5 5          Chidiegwu Dunkel Pizza              five  2018-09-14 00:00:00
6 6          Güvenç Attila    Ice cream          6     2019-01-23 00:00:00

Excel: A Warning About Dates

read_excel(
  "data/students.xlsx",
  col_types = c("text", "guess", "guess", "skip", "guess", "text"),
  na = c("", "N/A")
) |> 
  clean_names()
# A tibble: 6 × 5
  student_id full_name        favourite_food     age   dob  
  <chr>      <chr>            <chr>              <chr> <chr>
1 1          Sunil Huffmann   Strawberry yoghurt 4     43764
2 2          Barclay Lynn     French fries       5     43173
3 3          Jayendra Lyne    <NA>               7     42566
4 4          Leon Rossini     Anchovies          <NA>  <NA> 
5 5          Chidiegwu Dunkel Pizza              five  43357
6 6          Güvenç Attila    Ice cream          6     43488

Excel: Exporting Data

library(writexl)

write_xlsx(excel_students, "data/students_clean.xlsx")

read_excel("data/students_clean.xlsx")
# A tibble: 6 × 5
  student_id full_name        favourite_food     age   dob                
  <chr>      <chr>            <chr>              <chr> <dttm>             
1 1          Sunil Huffmann   Strawberry yoghurt 4     2019-10-26 00:00:00
2 2          Barclay Lynn     French fries       5     2018-03-14 00:00:00
3 3          Jayendra Lyne    <NA>               7     2016-07-15 00:00:00
4 4          Leon Rossini     Anchovies          <NA>  NA                 
5 5          Chidiegwu Dunkel Pizza              five  2018-09-14 00:00:00
6 6          Güvenç Attila    Ice cream          6     2019-01-23 00:00:00

RDS: R’s storage format

write_rds(students, "data/students_clean.rds")

students_rds <- read_rds("data/students_clean.rds")

students_rds
# A tibble: 6 × 6
  student_id full_name        favourite_food     meal_plan      age   dob       
       <dbl> <chr>            <chr>              <fct>          <chr> <date>    
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only     4     2019-10-26
2          2 Barclay Lynn     French fries       Lunch only     5     2018-03-14
3          3 Jayendra Lyne    <NA>               Breakfast and… 7     2016-07-15
4          4 Leon Rossini     Anchovies          Lunch only     <NA>  NA        
5          5 Chidiegwu Dunkel Pizza              Breakfast and… five  2018-09-14
6          6 Güvenç Attila    Ice cream          Lunch only     6     2019-01-23

Stata: Initial Import

--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
make            str18   %-18s                 Make and model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %9.0g      repair     Repair record 1978
headroom        float   %6.1f                 Headroom (in.)
review_date     float   %td                   
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear ratio
foreign         byte    %8.0g      origin     Car origin
--------------------------------------------------------------------------------

Stata: Initial Import

library(haven)

auto2 = read_dta("data/auto2.dta")

auto2
# A tibble: 74 × 13
   make      price   mpg rep78    headroom review_date trunk weight length  turn
   <chr>     <dbl> <dbl> <dbl+lb>    <dbl> <date>      <dbl>  <dbl>  <dbl> <dbl>
 1 AMC Conc…  4099    22  3 [Ave…      2.5 1978-04-28     11   2930    186    40
 2 AMC Pacer  4749    17  3 [Ave…      3   1978-03-27     11   3350    173    40
 3 AMC Spir…  3799    22 NA            3   1978-06-10     12   2640    168    35
 4 Buick Ce…  4816    20  3 [Ave…      4.5 1978-02-09     16   3250    196    40
 5 Buick El…  7827    15  4 [Goo…      4   1978-11-21     20   4080    222    43
 6 Buick Le…  5788    18  3 [Ave…      4   1978-01-14     21   3670    218    43
 7 Buick Op…  4453    26 NA            3   1978-04-03     10   2230    170    34
 8 Buick Re…  5189    20  3 [Ave…      2   1978-10-23     16   3280    200    42
 9 Buick Ri… 10372    16  3 [Ave…      3.5 1978-07-28     17   3880    207    43
10 Buick Sk…  4082    19  3 [Ave…      3.5 1978-11-12     13   3400    200    42
# ℹ 64 more rows
# ℹ 3 more variables: displacement <dbl>, gear_ratio <dbl>, foreign <dbl+lbl>

Stata: Variable Labels

head(auto2$rep78)
<labelled<double>[6]>: Repair record 1978
[1]  3  3 NA  3  4  3

Labels:
 value     label
     1      Poor
     2      Fair
     3   Average
     4      Good
     5 Excellent

Stata: Using Categorical Variables

auto2 |> 
  filter(rep78 == 1)
# A tibble: 2 × 13
  make        price   mpg rep78   headroom review_date trunk weight length  turn
  <chr>       <dbl> <dbl> <dbl+l>    <dbl> <date>      <dbl>  <dbl>  <dbl> <dbl>
1 Olds Starf…  4195    24 1 [Poo…      2   1978-12-17     10   2730    180    40
2 Pont. Fire…  4934    18 1 [Poo…      1.5 1978-02-02      7   3470    198    42
# ℹ 3 more variables: displacement <dbl>, gear_ratio <dbl>, foreign <dbl+lbl>

Stata: Using Categorical Variables

auto2 |> 
  count(rep78)
# A tibble: 6 × 2
  rep78              n
  <dbl+lbl>      <int>
1  1 [Poor]          2
2  2 [Fair]          8
3  3 [Average]      30
4  4 [Good]         18
5  5 [Excellent]    11
6 NA                 5
auto2 |>
  mutate(rep78 = as_factor(rep78)) |>
  count(rep78)
# A tibble: 6 × 2
  rep78         n
  <fct>     <int>
1 Poor          2
2 Fair          8
3 Average      30
4 Good         18
5 Excellent    11
6 <NA>          5

Stata: Accessing Variable Labels

library(labelled)

var_label(auto2$make)
[1] "Make and model"
var_label(auto2)
$make
[1] "Make and model"

$price
[1] "Price"

$mpg
[1] "Mileage (mpg)"

$rep78
[1] "Repair record 1978"

$headroom
[1] "Headroom (in.)"

$review_date
NULL

$trunk
[1] "Trunk space (cu. ft.)"

$weight
[1] "Weight (lbs.)"

$length
[1] "Length (in.)"

$turn
[1] "Turn circle (ft.)"

$displacement
[1] "Displacement (cu. in.)"

$gear_ratio
[1] "Gear ratio"

$foreign
[1] "Car origin"

Stata: Exporting Data

write_dta(auto2, "data/auto2_from_r.dta")
use auto2_from_r.dta

des
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
make            str17   %-18s                 Make and model
price           double  %8.0gc                Price
mpg             double  %8.0g                 Mileage (mpg)
rep78           double  %9.0g      rep78      Repair record 1978
headroom        double  %6.1f                 Headroom (in.)
review_date     double  %td                   
trunk           double  %8.0g                 Trunk space (cu. ft.)
weight          double  %8.0gc                Weight (lbs.)
length          double  %8.0g                 Length (in.)
turn            double  %8.0g                 Turn circle (ft.)
displacement    double  %8.0g                 Displacement (cu. in.)
gear_ratio      double  %6.2f                 Gear ratio
foreign         double  %8.0g      foreign    Car origin
--------------------------------------------------------------------------------

Stata: Exporting Data

Be careful converting labelled variables to factors and then exporting back to Stata

write_dta(auto2, "data/auto2_from_r.dta")
use auto2_from_r.dta

codebook foreign
-------------------------------------------------------
foreign                                      Car origin
-------------------------------------------------------

                  Type: Numeric (double)
                 Label: foreign

                 Range: [0,1]                         Units: 1
         Unique values: 2                         Missing .: 0/74

            Tabulation: Freq.   Numeric  Label
                           52         0  Domestic
                           22         1  Foreign
auto2 |> 
  mutate(foreign = as_factor(foreign)) |> 
  write_dta("data/auto2_factor_version.dta")
use auto2_factor_version.dta

codebook foreign
-------------------------------------------------------
foreign                                      Car origin
-------------------------------------------------------

                  Type: Numeric (long)
                 Label: foreign

                 Range: [1,2]                         Units: 1
         Unique values: 2                         Missing .: 0/74

            Tabulation: Freq.   Numeric  Label
                           52         1  Domestic
                           22         2  Foreign