CSV: Initial Import
Raw Data
Student ID,Full Name,favourite.food,mealPlan,AGE,DOB
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4,10/26/2019
2,Barclay Lynn,French fries,Lunch only,5,3/14/2018
3,Jayendra Lyne,N/A,Breakfast and lunch,7,7/15/2016
4,Leon Rossini,Anchovies,Lunch only,,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five,9/14/2018
6,Güvenç Attila,Ice cream,Lunch only,6,1/23/2019
library (readr)
students <- read_csv ("data/students.csv" )
students
# A tibble: 6 × 6
`Student ID` `Full Name` favourite.food mealPlan AGE DOB
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 10/2…
2 2 Barclay Lynn French fries Lunch only 5 3/14…
3 3 Jayendra Lyne N/A Breakfast and lu… 7 7/15…
4 4 Leon Rossini Anchovies Lunch only <NA> <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lu… five 9/14…
6 6 Güvenç Attila Ice cream Lunch only 6 1/23…
CSV: Fix Variable Names with dplyr::rename()
library (dplyr)
students <- read_csv ("data/students.csv" ) |>
rename (
student_id = ` Student ID ` ,
full_name = ` Full Name ` ,
favorite_food = favourite.food,
meal_plan = mealPlan,
age = AGE
)
students
# A tibble: 6 × 6
student_id full_name favorite_food meal_plan age DOB
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 10/2…
2 2 Barclay Lynn French fries Lunch only 5 3/14…
3 3 Jayendra Lyne N/A Breakfast and lunch 7 7/15…
4 4 Leon Rossini Anchovies Lunch only <NA> <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five 9/14…
6 6 Güvenç Attila Ice cream Lunch only 6 1/23…
CSV: Fix Variable Names with janitor::clean_names()
library (janitor)
students <- read_csv ("data/students.csv" ) |>
clean_names ()
students
# A tibble: 6 × 6
student_id full_name favourite_food meal_plan age dob
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 10/2…
2 2 Barclay Lynn French fries Lunch only 5 3/14…
3 3 Jayendra Lyne N/A Breakfast and lunch 7 7/15…
4 4 Leon Rossini Anchovies Lunch only <NA> <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five 9/14…
6 6 Güvenç Attila Ice cream Lunch only 6 1/23…
CSV: Files Without Variable Names
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
students <- read_csv (
"data/students_no_variable_names.csv" ,
col_names = c ("student_id" , "full_name" , "favorite_food" , "meal_plan" , "age" , "dob" )
)
students
# A tibble: 6 × 6
student_id full_name favorite_food meal_plan age dob
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 10/2…
2 2 Barclay Lynn French fries Lunch only 5 3/14…
3 3 Jayendra Lyne N/A Breakfast and lunch 7 7/15…
4 4 Leon Rossini Anchovies Lunch only <NA> <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five 9/14…
6 6 Güvenç Attila Ice cream Lunch only 6 1/23…
CSV: Missing Data
Raw Data
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
# A tibble: 6 × 6
student_id full_name favorite_food meal_plan age dob
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 10/2…
2 2 Barclay Lynn French fries Lunch only 5 3/14…
3 3 Jayendra Lyne N/A Breakfast and lunch 7 7/15…
4 4 Leon Rossini Anchovies Lunch only <NA> <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five 9/14…
6 6 Güvenç Attila Ice cream Lunch only 6 1/23…
By default, only “” is treated as N/A
CSV: Missing Data
students <- read_csv ("data/students.csv" , na = c ("N/A" , "" )) |>
clean_names ()
students
# A tibble: 6 × 6
student_id full_name favourite_food meal_plan age dob
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 10/2…
2 2 Barclay Lynn French fries Lunch only 5 3/14…
3 3 Jayendra Lyne <NA> Breakfast and lunch 7 7/15…
4 4 Leon Rossini Anchovies Lunch only <NA> <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five 9/14…
6 6 Güvenç Attila Ice cream Lunch only 6 1/23…
If you specify other NA values, then you have to explicitly specify “”
CSV: Variable Types
# A tibble: 6 × 6
student_id full_name favourite_food meal_plan age dob
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 10/2…
2 2 Barclay Lynn French fries Lunch only 5 3/14…
3 3 Jayendra Lyne <NA> Breakfast and lunch 7 7/15…
4 4 Leon Rossini Anchovies Lunch only <NA> <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five 9/14…
6 6 Güvenç Attila Ice cream Lunch only 6 1/23…
CSV: Variable Types
students <- read_csv (
"data/students.csv" ,
na = c ("N/A" , "" ),
col_types = cols (mealPlan = col_factor (), DOB = col_date (format = "%m/%d/%Y" ))
) |>
clean_names ()
students
# A tibble: 6 × 6
student_id full_name favourite_food meal_plan age dob
<dbl> <chr> <chr> <fct> <chr> <date>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 2019-10-26
2 2 Barclay Lynn French fries Lunch only 5 2018-03-14
3 3 Jayendra Lyne <NA> Breakfast and… 7 2016-07-15
4 4 Leon Rossini Anchovies Lunch only <NA> NA
5 5 Chidiegwu Dunkel Pizza Breakfast and… five 2018-09-14
6 6 Güvenç Attila Ice cream Lunch only 6 2019-01-23
CSV: Exporting Data
write_csv (students, "data/students_clean.csv" )
read_csv ("data/students_clean.csv" )
# A tibble: 6 × 6
student_id full_name favourite_food meal_plan age dob
<dbl> <chr> <chr> <chr> <chr> <date>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 2019-10-26
2 2 Barclay Lynn French fries Lunch only 5 2018-03-14
3 3 Jayendra Lyne <NA> Breakfast and… 7 2016-07-15
4 4 Leon Rossini Anchovies Lunch only <NA> NA
5 5 Chidiegwu Dunkel Pizza Breakfast and… five 2018-09-14
6 6 Güvenç Attila Ice cream Lunch only 6 2019-01-23
We lose the factor, but keep the data b/c readr
writes dates in CSVs in the way that it automatically recognizes them during import.
Excel: Initial Import
library (readxl)
read_excel ("data/students.xlsx" )
# A tibble: 6 × 6
`Student ID` `Full Name` favourite.food mealPlan AGE DOB
<dbl> <chr> <chr> <chr> <chr> <dttm>
1 1 Sunil Huffmann Strawberry yo… Lunch o… 4 2019-10-26 00:00:00
2 2 Barclay Lynn French fries Lunch o… 5 2018-03-14 00:00:00
3 3 Jayendra Lyne N/A Breakfa… 7 2016-07-15 00:00:00
4 4 Leon Rossini Anchovies Lunch o… <NA> NA
5 5 Chidiegwu Dunk… Pizza Breakfa… five 2018-09-14 00:00:00
6 6 Güvenç Attila Ice cream Lunch o… 6 2019-01-23 00:00:00
Excel: Multiple Tabs
read_excel ("data/students.xlsx" , sheet = "Student List" )
# A tibble: 6 × 6
`Student ID` `Full Name` favourite.food mealPlan AGE DOB
<dbl> <chr> <chr> <chr> <chr> <dttm>
1 1 Sunil Huffmann Strawberry yo… Lunch o… 4 2019-10-26 00:00:00
2 2 Barclay Lynn French fries Lunch o… 5 2018-03-14 00:00:00
3 3 Jayendra Lyne N/A Breakfa… 7 2016-07-15 00:00:00
4 4 Leon Rossini Anchovies Lunch o… <NA> NA
5 5 Chidiegwu Dunk… Pizza Breakfa… five 2018-09-14 00:00:00
6 6 Güvenç Attila Ice cream Lunch o… 6 2019-01-23 00:00:00
Excel: Specify Column Names
read_excel (
"data/students.xlsx" ,
col_names = c ("student_id" , "full_name" , "favorite_food" , "meal_plan" , "age" , "dob" ),
skip = 1
)
# A tibble: 6 × 6
student_id full_name favorite_food meal_plan age dob
<dbl> <chr> <chr> <chr> <chr> <dttm>
1 1 Sunil Huffmann Strawberry yo… Lunch on… 4 2019-10-26 00:00:00
2 2 Barclay Lynn French fries Lunch on… 5 2018-03-14 00:00:00
3 3 Jayendra Lyne N/A Breakfas… 7 2016-07-15 00:00:00
4 4 Leon Rossini Anchovies Lunch on… <NA> NA
5 5 Chidiegwu Dunkel Pizza Breakfas… five 2018-09-14 00:00:00
6 6 Güvenç Attila Ice cream Lunch on… 6 2019-01-23 00:00:00
Excel: Variable Types & Missing Data
excel_students = read_excel (
"data/students.xlsx" ,
col_types = c ("text" , "guess" , "guess" , "skip" , "guess" , "date" ),
na = c ("" , "N/A" )
) |>
clean_names ()
excel_students
# A tibble: 6 × 5
student_id full_name favourite_food age dob
<chr> <chr> <chr> <chr> <dttm>
1 1 Sunil Huffmann Strawberry yoghurt 4 2019-10-26 00:00:00
2 2 Barclay Lynn French fries 5 2018-03-14 00:00:00
3 3 Jayendra Lyne <NA> 7 2016-07-15 00:00:00
4 4 Leon Rossini Anchovies <NA> NA
5 5 Chidiegwu Dunkel Pizza five 2018-09-14 00:00:00
6 6 Güvenç Attila Ice cream 6 2019-01-23 00:00:00
Unlike CSVs if you want to specify one variable’s type, you have to specify all of them.
Excel: A Warning About Dates
read_excel (
"data/students.xlsx" ,
col_types = c ("text" , "guess" , "guess" , "skip" , "guess" , "text" ),
na = c ("" , "N/A" )
) |>
clean_names ()
# A tibble: 6 × 5
student_id full_name favourite_food age dob
<chr> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt 4 43764
2 2 Barclay Lynn French fries 5 43173
3 3 Jayendra Lyne <NA> 7 42566
4 4 Leon Rossini Anchovies <NA> <NA>
5 5 Chidiegwu Dunkel Pizza five 43357
6 6 Güvenç Attila Ice cream 6 43488
Excel: Exporting Data
library (writexl)
write_xlsx (excel_students, "data/students_clean.xlsx" )
read_excel ("data/students_clean.xlsx" )
# A tibble: 6 × 5
student_id full_name favourite_food age dob
<chr> <chr> <chr> <chr> <dttm>
1 1 Sunil Huffmann Strawberry yoghurt 4 2019-10-26 00:00:00
2 2 Barclay Lynn French fries 5 2018-03-14 00:00:00
3 3 Jayendra Lyne <NA> 7 2016-07-15 00:00:00
4 4 Leon Rossini Anchovies <NA> NA
5 5 Chidiegwu Dunkel Pizza five 2018-09-14 00:00:00
6 6 Güvenç Attila Ice cream 6 2019-01-23 00:00:00
Stata: Initial Import
--------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
--------------------------------------------------------------------------------
make str18 %-18s Make and model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %9.0g repair Repair record 1978
headroom float %6.1f Headroom (in.)
review_date float %td
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear ratio
foreign byte %8.0g origin Car origin
--------------------------------------------------------------------------------
Stata: Initial Import
library (haven)
auto2 = read_dta ("data/auto2.dta" )
auto2
# A tibble: 74 × 13
make price mpg rep78 headroom review_date trunk weight length turn
<chr> <dbl> <dbl> <dbl+lb> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 AMC Conc… 4099 22 3 [Ave… 2.5 1978-04-28 11 2930 186 40
2 AMC Pacer 4749 17 3 [Ave… 3 1978-03-27 11 3350 173 40
3 AMC Spir… 3799 22 NA 3 1978-06-10 12 2640 168 35
4 Buick Ce… 4816 20 3 [Ave… 4.5 1978-02-09 16 3250 196 40
5 Buick El… 7827 15 4 [Goo… 4 1978-11-21 20 4080 222 43
6 Buick Le… 5788 18 3 [Ave… 4 1978-01-14 21 3670 218 43
7 Buick Op… 4453 26 NA 3 1978-04-03 10 2230 170 34
8 Buick Re… 5189 20 3 [Ave… 2 1978-10-23 16 3280 200 42
9 Buick Ri… 10372 16 3 [Ave… 3.5 1978-07-28 17 3880 207 43
10 Buick Sk… 4082 19 3 [Ave… 3.5 1978-11-12 13 3400 200 42
# ℹ 64 more rows
# ℹ 3 more variables: displacement <dbl>, gear_ratio <dbl>, foreign <dbl+lbl>
Always a good idea to have the haven
package fully loaded when working with data from Stata. Either directly imported or a previously imported Stata file and saved as intermediate .rds
file.
Stata: Variable Labels
<labelled<double>[6]>: Repair record 1978
[1] 3 3 NA 3 4 3
Labels:
value label
1 Poor
2 Fair
3 Average
4 Good
5 Excellent
Stata: Using Categorical Variables
auto2 |>
filter (rep78 == 1 )
# A tibble: 2 × 13
make price mpg rep78 headroom review_date trunk weight length turn
<chr> <dbl> <dbl> <dbl+l> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 Olds Starf… 4195 24 1 [Poo… 2 1978-12-17 10 2730 180 40
2 Pont. Fire… 4934 18 1 [Poo… 1.5 1978-02-02 7 3470 198 42
# ℹ 3 more variables: displacement <dbl>, gear_ratio <dbl>, foreign <dbl+lbl>
Stata: Using Categorical Variables
# A tibble: 6 × 2
rep78 n
<dbl+lbl> <int>
1 1 [Poor] 2
2 2 [Fair] 8
3 3 [Average] 30
4 4 [Good] 18
5 5 [Excellent] 11
6 NA 5
auto2 |>
mutate (rep78 = as_factor (rep78)) |>
count (rep78)
# A tibble: 6 × 2
rep78 n
<fct> <int>
1 Poor 2
2 Fair 8
3 Average 30
4 Good 18
5 Excellent 11
6 <NA> 5
Stata: Accessing Variable Labels
library (labelled)
var_label (auto2$ make)
$make
[1] "Make and model"
$price
[1] "Price"
$mpg
[1] "Mileage (mpg)"
$rep78
[1] "Repair record 1978"
$headroom
[1] "Headroom (in.)"
$review_date
NULL
$trunk
[1] "Trunk space (cu. ft.)"
$weight
[1] "Weight (lbs.)"
$length
[1] "Length (in.)"
$turn
[1] "Turn circle (ft.)"
$displacement
[1] "Displacement (cu. in.)"
$gear_ratio
[1] "Gear ratio"
$foreign
[1] "Car origin"
Stata: Exporting Data
write_dta (auto2, "data/auto2_from_r.dta" )
--------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
--------------------------------------------------------------------------------
make str17 %-18s Make and model
price double %8.0gc Price
mpg double %8.0g Mileage (mpg)
rep78 double %9.0g rep78 Repair record 1978
headroom double %6.1f Headroom (in.)
review_date double %td
trunk double %8.0g Trunk space (cu. ft.)
weight double %8.0gc Weight (lbs.)
length double %8.0g Length (in.)
turn double %8.0g Turn circle (ft.)
displacement double %8.0g Displacement (cu. in.)
gear_ratio double %6.2f Gear ratio
foreign double %8.0g foreign Car origin
--------------------------------------------------------------------------------
Stata: Exporting Data
Be careful converting labelled variables to factors and then exporting back to Stata
write_dta (auto2, "data/auto2_from_r.dta" )
use auto2_from_r.dta
codebook foreign
-------------------------------------------------------
foreign Car origin
-------------------------------------------------------
Type: Numeric (double)
Label: foreign
Range: [0,1] Units: 1
Unique values: 2 Missing .: 0/74
Tabulation: Freq. Numeric Label
52 0 Domestic
22 1 Foreign
auto2 |>
mutate (foreign = as_factor (foreign)) |>
write_dta ("data/auto2_factor_version.dta" )
use auto2_factor_version.dta
codebook foreign
-------------------------------------------------------
foreign Car origin
-------------------------------------------------------
Type: Numeric (long)
Label: foreign
Range: [1,2] Units: 1
Unique values: 2 Missing .: 0/74
Tabulation: Freq. Numeric Label
52 1 Domestic
22 2 Foreign