Importing data in R
You can lean R with the dataset it comes with when you install it in your machine. But sometimes you want to use the real data you or someone gathered already. One of critical steps for data processing is to import data with special format into R workspace.Data import refers to read data from the working directory into the workspace. In this chapter you will learn how to import common files into R. We will only focus on two common types of tabular data storage format—The comma-seprated .csv
and excell spreadsheet (.xlsx
). In later chapter we will explain how to read other types of data into R.
Comma-Separated (.csv)
The most commonly format that R like is the comma-separated
files. Although Base R provides various functions like read.table()
, read.csv()
, read.table()
and read.csv2()
to import data from the local directories into R workspace, for this book we use an read_csv()
function from readr
. Before we import the data, we need to load the packages that we will use their functions in this chapeter
require(dplyr)
require(readr)
require(lubridate)
require(readxl)
require(haven)
require(ggplot2)
require(kableExtra)
require(magrittr)
Consider a tabular data stored in my working directory in the .csv
format in figure 1.
We can import it with the read_csv()
functions as:
algoa.ctd = read_csv("algoa_ctd.csv")
-- Column specification --------------------------------------------------------
cols(
station = col_character(),
time = col_datetime(format = ""),
lon = col_double(),
lat = col_double(),
pressure = col_double(),
temperature = col_double(),
salinity = col_double(),
oxygen = col_double(),
fluorescence = col_double(),
spar = col_double(),
par = col_double(),
density = col_double()
)
When read_csv()
has imported the data into R workspace, it prints out the name and type of of data for each variable.
By simply glimpse the dataset, we see the format of the data is as expected. It has six variables(columns) and 177 observations (rows) similar to figure 1. Table 1 show sample of imported dataset.
Station | Time | Lon | Lat | Pressure | Temperature | Salinity | Oxygen | Fluorescence |
---|---|---|---|---|---|---|---|---|
st1 | 2004-08-18 | 40.61 | -10.54 | 5 | 25.17 | 33.92 | 3.93 | 0.56 |
st1 | 2004-08-18 | 40.61 | -10.54 | 10 | 25.13 | 34.86 | 4.49 | 0.60 |
st1 | 2004-08-18 | 40.61 | -10.54 | 15 | 25.11 | 34.86 | 4.50 | 0.65 |
st1 | 2004-08-18 | 40.61 | -10.54 | 20 | 25.04 | 34.86 | 4.51 | 0.68 |
st1 | 2004-08-18 | 40.61 | -10.54 | 25 | 24.95 | 34.86 | 4.51 | 0.76 |
st1 | 2004-08-18 | 40.61 | -10.54 | 30 | 24.91 | 34.86 | 4.50 | 0.73 |
st1 | 2004-08-18 | 40.61 | -10.54 | 35 | 24.88 | 34.87 | 4.49 | 0.74 |
st1 | 2004-08-18 | 40.61 | -10.54 | 40 | 24.85 | 34.87 | 4.48 | 0.69 |
st1 | 2004-08-18 | 40.61 | -10.54 | 45 | 24.80 | 34.88 | 4.46 | 0.70 |
st1 | 2004-08-18 | 40.61 | -10.54 | 50 | 24.61 | 34.89 | 4.44 | 0.75 |
Microsoft Excel(.xlsx)
Commonly our data is stored as a MS Excel file. we can import the file with read_xlsx()
function of readxl
package. The readxl
package provides a function read_exel() that allows us to specify which sheet within the Excel file to read and what character specifies missing data (it assumes a blank cell is missing data if you don’t specifying anything). The function automatically convert the worksheet into a .csv
file and read it. Let’s us import the the data in first sheet of the primary_productivity.xlsx
. The dataset contain primary productivity value. We will use this file to illustrate how to import the excel file into R workspace with readxl
package [@readxl].
sheet1 = readxl::read_xlsx("primary_productivity.xlsx", sheet = 1)
sheet1 %>% sample_n(5)
FALSE # A tibble: 5 x 7
FALSE date year value month day site variable
FALSE <dttm> <dbl> <chr> <dbl> <dbl> <chr> <chr>
FALSE 1 2011-07-15 00:00:00 2011 863.95789473684204 7 15 Pemba pp
FALSE 2 2013-01-15 00:00:00 2013 589.247983870967 1 15 Pemba pp
FALSE 3 2016-03-15 00:00:00 2016 635.401123046875 3 15 Pemba pp
FALSE 4 2010-12-15 00:00:00 2010 575.15543619791595 12 15 Pemba pp
FALSE 5 2003-05-15 00:00:00 2003 720.907401315789 5 15 Pemba pp
By printing the sheet1, we notice that the sheet contains monthly average value of primary productivity from the Pemba channel.
sheet2 = readxl::read_xlsx("primary_productivity.xlsx", sheet = 2)
sheet2 %>% sample_n(5)
FALSE # A tibble: 5 x 7
FALSE date year value month day site variable
FALSE <dttm> <dbl> <chr> <dbl> <dbl> <chr> <chr>
FALSE 1 2015-06-15 00:00:00 2015 822.32483057228899 6 15 Zanzibar pp
FALSE 2 2004-02-15 00:00:00 2004 706.53722734899304 2 15 Zanzibar pp
FALSE 3 2006-03-15 00:00:00 2006 681.77963709677397 3 15 Zanzibar pp
FALSE 4 2014-02-15 00:00:00 2014 618.70003939075605 2 15 Zanzibar pp
FALSE 5 2006-06-15 00:00:00 2006 987.41377314814804 6 15 Zanzibar pp
sheet2 contains monthly average value of primary productivity from the Zanzibar channel.
sheet3 = readxl::read_xlsx("primary_productivity.xlsx", sheet = 3)
sheet3 %>% sample_n(5)
FALSE # A tibble: 5 x 7
FALSE date year value month day site variable
FALSE <dttm> <dbl> <chr> <dbl> <dbl> <chr> <chr>
FALSE 1 2013-06-15 00:00:00 2013 1278.17880794701 6 15 Mafia pp
FALSE 2 2010-01-15 00:00:00 2010 1129.8221318493099 1 15 Mafia pp
FALSE 3 2004-05-15 00:00:00 2004 1216.4666526845599 5 15 Mafia pp
FALSE 4 2004-06-15 00:00:00 2004 1307.9894453642301 6 15 Mafia pp
FALSE 5 2012-09-15 00:00:00 2012 1254.1508246527701 9 15 Mafia pp
sheet3 contains monthly average value of primary productivity from the Mafia channel.
We look on the internal structure of the sheet3
file with the glimpse()
function. You can interact with the table that show all variables and observations (Table ??)
sheet3%>%glimpse()
FALSE Rows: 192
FALSE Columns: 7
FALSE $ date <dttm> 2003-01-15, 2003-02-15, 2003-03-15, 2003-04-15, 2003-05-15, ~
FALSE $ year <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2~
FALSE $ value <chr> "1311.50104865771", "1211.3158482142801", "1302.45291940789",~
FALSE $ month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8~
FALSE $ day <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
FALSE $ site <chr> "Mafia", "Mafia", "Mafia", "Mafia", "Mafia", "Mafia", "Mafia"~
FALSE $ variable <chr> "pp", "pp", "pp", "pp", "pp", "pp", "pp", "pp", "pp", "pp", "~
sheet2 %>% DT::datatable(rownames = FALSE, caption = "An Interactive table of primary productivity in the Zanzibar channel")
Writing t a File
Sometimes you work in the document and you want to export to a file. readr has
write_csv()
and write_tsv()
functions that allows to export data frames from workspace to working directory
write_csv(x = sheet1, path = "./data/Primary_productivity_Pemba.csv")
@r4d recomment the use of write_excel_csv()
function when you want to export a data frame to Excel. readr has other tools that export files to other software like SAS, SPSS and more …
write_excel_csv(x = sheet1, path = "./data/Primary_productivity_Pemba.csv")
Basic Data Manipulation
In this section, we brifely introduce some basic data handling and manipulation techniques, which are mostly associated with data frame. A data frame is a a tabular shaped contains columns and rows of equal length. In general a data frame structure with rows representing observations or measurements and with columns containing variables.
Explore the Data Frame
We can visualize the table by simply run the name of the data flights
octopus = read_csv("octopus_data.csv")
we can use class()
to check if the data is data frame
octopus %>% class()
FALSE [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
We can use names()
to extract the variable names
octopus %>% names()
FALSE [1] "date" "village" "port" "ground" "sex" "dml" "tl"
FALSE [8] "weight" "lat" "lon"
We can explore the internal structure of flights
object with a dplyr()
’s function glimpse()
octopus %>% glimpse()
FALSE Rows: 1,079
FALSE Columns: 10
FALSE $ date <date> 2018-02-12, 2018-01-30, 2018-02-01, 2018-01-21, 2018-03-03, 2~
FALSE $ village <chr> "Somanga", "Bwejuu", "Somanga", "Somanga", "Somanga", "Somanga~
FALSE $ port <chr> "Mbuyuni", "Kusini", "Mbuyuni", "Mbuyuni", "Mbuyuni", "Mbuyuni~
FALSE $ ground <chr> "CHAMBA CHA MACHANGE", "NYAMALILE", "BANIANI", "CHAMBA CHA SEL~
FALSE $ sex <chr> "F", "M", "M", "M", "M", "F", "M", "M", "M", "F", "F", "F", "M~
FALSE $ dml <dbl> 14.0, 14.5, 17.0, 20.0, 12.0, 16.0, 15.0, 17.0, 12.0, 12.0, 9.~
FALSE $ tl <dbl> 110.0, 115.0, 115.0, 130.0, 68.0, 90.0, 96.0, 110.0, 79.0, 84.~
FALSE $ weight <dbl> 1.385, 1.750, 1.000, 2.601, 0.670, 0.870, 1.020, 1.990, 0.730,~
FALSE $ lat <dbl> -8.397838, -7.915809, -8.392644, -8.391614, -8.391146, -8.3881~
FALSE $ lon <dbl> 39.28079, 39.65424, 39.28153, 39.28089, 39.28251, 39.28196, 39~
We can check how rows (observations/measurements) and columns (variables/fields) are in the data
octopus %>% dim()
FALSE [1] 1079 10
The number of rows (observation) can be obtained using nrow()
function
octopus %>% nrow()
FALSE [1] 1079
The number of columns (variables) can be obtained using ncol()
function
octopus %>% ncol()
FALSE [1] 10
The length of the data frame is given by
octopus %>% length()
FALSE [1] 10
Count the number of sample at each sex of octopus
octopus %$% table(sex)
FALSE sex
FALSE F M
FALSE 581 498
Count the number and compute the proportion of sample at each sex of octopus
octopus %$% table(sex) %>% prop.table() %>% round(digits = 2)
FALSE sex
FALSE F M
FALSE 0.54 0.46
simmple summary statistics
The most helpful function for for summarizing rows and columns is summary()
, which gives a collection of basim cummary statistics. The first method is to calculate some basic summary statistics (minimum, 25th, 50th, 75th percentiles, maximum and mean) of each column. If a column is categorical, the summary function will return the number of observations in each category.
octopus %>%
summary()
FALSE date village port ground
FALSE Min. :2017-12-18 Length:1079 Length:1079 Length:1079
FALSE 1st Qu.:2018-01-14 Class :character Class :character Class :character
FALSE Median :2018-01-20 Mode :character Mode :character Mode :character
FALSE Mean :2018-01-26
FALSE 3rd Qu.:2018-02-15
FALSE Max. :2018-03-12
FALSE sex dml tl weight
FALSE Length:1079 Min. : 6.0 Min. : 11.00 Min. :0.055
FALSE Class :character 1st Qu.:10.0 1st Qu.: 68.00 1st Qu.:0.600
FALSE Mode :character Median :12.0 Median : 82.00 Median :0.915
FALSE Mean :12.8 Mean : 86.01 Mean :1.232
FALSE 3rd Qu.:15.0 3rd Qu.:100.00 3rd Qu.:1.577
FALSE Max. :24.0 Max. :180.00 Max. :5.210
FALSE lat lon
FALSE Min. :-8.904 Min. : 0.00
FALSE 1st Qu.:-8.523 1st Qu.:39.28
FALSE Median :-8.392 Median :39.50
FALSE Mean :-8.069 Mean :38.69
FALSE 3rd Qu.:-7.973 3rd Qu.:39.67
FALSE Max. : 0.000 Max. :39.75
You noticed that the summary()
function provide the common metric for central tendency
and measure of dispersion
. We will look at them later. The next post takes us to our favorite package dplyr.