Data types in R

Masumbuko Semba

04.01.2020

Vectors

Ofen times we want to store a set of numbers in once place. One way to do this is using the vectors in R. Vectors store severl numbers– a set of numbers in one container. let us look on the example below

id = c(1,2,3,4,5)
people = c(158,659,782,659,759)
street = c("Dege", "Mchikichini", "Mwembe Mdogo", "Mwongozo",  "Cheka")

Notice that the c() function, which is short for concatenate wraps the list of numbers. The c() function combines all numbers together into one container. Notice also that all the individual numbers are separated with a comma. The comma is reffered to an an item-delimiter. It allows R to hold each of the numbers separately. This is vital as without the item-delimiter, R will treat a vector as one big, unsperated number.

Indexing the element

One advantage of vector is that you can extract individual element in the vector object by indexing, which is accomplished using the square bracket as illustrated below.

id[5]

FALSE [1] 5

people[5]

FALSE [1] 759

street[5]

FALSE [1] "Cheka"

Apart from extracting single element, indexing allows to extract a range of element in a vector. This is extremely important because it allows to subset a portion of data in a vector. A colon operator is used to extract a range of data

street[2:4]

FALSE [1] "Mchikichini"  "Mwembe Mdogo" "Mwongozo"

Adding and Replacing an element in a vector

It is possible to add element of an axisting vecor. Here ia an example

id[6] = 6
people[6] = 578
street[6] = "Mwongozo"

Sometimes you may need to replace an element from a vector, this can be achieved with indexing

people[1] = 750

Number of elements in a vector

Sometimes you may have a long vector and want to know the numbers of elements in the object. R has length() function that allows you to query the vector and print the answer

length(people)

FALSE [1] 6

Generating sequence of vectors Numbers

There are few R operators that are designed for creating vecor of non-random numbers. These functions provide multiple ways for generating sequences of numbers

The colon : operator, explicitly generate regular sequence of numbers between the lower and upper boundary numbers specified. For example, generating number beween 0 and 10, we simply write;

vector.seq = 0:10
vector.seq

FALSE  [1]  0  1  2  3  4  5  6  7  8  9 10

However, if you want to generate a vector of sequence number with specified interval, let say we want to generate number between 0 and 10 with interval of 2, then the seq() function is used

regular.vector = seq(from = 0,to = 10, by = 2)
regular.vector

FALSE [1]  0  2  4  6  8 10

unlike the seq() function and : operator that works with numbers, the rep() function generate sequence of repeated numbers or strings to create a vector

id = rep(x = 3, each = 4)
station = rep(x = "Station1", each = 4)
id;station

FALSE [1] 3 3 3 3

FALSE [1] "Station1" "Station1" "Station1" "Station1"

The rep() function allows to parse each and times arguments. The each argument allows creation of vector that that repeat each element in a vector according to specified number.

sampled.months = c("January", "March", "May")
rep(x = sampled.months, each = 3)

FALSE [1] "January" "January" "January" "March"   "March"   "March"   "May"    
FALSE [8] "May"     "May"

But the times argument repeat the whole vector to specfied times

rep(x = sampled.months, times = 3)

FALSE [1] "January" "March"   "May"     "January" "March"   "May"     "January"
FALSE [8] "March"   "May"

Generating vector of normal distribution

The central limit theorem that ensure the data is normal distributed is well known to statistician. R has a rnorm() function which makes vector of normal distributed values. For example to generate a vector of 40 sea surface temperature values from a normal distribution with a mean of 25, and standard deviation of 1.58, we simply type this expression in console;

sst = rnorm(n = 40, mean = 25,sd = 1.58)
sst

FALSE  [1] 24.03922 23.80315 23.47413 26.27077 23.42572 23.75998 23.61258 23.35309
FALSE  [9] 24.69700 22.76580 24.71677 23.02477 27.00590 23.95165 26.52619 26.29226
FALSE [17] 25.35510 24.18597 27.37901 24.34999 24.38044 26.92839 21.37074 25.54579
FALSE [25] 26.55608 26.61256 25.71027 29.16311 25.19961 24.32504 26.28006 25.61089
FALSE [33] 24.85534 24.80107 25.56271 27.84438 24.39485 27.66808 25.80105 24.16359

Rounding off numbers

There are many ways of rounding off numerical number to the nearest integers or specify the number of decimal places. the code block below illustrate the common way to round off:

require(magrittr)
chl = rnorm(n = 20, mean = .55, sd = .2)
chl %>% round(digits = 2)

FALSE  [1] 0.43 0.58 0.18 0.55 0.85 0.38 0.38 0.40 0.60 0.20 0.74 0.44 0.75 0.37 0.72
FALSE [16] 0.83 0.54 0.53 0.61 0.51

Data Frame

data.frame is very much like a simple Excel spreadsheet where each column represents a variable type and each row represent observations. A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. A data frame is a list of equal–length vectors with rows as records and columns as variables. This makes data frames unique in data storing as it can store different classes of objects in each column (i.e. numeric, character, factor, logic, etc). In this section, we will create data frames and add attributes to data frames.

Creating data frames

Perhaps the easiest way to create a data frame is to parse vectors in a data.frame() function. For instance, in this case we create a simple data frame dt and assess its internal structure

# create vectors
Name  = c('Bob','Jeff','Mary')
Score = c(90, 75, 92)
Grade = c("A", "B", "A")

## use the vectors to make a data frame
dt = data.frame(Name, Score, Grade)

## assess the internal structure
str(dt)

FALSE 'data.frame': 3 obs. of  3 variables:
FALSE  $ Name : chr  "Bob" "Jeff" "Mary"
FALSE  $ Score: num  90 75 92
FALSE  $ Grade: chr  "A" "B" "A"

Note how Variable Name in dt was converted to a column of factors . This is because there is a default setting in data.frame() that converts character columns to factors . We can turn this off by setting the stringsAsFactors = FALSE argument:

## use the vectors to make a data frame
df = data.frame(Name, Score, Grade, stringsAsFactors = FALSE)
df %>% str()

FALSE 'data.frame': 3 obs. of  3 variables:
FALSE  $ Name : chr  "Bob" "Jeff" "Mary"
FALSE  $ Score: num  90 75 92
FALSE  $ Grade: chr  "A" "B" "A"

Now the variable Name is of character class in the data frame. The inherited problem of data frame to convert character columns into a factor is resolved by introduction f advanced data frames called tibble, which provides sticker checking and better formating than the traditional data.frame.

## use the vectors to make a tibble
tb = tibble(Name, Score, Grade) 
## check the internal structure of the tibble
tb%>% glimpse()

FALSE Rows: 3
FALSE Columns: 3
FALSE $ Name  <chr> "Bob", "Jeff", "Mary"
FALSE $ Score <dbl> 90, 75, 92
FALSE $ Grade <chr> "A", "B", "A"

Table 1 show the the data frame created by fusing the two vectors together.

Table 1: Variables in the data frame
Name	Score	Grade
Bob	90	A
Jeff	75	B
Mary	92	A

Because the columns have meaning and we have given them column names, it is desirable to want to access an element by the name of the column as opposed to the column number.In large Excel spreadsheets I often get annoyed trying to remember which column something was. The $sign and []are used in R to select variable from the data frame.

dt$Name

FALSE [1] "Bob"  "Jeff" "Mary"

dt[,1]

FALSE [1] "Bob"  "Jeff" "Mary"

dt$Score

FALSE [1] 90 75 92

dt[,2]

FALSE [1] 90 75 92

R has build in dataset that we can use for illustration. For example, @longley created a longley dataset, which is data frame with 7 economic variables observed every year from 1947 ti 1962 (Table 2). We can add the data in the workspace with data() function

data(longley)

longley %>% 
  kable(caption = "Longleys' Economic dataset", align = "c", row.names = F) %>%
  column_spec(1:7, width = "3cm")

Table 2: Longleys’ Economic dataset
GNP.deflator	GNP	Unemployed	Armed.Forces	Population	Year	Employed
83.0	234.289	235.6	159.0	107.608	1947	60.323
88.5	259.426	232.5	145.6	108.632	1948	61.122
88.2	258.054	368.2	161.6	109.773	1949	60.171
89.5	284.599	335.1	165.0	110.929	1950	61.187
96.2	328.975	209.9	309.9	112.075	1951	63.221
98.1	346.999	193.2	359.4	113.270	1952	63.639
99.0	365.385	187.0	354.7	115.094	1953	64.989
100.0	363.112	357.8	335.0	116.219	1954	63.761
101.2	397.469	290.4	304.8	117.388	1955	66.019
104.6	419.180	282.2	285.7	118.734	1956	67.857
108.4	442.769	293.6	279.8	120.445	1957	68.169
110.8	444.546	468.1	263.7	121.950	1958	66.513
112.6	482.704	381.3	255.2	123.366	1959	68.655
114.2	502.601	393.1	251.4	125.368	1960	69.564
115.7	518.173	480.6	257.2	127.852	1961	69.331
116.9	554.894	400.7	282.7	130.081	1962	70.551

Sometimes you may need to create set of values and store them in vectors, then combine the vectors into a data frame. Let us see how this can be done. First create three vectors. One contains id for ten individuals, the second vector hold the time each individual signed in the attendane book and the third vector is the distance of each individual from office. We can concatenate the set of values to make vectors.

id  = c(1,2,3,4,5,6,7,8,9,10)

time = ymd_hms(c("2018-11-20 06:35:25 EAT", "2018-11-20 06:52:05 EAT", 
                 "2018-11-20 07:08:45 EAT", "2018-11-20 07:25:25 EAT", 
                 "2018-11-20 07:42:05 EAT", "2018-11-20 07:58:45 EAT", 
                 "2018-11-20 08:15:25 EAT", "2018-11-20 08:32:05 EAT", 
                 "2018-11-20 08:48:45 EAT", "2018-11-20 09:05:25 EAT"), tz = "")

distance = c(20, 85, 45, 69, 42,  52, 6, 45, 36, 7)

Once we have the vectors that have the same length dimension, we can use the function data.frame() to combine the the three vectors into one data frame shown in table 3

arrival = data.frame(id, time, distance)

Table 3: The time employees enter into the office with the distance from their residential areas to the office
IDs	Time	Distance
1	2018-11-20 06:35:25	20
2	2018-11-20 06:52:05	85
3	2018-11-20 07:08:45	45
4	2018-11-20 07:25:25	69
5	2018-11-20 07:42:05	42
6	2018-11-20 07:58:45	52
7	2018-11-20 08:15:25	6
8	2018-11-20 08:32:05	45
9	2018-11-20 08:48:45	36
10	2018-11-20 09:05:25	7

Matrix

A matrix is defined as a collection of data elements arranged in a two–dimensional rectangular layout. R is very strictly when you make up a matrix as it must be with equal dimension—all columns in a matrix must be of the same length. Unlike data frame and list that can store numeric or character.etc in columns, matrix columns must be numeric or characters in a matrix file.

Creating Matrices

The base R has a matrix() function that construct matrices column–wise. In other language, element in matrix are entered starting from the upper left corner and running down the columns. Therefore, one should take serious note of specifying the value to fill in a matrix and the number of rows and columns when using the matrix() function.For example in the code block below, we create an imaginary month sst value for five years and obtain an atomic vector of 60 observation.

sst = rnorm(n = 60, mean = 25, 3)

Once we have the atomic vector of sst value, we can convert it to matrix with the matrix() function. We put the observation as rows—months and the columns as years. Therefore, we have 12 rows and 5 years and the product of number of months and years we get 60—equivalent to our sst atomic vector we just created above.

sst.matrix = matrix(data = sst, nrow = 12, ncol = 5)

We then check whether we got the matrix with is.matrix() function

is.matrix(sst);is.matrix(sst.matrix)

FALSE [1] FALSE

FALSE [1] TRUE

sst

FALSE  [1] 25.08327 24.09097 26.34485 24.14146 24.64941 26.09739 23.91218 26.91260
FALSE  [9] 22.69753 24.84250 24.89610 21.91391 30.20031 22.91143 26.33853 24.99629
FALSE [17] 23.69466 30.10935 25.78983 25.05814 26.59829 26.77820 26.60457 27.57653
FALSE [25] 24.65990 26.95125 29.67351 20.43879 30.49668 25.97459 24.28109 25.72646
FALSE [33] 21.91946 21.91512 24.87088 27.31100 28.64027 26.58810 27.33460 21.98717
FALSE [41] 29.04679 25.85037 22.84274 22.30848 28.18201 25.51643 29.05196 24.12477
FALSE [49] 19.98520 30.13661 26.37739 25.49994 22.78731 25.80078 23.99339 23.64162
FALSE [57] 26.40151 29.94575 26.80928 23.84312

We can check whether the dimension we just defined while creating this matrix is correct. This is done with the dim() function from base R.

dim(sst.matrix)

FALSE [1] 12  5

If you have large vector and you you want the matrix() function to figure out the number of columns, you simply define the nrow and tell the function that you do not want those element arranged by rows —i.e you want them in column-wise. That is done by parsing the argument byrow = FALSE inside the matrixt() function.

sst.matrixby = sst %>% matrix(nrow = 12, byrow = FALSE)

Adding attributes to Matrices

Often times you may need to add additional attributes to the maxtrix—observation names, variable names and comments in the matrix.

We can add columns, which are years from 2014 to 2018

years = 2014:2018
colnames(sst.matrix) = years
sst.matrix

FALSE           2014     2015     2016     2017     2018
FALSE  [1,] 25.08327 30.20031 24.65990 28.64027 19.98520
FALSE  [2,] 24.09097 22.91143 26.95125 26.58810 30.13661
FALSE  [3,] 26.34485 26.33853 29.67351 27.33460 26.37739
FALSE  [4,] 24.14146 24.99629 20.43879 21.98717 25.49994
FALSE  [5,] 24.64941 23.69466 30.49668 29.04679 22.78731
FALSE  [6,] 26.09739 30.10935 25.97459 25.85037 25.80078
FALSE  [7,] 23.91218 25.78983 24.28109 22.84274 23.99339
FALSE  [8,] 26.91260 25.05814 25.72646 22.30848 23.64162
FALSE  [9,] 22.69753 26.59829 21.91946 28.18201 26.40151
FALSE [10,] 24.84250 26.77820 21.91512 25.51643 29.94575
FALSE [11,] 24.89610 26.60457 24.87088 29.05196 26.80928
FALSE [12,] 21.91391 27.57653 27.31100 24.12477 23.84312

and add the month for rows, which is January to December. Now the matrix has names for the rows—records and for columns—variables

months = seq(from = dmy(010115), to = dmy(311215), 
             by = "month") %>% month(abbr = TRUE, 
                                     label = TRUE)
rownames(sst.matrix) = months
sst.matrix

FALSE         2014     2015     2016     2017     2018
FALSE Jan 25.08327 30.20031 24.65990 28.64027 19.98520
FALSE Feb 24.09097 22.91143 26.95125 26.58810 30.13661
FALSE Mar 26.34485 26.33853 29.67351 27.33460 26.37739
FALSE Apr 24.14146 24.99629 20.43879 21.98717 25.49994
FALSE May 24.64941 23.69466 30.49668 29.04679 22.78731
FALSE Jun 26.09739 30.10935 25.97459 25.85037 25.80078
FALSE Jul 23.91218 25.78983 24.28109 22.84274 23.99339
FALSE Aug 26.91260 25.05814 25.72646 22.30848 23.64162
FALSE Sep 22.69753 26.59829 21.91946 28.18201 26.40151
FALSE Oct 24.84250 26.77820 21.91512 25.51643 29.94575
FALSE Nov 24.89610 26.60457 24.87088 29.05196 26.80928
FALSE Dec 21.91391 27.57653 27.31100 24.12477 23.84312

Arrays

array(data = sst, dim = c(3,5,4))

FALSE , , 1
FALSE 
FALSE          [,1]     [,2]     [,3]     [,4]     [,5]
FALSE [1,] 25.08327 24.14146 23.91218 24.84250 30.20031
FALSE [2,] 24.09097 24.64941 26.91260 24.89610 22.91143
FALSE [3,] 26.34485 26.09739 22.69753 21.91391 26.33853
FALSE 
FALSE , , 2
FALSE 
FALSE          [,1]     [,2]     [,3]     [,4]     [,5]
FALSE [1,] 24.99629 25.78983 26.77820 24.65990 20.43879
FALSE [2,] 23.69466 25.05814 26.60457 26.95125 30.49668
FALSE [3,] 30.10935 26.59829 27.57653 29.67351 25.97459
FALSE 
FALSE , , 3
FALSE 
FALSE          [,1]     [,2]     [,3]     [,4]     [,5]
FALSE [1,] 24.28109 21.91512 28.64027 21.98717 22.84274
FALSE [2,] 25.72646 24.87088 26.58810 29.04679 22.30848
FALSE [3,] 21.91946 27.31100 27.33460 25.85037 28.18201
FALSE 
FALSE , , 4
FALSE 
FALSE          [,1]     [,2]     [,3]     [,4]     [,5]
FALSE [1,] 25.51643 19.98520 25.49994 23.99339 29.94575
FALSE [2,] 29.05196 30.13661 22.78731 23.64162 26.80928
FALSE [3,] 24.12477 26.37739 25.80078 26.40151 23.84312

This can be done with the indexing. For example, in the sst.matrix we just create, it has twelve rows representing monthly average and five columns representing years. We then obtain data for the six year and we want to add it into the matrix. Simply done with indexing

sst.matrix[1:12,5]

FALSE      Jan      Feb      Mar      Apr      May      Jun      Jul      Aug 
FALSE 19.98520 30.13661 26.37739 25.49994 22.78731 25.80078 23.99339 23.64162 
FALSE      Sep      Oct      Nov      Dec 
FALSE 26.40151 29.94575 26.80928 23.84312

Dealing with Misiing Values

Just as we can assign numbers, strings, list to a variable, we can also assign nothing to an object, or an empty value to a variable. IN R, an empty object is defined with NULL. Assigning a value oof NULL to an object is one way to reset it to its original, empty state. You might do this when you wanto to pre–allocate an object without any value, especially when you iterate the process and you want the outputs to be stored in the empty object.

sst.container = NULL

You can check whether the object is an empty with the is.null() function, which return a logical ouputs indicating whther is TRUE or FALSE

is.null(sst.container)

FALSE [1] TRUE

You can also check for NULL in an if satement as well, as highlighted in the following example;

if (is.null(sst.container)){
  print("The object is empty and hence you can use to store looped outputs!!!")
}

FALSE [1] "The object is empty and hence you can use to store looped outputs!!!"

And empty element (value) in object is represented with NA in R, and it is the absence of value in an object or variable.

sst.sample = c(26.78, 25.98,NA, 24.58, NA)
sst.sample

FALSE [1] 26.78 25.98    NA 24.58    NA

To identify missing values in a vector in R, use the is.na() function, which returns a logical vector with TRUE of the corresponding element(s) with missing value

is.na(sst.sample)

FALSE [1] FALSE FALSE  TRUE FALSE  TRUE

and computing statistics of the variable with NA always will give out the NA ouputs

mean(sst.sample); sd(sst.sample);range(sst.sample)

FALSE [1] NA

FALSE [1] NA

FALSE [1] NA NA

However, we can exclude missing value in these mathematical operations by parsing , na.rm = TRUE argument

mean(sst.sample, na.rm = TRUE);sd(sst.sample, na.rm = TRUE);range(sst.sample, na.rm = TRUE)

FALSE [1] 25.78

FALSE [1] 1.113553

FALSE [1] 24.58 26.78

you can also exclude the element with NA value using the `na.omit()

sst.sample %>% na.omit()

FALSE [1] 26.78 25.98 24.58
FALSE attr(,"na.action")
FALSE [1] 3 5
FALSE attr(,"class")
FALSE [1] "omit"

Finally is a NaN, which is closely related to NA, which is used to assign non-floating numbers. For example when we have the anomaly of sea surface temperature and we are interested to use sqrt() function to reduce the variability of the dataset.

sst.anomaly = c(2.3,1.25,.8,.31,0,-.21)
sqrt(sst.anomaly)

FALSE [1] 1.5165751 1.1180340 0.8944272 0.5567764 0.0000000       NaN

We notice that the sqrt of -0.21 gives us a NaN elements.