10  R: Objects

10.1 Atomic vectors

An atomic vector in R is a vector containing objects of the same datatype. If the objects are not of the same datatype, then they are coerced to be of the same datatype. It is defined using the keyword c().

numbers = c(1, 2, 67)

The in-built R function length() is used to find the length of an atomic vector.

length(numbers)
[1] 3

10.1.1 Slicing the atomic vector

10.1.1.1 Slicing using indices

An atomic vector can be sliced using the indices of the elements within [] brackets.

For example, consider the vector:

vec <- 1:40

Suppose, we wish to get the \(3^{rd}\) element of the vector. We can get it using the index 3:

vec[3]
[1] 3

A sequence of consecutive elements can be sliced using the indices of the first element and the last element around the : operator. For example, let us slice elements from the \(3^{rd}\) index to the \(10^{th}\) element of the vector vec:

vec[3:10]
[1]  3  4  5  6  7  8  9 10

We can slice elements at different indices by putting the indices in an atomic vector within the [] brackets. Let us slice the \(4^{th}\), \(7^{th}\), and \(18^{th}\) elements of the vector vec:

vec[c(4,7,18)]
[1]  4  7 18

We can slice consecutive elements, and non-consecutive elements simultaneously. Let us slice the elements from the \(4^{th}\) index to the \(9^{th}\) index and the \(30^{th}\) and \(36^{th}\) element.

vec[c(4:9,30,36)]
[1]  4  5  6  7  8  9 30 36

10.1.1.2 Slicing using a logical atomic vector

An atomic vector can be sliced using a logical atomic vector of the same length. The logical atomic vector will have TRUE values corresponding to the indices where the element is to be selected, and FALSE where the element is to be discarded. See the example below.

vec <- 1:5
vec[c(TRUE, FALSE, FALSE, TRUE, FALSE)]
[1] 1 4

10.1.2 Removing elements from atomic vector

Elements can be removed from the vector using the negative sign within [] brackets.

Remove the 2nd element from the vector:

vec <- 1:5
vec[-2]
[1] 1 3 4 5

If multiple elements need to be removed, the indices of the elements to be removed can be given as an atomic vector.

Remove elements 2 to 6 and element 10 from the vector:

vec <- 1:20
vec[-c(2:6, 10)]
 [1]  1  7  8  9 11 12 13 14 15 16 17 18 19 20

Example: USA’s GDP per capita from 1960 to 2021 is given by the vector G in the code chunk below. The values are arranged in ascending order of the year, i.e., the first value is for 1960, the second value is for 1961, and so on. Store the years in which the GDP per capita of the US increased by more than 10%, in a vector.

G = c(3007, 3067, 3244, 3375,3574, 3828, 4146, 4336, 4696, 5032,5234,5609,6094,6726,7226,7801,8592,9453,10565,11674,12575,13976,14434,15544,17121,18237,19071,20039,21417,22857,23889,24342,25419,26387,27695,28691,29968,31459,32854,34515,36330,37134,37998,39490,41725,44123,46302,48050,48570,47195,48651,50066,51784,53291,55124,56763,57867,59915,62805,65095,63028,69288)
years <- c()
for (i in 1:(length(G) - 1)) {
  diff = (G[i+1] - G[i]) / G[i]
  if (diff > 0.1) years <- c(years, 1960 + i)
}
print(years)
[1] 1973 1976 1977 1978 1979 1981 1984

10.1.3 Element-wise operations on atomic vectors

When we use arithmetic operators like +, -, *, etc., or comparison operators like >, >=, ==, etc., between atomic vectors, then these operators are applied element-wise on the elements of the respective atomic vectors with the same index. Consider the examples below.

vec1 <- 1:4
vec2 <- 1:4
vec1 + vec2
[1] 2 4 6 8
vec1 > vec2
[1] FALSE FALSE FALSE FALSE

It is highly recommended that these operators be applied on atomic vectors of the same length. Otherwise, the vector of the smaller length will broadcast (or repeat itself) to match the length of the larger vector. A warning will be returned if the length of the longer vector is not a multiple of the length of the shorter vector. Broadcasting may be difficult to interpret, especially when arithmetic operators are being applied on more than 2 atomic vectors of different lengths.

If an operator is applied between an atomic vector and a scalar, then the operation is performed on each element of the atomic vector and the scalar. See the examples below.

vec1*4
[1]  4  8 12 16
vec1 > 2
[1] FALSE FALSE  TRUE  TRUE

Suppose, we wish to slice all elements from the object vec that are greater than 2. Here is one approach to do it. We will apply the > operator between vec and 2 to obtain a logical vector that is TRUE on indices where the condition is satisfied, and FALSE otherwise. We will then use this logical vector to slice vec. Below is the code.

vec1[vec1 > 2]
[1] 3 4

Now, solve the previous example without using a for loop.

10.1.4 The seq() function

The seq() function is used to generate an atomic vector consisting of a sequence of integers with a constant gap. For example, the code below generates a sequence of integers starting from 20 upto 60 with gaps of 5.

seq(20, 60, 5)
[1] 20 25 30 35 40 45 50 55 60

10.1.5 The rep() function

The rep() function is used to repeat an object a fixed number of times.

rep(4, 10)
 [1] 4 4 4 4 4 4 4 4 4 4
rep(c(2, 3), 10)
 [1] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3

10.1.6 The which() function

The which() function is used to find the indices of TRUE elements in a logical atomic vector.

vec <- c(8, 3, 4, 7, 9, 7, 5)
which(vec == 8)
[1] 1

In the above code, a logical vector is being created with vec == 8, and the which() function is returning the indices of the TRUE elements.

The index of the maximum and minimum values can be found using which.max() and which.min() respectively. In case of multple maximum or minimum elements, the smallest index is returned.

which.max(vec)
[1] 5
which.min(vec)
[1] 2

10.1.7 Practice exercise 1

Below is a vector consisting of responses to the question: “At what age do you think you will marry?” from students of the STAT303-1 Fall 2022 class.

exp_marriage_age <- c('24','30','28','29','30','27','26','28','30+','26','28','30','30','30','probably never','30','25','25','30','28','30+ ','30','25','28','28','25','25','27','28','30','30','35','26','28','27','27','30','25','30','26','32','27','26','27','26','28','37','28','28','28','35','28','27','28','26','28','26','30','27','30','28','25','26','28','35','29','27','27','30','24','25','29','27','33','30','30','25','26','30','32','26','30','30','I wont','25','27','27','25','27','27','32','26','25','never','28','33','28','35','25','30','29','30','31','28','28','30','40','30','28','30','27','by 30','28','27','28','30-35','35','30','30','never','30','35','28','31','30','27','33','32','27','27','26','N/A','25','26','29','28','34','26','24','28','30','120','25','33','27','28','32','30','26','30','30','28','27','27','27','27','27','27','28','30','30','30','28','30','28','30','30','28','28','30','27','30','28','25','never','69','28','28','33','30','28','28','26','30','26','27','30','25','Never','27','27','25')

10.1.7.1 Cleaning data

Remove the elements that are not integers - such as ‘probably never’, ‘30+’, etc. Convert the reamining elements to integer. What is the length of the new vector?

new_vector <- as.integer(exp_marriage_age)
Warning: NAs introduced by coercion
numeric_values <- new_vector[!is.na(new_vector)]
length(numeric_values)
[1] 181

10.1.7.2 Capping unreasonably high values

Cap the values greater than 80 to 80, in the clean vector obtained above. What is the mean age when people expect to marry in the new vector?

numeric_values[numeric_values > 80] <- 80
mean(numeric_values)
[1] 28.9558

10.1.7.3 People marrying at 30 or more

Determine the percentage of people who expect to marry at an age of 30 or more.

sum(numeric_values >= 30) / length(numeric_values)
[1] 0.3701657

10.1.8 The sapply() function

The sapply() function is used to apply a function on all the elements of a list, atomic vector or matrix.

For example, consider the vector below:

vec <- 1:6
vec
[1] 1 2 3 4 5 6

Suppose, we wish to square each element of the vector. We can use the sapply() function as below:

sapply(vec, FUN = function(x) x**2)
[1]  1  4  9 16 25 36

10.1.9 Practice exercise 2

Write a function that identifies if a word is a palindrome (A palindrome is a word that reads the same both backwards and forwards, for example, peep, rotator, madam, etc.). Apply the function to the vector of words below to count the number of palindrome words.

words_vec <- c('fat', 'civic', 'radar', 'mountain', 'noon','papa')
palindrome <- function(word) {
  for (i in 1:as.integer(nchar(word)/2)) {
    if (substr(word, i, i) != substr(word, nchar(word) - (i-1), nchar(word) - (i-1))) {
      return(FALSE)
    }
  }
  return(TRUE)
}
sum(sapply(words_vec, palindrome))
[1] 3

10.2 Matrix

Matrices are two-dimensional arrays. The in-built function matrix() is used to define a matrix. An atomic vector can be organized as a matrix by specifying the number of rows and columns.

For example, let us define a 2x3 matrix (2 rows and 3 columns) consisting of consecutive integers fro1 1 to 6.

mat <- matrix(1:6, 2, 3)
mat
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Note that the integers fill up column-wise in the matrix. If we wish to fill-up the matrix by row, we can use the byrow argument.

mat <- matrix(1:6, 2, 3, byrow = TRUE)
mat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

The functions nrow() and ncol() can be used to get the number of rows and columns of the matrix respectively.

nrow(mat)
[1] 2
ncol(mat)
[1] 3

Matrices can be sliced using the indices of row and column separated by a , in box brackets. Suppose we wish to get the element in the \(2^{nd}\) row and \(3^{rd}\) column of the matrix:

mat[2, 3]
[1] 6

For selecting all rows or columns of a matrix, the index for the row/column can be left blank. Suppose we wish to get all the elements of the \(1^{st}\) of the matrix:

mat[1, ]
[1] 1 2 3

Row and columns of the matrix can be sliced using the : operator. Suppose we want to select a sub-matrix that has elements in the first two rows and columns 2 and 3 of the matrix mat:

mat[1:2, 2:3]
     [,1] [,2]
[1,]    2    3
[2,]    5    6

Element-wise arithmetic operations can be performed between 2 matrices of the same shape.

mat1 <- matrix(1:6, 2, 3)
mat2 <- matrix(c(9, 2, 6, 5, 1, 0), 2, 3)
mat1 + mat2
     [,1] [,2] [,3]
[1,]   10    9    6
[2,]    4    9    6
mat1 - mat2
     [,1] [,2] [,3]
[1,]   -8   -3    4
[2,]    0   -1    6

Suppose we need to sum up all the rows of the matrix. We can do it using a for loop as follows:

row_sum <- c(0,0)
for (i in 1:nrow(mat)) {
  for (j in 1:ncol(mat)) {
    row_sum[i] <- row_sum[i] + mat[i, j]
  }
}
row_sum
[1]  6 15

Observe that in the above for loop, elements of each row are added one at a time. We can add all the elements of a row simultaneously using the sum() function. This will reduce a for loop from the above code:

row_sum <- c(0,0)
for (i in 1:nrow(mat)){
  row_sum[i] <- sum(mat[i,])
}
row_sum
[1]  6 15

In the above code, we sum up all the elements of the row simultaneously. However, we still need to sum up the elements of each row one at a time.

10.2.1 The apply() function

The apply() function can be used to apply a function on each row or column of a matrix. Thus, this function helps avoid the user to write a for() loop in R to iterate over all the rows and columns of the matrix. Note that each row / column of a matrix is an atomic vector. Thus, vectorized computations can be performed within the function, resulting in efficient computations.

Note that the apply functions use a for() loop under-the-hood, and thus the function will be applied sequentially on each row / column of the matrix. However, as the implementation of the for() loop is in C, it is likely to be faster than a for() loop in R.

Let us use the apply() function to sum up all the rows of the matrix mat.

apply(mat, 1, sum)
[1]  6 15

Let us compare the time taken to sum up rows of a matrix using a for loop with the time taken using the apply() function.

options(digits.secs = 6)
start.time <- Sys.time()
row_sum<-c(0, 0)
for (i in 1:nrow(mat)){
  row_sum[i] <- sum(mat[i,])
}
row_sum
[1]  6 15
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.005152941 secs
start.time <- Sys.time()
apply(mat, 1, sum)
[1]  6 15
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.001438379 secs

Observe that the apply() function takes much lesser time to sum up all the rows of the matrix as compared to the for loop.

Recall the earlier example where we computed year’s in which the increase in GDP per capita was more than 10%. Let us use matrices to solve the problem. We’ll also compare the time it takes using a matrix with the time it takes using for loops.

start.time <- Sys.time()

#Let the first column of the matrix be the GDP of all the years except 1960, and the second column be the GDP of all the years except 2021.
GDP_mat <- matrix(c(G[-1], G[-length(G)]), length(G) - 1, 2)

#The percent increase in GDP can be computed by performing computations using the 2 columns of the matrix
inc <- (GDP_mat[,1] - GDP_mat[,2]) / GDP_mat[,2]
years <- 1961:2021
years <- years[inc > 0.1]
years
[1] 1973 1976 1977 1978 1979 1981 1984
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.007261515 secs

Without matrices, the time taken to perform the same computation is measured with the code below.

start.time <- Sys.time()
years <- c()
for (i in 1:(length(G) - 1)) {
  diff = (G[i+1] - G[i]) / G[i]
  if (diff > 0.1) years <- c(years, 1960 + i)
}
print(years)
[1] 1973 1976 1977 1978 1979 1981 1984
#print(proc.time()[3]-start_time)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.007926941 secs

Observe that matrices reduce the execution time of the code as computations are performed simultaneously, in contrast to a for loop where computations are performed one at a time.

Sometimes, the computations on rows / columns of a matrix are not straighforward and we may need to use the apply() function to apply a function on each row / column of a matrix.

Example: Find the maximum GDP per capita of the US in each of the 5 year periods starting from 1961-1965, and upto 2015-2020.

GDP_5year <- matrix(G[-c(1, length(G))], 12, 5, byrow = TRUE)
GDP_max_5year <- apply(GDP_5year, 1, max)

In the above code, we applied the in-built function max on all the rows. Sometimes, an in-built function may not be available for the computations to be performed. In such as case, we can write our own user-defined function within the apply() function. See the example below.

Example: Find the range (max-min) of GDP per capita of the US in each of the 5 year periods starting from 1961-1965, and upto 2015-2020.

GDP_5year <- matrix(G[-c(1, length(G))], 12, 5, byrow = TRUE)
GDP_range_5year <- apply(GDP_5year, 1, function(x) max(x) - min(x))
GDP_range_5year
 [1]  761 1088 2192 3983 4261 4818 4349 6362 6989 2349 6697 7228

In the code above we applied a user-defined function on each row of the matrix. However, if the function has multiple lines, it may be inconvenient to write the function within the apply() function. In that case, we can define the function outside the apply() function.

Example: Find the five year periods starting from 1961-1965, and upto 2016-2020, during which the GDP per capita decreased as compared to the previous year.

GDP_inc <- function (GDP_5yr) {
  dec <- 0
  for (i in 1:4) {
    if(GDP_5yr[i+1] < GDP_5yr[i]) dec <- 1
  }
  return(dec)
}

GDP_5year_mat <- matrix(G[-c(1,length(G))], 12, 5, byrow = TRUE)
years_inc_dec <- apply(GDP_5year_mat, 1, GDP_inc)
five_year_periods <- seq(1960, 2015, 5)
print("Five year periods in which the GDP per capita decreased are those starting from the years:")
[1] "Five year periods in which the GDP per capita decreased are those starting from the years:"
print(five_year_periods[years_inc_dec == 1] + 1)
[1] 2006 2016

The 5 year periods during which the GDP per capita decreased as compared to the previous year are 2006-2010, and 2016-2020.

10.2.2 Practice exercise 3

Find the 5 year period in which the difference of the maximum GDP per capita and the minimum GDP per capita as a percentage of the minimum GDP per capita was the highest.

Solution:

five_year_periods[which.max(apply(GDP_5year_mat, 1, function(x) (max(x) - min(x)) / min(x)))] + 1
[1] 1976
print("During 1976-1980 the difference of the maximum GDP per capita and the minimum GDP per capita as a percentage of the minimum GDP per capita was the highest.")
[1] "During 1976-1980 the difference of the maximum GDP per capita and the minimum GDP per capita as a percentage of the minimum GDP per capita was the highest."

10.2.3 Practice exercise 4

The object country_names is an atomic vector consisting of country names. The object coordinates_capital_cities is a matrix consisting of the latitude-longitude pair of the capital city of the respective country. The order of countries in country_names is the same as the order in which their capital city coordinates (latitude-longitude) appear in the matrix coordinates_capital_cities.

Download the file capital_cities.csv from here. Make sure the file is in your current working directory. Execute the following code to obtain the objects coordinates_capital_cities and country_names.

capital_cities <- read.csv('capital_cities.csv')
coordinates_capital_cities <- as.matrix(capital_cities[,c(3, 4)])
country_names <- capital_cities[,1]

10.2.3.1 Country with capital closest to DC

Print the name and coordinates of the country with the capital city closest to the US capital - Washington DC.

Note that:

  1. The Country Name for US is given as United States in the data.
  2. The ‘closeness’ of capital cities from the US capital is based on the Euclidean distance of their coordinates to those of the US capital.

Hint:

  1. Get the coordinates of Washington DC from coordinates_capital_cities. The row that contains the coordinates of DC will have the same index as United States has in the vector country_names

  2. Create a matrix that has coordinates of Washington DC in each row, and has the same number of rows as the matrix coordinates_capital_cities.

  3. Subtract coordinates_capital_cities from the matrix created in (2). Element-wise subtraction will occur between the matrices.

  4. Use the apply() function on the matrix obtained above to find the Euclidean distance of Washington DC from the rest of the capital cities.

  5. Using the distances obtained above, find the country that has the closest capital to DC.

10.2.3.2 Top 10 countries closest to DC

  1. Print the names of the countries of the top 10 capital cities closest to the US capital - Washington DC.

  2. Create and print a matrix containing the coordinates of the top 10 capital cities closest to Washington DC.

US_index = which(country_names == 'United States')
dc_coord <- coordinates_capital_cities[US_index,]
distances_to_DC <- apply(coordinates_capital_cities, 1, 
                  function(city_coord) sqrt(sum((city_coord - dc_coord)**2)))
num_of_countries <- length(country_names)
distances_to_DC_matrix <- cbind(1:num_of_countries, distances_to_DC)
sorted <- distances_to_DC_matrix[order(distances_to_DC_matrix[,2]),]

Top 10 countries with capitals closest to Washington DC are the following:

country_names[sorted[3:12, 1]]

The coordinates of the top 10 capital cities closest to Washington DC are:

coordinates_capital_cities[sorted[3:12, 1],]

10.3 Lists

Atomic vectors and matrices are quite useful in R. However, a constraint with them is that they can only contain objects of the same datatype. For example, an atomic vector can contain all numeric objects, all character objects, or all logical objects, but not a mixture of multiple types of objects. Thus, there arises a need for a list data structure that can store objects of multiple datatypes.

A list can be defined using the list() function. For example, consider the list below:

list_ex <- list(1, "apple", TRUE, list("another list", TRUE))

The list list_ex consists of objects of mutiple datatypes. The length of the list can be obtained using the length()function:

length(list_ex)
[1] 4

A list is an ordered collection of objects. Each object of the list is associated with an index that corresponds to its order of occurrence in the list.

A single element can be sliced from the list by specifying its index within the [[]] operator. Let us slice the \(2^{nd}\) element of the list list_ex:

list_ex[[2]]
[1] "apple"

Multiple elements can be sliced from the list by specifying the indices as an atomic vector within the [] operator. Let us slice the \(1^{st}\) and \(3^{rd}\) elements from the list list_ex:

list_ex[c(1,3)]
[[1]]
[1] 1

[[2]]
[1] TRUE

Elements of a list can be named using the names() function. Let us name the elements of list_ex:

names(list_ex) <- c("Name1", "second_name", "3rd_element", "Number 4")

A single element can be sliced from the list using the name of the element with the $ operator. Let us slice the element named as second_name from the list list_ex:

list_ex$second_name
[1] "apple"

Note that if the name of the element does not begin with a letter or has special characters such as a space, then it should be specified within single quotes after the $ operator. For example, let us slice the element named as 3rd_element from the list list_ex:

list_ex$`3rd_element`
[1] TRUE

Names of elements of a list can also be specified while defining the list, as in the example below:

list_ex_with_names <- list(movie = 'The Dark Knight', IMDB_rating = 9)

A list can be converted to an atomic vector using the unlist() function. For example, let us convert the list list_ex to a vector:

unlist(list_ex)
         Name1    second_name    3rd_element      Number 41      Number 42 
           "1"        "apple"         "TRUE" "another list"         "TRUE" 

Since a vector can contain objects of a single datatype, note that all objects have been converted to the character datatype in the vector above.

10.3.1 Practice exercise 5

Download the dataset movies.json. Execute the following code to read the data into the object movies:

library(rjson)
movies<-fromJSON(file = 'movies.json')

10.3.1.1

What is the datatype of the object movies?

class(movies)

The datatype of the object movies is list.

10.3.1.2

Count the number movies having a negative profit, i.e., their production budget is higher than their worldwide gross.

Ignore the movies that:

  1. Have missing values of production budget or worldwide gross. Use the is.null() function to identify missing or NULL values.

  2. Have a zero worldwide gross (A zero worldwide gross is probably an incorrect value).

negative_profit <- c()
count <- 0
for (i in 1:length(movies)) {
  pb <- movies[[i]]$`Production Budget`
  wg <- movies[[i]]$`Worldwide Gross`
  if (!(is.null(pb) | is.null(wg))) {
    if (pb > wg & wg > 0) {
      count <- count + 1
    }
  }
}
print(paste("Number of movies with negative profit =", count))

10.3.2 The lapply() function

The lapply() function is used to apply a function on each element of a list, and returns a list of the same length.

For example, consider the list below:

list_ex <- list(1, "apple", TRUE, list("another list", TRUE))

Let us use the lapply() function to find the class of each element of the list list_ex:

lapply(list_ex, function(x) class(x))
[[1]]
[1] "numeric"

[[2]]
[1] "character"

[[3]]
[1] "logical"

[[4]]
[1] "list"

10.3.3 Practice exercise 6

Solve practice exercise 5 without using a for loop. Use the lapply() function.

profit <- lapply(movies, function(x) x$`Worldwide Gross`-x$`Production Budget`)
positive_wg <- lapply(movies, function(x) x$`Worldwide Gross` > 0)
sum(profit < 0 & positive_wg > 0, na.rm = TRUE)