An Introduction to R

Jan 15, 2024


Lecture Outline

  1. RStudio IDE
  2. Arithmetic, Syntax, and Basics
  3. Class Types
    • integer | numeric | logical | character | matrix | data frame
  4. Linear Algebra
  5. Functions
  6. Loops
  7. Simulating and Importing Data
  8. Visualizing Data
    • scatterplot | histogram | boxplot | barplot | pie chart | line graph
  9. Modeling Data - Regression
    • lm() | linear | quadratic | cubic | poly() | mean squared error (MSE)
  10. Modeling Data - Classification
  11. Final Exercises
  12. Appendix


RStudio

RStudio is an integrated development enviornment (IDE) for R. It is one of the most common R programming environments for data scientists. If you don't already have RStudio downloaded on your computer, please download the free desktop version here: RStudio download. We will be actively programming throughout the lesson.

[Command] + [enter] - Runs current line
[Command] + [s] - Saves file
[Command] + [l] - Clears console
Tools -> Global Options -> Appearance -> Editor Theme - To change the overall appearance


Arithmetic, Syntax, and Basics

Basic mathematical operations in R, including the assignment of variables, naming convenctions, and traditional syntax. Notable functions used in this section inlcude:

######## Assigning Variables ########
x <- 2
x = 2     #Note "=" often works but "<-" is the convention for R
y <- 3
b <- 2
z <- 3.12345
s <- "Welcome!"

adding <- x + y
sum(x, y) 
subtracting <- x - y 
multiplying <- x * y 
dividing <- x / y 
exponents <- x^b 
square_root <- sqrt(x)
logarithms <- log(x) 
exponentials <- exp(x) 
mean(x, y)

rounding <- round(z)
rounding1 <- round(z, 1)
rounding2 <- round(z, 2)

######## R Naming Convention: snake_case ########
variable_name <- 523
variable_name

pls.dont.use.dots.even.tho.it.works <- 3049
pls.dont.use.dots.even.tho.it.works


Class Types

We will specify 5 class types in this section: numeric, logical, character, matrix, data frame. Notable functions/operations used in this section inlcude:

######## Numeric ########
a_numeric <- 456
another_numeric <- 12.367
yet_another_numeric <- pi

a_vector <- c(1, 2, 3, 4)
a_sequence <- seq(from = 10, to = 100, by = 10) 	# an ordered list of numbers
another_sequence <- 1:10 				# Integer
yet_another_sequence <- seq_len(10)		# A sequence suitable for loops

######## Logical ########
a_boolean <- 2 == 3
another_boolean <- 1 + 2 == 5 - 2
sum(a_boolean)
sum(another_boolean)

######## Characters ########
a_charater <- "Hello"
another_character <- 'Datafest!'

######## Matrix ########
a_matrix <- matrix(1, nrow = 2, ncol = 5) 		# a Matrix-Array
a_more_interesting_matrix <- matrix(a_sequence, nrow = 2, ncol = 5)

######## Data Frame ########
a_dataframe <- data.frame(a_matrix)

######## Testing Class Types with class() ########
class(a_numeric)
class(another_numeric)
class(yet_another_numeric)
class(a_vector)
class(a_sequence)
class(yet_another_sequence)
class(a_matrix)
class(a_boolean)
class(a_charater)
class(a_dataframe)


Linear Algebra

Matrices are described by two dimensions: row \((m)\) and column \((n)\). If a matrix has 2 rows and 3 columns, we say it is a "two by three matrix," denoted as \( m \times n\). We use capital letters to name matrices. If we name the above matrix, \(A\), we can say \(A\) is a \(8 \times 8\) matrix. Each individual cell within a matrix is called an element. Elements are referred to using their location within the matrix they belong to. Using matrix \(A\) as an example, the overlapped highlighted element is located at the row index = 3 and column index = 4, so this element is denoted as \(A[3, 4]\). If we wanted to refer to all the elements in column 4, we would denote this as \(A[\hspace{0.2cm}, 4]\), leaving a blank space in the row placeholder. Similarly, we would write \(A[3,\hspace{0.2cm}]\) to refer to all the elements in row 3 of matrix \(A\).

Vectors are one dimensional matrices - either the row dimension = 1 or the column dimension = 1.

Notable functions/operations used in this section inlcude:

######## Vectors ########
num_vector <- c(1, 2, 3, 4, 5, 6, 7) 			
char_vector <- c("D", "a", "t", "a", "f", "e", "s", "t", "!")
x <- num_vector[1]
y <- char_vector[4]
sum(x, y)
char_vector
cat(char_vector)
length(char_vector)

######## Matrices ########
a_matrix <- matrix(data = seq(from = 1, to = 20, by = 1),	#note, we are using "=" here
                   nrow = 10,
                   ncol = 2)
a_matrix
row <- 5
col <- 2
a_matrix[row, col]
a_matrix[1, ]              #1st row, all columns
a_matrix[, 1]              #All rows, 1st column
a_matrix[1, 1] <- 75       #rewrite the 1st row 1st column element to = 75

mean(a_matrix) 		   #average
t(a_matrix)    		   #transpose
dim(a_matrix)  		   #dimensions

number_of_rows_displayed <- 3
head(a_matrix, number_of_rows_displayed)  #displays top rows 

View(a_matrix)    #displays matrix in spreadsheet format


Functions

Functions accept an input, perform specified tasks and/or return an output as a response dependent on the input value(s). Notable functions/operations used in this section include:

######## IF/ELSE Conditional Function ########
if(1 > 10){
  print("If 1 > 10 is TRUE, then this will print!")
} else if(80 == 40 * 2) {
  print("Cool, 80 is equal to twice of 40!")
} else{
  print("Darn, they were all false so now you have me!")
}


######## Functions ########
a_summation_function <- function(a, b){
  return(a + b)
}
a_summation_function(5, 11)


######## Pipelines ########
install.packages("dplyr")
library(dplyr)

pipe_ex <- 1:10
pipe_ex |> sum()
pipe_ex %>% mean()
pipe_ex |> order(decreasing = TRUE)

pipe_ex2 <- data.frame(numbers = sample(1:100, 8), 
                       letters = c("A", "B", "B", "A", "C", "A", "B", "A") )

pipe_ex2 |> 
  group_by(letters) |> 
  summarise(sum = sum(numbers))

pipe_ex2 |>
  mutate(new_var = numbers * 10)
Checkpoint 1: Write a function that computes the nth power of base b

Challenge: Write a function that computes the euclidean distance between two points (x1, y1) and (x2, y2)


Loops

Loops describe iterative and/or repetitive functions. Notable functions and operations used in this section include:

######## For Loops ######## 
for(i in seq_len(10)){
  print(2 * i)
}

datafest <- c()
for(i in 1:20){
	datafest[i] <- "datafest"
}
datafest

######## While Loops ########
count <- 1
while(count <= 5){
  print(paste0("This repeats until count equals 6. We're currently on iteration ", count, "."))
  count <- count + 1
}
Checkpoint 2: Write a loop that outputs the first 50 terms of the Fibonacci Sequence

Challenge: Write a loop that outputs the first 100 terms of the following sequence: 4, 9, 16, 25, 36, ...


Simulating and Importing Data

We will begin to work with data in this section, focusing on creating and importing data. We will be importing the following csv file: music.csv 1. Notable functions/operations used in this section include:

######## Simulating Non-Random Data ########
seq(from = 1, to = 10, by = .5)
1:10

######## Simulating Random Data ########
set.seed(1)                         		 	#for reproducibility when using a random generator

runif(n = 10, min = 0, max = 100)   			#random uniform #'s
rnorm(n = 10, mean = 0, sd = 1)    			#random normal #'s
runif(10, 0, 100)                   	    		#you don't always have to specify the function - check defaults

######## Importing Data ######## 
install.packages("dplyr")
imported_data <- read.csv("~/Desktop/music.csv") 			
View(imported_data)

######## Data Frames ########
a_dataframe <- data.frame(x1 = 1:50,
                	  x2 = runif(50, 0, 100))
a_dataframe
View(a_dataframe)

a_dataframe$x1						#calls the vector named x1 from a_dataframe


Visualizing Data

We will produce many different types of plots and graphs in this section, including: scatter plot, line graph, histogram, bar plot, box plot, pie chart.

Scatter Plots and Line Graphs

plot(x = 1:10, y = 1:10, main = "Scatter Plot")

plot(x = 1:10, y = 1:10, type = "l", main = "Line Graph")

plot(x = 1:10, y = 1:10, type = "b", col = "salmon", main = "More Descriptive Line Graph") 
  lines(x = 1:10, y = 1.2*(1:10), col = "skyblue") 
  legend("topleft", 
        legend = c("line 1", "line 2"), 
         fill = c("salmon","skyblue"))

plot(x = 1:60, y = (1:60)^2, 
     type = "b", 
     main = "Intro to R Lecture Knowledge Rate",
     xlab = "Time", ylab = "Knowledge")

plot(a_dataframe, main = "Simulated Data Scatter Plot")

Histograms

hist(x = rnorm(10, mean = 0, sd = 1)) 		#Note: default is mean = 0, sd = 1
hist(x = rnorm(10), 
     main = "An Interesting Main Title",
     xlab = "A Funny X Axis Title",
     ylab = "A Silly Y Axis Title")

######## Central Limit Theorem ########
clt_test <- function(n){
  x <- rnorm(n)
  hist(x, main = paste0("Histogram n = ", n))
}
clt_test(10)
clt_test(100)
clt_test(1000)
clt_test(10000)

Bar Plot

barplot(height = 1:10)
barplot(height = runif(5, 0, 10),
        horiz = TRUE)

barplot(height = c(52, 45, 67, 36, 46),
        names = c("Watermelon Sugar", "Adore You", "Golden", "She", "As It Was"),
        horiz = TRUE,
        las = 1,
        cex.names = 0.5,
        col = c(1, 2, 3, 4, 5),
        main = "What Would You Name The Main Title?")
Checkpoint 2: Come up with a descriptive title for the bar graph above.

Box Plot

bieber_fandom <- rnorm(100, mean = 50, sd = 20)
swift_fandom <- rnorm(100, mean = 80, sd = 15)

boxplot(bieber_fandom, swift_fandom)

boxplot(bieber_fandom, swift_fandom,
        names = c("Beliebers", "Swifties"),
        main = "Beliber vs. Swifty Fandom",
        xlab = "Public Affection Rating",
        horizontal = TRUE,
        col = c("lightblue", "salmon"))

Real Data

######## Music Data ########
music <- imported_data
dim(music)
View(music)

table(music$country)
barplot(height = table(music$country), main = "Artist Count by Country") 
pie(table(music$country), main = "Cluttered Pie Chart")

## So many countries! Let's remove the smaller ones
country_table <- table(music$country)
View(country_table)
country_table <- country_table[country_table >= 20]
country_table <- country_table[order(country_table)]

barplot(country_table, main = "Countries With More Than 20 Artists")
barplot(country_table, horiz = TRUE, las = 2, cex.names = 0.5,
        main = "Horizontal Version")

## Lets compare the top 50 artists
barplot(height = music$listeners[1:50], 
        names = music$artist[1:50],
        horiz = TRUE, 
        las = 2, 
        cex.names = 0.5,
        main = "Top 50 Listened-To Artists")

## Searching for specific artists
which(music$artist == "Taylor Swift")
which(music$artist == "Justin Bieber")
music[138, ]


Modeling Data - Regression

We will fit a few different functions to model data for regression, inlcuding: linear, quadratic, and cubic. We will be importing the following csv file: cookies.csv. Notable functions/operations used in this section include:

######## Create Data ########
data <- data.frame(x = 1:10, y = (1:10)^2) # y = x^2 for 1 < x < 10
plot(data)

######## Linear Regression Model lm() ########
lin_model <- lm(y ~ x, data = data)
predict_lin_model <- predict(lin_model)
summary(lin_model)

## Plotting ##
plot(data, 
     type = "b",
     col = "salmon",
     lwd = 2,
     main = "Fitting a Linear Model to Quadratic Function") 
  lines(predict_lin_model, col = "skyblue", lwd = 2) 
  legend("topleft", 
        legend = c("Observed", "Predicted"), 
         fill = c("salmon","skyblue"))
######## Quadratic Model ########
quad_model <- lm(y ~ I(x^2), data = data)
predict_quad_model <- predict(quad_model)

plot(data, 
     type = "b",
     col = "salmon",
     lwd = 2,
     main = "Fitting a Quadratic Model to Quadratic Function") 
  lines(predict_quad_model, col = "skyblue", lwd = 2) 
  legend(x = 2, y = 90, 
         legend = c("Observed", "Predicted"), 
         fill = c("salmon","skyblue"))
######## Cubic Model ########
cubic_model <- lm(y ~ I(x^3), data = data)
predict_cubic_model <- predict(cubic_model)

plot(data, 
     type = "b",
     col = "salmon",
     lwd = 2,
     main = "Fitting a Cubic Model to Quadratic Function") 
  lines(predict_cubic_model, col = "skyblue", lwd = 2) 
  legend(x = 2, y = 90, 
         legend = c("Observed", "Predicted"), 
         fill = c("salmon","skyblue"))
Checkpoint 3:
  • Read in the cookies.csv file
  • Fit a linear model to the data
  • Predict the number of cookies I will eat on the 7th day
  • Plot the actual data and fitted model on the same graph, make it asthetically descriptive
Challenge: Write a function to compute the MSE


Modeling Data - Classification

We won't cover classification methods here, but check out my article on Breast Cancer Detection for an instructive walkthough. In this article, I use logistic regression to classify breast cancer tumors as malignant or benign.


Final Exercise

  1. Choose one of the following data sets*:
  2. Explore the data set: plot(), summary(), dim(), View()
  3. Fit a model to the data: lm()
  4. Calculate predictions: predict()
  5. Create a visual of the observed data and fitted model: plot()
  6. Calculate accuracy of your model using MSE

*You're welcome to look for other data sets on kaggle, data.gov, etc.


Appendix

Here you can find answers to the checkpoint exercises.

  1. Power Function
  2. power <- function(b, n){     # b^n
      print(b^n)
    }
    power(5, 2)
  3. Euclidean Distance Function
  4. euclidean_distance <- function(point1, point2){     # a^2 + b^2 = c^2
      horiz_vert_dist <- (point1 - point2)^2
      add_them_up <- sum(horiz_vert_dist)
      return(sqrt(add_them_up))
    }
    xy1 <- c(0, 0)
    xy2 <- c(1, 1)
    euclidean_distance(xy1, xy2)
  5. Write a loop that stores the first 100 terms of the following sequence: 4, 9, 16, 25, 36, ...
  6. terms <- c()
    terms[1] <- 4
    for(i in 1:100){
      terms[i + 1] <- terms[i] + 2*i + 3
    }
    terms
  7. Write a loop that outputs the first 50 terms of the Fibonacci Sequence
  8. fib <- c()
    fib[1] <- 0
    fib[2] <- 1
    for(i in 3:50){
      fib[i] <- fib[i - 1] + fib[i - 2]
    }
    fib
  9. Cookies Regression Analysis
  10. # Step 1: read in data
    cookies_df <- read.csv("~/Desktop/cookies.csv")
    
    # Step 2: plot raw data to observe the shape
    plot(cookies_df)
    
    # Step 3: fit linear model
    cookies_model <- lm(Cookies ~ Day, data = cookies_df)
    
    # Step 4: predict day 7 using cookies model
    predict_cookies <- predict(cookies_model, newdata = data.frame(Day = 1:7))
    predict_cookies[7]
    
    # Step 5: Plot 
    plot(cookies_df, 
         type = "b",
         col = "salmon",
         lwd = 2,
         main = "Cookies Linear Regression Model Actual vs Fitted") 
      lines(predict_cookies, col = "skyblue", lwd = 2) 
      legend("topleft", 
            legend = c("Observed", "Predicted"), 
             fill = c("salmon","skyblue")) 
      points(x = 7, y = predict_cookies[7], col = "red", pch = 16) 
      text(x = 7, y = predict_cookies[7]+0.3, paste0(predict_cookies[7]))
  11. MSE Function
  12. mse <- function(y, yhat){
      error <- mean( (y - yhat)^2 )
      return(error)
    }
    
    mse(data$y, predict_lin_model)
    mse(data$y, predict_quad_model)
    mse(data$y, predict_cubic_model)
    
[1] Music data set originates from Kaggle