Lecture 01 · Fundamentals

Introduction to R & Setup

What is R?

R is a powerful open-source language and environment specifically designed for statistical computing, data analysis, and visualization. Unlike general-purpose languages, R was built by statisticians for statisticians, making it the gold standard for data exploration.

It is widely used in academia, data science, bioinformatics, and industry for everything from simple linear regressions to complex machine learning models.

Why Choose R?

Best-in-class Visualization: With libraries like ggplot2, R produces publication-quality graphics.
CRAN Ecosystem: Access to the Comprehensive R Archive Network (CRAN), which hosts thousands of specialized packages.
Exploratory Data Analysis (EDA): R allows you to interact with your data in real-time, making it easier to find patterns and outliers.
Tidyverse: A collection of cohesive packages designed for a consistent and intuitive data science workflow.
Strong Integration: Seamlessly works with Python and connects easily to SQL databases.

Installation & Environment

To get started, you need to install two separate pieces of software:

R (The Engine): This is the actual language that processes the code. Download it from cran.r-project.org.
RStudio (The Cockpit): This is the Integrated Development Environment (IDE). It makes writing, debugging, and visualizing your R code much easier. Download it from posit.co.

Note: Always install R first, then install RStudio.

Understanding the RStudio Interface

Once you open RStudio, you will see four primary panes. Understanding these is key to your productivity:

Source Editor (Top Left): Where you write and save your scripts (.R files).
Console (Bottom Left): Where the code actually runs. You can type commands here for instant results.
Environment/History (Top Right): Shows you every variable, data frame, or list currently stored in memory.
Files/Plots/Help (Bottom Right): Where you view your folders, see your generated graphs, and read documentation.

Your First Lines of Code

In R, we use the <- operator (called the assignment operator) to store values in variables.

R Console

> # Printing a simple message
print("Hello, R Mastery!")
[1] "Hello, R Mastery!"

> # Assigning values to variables
my_name <- "Alex"
current_year <- 2024
result <- 17 * 23

> # Printing the variables
print(my_name)
[1] "Alex"
print(result)
[1] 391

💻 Try It Yourself - Multi-Language Compiler

Practice R and many other programming languages right here in your browser! Switch between languages, modify the code, and click "Run" to see results instantly.

💡 Practice Tips:

Switch to R in the language selector and try the data analysis examples
Experiment with R's statistical functions and data visualization
Try other data languages like Python, SQL, or compare with statistical concepts
Use the "Load Example" button to see R-specific code samples
Use Ctrl+Enter to quickly run your code

🎯 Exercise 1.1: First Steps

1. Install R and RStudio on your machine.

2. Open RStudio and create a new R Script (File → New File → R Script).

3. Write and run code that does the following:

Assign your name to a variable called user_name.
Assign today's date to a variable called today.
Calculate 17 * 23 and store it in a variable called calc_result.
Print all three variables to the console.

Lecture 02 · Fundamentals

Variables & Data Types

Assignment Operator

In R, we typically use <- for assignment.

Assignment means storing a value inside a variable. A variable acts like a container that holds data which can later be used, modified, or printed.

variables.R

x <- 42
name <- "R Student"
is_valid <- TRUE

In the example above:

x stores a number
name stores text
is_valid stores a logical value

Data Types in R

Every variable in R stores a particular type of data. Understanding data types is extremely important because different operations work on different types of data.

1. Numeric

Numeric values represent decimal or floating-point numbers.

numeric.R

price <- 99.99
temperature <- 36.6

print(price)
print(temperature)

2. Integer

Integers are whole numbers without decimals. In R, integers are written using the L suffix.

integer.R

age <- 25L
year <- 2025L

print(age)
print(year)

3. Character

Character data represents text and must be written inside quotes.

character.R

first_name <- "Aman"
message <- "Welcome to R programming"

print(first_name)
print(message)

4. Logical

Logical values represent either TRUE or FALSE. They are commonly used in conditions and decision making.

logical.R

is_logged_in <- TRUE
has_permission <- FALSE

print(is_logged_in)
print(has_permission)

Checking the Data Type

We can use the class() function to check the type of a variable.

class_check.R

x <- 10
name <- "R Language"
status <- TRUE

class(x)
class(name)
class(status)

Type Conversion

Sometimes we need to convert one data type into another. This process is called type conversion or type casting.

conversion.R

x <- "100"

numeric_x <- as.numeric(x)

print(numeric_x)
class(numeric_x)

Summary

Variables are used to store data
<- is the standard assignment operator in R
R supports numeric, integer, character, and logical data types
Use class() to check data types
Variables can be updated and used in calculations

Lecture 03 · Fundamentals

Operators & Expressions

Arithmetic Operators

Arithmetic operators are used to perform mathematical calculations. R follows the standard order of operations (PEMDAS).

arithmetic.R

# Basic Math
5 + 10    # Addition (15)
15 - 5    # Subtraction (10)
4 * 3     # Multiplication (12)
10 / 2    # Division (5)
2 ^ 3     # Exponentiation/Power (8)

# Advanced Math
13 %% 5   # Modulo: Returns the remainder (3)
13 %/% 5  # Integer Division: Returns how many times it fits (2)

Comparison Operators

Comparison operators are used to compare two values. The result of a comparison is always a Boolean value: either TRUE or FALSE.

comparisons.R

10 == 10   # Equal to (TRUE)
10 != 10   # Not equal to (FALSE)
5 > 3      # Greater than (TRUE)
2 < 1      # Less than (FALSE)
10 >= 10   # Greater than or equal to (TRUE)
7 <= 5     # Less than or equal to (FALSE)

Logical Operators

Logical operators allow you to combine multiple comparisons. This is how you create complex filters.

& (AND): Returns TRUE if both conditions are true.
| (OR): Returns TRUE if at least one condition is true.
! (NOT): Reverses the result (TRUE becomes FALSE).

logical_ops.R

# Example variables
age <- 25
has_license <- TRUE

# AND operator
(age > 18) & (has_license == TRUE)   # TRUE

# OR operator
(age > 30) | (has_license == TRUE)    # TRUE

# NOT operator
!(age == 25)                         # FALSE

🎯 Exercise 1.3: Expression Challenge

Create a script that performs the following tasks:

Calculate the remainder of 100 divided by 7 using the modulo operator.
Create two variables, a <- 15 and b <- 20.
Write a comparison expression that checks if a is less than b AND a is greater than 10.
Test the ! (NOT) operator on the result of your previous comparison.

Lecture 04 · Fundamentals

Control Flow

Conditional Branching: if, else if, else

In R, control flow allows your program to make decisions and execute different paths of code based on logical conditions. R uses standard braces {} and parentheses () to enclose block statements.

control_flow.R

score <- 85

if (score >= 90) {
    grade <- "A"
} else if (score >= 80) {
    grade <- "B"
} else {
    grade <- "C"
}

Important syntax note

In R, the else or else if statement MUST reside on the same line as the closing brace } of the preceding block. Writing else on a new line will trigger a syntax error in standard R scripts.

Vectorized Conditionals: ifelse()

Because R is fundamentally designed to work with vectors, running standard loops or if-else trees over thousands of elements is inefficient. R provides a vectorized conditional function called ifelse() that evaluates a logical vector element-wise and returns values accordingly.

vectorized_conditional.R

ages <- c(12, 17, 24, 15, 30, 18)

# Syntax: ifelse(test_expression, value_if_true, value_if_false)
membership <- ifelse(ages >= 18, "Adult", "Minor")

print(membership)
# Output: [1] "Minor" "Minor" "Adult" "Minor" "Adult" "Adult"

Logical Evaluation differences

R supports both element-wise logical operators (& and |) and short-circuit scalar logical operators (&& and ||).

Use & and | when evaluating vectors (e.g. modifying columns in a data frame).
Use && and || inside single-value conditional checks (e.g., control structures inside standard if statements).

🎯 Exercise 4.1: Smart Grades

Write an R script to solve the following:

Create a variable temperature <- 28. Write an if-else structure that prints "Hot" if temperature is > 30, "Warm" if between 15 and 30, and "Cold" otherwise.
Create a vector of test scores: scores <- c(55, 78, 92, 45, 88). Use the ifelse() function to label each score as "Pass" if >= 60, and "Fail" if less than 60.

Lecture 05 · Fundamentals

Loops & Functions

Loops in R

Loops are used to execute a block of code repeatedly. R supports for, while, and repeat loops. However, in R, we often prefer vectorized commands over loops for performance reasons.

1. For Loop

for_loop.R

# Loop through numbers 1 to 5
for (i in 1:5) {
    print(paste("Iteration number:", i))
}

2. While Loop

while_loop.R

counter <- 1
while (counter <= 3) {
    print(paste("Counter is:", counter))
    counter <- counter + 1
}

Defining Custom Functions

Writing functions helps you avoid copying code and makes your scripts modular. A function in R is declared using the function keyword and assigned to a variable name.

functions.R

# Definition: function_name <- function(arg1, arg2 = default_value) { ... }
greet_user <- function(name, greeting = "Hello") {
    full_message <- paste(greeting, name)
    return(full_message)
}

# Calling the function
greet_user("Alex")                  # Returns "Hello Alex"
greet_user("Taylor", greeting = "Hi") # Returns "Hi Taylor"

💡 Implicit Returns

In R, a function will automatically return the value of the last evaluated expression if an explicit return() statement is omitted. However, using return() explicitly is a best practice for readability.

🎯 Exercise 5.1: Functional Logic

Create a script to define and run the following:

Write a function named is_even that takes a numeric argument and returns TRUE if it is divisible by 2, and FALSE otherwise.
Write a for loop that iterates over a vector of numbers from 1 to 10 and prints whether each number is "Even" or "Odd" using your custom function.

Lecture 06 · Core Concepts

Data Structures & Vectors

Vectors: The Core Data Structure

In R, the vector is the most basic and vital data structure. Every single scalar value (like a single number or string) is actually a vector of length 1. Vectors must contain elements of the same data type (homogenous).

vectors.R

# Create a vector using c()
numeric_vec <- c(2.4, 5.7, 9.1)
logical_vec <- c(TRUE, FALSE, TRUE)
char_vec <- c("apple", "banana", "cherry")

# Vector indexing is 1-based
char_vec[1]  # Returns "apple"

# Negative indexing excludes items
char_vec[-2] # Returns c("apple", "cherry")

Vector Recycling

When you perform operations on two vectors of different lengths, R automatically "recycles" the shorter vector (repeating it) to match the length of the longer vector.

recycling.R

v1 <- c(1, 2, 3, 4)
v2 <- c(10, 20)

# v2 is recycled to c(10, 20, 10, 20)
v1 + v2  # Returns c(11, 22, 13, 24)

Lists, Matrices & Factors

R provides other structures for complex scenarios:

Lists: Heterogenous collections that can store different types and nested elements.
Matrices: Two-dimensional homogenous tables.
Factors: Used to handle categorical labels with predefined levels (critical for statistical modeling).

structures.R

# Lists
my_list <- list(id = 101, name = "John", scores = c(90, 85))
my_list$name  # "John"

# Factors
satisfaction <- factor(c("Low", "High", "Medium", "Low"))
levels(satisfaction)  # [1] "High" "Low" "Medium"

🎯 Exercise 6.1: Vector Arithmetic

Complete the tasks below in your R console:

Create a numeric vector prices <- c(10, 25, 100, 5, 45).
Calculate a new vector discounted_prices where every price is reduced by 10%.
Write a logical expression to find all prices in prices that are greater than 30. Use this expression to filter the vector.

Lecture 07 · Core Concepts

Data Frames & dplyr

Understanding Data Frames

A data frame is R's native representation of a spreadsheet or a table. It is technically a list of vectors where each column represents a variable and all columns are of equal length.

data_frames.R

employees <- data.frame(
    name = c("Alice", "Bob", "Charlie"),
    salary = c(55000, 62000, 48000),
    remote = c(TRUE, FALSE, TRUE),
    stringsAsFactors = FALSE
)

print(employees$salary)  # Access column using $ operator

Modern Data Manipulation: dplyr

The dplyr library is a subset of the Tidyverse package suite. It introduces a grammatical structure for manipulating datasets using simple, readable action verbs instead of nested bracket indices.

Verb	Description
`filter()`	Selects rows matching a logical condition.
`select()`	Selects columns by name.
`mutate()`	Creates new columns or overrides existing ones.
`arrange()`	Sorts dataset rows.
`summarise()`	Aggregates columns to single values (e.g., sum, mean).

The Pipe Operator ( %>% )

The pipe operator chains data operations sequentially, feeding the output of one function as the first argument of the next. This replaces deeply nested code syntax with clean, top-down instruction blocks.

dplyr_pipe.R

library(dplyr)

# Chain operations together
clean_report <- employees %>%
    filter(salary > 50000) %>%
    mutate(bonus = salary * 0.05) %>%
    select(name, bonus) %>%
    arrange(desc(bonus))

print(clean_report)

🎯 Exercise 7.1: dplyr Basics

Complete the task below using the built-in dataset mtcars:

Load dplyr.
Select the columns mpg, cyl, and hp.
Filter the dataset for cars with cyl == 6.
Arrange the filtered table by mpg in descending order.

Lecture 08 · Core Concepts

Data Import & Cleaning

Importing Datasets

In real-world data science, you will load datasets from local files, SQL databases, or APIs. The most common format is CSV. R provides native utilities, but the readr library (part of tidyverse) is highly recommended for speed and consistent column type parsing.

import.R

library(readr)

# Importing a CSV dataset
my_data <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_cost.csv")

# Quickly inspect layout
head(my_data)
str(my_data)
summary(my_data)

Handling Missing Data (NA)

Missing observations in R are represented by the reserved value NA (Not Available). They propagate through math functions, which means any summary calculations containing NA will return NA by default.

na_handling.R

scores <- c(80, 95, NA, 72)

mean(scores)              # Returns NA
mean(scores, na.rm = TRUE) # Returns 82.33 (na.rm ignores missing values)

# Checking for NA
is.na(scores)  # Returns: FALSE FALSE TRUE FALSE

Cleaning Datasets with tidyr

The tidyr package contains verbs that target missing data, clean structures, and transform tables between wide and long formats.

data_cleaning.R

library(dplyr)
library(tidyr)

# Filter out rows with NA
clean_rows <- my_data %>%
    drop_na(tuition)

# Replace missing values in a specific column
imputed_data <- my_data %>%
    replace_na(list(degree_length = "Unknown"))

🎯 Exercise 8.1: Missing Values

Write an R script to clean the following table:

Create a sample data frame: df <- data.frame(id = 1:4, score = c(100, NA, 85, 90), level = c("Beginner", "Intermediate", NA, "Advanced"))
Calculate the average score ignoring NAs.
Replace missing levels with the word "Standard".

Lecture 09 · Advanced

ggplot2 Visualization

The Grammar of Graphics

The ggplot2 library implements a structured theory of data visualization called the Grammar of Graphics. Visual plots are constructed in layers, starting with data, mapping aesthetics (axes, color, shapes), and overlaying geometry shapes (points, lines, bars).

grammar_graphics.R

# Template structure:
# ggplot(data = dataset, mapping = aes(x = col1, y = col2)) + geom_shape()

Building Common Charts

Let's construct a simple scatter plot and a bar chart using the built-in mpg dataset.

plots.R

library(ggplot2)

# 1. Scatter Plot (Engine displacement vs Highway MPG)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
    geom_point(size = 2.5) +
    labs(
        title = "Engine displacement vs Highway MPG",
        x = "Displacement (L)",
        y = "Highway MPG"
    ) +
    theme_minimal()

# 2. Bar Chart (Count of cars by class)
ggplot(data = mpg, mapping = aes(x = class, fill = class)) +
    geom_bar() +
    labs(title = "Car counts by Class") +
    theme_classic()

Saving Visual Plots

Use the ggsave() function to save the last displayed graphic to your local hard drive in various dimensions and formats.

save_plot.R

ggsave("car_class_plot.png", width = 6, height = 4, dpi = 300)

🎯 Exercise 9.1: Build a Histogram

Write an R script to build the following visualization:

Use the built-in mtcars dataset.
Build a histogram of horsepower (hp) using geom_histogram().
Set the bin width to 30.
Label the titles clearly and apply a built-in theme of your choice.

Lecture 10 · Advanced

Statistical Analysis

Descriptive Statistics

R excels at numerical statistics. You can generate summary tables and calculate standard statistical indices with single function calls.

stats.R

heights <- c(172, 185, 168, 175, 190, 165, 178)

mean(heights)    # Mean
median(heights)  # Median
sd(heights)      # Standard Deviation
var(heights)     # Variance

summary(heights) # Print 5-number summary + Mean

Hypothesis Testing

Testing hypotheses is a common task in statistics. R includes standard functions for t-tests, ANOVA, and correlations.

hypothesis_testing.R

# Student's t-test (comparing two sample means)
groupA <- c(22, 25, 21, 24, 28)
groupB <- c(30, 32, 28, 35, 31)

t.test(groupA, groupB)

# Pearson correlation coefficient
cor(groupA, c(21, 23, 20, 25, 27))

Linear Regression modeling

To predict outcomes using continuous independent variables, R utilizes the lm() function (Linear Model). The model outputs are reviewed using summary().

regression.R

# Syntax: lm(dependent_variable ~ independent_variable, data = dataset)
car_model <- lm(mpg ~ hp, data = mtcars)

# Summarize statistical significance, coefficient estimates, R-squared values
summary(car_model)

🎯 Exercise 10.1: Correlation & Regression

Write an R script to solve the following:

Compute the correlation coefficient between displacement (displ) and highway mileage (hwy) inside the mpg dataset.
Fit a linear model predicting highway mileage using engine displacement. Analyze whether engine displacement is a statistically significant predictor.

Lecture 11 · Advanced

Advanced Data Manipulation

Relational Joins with dplyr

When working with relational databases, you will need to join data frames using keys. dplyr provides mutational join verbs mirroring standard SQL commands.

joins.R

library(dplyr)

df_left <- data.frame(customer_id = 1:3, name = c("Amy", "Ben", "Carl"))
df_right <- data.frame(customer_id = c(1, 2, 4), balance = c(500, 250, 100))

# Join matching rows from right to left
customer_info <- left_join(df_left, df_right, by = "customer_id")
print(customer_info)

# Keep only rows that have matches in both tables
matching_customers <- inner_join(df_left, df_right, by = "customer_id")
print(matching_customers)

Pivoting Datasets: Long vs Wide formats

Wide datasets (multiple variables spread across columns) are easier for humans to read. Long datasets (one column for variables, one for values) are required for tidy packages and ggplot visualization. R uses pivot_longer() and pivot_wider() to transform tables.

pivoting.R

library(tidyr)

wide_scores <- data.frame(
    student = c("A", "B"),
    math = c(90, 80),
    english = c(95, 88)
)

# Wide to Long
long_scores <- wide_scores %>%
    pivot_longer(
        cols = c(math, english),
        names_to = "subject",
        values_to = "score"
    )

print(long_scores)

🎯 Exercise 11.1: Joins & Reshaping

Write an R script to complete these tasks:

Use pivot_wider() to convert the long_scores table back to its original wide format.
Join a custom table of customer phone numbers: df_phones <- data.frame(customer_id = 1:2, phone = c("555-0199", "555-0122")) to the df_left data frame using a left join.

Lecture 12 · Advanced

Capstone Project: Data Analysis Report

Overview

In this capstone project, you will build an end-to-end data analysis workflow. You will load a real-world dataset, clean missing entries, run exploratory analyses, output visualizations, and perform predictive modeling.

Step 1: Environment Setup

Ensure that the core packages are loaded:

capstone.R

library(tidyverse)

Step 2: Load and Clean the Data

We will utilize the built-in chickwts dataset, which measures chick growth rates on different feed types.

capstone_load.R

# Load built-in data
raw_chicks <- chickwts

# Clean and inspect
chicks_clean <- raw_chicks %>%
    filter(!is.na(weight))

glimpse(chicks_clean)

Step 3: Aggregation & Summary

Calculate the average weight and standard deviation for each feed category.

capstone_summary.R

feed_summary <- chicks_clean %>%
    group_by(feed) %>%
    summarise(
        count = n(),
        mean_weight = mean(weight),
        sd_weight = sd(weight)
    ) %>%
    arrange(desc(mean_weight))

print(feed_summary)

Step 4: Visualize Relationships

Build a boxplot comparing feed types against chick weight.

capstone_plot.R

ggplot(chicks_clean, aes(x = reorder(feed, weight, FUN = median), y = weight, fill = feed)) +
    geom_boxplot(alpha = 0.7) +
    labs(
        title = "Chick Weight Distribution by Feed Type",
        x = "Feed Type",
        y = "Weight (g)"
    ) +
    theme_minimal() +
    theme(legend.position = "none")

Step 5: Run ANOVA Hypothesis Test

Evaluate whether differences in mean weight across feeds are statistically significant.

capstone_anova.R

chick_anova <- aov(weight ~ feed, data = chicks_clean)
summary(chick_anova)

🎯 Capstone Deliverable

Write an R script that performs the same analysis flow (load, clean, summary, boxplot, hypothesis test) using the built-in dataset iris to compare Petal.Length across Species.

Lecture 13 · Advanced

R Markdown & Reporting

Introduction to R Markdown

R Markdown (.Rmd files) merges R code execution, plain markdown text, and HTML/PDF styling into unified, professional data reports. This is a critical skill for reproducibility in data science.

Structure of a .Rmd Document

An R Markdown file consists of three components: a YAML header, markdown commentary, and code chunks.

sample.Rmd

---
title: "My Data Report"
author: "R Mastery Student"
output: html_document
---

# Section Title
This is plain text with **bold markup** and *italics*.

```{r load-libraries, echo=TRUE, message=FALSE}
# This is a code chunk
library(ggplot2)
summary(cars)
```

Formatting Code Chunks

You can configure how code executes and prints by defining chunk arguments inside the brackets:

echo = TRUE / FALSE: Determines whether code commands print in the final document.
eval = TRUE / FALSE: Determines whether R evaluates the chunk during compilation.
message = TRUE / FALSE: Silences packages loading text output.
warning = TRUE / FALSE: Silences warning messages.

Knitting Reports

In RStudio, compile your document by clicking the **Knit** button located in the Source Editor menu bar. This reads the YAML configurations, executes the code chunks in order, and exports the final document to your project folder.

🎯 Exercise 13.1: Knit to HTML

Perform the following in RStudio:

Go to File → New File → R Markdown...
Name the document "Tidyverse Analysis".
Add a new code chunk containing a scatter plot of your choice.
Set the chunk option so the R code is hidden (echo=FALSE) but the graphic outputs are rendered.
Knit the document and check the HTML result.

Lecture 14 · Advanced

Final Project — Complete Data Analysis

Project Overview

For your final course project, you will download an independent dataset, clean it, execute data transformations, carry out statistical testing, and create descriptive graphics. Compile everything into a unified R Markdown report.

Project Guidelines

To pass the course, your final script/document must achieve the following milestones:

Data Import: Successfully read local files or web URLs.
Data Cleaning: Use dplyr or tidyr to filter null observations and convert column formats.
Exploratory Summaries: Print mean, standard deviation, and count aggregates grouped by categories.
Visualization: Create at least two publication-quality ggplot2 figures with customized colors, text labels, and themes.
Modeling/Testing: Run a t-test or fit a linear model, providing a written interpretation of the coefficients and p-values.

Code Template

Use the script structure below as a starting point:

final_project.R

# Load core packages
library(tidyverse)

# 1. Load Data
raw_data <- read_csv("my_data.csv")

# 2. Cleaning
clean_data <- raw_data %>%
    filter(!is.na(target_variable)) %>%
    mutate(category = as.factor(category))

# 3. Explore & Aggregate
summary_stats <- clean_data %>%
    group_by(category) %>%
    summarise(avg = mean(target_variable), count = n())

print(summary_stats)

# 4. Plots
ggplot(clean_data, aes(x = category, y = target_variable)) +
    geom_boxplot() +
    theme_minimal()

Submitting Your Work

Convert your final code into an R Markdown report, knit it to HTML, and upload the code and HTML file to your git repository or learning administrator.