Lecture 1 / 12
Lecture 01 ยท Fundamentals

Introduction to R & Setup

Beginner ~60 min

What is R?

R is a powerful open-source language and environment specifically designed for statistical computing, data analysis, and visualization. Unlike general-purpose languages, R was built by statisticians for statisticians, making it the gold standard for data exploration.

It is widely used in academia, data science, bioinformatics, and industry for everything from simple linear regressions to complex machine learning models.

Why Choose R?

  • Best-in-class Visualization: With libraries like ggplot2, R produces publication-quality graphics.
  • CRAN Ecosystem: Access to the Comprehensive R Archive Network (CRAN), which hosts thousands of specialized packages for every imaginable data task.
  • Exploratory Data Analysis (EDA): R allows you to interact with your data in real-time, making it easier to find patterns and outliers.
  • Tidyverse: A collection of cohesive packages (like dplyr and tidyr) designed for a consistent and intuitive data science workflow.
  • Strong Integration: Seamlessly works with Python via the reticulate package and connects easily to SQL databases.

Installation & Environment

To get started, you need to install two separate pieces of software:

  • R (The Engine): This is the actual language that processes the code. Download it from cran.r-project.org.
  • RStudio (The Cockpit): This is the Integrated Development Environment (IDE). It makes writing, debugging, and visualizing your R code much easier. Download it from posit.co.

Note: Always install R first, then install RStudio.

Understanding the RStudio Interface

Once you open RStudio, you will see four primary panes. Understanding these is key to your productivity:

  • Source Editor (Top Left): Where you write and save your scripts (.R files).
  • Console (Bottom Left): Where the code actually runs. You can type commands here for instant results.
  • Environment/History (Top Right): Shows you every variable, data frame, or list currently stored in the system's memory.
  • Files/Plots/Help (Bottom Right): Where you view your folders, see your generated graphs, and read documentation.

Your First Lines of Code

In R, we use the <- operator (called the assignment operator) to store values in variables. While = works, the arrow is the standard convention in the R community.

R Console
> # Printing a simple message
print("Hello, R Mastery!")
[1] "Hello, R Mastery!"

> # Assigning values to variables
my_name <- "Alex"
current_year <- 2024
result <- 17 * 23

> # Printing the variables
print(my_name)
[1] "Alex"
print(result)
[1] 391

> # Checking R version
version$version.string
[1] "R version 4.4.1 (2024-...)"

๐Ÿ’ป Try It Yourself - Multi-Language Compiler

Practice R and many other programming languages right here in your browser! Switch between languages, modify the code, and click "Run" to see results instantly.

๐Ÿ’ก Practice Tips:

  • Switch to R in the language selector and try the data analysis examples
  • Experiment with R's statistical functions and data visualization
  • Try other data languages like Python, SQL, or compare with statistical concepts
  • Use the "Load Example" button to see R-specific code samples
  • Use Ctrl+Enter to quickly run your code
๐ŸŽฏ Exercise 1.1: First Steps

1. Install R and RStudio on your machine.

2. Open RStudio and create a new R Script (File → New File → R Script).

3. Write and run code that does the following:

  • Assign your name to a variable called user_name.
  • Assign today's date to a variable called today.
  • Calculate 17 * 23 and store it in a variable called calc_result.
  • Print all three variables to the console.
Lecture 02 ยท Fundamentals

Variables & Data Types

Beginner ~45 min

Assignment Operator

In R, we typically use <- for assignment.

Assignment means storing a value inside a variable. A variable acts like a container that holds data which can later be used, modified, or printed.

x <- 42
name <- "R Student"
is_valid <- TRUE

In the example above:

  • x stores a number
  • name stores text
  • is_valid stores a logical value

The value on the right side is assigned to the variable on the left side.

age <- 21
country <- "India"

print(age)
print(country)

Output:

[1] 21
[1] "India"

Using = for Assignment

R also allows the use of = for assignment. However, most R programmers prefer <- because it is considered clearer and is the traditional R style.

x = 100
print(x)

Both methods work, but throughout this course we will mainly use <-.

Variable Naming Rules

Variable names should be meaningful and easy to understand.

Rules for naming variables in R:

  • Variable names can contain letters, numbers, dots, and underscores
  • Variable names cannot start with a number
  • Variable names are case-sensitive
  • Avoid using spaces in variable names
# Valid variable names
student_name <- "Rahul"
marks1 <- 90
total.score <- 450

# Invalid variable names
# 1name <- "Error"
# student marks <- 50

Data Types in R

Every variable in R stores a particular type of data. These are called data types.

Understanding data types is extremely important because different operations work on different types of data.

1. Numeric

Numeric values represent decimal or floating-point numbers.

price <- 99.99
temperature <- 36.6

print(price)
print(temperature)

2. Integer

Integers are whole numbers without decimals. In R, integers are written using the L suffix.

age <- 25L
year <- 2025L

print(age)
print(year)

3. Character

Character data represents text and must be written inside quotes.

first_name <- "Aman"
message <- "Welcome to R programming"

print(first_name)
print(message)

4. Logical

Logical values represent either TRUE or FALSE. They are commonly used in conditions and decision making.

is_logged_in <- TRUE
has_permission <- FALSE

print(is_logged_in)
print(has_permission)

Checking the Data Type

We can use the class() function to check the type of a variable.

x <- 10
name <- "R Language"
status <- TRUE

class(x)
class(name)
class(status)

Output:

[1] "numeric"
[1] "character"
[1] "logical"

Type Conversion

Sometimes we need to convert one data type into another. This process is called type conversion or type casting.

x <- "100"

numeric_x <- as.numeric(x)

print(numeric_x)
class(numeric_x)

Output:

[1] 100
[1] "numeric"

Basic Arithmetic Operations

R can perform mathematical calculations using variables.

a <- 10
b <- 5

print(a + b)
print(a - b)
print(a * b)
print(a / b)

Output:

[1] 15
[1] 5
[1] 50
[1] 2

Updating Variables

Variables can be updated by assigning a new value to them.

score <- 50

score <- score + 10

print(score)

Output:

[1] 60

Important Notes

  • R is case-sensitive, so age and Age are different variables
  • Character values must be inside quotes
  • Logical values are written as TRUE and FALSE
  • Use meaningful variable names for better readability
  • Always check data types when debugging errors

Common Beginner Mistakes

# Mistake 1: Missing quotes
# city <- Delhi

# Correct
city <- "Delhi"

# Mistake 2: Using undefined variables
# print(total)

# Mistake 3: Invalid variable name
# 2marks <- 90

Mini Practice

Try running the following code yourself:

student <- "Arjun"
marks <- 95
passed <- TRUE

print(student)
print(marks)
print(passed)

class(student)
class(marks)
class(passed)

Summary

  • Variables are used to store data
  • <- is the standard assignment operator in R
  • R supports numeric, integer, character, and logical data types
  • Use class() to check data types
  • Variables can be updated and used in calculations
  • Meaningful naming improves code readability
Lecture 03 ยท Fundamentals

Operators & Expressions

Beginner ~50 min

Arithmetic Operators

Arithmetic operators are used to perform mathematical calculations. R follows the standard order of operations (PEMDAS), but you can use parentheses ( ) to prioritize specific calculations.

R Console
# Basic Math
5 + 10    # Addition (15)
15 - 5    # Subtraction (10)
4 * 3     # Multiplication (12)
10 / 2    # Division (5)
2 ^ 3     # Exponentiation/Power (8)

# Advanced Math
13 %% 5   # Modulo: Returns the remainder (3)
13 %/% 5  # Integer Division: Returns how many times it fits (2)

Comparison Operators

Comparison operators are used to compare two values. The result of a comparison is always a Boolean value: either TRUE or FALSE. These are critical when you start filtering data frames.

R Console
10 == 10   # Equal to (TRUE)
10 != 10   # Not equal to (FALSE)
5 > 3      # Greater than (TRUE)
2 < 1      # Less than (FALSE)
10 >= 10   # Greater than or equal to (TRUE)
7 <= 5     # Less than or equal to (FALSE)

Logical Operators

Logical operators allow you to combine multiple comparisons. This is how you create complex filters (e.g., "Find all users who are over 18 AND live in New York").

  • & (AND): Returns TRUE if both conditions are true.
  • | (OR): Returns TRUE if at least one condition is true.
  • ! (NOT): Reverses the result (TRUE becomes FALSE).
R Console
# Example variables
age <- 25
has_license <- TRUE

# AND operator
(age > 18) & (has_license == TRUE)   # TRUE

# OR operator
(age > 30) | (has_license == TRUE)    # TRUE

# NOT operator
!(age == 25)                         # FALSE
๐ŸŽฏ Exercise 1.3: Expression Challenge

Create a script that performs the following tasks:

  • Calculate the remainder of 100 divided by 7 using the modulo operator.
  • Create two variables, a <- 15 and b <- 20.
  • Write a comparison expression that checks if a is less than b AND a is greater than 10.
  • Test the ! (NOT) operator on the result of your previous comparison.
Lecture 04 ยท Fundamentals

Control Flow

Beginner ~45 min

Introduction to Control Flow

Control flow determines how a program makes decisions and repeats tasks. By default, R executes code line by line from top to bottom.

However, real programs need logic. Sometimes we want certain code to run only when a condition is true. Sometimes we want to repeat a block of code multiple times.

Control flow statements allow us to:

  • Make decisions
  • Repeat operations
  • Control program execution
  • Build dynamic and intelligent programs

If Statements

The if statement is used to execute code only when a condition is true.

if (x > 10) {
    print("Large")
} else {
    print("Small")
}

In the example above:

  • If x is greater than 10, R prints "Large"
  • Otherwise, R prints "Small"

Understanding Conditions

Conditions are expressions that evaluate to either TRUE or FALSE.

x <- 15

print(x > 10)
print(x < 5)
print(x == 15)

Output:

[1] TRUE
[1] FALSE
[1] TRUE

Comparison Operators

Comparison operators are used to compare values.

Operator Meaning
== Equal to
!= Not equal to
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to

Simple If Statement

An if statement can also be used without an else block.

age <- 20

if (age >= 18) {
    print("You are eligible to vote")
}

The code inside the block executes only if the condition is TRUE.

If-Else Statement

The if-else statement is used when there are two possible outcomes.

marks <- 40

if (marks >= 50) {
    print("Pass")
} else {
    print("Fail")
}

Else If Ladder

Multiple conditions can be checked using else if.

score <- 82

if (score >= 90) {
    print("Grade A")
} else if (score >= 75) {
    print("Grade B")
} else if (score >= 50) {
    print("Grade C")
} else {
    print("Fail")
}

Output:

[1] "Grade B"

Logical Operators

Logical operators are used to combine multiple conditions.

Operator Meaning
&& Logical AND
|| Logical OR
! Logical NOT

Using AND Operator

The AND operator returns TRUE only if both conditions are TRUE.

age <- 25
has_id <- TRUE

if (age >= 18 && has_id) {
    print("Entry Allowed")
}

Using OR Operator

The OR operator returns TRUE if at least one condition is TRUE.

is_weekend <- FALSE
is_holiday <- TRUE

if (is_weekend || is_holiday) {
    print("No Office Today")
}

Nested If Statements

We can place one if statement inside another. This is called nesting.

age <- 22
citizen <- TRUE

if (age >= 18) {

    if (citizen) {
        print("Eligible to vote")
    }

}

Introduction to Loops

Loops are used to repeat code multiple times.

Instead of writing the same code repeatedly, loops allow us to automate repetition.

For Loop

A for loop repeats a block of code for a sequence of values.

for (i in 1:5) {
    print(i)
}

Output:

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Loop Through a Vector

fruits <- c("Apple", "Banana", "Mango")

for (fruit in fruits) {
    print(fruit)
}

While Loop

A while loop continues running as long as the condition remains TRUE.

count <- 1

while (count <= 5) {

    print(count)

    count <- count + 1
}

Repeat Loop

The repeat loop runs forever until stopped using break.

x <- 1

repeat {

    print(x)

    x <- x + 1

    if (x > 5) {
        break
    }
}

Break Statement

The break statement immediately stops a loop.

for (i in 1:10) {

    if (i == 6) {
        break
    }

    print(i)
}

Output:

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Next Statement

The next statement skips the current iteration and moves to the next one.

for (i in 1:5) {

    if (i == 3) {
        next
    }

    print(i)
}

Output:

[1] 1
[1] 2
[1] 4
[1] 5

Real-World Example

Let us create a small program that checks student marks and prints the result.

marks <- 72

if (marks >= 90) {

    print("Excellent")

} else if (marks >= 75) {

    print("Very Good")

} else if (marks >= 50) {

    print("Pass")

} else {

    print("Fail")
}

Important Notes

  • Conditions must evaluate to TRUE or FALSE
  • Use indentation for better readability
  • Loops can become infinite if conditions never become FALSE
  • break exits a loop immediately
  • next skips the current iteration
  • Logical operators help combine conditions

Common Beginner Mistakes

# Mistake 1: Using = instead of ==
# if (x = 5)

# Correct
# if (x == 5)

# Mistake 2: Infinite while loop
# count <- 1
# while(count <= 5) {
#     print(count)
# }

# Correct
# count <- count + 1

# Mistake 3: Missing curly braces

Mini Practice

Try the following exercises yourself:

# Exercise 1
number <- 8

if (number %% 2 == 0) {
    print("Even Number")
} else {
    print("Odd Number")
}

# Exercise 2
for (i in 1:3) {
    print(i * 2)
}

Summary

  • Control flow manages program execution
  • if statements help programs make decisions
  • else if allows multiple conditions
  • Loops are used for repetition
  • for, while, and repeat are common loops in R
  • break and next control loop behavior
  • Logical operators combine conditions
Lecture 05 ยท Fundamentals

Loops & Functions

Beginner ~90 min

1. For Loops: Automating Repetition

Imagine you have to print 100 different reports. You wouldn't write the print command 100 times. A for loop allows you to tell R: "Do this action for every item in this list."

How it works: A loop needs a counter (usually i) and a sequence (like 1:10). Every time the loop runs, i takes the value of the next item in the sequence.

Loop Example
# Example 1: Simple numeric sequence
for (i in 1:5) {
    print(paste("This is iteration number", i))
}

# Example 2: Iterating over a vector of names
students <- c("Alice", "Bob", "Charlie", "Diana")

for (name in students) {
    print(paste("Welcome to the class,", name))
}
๐Ÿ’ก Pro Tip: In R, 1:5 is shorthand for a vector containing 1, 2, 3, 4, 5. The loop simply visits each number one by one.

2. While Loops: Condition-Based Repetition

Unlike a for loop (which runs a set number of times), a while loop runs as long as a condition is TRUE. It stops the moment the condition becomes FALSE.

Warning: If the condition never becomes FALSE, you create an "Infinite Loop," which can crash your RStudio session!

While Loop Example
# We start with a value
battery_level <- 10

while (battery_level > 0) {
    print(paste("Battery at:", battery_level, "% - System Running..."))
    battery_level <- battery_level - 2 # Decrease battery each time
}

print("Battery dead. System shutting down.")

3. Custom Functions: Your Own Tools

A function is like a recipe. You define the ingredients (inputs) and the steps (the code), and then you can "cook" (run) that recipe whenever you want without rewriting the steps.

The structure of a function:

  • Name: How you call the function (e.g., calculate_tax).
  • Arguments: The inputs the function needs (e.g., price).
  • Body: The actual logic/math.
  • Return Value: The final answer the function gives back.
Custom Function Example
# Define the function
calculate_discount <- function(price, discount_percent) {
    savings <- price * (discount_percent / 100)
    final_price <- price - savings
    return(final_price)
}

# Now we use (call) the function with different values
item1 <- calculate_discount(100, 20) # 100 minus 20%
item2 <- calculate_discount(50, 10)   # 50 minus 10%

print(item1) # Result: 80
print(item2) # Result: 45

4. The "R Way": Vectorization vs. Loops

This is the most important lesson in R: In many languages, you MUST use loops. But R is a vectorized language. This means R can apply an operation to an entire list of numbers at once, which is much faster.

Compare these two ways to double 5 numbers:

Comparison
# THE SLOW WAY (For Loop)
nums <- c(1, 2, 3, 4, 5)
results <- c()
for (n in nums) {
    results <- c(results, n * 2)
}

# THE R WAY (Vectorization)
nums <- c(1, 2, 3, 4, 5)
results <- nums * 2 # R automatically doubles every element!

Always ask yourself: "Can I do this without a loop?" If the answer is yes, your code will be faster and cleaner.

๐ŸŽฏ Exercise 1.5: The Automation Master

Scenario: You are building a tool for a store to handle sales tax.

1. Create a function called apply_tax that takes a price and a tax_rate as arguments and returns the total price.

2. Create a vector of 5 product prices: prices <- c(10, 25, 100, 150, 500).

3. Challenge A: Use a for loop to apply a 15% tax to each price and print the result.

4. Challenge B: Now, try to do the exact same thing using vectorization (no loop). Compare how many lines of code it took!

Lecture 06 ยท Core Concepts

Data Structures & Vectors

Intermediate ~55 min

Introduction to Data Structures

Data structures are ways of organizing and storing data in R. They help us manage, process, and analyze information efficiently.

R provides multiple built-in data structures such as:

  • Vectors
  • Matrices
  • Arrays
  • Lists
  • Data Frames
  • Factors

In this lecture, we will mainly focus on vectors because they are the foundation of most R programming operations.

What is a Vector?

A vector is a collection of multiple values stored in a single variable.

All elements inside a vector must usually belong to the same data type.

Vectors are one of the most important data structures in R because R is designed to work efficiently with vectorized data.

Creating Vectors

We use the c() function to combine values into a vector.

v <- c(1, 2, 3, 4, 5)
v * 2 # Vectorized operation

Output:

[1]  2  4  6  8 10

In the example above:

  • c() combines multiple values into a vector
  • Each element of the vector is multiplied by 2
  • This happens automatically without using loops

Why Vectors are Important

Vectors make R powerful and fast. Instead of processing one value at a time, R can process entire collections of values together.

This feature is called vectorization.

Types of Vectors

Vectors can store different types of data.

Numeric Vector

numbers <- c(10, 20, 30, 40)

print(numbers)

Character Vector

fruits <- c("Apple", "Banana", "Mango")

print(fruits)

Logical Vector

status <- c(TRUE, FALSE, TRUE)

print(status)

Mixing Data Types

R tries to keep all vector elements of the same type. If different data types are mixed, R automatically converts them.

mixed <- c(1, 2, "Three")

print(mixed)
class(mixed)

Output:

[1] "1"     "2"     "Three"
[1] "character"

Since one element is character data, R converts all elements into characters.

Accessing Vector Elements

Vector elements are accessed using square brackets.

marks <- c(80, 90, 75, 88)

print(marks[1])
print(marks[3])

Output:

[1] 80
[1] 75

R indexing starts from 1, not 0.

Accessing Multiple Elements

numbers <- c(5, 10, 15, 20, 25)

print(numbers[c(1, 3, 5)])

Output:

[1]  5 15 25

Modifying Vector Elements

Individual vector values can be changed.

scores <- c(60, 70, 80)

scores[2] <- 90

print(scores)

Output:

[1] 60 90 80

Adding Elements to a Vector

We can add new elements using the c() function.

values <- c(1, 2, 3)

values <- c(values, 4)

print(values)

Vector Arithmetic

Mathematical operations can be applied directly to vectors.

a <- c(1, 2, 3)
b <- c(4, 5, 6)

print(a + b)
print(a * b)

Output:

[1] 5 7 9
[1] 4 10 18

Operations happen element by element.

Vectorized Operations

One of the biggest strengths of R is vectorized computation.

Instead of using loops, R can directly perform operations on entire vectors.

numbers <- c(1, 2, 3, 4, 5)

result <- numbers ^ 2

print(result)

Output:

[1]  1  4  9 16 25

Useful Vector Functions

Length of a Vector

x <- c(2, 4, 6, 8)

length(x)

Sum of Elements

numbers <- c(10, 20, 30)

sum(numbers)

Mean of Elements

marks <- c(70, 80, 90)

mean(marks)

Maximum and Minimum

values <- c(5, 9, 2, 11)

max(values)
min(values)

Sequences in R

R provides shortcuts for creating sequences.

Using :

numbers <- 1:10

print(numbers)

Using seq()

seq(1, 10, by = 2)

Output:

[1] 1 3 5 7 9

Repeating Values

The rep() function repeats values.

rep(5, times = 4)

rep(c(1, 2), times = 3)

Vector Comparison

Comparisons also work element by element.

x <- c(1, 2, 3, 4)

x > 2

Output:

[1] FALSE FALSE TRUE TRUE

Filtering Vectors

Conditions can be used to filter vector values.

numbers <- c(5, 10, 15, 20)

numbers[numbers > 10]

Output:

[1] 15 20

Missing Values in Vectors

Missing values in R are represented using NA.

data <- c(10, 20, NA, 40)

print(data)

Checking Missing Values

is.na(data)

Ignoring Missing Values

sum(data, na.rm = TRUE)

Real-World Example

Let us calculate the average marks of students using vectors.

marks <- c(78, 85, 92, 88, 76)

average <- mean(marks)

print(average)

Important Notes

  • Vectors are one-dimensional collections of data
  • All vector elements usually share the same data type
  • R indexing starts from 1
  • Vectorized operations make R efficient and fast
  • Use NA to represent missing values

Common Beginner Mistakes

# Mistake 1: Using index 0
# x[0]

# Correct
# x[1]

# Mistake 2: Mixing incompatible data types
# c(1, TRUE, "Hello")

# Mistake 3: Forgetting na.rm = TRUE
# sum(c(1, 2, NA))

Mini Practice

Try the following exercises yourself:

# Exercise 1
numbers <- c(2, 4, 6, 8)

print(numbers * 3)

# Exercise 2
marks <- c(50, 70, 90, 85)

print(mean(marks))

# Exercise 3
values <- c(10, 20, 30, 40)

print(values[values > 20])

Summary

  • Vectors are the most fundamental data structure in R
  • The c() function creates vectors
  • Vectors support fast vectorized operations
  • Elements are accessed using square brackets
  • R provides many built-in vector functions
  • Filtering and comparisons work naturally with vectors
  • Missing values are represented using NA
Lecture 07 ยท Core Concepts

Data Frames & dplyr

Intermediate ~65 min

Introduction to Data Frames

Data Frames are one of the most important data structures in R. They are used to store data in a tabular format, similar to an Excel spreadsheet or a database table. Each column can contain different types of data such as numbers, text, or logical values.

In real-world data science and analytics projects, most datasets are handled using Data Frames because they are flexible, organized, and easy to manipulate.

students <- data.frame(
    name = c("Ali", "Sara", "John"),
    age = c(20, 22, 21),
    marks = c(85, 90, 88)
)

students

In the example above:

  • name column stores character values
  • age column stores numeric values
  • marks column stores students' scores

Accessing Data in Data Frames

We can access rows and columns of a Data Frame using indexing or column names.

students$name
students$marks

students[1, ]
students[, 2]

Here:

  • students$name accesses the name column
  • students[1, ] returns the first row
  • students[, 2] returns the second column

What is dplyr?

dplyr is one of the most popular R packages used for data manipulation and data analysis. It provides simple and readable functions that make working with datasets faster and easier.

The main advantage of dplyr is that it allows developers and data analysts to write clean and understandable code when processing data.

library(dplyr)

Important dplyr Functions

1. select()

The select() function is used to choose specific columns from a Data Frame.

students %>%
    select(name, marks)

2. filter()

The filter() function is used to extract rows based on conditions.

students %>%
    filter(age > 20)

3. mutate()

The mutate() function creates new columns or modifies existing columns.

students %>%
    mutate(grade = marks + 5)

4. arrange()

The arrange() function sorts data in ascending or descending order.

students %>%
    arrange(marks)

The Pipe Operator

The Pipe Operator %>% is one of the most useful features in dplyr. It passes the output of one operation directly into the next operation.

Without the pipe operator, code becomes deeply nested and difficult to read. Using pipes makes the code cleaner, more readable, and easier to debug.

df %>%
    filter(age > 18) %>%
    select(name, salary)

In this example:

  • The dataset is first filtered to include people older than 18
  • Then only the name and salary columns are selected

Chaining Multiple Operations

One of the biggest strengths of dplyr is the ability to combine multiple operations into a single pipeline.

students %>%
    filter(marks > 85) %>%
    arrange(desc(marks)) %>%
    select(name, marks)

This pipeline:

  • Filters students with marks greater than 85
  • Sorts them in descending order
  • Selects only the required columns

Advantages of Using dplyr

  • Easy to read and understand
  • Cleaner syntax compared to base R
  • Efficient for large datasets
  • Supports chaining with the pipe operator
  • Widely used in data science and analytics

Summary

In this lecture, we learned about Data Frames and the dplyr package in R. We explored how to create and access Data Frames, and how dplyr functions such as select(), filter(), mutate(), and arrange() help in data manipulation.

We also understood the importance of the Pipe Operator %>%, which allows multiple operations to be chained together in a clean and readable way.

Lecture 08 ยท Core Concepts

Data Import & Cleaning

Intermediate ~50 min

Learn how to read CSV and Excel files and clean messy data using tidyr.

Introduction to Data Import

In real-world data analysis projects, data usually comes from external files such as CSV files, Excel spreadsheets, databases, or APIs. Before analyzing data, we first need to import it into R.

R provides powerful functions and libraries that make data importing simple and efficient. Once the data is imported, the next important step is cleaning the data to remove errors, missing values, and inconsistencies.

Reading CSV Files

CSV (Comma-Separated Values) files are one of the most common file formats used in data science and analytics. Each line in a CSV file represents a row, and commas separate the column values.

data <- read.csv("students.csv")

data

The read.csv() function imports the CSV file into a Data Frame.

  • "students.csv" is the file name
  • The imported data is stored in the variable data

Viewing Imported Data

After importing data, it is important to inspect and understand the dataset structure.

head(data)
str(data)
summary(data)

These functions help us understand the dataset:

  • head() displays the first few rows
  • str() shows the structure and data types
  • summary() provides statistical summaries

Reading Excel Files

Excel files are widely used in businesses and organizations for storing data. To read Excel files in R, we commonly use the readxl package.

library(readxl)

data <- read_excel("students.xlsx")

data

The read_excel() function imports Excel spreadsheets directly into R.

Understanding Dirty Data

Dirty data refers to incomplete, inconsistent, or incorrect data that can affect analysis results.

Common problems in datasets include:

  • Missing values
  • Duplicate rows
  • Incorrect formatting
  • Extra spaces in text
  • Empty columns

Data cleaning is an important step because clean data produces more accurate analysis and better machine learning models.

Handling Missing Values

Missing values are represented as NA in R. We can identify and remove missing values using built-in functions.

is.na(data)

na.omit(data)

Here:

  • is.na() checks for missing values
  • na.omit() removes rows containing missing values

Removing Duplicate Data

Duplicate records can create incorrect analysis results. R provides functions to detect and remove duplicates.

unique(data)

The unique() function removes duplicate rows from the dataset.

Introduction to tidyr

The tidyr package is used for cleaning and organizing messy datasets. It helps convert raw data into a tidy format that is easier to analyze.

library(tidyr)

Using drop_na()

The drop_na() function removes rows containing missing values.

data %>%
    drop_na()

Using replace_na()

Instead of removing missing values, we can replace them with meaningful values.

data %>%
    replace_na(list(marks = 0))

In this example, missing values in the marks column are replaced with 0.

Cleaning Text Data

Text data often contains unnecessary spaces or inconsistent capitalization. Cleaning text improves data quality and consistency.

trimws(data$name)

The trimws() function removes extra spaces from text values.

Renaming Columns

Clear and meaningful column names make datasets easier to understand.

colnames(data) <- c("Name", "Age", "Marks")

This statement changes the column names of the dataset.

Exporting Cleaned Data

After cleaning the dataset, we can save it back to a CSV file for future use.

write.csv(data, "cleaned_data.csv")

The cleaned dataset will be stored in a new CSV file.

Advantages of Data Cleaning

  • Improves data accuracy
  • Removes inconsistencies
  • Prepares data for analysis
  • Enhances machine learning performance
  • Makes reports more reliable

Summary

In this lecture, we learned how to import CSV and Excel files into R using read.csv() and read_excel().

We also explored data cleaning techniques such as handling missing values, removing duplicates, renaming columns, and organizing datasets using the tidyr package.

Clean data is essential for accurate analysis, reporting, and machine learning applications.

Lecture 09 ยท Advanced

Visualization with ggplot2

Advanced ~70 min

Grammar of Graphics

ggplot(data = df, aes(x = age, y = height)) +
    geom_point() +
    geom_smooth(method = "lm")
Lecture 10 ยท Advanced

Statistical Analysis

Advanced ~60 min

Performing hypothesis testing and correlation analysis in R.

Lecture 11 ยท Advanced

Advanced Data Manipulation

Advanced ~60 min

Using purrr for functional programming and stringr for text processing.

Lecture 12 ยท Capstone

Capstone Project: Data Analysis

Advanced ~180 min

Perform an end-to-end data analysis on a real-world dataset, from cleaning and exploration to modeling and visualization.

# Project goals:
# 1. Load and clean a dataset
# 2. Perform exploratory data analysis (EDA)
# 3. Create publication-quality plots
# 4. Build a predictive model
Lecture 13 ยท Advanced

R Markdown & Reporting

Intermediate ~50 min Requires: Lecture 12

Content coming soon...

Lecture 14 ยท Professional

Final Project โ€” Complete Data Analysis

Advanced ~90 min Requires: All Previous

Content coming soon...