Introduction to R & Setup
What is R?
R is a powerful open-source language and environment specifically designed for statistical computing, data analysis, and visualization. Unlike general-purpose languages, R was built by statisticians for statisticians, making it the gold standard for data exploration.
It is widely used in academia, data science, bioinformatics, and industry for everything from simple linear regressions to complex machine learning models.
Why Choose R?
- Best-in-class Visualization: With libraries like
ggplot2, R produces publication-quality graphics. - CRAN Ecosystem: Access to the Comprehensive R Archive Network (CRAN), which hosts thousands of specialized packages for every imaginable data task.
- Exploratory Data Analysis (EDA): R allows you to interact with your data in real-time, making it easier to find patterns and outliers.
- Tidyverse: A collection of cohesive packages (like
dplyrandtidyr) designed for a consistent and intuitive data science workflow. - Strong Integration: Seamlessly works with Python via the
reticulatepackage and connects easily to SQL databases.
Installation & Environment
To get started, you need to install two separate pieces of software:
- R (The Engine): This is the actual language that processes the code. Download it from cran.r-project.org.
- RStudio (The Cockpit): This is the Integrated Development Environment (IDE). It makes writing, debugging, and visualizing your R code much easier. Download it from posit.co.
Note: Always install R first, then install RStudio.
Understanding the RStudio Interface
Once you open RStudio, you will see four primary panes. Understanding these is key to your productivity:
- Source Editor (Top Left): Where you write and save your scripts (.R files).
- Console (Bottom Left): Where the code actually runs. You can type commands here for instant results.
- Environment/History (Top Right): Shows you every variable, data frame, or list currently stored in the system's memory.
- Files/Plots/Help (Bottom Right): Where you view your folders, see your generated graphs, and read documentation.
Your First Lines of Code
In R, we use the <- operator (called the assignment operator) to store values in
variables. While = works, the arrow is the standard convention in the R community.
> # Printing a simple message
print("Hello, R Mastery!")
[1] "Hello, R Mastery!"
> # Assigning values to variables
my_name <- "Alex"
current_year <- 2024
result <- 17 * 23
> # Printing the variables
print(my_name)
[1] "Alex"
print(result)
[1] 391
> # Checking R version
version$version.string
[1] "R version 4.4.1 (2024-...)"
๐ป Try It Yourself - Multi-Language Compiler
Practice R and many other programming languages right here in your browser! Switch between languages, modify the code, and click "Run" to see results instantly.
๐ก Practice Tips:
- Switch to R in the language selector and try the data analysis examples
- Experiment with R's statistical functions and data visualization
- Try other data languages like Python, SQL, or compare with statistical concepts
- Use the "Load Example" button to see R-specific code samples
- Use Ctrl+Enter to quickly run your code
1. Install R and RStudio on your machine.
2. Open RStudio and create a new R Script (File → New File → R Script).
3. Write and run code that does the following:
- Assign your name to a variable called
user_name. - Assign today's date to a variable called
today. - Calculate
17 * 23and store it in a variable calledcalc_result. - Print all three variables to the console.
Variables & Data Types
Assignment Operator
In R, we typically use <- for assignment.
Assignment means storing a value inside a variable. A variable acts like a container that holds data which can later be used, modified, or printed.
x <- 42 name <- "R Student" is_valid <- TRUE
In the example above:
xstores a numbernamestores textis_validstores a logical value
The value on the right side is assigned to the variable on the left side.
age <- 21 country <- "India" print(age) print(country)
Output:
[1] 21 [1] "India"
Using = for Assignment
R also allows the use of = for assignment.
However, most R programmers prefer <- because it is considered clearer and is
the traditional R style.
x = 100 print(x)
Both methods work, but throughout this course we will mainly use <-.
Variable Naming Rules
Variable names should be meaningful and easy to understand.
Rules for naming variables in R:
- Variable names can contain letters, numbers, dots, and underscores
- Variable names cannot start with a number
- Variable names are case-sensitive
- Avoid using spaces in variable names
# Valid variable names student_name <- "Rahul" marks1 <- 90 total.score <- 450 # Invalid variable names # 1name <- "Error" # student marks <- 50
Data Types in R
Every variable in R stores a particular type of data. These are called data types.
Understanding data types is extremely important because different operations work on different types of data.
1. Numeric
Numeric values represent decimal or floating-point numbers.
price <- 99.99 temperature <- 36.6 print(price) print(temperature)
2. Integer
Integers are whole numbers without decimals.
In R, integers are written using the L suffix.
age <- 25L year <- 2025L print(age) print(year)
3. Character
Character data represents text and must be written inside quotes.
first_name <- "Aman" message <- "Welcome to R programming" print(first_name) print(message)
4. Logical
Logical values represent either TRUE or FALSE. They are commonly used in conditions and decision making.
is_logged_in <- TRUE has_permission <- FALSE print(is_logged_in) print(has_permission)
Checking the Data Type
We can use the class() function to check the type of a variable.
x <- 10 name <- "R Language" status <- TRUE class(x) class(name) class(status)
Output:
[1] "numeric" [1] "character" [1] "logical"
Type Conversion
Sometimes we need to convert one data type into another. This process is called type conversion or type casting.
x <- "100" numeric_x <- as.numeric(x) print(numeric_x) class(numeric_x)
Output:
[1] 100 [1] "numeric"
Basic Arithmetic Operations
R can perform mathematical calculations using variables.
a <- 10 b <- 5 print(a + b) print(a - b) print(a * b) print(a / b)
Output:
[1] 15 [1] 5 [1] 50 [1] 2
Updating Variables
Variables can be updated by assigning a new value to them.
score <- 50 score <- score + 10 print(score)
Output:
[1] 60
Important Notes
- R is case-sensitive, so
ageandAgeare different variables - Character values must be inside quotes
- Logical values are written as
TRUEandFALSE - Use meaningful variable names for better readability
- Always check data types when debugging errors
Common Beginner Mistakes
# Mistake 1: Missing quotes # city <- Delhi # Correct city <- "Delhi" # Mistake 2: Using undefined variables # print(total) # Mistake 3: Invalid variable name # 2marks <- 90
Mini Practice
Try running the following code yourself:
student <- "Arjun" marks <- 95 passed <- TRUE print(student) print(marks) print(passed) class(student) class(marks) class(passed)
Summary
- Variables are used to store data
<-is the standard assignment operator in R- R supports numeric, integer, character, and logical data types
- Use
class()to check data types - Variables can be updated and used in calculations
- Meaningful naming improves code readability
Operators & Expressions
Arithmetic Operators
Arithmetic operators are used to perform mathematical calculations. R follows the standard order
of operations (PEMDAS), but you can use parentheses ( ) to prioritize specific
calculations.
# Basic Math 5 + 10 # Addition (15) 15 - 5 # Subtraction (10) 4 * 3 # Multiplication (12) 10 / 2 # Division (5) 2 ^ 3 # Exponentiation/Power (8) # Advanced Math 13 %% 5 # Modulo: Returns the remainder (3) 13 %/% 5 # Integer Division: Returns how many times it fits (2)
Comparison Operators
Comparison operators are used to compare two values. The result of a comparison is always a
Boolean value: either TRUE or FALSE. These are
critical when you start filtering data frames.
10 == 10 # Equal to (TRUE) 10 != 10 # Not equal to (FALSE) 5 > 3 # Greater than (TRUE) 2 < 1 # Less than (FALSE) 10 >= 10 # Greater than or equal to (TRUE) 7 <= 5 # Less than or equal to (FALSE)
Logical Operators
Logical operators allow you to combine multiple comparisons. This is how you create complex filters (e.g., "Find all users who are over 18 AND live in New York").
&(AND): Returns TRUE if both conditions are true.|(OR): Returns TRUE if at least one condition is true.!(NOT): Reverses the result (TRUE becomes FALSE).
# Example variables age <- 25 has_license <- TRUE # AND operator (age > 18) & (has_license == TRUE) # TRUE # OR operator (age > 30) | (has_license == TRUE) # TRUE # NOT operator !(age == 25) # FALSE
Create a script that performs the following tasks:
- Calculate the remainder of 100 divided by 7 using the modulo operator.
- Create two variables,
a <- 15andb <- 20. - Write a comparison expression that checks if
ais less thanbANDais greater than 10. - Test the
!(NOT) operator on the result of your previous comparison.
Control Flow
Introduction to Control Flow
Control flow determines how a program makes decisions and repeats tasks. By default, R executes code line by line from top to bottom.
However, real programs need logic. Sometimes we want certain code to run only when a condition is true. Sometimes we want to repeat a block of code multiple times.
Control flow statements allow us to:
- Make decisions
- Repeat operations
- Control program execution
- Build dynamic and intelligent programs
If Statements
The if statement is used to execute code only when a condition is true.
if (x > 10) { print("Large") } else { print("Small") }
In the example above:
- If
xis greater than 10, R prints"Large" - Otherwise, R prints
"Small"
Understanding Conditions
Conditions are expressions that evaluate to either TRUE or FALSE.
x <- 15 print(x > 10) print(x < 5) print(x == 15)
Output:
[1] TRUE [1] FALSE [1] TRUE
Comparison Operators
Comparison operators are used to compare values.
| Operator | Meaning |
|---|---|
== |
Equal to |
!= |
Not equal to |
> |
Greater than |
< |
Less than |
>= |
Greater than or equal to |
<= |
Less than or equal to |
Simple If Statement
An if statement can also be used without an else block.
age <- 20 if (age >= 18) { print("You are eligible to vote") }
The code inside the block executes only if the condition is TRUE.
If-Else Statement
The if-else statement is used when there are two possible outcomes.
marks <- 40 if (marks >= 50) { print("Pass") } else { print("Fail") }
Else If Ladder
Multiple conditions can be checked using else if.
score <- 82 if (score >= 90) { print("Grade A") } else if (score >= 75) { print("Grade B") } else if (score >= 50) { print("Grade C") } else { print("Fail") }
Output:
[1] "Grade B"
Logical Operators
Logical operators are used to combine multiple conditions.
| Operator | Meaning |
|---|---|
&& |
Logical AND |
|| |
Logical OR |
! |
Logical NOT |
Using AND Operator
The AND operator returns TRUE only if both conditions are TRUE.
age <- 25 has_id <- TRUE if (age >= 18 && has_id) { print("Entry Allowed") }
Using OR Operator
The OR operator returns TRUE if at least one condition is TRUE.
is_weekend <- FALSE is_holiday <- TRUE if (is_weekend || is_holiday) { print("No Office Today") }
Nested If Statements
We can place one if statement inside another.
This is called nesting.
age <- 22 citizen <- TRUE if (age >= 18) { if (citizen) { print("Eligible to vote") } }
Introduction to Loops
Loops are used to repeat code multiple times.
Instead of writing the same code repeatedly, loops allow us to automate repetition.
For Loop
A for loop repeats a block of code for a sequence of values.
for (i in 1:5) {
print(i)
}
Output:
[1] 1 [1] 2 [1] 3 [1] 4 [1] 5
Loop Through a Vector
fruits <- c("Apple", "Banana", "Mango") for (fruit in fruits) { print(fruit) }
While Loop
A while loop continues running as long as the condition remains TRUE.
count <- 1 while (count <= 5) { print(count) count <- count + 1 }
Repeat Loop
The repeat loop runs forever until stopped using break.
x <- 1 repeat { print(x) x <- x + 1 if (x > 5) { break } }
Break Statement
The break statement immediately stops a loop.
for (i in 1:10) { if (i == 6) { break } print(i) }
Output:
[1] 1 [1] 2 [1] 3 [1] 4 [1] 5
Next Statement
The next statement skips the current iteration and moves to the next one.
for (i in 1:5) { if (i == 3) { next } print(i) }
Output:
[1] 1 [1] 2 [1] 4 [1] 5
Real-World Example
Let us create a small program that checks student marks and prints the result.
marks <- 72 if (marks >= 90) { print("Excellent") } else if (marks >= 75) { print("Very Good") } else if (marks >= 50) { print("Pass") } else { print("Fail") }
Important Notes
- Conditions must evaluate to TRUE or FALSE
- Use indentation for better readability
- Loops can become infinite if conditions never become FALSE
breakexits a loop immediatelynextskips the current iteration- Logical operators help combine conditions
Common Beginner Mistakes
# Mistake 1: Using = instead of == # if (x = 5) # Correct # if (x == 5) # Mistake 2: Infinite while loop # count <- 1 # while(count <= 5) { # print(count) # } # Correct # count <- count + 1 # Mistake 3: Missing curly braces
Mini Practice
Try the following exercises yourself:
# Exercise 1 number <- 8 if (number %% 2 == 0) { print("Even Number") } else { print("Odd Number") } # Exercise 2 for (i in 1:3) { print(i * 2) }
Summary
- Control flow manages program execution
ifstatements help programs make decisionselse ifallows multiple conditions- Loops are used for repetition
for,while, andrepeatare common loops in Rbreakandnextcontrol loop behavior- Logical operators combine conditions
Loops & Functions
1. For Loops: Automating Repetition
Imagine you have to print 100 different reports. You wouldn't write the print command 100 times.
A for loop allows you to tell R: "Do this action for every item in
this list."
How it works: A loop needs a counter (usually i)
and a sequence (like 1:10). Every time the loop runs,
i takes the value of the next item in the sequence.
# Example 1: Simple numeric sequence for (i in 1:5) { print(paste("This is iteration number", i)) } # Example 2: Iterating over a vector of names students <- c("Alice", "Bob", "Charlie", "Diana") for (name in students) { print(paste("Welcome to the class,", name)) }
1:5 is shorthand for a vector
containing 1, 2, 3, 4, 5. The loop simply visits each number one
by one.
2. While Loops: Condition-Based Repetition
Unlike a for loop (which runs a set number of times), a while loop runs as long as a condition is TRUE. It stops
the moment the condition becomes FALSE.
Warning: If the condition never becomes FALSE, you create an "Infinite Loop," which can crash your RStudio session!
# We start with a value battery_level <- 10 while (battery_level > 0) { print(paste("Battery at:", battery_level, "% - System Running...")) battery_level <- battery_level - 2 # Decrease battery each time } print("Battery dead. System shutting down.")
3. Custom Functions: Your Own Tools
A function is like a recipe. You define the ingredients (inputs) and the steps (the code), and then you can "cook" (run) that recipe whenever you want without rewriting the steps.
The structure of a function:
- Name: How you call the function (e.g.,
calculate_tax). - Arguments: The inputs the function needs (e.g.,
price). - Body: The actual logic/math.
- Return Value: The final answer the function gives back.
# Define the function calculate_discount <- function(price, discount_percent) { savings <- price * (discount_percent / 100) final_price <- price - savings return(final_price) } # Now we use (call) the function with different values item1 <- calculate_discount(100, 20) # 100 minus 20% item2 <- calculate_discount(50, 10) # 50 minus 10% print(item1) # Result: 80 print(item2) # Result: 45
4. The "R Way": Vectorization vs. Loops
This is the most important lesson in R: In many languages, you MUST use loops. But R is a vectorized language. This means R can apply an operation to an entire list of numbers at once, which is much faster.
Compare these two ways to double 5 numbers:
# THE SLOW WAY (For Loop) nums <- c(1, 2, 3, 4, 5) results <- c() for (n in nums) { results <- c(results, n * 2) } # THE R WAY (Vectorization) nums <- c(1, 2, 3, 4, 5) results <- nums * 2 # R automatically doubles every element!
Always ask yourself: "Can I do this without a loop?" If the answer is yes, your code will be faster and cleaner.
Scenario: You are building a tool for a store to handle sales tax.
1. Create a function called apply_tax that takes a price and a tax_rate as
arguments and returns the total price.
2. Create a vector of 5 product prices: prices <- c(10, 25, 100, 150, 500).
3. Challenge A: Use a for loop to apply a 15%
tax to each price and print the result.
4. Challenge B: Now, try to do the exact same thing using vectorization (no loop). Compare how many lines of code it took!
Data Structures & Vectors
Introduction to Data Structures
Data structures are ways of organizing and storing data in R. They help us manage, process, and analyze information efficiently.
R provides multiple built-in data structures such as:
- Vectors
- Matrices
- Arrays
- Lists
- Data Frames
- Factors
In this lecture, we will mainly focus on vectors because they are the foundation of most R programming operations.
What is a Vector?
A vector is a collection of multiple values stored in a single variable.
All elements inside a vector must usually belong to the same data type.
Vectors are one of the most important data structures in R because R is designed to work efficiently with vectorized data.
Creating Vectors
We use the c() function to combine values into a vector.
v <- c(1, 2, 3, 4, 5) v * 2 # Vectorized operation
Output:
[1] 2 4 6 8 10
In the example above:
c()combines multiple values into a vector- Each element of the vector is multiplied by 2
- This happens automatically without using loops
Why Vectors are Important
Vectors make R powerful and fast. Instead of processing one value at a time, R can process entire collections of values together.
This feature is called vectorization.
Types of Vectors
Vectors can store different types of data.
Numeric Vector
numbers <- c(10, 20, 30, 40) print(numbers)
Character Vector
fruits <- c("Apple", "Banana", "Mango") print(fruits)
Logical Vector
status <- c(TRUE, FALSE, TRUE) print(status)
Mixing Data Types
R tries to keep all vector elements of the same type. If different data types are mixed, R automatically converts them.
mixed <- c(1, 2, "Three") print(mixed) class(mixed)
Output:
[1] "1" "2" "Three" [1] "character"
Since one element is character data, R converts all elements into characters.
Accessing Vector Elements
Vector elements are accessed using square brackets.
marks <- c(80, 90, 75, 88) print(marks[1]) print(marks[3])
Output:
[1] 80 [1] 75
R indexing starts from 1, not 0.
Accessing Multiple Elements
numbers <- c(5, 10, 15, 20, 25) print(numbers[c(1, 3, 5)])
Output:
[1] 5 15 25
Modifying Vector Elements
Individual vector values can be changed.
scores <- c(60, 70, 80) scores[2] <- 90 print(scores)
Output:
[1] 60 90 80
Adding Elements to a Vector
We can add new elements using the c() function.
values <- c(1, 2, 3) values <- c(values, 4) print(values)
Vector Arithmetic
Mathematical operations can be applied directly to vectors.
a <- c(1, 2, 3) b <- c(4, 5, 6) print(a + b) print(a * b)
Output:
[1] 5 7 9 [1] 4 10 18
Operations happen element by element.
Vectorized Operations
One of the biggest strengths of R is vectorized computation.
Instead of using loops, R can directly perform operations on entire vectors.
numbers <- c(1, 2, 3, 4, 5) result <- numbers ^ 2 print(result)
Output:
[1] 1 4 9 16 25
Useful Vector Functions
Length of a Vector
x <- c(2, 4, 6, 8) length(x)
Sum of Elements
numbers <- c(10, 20, 30) sum(numbers)
Mean of Elements
marks <- c(70, 80, 90) mean(marks)
Maximum and Minimum
values <- c(5, 9, 2, 11) max(values) min(values)
Sequences in R
R provides shortcuts for creating sequences.
Using :
numbers <- 1:10 print(numbers)
Using seq()
seq(1, 10, by = 2)
Output:
[1] 1 3 5 7 9
Repeating Values
The rep() function repeats values.
rep(5, times = 4) rep(c(1, 2), times = 3)
Vector Comparison
Comparisons also work element by element.
x <- c(1, 2, 3, 4) x > 2
Output:
[1] FALSE FALSE TRUE TRUE
Filtering Vectors
Conditions can be used to filter vector values.
numbers <- c(5, 10, 15, 20) numbers[numbers > 10]
Output:
[1] 15 20
Missing Values in Vectors
Missing values in R are represented using NA.
data <- c(10, 20, NA, 40) print(data)
Checking Missing Values
is.na(data)
Ignoring Missing Values
sum(data, na.rm = TRUE)
Real-World Example
Let us calculate the average marks of students using vectors.
marks <- c(78, 85, 92, 88, 76) average <- mean(marks) print(average)
Important Notes
- Vectors are one-dimensional collections of data
- All vector elements usually share the same data type
- R indexing starts from 1
- Vectorized operations make R efficient and fast
- Use
NAto represent missing values
Common Beginner Mistakes
# Mistake 1: Using index 0 # x[0] # Correct # x[1] # Mistake 2: Mixing incompatible data types # c(1, TRUE, "Hello") # Mistake 3: Forgetting na.rm = TRUE # sum(c(1, 2, NA))
Mini Practice
Try the following exercises yourself:
# Exercise 1 numbers <- c(2, 4, 6, 8) print(numbers * 3) # Exercise 2 marks <- c(50, 70, 90, 85) print(mean(marks)) # Exercise 3 values <- c(10, 20, 30, 40) print(values[values > 20])
Summary
- Vectors are the most fundamental data structure in R
- The
c()function creates vectors - Vectors support fast vectorized operations
- Elements are accessed using square brackets
- R provides many built-in vector functions
- Filtering and comparisons work naturally with vectors
- Missing values are represented using
NA
Data Frames & dplyr
Introduction to Data Frames
Data Frames are one of the most important data structures in R. They are used to store data in a tabular format, similar to an Excel spreadsheet or a database table. Each column can contain different types of data such as numbers, text, or logical values.
In real-world data science and analytics projects, most datasets are handled using Data Frames because they are flexible, organized, and easy to manipulate.
students <- data.frame(
name = c("Ali", "Sara", "John"),
age = c(20, 22, 21),
marks = c(85, 90, 88)
)
students
In the example above:
- name column stores character values
- age column stores numeric values
- marks column stores students' scores
Accessing Data in Data Frames
We can access rows and columns of a Data Frame using indexing or column names.
students$name students$marks students[1, ] students[, 2]
Here:
- students$name accesses the name column
- students[1, ] returns the first row
- students[, 2] returns the second column
What is dplyr?
dplyr is one of the most popular R packages used for data manipulation and data analysis. It provides simple and readable functions that make working with datasets faster and easier.
The main advantage of dplyr is that it allows developers and data analysts to write clean and understandable code when processing data.
library(dplyr)
Important dplyr Functions
1. select()
The select() function is used to choose specific columns from a Data Frame.
students %>%
select(name, marks)
2. filter()
The filter() function is used to extract rows based on conditions.
students %>%
filter(age > 20)
3. mutate()
The mutate() function creates new columns or modifies existing columns.
students %>%
mutate(grade = marks + 5)
4. arrange()
The arrange() function sorts data in ascending or descending order.
students %>%
arrange(marks)
The Pipe Operator
The Pipe Operator %>% is one of the most useful features in dplyr. It passes the output of one operation directly into the next operation.
Without the pipe operator, code becomes deeply nested and difficult to read. Using pipes makes the code cleaner, more readable, and easier to debug.
df %>%
filter(age > 18) %>%
select(name, salary)
In this example:
- The dataset is first filtered to include people older than 18
- Then only the name and salary columns are selected
Chaining Multiple Operations
One of the biggest strengths of dplyr is the ability to combine multiple operations into a single pipeline.
students %>%
filter(marks > 85) %>%
arrange(desc(marks)) %>%
select(name, marks)
This pipeline:
- Filters students with marks greater than 85
- Sorts them in descending order
- Selects only the required columns
Advantages of Using dplyr
- Easy to read and understand
- Cleaner syntax compared to base R
- Efficient for large datasets
- Supports chaining with the pipe operator
- Widely used in data science and analytics
Summary
In this lecture, we learned about Data Frames and the dplyr package in R. We explored how to create and access Data Frames, and how dplyr functions such as select(), filter(), mutate(), and arrange() help in data manipulation.
We also understood the importance of the Pipe Operator %>%, which allows multiple operations to be chained together in a clean and readable way.
Data Import & Cleaning
Learn how to read CSV and Excel files and clean messy data using tidyr.
Introduction to Data Import
In real-world data analysis projects, data usually comes from external files such as CSV files, Excel spreadsheets, databases, or APIs. Before analyzing data, we first need to import it into R.
R provides powerful functions and libraries that make data importing simple and efficient. Once the data is imported, the next important step is cleaning the data to remove errors, missing values, and inconsistencies.
Reading CSV Files
CSV (Comma-Separated Values) files are one of the most common file formats used in data science and analytics. Each line in a CSV file represents a row, and commas separate the column values.
data <- read.csv("students.csv")
data
The read.csv() function imports the CSV file into a Data Frame.
- "students.csv" is the file name
- The imported data is stored in the variable data
Viewing Imported Data
After importing data, it is important to inspect and understand the dataset structure.
head(data) str(data) summary(data)
These functions help us understand the dataset:
- head() displays the first few rows
- str() shows the structure and data types
- summary() provides statistical summaries
Reading Excel Files
Excel files are widely used in businesses and organizations for storing data. To read Excel files in R, we commonly use the readxl package.
library(readxl)
data <- read_excel("students.xlsx")
data
The read_excel() function imports Excel spreadsheets directly into R.
Understanding Dirty Data
Dirty data refers to incomplete, inconsistent, or incorrect data that can affect analysis results.
Common problems in datasets include:
- Missing values
- Duplicate rows
- Incorrect formatting
- Extra spaces in text
- Empty columns
Data cleaning is an important step because clean data produces more accurate analysis and better machine learning models.
Handling Missing Values
Missing values are represented as NA in R. We can identify and remove missing values using built-in functions.
is.na(data) na.omit(data)
Here:
- is.na() checks for missing values
- na.omit() removes rows containing missing values
Removing Duplicate Data
Duplicate records can create incorrect analysis results. R provides functions to detect and remove duplicates.
unique(data)
The unique() function removes duplicate rows from the dataset.
Introduction to tidyr
The tidyr package is used for cleaning and organizing messy datasets. It helps convert raw data into a tidy format that is easier to analyze.
library(tidyr)
Using drop_na()
The drop_na() function removes rows containing missing values.
data %>%
drop_na()
Using replace_na()
Instead of removing missing values, we can replace them with meaningful values.
data %>%
replace_na(list(marks = 0))
In this example, missing values in the marks column are replaced with 0.
Cleaning Text Data
Text data often contains unnecessary spaces or inconsistent capitalization. Cleaning text improves data quality and consistency.
trimws(data$name)
The trimws() function removes extra spaces from text values.
Renaming Columns
Clear and meaningful column names make datasets easier to understand.
colnames(data) <- c("Name", "Age", "Marks")
This statement changes the column names of the dataset.
Exporting Cleaned Data
After cleaning the dataset, we can save it back to a CSV file for future use.
write.csv(data, "cleaned_data.csv")
The cleaned dataset will be stored in a new CSV file.
Advantages of Data Cleaning
- Improves data accuracy
- Removes inconsistencies
- Prepares data for analysis
- Enhances machine learning performance
- Makes reports more reliable
Summary
In this lecture, we learned how to import CSV and Excel files into R using read.csv() and read_excel().
We also explored data cleaning techniques such as handling missing values, removing duplicates, renaming columns, and organizing datasets using the tidyr package.
Clean data is essential for accurate analysis, reporting, and machine learning applications.
Visualization with ggplot2
Grammar of Graphics
ggplot(data = df, aes(x = age, y = height)) +
geom_point() +
geom_smooth(method = "lm")
Statistical Analysis
Performing hypothesis testing and correlation analysis in R.
Advanced Data Manipulation
Using purrr for functional programming and stringr for text processing.
Capstone Project: Data Analysis
Perform an end-to-end data analysis on a real-world dataset, from cleaning and exploration to modeling and visualization.
# Project goals: # 1. Load and clean a dataset # 2. Perform exploratory data analysis (EDA) # 3. Create publication-quality plots # 4. Build a predictive model
R Markdown & Reporting
Content coming soon...
Final Project โ Complete Data Analysis
Content coming soon...