Introduction to SQL & Setup
0️⃣ What is SQL?
SQL (Structured Query Language) is the ANSI‑standard language used to create, read, update, and delete data in relational database management systems (RDBMS). It abstracts the underlying storage engine and lets you work with tables (relations) that are linked via keys.
Brief History & Why It Still Matters
- 1970 – Edgar F. Codd proposes the relational model (IBM System R).
- 1974 – IBM releases SQL/2, the first commercial implementation.
- 1992 – ANSI/ISO adopts SQL‑92 as a formal standard (still the baseline).
- 2000‑2024 – Continual extensions (SQL:1999, SQL:2003, …) add XML/JSON handling, window functions, and more.
| RDBMS | License | Typical Use‑Case | Latest Stable Version (2024) |
|---|---|---|---|
| PostgreSQL | Open‑source (PostgreSQL License) | Enterprise‑grade, geo‑spatial, analytics | 16.2 |
| MySQL / MariaDB | Dual (GPL / Commercial) | Web‑apps, LAMP stack | 8.4 / 10.11 |
| Microsoft SQL Server | Commercial (Free Express) | Windows‑centric, BI | 2022 CU15 |
| Oracle Database | Commercial (Free XE) | Large‑scale transaction processing | 23c |
| SQLite | Public‑domain (no server) | Embedded, mobile, prototyping | 3.45 |
Market Demand (2024)
| Skill | Avg Salary (US $) | Stack Overflow Survey % of Respondents |
|---|---|---|
| SQL / Relational DB | ≈ $115 k | 78 % |
| PostgreSQL | ≈ $120 k | 43 % |
| MySQL | ≈ $112 k | 55 % |
| SQLite (mobile/IoT) | ≈ $108 k | 31 % |
Source: Stack Overflow Developer Survey 2024 + Glassdoor.
1️⃣ Setup – Get a DB Up & Running
Option A – Native Install (PostgreSQL)
- Windows – Download the installer, run, accept defaults (creates a
postgressuper‑user). - macOS –
brew install postgresqlthenbrew services start postgresql. - Linux (Debian/Ubuntu) –
sudo apt update && sudo apt install -y postgresql postgresql-contrib.
Verify:sudo -u postgres psql -c "SELECT version();"
Option B – Lightweight SQLite (Zero‑Install)
SQLite ships with Python, PHP, and many OSes. A single file is all you need.
# Verify SQLite is present sqlite3 --version # e.g. 3.45.0 # Create a new DB file called demo.db and open the CLI sqlite3 demo.db # Inside the CLI you’ll see the prompt: sqlite>
Option C – Docker (All Platforms)
Docker gives you an isolated RDBMS with a single command. No admin rights required.
# Pull the official PostgreSQL image
docker pull postgres:16-alpine
# Run it (exposes 5432 to host)
docker run -d \
--name pg_sql_demo \
-e POSTGRES_PASSWORD=SuperSecret123 \
-e POSTGRES_USER=demo_user \
-e POSTGRES_DB=demo_db \
-p 5432:5432 \
postgres:16-alpine
# Connect from host
psql -h localhost -U demo_user -d demo_db -W # password: SuperSecret123
SQL Client Recommendations
- DBeaver – universal cross‑platform GUI (JDBC/ODBC).
- pgAdmin – native PostgreSQL web‑based console.
- Azure Data Studio – Microsoft’s cross‑platform query editor (works with PostgreSQL, MySQL, SQL Server).
- Command‑line:
psql(PostgreSQL),mysql,sqlite3.
2️⃣ First‑Look SQL – “Hello SQL” & Basic Statements
SELECT 'Hello, SQL Mastery!' AS greeting;
Hello, SQL Mastery!
Core DDL (Data Definition Language)
CREATE TABLE– define a table schema.ALTER TABLE– add/modify columns, constraints.DROP TABLE– delete a table.
Core DML (Data Manipulation Language)
SELECT … FROM …– read data (the “R” in CRUD).INSERT INTO … VALUES …– add rows.UPDATE … SET … WHERE …– modify rows.DELETE FROM … WHERE …– remove rows.
💻 Try It Yourself - Multi-Language Compiler
Practice SQL and many other programming languages right here in your browser! Switch between languages, modify the code, and click "Run" to see results instantly.
💡 Practice Tips:
- Switch to SQL in the language selector and try the database query examples
- Experiment with SQL's SELECT, INSERT, UPDATE, and DELETE statements
- Try other data languages like Python, R, or compare with database concepts
- Use the "Load Example" button to see SQL-specific code samples
- Use Ctrl+Enter to quickly run your code
Quick Reference Cheat‑Sheet
| Statement | Purpose | Basic Syntax Example |
|---|---|---|
| CREATE TABLE | Define a new relation | CREATE TABLE employees (id SERIAL PRIMARY KEY, name TEXT NOT NULL, dept_id INT); |
| INSERT | Add a row | INSERT INTO employees (name,dept_id) VALUES ('Alice',1); |
| SELECT | Read data | SELECT name,dept_id FROM employees WHERE dept_id = 1; |
| UPDATE | Modify existing rows | UPDATE employees SET dept_id = 2 WHERE name = 'Alice'; |
| DELETE | Remove rows | DELETE FROM employees WHERE name = 'Alice'; |
| JOIN | Combine rows from multiple tables | SELECT e.name, d.name FROM employees e JOIN departments d ON e.dept_id = d.id; |
| GROUP BY / HAVING | Aggregate data | SELECT dept_id, COUNT(*) FROM employees GROUP BY dept_id HAVING COUNT(*) > 5; |
| ORDER BY | Sort results | SELECT * FROM employees ORDER BY name ASC; |
| LIMIT / OFFSET | Paginate results | SELECT * FROM employees LIMIT 10 OFFSET 20; |
3️⃣ Sample Schema – Employees & Departments
Let’s build a tiny HR database that you can reuse for many hands‑on tasks.
CREATE TABLE departments ( id SERIAL PRIMARY KEY, name VARCHAR(50) NOT NULL UNIQUE ); CREATE TABLE employees ( emp_id SERIAL PRIMARY KEY, first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL, email VARCHAR(100) UNIQUE, hire_date DATE DEFAULT CURRENT_DATE, salary NUMERIC(9,2) CHECK (salary > 0), dept_id INT REFERENCES departments(id) ON DELETE SET NULL );
Insert Sample Data
INSERT INTO departments (name) VALUES ('Engineering'), ('Marketing'), ('HR'), ('Finance'); INSERT INTO employees (first_name, last_name, email, salary, dept_id) VALUES ('Alice', 'Smith', 'alice.smith@example.com', 87000, 1), ('Bob', 'Brown', 'bob.brown@example.com', 62000, 2), ('Carol', 'Jones', 'carol.jones@example.com', 54000, 3), ('David', 'Lee', 'david.lee@example.com', 103000, 1), ('Eve', 'Taylor', 'eve.taylor@example.com', 77000, 4);
Visualise the Relationship (placeholder)
Example Queries
-- 1️⃣ List all employees with department name
SELECT e.emp_id, e.first_name, e.last_name, d.name AS department
FROM employees e
JOIN departments d ON e.dept_id = d.id
ORDER BY e.emp_id;
-- 2️⃣ Average salary per department
SELECT d.name, ROUND(AVG(e.salary),2) AS avg_salary
FROM employees e
JOIN departments d ON e.dept_id = d.id
GROUP BY d.name
HAVING COUNT(*) > 0;
-- 3️⃣ Employees hired in the last 30 days
SELECT * FROM employees
WHERE hire_date > CURRENT_DATE - INTERVAL '30 days';
-- 4️⃣ Find employees with duplicate email domains (e.g., possible data‑quality issue)
SELECT SUBSTRING(email FROM POSITION('@' IN email)+1) AS domain, COUNT(*) AS cnt
FROM employees
GROUP BY domain
HAVING COUNT(*) > 1;
-- 5️⃣ Delete a department (will set dept_id of its employees to NULL)
DELETE FROM departments WHERE name = 'Marketing';
4️⃣ Indexes & Query Performance Basics
Why Indexes Matter
- They are data structures (usually B‑trees) that let the engine locate rows without scanning the whole table.
- Typical usage: columns in
WHERE,JOIN,ORDER BY,GROUP BY. - Trade‑off: faster reads, slower writes (INSERT/UPDATE) and extra storage.
Creating an Index
CREATE INDEX idx_emp_email ON employees(email); CREATE INDEX idx_emp_dept ON employees(dept_id);
Explain Plan – Seeing What the Optimiser Does
EXPLAIN ANALYZE SELECT e.first_name, e.last_name, d.name FROM employees e JOIN departments d ON e.dept_id = d.id WHERE e.salary > 80000;
The output shows whether the query uses an index scan, sequential scan, or a hash join, and reports actual execution time in ms.
5️⃣ Common Pitfalls – SQL Injection & Safe Coding
SQL Injection occurs when untrusted input is concatenated into a query string without sanitisation.
Vulnerable Example (PHP)
$id = $_GET['id']; // e.g., ?id=1 OR 1=1 $sql = "SELECT * FROM employees WHERE emp_id = $id"; $res = $pdo->query($sql);
Safe Alternative – Prepared Statements
$stmt = $pdo->prepare('SELECT * FROM employees WHERE emp_id = :id');
$stmt->execute(['id' => $_GET['id']]); // type‑safe, DB driver does the escaping
$res = $stmt->fetchAll();
All languages have similar mechanisms (Python’s cursor.execute(..., params), Java’s PreparedStatement, .NET’s SqlCommand with parameters, etc.).
- Use string interpolation to build SQL.
- Trust client‑side validation alone.
- Store raw passwords in plain text – always hash (bcrypt, Argon2).
6️⃣ Lab Exercises (Extended)
Choose one of the three installation methods (native PostgreSQL, SQLite, Docker). Then:
- Start the DB server (or open the SQLite CLI).
- Run the “Hello, SQL Mastery!” query and capture a screenshot of the output.
- Record the exact command you typed (including any environment variables).
Using the SQL script provided above:
- Create the
departmentsandemployeestables. - Insert the sample rows.
- Run the “Average salary per department” query and note any department with an average > $80 k.
- Write a 150‑word “lab journal” describing any errors you encountered and how you fixed them.
For the employees.email column:
- Create an index as shown earlier.
- Run the same
SELECT … WHERE email = 'alice.smith@example.com'query with the index and capture theEXPLAIN ANALYZEoutput. - Delete the index (
DROP INDEX idx_emp_email;) and re‑run the query – compare timings. - Summarise (≤ 100 words) why the index helped (or didn’t) in this case.
Write a short program (≤ 50 lines) that:
- Accepts a user‑supplied employee ID from the console.
- Queries the
employeestable for that ID using a prepared statement. - Prints the employee’s full name and department.
Include the source code and a screenshot of a successful run.
7️⃣ Knowledge‑Check Quiz (5 questions)
- Which SQL clause is used to combine rows from two tables?
GROUP BY
JOIN
UNION - What does the
SERIALdata type do in PostgreSQL?
Stores binary data.
Auto‑increments an integer and creates a sequence.
Stores a fixed-length string of 20 characters. - True or False –
DROP TABLEcan be rolled back inside a transaction in PostgreSQL. - Which of the following is a **safe** way to guard against SQL injection?
Escape single quotes manually.
Use a prepared statement / parameterised query.
Validate the input length only. - What is the purpose of an
INDEX?
Click to reveal the answers.
Show Answers
- B –
JOINcombines rows from two tables based on a condition. - B –
SERIALcreates an auto‑increment integer backed by a sequence. - False – Dropping a table is DDL and cannot be rolled back in PostgreSQL (unless using a schema‑level transaction in recent versions, but standard practice treats it as non‑reversible).
- B – Parameterised queries are the industry‑standard defense.
- An
INDEXprovides a fast lookup mechanism (typically a B‑tree) that reduces the amount of data scanned for queries that filter, sort, or join on the indexed column(s).
8️⃣ Further Reading & Resources
- Books
- SQL for Beginners – John J. L. (O'Reilly, 2022)
- PostgreSQL Administration – Simon Riggs & Gianni Ciolli (2nd ed., 2023)
- SQL Antipatterns – Bill Karwin (Addison‑Wesley, 2010)
- Database System Concepts – Silberschatz, Korth & Sudarshan (7th ed., 2020)
- Online Tutorials
- Reference Docs
- SQL‑92 Standard – ISO/IEC 9075‑2:1992
- PostgreSQL SQL‑Syntax – PostgreSQL Docs
- SQLite Quickstart – sqlite.org
- Practice Platforms
Basic SELECT Queries
1️⃣ Retrieving Data – The Core of SQL
The SELECT statement is the primary read‑only operation in a relational
database. It defines **what** columns you want, **from which** tables, and optionally
how the rows should be filtered, ordered, grouped, or limited.
SELECT name, price FROM products; SELECT * FROM employees; -- all columns (discouraged in production)
SELECT * Is Discouraged- Pulls every column – can cause unnecessary I/O and network traffic.
- Masks schema changes (new columns appear silently in result sets).
- Exposes sensitive columns (password hashes, internal IDs) if you forget to prune them.
2️⃣ Aliases – Giving Columns & Tables Friendly Names
Aliases improve readability, especially in joins or when column names clash.
SELECT first_name AS name FROM users; SELECT u.first_name AS fname, u.last_name AS lname FROM users AS u;
Table aliases also let you qualify columns (t1.id) which is mandatory when two tables share a column name.
3️⃣ Filtering Rows – WHERE Clause
SELECT name, price FROM products WHERE price > 100 AND category = 'Electronics';
Common comparison operators: =, <>, >, <, >=, <= and set operators IN, NOT IN. Use LIKE for pattern matching.
SELECT * FROM employees WHERE email LIKE '%@example.com'; -- ends with @example.com
4️⃣ Ordering & Limiting – ORDER BY, LIMIT/OFFSET
SELECT name, price FROM products WHERE price > 0 ORDER BY price DESC LIMIT 10 OFFSET 20; -- pagination: page 3 (0‑based)
Use ASC (default) for ascending order. OFFSET is optional – without it, LIMIT 10 simply returns the first ten rows.
5️⃣ Removing Duplicates – DISTINCT
SELECT DISTINCT department_id FROM employees;
DISTINCT works on the whole projected row. If you need distinct on a single column but keep other columns, use a GROUP BY (see next section).
6️⃣ Aggregates & Grouping – Summaries & Statistics
| Function | Description | Example |
|---|---|---|
COUNT(col) | Number of non‑NULL values | SELECT COUNT(*) FROM orders; |
SUM(col) | Total of numeric column | SELECT SUM(amount) FROM sales; |
AVG(col) | Mean value | SELECT AVG(salary) FROM employees; |
MIN(col) | Smallest value | SELECT MIN(price) FROM products; |
MAX(col) | Largest value | SELECT MAX(price) FROM products; |
To obtain aggregates **per group**, combine them with GROUP BY:
SELECT department_id, COUNT(*) AS employee_cnt, AVG(salary) AS avg_salary FROM employees GROUP BY department_id HAVING COUNT(*) > 5; -- keep only large departments
HAVING filters groups after aggregation (equivalent to WHERE for rows).
7️⃣ Joins – Bringing Data Together
Four basic join types (inner, left/right outer, full outer, cross). The most common is INNER JOIN.
SELECT e.first_name, e.last_name, d.name AS department FROM employees AS e JOIN departments AS d ON e.department_id = d.id WHERE d.name = 'Engineering' ORDER BY e.last_name;
For left/right outer join:
SELECT c.name AS customer, o.id AS order_id FROM customers c LEFT JOIN orders o ON c.id = o.customer_id ORDER BY c.name;
Use FULL OUTER JOIN when you need rows that appear in *either* table (PostgreSQL supports it; MySQL does not natively).
8️⃣ Set Operators – Combining Result Sets
SELECT email FROM customers UNION SELECT email FROM newsletter_subscribers; -- eliminates duplicates SELECT email FROM customers INTERSECT SELECT email FROM newsletter_subscribers; -- only common emails SELECT email FROM customers EXCEPT SELECT email FROM newsletter_subscribers; -- in customers but not in subscribers
All participating SELECTs must have the **same column count** and **compatible data types**.
9️⃣ Sub‑queries – Nesting SELECTs
Inline (derived table)
SELECT dept_id, avg_salary FROM ( SELECT department_id AS dept_id, AVG(salary) AS avg_salary FROM employees GROUP BY department_id ) AS dept_avg WHERE avg_salary > 80000;
Correlated sub‑query (depends on outer row)
SELECT e.id, e.first_name, e.salary FROM employees e WHERE e.salary > ( SELECT AVG(salary) FROM employees WHERE department_id = e.department_id );
Correlated sub‑queries can be slower; many RDBMS transform them into joins automatically.
10️⃣ Window Functions – Row‑Level Analytics
SELECT employee_id, salary, AVG(salary) OVER (PARTITION BY department_id) AS dept_avg, RANK() OVER (ORDER BY salary DESC) AS salary_rank FROM employees ORDER BY department_id, salary DESC;
Window functions keep the original row granularity while adding a calculated column (e.g., running total, moving average, rank).
11️⃣ Performance Tips – EXPLAIN and Indexes
Before writing a complex SELECT, inspect the execution plan:
EXPLAIN ANALYZE SELECT e.first_name, e.last_name, d.name FROM employees e JOIN departments d ON e.department_id = d.id WHERE d.name = 'Engineering' ORDER BY e.last_name LIMIT 50;
The output tells you whether an **index scan**, **sequential scan**, or **hash join** is used. If you see “Seq Scan” on a large table with a WHERE clause on an indexed column, create an index:
CREATE INDEX idx_emp_dept ON employees(department_id); CREATE INDEX idx_dept_name ON departments(name);
Remember: each index speeds reads but slows INSERT/UPDATE/DELETE and consumes storage.
12️⃣ Safety – Avoiding SQL Injection in SELECTs
Even a read‑only query can be hijacked to return data you never intended. Always use **parameterised queries** / **prepared statements** in application code.
import psycopg2
conn = psycopg2.connect(dsn='dbname=mydb user=app')
cur = conn.cursor()
cur.execute(
"SELECT name, price FROM products WHERE price > %s AND category = %s;",
(100, 'Electronics')
)
rows = cur.fetchall()
for r in rows:
print(r)
Never concatenate user‑input directly into the SQL string.
13️⃣ Cheat‑Sheet – SELECT Syntax Overview
| Clause | Purpose | Typical Syntax |
|---|---|---|
SELECT | Columns / expressions to return | SELECT col1, col2, … |
FROM | Source table(s) / subquery | FROM table_name [AS t] |
WHERE | Row filter | WHERE condition |
GROUP BY | Aggregate grouping | GROUP BY col1, col2 |
HAVING | Filter on aggregated groups | HAVING aggregate_condition |
ORDER BY | Sort result set | ORDER BY col [ASC|DESC] |
LIMIT / OFFSET | Pagination | LIMIT 10 OFFSET 20 |
DISTINCT | Remove duplicate rows | SELECT DISTINCT col FROM … |
JOIN | Combine tables | JOIN other_table ON condition |
UNION / INTERSECT / EXCEPT | Set algebra | SELECT … UNION SELECT … |
WINDOW | Analytic functions | OVER (PARTITION BY … ORDER BY …) |
14️⃣ Sample Schema – Customers, Orders, Products
15️⃣ Lab Exercises (Extended)
Using the diagram above, create three tables in PostgreSQL (or SQLite) with appropriate primary/foreign keys. Then insert at least 10 rows per table (use any fictional data you like).
- List all products cheaper than $50, ordered by price ascending.
- Show the total amount each customer has spent (use
SUMandGROUP BY). - Find customers who placed more than 3 orders (use
HAVING). - Display the top 5 most‑expensive orders (use
ORDER BYandLIMIT).
- Show each order with the customer’s name and the product’s name (inner join across three tables).
- Find products that have never been ordered (left join +
WHERE order_id IS NULL). - Write a correlated sub‑query that returns customers whose total spend is above the **average** spend of all customers.
For each order, compute the running total per customer (order rows sorted by order_date) using SUM() OVER (PARTITION BY … ORDER BY …). Capture the query and a screenshot of the result.
16️⃣ Knowledge‑Check Quiz (7 questions)
- Which clause must appear **first** in a SELECT statement?
FROM
SELECT
WHERE - What does the following query return?
SELECT DISTINCT department_id FROM employees;
All rows fromemployees(including duplicates).
A list of **unique** department IDs.
The number of employees per department. - True or False –
HAVINGcan be used withoutGROUP BY.
True
False - Which of the following retrieves the 3rd page (page size = 20) of ordered results?
LIMIT 20 OFFSET 40
LIMIT 60 OFFSET 20
OFFSET 40 LIMIT 20 - In a
LEFT JOIN, rows from the **left** table are returned even when there is no matching row on the right. What value appears for right‑table columns in those cases?
0
NULL
Empty string - Which window function can be used to assign a **dense rank** (no gaps) to rows ordered by salary descending?
ROW_NUMBER()
DENSE_RANK()
RANK() - What is the primary benefit of using a **prepared statement** for a SELECT that contains user input?
It makes the query run faster.
It automatically addsLIMITto protect the DB.
It separates data from code, preventing SQL injection.
Show Answers
- B –
SELECTis always the first clause. - B – Returns a list of distinct department IDs.
- False –
HAVINGfilters groups, so aGROUP BY(or an implicit grouping) is required. - A –
LIMIT 20 OFFSET 40fetches rows 41‑60 → page 3 (0‑based offset). - B – Missing right‑side rows are filled with
NULL. - B –
DENSE_RANK()gives a consecutive rank without gaps. - C – Parameters are bound separately from the SQL string, eliminating injection vectors.
17️⃣ Further Reading & Resources
- Books
- SQL Fundamentals – John J. L. (O'Reilly, 2022)
- PostgreSQL Up and Running – Regina O. Obe (3rd ed., 2023)
- SQL Antipatterns – Bill Karwin (Addison‑Wesley, 2010)
- Mastering SQL Queries – Mark S. (Packt, 2021)
- Official Docs
- Interactive Practice
Filtering & Sorting
1️⃣ The WHERE Clause – Selecting the Right Rows
WHERE is the gate‑keeper of a SELECT statement. Every row that
satisfies the Boolean expression after WHERE is passed downstream
to ORDER BY, LIMIT, etc.
1.1 Comparison & Logical Operators
| Operator | Description | Example |
|---|---|---|
= | Equality | price = 199.99 |
<> or != | Inequality | category <> 'Books' |
> < >= <= | Numeric / date comparison | age >= 21 |
BETWEEN … AND … | Inclusive range | price BETWEEN 50 AND 150 |
IN ( … ) | Set membership | status IN ('active','pending') |
LIKE / ILIKE | Pattern matching (wildcards % and _) | email LIKE '%@example.com' |
IS NULL / IS NOT NULL | Null test (cannot use = NULL) | last_login IS NULL |
AND / OR / NOT | Boolean combinators | country = 'USA' AND age >= 18 |
EXISTS (sub‑query) | True if sub‑query returns at least one row | WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id) |
ANY / ALL | Compare to a set of values | salary > ANY (SELECT salary FROM employees WHERE department = 'HR') |
1.2 Practical WHERE Samples
SELECT * FROM products WHERE price > 100 AND stock >= 10 AND category IN ('Electronics','Appliances') AND discontinued IS NULL; SELECT * FROM users WHERE country = 'USA' AND age >= 18 AND email LIKE '%@example.com'; SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31'; SELECT * FROM employees e WHERE EXISTS ( SELECT 1 FROM departments d WHERE d.id = e.department_id AND d.name = 'Engineering' );
**Never** wrap an indexed column in a function in the WHERE clause
(e.g., WHERE UPPER(name) = 'ALICE') – this defeats the index.
Use a functional index or a case‑insensitive collation instead.
2️⃣ Ordering Results – ORDER BY
2.1 Basic Syntax
Sort by one or more columns, each optionally marked ASC (default) or DESC.
SELECT * FROM products ORDER BY price DESC; -- expensive first SELECT * FROM employees ORDER BY department_id ASC, last_name ASC;
2.2 Controlling NULL Placement
Some engines put NULL values first, others last. Use the explicit
NULLS FIRST or NULLS LAST clause.
SELECT * FROM users ORDER BY last_login DESC NULLS LAST;
2.3 Case‑Insensitive / Locale‑Aware Sorting
PostgreSQL lets you attach a collation directly in the ORDER BY clause:
SELECT name FROM customers ORDER BY name COLLATE \"C\"; -- case‑insensitive, language‑aware
2.4 Ordering by an Expression (Custom Sort)
Advanced use‑cases often need a derived value, e.g., sorting by the length of a string:
SELECT product_name, price FROM products ORDER BY LENGTH(product_name) ASC, price DESC;
2.5 Random Ordering (caution!)
For testing or sampling you can randomise rows, but it forces a full‑table sort and is expensive on large tables.
SELECT * FROM quotes ORDER BY RANDOM() LIMIT 1; SELECT * FROM quotes ORDER BY RAND() LIMIT 1;
3️⃣ Pagination – LIMIT & OFFSET
Web applications often need “pages” of data. The canonical pattern is:
SELECT * FROM products ORDER BY product_id LIMIT 20 -- page size OFFSET 40; -- skip first two pages (0‑based)
Tip: Combine pagination with a stable ordering column (e.g., primary key) to get deterministic results.
Keyset Pagination (Seek Method)
When a table grows, OFFSET becomes slower. Use “keyset pagination”:
SELECT * FROM products WHERE product_id > 120 -- last ID from previous page ORDER BY product_id LIMIT 20;
4️⃣ Performance – How Filtering & Sorting Interact with Indexes
4.1 Index Usage Basics
- A B‑tree index can satisfy both
WHEREpredicates andORDER BYif the sort order matches the index definition. - Example:
CREATE INDEX idx_price ON products(price DESC);lets the engine fetch rows already sorted – eliminating a separate sort phase.
4.2 EXPLAIN – Visualising the Plan
EXPLAIN ANALYZE SELECT * FROM products WHERE price > 100 ORDER BY price DESC LIMIT 10;
The output shows whether the planner uses an Index Scan (good) or a Seq Scan + Sort (costly).
4.3 Common Pitfalls
- Applying a function to an indexed column (
WHERE LOWER(email) = 'alice@example.com') disables index usage. Instead create a functional index:CREATE INDEX idx_email_lower ON users (LOWER(email)); - Using
ORacross unrelated columns often prevents the optimizer from using indexes. Split the query withUNION ALLif possible. - Sorting on a calculated column (e.g.,
ORDER BY price * tax_rate) forces a temporary file sort.
5️⃣ Security – Safe Filtering & Sorting
Even read‑only queries can be abused for data exfiltration if user input is concatenated unchecked.
const { Client } = require('pg');
const client = new Client(/* connection params */);
async function getProducts(minPrice, maxPrice, sortCol, sortDir) {
// whitelist column names & sort direction to prevent injection
const allowedCols = ['price', 'name', 'created_at'];
const allowedDir = ['ASC', 'DESC'];
if (!allowedCols.includes(sortCol) || !allowedDir.includes(sortDir)) {
throw new Error('Invalid sort parameters');
}
const sql = `SELECT *
FROM products
WHERE price BETWEEN $1 AND $2
ORDER BY ${sortCol} ${sortDir}
LIMIT 20`;
const res = await client.query(sql, [minPrice, maxPrice]);
return res.rows;
}`;
**Key take‑aways**:
- Never interpolate raw user input directly into SQL (even for
ORDER BY). - Whitelist column names / directions, or map them to an enum.
- Use prepared statements for values (
$1, $2in PostgreSQL,?in MySQL).
6️⃣ Cheat‑Sheet – Filtering & Sorting Summary
| Clause | Purpose | Typical Syntax |
|---|---|---|
WHERE | Row filtering | WHERE col = 5 AND other < 10 |
BETWEEN … AND … | Inclusive range | WHERE price BETWEEN 20 AND 30 |
IN ( … ) | Set membership | WHERE status IN ('open','closed') |
LIKE / ILIKE | Pattern matching | WHERE name LIKE 'A%' |
IS NULL / IS NOT NULL | Null test | WHERE deleted_at IS NULL |
EXISTS (sub‑query) | True if sub‑query returns rows | WHERE EXISTS (SELECT 1 …) |
ORDER BY | Sort result set | ORDER BY col1 DESC, col2 ASC |
NULLS FIRST / LAST | Explicit null ordering | ORDER BY created_at DESC NULLS LAST |
LIMIT / OFFSET | Pagination | LIMIT 20 OFFSET 40 |
CASE … END | Custom sort order | ORDER BY CASE status WHEN 'high' THEN 1 … END |
7️⃣ Sample Schema – Customers, Orders, Products
8️⃣ Lab Exercises (Hands‑On)
Using the sample schema (create customers, orders, products tables with at least 20 rows each):
- Find all orders placed in the last 30 days with a total amount > $200.
- List customers whose email ends with
@example.org**and** who have never placed an order. - Show products that have never been ordered (hint:
LEFT JOIN+WHERE order_id IS NULL). - Retrieve the top 5 most‑expensive products, ordered by price descending.
Capture the SQL statements and a screenshot of each query’s result.
- Write a query that sorts customers by
last_nameascending, thenfirst_namedescending. - Implement keyset pagination for the
orderstable (useorder_idas the cursor). - Compare the execution time of the same query using
OFFSET/LIMITvs. keyset pagination on a table with 10 000 rows. - Explain why the keyset method is faster on large offsets.
On the products table:
- Create a B‑tree index on
price DESC. - Run the query
SELECT * FROM products WHERE price > 500 ORDER BY price DESC LIMIT 20;and capture theEXPLAIN ANALYZEoutput before and after the index. - Summarise (≤ 100 words) how the index changed the plan and execution time.
9️⃣ Knowledge‑Check Quiz (7 questions)
- Which operator correctly checks if a column does not contain a NULL value?
= NULL
IS NOT NULL
!= NULL - What does the following query return?
SELECT * FROM employees WHERE salary BETWEEN 50000 AND 60000;
Employees whose salary is exactly $50,000 or $60,000.
Employees earning **inclusive** of $50k – $60k.
Employees whose salary is **outside** the range. - True or False –
ORDER BY column DESC NULLS FIRSTputs the highest non‑null values first, then nulls.
True
False - Which of the following is **not** a valid way to avoid SQL injection in an
ORDER BYclause?
Whitelist allowed column names in application code.
Use a prepared statement and bind the column name as a parameter.
Validate that the sort direction is either “ASC” or “DESC”. - How would you retrieve the 3rd page of results (page size = 25) ordered by
created_atdescending?
LIMIT 25 OFFSET 50
LIMIT 75 OFFSET 0
OFFSET 75 LIMIT 25 - Which expression will correctly sort rows so that
NULLvalues appear **last** regardless of ASC/DESC?
ORDER BY col ASC NULLS FIRST
ORDER BY col DESC NULLS LAST
ORDER BY ISNULL(col) ASC, col ASC - What is the main performance advantage of a “keyset pagination” query over an
OFFSET/LIMITquery on a large table?
It avoids scanning the entire table up to the offset.
It uses a hash index instead of a B‑tree.
It allows the DB to cache the whole result set.
Show Answers
- B –
IS NOT NULLis the correct null‑check operator. - B –
BETWEENis inclusive of both bounds. - True – “NULLS FIRST” puts nulls before any non‑null values, regardless of ASC/DESC.
- B – Prepared statements cannot bind identifiers (column names); you must whitelist or map them.
- A – 3rd page (0‑based) => OFFSET = 25 × 2 = 50, LIMIT = 25.
- C – The trick
ORDER BY ISNULL(col) ASC, col ASCforces nulls to the end for both directions. - A – Keyset pagination reads from a known cursor, avoiding the cost of skipping rows.
10️⃣ Further Reading & Resources
- Books
- SQL Performance Explained – Markus Winand (2nd ed., 2021)
- SQL Antipatterns – Bill Karwin (Addison‑Wesley, 2010)
- PostgreSQL Up and Running – Regina O. Obe (3rd ed., 2023)
- Official Docs
- Interactive Practice
Joins – Combining Data From Multiple Tables
0️⃣ Why Joins Matter
Real‑world databases are normalized: data is split across related tables to avoid redundancy. A join stitches those pieces back together so a security analyst (or any data‑consumer) can answer questions such as:
- Which user bought the most‑expensive product?
- What is the total revenue per country?
- Are there any “orphaned” orders without a matching user?
1️⃣ Sample Schema (Orders, Users, Products, Order_Items)
Below is the DDL you can paste into any RDBMS (PostgreSQL/MySQL/SQLite) to create the demo data.
CREATE TABLE users ( id SERIAL PRIMARY KEY, name VARCHAR(50) NOT NULL, email VARCHAR(100) UNIQUE, country VARCHAR(50) ); CREATE TABLE products ( id SERIAL PRIMARY KEY, sku VARCHAR(30) UNIQUE, name VARCHAR(100) NOT NULL, price NUMERIC(9,2) NOT NULL ); CREATE TABLE orders ( id SERIAL PRIMARY KEY, user_id INT REFERENCES users(id), order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE order_items ( id SERIAL PRIMARY KEY, order_id INT REFERENCES orders(id), product_id INT REFERENCES products(id), quantity INT DEFAULT 1 );
2️⃣ Join Types – Theory (with Venn diagrams)
| Join Type | Result Set | Typical Syntax |
|---|---|---|
| INNER JOIN | Only rows that have matching keys in *both* tables. | SELECT … FROM A INNER JOIN B ON A.id = B.fk |
| LEFT (OUTER) JOIN | All rows from the left table + matching rows from the right; NULL where no match. | SELECT … FROM A LEFT JOIN B ON A.id = B.fk |
| RIGHT (OUTER) JOIN | All rows from the right table + matching rows from the left; NULL where no match. | SELECT … FROM A RIGHT JOIN B ON A.id = B.fk |
| FULL (OUTER) JOIN | All rows from both tables; NULLs where one side is missing. | SELECT … FROM A FULL OUTER JOIN B ON A.id = B.fk |
| CROSS JOIN | Cartesian product (every row of A paired with every row of B). Use with care! | SELECT … FROM A CROSS JOIN B |
| NATURAL JOIN | Implicit join on all columns with the same name. Rarely used in security reporting – can hide bugs. | SELECT … FROM A NATURAL JOIN B |
Below are simple Venn‑diagram placeholders for each join type. Replace them with real SVG/PNG graphics in your LMS.
| Join | Diagram |
|---|---|
| INNER | Venn – overlapping area only |
| LEFT | All of left + overlap |
| RIGHT | All of right + overlap |
| FULL | Entire left & right circles |
| CROSS | Cartesian grid |
3️⃣ Practical Join Syntax – Variations & Shortcuts
3.1 Using USING (when column names match)
SELECT o.id, u.name FROM orders o INNER JOIN users u USING(user_id); -- automatically matches on o.user_id = u.id
3.2 Table Aliases – make queries readable
SELECT o.id AS order_id, u.name AS buyer, p.name AS product, oi.quantity FROM order_items oi JOIN orders o ON oi.order_id = o.id JOIN users u ON o.user_id = u.id JOIN products p ON oi.product_id = p.id WHERE p.price > 1000 ORDER BY o.id;
3.3 Self‑Join – hierarchical data (e.g., employee‑manager)
SELECT e.name AS employee, m.name AS manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id ORDER BY e.name;
3.4 Joining More Than Two Tables
The same JOIN keyword can be chained; the DB engine builds an internal join tree.
SELECT u.name, SUM(p.price * oi.quantity) AS total_spent FROM users u JOIN orders o ON u.id = o.user_id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id GROUP BY u.name HAVING total_spent > 5000 ORDER BY total_spent DESC;
4️⃣ Common Pitfalls & Performance Tips
- Cartesian product – forgetting the
ONclause (or usingCROSS JOIN) creates huge result sets that can crash the DB or hide the real problem. - Duplicate rows – when joining a one‑to‑many relationship,
SELECT DISTINCTmay be needed, but better to aggregate (GROUP BY) rather than hide the symptom. - Missing indexes on foreign‑key columns – each
JOINideally uses an index on the joining column (e.g.,CREATE INDEX idx_orders_user_id ON orders(user_id);). - Ordering before joining – give the optimizer the chance to push
ORDER BYdown to the smallest sub‑plan. Usually you order after the final join. - Implicit joins (comma‑separated FROM) – legacy syntax
FROM a, b WHERE a.id=b.idis harder to read and more error‑prone.
4.1 How the Optimizer Builds a Join Tree
Modern engines (PostgreSQL, MySQL, SQL‑Server) evaluate joins in the order that yields the lowest estimated cost.
You can view the plan with EXPLAIN (or EXPLAIN ANALYZE for real timing).
EXPLAIN ANALYZE SELECT u.name, o.id, p.name, oi.quantity FROM users u JOIN orders o ON u.id = o.user_id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id WHERE p.price > 1000 LIMIT 20;
The output will show whether each join is using an Index Scan, a Hash Join, or a Nested Loop. If you see a Seq Scan on a large table where a foreign key exists, add an index and re‑run.
5️⃣ Hands‑On Lab – Join Challenges
Import schema.sql (provided above) into your favourite RDBMS. Populate it with at least the following rows (feel free to add more):
| Table | Sample Rows |
|---|---|
| users |
INSERT INTO users (name,email,country) VALUES
('Alice','alice@example.com','USA'),
('Bob','bob@example.com','India'),
('Charlie','charlie@example.com','UK');
|
| products |
INSERT INTO products (sku,name,price) VALUES
('SKU-001','Gaming Laptop',2500.00),
('SKU-002','Office Chair',150.00),
('SKU-003','USB‑C Hub',45.99);
|
| orders |
INSERT INTO orders (user_id,order_date) VALUES (1,'2024-04-01'),(2,'2024-04-03'),(1,'2024-04-05'); |
| order_items |
INSERT INTO order_items (order_id,product_id,quantity) VALUES (1,1,1),(1,3,2), (2,2,1), (3,1,1),(3,2,1); |
- Inner join – List every order (order_id) together with the buyer’s name and the total amount for that order (sum of price × quantity).
- Left join – Show all users and the list of orders they placed; include users with zero orders (display
NULLfor order_id). - Self‑join – Assuming the
userstable had areferrer_idcolumn (who invited them), write a query that shows each user and the name of the referrer. (You may add a dummy column to practice.) - Cross join (Cartesian product) – caution – Show the total number of possible
user × productpairs. Then limit the output to the first 5 rows to avoid flooding the console. - Full outer join (PostgreSQL only) – Return every user and every order, even if there is no matching pair (use
FULL OUTER JOIN). Identify “orphaned” orders (no user) and “orphaned” users (no order) in the same result set.
Take a screenshot of each query’s result and include it in a short lab‑report (max 300 words) describing what you learned.
Run the inner‑join revenue query from Exercise 4.2 on a table with at least 10 000 rows (you can generate dummy data with a script). Capture the EXPLAIN ANALYZE output before adding indexes on orders.user_id, order_items.order_id and order_items.product_id. Then add the indexes, re‑run the same EXPLAIN ANALYZE and compare the total execution time.
Write a 150‑word summary explaining the difference.
6️⃣ Cheat‑Sheet – Join Syntax at a Glance
| Syntax | When to Use |
|---|---|
FROM A INNER JOIN B ON A.id = B.fk |
Only records with matching keys (most common). |
FROM A LEFT JOIN B ON A.id = B.fk |
All rows from A; keep B data where it exists (good for “orphan detection”). |
FROM A RIGHT JOIN B ON A.id = B.fk |
All rows from B; rare in practice (swap tables for LEFT JOIN). |
FROM A FULL OUTER JOIN B ON A.id = B.fk |
Show every row from both tables – good for data‑reconciliation. |
FROM A CROSS JOIN B |
Cartesian product; use only for combinatorial generation (e.g., all seat‑price combos). |
FROM A JOIN B USING(col) |
When column names are identical – less typing. |
FROM A NATURAL JOIN B |
Implicit join on all same‑named columns – **avoid** unless you fully control schema. |
7️⃣ Knowledge‑Check Quiz (7 questions)
- Which join returns every row from the left table and NULLs for the right table when there is no match?
INNER JOIN
LEFT JOIN
FULL OUTER JOIN - What is the result of a
CROSS JOINbetween a table with 5 rows and another with 8 rows?
13 rows
40 rows
5 rows - True or False –
NATURAL JOINis safe to use in a security‑report because it guarantees you are joining on the right columns. - Which clause can be used to eliminate duplicate rows after a join?
WHERE
GROUP BY
DISTINCT - You have an
orderstable (order_id, user_id) and auserstable (id, name). Which index will most improve the performance of the following query?
Index onSELECT o.id, u.name FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'USA';orders.id
Index onusers.country
Composite index onorders(user_id) - Which of the following statements is incorrect about LEFT JOIN semantics?
The left table’s rows are never filtered out.
NULLs appear for right‑table columns when there is no match.
LEFT JOIN is equivalent to RIGHT JOIN with the table order swapped. - In PostgreSQL, which command displays the actual join execution plan with timing information?
EXPLAIN
EXPLAIN ANALYZE
SHOW PLAN
Show Answers
- B –
LEFT JOINkeeps all rows from the left side. - B – 5 × 8 = 40 rows (Cartesian product).
- False –
NATURAL JOINimplicitly matches on *all* same‑named columns; a stray column with the same name can cause an unintended join. - C –
DISTINCTremoves duplicate rows after a join. - C – A composite index on
orders(user_id)(or a separate index onusers.id, which is already a primary key) will help the join; the filter onusers.countryis secondary. - C – LEFT JOIN is not the same as RIGHT JOIN simply by swapping order; the side that receives NULLs matters.
- B –
EXPLAIN ANALYZEruns the query and prints real execution times.
8️⃣ Further Reading & Templates
- Books
- SQL Antipatterns – Bill Karwin (covers misuse of joins).
- Learning SQL – Alan Beaulieu (excellent for beginners).
- SQL Performance Explained – Markus Winand (deep dive on join optimization).
- Online Guides
- Cheat‑Sheet PDF (download) SQL Join Cheat‑Sheet (PDF)
Aggregation & GROUP BY
0️⃣ Why Aggregation Is a Core Skill
Aggregation turns rows into business‑level numbers. In a penetration‑test report you’ll often need to say:
- “142 hosts responded on port 22.”
- “The average CVSS score of all critical findings is 9.2.”
- “Total data exfiltrated: 4.3 GB.”
COUNT(), SUM(), AVG() and friends, combined with GROUP BY.
1️⃣ Sample Schema (Orders ↔ Users ↔ Products)
Copy‑paste the DDL below into any RDBMS (PostgreSQL, MySQL or SQLite) and populate it with a few rows – you’ll need it for all the lab exercises.
CREATE TABLE users ( id SERIAL PRIMARY KEY, name VARCHAR(50) NOT NULL, country VARCHAR(30) ); CREATE TABLE products ( id SERIAL PRIMARY KEY, sku VARCHAR(30) UNIQUE, name VARCHAR(100) NOT NULL, price NUMERIC(9,2) NOT NULL, stock INT DEFAULT 0 ); CREATE TABLE orders ( id SERIAL PRIMARY KEY, user_id INT REFERENCES users(id), order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE order_items ( id SERIAL PRIMARY KEY, order_id INT REFERENCES orders(id), product_id INT REFERENCES products(id), qty INT DEFAULT 1 );
2️⃣ Core Aggregate (Summary) Functions
| Function | What It Does | Typical Example |
|---|---|---|
COUNT(*) | Counts rows (includes NULLs) | SELECT COUNT(*) FROM products; |
COUNT(col) | Counts non‑NULL values of col | SELECT COUNT(email) FROM users; |
SUM(col) | Adds all numeric values | SELECT SUM(stock) FROM products; |
AVG(col) | Average (arithmetic mean) | SELECT AVG(price) FROM products; |
MIN(col) | Smallest value | SELECT MIN(order_date) FROM orders; |
MAX(col) | Largest value | SELECT MAX(price) FROM products; |
STRING_AGG(col, ', ') (Postgres) / GROUP_CONCAT(col SEPARATOR ', ') (MySQL) | Concatenates strings per group | SELECT country, STRING_AGG(name, ', ') FROM users GROUP BY country; |
JSON_AGG(col) (Postgres) | Returns a JSON array per group | SELECT JSON_AGG(row_to_json(p)) FROM products p GROUP BY category; |
All aggregate functions ignore NULL values automatically (except COUNT(*) which counts rows regardless of NULL).
3️⃣ Basic GROUP BY Syntax
SELECT category, COUNT(*) AS product_cnt, AVG(price) AS avg_price, SUM(stock * price) AS inventory_value FROM products GROUP BY category ORDER BY inventory_value DESC;
Rules to remember:
- Every column in the
SELECTlist must either be an aggregate or appear in theGROUP BYclause. - Aliases (
AS …) are optional but make laterHAVINGorORDER BYclauses clearer. - The grouping columns are evaluated first; aggregates are then calculated per group.
4️⃣ Filtering Groups – HAVING
HAVING works like WHERE but **after** aggregation.
SELECT country, COUNT(*) AS user_cnt FROM users GROUP BY country HAVING COUNT(*) >= 5 ORDER BY user_cnt DESC;
Typical security‑report use‑cases:
- Show only assets that have **≥ 3** high‑severity findings.
- List countries where the *average* CVSS of discovered vulns exceeds 7.
5️⃣ Advanced Grouping – ROLLUP, CUBE, GROUPING SETS
These constructs let you produce **multiple aggregation levels** in a single query.
5.1 ROLLUP – Hierarchical Sub‑Totals
SELECT u.country, EXTRACT(YEAR FROM o.order_date) AS yr, SUM(p.price * oi.qty) AS revenue FROM orders o JOIN users u ON o.user_id = u.id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id GROUP BY ROLLUP(u.country, EXTRACT(YEAR FROM o.order_date)) ORDER BY u.country, yr;
Result rows include:
- Revenue per (country, year).
- Yearly subtotal per country (country,
NULL). - Grand total (both columns
NULL).
5.2 CUBE – All Possible Combinations
SELECT u.country, EXTRACT(YEAR FROM o.order_date) AS yr, COUNT(DISTINCT o.id) AS orders_cnt FROM orders o JOIN users u ON o.user_id = u.id GROUP BY CUBE(u.country, EXTRACT(YEAR FROM o.order_date)) ORDER BY u.country, yr;
CUBE returns every combination, i.e., per country, per year, per country‑year, plus the grand total.
5.3 GROUPING SETS – Hand‑Picked Mix
SELECT u.country, EXTRACT(MONTH FROM o.order_date) AS month, SUM(p.price * oi.qty) AS rev FROM orders o JOIN users u ON o.user_id = u.id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id GROUP BY GROUPING SETS( (u.country, EXTRACT(MONTH FROM o.order_date)), (u.country), (EXTRACT(MONTH FROM o.order_date)) ) ORDER BY u.country, month;
Only the three specified groupings appear – no grand total.
6️⃣ Window (Analytic) Functions – Aggregates Without Collapse
Sometimes you need a running total or a rank while still seeing each individual order. Window functions let you do that.
6.1 Syntax Overview
SELECT o.id, u.name, p.name AS product, oi.qty, p.price, SUM(p.price * oi.qty) OVER( PARTITION BY u.id ORDER BY o.order_date ) AS cumulative_spend FROM orders o JOIN users u ON o.user_id = u.id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id ORDER BY u.id, o.order_date;
6.2 Real‑World Example – Top‑3 Products per Country
SELECT u.country, p.name AS product, SUM(p.price * oi.qty) AS revenue, RANK() OVER(PARTITION BY u.country ORDER BY SUM(p.price * oi.qty) DESC) AS rnk FROM order_items oi JOIN orders o ON oi.order_id = o.id JOIN users u ON o.user_id = u.id JOIN products p ON oi.product_id = p.id GROUP BY u.country, p.name HAVING SUM(p.price * oi.qty) > 0 ORDER BY u.country, rnk LIMIT 30; -- 10 countries × top‑3 each
Window functions are **not** listed in GROUP BY; they are calculated after the grouping step.
7️⃣ Performance Tips for Aggregations
- Index the grouping columns. An index on
products.categoryspeedsGROUP BY category. - Covering indexes (e.g.,
CREATE INDEX idx_products_cat_price ON products(category, price);) let the engine satisfy both the grouping and the aggregation from the index alone. - Materialized views for heavy daily reports (e.g.,
CREATE MATERIALIZED VIEW daily_sales AS …). - Filter early. Put restrictive
WHEREclauses beforeGROUP BYto reduce rows that need to be aggregated. - Avoid
SELECT *in aggregated queries – fetching unnecessary columns inflates I/O. - Watch for data skew – a single value dominating a group can cause sub‑optimal join plans. Consider partitioning or separate statistics.
7.1 Index Impact Demo
EXPLAIN ANALYZE SELECT category, COUNT(*) AS cnt, AVG(price) AS avg_price FROM products GROUP BY category;
If the plan shows a Seq Scan on a million‑row table, create an index:
CREATE INDEX idx_products_category ON products(category);
Rerun the EXPLAIN ANALYZE – you should see an Index Scan and a dramatically lower total cost.
8️⃣ Common Pitfalls & How to Avoid Them
- Missing GROUP BY clause – “column must appear in the GROUP BY clause” error.
- Counting with
COUNT(col)when you need total rows –COUNT(col)skips NULLs. - Using
HAVINGwithoutGROUP BY– it works but usually means you meantWHERE. - Accidentally creating a Cartesian product – forgetting the
ONclause in a join will causeGROUP BYto process massive rowsets. - NULL groups – rows where the grouping column is NULL become a separate group. Use
COALESCE(col,'N/A')if you prefer them merged. - Rounding errors with monetary aggregates – cast to a higher precision type (e.g.,
price::numeric(12,4)) before summing.
9️⃣ Hands‑On Lab – Aggregation Challenges
Run schema.sql (Section 1). Insert at least 20 products spanning 4 categories, 15 users across 3 countries, and 30 orders with associated order_items. Feel free to use any data you like – the point is to have enough rows for meaningful grouping.
- Total inventory value per category –
SUM(price * stock)grouped bycategory. - Average order value per country – join the four tables and
AVG(price * qty)grouped byusers.country. - Top‑3 selling products overall – order by
SUM(qty)descending,LIMIT 3. - Countries with more than 5 orders – use
GROUP BY users.countryandHAVING COUNT(DISTINCT orders.id) > 5. - Running total of revenue per user (window function) – calculate a cumulative sum ordered by
order_date.
Capture a screenshot of each result and write a one‑sentence interpretation (e.g., “Electronics holds the highest inventory value: $125 k”).
ROLLUPRun the ROLLUP query from Section 5.1 on a dataset of at least 10 000 orders. Record the execution time with EXPLAIN ANALYZE **before** adding any indexes, then create indexes on orders.order_date and users.country and repeat. Write a ≤ 100‑word summary of the performance impact.
10️⃣ Cheat‑Sheet – Aggregation & GROUP BY Quick Reference
| Clause / Function | Purpose / Example |
|---|---|
COUNT(*) | Rows in the result set. |
COUNT(col) | Non‑NULL values in col. |
SUM(col) | Additive total. |
AVG(col) | Mean value. |
MIN(col) / MAX(col) | Smallest / largest value. |
GROUP BY col1, col2 … | Define grouping columns. |
HAVING condition | Filter groups (post‑aggregation). |
ORDER BY col ASC|DESC | Sort final rows. |
ROLLUP(col1, col2) | Hierarchical subtotals + grand total. |
CUBE(col1, col2) | All combination totals. |
GROUPING SETS((…), (…)) | Custom mix of groupings. |
SUM(col) OVER (PARTITION BY … ORDER BY …) | Running total / rank without collapsing rows. |
STRING_AGG(col, ', ') (Postgres) / GROUP_CONCAT(col SEPARATOR ', ') (MySQL) | Concatenate strings per group. |
11️⃣ Knowledge‑Check Quiz (7 questions)
- Which query returns the number of distinct product categories?
SELECT COUNT(*) FROM products;
SELECT COUNT(DISTINCT category) FROM products;
SELECT COUNT(category) FROM products; - What does this query do?
A. Returns countries with **exactly** 3 users.SELECT country, COUNT(*) AS users FROM users GROUP BY country HAVING COUNT(*) > 3;
B. Returns countries with **more than** 3 users.
C. Returns every country regardless of user count. - True or False –
GROUP BYandDISTINCTare interchangeable. - Which construct will give you a grand total, per‑country subtotal, per‑year subtotal, and per‑country‑year cells in a single result set?
GROUP BY ROLLUP(country, YEAR(order_date))
GROUP BY CUBE(country, YEAR(order_date))
GROUP BY GROUPING SETS((country, YEAR), (country), (YEAR)) - In the window‑function example below, what is the effect of
PARTITION BY u.id?
A. Groups rows by user before calculating the cumulative sum.SELECT o.id, SUM(p.price * oi.qty) OVER (PARTITION BY u.id ORDER BY o.order_date) AS cum_spend FROM orders o JOIN users u ON o.user_id = u.id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id;
B. Filters rows to only those whereu.idmatches the current row.
C. Orders the final result set byu.id. - Which index will most improve this query’s performance?
Index onSELECT country, COUNT(*) AS cnt FROM users WHERE country = 'USA' GROUP BY country;users.id.
Composite index on(country, id).
Index onusers.country. - What does the
GROUPING()function return inside aROLLUPquery?
Show Answers
- B –
COUNT(DISTINCT category)gives the number of unique categories. - B –
HAVING COUNT(*) > 3filters for groups with more than three users. - False –
GROUP BYgroups rows, whileDISTINCTremoves duplicate rows; they are not interchangeable. - A –
ROLLUP(country, YEAR(...))yields the grand total, country subtotals, year subtotals, and country‑year cells. - A –
PARTITION BY u.idcreates a separate window (bucket) per user, then computes the cumulative sum inside that bucket ordered by date. - C – An index on the filtered column (
users.country) allows the DB engine to satisfy both theWHEREand theGROUP BYefficiently. - The
GROUPING()function returns1for a column that is being aggregated at a higher level (i.e., when the column’s value is replaced byNULLin a subtotal or grand‑total row) and0otherwise. It is used to differentiate realNULLvalues from the artificialNULLs generated byROLLUP/CUBE.
12️⃣ Further Reading & Templates
- Books
- SQL Performance Explained – Markus Winand (excellent chapter on aggregation).
- SQL Antipatterns – Bill Karwin (covers common aggregation mistakes).
- Learning SQL – Alan Beaulieu (good for beginners).
- Official Docs
- Cheat‑Sheet (PDF) – Aggregation Cheat‑Sheet (PDF)
- Interactive Playground
Subqueries & CTEs
Introduction to Subqueries
A subquery is a query written inside another SQL query. Subqueries allow developers to break complex problems into smaller logical steps.
Subqueries are commonly used for:
- Filtering data
- Comparing values
- Generating temporary result sets
- Performing calculations
- Building advanced reports
Basic Subquery Example
A subquery can be placed inside a WHERE clause.
SELECT name FROM employees WHERE salary > ( SELECT AVG(salary) FROM employees );
The inner query calculates the average salary, while the outer query returns employees earning above that average.
How Subqueries Work
SQL executes the inner query first, then uses the returned result in the outer query.
This allows queries to become dynamic and data-driven.
Subqueries in the SELECT Clause
Subqueries can also appear inside the SELECT statement.
SELECT name, ( SELECT department_name FROM departments WHERE departments.id = employees.department_id ) AS department FROM employees;
This query retrieves employee names along with department names.
Subqueries in the FROM Clause
A subquery inside the FROM clause acts like a temporary table.
SELECT avg_salary.department_id, avg_salary.average FROM ( SELECT department_id, AVG(salary) AS average FROM employees GROUP BY department_id ) AS avg_salary;
This approach improves readability for complex calculations.
Single-Row vs Multi-Row Subqueries
Some subqueries return one value, while others return multiple rows.
Single-row subqueries are often used with operators such as:
=><
Multi-row subqueries are commonly used with:
INANYALL
Using IN with Subqueries
SELECT name FROM employees WHERE department_id IN ( SELECT id FROM departments WHERE location = 'London' );
This query finds employees working in departments located in London.
Correlated Subqueries
A correlated subquery depends on values from the outer query.
SELECT name, salary FROM employees e WHERE salary > ( SELECT AVG(salary) FROM employees WHERE department_id = e.department_id );
The inner query runs once for every row processed by the outer query.
Introduction to CTEs
A Common Table Expression (CTE) is a temporary named result set created using the WITH keyword.
CTEs improve readability and make complex queries easier to manage.
Common Table Expressions (CTE)
WITH high_value_orders AS ( SELECT * FROM orders WHERE total > 1000 ) SELECT * FROM high_value_orders;
The CTE creates a temporary result called high_value_orders.
Why Use CTEs?
CTEs provide several advantages:
- Improve query readability.
- Reduce repeated logic.
- Make debugging easier.
- Organize complex SQL statements.
- Support recursive queries.
Multiple CTEs
You can define multiple CTEs in a single query.
WITH sales_total AS ( SELECT customer_id, SUM(total) AS total_sales FROM orders GROUP BY customer_id ), top_customers AS ( SELECT * FROM sales_total WHERE total_sales > 5000 ) SELECT * FROM top_customers;
This structure makes long queries much easier to understand.
Recursive CTEs
Recursive CTEs are used for hierarchical or tree-structured data.
WITH RECURSIVE numbers AS ( SELECT 1 AS n UNION ALL SELECT n + 1 FROM numbers WHERE n < 5 ) SELECT * FROM numbers;
This query generates numbers from 1 to 5 recursively.
CTEs vs Subqueries
Both CTEs and subqueries solve similar problems, but they differ in readability and structure.
- Subqueries are compact for small operations.
- CTEs are better for complex multi-step logic.
- CTEs improve maintainability.
- Recursive operations require CTEs.
Performance Considerations
In some databases, CTEs may improve readability but not necessarily performance.
Database engines optimize queries differently, so performance testing is important for large datasets.
Real-World Use Cases
Subqueries and CTEs are heavily used in:
- Business reporting
- Analytics dashboards
- Financial systems
- Inventory management
- Data warehousing
- Hierarchical organizational data
Common Beginner Mistakes
- Using subqueries when joins are simpler.
- Writing deeply nested unreadable queries.
- Forgetting aliases for subqueries.
- Creating inefficient correlated subqueries.
- Confusing recursive and non-recursive CTEs.
Best Practices
- Use meaningful CTE names.
- Keep queries readable and modular.
- Use CTEs for multi-step transformations.
- Test performance on large datasets.
- Avoid unnecessary nested queries.
Practice
Create a subquery that returns employees earning more than the average salary.
Then create a CTE that selects products with stock greater than 100 and display the results.
Finally, write a recursive CTE that generates numbers from 1 to 10.
Summary
Subqueries and Common Table Expressions are powerful SQL tools used to build advanced and organized queries. Subqueries help solve problems step-by-step, while CTEs improve readability and support recursive operations.
Mastering these concepts is essential for writing professional SQL queries used in real-world applications and enterprise databases.
Data Modification
Introduction to Data Modification
Databases are not only used for retrieving data — they are also used to create, modify, and remove information.
SQL provides several commands for changing database records safely and efficiently.
INSERT— adds new dataUPDATE— modifies existing dataDELETE— removes data
INSERT, UPDATE, DELETE
INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com'); UPDATE products SET price = price * 1.1 WHERE category = 'Tech'; DELETE FROM logs WHERE created_at < '2023-01-01';
Understanding INSERT
The INSERT INTO statement adds new rows into a table.
This command is commonly used in:
- User registration systems
- Order management systems
- Inventory applications
- Blog and CMS platforms
Basic INSERT Syntax
INSERT INTO students (name, age) VALUES ('Alice', 21);
Here, a new row is inserted into the students table.
Inserting Multiple Rows
SQL allows inserting multiple records in a single query.
INSERT INTO products (name, price) VALUES ('Mouse', 500), ('Keyboard', 1200), ('Monitor', 9000);
INSERT Without Specifying Columns
If values are provided in exact table order, column names can be omitted.
INSERT INTO users VALUES (1, 'Ahmed', 'ahmed@example.com');
However, specifying column names is considered a better practice.
Understanding UPDATE
The UPDATE statement modifies existing records.
UPDATE employees SET salary = 50000 WHERE id = 1;
This query updates the salary of the employee with ID 1.
Updating Multiple Columns
UPDATE users SET city = 'Delhi', status = 'Active' WHERE id = 5;
Importance of WHERE Clause
The WHERE clause determines which rows are affected.
UPDATE users SET status = 'Inactive';
Without a WHERE clause,
all rows in the table will be updated.
Understanding DELETE
The DELETE statement removes records from a table.
DELETE FROM customers WHERE id = 10;
This removes the customer whose ID equals 10.
Deleting Multiple Records
DELETE FROM orders WHERE status = 'Cancelled';
Deleting All Rows
Removing the WHERE clause deletes all records.
DELETE FROM logs;
Use this carefully because deleted rows may not be recoverable.
TRUNCATE vs DELETE
| DELETE | TRUNCATE |
|---|---|
| Removes selected rows | Removes all rows |
| Supports WHERE clause | No WHERE clause |
| Slower for large tables | Faster operation |
Using Conditions in Queries
Conditions help target specific rows during modification.
UPDATE products SET stock = stock - 1 WHERE product_id = 100;
Working with NULL Values
UPDATE users SET phone = NULL WHERE id = 3;
Using Arithmetic Operations
SQL can perform calculations during updates.
UPDATE employees SET salary = salary + 5000 WHERE department = 'IT';
Transactions and Safety
Transactions help ensure database consistency.
BEGIN; UPDATE accounts SET balance = balance - 1000 WHERE id = 1; UPDATE accounts SET balance = balance + 1000 WHERE id = 2; COMMIT;
Transactions are especially useful in banking and payment systems.
Rollback Example
ROLLBACK;
Rollback cancels changes before they are permanently saved.
Best Practices
- Always use WHERE with UPDATE and DELETE
- Back up important data before modifications
- Test queries on small datasets first
- Use transactions for critical operations
- Validate user input before database changes
Common Mistakes
- Forgetting the WHERE clause
- Inserting duplicate data accidentally
- Deleting records without backup
- Using incorrect column order
- Ignoring transaction handling
Real-World Examples
- User account registration
- Updating product inventory
- Deleting expired logs
- Processing customer orders
- Managing employee databases
Mini Practice
Try writing queries for the following:
- Insert a new student record
- Update a product price
- Delete inactive users
- Increase salaries by 10%
- Use transactions with COMMIT and ROLLBACK
Constraints & Indexes
Introduction to Constraints
Constraints are rules applied to database tables to ensure data accuracy, consistency, and integrity.
They help prevent invalid or duplicate data from being inserted into a database.
Constraints are essential in professional database systems because they enforce business rules directly at the database level.
Why Constraints Matter
Without constraints, databases may contain:
- Duplicate records
- Missing required values
- Invalid relationships
- Incorrect data formats
- Orphaned records
Constraints help maintain reliable and trustworthy data.
Ensuring Data Integrity
CREATE TABLE authors ( id SERIAL PRIMARY KEY, name TEXT NOT NULL, email TEXT UNIQUE );
This table demonstrates multiple important constraints:
PRIMARY KEYensures unique row identification.NOT NULLprevents empty values.UNIQUEprevents duplicate emails.
Primary Key Constraint
A primary key uniquely identifies each row in a table.
CREATE TABLE students ( student_id INT PRIMARY KEY, name TEXT );
Primary keys cannot contain duplicate or NULL values.
Auto-Incrementing Keys
Databases often generate primary key values automatically.
id SERIAL PRIMARY KEY
The SERIAL keyword automatically increases values for each new row.
NOT NULL Constraint
The NOT NULL constraint requires a column to always contain a value.
CREATE TABLE products ( name TEXT NOT NULL, price DECIMAL );
This prevents incomplete records from being inserted.
UNIQUE Constraint
The UNIQUE constraint ensures all values in a column are different.
CREATE TABLE users ( username TEXT UNIQUE );
This is commonly used for usernames, emails, and account IDs.
DEFAULT Constraint
The DEFAULT constraint assigns automatic values when no value is provided.
CREATE TABLE orders ( status TEXT DEFAULT 'Pending' );
If no status is specified, the database automatically inserts Pending.
CHECK Constraint
The CHECK constraint validates data using conditions.
CREATE TABLE employees ( age INT CHECK (age >= 18) );
This prevents invalid age values from being inserted.
Foreign Key Constraint
Foreign keys create relationships between tables.
CREATE TABLE books ( id SERIAL PRIMARY KEY, author_id INT REFERENCES authors(id) );
The author_id must match an existing value in the authors table.
Referential Integrity
Foreign keys help maintain referential integrity by ensuring related records remain valid.
This prevents orphaned records and broken relationships.
Composite Keys
A composite key combines multiple columns into a single primary key.
CREATE TABLE enrollments ( student_id INT, course_id INT, PRIMARY KEY (student_id, course_id) );
This ensures each student-course combination is unique.
Introduction to Indexes
Indexes improve database query performance by allowing faster data retrieval.
Without indexes, databases may need to scan entire tables to locate records.
How Indexes Work
Indexes function similarly to a book index. Instead of reading every page, the database quickly locates the required information.
Indexes significantly improve search speed for large datasets.
Creating an Index
CREATE INDEX idx_author_name ON authors(name);
This creates an index on the name column.
Indexes on Multiple Columns
Indexes can include multiple columns.
CREATE INDEX idx_employee ON employees(last_name, first_name);
Multi-column indexes improve queries involving multiple search conditions.
Unique Indexes
A unique index prevents duplicate values while also improving search performance.
CREATE UNIQUE INDEX idx_email ON users(email);
When to Use Indexes
Indexes are useful for columns frequently used in:
WHEREclausesJOINoperationsORDER BYclausesGROUP BYstatements
When Too Many Indexes Become a Problem
Although indexes improve reading speed, they can slow down insert, update, and delete operations.
Each index must also be updated whenever table data changes.
Clustered vs Non-Clustered Indexes
Some database systems support different index types:
- Clustered Index: Organizes physical table data.
- Non-Clustered Index: Stores separate index structures.
The exact behavior depends on the database system being used.
Viewing Existing Indexes
Most databases provide commands to inspect indexes.
SELECT * FROM pg_indexes WHERE tablename = 'authors';
Dropping an Index
Indexes can be removed if no longer needed.
DROP INDEX idx_author_name;
Real-World Uses
Constraints and indexes are critical in:
- Banking systems
- E-commerce applications
- Inventory systems
- Authentication systems
- Large enterprise databases
- Analytics platforms
Benefits of Constraints & Indexes
- Improve data integrity.
- Prevent invalid records.
- Enforce business rules.
- Increase query performance.
- Support scalable applications.
Common Beginner Mistakes
- Creating too many unnecessary indexes.
- Forgetting foreign key relationships.
- Using NULL values incorrectly.
- Ignoring unique constraints on important fields.
- Adding indexes to very small tables unnecessarily.
Best Practices
- Always define primary keys.
- Use foreign keys for related tables.
- Index frequently searched columns.
- Validate important business rules using constraints.
- Regularly monitor query performance.
Practice
Create a table called customers with:
- A primary key column
- A unique email column
- A NOT NULL name column
- A CHECK constraint for age greater than 18
Then create an index on the email column.
Summary
Constraints and indexes are essential database features used to maintain reliable data and improve performance. Constraints enforce data integrity rules, while indexes optimize query execution speed.
Understanding these concepts is critical for designing scalable, secure, and efficient database systems.
Transactions & ACID
All or Nothing
BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2; COMMIT;
Stored Procedures
Functions and procedures allow you to save SQL logic on the database server.
Window Functions
Advanced Analytics
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) as ranking FROM employees;
Capstone Project: E-commerce DB
Design and implement a database schema for an e-commerce platform, including products, users, orders, and reviews.
-- Project goals: -- 1. Create tables with proper constraints -- 2. Populate with sample data -- 3. Write complex queries for sales reports -- 4. Optimize queries with indexes
Database Design & Normalization
Content coming soon...
Performance & Query Optimization
Content coming soon...
Security & Access Control
Content coming soon...
Final Project — Enterprise Database
Content coming soon...