The Complete PostgreSQL Database Guide for Developers
This comprehensive guide covers essential PostgreSQL database concepts and practices every developer should master, from junior to senior level. You’ll learn everything from basic SQL operations to advanced database design patterns and optimization techniques.
Table of Contents
- Database Fundamentals
- PostgreSQL Overview
- SQL Fundamentals
- Creating and Managing Tables
- Data Types and Constraints
- Querying Data - SELECT Statement
- Joining Tables
- Data Modification
- Data Modeling & Design
- Normalization
- Advanced SQL Concepts
- Transactions & ACID
- Indexes & Performance
- Triggers & Stored Procedures
- Views and Materialized Views
- Security & Best Practices
- Performance Optimization
- Monitoring & Maintenance
Database Fundamentals
What is a Database?
A database is an organized collection of structured information, or data, typically stored electronically in a computer system. Think of it as a digital filing cabinet where you can store, organize, and retrieve information efficiently.
Key Components:
- Data: The actual information stored (names, numbers, dates, etc.)
- DBMS (Database Management System): Software that manages the database (PostgreSQL, MySQL, etc.)
- Schema: The structure that defines how data is organized
- Tables: Collections of related data organized in rows and columns
Why Use Databases?
- Data Integrity: Ensures data accuracy and consistency
- Concurrency: Multiple users can access data simultaneously
- Security: Control who can access what data
- Backup & Recovery: Protect against data loss
- Scalability: Handle growing amounts of data
- ACID Compliance: Reliable transaction processing
Types of Database Systems
Relational Databases (RDBMS)
- Data stored in tables with predefined relationships
- Use SQL (Structured Query Language) for operations
- ACID compliant for reliable transactions
- Examples: PostgreSQL, MySQL, Oracle, SQL Server
Key Characteristics:
- Structured Data: Fixed schema with defined columns and data types
- Relationships: Tables can be related through foreign keys
- Consistency: Strong data integrity guarantees
- Mature Technology: Decades of development and optimization
PostgreSQL Overview
PostgreSQL is an advanced, open-source relational database system known for its reliability, feature robustness, and performance. It’s often called “the world’s most advanced open source database.”
Why Choose PostgreSQL?
1. ACID Compliance PostgreSQL ensures that all transactions are:
- Atomic: All operations in a transaction succeed or all fail
- Consistent: Database remains in a valid state
- Isolated: Transactions don’t interfere with each other
- Durable: Committed changes persist even after system failure
2. Advanced Features
- JSON/JSONB Support: Store and query JSON data natively
- Arrays: Column can contain multiple values
- Custom Data Types: Create your own data types
- Full-Text Search: Built-in search capabilities
- Window Functions: Advanced analytical queries
- Common Table Expressions (CTEs): Recursive and non-recursive queries
3. Extensibility
- Custom functions in multiple languages (PL/pgSQL, Python, JavaScript)
- Extensions for additional functionality
- Custom operators and aggregates
4. Concurrency Control
- MVCC (Multi-Version Concurrency Control): Readers don’t block writers
- Multiple isolation levels
- Deadlock detection and resolution
5. Scalability Options
- Read replicas for scaling reads
- Partitioning for large tables
- Connection pooling
- Horizontal scaling with tools like Citus
SQL Fundamentals
SQL (Structured Query Language) is the standard language for interacting with relational databases. It’s divided into several categories:
SQL Command Categories
1. DDL (Data Definition Language) Commands that define database structure:
CREATE: Create database objects (tables, indexes, etc.)ALTER: Modify existing database objectsDROP: Delete database objectsTRUNCATE: Remove all data from a table
2. DML (Data Manipulation Language) Commands that manipulate data:
SELECT: Retrieve data from tablesINSERT: Add new data to tablesUPDATE: Modify existing dataDELETE: Remove data from tables
3. DCL (Data Control Language) Commands that control access to data:
GRANT: Give privileges to usersREVOKE: Remove privileges from users
4. TCL (Transaction Control Language) Commands that manage transactions:
BEGIN: Start a transactionCOMMIT: Save transaction changesROLLBACK: Undo transaction changesSAVEPOINT: Create a rollback point within a transaction
Creating and Managing Tables
Database and Schema Creation
First, let’s understand the hierarchy: Database → Schema → Table
-- Create a new database
CREATE DATABASE company_db;
-- Connect to the database
\c company_db;
-- Create a schema (namespace for organizing objects)
CREATE SCHEMA hr;
CREATE SCHEMA sales;
-- Set search path to include schemas
SET search_path TO hr, sales, public;
Why use schemas?
- Organization: Group related tables together
- Security: Different permissions for different schemas
- Naming: Same table name can exist in different schemas
- Multi-tenancy: Separate data for different clients
Creating Tables
The CREATE TABLE statement defines a new table structure:
-- Basic table creation
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE NOT NULL,
hire_date DATE DEFAULT CURRENT_DATE,
salary DECIMAL(10,2),
department_id INTEGER,
is_active BOOLEAN DEFAULT true
);
Let’s break this down:
SERIAL: Auto-incrementing integer (equivalent toINTEGER GENERATED BY DEFAULT AS IDENTITY)PRIMARY KEY: Uniquely identifies each rowVARCHAR(50): Variable-length string up to 50 charactersNOT NULL: Column cannot be emptyUNIQUE: No duplicate values allowedDEFAULT: Value used when none is providedDECIMAL(10,2): Number with 10 total digits, 2 after decimal point
Table Constraints
Constraints enforce rules on data to maintain integrity:
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
price DECIMAL(10,2) NOT NULL,
category_id INTEGER NOT NULL,
stock_quantity INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Table-level constraints
CONSTRAINT chk_positive_price CHECK (price > 0),
CONSTRAINT chk_non_negative_stock CHECK (stock_quantity >= 0),
CONSTRAINT fk_category FOREIGN KEY (category_id) REFERENCES categories(id)
);
Types of constraints:
- PRIMARY KEY: Unique identifier for each row
- FOREIGN KEY: Links to another table’s primary key
- UNIQUE: Ensures no duplicate values
- CHECK: Custom validation rules
- NOT NULL: Prevents empty values
Modifying Tables with ALTER
The ALTER TABLE statement modifies existing table structure:
-- Add a new column
ALTER TABLE employees
ADD COLUMN phone VARCHAR(20);
-- Add column with constraint
ALTER TABLE employees
ADD COLUMN manager_id INTEGER REFERENCES employees(id);
-- Modify existing column
ALTER TABLE employees
ALTER COLUMN salary SET NOT NULL;
-- Change column data type
ALTER TABLE employees
ALTER COLUMN phone TYPE VARCHAR(25);
-- Rename column
ALTER TABLE employees
RENAME COLUMN phone TO phone_number;
-- Add constraint to existing table
ALTER TABLE employees
ADD CONSTRAINT chk_positive_salary CHECK (salary > 0);
-- Drop constraint
ALTER TABLE employees
DROP CONSTRAINT chk_positive_salary;
-- Drop column
ALTER TABLE employees
DROP COLUMN phone_number;
Dropping Tables
-- Drop table (fails if referenced by foreign keys)
DROP TABLE employees;
-- Drop with cascade (removes dependent objects)
DROP TABLE employees CASCADE;
-- Drop only if exists (no error if table doesn't exist)
DROP TABLE IF EXISTS employees;
Data Types and Constraints
Numeric Data Types
PostgreSQL offers various numeric types for different needs:
CREATE TABLE numeric_examples (
-- Integer types
small_int SMALLINT, -- -32,768 to 32,767
regular_int INTEGER, -- -2,147,483,648 to 2,147,483,647
big_int BIGINT, -- Very large integers
-- Auto-incrementing integers
auto_id SERIAL, -- 1 to 2,147,483,647
big_auto_id BIGSERIAL, -- 1 to 9,223,372,036,854,775,807
-- Decimal types
exact_decimal DECIMAL(10,2), -- Exact precision
money_amount NUMERIC(15,2), -- Same as DECIMAL
-- Floating point
real_number REAL, -- 6 decimal digits precision
double_number DOUBLE PRECISION -- 15 decimal digits precision
);
When to use each:
- SERIAL/BIGSERIAL: Auto-incrementing IDs
- INTEGER: Most common for whole numbers
- DECIMAL/NUMERIC: Financial calculations (exact precision)
- REAL/DOUBLE PRECISION: Scientific calculations (approximate)
Character Data Types
CREATE TABLE text_examples (
-- Fixed length (padded with spaces)
product_code CHAR(5), -- Exactly 5 characters
-- Variable length with limit
name VARCHAR(100), -- Up to 100 characters
-- Variable length without limit
description TEXT, -- Unlimited length
-- Single character
grade CHAR(1) -- A, B, C, D, F
);
Date and Time Types
CREATE TABLE datetime_examples (
-- Date only
birth_date DATE, -- 'YYYY-MM-DD'
-- Time only
meeting_time TIME, -- 'HH:MM:SS'
meeting_time_tz TIME WITH TIME ZONE, -- Time with timezone
-- Date and time
created_at TIMESTAMP, -- 'YYYY-MM-DD HH:MM:SS'
updated_at TIMESTAMP WITH TIME ZONE, -- With timezone
-- Intervals
session_duration INTERVAL -- '1 day 2 hours 30 minutes'
);
-- Working with dates and times
INSERT INTO datetime_examples (
birth_date,
meeting_time,
created_at,
session_duration
) VALUES (
'1990-05-15',
'14:30:00',
CURRENT_TIMESTAMP,
'2 hours 30 minutes'
);
Boolean and Other Types
CREATE TABLE other_types (
-- Boolean
is_active BOOLEAN, -- TRUE, FALSE, NULL
-- JSON
metadata JSON, -- JSON data (stored as text)
settings JSONB, -- Binary JSON (more efficient)
-- Arrays
tags TEXT[], -- Array of text values
scores INTEGER[], -- Array of integers
-- UUID
unique_id UUID DEFAULT gen_random_uuid(),
-- Network addresses
ip_address INET, -- IP address
mac_address MACADDR -- MAC address
);
-- Working with arrays
INSERT INTO other_types (tags, scores)
VALUES (
ARRAY['postgres', 'database', 'sql'],
ARRAY[95, 87, 92]
);
-- Working with JSON
INSERT INTO other_types (settings)
VALUES ('{"theme": "dark", "notifications": true, "language": "en"}');
Querying Data - SELECT Statement
The SELECT statement is the most important SQL command for retrieving data. Let’s explore it comprehensively:
Basic SELECT Syntax
-- Basic syntax
SELECT column1, column2, ...
FROM table_name
WHERE condition
ORDER BY column
LIMIT number;
Simple SELECT Examples
-- Select all columns
SELECT * FROM employees;
-- Select specific columns
SELECT first_name, last_name, email FROM employees;
-- Select with aliases (cleaner column names in output)
SELECT
first_name AS "First Name",
last_name AS "Last Name",
salary AS "Annual Salary"
FROM employees;
-- Select with calculations
SELECT
first_name,
last_name,
salary,
salary * 12 AS annual_salary,
salary / 12 AS monthly_salary
FROM employees;
WHERE Clause - Filtering Data
The WHERE clause filters rows based on conditions:
-- Comparison operators
SELECT * FROM employees WHERE salary > 50000;
SELECT * FROM employees WHERE department_id = 1;
SELECT * FROM employees WHERE hire_date >= '2020-01-01';
-- String comparisons
SELECT * FROM employees WHERE first_name = 'John';
SELECT * FROM employees WHERE last_name LIKE 'Smi%'; -- Starts with 'Smi'
SELECT * FROM employees WHERE email LIKE '%@gmail.com'; -- Ends with '@gmail.com'
SELECT * FROM employees WHERE first_name ILIKE 'JOHN'; -- Case-insensitive
-- Logical operators
SELECT * FROM employees
WHERE salary > 50000 AND department_id = 1;
SELECT * FROM employees
WHERE department_id = 1 OR department_id = 2;
SELECT * FROM employees
WHERE NOT (salary < 30000);
-- Range and list operations
SELECT * FROM employees WHERE salary BETWEEN 40000 AND 80000;
SELECT * FROM employees WHERE department_id IN (1, 2, 3);
SELECT * FROM employees WHERE email IS NOT NULL;
SELECT * FROM employees WHERE manager_id IS NULL;
Pattern Matching with LIKE:
%: Matches any sequence of characters_: Matches any single characterILIKE: Case-insensitive version of LIKE
ORDER BY - Sorting Results
-- Single column sorting
SELECT * FROM employees ORDER BY salary; -- Ascending (default)
SELECT * FROM employees ORDER BY salary DESC; -- Descending
-- Multiple column sorting
SELECT * FROM employees
ORDER BY department_id ASC, salary DESC;
-- Sorting by expressions
SELECT first_name, last_name, salary
FROM employees
ORDER BY salary * 12 DESC; -- Order by annual salary
-- Handling NULL values
SELECT * FROM employees ORDER BY manager_id NULLS FIRST;
SELECT * FROM employees ORDER BY manager_id NULLS LAST;
LIMIT and OFFSET - Pagination
-- Get first 10 employees
SELECT * FROM employees LIMIT 10;
-- Get employees 11-20 (pagination)
SELECT * FROM employees LIMIT 10 OFFSET 10;
-- Top 5 highest paid employees
SELECT first_name, last_name, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;
DISTINCT - Removing Duplicates
-- Get unique department IDs
SELECT DISTINCT department_id FROM employees;
-- Get unique combinations
SELECT DISTINCT department_id, job_title FROM employees;
-- Count unique values
SELECT COUNT(DISTINCT department_id) AS unique_departments FROM employees;
Aggregate Functions
Aggregate functions perform calculations on groups of rows:
-- Basic aggregates
SELECT COUNT(*) AS total_employees FROM employees;
SELECT COUNT(manager_id) AS employees_with_managers FROM employees; -- Excludes NULLs
SELECT SUM(salary) AS total_payroll FROM employees;
SELECT AVG(salary) AS average_salary FROM employees;
SELECT MIN(salary) AS lowest_salary FROM employees;
SELECT MAX(salary) AS highest_salary FROM employees;
-- String aggregation
SELECT STRING_AGG(first_name, ', ') AS all_names FROM employees;
GROUP BY - Grouping Data
GROUP BY groups rows with the same values and allows aggregate functions to work on each group:
-- Count employees by department
SELECT
department_id,
COUNT(*) AS employee_count
FROM employees
GROUP BY department_id;
-- Multiple grouping columns
SELECT
department_id,
job_title,
COUNT(*) AS count,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id, job_title;
-- Group by expressions
SELECT
EXTRACT(YEAR FROM hire_date) AS hire_year,
COUNT(*) AS employees_hired
FROM employees
GROUP BY EXTRACT(YEAR FROM hire_date)
ORDER BY hire_year;
HAVING - Filtering Groups
HAVING filters groups after GROUP BY (whereas WHERE filters rows before grouping):
-- Departments with more than 5 employees
SELECT
department_id,
COUNT(*) AS employee_count
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 5;
-- Departments with average salary > 60000
SELECT
department_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
HAVING AVG(salary) > 60000;
-- Combined WHERE and HAVING
SELECT
department_id,
COUNT(*) AS employee_count
FROM employees
WHERE hire_date >= '2020-01-01' -- Filter rows first
GROUP BY department_id
HAVING COUNT(*) > 2; -- Then filter groups
Subqueries
Subqueries are queries within queries:
-- Scalar subquery (returns single value)
SELECT first_name, last_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- IN subquery (returns multiple values)
SELECT first_name, last_name
FROM employees
WHERE department_id IN (
SELECT id FROM departments WHERE name IN ('Engineering', 'Sales')
);
-- EXISTS subquery (checks for existence)
SELECT first_name, last_name
FROM employees e
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.employee_id = e.id
);
-- Correlated subquery (references outer query)
SELECT
first_name,
last_name,
salary,
(SELECT AVG(salary)
FROM employees e2
WHERE e2.department_id = e1.department_id) AS dept_avg_salary
FROM employees e1;
Joining Tables
Joins combine data from multiple tables based on relationships. This is one of the most powerful features of relational databases.
Sample Tables for Join Examples
Let’s create some sample tables to demonstrate joins:
-- Departments table
CREATE TABLE departments (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
manager_id INTEGER
);
-- Employees table
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE NOT NULL,
department_id INTEGER,
salary DECIMAL(10,2)
);
-- Projects table
CREATE TABLE projects (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
department_id INTEGER,
budget DECIMAL(12,2)
);
-- Employee_projects junction table (many-to-many)
CREATE TABLE employee_projects (
employee_id INTEGER,
project_id INTEGER,
role VARCHAR(100),
hours_allocated INTEGER,
PRIMARY KEY (employee_id, project_id)
);
INNER JOIN
INNER JOIN returns only rows that have matching values in both tables:
-- Basic INNER JOIN
SELECT
e.first_name,
e.last_name,
d.name AS department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.id;
-- Multiple table INNER JOIN
SELECT
e.first_name,
e.last_name,
d.name AS department_name,
p.name AS project_name,
ep.role
FROM employees e
INNER JOIN departments d ON e.department_id = d.id
INNER JOIN employee_projects ep ON e.id = ep.employee_id
INNER JOIN projects p ON ep.project_id = p.id;
When to use INNER JOIN:
- When you only want rows that exist in both tables
- Most common type of join
- Example: Get employees and their departments (only for employees assigned to departments)
LEFT JOIN (LEFT OUTER JOIN)
LEFT JOIN returns all rows from the left table and matching rows from the right table. When no match is found, NULL values are returned for right table columns:
-- Get all employees and their departments (including employees without departments)
SELECT
e.first_name,
e.last_name,
d.name AS department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id;
-- Find employees without departments
SELECT
e.first_name,
e.last_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id
WHERE d.id IS NULL;
When to use LEFT JOIN:
- When you want all records from the “main” table (left side)
- Useful for finding missing relationships
- Example: All employees, whether they have departments or not
RIGHT JOIN (RIGHT OUTER JOIN)
RIGHT JOIN returns all rows from the right table and matching rows from the left table:
-- Get all departments and their employees (including departments without employees)
SELECT
d.name AS department_name,
e.first_name,
e.last_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.id;
-- Find departments without employees
SELECT
d.name AS department_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.id
WHERE e.id IS NULL;
When to use RIGHT JOIN:
- Less common than LEFT JOIN
- Can usually be rewritten as LEFT JOIN by swapping table order
- Example: All departments, whether they have employees or not
FULL OUTER JOIN
FULL OUTER JOIN returns all rows when there’s a match in either table:
-- Get all employees and departments, showing mismatches
SELECT
e.first_name,
e.last_name,
d.name AS department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.id;
-- Find mismatched data (employees without departments OR departments without employees)
SELECT
e.first_name,
e.last_name,
d.name AS department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.id
WHERE e.id IS NULL OR d.id IS NULL;
CROSS JOIN
CROSS JOIN returns the Cartesian product of both tables (every row from first table combined with every row from second table):
-- Get all possible employee-project combinations
SELECT
e.first_name,
e.last_name,
p.name AS project_name
FROM employees e
CROSS JOIN projects p;
-- Practical example: Create a schedule template
SELECT
e.first_name,
e.last_name,
d.day_name
FROM employees e
CROSS JOIN (
VALUES ('Monday'), ('Tuesday'), ('Wednesday'),
('Thursday'), ('Friday')
) AS d(day_name);
When to use CROSS JOIN:
- Rarely used in practice
- Useful for generating combinations
- Be careful with large tables (can create huge result sets)
Self JOIN
A self join is when a table is joined with itself, typically used for hierarchical data:
-- Find employees and their managers
SELECT
e.first_name || ' ' || e.last_name AS employee_name,
m.first_name || ' ' || m.last_name AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
-- Find all employees who earn more than their manager
SELECT
e.first_name || ' ' || e.last_name AS employee_name,
e.salary AS employee_salary,
m.first_name || ' ' || m.last_name AS manager_name,
m.salary AS manager_salary
FROM employees e
INNER JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
Advanced Join Techniques
Using Multiple Join Conditions:
-- Join with multiple conditions
SELECT
e.first_name,
e.last_name,
p.name AS project_name
FROM employees e
INNER JOIN employee_projects ep ON e.id = ep.employee_id
INNER JOIN projects p ON ep.project_id = p.id
AND p.budget > 100000 -- Additional condition
AND ep.hours_allocated > 20;
Join with Aggregations:
-- Get department summary with employee count and average salary
SELECT
d.name AS department_name,
COUNT(e.id) AS employee_count,
COALESCE(AVG(e.salary), 0) AS avg_salary,
COALESCE(SUM(e.salary), 0) AS total_payroll
FROM departments d
LEFT JOIN employees e ON d.id = e.department_id
GROUP BY d.id, d.name
ORDER BY employee_count DESC;
Join Performance Tips:
- Use indexes on join columns
- Join on indexed columns (usually primary/foreign keys)
- Filter early with WHERE clauses
- Consider join order for complex queries
- Use EXPLAIN to analyze join performance
Data Modification
INSERT - Adding Data
The INSERT statement adds new rows to tables:
Basic INSERT
-- Insert single row with all columns
INSERT INTO employees (first_name, last_name, email, department_id, salary)
VALUES ('John', 'Doe', 'john.doe@company.com', 1, 75000);
-- Insert single row with selected columns (others get default values)
INSERT INTO employees (first_name, last_name, email)
VALUES ('Jane', 'Smith', 'jane.smith@company.com');
-- Insert multiple rows
INSERT INTO employees (first_name, last_name, email, department_id, salary)
VALUES
('Alice', 'Johnson', 'alice.johnson@company.com', 2, 80000),
('Bob', 'Wilson', 'bob.wilson@company.com', 1, 70000),
('Carol', 'Brown', 'carol.brown@company.com', 3, 85000);
INSERT with SELECT (Copying Data)
-- Insert data from another table
INSERT INTO employees_backup
SELECT * FROM employees WHERE department_id = 1;
-- Insert calculated data
INSERT INTO department_stats (department_id, employee_count, avg_salary, total_payroll)
SELECT
department_id,
COUNT(*),
AVG(salary),
SUM(salary)
FROM employees
GROUP BY department_id;
INSERT with RETURNING
-- Get the inserted data back (useful for auto-generated IDs)
INSERT INTO employees (first_name, last_name, email)
VALUES ('David', 'Miller', 'david.miller@company.com')
RETURNING id, first_name, last_name, hire_date;
-- Insert multiple and return specific columns
INSERT INTO products (name, price, category_id)
VALUES
('Laptop', 999.99, 1),
('Mouse', 29.99, 1),
('Keyboard', 79.99, 1)
RETURNING id, name, price;
INSERT with ON CONFLICT (UPSERT)
-- Insert or update if conflict occurs
INSERT INTO employees (email, first_name, last_name, salary)
VALUES ('john.doe@company.com', 'John', 'Doe', 80000)
ON CONFLICT (email)
DO UPDATE SET
first_name = EXCLUDED.first_name,
last_name = EXCLUDED.last_name,
salary = EXCLUDED.salary,
updated_at = CURRENT_TIMESTAMP;
-- Insert or do nothing if conflict
INSERT INTO employees (email, first_name, last_name)
VALUES ('existing@company.com', 'Test', 'User')
ON CONFLICT (email) DO NOTHING;
UPDATE - Modifying Data
The UPDATE statement modifies existing rows:
Basic UPDATE
-- Update single column
UPDATE employees
SET salary = 85000
WHERE id = 1;
-- Update multiple columns
UPDATE employees
SET
salary = salary * 1.1, -- 10% raise
updated_at = CURRENT_TIMESTAMP
WHERE department_id = 1;
-- Update with calculations
UPDATE employees
SET
annual_bonus = salary * 0.15,
performance_rating = CASE
WHEN salary > 80000 THEN 'Excellent'
WHEN salary > 60000 THEN 'Good'
ELSE 'Average'
END;
UPDATE with JOIN (Using FROM clause)
-- Update employees based on department information
UPDATE employees
SET salary = salary * 1.05
FROM departments
WHERE employees.department_id = departments.id
AND departments.name = 'Engineering';
-- Update with aggregated data
UPDATE departments
SET employee_count = emp_counts.count
FROM (
SELECT department_id, COUNT(*) as count
FROM employees
GROUP BY department_id
) AS emp_counts
WHERE departments.id = emp_counts.department_id;
UPDATE with Subqueries
-- Update based on subquery
UPDATE employees
SET salary = salary * 1.1
WHERE id IN (
SELECT employee_id
FROM employee_projects ep
JOIN projects p ON ep.project_id = p.id
WHERE p.name = 'Critical Project'
);
-- Update with correlated subquery
UPDATE employees
SET performance_rank = (
SELECT COUNT(*) + 1
FROM employees e2
WHERE e2.department_id = employees.department_id
AND e2.salary > employees.salary
);
UPDATE with RETURNING
-- See what was updated
UPDATE employees
SET salary = salary * 1.1
WHERE department_id = 1
RETURNING id, first_name, last_name, salary;
DELETE - Removing Data
The DELETE statement removes rows from tables:
Basic DELETE
-- Delete specific row
DELETE FROM employees WHERE id = 5;
-- Delete multiple rows
DELETE FROM employees WHERE department_id = 3;
-- Delete with conditions
DELETE FROM employees
WHERE hire_date < '2020-01-01'
AND salary < 40000;
DELETE with JOIN (Using USING clause)
-- Delete employees from inactive departments
DELETE FROM employees
USING departments
WHERE employees.department_id = departments.id
AND departments.is_active = false;
-- Delete duplicate records (keep one)
DELETE FROM employees e1
USING employees e2
WHERE e1.id > e2.id
AND e1.email = e2.email;
DELETE with Subqueries
-- Delete employees not assigned to any project
DELETE FROM employees
WHERE id NOT IN (
SELECT DISTINCT employee_id
FROM employee_projects
WHERE employee_id IS NOT NULL
);
-- Delete old records based on related data
DELETE FROM orders
WHERE customer_id IN (
SELECT id FROM customers
WHERE last_login < CURRENT_DATE - INTERVAL '2 years'
);
DELETE with RETURNING
-- See what was deleted
DELETE FROM employees
WHERE salary < 30000
RETURNING id, first_name, last_name, salary;
TRUNCATE - Fast Delete All
-- Remove all rows quickly (faster than DELETE)
TRUNCATE TABLE temp_data;
-- Truncate with restart identity (reset auto-increment)
TRUNCATE TABLE employees RESTART IDENTITY;
-- Truncate multiple tables
TRUNCATE TABLE orders, order_items RESTART IDENTITY;
-- Truncate with cascade (truncate related tables)
TRUNCATE TABLE departments CASCADE;
TRUNCATE vs DELETE:
- TRUNCATE: Faster, resets auto-increment, can’t use WHERE clause
- DELETE: Slower, can use WHERE clause, doesn’t reset auto-increment
Data Modification Best Practices
- Always use WHERE clause with UPDATE/DELETE (unless you really want to affect all rows)
- Test with SELECT first - convert your UPDATE/DELETE to SELECT to see what will be affected
- Use transactions for important modifications
- Use RETURNING to verify changes
- Backup before large modifications
- Consider using LIMIT for large DELETE operations to avoid locking
-- Good practice: Test before modifying
-- First, see what would be affected:
SELECT * FROM employees WHERE department_id = 3;
-- Then perform the actual update:
BEGIN;
UPDATE employees SET salary = salary * 1.1 WHERE department_id = 3;
-- Review the changes before committing
SELECT * FROM employees WHERE department_id = 3;
COMMIT;
Data Modeling & Design
Data modeling is the process of creating a conceptual representation of data and its relationships. Good data modeling is crucial for database performance, maintainability, and scalability.
Entity-Relationship (ER) Modeling
What is an Entity? An entity represents a real-world object or concept that has independent existence. Examples:
- Customer: A person who buys products
- Product: An item for sale
- Order: A purchase transaction
- Employee: A person who works for the company
What are Attributes? Attributes are properties or characteristics of entities:
- Customer: name, email, phone, address
- Product: name, price, description, category
- Order: date, total_amount, status
What are Relationships? Relationships describe how entities are connected:
- Customer places Order
- Order contains Products
- Employee works in Department
Relationship Types and Implementation
One-to-One (1:1) Relationship Each record in Table A relates to exactly one record in Table B, and vice versa.
-- Example: Person and Passport
CREATE TABLE persons (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
birth_date DATE
);
CREATE TABLE passports (
id SERIAL PRIMARY KEY,
passport_number VARCHAR(20) UNIQUE NOT NULL,
issue_date DATE NOT NULL,
expiry_date DATE NOT NULL,
person_id INTEGER UNIQUE NOT NULL REFERENCES persons(id)
);
Note: UNIQUE constraint on foreign key enforces 1:1 relationship
One-to-Many (1:N) Relationship One record in Table A can relate to many records in Table B, but each record in Table B relates to only one record in Table A.
-- Example: Department and Employees
CREATE TABLE departments (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
budget DECIMAL(12,2)
);
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE NOT NULL,
department_id INTEGER REFERENCES departments(id) -- Foreign key (many side)
);
Note: Foreign key goes on the “many” side
Many-to-Many (M:N) Relationship Multiple records in Table A can relate to multiple records in Table B.
-- Example: Students and Courses
CREATE TABLE students (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE NOT NULL
);
CREATE TABLE courses (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
credits INTEGER NOT NULL,
description TEXT
);
-- Junction table (also called bridge table or linking table)
CREATE TABLE student_courses (
student_id INTEGER REFERENCES students(id),
course_id INTEGER REFERENCES courses(id),
enrollment_date DATE DEFAULT CURRENT_DATE,
grade CHAR(2), -- A+, A, B+, B, etc.
PRIMARY KEY (student_id, course_id) -- Composite primary key
);
Note: Many-to-many relationships require a junction table
Design Process Step-by-Step
Step 1: Identify Entities Look for nouns in your business requirements:
- “Customers place orders for products”
- Entities: Customer, Order, Product
Step 2: Identify Attributes What information do you need to store about each entity?
-- Customer attributes
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE NOT NULL,
phone VARCHAR(20),
date_of_birth DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 3: Determine Relationships
- Customer places Orders (1:N)
- Order contains Products (M:N)
- Product belongs to Category (N:1)
Step 4: Choose Primary Keys Every table needs a unique identifier:
-- Natural key (meaningful business data)
CREATE TABLE countries (
country_code CHAR(2) PRIMARY KEY, -- US, CA, MX
name VARCHAR(100) NOT NULL
);
-- Surrogate key (artificial identifier)
CREATE TABLE customers (
id SERIAL PRIMARY KEY, -- Auto-generated number
email VARCHAR(100) UNIQUE NOT NULL -- Natural candidate key
);
Step 5: Implement Foreign Keys Link related tables:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(20) DEFAULT 'pending',
total_amount DECIMAL(10,2)
);
Complete E-Commerce Example
-- Categories
CREATE TABLE categories (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
description TEXT,
parent_category_id INTEGER REFERENCES categories(id), -- Self-referencing for hierarchy
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Products
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
description TEXT,
price DECIMAL(10,2) NOT NULL,
category_id INTEGER NOT NULL REFERENCES categories(id),
stock_quantity INTEGER DEFAULT 0,
sku VARCHAR(50) UNIQUE NOT NULL, -- Stock Keeping Unit
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Customers
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE NOT NULL,
phone VARCHAR(20),
date_of_birth DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Customer Addresses (1:N relationship)
CREATE TABLE customer_addresses (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
address_type VARCHAR(20) NOT NULL, -- 'billing', 'shipping'
street_address VARCHAR(200) NOT NULL,
city VARCHAR(100) NOT NULL,
state VARCHAR(100),
postal_code VARCHAR(20),
country VARCHAR(100) NOT NULL,
is_default BOOLEAN DEFAULT false
);
-- Orders
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(20) DEFAULT 'pending', -- pending, processing, shipped, delivered, cancelled
total_amount DECIMAL(10,2) NOT NULL,
shipping_address_id INTEGER REFERENCES customer_addresses(id),
billing_address_id INTEGER REFERENCES customer_addresses(id)
);
-- Order Items (M:N between Orders and Products)
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INTEGER NOT NULL REFERENCES orders(id),
product_id INTEGER NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
unit_price DECIMAL(10,2) NOT NULL, -- Price at time of order
UNIQUE(order_id, product_id) -- Prevent duplicate products in same order
);
-- Product Reviews (customers can review products)
CREATE TABLE product_reviews (
id SERIAL PRIMARY KEY,
product_id INTEGER NOT NULL REFERENCES products(id),
customer_id INTEGER NOT NULL REFERENCES customers(id),
rating INTEGER NOT NULL CHECK (rating >= 1 AND rating <= 5),
review_text TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(product_id, customer_id) -- One review per customer per product
);
Design Best Practices
-
Use Meaningful Names
- Table names: plural nouns (customers, orders, products)
- Column names: singular, descriptive (first_name, not fname)
- Be consistent with naming conventions
-
Choose Appropriate Data Types
- Use smallest data type that fits your needs
- Use DECIMAL for money (not FLOAT)
- Use appropriate VARCHAR lengths
-
Define Constraints
- NOT NULL for required fields
- UNIQUE for natural keys
- CHECK constraints for business rules
- Foreign keys for referential integrity
-
Consider Future Requirements
- Design for extensibility
- Avoid over-normalization if it hurts performance
- Plan for audit trails and soft deletes
-- Adding audit fields
ALTER TABLE products ADD COLUMN created_by INTEGER REFERENCES users(id);
ALTER TABLE products ADD COLUMN updated_by INTEGER REFERENCES users(id);
-- Soft delete pattern
ALTER TABLE products ADD COLUMN deleted_at TIMESTAMP;
-- Instead of DELETE, use: UPDATE products SET deleted_at = CURRENT_TIMESTAMP WHERE id = ?
Normalization
Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity. It involves decomposing tables into smaller, well-structured tables and defining relationships between them.
Why Normalize?
Problems with Unnormalized Data:
-- Bad: Unnormalized table with redundant data
CREATE TABLE orders_bad (
order_id INTEGER,
customer_name VARCHAR(100),
customer_email VARCHAR(100),
customer_phone VARCHAR(20),
product_name VARCHAR(200),
product_price DECIMAL(10,2),
quantity INTEGER,
order_date DATE
);
-- Sample data showing problems:
INSERT INTO orders_bad VALUES
(1, 'John Doe', 'john@email.com', '123-456-7890', 'Laptop', 999.99, 1, '2024-01-15'),
(1, 'John Doe', 'john@email.com', '123-456-7890', 'Mouse', 29.99, 2, '2024-01-15'),
(2, 'John Doe', 'john@email.com', '123-456-7890', 'Keyboard', 79.99, 1, '2024-01-20');
Problems:
- Data Redundancy: Customer information repeated for each order item
- Update Anomalies: If John changes his email, we need to update multiple rows
- Insert Anomalies: Can’t add a customer without an order
- Delete Anomalies: Deleting an order might lose customer information
- Storage Waste: Redundant data takes extra space
First Normal Form (1NF)
Rules for 1NF:
- Each column contains atomic (indivisible) values
- Each column contains values of the same data type
- Each row is unique
- No repeating groups or arrays
Before 1NF (Violates atomicity):
CREATE TABLE customers_bad (
id INTEGER,
name VARCHAR(100),
phones VARCHAR(200), -- "123-456-7890, 098-765-4321" (multiple values)
skills VARCHAR(500) -- "SQL, Python, JavaScript" (multiple values)
);
After 1NF (Atomic values):
-- Separate table for phone numbers
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL
);
CREATE TABLE customer_phones (
id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id),
phone_type VARCHAR(20), -- 'home', 'work', 'mobile'
phone_number VARCHAR(20)
);
-- Separate table for skills
CREATE TABLE skills (
id SERIAL PRIMARY KEY,
name VARCHAR(100) UNIQUE NOT NULL
);
CREATE TABLE customer_skills (
customer_id INTEGER REFERENCES customers(id),
skill_id INTEGER REFERENCES skills(id),
proficiency_level VARCHAR(20), -- 'beginner', 'intermediate', 'advanced'
PRIMARY KEY (customer_id, skill_id)
);
Second Normal Form (2NF)
Rules for 2NF:
- Must be in 1NF
- No partial dependencies (non-key attributes must depend on the entire primary key)
This mainly applies to tables with composite primary keys.
Before 2NF (Has partial dependency):
CREATE TABLE order_items_bad (
order_id INTEGER,
product_id INTEGER,
product_name VARCHAR(200), -- Depends only on product_id, not on (order_id, product_id)
product_price DECIMAL(10,2), -- Depends only on product_id
quantity INTEGER, -- Depends on both order_id and product_id
PRIMARY KEY (order_id, product_id)
);
Problems:
- Product name and price are repeated for each order
- If product price changes, we need to update multiple order_items
- We can’t store product information without an order
After 2NF (Remove partial dependencies):
-- Products table (product info depends only on product_id)
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
price DECIMAL(10,2) NOT NULL
);
-- Order items table (only order-specific info)
CREATE TABLE order_items (
order_id INTEGER,
product_id INTEGER REFERENCES products(id),
quantity INTEGER NOT NULL,
unit_price DECIMAL(10,2) NOT NULL, -- Price at time of order (may differ from current price)
PRIMARY KEY (order_id, product_id)
);
Third Normal Form (3NF)
Rules for 3NF:
- Must be in 2NF
- No transitive dependencies (non-key attributes must not depend on other non-key attributes)
Before 3NF (Has transitive dependency):
CREATE TABLE employees_bad (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
department_id INTEGER,
department_name VARCHAR(100), -- Depends on department_id, not employee id
department_location VARCHAR(100) -- Depends on department_id, not employee id
);
Problems:
- Department information is repeated for each employee
- If department name changes, we need to update multiple employee records
- Inconsistent data if department info differs between employees
After 3NF (Remove transitive dependencies):
-- Departments table
CREATE TABLE departments (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
location VARCHAR(100)
);
-- Employees table (only employee-specific info)
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
department_id INTEGER REFERENCES departments(id)
);
Higher Normal Forms
Boyce-Codd Normal Form (BCNF) A stronger version of 3NF that handles certain edge cases with functional dependencies.
Fourth Normal Form (4NF) Eliminates multi-valued dependencies.
Fifth Normal Form (5NF) Eliminates join dependencies.
Most practical database designs only need to reach 3NF.
When to Denormalize
Sometimes, breaking normalization rules can improve performance:
Scenarios for Denormalization:
- Read-heavy applications where query performance is more important than update performance
- Reporting systems that need fast aggregations
- Data warehouses optimized for analytics
- Caching frequently accessed calculated values
Example of Controlled Denormalization:
-- Normalized tables
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id),
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(20)
);
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(id),
product_id INTEGER REFERENCES products(id),
quantity INTEGER,
unit_price DECIMAL(10,2),
PRIMARY KEY (order_id, product_id)
);
-- Denormalized: Add calculated total to orders table for performance
ALTER TABLE orders ADD COLUMN total_amount DECIMAL(10,2);
-- Keep it updated with triggers
CREATE OR REPLACE FUNCTION update_order_total()
RETURNS TRIGGER AS $$
BEGIN
UPDATE orders
SET total_amount = (
SELECT COALESCE(SUM(quantity * unit_price), 0)
FROM order_items
WHERE order_id = COALESCE(NEW.order_id, OLD.order_id)
)
WHERE id = COALESCE(NEW.order_id, OLD.order_id);
RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER update_order_total_trigger
AFTER INSERT OR UPDATE OR DELETE ON order_items
FOR EACH ROW EXECUTE FUNCTION update_order_total();
Normalization Best Practices
- Start with 3NF for most applications
- Identify functional dependencies early in design
- Document your design decisions, especially denormalizations
- Use views to present denormalized data without storing it
- Consider read vs write patterns in your application
- Test performance before and after normalization changes
-- Use views to present denormalized data
CREATE VIEW order_summary AS
SELECT
o.id,
o.order_date,
c.first_name || ' ' || c.last_name AS customer_name,
c.email AS customer_email,
COUNT(oi.product_id) AS item_count,
SUM(oi.quantity * oi.unit_price) AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
LEFT JOIN order_items oi ON o.id = oi.order_id
GROUP BY o.id, o.order_date, c.first_name, c.last_name, c.email;
Advanced SQL Concepts
Now that we’ve covered the fundamentals, let’s explore advanced SQL features that will make you a more effective database developer.
Window Functions
Window functions perform calculations across a set of rows related to the current row, without collapsing the result set like GROUP BY would. They’re incredibly powerful for analytics and reporting.
Basic Window Function Syntax:
function_name() OVER (
[PARTITION BY column(s)]
[ORDER BY column(s)]
[ROWS/RANGE window_frame]
)
Ranking Functions
-- ROW_NUMBER: Assigns unique sequential numbers
SELECT
first_name,
last_name,
salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) as salary_rank
FROM employees;
-- RANK: Same values get same rank, with gaps
SELECT
first_name,
last_name,
salary,
RANK() OVER (ORDER BY salary DESC) as salary_rank
FROM employees;
-- Result: 1, 2, 2, 4, 5 (note the gap after tied rank 2)
-- DENSE_RANK: Same values get same rank, no gaps
SELECT
first_name,
last_name,
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) as salary_rank
FROM employees;
-- Result: 1, 2, 2, 3, 4 (no gap after tied rank 2)
-- NTILE: Divides rows into buckets
SELECT
first_name,
last_name,
salary,
NTILE(4) OVER (ORDER BY salary DESC) as salary_quartile
FROM employees;
-- Divides employees into 4 salary quartiles
Partition By (Group within Windows)
-- Rank employees within each department
SELECT
first_name,
last_name,
department_id,
salary,
ROW_NUMBER() OVER (
PARTITION BY department_id
ORDER BY salary DESC
) as dept_salary_rank
FROM employees;
-- Compare individual salary to department average
SELECT
first_name,
last_name,
department_id,
salary,
AVG(salary) OVER (PARTITION BY department_id) as dept_avg_salary,
salary - AVG(salary) OVER (PARTITION BY department_id) as salary_diff_from_avg
FROM employees;
Aggregate Window Functions
-- Running totals
SELECT
order_date,
total_amount,
SUM(total_amount) OVER (
ORDER BY order_date
ROWS UNBOUNDED PRECEDING
) as running_total
FROM orders
ORDER BY order_date;
-- Moving averages (last 3 orders)
SELECT
order_date,
total_amount,
AVG(total_amount) OVER (
ORDER BY order_date
ROWS 2 PRECEDING
) as moving_avg_3_orders
FROM orders
ORDER BY order_date;
-- Percentage of total
SELECT
department_id,
salary,
salary / SUM(salary) OVER () * 100 as pct_of_total_payroll,
salary / SUM(salary) OVER (PARTITION BY department_id) * 100 as pct_of_dept_payroll
FROM employees;
LAG and LEAD Functions
-- Compare with previous row
SELECT
order_date,
total_amount,
LAG(total_amount) OVER (ORDER BY order_date) as previous_order_amount,
total_amount - LAG(total_amount) OVER (ORDER BY order_date) as amount_difference
FROM orders
ORDER BY order_date;
-- Compare with next row
SELECT
employee_id,
salary,
hire_date,
LEAD(salary) OVER (ORDER BY hire_date) as next_hire_salary
FROM employees
ORDER BY hire_date;
-- Multiple periods back/forward
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) as prev_month,
LAG(revenue, 12) OVER (ORDER BY month) as same_month_last_year
FROM monthly_revenue
ORDER BY month;
Common Table Expressions (CTEs)
CTEs are temporary named result sets that exist within the scope of a single SQL statement. They make complex queries more readable and maintainable.
Basic CTEs
-- Simple CTE
WITH high_earners AS (
SELECT *
FROM employees
WHERE salary > 75000
)
SELECT
department_id,
COUNT(*) as high_earner_count,
AVG(salary) as avg_high_earner_salary
FROM high_earners
GROUP BY department_id;
-- Multiple CTEs
WITH
department_stats AS (
SELECT
department_id,
COUNT(*) as employee_count,
AVG(salary) as avg_salary,
MAX(salary) as max_salary
FROM employees
GROUP BY department_id
),
department_info AS (
SELECT
d.id,
d.name,
d.budget
FROM departments d
WHERE d.is_active = true
)
SELECT
di.name as department_name,
di.budget,
ds.employee_count,
ds.avg_salary,
ds.max_salary,
di.budget / ds.employee_count as budget_per_employee
FROM department_info di
JOIN department_stats ds ON di.id = ds.department_id;
Recursive CTEs
Recursive CTEs are perfect for hierarchical data like organizational charts, file systems, or category trees.
-- Organizational hierarchy
WITH RECURSIVE employee_hierarchy AS (
-- Base case: Top-level managers (no manager)
SELECT
id,
first_name,
last_name,
manager_id,
1 as level,
first_name || ' ' || last_name as path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: Employees with managers
SELECT
e.id,
e.first_name,
e.last_name,
e.manager_id,
eh.level + 1,
eh.path || ' -> ' || e.first_name || ' ' || e.last_name
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.id
WHERE eh.level < 10 -- Prevent infinite recursion
)
SELECT
level,
REPEAT(' ', level - 1) || first_name || ' ' || last_name as indented_name,
path
FROM employee_hierarchy
ORDER BY path;
-- Category tree with totals
WITH RECURSIVE category_tree AS (
-- Root categories
SELECT
id,
name,
parent_category_id,
1 as level
FROM categories
WHERE parent_category_id IS NULL
UNION ALL
-- Child categories
SELECT
c.id,
c.name,
c.parent_category_id,
ct.level + 1
FROM categories c
JOIN category_tree ct ON c.parent_category_id = ct.id
),
category_with_product_counts AS (
SELECT
ct.*,
COALESCE(p.product_count, 0) as direct_product_count
FROM category_tree ct
LEFT JOIN (
SELECT category_id, COUNT(*) as product_count
FROM products
GROUP BY category_id
) p ON ct.id = p.category_id
)
SELECT
level,
REPEAT(' ', level - 1) || name as indented_category,
direct_product_count
FROM category_with_product_counts
ORDER BY level, name;
Advanced Aggregations
GROUPING SETS
GROUPING SETS allows you to compute multiple groupings in a single query:
-- Multiple groupings at once
SELECT
department_id,
job_title,
COUNT(*) as employee_count,
AVG(salary) as avg_salary
FROM employees
GROUP BY GROUPING SETS (
(department_id), -- Group by department only
(job_title), -- Group by job title only
(department_id, job_title), -- Group by both
() -- Grand total (no grouping)
);
-- Use GROUPING() function to identify which columns are grouped
SELECT
CASE WHEN GROUPING(department_id) = 1 THEN 'ALL DEPARTMENTS'
ELSE department_id::TEXT END as department,
CASE WHEN GROUPING(job_title) = 1 THEN 'ALL TITLES'
ELSE job_title END as title,
COUNT(*) as employee_count
FROM employees
GROUP BY GROUPING SETS (
(department_id),
(job_title),
()
);
ROLLUP and CUBE
-- ROLLUP: Hierarchical totals (department -> job_title -> grand total)
SELECT
department_id,
job_title,
COUNT(*) as employee_count,
SUM(salary) as total_salary
FROM employees
GROUP BY ROLLUP (department_id, job_title)
ORDER BY department_id, job_title;
-- CUBE: All possible combinations
SELECT
department_id,
job_title,
COUNT(*) as employee_count
FROM employees
GROUP BY CUBE (department_id, job_title)
ORDER BY department_id, job_title;
Conditional Logic in SQL
CASE Statements
-- Simple CASE
SELECT
first_name,
last_name,
salary,
CASE
WHEN salary >= 80000 THEN 'High'
WHEN salary >= 60000 THEN 'Medium'
ELSE 'Low'
END as salary_category
FROM employees;
-- Searched CASE with multiple conditions
SELECT
first_name,
last_name,
salary,
EXTRACT(YEAR FROM AGE(CURRENT_DATE, hire_date)) as years_experience,
CASE
WHEN salary >= 80000 AND EXTRACT(YEAR FROM AGE(CURRENT_DATE, hire_date)) >= 5
THEN 'Senior High Performer'
WHEN salary >= 80000
THEN 'High Performer'
WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, hire_date)) >= 10
THEN 'Veteran'
ELSE 'Standard'
END as employee_category
FROM employees;
-- CASE in aggregate functions
SELECT
department_id,
COUNT(*) as total_employees,
COUNT(CASE WHEN salary >= 70000 THEN 1 END) as high_salary_count,
COUNT(CASE WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, hire_date)) >= 5 THEN 1 END) as veteran_count,
AVG(CASE WHEN salary >= 70000 THEN salary END) as avg_high_salary
FROM employees
GROUP BY department_id;
COALESCE and NULLIF
-- COALESCE: Return first non-null value
SELECT
first_name,
last_name,
COALESCE(mobile_phone, work_phone, home_phone, 'No phone') as contact_phone
FROM employees;
-- NULLIF: Return NULL if values are equal
SELECT
product_name,
list_price,
sale_price,
NULLIF(sale_price, list_price) as discount_price -- NULL if no discount
FROM products;
-- Practical example: Avoid division by zero
SELECT
department_id,
total_budget,
employee_count,
total_budget / NULLIF(employee_count, 0) as budget_per_employee
FROM department_summary;
JSON Operations in PostgreSQL
PostgreSQL has excellent JSON support with JSON and JSONB data types:
-- Create table with JSON column
CREATE TABLE user_preferences (
user_id INTEGER PRIMARY KEY,
preferences JSONB -- JSONB is more efficient than JSON
);
-- Insert JSON data
INSERT INTO user_preferences VALUES
(1, '{"theme": "dark", "language": "en", "notifications": {"email": true, "push": false}}'),
(2, '{"theme": "light", "language": "es", "notifications": {"email": false, "push": true}}'),
(3, '{"theme": "dark", "language": "fr", "sidebar_collapsed": true}');
-- JSON operators
SELECT
user_id,
preferences -> 'theme' as theme, -- Returns JSON
preferences ->> 'language' as language, -- Returns text
preferences -> 'notifications' ->> 'email' as email_notifications,
preferences #> '{notifications,push}' as push_notifications
FROM user_preferences;
-- JSON aggregation
SELECT
json_object_agg(user_id, preferences ->> 'theme') as user_themes
FROM user_preferences;
-- JSON path queries
SELECT
user_id,
preferences
FROM user_preferences
WHERE preferences @> '{"theme": "dark"}'; -- Contains
-- Update JSON
UPDATE user_preferences
SET preferences = preferences || '{"last_login": "2024-01-15"}'
WHERE user_id = 1;
-- JSON indexing
CREATE INDEX idx_user_prefs_theme ON user_preferences USING GIN ((preferences -> 'theme'));
Full-Text Search
PostgreSQL includes powerful full-text search capabilities:
-- Add full-text search to products
ALTER TABLE products ADD COLUMN search_vector tsvector;
-- Update search vector
UPDATE products
SET search_vector = to_tsvector('english', name || ' ' || COALESCE(description, ''));
-- Create index for fast searching
CREATE INDEX idx_products_search ON products USING GIN(search_vector);
-- Search products
SELECT
name,
description,
ts_rank(search_vector, query) as rank
FROM products, to_tsquery('english', 'laptop & fast') as query
WHERE search_vector @@ query
ORDER BY rank DESC;
-- Highlight search results
SELECT
name,
ts_headline('english', description, to_tsquery('laptop & fast')) as highlighted_description
FROM products
WHERE search_vector @@ to_tsquery('english', 'laptop & fast');
Transactions & ACID
Transactions are a fundamental concept in database systems that ensure data integrity and consistency. Understanding transactions is crucial for building reliable applications.
What is a Transaction?
A transaction is a sequence of database operations that are treated as a single unit of work. Either all operations succeed (commit) or all operations fail (rollback).
Real-world example: Transferring money between bank accounts
-- This should be atomic - either both operations succeed or both fail
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE account_id = 'A123';
UPDATE accounts SET balance = balance + 100 WHERE account_id = 'B456';
COMMIT;
ACID Properties
PostgreSQL is ACID compliant, meaning it guarantees four critical properties:
Atomicity
“All or Nothing” - A transaction is indivisible. Either all operations within the transaction are completed successfully, or none of them are.
-- Example: Order processing
BEGIN;
-- Deduct inventory
UPDATE products SET stock_quantity = stock_quantity - 5 WHERE id = 1;
-- Create order record
INSERT INTO orders (customer_id, total_amount) VALUES (123, 199.99);
-- Add order items
INSERT INTO order_items (order_id, product_id, quantity, unit_price)
VALUES (currval('orders_id_seq'), 1, 5, 39.99);
-- If any of these fail, ALL changes are rolled back
COMMIT;
Consistency
“Valid State to Valid State” - A transaction brings the database from one valid state to another, maintaining all constraints and rules.
-- Constraints ensure consistency
CREATE TABLE accounts (
id SERIAL PRIMARY KEY,
account_number VARCHAR(20) UNIQUE NOT NULL,
balance DECIMAL(10,2) NOT NULL,
CONSTRAINT chk_positive_balance CHECK (balance >= 0)
);
-- This transaction will fail due to constraint violation
BEGIN;
UPDATE accounts SET balance = balance - 1000 WHERE account_number = 'A123';
-- If this makes balance negative, the constraint prevents commit
COMMIT;
Isolation
“Concurrent transactions don’t interfere” - Transactions appear to run in isolation, even when executed concurrently.
-- Two users trying to book the last seat simultaneously
-- Transaction 1:
BEGIN;
SELECT available_seats FROM flights WHERE flight_id = 'FL123'; -- Returns 1
UPDATE flights SET available_seats = available_seats - 1 WHERE flight_id = 'FL123';
INSERT INTO bookings (flight_id, passenger_id) VALUES ('FL123', 'P001');
COMMIT;
-- Transaction 2 (concurrent):
BEGIN;
SELECT available_seats FROM flights WHERE flight_id = 'FL123'; -- Should see updated value
-- This will either see 0 seats or wait for Transaction 1 to complete
COMMIT;
Durability
“Committed changes persist” - Once a transaction is committed, the changes are permanent and survive system failures.
-- After COMMIT, this data survives even if server crashes
BEGIN;
INSERT INTO critical_data (id, value) VALUES (1, 'Important information');
COMMIT; -- Data is now permanently stored
Transaction Control Commands
BEGIN/START TRANSACTION
Starts a new transaction:
-- Both are equivalent
BEGIN;
START TRANSACTION;
-- With specific isolation level
BEGIN ISOLATION LEVEL READ COMMITTED;
COMMIT
Makes all changes permanent:
BEGIN;
INSERT INTO customers (name, email) VALUES ('John Doe', 'john@email.com');
UPDATE customers SET last_login = CURRENT_TIMESTAMP WHERE email = 'john@email.com';
COMMIT; -- Changes are now permanent
ROLLBACK
Undoes all changes in the current transaction:
BEGIN;
DELETE FROM customers WHERE id = 1;
-- Oops, that was a mistake!
ROLLBACK; -- Customer record is restored
SAVEPOINT
Creates a point within a transaction to which you can rollback:
BEGIN;
INSERT INTO customers (name, email) VALUES ('Alice', 'alice@email.com');
SAVEPOINT customer_created;
INSERT INTO orders (customer_id, total_amount) VALUES (currval('customers_id_seq'), 100.00);
SAVEPOINT order_created;
-- Something goes wrong with payment processing
INSERT INTO payments (order_id, amount, status) VALUES (currval('orders_id_seq'), 100.00, 'failed');
-- Roll back to before payment, but keep customer and order
ROLLBACK TO SAVEPOINT order_created;
-- Or roll back to before order
ROLLBACK TO SAVEPOINT customer_created;
COMMIT;
Isolation Levels
Different isolation levels provide different guarantees about what a transaction can see:
READ UNCOMMITTED (Lowest Isolation)
Allows dirty reads - can see uncommitted changes from other transactions.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
-- Can see data that other transactions haven't committed yet
-- Rarely used in practice due to data integrity concerns
READ COMMITTED (PostgreSQL Default)
Can only see committed changes. Each statement sees a snapshot of committed data at the time the statement started.
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Transaction 1:
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- Returns 1000
-- Transaction 2 commits a change here
SELECT balance FROM accounts WHERE id = 1; -- Might return 1100 (sees committed change)
COMMIT;
REPEATABLE READ
Sees a snapshot of the database at the start of the transaction. All reads within the transaction return the same data.
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Transaction 1:
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- Returns 1000
-- Even if Transaction 2 commits changes, subsequent reads return 1000
SELECT balance FROM accounts WHERE id = 1; -- Still returns 1000
COMMIT;
SERIALIZABLE (Highest Isolation)
Strongest isolation level. Transactions appear to run one after another, even if they run concurrently.
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- May cause serialization failures that require retry logic in application
Concurrency Problems and Solutions
Dirty Read
Reading uncommitted data from another transaction.
Problem: Transaction A reads data that Transaction B has modified but not committed. Solution: Use READ COMMITTED or higher isolation level.
Non-Repeatable Read
Getting different results when reading the same data twice within a transaction.
-- Transaction 1:
BEGIN;
SELECT salary FROM employees WHERE id = 1; -- Returns 50000
-- Transaction 2 updates and commits here
SELECT salary FROM employees WHERE id = 1; -- Returns 55000 (different!)
COMMIT;
Solution: Use REPEATABLE READ or SERIALIZABLE isolation level.
Phantom Read
New rows appearing in subsequent reads within the same transaction.
-- Transaction 1:
BEGIN;
SELECT COUNT(*) FROM employees WHERE department_id = 1; -- Returns 10
-- Transaction 2 inserts new employee and commits
SELECT COUNT(*) FROM employees WHERE department_id = 1; -- Returns 11 (phantom!)
COMMIT;
Solution: Use SERIALIZABLE isolation level.
Lost Update
Two transactions modify the same data, and one overwrites the other’s changes.
-- Both transactions read balance = 1000
-- Transaction 1: UPDATE accounts SET balance = 1000 + 100 WHERE id = 1; (balance = 1100)
-- Transaction 2: UPDATE accounts SET balance = 1000 + 200 WHERE id = 1; (balance = 1200)
-- Result: Only Transaction 2's change persists, Transaction 1's is lost
Solutions:
- Optimistic Locking: Use version numbers or timestamps
-- Add version column
ALTER TABLE accounts ADD COLUMN version INTEGER DEFAULT 1;
-- Update with version check
UPDATE accounts
SET balance = balance + 100, version = version + 1
WHERE id = 1 AND version = @current_version;
-- If 0 rows affected, someone else updated the record
- Pessimistic Locking: Use SELECT FOR UPDATE
BEGIN;
SELECT balance FROM accounts WHERE id = 1 FOR UPDATE; -- Locks the row
-- Now safely calculate and update
UPDATE accounts SET balance = balance + 100 WHERE id = 1;
COMMIT;
Advanced Transaction Patterns
Transaction with Error Handling
DO $$
DECLARE
error_message TEXT;
BEGIN
BEGIN
-- Start main transaction logic
INSERT INTO orders (customer_id, total_amount) VALUES (123, 199.99);
-- This might fail
INSERT INTO inventory_log (product_id, quantity_change)
VALUES (999, -5); -- Product 999 doesn't exist
EXCEPTION WHEN OTHERS THEN
GET STACKED DIAGNOSTICS error_message = MESSAGE_TEXT;
RAISE NOTICE 'Transaction failed: %', error_message;
-- Log the error
INSERT INTO error_log (error_message, occurred_at)
VALUES (error_message, CURRENT_TIMESTAMP);
-- Re-raise the exception to rollback
RAISE;
END;
END $$;
Bulk Operations with Periodic Commits
-- For large data processing, commit periodically to avoid long-running transactions
DO $$
DECLARE
batch_size INTEGER := 1000;
processed INTEGER := 0;
total_records INTEGER;
BEGIN
SELECT COUNT(*) INTO total_records FROM large_source_table;
WHILE processed < total_records LOOP
BEGIN;
INSERT INTO processed_data (id, processed_value)
SELECT id, expensive_function(data)
FROM large_source_table
WHERE id > processed
ORDER BY id
LIMIT batch_size;
processed := processed + batch_size;
COMMIT;
RAISE NOTICE 'Processed % of % records', processed, total_records;
END;
END LOOP;
END $$;
Transaction Best Practices
- Keep transactions short - Long transactions hold locks longer
- Don’t include user interaction in transactions
- Handle deadlocks gracefully with retry logic
- Use appropriate isolation levels - not always SERIALIZABLE
- Avoid nested transactions - use savepoints instead
- Test concurrent scenarios - race conditions are hard to debug
-- Good: Short, focused transaction
BEGIN;
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 123;
INSERT INTO sales (product_id, quantity, sale_date) VALUES (123, 1, CURRENT_DATE);
COMMIT;
-- Bad: Long transaction with user interaction
BEGIN;
SELECT * FROM products WHERE category = 'electronics';
-- User spends 5 minutes deciding what to buy
INSERT INTO orders (...); -- Other users are blocked during this time
COMMIT;
Indexes & Performance
Indexes are database objects that improve query performance by providing fast access paths to table data. They’re like an index in a book - instead of scanning every page, you can jump directly to the information you need.
How Indexes Work
Without Index (Full Table Scan):
-- This query scans every row
SELECT * FROM employees WHERE email = 'john@company.com';
-- PostgreSQL must check all 100,000 employee records
With Index (Index Scan):
-- Create index
CREATE INDEX idx_employees_email ON employees(email);
-- Same query now uses index
SELECT * FROM employees WHERE email = 'john@company.com';
-- PostgreSQL directly finds the row using the index
Index Types in PostgreSQL
B-Tree Index (Default and Most Common)
B-Tree indexes are perfect for equality and range queries:
-- Single column index
CREATE INDEX idx_employees_last_name ON employees(last_name);
-- Multi-column index (column order matters!)
CREATE INDEX idx_employees_dept_salary ON employees(department_id, salary);
-- Partial index (indexes only rows meeting a condition)
CREATE INDEX idx_active_employees_email ON employees(email)
WHERE is_active = true;
-- Functional index (indexes result of expression)
CREATE INDEX idx_employees_lower_email ON employees(LOWER(email));
-- Examples of when B-Tree indexes are used:
SELECT * FROM employees WHERE last_name = 'Smith'; -- Equality
SELECT * FROM employees WHERE salary > 50000; -- Range
SELECT * FROM employees WHERE salary BETWEEN 40000 AND 80000; -- Range
SELECT * FROM employees WHERE last_name LIKE 'Smi%'; -- Prefix match
SELECT * FROM employees WHERE department_id = 1 ORDER BY salary; -- Order by indexed column
Multi-column Index Column Order:
-- Good for queries filtering by department_id, or department_id + salary
CREATE INDEX idx_emp_dept_salary ON employees(department_id, salary);
-- These queries can use the index:
SELECT * FROM employees WHERE department_id = 1; -- ✓ Uses index
SELECT * FROM employees WHERE department_id = 1 AND salary > 50000; -- ✓ Uses index fully
SELECT * FROM employees WHERE department_id = 1 ORDER BY salary; -- ✓ Uses index
-- This query cannot use the index efficiently:
SELECT * FROM employees WHERE salary > 50000; -- ✗ Salary is second column
Hash Index
Hash indexes are only for equality comparisons, but very fast:
-- Hash index for exact matches only
CREATE INDEX idx_employees_id_hash ON employees USING HASH(id);
-- Good for:
SELECT * FROM employees WHERE id = 123; -- ✓ Equality
-- Cannot be used for:
SELECT * FROM employees WHERE id > 100; -- ✗ Range queries
SELECT * FROM employees WHERE id IN (1, 2, 3); -- ✗ Not equality
GIN Index (Generalized Inverted Index)
Perfect for complex data types like arrays, JSONB, and full-text search:
-- For array columns
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
title VARCHAR(200),
tags TEXT[]
);
CREATE INDEX idx_posts_tags ON posts USING GIN(tags);
-- Efficient queries:
SELECT * FROM posts WHERE tags @> ARRAY['postgresql']; -- Contains
SELECT * FROM posts WHERE tags && ARRAY['database', 'sql']; -- Overlaps
-- For JSONB columns
CREATE TABLE user_profiles (
id SERIAL PRIMARY KEY,
profile JSONB
);
CREATE INDEX idx_user_profiles_gin ON user_profiles USING GIN(profile);
-- Efficient JSONB queries:
SELECT * FROM user_profiles WHERE profile @> '{"city": "New York"}';
SELECT * FROM user_profiles WHERE profile ? 'phone'; -- Has key
SELECT * FROM user_profiles WHERE profile #> '{address,city}' = '"NYC"';
-- For full-text search
CREATE INDEX idx_posts_fulltext ON posts USING GIN(to_tsvector('english', title));
SELECT * FROM posts WHERE to_tsvector('english', title) @@ to_tsquery('postgresql & database');
GiST Index (Generalized Search Tree)
Good for geometric data, full-text search, and range types:
-- For geometric data
CREATE TABLE stores (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
location POINT
);
CREATE INDEX idx_stores_location ON stores USING GIST(location);
-- Efficient geometric queries:
SELECT * FROM stores WHERE location <-> POINT(0,0) < 10; -- Distance
-- For range types
CREATE TABLE events (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
date_range DATERANGE
);
CREATE INDEX idx_events_daterange ON events USING GIST(date_range);
-- Efficient range queries:
SELECT * FROM events WHERE date_range @> CURRENT_DATE; -- Contains date
SELECT * FROM events WHERE date_range && '[2024-01-01,2024-12-31)'::daterange; -- Overlaps
Query Performance Analysis
EXPLAIN - Understanding Query Plans
-- Basic explain
EXPLAIN SELECT * FROM employees WHERE department_id = 1;
-- More detailed analysis
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT e.first_name, e.last_name, d.name
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE e.salary > 50000;
Understanding EXPLAIN output:
- Seq Scan: Table scan (slow for large tables)
- Index Scan: Using an index to find rows
- Index Only Scan: Getting all needed data from index
- Bitmap Heap Scan: Using index to find row locations, then fetching rows
- Nested Loop: Join algorithm for small result sets
- Hash Join: Join algorithm for larger result sets
- Sort: Explicit sorting operation
Query Optimization Examples
Before optimization:
-- Slow query - no index on salary
EXPLAIN ANALYZE
SELECT * FROM employees WHERE salary > 50000;
-- Result: Seq Scan on employees (cost=0.00..2500.00 rows=5000 width=50) (actual time=25.123..45.678 rows=5000)
After optimization:
-- Add index
CREATE INDEX idx_employees_salary ON employees(salary);
-- Same query now faster
EXPLAIN ANALYZE
SELECT * FROM employees WHERE salary > 50000;
-- Result: Index Scan using idx_employees_salary (cost=0.42..156.78 rows=5000 width=50) (actual time=0.123..2.345 rows=5000)
Index Maintenance and Best Practices
Monitoring Index Usage
-- Check index usage statistics
SELECT
schemaname,
tablename,
indexname,
idx_scan as times_used,
idx_tup_read as tuples_read,
idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Find unused indexes
SELECT
schemaname,
tablename,
indexname,
idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY schemaname, tablename;
-- Check index sizes
SELECT
indexname,
pg_size_pretty(pg_relation_size(indexname::regclass)) as index_size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexname::regclass) DESC;
Index Maintenance
-- Rebuild index (removes bloat)
REINDEX INDEX idx_employees_email;
-- Rebuild all indexes on a table
REINDEX TABLE employees;
-- Create index without blocking writes (PostgreSQL 11+)
CREATE INDEX CONCURRENTLY idx_employees_phone ON employees(phone);
-- Drop unused index
DROP INDEX IF EXISTS idx_employees_unused;
Performance Best Practices
1. Index Strategy
-- Good: Composite index for common query patterns
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
-- Supports both of these efficiently:
SELECT * FROM orders WHERE customer_id = 123;
SELECT * FROM orders WHERE customer_id = 123 AND order_date >= '2024-01-01';
2. Covering Indexes
-- Include additional columns to avoid table lookups
CREATE INDEX idx_employees_dept_salary_covering
ON employees(department_id, salary)
INCLUDE (first_name, last_name);
-- This query can be satisfied entirely from the index:
SELECT first_name, last_name, salary
FROM employees
WHERE department_id = 1 AND salary > 50000;
3. Partial Indexes for Sparse Data
-- Only index active employees (saves space if most are inactive)
CREATE INDEX idx_active_employees_email ON employees(email)
WHERE is_active = true;
-- Only index recent orders
CREATE INDEX idx_recent_orders_customer ON orders(customer_id)
WHERE order_date >= CURRENT_DATE - INTERVAL '1 year';
4. Functional Indexes for Expressions
-- Index for case-insensitive searches
CREATE INDEX idx_employees_lower_email ON employees(LOWER(email));
-- Query that uses the index:
SELECT * FROM employees WHERE LOWER(email) = LOWER('John@Company.COM');
-- Index for date part queries
CREATE INDEX idx_orders_year ON orders(EXTRACT(YEAR FROM order_date));
-- Query that uses the index:
SELECT * FROM orders WHERE EXTRACT(YEAR FROM order_date) = 2024;
5. Index Maintenance Considerations
- Too many indexes slow down writes (INSERT, UPDATE, DELETE)
- Each index takes storage space
- Keep statistics updated with ANALYZE
- Monitor for index bloat and rebuild when necessary
- Drop unused indexes regularly
-- Update table statistics (helps query planner)
ANALYZE employees;
-- Schedule regular maintenance
-- In cron or scheduled job:
-- VACUUM ANALYZE employees;
-- REINDEX TABLE employees;
Remember: Indexes are a trade-off between read performance and write performance. Design your indexing strategy based on your application’s specific query patterns and performance requirements.
Triggers & Stored Procedures
Triggers and stored procedures are powerful PostgreSQL features that allow you to implement business logic directly in the database.
Understanding Triggers
A trigger is a special function that automatically executes (or “fires”) in response to certain events in a database table or view. Triggers run automatically - you can’t call them directly.
When do triggers fire?
- BEFORE or AFTER an event
- On INSERT, UPDATE, or DELETE operations
- FOR EACH ROW or FOR EACH STATEMENT
Trigger Functions
First, you need to create a function that returns TRIGGER:
-- Basic trigger function template
CREATE OR REPLACE FUNCTION function_name()
RETURNS TRIGGER AS $$
BEGIN
-- Trigger logic here
-- For BEFORE triggers on INSERT/UPDATE, return NEW
-- For BEFORE triggers on DELETE, return OLD
-- For AFTER triggers, return value is ignored but use NEW or OLD as appropriate
RETURN NEW; -- or OLD for DELETE triggers
END;
$$ LANGUAGE plpgsql;
Common Trigger Examples
1. Automatic Timestamp Updates
-- Function to update 'updated_at' column
CREATE OR REPLACE FUNCTION update_updated_at_column()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = CURRENT_TIMESTAMP;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Add updated_at column to table
ALTER TABLE employees ADD COLUMN updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP;
-- Create trigger
CREATE TRIGGER trigger_employees_updated_at
BEFORE UPDATE ON employees
FOR EACH ROW
EXECUTE FUNCTION update_updated_at_column();
-- Test it
UPDATE employees SET salary = 80000 WHERE id = 1;
-- The updated_at column will automatically be set to current timestamp
2. Audit Trail Trigger
-- Create audit log table
CREATE TABLE audit_log (
id SERIAL PRIMARY KEY,
table_name VARCHAR(100) NOT NULL,
operation VARCHAR(10) NOT NULL,
row_id INTEGER,
old_data JSONB,
new_data JSONB,
changed_by VARCHAR(100) DEFAULT current_user,
changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Audit trigger function
CREATE OR REPLACE FUNCTION audit_trigger_function()
RETURNS TRIGGER AS $$
BEGIN
-- Handle different operations
IF TG_OP = 'DELETE' THEN
INSERT INTO audit_log (table_name, operation, row_id, old_data)
VALUES (TG_TABLE_NAME, TG_OP, OLD.id, row_to_json(OLD));
RETURN OLD;
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO audit_log (table_name, operation, row_id, old_data, new_data)
VALUES (TG_TABLE_NAME, TG_OP, NEW.id, row_to_json(OLD), row_to_json(NEW));
RETURN NEW;
ELSIF TG_OP = 'INSERT' THEN
INSERT INTO audit_log (table_name, operation, row_id, new_data)
VALUES (TG_TABLE_NAME, TG_OP, NEW.id, row_to_json(NEW));
RETURN NEW;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
-- Apply audit trigger to tables
CREATE TRIGGER audit_employees
AFTER INSERT OR UPDATE OR DELETE ON employees
FOR EACH ROW EXECUTE FUNCTION audit_trigger_function();
CREATE TRIGGER audit_customers
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW EXECUTE FUNCTION audit_trigger_function();
3. Data Validation Trigger
-- Validation trigger function
CREATE OR REPLACE FUNCTION validate_employee_data()
RETURNS TRIGGER AS $$
BEGIN
-- Validate email format
IF NEW.email !~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' THEN
RAISE EXCEPTION 'Invalid email format: %', NEW.email;
END IF;
-- Validate salary range
IF NEW.salary < 0 THEN
RAISE EXCEPTION 'Salary cannot be negative';
END IF;
IF NEW.salary > 1000000 THEN
RAISE EXCEPTION 'Salary exceeds maximum allowed amount';
END IF;
-- Validate phone number format (if provided)
IF NEW.phone IS NOT NULL AND NEW.phone !~ '^\+?[1-9]\d{1,14}$' THEN
RAISE EXCEPTION 'Invalid phone number format: %', NEW.phone;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Create validation trigger
CREATE TRIGGER validate_employee_trigger
BEFORE INSERT OR UPDATE ON employees
FOR EACH ROW EXECUTE FUNCTION validate_employee_data();
4. Automatic Counter Updates
-- Update department employee count when employees are added/removed
CREATE OR REPLACE FUNCTION update_department_employee_count()
RETURNS TRIGGER AS $$
BEGIN
-- Handle INSERT
IF TG_OP = 'INSERT' THEN
UPDATE departments
SET employee_count = employee_count + 1
WHERE id = NEW.department_id;
RETURN NEW;
END IF;
-- Handle DELETE
IF TG_OP = 'DELETE' THEN
UPDATE departments
SET employee_count = employee_count - 1
WHERE id = OLD.department_id;
RETURN OLD;
END IF;
-- Handle UPDATE (department change)
IF TG_OP = 'UPDATE' AND OLD.department_id != NEW.department_id THEN
-- Decrease count in old department
UPDATE departments
SET employee_count = employee_count - 1
WHERE id = OLD.department_id;
-- Increase count in new department
UPDATE departments
SET employee_count = employee_count + 1
WHERE id = NEW.department_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Add employee_count column to departments
ALTER TABLE departments ADD COLUMN employee_count INTEGER DEFAULT 0;
-- Initialize counts
UPDATE departments
SET employee_count = (
SELECT COUNT(*)
FROM employees
WHERE department_id = departments.id
);
-- Create trigger
CREATE TRIGGER update_dept_employee_count
AFTER INSERT OR UPDATE OR DELETE ON employees
FOR EACH ROW EXECUTE FUNCTION update_department_employee_count();
Stored Procedures and Functions
PostgreSQL functions can return values, while procedures (introduced in PostgreSQL 11) cannot return values but can manage transactions.
Functions (Return Values)
-- Simple function that returns a value
CREATE OR REPLACE FUNCTION get_employee_full_name(emp_id INTEGER)
RETURNS TEXT AS $$
DECLARE
full_name TEXT;
BEGIN
SELECT first_name || ' ' || last_name
INTO full_name
FROM employees
WHERE id = emp_id;
IF full_name IS NULL THEN
RETURN 'Employee not found';
END IF;
RETURN full_name;
END;
$$ LANGUAGE plpgsql;
-- Usage
SELECT get_employee_full_name(1);
Functions with Complex Logic
-- Calculate employee bonus based on performance and tenure
CREATE OR REPLACE FUNCTION calculate_employee_bonus(emp_id INTEGER)
RETURNS DECIMAL(10,2) AS $$
DECLARE
emp_record RECORD;
years_employed INTEGER;
base_bonus DECIMAL(10,2);
performance_multiplier DECIMAL(3,2);
final_bonus DECIMAL(10,2);
BEGIN
-- Get employee information
SELECT * INTO emp_record
FROM employees
WHERE id = emp_id;
IF NOT FOUND THEN
RAISE EXCEPTION 'Employee with ID % not found', emp_id;
END IF;
-- Calculate years employed
years_employed := EXTRACT(YEAR FROM AGE(CURRENT_DATE, emp_record.hire_date));
-- Base bonus calculation (5% of salary)
base_bonus := emp_record.salary * 0.05;
-- Performance multiplier based on performance rating
performance_multiplier := CASE emp_record.performance_rating
WHEN 'Excellent' THEN 1.5
WHEN 'Good' THEN 1.2
WHEN 'Average' THEN 1.0
WHEN 'Poor' THEN 0.5
ELSE 1.0
END;
-- Tenure bonus (additional 2% per year, max 10 years)
final_bonus := base_bonus * performance_multiplier * (1 + LEAST(years_employed, 10) * 0.02);
-- Log bonus calculation
INSERT INTO bonus_calculations (employee_id, base_bonus, performance_multiplier, final_bonus, calculated_at)
VALUES (emp_id, base_bonus, performance_multiplier, final_bonus, CURRENT_TIMESTAMP);
RETURN final_bonus;
END;
$$ LANGUAGE plpgsql;
-- Usage
SELECT
id,
first_name,
last_name,
salary,
calculate_employee_bonus(id) as bonus
FROM employees
WHERE department_id = 1;
Table-Valued Functions
-- Function that returns a table
CREATE OR REPLACE FUNCTION get_department_summary(dept_id INTEGER DEFAULT NULL)
RETURNS TABLE(
department_name VARCHAR(100),
employee_count BIGINT,
avg_salary DECIMAL(10,2),
min_salary DECIMAL(10,2),
max_salary DECIMAL(10,2),
total_payroll DECIMAL(12,2)
) AS $$
BEGIN
RETURN QUERY
SELECT
d.name,
COUNT(e.id)::BIGINT,
COALESCE(AVG(e.salary), 0)::DECIMAL(10,2),
COALESCE(MIN(e.salary), 0)::DECIMAL(10,2),
COALESCE(MAX(e.salary), 0)::DECIMAL(10,2),
COALESCE(SUM(e.salary), 0)::DECIMAL(12,2)
FROM departments d
LEFT JOIN employees e ON d.id = e.department_id
WHERE (dept_id IS NULL OR d.id = dept_id)
GROUP BY d.id, d.name
ORDER BY d.name;
END;
$$ LANGUAGE plpgsql;
-- Usage
SELECT * FROM get_department_summary(); -- All departments
SELECT * FROM get_department_summary(1); -- Specific department
Procedures (PostgreSQL 11+)
-- Procedure for complex business operations
CREATE OR REPLACE PROCEDURE process_monthly_payroll(pay_period_month INTEGER, pay_period_year INTEGER)
LANGUAGE plpgsql AS $$
DECLARE
emp_record RECORD;
bonus_amount DECIMAL(10,2);
tax_amount DECIMAL(10,2);
net_pay DECIMAL(10,2);
BEGIN
-- Create payroll batch
INSERT INTO payroll_batches (month, year, status, created_at)
VALUES (pay_period_month, pay_period_year, 'PROCESSING', CURRENT_TIMESTAMP);
-- Process each active employee
FOR emp_record IN
SELECT * FROM employees WHERE is_active = true
LOOP
-- Calculate bonus
bonus_amount := calculate_employee_bonus(emp_record.id);
-- Calculate tax (simplified - 25% flat rate)
tax_amount := (emp_record.salary + bonus_amount) * 0.25;
-- Calculate net pay
net_pay := emp_record.salary + bonus_amount - tax_amount;
-- Insert payroll record
INSERT INTO payroll_records (
employee_id,
batch_id,
base_salary,
bonus_amount,
tax_amount,
net_pay,
created_at
) VALUES (
emp_record.id,
currval('payroll_batches_id_seq'),
emp_record.salary,
bonus_amount,
tax_amount,
net_pay,
CURRENT_TIMESTAMP
);
-- Commit each employee (for long-running process)
COMMIT;
END LOOP;
-- Update batch status
UPDATE payroll_batches
SET status = 'COMPLETED', completed_at = CURRENT_TIMESTAMP
WHERE id = currval('payroll_batches_id_seq');
COMMIT;
RAISE NOTICE 'Payroll processing completed for %-%', pay_period_month, pay_period_year;
END;
$$;
-- Usage
CALL process_monthly_payroll(1, 2024);
Advanced Trigger Concepts
Conditional Triggers
-- Only log changes to sensitive columns
CREATE OR REPLACE FUNCTION selective_audit_trigger()
RETURNS TRIGGER AS $$
BEGIN
-- Only audit if salary or performance_rating changed
IF TG_OP = 'UPDATE' AND (
OLD.salary IS DISTINCT FROM NEW.salary OR
OLD.performance_rating IS DISTINCT FROM NEW.performance_rating
) THEN
INSERT INTO sensitive_audit_log (
table_name,
operation,
row_id,
column_changed,
old_value,
new_value,
changed_by,
changed_at
) VALUES (
TG_TABLE_NAME,
TG_OP,
NEW.id,
CASE
WHEN OLD.salary IS DISTINCT FROM NEW.salary THEN 'salary'
WHEN OLD.performance_rating IS DISTINCT FROM NEW.performance_rating THEN 'performance_rating'
END,
CASE
WHEN OLD.salary IS DISTINCT FROM NEW.salary THEN OLD.salary::TEXT
WHEN OLD.performance_rating IS DISTINCT FROM NEW.performance_rating THEN OLD.performance_rating
END,
CASE
WHEN OLD.salary IS DISTINCT FROM NEW.salary THEN NEW.salary::TEXT
WHEN OLD.performance_rating IS DISTINCT FROM NEW.performance_rating THEN NEW.performance_rating
END,
current_user,
CURRENT_TIMESTAMP
);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
Statement-Level Triggers
-- Trigger that fires once per statement (not per row)
CREATE OR REPLACE FUNCTION log_bulk_operations()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO operation_log (
table_name,
operation,
row_count,
executed_by,
executed_at
) VALUES (
TG_TABLE_NAME,
TG_OP,
CASE TG_OP
WHEN 'INSERT' THEN (SELECT COUNT(*) FROM inserted_rows)
WHEN 'UPDATE' THEN (SELECT COUNT(*) FROM updated_rows)
WHEN 'DELETE' THEN (SELECT COUNT(*) FROM deleted_rows)
END,
current_user,
CURRENT_TIMESTAMP
);
RETURN NULL; -- Return value ignored for statement-level triggers
END;
$$ LANGUAGE plpgsql;
-- Create statement-level trigger
CREATE TRIGGER log_bulk_employee_operations
AFTER INSERT OR UPDATE OR DELETE ON employees
FOR EACH STATEMENT EXECUTE FUNCTION log_bulk_operations();
Trigger and Function Best Practices
- Keep triggers simple and fast - they run for every affected row
- Avoid complex business logic in triggers - use application code instead
- Be careful with recursive triggers - triggers can fire other triggers
- Use BEFORE triggers for validation and data modification
- Use AFTER triggers for logging and notifications
- Handle exceptions properly to avoid data corruption
- Test trigger performance impact especially on high-volume tables
-- Example: Prevent infinite recursion
CREATE OR REPLACE FUNCTION safe_update_trigger()
RETURNS TRIGGER AS $$
BEGIN
-- Check if we're already in a trigger to prevent recursion
IF TG_OP = 'UPDATE' AND OLD.some_column = NEW.some_column THEN
RETURN NEW; -- No change needed
END IF;
-- Safe to proceed with trigger logic
NEW.updated_count := OLD.updated_count + 1;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
Views and Materialized Views
Views are virtual tables that provide a way to present data from one or more tables in a specific format. They’re incredibly useful for simplifying complex queries, enhancing security, and creating logical data layers.
Regular Views
Views are stored queries that appear as tables but don’t store data themselves:
-- Simple view
CREATE VIEW active_employees AS
SELECT
id,
first_name,
last_name,
email,
hire_date,
salary
FROM employees
WHERE is_active = true;
-- Use the view like a table
SELECT * FROM active_employees WHERE salary > 50000;
Complex Views with Joins
-- Employee summary view with department information
CREATE VIEW employee_summary AS
SELECT
e.id,
e.first_name || ' ' || e.last_name AS full_name,
e.email,
e.salary,
e.hire_date,
d.name AS department_name,
d.budget AS department_budget,
EXTRACT(YEAR FROM AGE(CURRENT_DATE, e.hire_date)) AS years_employed,
CASE
WHEN e.salary >= 80000 THEN 'Senior'
WHEN e.salary >= 60000 THEN 'Mid-level'
ELSE 'Junior'
END AS seniority_level
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id
WHERE e.is_active = true;
-- Query the view
SELECT
department_name,
seniority_level,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employee_summary
GROUP BY department_name, seniority_level
ORDER BY department_name, seniority_level;
Views for Security and Data Access Control
-- Public employee directory (sensitive data hidden)
CREATE VIEW employee_directory AS
SELECT
first_name,
last_name,
department_id,
email,
phone
FROM employees
WHERE is_active = true;
-- Management view (more detailed information)
CREATE VIEW management_employee_view AS
SELECT
e.*,
d.name AS department_name,
m.first_name || ' ' || m.last_name AS manager_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id
LEFT JOIN employees m ON e.manager_id = m.id;
-- Grant different access levels
GRANT SELECT ON employee_directory TO public_users;
GRANT SELECT ON management_employee_view TO managers;
Materialized Views
Materialized views store the query results physically, providing better performance for complex queries:
-- Create materialized view for reporting
CREATE MATERIALIZED VIEW department_statistics AS
SELECT
d.id AS department_id,
d.name AS department_name,
COUNT(e.id) AS employee_count,
AVG(e.salary) AS avg_salary,
MIN(e.salary) AS min_salary,
MAX(e.salary) AS max_salary,
SUM(e.salary) AS total_payroll,
COUNT(CASE WHEN e.hire_date >= CURRENT_DATE - INTERVAL '1 year' THEN 1 END) AS new_hires_last_year,
AVG(EXTRACT(YEAR FROM AGE(CURRENT_DATE, e.hire_date))) AS avg_years_employed
FROM departments d
LEFT JOIN employees e ON d.id = e.department_id AND e.is_active = true
GROUP BY d.id, d.name;
-- Query the materialized view (fast!)
SELECT * FROM department_statistics ORDER BY total_payroll DESC;
Refreshing Materialized Views
-- Refresh the entire materialized view
REFRESH MATERIALIZED VIEW department_statistics;
-- Refresh without blocking concurrent selects (PostgreSQL 9.4+)
REFRESH MATERIALIZED VIEW CONCURRENTLY department_statistics;
-- Note: For CONCURRENTLY, you need a unique index
CREATE UNIQUE INDEX idx_dept_stats_dept_id ON department_statistics(department_id);
Automatic Refresh with Triggers
-- Function to refresh materialized view
CREATE OR REPLACE FUNCTION refresh_department_statistics()
RETURNS TRIGGER AS $$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY department_statistics;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
-- Trigger to refresh when employees table changes
CREATE TRIGGER refresh_dept_stats_on_employee_change
AFTER INSERT OR UPDATE OR DELETE ON employees
FOR EACH STATEMENT
EXECUTE FUNCTION refresh_department_statistics();
Advanced View Patterns
Updatable Views
Simple views can be updatable if they meet certain criteria:
-- Simple updatable view
CREATE VIEW current_employees AS
SELECT id, first_name, last_name, email, salary, department_id
FROM employees
WHERE is_active = true;
-- This will work (updates underlying table)
UPDATE current_employees SET salary = 75000 WHERE id = 1;
-- Complex view with INSTEAD OF triggers for updates
CREATE VIEW employee_with_department AS
SELECT
e.id,
e.first_name,
e.last_name,
e.email,
e.salary,
d.name AS department_name
FROM employees e
JOIN departments d ON e.department_id = d.id;
-- Make it updatable with INSTEAD OF trigger
CREATE OR REPLACE FUNCTION update_employee_with_department()
RETURNS TRIGGER AS $$
BEGIN
UPDATE employees
SET
first_name = NEW.first_name,
last_name = NEW.last_name,
email = NEW.email,
salary = NEW.salary
WHERE id = NEW.id;
-- Handle department name change
IF OLD.department_name != NEW.department_name THEN
UPDATE employees
SET department_id = (
SELECT id FROM departments WHERE name = NEW.department_name
)
WHERE id = NEW.id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER update_employee_with_department_trigger
INSTEAD OF UPDATE ON employee_with_department
FOR EACH ROW EXECUTE FUNCTION update_employee_with_department();
Recursive Views (PostgreSQL 14+)
-- Recursive view for organizational hierarchy
CREATE RECURSIVE VIEW employee_hierarchy (employee_id, employee_name, manager_id, level, path) AS (
-- Base case: top-level employees
SELECT
id,
first_name || ' ' || last_name,
manager_id,
1,
ARRAY[id]
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case
SELECT
e.id,
e.first_name || ' ' || e.last_name,
e.manager_id,
eh.level + 1,
eh.path || e.id
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
);
-- Query the recursive view
SELECT
REPEAT(' ', level - 1) || employee_name AS indented_name,
level,
array_length(path, 1) AS depth
FROM employee_hierarchy
ORDER BY path;
View Performance Considerations
Views with Indexes
-- You can't create indexes directly on views, but you can:
-- 1. Create indexes on underlying tables to speed up view queries
CREATE INDEX idx_employees_active_salary ON employees(is_active, salary) WHERE is_active = true;
-- 2. Use materialized views with indexes
CREATE MATERIALIZED VIEW expensive_calculation AS
SELECT
product_id,
SUM(quantity * unit_price) AS total_revenue,
COUNT(*) AS order_count,
AVG(unit_price) AS avg_price
FROM order_items
GROUP BY product_id;
CREATE INDEX idx_expensive_calc_revenue ON expensive_calculation(total_revenue);
CREATE INDEX idx_expensive_calc_product ON expensive_calculation(product_id);
View Best Practices
- Use views for commonly used queries to reduce code duplication
- Implement security layers with views instead of exposing tables directly
- Keep view logic simple for better performance
- Use materialized views for expensive aggregations
- Document view purposes and refresh schedules
- Monitor view performance and optimize underlying queries
-- Example: Layered view approach
-- Base view (simple filter)
CREATE VIEW base_active_employees AS
SELECT * FROM employees WHERE is_active = true;
-- Business logic view (builds on base view)
CREATE VIEW employee_performance_summary AS
SELECT
bae.*,
CASE
WHEN salary >= 80000 THEN 'High Performer'
WHEN salary >= 60000 THEN 'Good Performer'
ELSE 'Standard Performer'
END AS performance_category,
EXTRACT(YEAR FROM AGE(CURRENT_DATE, hire_date)) AS tenure_years
FROM base_active_employees bae;
-- Reporting view (builds on business logic view)
CREATE VIEW department_performance_report AS
SELECT
d.name AS department,
eps.performance_category,
COUNT(*) AS employee_count,
AVG(eps.salary) AS avg_salary,
AVG(eps.tenure_years) AS avg_tenure
FROM employee_performance_summary eps
JOIN departments d ON eps.department_id = d.id
GROUP BY d.name, eps.performance_category
ORDER BY d.name, eps.performance_category;
Security & Best Practices
Database security is critical for protecting sensitive data and ensuring system integrity. PostgreSQL provides robust security features that should be properly configured and maintained.
User Management & Authentication
Creating Users and Roles
-- Create roles (groups of permissions)
CREATE ROLE app_read;
CREATE ROLE app_write;
CREATE ROLE app_admin;
CREATE ROLE hr_staff;
CREATE ROLE finance_team;
-- Create users with passwords
CREATE USER app_user WITH PASSWORD 'secure_random_password_123!';
CREATE USER readonly_user WITH PASSWORD 'another_secure_password_456!';
CREATE USER jane_hr WITH PASSWORD 'hr_secure_pass_789!';
-- Assign roles to users
GRANT app_write TO app_user;
GRANT app_read TO readonly_user;
GRANT hr_staff TO jane_hr;
-- Users can inherit multiple roles
GRANT app_read, hr_staff TO jane_hr;
Role-Based Permissions
-- Schema-level permissions
GRANT USAGE ON SCHEMA public TO app_read;
GRANT USAGE ON SCHEMA hr TO hr_staff;
GRANT CREATE ON SCHEMA public TO app_admin;
-- Table-level permissions
GRANT SELECT ON ALL TABLES IN SCHEMA public TO app_read;
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA public TO app_write;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO app_admin;
-- Column-level permissions (limit access to sensitive data)
GRANT SELECT (id, first_name, last_name, email) ON employees TO hr_staff;
-- hr_staff cannot see salary, ssn, or other sensitive columns
-- Sequence permissions (needed for SERIAL columns)
GRANT USAGE ON ALL SEQUENCES IN SCHEMA public TO app_write;
-- Function permissions
GRANT EXECUTE ON FUNCTION calculate_employee_bonus(INTEGER) TO hr_staff;
Default Permissions for Future Objects
-- Set default permissions for objects created in the future
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO app_read;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO app_write;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT USAGE ON SEQUENCES TO app_write;
Row Level Security (RLS)
RLS allows you to control which rows a user can see or modify:
-- Enable RLS on a table
ALTER TABLE employees ENABLE ROW LEVEL SECURITY;
-- Policy: Users can only see employees in their department
CREATE POLICY department_employees ON employees
FOR SELECT
TO app_user
USING (department_id = (
SELECT department_id
FROM employees
WHERE email = current_user || '@company.com'
));
-- Policy: HR staff can see all employees
CREATE POLICY hr_full_access ON employees
FOR ALL
TO hr_staff
USING (true);
-- Policy: Employees can only update their own record
CREATE POLICY self_update ON employees
FOR UPDATE
TO app_user
USING (email = current_user || '@company.com')
WITH CHECK (email = current_user || '@company.com');
-- Policy with time-based restrictions
CREATE POLICY business_hours_update ON employees
FOR UPDATE
USING (
EXTRACT(hour FROM CURRENT_TIME) BETWEEN 9 AND 17
AND EXTRACT(dow FROM CURRENT_DATE) BETWEEN 1 AND 5
);
Advanced RLS Examples
-- Multi-tenant application: Users only see their organization's data
CREATE POLICY tenant_isolation ON orders
FOR ALL
USING (
organization_id = (
SELECT organization_id
FROM users
WHERE username = current_user
)
);
-- Hierarchical access: Managers can see subordinates' data
CREATE POLICY manager_access ON performance_reviews
FOR SELECT
USING (
employee_id IN (
-- Recursive query to find all subordinates
WITH RECURSIVE subordinates AS (
SELECT id FROM employees WHERE email = current_user || '@company.com'
UNION ALL
SELECT e.id FROM employees e
JOIN subordinates s ON e.manager_id = s.id
)
SELECT id FROM subordinates
)
);
Data Encryption
Column-Level Encryption
-- Install pgcrypto extension
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- Encrypt sensitive data
CREATE TABLE customer_sensitive (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100),
ssn_encrypted BYTEA, -- Store encrypted data
credit_card_encrypted BYTEA
);
-- Insert encrypted data
INSERT INTO customer_sensitive (name, email, ssn_encrypted, credit_card_encrypted)
VALUES (
'John Doe',
'john@email.com',
pgp_sym_encrypt('123-45-6789', 'encryption_key_for_ssn'),
pgp_sym_encrypt('4532-1234-5678-9012', 'encryption_key_for_cc')
);
-- Decrypt data (only for authorized users)
SELECT
name,
email,
CASE
WHEN current_user IN ('hr_manager', 'admin')
THEN pgp_sym_decrypt(ssn_encrypted, 'encryption_key_for_ssn')
ELSE '***-**-****'
END AS ssn,
CASE
WHEN current_user = 'finance_manager'
THEN pgp_sym_decrypt(credit_card_encrypted, 'encryption_key_for_cc')
ELSE '****-****-****-****'
END AS credit_card
FROM customer_sensitive;
Transparent Data Encryption (TDE)
For PostgreSQL 11+, you can encrypt entire tablespaces:
-- Create encrypted tablespace (requires proper setup)
CREATE TABLESPACE encrypted_space
LOCATION '/encrypted/path'
WITH (encryption_key_id = 'key1');
-- Create tables in encrypted tablespace
CREATE TABLE sensitive_data (
id SERIAL PRIMARY KEY,
confidential_info TEXT
) TABLESPACE encrypted_space;
Connection Security
SSL/TLS Configuration
-- Check SSL status
SHOW ssl;
-- Force SSL connections for specific users
ALTER USER app_user REQUIRE SSL;
-- In postgresql.conf:
-- ssl = on
-- ssl_cert_file = 'server.crt'
-- ssl_key_file = 'server.key'
-- ssl_ca_file = 'ca.crt'
Connection Limits and Authentication
-- Set connection limits
ALTER USER app_user CONNECTION LIMIT 50;
-- Set session timeout
ALTER USER app_user SET statement_timeout = '30min';
ALTER USER app_user SET idle_in_transaction_session_timeout = '10min';
-- In pg_hba.conf (PostgreSQL Host-Based Authentication):
-- host mydb app_user 192.168.1.0/24 md5
-- host mydb admin 127.0.0.1/32 cert
-- local all all peer
Audit and Monitoring
Built-in Logging
-- In postgresql.conf:
-- log_destination = 'csvlog'
-- logging_collector = on
-- log_directory = 'pg_log'
-- log_statement = 'all' -- or 'ddl', 'mod', 'none'
-- log_min_duration_statement = 1000 -- Log slow queries (1 second)
-- Log connections and disconnections
-- log_connections = on
-- log_disconnections = on
-- Log lock waits
-- log_lock_waits = on
-- deadlock_timeout = 1s
Custom Audit Logging
-- Create comprehensive audit table
CREATE TABLE audit_trail (
id BIGSERIAL PRIMARY KEY,
username TEXT NOT NULL DEFAULT current_user,
timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
client_ip INET,
database_name TEXT NOT NULL DEFAULT current_database(),
schema_name TEXT,
table_name TEXT,
operation TEXT NOT NULL,
object_id TEXT,
old_values JSONB,
new_values JSONB,
query_text TEXT,
application_name TEXT
);
-- Audit trigger function
CREATE OR REPLACE FUNCTION comprehensive_audit_trigger()
RETURNS TRIGGER AS $$
DECLARE
client_ip TEXT;
app_name TEXT;
BEGIN
-- Get client information
SELECT inet_client_addr() INTO client_ip;
SELECT application_name FROM pg_stat_activity
WHERE pid = pg_backend_pid() INTO app_name;
IF TG_OP = 'DELETE' THEN
INSERT INTO audit_trail (
schema_name, table_name, operation, object_id,
old_values, client_ip, application_name
) VALUES (
TG_TABLE_SCHEMA, TG_TABLE_NAME, TG_OP, OLD.id::TEXT,
row_to_json(OLD)::JSONB, client_ip::INET, app_name
);
RETURN OLD;
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO audit_trail (
schema_name, table_name, operation, object_id,
old_values, new_values, client_ip, application_name
) VALUES (
TG_TABLE_SCHEMA, TG_TABLE_NAME, TG_OP, NEW.id::TEXT,
row_to_json(OLD)::JSONB, row_to_json(NEW)::JSONB,
client_ip::INET, app_name
);
RETURN NEW;
ELSIF TG_OP = 'INSERT' THEN
INSERT INTO audit_trail (
schema_name, table_name, operation, object_id,
new_values, client_ip, application_name
) VALUES (
TG_TABLE_SCHEMA, TG_TABLE_NAME, TG_OP, NEW.id::TEXT,
row_to_json(NEW)::JSONB, client_ip::INET, app_name
);
RETURN NEW;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
Database Security Best Practices
1. Principle of Least Privilege
-- Bad: Giving unnecessary permissions
GRANT ALL ON ALL TABLES IN SCHEMA public TO app_user;
-- Good: Only necessary permissions
GRANT SELECT, INSERT, UPDATE ON specific_tables TO app_user;
GRANT SELECT ONLY ON read_only_tables TO app_user;
2. Regular Security Maintenance
-- Regular password rotation
ALTER USER app_user WITH PASSWORD 'new_secure_password';
-- Review and cleanup unused accounts
SELECT usename, valuntil FROM pg_user WHERE valuntil < CURRENT_DATE;
DROP USER IF EXISTS old_unused_user;
-- Monitor failed authentication attempts
SELECT * FROM pg_stat_database WHERE datname = current_database();
3. Backup Security
-- Encrypt backups
-- pg_dump --format=custom --compress=9 --file=backup.dump database_name
-- gpg --cipher-algo AES256 --compress-algo 1 --symmetric backup.dump
-- Test backup restoration regularly in isolated environment
-- pg_restore --clean --create --dbname=postgres backup.dump
4. Input Validation and SQL Injection Prevention
-- Bad: Dynamic SQL vulnerable to injection
-- query = "SELECT * FROM users WHERE email = '" + user_input + "'";
-- Good: Parameterized queries
-- PREPARE get_user (TEXT) AS SELECT * FROM users WHERE email = $1;
-- EXECUTE get_user('user@email.com');
-- Good: Using functions with proper parameter types
CREATE OR REPLACE FUNCTION get_user_by_email(email_param TEXT)
RETURNS TABLE(id INTEGER, name TEXT, email TEXT) AS $$
BEGIN
RETURN QUERY
SELECT u.id, u.name, u.email
FROM users u
WHERE u.email = email_param; -- Safe parameter binding
END;
$$ LANGUAGE plpgsql;
Performance Optimization
Database performance optimization is an ongoing process that requires understanding your data, queries, and system resources. Here’s a comprehensive guide to PostgreSQL performance tuning.
Query Optimization Strategies
Understanding EXPLAIN Output
-- Basic EXPLAIN
EXPLAIN SELECT * FROM employees WHERE department_id = 1;
-- Detailed analysis with actual execution data
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, COSTS, FORMAT JSON)
SELECT
e.first_name,
e.last_name,
d.name AS department_name
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE e.salary > 50000
ORDER BY e.salary DESC;
Key metrics to watch:
- Cost: PostgreSQL’s estimate of query expense
- Actual Time: Real execution time
- Rows: Estimated vs actual row counts
- Buffers: Memory usage (shared hit = good, read = disk I/O)
- Loops: How many times each operation repeated
Query Rewriting for Performance
-- Slow: Subquery in SELECT
SELECT
e.first_name,
e.last_name,
(SELECT COUNT(*) FROM orders WHERE employee_id = e.id) AS order_count
FROM employees e;
-- Fast: JOIN with aggregation
SELECT
e.first_name,
e.last_name,
COALESCE(o.order_count, 0) AS order_count
FROM employees e
LEFT JOIN (
SELECT employee_id, COUNT(*) AS order_count
FROM orders
GROUP BY employee_id
) o ON e.id = o.employee_id;
-- Slow: OR conditions
SELECT * FROM employees WHERE department_id = 1 OR department_id = 2;
-- Fast: IN clause
SELECT * FROM employees WHERE department_id IN (1, 2);
-- Slow: Function in WHERE clause (not sargable)
SELECT * FROM employees WHERE UPPER(last_name) = 'SMITH';
-- Fast: Functional index + proper query
CREATE INDEX idx_employees_upper_lastname ON employees(UPPER(last_name));
SELECT * FROM employees WHERE UPPER(last_name) = 'SMITH';
PostgreSQL Configuration Tuning
Memory Configuration
-- Check current settings
SHOW shared_buffers;
SHOW work_mem;
SHOW maintenance_work_mem;
SHOW effective_cache_size;
-- In postgresql.conf (restart required for shared_buffers):
-- shared_buffers = 256MB -- 25% of RAM for dedicated DB server
-- work_mem = 4MB -- Per operation per connection
-- maintenance_work_mem = 64MB -- For maintenance operations
-- effective_cache_size = 1GB -- OS + PostgreSQL cache estimate
Checkpoint and WAL Configuration
-- WAL (Write-Ahead Logging) settings
-- wal_buffers = 16MB
-- checkpoint_completion_target = 0.9
-- checkpoint_timeout = 5min
-- max_wal_size = 1GB
-- min_wal_size = 80MB
-- Check checkpoint frequency
SELECT * FROM pg_stat_bgwriter;
Connection and Resource Limits
-- max_connections = 100 -- Adjust based on your needs
-- shared_preload_libraries = 'pg_stat_statements'
-- Enable query statistics collection
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
Index Optimization Strategies
Composite Index Optimization
-- Analyze query patterns
SELECT
query,
calls,
total_time,
mean_time,
rows
FROM pg_stat_statements
WHERE query LIKE '%employees%'
ORDER BY total_time DESC
LIMIT 10;
-- Create optimal composite indexes
-- For queries filtering by department and salary range:
CREATE INDEX idx_employees_dept_salary_optimized
ON employees(department_id, salary)
WHERE is_active = true;
-- For queries ordering by salary within departments:
CREATE INDEX idx_employees_dept_salary_ordered
ON employees(department_id, salary DESC);
Index Maintenance
-- Check index bloat
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Rebuild bloated indexes
REINDEX INDEX CONCURRENTLY idx_employees_email;
-- Find duplicate or redundant indexes
SELECT
t.tablename,
array_agg(i.indexname) AS indexes,
i.indkey
FROM pg_indexes i
JOIN pg_tables t ON i.tablename = t.tablename
GROUP BY t.tablename, i.indkey
HAVING COUNT(*) > 1;
Table Optimization
Partitioning for Large Tables
-- Range partitioning by date
CREATE TABLE orders_partitioned (
id SERIAL,
customer_id INTEGER,
order_date DATE NOT NULL,
total_amount DECIMAL(10,2)
) PARTITION BY RANGE (order_date);
-- Create partitions
CREATE TABLE orders_2024_q1 PARTITION OF orders_partitioned
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE orders_2024_q2 PARTITION OF orders_partitioned
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
-- Automatic partition creation function
CREATE OR REPLACE FUNCTION create_quarterly_partition(table_name TEXT, start_date DATE)
RETURNS VOID AS $$
DECLARE
end_date DATE;
partition_name TEXT;
BEGIN
end_date := start_date + INTERVAL '3 months';
partition_name := table_name || '_' || to_char(start_date, 'YYYY_Q"Q"');
EXECUTE format('CREATE TABLE %I PARTITION OF %I FOR VALUES FROM (%L) TO (%L)',
partition_name, table_name, start_date, end_date);
-- Create indexes on the partition
EXECUTE format('CREATE INDEX %I ON %I (customer_id)',
'idx_' || partition_name || '_customer', partition_name);
END;
$$ LANGUAGE plpgsql;
Table Maintenance
-- Regular VACUUM and ANALYZE
VACUUM ANALYZE employees;
-- Check table bloat
SELECT
schemaname,
tablename,
n_tup_ins,
n_tup_upd,
n_tup_del,
n_dead_tup,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- Aggressive cleanup for heavily updated tables
VACUUM FULL employees; -- Requires exclusive lock, use carefully
Connection Pooling
// Application-level connection pooling (Node.js example)
const { Pool } = require('pg');
const pool = new Pool({
user: 'app_user',
host: 'localhost',
database: 'company_db',
password: 'password',
port: 5432,
// Pool configuration
max: 20, // Maximum connections
min: 2, // Minimum connections
idle: 10000, // Idle timeout (10 seconds)
acquire: 30000, // Acquire timeout (30 seconds)
evict: 1000, // Eviction check interval
// Connection validation
validate: (client) => {
return client.query('SELECT 1');
}
});
// Using connection from pool
async function getEmployees() {
const client = await pool.connect();
try {
const result = await client.query('SELECT * FROM employees LIMIT 10');
return result.rows;
} finally {
client.release(); // Return connection to pool
}
}
Monitoring and Maintenance
Performance Monitoring Queries
-- Active queries and their duration
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state,
wait_event
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
ORDER BY duration DESC;
-- Table sizes and access patterns
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
seq_scan,
seq_tup_read,
idx_scan,
idx_tup_fetch,
n_tup_ins,
n_tup_upd,
n_tup_del
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Lock monitoring
SELECT
t.relname,
l.locktype,
l.mode,
l.granted,
a.usename,
a.query,
a.query_start
FROM pg_locks l
JOIN pg_stat_all_tables t ON l.relation = t.relid
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted
ORDER BY a.query_start;
Automated Maintenance
-- Function for automated table maintenance
CREATE OR REPLACE FUNCTION perform_maintenance()
RETURNS VOID AS $$
DECLARE
table_record RECORD;
BEGIN
-- Update statistics for all tables
FOR table_record IN
SELECT schemaname, tablename
FROM pg_tables
WHERE schemaname = 'public'
LOOP
EXECUTE 'ANALYZE ' || quote_ident(table_record.schemaname) || '.' || quote_ident(table_record.tablename);
END LOOP;
-- Vacuum tables with high dead tuple ratio
FOR table_record IN
SELECT schemaname, tablename
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
AND (n_dead_tup::FLOAT / GREATEST(n_live_tup, 1)) > 0.1
LOOP
EXECUTE 'VACUUM ' || quote_ident(table_record.schemaname) || '.' || quote_ident(table_record.tablename);
END LOOP;
RAISE NOTICE 'Maintenance completed at %', CURRENT_TIMESTAMP;
END;
$$ LANGUAGE plpgsql;
-- Schedule with pg_cron extension (if available)
-- SELECT cron.schedule('maintenance', '0 2 * * *', 'SELECT perform_maintenance();');
Monitoring & Maintenance
Effective database monitoring and maintenance are crucial for ensuring optimal performance, preventing data loss, and maintaining system reliability.
System Monitoring
Key Metrics to Monitor
-- Database size and growth
SELECT
datname,
pg_size_pretty(pg_database_size(datname)) AS size,
pg_database_size(datname) AS size_bytes
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- Connection statistics
SELECT
datname,
numbackends AS current_connections,
xact_commit,
xact_rollback,
blks_read,
blks_hit,
tup_returned,
tup_fetched,
tup_inserted,
tup_updated,
tup_deleted
FROM pg_stat_database
WHERE datname = current_database();
-- Cache hit ratio (should be > 95%)
SELECT
'cache hit ratio' AS metric,
(blks_hit::FLOAT / (blks_hit + blks_read)) * 100 AS percentage
FROM pg_stat_database
WHERE datname = current_database();
Performance Monitoring
-- Slowest queries
SELECT
query,
calls,
total_time,
total_time / calls AS avg_time,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
-- Most frequently called queries
SELECT
query,
calls,
total_time,
total_time / calls AS avg_time
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;
-- Index usage efficiency
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan < 100 -- Potentially unused indexes
ORDER BY pg_relation_size(indexrelid) DESC;
Backup and Recovery
Logical Backups with pg_dump
# Full database backup
pg_dump -h localhost -U username -d database_name \
--format=custom \
--compress=9 \
--verbose \
--file=backup_$(date +%Y%m%d_%H%M%S).dump
# Schema-only backup
pg_dump -h localhost -U username -d database_name \
--schema-only \
--file=schema_backup.sql
# Data-only backup
pg_dump -h localhost -U username -d database_name \
--data-only \
--format=custom \
--file=data_backup.dump
# Specific tables backup
pg_dump -h localhost -U username -d database_name \
--table=employees \
--table=departments \
--file=critical_tables_backup.sql
# Backup with exclusions
pg_dump -h localhost -U username -d database_name \
--exclude-table=logs \
--exclude-table=temp_* \
--file=backup_without_logs.dump
Physical Backups with pg_basebackup
# Base backup for point-in-time recovery
pg_basebackup -h localhost -U replication_user \
--pgdata=/backup/base \
--format=tar \
--compress=9 \
--progress \
--verbose
# Streaming backup (for large databases)
pg_basebackup -h localhost -U replication_user \
--pgdata=/backup/streaming \
--wal-method=stream \
--progress
Point-in-Time Recovery (PITR) Setup
-- Configure WAL archiving in postgresql.conf:
-- wal_level = replica
-- archive_mode = on
-- archive_command = 'cp %p /archive/wal/%f'
-- max_wal_senders = 3
-- wal_keep_segments = 64
-- Create a recovery scenario
-- 1. Take base backup
-- 2. Archive WAL files continuously
-- 3. If disaster occurs, restore base backup and replay WAL files
-- Recovery command example:
-- restore_command = 'cp /archive/wal/%f %p'
-- recovery_target_time = '2024-01-15 14:30:00'
Automated Backup Script
#!/bin/bash
# automated_backup.sh
# Configuration
DB_NAME="company_db"
DB_USER="backup_user"
DB_HOST="localhost"
BACKUP_DIR="/backups"
RETENTION_DAYS=30
# Create backup directory with timestamp
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="$BACKUP_DIR/$DB_NAME/$TIMESTAMP"
mkdir -p "$BACKUP_PATH"
# Perform backup
echo "Starting backup of $DB_NAME at $(date)"
pg_dump -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" \
--format=custom \
--compress=9 \
--verbose \
--file="$BACKUP_PATH/database.dump" 2>&1 | tee "$BACKUP_PATH/backup.log"
# Check backup success
if [ ${PIPESTATUS[0]} -eq 0 ]; then
echo "Backup completed successfully"
# Calculate backup size
BACKUP_SIZE=$(du -h "$BACKUP_PATH/database.dump" | cut -f1)
echo "Backup size: $BACKUP_SIZE"
# Test backup integrity
pg_restore --list "$BACKUP_PATH/database.dump" > /dev/null
if [ $? -eq 0 ]; then
echo "Backup integrity verified"
else
echo "WARNING: Backup integrity check failed"
fi
else
echo "ERROR: Backup failed"
exit 1
fi
# Cleanup old backups
find "$BACKUP_DIR/$DB_NAME" -type d -mtime +$RETENTION_DAYS -exec rm -rf {} +
echo "Cleaned up backups older than $RETENTION_DAYS days"
# Send notification (optional)
echo "Backup completed for $DB_NAME on $(date)" | mail -s "Database Backup Status" admin@company.com
Regular Maintenance Tasks
Automated Maintenance Script
-- Comprehensive maintenance function
CREATE OR REPLACE FUNCTION database_maintenance(
vacuum_threshold FLOAT DEFAULT 0.1,
analyze_threshold FLOAT DEFAULT 0.1
)
RETURNS TEXT AS $$
DECLARE
table_record RECORD;
maintenance_log TEXT := '';
start_time TIMESTAMP;
end_time TIMESTAMP;
BEGIN
start_time := CURRENT_TIMESTAMP;
maintenance_log := 'Database maintenance started at ' || start_time || E'\n';
-- Update table statistics
maintenance_log := maintenance_log || 'Updating statistics...' || E'\n';
FOR table_record IN
SELECT schemaname, tablename, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE (n_dead_tup::FLOAT / GREATEST(n_live_tup + n_dead_tup, 1)) >= analyze_threshold
LOOP
EXECUTE 'ANALYZE ' || quote_ident(table_record.schemaname) || '.' || quote_ident(table_record.tablename);
maintenance_log := maintenance_log || ' Analyzed: ' || table_record.schemaname || '.' || table_record.tablename || E'\n';
END LOOP;
-- Vacuum tables with high dead tuple ratio
maintenance_log := maintenance_log || 'Vacuuming tables...' || E'\n';
FOR table_record IN
SELECT schemaname, tablename, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
AND (n_dead_tup::FLOAT / GREATEST(n_live_tup, 1)) >= vacuum_threshold
LOOP
EXECUTE 'VACUUM ' || quote_ident(table_record.schemaname) || '.' || quote_ident(table_record.tablename);
maintenance_log := maintenance_log || ' Vacuumed: ' || table_record.schemaname || '.' || table_record.tablename ||
' (dead tuples: ' || table_record.n_dead_tup || ')' || E'\n';
END LOOP;
-- Reindex fragmented indexes
maintenance_log := maintenance_log || 'Checking index fragmentation...' || E'\n';
-- This would require additional logic to detect fragmentation
-- Clean up old statistics
DELETE FROM pg_stat_statements WHERE calls < 10 AND last_call < CURRENT_TIMESTAMP - INTERVAL '30 days';
maintenance_log := maintenance_log || 'Cleaned up old query statistics' || E'\n';
end_time := CURRENT_TIMESTAMP;
maintenance_log := maintenance_log || 'Maintenance completed at ' || end_time || E'\n';
maintenance_log := maintenance_log || 'Total duration: ' || (end_time - start_time) || E'\n';
-- Log maintenance activity
INSERT INTO maintenance_log (log_date, log_message)
VALUES (CURRENT_TIMESTAMP, maintenance_log);
RETURN maintenance_log;
END;
$$ LANGUAGE plpgsql;
-- Create maintenance log table
CREATE TABLE IF NOT EXISTS maintenance_log (
id SERIAL PRIMARY KEY,
log_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
log_message TEXT
);
-- Schedule maintenance (example with pg_cron extension)
-- SELECT cron.schedule('weekly-maintenance', '0 2 * * 0', 'SELECT database_maintenance();');
Health Check Queries
-- Database health check function
CREATE OR REPLACE FUNCTION database_health_check()
RETURNS TABLE(
check_name TEXT,
status TEXT,
value TEXT,
recommendation TEXT
) AS $$
BEGIN
-- Check database size
RETURN QUERY
SELECT
'Database Size'::TEXT,
CASE WHEN pg_database_size(current_database()) > 100 * 1024^3 THEN 'WARNING' ELSE 'OK' END,
pg_size_pretty(pg_database_size(current_database())),
CASE WHEN pg_database_size(current_database()) > 100 * 1024^3
THEN 'Consider partitioning large tables or archiving old data'
ELSE 'Database size is manageable' END;
-- Check cache hit ratio
RETURN QUERY
SELECT
'Cache Hit Ratio'::TEXT,
CASE WHEN (blks_hit::FLOAT / NULLIF(blks_hit + blks_read, 0)) * 100 < 95 THEN 'WARNING' ELSE 'OK' END,
ROUND((blks_hit::FLOAT / NULLIF(blks_hit + blks_read, 0)) * 100, 2)::TEXT || '%',
CASE WHEN (blks_hit::FLOAT / NULLIF(blks_hit + blks_read, 0)) * 100 < 95
THEN 'Consider increasing shared_buffers'
ELSE 'Cache performance is good' END
FROM pg_stat_database WHERE datname = current_database();
-- Check for long-running queries
RETURN QUERY
SELECT
'Long Running Queries'::TEXT,
CASE WHEN COUNT(*) > 0 THEN 'WARNING' ELSE 'OK' END,
COUNT(*)::TEXT || ' queries > 30 minutes',
CASE WHEN COUNT(*) > 0
THEN 'Review long-running queries for optimization'
ELSE 'No problematic long-running queries' END
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < CURRENT_TIMESTAMP - INTERVAL '30 minutes'
AND query NOT LIKE '%pg_stat_activity%';
-- Check for unused indexes
RETURN QUERY
SELECT
'Unused Indexes'::TEXT,
CASE WHEN COUNT(*) > 5 THEN 'WARNING' ELSE 'OK' END,
COUNT(*)::TEXT || ' potentially unused indexes',
CASE WHEN COUNT(*) > 5
THEN 'Review and consider dropping unused indexes'
ELSE 'Index usage looks reasonable' END
FROM pg_stat_user_indexes
WHERE idx_scan < 100;
-- Check table bloat
RETURN QUERY
SELECT
'Table Bloat'::TEXT,
CASE WHEN COUNT(*) > 0 THEN 'WARNING' ELSE 'OK' END,
COUNT(*)::TEXT || ' tables with high dead tuple ratio',
CASE WHEN COUNT(*) > 0
THEN 'Run VACUUM on bloated tables'
ELSE 'Table bloat is under control' END
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
AND (n_dead_tup::FLOAT / GREATEST(n_live_tup, 1)) > 0.1;
END;
$$ LANGUAGE plpgsql;
-- Run health check
SELECT * FROM database_health_check();
Conclusion
This comprehensive PostgreSQL guide covers the essential knowledge every developer needs to work effectively with relational databases. From basic SQL operations to advanced performance optimization, you now have the tools to design, implement, and maintain robust database systems.
Key Takeaways
Database Design:
- Always start with proper data modeling and normalization
- Use appropriate data types and constraints
- Design relationships carefully with foreign keys
- Plan for scalability from the beginning
SQL Mastery:
- Master fundamental operations (SELECT, INSERT, UPDATE, DELETE)
- Understand joins and when to use each type
- Use indexes strategically for performance
- Leverage advanced features like window functions and CTEs
Performance and Security:
- Monitor query performance with EXPLAIN
- Implement proper indexing strategies
- Use transactions appropriately for data consistency
- Follow security best practices with role-based access
Operations and Maintenance:
- Establish regular backup and recovery procedures
- Monitor database health and performance metrics
- Perform routine maintenance tasks
- Plan for disaster recovery scenarios
Advanced Topics to Explore Further
As you continue your PostgreSQL journey, consider diving deeper into:
- Replication and High Availability: Master-slave setups, streaming replication
- Partitioning Strategies: Range, hash, and list partitioning for large datasets
- Extensions: PostGIS for geographic data, pg_cron for scheduling, TimescaleDB for time-series
- Performance Tuning: Advanced configuration, query optimization, hardware considerations
- Database Administration: Cluster management, monitoring tools, automation
Resources for Continued Learning
- Official PostgreSQL Documentation: Comprehensive and always up-to-date
- PostgreSQL Performance Tuning: Books and online courses on optimization
- Community Forums: Stack Overflow, PostgreSQL mailing lists, Reddit communities
- Monitoring Tools: pgAdmin, DataDog, New Relic for database monitoring
- Practice Platforms: SQLBolt, HackerRank, LeetCode for SQL practice
Remember that database expertise comes with practice and experience. Start with the fundamentals covered in this guide, then gradually tackle more complex scenarios as you encounter them in real-world projects. PostgreSQL’s extensive feature set and active community make it an excellent choice for applications of any scale.
Whether you’re building a small web application or an enterprise data platform, the concepts and techniques in this guide will serve as a solid foundation for your database development journey.