Reliability and Recovery

Overview

How a DBMS keeps the database consistent despite failures.
How transaction states, logging and backups interact.
How Oracle uses REDO, UNDO and Flashback for recovery.

1. Role of the DBMS in Failure Management

Recovery
- After a failure, restore the database to a correct state, as if the failure had not happened.

Example – bank transfer

UPDATE comptes
SET solde = solde - 100
WHERE numero_compte = '12345';

UPDATE comptes
SET solde = solde + 100
WHERE numero_compte = '67890';

COMMIT;

Either both updates must take effect, or none (transaction is atomic).

2. Types of Failures

2.1 Transaction Failures

Transaction failures affect a single transaction.
Causes include:
- Logical errors (divide by zero, constraint violations),
- Locking problems (deadlocks, timeouts),
- Resource exhaustion (memory, disk limits).
The DBMS responds by rolling back the failing transaction.

2.2 System and Media Failures

System failures
DBMS or operating system crashes, power loss, software bugs, overload.
These typically affect many transactions; memory contents may be lost.

Media failures
Disk crash or corruption, possibly causing loss of entire datafiles or even the whole database.

2.3 Network Failures

Network outage or disconnection of clients,
Transactions may be left incomplete from the client’s perspective; the DBMS must treat them consistently.

3. Transaction States and Database State

3.1 Transaction States

Four main states:

Active
- Transaction is executing and modifying/reading data.

Partially committed
- All operations finished, but commit has not yet been made durable.

Committed
- All changes have been permanently recorded.
- Ensures durability.

Aborted
- Transaction was rolled back; its effects must be removed.

3.2 Before and After Images

BFIM (Before Image)
Old value of data before a modification.

AFIM (After Image)
New value after modification.

Key rules:
The database state on disk must always be well‑defined.
Before commit, BFIM must be preserved and AFIM may still be only in memory or in the log.
At commit, AFIM replaces BFIM atomically (“all or nothing”).

3.3 Data Writing Strategies

Immediate (in‑place) updates
Changes are written directly to disk blocks as the transaction runs.
Pros: durability, fast persistence.
Cons: more I/O; recovery relies heavily on logging.

Deferred (out‑of‑place) updates
Changes are staged in memory or separate structures and applied on commit.
Pros: performance, flexibility.
Cons: more complex recovery, risk of losing staged updates on crash.

4. Recovery Process

General idea:
- Use a log to determine which operations must be redone or undone after a failure.

Steps:

Analysis
Scan the log to identify:
- Transactions that committed,
- Transactions that were active at failure time.
Redo
Re‑apply operations of committed transactions that are not yet fully reflected on disk.
Undo
Roll back operations of transactions that were not committed.

These steps preserve atomicity and durability.

5. Recovery Utilities

5.1 Logging

Forward (redo) logging
- Log contains enough information to redo a change.
- Changes are written to the log before being applied to the database.

Backward (undo) logging
- Log contains enough information to undo a change.
- Used to roll back uncommitted transactions.

In practice, DBMSs combine ideas from both.

5.2 Backups

Full backup
- Copy of the entire database at a point in time.
- Easy to restore, but large and time‑consuming.

Incremental backup
- Copies only data changed since the last backup (full or incremental).
- Efficient to take, but restore may require applying a chain of backups.

Differential backup
- Copies data changed since the last full backup.
- Restore usually needs only the last full backup plus the last differential.
- Video primer: Full vs incremental vs differential backups

5.3 Replication

Maintain copies of data on multiple servers.
Transactional replication: propagate committed transactions.
Log‑based replication: ship redo logs to other servers.
Trigger‑based replication: use triggers to capture and forward changes.

Benefits:
- Higher availability and fault tolerance,
- Possibility of load balancing and reporting on secondary systems.

5.4 Checkpoints

A checkpoint is a snapshot of the database at a specific time.
Dirty buffers are flushed to disk,
A position in the redo log is marked.

Benefits:
- On recovery, the DBMS can start from the last checkpoint instead of from the beginning of the log,
- Greatly reduces recovery time.

Checkpoints happen:
- Periodically,
- On clean shutdown,
- On certain configuration events (e.g. log switches).

6. Oracle REDO, UNDO and Flashback

6.1 Redo Logs

Oracle uses redo logs to record all changes made to the database.

Useful queries:

-- Redo log files
SELECT * FROM v$logfile;

-- Log groups and status
SELECT * FROM v$log;

-- Force a log switch
ALTER SYSTEM SWITCH LOGFILE;

6.2 UNDO and Read Consistency

Oracle uses UNDO segments stored in an UNDO tablespace to:

Rollback transactions
ROLLBACK uses UNDO to restore old values.
Recovery after failure
UNDO helps undo incomplete transactions.
Read consistency
Queries see a consistent snapshot, even while other transactions are updating data.
Flashback features
Allow querying or restoring past states of data.

The UNDO_RETENTION parameter controls how long old values are kept.

Examples:

-- Tablespaces and their types
SELECT tablespace_name, contents, status
FROM   dba_tablespaces;

-- Datafiles and sizes
SELECT tablespace_name, file_name, bytes/1024/1024 AS size_mb
FROM   dba_data_files;

-- UNDO configuration
SHOW PARAMETER undo;
ALTER SYSTEM SET UNDO_RETENTION = 3600;

Further reading: Undo and redo in Oracle database

6.3 Flashback

Flashback Query

SELECT *
FROM   produits
AS OF TIMESTAMP (SYSTIMESTAMP - INTERVAL '1' HOUR);

Flashback Table

FLASHBACK TABLE produits
TO TIMESTAMP (SYSTIMESTAMP - INTERVAL '1' HOUR);

Flashback uses UNDO and redo information to reconstruct past versions of data without restoring an entire backup.
- Video deep dive: Oracle Flashback internals and 19c features

7. Key Takeaways

Failures are inevitable; the DBMS must handle them while preserving ACID properties.
Transaction states (active, committed, aborted) and before/after images define how recovery behaves.
Recovery uses logs to decide what to redo and undo.
Backups, replication and checkpoints are essential for robust recovery strategies.
In Oracle, redo logs, UNDO segments and Flashback provide powerful tools to restore data and to analyze or correct past changes.