Database Glossary

Databases involve a variety of important terms and concepts that are crucial for understanding how data is stored, managed, and accessed

Basic Concepts

Database:
- An organized collection of data, generally stored and accessed electronically from a computer system.
DBMS(Database Management System):
- Software that interacts with end users, applications, and the database itself to capture and analyze data. Examples include MySQL, PostgreSQL, Oracle, and MongoDB.
Schema:
- The structure of a database, defined in a formal language supported by the DBMS. It includes the tables, fields, relationships, views, indexes, and other elements

Data Models

Relational Models
- Organizes data into tables(relations) with rows and columns. Each table represents an entity, and each row represents a record.
- SQL(Structured Query Language) is used to interact with relational databases
NoSQL
- A broad class of database systems that are not relational. NoSQL databases can be document-based, key-value stores, wide-column stores, or graph databases
- Examples: MongoDB(document), Redis (key-value), Cassandra(wide-column), Neo4j (graph)

Key Concepts

Table:
- A collection of related data entries in a relational database, consisting of rows and columns
Row(Tuple):
- A single record in a table, representing a single data item.
Column (Attribute):
- A single field in a table, representing a data attribute
Primary Key:
- A unique identifier for a row in a table, ensuring that each record can be uniquely identified
Foreign Key:
- An attribute in a table that creates a link between two tables. It corresponds to the primary key in another table
Index:
- A database object that improves the speed of data retrieval operations on a table. It is created on one or more columns of a table

Transactions and Concurrency

Transaction:
- A unit of work that is performed against a database. It is often a sequence of operations performed as a single logical unit of work
- ACID Properties
  - Atomicity: Ensures that all operations within a transaction are completed; if not the transaction is aborted
  - Consistency: Ensures that the database remains in a consistent state before and after the transaction
  - Isolation: Ensures that transactions are securely and independently processed at the same time without interface
  - Durability: Ensures that the results of a completed transacition are permanently stored in the database.
Locking:
- Mechanism to control concurrent access to data in a database. Locks can be applied to rows, pages or tables to ensure data integrity
Dead Lock:
- A situation where two or more transactions are waiting for each other to release locks, resulting in a standstill for long time

Database Design

Normalization:
- The process of organizing the fields and tables of a relational database to minimize redundancy and dependecy. Normal forms(1NF, 2NF, 3NF, etc..)
Denormalization:
- The process of adding redundancy to a database design to improve read performance at the cost of write performance and data integrity
ER Model(Entity-Relationship Model):
- A data modeling techinque used to visualize the database structure. It involves entities (objects) and relationship between them

Advanced Concepts

Replication:
- The process of copying and maintaining database objects, such as tables in multiple database instances to ensure data availability and reliability
Sharding
- The practice of partioning data across multiple databases to improve perfromance and scalibilty
Partioning:
- The division of a database or its components into smaller, more manageable pieces called partitions
Backup and Recovery
- Techniques for creating copies of the database to proect against data loss and for restoring the database to consistent state after a failure
ETL(Extract, Transform, Load):
- A prcoess in data warehousing that involves extracting data from different sources, transforming it into a suitable format, and loadiing it into a data warehouse

Normalization in Depth

Normalization is a systematic approach to organizing data in a database to reduce redundancy and improve data integrity. It involves dividing large tables into smaller, related tables and defining relationships between them. The process of normalization is guided by various normal forms, each addressing specific issues related to data redundancy and dependency. Here is an in-depth look at normalization, from basics to advanced concepts.

Basics of Normalization

1. First Normal Form (1NF):

Definition: A table is in 1NF if all columns contain only atomic (indivisible) values, and each column contains values of a single type.
Goals: Eliminate repeating groups and ensure that each column contains atomic values.
Example: A table with columns for CustomerID, OrderID, and a list of ProductIDs is not in 1NF. To normalize, split the ProductIDs into separate rows.

Example Table (Not in 1NF):

CustomerID	OrderID	ProductIDs
1	101	1, 2, 3
2	102	2, 3

Normalized Table (1NF):

CustomerID	OrderID	ProductID
1	101	1
1	101	2
1	101	3
2	102	2
2	102	3

2. Second Normal Form (2NF):

Definition: A table is in 2NF if it is in 1NF and all non-key attributes are fully functionally dependent on the primary key.
Goals: Eliminate partial dependency, where a non-key attribute depends only on part of the primary key.
Example: In a table with a composite primary key (OrderID, ProductID), if CustomerName depends only on OrderID, it's not in 2NF.

Example Table (1NF but not 2NF):

OrderID	ProductID	CustomerName
101	1	Alice
101	2	Alice
102	3	Bob

Normalized Tables (2NF):

Orders Table:

| OrderID | CustomerName | | --- | --- | | 101 | Alice | | 102 | Bob |
OrderDetails Table:

| OrderID | ProductID | | --- | --- | | 101 | 1 | | 101 | 2 | | 102 | 3 |

3. Third Normal Form (3NF):

Definition: A table is in 3NF if it is in 2NF and all attributes are functionally dependent only on the primary key.
Goals: Eliminate transitive dependency, where a non-key attribute depends on another non-key attribute.
Example: In a table with OrderID, CustomerID, and CustomerAddress, if CustomerAddress depends on CustomerID, not OrderID, it's not in 3NF.

Example Table (2NF but not 3NF):

OrderID	CustomerID	CustomerAddress
101	1	123 Main St
102	2	456 Elm St

Normalized Tables (3NF):

Orders Table:

| OrderID | CustomerID | | --- | --- | | 101 | 1 | | 102 | 2 |
Customers Table:

| CustomerID | CustomerAddress | | --- | --- | | 1 | 123 Main St | | 2 | 456 Elm St |

Advanced Normal Forms

4. Boyce-Codd Normal Form (BCNF):

Definition: A table is in BCNF if it is in 3NF and every determinant is a candidate key.
Goals: Handle situations where 3NF does not eliminate all anomalies.
Example: In a table with StudentID, CourseID, and InstructorID, if each course has only one instructor but an instructor can teach multiple courses, anomalies can occur.

Example Table (3NF but not BCNF):

StudentID	CourseID	InstructorID
1	CS101	1001
2	CS101	1001
3	CS102	1002

Normalized Tables (BCNF):

Enrollments Table:

| StudentID | CourseID | | --- | --- | | 1 | CS101 | | 2 | CS101 | | 3 | CS102 |
Courses Table:

| CourseID | InstructorID | | --- | --- | | CS101 | 1001 | | CS102 | 1002 |

5. Fourth Normal Form (4NF):

Definition: A table is in 4NF if it is in BCNF and contains no multi-valued dependencies.
Goals: Eliminate multi-valued dependencies, where one attribute depends on another attribute independent of the primary key.
Example: In a table with StudentID, CourseID, and Hobby, if a student can have multiple hobbies and enroll in multiple courses independently, it leads to redundancy.

Example Table (BCNF but not 4NF):

StudentID	CourseID	Hobby
1	CS101	Reading
1	CS101	Swimming
1	CS102	Reading
1	CS102	Swimming

Normalized Tables (4NF):

StudentCourses Table:

| StudentID | CourseID | | --- | --- | | 1 | CS101 | | 1 | CS102 |
StudentHobbies Table:

| StudentID | Hobby | | --- | --- | | 1 | Reading | | 1 | Swimming |

6. Fifth Normal Form (5NF):

Definition: A table is in 5NF if it is in 4NF and all join dependencies are implied by candidate keys.
Goals: Ensure that the decomposition of the table does not lead to loss of information.
Example: In a table with ProjectID, EmployeeID, and TaskID, if projects, employees, and tasks are related independently, it can cause anomalies.

Example Table (4NF but not 5NF):

ProjectID	EmployeeID	TaskID
1	101	A
1	102	A
2	101	B
2	102	B

Normalized Tables (5NF):

ProjectEmployees Table:

| ProjectID | EmployeeID | | --- | --- | | 1 | 101 | | 1 | 102 | | 2 | 101 | | 2 | 102 |
ProjectTasks Table:

| ProjectID | TaskID | | --- | --- | | 1 | A | | 2 | B |
EmployeeTasks Table:

| EmployeeID | TaskID | | --- | --- | | 101 | A | | 102 | A | | 101 | B | | 102 | B |

Summary of Normal Forms

1NF: Eliminate repeating groups; ensure atomicity.
2NF: Eliminate partial dependencies on the primary key.
3NF: Eliminate transitive dependencies.
BCNF: Ensure every determinant is a candidate key.
4NF: Eliminate multi-valued dependencies.
5NF: Ensure that the table can be reconstructed from its decompositions without loss of information.

Normalization is a fundamental concept in database design, ensuring data integrity and reducing redundancy. Understanding and applying the appropriate normal forms can significantly enhance the efficiency and reliability of a database.

Codd's Rule for Optimized Design

Codd's Twelve Rules

Rule 0: Foundation Rule

A system must qualify as a relational database management system by supporting these rules; otherwise, it is not considered relational.

Rule 1: Information Rule

All information in a relational database is represented explicitly at the logical level and in exactly one way: by values in tables.

Rule 2: Guaranteed Access Rule

Each piece of data (atomic value) in a relational database is guaranteed to be accessible by using a combination of table name, primary key value, and column name.

Rule 3: Systematic Treatment of Null Values

Null values (distinct from the empty string or a zero) are supported fully for representing missing information and inapplicable information in a systematic way, independent of data type.

Rule 4: Dynamic Online Catalog Based on the Relational Model

The database description is represented at the logical level as tables and can be queried online via the data manipulation language.

Rule 5: Comprehensive Data Sublanguage Rule

A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements can express all of the following:
- Data definition
- View definition
- Data manipulation (interactive and by program)
- Integrity constraints
- Authorization
- Transaction boundaries (begin, commit, and rollback)

Rule 6: View Updating Rule

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-Level Insert, Update, and Delete

The system must support insert, update, and delete operations at a set level (e.g., operating on multiple rows or tables simultaneously).

Rule 8: Physical Data Independence

Changes to the physical level (how data is stored) must not require a change to an application based on the structure.

Rule 9: Logical Data Independence

Changes to the logical level (tables, columns, rows) must not require a change to an application based on the structure.

Rule 10: Integrity Independence

Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and stored in the catalog, not in the application programs.

Rule 11: Distribution Independence

The distribution of data across various locations should be invisible to the end-users. Users should not be aware of whether the data is centralized or distributed.

Rule 12: Non-Subversion Rule

If a relational system has a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher-level (set-oriented) relational language.

Summary

Codd's rules are intended to provide a guideline for the features and behavior of relational database management systems, ensuring they adhere to the principles of the relational model. These rules focus on data integrity, data independence, and the systematic treatment of data, which are critical for maintaining a robust and reliable database system. While not all modern RDBMSs fully comply with all of Codd's rules, they remain foundational principles in the design and evaluation of relational databases.