Databases involve a variety of important terms and concepts that are crucial for understanding how data is stored, managed, and accessed
Basic Concepts
Database:
- An organized collection of data, generally stored and accessed electronically from a computer system.
DBMS(Database Management System):
- Software that interacts with end users, applications, and the database itself to capture and analyze data. Examples include MySQL, PostgreSQL, Oracle, and MongoDB.
Schema:
- The structure of a database, defined in a formal language supported by the DBMS. It includes the tables, fields, relationships, views, indexes, and other elements
Data Models
Relational Models
Organizes data into tables(relations) with rows and columns. Each table represents an entity, and each row represents a record.
SQL(Structured Query Language) is used to interact with relational databases
NoSQL
A broad class of database systems that are not relational. NoSQL databases can be document-based, key-value stores, wide-column stores, or graph databases
Examples: MongoDB(document), Redis (key-value), Cassandra(wide-column), Neo4j (graph)
Key Concepts
Table:
- A collection of related data entries in a relational database, consisting of rows and columns
Row(Tuple):
- A single record in a table, representing a single data item.
Column (Attribute):
- A single field in a table, representing a data attribute
Primary Key:
- A unique identifier for a row in a table, ensuring that each record can be uniquely identified
Foreign Key:
- An attribute in a table that creates a link between two tables. It corresponds to the primary key in another table
Index:
- A database object that improves the speed of data retrieval operations on a table. It is created on one or more columns of a table
Transactions and Concurrency
Transaction:
A unit of work that is performed against a database. It is often a sequence of operations performed as a single logical unit of work
ACID Properties
Atomicity: Ensures that all operations within a transaction are completed; if not the transaction is aborted
Consistency: Ensures that the database remains in a consistent state before and after the transaction
Isolation: Ensures that transactions are securely and independently processed at the same time without interface
Durability: Ensures that the results of a completed transacition are permanently stored in the database.
Locking:
- Mechanism to control concurrent access to data in a database. Locks can be applied to rows, pages or tables to ensure data integrity
Dead Lock:
- A situation where two or more transactions are waiting for each other to release locks, resulting in a standstill for long time
Database Design
Normalization:
- The process of organizing the fields and tables of a relational database to minimize redundancy and dependecy. Normal forms(1NF, 2NF, 3NF, etc..)
Denormalization:
- The process of adding redundancy to a database design to improve read performance at the cost of write performance and data integrity
ER Model(Entity-Relationship Model):
A data modeling techinque used to visualize the database structure. It involves entities (objects) and relationship between them
Advanced Concepts
Replication:
The process of copying and maintaining database objects, such as tables in multiple database instances to ensure data availability and reliability
Sharding
The practice of partioning data across multiple databases to improve perfromance and scalibilty
Partioning:
The division of a database or its components into smaller, more manageable pieces called partitions
Backup and Recovery
- Techniques for creating copies of the database to proect against data loss and for restoring the database to consistent state after a failure
ETL(Extract, Transform, Load):
- A prcoess in data warehousing that involves extracting data from different sources, transforming it into a suitable format, and loadiing it into a data warehouse
Normalization in Depth
Normalization is a systematic approach to organizing data in a database to reduce redundancy and improve data integrity. It involves dividing large tables into smaller, related tables and defining relationships between them. The process of normalization is guided by various normal forms, each addressing specific issues related to data redundancy and dependency. Here is an in-depth look at normalization, from basics to advanced concepts.
Basics of Normalization
1. First Normal Form (1NF):
Definition: A table is in 1NF if all columns contain only atomic (indivisible) values, and each column contains values of a single type.
Goals: Eliminate repeating groups and ensure that each column contains atomic values.
Example: A table with columns for
CustomerID
,OrderID
, and a list ofProductIDs
is not in 1NF. To normalize, split theProductIDs
into separate rows.
Example Table (Not in 1NF):
CustomerID | OrderID | ProductIDs |
1 | 101 | 1, 2, 3 |
2 | 102 | 2, 3 |
Normalized Table (1NF):
CustomerID | OrderID | ProductID |
1 | 101 | 1 |
1 | 101 | 2 |
1 | 101 | 3 |
2 | 102 | 2 |
2 | 102 | 3 |
2. Second Normal Form (2NF):
Definition: A table is in 2NF if it is in 1NF and all non-key attributes are fully functionally dependent on the primary key.
Goals: Eliminate partial dependency, where a non-key attribute depends only on part of the primary key.
Example: In a table with a composite primary key (
OrderID
,ProductID
), ifCustomerName
depends only onOrderID
, it's not in 2NF.
Example Table (1NF but not 2NF):
OrderID | ProductID | CustomerName |
101 | 1 | Alice |
101 | 2 | Alice |
102 | 3 | Bob |
Normalized Tables (2NF):
Orders Table:
| OrderID | CustomerName | | --- | --- | | 101 | Alice | | 102 | Bob |
OrderDetails Table:
| OrderID | ProductID | | --- | --- | | 101 | 1 | | 101 | 2 | | 102 | 3 |
3. Third Normal Form (3NF):
Definition: A table is in 3NF if it is in 2NF and all attributes are functionally dependent only on the primary key.
Goals: Eliminate transitive dependency, where a non-key attribute depends on another non-key attribute.
Example: In a table with
OrderID
,CustomerID
, andCustomerAddress
, ifCustomerAddress
depends onCustomerID
, notOrderID
, it's not in 3NF.
Example Table (2NF but not 3NF):
OrderID | CustomerID | CustomerAddress |
101 | 1 | 123 Main St |
102 | 2 | 456 Elm St |
Normalized Tables (3NF):
Orders Table:
| OrderID | CustomerID | | --- | --- | | 101 | 1 | | 102 | 2 |
Customers Table:
| CustomerID | CustomerAddress | | --- | --- | | 1 | 123 Main St | | 2 | 456 Elm St |
Advanced Normal Forms
4. Boyce-Codd Normal Form (BCNF):
Definition: A table is in BCNF if it is in 3NF and every determinant is a candidate key.
Goals: Handle situations where 3NF does not eliminate all anomalies.
Example: In a table with
StudentID
,CourseID
, andInstructorID
, if each course has only one instructor but an instructor can teach multiple courses, anomalies can occur.
Example Table (3NF but not BCNF):
StudentID | CourseID | InstructorID |
1 | CS101 | 1001 |
2 | CS101 | 1001 |
3 | CS102 | 1002 |
Normalized Tables (BCNF):
Enrollments Table:
| StudentID | CourseID | | --- | --- | | 1 | CS101 | | 2 | CS101 | | 3 | CS102 |
Courses Table:
| CourseID | InstructorID | | --- | --- | | CS101 | 1001 | | CS102 | 1002 |
5. Fourth Normal Form (4NF):
Definition: A table is in 4NF if it is in BCNF and contains no multi-valued dependencies.
Goals: Eliminate multi-valued dependencies, where one attribute depends on another attribute independent of the primary key.
Example: In a table with
StudentID
,CourseID
, andHobby
, if a student can have multiple hobbies and enroll in multiple courses independently, it leads to redundancy.
Example Table (BCNF but not 4NF):
StudentID | CourseID | Hobby |
1 | CS101 | Reading |
1 | CS101 | Swimming |
1 | CS102 | Reading |
1 | CS102 | Swimming |
Normalized Tables (4NF):
StudentCourses Table:
| StudentID | CourseID | | --- | --- | | 1 | CS101 | | 1 | CS102 |
StudentHobbies Table:
| StudentID | Hobby | | --- | --- | | 1 | Reading | | 1 | Swimming |
6. Fifth Normal Form (5NF):
Definition: A table is in 5NF if it is in 4NF and all join dependencies are implied by candidate keys.
Goals: Ensure that the decomposition of the table does not lead to loss of information.
Example: In a table with
ProjectID
,EmployeeID
, andTaskID
, if projects, employees, and tasks are related independently, it can cause anomalies.
Example Table (4NF but not 5NF):
ProjectID | EmployeeID | TaskID |
1 | 101 | A |
1 | 102 | A |
2 | 101 | B |
2 | 102 | B |
Normalized Tables (5NF):
ProjectEmployees Table:
| ProjectID | EmployeeID | | --- | --- | | 1 | 101 | | 1 | 102 | | 2 | 101 | | 2 | 102 |
ProjectTasks Table:
| ProjectID | TaskID | | --- | --- | | 1 | A | | 2 | B |
EmployeeTasks Table:
| EmployeeID | TaskID | | --- | --- | | 101 | A | | 102 | A | | 101 | B | | 102 | B |
Summary of Normal Forms
1NF: Eliminate repeating groups; ensure atomicity.
2NF: Eliminate partial dependencies on the primary key.
3NF: Eliminate transitive dependencies.
BCNF: Ensure every determinant is a candidate key.
4NF: Eliminate multi-valued dependencies.
5NF: Ensure that the table can be reconstructed from its decompositions without loss of information.
Normalization is a fundamental concept in database design, ensuring data integrity and reducing redundancy. Understanding and applying the appropriate normal forms can significantly enhance the efficiency and reliability of a database.
Codd's Rule for Optimized Design
Codd's Twelve Rules
Rule 0: Foundation Rule
- A system must qualify as a relational database management system by supporting these rules; otherwise, it is not considered relational.
Rule 1: Information Rule
- All information in a relational database is represented explicitly at the logical level and in exactly one way: by values in tables.
Rule 2: Guaranteed Access Rule
- Each piece of data (atomic value) in a relational database is guaranteed to be accessible by using a combination of table name, primary key value, and column name.
Rule 3: Systematic Treatment of Null Values
- Null values (distinct from the empty string or a zero) are supported fully for representing missing information and inapplicable information in a systematic way, independent of data type.
Rule 4: Dynamic Online Catalog Based on the Relational Model
- The database description is represented at the logical level as tables and can be queried online via the data manipulation language.
Rule 5: Comprehensive Data Sublanguage Rule
A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements can express all of the following:
Data definition
View definition
Data manipulation (interactive and by program)
Integrity constraints
Authorization
Transaction boundaries (begin, commit, and rollback)
Rule 6: View Updating Rule
- All views that are theoretically updatable must be updatable by the system.
Rule 7: High-Level Insert, Update, and Delete
- The system must support insert, update, and delete operations at a set level (e.g., operating on multiple rows or tables simultaneously).
Rule 8: Physical Data Independence
- Changes to the physical level (how data is stored) must not require a change to an application based on the structure.
Rule 9: Logical Data Independence
- Changes to the logical level (tables, columns, rows) must not require a change to an application based on the structure.
Rule 10: Integrity Independence
- Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and stored in the catalog, not in the application programs.
Rule 11: Distribution Independence
- The distribution of data across various locations should be invisible to the end-users. Users should not be aware of whether the data is centralized or distributed.
Rule 12: Non-Subversion Rule
- If a relational system has a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher-level (set-oriented) relational language.
Summary
Codd's rules are intended to provide a guideline for the features and behavior of relational database management systems, ensuring they adhere to the principles of the relational model. These rules focus on data integrity, data independence, and the systematic treatment of data, which are critical for maintaining a robust and reliable database system. While not all modern RDBMSs fully comply with all of Codd's rules, they remain foundational principles in the design and evaluation of relational databases.