Databases, an introduction to fundamentals

Nov 15, 2024

This article is part of the series:

Today we will talk about databases, which are often the backbone of modern computer systems, powering everything from business applications to social networks and e-commerce systems. Although we take their existence for granted today, database technology has not always been as advanced and accessible as it is now. In fact, the first data storage systems were developed in the 1960s, when the growing needs of organizations began to exceed the capabilities of traditional paper-based filing systems and other physical storage methods. As companies grew and the volume of information they needed to manage increased, the need for a digital solution became evident—one that would allow for the organization, storage, and rapid access to large amounts of data.

Origins

In its early stages, database systems were structured hierarchically or in networks, allowing data to be organized in tree or graph structures. Although these architectures were a significant advancement at the time, they also had important limitations, especially when it came to handling complex data and performing efficient queries. In a hierarchical system, data was stored in a tree structure, where each record had a single “parent,” which limited flexibility. On the other hand, networked database systems allowed for more complex relationships between data but were still difficult to implement and maintain, often requiring specific programming for each application.

This context of innovation and technical limitations motivated Edgar F. Codd, a computer scientist at IBM, to propose a new model for data storage and management in 1970, known as the “Relational Model.” Codd devised a system in which data could be stored in tables, allowing for access and manipulation through a query language that would later evolve into SQL (Structured Query Language). The relational model not only made it possible to efficiently store and access large volumes of information but also introduced unprecedented flexibility for complex queries. This paradigm shift drove the growth of databases as we know them today, laying the foundation for the development of database management systems (DBMS), which are still widely used today.

The introduction of the relational model revolutionized many industries, enabling the creation of more sophisticated business applications and promoting data-driven decision-making. Relational databases quickly became the standard for data storage and management, with their use spreading from banking and finance to logistics and retail. With the advent of database management systems such as IBM DB2, Oracle, and later, MySQL (MariaDB) and PostgreSQL, among others, companies gained access to powerful tools that allowed them to efficiently organize and analyze their data, enabling new capabilities to manage large-scale operations and processes.

As data continues to grow in volume and complexity, relational databases remain a central technology. However, new solutions have also emerged, such as NoSQL databases, which address specific needs that traditional relational systems do not always cover. Additionally, it is common for tools like Excel to be used informally as databases. Although Excel is not a database in the strict sense, its ease of use and tabular structure allow many users to store and organize data in a basic way. This phenomenon also occurs with tools like Access, which, while a more complete database management system, still does not offer the power and scalability of a full-fledged relational database management system.

Types

As data continues to grow in volume and complexity, relational databases remain a key technology. However, more modern solutions have now emerged that meet the demands of artificial intelligence, real-time analysis, and the management of unstructured data, such as NoSQL databases, which address specific needs that traditional relational systems do not always cover.

Relational databases

Relational databases store information in structured tables with rows and columns, allowing for complex relationships between data. Based on the relational model proposed by Edgar F. Codd, their structure relies on the use of primary and foreign keys to link data across tables. This type of database is ideal for applications where data integrity and transactions are essential, such as financial systems, business management, or any application requiring ACID (Atomicity, Consistency, Isolation, Durability) properties.

Examples: Microsoft SQL (MSSQL), MySQL, PostgreSQL, Oracle.

Document databases (NoSQL)

Document databases are a type of NoSQL database that store data in documents, typically in JSON or BSON format. These documents can contain both structured and unstructured data and, unlike relational databases, do not require a fixed schema. This provides great flexibility, especially for handling dynamic data. Document databases are suitable for applications that require fast access to large volumes of unstructured data, such as content applications or e-commerce systems.

Examples: MongoDB, Cassandra, CouchDB, Firebase Realtime Database.

In-memory databases

In-memory databases store data directly in RAM instead of on disks, allowing much faster access. They are ideal for applications that require high processing speed, such as online games, financial transaction systems, and real-time analytics. However, due to their dependence on RAM, they may have storage limitations compared to disk-based databases.

Examples: Redis, Memcached.

Graph databases

Graph databases store information in nodes and edges, making them ideal for representing complex relationships between entities. This type of database is particularly useful in applications where connections between data are as important as the data itself. A typical use of graph databases is in social networks, where relationships between users can be represented, or in recommendation systems and fraud analysis.

Examples: Neo4j, Amazon Neptune, ArangoDB.


There are other types of databases, such as time series databases, search databases, and columnar databases, each designed to meet specific performance needs. However, these are not the most common, so they will not be covered in depth in this series. We will focus on types of relational databases that are widely used today and that pose the most challenges when defining their structure.

Relational databases

Features and peculiarities

Relational databases are probably the most widely used and studied type of database in the world of computing. Developed in the 1970s by Edgar F. Codd, these databases are based on a structured model of interrelated tables, in which each table represents an entity or concept (such as Customers or Products). The strength of relational databases lies in their ability to maintain data integrity and consistency through normalization rules and constraints.

Data structure and organization

In a relational database, data is organized into tables composed of rows and columns. Each row represents a unique record (such as a specific customer), and each column represents an attribute of that record (such as the customer’s name or address). These tables can be related to each other through primary keys and foreign keys, allowing data to be linked and queried across multiple tables efficiently.

For example, an Orders table could be linked to a Customers table through a foreign key that stores the unique identifier of the customer who placed each order. This structure provides great flexibility for performing queries and retrieving complex information in a single operation. We will see a more detailed example below.

Key features

  1. Referential integrity: one of the main peculiarities of relational databases is their ability to maintain data integrity through key constraints. Referential integrity ensures that relationships between tables are valid; that is, each foreign key in a table corresponds to an existing record in the referenced table. This prevents, for example, orders from being assigned to nonexistent customers.

  2. ACID: relational databases usually adhere to ACID properties: atomicity, consistency, isolation, and durability. These properties are essential in applications where it is critical to ensure that transactions (such as payments or records) complete correctly.

    • Atomicity: ensures that a transaction is either completed in full or not executed at all. If one part of the transaction fails, it is completely rolled back.
    • Consistency: guarantees that the database moves from one valid state to another with each transaction.
    • Isolation: prevents transactions from interfering with each other, especially in multi-user environments.
    • Durability: ensures that changes made by a completed transaction persist even in the event of a system failure.
  3. SQL as a query language: relational databases use SQL (Structured Query Language) and its variants to define, manipulate, and query data. This enables complex queries using selection, join, and filter operators, offering a powerful and flexible syntax for data extraction. Additionally, SQL has evolved to support advanced features such as aggregations, subqueries, and stored procedures.

Advantages

  • Data integrity: ensures that data is consistent and accurate through integrity constraints (such as primary keys, foreign keys, and uniqueness rules). For example, in a banking system, financial transactions rely on data integrity to ensure that account balances are accurate and there is no duplication of transactions.

  • Ability to perform complex queries: enables advanced and complex queries using SQL, making it easier to extract valuable information. For example, in an e-commerce company, analysts can use SQL to obtain detailed information, such as the average monthly spending of customers or the inventory of best-selling products. The ability to perform joins between multiple tables allows these systems to respond to complex queries quickly and accurately.

  • Vertical scalability for large volumes of data: designed to handle large amounts of data and scale vertically ( i.e., with more powerful hardware). In large organizations, such as a hospital that stores millions of medical records, a relational database can be managed on a high-performance server that allows for efficient processing, storage, and retrieval of massive amounts of data, ensuring fast access to patient information.

  • Concurrent access and user control: multiple users can access and manipulate data simultaneously without conflicts, thanks to concurrency control and permission mechanisms. For example, in a project management system, several employees can update real-time information on the status of different tasks without data corruption. Additionally, the administrator can set access permissions, ensuring that only authorized users can view or modify sensitive data.

  • Widely adopted and compatible standard: SQL is a widely accepted and compatible standard, which facilitates data migration and interoperability between different relational database systems. For example, a company that decides to switch its data management system from MySQL to PostgreSQL can migrate data and queries relatively easily due to SQL compatibility between these platforms. This standard also enables developers and analysts to adapt quickly to new systems without the need to relearn a different query language.

Limitations

Even with all their many advantages, relational databases also have some limitations. For example:

  • Limited scalability: they are often designed for vertical scaling, which involves increasing the capacity of a single server to handle more data or users. This can be costly and complicated compared to NoSQL systems, which allow for easier horizontal scaling by adding multiple servers.

  • Rigid schema: the data structure is rigid and requires a predefined schema. This means that structural changes, such as adding new columns or modifying data types, can be complex and may require downtime. This structured design limits adaptability in environments where data changes rapidly.

  • Low performance with large volumes of unstructured data: they are not optimized for handling large volumes of unstructured data (such as free text, multimedia, or IoT-generated data), which can significantly impact performance in these types of applications. NoSQL databases are often more suitable for these cases.

  • High maintenance load: they require constant administration to ensure performance, security, and data integrity. Configurations such as indexes, permissions, and backups need regular management, which implies maintenance costs and specialized technical resources.

  • Complexity in distributed queries: performing distributed queries and transactions across multiple relational databases is complicated and can degrade performance. In distributed applications or high-availability systems, effectively managing distributed transactions is more complex and requires advanced configurations or even the use of external tools for synchronization and consistency.

Despite these limitations, relational databases remain a popular and effective choice for many business and mission-critical applications. Their ability to ensure data integrity, query flexibility, and extensive industry support make them a solid option for a wide variety of use cases. At their core, they use the relational model, which is defined by table structure and the relationships between them, organizing data in a way that is consistent and easily accessible.

Main elements of a relational database

Let’s now look at the fundamental concepts that underpin relational databases and how they are applied in practice.

1. Entities

Entities represent the main concepts on which information will be stored in the database. Each entity is defined in a table, which is the fundamental structure for organizing data. A table consists of rows and columns:

  • Rows (or records): each row in a table represents a specific instance of the entity. For example, in a Customer table, each row would contain data for an individual customer.
  • Columns (or fields): each column represents an attribute of the entity, such as name, address, or email in the Customer table.

There is much debate about whether tables should be named in plural or singular form; for example, using Customer or Customers. The choice depends on the convention followed in the project or organization. Personally, I prefer to name tables in singular because each row represents a single entity, not a set of them. This approach also facilitates matching the table name with the class name when using an ORM (Object Relational Mapping), which adds consistency and clarity to the code.

An ORM is a tool that maps database tables to objects in the code, making it easier to interact with the database from the application. Some popular ORMs include Hibernate for Java, Room for Android, Entity Framework for .NET, and Sequelize for Node.js.

It is essential to maintain consistency in naming conventions to avoid confusion, and, whenever possible, use names in English, as it is the “standard” language in programming and facilitates collaboration with developers worldwide if necessary.

2. Relationships between tables

One of the most important features of relational databases is the ability to establish relationships between different tables, which allows data to be organized and associated meaningfully. The most common relationships in a relational database include:

  • One-to-One relationship: occurs when a record in one table is associated with a single record in another table. An example of this is a Person table and a License table, where each person has a single license, and each license is associated with only one person. The foreign key in the referenced table is restricted to be unique, ensuring that the relationship is one-to-one. Only one person can have a license, and only one license can belong to a person.
  • One-to-Many relationship: this is the most common type of relationship and occurs when a record in one table is associated with multiple records in another table. For example, a customer (in the Customer table) can place multiple orders (in the Order table), establishing a one-to-many relationship between Customer and Order.
  • Many-to-Many relationship: this type of relationship occurs when multiple records in one table are associated with multiple records in another table. A classic example would be a Student table and a Course table, where a student can enroll in multiple courses, and a course can have multiple students. To manage these relationships, an intermediate table, such as Student_Course, is created, containing references to the primary keys of Student and Course.

The examples presented here are simplified and will vary depending on the complexity of the application and data requirements. In general, each of the examples presented here would contain a larger number of fields and related tables. It is important to consider that the database structure should faithfully reflect the reality of the problem domain being addressed.

3. Primary and foreign keys

To manage these relationships, relational databases use primary keys and foreign keys:

  • Primary Key: a unique identifier for each record in a table. The primary key ensures that each row in a table is unique. For example, in the Customer table, the id field can serve, and is often used, as the primary key. Recently, the use of UUIDs (Universal Unique Identifiers) as primary keys has become popular, as they are globally unique, do not rely on the database for generation, do not have the size limitations of integers, and offer better privacy, though they may impact performance.

  • Foreign Key: a field in a table that references the primary key of another table, establishing a relationship between the two. For example, in the Order table, the customer_id field can be a foreign key referencing the primary key id in the Customer table, thus linking each order to the corresponding customer.

4. Constraints and integrity rules

To ensure data quality and consistency, relational databases implement constraints and integrity rules:

  • Entity integrity: ensures that each table has a unique primary key, preventing duplicates and allowing for the unequivocal identification of each record.

  • Referential integrity: ensures that relationships between tables are valid; that is, foreign keys correspond to existing records in the referenced table. This rule prevents references to nonexistent records and ensures data coherence.

  • Uniqueness constraints: these constraints ensure that certain fields do not have duplicate values, as in the case of email addresses or identification numbers.


These basic elements make relational databases a robust and versatile tool for managing data in an organized, precise, and consistent manner. By understanding these fundamental concepts, developers can design relational databases that efficiently meet the requirements of their applications.

We have explored an introduction to databases, their evolution over time, and the key concepts underpinning relational databases. In future articles, we will delve into the more technical aspects of relational databases, such as table design and SQL queries, so you can start working with databases in your own projects. See you next time, and Happy Coding!

Related posts

That may interest you