SQL
postgres
database
When we talk about relational databases, the main advantage that comes to mind is data integrity. These databases allow to distribute the information in different tables to group information according to its relevance and to create relationships between tables to associate data. However, to maintain integrity it is necessary to meet certain standards when designing our tables. We need the structure to be relevant to the data we want to store and to guarantee the integrity of the information as well as its consistency and avoid having unnecessary redundant information.
For this there is the database normalization, it is about five normal forms that if you comply with them your database will guarantee the integrity of the information and the optimization when making queries. Forms four and five are generally contemplated for more advanced and complex scenarios, for most of the cases it will be enough to comply with the first three normal forms that we will cover next in this article.
Before talking about the normal forms we must handle some fundamental concepts.
The first concept is entity, it is an abstraction of an object of reality, or of a process of the organization, whose information must be stored in the database in a grouped and correlated way.
The second concept is key or primary key, it is an attribute or column that serves to unequivocally identify an entity within a table.
For a table to comply with the first normal form it must meet the following parameters:
If for example we have a product order table of the following form:
OrderId | Date | Details | Customer |
1 | 01-01-23 | 2 Shoes | Pedro |
1 | 01-01-23 | 3 Pants | Pedro |
2 | 03-02-23 | 1 Bag | Jane |
3 | 02-03-23 | 1 Shirt | Ana |
We can notice that the "Detail" column contains two important pieces of information for the order: the quantity and the product name. In the same way we can notice how although we can identify the order with the OrderId" the same does not happen in the case of the products and the customer since we only have a name that is not necessarily unique for each one and can be repeated. Let's correct that:
OrderId | Date | ProductId | ProductName | Amount | CustomerId | Customer |
1 | 01-01-23 | 1 | Shoes | 2 | 1 | Pedro |
1 | 01-01-23 | 2 | Pants | 6 | 1 | Pedro |
2 | 03-02-23 | 3 | Bolso | 4 | 2 | Jane |
3 | 02-03-23 | 1 | Shoes | 1 | 3 | Ana |
Now we can see how not only the order can be identified with its ID but we also have ids for the product and the customer so that each of these entities can be identified. We also separated the quantity and the product name to separate columns to comply with the first normal form.
For the second normal form the objective is to eliminate repetitive groups so the following criteria must be met:
comply with the first normal form
All attributes must be directly dependent on the primary key.
In other words, all those attributes that correspond to different primary keys must be separated in different Tables with their own primary keys and related attributes. Following the example of the order table would be as follows:
Orders | ||
Id | Date | Customer |
1 | 01-01-23 | 1 |
2 | 03-02-23 | 2 |
3 | 02-03-23 | 3 |
Customer | |
Id | Name |
1 | Pedro |
2 | Jane |
3 | Ana |
Product | ||||
OrderId | Date | ProductId | Name | Amount |
1 | 01-01-23 | 1 | Shoes | 2 |
1 | 01-01-23 | 2 | Pants | 3 |
2 | 03-02-23 | 3 | Bag | 1 |
3 | 02-03-23 | 1 | Shoes | 1 |
By separating the attributes in their respective tables with their own primary keys we can see a reduction of repeated rows as this redundancy becomes unnecessary when the different tables can be related through their primary and foreign keys. For example, I only need one row to record the customer which can appear several times in the order table, instead of placing the customer information several times for each order.
Even so we can notice some redundancy in the products table since we need to repeat the products for each of the orders, this is where it comes the...
For the third normal form what we want is that absolutely all the attributes of a table depend only on its primary key and not on another non-primary key. This is what is known as a transitive dependency, i.e. when an attribute depends indirectly on the primary key of another. For example: Shoes have their own key "id", but they must also be associated with another key, the OrderId", to know which products must be included in the order.
To resolve these dependencies we create tables with each of the entities separately, and create transitional tables that relate the primary keys of both. With this we eliminate any redundancy of data and achieve greater reliability of the information.
In the case of the product table in the example, it would be solved as follows:
Orders | |
OrderId | Date |
1 | 01-01-23 |
2 | 03-02-23 |
3 | 02-03-23 |
Products | |
ProductId | Name |
1 | Shoes |
2 | Pants |
3 | Bag |
Order details | ||
OrderId | ProductId | Amount |
1 | 1 | 2 |
1 | 2 | 3 |
2 | 3 | 1 |
3 | 1 | 1 |
Now you can see how the products in each order are related, along with the quantity of that product for the corresponding order. Here you can see the final model of the database
The cost of this is that we now have to make queries by cross-referencing information from several tables, but in return you have a database without redundancy and with more consistent and reliable data. If, for example, a customer changes his information, all previously placed orders are related only to the customer's updated record, generating greater reliability for system users.
Finally we recommend an online tool that allows you to create these diagrams in a simple way and where you can also work collaboratively. You can see the example at QuickDatabaseDiagrams.com and see how easy it is to create models. To access the documentation and learn how to use the tool select the "DOCS" option in the top bar of the screen.