In part one of this SQL series, we introduced the concepts of the Relational Database Management System (RDBMS), installed Microsoft SQL Server Express, SQL Server Management Studio, and the AdventureWorks database. In part two of this series, we introduced the SELECT statement and various clauses for returning data from the RDBMS; this includes the SELECT, FROM, WHERE, GROUP BY, HAVING and ORDER BY. In part three, we are going to take a step back and look at the theory of data modelling; understanding data modelling is essential and far too often overlooked.
Data Modelling in software engineering is the process of creating a data model for an information system by applying certain formal techniques; a Data Model is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world entities.
The main aim of data models is to support the development of information systems by providing the definition and format of data. According to West and Fowler (1999) “if this is done consistently across systems then compatibility of data can be achieved. If the same data structures are used to store and access data then different applications can share data.
Data models are commonly represented as Entity-Relationship Models (ERMs), or Entity-Relationship Diagrams (ERDs).
Entity-Relationship Models (ERM)
An Entity–Relationship Model is the result of systematic analysis to define and describe what is important to processes in an area of a business; it does not define the business processes; it only presents a business data schema in graphical form. It is usually drawn in a graphical form as boxes (entities) that are connected by lines (relationships) which express the associations and dependencies between entities. Entities may be characterized not only by relationships but also by additional properties (attributes).
- Entities – Person, Address.
- Relationships – Person has zero or more Addresses.
- Attributes – Person attribute includes first name, last name.
Peter Chen, the father of ER modelling said in his seminal paper:
“The entity-relationship model adopts the more natural view that the real world consists of entities and relationships. It incorporates some of the important semantic information about the real world.”
Types of Entity-Relationship Diagrams
There are three types of diagrams according to ANSI in 1975:
- Conceptual Data Model (CDM): contains the least granular detail but establishes the overall scope of what is to be included within the model set.
- Logical Data Model (LDM): contains the abstract structure of a domain of information.
- Physical Data Model (PDM): contains enough detail to produce a physical database
When joining a new project, the Conceptual Data Model will allow you to see an overview of how data is represented, interconnected and constrained. Once you are given a business area you may want to view the Logical Data Model where you can determine what attribute exists and what the relationships and attributes are. Finally, when you are looking to write SQL SELECT Statements you will want to look at the Physical Data Model to guide your table designs.
Conceptual Data Model (CDM)
A conceptual model’s primary objective is to convey the fundamental principles and basic functionality of the system which it represents. Also, a conceptual model must be developed in such a way as to provide an easily understood system interpretation for the model’s users. A conceptual model, when implemented properly, should satisfy four fundamental objectives.
- Enhance an individual’s understanding of the representative system.
- Facilitate efficient conveyance of system details between stakeholders.
- Provide a point of reference for system designers to extract system specifications.
- Document the system for future reference and provide a means for collaboration
The conceptual model plays an important role in the overall system development life cycle.
Logical Data Model (LDM)
Logical data models represent the abstract structure of a domain of information. They are often diagrammatic in nature and are most typically used in business processes that seek to capture things of importance to an organization and how they relate to one another. Once validated and approved, the logical data model can become the basis of a physical data model and form the design of a database.
Logical data models should be based on the structures identified in a preceding conceptual data model since this describes the semantics of the information context, which the logical model should also reflect. Even so, since the logical data model anticipates implementation on a specific computing system, the content of the logical data model is adjusted to achieve certain efficiencies.
The term ‘Logical Data Model’ is sometimes used as a synonym of ‘domain model’ or as an alternative to the domain model. While the two concepts are closely related and have overlapping goals, a domain model is more focused on capturing the concepts in the problem domain rather than the structure of the data associated with that domain.
Reasons for building a Logical Data Model (LDM) are to:
- Create a common understanding of business data elements and requirements.
- Provides a foundation for designing a database.
- Facilitates avoidance of data redundancy and thus prevent data & business transaction inconsistency.
- Facilitates data re-use and sharing.
- Decreases development and maintenance time and cost.
- Confirms a logical process model and helps impact analysis.
Physical Data Model (PDM)
A physical data model (or database design) is a representation of a data design as implemented, or intended to be implemented, in a database management system. In the lifecycle of a project, it typically derives from a logical data model, though it may be reverse-engineered from a given database implementation. A complete physical data model will include all the database artefacts required to create relationships between tables or to achieve performance goals, such as indexes, constraint definitions, linking tables, partitioned tables or clusters.
Analysts can usually use a physical data model to calculate storage estimates; it may include specific storage allocation details for a given database system.
Once you start looking at Logical and Physical Data Models, the topic of Normalisation starts to appears, but what is this?
Database normalization is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as an integral part of his relational model.
Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).
A lot of Entity-Relationship Diagrams (ERDs) will be in various states of normalisation, so you will need to understand the basics of this.
Unnormalized Form (UNF) is also known as an unnormalized relation or non-first normal form (NF2), is a simple database data model (organization of data in a database) lacking the efficiency of database normalization. An unnormalized data model will suffer the pitfalls of data redundancy, where multiple values and/or complex data structures may be stored within a single field or attribute, or where fields may be replicated within a single table (a way of subverting the first normal form rule of one value per field or attribute).
When an attempt is made to modify (update, insert into, or delete from) a relation, the following undesirable side-effects may arise in relations that have not been sufficiently normalized:
- Update Anomaly – The same information can be expressed on multiple rows; therefore updates to the relation may result in logical inconsistencies. For example, each record in an “Employees’ Skills” relation might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee may need to be applied to multiple records (one for each skill). If the update is only partially successful – the employee’s address is updated on some records but not others – then the relation is left in an inconsistent state. Specifically, the relation provides conflicting answers to the question of what this particular employee’s address is. This phenomenon is known as an update anomaly.
- Insertion Anomaly – There are circumstances in which certain facts cannot be recorded at all. For example, each record in a “Faculty and Their Courses” relation might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code. Therefore, we can record the details of any faculty member who teaches at least one course, but we cannot record a newly hired faculty member who has not yet been assigned to teach any courses, except by setting the Course Code to null. This phenomenon is known as an insertion anomaly.
- Deletion Anomaly – Under certain circumstances, deletion of data representing certain facts necessitates deletion of data representing completely different facts. The “Faculty and Their Courses” relation described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears, effectively also deleting the faculty member, unless we set the Course Code to null. This phenomenon is known as a deletion anomaly.
First Normal Form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain. The first definition of the term, in a 1971 conference paper by Edgar Codd, defined a relation to be in first normal form when none of its domains has any sets as elements.
First Normal Form, 1NF, is an essential property of a relation in a relational database. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement.
First normal form enforces these criteria:
- Eliminate repeating groups in individual tables.
- Create a separate table for each set of related data.
- Identify each set of related data with a primary key.
Full Technopedia tutorial on 1NF: https://www.techopedia.com/definition/25955/first-normal-form-1nf
Second Normal Form (2NF) is a normal form used in database normalization. 2NF was originally defined by E.F. Codd in 1971.
A relation that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically, a relation is in 2NF if:
- The relation is in First Normal Form (1NF).
- No non-prime attribute is dependent on any proper subset of any candidate key of the relation. A non-prime attribute of a relation is an attribute that is not a part of any candidate key of the relation.
Full 2NF tutorial on Technopedia: https://www.techopedia.com/definition/21980/second-normal-form-2nf
Third Normal Form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that:
- The entity is in second normal form (2NF).
- It contains only columns that are non-transitively dependent on the primary key.
Full Technopedia tutorial on 3NF: https://www.techopedia.com/definition/22561/third-normal-form-3nf
Most 3NF relations are free of insertion, update, and deletion anomalies.
Additional Normal Forms
It does not stop with 3rd normal form, but there are other normal forms that you can read up about:
- EKNF: Elementary key normal form
- BCNF: Boyce–Codd normal form
- 4NF: Fourth normal form
- ETNF: Essential tuple normal form
- 5NF: Fifth normal form
- DKNF: Domain-key normal form
- 6NF: Sixth normal form
Note: you can keep normalising until you are blue in the face; normalization does have its purposes and is useful for data modification. However, when it comes to reporting, how you structure your data depends on the physical storage layer, sometimes a denormalised data structure is prefered; this all depends on how the physical database system works and the optimal structure.
In this part of the series we have been introduced to some data modelling concepts, do not worry, we will go over this in more details so do not panic, but you should be somewhat familiar with these terms:
- Data Modelling
- Entity-Relationship Modelling / Diagram
- Type of ERDs
- Conceptual Data Model (CDM)
- Logical Data Model (LDM)
- Physical Data Model (PDM)
- Unnormalized Form (UNF)
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
In the next part of this series, we will look at writing queries across multiple tables in our AdventureWorks Database, which is a normalized data model.
As always, please let me know what you think, leave a comment below or reach out to me on Twitter @Tableau_Magic.