Understand Normalization Before Denormalizing

by Krishna on March 7, 2009

database diagram

Database denormalization is an area which is frequently misunderstood and gets software developers into trouble. Many developers seem to think that denormalization is the antithesis of normalization and have conflicting goals. You frequently read people demonizing normalization as a straitjacket that reduces the performance of the system and makes coding complicated. This is a misunderstanding of the intentions of both techniques.

The primary objective of normalization is data integrity, which means that the data is correct and complete. Normalization requires that you store data in one place, because having multiple copies of the same data can easily get out of sync due to programmer or system errors that can make it difficult to identify which data is correct. As a corollary, having data in one place makes it easier to update it, because you don’t have to worry about multiple updates that must happen together.

Another goal of normalization is to minimize redesign of the database structure. An example is storing address information in the user table. What you will encounter frequently is a person having multiple addresses (permanent home, temporary home, business) or two people having the same address (married couples, parents and children, etc.) This situation results in changed entity relationships and some significant restructuring of the database. Normalized designs are more extendable and query-able than non-normalized design. Normalization also reduces your I/O and disk costs since you don’t have to write data multiple times.

Denormalization is an effort to improve the performance of a database. It does not (and should not) seek to eliminate the goal of data integrity, but instead seeks to optimize performance while maintaining data integrity. Because of this, a denormalized database is not equivalent to a database that has not been normalized. Instead, you only seek to denormalize a data model that has already been normalized. This distinction is important to understand, because you go from normalized to denormalized, not from nothing to denormalized.

The mistake that some software developers do is to directly build a denormalized database considering only the performance aspect. This only optimizes one part of the equation, which is database reads, i.e., SELECT queries to the database are simpler to create and they execute faster. But it puts a tax on the database writes in 2 ways:

  1. You have multiple UPDATE queries which slow down writes. This may be okay if your application is read-heavy, which is true of most web applications. This may not be true of applications that receive and processes heavy volumes of data.
  2. Multiple updates require more coding. You have to ensure either through your application or at the system software level that data gets updated at the same time. If it does not, you have inconsistent data that can cause problems for your users.

Some applications may not have a problem with data being inconsistent for a brief period of time. For example, a report of your website statistics may not have data of the past 5 minutes, but that is alright because you are fine with the ballpark figures. On the other hand, a report about your bank balance cannot show the same value even a second after you have just withdrawn some money.

For some developers, data integrity problems can be overcome by scheduled jobs. This is not only an impossible solution for some applications (like finance in the previous paragraph), but can also be dangerous. Scheduled jobs may depend on properly maintained audit trails. If the number of writes increases, they may need to be run more frequently than anticipated, resulting in greater consumption of computing resources.

If you are not careful, you could use denormalization as an excuse to introduce inefficiencies because you think it will solve your performance problems. This is premature optimization for something that cannot be optimized to work within your available resources. This is true for situations where the amount of data you are working with is huge and requires enormous amount of processing. In such cases, you have to look at entirely different solutions like parallel processing and other database structures, or even reconsider your requirements.

Denormalization is a design level that is one step up from normalization. It should not be treated naively. Framing denormalization against normalization purely in the context of performance is unserious and can result in major application problems. So, spend the time to understand why you need normalization in the first place, how and when to use denormalization without making worse mistakes and finally when you should break out of your existing mindset and look at other solutions.

[Photo licensed from meteorix]


Anonymous March 8, 2009 at 12:02 am

"Database denormalization is an area is" ?

Krishna Kumar March 8, 2009 at 5:42 am

Fixed the typo. Thanks for pointing it out.

Asher March 11, 2009 at 9:30 pm

very good article indeed, we need certain level denormalization this could be one reason why we go for a max level 3 normalization even when we have 5 normal forms...

Anonymous April 7, 2009 at 7:59 pm

My personal opinion is that a programmer should NOT do explicit denormalization to improve performance, except in very special circumstances. Instead they should use features like Materialized Query tables (DB2) or materialized views (Oracle) that can improve performance transparently. DB2 for example, can suggest best MQTs automatically based on a workload and the performance improvements are easily measurable.



Krish April 8, 2009 at 9:02 am

You are right, Vijil. I think many programmers ignore or underestimate the capability of the systems software to do performance tuning. They should first explore what the DB server and the hardware can do before going down the path of denormalization.

Comments on this entry are closed.

{ 1 trackback }

Previous post:

Next post: