Database Choice for Data Warehouse – Part I

Frequently a question crops up “What is your database choice for a data warehouse?” 

It is not an easy question to answer and certainly should not be a preconceived one. Here is why.

There are number of things that have be to looked at, in recommending and/or opting a database for a data warehouse.

Here is a small list.

  • Existing Infrastructure
  • Existing Architecture
  • Current/Recent Issues
  • Application Interfaces
  • Current/Recent Data Volume Trends
  • Existing Talent Pool (Database & Non-Database)
  • Expected Growth – Application Model and Complexity
  • Expected Growth – Data Volume
  • Consumption/Usage – Business Intelligence, Analytics etc.
  • Availability/SLA’s
  • (Last but not least) – Budget
I will explore each of them in detail. Also, what I have not covered (but left for a later post) is specifics about the database product, the company behind it, their market position.

Database Choice for Data Warehouse – Part II

Previously, I had given a small list in recommending or opting a database for Data Warehouse. Lets look at them in detail. Bear with me, this is going to be a long one, since I want to address them in one piece.

Existing Infrastructure

It is very important to understand what we have, before we lay out of plan to what we would like to have. Systems such as data warehouse will coexist with existing ones. If you are starting new, you have to make a choice between building a new infrastructure (for data warehouse) and bringing into life, among existing ones. Most often, the latter will give you the time and cost advantage than the former. Also, given your (new) data warehouse database infrastructure would be part of the existing ecosystem, integration is relatively easy.

Existing Architecture
It is very important to understand business flows, as any system that we build has to reflect a purpose, a business purpose.

Why are you building something?

And what do you expect out of it?

Those two questions act as foundations of your architecture. Data architecture does not exist on its own. It is an integral part of an enterprise. That means, your system is going to be deeply integrated with existing systems and processes. Whether it is your transaction systems, CRM database or customer facing applications and portals, your data warehouse architecture will coexist with those.

So a very sound understanding of the existing architecture is critical to start with. From a technical stand-point, it is equally important to understand the current landscape for you to make a choice.

Since, data-integration is a critical component your data warehouse will talk to any or your existing systems to get data. Your goal is not bring something new into life, which is isolated and is very cumbersome to make it talk to other systems.

Current/Recent Issues
You may wonder why is this important?

Think about it – challenges in a data environment are plenty. If you already have a data warehouse and are spending most of the time fire-fighting issues around loads and other technical database issues, you have to zoom out to think about the choice of existing technology solutions you have and whether they serve you well or not.

If you are building a data warehouse anew, think of the requirements you have in front of you as ‘expectations’ from this data warehouse. How much of critically is attached to the data-availability and system response to drive your decisions (among other factors)?

Application Interfaces
As we saw above, your data warehouse is integral part of an ecosystem. It is going to be able to talk to other systems and at the same time, it will its own unique set of processes around it. So the communication channels, leading into and out of this data warehouse is paramount for inflow and outflow of data.

Every incremental decision you make as you build and as your data warehouse matures, these ‘interfaces’ have to be looked into, to make sure they operate and operate well. If your choice limits one or more of these, you would have spend lot of time and money to find workaround (if there is any workaround available).

Current/Recent Data Volume Trends
Data volume is often overlooked. And all too often you are caught by surprise. If you have a data warehouse already, suddenly when your processes starts to fail or slow down and you realize that for various reasons (that would be found very soon), your architecture is unable to cope with the increase in the volume.

If you are in the exercise of calculating or projecting data volume for a new architecture, it is a very important exercise that you should prefer to do carefully. Sticking a big volume of hard-disk is not an ideal solution. Assuming that it would work for the next n months is a recipe for disaster. As your business grows, your decisions may change and as they change, you take decisions (most often incremental) to collect more, to be able to store and analyze in the data warehouse.

These have to be foreseen, as you decide to collect new pieces of information. In other words, such an assessment is a continuous process. You may think that you have a grip on your (existing) database capabilities with volume increase, but that doesn’t come automatic. You have to be on top of it. If you are planning to build a new data warehouse, it is ever more critical for you to do projections that are reasonable and accurate. Dig deep into your profiling exercise to understand that’s flowing into your system.

Existing Talent Pool (Database/Non-Database)
Remember, data warehouse are not just for the architects. It is for the entire organization. You would have Business Analysts, Department Heads, Engineering Community (data and non-data), Department’ data liaison’s etc. using and interacting with these systems.

In an organization where the technology platform has matured, the comfort level with one or two flavor of database also would have grown over years. This may present a tricky challenge in opting for a third choice. Experts may resist introduction of a new platform. Natural concern among all would the learning curve as well as the integration time-line. Time for any training should also be accounted, planned for and addressed. Equally, an objective and thorough analysis on why the new platform is better than exiting ones should be made available to all concerned.

Expected Growth – Application Model and Complexity
Remember your organization is constantly growing. Applications that support your organization grows and along with it, complexity grows. With the knowledge you have and you should have about the current set of applications, it is equally important to understand how they are expected to mature as well.

Just as a data warehouse architect builds and predicts a maturity model, it is important to understand other software architects’ plan as well, as they have a direct and indirect impact on your data warehouse architecture.

Expected Growth – Data Volume
One of the many important exercise that a data architect has to perform is to come with projections – data volume projections. The following questions among others have to be addressed

  1. How much of growth do we expect and when ?
  2. How do you plan to handle it ?
  3. Does your database have any technical limitation to handle it ?
  4. How are they expected to be processed ?
  5. What is the intensity level ?
  6. Where could be the bottleneck ?
  7. How to mitigate such bottlenecks ?
  8. Will the process condense or explode that data which will increase its size ?
  9. Are there any issues with the current architecture to not handle this ?
  10. What is the fail-over strategy ?
  11. What the business case to cause this increase ?
  12. Is it seasonal ?

Consumption/Usage – Business Intelligence, Analytics etc.
Remember your data warehouse is your serve your enterprise. There are going to lot of users (from different departments) using your data warehouse, through different means – Business Intelligence Processes, Analytical Processes, Data Dumps etc.

It is going to incur lot of load on your data warehouse and in turn your database. A clear understanding on the consumption pattern is very important to manage it well.

Expectations are going to be set on data availability on any data warehouse. That also means that your database should perform well for you to be able to meet those time-lines. A database foundation of a data warehouse is always expected to perform well. Look deep into any potential bottlenecks, which will prevent you from meeting those SLA’s. Equally, as your organization grows, as complexity increase, these have to continuously monitored and addressed (when needed).

All too often, you may have budgeting constraints to pick a database flavor, that may not have been your (or your evaluation panel) choice. Database vendor of your choice might have good set of features that looks fit to your need, but it could come at a price.

So you move with your second or third choice. A clear understanding of the feature set between different databases will help you immensely not just in the beginning of your setup, but as your data warehouse grows as well. I will revisit this, as I talk specifically about few databases in a data warehousing environment.