Previously, I had given a small list in recommending or opting a database for Data Warehouse. Lets look at them in detail. Bear with me, this is going to be a long one, since I want to address them in one piece.
It is very important to understand what we have, before we lay out of plan to what we would like to have. Systems such as data warehouse will coexist with existing ones. If you are starting new, you have to make a choice between building a new infrastructure (for data warehouse) and bringing into life, among existing ones. Most often, the latter will give you the time and cost advantage than the former. Also, given your (new) data warehouse database infrastructure would be part of the existing ecosystem, integration is relatively easy.
It is very important to understand business flows, as any system that we build has to reflect a purpose, a business purpose.
Why are you building something?
And what do you expect out of it?
Those two questions act as foundations of your architecture. Data architecture does not exist on its own. It is an integral part of an enterprise. That means, your system is going to be deeply integrated with existing systems and processes. Whether it is your transaction systems, CRM database or customer facing applications and portals, your data warehouse architecture will coexist with those.
So a very sound understanding of the existing architecture is critical to start with. From a technical stand-point, it is equally important to understand the current landscape for you to make a choice.
Since, data-integration is a critical component your data warehouse will talk to any or your existing systems to get data. Your goal is not bring something new into life, which is isolated and is very cumbersome to make it talk to other systems.
You may wonder why is this important?
Think about it – challenges in a data environment are plenty. If you already have a data warehouse and are spending most of the time fire-fighting issues around loads and other technical database issues, you have to zoom out to think about the choice of existing technology solutions you have and whether they serve you well or not.
If you are building a data warehouse anew, think of the requirements you have in front of you as ‘expectations’ from this data warehouse. How much of critically is attached to the data-availability and system response to drive your decisions (among other factors)?
As we saw above, your data warehouse is integral part of an ecosystem. It is going to be able to talk to other systems and at the same time, it will its own unique set of processes around it. So the communication channels, leading into and out of this data warehouse is paramount for inflow and outflow of data.
Every incremental decision you make as you build and as your data warehouse matures, these ‘interfaces’ have to be looked into, to make sure they operate and operate well. If your choice limits one or more of these, you would have spend lot of time and money to find workaround (if there is any workaround available).
Current/Recent Data Volume Trends
Data volume is often overlooked. And all too often you are caught by surprise. If you have a data warehouse already, suddenly when your processes starts to fail or slow down and you realize that for various reasons (that would be found very soon), your architecture is unable to cope with the increase in the volume.
If you are in the exercise of calculating or projecting data volume for a new architecture, it is a very important exercise that you should prefer to do carefully. Sticking a big volume of hard-disk is not an ideal solution. Assuming that it would work for the next n months is a recipe for disaster. As your business grows, your decisions may change and as they change, you take decisions (most often incremental) to collect more, to be able to store and analyze in the data warehouse.
These have to be foreseen, as you decide to collect new pieces of information. In other words, such an assessment is a continuous process. You may think that you have a grip on your (existing) database capabilities with volume increase, but that doesn’t come automatic. You have to be on top of it. If you are planning to build a new data warehouse, it is ever more critical for you to do projections that are reasonable and accurate. Dig deep into your profiling exercise to understand that’s flowing into your system.
Existing Talent Pool (Database/Non-Database)
Remember, data warehouse are not just for the architects. It is for the entire organization. You would have Business Analysts, Department Heads, Engineering Community (data and non-data), Department’ data liaison’s etc. using and interacting with these systems.
In an organization where the technology platform has matured, the comfort level with one or two flavor of database also would have grown over years. This may present a tricky challenge in opting for a third choice. Experts may resist introduction of a new platform. Natural concern among all would the learning curve as well as the integration time-line. Time for any training should also be accounted, planned for and addressed. Equally, an objective and thorough analysis on why the new platform is better than exiting ones should be made available to all concerned.
Expected Growth – Application Model and Complexity
Remember your organization is constantly growing. Applications that support your organization grows and along with it, complexity grows. With the knowledge you have and you should have about the current set of applications, it is equally important to understand how they are expected to mature as well.
Just as a data warehouse architect builds and predicts a maturity model, it is important to understand other software architects’ plan as well, as they have a direct and indirect impact on your data warehouse architecture.
Expected Growth – Data Volume
One of the many important exercise that a data architect has to perform is to come with projections – data volume projections. The following questions among others have to be addressed
- How much of growth do we expect and when ?
- How do you plan to handle it ?
- Does your database have any technical limitation to handle it ?
- How are they expected to be processed ?
- What is the intensity level ?
- Where could be the bottleneck ?
- How to mitigate such bottlenecks ?
- Will the process condense or explode that data which will increase its size ?
- Are there any issues with the current architecture to not handle this ?
- What is the fail-over strategy ?
- What the business case to cause this increase ?
- Is it seasonal ?
Consumption/Usage – Business Intelligence, Analytics etc.
Remember your data warehouse is your serve your enterprise. There are going to lot of users (from different departments) using your data warehouse, through different means – Business Intelligence Processes, Analytical Processes, Data Dumps etc.
It is going to incur lot of load on your data warehouse and in turn your database. A clear understanding on the consumption pattern is very important to manage it well.
Expectations are going to be set on data availability on any data warehouse. That also means that your database should perform well for you to be able to meet those time-lines. A database foundation of a data warehouse is always expected to perform well. Look deep into any potential bottlenecks, which will prevent you from meeting those SLA’s. Equally, as your organization grows, as complexity increase, these have to continuously monitored and addressed (when needed).
All too often, you may have budgeting constraints to pick a database flavor, that may not have been your (or your evaluation panel) choice. Database vendor of your choice might have good set of features that looks fit to your need, but it could come at a price.
So you move with your second or third choice. A clear understanding of the feature set between different databases will help you immensely not just in the beginning of your setup, but as your data warehouse grows as well. I will revisit this, as I talk specifically about few databases in a data warehousing environment.
A good understanding of what a Data Warehouse appliance is, would help understand when to think of getting one.
A data warehouse appliance is specifically tuned and built for a data warehouse, which consists of hardware (Processor, Memory, Storage, Network Array etc.) and software (DBMS – DataBase Management System, Application Interfaces, Administration Interfaces etc.)
Now how is that different from what you may already have – dedicated hardware (servers) and software (database such as Oracle or MySQL)?
Difference is, these commodity hardware and database software are meant for general purpose, which could be used and tweaked to build and maintain a data warehouse as other type of applications. But they are not exclusively meant to host a data warehouse.
The point above begs the question – what would be the need to have a dedicated hardware and software and what could be expected out of it?
If you are a CIO, please look back into the IT organization. Look deep into existing data infrastructure and take stock of the tools and technologies used. Mainly look deep into the challenges you face today.
Most often you would hear from your data architects and administrators that while the applications are tweaked to their maximum and hardware resources have been added (both proactively and on-demand), it just is unable to cope with the ‘demand’.
So what is the ‘demand’ ?
There are two dimensions to ‘demand’. One, what end user expects out of an analytical application and two, how much of data need to stored and analyzed, to cope with the first dimension of the demand.
Organizations today, wants to collect as much data as possible. Equally storage prices [cost per byte (gigabyte, terabyte, petabyte etc.)] are crashing. On the other hands, ends user requirements and expectations demands to analyze as much as that is stored.
I would even argue that storage price and end user requirements/expectations demands that we collect more. And this cycle continues.
So if you are a CIO today, you know that ‘exponential’ data growth is sure thing.
Coming back to the general purpose hardware and software – these machines as software systems were mostly meant for common demands – Web Applications, Light-weight transactional systems, Portals, Small Business registers etc. They do work well but again their goal is different. So the performance expectations don’t often match with what you would expect out of a high-volume data warehouse.
A data warehouse stores data, lot of it and in different forms – Common Identifiers, Long strings, Free form text, Transactional Information, Event driven data etc. etc.
When you have them all, and in big volume, not only efficient storage is important, but being able to analyze them to get actionable intelligence and in an shortest possible time-frame is paramount.
And we realize that existing infrastructure and architecture is not an ideal solution. This is where a Data warehouse Appliance comes to our rescue.
We will see more on this in the next part.
Previously I raised a question, commonly asked on the flavor of database for data warehouse.
Along the lines, I get asked “What is your choice for a Business Intelligence tool?”. To be clear, most often this question is about a ‘reporting’ tool and not an ‘ETL’ tool.
My answer as my position is the same – there is no ready made, predetermined answer to this.
Why again is it difficult to answer that or actually what would be the criteria to select a BI tool?
Here are some
- Underlying data warehouse architecture
- Data volume (today and projections)
- Application Interfaces
- Usage (today and projections)
- User Experience
- Legacy tool knowledge, considerations and applicability
- Availability Expectations
- (Last but not least) Budget
Frequently a question crops up “What is your database choice for a data warehouse?”
It is not an easy question to answer and certainly should not be a preconceived one. Here is why.
There are number of things that have be to looked at, in recommending and/or opting a database for a data warehouse.
Here is a small list.
- Existing Infrastructure
- Existing Architecture
- Current/Recent Issues
- Application Interfaces
- Current/Recent Data Volume Trends
- Existing Talent Pool (Database & Non-Database)
- Expected Growth – Application Model and Complexity
- Expected Growth – Data Volume
- Consumption/Usage – Business Intelligence, Analytics etc.
- (Last but not least) – Budget