Data warehousing can add significant value by facilitating enhanced business intelligence, but success isn't automatic. This article explores critical data-warehousing success factors.
By Henry Martinez
Operational databases are predominantly optimized for transaction processing. They are usually highly normalized and are very efficient at reading and writing a few related records at a time. However, they are typically not optimally architected to serve the needs of executives and business analysts who perform query, reporting, and analysis against vast swaths of data. When these people use operational databases to complete these tasks, the database must often perform large, complex table joins. The necessary disk I/O and processing can bring systems to their knees.
Further complicating matters, users of business intelligence tools must often integrate data from a variety of disparate databases. For example, they may need to merge inventory data from an Oracle database, sales data from a DB2 database, and forecasting data from a Sybase database--all of which run on different hardware and operating system platforms--in order to optimize inventory management.
Figure 1: Enterprise-wide requirements often include disparate databases.
To solve these problems, organizations often create a data warehouse and/or data marts to merge data from operational databases. The resulting data store is optimized for query, reporting, and business intelligence purposes.
It sounds easy. Gather information from operational databases and copy it into a database optimized for online analytical processing (OLAP) instead of online transaction processing (OLTP). It may sound easy, but it isn't. Many organizations' data warehousing efforts have met with qualified success or worse.
Defining Terms
Before I delve into this topic, some of the confusion that often surrounds "data" and related terms needs to be clarified.
•· Data are numbers, names, labels, and other structured values that can be contained in databases. The temperature is an example of data.
•· Information is synthesized from data by applying logic to one or more instances of data. As such, databases don't contain information directly, but information can be created from the contents of a database. "The temperature in the office is within most people's comfort zone" is an example of information.
•· Combining information and experience generates rational decisions. For example, experience may tell you that people are more comfortable when the temperature is within their comfort zone, so you decide to turn on the air conditioning when it gets too hot.
A Word from Your Sponsor
The nature of query, reporting, and, particularly, business intelligence functions sets them apart from operational functions in another way as well. Operational functions can be precisely defined. In contrast, the ways a data warehouse will be employed after users gain experience with it and the purposes to which business intelligence tools will eventually be put typically cannot be predicted with any precision much beyond their initial implementation.
This is not to say that the requirements analysis phase of an operational application development or acquisition project is easy. The number of projects that fail to fully meet user requirements attests to the fact that it is not. But at least it is possible.
Furthermore, with operational applications it is conceivable, although usually not desirable, for the IT department to take full responsibility for analyzing and defining functional requirements. Theoretically, it is possible to build or buy a system that entirely meets current operational needs by observing users and analyzing existing business processes. That might not be the ideal approach, and it would likely fail to address future requirements, but it could, in theory, work.
In contrast, when it comes to performing business intelligence tasks, each answer that a user finds often leads to new, unexpected questions. Business analysts look for patterns and trends that will help them to increase sales, reduce costs, improve quality, or achieve any number of other corporate objectives. Those patterns and trends are unknown until they are found. Random walks through the data are, therefore, a common element of business intelligence efforts. Thus, the IT department cannot define business intelligence needs by simply observing the user community, because one of the things that business analysts need to determine is whether the processes that are being performed now are being performed optimally and, indeed, are the processes that need doing. The objective of business intelligence tasks is, as the name suggests, to create new intelligence, not merely to use existing knowledge. That objective is rarely met entirely through well-structured processes.
The upshot is that the user community must drive business intelligence projects, including the building of the data warehouses that support them. A data warehouse that is thrust upon business analysts and executives will likely neither gain their support nor fully serve their needs. Therefore, a successful data-warehousing project requires a strong sponsor from the user community who will champion the project.
Little Bangs
A "big bang" project to create a universal, encyclopedic data warehouse for the entire enterprise in a single go is almost certain to fail. The complexity, heterogeneity, and interrelatedness of the various enterprise-wide data sources and the vagueness of myriad potential purposes to which the data warehouse can be put are normally so great as to result in a data warehouse project that is perpetually in the analysis-and-design phase. It is better to start with a single area of the company for which there is an enthusiastic user who will champion the analysis-and-design process for his or her department or work function.
Starting with a "little bang" rather than a big one offers a number of benefits, including allowing you to do the following:
•· Devote the resources required to find and analyze all data that directly or indirectly impacts the subject department or work function.
•· Avoid the "paralysis by analysis" that often occurs when trying to do too much in a single project.
•· Press the data warehouse into service more quickly, thereby allowing you to generate value sooner. This also allows the IT department and the user community to gain more experience with data warehousing and business intelligence tools and concepts before building the full data warehouse.
When taking this "little bang" approach, be sure to set an appropriate expectation level within the enterprise. If the executive suite expects an all-inclusive, all-singing, all-dancing data warehouse and business intelligence solution set, they will be disappointed with the initial results. On the other hand, if they recognize that the first effort is a pilot project that will be expanded in a prudent, affordable manner, benefiting from learning experiences along the way, disappointment over the magnitude of the "little bang" can be avoided.
Analysis Is Critical
Users who become frustrated with a data warehouse won't use it and will resist attempts to expand its scope and use. Therefore, it is important to get particularly the first but also the subsequent projects right. To do so, the analysis phase must answer the following questions:
•· What information can add value to the decision-making processes of the subject department or job function? And what data is required to synthesize that information? This is not always obvious, nor is all of the data necessarily within the domain of the subject department. For example, quality control analysts in the manufacturing group need to uncover the causes of any spikes or, worse, steady increases in product defects.
o Was it a result of a decline in the skill levels of new-hires? That information is in the human resources database.
o Did a change in shift schedules result in workers on one shift being more fatigued than they had been on the previous schedule, and, if so, was that shift responsible for the defects? Answering that question may require plumbing data from both the human resources and manufacturing databases.
o Are most of the defective products being built using raw materials or subassemblies from a new supplier? To find out, you may need to consult both the manufacturing and purchasing databases.
o Was there some other cause that, if found, would be surprising? Finding those surprises requires taking random walks through all data that in any way relates to the production, storage, and transportation of the products in question.
•· Where does the data currently reside?
•· How important is data timeliness? A data warehouse that supports solely financial reporting activities needs data that is complete up to the end of the last reporting period. More recent data only gets in the way. On the other hand, managers at a major retail chain who want to stay on top of rapidly changing fashion tastes, while also ensuring that each store has the stock that will best serve local preferences, will likely want data that is no more than one day old. Even better would be near real-time data gathered from point-of-sale terminals.
•· Is data in the source databases in the format that is most easily usable by the people who will use the data warehouse? For instance, telephone area codes can provide useful data for marketers planning telemarketing campaigns. Yet many operational databases contain a single telephone number field. The data warehouse might better serve marketers by separating the telephone number into separate area code and local number fields, allowing the data to be sorted and filtered easily using the area code.
•· Will data from differing sources need to be reconciled? For example, an international company may need to convert the kilograms and centimeters in its European subsidiaries' databases or the pounds and inches in its American databases into common units in an enterprise data warehouse.
•· Are there any data privacy and security concerns? Care must be taken to ensure that such requirements continue to be met when data is copied from production databases into the data warehouse.
•· What demands will be placed on the data warehouse's disks and servers? Because users will not have a good sense of all of the purposes to which a data warehouse can be put until they start using it, a rough estimate is likely the best that can be hoped for during the analysis phase. Consequently, it is important to design a highly scalable data warehouse.
If you use a "little bangs" approach to building the data warehouse, keep in mind that each new project adds only a piece of the puzzle. As you expand the data warehouse to meet the needs of other departments, you'll need to consider that the requirements of those departments will differ. Therefore, strive to maintain a design that is as flexible as possible. This might argue for maintaining a highly normalized structure, which is contrary to the desire to denormalize data to improve OLAP performance. A way to reconcile this conflict is suggested in the "Warehouse, Mart, or Both" section below.
Keep It Clean
To optimize the efficiency and effectiveness of a data warehouse, the data in it should be as "clean" as possible. There are a number of aspects to data cleansing, including the following:
•· Standardize data descriptions, abbreviations, and formats. For example, the same person may be listed as Mrs. Mary Jones in a company's Web store database, but as Ms. M. Jones in a database supporting its physical stores. And one database may list her address as 123 Main St., Apt. 456, another as 456 - 123 Main Street, and yet another as 123 Main Street, Suite 456.
•· Consolidate duplicate records. Even a single database may contain redundant records. This can occur because of the differences described in the preceding point. For example, a sales application may create a new customer record when it doesn't recognize that Mrs. Mary Jones at 123 Main St., Apt 456 is the same person as Ms. M. Jones at 456 - 123 Main Street. Moreover, the existence of duplicate records is almost assured when data is merged from multiple databases containing overlapping data.
•· Correct errors where possible. For example, there are publicly available address-change databases. This information can be matched against company data as it is moved into the data warehouse to make address corrections even when the customer has not yet provided that information.
•· Detect and flag corrupted records, or, where possible, automatically fix the corruption.
Warehouse, Mart, or Both
As stated above, the query, reporting, and business intelligence needs of each functional area within a company differ. The data warehouse design that best serves one group may all but thwart the objectives of another. One way to resolve this dilemma is to build a data warehouse that serves primarily to consolidate and reconcile data from the various operational databases but possibly little else.
This data warehouse can then be used to feed specific-purpose data marts that contain just the data needed for the functional areas they will serve and that are structured to best serve the needs of those functional areas.
This data warehouse/data mart topology allows you to schedule the population of each data mart as appropriate. For example, marketing analysts may want data to be as current as possible. Their data mart can be fed updates as soon as they arrive in the data warehouse or possibly in nightly batches. On the other hand, a data mart that serves quarterly or annual financial reporting purposes can be built using a snapshot copy tool that is run immediately after the end of the relevant financial period and that captures data for only that period.
Figure 2: Individual data marts can draw the specific information they need from the data warehouse.
If the users of a particular data mart normally work with aggregated data--say, monthly totals and/or regional totals, rather than individual transactions--the data copied into that data mart can be added into "summary buckets" rather than stored as separate records, thereby eliminating the need to create those summaries every time a user accesses the data mart.
It is not necessary to feed data marts from a data warehouse. Instead, the marts can be populated directly from operational databases. A potential drawback is that data merging and transformation tasks may have to be repeated several times as the data is sent to the different data marts, thereby increasing the processor load required to feed the data marts.
Data Movement Software
A key data warehousing infrastructure component is the software that copies data from the operational databases to the data warehouse and/or data marts. This software falls into two categories: data replication and extract, transform, and load (ETL) products.
Data replication software monitors updates to a source database and copies those changes to one or more target databases. Replication can occur in near real-time, or the replicator might offer the option of collecting updates at the source and transmitting them to the target in batches, possibly taking advantage of network slow periods.
As the name implies, ETL software first extracts data from a source database (possibly as a snapshot copy), transforms it as required, and then loads it onto a target database.
The distinction between data replication and ETL has blurred over recent years. Data replicators have for some time been able to transform data--the "T" in ETL--on the way from the source to the target. On the ETL side, advanced ETL products now work more like replicators in that they often recognize what data has changed since the last snapshot was taken and then extract only that changed data, allowing the ETL tool to act more like a data replicator. Furthermore, some products that are labeled as being in the ETL category have morphed into products that offer change-based replication as well.
When evaluating data replication and ETL software, consider the following:
•· Does the product support all of your current hardware, operating system, and database platforms? This may include being able to, for example, merge data from Oracle, Sybase, and DB2 databases--all running on different hardware and operating systems--into a Teradata data warehouse.
Keep in mind that your platforms may change in the future. When choosing new application software, you shouldn't have to reject the otherwise best choice because your ETL or data replication software won't support the platform that the application runs on. Therefore, look for ETL or replication software that supports the broadest range of data sources and targets, even though you may not require a range that broad today.
•· Does it support all of the data transformation capabilities that you require to cleanse, reconcile, and restructure data as it is moved into the data warehouse and/or marts? Most replicators and ETL software include some pre-written data transformation routines to meet requirements that are common to many organizations. Because no vendor can provide pre-written routines that will meet all of the unique needs of every organization, the software must allow you to write your own data transformation routines.
•· Does it allow for different database schemas on the source and target databases? For example, you may need to add derived columns such as generating a region code from a ZIP code or a telephone area code or adding a timestamp to indicate the freshness of the data. Or if you want to denormalize data in the data warehouse or marts, you may need to merge data from matching rows in multiple source tables into corresponding rows in a single target table. In short, the product must offer complete flexibility when mapping source and target tables, rows, and columns.
•· Does it support your data timeliness requirements? If you require a real-time data warehouse and/or marts, an ETL product that extracts and loads only snapshot copies at infrequent intervals will not meet your needs. Nonetheless, there are some instances, such as the quarterly or annual financial reporting data mart suggested above, when batch loads are the best option. For optimal flexibility, look for a product that supports both real-time, change-based replication as well as snapshot or batched copying.
Putting It Together
Data warehousing, possibly including the maintenance of data marts, offers the opportunity to create significant value by using business intelligence tools to exploit information resources more fully, but success is not automatic. Success requires careful planning, analysis, and design along with the selection and implementation of the data warehousing infrastructure that will best meet your specific requirements.
Among the most critical requirements for success are data warehousing knowledge, skills, and experience. Space limitations do not allow more than a surface examination of data warehousing issues here. Before undertaking your first data warehousing project, make the necessary investment to acquire the required additional expertise, whether you do that by training existing employees, hiring new staff with experience in the area, or contracting data warehousing experts.
LATEST COMMENTS
MC Press Online