Strategies for Successful Data Warehousing Projects

Typography

Smaller Small Medium Big Bigger
Default Helvetica Segoe Georgia Times
Reading Mode

Data warehousing can add significant value by facilitating enhanced business intelligence, but success isn't automatic. This article explores critical data-warehousing success factors.

By Henry Martinez

Operational databases are predominantly optimized for transaction processing. They are usually highly normalized and are very efficient at reading and writing a few related records at a time. However, they are typically not optimally architected to serve the needs of executives and business analysts who perform query, reporting, and analysis against vast swaths of data. When these people use operational databases to complete these tasks, the database must often perform large, complex table joins. The necessary disk I/O and processing can bring systems to their knees.

Further complicating matters, users of business intelligence tools must often integrate data from a variety of disparate databases. For example, they may need to merge inventory data from an Oracle database, sales data from a DB2 database, and forecasting data from a Sybase database--all of which run on different hardware and operating system platforms--in order to optimize inventory management.

Figure 1: Enterprise-wide requirements often include disparate databases.

To solve these problems, organizations often create a data warehouse and/or data marts to merge data from operational databases. The resulting data store is optimized for query, reporting, and business intelligence purposes.

It sounds easy. Gather information from operational databases and copy it into a database optimized for online analytical processing (OLAP) instead of online transaction processing (OLTP). It may sound easy, but it isn't. Many organizations' data warehousing efforts have met with qualified success or worse.

Defining Terms

Before I delve into this topic, some of the confusion that often surrounds "data" and related terms needs to be clarified.

•· Data are numbers, names, labels, and other structured values that can be contained in databases. The temperature is an example of data.

•· Information is synthesized from data by applying logic to one or more instances of data. As such, databases don't contain information directly, but information can be created from the contents of a database. "The temperature in the office is within most people's comfort zone" is an example of information.

•· Combining information and experience generates rational decisions. For example, experience may tell you that people are more comfortable when the temperature is within their comfort zone, so you decide to turn on the air conditioning when it gets too hot.

A Word from Your Sponsor

The nature of query, reporting, and, particularly, business intelligence functions sets them apart from operational functions in another way as well. Operational functions can be precisely defined. In contrast, the ways a data warehouse will be employed after users gain experience with it and the purposes to which business intelligence tools will eventually be put typically cannot be predicted with any precision much beyond their initial implementation.

This is not to say that the requirements analysis phase of an operational application development or acquisition project is easy. The number of projects that fail to fully meet user requirements attests to the fact that it is not. But at least it is possible.

Furthermore, with operational applications it is conceivable, although usually not desirable, for the IT department to take full responsibility for analyzing and defining functional requirements. Theoretically, it is possible to build or buy a system that entirely meets current operational needs by observing users and analyzing existing business processes. That might not be the ideal approach, and it would likely fail to address future requirements, but it could, in theory, work.

In contrast, when it comes to performing business intelligence tasks, each answer that a user finds often leads to new, unexpected questions. Business analysts look for patterns and trends that will help them to increase sales, reduce costs, improve quality, or achieve any number of other corporate objectives. Those patterns and trends are unknown until they are found. Random walks through the data are, therefore, a common element of business intelligence efforts. Thus, the IT department cannot define business intelligence needs by simply observing the user community, because one of the things that business analysts need to determine is whether the processes that are being performed now are being performed optimally and, indeed, are the processes that need doing. The objective of business intelligence tasks is, as the name suggests, to create new intelligence, not merely to use existing knowledge. That objective is rarely met entirely through well-structured processes.

The upshot is that the user community must drive business intelligence projects, including the building of the data warehouses that support them. A data warehouse that is thrust upon business analysts and executives will likely neither gain their support nor fully serve their needs. Therefore, a successful data-warehousing project requires a strong sponsor from the user community who will champion the project.

Little Bangs

A "big bang" project to create a universal, encyclopedic data warehouse for the entire enterprise in a single go is almost certain to fail. The complexity, heterogeneity, and interrelatedness of the various enterprise-wide data sources and the vagueness of myriad potential purposes to which the data warehouse can be put are normally so great as to result in a data warehouse project that is perpetually in the analysis-and-design phase. It is better to start with a single area of the company for which there is an enthusiastic user who will champion the analysis-and-design process for his or her department or work function.

Starting with a "little bang" rather than a big one offers a number of benefits, including allowing you to do the following:

•· Devote the resources required to find and analyze all data that directly or indirectly impacts the subject department or work function.

•· Avoid the "paralysis by analysis" that often occurs when trying to do too much in a single project.

•· Press the data warehouse into service more quickly, thereby allowing you to generate value sooner. This also allows the IT department and the user community to gain more experience with data warehousing and business intelligence tools and concepts before building the full data warehouse.

When taking this "little bang" approach, be sure to set an appropriate expectation level within the enterprise. If the executive suite expects an all-inclusive, all-singing, all-dancing data warehouse and business intelligence solution set, they will be disappointed with the initial results. On the other hand, if they recognize that the first effort is a pilot project that will be expanded in a prudent, affordable manner, benefiting from learning experiences along the way, disappointment over the magnitude of the "little bang" can be avoided.

Analysis Is Critical

Users who become frustrated with a data warehouse won't use it and will resist attempts to expand its scope and use. Therefore, it is important to get particularly the first but also the subsequent projects right. To do so, the analysis phase must answer the following questions:

•· What information can add value to the decision-making processes of the subject department or job function? And what data is required to synthesize that information? This is not always obvious, nor is all of the data necessarily within the domain of the subject department. For example, quality control analysts in the manufacturing group need to uncover the causes of any spikes or, worse, steady increases in product defects.

o Was it a result of a decline in the skill levels of new-hires? That information is in the human resources database.

o Did a change in shift schedules result in workers on one shift being more fatigued than they had been on the previous schedule, and, if so, was that shift responsible for the defects? Answering that question may require plumbing data from both the human resources and manufacturing databases.

o Are most of the defective products being built using raw materials or subassemblies from a new supplier? To find out, you may need to consult both the manufacturing and purchasing databases.

o Was there some other cause that, if found, would be surprising? Finding those surprises requires taking random walks through all data that in any way relates to the production, storage, and transportation of the products in question.

•· Where does the data currently reside?

•· How important is data timeliness? A data warehouse that supports solely financial reporting activities needs data that is complete up to the end of the last reporting period. More recent data only gets in the way. On the other hand, managers at a major retail chain who want to stay on top of rapidly changing fashion tastes, while also ensuring that each store has the stock that will best serve local preferences, will likely want data that is no more than one day old. Even better would be near real-time data gathered from point-of-sale terminals.

•· Is data in the source databases in the format that is most easily usable by the people who will use the data warehouse? For instance, telephone area codes can provide useful data for marketers planning telemarketing campaigns. Yet many operational databases contain a single telephone number field. The data warehouse might better serve marketers by separating the telephone number into separate area code and local number fields, allowing the data to be sorted and filtered easily using the area code.

•· Will data from differing sources need to be reconciled? For example, an international company may need to convert the kilograms and centimeters in its European subsidiaries' databases or the pounds and inches in its American databases into common units in an enterprise data warehouse.

•· Are there any data privacy and security concerns? Care must be taken to ensure that such requirements continue to be met when data is copied from production databases into the data warehouse.

•· What demands will be placed on the data warehouse's disks and servers? Because users will not have a good sense of all of the purposes to which a data warehouse can be put until they start using it, a rough estimate is likely the best that can be hoped for during the analysis phase. Consequently, it is important to design a highly scalable data warehouse.

If you use a "little bangs" approach to building the data warehouse, keep in mind that each new project adds only a piece of the puzzle. As you expand the data warehouse to meet the needs of other departments, you'll need to consider that the requirements of those departments will differ. Therefore, strive to maintain a design that is as flexible as possible. This might argue for maintaining a highly normalized structure, which is contrary to the desire to denormalize data to improve OLAP performance. A way to reconcile this conflict is suggested in the "Warehouse, Mart, or Both" section below.

Keep It Clean

To optimize the efficiency and effectiveness of a data warehouse, the data in it should be as "clean" as possible. There are a number of aspects to data cleansing, including the following:

•· Standardize data descriptions, abbreviations, and formats. For example, the same person may be listed as Mrs. Mary Jones in a company's Web store database, but as Ms. M. Jones in a database supporting its physical stores. And one database may list her address as 123 Main St., Apt. 456, another as 456 - 123 Main Street, and yet another as 123 Main Street, Suite 456.

•· Consolidate duplicate records. Even a single database may contain redundant records. This can occur because of the differences described in the preceding point. For example, a sales application may create a new customer record when it doesn't recognize that Mrs. Mary Jones at 123 Main St., Apt 456 is the same person as Ms. M. Jones at 456 - 123 Main Street. Moreover, the existence of duplicate records is almost assured when data is merged from multiple databases containing overlapping data.

•· Correct errors where possible. For example, there are publicly available address-change databases. This information can be matched against company data as it is moved into the data warehouse to make address corrections even when the customer has not yet provided that information.

•· Detect and flag corrupted records, or, where possible, automatically fix the corruption.

Warehouse, Mart, or Both

As stated above, the query, reporting, and business intelligence needs of each functional area within a company differ. The data warehouse design that best serves one group may all but thwart the objectives of another. One way to resolve this dilemma is to build a data warehouse that serves primarily to consolidate and reconcile data from the various operational databases but possibly little else.

This data warehouse can then be used to feed specific-purpose data marts that contain just the data needed for the functional areas they will serve and that are structured to best serve the needs of those functional areas.

This data warehouse/data mart topology allows you to schedule the population of each data mart as appropriate. For example, marketing analysts may want data to be as current as possible. Their data mart can be fed updates as soon as they arrive in the data warehouse or possibly in nightly batches. On the other hand, a data mart that serves quarterly or annual financial reporting purposes can be built using a snapshot copy tool that is run immediately after the end of the relevant financial period and that captures data for only that period.

Figure 2: Individual data marts can draw the specific information they need from the data warehouse.

If the users of a particular data mart normally work with aggregated data--say, monthly totals and/or regional totals, rather than individual transactions--the data copied into that data mart can be added into "summary buckets" rather than stored as separate records, thereby eliminating the need to create those summaries every time a user accesses the data mart.

It is not necessary to feed data marts from a data warehouse. Instead, the marts can be populated directly from operational databases. A potential drawback is that data merging and transformation tasks may have to be repeated several times as the data is sent to the different data marts, thereby increasing the processor load required to feed the data marts.

Data Movement Software

A key data warehousing infrastructure component is the software that copies data from the operational databases to the data warehouse and/or data marts. This software falls into two categories: data replication and extract, transform, and load (ETL) products.

Data replication software monitors updates to a source database and copies those changes to one or more target databases. Replication can occur in near real-time, or the replicator might offer the option of collecting updates at the source and transmitting them to the target in batches, possibly taking advantage of network slow periods.

As the name implies, ETL software first extracts data from a source database (possibly as a snapshot copy), transforms it as required, and then loads it onto a target database.

The distinction between data replication and ETL has blurred over recent years. Data replicators have for some time been able to transform data--the "T" in ETL--on the way from the source to the target. On the ETL side, advanced ETL products now work more like replicators in that they often recognize what data has changed since the last snapshot was taken and then extract only that changed data, allowing the ETL tool to act more like a data replicator. Furthermore, some products that are labeled as being in the ETL category have morphed into products that offer change-based replication as well.

When evaluating data replication and ETL software, consider the following:

•· Does the product support all of your current hardware, operating system, and database platforms? This may include being able to, for example, merge data from Oracle, Sybase, and DB2 databases--all running on different hardware and operating systems--into a Teradata data warehouse.

Keep in mind that your platforms may change in the future. When choosing new application software, you shouldn't have to reject the otherwise best choice because your ETL or data replication software won't support the platform that the application runs on. Therefore, look for ETL or replication software that supports the broadest range of data sources and targets, even though you may not require a range that broad today.

•· Does it support all of the data transformation capabilities that you require to cleanse, reconcile, and restructure data as it is moved into the data warehouse and/or marts? Most replicators and ETL software include some pre-written data transformation routines to meet requirements that are common to many organizations. Because no vendor can provide pre-written routines that will meet all of the unique needs of every organization, the software must allow you to write your own data transformation routines.

•· Does it allow for different database schemas on the source and target databases? For example, you may need to add derived columns such as generating a region code from a ZIP code or a telephone area code or adding a timestamp to indicate the freshness of the data. Or if you want to denormalize data in the data warehouse or marts, you may need to merge data from matching rows in multiple source tables into corresponding rows in a single target table. In short, the product must offer complete flexibility when mapping source and target tables, rows, and columns.

•· Does it support your data timeliness requirements? If you require a real-time data warehouse and/or marts, an ETL product that extracts and loads only snapshot copies at infrequent intervals will not meet your needs. Nonetheless, there are some instances, such as the quarterly or annual financial reporting data mart suggested above, when batch loads are the best option. For optimal flexibility, look for a product that supports both real-time, change-based replication as well as snapshot or batched copying.

Putting It Together

Data warehousing, possibly including the maintenance of data marts, offers the opportunity to create significant value by using business intelligence tools to exploit information resources more fully, but success is not automatic. Success requires careful planning, analysis, and design along with the selection and implementation of the data warehousing infrastructure that will best meet your specific requirements.

Among the most critical requirements for success are data warehousing knowledge, skills, and experience. Space limitations do not allow more than a surface examination of data warehousing issues here. Before undertaking your first data warehousing project, make the necessary investment to acquire the required additional expertise, whether you do that by training existing employees, hiring new staff with experience in the area, or contracting data warehousing experts.

As Senior Vice President of Engineering, Henry Martinez has a broad spectrum of experience in the application of information technologies (IT), including traditional IT, consumer electronics, the utility industry, aerospace, and industrial automation. He has managed global, virtual R&D labs; launched high-technology startups; established domestic and international manufacturing operations; and negotiated and implemented technology alliances/transfers.

Prior to joining Vision Solutions, Henry has been Senior Vice President of New Product Development for CCC Network Systems, Vice President and General Manager of Excellon Automation's Laser Business Division, Director of Strategic Technologies for Leviton Manufacturing, and Research & Development and Unit Manager for Northrop Electronics. Martinez holds a BSEE from UCLA with graduate studies in computer science and has participated in Executive Management programs at Stanford University and the California Institute of Technology. Additionally, he is a member of the Institute of Electrical and Electronics Engineers (IEEE), has authored article in several technical publications, and holds two U.S. patents.

Henry may be reached at This email address is being protected from spambots. You need JavaScript enabled to view it.. For further information about Vision Solutions, visit http://www.visionsolutions.com/ or call 801.541.7769.

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Book Reviews

Book Review: Extract, Transform, and Load with SSIS

Do your business apps access different data sources? This book shows you how to make that task easier
Book Review: 21st Century RPG: /Free, ILE, and MVC

David Shirey’s first book is an educational and entertaining read for “modern” and “old” RPG programmers alike
Book Review: Developing Business Applications for the Web--With HTML, CSS, JSP, PHP, ASP.NET, and JavaScript

If you are ready to get into Web application development, take this book along as your guide
Book Review: DB2 10.5 Fundamentals for LUW: Certification Study Guide (Exam 615)

DBAs who use the book will find it very helpful first in their test study and later as a reference book.
Book Review: DB2 11 for z/OS Database Administration—Certification Study Guide

This is a well-written DB2 11 book that could easily stand on its own as a reference manual, not just a certification guide.
Book Review: Free-Format RPG IV, Third Edition

Jim Martin comes through for us again.
Book Review: IBM i Security Administration and Compliance, Second Edition
Book Review: Programming in ILE RPG, Fifth Edition

This book really hits the mark and is a must-read for all RPG developers.
Book Review: DB2 10.1/10.5 for Linux, UNIX, and Windows Database Administration: Certification Guide
Book Review: Subfiles in Free-Format RPG

Whether you're a newbie or a seasoned pro, this book has something for you.
Book Review: Evolve Your RPG Coding: Move from OPM to ILE ... and Beyond

This book provides an amazingly comprehensive introduction to the concepts while at the same time delivering enough technical detail to make you productive very quickly.
Book Review: Database Design and SQL for DB2
Book Review: The Chief Data Officer Handbook for Data Governance

When implemented appropriately, data governance is a powerful framework.
Book Review: DB2 10 for z/OS: The Smarter, Faster Way to Upgrade

Trying to figure out whether to upgrade? Read on.
Book Review: 5 Keys to Business Analytics Program Success
Book Review: DB2 11: The Ultimate Database for Cloud, Analytics, and Mobile
Book Review: Flexible Input, Dazzling Output with IBM i

Today, it's all about input and output. Getting data into the IBM i from non-traditional sources and then displaying it back out again in varied formats. But where can you go to learn all that you need to know about this critical skill?
Book Review: Advanced Guide to PHP on IBM i

Enterprise-level PHP skills and techniques have been adapted for IBM i developers in Kevin Schroeder's new book.
Book Review: Java for RPG Programmers

If you've been putting off learning Java, you have no excuse anymore!
Book Review: DB2 10.1 Fundamentals: Certification Study Guide

Too valuable to be classified as merely excellent certification material, this book should also rightly take its place on DB2 DBA bookshelves as a solid day-to-day DB2 reference.
Book Review: DB2 10 for Z/OS Database Administration: Certification Study Guide

Whether you're trying to get certified or you just need a great reference book, this is the book for you.
Book Review: Developing Web 2.0 Applications with EGL for IBM i

It's everything you need to know, from the bottom up.
Book Review: Advanced Integrated RPG

Isn't it about time somebody told us how to integrate RPG and Java?
Book Review: Managing Without Walls

If you manage remote or satellite teams, this book is a must-read!
Book Review: Managing Without Walls

If you manage remote or satellite teams, this book is a must-read!
Book Review: The Remote System Explorer

This book speaks directly to the thousands of IBM i programmers who develop in RPG, COBOL, CL, and DDS every day.
Book Review: IBM System i APIs at Work, Second Edition

API expert Bruce Vining delivers the only comprehensive guide to APIs.
Book Review: Functions in Free-Format RPG IV

This one short volume manages to essentially be both a general introduction and a detailed reference.
Book Review: DB2 11: The Database for Big Data and Analytics
Book Review: IBM Mainframe Security: Beyond the Basics

Beginners will have a strong foundation after reading this book. Experienced professionals will reference it frequently.
Book Review: IBM InfoSphere: A Platform for Big Data Governance and Process Data Governance

Find out how IBM is addressing the challenges of big data.
Book Review: Fundamentals of Technology Project Management

Projects can be overwhelming, but taken in small, deliberate steps, all projects are achievable.
Book Review: Customer Experience Analytics

Use CEA as a strategic weapon to stay ahead of your competitors.
Book Review: Big Data Analytics: Disruptive Technologies for Changing the Game

The disciplines of data analytics are evolving to meet the new challenges of big data.
Book Review: IBM i Security: Administration and Compliance

If you have any interest in IBM i security, whether as an administrator, a programmer, or an auditor, then this book is the perfect resource.
Book Review: DB2 9.7 for Linux, UNIX, and Windows Database Administration (Exam 541)

This book, written by the creator of the certification exam, reveals exactly what you'll need to know to prep for the test.
Book Review: Selling Information Governance to the Business

Who governs the information that runs your company?
Book Review: You Want to Do WHAT with PHP?

If you're serious about programming in PHP, get a book that treats you that way.
Book Review: The IBM i Programmer's Guide to PHP

Both a primer and a reference, this book is a must-have for anyone who wants to program in PHP.
Book Review: JavaScript for the Business Developer

There's no faster, easier way to become proficient in JavaScript.
Book Review: SOA for the Business Developer

If you want to know how SOA works in the real world, this is your book.
Book Review: DB2 9 Fundamentals

Whether you want to obtain an IBM certified DB2 professional certification or simply become well-rounded in the fundamental concepts of DB2 and general database theory, this is your book.
Book Review: The Modern RPG IV Language, Fourth Edition

This book isn't a training manual; it's a reference book.

Resource Center

How to Modernize Fast and Within Budget (Quick Guide)
Why Migrate When You Can Modernize?

Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.
Resource Center

The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit
IBM i Transformation Risks Every Business Leader Should Know

Join us for this hour-long webcast that will explore:
What to Do When Your AS/400 Talent Retires

IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn:

Analytics & Cognitive Categories

Latest Analytics & Cognitive News

Career Catgories

Latest Career News

Cloud Categories

Latest Cloud News

IT Infrastructure Categories

Latest IT Infrastructure News

News Categories

Latest News

Programming Categories

Latest Programming News

Security Categories

Latest Security News

Typography

Share This

Defining Terms

A Word from Your Sponsor

Little Bangs

Analysis Is Critical

Keep It Clean

Warehouse, Mart, or Both

Data Movement Software

Putting It Together

LATEST COMMENTS

MC Press Online

Support MC Press Online

Book Reviews

Book Review: Extract, Transform, and Load with SSIS

Book Review: 21st Century RPG: /Free, ILE, and MVC

Book Review: Developing Business Applications for the Web--With HTML, CSS, JSP, PHP, ASP.NET, and JavaScript

Book Review: DB2 10.5 Fundamentals for LUW: Certification Study Guide (Exam 615)

Book Review: DB2 11 for z/OS Database Administration—Certification Study Guide

Book Review: Free-Format RPG IV, Third Edition

Book Review: IBM i Security Administration and Compliance, Second Edition

Book Review: Programming in ILE RPG, Fifth Edition

Book Review: DB2 10.1/10.5 for Linux, UNIX, and Windows Database Administration: Certification Guide

Book Review: Subfiles in Free-Format RPG

Book Review: Evolve Your RPG Coding: Move from OPM to ILE ... and Beyond

Book Review: Database Design and SQL for DB2

Book Review: The Chief Data Officer Handbook for Data Governance

Book Review: DB2 10 for z/OS: The Smarter, Faster Way to Upgrade

Book Review: 5 Keys to Business Analytics Program Success

Book Review: DB2 11: The Ultimate Database for Cloud, Analytics, and Mobile

Book Review: Flexible Input, Dazzling Output with IBM i

Book Review: Advanced Guide to PHP on IBM i