17
Fri, Jan
2 New Articles

Strategies for Successful Data Warehousing Projects

Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Data warehousing can add significant value by facilitating enhanced business intelligence, but success isn't automatic. This article explores critical data-warehousing success factors.

By Henry Martinez

Operational databases are predominantly optimized for transaction processing. They are usually highly normalized and are very efficient at reading and writing a few related records at a time. However, they are typically not optimally architected to serve the needs of executives and business analysts who perform query, reporting, and analysis against vast swaths of data. When these people use operational databases to complete these tasks, the database must often perform large, complex table joins. The necessary disk I/O and processing can bring systems to their knees.

 

Further complicating matters, users of business intelligence tools must often integrate data from a variety of disparate databases. For example, they may need to merge inventory data from an Oracle database, sales data from a DB2 database, and forecasting data from a Sybase database--all of which run on different hardware and operating system platforms--in order to optimize inventory management.

 

020909disparatedatabases.png

Figure 1: Enterprise-wide requirements often include disparate databases.

 

To solve these problems, organizations often create a data warehouse and/or data marts to merge data from operational databases. The resulting data store is optimized for query, reporting, and business intelligence purposes.

 

It sounds easy. Gather information from operational databases and copy it into a database optimized for online analytical processing (OLAP) instead of online transaction processing (OLTP). It may sound easy, but it isn't. Many organizations' data warehousing efforts have met with qualified success or worse.

 

Defining Terms

Before I delve into this topic, some of the confusion that often surrounds "data" and related terms needs to be clarified.

•·                     Data are numbers, names, labels, and other structured values that can be contained in databases. The temperature is an example of data.

•·                     Information is synthesized from data by applying logic to one or more instances of data. As such, databases don't contain information directly, but information can be created from the contents of a database. "The temperature in the office is within most people's comfort zone" is an example of information.

•·                     Combining information and experience generates rational decisions. For example, experience may tell you that people are more comfortable when the temperature is within their comfort zone, so you decide to turn on the air conditioning when it gets too hot.

 

A Word from Your Sponsor

The nature of query, reporting, and, particularly, business intelligence functions sets them apart from operational functions in another way as well. Operational functions can be precisely defined. In contrast, the ways a data warehouse will be employed after users gain experience with it and the purposes to which business intelligence tools will eventually be put typically cannot be predicted with any precision much beyond their initial implementation.

 

This is not to say that the requirements analysis phase of an operational application development or acquisition project is easy. The number of projects that fail to fully meet user requirements attests to the fact that it is not. But at least it is possible.

 

Furthermore, with operational applications it is conceivable, although usually not desirable, for the IT department to take full responsibility for analyzing and defining functional requirements. Theoretically, it is possible to build or buy a system that entirely meets current operational needs by observing users and analyzing existing business processes. That might not be the ideal approach, and it would likely fail to address future requirements, but it could, in theory, work.

 

In contrast, when it comes to performing business intelligence tasks, each answer that a user finds often leads to new, unexpected questions. Business analysts look for patterns and trends that will help them to increase sales, reduce costs, improve quality, or achieve any number of other corporate objectives. Those patterns and trends are unknown until they are found. Random walks through the data are, therefore, a common element of business intelligence efforts. Thus, the IT department cannot define business intelligence needs by simply observing the user community, because one of the things that business analysts need to determine is whether the processes that are being performed now are being performed optimally and, indeed, are the processes that need doing. The objective of business intelligence tasks is, as the name suggests, to create new intelligence, not merely to use existing knowledge. That objective is rarely met entirely through well-structured processes.

 

The upshot is that the user community must drive business intelligence projects, including the building of the data warehouses that support them. A data warehouse that is thrust upon business analysts and executives will likely neither gain their support nor fully serve their needs. Therefore, a successful data-warehousing project requires a strong sponsor from the user community who will champion the project.

 

Little Bangs

A "big bang" project to create a universal, encyclopedic data warehouse for the entire enterprise in a single go is almost certain to fail. The complexity, heterogeneity, and interrelatedness of the various enterprise-wide data sources and the vagueness of myriad potential purposes to which the data warehouse can be put are normally so great as to result in a data warehouse project that is perpetually in the analysis-and-design phase. It is better to start with a single area of the company for which there is an enthusiastic user who will champion the analysis-and-design process for his or her department or work function.

 

Starting with a "little bang" rather than a big one offers a number of benefits, including allowing you to do the following:

 

•·                     Devote the resources required to find and analyze all data that directly or indirectly impacts the subject department or work function.

•·                     Avoid the "paralysis by analysis" that often occurs when trying to do too much in a single project.

•·                     Press the data warehouse into service more quickly, thereby allowing you to generate value sooner. This also allows the IT department and the user community to gain more experience with data warehousing and business intelligence tools and concepts before building the full data warehouse.

 

When taking this "little bang" approach, be sure to set an appropriate expectation level within the enterprise. If the executive suite expects an all-inclusive, all-singing, all-dancing data warehouse and business intelligence solution set, they will be disappointed with the initial results. On the other hand, if they recognize that the first effort is a pilot project that will be expanded in a prudent, affordable manner, benefiting from learning experiences along the way, disappointment over the magnitude of the "little bang" can be avoided.

 

Analysis Is Critical

Users who become frustrated with a data warehouse won't use it and will resist attempts to expand its scope and use. Therefore, it is important to get particularly the first but also the subsequent projects right. To do so, the analysis phase must answer the following questions:

 

•·                     What information can add value to the decision-making processes of the subject department or job function? And what data is required to synthesize that information? This is not always obvious, nor is all of the data necessarily within the domain of the subject department. For example, quality control analysts in the manufacturing group need to uncover the causes of any spikes or, worse, steady increases in product defects.

    o        Was it a result of a decline in the skill levels of new-hires? That information is in the human resources database.

    o        Did a change in shift schedules result in workers on one shift being more fatigued than they had been on the previous schedule, and, if so, was that shift responsible for the defects? Answering that question may require plumbing data from both the human resources and manufacturing databases.

    o        Are most of the defective products being built using raw materials or subassemblies from a new supplier? To find out, you may need to consult both the manufacturing and purchasing databases.

    o        Was there some other cause that, if found, would be surprising? Finding those surprises requires taking random walks through all data that in any way relates to the production, storage, and transportation of the products in question.

•·                     Where does the data currently reside?

•·                     How important is data timeliness? A data warehouse that supports solely financial reporting activities needs data that is complete up to the end of the last reporting period. More recent data only gets in the way. On the other hand, managers at a major retail chain who want to stay on top of rapidly changing fashion tastes, while also ensuring that each store has the stock that will best serve local preferences, will likely want data that is no more than one day old. Even better would be near real-time data gathered from point-of-sale terminals.

•·                     Is data in the source databases in the format that is most easily usable by the people who will use the data warehouse? For instance, telephone area codes can provide useful data for marketers planning telemarketing campaigns. Yet many operational databases contain a single telephone number field. The data warehouse might better serve marketers by separating the telephone number into separate area code and local number fields, allowing the data to be sorted and filtered easily using the area code.

•·                     Will data from differing sources need to be reconciled? For example, an international company may need to convert the kilograms and centimeters in its European subsidiaries' databases or the pounds and inches in its American databases into common units in an enterprise data warehouse.

•·                     Are there any data privacy and security concerns? Care must be taken to ensure that such requirements continue to be met when data is copied from production databases into the data warehouse.

•·                     What demands will be placed on the data warehouse's disks and servers? Because users will not have a good sense of all of the purposes to which a data warehouse can be put until they start using it, a rough estimate is likely the best that can be hoped for during the analysis phase. Consequently, it is important to design a highly scalable data warehouse.

 

If you use a "little bangs" approach to building the data warehouse, keep in mind that each new project adds only a piece of the puzzle. As you expand the data warehouse to meet the needs of other departments, you'll need to consider that the requirements of those departments will differ. Therefore, strive to maintain a design that is as flexible as possible. This might argue for maintaining a highly normalized structure, which is contrary to the desire to denormalize data to improve OLAP performance. A way to reconcile this conflict is suggested in the "Warehouse, Mart, or Both" section below.

 

Keep It Clean

To optimize the efficiency and effectiveness of a data warehouse, the data in it should be as "clean" as possible. There are a number of aspects to data cleansing, including the following:

•·                     Standardize data descriptions, abbreviations, and formats. For example, the same person may be listed as Mrs. Mary Jones in a company's Web store database, but as Ms. M. Jones in a database supporting its physical stores. And one database may list her address as 123 Main St., Apt. 456, another as 456 - 123 Main Street, and yet another as 123 Main Street, Suite 456.

•·                     Consolidate duplicate records. Even a single database may contain redundant records. This can occur because of the differences described in the preceding point. For example, a sales application may create a new customer record when it doesn't recognize that Mrs. Mary Jones at 123 Main St., Apt 456 is the same person as Ms. M. Jones at 456 - 123 Main Street. Moreover, the existence of duplicate records is almost assured when data is merged from multiple databases containing overlapping data.

•·                     Correct errors where possible. For example, there are publicly available address-change databases. This information can be matched against company data as it is moved into the data warehouse to make address corrections even when the customer has not yet provided that information.

•·                     Detect and flag corrupted records, or, where possible, automatically fix the corruption.

  

Warehouse, Mart, or Both

As stated above, the query, reporting, and business intelligence needs of each functional area within a company differ. The data warehouse design that best serves one group may all but thwart the objectives of another. One way to resolve this dilemma is to build a data warehouse that serves primarily to consolidate and reconcile data from the various operational databases but possibly little else.

 

This data warehouse can then be used to feed specific-purpose data marts that contain just the data needed for the functional areas they will serve and that are structured to best serve the needs of those functional areas.

 

This data warehouse/data mart topology allows you to schedule the population of each data mart as appropriate. For example, marketing analysts may want data to be as current as possible. Their data mart can be fed updates as soon as they arrive in the data warehouse or possibly in nightly batches. On the other hand, a data mart that serves quarterly or annual financial reporting purposes can be built using a snapshot copy tool that is run immediately after the end of the relevant financial period and that captures data for only that period.

 

020909warehousemarts.png

Figure 2: Individual data marts can draw the specific information they need from the data warehouse.

 

If the users of a particular data mart normally work with aggregated data--say, monthly totals and/or regional totals, rather than individual transactions--the data copied into that data mart can be added into "summary buckets" rather than stored as separate records, thereby eliminating the need to create those summaries every time a user accesses the data mart.

 

It is not necessary to feed data marts from a data warehouse. Instead, the marts can be populated directly from operational databases. A potential drawback is that data merging and transformation tasks may have to be repeated several times as the data is sent to the different data marts, thereby increasing the processor load required to feed the data marts.

 

Data Movement Software

A key data warehousing infrastructure component is the software that copies data from the operational databases to the data warehouse and/or data marts. This software falls into two categories: data replication and extract, transform, and load (ETL) products.

 

Data replication software monitors updates to a source database and copies those changes to one or more target databases. Replication can occur in near real-time, or the replicator might offer the option of collecting updates at the source and transmitting them to the target in batches, possibly taking advantage of network slow periods.

 

As the name implies, ETL software first extracts data from a source database (possibly as a snapshot copy), transforms it as required, and then loads it onto a target database.

 

The distinction between data replication and ETL has blurred over recent years. Data replicators have for some time been able to transform data--the "T" in ETL--on the way from the source to the target. On the ETL side, advanced ETL products now work more like replicators in that they often recognize what data has changed since the last snapshot was taken and then extract only that changed data, allowing the ETL tool to act more like a data replicator. Furthermore, some products that are labeled as being in the ETL category have morphed into products that offer change-based replication as well.

 

When evaluating data replication and ETL software, consider the following:

 

•·                     Does the product support all of your current hardware, operating system, and database platforms? This may include being able to, for example, merge data from Oracle, Sybase, and DB2 databases--all running on different hardware and operating systems--into a Teradata data warehouse.

Keep in mind that your platforms may change in the future. When choosing new application software, you shouldn't have to reject the otherwise best choice because your ETL or data replication software won't support the platform that the application runs on. Therefore, look for ETL or replication software that supports the broadest range of data sources and targets, even though you may not require a range that broad today.

•·                     Does it support all of the data transformation capabilities that you require to cleanse, reconcile, and restructure data as it is moved into the data warehouse and/or marts? Most replicators and ETL software include some pre-written data transformation routines to meet requirements that are common to many organizations. Because no vendor can provide pre-written routines that will meet all of the unique needs of every organization, the software must allow you to write your own data transformation routines.

•·                     Does it allow for different database schemas on the source and target databases? For example, you may need to add derived columns such as generating a region code from a ZIP code or a telephone area code or adding a timestamp to indicate the freshness of the data. Or if you want to denormalize data in the data warehouse or marts, you may need to merge data from matching rows in multiple source tables into corresponding rows in a single target table. In short, the product must offer complete flexibility when mapping source and target tables, rows, and columns.

•·                     Does it support your data timeliness requirements? If you require a real-time data warehouse and/or marts, an ETL product that extracts and loads only snapshot copies at infrequent intervals will not meet your needs. Nonetheless, there are some instances, such as the quarterly or annual financial reporting data mart suggested above, when batch loads are the best option. For optimal flexibility, look for a product that supports both real-time, change-based replication as well as snapshot or batched copying.

 

Putting It Together

Data warehousing, possibly including the maintenance of data marts, offers the opportunity to create significant value by using business intelligence tools to exploit information resources more fully, but success is not automatic. Success requires careful planning, analysis, and design along with the selection and implementation of the data warehousing infrastructure that will best meet your specific requirements.

 

Among the most critical requirements for success are data warehousing knowledge, skills, and experience. Space limitations do not allow more than a surface examination of data warehousing issues here. Before undertaking your first data warehousing project, make the necessary investment to acquire the required additional expertise, whether you do that by training existing employees, hiring new staff with experience in the area, or contracting data warehousing experts.

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$

Book Reviews

Resource Center

  • SB Profound WC 5536 Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:

  • SB Profound WP 5539More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.

  • SB HelpSystems ROBOT Generic IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:

  • Magic MarkTRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now!  Request Now.

  • SB HelpSystems ROBOT GenericForms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.

  • SB HelpSystems ROBOT GenericIT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.

  • SB HelpSystems ROBOT GenericCan you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:

  • FORTRA Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:

  • FORTRAManaging messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:

  • FORTRAThe thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:

  • FORTRAFor over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:

  • LANSA Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.

  • LANSAWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

  • LANSASupply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:

  • The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit

  • Profound Logic Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.

  • SB Profound WC 5536Join us for this hour-long webcast that will explore:

  • Fortra IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn: