On April 4, 2003, IBM announced that it is introducing the industry's first blueprint to assist customers as they begin to build autonomic computing systems. IBM's autonomic computing initiative is one of the key elements of IBM's larger e-Business on Demand marketing message.
Autonomic computing has been, in the past, defined by IBM as computing systems that are self-healing, self-configuring combinations of hardware and software that relieve administrators from low-level setup and maintenance tasks. But could the making of an autonomic computing architecture impact how high availability is delivered to customers?
The Evolution of High Availability Products
These days, the software and hardware combinations that compose high availability computing solutions--solutions designed to keep systems up and running--are evolutionary responses to real experiences of customers and engineers. They're responses to something that broke in the past and required engineers to devise preventative measures.
But if IBM brings forth a real era of autonomic computing--systems that are self-diagnosing, self-aware, self-configuring, and self-healing--the science of high availability will have new challenges and benefits.
The System/38: Precursors for High Availability
If you look back at the history of most high availability solutions in the market today, you will find a thread that stretches back to the days of the System/38 architecture, when the failure of a single disk drive could spell days of downtime. (And even back then, a single day of unexpected downtime was not a pleasant thing to explain to your management.)
At the time, the System/38 had some unique challenges that transformed simple backup and recovery into a serious need for availability services. It was the first truly virtual computer, composed of an operating system and applications that rode on top of a hardware platform that seemed to be constantly struggling to keep up with the system's advanced design. Programmers loved it. But it was the System/38's reputation as a DASD masher that sent fear and trembling into the hearts of IBM systems engineers. Where was the data to recover, these engineers wondered? In fact, it was scattered anywhere and everywhere, shotgun blasted across the platters for better disk drive armature utilization. This meant that systems engineers had to devise a means of resurrecting failed hard drives on systems where file structures were not clearly identified and where volatile memory often held large portions of the database when the system went down.
The High Availability Market Spins Off
Many of the hardware and software technologies that we see in use today came from engineering experiences that were aimed at solving or avoiding disruptions in the virtual information structures of single-level storage. In fact, in order for the AS/400 to succeed in the marketplace (where the System/38 had faltered), IBM had to prove once and for all that it had solved the problems of system availability with this strange computing system that was designed to use single-level virtual storage.
For many years, IBM concentrated simultaneously on both the hardware and the software front to increase the AS/400's availability. It created drives with comprehensive disk diagnosis and alert messaging software built in. It wrote operating system services--like journaling and mirrored storage pool software--to minimize the damage that a failed piece of hardware might create. And the result? The AS/400 achieved the highest availability rating of any system in the 1990s, with a rating of 99.999%. (And to this day, the premium that customers pay for their iSeries DASD is a vestige of that era when IBM's focus was on unique, advanced AS/400 disk drives. This price is still charged, even though the iSeries today uses the same drives that are used on other IBM platforms.)
Is it any wonder then that--when high availability became an important issue for larger computing systems and for networked servers in the 1990s--the AS/400 had already spawned a lot of expert IBM engineers who had cut their teeth on the AS/400's high availability mechanisms. Many of these engineers went off to form separate companies that sold products to meet the needs of a burgeoning high availability marketplace.
This is not to say that the high availability mechanisms designed in the days of the AS/400 are sufficient for the needs of customers today. And that's the point.
New Challenges in Availability Lead to Autonomic Technology
The challenges facing engineers and administrators today are not simple "what-if" scenarios on a single failing machine or device. They are issues borne of complexity--massive, cascading complexity--on networks composed of a wide spectrum of services. These services, by their very nature, become tangled as users thread their way through portals, applications, networks, servers, routers, protocols, and configurations. Unfortunately today, by the time a network administrator has been notified of some problem--with a disk drive, a stuck printer queue, a corrupted user profile, a streaming router, or a bungled data stream--the damage to the information flow has probably already been done. Why? Because most of the high availability services that are out there are based upon external monitoring of discrete conditions within the information system itself--backed by scripts and controls that have been explicitly designed to handle unique error conditions.
What is amazing is not that our systems sometimes fail, but that we can get as much work done as we do. In fact, our information systems have become so complex that we work in an environment where things are constantly failing, all the time! It's only when they fail catastrophically that we bother to address the problem.
So how does IBM's autonomic computing initiative address this problem? This is where a revolution in creative thinking--backed by some intense engineering--may positively impact the whole industry of high availability.
IBM Shares Project Eliza Insights
IBM's experience with autonomic computing comes, in part, from its Project Eliza, the R&D initiative that was funded several years ago. Project Eliza provided basic research into how systems could be designed to actually learn. Composed of a neural network of sensors, logs, and algorithms, the concept behind Project Eliza was to build systems that could develop enough self-knowledge to make discrete decisions about what needed to be done to make preemptive, self-actuated decisions that might avoid catastrophes. This research, now reaching some maturity, is forming the basis for developing real systems that are self-configuring and self-monitoring, with built-in decision-making capabilities.
IBM could have taken this basic research and developed its own brand of high availability services for each of its products. However, IBM also understands that its products no longer reside in standalone environments; instead, they are members of larger networks of heterogeneous systems in which the failure of one device can spell disruption for the entire information system. As a result, as a part of its long-term e-Business on Demand initiative, IBM has chosen to release a great deal of its research as a blue print for building autonomic systems: a sort of workbook of what it actually takes to design systems that are truly self-configuring, self-monitoring, and self-healing.
Autonomic Technologies Released
In addition to this blueprint, the company is also providing developers with four specific technologies to help develop autonomic systems. These new technologies provide developers and customers with some actual building blocks that will enable them to produce self-managed systems that are compliant within the framework of the new blueprint. These are the four technologies:
- Log & Trace Tool for Problem Determination--This tool alleviates the manual task of tracking down the cause of a system problem by putting the log data from different system components into a common format. By doing this, administrators can more easily identify the root cause more quickly. This tool captures and correlates events from end-to-end execution in the distributed stack, allows for a more structured analysis of distributed application problems, and facilitates the development of autonomic self-healing and self-optimizing capabilities.
- ABLE (Agent Building and Learning Environment) Rules Engine for Complex Analysis--ABLE is a set of fast, scalable, and reusable learning and reasoning components that capture and share the individual and organizational knowledge that surrounds a system. ABLE is designed to minimize the need for developing complex algorithms that are the basis for intelligent, autonomic behavior by a system.
- Monitoring Engine providing Autonomic Monitoring capability--This technology detects resource outages and potential problems before they impact system performance or end-user experience. The monitoring engine has embedded self-healing technology that allows systems to automatically recover from critical situations. IBM says it uses the same advanced resource modeling technology to capture, analyze, and correlate metrics that IBM uses in its Tivoli product line. In addition, IBM says that the Tivoli Autonomic Monitoring Engine will be available in beta this summer and is scheduled to ship later in the year.
- Business Workload Management for Heterogeneous Environments--IBM says that the initial delivery of this technology will use the Application Response Management (ARM) standard to help identify the causes of bottlenecks in a system by using response time measurement; transactional processing segment reporting; and a neural, network-like, self-learning mechanism through middleware and servers. It's designed to adjust resources automatically to ensure that specified performance objectives are hit. This technology will also start to be delivered with the IBM Tivoli Monitoring for Transaction Performance product.
Additional details about these technologies were made available at IBM's developerWorks Live! conference for software developers that was held last week in New Orleans.
Bringing It All Back Home to iSeries
How these technologies will find their way into the iSeries environment will certainly be tied to IBM's Tivoli product efforts, but the manner by which IBM is making these technologies available to the developer community will also ensure that the level of high availability products offered by third-party vendors will also substantially increase over time. Implementing these technologies will greatly enhance the interrelationships between IBM e-server high availability, and the features and functions of high availability products aimed at routers, non-IBM servers, network switches, and the rest of the components that comprise the modern information system.
High availability has come a long way since the time when systems engineers labored over failed System/38 disk drives, leading to the record-breaking availability performance of the AS/400 and the iSeries platforms. But, if IBM follows through with its autonomic computing plans, it's clear that we're entering into a new era in which 99.999 % availability for the whole information infrastructure will become the standard for an industry rife with complexity and interruption.
For more information about IBM's announcement on April 4, 2003, visit IBM's announcement.
For more information about IBM's autonomic computing initiative, we recommend the following white paper: "The Dawning of the Autonomic Era" by A.G. Ganek and T.A. Corbi.
A complete look at IBM's autonomic efforts may be found at IBM's Autonomic Web page.
Thomas M. Stockwell is the Editor in Chief of MC Press, LLC. He has written extensively about program development, project management, IT management, and IT consulting and has been a frequent contributor to many midrange periodicals. He has authored numerous white papers for iSeries solutions providers. He welcomes your comments about this or other articles and can be reached at
LATEST COMMENTS
MC Press Online