The first step on the road to becoming ClusterProven is understanding what ClusterProven means. ClusterProven is an IBM registered trademark, not a generic description. IBM grants a license to use the trademark to independent software vendors, or ISVs, whose products conform to rigorous clustering specifications.
ClusterProven and Advanced ClusterProven programs exist for each of the IBM server lines. And, for the iSeries 400, it is the PartnerWorld for Developers/iSeries 400 group that is responsible for administering the program.
For ClusterProven branding, iSeries 400 applications must meet the following criteria:
Switch to backup resources automatically if a primary system becomes unavailable
Provide sufficient information to the cluster environment to enable automatic application configuration and resilient resources
Supply an exit program that can coordinate the applications restart using resilient data
IBM also awards a higher-level branding trademark, Advanced ClusterProven, to ISVs whose iSeries 400 products meet all of the ClusterProven criteria and then go even farther toward ensuring continuous operations. One essential criterion of these Advanced ClusterProven apps is that they must use commitment control or some internal application checkpoint process to recover to the last transaction, as well as minimize user disruption in the event of a system failure.
Value Added
The primary value of these two trademarks is the confidence they instill in customers who purchase a high-availability clustering solution. While an application that does not bear either brand may, obviously, still meet all of the same cluster-enabling criteria, the ClusterProven and Advanced ClusterProven marks are proof that clustering capabilities were verified by IBM.
The trademarks also benefit software buyers by removing the need to verify the cluster-readiness of an application under evaluation. The certification tests have already proven it. And, of course, ISVs receive benefits from the brands.
Candidate Applications
Some applications are more ready for clustering than others. The amount of work required to achieve ClusterProven status for an existing application depends on the applications functionality and design. There are several issues that must be considered.
Do You Use Journaling?
Journaling is a prerequisite for clustering. In traditional environments, many organizations have feared turning journaling on because of the potential impact on system performance. However, the magnitude of that impact has not always been clear, and in addition, IBM continues to make significant improvements in journaling performance.
Because the journal keeps a record of all changes applied to the database, journaling can cause a dramatic increase in the amount of disk I/O. This may or may not impact overall system performance, depending on where bottlenecks occur.
For example, if a system is primarily CPU-bound, adding a significant volume of disk I/O may not affect overall performance. Therefore, before implementing journaling, you should test its impact on overall system performance by turning it on. If performance is unacceptable, consult IBMs documentation for journaling tips and techniques. If performance is still inadequate, some application tuning may be required before using journaling in a production environment. With proper configuration and tuning, it is possible to journal in virtually any application environment with minimal overhead.
How Does the Application Recover Data?
If replication software from an IBM AS/400 High Availability Business Partner (HABP) is already used to maintain replicated data on a backup system, part of the road to ClusterProven status has already been traveled.
AS/400 clustering uses a shared nothing architecture. This means nodes in the cluster do not share resources. Instead, the cluster uses HABP replication software to maintain duplicates of all resourcesdata, applications, user profiles, and other system objectson all backup nodes.
How Does the Application Recover Users?
A simple litmus test question can set the stage for determining how well prepared the application is for clustering. Simply ask yourself, How well does the application recover itself and its users after an abnormal failure in a single system environment? If an on-site visit by a software engineer is required to recover/repair data, then, most likely, a significant amount of work will be required to add robustness to the application design. If, on the other hand, the application comes back online and can quickly determine where it left off and can automatically deal with lost or damaged data, then the work to cluster-enable the application will be greatly diminished.
At a conceptual level, clustering on the AS/400 is not new. For some time, and to varying degrees, the HABPs have offered system-monitoring capabilities and facilities to automatically failover or manually switchover to a backup system where the application is simply restarted. The application is not involved in the process and is said to be cluster unaware.
The difference with iSeries clustering for high availability is that some of these functions are now performed by the AS/400 cluster engine, some are performed by the HABP middleware, and some are performed by the application-supplied exit program and related application resiliency changes. The biggest difference in this new clustering architecture is the active participation of a highly resilient cluster-aware application.
Exploited, Leveraged, or Enhanced?
Once an application is enabled to behave properly in a high-availability cluster, the degree to which user impact can be mitigated (or eliminated) is typically a function of how much (if any) application state information is maintained, how that information is used in a recovery/restart scenario, and the use of such techniques as commitment control. Ideally, when these are used collaboratively, the highest level of cluster-enablementAdvanced ClusterProvencan be achieved.
Which One Do You Need?
Early in your clustering journey, you must answer a question: Is ClusterProven sufficient, or does your application require Advanced ClusterProven status?
A ClusterProven application can automatically restart on a backup node when the primary node becomes unavailable. However, it is not necessary to reposition users at the point of failure. Instead, they may be taken back to the last main menu screen they encountered. Users may then have to recreate some of their work.
An Advanced ClusterProven application goes further. All transactions managed by the application should ideally use commitment control. In the event of a failover to a backup system, host-centric applications must reposition users to the last transaction commit boundary or to a checkpoint boundary.
Advanced ClusterProven client/server applications are even more resilient than host- centric applications. Because the client manages the user interface, the user experiences a seamless failover with minimal service interruption when a primary server fails.
The best way to think about this is to view it as an availability journey. Its kind of a quest for the Holy Grail. Just about any application availability enhancement can deliver customer value, so it is quite reasonable to increment your way to a state-of-the-art application, from an availability perspective. Maybe it makes sense to simply enable the application to function properly in the clustering environment as a first step. Next, adding some simple application checkpointing (state information) so that you can reposition users after an application restart might be beneficial, and so on. The best advice is to listen to what your customers are asking for. They probably cant articulate it in terms such as ClusterProven or Advanced ClusterProven, but they can usually tell you how long they can afford to be down.
Batch Applications
Traditionally, batch applications did not use journaling or commitment control, because the easiest and most-efficient recovery strategy was to take a backup before the batch job started. If the job failed, the database was restored from the backup and the job was rerun.
This batch recovery strategy is not acceptable in a clustered environment that requires around-the-clock availability. Journaling is a minimum requirement for ClusterProven and Advanced ClusterProven status. Advanced ClusterProven applications should also use commitment control. Batch programs that are write-intensive can present special problems when journaled, so IBM addressed this by providing the Batch Journal Cache PRPQ.
One option for batch programs is to create an arbitrary commit point after every x number of records that the batch job processes. The most appropriate value of x depends on the nature of the batch application and the nature of the other applications that may be running simultaneously.
Replication
Clustering in the AS/400 environment is a cooperative solution that includes replication software from an HABP. Because the backup nodes must be ready to take over at any time,
replication and journaling must be active at all times, including when batch jobs are running.
For clustering to succeed, all of the data, programs, security objects, and other system objects used by an application must be replicated. A complete inventory of these items is, therefore, critical. An HABP can provide services and tools to help you identify all relevant resources.
To receive ClusterProven status, you could write all the required software on your own, but the easiest, and probably the safest, solution is to establish a relationship with an HABP. Even if you plan to leave the choice of HABP up to your customers, a replication solution and a cluster management interface from one of the HABPs should still be used in the testing process that leads to ClusterProven certification.
Exit Program
Each application must supply an exit program to be called when a cluster event occurs. Examples of cluster events are a node failure, the addition of a node to a cluster, or the removal of a node.
One generic exit program may serve multiple applications, and it can be written in any ILE programming language. When a cluster event occurs, the exit program is initiated on all nodes in the recovery group. It must be able to handle all of the relevant action codes shown in Figure 1 (page 51).
A generic exit program may be sufficient to achieve the lowest level of recoverability required for ClusterProven status. As part of its service to cluster-enable your application, an HABP may use an automated tool to help you create the program. However, a more sophisticated exit program might be required. It may be necessary, for example, to start a sequence of programs rather than just the failed program. Or the exit program may play a role in repositioning the user in the restarted application.
Data for Automated Installation
AS/400 clusters share application information in a standard format through the automated installation data area. You must set up and initialize this area for each application that runs on the cluster. It must exist on every node in the applications recovery domain.
The data for automated installation consists of three components: the input data area (QCSTHAAPPI), the output data area (QCSTHAAPPO), and the object specifier file.
The input data area contains information about the application, the applications resiliency requirements, and its data and object replication requirements. This includes the application name, release level, and identification information; the associated exit program and information required by it; information about the cluster resource group; and information about any associated data areas and journals.
The output data area reflects the results of setting up the application resiliency environment. This includes an IP takeover name (if appropriate), the participating data resource group names, and the various status indicators.
The object specifier file describes the format used to identify objects replicated by the HABP solution. This includes information such as country and language codes and path information for replicated objects. Because this information is stored in a standard format, once set up, it can be used by any of the HABP solutions.
If you work with an HABP to cluster-enable your application, it may employ software tools that can help build the data for automated installation.
Certification Process
The project team involved in ClusterProven testing may vary. However, the application vendor, the HABP, or a combination of the two usually manages the project. IBM typically plays a monitoring role.
The certification process begins with the HABP loading the necessary objects on the backup node and populating the data in the automated installation data area. A series of scripted tests are then run to simulate system failures. The failover activity and resulting application processing on the backup node are observed to ensure that they conform to the ClusterProven or Advanced ClusterProven specifications.
The testing process generally lasts less than one week. The paperwork must then pass through the legal process before the ISV is allowed to use the ClusterProven or Advanced ClusterProven trademark.
Its All About Credibility
The IBM ClusterProven and Advanced ClusterProven trademarks help ISVs gain immediate credibility for their products clustering capabilitieseven before they have any reference accounts. The length and difficulty of the journey to ClusterProven status can vary greatly, depending on both the nature of the application and the level of application resiliency required.
When you start the journey, you cannot travel alone. AS/400 clustering is a cooperative venture among the ISV application, OS/400, and replication and cluster management software from an HABP. Since you must prove that your application is cluster-enabled before receiving a license to use the ClusterProven trademark, a relationship with at least one HABP is a mandatory step in the process.
When choosing an HABP to partner with, consider three factors: its replication and cluster management solution, its experience in cluster-enabling applications, and the tools it can bring to bear on the necessary tasks. An HABP that is strong in all three areas can greatly accelerate your trip on the road to ClusterProven or Advanced ClusterProven. Good luck on your journey!
INITIALIZE
START, RESTART
END
DELETE, REJOIN, CHANGE,
DELETE COMMAND
FAILOVER, END NODE
Ensure that the related program objects exist on the relevant nodes in the cluster. Validate the existence of the required data. Set the exit program success indicator.
Ensure that the required data areas are available on the primary and backup nodes. Initiate handlers for exception and cancel conditions on primary node. Set the exit program success indicator if a failure occurs on primary node. Set the exit program success indicator on the backup node.
Set the success indicator on the primary and backup nodes. End any jobs that were started by the exit program on the primary node.
Set the exit program success indicator on the primary and backup nodes.
Initiate handlers for exception and cancel conditions on first backup node. Validate the existence of the required data on first backup node. For a failure condition, set the exit program success indicator on first backup node. Validate the existence of the required data on other backup nodes. For a failure condition, set the exit program success indicator on other backup nodes.
LATEST COMMENTS
MC Press Online