With Version 4 Release 3 of OS/400, IBM announced the availability of Encoded Vector Indexes (EVIs). EVIs are a new kind of index. They may sound like some kind of arcane database feature that only a database administrator could love, but they make network administrators happy, as well, by improving user response time.
All relational database management systems (RDBMSs) have some form of B-tree radix indexes. These indexes are, essentially, tree structures that provide very fast retrieval for a small number of rows. Radix indexes were developed years ago when online transaction processing (OLTP) applications were the only thing running on most production systems, and they are well suited to situations in which a fairly simple request results in one or two rows being retrieved from a table.
But with business intelligence, users are running more-complex queries. These queries sometimes retrieve large result sets, andmore importantlythe users are allowed to define ad hoc queries and run them dynamically. The majority of these users are using workstation tools and accessing the data through the network via connectivity protocols such as ODBC. Increasingly, these users are using Web-based data access tools. No matter how theyre issuing the requests, the result is a bottleneck in the database. This is completely transparent to the users; all they see is that response time is slow.
For many types of queries, EVIs can significantly reduce the database bottleneck, thus improving overall response time. Although it is not necessary for network administrators to understand the structure and intricacies of EVIs, it is important that they understand why EVIs improve database performance and recognize the situations in which they will be most useful. This article will provide some background on indexing technologies and provide examples of when and how EVIs help improve performance.
Radix Indexes
Although radix indexes are effective for static queries, they become less effective for ad hoc queries. In a static, or compiled, query, the database administrator already knows all the elements of the query in advance and can create the perfect index to support the query. But in an ad hoc environment, the user is allowed to select the elements of the query at will. In a datamart, for example, the database administrator might know that the user will always select store number and date, but, beyond that, the user might be able to choose from a
broad spectrum of elements, such as revenue, profit, inventory, or item. Given the users possible choices, it becomes impossible to define the perfect radix index in advance of the query.
Experts throughout the RDBMS industry have recognized this problem, and various vendors have proposed solutions that can be referred to collectively as bitmapped indexes. The basic concept behind a bitmapped index is that the database generates an array of bits in which each bit represents a row in the table. If the bit is on, the row contains the desired value. So, for example, if you have a customer table and you define a bitmapped index over a column called State that contains each state in the United States, an industry- generic bitmapped index will generate 50 bitmaps. If a user creates a query that includes the predicate where state=california, the RDBMS will retrieve the bitmap for California and quickly locate the rows that match the query.
The good news about this kind of index is that, in addition to fast retrieval, bitmaps can be combined using Boolean algebra to further increase ad hoc access. For example, say the database contained bitmapped indexes for the columns State and Month and a users query contained the statement where state=california and month=march.
The RDBMS could and the two bitmaps together to create a bitmap in which each set bit represented the combination of the two predicates. In an ad hoc environment, this becomes an extremely powerful tool for supporting users.
The bad news about the generic industry solution is that, since the RDBMS generates a bitmap for each unique value, bitmaps become an ungainly solution for large tables. First, consider the scenario in which you build a bitmap over a column called Gender, which will have two unique values (M and F). The database will build two bitmaps; each bitmap will have one bit for every row in the table. Each time a row is added to the customer table, both bitmaps are rebuilt. When the customer table contains 100,000 rows, this bitmapped index will not be too large or too difficult to maintain.
Now, consider the scenario in which you have thousands of unique values and millions of rows; the bitmapped index quickly becomes huge, and maintaining it becomes costly. Imagine building and maintaining a bitmapped index over a column called Cust_Num when there are 1 million customers, each with a unique customer number! The index would contain 1 million bitmaps, each with a million bits. Because of this scenario, most RDBMS vendors recommend bitmapped indexes only for small tables and columns with a low number of unique values.
EVI to the Rescue!
Aware of the inherent problems with bitmapped indexes, IBM Research developed EVIs, which store information about bitmaps rather than the bitmaps themselves. When a query requires a bitmap, DB2 UDB for AS/400 uses the EVI to generate a dynamic bitmap. Like the generic industry bitmaps, these dynamic bitmaps are used to retrieve rows from the database and can be combined through the Boolean functions and and or to produce a bitmap that perfectly matches the users request.
Because of the way EVIs are stored internally, their use is far broader than that of a generic bitmapped index. Because DB2 UDB for AS/400 does not have to maintain a bitmap for each unique value, EVIs are a powerful tool for very large tables. And improving performance for very large tables is exactly why network administrators will love EVIs as much as database administrators will. But, not knowing the intricate details of how a database optimizer works, these network administrators may want to know how EVIs improve performance. The simple answer is this: EVIs reduce the number of rows that the database must access. Of course, the simple answer requires more explanation.
As stated previously, the best way for a database to fetch a few rows for a given query is to build an index over the columns used in the query. But when a few rows become a lot of rows, the performance of a radix index begins to fall. Without EVIs, the only option is what database administrators call a full table scan. This is the process
whereby the database must access every row in the table and check to see if it meets the querys criteria.
A full table scan is not a problem for small tables, but for large tablessay, hundreds of millions of rowsa full table scan becomes a tremendously large, time- consuming task. Even on the AS/400, where parallel I/O and symmetric multiprocessing (SMP) work in concert to deliver some of the fastest table scans in the industry, full table scans can drag down the performance of a query.
Without EVIs, a database optimizer must choose between a radix index and a full table scan. If there is no radix index to match the query, the database has no choice but to use a full table scan. EVIs provide a third solution: The optimizer can combine several EVIs to create a dynamic bitmap that tells the system exactly which rows to retrieve and which rows to skip. This kind of data retrieval is called skip sequential processing. It uses the power of the AS/400s parallel I/O subsystem while avoiding the need to access every row in the table.
Performance Testing EVIs
To verify the viability of EVIs, the AS/400 Teraplex Integration Center, a customer testing facility dedicated to high-end business intelligence, set up a test that exercised the three major data access methods: radix indexes, EVIs, and full table scans. At the outset, the hypothesis was that the database would use EVIs and skip sequential processing when an intermediate percentage of rows were being returned. Figure 1 describes how experimenters thought EVIs would be used, relative to other data access methods. As the chart shows, researchers expected that, even when EVIs were available, the database would continue to use a radix index when a small percentage of rows were being retrieved. They also expected to see full table scans used for a large percentage of returned rows.
The purpose of the test was to quantify, at least for one query, where and when the database would stop using one access method and begin using another. To do this, the Teraplex Center researchers created a 512 GB, 2.1 billion-row table representing a distribution business. Figure 2 shows how they defined the query. With the query defined, the team could then build the perfect radix index and a set of EVIs that would match the query. The perfect radix index, which is an index built specifically for the query at hand, was built over the columns Year, ReturnFlag, Shipmode, and CustKey because these were the columns used to select which rows would be retrieved. In addition, the team built single-column EVIs over the same set of columns. That is, it defined indexes in the following way:
CREATE ENCODED VECTOR INDEX STARDB/YEAREVI +
ON STARDB/ITEMFACT (YEAR) +
WITH 333000 DISTINCT VALUES
Similar indexes were created for the other columns. By changing the value of the variable custkey_value in the query, the number of rows returned could be controlled. In the first iteration, 152 rows were returned. The value of custkey_value was systematically increased with each run of the query until all 2.1 billion rows were returned. The table in Figure 3 describes the results.
The results follow the general pattern described in Figure 1. However, the Teraplex team learned a couple of important points from this test. For one, when 0.05 to 25 percent of the rows were retrieved, DB2 UDB for AS/400 chose to use a radix index and skip sequential processing. The database has had this capability since V4R2, and the Teraplex team has seen it used effectively many times. In this case, it was not expecting to see this combination of index and access method used more often than a radix index with key row positioning. These results are a testament to the effectiveness of skip sequential processing.
Second, the system consistently used no more than three EVIs for any query, even though seven EVIs were defined over the table. The optimizer analyzes the cost of using all
the EVIs, and, in this case, it determined that using three of the seven EVIs provided the most efficient access to the required rows.
Finally, the full table scan was not used until 63 percent of the rows were returned. Without EVIs, the optimizer will often choose a full table scan when 20 percent of the rows are returned. On AS/400s with large memory configurations, like the ones in the Teraplex Center, a full table scan may be used when as few as 10 percent of the rows are returned because the system can preload data into memory in parallel.
Although full table scans are very efficient for a high percentage of returned rows, its important to understand the cost of using them for all queries. The Teraplex Center team took the same query described in Figure 1 and ran it three different ways, returning 152 rows each time. First, it ran the query with no indexes present, forcing the system to use a full table scan. Then, it ran the query with the seven EVIs defined over the table. Finally, it dropped all the EVIs and ran the query with only a radix index present. Figure 4 illustrates the results of these tests. As expected, with such a small number of rows returned, a radix index was still the most efficient method for processing this query. But more important than that is the dramatic difference between using an EVI and using a full table scan. The full table scan is much slower than an EVI.
The Results Speak for Themselves
The results of the Teraplex tests illustrate the power of EVIs in an ad hoc query environment. These users demand performance, and EVIs are another tool for meeting the performance demand.
Response Time/ System Resources
Skip sequential with EVIs and dynamic bitmaps
Sequential full table scan Few Many Number of rows accessed
Keyed with "perfect" rapid index
Figure 1: This graph shows how experimenters thought EVIs would be used relative to other data access methods.
SELECT CUSTKEY, LINENUMBER
FROM ITEM_FACT
WHERE YEAR = 1998
AND RETURNFLAG = "R"
AND SHIPMODE = "AIR"
AND CUSTKEY <= custkey_value
Figure 2: The experimenters defined the query for the test with this SQL statement.
Percent Rows Returned Index Used for Access Access Method
rows < 0.05 Radix Key Row Positioning
0.05 < rows < 25 Radix Skip Sequential 25 < rows < 63 3 EVIs Skip Sequential rows > 63 none Full Table Scan The table was 512 GB and contained 2.1 billion rows. The test was run on a model 740, 12-way system with 40 GB of memory. These results are particular to a given query on a given configuration. Your results will vary.
Figure 3: This table describes the results after all 2.1 billion rows were returned.
Time
(in minutes)
60
50
40
30
20
10
0
49.43
Full Table Scan EVIs Perfect Index
3.98 0.33
Figure 4: The team also analyzed the time costs of using just one access method for all queries.
LATEST COMMENTS
MC Press Online