SQL goes from flexible to downright magic when you take advantage of prepared statements.
If you’ve followed me over the years, you know that a lot of what I do centers around data analysis. Even as the industry trends toward transferring our data to a data warehouse (or data lake, nowadays), I still spend a lot of time taking existing data and massaging it into actionable business information for my customers. This involves both transactional data and configuration data (what we used to call master data). And even when we do use external data processing of one kind or another, oftentimes our data needs to be reviewed and cleansed before we can export it. For configuration data, one of the biggest issues is to find outliers, and that can be quite the feat.
Statement of Work
Today’s discussion is about finding unused fields and unusual values. We can take a simple case, the ubiquitous item definition file. Every ERP system has one; many have more than one. But let’s take a minimal case of a system with a single item file with a set of fields. Let’s do a short list: item number, item status, item description, item type, item group, item planner, unit of measure, and one of my favorites, harmonized system (HS) code. Most of those should be pretty self-explanatory, but we can identify a couple of immediate characteristics. Some are unique values (such as item number) while others are codes that belong to a limited list, such as unit of measure and HS code. Of the latter, some codes are externally defined (like HS code) while others are arbitrary values whose meaning is determined by the business.
When preparing for export, often one of the first things we need to do is review the data for consistency. These reviews typically lead to cleanup projects where someone in the business decides what to do with the various outliers. But before they can do that, someone has to build the list for them to review. It’s not hard for one file, but recently I had to do that for a number of files and fields, and that’s what led me to the techniques in this article.
Simple but Not Easy
I said it’s not hard for one file. Even that is a little misleading. Let’s decide what we’re going to do. I started by building a little field analysis file. This file has four fields: file name, field name, field value, and count. The idea was just to be able to store the analysis of the fields. And how would we populate the file? Well, assuming the fields in the analysis file are FAFILE, FAFIELD, FAVALUE, and FACOUNT, it wouldn’t be too hard to add, say, the statistics for the HS code in the item master. Assuming the field is IMHSCD in ITMMAS, the SQL is pretty simple:
INSERT INTO FLDANL
SELECT 'ITMMAS', 'IMHSCD', IMHSCD, COUNT(*)
FROM ITMMAS GROUP BY IMHSCD ORDER BY 1
I threw in the ORDER BY clause just because that makes it easier to review the results quickly, even without a key on the file. But I immediately run into a few issues. First, I have to do this for every field in the file. That’s not too difficult when you have a few fields, but it quickly gets prohibitive when dealing with dozens of files with hundreds of fields each, which is typical in a modern ERP system. And if this task needs to be done more than once or twice, there needs to be a way to mechanize it. Second, I have to be careful not to inundate this file with those cases where every record has a different value, such as the item number. I have to figure out how to identify those fields with too many values for distinct analysis.
Serving with Distinction
I decided on a hybrid approach. First, every field would have a special record where the value “*DISTINCT” would indicate that the count was the number of distinct values in the file. Yes, I realize this breaks down for any field where the actual value is *DISTINCT, but that becomes less of an issue with the next change. What I did was then decide to only add additional records to the file if the number of distinct values was below an arbitrary threshold, in this case fewer than five distinct entries. So let’s assume that our item file has 1,000 records and that 500 of the items are accounted for in LBS, 300 in FT, and 200 in EA. I start by getting the count of distinct values:
INSERT INTO FLDANL
SELECT 'ITMMAS', 'IMUOM', '*DISTINCT', COUNT(DISTINCT IMUOM) FROM ITMMAS
I run this statement for each selected field. Then, for each field having a count less than five, I run the previous INSERT statement that adds the distinct values for that field. Executing this technique manually for the fields IMITEM and IMUOM (item number and unit of measure), I end up with these entries in the analysis file:
File |
Field |
Value |
Count |
ITMMAS |
IMITEM |
*DISTINCT |
1000 |
IMTMAS |
IMUOM |
*DISTINCT |
3 |
IMTMAS |
IMUOM |
EA |
200 |
IMTMAS |
IMUOM |
FT |
300 |
IMTMAS |
IMUOM |
LBS |
500 |
Since there are so many distinct values for IMITEM, I don’t make entries for the individual cases.
Automating the Process
As noted, this is easy with a small number of fields. I can run the three SQL statements pretty easily. I can even put them in a script and run them repeatedly. But things start to get difficult as the number of fields increases. That’s what led me to an SQL approach, which in turn led me to write some code for this. To do this, I needed to use embedded SQL, but that was going to be tricky. The first issue is that embedded SQL won’t let us use variables to define table and column names (file and field name to us RPG folks). This will not work:
EXEC SQL 'SET :FACOUNT = (SELECT COUNT(DISTINCT :FAFIELD)'
+ ' FROM :FAFILE';
So we have to go to the prepared statement approach. I was hoping this would do the trick:
P1 = 'SET ? = (SELECT COUNT(DISTINCT ' + %TRIM(FAFIELD)
+ ') FROM ' + %TRIM(FAFILE) + ')';
EXEC SQL PREPARE S1 FROM :P1;
EXEC SQL EXECUTE S1 INTO :FACOUNT;
But while DB2 supports EXECUTE…INTO, ILE RPG with embedded SQL does not. And that’s where I got stuck for a minute until I came up with the SQL magic for this article. I created a global variable called CDISTINCT and then just finished my code this way:
P1 = 'SET CDISTINCT = (SELECT COUNT(DISTINCT ' + %TRIM(FAFIELD)
+ ') FROM ' + %TRIM(FAFILE) + ')';
EXEC SQL PREPARE S1 FROM :P1;
EXEC SQL EXECUTE S1;
EXEC SQL SET :FACOUNT = CDISTINCT;
The prepared statement puts the number of distinct values into the global variable CDISTINCT, and then the last statement pulls that value into the field FACOUNT. At that point, I can write the record. I can then write the record or use an INSERT to add it. All that remains is the logic to insert the actual distinct values when the count is below the threshold.
IF FACOUNT < 5;
P2 = 'INSERT INTO FLDANL SELECT '
+ '''' + %TRIM(FAFILE) + ''''
+ ', ''' + %TRIM(FAFIELD) + ''''
+ ', CHAR(' + %TRIM(FAFIELD) + ')'
+ ', COUNT(*) FROM ' + %TRIM(FAFILE)
+ ' GROUP BY CHAR(' + %TRIM(FAFIELD) + ') ORDER BY 3';
EXEC SQL PREPARE S2 FROM :P2;
EXEC SQL EXECUTE S2;
ENDIF;
And there you have it! You drive this by reading through a list of fields (maybe from DSPFFD to an output file) and putting the file and field names into FAFILE and FAFIELD. These SQL statements then generate the distinct value analysis.
Not a Silver Bullet
While this technique is slick and takes advantage of some advanced SQL capabilities, the most important caveat is that this sort of analysis works best on smaller files, with tens of thousands of records, not hundreds of millions. That’s because most databases don’t have an index for every field, so the COUNT DISTINCT is going to have to do a check of every record. This will involve a lot of data access. But that aside, this is a great analysis tool, a good basis for other low-level file analysis, and a way to polish your prepared statement skills. Enjoy!
LATEST COMMENTS
MC Press Online