Thank you, Mr. Smith. The telephone receiver goes down. Wait a minute. Was that Smith or Smyth? Maybe Smythe? Too late now; I'll guess at Smith.
The English language has many strange rules and exceptions. Start dealing with the spellings of names and there can be big trouble. The United States census takers had those problems. Imagine trying to get everyone's name recorded correctly for the entire country. In a task this daunting, there are bound to be many errors.
If, after taking this error-prone information, there is a desire to match data by family names, how is this done? The answer was provided by two people, Margaret Odell and Robert Russel, shortly before World War I. They developed a simple algorithm called Soundex to accomplish this task. Their original algorithm can be found in Appendix I, page 39, of The 1910 Federal Population Census: A Catalog of Microfilm Copies of the Schedules. The two patents that existed regarding the Soundex algorithms expired before World War II.
Odell and Russel examined the usage of letters as they were pronounced. They attempted to categorize letters into groups by their sounds and by their significance to the final pronunciation. Each letter of the alphabet was assigned a number. These numbers were then used to produce a Soundex code for a name. Names with the same Soundex code would be pulled out together in a search on a given name. In the Smith example, all three spellings would generate the same Soundex code.
When I was looking to build a Soundex algorithm for a recent project, I searched the Internet and a large, local BBS. What I discovered was that Soundex coding is an art form, with each person interpreting it somewhat differently and adding flourishes. If you are going to build your own Soundex routine, take a spin around cyberspace and see what others have done before you.
PC and UNIX resources have a plethora of information on Soundex algorithms. I found many Soundex routines, some in languages that I've never heard of, and thoughts from various Soundex authors. The grandfather of most modern Soundex algorithms seems to be from The Art of Computer Programming, Volume 3: Sorting and Searching by Donald Knuth. PC Magazine recently ran a small article on Soundex (May 30, 1995) which included C language source code.
How It Works
My Soundex algorithm will take a 50-character string and convert it into a four-position Soundex code. This code can then be stored in a record and used as a key field for a logical file when you want to search by Soundex.
My Soundex implementation is made up of one letter followed by three digits. This is the same method used by Soundex's originators. The letter in the code is the first letter of the word or name being converted to Soundex. If the name is Smith, the first position of the Soundex code will contain the letter S.
Bringing the first letter directly into the Soundex code has both advantages and disadvantages. The advantage is that it narrowly focuses the search for similar sounding names to names that start with a particular letter. The pool of possibilities should be reduced compared to the alternative method.
The alternative is to encode the first letter just like the rest of the name. So, instead of the original letter-digit-digit-digit Soundex code, a digit-digit-digit-digit Soundex code is produced. The advantage is that it will find matches for similar sounding first letters. For example, "staut" and "ztaut" would have equal Soundex codes under the four-digit method. I have seen implementations using both methods. I have also seen implementations that allow for a longer Soundex code.
By now, you are probably wondering about those mysterious digits in the Soundex code. Remember those old secret codes you used with your best friend when you were a kid (there were no PCs and PGP encryption software back then): you would write all of the letters to the alphabet and equate them to some other letter or symbol. Well, this is kind of like that, except all of the letters have codes based on their sound groups. In Soundex, all of the letters of the alphabet correspond to one of seven digits, 0 through 6. 1 contains the translation table my algorithm uses.
By now, you are probably wondering about those mysterious digits in the Soundex code. Remember those old secret codes you used with your best friend when you were a kid (there were no PCs and PGP encryption software back then): you would write all of the letters to the alphabet and equate them to some other letter or symbol. Well, this is kind of like that, except all of the letters have codes based on their sound groups. In Soundex, all of the letters of the alphabet correspond to one of seven digits, 0 through 6. Figure 1 contains the translation table my algorithm uses.
Group 0 is made up of letters considered to have little contribution to overall word sound. It includes all of the vowels and consonants that are often silent. The other groups create similar sounds when pronouncing a word. The greatest variation that I found in Soundex algorithms was in this translation table. The one used here is the most popular, but many programmers juggled the letters to fine-tune the process.
Let's go back to old Mr. Smith. If we wanted to translate his name into Soundex code, we would bring over the first letter of his name and translate the remaining letters based on the translation table in 1. The interim result would be S5030. The algorithm compresses out any irrelevant sounds (e.g., silent letters) that exist between the relevant sounds and shortens the result to the four positions required. For Smith, the final result is S530. The raw code for Smyth is S5030; for Smythe, it's S50300. When compressed and truncated, all three versions end up as Soundex code S530-a match.
Let's go back to old Mr. Smith. If we wanted to translate his name into Soundex code, we would bring over the first letter of his name and translate the remaining letters based on the translation table in Figure 1. The interim result would be S5030. The algorithm compresses out any irrelevant sounds (e.g., silent letters) that exist between the relevant sounds and shortens the result to the four positions required. For Smith, the final result is S530. The raw code for Smyth is S5030; for Smythe, it's S50300. When compressed and truncated, all three versions end up as Soundex code S530-a match.
If I needed to Soundex code the name Who, I would end up with a raw code of W00. Since the raw code is less than four positions long, it is padded to fill out all four positions by adding a 0. Who's Soundex code is W000 (and yes, he is on first).
There is another oft-changed element to Soundex algorithms: redundant sounds. Some versions strip all redundant, contiguous letters. Some versions go further and strip redundant contiguous sounds (i.e., letters in the same sound group). Other versions go even further and strip redundant, contiguous sounds in the compressed Soundex code. To demonstrate this, assume the contrived name Amemnierr. Ms. Amemnierr's raw code would be A50550066. If an algorithm strips only redundant, contiguous letters, the raw code would be changed to A5055006 (the second R was removed) and the resultant Soundex code would be A555. If an algorithm strips redundant contiguous sounds, the raw code would be changed to A505006 (the second R and the N are removed), and the resultant Soundex code would be A556. Another version strips redundant sounds after the compression. We ended up with four-position code in the last example (A556), but stripping the redundant 5 results in the final code A560. (A 0 is placed at the end to pad out to four positions if we run out of relevant sound codes.)
The algorithm in this article follows the last method, stripping all redundant sounds after compression.
The RPG Code
2 contains the ILE RPG program SDX001RG, which executes my Soundex algorithm. SDX001RG accepts a 50-character string and returns a four-position Soundex code. The *Entry parameters are listed in the initialization subroutine.
Figure 2 contains the ILE RPG program SDX001RG, which executes my Soundex algorithm. SDX001RG accepts a 50-character string and returns a four-position Soundex code. The *Entry parameters are listed in the initialization subroutine.
The input string is immediately converted to all uppercase letters. After this is accomplished, a small flourish of my own is executed. A call to subroutine Odd_Start examines the first two characters in the string. If they match any of the combinations that I am trapping, sound equivalencies are made in the starting letter. I did this in an attempt to compensate for words such as phone. This subroutine is a good one to play with and could easily be expanded or removed for your own purposes. As you can see by examining the Odd_Start subroutine in 2, I change PH to F, NM to M, PT to T, and KN to N.
The input string is immediately converted to all uppercase letters. After this is accomplished, a small flourish of my own is executed. A call to subroutine Odd_Start examines the first two characters in the string. If they match any of the combinations that I am trapping, sound equivalencies are made in the starting letter. I did this in an attempt to compensate for words such as phone. This subroutine is a good one to play with and could easily be expanded or removed for your own purposes. As you can see by examining the Odd_Start subroutine in Figure 2, I change PH to F, NM to M, PT to T, and KN to N.
SDX001RG also strips out all non-alphabetic characters. This is designed to clear any insignificant characters that may come along with a name, such as apostrophes (O'hare). The actual Soundex coding is quite simple and relies upon RPG's XLATE operation code to translate a given uppercase letter into its corresponding Soundex number code.
I've provided display file SDX002DF (see 3) and program SDX002RG (see 4) as a small application to test my Soundex program, SDX001RG. The SDX002RG program presents a pop-up window that accepts a character string and displays the resultant Soundex code for the string. This program can be especially helpful as you tweak the program to fit your own needs.
I've provided display file SDX002DF (see Figure 3) and program SDX002RG (see Figure 4) as a small application to test my Soundex program, SDX001RG. The SDX002RG program presents a pop-up window that accepts a character string and displays the resultant Soundex code for the string. This program can be especially helpful as you tweak the program to fit your own needs.
Many Uses
Soundex need not be used only for matching similar names. It can work for all sorts of words and applications. It is difficult to spell medical procedures or medicine names correctly. A database made up of this subject matter could easily be Soundex-coded. Other obvious usages can be found in human resource, customer, patient, or client databases.
What about street or city names? The opportunities to use this simple, malleable algorithm are endless. Now the next time Mr. Smyth, or Mr. Smythe (or was that Mr. Smith?) calls, you can look up his personal data quickly and easily.
Brian Kautz, an application development manager in a Pennsylvania manufacturing business, has worked in the IBM midrange environment for 10 years. He can be reached on the Internet at
REFERENCES
Knuth, Donald. The Art of Computer Programming, Volume 3: Sorting and Searching. Reading, Mass.: Addison-Wesley Publishing, 1973 (ISBN 0-201-13803-X).
The 1910 Federal Population Census: A Catalog of Microfilm Copies of the Schedules. Washington, D.C.: National Archives Trust Fund Board, 1982 (ISBN 0-911333-15-0).
Sound Off for Soundex
Figure 1: Soundex Code Translation Table
Letters Soundex Code a,e,h,i,o,u,w,y 0 b,f,p,v 1 c,g,j,k,q,s,x,z 2 d,t 3 l 4 m,n 5 r 6
Sound Off for Soundex
Figure 2: RPG Program SDX001RG
*=============================================================== * To compile: * * CRTBNDRPG PGM(XXX/SDX001RG) SRCFILE(XXX/QRPGLESRC) * *=============================================================== * SDX001RG: Convert a given string to a Soundex code. * Input: Accepts a 50 character string. (In_String) * Output: Returns a 4 character soundex code. (Soundex) * Indicator Usage: * 90 - General Work Indicator * Variable Declarations: *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 DInput_Data S 1A Dim(50) DWork_Data S 1A Dim(50) DIn_String S 50A DSoundex S 4A DSave_Last S 1A INZ(*HIVAL) DFirst_Two S 2A DI1 S 2S 0 INZ(0) DI2 S 2S 0 INZ(0) DUpper C CONST('ABCDEFGHIJKLMNOPQRSTUVWXYZ') DLower C CONST('abcdefghijklmnopqrstuvwxyz') DCodes C CONST('01230120022455012623010202') ********************* * Mainline Program: * ********************* * Strip Non-Alphabet Characters from the input data C EXSR Strip_It * Convert input text to soundex coding C EXSR Convert_it * Strip out Repeat Sound Codes and Build Soundex Variable: C EXSR Build_it * End the program C EVAL *inLR = *ON * Subroutine Section: *********************** * Strip It Subroutine * *********************** C Strip_it BEGSR * Strip all non-alpabetic characters from the input data: *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 C DOW I1 < %ELEM(Input_Data) C EVAL I1 = I1 + 1 * Check for non-alpha characters C Upper CHECK Input_Data(I1) 90 * Continue processing this value only if it was alphabetic C IF *in90<>*ON * Check for repeat letters and fill work file C IF Save_Last <> Input_Data(I1) C EVAL I2 = I2 + 1 C EVAL Work_Data(I2) = Input_Data(I1) C EVAL Save_Last = Input_Data(I1) C ENDIF C ENDIF C ENDDO * Move the Work Array Information back to the Input Data Array * and clear the work array. C EVAL Input_Data = Work_Data C CLEAR Work_Data C ENDSR ************************* * Convert it Subroutine * ************************* C Convert_It BEGSR * Convert input data to Soundex Coding * First element is moved without translation: C EVAL Work_Data(1) = Input_Data(1) C EVAL I1 = 1 C DOW I1 < %ELEM(Input_Data) C EVAL I1 = I1 + 1 C Upper:Codes XLATE Input_Data(I1)Work_Data(I1) C ENDDO * Move the Work Array Information back to the Input Data Array * and clear the work array. C EVAL Input_Data = Work_Data C CLEAR Work_Data *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 C ENDSR *********************** * Build It Subroutine * *********************** C Build_it BEGSR * Build the Soundex Variable * First Character comes over with no further processing: C EVAL %SUBST(Soundex:1:1) = Input_Data(1) * Set up variables for processing loop: C EVAL Save_Last = *HIVAL C EVAL I1 = 1 C EVAL I2 = 1 * Do while index is less than number of input elements * and current element is not blank * and last Soundex element is not filled. C DOW I1 < %ELEM(Input_Data) C AND Input_Data(I1) > ' ' C AND I2 < 4 C EVAL I1 = I1 + 1 C IF Input_Data(I1) > ' ' * If code is not a dupe of the previous code, move it to Sound_X: C IF Save_Last <> Input_Data(I1) AND C Input_Data(I1) <> '0' C EVAL Save_Last = Input_Data(I1) C EVAL I2 = I2 +1 C EVAL %SUBST(Soundex:I2:1) = Input_Data(I1) C ENDIF C ENDIF C ENDDO * Convert any remaining blank values to zeros: C ' ':'0' XLATE Soundex Soundex C ENDSR ****************************** * Alter Odd Starting Letters * ****************************** *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 C Odd_Start BEGSR * Try to raise accuracy by substituting common weird two letter * word starting combinations. * Move the first two letters of our word to be converted: C EVAL First_Two = %SUBST(In_String:1:2) C SELECT C WHEN First_Two = 'PH' C EVAL %SUBST(In_String:1:2) = ' F' C WHEN First_Two = 'NM' C EVAL %SUBST(In_String:1:1) = ' ' C WHEN First_Two = 'PT' C EVAL %SUBST(In_String:1:1) = ' ' C WHEN First_Two = 'KN' C EVAL %SUBST(In_String:1:1) = ' ' C ENDSL C ENDSR ****************************** * Initialization Subroutine: * ****************************** C *Inzsr BEGSR * Parameters Expected by this Program: C *Entry PLIST C PARM In_String C PARM Soundex * Translate all alphabetic characters to upper case: C Lower:Upper XLATE In_String In_String * Alter Odd Starting Letter combinations: C EXSR Odd_Start * Strip Leading Blanks C EVAL In_String = %triml(In_String) * Place our Input Data into an array for further processing: C MOVEA In_String Input_Data * Make certain that the Soundex code variable is empty: C CLEAR Soundex C ENDSR *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8
Sound Off for Soundex
Figure 3: Display File SDX002DF
A*=============================================================== A* To compile: A* A* CRTDSPF FILE(XXX/SDX002DF) SRCFILE(XXX/QDDSSRC) A* A*=============================================================== A*. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 A DSPSIZ(24 80 *DS3) A CSRINPONLY A* Main Input Screen: A R SCREEN1 A WINDOW(5 10 9 52 *NOMSGLIN) A CF03 A CF12 A PRINT A WDWBORDER((*COLOR WHT) (*DSPATR HI)) A WDWTITLE((*TEXT 'Soundex Test') (*C- A OLOR WHT)) A 2 1'Input String:' A IN_STRING 50A I 3 2CHECK(LC) A 5 2'Results of Last Run:' A OUT_STRING 50A O 6 2COLOR(WHT) A 7 2'Soundex code --------->' A SOUNDEX 4A O 7 26COLOR(WHT) A 8 1'F3=Exit' A COLOR(BLU) * Dummy Record: A R DUMMY A ASSUME A 1 5' ' A*. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7
Sound Off for Soundex
Figure 4: RPG Program SDX002RG
*=============================================================== * To compile: * * CRTBNDRPG PGM(XXX/SDX002RG) SRCFILE(XXX/QRPGLESRC) * *=============================================================== * SDX002RG: Demonstrate The Soundex Routine *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 FSDX002DF CF E WORKSTN F INFDS(INFDS) * Information Data Structure To Get The Function Key pressed: D INFDS DS D FunctionKy 369 369 * Named Hex Constants For Function Keys D F03 C CONST(X'33') D F12 C CONST(X'3C') D ENTER C CONST(X'F1') **************** * Main Program * **************** C DOW FunctionKy <> F03 C AND FunctionKy <> F12 * Output Entry Screen: C EXFMT Screen1 * Process Screen Based on Key used: C SELECT C WHEN FunctionKy = ENTER C EXSR Process * Verify Input C OTHER * Do Nothing C ENDSL C ENDDO * Terminate the Program: C EVAL *inLR = *ON ***************************************** * Subroutine Section * ***************************************** * Call the Soundex Routine: C Process BEGSR C CALL 'SDX001RG' C PARM In_String C PARM Soundex C EVAL Out_String = In_String C CLEAR In_String C ENDSR *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8
LATEST COMMENTS
MC Press Online