Data discovery is a lot easier if you're using the right tools the right way.
Editor's note: This article is an excerpt from the new book Data Governance Tools by Sunil Soares (MC Press 2014).
Data profiling is the process of understanding the data in a system, where it is located, and how it relates to other systems. This process includes developing a statistical analysis of the data such as data type, null percentages, and uniqueness. While there might be some nuances, we will use the terms "data profiling" and "data discovery" synonymously.
In the absence of tools, data analysts have historically resorted to the use of SQL queries to discover and profile data. Data profiling tools can automate a number of tasks associated with data governance.
Conduct Column Analysis
The first step in any data profiling exercise is to conduct an analysis of the columns. In Figure 5.1, IBM InfoSphere Discovery displays a column analysis for the HQ_EMP table. The column analysis displays basic metadata about each column, as discussed below (not all metadata is shown in the screenshot):
- #—The sequence number of the column.
- Column Name—The name of the column as shown in the database table.
- Data Type—The data type, such as NumberString, Varchar, and DateTime. For example, the data type for the EMPLOYEE_ID column in Figure 5.1 is NumberString.
- Length—The defined length of the column. For example, the length of EMPLOYEE_ID is seven characters.
- Precision—The maximum number of digits that can be present in a number. For example, EMPLOYEE_ID can have a maximum of 31 digits.
- Scale—The maximum number of decimals after the decimal point. For example, EMPLOYEE_ID has zero digits after the decimal point.
- Cardinality—The number of unique values in a column. For example, FNAME and LNAME have 228 and 219 unique values, respectively.
- Selectivity—The degree of uniqueness of the values (including nulls) in the column, calculated as Cardinality / (Row Count – Null Count). Selectivity is calculated on each column individually and is not the result of comparison to another column. This value is never greater than one.
- Min—The smallest or lowest value in the column, calculated numerically for numeric columns and alphabetically for other columns.
- Max—The largest or greatest value in the column, calculated numerically for numeric columns and alphabetically for other columns.
- Mode—The most common value in the column, not including null values. This value is calculated only if a particular value is displayed in more than five percent of the rows. In Figure 5.1, the mode for STATE is TX.
- Mode%—The number of times the mode (the most common value) is displayed in this column, as a percentage of all values in the column. For example, TX appears eight percent of the time in STATE.
- Sparse—Indicates whether the column is sparse, based on the Mode%. A sparse column contains mostly the same value except for a few exceptions.
- Null Count—The number of rows where the column value is null.
- Blank Count—The number of rows in the column that are blank (empty).
Figure 5.1: IBM InfoSphere Discovery displays the column analysis for the HQ_EMP table.
(From the IBM Redbook Metadata Management with IBM InfoSphere Information Server, October 2011, Jackie Zhu et al.)
Discover the Values Distribution of a Column
Data discovery tools should also display the most frequent values of a specific column. As shown in Figure 5.2, Trillium TS Discovery displays the Values Distribution, which shows the top five values for the Name column. The names "Michelle," "Dorothy," "Joey," "Royson," and "Sunil" appear 28.571%, 14.286%, 14.286%, 14.286%, and 14.286% of the time, respectively.
Figure 5.2: The Values Distribution for the Name column in Trillium TS Discovery.
Discover the Patterns Distribution of a Column
Data discovery tools should also display the patterns distribution of a specific column. As shown in Figure 5.3, Trillium TS Discovery displays the Patterns Distribution, which shows the top five patterns for the Name column. The most common patterns are alphanumeric six characters and alphanumeric eight characters, each with 28.571% of the records. These are followed by alphanumeric four, alphanumeric five, and alphanumeric seven, each with 14.286% of the records.
Figure 5.3: The Patterns Distribution for the Name column in Trillium TS Discovery.
Discover the Length Frequencies of a Column
Data discovery should also display the length frequencies of columns. In Figure 5.4, IBM InfoSphere Discovery displays the length frequencies for CHECKING.ACCOUNT_BALANCE. For example, values with a length of eight and seven occur 559 and 337 times, respectively. The bottom of the screen shows a preview of the rows where the length of CHECKING.ACCOUNT_BALANCE is seven.
Figure 5.4: Length frequencies in IBM InfoSphere Discovery.
(From the IBM Redbook Metadata Management with IBM InfoSphere Information Server, October 2011, Jackie Zhu et al.)
Discover Hidden Sensitive Data
Data discovery tools can also discover hidden sensitive data, which is a specific form of pattern matching. The sensitive nature of the data might not be reflected in column or table names. For example, U.S. Social Security numbers might be hidden in a field called EMP_NUM. Figure 5.5 shows that credit card numbers have been discovered by the Global IDs Profiler within the CoffeeChainSheet.txt, MedSpan2.5_DataSample.txt, and customer_sample_value.csv data sources.
Figure 5.5: The Global IDs Profiler discovers credit card numbers within multiple data sources.
Discover Values with Similar Sounds in a Column
The data discovery tool should also discover column values with similar sounds. As shown in Figure 5.6, the Soundexes Distribution in Trillium TS Discovery shows the Name values with a similar sound that are grouped together as soundexes when the data is analyzed. The soundex is based on the first four values of every attribute. Trillium TS Discovery also checks for the Metaphone, which is based on the entire attribute value, helping to check for misspellings and data discrepancies.
Figure 5.6: The Soundexes Distribution for the Name column in Trillium TS Discovery.
LATEST COMMENTS
MC Press Online