We are all familiar with the phrase “garbage in, garbage out”. Once data quality gets “off the rails” it can be exceedingly difficult to get it back on track again. This is even more important in this age of “big data”. There has never been a more critical need for data quality, both because modern data volumes preclude any manual intervention and because modern business process and business decisions critically depend on that very same data. Unfortunately, you seldom have control over the data itself, so you need to compensate, which is what this article is about.
The term “garbage” leaves the impression of totally corrupted information. We can of course test for this, but minor formatting errors are much more common, and even these subtly different variations can have a similar negative effect on your processes. For example, if the US state might be “NY”, “New York” or “N Y” then you might not correctly consider all the data as part of a single group, resulting in a flawed analysis. Similar problems might happen with most data elements. The bottom line is: anything that can be standardized probably should be, because you never know how the data might be used. Standardization is the key that allows you to correctly aggregate and link information, with which you can make sound business decisions, but the challenge is to implement this goal easily. Let me explain.
Most analytics products allow you to test your data; for example maximums, minimums and that the correct (expected) coding format is used, but an ideal solution would allow you to detect and standardize your data, based on rules, in a manner that doesn’t involve “rocket science”. The problem with “rocket science”, is that the harder something is to understand, the less likely it will be implemented, and the harder it will be to maintain. A simple example perhaps demonstrates the point. Assume you have an inventory number field that is supposed to be coded like “L27/726” (ie, a PIC of “X99/999”), but you know there are errors that need to be corrected. You need an expression that creates the PIC for each transaction, so you can identify, classify and correct for the errors. With our main competitor, the expression you would use for this is: REGEXREPLACE( REGEXREPLACE( REGEXREPLACE( invno,”\d”, “9”), “[a-z]”, “x”), “[A-Z]”, “X”). With Analyzer the expression is simply: Format(Invno). Both expressions produce the same result, but I’ll let you choose the one you prefer, and would prefer to maintain.
Imagine, if you will, that you there were easy to understand functions for all your common data testing requirements, so that data quality is an easily attainable goal. This is exactly what Arbutus Analyzer offers. The Format function is but one example of many, that allow powerful testing with no “rocket science”. Whether you want to identify formatting errors, remove extra blanks, or any of a large range of other manipulations, Analyzer will usually let you do it with ease.
By the way, there is an interesting “off-label” use for Analyzer’s Format function, in fraud and duplicate payment scenarios. With accounts payable, each vendor has a different invoice number format, but within each vendor the format is usually consistent. You can summarize on the vendor number and the expression Format(invoice_number) to identify every non-standard vendor invoice format. This allows you to quickly identify higher risk non-standard transactions.