And on the 8th day the Database Administrator said “All States Will Be Two Letters.” And thanks to data parsing and standardization it was so.
How many employee handbooks have you received over the course of your career? Whether your answer is “one,” “a few,” or “I’ve lost count,” you have a pretty good idea what these documents are all about. Once you get past the mission statement and the org chart, you’ll find page after page of work rules and company policies, all presented with the intent of getting you up to speed and doing things “the company way” as quickly as possible.
That makes sense to me. A company can’t move forward when divisions are working at cross-purposes or employees are operating from different sets of expectations.
Honestly, the situation isn’t that different when thinking about the data your company collects. It needs to follow a known set of rules in order to get anything done. Just as you would expect an employee to follow the guidelines laid out in an employee handbook, you should also expect your data to conform to the standards you select for records in your information systems.
Unfortunately, there’s a lot that can go wrong as you populate your databases. Perhaps customers don’t always have their full nine digit ZIP code handy when opting into your catalog mailing list. Perhaps your global logistics team sometimes deals in miles and sometimes in kilometers. Whatever the specific situation, here’s the point: You need to make sure all this information is in some consistent format before it ever makes it into your data systems. Ensuring that happens is what Data Parsing and Standardization is all about.
A Closer Look at Data Parsing and Standardization
It all begins with a set of rules. If everyone followed the rules, you’d be golden here. If you are setting up a new project, you’ll need to consider the nature of the data you’ll be collecting in system, consider what format would be the most advantageous form for each field. For example, if you’re doing business in the US, you want abbreviations for state names to conform to the list approved by the US Postal Service. So programmatically limiting the data collection of state information to the 2 letter state abbreviation (e.g. CA rather than California, Calif, Cali, or Cal) will insure that your records don’t suffer from data quality issues.
If for whatever reason, your data has already been collected, and it’s not pristine, Data Parsing can be applied to standardize the offending columns in the database. This is basically how data parsing works. A software engine uses a collection of patterns representing the ways that data values can be formatted for that specific data type. However, not every data type has a single pattern. If you take a closer look at telephone numbers, which appear in a number of different patterns, I’m sure you’ve seen the following:
- (123) 456-7890
- 123 456 7890
- 123.456.7890
- Etc.
Parsing
Although someone examining a similar character string would immediately recognize it as a phone number, computers, like data are dumb, and a data parsing application must be used to “decipher” them. The software would have a relatively complete set of patterns to properly assess the data in your dataset. This is the “parsing” side of the equation. If the input passes your rule check, great, into the system it goes.
In short, parsing and standardization rely on two components: a set of data format rules and an engine to process them. The data format rules are composed of a pattern and a set of actions applied to the sub-parts of a recognized pattern, whereas the engine takes these rules as input and applies them in the process.
What distinguishes the different vendor tools available on the market these days is the complexity of the rules and transformations allowed, as well as the additional interface bells and whistles provided. As you consider what to do, it’s tempting to want to create a comprehensive “manual” much like the HR manual that tries to cover every single action and consequence. A noble goal, but is it really practical in your world? How long before you have a usable data set?
The possibilities for data parsing and standardization are vast. But here is where the Pareto principle or the 80-20 rule really comes to bear. Consider defining and using the 20% of the obvious rules that fix 80% of the problems. Great first step. Much better data. With that said, keep in mind if a data point is important to your business, it’s probably important enough to develop some type of rule for it.
Standardization
Engines for standardizing data also vary greatly in their sophistication, but the end goal is always the same — to transform unacceptable data into an acceptable form. Think about that state code example I mentioned earlier. If having a valid mailing state code attached to a record is important to the way you do business, then your system needs to be smart enough to check the data and make the necessary changes.
Data Parsing and Standardization and Your BI Solution
Like the other Data Quality best practices we’ve already considered, Data Parsing and Standardization can add a higher degree of efficiency to your BI system’s operation and an increased level of trust for the information ultimately displayed on your dashboard. And in the metrics and KPI’s you use to drive your business. An even more important benefit is the end users “trust in the data”. As we move from using data only for internal purposes to a broader business ecosystem where we want to share our data with customers, suppliers, and vendors, clean, or valid data becomes the holy grail of user adoption and end user trust in the data. This kind of power is not to be ignored.
Looking Ahead
So far, we’ve mostly looked at Data Quality best practices in the context of a single dataset. But in the real world, the power of combining datasets together to get new answers to new questions is an incredibly powerful advantage. Before datasets can be used together, we need to understand the relationships between them. To help us get a handle on that, we’ll be looking at the practice of Data Matching next.