Little things in your data can sabotage a business intelligence implementation. Data cleansing can help prevent it.
Modern tools such as Business Intelligence, analytic and collaboration systems can help your business reach goals that you once might have thought were impossible. Unfortunately, those same systems can also be derailed easily by the unexpected.
Consider this example from 15th century Holland. Disgruntled textile workers found themselves facing an unfamiliar threat: the first primitively “automated” looms.
Complex by the time’s standards, the new machines raised productivity significantly. Prescient workers could foresee the day when their labor simply wouldn’t be needed. In response, some of these folks developed the unfortunate habit of providing the system with some unexpected input by throwing their wooden shoes – known as sabots – into the loom’s intricate mesh of wooden gears. “Sabotage” was born.
I’m sure your business is not confronting active, malicious attempts to sabotage the core data in the systems, but when it comes to multiple data sets there are inherent issues that can cause the gears to stick. This includes, but is not limited to, things like the same variable being named differently in different databases. For example, one database calls a field “ZIP” and the other calls it “Zip Code”. It’s the same thing, to us, but as I’ve said before data and systems are dumb, humans are intelligent. We can immediately make sense of it, but not so easy for a computer.
Conventional wisdom would have us “fix” the problem at the source, and although this might in fact be the best way to do it, in many cases it is not practical or cost effective, and if attempted would lead to errors elsewhere that relied on a header or metadata to contain a specific name.
The bottom line: Making business decisions based on suspect output is not going to move you closer to your goals. And when you’re missing unable to join data sets effectively, it doesn’t much matter why it happened, it just matters that it did.
To rescue a system compromised by these issues, you need a process for fixing it on the fly and somehow documenting it as part of the overall project flow.
What Else Happens During Data Cleansing?
Another important thing to consider during the data cleansing phase is a thorough evaluation of the records in the set. For instance, what percentage of the records are incomplete, inaccurate, or contain information outside of a standard format? A small data set might lend itself to a simple visual inspection, but at the enterprise level, you’ll need tools to do the check. That check may rely on referencing a validated outside record set or checking each datum against a list of business rules you’ve established.
When suspect records are flagged, they can be rejected entirely from the data set or repaired, replaced, or modified. The latter options are preferable, but as stated earlier it might be better handled on the fly and fixed for the analytic data set. Cleansing the data makes the data actionable, and keeps you from wasting the cost of collecting the data originally, and the cumulative effect of adding layer upon layer of business decisions based on erroneous or incomplete data.
How Will Data Cleansing Affect a BI Implementation?
A proper tool allows you to create on-the-fly data cleansing logic, that’s captured and fully documented as part of the overall project flow. This reduces or eliminates the need to go back to the source system and clean up the data there, which can be very time consuming and error prone.
As we’ve already discussed, Data Cleansing is going to make the output from your BI solution more reliable. It’s also going to help users have more confidence in the data you provide them for their decision making purposes. Here’s an added bonus to think about: A data set that has undergone a careful Data Cleansing can often be useful for other business purposes outside your core BI system, meaning that your investment in the process can enhance the value of a hard-won asset.
What looms ahead?
So far, we’ve explored the value of Data Profiling, which helps you understand what kind of questions your existing data is capable of answering, and Data Cleansing, which gets your data set in proper order to do the job at hand. Next, we’ll look at how Data Parsing and Standardization can enhance the usefulness of the information your business collects.