When it comes to business intelligence, a little data profiling can go a long way to preventing GIGO
Last night I was watching Hell’s Kitchen — mainly to pick up some management tips from Chef Ramsey — but then I started thinking about the meals the teams were asked to produce, and how the show provides them with the finest ingredients on the market.
Ask any chef the secret to preparing great meals and within moments they will agree with Chef Ramsey about the importance of starting with the best quality ingredients. It’s solid advice for working in the kitchen. You simply can’t build a wonderful dish from a poor-quality, tasteless foundation. Garbage in, garbage out. Hmmmmm…
Oddly enough, it’s also great advice for utilizing your business data.
Before you can put data to use, you need to understand exactly what it is that you have and what level of information quality it represents. Allowing crucial business decisions to rest on a foundation of inappropriate or low-quality data is a recipe for a disaster. On the other hand, a thorough understanding of the data you have available and the questions you can reasonably ask of it may allow your analysts to tease out business insights you never dreamed were possible, and at the end of the day, increase profitability, optimize the sales pipeline, and improve customer satisfaction.
What Is Data Profiling?
Simply put, Data Profiling, also known as data archeology (that’s cool isn’t it?) is the statistical analysis and assessment to determine the quality of a dataset for meeting specific business needs and answering specific questions. According to the academic definition:
A company engaged in Data Profiling will perform various statistical analyses on the dataset in question, think deeply about the data types involved in collecting each field in a record, and set business rules to test the fitness of that dataset for answering new questions or interacting with other datasets.
Ok, so what does that mean? Basically, we need to do some math to figure out how “clean” or “messed up” the data is before we use it for analytics that will drive our business decisions. We all know our data is not perfect. Knowing how good or bad it is, allows us to take appropriate actions regarding the data.
What Benefits Can You Expect?
If you have the luxury of starting a new data gathering project with Data Profiling in mind, Data Profiling will help you to carefully consider both the questions you hope to answer and the most appropriate design for gathering and storing relevant data. Time spent in this kind of planning results in higher data reliability and means less time and money are spent fixing the problems later (or in the worst case, discovering there is no way to derive the answers you need from the data you have).
You probably have gigabytes if not terabytes of pre-existing data, and realistically you won’t have the chance to redesign these datasets from the ground up. When you’re dealing with a dataset that’s already in existence, Data Profiling tools can help you quickly determine if it can reliably be used to handle your business needs. The tools do this by assessing the actual content of the data, the structure and the quality, as well as exploring relationships that exist between value collections both within and across data sets.
For example, by examining the frequency distribution of different values for each column in a table, an analyst can gain insight into the type and use of each column. Cross-column analysis can be used to expose embedded value dependencies and inter-table analysis allows the analyst to discover overlapping value sets that represent “foreign key” relationships between datasets. Data Profiling can often shed light on potential adjustments that could allow you improve your data collection process or to improve the validity of the existing dataset.
When you apply Data Profiling principles across multiple datasets, an interesting picture can often emerge. It isn’t unusual to uncover previously unknown relationshipsâ€”relationships which could lead to new and unexpected opportunities or provide you with enough warning to solve unanticipated problems.
Finally, if you are planning to join several datasets together, Data Profiling is a must. Before you ever execute such a plan, you need to understand all the relationships involved and have a clear picture of what the outcome will look like.
How Will Data Profiling Affect a BI Implementation?
Your BI solution should function as a simple way to look objectively at your performance relative to select KPI’s and take appropriate action. But in order to act on data, you need to trust the data. Data Profiling sets up a strong foundation for building that kind of trust. It also means your BI implementation can be built, launched, and utilized efficiently, saving you money and minimizing lost opportunities.
One Final Note
It’s important to understand that Data Profiling doesn’t evaluate the truth of each datum in a dataset. The practice is intended to answer questions about capability (Can my dataset meet my needs?), not correspondence to reality (Is this data correct?).
In my next post, I’ll examine ways to address that second question a little more closely as we explore the practice of Data Cleansing.