You are looking at a dataset or a series of datasets and you are wondering if the data is good or bad. What do I mean by that? In the Google Data Analytics course at Coursera, they have a way to test that. In the course three Prepare Data, there is a module on data credibility. Is the data true and accurate? Is the data “bad”?
ROCCC
For data, Google has an acronym called ROCCC. Reliable, original, comprehensive, current and cited. If the data meets these five criteria, it should be good data.
One example of making a business decision with incomplete data is the example of the beverage Coke dropping it’s classic formula for the new Coke. Their data was not comprehensive. They did a study and found that people preferred the new Coke to Pepsi. They went ahead and made the change to the new formula. People were upset. People may have preferred the new formula to Pepsi, but many people preferred the original Coke to the new formula.
“Bad” data can’t be trusted because it’s inaccurate, incomplete, or biased. This could be data that has sample selection bias because it doesn’t reflect the overall population. Or it could be data visualizations and graphs that are just misleading. An example of a bad visualization is one that doesn’t zero-out the left-side y-axis.
O is for original. If you can’t locate the original data source and you’re just relying on second or third party information, be very careful.
For a “good” data source, you can have a look at the U.S. Census Bureau, which regularly updates their information. There are many other “good” data sources out there.
Data Governance
In the book Data Governance: The Definitive Guide, data quality is defined by three main characteristics: Accuracy, Completeness and Timeliness.