Data Quality

Why more data does not better results make.

Three weeks ago I wrote in a margin note “‍‘‍Big data‍’ is unlike other data in that it has low quality (we take the data we have) and high quantity. Intuition learned in statistics classes about p-values, sample sizes, and controls are the wrong tools for big data.‍” Today I wish to expand on some parts of this statement.

“‍Big data‍” is a popular term, and it is sometimes used to refer to any large dataset. However, there is a kind of dataset that can only be described as “‍big data‍”: it is incidental rather than intentional, information that was not designed to measure what we will use it to measure; and it is many-dimensional, measuring hundreds or millions of numbers per datum.

Login name is a traditional measure of user identity. Window size, IP address, round-trip network travel time, the set of available browser add-ons, script execution speed, mouse motion patterns, and delays between keystrokes when typing together form a big data estimate of user identity. Exam scores are a traditional measure of student learning. Time spent studying, eye-tracking data when reading tutorials, level of engagement in course forums, popularity with other students, parents’ education and employment, and vocabulary used in online posts together form a big data estimate of student learning.

Data quality is the property of data measuring what we think it is measuring and not something else. If participants opt-in to a study, data may measure properties of who opts in instead of properties of the thing being studied. Exam results may measure test-prep skills instead of the skills the exam is designed to measure. School performance may measure cultural match between teacher and student or home environment instead of learning. Study design tries to avoid problems with data quality and statistical methods try to quantify and compensate for it.

Because big data is incidental, it is all extremely low-quality: it was intentionally measuring something other than what we are using it for. The hope is that the high quantity of distinct measures in a single datum will allow us to extract and accumulate the tiny bit of signal in each and cancel out the extraneous information. Many factors may prevent this hope from being realized: the measures may be biased (i.e., all have the same error); portions of the signal may be missing from all measures; and our analysis may mistakenly identify extraneous information as signal.

Statisticians have defined many measures that can help us understand the validity of analyses based on traditional data. The best known of these is the p-value, which measures the probability that a difference we see between two populations could be the result of luck rather than an underlying cause. The larger the difference between the populations is and the lower the variance within each population, the more certain the p-value becomes that they are distinct. Also, the more data we have the more certain the p-value becomes because more data provides a more complete picture of differences and variance.

With big data, statistical measures like p-values don’t work as expected. If we apply them to single measures, they simply report that the data is incidental and vast, which we already knew. If we apply them naïvely to entire many-dimensional data, they require axis normalization and norm selection and other tunable parameters that can change the outcome, meaning they still provide no clear insight. Big data variants of statistical measures are still an active area of research and not one I am qualified to comment on, but the few that are commonly deployed today (such as precision and recall) are far from being as versatile or telling as the statistical measures we expect for scientific progress.

Controls are another tool statisticians have developed to assist in learning despite quality problems in data. For example, suppose I wish to use a survey of participant mood as one of the measures in my study, but I hypothesize that mood will vary by the time of day the participant answers the survey. A study design solution would be to ensure every participant gets the survey at the exact same time, but that might be challenging to implement. Instead, I can control for time of day: first I determine what impact time of day appears to have on the data, then I modify the data to remove that impact before performing my study analysis.

In theory, controls can be applied to big data, but in practice doing so is very challenging. The core of this challenge lies in proxies: measures in a datum that collectively contain all the information of another measure. Simply omitting certain measures is thus ineffectual: the confounding variable retains its influence through its proxies. A traditional control has a different, though related problem: the determine-impact-and-remove step is rarely perfect, meaning each of the proxies will retain a small part of its original proxying power, and if there are enough of them the big data methods will still be able to extract the original from them all.

All of this is not to say that big data cannot be useful. Having any measure, even if it has errors, of thing previously too expensive or challenging to measure has vast potential utility. But that utility comes at a cost: big data applications are full of biases and errors which current methods are neither able to quantify nor correct.

Big data methods may yet emerge that will bring a measure of certainty or at least transparency to this field. Or proofs may emerge showing such methods cannot exist, bringing confidence and maturity of a different kind. Until that happens, though, utility and cost drive the market and biases and errors will remain.