Data processing is tricky business, full of pitfalls and gotchas. Hopefully the tasks in this guide help with getting started in this process. But you, I, and the entire world will make mistakes. It's natural.
But mistakes in data processing, like all other kinds of mistakes, can be painful. They can result in hours of bug hunting, days of reprocessing, and months of crying. Since we know mistakes happen and will continue to happen, what can we do to take away some of the pain?
In a word, padding. We need some padding to protect us from the bumps and bruises of data processing. And I would suggest that this padding come in the form of simple tests that check the assumptions you have about the shape and contents of your data.
Unless there is an extreme performance need, these tests should run in the data processing pipeline. Optimally, they would be easy to turn on and off so that you can disable them if you need to if your code is deployed.
These tests can be created with assertions - functions that check the truthiness of a statement in code. Typically, they raise an error when an expected truth is not actually true.
JavaScript doesn't have a built assertions, but we can rectify this deficiency with a simple function.
function assert(isTrue, message) {
if(!isTrue) {
console.log(message);
return false;
}
return true;
}
This will output a given message if the input is not true. Typically assertions throw errors, but we can just log it for explaining purposes.
Now let's use our assert
function to check some assumptions about the details of our data.
We can use lodash's suite of type checking functions to take care of performing the checks, passing the result of the check to assert
to produce our errors.
Let's say our data importing process has made some mistakes:
var data = [{"name":"Dan",
"age":23,
"superhuman":false},
{"name":"Sleepwalker",
"age":NaN,
"superhuman":"TRUE"}
];
Our first entry looks ok, where our second entry has some problems. The age parsing for the immortal Sleepwalker has left him with no age. Also, bad input data has left us with a string in superhuman
, where we expect a boolean.
A simple assumption checking function that could be run on this data could look something like this:
function checkDataContent(data) {
data.forEach(function(d) {
var dString = JSON.stringify(d);
assert(_.isString(d.name), dString + " has a bad name - should be a string");
assert(_.isNumber(d.age), dString + " has a bad age - should be a number");
assert(!_.isNaN(d.age), dString + " has a bad age - should not be NaN");
assert(_.isBoolean(d.superhuman), dString + " has a bad superhuman - should be boolean");
});
}
checkDataContent(data);
=> {"name":"Sleepwalker","age":null,"superhuman":"TRUE"} has a bad age - should not be NaN
{"name":"Sleepwalker","age":null,"superhuman":"TRUE"} has a bad superhuman - should be boolean
Again, the focus here is on detection of data problems. You want something quick and simple that will serve as an early warning sign.
Unfortunately, the JavaScript primitive NaN
is indeed a number, and so additional checks need to be made. As more data comes in, this function will need to be updated to add more checks. This might get a bit tedious, but a little bit of checking can go a long way towards maintaining sanity.
Just as you can test your assumptions about the content of your data elements, it can be a good idea to test your assumptions about the shape of your data. Here, shape just refers to the size and structure of your data. Rows and columns.
Something simple to perform this check could look like this:
function checkDataShape(data) {
assert(data.length > 0, "data is empty");
assert(data.length > 4, "data is too small");
var keys = d3.keys(data[0]);
assert(keys.length === 4, "wrong number of columns");
}
checkDataShape(data);
=> data is too small
wrong number of columns
The two assumption functions could easily be combined into one, but it's important to look at both aspects of your data.
Finally, its often useful to check assumptions about data objects being equal. Lodash comes to the rescue again with the isEqual function:
console.log(_.isEqual({ tea: 'green' }, { tea: 'green' }));
console.log(_.isEqual({ tea: 'earl' }, { tea: 'green' }));
=> true
false
If this is an approach that appeals to you, it might be worth exploring more powerful assertion libraries.
One such tool is Chai which comes with a great collection of assertion helpers. These can help you check for more complicated things like whether two objects are equal or whether an object has or doesn't have a property in a more succinct style.