Error detecting data visualizations
Published: Jul 21, 2017
Author: admin_sec
Winter is in full swing and the RISK Team is hard at work gathering, validating, and analyzing data for a variety of projects that we’re working on. One of those is the upcoming quarterly release for the VERIS Community Database.
For those not familiar with the VCDB, every quarter the RISK Team releases raw data regarding information security breaches that have been reported in public channels, like the news or reports to the Attorney General’s office. The basic process is to identify news articles, create issues in github, and then code them up using a web-based application. When it comes time for a release, we export the data and use some python to turn the entries into VERIS schema-compatible JSON objects.
Of course, humans are prone to error and so we have a variety of sanity checks that we run on the data before it gets committed to github. For example, we check each incident to make sure that all the values are actual enumerations from the standard (no incidents that have ‘hakcing’ instead of hacking) and that an integer is present in the data total.
Even with these checks in place, bad data can slip through the cracks, and one great way to detect that is by having some dynamic data visualizations made up that work against your live data. For example, earlier this week I pulled the latest data from our tool, converted it to JSON, validated it, and then put it into a test database. Now what data scientist wouldn’t want to play around with a shiny new data set? So I pulled up a graph of the incidents in each year in the dataset.
Before I updated the data, the graph looked like this:
And after adding the new data I got this:
Well that’s not good. What the heck happened? Turns out that one of the incidents that passed validation had a mistake in the timeline. In that one file, timeline.incident.year was set to 20134. When I generate this particular view of the data I like to fill in any missing years. That way I don’t have 1971 showing up next to 1983 (yes we have some of those in the data set - I just clipped off the graphic so it wouldn’t be huge). So when this view came up, it filled in every year from 20134 down to 2014. All of the years that actually had something were basically printed on top of each other and were very thin.
So even though my data validation script had given the thumbs up to all the data, taking a glance at some simple visualizations alerted me to a problem. It was an easy fix, and after I re-populated my database I got a picture that was much more in line with what I had expected.
Another example is seeing enumerations in a graphic that simply don’t belong there. Take a look at this graph of actor varieties in the VCDB.
One of the actor varieties is motive, which is not a valid enumeration in VERIS for actor variety (motive is a different variable to describe an actor). Looking in the data set I was able to find an incident that had been coded wrong and fix it even though it had slipped past our other checks.