So your boss hands you a database and tells you to come back with dataviz on what it contains. This may be a straightforward task, but what if the database is not formatted the way you expected? What if it contains unnecessary fields?
Sounds like you may need to massage the data. The ZingChart team recently performed some data massaging and has some helpful tips on the process.
What is Data Massaging?
No candles or oils needed for this type of massage.
The term data massaging, also sometimes called “data cleansing” or “data scrubbing,” may sound a little naughty. But it is commonly used to describe the process of extracting data to remove unneeded information or otherwise clean up a dataset to get it into a useable format.
Common tasks include stripping unwanted characters and whitespace, converting number and date values into desired formats, and organizing data into a meaningful structure.
So much raw data.
Unfortunately, there actually is such a thing as naughty data massaging. The actual term “data massaging” is often also associated with the practice of “cherry-picking”, or selectively excluding or altering certain data based on what you want (or don’t want) it to reflect, in turn changing what the final visualization communicates to your audience.
The ZingChart team encourages you to avoid this practice at all costs. Improvements can’t be made or questions answered when data is incomplete or skewed.
Apply Massaging Techniques to Your Dataset
Our example is taken from the treemap we made for our recent review of the book Visual Insights. We needed a substantial dataset to work with and ended up choosing the CDC FOOD tool, which exports data in XML.
It contains 232,062 lines of XML that we need to extract data from to make a ZingChart treemap that shows outbreaks by state, then by year, then by genus. Where do we start?
A good question to ask at the start is, “How can I work smarter, not harder?” And one answer to that question is CodeBeautify. This app will convert just about anything to anything else, including XML to JSON. Simply:
-
Paste your XML in (yes, all 232,062 lines of it)
-
Set it to tree format for easy reading
-
Review your newly formatted JSON, ready to traverse
![JSON](https://blog.zingc![JSON](https://blog.zingc
Sweet, sweet JSON
Now, the nature of this dataset allowed us to set up skeleton JSON that we could populate as we parsed. We knew exactly how many states and years our treemap would hold, so we set up the file with empty children arrays for each year in each state. When we are finished, these children arrays will contain key/value pairs of outbreak numbers per genus.
![data arrays](https://blo![data arrays](https://blo
With this structure in place, we can start parsing. Pseudocode for this operation can be written as follows:
-
Loop through each record in the data set
-
process the genus with regex to match only first full word (no species)
-
traverse the treemap skeleton to find current record’s state
-
traverse that state to find current record’s year
-
if (genus exists in state’s year)
-
increment value
-
else
-
push genus to children array and set value to 1
There are three pieces of data we need from each record: state, year, and genus. The state and year serve as references - we use them to traverse the JSON of the treemap skeleton and find where we should be adding or updating the genus from the current record.
The massaging comes into play with the state and genus values. We only wanted actual states and our data contains records for multistate outbreaks, so we did a simple value check:
if(sState != "Multistate"){
//continue processing record
}
As for genus, we had a couple changes to make. First, there were more than a few records with Genus_Species listed as an empty string. Empty strings aren’t helpful in any dataviz, so we replaced empty genus values with “Undeclared”.
if (currObj["Genus_Species"] == "") {
genus = "Undeclared";
}
Lastly, we only wanted the genus, as using both the genus and the species would take up too much space in our treemap, which is already set to be quite large. For this, we wrote a regex to capture just the first full word up to whitespace or comma, but including periods (for example, E.coli).
var genusReg = /^(\w+\.?\w+)/;
With each piece of information saved, we then traversed our skeleton and inserted or updated the genus as necessary. Here is our full loop:
Helpful Tools for Working With Data
There is another tool at work here besides CodeBeautify. Did you notice something in our for loop? Wonder what _.findIndex() is? That’s a utility function from the library lodash.js. Lodash offers quite a few useful functions for dealing with data in JavaScript, including but not limited to:
-
_.chunk for splitting an array into chunks of a given size
-
_.findKey for finding keys
-
_.inRange for checking whether a number is within a given range
-
_.sortByAll & _.sortByOrder for sorting by multiple properties & orders
If you’re unfamiliar with lodash, you might recognize Underscore.js, its predecessor. Utility libraries like these make traversing and massaging large datasets a much more manageable task.
Share Your Data Massaging Best Practices
Have you found a workflow or online tools that help with data massaging at your organization? Want to comment on the examples here? Share your thoughts with us below.