In Part 1 of Diagramming Data, I talked about the relationship between code simplicity and the underlying data. I touched on some of the issues that organizations face with their data. Now, we’ll categorize those issues according to how easily they can be fixed.
We can take a sampling of data-related issues and place them on a continuum from easily reversible to irreversible. Some data issues can be fixed with relative ease. Others cannot be fixed without re-collecting all the data. There are also cases in the middle where we can fix issues with the data by writing complex code.
An example of a very fixable data-related issue: we want to show an attribute in a list, but that attribute only shows up in the detail call, and we have control over the server serving up this data. We discussed this situation in part 1—and if we have control over the server, we can add the desired piece of data into our list call. In fact, this situation is reversible enough that, if your developers are pushing back on putting an attribute in a list because it only comes back in the detail call, the problem is almost certainly that we don’t have control over the server from which we’re getting the data.
And then there are data situations that are untenable without re-collecting the data. Say you have a list of auto repair facilities, but none of them have recorded addresses. Without asking them all to enter this information, there’s no way to get it.
Then you have situations in the middle, say, where data is spotty. You have a list of doctors, but only 70% of them have listed their genders. This information is important to users (say patients want to make sure they get a female OBGYN for personal comfort reasons or something). You could go back and ask. Or you could designate a default gender and list people as that (poor solution). Or you could come up with a way to indicate to users that you don’t know the gender of a given doctor. You might encourage users to call the doctor and ask in these scenarios.
These are examples of solutions you might use to mitigate absent or spotty data. To handle myriad cases like this, code has to be more complex. That’s one of the tradeoffs associated with presenting a data-rich UI when the data doesn’t necessarily provide all the richness we would like.
In the most extreme case, we can use data that we do have to try to guess at the data we don’t have. In the doctor example, we could (though I do not recommend this solution) find a way to use the other things we know about the doctor (like their name) to guess their gender.
This is a simplified example. But more complex examples of missing data represent the impetus for an entire industry within software engineering. Predictive analytics and intelligent applications exist precisely so users can glean information from the data we do have to make educated guesses at the data we don’t have.
It’s important to note that a lot of missing information and misinformation in a dataset results from a data collection process that doesn’t line up well with the information it’s trying to collect. An example might be, say, if the ‘gender’ specifier on the doctor entry form is a radio button that says ‘male’ and nothing else. This, again, is a simplified example. We see other examples of this on forms all the time. For example, the name or address field may be too short, so people try to condense the information or stick the extra in another field. Or the phone number form only allows six characters (this has absolutely happened). Or the form forces someone to enter a middle name, even if they do not have one.
Another scenario that arises for enterprises: a user enters data from multiple different entry points, and that data is not all consistent with each other. A user enters a phone number through a new member portal. Then a welcome email gets sent to them to fill out more profile details, and that has a phone number field that is not connected to the previous phone number field. They enter a different number. They sign up for a new plan or product and get asked for their address. They enter their current address, which is different from the address on the new member form because they moved recently. This happens often when different vendors handle different parts of the user engagement process for a given company, and all of those vendors have their own system.
Now we have talked about what constitutes a data issue, acquainted ourselves with the relative reversibility of a few data issues, and discussed how those data issues might arise. In the next installment, we’ll talk about how to hedge against some of those data issues.