In part 1 of this series, we discussed the relationship between data and code complexity. In part 2, we talked about some of the deficiencies datasets might have and how they happen.
Now we’ll talk about some starting points for building healthy datasets—and nursing deficient datasets to health as much as possible. It’s important to note that these starting points apply chiefly to datasets obtained via human data entry—generally via a form.
1. Keep an open connection with your data sources.
While you’re collecting data from your sources, be sure to collect their contact information, too. That way, if you need more information from them later or you especially need a particular field that they did not fill in, you can reach out to them directly and recover some of the information that the initial data collection missed.
2. Avoid placing restrictions on form fields.Often, human-entered data looks wrong or misleading because users were forced to enter it under constraints that did not make sense. An extreme example of this is a phone number field that literally could not collect accurate data because it only allowed users to enter six characters. We often see more sophisticated examples of the same types of mistakes when a homogeneous group of people design a form to be used by a heterogeneous group of people. For example, Americans assume that it makes sense to require the zip code field on a form. But if it’s on a form for an international audience, that doesn’t make sense because many countries do not use zip codes. Or designers will make a form that demands a first, middle, and last name with minimum length requirements for each—despite the fact that some people don’t have middle names, and some people have more than three names, and some people have two letter last names. These data enterers are forced to put inaccurate information into forms.
Solution: trust humans over forms. Leave free-form text boxes wherever possible and allow people to write in their own names, numbers, and addresses the way that makes sense for them. Correct data, inconsistently formatted, may require some complex software to use. But incorrect data is impossible to use, and correcting it is a much harder problem in software engineering than figuring out data that comes in different formats.
3. Establish relationships between the owners of different datasets.
You want pieces of data A and B, but A is owned by one department and B is owned by another. The two departments rarely speak, and neither is aware of the relationship their data has to the data owned by the other department. It’s a common problem in large organizations, and it’s a problem many organizations attempt to solve with complex data-collating software.
These types of problems can become easier to solve if the owners of different datasets are aware of each other. By forming relationships between different data owners within your company, you open the door for more coordination in the data collection and data analysis process. You make composite datasets easier to create and maintain. And the different owners may even come up with new ways to use their combined data for the benefit of the company.
These are only starting points for building healthy datasets, but they illustrate an important point: data deficiencies often come from business oversights, usually in the form of failures to establish relationships: failure to establish relationships with data entrants, customer bases, and even each other within an organization. By working to build better communication channels within an organization, we can build and repair datasets that give us strong, rich data—data that we might even be able to leverage with simpler software.