API Support for World-Changing Datasets

Reading Time: 6 minutes

We applaud the arrival of  “world-changing” data-manipulation technologies. But where do we see those technologies applied? In sales, maybe . . . but rarely in the changing of worlds.

Companies have APIs all over the place to access data at the snap of a finger. Meanwhile, researchers hoping to access public data—especially in the social sciences—are much more likely to find themselves downloading a slightly out-of-date Excel spreadsheet from a website that, on a nonzero number of occasions, has for some reason been unavailable when they visited it.

A lot of public datasets deserve APIs. The Armed Conflict Location and Event Database is one of them. So I made an API and a webclient. Here’s how I did it and what they’re for:

ACLED documents conflict events from all over the world: where they happen, when they happen, who is involved, et cetera. That data is made available to the public via a set of large Microsoft Excel spreadsheets for download. It’s free, but it’s not updated in real time—and what’s more, if a researcher wants the most up-to-date data, he doesn’t have it automatically, but instead has to go and re-download the spreadsheet.

An API fixes this problem. APIs, or application programming interfaces (though nobody calls them that), make data available to programmers to use in their own applications. For example, with an ACLED API, someone could take this map on the homepage:

Screen Shot 2014-07-31 at 10.38.52 AM

and, instead of having to advertise when it last got updated, have it always update automatically by pulling the most recent data via the API. So each time it is viewed, it gets updated to what’s in the database as of that very minute.

Now, a caveat before we continue: the API I created is, for the moment, not the entirety of the data, because when I went to their website to get the data, this is what I saw:

Screen Shot 2014-07-18 at 1.14.08 PM

It was that way for several days while I was building the API. This kind of thing makes researchers not want to use this data, and so it doesn’t get as much publication and exposure as it otherwise would.

Luckily, I happened to have on hand a downloaded Excel file containing 1654 rows of ACLED Africa data from the beginning of this year. So I used that to build my API. I copied the data from the Excel spreadsheet and pasted it into a mySQL database (although PostgreSQL is a more popular option these days, mySQL works with Sequel Pro, a computer application that makes it easy to import data from Excel—easier than the Postgres twin application, PGCommander. Since this dataset was so big, having that application at my disposal made it worthwhile to use mySQL over its trendier counterpart).

An API makes all of the data in the database available to developers who make calls for it over the internet. When somebody calls the API events index, which is all of the 1654 rows in the database, it returns them all in a list like this (only first 3 shown):

Screen Shot 2014-07-22 at 11.26.06 PM

The format for the data you’re seeing is called JSON. You already know what HTML is: a way of determining the structure of a webpage by using tags on all the individual pieces. We’ll work our way back to the nature of JSON from there. HTML has tags like <h1> for headers, <p> for paragraphs, and <img> for images. At some point, someone decided that we should be able to make up our own tags, and so XML was born—extensible HTML that allowed people to make tags like <event_date>, <event_type>, and <actor1>. So APIs used (and still often use) XML to indicate which columns all of the data belonged in. JSON does the same thing as XML for APIs: it indicates the attribute name (name of the column it goes in) for each piece of data right next to the data itself. It’s just easier to read than XML. What you see up there, JSON, is much easier to read than <event_date> 1/2/2014 </event_date>.

In addition to calling the index, it is possible for people to call individual events by referring to their ID number. If someone wanted to only see the first event up there, they could add to the URL the ID number and get just that event.

Of course, the data output isn’t very pretty. Making data pretty, drawing pictures and maps and graphs and charts with it, is the job of a web client that uses an API. For my fake 1654-row ACLED API, I made a web client, too. It doesn’t make fancy graphs or pictures right now, but it does put all of the data in a table.

Screen Shot 2014-07-31 at 1.13.43 PM

Still not gorgeous, but nonetheless easier to read. (It’s capable of displaying latitude and longitude to a higher degree of precision than that—the data just didn’t).

And every time somebody goes here, the data would be up to date.

Now, there are restrictions on this data. Anyone can read the index of all the data and anyone can also read the info for a specific event, but not just anyone can add, update, or delete events. The last thing we need is the data becoming bloated with people’s Hatfield-McCoy stories.

Instead, right now, I can provide an API key to individuals who want to make changes to the data. Clearly, those keys should only be given to the people maintaining ACLED. I’m attempting to contact them and put this API in their hands so they can keep it updated by logging into the web client. If they respond, then the next step is to stick in all the datasets they already have, then demonstrate how to add new events via a form on the backend.

Hopefully, though, if ACLED can overtake the maintenance of this API, researchers would have a really easy time of manipulating, watching, and using this data. And who knows? One of their research findings might just change the world.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.