In this series, we’re walking through a data science business problem from start to finish. We introduced the business case in the last post: a list of leads for our sales team at a company that makes continuing education courses for cardiologists.
Now that we have assessed the business value, let’s talk about data.
Where can we get the data about potential leads?
The U.S. Centers for Medicare & Medicaid Services publishes information online about medical providers. We could use that to give the sales team a list of every doctor in the U.S..
That’s not going to maximize sales, though: most doctors don’t need this particular kind of continuing education course. We need to identify those doctors who are the best fit for these courses: namely, cardiologists.
We could instead use a list of every doctor in the U.S. whose specialty appears in the NPPES data as ‘Cardiology.’
That’s a start, but the sales team wants to go further. the Medicare Provider Utilization Data contains procedure records for almost a million providers, each identified by their own National Provider Identifier (NPI). But the National Plan and Provider Enumeration System doesn’t list the specialties for every NPI listed in the utilization data. We’d like to attempt to find any cardiologists among the unlabeled NPIs from the utilization data.
We’ll do an exploratory analysis of this data in the next post, but first we need to ask some questions about it.
Are we allowed to use this data?
This is important because we can get sued if we use this data in ways we’re not supposed to.
When we download the data from the CMS website, it comes with this license agreement outlining how to use it. Many companies have a separate legal team that reviews licenses.
If we were really doing this business case for a client, I would check this license more closely with someone who has legal expertise. Since this is a demonstration of the data science process for my personal use in a non-commercial purpose, we’ll leave this point at “check the license” and move on to the next question.
Does the data have the fields we need?
It does. We have the NPI, name, gender, contact information, and sometimes specialty of these doctors. We also have records of the Medicaid-covered procedures they have performed, which we can use to guess the specialty in the cases where that information is missing.
In fact, we’ll need to remove some of these fields to conscientiously use this dataset. That brings us to the next question.
How can we use this data and protect people’s privacy?
The data contains the doctors’ contact information, including address and phone number. We need the phone numbers so that the sales team can call the doctors. However, we will remove address before we start our analysis. Our customer (the sales team) doesn’t need it, and we won’t disseminate anyone’s physical location as long as we don’t absolutely have to.
What sorts of biases and ethical concerns may arise?
A big risk area is demographic factors. We want to check two things:
1. Do we have demographic information in our data? YES. We have gender, which is a demographic attribute. Let’s remove this.
2. Do we have any PROXIES for demographic information? We have zip codes (which correlate with race) and names (which correlate with race). Zip code also stratifies with income and cardiology is a high-income specialty, so we might see an effect there if we included it. I’m not going to include it because I’m picturing this situation: suppose two people are alike in every other possible way related to their likelihood of being a cardiologist, but they live in different neighborhoods. Do I want the sales team to call one before the other based on that? I don’t. Ideally, I just want the sales team to call both of them. So I want to build a list of prospects that doesn’t prioritize based on this.
The sales team needs people’s names in order to call them, so I’ll keep the names in the data set. But I will not use them as features to train our model.
We are also classifying on medical specialty here, which correlates with ethnicity—or at least, we know it did before 2010, and it’s highly probable that it still does. Does this mean we have a problem?
That depends on what we’re building. Demographic data poses the highest risk where we are deciding either:
- how people are portrayed (in an ad, article, white paper, or educational material, for example), or
- whether people access to a resource (like loans, insurance, or safety, for example).
We’re marketing a cardiology-related course. That’s a very specific product predicated on cardiologist customers. So maybe we don’t have to contend with the bias baked into who becomes a cardiologist in the context of building this model.
Suppose that, instead, we were marketing home loans aimed at luxury home buyers because we knew cardiologists make a lot of money. In this case it’s ethically dodgy to offer different access to resources based on a variable that we know for a fact correlates with race. Home loans in particular also happen to be legally protected precisely because of historical and contemporary racist loaning practices.
Chelsea, when do we get to the actual data analysis?
We get to it in the next post. But before we do, I want to say one thing about why we bothered with all these preemptive assessments.
These are Canadian iron rings.

When Canadian engineering students graduate, they receive iron rings in a ceremony called the The Ritual of the Calling of an Engineer. The ceremony started after the original Quebec Bridge collapsed during construction. The collapse, attributed to poor planning by the engineers, killed seventy-five construction workers.
The rings remind Canadian engineers of their ethical obligation to conduct their work with the utmost rigor—that they accept a duty of care to everyone and everything their product impacts. The ring is worn on the pinky finger of the engineer’s drafting hand so that it drags along the paper as they draw—a constant reminder of who their work affects.
In data and software engineering, we have no such ring. We should. Because the things we make impact people’s lives.
Examining bias in my work is a critical part of my commitment to rigor, to my duty of care.
I hope you feel the same way.
Conclusion
We have found a data source! We have asked some questions about that data:
- Are we allowed to use it?
- Do we have the fields we need?
- How can we protect people’s privacy while using this data?
- What biases and ethical concerns arise with this data?
With those questions answered, we next move on to some exploratory analysis of our data set.
If you liked this piece, you might also like:
This piece about the duty of senior technologists to assess the impact of their work
This piece, in which we dive into the internals of the Scipy CSR Matrix
This piece about testing your code based on the biggest risk factors