Applied Data Science Case Study, Part 1: Assessing Business Value

Reading Time: 9 minutes

The internet has a lot to say about data science. You can find courses on mathematics, programming, and machine learning. You can find reams of documentation on libraries for data munging and visualization.

What we don’t often find are explanations of how to combine these concepts into business value.  That’s what I’ll cover here, in this series of posts: I’ll walk you through a data science business problem from start to finish.

The Project: Provide a List of Leads for a Sales Campaign.

Let’s say your company makes continuing education courses for cardiologists. The sales team calls doctors to pitch them the courses, and the company earns money when the doctors pay to attend.

How will they find the nation’s cardiologists?  The U.S. Centers for Medicare & Medicaid Services publishes information online about medical providers. So they could use that to contact  every doctor in the U.S., but that would be inefficient.

To minimize the time they spend on sales calls (or, turning this around, to maximize their sales potential) they’ve called you, the data scientist, to develop a model to help then find leads.


Before we get too far …

As data scientists, it’s tempting to receive a request and dive straight into creating models.  Before you do that, though, you’ll want to assess the situation and ask several questions related to the business impact of the project.

Acceptable accuracy happens where the value that the model brings in outweighs the costs of any mistakes it makes (plus the cost of building it). We’d like to ship something when we reach this acceptable accuracy, and then iterate on it to increase accuracy for as long as the additional accuracy captures more value than it costs.

Given that, let’s start by consulting with our customer (the sales team), perform a brief cost/benefit analysis, understand what happens when the model is wrong, and estimate response rates.


We know that we acquire one customer per 500 calls, and that 4 out of 5 customers are cardiologists. This means that, if the sales team calls 2500 people, on average we will gain 5 customers, 4 of whom will be cardiologists.Let c be the number of cardiologists we have to call to get 5 sales. Let n be the number of non-cardiologists we have to call to get a 5 sales.

4c + n = 2500

Let’s assume that the distribution of specialties among the doctors our sales team has called match the distribution of specialties in the U.S. physician population. With about 19,000 cardiologists in about a million physicians, we’re looking at roughly 1.9%.

c = 1.9/98.1 * n

When we solve that as a system of equations, we learn that

55.6c = 2500
c = 8.99
n = 2500 - 36 = 2464

This means that every 1 in 9 cardiologists our sales team calls will purchase at least one course. For non-cardiologists? 1 in 2,464.

Remember also that the two year revenue averages $850 for a cardiologist and $500 for another physician.

These numbers will help us figure out how to judge our model soon.

What is the value of us getting this right?

This means the sales team should close more deals in a smaller amount of time, which means the company earns more profit (because it’s spending less money generating revenue).  That’s the rose-colored glasses version.  But things can also go wrong:

False Positives and False Negatives. Image found on StackOverflow.

In the next installment, we’ll do an audit on the data we’re using before we dive into exploratory data analysis.

If you liked this post, you might also like:

This series about Design Patterns for Data Science

This Series about Feature Engineering

This Deep Dive on Numpy Vectorization

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.