The internet has a lot to say about data science. You can find courses on mathematics, programming, and machine learning. You can find reams of documentation on libraries for data munging and visualization.

What we *don’t* often find are explanations of how to combine these concepts into business value. That’s what I’ll cover here, in this series of posts: I’ll walk you through a data science business problem from start to finish.

### The Project: Provide a List of Leads for a Sales Campaign.

Let’s say your company makes continuing education courses for cardiologists. The sales team calls doctors to pitch them the courses, and the company earns money when the doctors pay to attend.

How will they find the nation’s cardiologists? The U.S. Centers for Medicare & Medicaid Services publishes information online about medical providers. So they *could* use that to contact every doctor in the U.S., but that would be inefficient.

To minimize the time they spend on sales calls (or, turning this around, to maximize their sales potential) they’ve called you, the data scientist, to develop a model to help then find leads.

### Before we get too far …

*kinds*of wrong affect our clients, and by how much. Without that, we

*cannot*gauge the minimum viable accuracy our model needs to achieve to create value.

Acceptable accuracy happens where the value that the model brings in outweighs the costs of any mistakes it makes (plus the cost of building it). We’d like to ship something when we reach this acceptable accuracy, and then iterate on it to increase accuracy for as long as the additional accuracy captures more value than it costs.

### What else should the sales team share with us?

Additional questions for the sales team:

**Question:**What proportion of these physicians have we already contacted?**Why we ask:**If we know the physicians, we might be able to*ask*them their specialty and get a much more complete, correct dataset without investing in software to get the answer. Also if we have contacted them before, we may not need to overcome customer acquisition costs with them the way we would for a cold contact.**Answer:**Since we’re building a list for cold-calling, we’ll assume for now that we*do not*already know these physicians and cannot ask them their specialties until we’re on the phone with them.

**Question:**What’s the customer acquisition cost for both cardiologists and non-cardiologists?**Why we ask:**This helps us determine how much money we spend per customer acquisition of each type of customer, so we can determine the cost associated with mis-targeting in either direction.**Answer:**The sales team will cold-call the prospects, and they have some data on the effectiveness of their work so far. The sales team knows that about 1 out of every 500 physicians they contact ends up purchasing at least one course. They don’t narrow down the specialty that they cold call, usually: instead, they cold call any doctor. From preparation to hangup. these calls take an average of 4.8 minutes.

**Question:**What’s the customer lifetime value of both cardiologists and non-cardiologists?**Why we ask:**This helps us determine how much money we lose if our mis-targeting in either direction loses us a customer.**Answer:**The company has been around for 2 years. Among the customers we have, the sales team knows that 80% of the doctors who buy at least one course are cardiologists. When we split the customers into cardiologists and non-cardiologists, we see that the non-cardiologists have purchased an average of 1.03 courses over the time we have known them (usually just one, on rare occasions 2, and in a few outstanding cases more than that). The cardiologists have purchased an average of 1.7 courses (much more common for them to purchase more than one).

- What’s the purchase value of a course?
**Why we ask:**The higher this value, the higher the opportunity cost of a false negative.**Answer:**The courses average $1000 a piece. We don’t have enough information to extrapolate customer lifetime value with high confidence because we don’t know if doctors will purchase our courses for decades or be finished after, at most, a few. For now though, let’s say a non-cardiologist pays us roughly $500 per year and a cardiologist pays us roughly $850 per year.

### Calculations

We know that we acquire one customer per 500 calls, and that 4 out of 5 customers are cardiologists. This means that, if the sales team calls 2500 people, on average we will gain 5 customers, 4 of whom will be cardiologists.Let c be the number of cardiologists we have to call to get 5 sales. Let n be the number of non-cardiologists we have to call to get a 5 sales.

`4c + n = 2500`

Let’s assume that the distribution of specialties among the doctors our sales team has called match the distribution of specialties in the U.S. physician population. With about 19,000 cardiologists in about a million physicians, we’re looking at roughly 1.9%.

`c = 1.9/98.1 * n`

When we solve that as a system of equations, we learn that

`55.6c = 2500`

`c = 8.99`

`n = 2500 - 36 = 2464`

This means that every 1 in 9 cardiologists our sales team calls will purchase at least one course. For non-cardiologists? 1 in 2,464.

Remember also that the two year revenue averages $850 for a cardiologist and $500 for another physician.

These numbers will help us figure out how to judge our model soon.

### What is the value of us getting this right?

If we correctly target the cardiologists, then we can limit the marketing campaign to the most receptive prospects and maximize our sales team’s return on investment.

This means the sales team should close more deals in a smaller amount of time, which means the company earns more *profit* (because it’s spending less money generating *revenue*). That’s the rose-colored glasses version. But things can also go wrong:

### What is the potential cost of getting this wrong?

**1. False positives: the sales team calls some non-cardiologists about the courses.**

Costs:

- We waste time and resources putting our campaign in front of doctors it isn’t meant for.
- The doctors waste time and energy receiving our irrelevant contact.
- There is a chance that those doctors become desensitized to our contact, making it more costly or less likely for the sales team to sell future products to those doctors.

For our sales team, the customer acquisition cost is much higher (roughly 274 times higher) for a non-cardiologist, and their customer lifetime value is lower. Add in the fact that the opportunity cost of calling a non-cardiologist is the chance to call a cardiologist instead, and the cost of false positives in our list of leads adds up *fast* for our sales team.

** 2. False negatives: the sales team doesn’t call some cardiologists about the courses.**

Costs:

- Our sales team misses an opportunity to sell our courses.
- There is a chance that those doctors feel left out and reach out to ask to be included. Not a bad thing, but we’d need to make one-off exceptions to any automated cycles we have in place for these doctors.
- There is a chance that those doctors feel left out and don’t take our sales team’s calls in the future (maybe unlikely, but we’re listing all the possibilities here).
- There is a chance that those doctors feel left out and do something more drastic, like sue us for discrimination. (There’s precedent for this: Facebook’s ad targeting strategy for a time allowed home loan advertisers to only put their ads in front of white home buyers.

For our sales team, every false negative is a missed opportunity to make a sale. Luckily, we can hedge against this one by taking advantage of the fact that sales calls take time. We can give the sales team the surest leads and then use the time that buys us to find the next tranche of leads while (hopefully) keeping the false positive rate as low as possible.

### Our Plan of Action

So here’s what we’ll do: we’ll first hand the sales team a list of *confirmed *cardiologists whose specialties are listed in the NPPES directory. While they make calls off that list, we ‘ll work to build a very **specific** model (few false positives). Then, if the sales team starts burning through the list of leads from the highly specific model, we buy ourselves even more time to build a more **sensitive** model that captures any remaining cardiologists without exposing the sales team to costly mistakes.

Some technologists call this front-loading value, and some call it (lovingly) procrastination. The idea here is to develop in incremental stages so we continue to give the sales team what they need to do their jobs while iterating on the technology we’re building.

**Conclusion**

Now that we’ve asked several questions of the sales team and calculated the cost of the mistakes the model could make, we’re ready to iteratively deliver value to our sales team as we learn more about our data.

In the next installment, we’ll do an audit on the data we’re using before we dive into exploratory data analysis.