Companies accrue ESG metrics that theoretically measure their performance on things like good governance, environmental stewardship, and fair employee treatment. I wanted to look for relationships between these metrics and company stock performance, so I’ve been sharing a series of posts with you about my findings. Here’s the cumulative result so far! It’s long, and it gets into weeds.

I hope you love weeds. I *love* weeds.

# Millenials and Investment: an Ongoing Exploration

In case you haven’t heard, millenials are killing everything from diamonds to department stores to designer crap to grocery chains.

Why? Sure, the recession had an impact. But also, millenials pay more attention to ethics than many multinational corporations bargain for. They cite the blood diamond trade as a major reason to spring for non-traditional engagement rings. They opt for grocery providers that can tell them where their food is coming from and under what conditions it was produced. They’re ditching the fast fashion industry for higher-priced items purchased secondhand on sites like Poshmark and ThreadUp.

And as millenials reach the age where they might accrue some savings, it makes sense that they would care about where that is going, too. In addition to millenial attendance at the NoDAPL protests, we saw thousands of millenials divest from Western Union, Bank of America, and other banks that loaned money to the project. Maybe megacorps won’t change their tunes because a few thousand people stood in a field to get mowed down by water cannons, but they’re more likely to sit up and listen when those same people take their hard-earned doll hairs to another playhouse.

So we see that millenials are surveying their options to spend and save according to their values. What about investing? Any personal finance 101 that isn’t taught by a financial advisor will recommend a low cost index as the place to stick extra money so it can grow with the market. Most index funds, including the most recommended one (Vanguard), decide their investments via index-matching: matching their holdings to the S&P500 by market cap, with no other variables. Thing is, plenty of investors are expressing interest in taking ethical considerations into account. Some portfolios do this by blanket blocking investments in certain industries like tobacco or porn. Other more advanced optsions, like Betterment’s AutoSRI portfolio, use actual ESG data to determine where they invest the money. There isn’t (yet) a fully customizable option to allow folks to automatically invest their funds based on a checklist of their individual values. For a while, I’ve thought about building a toy version of what that might look like.

When I talk about the idea with friends and relatives, I get the following objection: ‘What about the returns?’ Touche. Nobody wants to lose out on their potential earnings. At first, I figured I’d build a tolerance into the system that allowed investors to say ‘These are my values, but please don’t invest in a way that will trail general market performance by more than x percent.’ The algorithm would then predict stock performance for each company, somehow blend that with ESG rating, and come up with a combined weight for divvying up investment money.

Before I build that, though, I need to test the assumption that high ESG ratings *do* correlate negatively with returns. If they don’t, there’s no need for the tolerance measure in the first place.

I’m not the first person to run correlations along these lines. Dorfleitner, Utz, and Wimmer published a paper on this just last year. Their analysis suggests that higher corporate social responsibility ratings *increase* returns over a long period of time (“long” being a 12 year period from 2002-2014). They even identify three specific areas that correlate with higher than average returns: emission and resource reduction, workforce, and society. So in my exploration, I’ll dig into some specific CSR breakdowns with the data I have on S&P 500 companies.

```
import pandas as pd
import numpy as np
```

## Correlating KLD ESG Ratings to Stock Performance, 1990-2005

Let’s determine whether we notice any correlation between companies’ environmental, social, and governmental ratings and their stock performance.

### First, we pull in the ESG data.

These come from KLD and are now distributed by MSGI. I pulled them from an academic database. Don’t rerun this notebook because I didn’t push the actual data to Github, on account of it is large and on account of both data providers ask corporations to pay for the data. So I’m not going to undermine that.

```
y91 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1991 HistoricalSpreadsheet_STATS.xls')
y92 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1992 HistoricalSpreadsheet_STATS.xls')
y93 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1993 HistoricalSpreadsheet_STATS.xls')
y94 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1994 HistoricalSpreadsheet_STATS.xls')
y95 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1995 HistoricalSpreadsheet_STATS.xls')
y96 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1996 HistoricalSpreadsheet_STATS.xls')
y97 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1997 HistoricalSpreadsheet_STATS.xls')
y98 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1998 HistoricalSpreadsheet_STATS.xls')
y99 = pd.read_excel('../stockproject/12231046.1990-1999.stats/1999 HistoricalSpreadsheet_STATS.xls')
nineties = [y91, y92, y93, y94, y95, y96, y97, y98, y99]
```

```
y00 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2000 HistoricalSpreadsheet_STATS.xls')
y01 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2001 HistoricalSpreadsheet_STATS.xls')
y02 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2002 HistoricalSpreadsheet_STATS.xls')
y03 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2003 HistoricalSpreadsheet_STATS.xls')
y04 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2004 HistoricalSpreadsheet_STATS.xls')
y05 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2005 Historical Spreadsheet_STATS.xls') #wth KLD
y06 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2006 Historical Spreadsheet_STATS.xls')
y07 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2007 HistoricalSpreadsheet_STATS.xls')
y08 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2008 HistoricalSpreadsheet_STATS.xls')
y09 = pd.read_excel('../stockproject/12231046.2000-2009.stats/2009 HistoricalSpreadsheet_STATS.xls')
two_thousands = [y00, y01, y02, y03, y04]
```

I wanted to run 1990-2009, but evidently starting in 2005 these spreadsheets no longer represent whether a company was in the S&P500 in the same way. That’s okay: we can do this for a fifteen-year period wrangle in more data later if we would like to see a longer period of time.

Now let’s pull out the companies that belong to the S&P 500. We’ll begin by examining their ESG scores on four metrics: employment policy strengths, employment policy concerns, environmental impact strengths, and environmental impact concerns. These companies get a score of zero (0) or one (1) each year on each of several sub-metrics. For example, employment policy strengths include specific scores for workplace safety, compensation, union management, et cetera.

We’re going to sum up all of the sub-metrics for each metric per company, and then we’re going to sum that company’s total scores in that metric across our fifteen year time span. We’ll end up with a company score of cumulative strengths and concerns in employment and environmental practices over the course of the fifteen years.

```
twenty_years = nineties + two_thousands
def filter_500(df):
return df[df['SP500'] == True]
sp500_90s = []
for data in twenty_years:
sp500_90s.append(filter_500(data))
def aggregate_columns_for(category, dataframe):
relevant_columns = [column_name for column_name in dataframe.columns.values if column_name.startswith(category)]
return dataframe[relevant_columns].sum(axis=1)
aggregate_data = pd.concat(sp500_90s)
aggregate_data['emp_str_sum'] = aggregate_columns_for('EMP-str', aggregate_data)
aggregate_data['emp_con_sum'] = aggregate_columns_for('EMP-con', aggregate_data)
aggregate_data['env_str_sum'] = aggregate_columns_for('ENV-str', aggregate_data)
aggregate_data['env_con_sum'] = aggregate_columns_for('ENV-con', aggregate_data)
aggregate_data['alc_con_sum'] = aggregate_columns_for('ALC-con', aggregate_data)
aggregate_data['cgov_str_sum'] = aggregate_columns_for('CGOV-str', aggregate_data)
aggregate_data['cgov_con_sum'] = aggregate_columns_for('CGOV-con', aggregate_data)
aggregate_data['com_str_sum'] = aggregate_columns_for('COM-str', aggregate_data)
aggregate_data['com_con_sum'] = aggregate_columns_for('COM-con', aggregate_data)
aggregate_data['div_str_sum'] = aggregate_columns_for('DIV-str', aggregate_data)
aggregate_data['div_con_sum'] = aggregate_columns_for('DIV-con', aggregate_data)
aggregate_data.columns.values
```

OK, so here’s our data. Let’s take a look at this data and make sure we’re getting what we want: a sum of the ESG scores on a company-by-company basis.

```
def sum_scores_for(dataframe, esg_marker):
grouping = dataframe.groupby(['Ticker'])[esg_marker].sum()
return pd.DataFrame({esg_marker : grouping}).reset_index()
esg_markers = [
'emp_str_sum',
'emp_con_sum',
'env_str_sum',
'env_con_sum',
'alc_con_sum',
'cgov_str_sum',
'cgov_con_sum',
'com_str_sum',
'com_con_sum',
'div_str_sum',
'div_con_sum'
]
def aggregate_sums_for(esg_markers):
esg_marker_data = sum_scores_for(aggregate_data, esg_markers[0])
for esg_marker in esg_markers[1::]:
esg_marker_data[esg_marker] = sum_scores_for(aggregate_data, esg_marker)[esg_marker]
return esg_marker_data
esg_marker_data = aggregate_sums_for(esg_markers)
```

Ah, these sums look like what we would expect to see!

```
esg_marker_data.sort_values(by=['emp_con_sum'], ascending=False).head()
```

### Second, we pull in stock performance data.

This data contains stock returns by quarter for S&P500 companies dating back to 1979. We’ll pull the columns for the ’90s for now.

```
price_data = pd.read_excel('../stockproject/Cleaned_Researcher_Dataset.xlsx')
```

```
new_header = price_data.iloc[0] #grab the first row for the header
content = price_data[1:] #take the data less the header row
content.columns = new_header #set the header row as the df header
content.head()
tickers = content.iloc[:,0:2]
tickers.columns = list(new_header)[0:2]
dates = content.iloc[:,45:106]
dates.columns = list(new_header)[45:106]
result = pd.concat([tickers, dates], axis=1)
result.head()
```

### Third, we translate these stock prices into returns.

I want a metric that I can use to compare all the companies that a) belonged to the S&P 500 and b) earned some kind of KLD scores during the 1990-2005 period. Some companies only belong to the S&P500 for a subset of the years in question. We want a metric that will not penalize companies based on having spent less time in the S&P500, so a cumulative score won’t work for us. I decided to calculate quarterly returns based on the stock prices. This fairly compares each company’s stock performance during the period that an index-matching ETF would have held it, however long or short that was.

This is also nice because our mean function will only consider, for each company, those cells that have a number. So we don’t have to do as much data skullduggery to get the equation functions to spit out something meaningful.

```
def quarter_return(start, end):
if start == 0 or end == 0:
return 0
return end / start
#WARNING: This has to go column by column because the sequence in time matters.
#Such an iterative operation takes longer than async-per-column pandas operations.
#Expect this block of code to take several seconds to run.
raw_stock_prices = result
returns_df = raw_stock_prices[['Ticker', 'Company Name']]
#We cannot easily index the columsn by name
#because the column names are datetimes rather than strings,
#so we use column index instead.
for column_name in raw_stock_prices.iloc[:,2:]:
loc = raw_stock_prices.columns.get_loc(column_name)
this_column = raw_stock_prices.iloc[:,loc]
next_col = loc + 1
try:
next_column = raw_stock_prices.iloc[:, next_col]
temp_df = pd.concat([this_column, next_column], axis=1)
temp_df.columns = ['a', 'b']
returns_df['quarter_starting_' + column_name.strftime('%m/%d/%Y') + '_roi'] = (
temp_df.apply(lambda row: quarter_return(row['a'], row['b']), axis=1))
except:
print('End of dataframe reached')
returns_df.head()
```

### Fourth, we combine the data into one dataframe.

We find all the companies for which we have both stock price data and ESG data, and we put the information together.

```
tickers = list(esg_marker_data["Ticker"]) #get all the company tickers for which we have esg data
prices_for_esg_companies = returns_df[returns_df["Ticker"].isin(tickers)] #get the stock data from companies in that list
esg_marker_data.head()
```

```
relevant_esgs = esg_marker_data[esg_marker_data["Ticker"].isin(tickers)]
relevant_esgs = relevant_esgs.fillna(0.0) #no esg score to zero esg score
relevant_esgs['emp_str_sum'].unique()
```

```
all_data = pd.concat([relevant_esgs, prices_for_esg_companies.iloc[:,1:]], axis = 1) #put the esg and stock data in one dataframe
all_data = all_data[np.isfinite(all_data['emp_str_sum'])]
```

```
all_data['emp_str_sum'].unique() #making sure the list of EMP-str-sum values is the same before and after concatenation to ensure that all the esg data made it over
```

### Fifth, we look at the data.

```
import matplotlib.pyplot as plt
import random
import seaborn as sns
import math
from scipy.stats import t
import numpy as np
%matplotlib inline
```

Our research question, recall, is this: do companies with low aggregated ESG scores have better stock performance than companies with high aggregated ESG scores?

We have so far calculated the return on investment of every stock, every quarter. We’re not ready to use that data, though. If we use each return individually, we’ll end up comparing each example (that is, each company) *multiple times*. Multiple comparisons in statistics gives you more opportunities to end up with false positives because it gives you more cracks at hitting that 5% probability lottery (if p-0.05) of your results being a fluke.

Instead, we want a way to generalize over all of its quarterly returns for those quarters in which each stock had a return.

Easy! Take an average, right? Well, not quite.

First of all, an average skews toward big outliers. If most quarters showed an ROI of 1.01 and then one quarter has an ROI of 2, the average will skew much higher than 1.01.

Pursuant to that, smaller numbers of measurements skew toward the extremes. So if a company only has three returns, each individual return affects the average a lot. That means, if one of the three is super high, the whole average will be super high. Comparatively, if a company has 200 returns and one of them is an outlier, it won’t drag the average nearly as far. In datasets where examples possess a different number of measurements, the highest and lowest values for the target variable often come from samples with few measurements.

We are in a prime position to encounter that gotcha, because some of these companies are much more measured (many more quarterly returns) than others.

So instead, we’re going to do something that might look a little weird.

#### Confidence Intervals

Instead of going off the *average* of our set of returns, we’ll take the confidence interval for it. A confidence interval attempts to take into account the fact that our data doesn’t quite match the real world: it samples the real world such that we can attempt to represent the real world in studies. So the “true average” of something might not look like the average of our measurements of it, and the fewer measurements we have the less sure we can be of the discrepancy. A confidence interval takes our average and says ‘based on this average, the *true* average lies between this number and this number with this probability.’

So, for our stock return measurements, we’ll get the confidence interval for each company. For companies with few measurements, that confidence interval will be wide. For companies with many measurements, it will be more narrow. We’ll then sample that *pessimistically* and compare the *bottoms* of all the companies’ confidence intervals.

It’s worth noting that this is going to give us a very low number for companies with few returns and *without* a high outlier. We’ll address that as well.

```
# Calculate average roi
returns = all_data[all_data.columns.difference(['emp_str_sum', 'emp_con_sum', 'env_str_sum', 'env_con_sum', 'div_str_sum', 'Ticker','Company Name'])]
returns.head()
```

```
def confidence_interval_for(samples=[], confidence=0.95):
sample_size = len(samples)
degrees_freedom = sample_size - 1
outlier_tails = (1.0 - confidence) / 2.0
t_distribution_number = -1 * t.ppf(outlier_tails, degrees_freedom)
step_1 = np.std(samples)/math.sqrt(sample_size)
step_2 = step_1 * t_distribution_number
low_end = np.mean(samples) - step_2
high_end = np.mean(samples) + step_2
return low_end, high_end
```

```
lower_conf_intervals = []
upper_conf_intervals = []
for (idx, row) in returns.iterrows():
maybe_measurements = row.tolist()
measurements = [float(x) for x in maybe_measurements if (math.isnan(float(x)) == False)]
bottom, top = confidence_interval_for(measurements)
lower_conf_intervals.append(bottom)
upper_conf_intervals.append(top)
```

```
all_data['avg_quarterly_roi'] = returns.mean(axis=1, skipna=True)
all_data['num_measurements'] = returns.count(axis=1)
all_data['lower_conf_interval'] = lower_conf_intervals
all_data['upper_conf_interval'] = upper_conf_intervals
all_data.head(10)
```

Before we go on, let’s talk statistical power.

The statistical power of a dataset describes the likelihood that you will be able to *detect* a meaningful difference with this data given that the difference *is there*. It’s an important precursor step to data analysis because it determines your ability to find the effect you’re looking for.

Calculating statistical power is a little complicated, and it’s generally a good idea to get someone who knows what they’re doing to help you. I’m not confident in my ability to evaluate statistical power calculation strategies yet, but I figured I’d start with the all-around method and used this online calculator.

I performed a one-tail statistical power test. For this we will need the following data:

```
all_data['lower_conf_interval'].describe()
```

I took the mean of the bottom end of confidence intervals on stock returns—0.92. Suppose this were only the mean for companies with favorable ESG scores, and suppose that companies with unfavorable ESG scores had a higher quarterly return of 1 standard deviation higher—0.934. So 0.92 is my sample average and 0.934 is my test value.

In calculating your statistical power, you have two variables under your control: the number of examples in your data and your effect size. The more examples you have, the smaller effect size you can reliably detect. This means that, if you have already collected your data (as we have), to have a statistically powerful study, the effect size you’re seeking needs to be large enough that the number of examples you have could reliably detect it. The bigger the effect size, the less data you need to detect it. For small effect sizes, you may need thousands of examples. But if we had only ten stock examples (5 with a low ESG aggregate score, 5 with a high one) and the 5 with low ESG aggregate scores showed triple the returns of the high ones, that effect size is *huge*, so our tiny study could still reliably detect it. (to check this, I plugged it into the power calculator with an absurdly high standard deviation of 1. My power was 100%).

Anyway, back to our real data. The sample size is how many examples we have…

```
all_data.shape
```

…925.

Our standard deviation for sample is 0.14. so we plug that in.

I left the confidence level at 5% and ran the calculator.

My result: 91.9%

That’s the probability, if there is a 1.4 percentage point difference or more between stock returns for companies with favorable and unfavorable ESG scores, we have a 91.9% chance of detecting it. Statistical power expectations vary, but a rule of thumb (and standard practice at some more rigorous scientific journals) is to expect a statistical power of 80% or higher to take a study seriously.

OK, time to make some pictures! Let’s plot our ESG score sums against average roi and see if we notice any trends.

```
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(20, 10))
f.suptitle('ESG Sums Plotted Against Returns')
ax1.scatter(x=all_data['emp_str_sum'], y=all_data['lower_conf_interval'])
ax2.scatter(x=all_data['emp_con_sum'], y=all_data['lower_conf_interval'])
ax3.scatter(x=all_data['env_str_sum'], y=all_data['lower_conf_interval'])
ax4.scatter(x=all_data['env_con_sum'], y=all_data['lower_conf_interval'])
all_data.shape
```

We have a couple of really low ones on those charts. Let’s look at them:

```
all_data[['Company Name','lower_conf_interval', 'num_measurements']].sort('lower_conf_interval').head(10)
```

We spoke a little bit earlier about the way our measuring strategy would especially penalize companies with only a few measurements. We’re seeing that here. Out of these bottom 10 scores for the lower end of stock return confidence intervals, only one has more than 7 data points. Compare this to the mean number of measurements:

```
all_data['num_measurements'].mean()
```

But there’s another thing to note about this data: five of the values are *negative*. These scores are low, yes, but a negative value for a stock return does not make sense. The value of a stock can drop to zero, but it does not turn into debt to the company owed by the stockholders. The lowest a stock price can drop is to zero. To accurately represent that, we’ll clip these values at zero.

```
all_data['lower_conf_interval'] = all_data['lower_conf_interval'].clip_lower(0)
```

Let’s check the means too to make sure we don’t have negative values in the means:

```
all_data[['Company Name','avg_quarterly_roi', 'num_measurements']].sort('avg_quarterly_roi').head(10)
```

Cool: we don’t. This means that none of the upper ends of the confidence intervals will have negative values either, as the upper end of the confidence interval is always higher than the mean on which it is based.

Let’s check out our plot again with those adjustments:

```
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(20, 10))
f.suptitle('ESG Sums Plotted Against Returns')
ax1.scatter(x=all_data['emp_str_sum'], y=all_data['lower_conf_interval'])
ax2.scatter(x=all_data['emp_con_sum'], y=all_data['lower_conf_interval'])
ax3.scatter(x=all_data['env_str_sum'], y=all_data['lower_conf_interval'])
ax4.scatter(x=all_data['env_con_sum'], y=all_data['lower_conf_interval'])
```

Let’s look at the general trends of the returns. They look pretty flat to me: I don’t see an upward or downward trend in quarterly ROI based on any of the 4 ESG metrics.

Let’s dig a little deeper and see if the numbers themselves support that.

```
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(20, 10))
f.suptitle('Histograms of Aggregated ESG Scores')
ax1.hist(x=all_data['emp_str_sum'])
ax2.hist(x=all_data['emp_con_sum'])
ax3.hist(x=all_data['env_str_sum'])
ax4.hist(x=all_data['env_con_sum'])
all_data.shape
```

The thing about a lot of our data is that it skews heavily toward low ESG sums. For example, there are a lot more companies with low employment policy strength scores than high ones. How do we account for this when we do our aggregations?

Well, we’ll look at the individual ESG scores first. But later we’ll bucket those ESG scores by fives so that we have small groups of scores, and then we’ll look at the confidence intervals around those (which will, of course, be wider for smaller groups).

Here’s a question though—why do buckets at all? Why not stick to a linear regressor?

Good question! A regressor can frequently be a preferable option to drawing arbitrary distinctions in continuous data. This isn’t exaclty continuous data as the ESG aggregations are all natural numbers, *but* that is not the reason for the buckets. The reason for the buckets is to have some kind of aggregate at each stage across our data examples, so our data is not so broken up by a single ESG score with no examples, or just one example. If we run a regressor, the bottom of the regressor represents a ton of data points, and as we go higher it represents fewer and fewer data points such that the expected value up there means very little.

I’d be very concerned to artificially divide this data into only *two* buckets such that points close to the arbitrary division get lumped in with very different means. But breaking it up by fives gives us several buckets such that each bucket of data points contains data points that are roughly similar to one another in their ESG scores.

```
all_data['emp_str_buckets'] = all_data.apply(lambda row: int(row['emp_str_sum'] / 5), axis=1)
all_data['emp_con_buckets'] = all_data.apply(lambda row: int(row['emp_con_sum'] / 5), axis=1)
all_data['env_str_buckets'] = all_data.apply(lambda row: int(row['env_str_sum'] / 5), axis=1)
all_data['env_con_buckets'] = all_data.apply(lambda row: int(row['env_con_sum'] / 5), axis=1)
```

```
def get_grouping_for(esg_metric, dataframe):
return dataframe.groupby(esg_metric).agg({'lower_conf_interval': ['count','mean','std']}).reset_index()
emp_str_sum_group = get_grouping_for('emp_str_sum', all_data)
emp_con_sum_group = get_grouping_for('emp_con_sum', all_data)
env_str_sum_group = get_grouping_for('env_str_sum', all_data)
env_con_sum_group = get_grouping_for('env_con_sum', all_data)
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(20, 10))
f.suptitle('ESG Sums Plotted Against Returns')
ax1.plot(emp_str_sum_group['emp_str_sum'], emp_str_sum_group['lower_conf_interval']['mean'])
ax2.plot(emp_con_sum_group['emp_con_sum'], emp_con_sum_group['lower_conf_interval']['mean'])
ax3.plot(env_str_sum_group['env_str_sum'], env_str_sum_group['lower_conf_interval']['mean'])
ax4.plot(env_con_sum_group['env_con_sum'], env_con_sum_group['lower_conf_interval']['mean'])
```

Hmm. I’m not sure I see much of a trend here. Let’s see if any of these ESG sums correlate with their mean lower confidence interval return.

```
all_data[['emp_str_sum', 'emp_con_sum', 'env_str_sum', 'env_con_sum', 'lower_conf_interval']].corr()
```

We’re interested in the correlations between the esg sums on the left and the lower_conf_interval (last column). These are all very small correlations. A perfect direct correlation is 1. A perfect inverse correlation is -1. Zero means no correlation whatsoever. These correlation numbers arepretty darn close to zero.

We’ll have to dig deeper into the data.

### Sixth, we determine what our results could mean.

OK friends, it’s time to figure out what our data means, or rather, whether it means anything at all with respect to our question: do companies with higher str ESG scores and lower con ESG scores have lower stock returns than companies that perform more poorly on ESG metrics?

First, let’s calculate our confidence intervals for stock performance and determine if the stock performances for companies low ESG scores fall outside the confidence intervals for the stock performance of companies with high ESG scores.

```
import math
from scipy.stats import t
import numpy as np
```

First, let’s make ourselves a method to get confidence intervals from summary statistics. That way, we can make interpretable groupings for all of our esg scores and get the confidence intervals on those scores based on how many companies had each esg score.

```
def confidence_interval_for_collection(sample_size=[], standard_deviation=[], mean=[], confidence=0.95):
degrees_freedom = [count - 1 for count in sample_size]
outlier_tails = (1.0 - confidence) / 2.0
confidence_collection = [outlier_tails for _ in sample_size]
t_distribution_number = [-1 * t.ppf(tails, df) for tails, df in zip(confidence_collection, degrees_freedom)]
step_1 = [std/math.sqrt(count) for std, count in zip(standard_deviation, sample_size)]
step_2 = [step * t for step, t in zip(step_1, t_distribution_number)]
low_end = [mean_num - step_num for mean_num, step_num in zip(mean, step_2)]
high_end = [mean_num + step_num for mean_num, step_num in zip(mean, step_2)]
return low_end, high_end
```

Now let’s make ourselves some convenience methods to wrhangle our data. One of them will create the grouping we talked about before, aggregating each ESG metric to tell us a) how many companies had each score on the ESG metric, b) the mean return in that group of companies, and c) the standard deviation on returns in that group of companies.

We’ll use that data to get a lower limit and an upper limit on a *confidence interval* around those returns based on how many measurements we have at each score.

Finally, we’ll make a method to plot the results of these groupings so we can eyeball our results.

```
def prepare_confidence_data_for(esg_metric, grouping):
labels=grouping[esg_metric]
sample_size = grouping['lower_conf_interval']['count']
standard_deviation = grouping['lower_conf_interval']['std']
mean = mean = grouping['lower_conf_interval']['mean']
low_end, high_end = confidence_interval_for_collection(sample_size, standard_deviation, mean)
df_with_confidence_limits = pd.concat([labels, sample_size, standard_deviation, mean], axis=1)
df_with_confidence_limits['.95 confidence level min'] = low_end
df_with_confidence_limits['.95 confidence level max'] = high_end
return df_with_confidence_limits
def plot_grouping(group_name, grouping, axes):
mini = axes.plot(grouping['.95 confidence level max'], color='r', label="Max of 0.95 Confidence Interval")
mean = axes.plot(grouping['mean'], color='g', label="Mean")
maxi = axes.plot(grouping['.95 confidence level min'], color='b', label="Min of 0.95 Confidence Interval")
legend = axes.legend(loc='upper right', shadow=True)
```

Let’s eyeball some charts!

```
emp_str_confidence_interval_data = prepare_confidence_data_for('emp_str_sum', emp_str_sum_group)
emp_con_confidence_interval_data = prepare_confidence_data_for('emp_con_sum', emp_con_sum_group)
env_str_confidence_interval_data = prepare_confidence_data_for('env_str_sum', env_str_sum_group)
env_con_confidence_interval_data = prepare_confidence_data_for('env_con_sum', env_con_sum_group)
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(20, 10))
f.suptitle('ESG Sums Plotted Against Returns')
plot_grouping('Employment Strength Sums', emp_str_confidence_interval_data, ax1)
plot_grouping('Employment Concern Sums', emp_con_confidence_interval_data, ax2)
plot_grouping('Environmental Strength Sums', env_str_confidence_interval_data, ax3)
plot_grouping('Emnvironmental Concern Sums', env_con_confidence_interval_data, ax4)
```

Worth noting: You can tell from the cart the rough relative number of data points that we had for a given ESG score. Where you see no red, blue, or green lines, we had no data points for that ESG score. Where there is a green line but no red or blue lines, we had only one data point for that ESG score. Where there are red and blue lines and they are far apart, we had a few data points, but a small enough number that the confidence interval is still very wide. Where the red and blue lines flank the green line very closely, we had a *lot* of data: the confidence interval for these is very close to the mean.

This is a little hard to read with the skips in it. Let’s bucket the scores and see how that looks:

```
emp_str_bucket_group = get_grouping_for('emp_str_buckets', all_data)
emp_con_bucket_group = get_grouping_for('emp_con_buckets', all_data)
env_str_bucket_group = get_grouping_for('env_str_buckets', all_data)
env_con_bucket_group = get_grouping_for('env_con_buckets', all_data)
emp_str_bucket_data = prepare_confidence_data_for('emp_str_buckets', emp_str_bucket_group)
emp_con_bucket_data = prepare_confidence_data_for('emp_con_buckets', emp_con_bucket_group)
env_str_bucket_data = prepare_confidence_data_for('env_str_buckets', env_str_bucket_group)
env_con_bucket_data = prepare_confidence_data_for('env_con_buckets', env_con_bucket_group)
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(20, 10))
f.suptitle('ESG Sums Plotted Against Returns')
plot_grouping('Bucketed Employment Strength Sums', emp_str_bucket_data, ax1)
plot_grouping('Bucketed Employment Concern Sums', emp_con_bucket_data, ax2)
plot_grouping('Bucketed Environmental Strength Sums', env_str_bucket_data, ax3)
plot_grouping('BucketedEmnvironmental Concern Sums', env_con_bucket_data, ax4)
```

Check out those widening confidence intervals as we get to the higher point ranges! You can practicelly see the number of exaxmples falling off.

*So what am I looking for here?* I’m looking to see if the highest point on the *bottom* of the confidence intervals on either side of the graph is higher than the *top* of the confidence intervals on the other side. This break in overlap between the confidence intervals would provide a leading indication of a strong and meaningful difference in stock returns for companies with high versus low scores in any of these ESG metrics.

I don’t see that happen in any of these charts. The confidence intervals overlap significantly all the way across. So we can’t be confident at all that stocks from the high ESG score groups will have higher or lower returns than the stocks from the low ESG score groups.

But wait! That doesn’t necessarily mean that there is no meaningful difference. The confidence intervals can overlap for two series of points whose data *still* has a meaningful difference. To determine whether it does, we need to perform additional significance testing—in this case, a *t test*.

```
def t_test_for(num_samples_1, standard_deviatcion_1, mean1, num_samples_2, standard_deviation_2, mean2, confidence=0.95, num_comparisons=1):
alpha = (1 - confidence)/float(num_comparisons)
total_degrees_freedom = num_samples_1 + num_samples_2 - 2
t_distribution_number = -1 * t.ppf(alpha, total_degrees_freedom)
degrees_freedom_1 = num_samples_1 - 1
degrees_freedom_2 = num_samples_2 - 1
sum_of_squares_1 = (standard_deviation_1 ** 2) * degrees_freedom_1
sum_of_squares_2 = (standard_deviation_2 ** 2) * degrees_freedom_2
combined_variance = (sum_of_squares_1 + sum_of_squares_2) / (degrees_freedom_1 + degrees_freedom_2)
first_dividend_addend = combined_variance/float(num_samples_1)
second_dividend_addend = combined_variance/float(num_samples_2)
denominator = math.sqrt(first_dividend_addend + second_dividend_addend)
numerator = mean1 - mean2
t_value = float(numerator)/float(denominator)
accept_null_hypothesis = abs(t_value) < abs(t_distribution_number) #results are not significant
return accept_null_hypothesis, t_value
```

We have written a method that will tell us, firstly, whether to accept or reject the *null hypothesis*, which assumes no meaningful difference between the two sets of data we want to compare. I have named that output ‘accept_null_hypothesis’ because I don’t love the ubiquitous use of the confounding phrase ‘reject the null hypothesis’ in scientifi inquiry. It’s a double negative (*reject* the *absence* of meaningful difference), which adds an unnecessary additional piece of mental acrobatics to the (already frequently herculean) task of determining what, exactly, the scientists are trying to say in their conclusion paragraph.

We are going with *accept* the *absence of* meaningful difference as the variable name for two reasons. First of all, we remove the double negative this way. Second of all, accepting the null hypothesis is (or should be) the outcome of the vast majority of scientific inquiry. Scientists, collectively, test a *whole bunch* of stuff to see what has an effect. Most of the things tried, it turns out, don’t have that effect. So our `accept_null_hypothesis`

value will usually be true. When it’s false, we should sit up and pay attention.

```
mean1 = all_data.sort('emp_str_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].mean()
std1 = all_data.sort('emp_str_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].std()
mean2 = all_data.sort('emp_str_sum').iloc[0:20, :]['lower_conf_interval'].mean()
std2 = all_data.sort('emp_str_sum').iloc[0:20, :]['lower_conf_interval'].std()
n1 = 20
n2 = 20
accept_null_hypothesis, t_value = t_test_for(n1, std1, mean1, n2, std2, mean2)
print(accept_null_hypothesis)
print(t_value)
```

```
mean1 = all_data.sort('emp_con_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].mean()
std1 = all_data.sort('emp_con_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].std()
mean2 = all_data.sort('emp_con_sum').iloc[0:20, :]['lower_conf_interval'].mean()
std2 = all_data.sort('emp_con_sum').iloc[0:20, :]['lower_conf_interval'].std()
n1 = 20
n2 = 20
accept_null_hypothesis, t_value = t_test_for(n1, std1, mean1, n2, std2, mean2)
print(accept_null_hypothesis)
print(t_value)
```

```
mean1 = all_data.sort('env_str_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].mean()
std1 = all_data.sort('env_str_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].std()
mean2 = all_data.sort('env_str_sum').iloc[0:20, :]['lower_conf_interval'].mean()
std2 = all_data.sort('env_str_sum').iloc[0:20, :]['lower_conf_interval'].std()
n1 = 20
n2 = 20
accept_null_hypothesis, t_value = t_test_for(n1, std1, mean1, n2, std2, mean2)
print(accept_null_hypothesis)
print(t_value)
```

```
mean1 = all_data.sort('env_con_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].mean()
std1 = all_data.sort('env_con_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].std()
mean2 = all_data.sort('env_con_sum').iloc[0:20, :]['lower_conf_interval'].mean()
std2 = all_data.sort('env_con_sum').iloc[0:20, :]['lower_conf_interval'].std()
n1 = 20
n2 = 20
accept_null_hypothesis, t_value = t_test_for(n1, std1, mean1, n2, std2, mean2)
print(accept_null_hypothesis)
print(t_value)
```

How about that! So far, accepting the null hypothesis everywhere. It appears that we cannot meaninfgully separate the stock performance of companies on either end of the spectrum for each of their employment policy ESG scores.

#### Interim Conclusions

On this pass, we have taken a much more thorough look at our data with respect to how well it approximates the real world and how certain we are about that approximation.

This did not mean tacking on a test for statistical significance at the end: it meant rethinking our analysis from the very beginning, including *what our target variable would be* (previously extrapolated annual roi, now back to mean quarterly roi) and also *how we would represent it* (with the bottom end of the confidence interval instead of the mean itself).

It also meant checking our results more thoroughly along the way, identifying unrealistic values in the data, understanding why those values were what they were, and replacing them with values that better approximate reality.

Finally, it meant conducting a t test rather than eyeballing the results.

The result here provides an excellent example of the difference between how data might be interpreted or perceived versus what it really represents. When we use the bottom of the confidence intervals, we’re looking at a *lot* of *losses* in shareholder value. It’s important to note that this isn’t really what happened. It is, more accurately, a statistically derived worst-case scenario for a company that did everything that a given example company did, based on the metrics we have about how that company actually did. That is *not useful* for predicting what the stock returns for the example company will be in the future (fun fact: trying to predict stock returns in general is not that useful, as straight-up index matching outperforms the vast majority of shrewder, prediction-based investment strategies in the long term. I don’t miss our loss of predictive value because there wasn’t much demonstrated value there to begin with.)

At any rate, what it *is* useful for is accurately representing the fact that there is a lot we do not know and attempting to account for that in a comparison of stock returns relative to ESG ratings. We’re not trying to predict returns by company: we’re trying to determine if companies with favorable ESG scores, in aggregate, underperform on the stock market relative to companies with unfavorable ESG scores. For this, our data is useful.

And, as we’ve established, we don’t see in this analysis a meaningful difference in any of our cases. This bodes well for the *application* question—can I create a socially conscious investment portfolio without sacrificing returns on my investment?

There’s more to look at to get a more definitive answer to this question. But so far, what we’re seeing points to *yes*.

#### Code Setup for Next Part

We will need to calculate p values for this next section of the analysis. P values are not a perfect metric: they can be tough to interpret and vulnerable to truth inflation. That said, we’ll do our best to take those things into account as we move along. Here you’ll see our method for getting a p value from some summary statistics.

```
import scipy.stats as stats
def p_value_for(num_samples_1, standard_deviation_1, mean1, num_samples_2, standard_deviation_2, mean2, confidence=0.95, num_comparisons=1):
alpha = (1 - confidence)/num_comparisons
total_degrees_freedom = num_samples_1 + num_samples_2 - 2
degrees_freedom_1 = num_samples_1 - 1
degrees_freedom_2 = num_samples_2 - 1
combined_degrees_freedom = degrees_freedom_1 + degrees_freedom_2
variance_1 = (standard_deviation_1 ** 2)
variance_2 = (standard_deviation_2 ** 2)
combined_variance_1 = (variance_1 / num_samples_1)
combined_variance_2 = (variance_2 / num_samples_2)
mean_standard_error = math.sqrt(combined_variance_1 + combined_variance_2)
t_statistic = float(mean1 - mean2)/float(mean_standard_error)
p_value = stats.t.sf(np.abs(t_statistic), combined_degrees_freedom) * 2 # two-sided p value = probability abs(t) > tt)
comparison_adjusted_p_value = p_value/float(num_comparisons)
return comparison_adjusted_p_value
```

I also want a better way to save, store, access, and analyze results for each ESG metric. So let’s make an object to store summary statistics, render plots, and run analyses by ESG metric. You’ll see the utility of this object in the following cells:

```
class EsgMetricAnalysis():
def __init__(self, name):
self.name = name
self.low_esg_metric_mean = 0
self.low_esg_metric_std = 0
self.low_esg_metric_num_examples = 0
self.high_esg_metric_mean = 0
self.high_esg_metric_std = 0
self.high_esg_metric_num_examples = 0
self.num_comparisons = 1
self.esg_marker_sum_grouping = {}
self.esg_marker_confidence_interval_data = {}
self.esg_marker_bucket_group = {}
self.esg_marker_bucket_data = {}
def plots(self):
f, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, sharex='col', sharey='row', figsize=(30, 20))
f.suptitle('ESG Sums Plotted Against Returns')
ax1.scatter(x=all_data[self.name + '_sum'], y=all_data['lower_conf_interval'])
ax2.hist(x=all_data[self.name + '_sum'])
ax3.plot(self.esg_marker_sum_grouping[self.name + '_sum'], self.esg_marker_sum_grouping['lower_conf_interval']['mean'])
plot_grouping('Employment Strength Sums', self.esg_marker_confidence_interval_data, ax4)
plot_grouping('Bucketed Employment Strength Sums', self.esg_marker_bucket_data, ax5)
def t_test(self):
return t_test_for(
analysis.low_esg_metric_num_examples,
analysis.low_esg_metric_std,
analysis.low_esg_metric_mean,
analysis.high_esg_metric_num_examples,
analysis.high_esg_metric_std,
analysis.high_esg_metric_mean,
num_comparisons=self.num_comparisons
)
def p_value(self):
return p_value_for(
analysis.low_esg_metric_num_examples,
analysis.low_esg_metric_std,
analysis.low_esg_metric_mean,
analysis.high_esg_metric_num_examples,
analysis.high_esg_metric_std,
analysis.high_esg_metric_mean,
num_comparisons=self.num_comparisons
)
```

OK, here is where we create all of our EsgMetricAnalysis objects. This method is messy, hard to interpret, and very dependent on the state of the notebook. I’d ultimately like to encapsulate this work in a series of pipelines and possibly extract a separate script for data preparation. That said, I promised to share incremental progress with you, and I have enough progress now that it’s time to share it.

```
def do_the_whole_dance_with(esg_metric=None, num_comparisons=1):
analysis = EsgMetricAnalysis(esg_metric)
esg_marker_data[esg_metric + '_sum'] = sum_scores_for(aggregate_data, esg_metric + '_sum')[esg_metric + '_sum']
all_data[esg_metric + '_buckets'] = all_data.apply(lambda row: int(row[esg_metric + '_sum'] / 5), axis=1)
analysis.esg_marker_sum_grouping = get_grouping_for(esg_metric + '_sum', all_data)
analysis.esg_marker_confidence_interval_data = prepare_confidence_data_for(esg_metric + '_sum', analysis.esg_marker_sum_grouping)
analysis.esg_marker_bucket_group = get_grouping_for(esg_metric + '_buckets', all_data)
analysis.esg_marker_bucket_data = prepare_confidence_data_for(esg_metric + '_buckets', analysis.esg_marker_bucket_group)
analysis.low_esg_metric_mean = all_data.sort_values(by=esg_metric + '_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].mean()
analysis.low_esg_metric_std = all_data.sort_values(by=esg_metric + '_sum', ascending=False).iloc[0:20, :]['lower_conf_interval'].std()
analysis.high_esg_metric_mean = all_data.sort_values(by=esg_metric + '_sum').iloc[0:20, :]['lower_conf_interval'].mean()
analysis.high_esg_metric_std = all_data.sort_values(by=esg_metric + '_sum').iloc[0:20, :]['lower_conf_interval'].std()
analysis.low_esg_metric_num_examples = 20
analysis.high_esg_metric_num_examples = 20
analysis.num_comparisons = num_comparisons
return analysis
```

Let’s make sure our thing works:

```
analysis = do_the_whole_dance_with('emp_str')
print(analysis.name)
print(analysis.p_value())
```

Sweet! Now that we have created a way to run our analysis quickly on several ESG metrics, let’s try it on several of them at one time.

Ah! But when we do this, there’s something we have to watch out for: **multiple comparisons**! For each individual ESG metric, we are comparing the same examples. Remember that, if we test enough variables about our companies, eventually some of them will demonstrate a meaningful-looking effect by pure chance. A p-value of 0.05 *still means* a 5% probability of just such a chance occurrence, and when you do 100 comparisons the probability of at least one such false positive rises to…99%.

We are not making 99 comparisons, but we are making 11. For 11 comparisons at a p value of 0.05, the probability of a meaningful-looking result occurring by chance is…

```
1-(0.95**11)
```

…0.43. That is, there’s a 43% chance we get a false positive. That’s a fairly high probability.

There are a few ways to adjust our calculations for multiple comparisons. The most hamfisted of these, the Bonferroni Correction, adjusts by dividing our starting p value by the number of comparisons we make and using that as the working p-value.

The Bonferroni correction controls the *familywise error rate*—the probability, assuming all the variables have identical distribution in the two groups, that a significant-looking difference happens purely by chance.

So in this case, our working p value would be…

```
0.05 /11
```

…0.004. Very, very tiny.

Unsurprisingly, when we run our analysis with the Bonferroni correction, we do not find a single statistically significant difference in stock prices among the lot. Check it out:

```
esgs = [
'emp_str',
'emp_con',
'env_str',
'env_con',
'alc_con',
'cgov_str',
'cgov_con',
'com_str',
'com_con',
'div_str',
'div_con'
]
analyses = []
for esg in esgs:
analyses.append(do_the_whole_dance_with(esg, num_comparisons=11))
```

That is a lot of ‘true’values for accepting the null hypothesis.

We can also check our resuts with the Benjamini-Hochberg procedure.

The Benjamini-Hochberg correction controls the False discovery rate—the expected proportion of false positives among the variables for which you claim the existence of a difference. For example, if with FDR controlled to 5% 20 tests are positive, on average one of these tests will be a false positive (because 1 in 20 is 5%).

To run this tesst, we choose a false positive rate (Q) to which we wish to control. Let’s choose 10% to start and see what we come up with.

```
p_values = {}
for esg in esgs:
analysis = do_the_whole_dance_with(esg) #We leave num_comparisons at 1 since we're about to correct for it with a different procedure.
p_values[analysis.name] = analysis.p_value()
p_values
```

How to do this test:

- Rank all your p values lowest to highest.
- Calculate (Q * rank)/number of comparisons.
- Go down the list until the calculated value Qi/n is lower than p.

```
benjamini_hochberg_data = pd.DataFrame({'name': list(p_values.keys()), 'p_value': list(p_values.values())})
benjamini_hochberg_data = benjamini_hochberg_data.sort_values(by='p_value')
benjamini_hochberg_data['rank'] = range(11)
benjamini_hochberg_data['rank'] = benjamini_hochberg_data['rank'] + 1
benjamini_hochberg_data['Qi/n'] = (0.1 * benjamini_hochberg_data['rank']) / float(11)
benjamini_hochberg_data
```

Check it out: even on the very first one, the p value is larger than Qi/n. So when we analyze with this procedure, we’re still not finding a meaningful difference between stock performances based on company ESG score.

```
```