import pandas as pd
import numpy as np
import seaborn as sns
import math
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
%matplotlib inline
import matplotlib.pyplot as plt
world_happiness2016=pd.read_csv(r"C:\datafiles\The_World_Happiness_Report_2016.csv")

For my final project I’ve selected The World Happiness Report released in 2016– The report ranks 156 countries by their happiness level. In addition the report can be categorized by viewing of happiness in the world by region. Questions that I would like to answer are:

What are the major factors that affect happiness? 
Which countries and regions rank highest and lowest in overall happiness?
What is the relation between happiness and kindness?

Before looking at the DATA let's start by a basic question What are the top 10 happiest countries in the world?

world_happiness2016.head(10)
selectCountryScore=  ['Country','Happiness Rank','Happiness Score']
world_happiness2016[selectCountryScore].head(10)

Understanding the DATA- The data in this table is straight forward. It is based on two datasets, one being the happiness score, and two is based on rankings from 0 to 10 where ten is the best possible score. The data is complete, it does not contain empty fields nor has any mix erroneous elements. There are six factors that make the “Happiness Score” for each country.

world_happiness2016.head(3)

#Let's take a look at the columns 
world_happiness2016.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 14 columns):
Country                          157 non-null object
Region                           157 non-null object
Happiness Rank                   157 non-null int64
Happiness Score                  157 non-null float64
Lower Confidence Interval        157 non-null float64
Upper Confidence Interval        157 non-null float64
Economy (GDP per Capita)         157 non-null float64
Family                           157 non-null float64
Health (Life Expectancy)         157 non-null float64
Freedom                          157 non-null float64
Trust (Government Corruption)    157 non-null float64
Generosity                       157 non-null float64
Dystopia Residual                157 non-null float64
RegionsMerged                    157 non-null object
dtypes: float64(10), int64(1), object(3)
memory usage: 17.2+ KB

What is Happiness? In philosophy, happiness is a translation of the Greek concept of eudaimonia (good spirit) or “highest human good” and refers to "the good life" judgments by a person about their overall well-being. Eudaimonia is a central concept in Aristotelian ethics.

The data in the report is classified by region, let’s take a look at how Happiness is ranked across the globe by the dataset Region:

selectRegion=  world_happiness2016.Region
selectRegion.value_counts()

Sub-Saharan Africa                 38
Central and Eastern Europe         29
Latin America and Caribbean        24
Western Europe                     21
Middle East and Northern Africa    19
Southeastern Asia                   9
Southern Asia                       7
Eastern Asia                        6
North America                       2
Australia and New Zealand           2
Name: Region, dtype: int64

sns.swarmplot(x="Region", y="Happiness Score",  data=world_happiness2016, s=8)
plt.xticks(rotation=90)
plt.title('Regional Happiness Ranking')
plt.figure(figsize=(8,16))
plt.show()

<matplotlib.figure.Figure at 0xd2c9390>

Understanding the rank among the six attributes will give us a better point of reference as to what is our most informative values for the make up the Happiness Score value.

selectAttributes=  ['Economy (GDP per Capita)','Family','Health (Life Expectancy)',
                             'Freedom','Trust (Government Corruption)','Generosity']
world_happiness2016[selectAttributes].mean().plot(kind='bar')
plt.xticks(rotation=80)
plt.show()

Now that we understand the raking on the qualifiers and given the relationship between the Happiness Score and Region let’s take a look at Happiness by region.

selectRegions = ['Sub-Saharan Africa', 'Central and Eastern Europe','Latin America and Caribbean',
                 'Western Europe','Middle East and Northern Africa','Southeastern Asia','Southern Asia',
                 'Eastern Asia','North America','Australia and New Zealand']
rank='world_happiness2016.Happiness Rank'
score='world_happiness2016.Happiness Score'
selectRegionScore=  [world_happiness2016.Region,rank,score]

selectAttributes = world_happiness2016.drop(['Happiness Rank','Lower Confidence Interval','Upper Confidence Interval',
                                             'Dystopia Residual'], axis=1)

fig, axes = plt.subplots(figsize=(10, 8))
corrmat = selectAttributes.corr()
with sns.axes_style('white'):
    ax = sns.heatmap(corrmat,linewidths=2,annot=True, mask=0, vmax=.8, square=True)
axes.set_title("Correlation Matrix")
plt.xticks(rotation=80)
plt.show()

data = dict(type = 'choropleth', 
           locations = world_happiness2016['Country'],
           locationmode = 'country names',
           z = world_happiness2016['Happiness Rank'], 
           text = world_happiness2016['Country'],
           colorbar = {'title':'Happiness'})
layout = dict(title = 'Wolrd Happiness Report 2016', 
             geo = dict(showframe = False,
                       projection = {'type': 'Mercator'}))
choromap3 = go.Figure(data = [data], layout=layout)
iplot(choromap3)

Linear Model to Predict Happiness

y = world_happiness2016['Happiness Score']
X = world_happiness2016.drop(['Happiness Score', 'Happiness Rank', 'Country', 'Region',
                             'Lower Confidence Interval','Upper Confidence Interval','Dystopia Residual', 'RegionsMerged'], axis=1)
X.head(5)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)

#For linear regression, the predictors and the target must both be continuous. 

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Let's take a look at the interception of the coefficients to see how much the line has shifted up or down in points. As described above, the Happiness Score is a correlation between the qulifiers of Economy, Family, Health, Freedom, Trust, and Generosity.

lm.intercept_

2.1723729132582421

coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

# 80% accuracy on the train data
lm.score(X_train, y_train)

0.80151768975839421

# 74% accuracy on the actual test data
lm.score(X_test, y_test)

0.74344697745571464

# lets compare actual score vs. predicting score
plt.scatter(y_test, lm.predict(X_test))
plt.xlabel("Actual Score")
plt.ylabel("Predicted Score")
plt.title("Actual vs. Predicted Score")
plt.show()

from sklearn import metrics
predictions = lm.predict( X_test)
print('MSE:', metrics.mean_squared_error(y_test, predictions))

MSE: 0.341040379342

#so if your training set is better then your test it means that you are overfeeding your data

predict_test=lm.predict(X_test)
predict_train=lm.predict(X_train)
msetrain=np.mean((y_train-predict_train)**2)
msetest=np.mean((y_test-predict_test)**2)

print('training MSE:'+str(msetrain))
print('test MSE:'+str(msetest))

training MSE:0.2539161254769463
test MSE:0.3410403793417715

Now that we have a predicting model that works, let’s remove the attributes of Freedom, Generosity, and Trust to see how our model fairs without these elements. In other words, let’s compare R^2 without these values to see the difference.

What is the relation or impact between happiness, trust and kindness?

y = world_happiness2016['Happiness Score']
X = world_happiness2016.drop(['Happiness Score', 'Happiness Rank', 'Country', 'Region',
                             'Lower Confidence Interval','Upper Confidence Interval',
                              'Dystopia Residual', 'Trust (Government Corruption)','Generosity', 'RegionsMerged'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)
lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# 79% accuracy on the train data vs. 80%
lm.score(X_train, y_train)

0.79745588283477487

# 73% accuracy on the actual test data vs. 74%
lm.score(X_test, y_test)

0.73455680507765209

# lets compare actual score vs. predicting score
plt.scatter(y_test, lm.predict(X_test))
plt.xlabel("Actual Score")
plt.ylabel("Predicted Score")
plt.title("Actual vs. Predicted Score")
plt.show()

Given the fact that coefficient values for Trust and Generosity are significantly smaller, one can conclude that the correlation between world happiness, trust and kindness is very small.

selectRegionsMerged=  world_happiness2016.RegionsMerged
selectRegionsMerged.value_counts()

Africa                     38
Eastern Europe             29
North and South America    26
Asia                       24
Western Europe             21
Middle East                19
Name: RegionsMerged, dtype: int64

Since we now know that Generosity has no relation to Happiness, let’s take a closer look as to how each region ranked for each attribute (Economy, Family, Health, Freedom, Trust, and Generosity).

import warnings
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
def plot_compare(dataset,regions,compareCols):
    n = len(compareCols)
    f, axes = plt.subplots(math.ceil(n/2), 2, figsize=(16, 6*math.ceil(n/2)))
    axes = axes.flatten()
    for i in range(len(compareCols)):
        col = compareCols[i]
        axi = axes[i]
        for region in regions:
            this_region = dataset[dataset['RegionsMerged']==region]
            sns.distplot(this_region[col], label=region, ax=axi)
        axi.legend()
attributes=  ['Economy (GDP per Capita)','Family','Health (Life Expectancy)',
              'Freedom','Trust (Government Corruption)','Generosity']
selectRegionsMerged = ['Africa','Eastern Europe','North and South America','Asia','Western Europe','Middle East']
plot_compare(world_happiness2016,selectRegionsMerged,attributes)

happinesScoreMEast = world_happiness2016[world_happiness2016.RegionsMerged=="Middle East"]['Happiness Rank'].mean()
happinesScoreAfrica = world_happiness2016[world_happiness2016.RegionsMerged=="Africa"]['Happiness Rank'].mean()
happinesScoreAsia = world_happiness2016[world_happiness2016.RegionsMerged=="Asia"]['Happiness Rank'].mean()
happinesScoreEastEurope = world_happiness2016[world_happiness2016.RegionsMerged=="Eastern Europe"]['Happiness Rank'].mean()
happinesScoreWsternEurope = world_happiness2016[world_happiness2016.RegionsMerged=="Western Europe"]['Happiness Rank'].mean()
happinesScoreNSAmerica = world_happiness2016[world_happiness2016.RegionsMerged=="North and South America"]['Happiness Rank'].mean()
print('Average Happiness Ranking for Africa:'+str(happinesScoreAfrica))
print('Average Happiness Ranking for Eastern Europe:'+str(happinesScoreEastEurope))
print('Average Happiness Ranking for the Middle East:'+str(happinesScoreMEast))
print('Average Happiness Ranking for Asia:'+str(happinesScoreAsia))
print('Average Happiness Ranking for North and South America:'+str(happinesScoreNSAmerica))
print('Average Happiness Ranking for Western Europe:'+str(happinesScoreWsternEurope))

Average Happiness Ranking for Africa:129.6578947368421
Average Happiness Ranking for Eastern Europe:78.44827586206897
Average Happiness Ranking for the Middle East:78.10526315789474
Average Happiness Ranking for Asia:80.08333333333333
Average Happiness Ranking for North and South America:45.34615384615385
Average Happiness Ranking for Western Europe:29.19047619047619

In conclusion, based on my analysis the relation between generosity and happiness have no relation and the most influential to the happiness score are Health, Economy and Family. In fact Africa is the one Region that places Generosity at the highest ranking but their overall happiness is ranked at the lowest.

	Country	Happiness Rank	Happiness Score
0	Denmark	1	7.526
1	Switzerland	2	7.509
2	Iceland	3	7.501
3	Norway	4	7.498
4	Finland	5	7.413
5	Canada	6	7.404
6	Netherlands	7	7.339
7	New Zealand	8	7.334
8	Australia	9	7.313
9	Sweden	10	7.291

	Country	Region	Happiness Rank	Happiness Score	Lower Confidence Interval	Upper Confidence Interval	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Trust (Government Corruption)	Generosity	Dystopia Residual	RegionsMerged
0	Denmark	Western Europe	1	7.526	7.460	7.592	1.44178	1.16374	0.79504	0.57941	0.44453	0.36171	2.73939	Western Europe
1	Switzerland	Western Europe	2	7.509	7.428	7.590	1.52733	1.14524	0.86303	0.58557	0.41203	0.28083	2.69463	Western Europe
2	Iceland	Western Europe	3	7.501	7.333	7.669	1.42666	1.18326	0.86733	0.56624	0.14975	0.47678	2.83137	Western Europe

	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Trust (Government Corruption)	Generosity
0	1.44178	1.16374	0.79504	0.57941	0.44453	0.36171
1	1.52733	1.14524	0.86303	0.58557	0.41203	0.28083
2	1.42666	1.18326	0.86733	0.56624	0.14975	0.47678
3	1.57744	1.12690	0.79579	0.59609	0.35776	0.37895
4	1.40598	1.13464	0.81091	0.57104	0.41004	0.25492

	Coeffecient
Economy (GDP per Capita)	0.822006
Family	1.470933
Health (Life Expectancy)	1.124369
Freedom	1.322568
Trust (Government Corruption)	0.570433
Generosity	0.291347