import pandas as pd
import numpy as np
import seaborn as sns
import math
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
%matplotlib inline
import matplotlib.pyplot as plt
world_happiness2016=pd.read_csv(r"C:\datafiles\The_World_Happiness_Report_2016.csv")
For my final project I’ve selected The World Happiness Report released in 2016– The report ranks 156 countries by their happiness level. In addition the report can be categorized by viewing of happiness in the world by region. Questions that I would like to answer are:
What are the major factors that affect happiness?
Which countries and regions rank highest and lowest in overall happiness?
What is the relation between happiness and kindness?
Before looking at the DATA let's start by a basic question What are the top 10 happiest countries in the world?
world_happiness2016.head(10)
selectCountryScore= ['Country','Happiness Rank','Happiness Score']
world_happiness2016[selectCountryScore].head(10)
Understanding the DATA- The data in this table is straight forward. It is based on two datasets, one being the happiness score, and two is based on rankings from 0 to 10 where ten is the best possible score. The data is complete, it does not contain empty fields nor has any mix erroneous elements. There are six factors that make the “Happiness Score” for each country.
world_happiness2016.head(3)
#Let's take a look at the columns
world_happiness2016.info()
What is Happiness? In philosophy, happiness is a translation of the Greek concept of eudaimonia (good spirit) or “highest human good” and refers to "the good life" judgments by a person about their overall well-being. Eudaimonia is a central concept in Aristotelian ethics.
The data in the report is classified by region, let’s take a look at how Happiness is ranked across the globe by the dataset Region:
selectRegion= world_happiness2016.Region
selectRegion.value_counts()
sns.swarmplot(x="Region", y="Happiness Score", data=world_happiness2016, s=8)
plt.xticks(rotation=90)
plt.title('Regional Happiness Ranking')
plt.figure(figsize=(8,16))
plt.show()
Understanding the rank among the six attributes will give us a better point of reference as to what is our most informative values for the make up the Happiness Score value.
selectAttributes= ['Economy (GDP per Capita)','Family','Health (Life Expectancy)',
'Freedom','Trust (Government Corruption)','Generosity']
world_happiness2016[selectAttributes].mean().plot(kind='bar')
plt.xticks(rotation=80)
plt.show()
Now that we understand the raking on the qualifiers and given the relationship between the Happiness Score and Region let’s take a look at Happiness by region.
selectRegions = ['Sub-Saharan Africa', 'Central and Eastern Europe','Latin America and Caribbean',
'Western Europe','Middle East and Northern Africa','Southeastern Asia','Southern Asia',
'Eastern Asia','North America','Australia and New Zealand']
rank='world_happiness2016.Happiness Rank'
score='world_happiness2016.Happiness Score'
selectRegionScore= [world_happiness2016.Region,rank,score]
selectAttributes = world_happiness2016.drop(['Happiness Rank','Lower Confidence Interval','Upper Confidence Interval',
'Dystopia Residual'], axis=1)
fig, axes = plt.subplots(figsize=(10, 8))
corrmat = selectAttributes.corr()
with sns.axes_style('white'):
ax = sns.heatmap(corrmat,linewidths=2,annot=True, mask=0, vmax=.8, square=True)
axes.set_title("Correlation Matrix")
plt.xticks(rotation=80)
plt.show()
data = dict(type = 'choropleth',
locations = world_happiness2016['Country'],
locationmode = 'country names',
z = world_happiness2016['Happiness Rank'],
text = world_happiness2016['Country'],
colorbar = {'title':'Happiness'})
layout = dict(title = 'Wolrd Happiness Report 2016',
geo = dict(showframe = False,
projection = {'type': 'Mercator'}))
choromap3 = go.Figure(data = [data], layout=layout)
iplot(choromap3)
Linear Model to Predict Happiness
y = world_happiness2016['Happiness Score']
X = world_happiness2016.drop(['Happiness Score', 'Happiness Rank', 'Country', 'Region',
'Lower Confidence Interval','Upper Confidence Interval','Dystopia Residual', 'RegionsMerged'], axis=1)
X.head(5)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)
#For linear regression, the predictors and the target must both be continuous.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
Let's take a look at the interception of the coefficients to see how much the line has shifted up or down in points. As described above, the Happiness Score is a correlation between the qulifiers of Economy, Family, Health, Freedom, Trust, and Generosity.
lm.intercept_
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
# 80% accuracy on the train data
lm.score(X_train, y_train)
# 74% accuracy on the actual test data
lm.score(X_test, y_test)
# lets compare actual score vs. predicting score
plt.scatter(y_test, lm.predict(X_test))
plt.xlabel("Actual Score")
plt.ylabel("Predicted Score")
plt.title("Actual vs. Predicted Score")
plt.show()
from sklearn import metrics
predictions = lm.predict( X_test)
print('MSE:', metrics.mean_squared_error(y_test, predictions))
#so if your training set is better then your test it means that you are overfeeding your data
predict_test=lm.predict(X_test)
predict_train=lm.predict(X_train)
msetrain=np.mean((y_train-predict_train)**2)
msetest=np.mean((y_test-predict_test)**2)
print('training MSE:'+str(msetrain))
print('test MSE:'+str(msetest))
Now that we have a predicting model that works, let’s remove the attributes of Freedom, Generosity, and Trust to see how our model fairs without these elements. In other words, let’s compare R^2 without these values to see the difference.
What is the relation or impact between happiness, trust and kindness?
y = world_happiness2016['Happiness Score']
X = world_happiness2016.drop(['Happiness Score', 'Happiness Rank', 'Country', 'Region',
'Lower Confidence Interval','Upper Confidence Interval',
'Dystopia Residual', 'Trust (Government Corruption)','Generosity', 'RegionsMerged'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)
lm.fit(X_train,y_train)
# 79% accuracy on the train data vs. 80%
lm.score(X_train, y_train)
# 73% accuracy on the actual test data vs. 74%
lm.score(X_test, y_test)
# lets compare actual score vs. predicting score
plt.scatter(y_test, lm.predict(X_test))
plt.xlabel("Actual Score")
plt.ylabel("Predicted Score")
plt.title("Actual vs. Predicted Score")
plt.show()
Given the fact that coefficient values for Trust and Generosity are significantly smaller, one can conclude that the correlation between world happiness, trust and kindness is very small.
selectRegionsMerged= world_happiness2016.RegionsMerged
selectRegionsMerged.value_counts()
Since we now know that Generosity has no relation to Happiness, let’s take a closer look as to how each region ranked for each attribute (Economy, Family, Health, Freedom, Trust, and Generosity).
import warnings
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
def plot_compare(dataset,regions,compareCols):
n = len(compareCols)
f, axes = plt.subplots(math.ceil(n/2), 2, figsize=(16, 6*math.ceil(n/2)))
axes = axes.flatten()
for i in range(len(compareCols)):
col = compareCols[i]
axi = axes[i]
for region in regions:
this_region = dataset[dataset['RegionsMerged']==region]
sns.distplot(this_region[col], label=region, ax=axi)
axi.legend()
attributes= ['Economy (GDP per Capita)','Family','Health (Life Expectancy)',
'Freedom','Trust (Government Corruption)','Generosity']
selectRegionsMerged = ['Africa','Eastern Europe','North and South America','Asia','Western Europe','Middle East']
plot_compare(world_happiness2016,selectRegionsMerged,attributes)
happinesScoreMEast = world_happiness2016[world_happiness2016.RegionsMerged=="Middle East"]['Happiness Rank'].mean()
happinesScoreAfrica = world_happiness2016[world_happiness2016.RegionsMerged=="Africa"]['Happiness Rank'].mean()
happinesScoreAsia = world_happiness2016[world_happiness2016.RegionsMerged=="Asia"]['Happiness Rank'].mean()
happinesScoreEastEurope = world_happiness2016[world_happiness2016.RegionsMerged=="Eastern Europe"]['Happiness Rank'].mean()
happinesScoreWsternEurope = world_happiness2016[world_happiness2016.RegionsMerged=="Western Europe"]['Happiness Rank'].mean()
happinesScoreNSAmerica = world_happiness2016[world_happiness2016.RegionsMerged=="North and South America"]['Happiness Rank'].mean()
print('Average Happiness Ranking for Africa:'+str(happinesScoreAfrica))
print('Average Happiness Ranking for Eastern Europe:'+str(happinesScoreEastEurope))
print('Average Happiness Ranking for the Middle East:'+str(happinesScoreMEast))
print('Average Happiness Ranking for Asia:'+str(happinesScoreAsia))
print('Average Happiness Ranking for North and South America:'+str(happinesScoreNSAmerica))
print('Average Happiness Ranking for Western Europe:'+str(happinesScoreWsternEurope))
In conclusion, based on my analysis the relation between generosity and happiness have no relation and the most influential to the happiness score are Health, Economy and Family. In fact Africa is the one Region that places Generosity at the highest ranking but their overall happiness is ranked at the lowest.