The Data Cleaning Process

Data Overview

For our analysis, we had three main sources of data including World health report, Comparative Political Data Set or CPDS, and Income inequality data set. The World health report was our starting data set which had variables such as GDP, Life expectancy, Life Ladder, a measure of happiness, and more. This data set was a good starting point because it had good measures of quality of life, which would end up being our main variables of interest. These variables are life expectancy and life ladder score, which is a measure of happiness. The next dataset we added in was the Comparative Political Data Set, or CPDS. This data set includes features such as amount of right and left wing members of government, government type, women percentage in government, and more government related features. This data set was useful to us to be able to use our analysis to focus on how governmental factors impact quality of life. Our last data set that we included was an income inequality data set. This data had features such as Gini coefficient, median income, and poverty rate which are all measures of inequality and wealth. This dataset was useful for us to be able to identify how financial factors effect quality of life.

Cleaning the Data

The first step of cleaning the data was converting all values to be numerical. Next I got rid of all columns that did not have quantative data. I then added in code to change some common country name spellings to the ones utilized by our data sets. Too see the naming conventions for all countries in our dataset, you can refer to the list here. My code uses regular expressions to detect the index columns for country and year if it is included. My code will then remove all entries missing an index value since their data is not useable. These cleaning steps will be repeated for all data sets that are loaded in and the datasets will be assigned into either data with year or data with no year. All datasets with year our then combined together. This data will then be reformatted to be indexed by country name with values for every feature for every year. To remove all missing values, if a country has no data for any feature, than that country will be removed from the reformatted data. If a country has missing values, then they will be predicted with linear regression based on the other feature values for that country. This process results in the reformatted data having no missing values. Once all the data with years is reformatted, the data with no years will then be added in. Whenever data sets are merged together, to retain all features with no missing values, only countries present in both data sets will be kept in merged data. After all the inputted data is merged and reformatted, it will then be written to a csv file name of the user’s choosing with the final_data folder in the src folder. The outputted data of this cleaning function will be in the correct format to have the PC algorithm run properly on it.

Return back home by clicking here.