Sunday, September 25, 2016

Managing data

For managing data, I wanted to categorise the variable incomeperperson in the Gapminder dataset into Low, Middle and High income groups. I created a new variable - 'Incomecategory' with three categories: Low income (1), Middle income (2) and High income (3) categories.

My program is as follows:

  LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.gapminder;
LABEL incomeperperson="Per capita GDP"
    co2emissions="CO2 emissions (in metric tons)"
    urbanrate="Percentage of people in urban areas";
IF incomeperperson <= 2000 then Incomecategory=1;
ELSE IF incomeperperson <= 14000 THEN Incomecategory=2;
ELSE Incomecategory=3;
PROC SORT; BY COUNTRY;
PROC FREQ; TABLES Incomecategory incomeperperson co2emissions urbanrate;
RUN;

This is what the grouped table looks like:



The FREQ Procedure
Incomecategory
Frequency
Percent
Cumulative
Frequency
Cumulative
Percent
1
103
48.36
103
48.36
2
72
33.80
175
82.16
3
38
17.84
213
100.00
 



This shows that over 48% (nearly half) the sample countries fall in the low income category with per capita GDP equal to or less than USD 2,000, 33.8% are in the middle income category with per capita GDP equal to or less than USD 14,000. A little under 18% countries fall in the high income category with per capita GDP above USD 14,000.
The categorisation of the discrete quantitative variables will help me in further analysing the Gapminder data and interpret the trends and correlation between income, urbanisation and emission levels.

Frequency tables for the three variables, urbanrate, co2emisions and incomeperperson are provided in the following pages:

Managing data - 2a

Managing data - 2

Since the Gapminder dataset does not have qualitative or categorical variables, I decided not to run a program for coding out missing data or coding in valid data.



































No comments:

Post a Comment