Sunday, October 30, 2016

Creating graphs: Bivariate graphs

Looking at the Gapminder dataset, my research question was “How do urbanisation and income levels impact CO2 emissions?”

After doing a literature review on the question, my hypothesis was that while higher levels of urbanisation might lead to greater economic activity and therefore higher incomes, they do not necessarily lead to higher rates of CO2 emissions. 

Since the variables I have selected are all quantitative, I ran the program to create two separate scatter plots with incomeperperson and urbanrate as the explanatory variables and co2emissions as the response variable in both. After this, I also added the code for the new categorical variable (Incomecategory) I had created by binning incomeperperson into 3 categories. I ran the code for a bar chart with Incomecategory as the explanatory variable and co2emisions as the response variable. My program was:

  LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.gapminder;
LABEL incomeperperson="Per capita GDP"
    co2emissions="CO2 emissions (in metric tons)"
    urbanrate="Percentage of people in urban areas";
    IF incomeperperson <= 2000 then Incomecategory=1;
ELSE IF incomeperperson <= 14000 THEN Incomecategory=2;
ELSE Incomecategory=3;
PROC SORT; BY COUNTRY;
PROC FREQ; TABLES Incomecategory;
PROC UNIVARIATE; VAR incomeperperson co2emissions urbanrate;
PROC GPLOT; PLOT co2emissions*incomeperperson co2emissions*urbanrate;
PROC GCHART; VBAR Incomecategory/discrete TYPE=MEAN SUMVAR=co2emissions;
RUN; 


The tables are on the following page:
Tables for variability and frequency

 


























As seen from the two scatter plots, there is no significant rise in CO2 emissions with rise in income or rate of urbanisation. In fact, the second graph shows a marginally negative correlation between rate of urbanisation and CO2 emissions, which is consistent with the literature review of my research question.

To get a better idea on the correlation between income and CO2 emissions, I created a bar chart with the categorical variable Incomecategory.



With this Quantitative - Categorical graph, we clearly see a positive correlation between income and CO2 emissions. The jump in the level of emissions for the high income category is significant.





 

Tables for variability and frequency





The FREQ Procedure
Incomecategory
Frequency
Percent
Cumulative
Frequency
Cumulative
Percent
1
103
48.36
103
48.36
2
72
33.80
175
82.16
3
38
17.84
213
100.00


The UNIVARIATE Procedure
Variable: incomeperperson (Per capita GDP)
Moments
N
190
Sum Weights
190
Mean
8740.96608
Sum Observations
1660783.55
Std Deviation
14262.8091
Variance
203427723
Skewness
3.25047792
Kurtosis
14.6656757
Uncorrected SS
5.29647E10
Corrected SS
3.84478E10
Coeff Variation
163.171999
Std Error Mean
1034.73292


Basic Statistical Measures
Location
Variability
Mean
8740.966
Std Deviation
14263
Median
2553.496
Variance
203427723
Mode
.
Range
105044


Interquartile Range
8681


Tests for Location: Mu0=0
Test
Statistic
p Value
Student's t
t
8.447558
Pr > |t|
<.0001
Sign
M
95
Pr >= |M|
<.0001
Signed Rank
S
9072.5
Pr >= |S|
<.0001





Quantiles (Definition 5)
Level
Quantile
100% Max
105147.438
99%
81647.100
95%
33945.314
90%
26901.858
75% Q3
9425.326
50% Median
2553.496
25% Q1
744.239
10%
337.318
5%
242.678
1%
115.306
0% Min
103.776


Extreme Observations
Lowest
Highest
Value
Obs
Value
Obs
103.776
42
39972.4
145
115.306
30
52301.6
112
131.796
59
62682.1
21
155.033
108
81647.1
110
161.317
80
105147.4
128


Missing Values
Missing
Value
Count
Percent Of
All Obs
Missing Obs
.
23
10.80
100.00


The UNIVARIATE Procedure
Variable: co2emissions (CO2 emissions (in metric tons))
Moments
N
200
Sum Weights
200
Mean
5033261622
Sum Observations
1.00665E12
Std Deviation
2.57381E10
Variance
6.62451E20
Skewness
11.0263976
Kurtosis
136.871113
Uncorrected SS
1.36894E23
Corrected SS
1.31828E23
Coeff Variation
511.360632
Std Error Mean
1819959808


Basic Statistical Measures
Location
Variability
Mean
5.0333E9
Std Deviation
2.57381E10
Median
1.859E8
Variance
6.62451E20
Mode
.
Range
3.34221E11


Interquartile Range
1818721667


Tests for Location: Mu0=0
Test
Statistic
p Value
Student's t
t
2.765589
Pr > |t|
0.0062
Sign
M
100
Pr >= |M|
<.0001
Signed Rank
S
10050
Pr >= |S|
<.0001


Quantiles (Definition 5)
Level
Quantile
100% Max
3.34221E+11
99%
8.69552E+10
95%
2.10270E+10
90%
8.52255E+09
75% Q3
1.85270E+09
50% Median
1.85902E+08
25% Q1
3.39753E+07
10%
7.33517E+06
5%
2.65467E+06
1%
9.47833E+05
0% Min
1.32000E+05


Extreme Observations
Lowest
Highest
Value
Obs
Value
Obs
132000
144
4.12296E+10
70
850667
192
4.60922E+10
95
1045000
44
7.25243E+10
203
1111000
99
1.01386E+11
39
1206333
121
3.34221E+11
204


Missing Values
Missing
Value
Count
Percent Of
All Obs
Missing Obs
.
13
6.10
100.00


The UNIVARIATE Procedure
Variable: urbanrate (Percentage of people in urban areas)
Moments
N
203
Sum Weights
203
Mean
56.7693596
Sum Observations
11524.18
Std Deviation
23.8449326
Variance
568.580813
Skewness
-0.0188477
Kurtosis
-0.9952228
Uncorrected SS
769073.643
Corrected SS
114853.324
Coeff Variation
42.0031736
Std Error Mean
1.67358618


Basic Statistical Measures
Location
Variability
Mean
56.7694
Std Deviation
23.84493
Median
57.9400
Variance
568.58081
Mode
100.0000
Range
89.60000


Interquartile Range
37.68000


Tests for Location: Mu0=0
Test
Statistic
p Value
Student's t
t
33.92079
Pr > |t|
<.0001
Sign
M
101.5
Pr >= |M|
<.0001
Signed Rank
S
10353
Pr >= |S|
<.0001


Quantiles (Definition 5)
Level
Quantile
100% Max
1.0E+02
99%
1.0E+02
95%
9.4E+01
90%
8.9E+01
75% Q3
7.5E+01
50% Median
5.8E+01
25% Q1
3.7E+01
10%
2.5E+01
5%
1.8E+01
1%
1.3E+01
0% Min
1.0E+01


Extreme Observations
Lowest
Highest
Value
Obs
Value
Obs
10.40
30
100
35
12.54
150
100
84
12.98
200
100
113
13.22
195
100
128
14.32
110
100
174


Missing Values
Missing
Value
Count
Percent Of
All Obs
Missing Obs
.
10
4.69
100.00