Today, November 2, 2022 RStudio became Posit! (https://posit.co/)
tidyverse
package attaches ggplot2
, purrr
, tibble
, dplyr
, tidyr
, stringr
, readr
, forcats
as well.rmarkdown
package to analyze, share and reproduce.We need both R and RStudio. If you have not installed R and RStudio, please follow the instruction in ModernDive: Chapter 1 Getting Started with Data in R .
R is ‘GNU S’, a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.
The R Project for Statistical Computing: https://www.r-project.org
Update R
Currently, R version 4.2.1 (2022-06-23) is running on this system, recommended to update once a year. If you are Windows user, you can use the following code.
if( !("installr" %in% installed.packages()) ){install.packages("installr")}
installr::updateR(TRUE)
System Language
It is more convenient to set the system language of R to be English. You can copy and paste error message for search solutions.
Sys.setenv(LANG = "en")
RStudio Site: https://www.rstudio.com
Inspired by innovators in science, education, government, and industry, RStudio develops free and open tools for R, and enterprise-ready professional products for teams who use both R and Python, to scale and share their work. (About RStudio: Inspiration)
Update RStudio
Top Menu > Help > Check Updates
R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network).
Install packages
We need two packages, rmarkdown
and tidyverse
. Install these by the following code. You can also copy install.packages(c("rmarkdown", "tidyverse"))
and paste in Console. You can also use Tools > Install Packages in the top menu.
tidyverse
: https://www.tidyverse.org, https://cran.r-project.org/package=tidyversermarkdown
: https://rmarkdown.rstudio.com, https://cran.r-project.org/package=rmarkdownThe standard link to an R package is in the form: https://cran.r-project.org/package=<Package Name>
.
install.packages(c("rmarkdown", "tidyverse"))
The following code does the same. c("rmarkdown", "tidyverse")
is a vector notation in R to concatenate, or combine, two values “rmarkdown”, and “tidyverse”.
install.packages("rmarkdown")
install.packages("tidyverse")
When you use a package you need to attach it by the following code.
library(tidyverse)
Tidyverse is a collection of packages designed with consistency. If you are interested in the undelying philosophy, please take a look at Tidyverse design guide. See the messages of attaching packages and conflicts.
Create an account of RStudio Cloud: https://www.rstudio.com/products/cloud/
GET STARTED FREE
We mainly use one of the formats of R Markdwon called R Notebook.
Let us assign the iris
data in the pre-installed package datasets
to df_iris
. You can give any name starting from an alphabet, though there are some rules.
df_iris <- datasets::iris
class(df_iris)
[1] "data.frame"
The class of data iris
is data.frame
, the basic data class of R. You can assign the same data as a tibble
, the data class of tidyverse
as follows.
tbl_iris <- as_tibble(datasets::iris)
class(tbl_iris)
[1] "tbl_df" "tbl" "data.frame"
df_iris <- iris
can replace df_iris <- datasets::iris
because the package datasets
is installed and attached as default. Since you may have other data called iris
included in a differenct package or you may have changed iris
before, it is safer to specify the name of the package with the name of the data.tf_iris
and tbl_iris
behave differently. It is because of the default settings of R Markdown.The View
command open up a window to show the contents of the data and you can use the filter as well.
View(df_iris)
The following simple command also shows the data.
df_iris
The output within R Notebook is a tibble style. Try the same command in Console.
slice(df_iris, 1:10)
11+2
[1] 13
1:10
[1] 1 2 3 4 5 6 7 8 9 10
seq(1,10, by = 2)
[1] 1 3 5 7 9
Let us look at the structure of the data. You can try str(df_iris)
on Console or by adding a code chunk in R Notebook introducing later.
glimpse(df_iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setos…
There are six types of data in R; Double, Integer, Character, Logical, Raw, Complex.
The names after $ are column names. If you call df_iris$Species
, you have the Species column. Species is in the 5th collumn, typeof(df_iris[[5]])
does the same as the next. df_iris[2,4] =
0.2 is the fourth entry of Sepal.Width.
typeof(df_iris$Species)
[1] "integer"
class(df_iris$Species)
[1] "factor"
For factors = fct
see the R Document or an explanation in Factor in R: Categorical Variable & Continuous Variables.
typeof(df_iris$Sepal.Length)
[1] "double"
class(df_iris$Sepal.Length)
[1] "numeric"
Q1. What are the differences ofdf_iris
, slice(df_iris, 1:10)
and glimpse(df_iris)
above?
Q2. What are the differences ofdf_iris
, slice(df_iris, 1:10)
and glimpse(df_iris)
in the console?
The following is very convenient to get the summary information of a data.
summary(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Minimum, 1st Quadrant (25%), Median, Mean, 3rd Quadrant (75%), Maximum, and the count of each factor.
We use ggplot
to draw graphs. The scatter plot is a projection of data with two variables \(x\) and \(y\).
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Add title and labels adding labs()
. See Modify axis, legend, and plot labels.
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
geom_point() +
labs(title = "Title", x = "Label for x", y = "Label for y")
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
labs(title = "Scatter Plot of Sepal Data of Iris", x = "Sepal Length", y = "Sepal Width")
Add different colors automatically to each spieces. See Colour related aesthetics: colour, fill, and alpha. Can you see each group?
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
The boxplot compactly displays the distribution of a continuous variable. See A box and whiskers plot (in the style of Tukey).
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot()
Visualize the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars. See Histograms and frequency polygons.
ggplot(data = df_iris, aes(x = Sepal.Length)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Change the number of bins by bins =
<number>
.
ggplot(data = df_iris, aes(x = Sepal.Length)) +
geom_histogram(bins = 10)
Three denity plots of the same data.
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
geom_violin()
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
geom_jitter(width = 0.2)
ggplot(data = df_iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.5)
We will not get into the mathematical models and hypothesis testings. The following is a simple linear regression model and the graph of the fitted line.
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula 'y ~ x'
lm(Sepal.Width ~ Sepal.Length, data = df_iris)
Call:
lm(formula = Sepal.Width ~ Sepal.Length, data = df_iris)
Coefficients:
(Intercept) Sepal.Length
3.41895 -0.06188
The formula of the line in the scattered plot is the following. \[y = (-0.06188)x + 3.41895\] The slope is \(-0.06188\) and the \(y\)-intercept is 3.41895.
It is a little strange and counter intuitive, because it claims that as the sepal width gets smaller as the sepal length gets larger. What is hidden behind?
lm(Sepal.Width ~ Sepal.Length, data = df_iris) %>% summary()
Call:
lm(formula = Sepal.Width ~ Sepal.Length, data = df_iris)
Residuals:
Min 1Q Median 3Q Max
-1.1095 -0.2454 -0.0167 0.2763 1.3338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.41895 0.25356 13.48 <2e-16 ***
Sepal.Length -0.06188 0.04297 -1.44 0.152
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4343 on 148 degrees of freedom
Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula 'y ~ x'
ggplot(df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species)) +
geom_smooth(aes(color = Species), formula = y ~ x, method = "lm", se = FALSE) +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE, color = "black", linetype = "twodash", size = 1.5)
The column Species is a confounding variable. See Wikipedia: Confounding, Confounding Variables | Definition, Examples & Controls.
Open RStudio first.
Run the following code by clicking the triangle or Run Current Chunk on Top bar. Alternately, you can use New Folder button in the Files tab in the right bottom pane.
dir.create("data")
Choose R Notebook from the pull down File menu in the top bar.
Default* is as follows
---
title: "R Notebook"
output: html_notebook
---
Template
---
title: "Title of R Notebook"
author: "ID and Your Name"
date: "2022-11-02"
output:
html_notebook:
# number_sections: yes
# toc: true
# toc_float: true
---
number_sections: no
.toc: true
- default is toc: false
.toc_float: true
- default is toc_float: false
Insert Chunk in Code pull down menu in the top bar, or use the C button on top. You can use shortcut keys listed under Tools in the top bar.
library(tidyverse)
There are many sites we can get data. Here we learn how to get two types of data, i.e, a data with comma separated values in *.csv
, and a Microsoft Excel Data in *.xslx
or *.xsl
in three ways. We have used a data in a package of R, e.g. datasets::iris
, and there are many other data packages of R. However, we focus on how we get so called open public data. The folowing is the definition of Open Data by World Bank.
The term ``Open Data’’ has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.
1. The data must be , which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
2. The data must be , which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.
See a list of open public data below.
data
folderurl_class <- "<URL of the data>"
download.file(url = url_class, destfile = "data/<file name>")
Copy the link by Right Click or Ctrl+Click and paste it in the URL.
url <- "https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv" # long file name
(pop <- read_csv(url))
New names:
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
Rows: 7874 Columns: 7
── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): T02, Population, density and surface area, ...3, ...4, ...5, ...6, ...7
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
(pop <- read_csv(url))
is a short hand of the following:
pop <- read_csv(url)
pop
The first row is not the column names, so skip the first row.
url <- "https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv"
(pop <- read_csv(url, skip = 1))
New names:
• `` -> `...2`
Rows: 7873 Columns: 7
── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ...2, Series, Footnotes, Source
dbl (2): Region/Country/Area, Year
num (1): Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
It is better to save it in the data folder.
url <- "https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv"
download.file(url = url, destfile = "data/pop.csv")
trying URL 'https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv'
Content type 'application/octet-stream' length 2290838 bytes (2.2 MB)
==================================================
downloaded 2.2 MB
pop <- read_csv("data/pop.csv", skip = 1)
New names:
• `` -> `...2`
Rows: 7873 Columns: 7
── Column specification ───────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ...2, Series, Footnotes, Source
dbl (2): Region/Country/Area, Year
num (1): Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop
read_excel
, read_xls
, read_xlsx
url_class <- "https://databankfiles.worldbank.org/data/download/site-content/CLASS.xlsx"
download.file(url = url_class, destfile = "data/CLASS.xlsx")
trying URL 'https://databankfiles.worldbank.org/data/download/site-content/CLASS.xlsx'
Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 68155 bytes (66 KB)
==================================================
downloaded 66 KB
Let us look at the first sheet.
library(readxl)
wb_countries_tmp <- read_excel("data/CLASS.xlsx", sheet = 1, skip = 0, n_max =219)
wb_countries <- wb_countries_tmp %>%
select(country = Economy, iso3c = Code, region = Region, income = `Income group`, lending = "Lending category", other = "Other (EMU or HIPC)")
wb_countries
readxl
: https://readxl.tidyverse.orgreadxl
is a part of the tidyverse
family but not automatically attached. So attach it by library(readxl)
.read_excel
, read_xls
, read_xlsx
wb_regions_tmp <- read_excel("data/CLASS.xlsx", sheet = 1, skip = 0, n_max =266) %>%
slice(-(1:220))
wb_regions <- wb_regions_tmp %>%
select(region = Economy, iso3c = Code) %>% drop_na()
wb_regions
data
Folderobject <- read_csv("data/<file name>)
Install WDI
package for the first time by install.packages("WDI")
. See the Setup Section Above.
install.packages("WDI")
library(WDI)
Try ?WDI
in Console or WDI
in the Help tab on the right buttom pane.
WDI(country = "all",
indicator = "NY.GDP.PCAP.KD",
start = 1960,
end = 2025,
extra = FALSE,
cache = NULL)
Arguments - country: Vector of countries (ISO-2 character codes, e.g. “BR”, “US”, “CA”, or “all”) - indicator: If you supply a named vector, the indicators will be automatically renamed: c('women_private_sector' = 'BI.PWK.PRVS.FE.ZS')
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", cache = NULL)
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", short = FALSE, cache = NULL)
Default: short = TRUE
WDIsearch(string = "gdp per capita",
field = "name", cache = NULL)
WDIsearch(string = "6.0.GDP_current", field = "indicator", short = FALSE)
Another way to find the indicator
Go to the World Bank Open Data site and select Browse by Indicators and find the data. Then the indicator is included in the URL.
CO2 emissions (metric tons per capita)
Indicator: EN.ATM.CO2E.PC
Check the indicator by WDIsearch
.
WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator", short = FALSE)
Further Examples
WDIsearch(string = "population, total", field = "name", short = FALSE)
WDIsearch(string = "SP.POP.TOTL", field = "indicator", short = FALSE)
Exercise. Try the following on Console or in a new code chunk, and choose one interesting data with its indicator, name and description.
?WDIsearch
View(WDIsearch(string = "gdp per capita", field = "name", cache = NULL))
View(WDIsearch(string = "gdp per capita", field = "name", short = FALSE, cache = NULL))
View(WDIsearch(string = "gdp", field = "name", short = FALSE, cache = NULL))
View(WDIsearch(string = "gender", field = "name", short = FALSE, cache = NULL))
WDI
Description: GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2015 U.S. dollars.
gdpcap <- WDI(country = "all", indicator = "NY.GDP.PCAP.KD")
gdpcap
Avoiding downloading the data repeatedly as a CSV file, let us sage it in the data
folder created above.
write_csv(gdpcap, "data/gdpcap.csv")
We can download two data as one data! The GDP per capita data and the CO2 emission per capita data with indicator EN.ATM.CO2E.PC
.
Description: Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring.
gdpcap_co2 <- WDI(country = "all", indicator = c("NY.GDP.PCAP.KD", "EN.ATM.CO2E.PC"), extra = TRUE)
gdpcap_co2
Let us save this data as well. It is very large, i.e., 14 columns and 16,492 rows, 1.9MB.
write_csv(gdpcap_co2, "data/gdpcap_co2.csv")
dplyr
Overviewdplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
select()
picks variables based on their names.filter()
picks cases based on their values.arrange()
changes the ordering of the rows.mutate()
adds new variables that are functions of existing variablesgroup_by()
takes an existing tbl and converts it into a grouped tbl.summarise()
reduces multiple values down to a single summary.You can learn more about them in vignette(“dplyr”). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette(“two-table”).
If you are new to dplyr, the best place to start is the data transformation chapter in R for data science.
select
: Subset columns using their names and typesHelper Function | Use | Example |
---|---|---|
- | Columns except | select(babynames, -prop) |
: | Columns between (inclusive) | select(babynames, year:n) |
contains() | Columns that contains a string | select(babynames, contains(“n”)) |
ends_with() | Columns that ends with a string | select(babynames, ends_with(“n”)) |
matches() | Columns that matches a regex | select(babynames, matches(“n”)) |
num_range() | Columns with a numerical suffix in the range | Not applicable with babynames |
one_of() | Columns whose name appear in the given set | select(babynames, one_of(c(“sex”, “gender”))) |
starts_with() | Columns that starts with a string | select(babynames, starts_with(“n”)) |
filter
: Subset rows using column valuesLogical operator | tests | Example |
---|---|---|
> | Is x greater than y? | x > y |
>= | Is x greater than or equal to y? | x >= y |
< | Is x less than y? | x < y |
<= | Is x less than or equal to y? | x <= y |
== | Is x equal to y? | x == y |
!= | Is x not equal to y? | x != y |
is.na() | Is x an NA? | is.na(x) |
!is.na() | Is x not an NA? | !is.na(x) |
arrange
and Pipe %>%
arrange()
orders the rows of a data frame by the values of selected columns.Unlike other dplyr
verbs, arrange()
largely ignores grouping; you need to explicitly mention grouping variables (or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group. * [
pipes`](https://r4ds.had.co.nz/pipes.html) in R for Data Science.
Examples
arrange(<data>, <varible>)
arrange(<data>, desc(<variable>))
mutate
Create, modify, and delete columns
Useful mutate functions
+, -, log(), etc., for their usual mathematical meanings
lead(), lag()
dense_rank(), min_rank(), percent_rank(), row_number(), cume_dist(), ntile()
cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()
na_if(), coalesce()
if_else(), recode(), case_when()
See https://ggplot2.tidyverse.org/reference/
ggplot(<data>) + geom_point(aes(x = <>, y = <>))
<data> %>% ggplot() + geom_point(aes(x = <>, y = <>))
geom_point()
: Pointsgeom_boxplot()
: A box plotgeom_histogram()
: Histogramsgeom_smooth()
: Smoothed conditional meansgdpcap <- read_csv("data/gdpcap.csv")
Rows: 16492 Columns: 5
── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, iso2c, iso3c
dbl (2): year, NY.GDP.PCAP.KD
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdpcap
summary(gdpcap)
country iso2c iso3c year
Length:16492 Length:16492 Length:16492 Min. :1960
Class :character Class :character Class :character 1st Qu.:1975
Mode :character Mode :character Mode :character Median :1990
Mean :1990
3rd Qu.:2006
Max. :2021
NY.GDP.PCAP.KD
Min. : 144.2
1st Qu.: 1307.1
Median : 3463.2
Mean : 10674.3
3rd Qu.: 11755.0
Max. :181709.3
NA's :4042
gdpcap %>% distinct(country, iso2c)
gdpcap %>% filter(country == "World") %>%
ggplot(aes(x = year, y = NY.GDP.PCAP.KD)) +
geom_line()
gdpcap %>% filter(iso2c %in% c("BR", "RU", "IN", "CN")) %>%
ggplot(aes(x = year, y = NY.GDP.PCAP.KD, color = country)) +
geom_line() +
labs(title = "GDP per capta of BRICs", y = "GDP per capita (constant 2015 US$)")
wb_countries
above.(low_countries <- wb_countries %>% filter(income == "Low income"))
(gdpcap_low <- semi_join(gdpcap, low_countries))
Joining, by = c("country", "iso3c")
gdpcap_low %>%
ggplot(aes(x = year, y = NY.GDP.PCAP.KD, color = country)) +
geom_line() +
labs(title = "GDP per capta of low income countries", y = "GDP per capita (constant 2015 US$)")
Warning: Removed 439 row(s) containing missing values (geom_path).
Let us use the saved data.
gdpcap_co2 <- read_csv("data/gdpcap_co2.csv")
Rows: 16492 Columns: 14
── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, NY.GDP.PCAP.KD, EN.ATM.CO2E.PC, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdpcap_co2
co2 <- gdpcap_co2 %>%
select(country, iso2c, iso3c, year, co2 = EN.ATM.CO2E.PC, region)
co2 %>% filter(country %in% c("France", "Germany", "Italy", "Japan", "United Kingdom", "United States", "Canada", "Russian Federation")) %>% distinct(country, iso2c, iso3c, region)
Successfully chosen data of G8 countries.
co2 %>% filter(country %in% c("France", "Germany", "Italy", "Japan", "United Kingdom", "United States", "Canada", "Russian Federation")) %>%
filter(!is.na(co2)) %>%
ggplot(aes(x = year, y = co2, color = country)) + geom_line() +
labs(title = "CO2 Emission Per Capita of G8 Countries")
Let us use the saved data.
gdpcap_co2 <- read_csv("data/gdpcap_co2.csv")
Rows: 16492 Columns: 14
── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, NY.GDP.PCAP.KD, EN.ATM.CO2E.PC, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdpcap_co2
summary(gdpcap_co2)
country iso2c iso3c year
Length:16492 Length:16492 Length:16492 Min. :1960
Class :character Class :character Class :character 1st Qu.:1975
Mode :character Mode :character Mode :character Median :1990
Mean :1990
3rd Qu.:2006
Max. :2021
status lastupdated NY.GDP.PCAP.KD EN.ATM.CO2E.PC
Mode:logical Min. :2022-09-16 Min. : 144.2 Min. : 0.000
NA's:16492 1st Qu.:2022-09-16 1st Qu.: 1307.1 1st Qu.: 0.674
Median :2022-09-16 Median : 3463.2 Median : 2.460
Mean :2022-09-16 Mean : 10674.3 Mean : 4.218
3rd Qu.:2022-09-16 3rd Qu.: 11755.0 3rd Qu.: 6.209
Max. :2022-09-16 Max. :181709.3 Max. :50.954
NA's :4042 NA's :9348
region capital longitude latitude
Length:16492 Length:16492 Min. :-175.22 Min. :-41.286
Class :character Class :character 1st Qu.: -15.18 1st Qu.: 4.174
Mode :character Mode :character Median : 19.54 Median : 17.277
Mean : 19.16 Mean : 18.740
3rd Qu.: 50.53 3rd Qu.: 39.715
Max. : 179.09 Max. : 64.184
NA's :3472 NA's :3472
income lending
Length:16492 Length:16492
Class :character Class :character
Mode :character Mode :character
Since the most recent year seems to be 2021, let us choose the year.
gdpcap_co2 %>% filter(year == 2021) %>%
ggplot(aes(x = NY.GDP.PCAP.KD, y = EN.ATM.CO2E.PC)) +
geom_point()
Warning: Removed 266 rows containing missing values (geom_point).
What is wrong? There seems to be many missing values. First use filter
to choose countries, i.e., non-aggregates and see how many data are in each year.
gdpcap_co2 %>% filter(region != "Aggregates", year == "2021")
Alternately, we can find the data points in each year except NA, i.e., not available.
gdpcap_co2 %>% filter(region != "Aggregates", !is.na(NY.GDP.PCAP.KD), !is.na(EN.ATM.CO2E.PC)) %>%
group_by(year) %>% summarize(n = n())
gdpcap_co2 %>% filter(region != "Aggregates", !is.na(NY.GDP.PCAP.KD), !is.na(EN.ATM.CO2E.PC), year == 2019) %>%
ggplot(aes(x = NY.GDP.PCAP.KD, y = EN.ATM.CO2E.PC)) +
geom_point()
Most of the points are near origin. Let’s try the log-log scale.
gdpcap_co2 %>% filter(region != "Aggregates", !is.na(NY.GDP.PCAP.KD), !is.na(EN.ATM.CO2E.PC), year == 2019) %>%
ggplot(aes(x = log10(NY.GDP.PCAP.KD), y = log10(EN.ATM.CO2E.PC))) +
geom_point()
gdpco2_2019 <- gdpcap_co2 %>% select(region, year, gdp = NY.GDP.PCAP.KD, co2 = EN.ATM.CO2E.PC) %>%
filter(region != "Aggregates", !is.na(gdp), !is.na(co2), year == 2019)
gdpco2_2019 %>% ggplot(aes(x = log10(gdp), y = log10(co2))) +
geom_point() +
labs(title = bquote(~CO[2]~ "Log10 Scale Plot of "~CO[2]~" Emission Per Capita Against GDP Per Capitain 2019"),
subtitle = bquote(~CO[2]~": metric tons per capita, GDP: constant 2015 US$"),
x = bquote(~CO[2]~ "Emission Per Capita (Log10))"), y = "GDP Per Capita (Log10)") +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE) +
geom_smooth(formula = y ~ x, method = "loess", col = "red")
summary(gdpco2_2019)
region year gdp co2
Length:184 Min. :2019 Min. : 278.2 Min. : 0.03699
Class :character 1st Qu.:2019 1st Qu.: 1979.2 1st Qu.: 0.80237
Mode :character Median :2019 Median : 5676.2 Median : 2.78168
Mean :2019 Mean : 13487.5 Mean : 4.15755
3rd Qu.:2019 3rd Qu.: 16122.0 3rd Qu.: 5.59986
Max. :2019 Max. :108570.0 Max. :32.47447
lm(log10(co2) ~ log10(gdp), data = gdpco2_2019) %>% summary()
Call:
lm(formula = log10(co2) ~ log10(gdp), data = gdpco2_2019)
Residuals:
Min 1Q Median 3Q Max
-0.78798 -0.18503 -0.03235 0.21130 0.67157
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.08818 0.13802 -22.38 <2e-16 ***
log10(gdp) 0.90200 0.03624 24.89 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2929 on 182 degrees of freedom
Multiple R-squared: 0.7729, Adjusted R-squared: 0.7717
F-statistic: 619.5 on 1 and 182 DF, p-value: < 2.2e-16
There are \(n = 184\) data. So gdp = \(x_1, x_2, \ldots, x_{184}\) and co2 = \(y_1, y_2, \ldots, y_{184}\). In log10 scale, log10(gdp) = \(\log_{10}(x_1), \log_{10}(x_2), \ldots, \log_{10}(x_{184})\) and log10(co2) = \(\log_{10}(y_1), \log_{10}(y_2), \ldots, \log_{10}(y_{184})\), and call them \(x_1', x_2', \ldots, x_{184}'\) and \(y_1', y_2', \ldots, y_{184}'\). The scatter plot is the points \((x_1', y_1'), (x_2', y_2'), \ldots, (x_{184}', y_{184}')\). Let \(\mathrm{mean}(x')\) and \(\mathrm{mean}(y')\) be mean of \(x_1', x_2', \ldots, x_{184}'\) and \(y_1', y_2', \ldots, y_{184}'\).
\[\mathrm{SSxx} = \sum_{i=1}^{184} (x_i' - \mathrm{mean}(x'))^2, \quad \mathrm{SSxy} = \sum_{i=1}^{184} (x_i' - \mathrm{mean}(x'))(y_i' - \mathrm{mean}(y')).\] Then, by theory, we know the slope and the y-inetrcept as follows. \[\mathrm{Slope} = \frac{\mathrm{SSxy}}{\mathrm{SSxx}}, \quad \mathrm{Intercept} = \mathrm{mean}(y') - \mathrm{Slope}\cdot \mathrm{mean}(x').\]
gdpco2log_ssxx <- sum((log10(gdpco2_2019$gdp)- mean(log10(gdpco2_2019$gdp)))^2)
gdpco2log_ssxy <- sum((log10(gdpco2_2019$gdp)- mean(log10(gdpco2_2019$gdp)))*(log10(gdpco2_2019$co2)- mean(log10(gdpco2_2019$co2))))
gdpco2log_slope <- gdpco2log_ssxy/gdpco2log_ssxx
gdpco2log_intercept <- mean(log10(gdpco2_2019$co2)) - gdpco2log_slope*mean(log10(gdpco2_2019$gdp))
c(Intercept = gdpco2log_intercept, Slope = gdpco2log_slope)
Intercept Slope
-3.0881849 0.9019973
Now consider, the squared difference at \(\log_{10}(x_i)\) between the actual value \(\log_{10}(y_i)\) and the linear estimation \(m\cdot \log_{10}(x_i) + b\). Since it is a square, it is positive. Now take the sum and want to determine \(m\) and \(b\) to minimize the sum, which is called the errors. \[\mathrm{SSE} = \sum_{i=1}^{184} (m\cdot \log_{10}(x_i) + b - \log_{10}(y_i))^2 = \sum_{i=1}^{184} (m\cdot x_i' + b - y_i')^2.\] Let \(k\) be the number of variables for estimation. In our case we estimate using only one variable, we set \(k=1\). The following value is called the Residual Standard Error. \(n-(k+1) = 182\) is called the degree of freedom. \[\mbox{Residual Standard Error} = \sqrt{\frac{\mathrm{SSE}}{n-(k+1)}} = \sqrt{\frac{\mathrm{SSE}}{182}}\]
gdpco2log_model <- lm(log10(co2) ~ log10(gdp), data = gdpco2_2019)
gdpco2log_sse <- sum((log10(gdpco2_2019$co2) - gdpco2log_model$fitted.values)^2)
(gdpco2log_rse <- sqrt(gdpco2log_sse/182))
[1] 0.2928812
The estimated value of \(b\) appears in the row (Intercept) and \(m\) appears in the row log10(gdp). Roughly speaking it is the best fit line against the data points minimizing the sum of the squared distances.
To normalize the scale, we divide the value \(\mathrm{SSyy}\) and consider the following. \[\mathrm{SSyy} = \sum_{i=1}^{184} (\log_{10}(y_i) - \mathrm{mean})^2 = \sum_{i=1}^{184} (y_i'-\mathrm{mean})^2, \quad \mbox{Multiple R Squared} = 1-\frac{\mathrm{SSE}}{\mathrm{SSyy}}\]
gdpco2log_ssyy <- sum((log10(gdpco2_2019$co2)- mean(log10(gdpco2_2019$co2)))^2)
1-gdpco2log_sse/gdpco2log_ssyy
[1] 0.7729313
Conclusion: Roughly speaking, the linear model fits much better than the estimation by means and the model explains about 77% of the scattered points.
You can read many excellent books online.
learnr
package help you to review and consolidate your understanding of basis of R.