Datascience is an interdisciplinary academic field to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data.
Why datascience? There are many reasons, but I believe it would promote evidence based, non-biased thinking.
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: EDA)
The study of the entire economy of a region, a country, or the entire world
World Development Indicators (WDI) is the World Bank’s premier compilation of international statistics on global development. Drawing from officially recognized sources and including national, regional, and global estimates, the WDI provides access to approximately 1,600 indicators for 217 economies, with some time series extending back more than 50 years. The database helps users find information related to development, both current and historical. The topics covered in the WDI range from poverty, health, and demographics to GDP, trade, and the environment.
The World Development Indicators website provides access to data as well as information about data coverage, curation, and methodologies, and allows users to discover what type of indicators are available, how they are collected, and how they can be visualized to analyze development trends. (Wikipedia: Link)
Datascience and Artificial Intelligence
Computer Resources
Data
Big Data
Open and Public Data
Q. Explain ‘Open Data’ and ‘Public Data’, and their differences.
A. AI Poe Assistant. (Poe.com)
“Open data” and “public data” are related concepts that refer to different aspects of data accessibility and usage.
Open Data: Open data refers to the idea that certain data should be freely available to the public, without restrictions on access, usage, or redistribution. Open data is typically released in a machine-readable format, such as CSV, JSON, or XML, and is accompanied by an open license that allows anyone to use, reuse, and distribute the data for any purpose. The goal of open data is to promote transparency, collaboration, and innovation by enabling individuals, organizations, and communities to analyze and build upon the data.
Open data often includes government data, scientific research data, public records, and other datasets that are deemed to be of public interest. Governments, international organizations, and various institutions actively promote the release of open data to foster accountability, drive economic growth, and empower citizens to make informed decisions.
Public Data: Public data, on the other hand, refers to any data that is accessible or available to the public. It is a broader term that encompasses both open data and other types of data that may have certain restrictions on access or usage. Public data can include information that is freely and openly available, as well as data that is accessible through specific channels, such as public libraries, government agencies, or subscription-based services.
Public data can include a wide range of information, such as government reports, census data, legal documents, financial statements, and more. While public data may be accessible to the public, it does not necessarily guarantee the same level of openness, usability, or freedom as open data. Public data may have limitations on its usage, copyright restrictions, or conditions imposed by the data provider.
In summary, open data specifically refers to data that is freely accessible, usable, and redistributable without restrictions, often accompanied by an open license. Public data, on the other hand, is a broader term that includes both open data and other data that may have varying degrees of accessibility and usage restrictions.
Visit Open Government Data Toolkit.
Posit.cloud: https://posit.cloud/
Posit Primers: interactive tutorials
Cheatsheets: RStudio IDE, Data Visualization, Data Transformation, Data Import, RMarkdown, etc.
Shared Project: https://posit.cloud/content/5539763
Create Your Copy and Download it.
We use two packages, tidyverse
and WDI
.
R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardized collection format that can be installed by users of R, typically via a centralized software repository such as CRAN (the Comprehensive R Archive Network). [CRAN Link]
tidyverse: The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham and his team that “share an underlying design philosophy, grammar, and data structures” of tidy data. Characteristic features of tidyverse packages include extensive use of non-standard evaluation and encouraging piping. [CRAN Link]
WDI: Search and download data from over 40 databases hosted by the World Bank, including the World Development Indicators (‘WDI’), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national Poverty indicators. [CRAN Link]
Step 1. Install packages if necessary.
install.packages("tidyverse")
install.packages("WDI")
Step 2. Load packages.
library(tidyverse)
library(WDI)
Step 3. Create a data directory for the first time.
dir.create("data")
Warning: 'data' already exists
Step 4. Set ‘System Language’ to be English, recommended.
Sys.setenv(LANG = "en")
The following code chunk is to download GDP data with the following indicator code.
WDI indicator: NY.GDP.MKTP.PP.KD
df_gdp <- WDI(indicator = "NY.GDP.MKTP.PP.KD")
N.B. There are many GDP related data in WDI, for example, “NY.GDP.MKTP.CD”
To avoid the internet traffic, save the data and reuse it.
CSV: comma separated values, a text format of a data.
write_csv(df_gdp, "data/gdp.csv")
Run codes above only once to download and write the data into the data directory.
df_gdp <- read_csv("data/gdp.csv")
Rows: 16758 Columns: 5── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, iso2c, iso3c
dbl (2): year, NY.GDP.MKTP.PP.KD
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, look at the data by head
, str
ucture, and summary
.
head
: display the first 6 rows by default
head(df_gdp)
head(df_gdp, 50)
2.561800e+12 is in scientific notation, i.e., \(2.561800 \times10^{12} = 2,562,800,000,000\).
str
: display the structure of an object
str(df_gdp)
spc_tbl_ [16,758 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country : chr [1:16758] "Africa Eastern and Southern" "Africa Eastern and Southern" "Africa Eastern and Southern" "Africa Eastern and Southern" ...
$ iso2c : chr [1:16758] "ZH" "ZH" "ZH" "ZH" ...
$ iso3c : chr [1:16758] "AFE" "AFE" "AFE" "AFE" ...
$ year : num [1:16758] 2022 2021 2020 2019 2018 ...
$ NY.GDP.MKTP.PP.KD: num [1:16758] 2.56e+12 2.47e+12 2.37e+12 2.43e+12 2.38e+12 ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. iso2c = col_character(),
.. iso3c = col_character(),
.. year = col_double(),
.. NY.GDP.MKTP.PP.KD = col_double()
.. )
- attr(*, "problems")=<externalptr>
summary
: display the summary of an object
summary(df_gdp)
country iso2c iso3c year NY.GDP.MKTP.PP.KD
Length:16758 Length:16758 Length:16758 Min. :1960 Min. :2.482e+07
Class :character Class :character Class :character 1st Qu.:1975 1st Qu.:1.824e+10
Mode :character Mode :character Mode :character Median :1991 Median :1.055e+11
Mean :1991 Mean :3.329e+12
3rd Qu.:2007 3rd Qu.:1.083e+12
Max. :2022 Max. :1.390e+14
NA's :9096
In RNotebook, the following also displays the first 1000 rows of the data in the paged format.
df_gdp
|>
, or %>%
, is called a pipe
operator and df_gdp |> filter(country == COUNTRY)
is
same as
filter(df_gdp, country == COUNTRY)
.
filter
: Keep rows that match a condition
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY)
ggplot() + geom_line()
: A tidyverse function of draw a
line graph
aes(year, NY.GDP.MKTP.PP.KD)
: aesthetic mapping sending
year to x-axis and NY.GDP.MKTP.PP.KD
to y-axis
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
Let’s delete the rows with missing values using
drop_na(NY.GDP.MKTP.PP.KD)
, a transformation.
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
COUNTRY <- "World"
df_gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
COUNTRY <- "World"
df_gdp |> filter(country == COUNTRY) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
Observations and Questions
e.g. The GDP of the world is continuously increasing since 1990.
There drops in the year around 2008 and 2020
By country names
COUNTRIES <- c("Japan", "China", "India", "United Kingdom", "United States", "Germany", "France")
df_gdp |> filter(country %in% COUNTRIES) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) + geom_line()
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, col = iso2c)) + geom_line()
What happens if you replace color = iso2c
at the bottom
of the code above with colour = iso2c
,
color = country
, col = country
?
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) + geom_line()
(df_codes <- df_gdp |> distinct(country, iso2c))
Set COUNTRIES and/or ISO2C to draw line graphs of GDP.
COUNTRIES <- c() # surround the country name with quotation marks, and use a comma as a separator
df_gdp |> filter(country %in% COUNTRIES) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) + geom_line()
ISO2C <- c() # surround the iso2c code with quotation marks, and use a comma as a separator
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = iso2c)) + geom_line()
World Bank Home Page
Excel Files
API Search
WDIsearch(string = "gdp", field = "name")
WDIsearch(string = "NY.GDP.MKTP.PP.KD", field = "indicator", short = FALSE)
Find at least one WDI indicator with its name and its code.
Find at least one pair of WDI indicators with their names and their codes you want to study their relation.
GDP, PPP (constant 2017 international $): NY.GDP.MKTP.PP.KD
Population, total: SP.POP.TOTL
Calculate GDP per Capita
GDP, PPP (constant 2017 international $) PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.MKTP.PP.KD
Population, total Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. ID: SP.POP.TOTL
df_gdppcap <- WDI(indicator = c(gdp = "NY.GDP.MKTP.PP.KD", pop = "SP.POP.TOTL", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_gdppcap, "data/gdppcap.csv")
/bin/sh: rE: command not found
df_gdppcap <- read_csv("data/gdppcap.csv")
Rows: 16758 Columns: 15── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (6): year, gdp, pop, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df_gdppcap)
spc_tbl_ [16,758 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ iso2c : chr [1:16758] "AF" "AF" "AF" "AF" ...
$ iso3c : chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
$ year : num [1:16758] 2014 2012 2009 2013 1971 ...
$ status : logi [1:16758] NA NA NA NA NA NA ...
$ lastupdated: Date[1:16758], format: "2023-09-19" "2023-09-19" "2023-09-19" ...
$ gdp : num [1:16758] 7.02e+10 6.47e+10 4.99e+10 6.83e+10 NA ...
$ pop : num [1:16758] 32716210 30466479 27385307 31541209 11015857 ...
$ gdppcap : num [1:16758] 2144 2123 1824 2165 NA ...
$ region : chr [1:16758] "South Asia" "South Asia" "South Asia" "South Asia" ...
$ capital : chr [1:16758] "Kabul" "Kabul" "Kabul" "Kabul" ...
$ longitude : num [1:16758] 69.2 69.2 69.2 69.2 69.2 ...
$ latitude : num [1:16758] 34.5 34.5 34.5 34.5 34.5 ...
$ income : chr [1:16758] "Low income" "Low income" "Low income" "Low income" ...
$ lending : chr [1:16758] "IDA" "IDA" "IDA" "IDA" ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. iso2c = col_character(),
.. iso3c = col_character(),
.. year = col_double(),
.. status = col_logical(),
.. lastupdated = col_date(format = ""),
.. gdp = col_double(),
.. pop = col_double(),
.. gdppcap = col_double(),
.. region = col_character(),
.. capital = col_character(),
.. longitude = col_double(),
.. latitude = col_double(),
.. income = col_character(),
.. lending = col_character()
.. )
- attr(*, "problems")=<externalptr>
df_gdppcap |> select(region, income, lending) |> lapply(unique)
$region
[1] "South Asia" "Aggregates" "Europe & Central Asia"
[4] "Middle East & North Africa" "East Asia & Pacific" "Sub-Saharan Africa"
[7] "Latin America & Caribbean" "North America" NA
$income
[1] "Low income" "Aggregates" "Upper middle income" "Lower middle income"
[5] "High income" NA "Not classified"
$lending
[1] "IDA" "Aggregates" "IBRD" "Not classified" "Blend" NA
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, gdppcap)) + geom_line()
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, pop)) + geom_line()
Write your observations and questions.
df_gdppcap2 <- df_gdppcap |> drop_na(pop) |>
mutate(PCAP = gdp/pop, .after = gdppcap)
df_gdppcap2
df_gdppcap2 |> drop_na(gdppcap, PCAP) |> mutate(near = near(gdppcap, PCAP)) |>
summarize(numberofdata = n(), sum(near))
df_gdppcap2 |> filter(!near(gdppcap, PCAP))
df_gdppcap2 |> filter(!near(gdppcap, PCAP)) |> distinct(country) |> pull()
[1] "Cyprus" "Morocco" "Russian Federation" "Sudan"
[5] "Tanzania" "Ukraine"
Write your observations and questions.
Two useful questions.
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
See Link.
arrange(desc(gdp))
is to reorder in descending order of
gdp,
arrange(gdp)
in ascending order.
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(gdp) |> arrange(desc(gdp))
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(gdppcap) |> arrange(desc(gdppcap))
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(gdppcap) |> arrange(gdppcap)
Find the top 10 of the countries with the largest population.
Find the top 10 of the countries with the smallest population.
What type of covariation occurs between my variables?
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point()
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |> lm(log10(gdp) ~ log10(pop), data = _) |> summary()
Call:
lm(formula = log10(gdp) ~ log10(pop), data = drop_na(filter(df_gdppcap2,
year == 2022, region != "Aggregates"), gdp, pop))
Residuals:
Min 1Q Median 3Q Max
-1.22646 -0.39512 0.03996 0.42553 0.95842
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.45320 0.27485 16.20 <2e-16 ***
log10(pop) 0.94704 0.03998 23.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5117 on 181 degrees of freedom
Multiple R-squared: 0.7561, Adjusted R-squared: 0.7548
F-statistic: 561.2 on 1 and 181 DF, p-value: < 2.2e-16
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |>
ggplot(aes(pop, gdp, color = region)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp, color = region, shape = income)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |>
drop_na(gdp, gdppcap, pop) |>
ggplot(aes(gdppcap, gdp, color = region, size = pop)) + geom_point() +
scale_x_log10() + scale_y_log10()
install.packages("plotly")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-x86_64/contrib/4.3/plotly_4.10.3.tgz'
Content type 'application/x-gzip' length 3198308 bytes (3.1 MB)
==================================================
downloaded 3.1 MB
The downloaded binary packages are in
/var/folders/sf/0qlks6k13n512p460v20b87m0000gn/T//Rtmp45fPtT/downloaded_packages
library(plotly)
test <- df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |>
ggplot(aes(color = country, shape = region, pop, gdp)) + geom_point() +
scale_x_log10() + scale_y_log10() + theme(legend.position = "none")
test |> ggplotly()
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: EDA)
GDP, PPP (constant 2017 international $): NY.GDP.MKTP.PP.KD
Population, total: SP.POP.TOTL
Calculate GDP per Capita
GDP, PPP (constant 2017 international $) PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.MKTP.PP.KD
Population, total Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. ID: SP.POP.TOTL
df_gdppcap <- WDI(indicator = c(gdp = "NY.GDP.MKTP.PP.KD", pop = "SP.POP.TOTL", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_gdppcap, "data/gdppcap.csv")
df_gdppcap <- read_csv("data/gdppcap.csv")
Rows: 16758 Columns: 15── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (6): year, gdp, pop, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df_gdppcap)
spc_tbl_ [16,758 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ iso2c : chr [1:16758] "AF" "AF" "AF" "AF" ...
$ iso3c : chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
$ year : num [1:16758] 2014 2012 2009 2013 1971 ...
$ status : logi [1:16758] NA NA NA NA NA NA ...
$ lastupdated: Date[1:16758], format: "2023-09-19" "2023-09-19" "2023-09-19" ...
$ gdp : num [1:16758] 7.02e+10 6.47e+10 4.99e+10 6.83e+10 NA ...
$ pop : num [1:16758] 32716210 30466479 27385307 31541209 11015857 ...
$ gdppcap : num [1:16758] 2144 2123 1824 2165 NA ...
$ region : chr [1:16758] "South Asia" "South Asia" "South Asia" "South Asia" ...
$ capital : chr [1:16758] "Kabul" "Kabul" "Kabul" "Kabul" ...
$ longitude : num [1:16758] 69.2 69.2 69.2 69.2 69.2 ...
$ latitude : num [1:16758] 34.5 34.5 34.5 34.5 34.5 ...
$ income : chr [1:16758] "Low income" "Low income" "Low income" "Low income" ...
$ lending : chr [1:16758] "IDA" "IDA" "IDA" "IDA" ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. iso2c = col_character(),
.. iso3c = col_character(),
.. year = col_double(),
.. status = col_logical(),
.. lastupdated = col_date(format = ""),
.. gdp = col_double(),
.. pop = col_double(),
.. gdppcap = col_double(),
.. region = col_character(),
.. capital = col_character(),
.. longitude = col_double(),
.. latitude = col_double(),
.. income = col_character(),
.. lending = col_character()
.. )
- attr(*, "problems")=<externalptr>
df_gdppcap |> select(region, income, lending) |> lapply(unique)
$region
[1] "South Asia" "Aggregates" "Europe & Central Asia"
[4] "Middle East & North Africa" "East Asia & Pacific" "Sub-Saharan Africa"
[7] "Latin America & Caribbean" "North America" NA
$income
[1] "Low income" "Aggregates" "Upper middle income" "Lower middle income"
[5] "High income" NA "Not classified"
$lending
[1] "IDA" "Aggregates" "IBRD" "Not classified" "Blend" NA
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, gdppcap)) + geom_line()
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, pop)) + geom_line()
Write your observations and questions.
Two useful questions.
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
See Link.
arrange(desc(gdp))
is to reorder in descending order of
gdp,
arrange(gdp)
in ascending order.
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(gdp) |> arrange(desc(gdp))
Find the top 10 of the countries with the highest GDP per capita.
Find the top 10 of the countries with the lowest GDP per capita.
Find the top 10 of the countries with the largest population.
Find the top 10 of the countries with the smallest population.
Histogram Video in Posit Primers: Link
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_histogram()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram() + scale_x_log10()
geom_histogram(bins = 20)
, etc.df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram(bins = 20) + scale_x_log10()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_histogram(binwidth = 10000)
scale_x_log10()
and adjust the number of bins.df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_histogram(bins = 10) + scale_x_log10()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(pop) |>
ggplot(aes(pop)) + geom_histogram(bins = 20) + scale_x_log10()
df_gdppcap |> filter(year == 2022,region != "Aggregates") |> drop_na(pop) |>
group_by(region) |>
ggplot(aes(pop, fill = region)) + geom_histogram(col = "black", linewidth = 0.2) + scale_x_log10()
df_gdppcap |> filter(year == 2022, region != "Aggregates") |> drop_na(pop) |>
group_by(region) |>
ggplot(aes(pop, fill = income)) + geom_histogram(col = "black", linewidth = 0.2) + scale_x_log10()
World by Income and Region: [Link]
Boxplot Video in Posit Primers: Link
df_gdppcap2 |> filter(year %in% c(1990,2000, 2010, 2020)) |> drop_na(gdppcap) |>
ggplot(aes(gdppcap, factor(year))) + geom_boxplot() + scale_x_log10()
df_gdppcap2 |> filter(year %in% c(1990,2000, 2010, 2020)) |> drop_na(gdppcap) |>
ggplot(aes(gdppcap, factor(year))) + geom_boxplot() + scale_x_log10() +
labs(title = "Distribution of the GDP per Capita of Countries", subtitle = "Year 1990, 2000, 2010, 2020",
y = "Year", x = "GDP per capita in log10 scale")
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |>
filter(income != "Aggregates") |>
ggplot(aes(gdppcap, income, fill = income)) + geom_boxplot() + scale_x_log10() +
theme(legend.position = "none")
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |>
filter(income != "Aggregates") |>
ggplot(aes(gdppcap, factor(income, levels = c("High income", "Upper middle income", "Lower middle income", "Low income")), fill = income)) + geom_boxplot() + scale_x_log10() +
labs(y = "") +
theme(legend.position = "none")
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |>
filter(income != "Aggregates") |>
ggplot(aes(gdp, region, fill = region)) + geom_boxplot() + scale_x_log10() +
theme(legend.position = "none")
CO2 emissions (metric tons per capita): EN.ATM.CO2E.PC
GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD
CO2 emissions (metric tons per capita) Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring. EN.ATM.CO2E.PC
GDP per capita, PPP (constant 2017 international $) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD
df_co2gdp <- WDI(indicator = c(co2pcap = "EN.ATM.CO2E.PC", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_co2gdp, "data/co2gdp.csv")
df_co2gdp <- read_csv("data/co2gdp.csv")
Rows: 16758 Columns: 14── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, co2pcap, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
COUNTRY <- "World"
df_co2gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, co2pcap)) + geom_line()
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_co2gdp |> filter(iso2c %in% ISO2C) |> drop_na(co2pcap) |>
ggplot(aes(year, co2pcap, linetype = iso2c)) + geom_line()
iso2c
codes to those you want to investigate.
Use df_codes
under Environmentlinetype
to col.ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_co2gdp |> filter(iso2c %in% ISO2C) |> drop_na(co2pcap) |>
ggplot(aes(year, co2pcap, col = iso2c)) + geom_line()
df_co2gdp |> filter(year == 2020) |> drop_na(co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point()
df_co2gdp |> filter(year == 2020) |>
drop_na(gdppcap, co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_co2gdp |> filter(year == 2020) |>
drop_na(gdppcap, co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point() +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10() + scale_y_log10()
df_co2gdp |> filter(year == 2020) |> drop_na(gdppcap, co2pcap) |>
lm(log10(co2pcap)~log10(gdppcap), data = _) |> summary()
Call:
lm(formula = log10(co2pcap) ~ log10(gdppcap), data = drop_na(filter(df_co2gdp,
year == 2020), gdppcap, co2pcap))
Residuals:
Min 1Q Median 3Q Max
-0.60778 -0.15660 -0.00651 0.16129 0.59437
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.31545 0.13386 -32.24 <2e-16 ***
log10(gdppcap) 1.13831 0.03288 34.62 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2362 on 228 degrees of freedom
Multiple R-squared: 0.8402, Adjusted R-squared: 0.8395
F-statistic: 1199 on 1 and 228 DF, p-value: < 2.2e-16
School enrollment, secondary (% gross): SE.SEC.ENRR
GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD
School enrollment, secondary (% gross) Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers. SE.SEC.ENRR
GDP per capita, PPP (constant 2017 international $) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD
df_secgdp <- WDI(indicator = c(sec = "SE.SEC.ENRR", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_secgdp, "data/secgdp.csv")
df_secgdp <- read_csv("data/secgdp.csv")
Rows: 16758 Columns: 14── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, sec, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
COUNTRY <- "World"
df_secgdp |> filter(country == COUNTRY) |>
ggplot(aes(year, sec)) + geom_line()
COUNTRIES <- c("Low income", "Low & middle income", "Lower middle income", "Middle income", "Upper middle income", "High income")
df_secgdp |> filter(country %in% COUNTRIES) |> drop_na(sec) |>
ggplot(aes(year, sec, linetype = factor(country, levels = COUNTRIES))) + geom_line() +
labs(linetype = "Income Levels")
Change COUNTRIES
to ISO2C
of countries you
want to investigate. Use df_codes
under Environment
df_secgdp |> filter(year == 2020) |> drop_na(sec) |>
ggplot(aes(gdppcap, sec)) + geom_point()
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
ggplot(aes(gdppcap, sec)) + geom_point() +
scale_x_log10()
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
ggplot(aes(gdppcap, sec)) + geom_point() +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10()
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
lm(sec~log10(gdppcap), data = _) |> summary()
Call:
lm(formula = sec ~ log10(gdppcap), data = drop_na(filter(df_secgdp,
year == 2020), gdppcap, sec))
Residuals:
Min 1Q Median 3Q Max
-53.777 -10.846 -1.173 9.006 66.996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -102.994 11.933 -8.631 6.38e-15 ***
log10(gdppcap) 46.088 2.841 16.222 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.64 on 157 degrees of freedom
Multiple R-squared: 0.6263, Adjusted R-squared: 0.624
F-statistic: 263.2 on 1 and 157 DF, p-value: < 2.2e-16
Posit Primers: Link.
Cheat Sheet: Link.
Shared Project: https://posit.cloud/content/5539763
R for Social Scientists: https://datacarpentry.org/r-socialsci/
Old Shared Project: https://rstudio.cloud/content/4858948
Data Analysis for Researchers AY2022: Link.
みんなのデータサイエンス - Data Science for All (in Japanese)
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: EDA)
df_dataframe_name <- WDI(indicators = c(name1 = "Indicator Code 1",
name2 = "Indicator Code 2"), extra = TRUE)
Write and read:
write_csv(df_dataframe_name, "data/dataframe_name.csv")
df_dataframe_name <- read_csv("data/dataframe_name.csv")
head()
, str()
, summary()
, and
try df_dataframe_name
. See also Environment Tab of
RStudio.
df_dataframe_name |> filter(var == "value")
df_dataframe_name |> filter(var %in% c("value_1", ... , "value_n")
df_dataframe_name |> filter(var != "value")
df_dataframe_name |> drop_na(var)
df_dataframe_name |> mutate(var_new = var1 * var2)}
arrange()
df_dataframe_name |> arrange(var)
df_dataframe_name |> arrange(dsc(var))
Visualizing using ggplot() + geom_*()
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
transformed_data |> ggplot(aes(year, name1)) + geom_line()
transformed_data |> ggplot(aes(year, name2)) + geom_line()
transformed_data |> ggplot(aes(name1, name2)) + geom_point()
transformed_data |> ggplot(aes(name1, name2)) + geom_point() + scale_x_log10()
transformed_data |> ggplot(aes(name1, name2)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)
transformed_data |> ggplot(aes(name1, name2)) + geom_point() +
geom_smooth(method = "lm", se = FALSE) + scale_x_log10()
transformed_data |> ggplot(aes(name1)) + geom_histogram()
categorical_var
: factor(year)
,
income
, region
transformed_data |> ggplot(aes(categorical_var, name1)) + geom_boxplot()
library(tidyverse)
library(WDI)
We study the relation between the CO2 emission per capita and the GDP per capita using the following two World Development Indicators.
df_co2gdp <- WDI(indicator = c(co2pcap = "EN.ATM.CO2E.PC", gdppcap = "NY.GDP.PCAP.PP.KD"),
extra = TRUE)
write_csv(df_co2gdp, "data/co2gdp.csv")
df_co2gdp <- read_csv("data/co2gdp.csv")
Rows: 16758 Columns: 14── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, co2pcap, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
COUNTRY <- "World"
df_co2gdp |> filter(country == COUNTRY) |> drop_na(co2pcap) |>
ggplot(aes(year, co2pcap)) + geom_line() +
labs(title = expression(paste(CO[2], " per capita of the World")),
y = expression(paste(CO[2], " per capita in tons")))
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_co2gdp |> filter(iso2c %in% ISO2C) |> drop_na(co2pcap) |>
ggplot(aes(year, co2pcap, col = iso2c)) + geom_line() +
labs(title = expression(paste(CO[2], " per capita of seven conutries with large GDP")),
subtitle = "China, Germany, France, United Kingdom, India, Japan, United States",
y = expression(paste(CO[2], " per capita in tons")))
COUNTRY <- "World"
df_co2gdp |> filter(country == COUNTRY) |> drop_na(gdppcap) |>
ggplot(aes(year, gdppcap)) + geom_line() +
labs(title = "GDP per capita of the World")
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_co2gdp |> filter(iso2c %in% ISO2C) |> drop_na(gdppcap) |>
ggplot(aes(year, gdppcap, col = iso2c)) + geom_line() +
labs(title = "GDP per capita of seven countries with large GDP",
subtitle = "China, Germany, France, United Kingdom, India, Japan, United States",
y = "GDP per capita PPP",
caption = "constant 2017 international usd")
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> arrange(desc(co2pcap))
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> arrange(co2pcap)
Observations and Questions:
Top 10 countries of CO2 emission per capita:
Lowest 10 countries of CO2 emission per capita:
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> arrange(desc(gdppcap))
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> arrange(gdppcap)
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(income != "Not classified") |>
ggplot(aes(co2pcap, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() +
labs(title = "Histogram of CO2 per capita in 2020", fill = "")
df_co2gdp |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(co2pcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(co2pcap, fill = factor(year))) +
geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() + facet_wrap(~year) +
labs(title = "Histogram of CO2 per capita in 1990, 2000, 2010, 2020", fill = "")
df_co2gdp |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(co2pcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(co2pcap, factor(year), fill = factor(year))) +
geom_boxplot() + scale_x_log10() + labs(y = "") + theme(legend.position = "none")
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(co2pcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(co2pcap, factor(income, levels = INCOME), fill = income)) +
geom_boxplot() + scale_x_log10() +
labs(title = "CO2 per capita by income level", y = "", fill = "") +
theme(legend.position = "none")
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(co2pcap > 0) |>
ggplot(aes(co2pcap, region, fill = region)) +
geom_boxplot() + scale_x_log10() +
labs(title = "CO2 per capita by region", y = "", fill = "") +
theme(legend.position = "none")
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(income != "Not classified") |>
ggplot(aes(gdppcap, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() +
labs(title = "Histogram of GDP per capita in 2020", fill = "")
df_co2gdp |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(gdppcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(gdppcap, fill = factor(year))) +
geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() + facet_wrap(~year) +
labs(title = "Histogram of GDP per capita in 1990, 2000, 2010, 2020", fill = "") +
theme(legend.position = "none")
df_co2gdp |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(gdppcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(gdppcap, factor(year), fill = factor(year))) +
geom_boxplot() + scale_x_log10() + labs(y = "") + theme(legend.position = "none")
INCOME <- c("Low income", "Low & middle income", "Lower middle income", "Middle income", "Upper middle income", "High income")
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(gdppcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(gdppcap, factor(income, levels = INCOME), fill = income)) +
geom_boxplot() + scale_x_log10() +
labs(title = "GDP per capita by income level", y = "", fill = "") +
theme(legend.position = "none")
df_co2gdp |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(gdppcap > 0) |>
ggplot(aes(gdppcap, region, fill = region)) +
geom_boxplot() + scale_x_log10() +
labs(title = "GDP per capita by region", y = "", fill = "") +
theme(legend.position = "none")
df_co2gdp |> filter(year == 2020) |>
drop_na(gdppcap, co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point(aes(col = region)) +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10() + scale_y_log10() +
labs(title = "GDP per capita vs CO2 per capita",
x = "GDP per capita",
y = expression(paste(CO[2], " per capita in tons")))
df_co2gdp |> filter(year == 2020) |> drop_na(gdppcap, co2pcap) |>
lm(log10(co2pcap)~log10(gdppcap), data = _) |> summary()
Call:
lm(formula = log10(co2pcap) ~ log10(gdppcap), data = drop_na(filter(df_co2gdp,
year == 2020), gdppcap, co2pcap))
Residuals:
Min 1Q Median 3Q Max
-0.60778 -0.15660 -0.00651 0.16129 0.59437
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.31545 0.13386 -32.24 <2e-16 ***
log10(gdppcap) 1.13831 0.03288 34.62 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2362 on 228 degrees of freedom
Multiple R-squared: 0.8402, Adjusted R-squared: 0.8395
F-statistic: 1199 on 1 and 228 DF, p-value: < 2.2e-16
WDIsearch(string = "school enrollment.*(% gross)", field = "name", short = FALSE)
df_sec_ter_gdp <- WDI(indicator = c(sec = "SE.SEC.ENRR", ter = "SE.TER.ENRR",
gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_secgdp, "data/sec_ter_gdp.csv")
df_secgdp <- read_csv("data/sec_ter_gdp.csv")
Rows: 16758 Columns: 14── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, sec, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
COUNTRY <- "World"
df_sec_ter_gdp |> filter(country == COUNTRY) |> drop_na(sec, ter) |>
ggplot() + geom_line(aes(year, sec), col = "blue") + geom_line(aes(year, ter), col = "red") +
labs(title = "School enrollment; Secondary and Tertiary",
subtitle = "secondary in blue and tertiary in red", y = "")
INCOME <- c("Low income", "Low & middle income", "Lower middle income", "Middle income", "Upper middle income", "High income")
df_sec_ter_gdp |> filter(country %in% INCOME) |> drop_na(sec, ter) |>
ggplot(aes(linetype = factor(country, levels = INCOME))) + geom_line(aes(year, sec), col = "blue") + geom_line(aes(year, ter), col = "red") + ylim(c(0,110)) +
labs(title = "School enrollment; Secondary and Tertiary",
subtitle = "secondary in blue and tertiary in red", linetype = "Income Levels", y = "")
df_sec_ter_gdp |> filter(year == 2020) |> drop_na(sec, ter, gdppcap) |>
ggplot() + geom_point(aes(gdppcap, sec), col = "blue") +
geom_point(aes(gdppcap, ter), col = "red") +
labs(title = "School enrollment; Secondary and Tertiary vs GDP per capita",
subtitle = "secondary in blue and tertiary in red", y = "")
df_sec_ter_gdp |> filter(year == 2020) |> drop_na(sec, ter, gdppcap) |>
ggplot() + geom_point(aes(gdppcap, sec), col = "blue") +
geom_point(aes(gdppcap, ter), col = "red") +
scale_x_log10() +
labs(title = "School enrollment; Secondary and Tertiary vs GDP per capita in log10 scale",
subtitle = "secondary in blue and tertiary in red", y = "")
df_sec_ter_gdp |> filter(year == 2020) |> drop_na(sec, ter, gdppcap) |>
ggplot() + geom_point(aes(gdppcap, sec), col = "blue") +
geom_point(aes(gdppcap, ter), col = "red") +
geom_smooth(aes(gdppcap, sec), col = "blue", method = "lm", formula = 'y~x', se = FALSE) +
geom_smooth(aes(gdppcap, ter), col = "red", method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10() +
labs(title = "School enrollment; Secondary and Tertiary vs GDP per capita in log10 scale",
subtitle = "secondary in blue and tertiary in red with regression lines", y = "")
df_sec_ter_gdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
lm(sec~log10(gdppcap), data = _) |> summary()
Call:
lm(formula = sec ~ log10(gdppcap), data = drop_na(filter(df_sec_ter_gdp,
year == 2020), gdppcap, sec))
Residuals:
Min 1Q Median 3Q Max
-53.777 -10.846 -1.173 9.006 66.996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -102.994 11.933 -8.631 6.38e-15 ***
log10(gdppcap) 46.088 2.841 16.222 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.64 on 157 degrees of freedom
Multiple R-squared: 0.6263, Adjusted R-squared: 0.624
F-statistic: 263.2 on 1 and 157 DF, p-value: < 2.2e-16
df_sec_ter_gdp |> filter(year == 2020) |> drop_na(gdppcap, ter) |>
lm(ter~log10(gdppcap), data = _) |> summary()
Call:
lm(formula = ter ~ log10(gdppcap), data = drop_na(filter(df_sec_ter_gdp,
year == 2020), gdppcap, ter))
Residuals:
Min 1Q Median 3Q Max
-72.696 -8.388 -0.808 8.589 89.657
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -159.817 13.877 -11.52 <2e-16 ***
log10(gdppcap) 49.861 3.303 15.09 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.18 on 157 degrees of freedom
Multiple R-squared: 0.592, Adjusted R-squared: 0.5894
F-statistic: 227.8 on 1 and 157 DF, p-value: < 2.2e-16
df_sec_ter_gdp |> filter(year == 2020, region != "Aggregates") |> drop_na(sec, region) |>
ggplot(aes(sec, region, fill = region)) + geom_boxplot() +
labs(x = "School enrollment, secondary (% gross)", y = "") + theme(legend.position = "none")
df_sec_ter_gdp |> filter(year == 2020, income !="Aggregates") |> drop_na(sec, income) |>
ggplot(aes(sec, factor(income, levels = INCOME), fill = income)) + geom_boxplot() +
labs(title = "Seconary education: School enrollment by income level", x = "School enrollment, secondary (% gross)", y = "") + theme(legend.position = "none")
df_sec_ter_gdp |> filter(year == 2020, region != "Aggregates") |> drop_na(ter, region) |>
ggplot(aes(ter, region, fill = region)) + geom_boxplot() +
labs(x = "School enrollment, tertiary (% gross)", y = "") + theme(legend.position = "none")
df_sec_ter_gdp |> filter(year == 2020, income != "Aggregates") |> drop_na(ter, income) |>
ggplot(aes(ter, factor(income, levels = INCOME), fill = income)) + geom_boxplot() +
labs(title = "Tertiary education: School enrollment by income level", x = "School enrollment, tertiary (% gross)", y = "") + theme(legend.position = "none")
Observations
We study …..
chosen_indicator_1 <- "SE.SEC.ENRR"
short_name_1 <- "sec"
chosen_indicator_2 <- "NY.GDP.PCAP.PP.KD"
short_name_2 <- "gdppcap"
df_yourdata <- WDI(indicator = c(short_name_1 = chosen_indicator_1, short_name_2 = chosen_indicator_2),
extra = TRUE)
write_csv(df_yourdata, "data/yourdata.csv")
df_yourdata <- read_csv("data/yourdata.csv")
Rows: 16758 Columns: 14── Column specification ──────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, short_name_1, short_name_2, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
COUNTRY <- "World"
df_yourdata |> filter(country == COUNTRY) |> drop_na(short_name_1) |>
ggplot(aes(year, short_name_1)) + geom_line() +
labs(title = "",
y = "")
Observations and Questions:
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_yourdata |> filter(iso2c %in% ISO2C) |> drop_na(short_name_1) |>
ggplot(aes(year, short_name_1, col = iso2c)) + geom_line() +
labs(title = "",
subtitle = "China, Germany, France, United Kingdom, India, Japan, United States",
y = "")
Observations and Questions:
COUNTRY <- "World"
df_yourdata |> filter(country == COUNTRY) |> drop_na(short_name_2) |>
ggplot(aes(year, short_name_2)) + geom_line() +
labs(title = "")
Observations and Questions:
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_yourdata |> filter(iso2c %in% ISO2C) |> drop_na(short_name_2) |>
ggplot(aes(year, short_name_2, col = iso2c)) + geom_line() +
labs(title = "",
subtitle = "China, Germany, France, United Kingdom, India, Japan, United States",
y = "",
caption = "")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> arrange(desc(short_name_1))
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> arrange(short_name_1)
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> arrange(desc(short_name_2))
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> arrange(short_name_2)
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(income != "Not classified") |>
ggplot(aes(short_name_1, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() +
labs(title = "", fill = "")
Observations and Questions:
df_yourdata |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(short_name_1 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_1, fill = factor(year))) +
geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() + facet_wrap(~year) +
labs(title = "", fill = "")
Observations and Questions:
df_yourdata |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(short_name_1 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_1, factor(year), fill = factor(year))) +
geom_boxplot() + scale_x_log10() + labs(y = "") + theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(short_name_1 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_1, factor(income, levels = INCOME), fill = income)) +
geom_boxplot() + scale_x_log10() +
labs(title = "", y = "", fill = "") +
theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(short_name_1 > 0) |>
ggplot(aes(short_name_1, region, fill = region)) +
geom_boxplot() + scale_x_log10() +
labs(title = "", y = "", fill = "") +
theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(income != "Not classified") |>
ggplot(aes(short_name_2, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() +
labs(title = "", fill = "")
Edit the title and the year if necessary.
df_yourdata |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(short_name_2 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_2, fill = factor(year))) +
geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() + facet_wrap(~year) +
labs(title = "", fill = "") +
theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(short_name_2 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_2, factor(year), fill = factor(year))) +
geom_boxplot() + scale_x_log10() + labs(y = "") + theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(short_name_2 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_2, factor(income, levels = INCOME), fill = income)) +
geom_boxplot() + scale_x_log10() +
labs(title = "", y = "", fill = "") +
theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(short_name_2 > 0) |>
ggplot(aes(short_name_2, region, fill = region)) +
geom_boxplot() + scale_x_log10() +
labs(title = "", y = "", fill = "") +
theme(legend.position = "none")
df_yourdata |> filter(year == 2020) |>
drop_na(short_name_2, short_name_1) |>
ggplot(aes(short_name_2, short_name_1)) + geom_point(aes(col = region)) +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10() + scale_y_log10() +
labs(title = "",
x = "",
y = "")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> drop_na(short_name_2, short_name_1) |>
lm(log10(short_name_1)~log10(short_name_2), data = _) |> summary()
Call:
lm(formula = log10(short_name_1) ~ log10(short_name_2), data = drop_na(filter(df_yourdata,
year == 2020), short_name_2, short_name_1))
Residuals:
Min 1Q Median 3Q Max
-0.289057 -0.056975 0.002394 0.057760 0.260181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8248 0.0647 12.75 <2e-16 ***
log10(short_name_2) 0.2648 0.0154 17.19 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08479 on 157 degrees of freedom
Multiple R-squared: 0.653, Adjusted R-squared: 0.6508
F-statistic: 295.5 on 1 and 157 DF, p-value: < 2.2e-16
Observations and Questions: