We will use the list of Presidents of Peru from a Wikipedia page, to play a bit with some cool R packages (XML, dplyr, lubridate, ggplot2, and googleVis), which will be used to extract and clean up the data, and later make some summaries and plots.
For this experiment, we will need the following libraries
- XML: to parse and extract a table from an HTML page
- dplyr: to do some data manipulation
- lubridate: to do some date operations
- ggplot2: to generate a nice boxplot
- googleVis: to make some interactive tables and plots (I am using the development version from github)
If you don’t have them installed, then you might want to run:
Getting and mangling the data
First, let’s read the data from the third HTML table in Wikipedia’s page: “List of Presidents of Peru”
Then, we ought to fix some weirdness in the data, and will save it to a CSV just in case we want to do some more processing in the future. As we are keeping the original column names from the HTML table, some code is a bit more cumbersome (because we need to use backticks).
Displaying the data as a sortable and paginated table
Let’s look at the data we got after scraping Wikipedia and mangling values around. We’ll make an interactive table using the gvisTable function from the googleVis package.
We want to paginate the table, because Peru has had 97 people that held the Presidency at one point or another. The table is a bit wide, so it will look nicer.
Creating a timeline chart
Now, let’s visualize the succession of presidents using a timeline chart as implemented in googleVis, coloring each timespan by the what original data calls “Form of entry”, which is how a particular person got into the Presidency. There are 3 records that do not have a given value for the aforementioned field, so we will recode those as “Unknown”.
This chart is also a bit wide, because the data spans over 190 years.
You might have noticed that at some points in Peru’s history we had more than one President, and at other times they seem to change rapidly or to swing back and forth among a number of recurring characters. Such was our lot back then, but we have had better luck for some decades now.
Understanding how they got into power
We will make cummulative frequency chart, by using dplyr to manipulate and summarize the data and googleVis to plot it. We could’ve used table() along with other base functions, but dplyr’s syntax is cleaner and more readable.
In this chart we can plainly see that the first 4 modes of attaining office (”Direct Elections”, “Coup d’état”, “Interim caretaker”, and “Elected by Congress”), comprise the majority (a bit over 81%) of all the ways that the office of President have ever been attained in Peru.
Length of time in office
If we wanted to know the distribution of the lengths of time in office for all presidents, we can do some simple data exploration and create a histogram, with the the median and mean overlayed on it:
We can see a typical right-skewed distribution, with a great majority of short lengths of term in office (as little as 2 days), and some exceptionally long ones (as much as ~11.15 years). So in this case, the mean (2.04 years) is not very informative, and the median (0.96 years) looks suspiciosly short.
Let’s look at these time spans groupíng them by the way each one attained the office.
In this chart we have added a reference line, the official time span for a President’s term in office in Peru: 5 years. It would seem that if you got into office by “Direct Elections” you have a better chance to reach you usual term (median ~ 4 years), but if you got by another route (let’s say by “Coup d’état”) you are more likely to be there for a short time.
In the table below, we can see a set of summary statistics per group, which indicate a distinctive difference between them.
In fact, using a Kruskal-Walis rank sum test, seems to indicate that the groups are indeed different (p < 0.001).
Kruskal-Wallis rank sum test data: len_office by group Kruskal-Wallis chi-squared = 38.46, df = 4, p-value = 8.997e-08
There might be a moral in this data, but policital conclusions run the risk of degenerating in random rants, so I’ll skip that.
The source code for this document is available at https://gist.github.com/jmcastagnetto/11127154
This post was originally published in RPubs at
This version contains syntax changes, because in current versions of dplyr the
“piping” operator is now
R version 3.1.1 (2014-07-10) Platform: x86_64-pc-linux-gnu (64-bit) locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  methods stats graphics grDevices utils datasets base other attached packages:  googleVis_0.5.2 ggplot2_1.0.0 lubridate_1.3.3 dplyr_0.2  XML_3.98-1.1 knitr_1.6 loaded via a namespace (and not attached):  assertthat_0.1 colorspace_1.2-2 digest_0.6.4 evaluate_0.5.5  formatR_0.10 grid_3.1.1 gtable_0.1.2 labeling_0.2  magrittr_1.0.1 MASS_7.3-33 memoise_0.2.1 munsell_0.4.2  parallel_3.1.1 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2  reshape2_1.4 RJSONIO_1.2-0.2 scales_0.2.4 stringr_0.6.2  tools_3.1.1