Reproducing a spurious correlation

This is quick and dirty code that tries to reproduce the analysis and graph from the post: "Per capita consumption of mozzarella cheese (US) correlates with Civil engineering doctorates awarded (US)"

# get the data from the article
library(XML)
url <- "http://www.tylervigen.com/view_correlation?id=3890"
tables <- readHTMLTable(url)
# we need table 3, minus a couple of extra columns
tab3 <- tables[[3]][1:2,-c(1,12)]
years <- as.numeric(colnames(tab3))
# per capita consumption of mozzarella
cheese <- as.numeric(t(tab3)[,1])
# civil engineering doctorates
engdoc <-as.numeric(t(tab3)[,2])

Now we will reproduce the plot

# code stolen+adapted from http://robjhyndman.com/hyndsight/r-graph-with-two-y-axes/
par(mar=c(5,4,4,5)+.1)
plot(years, cheese, type="b", lty="dashed", col="blue")
par(new=TRUE)
plot(years, engdoc, type="b", col="red",xaxt="n",yaxt="n",xlab="",ylab="")
axis(4)
mtext("engdoc",side=4,line=3)
legend("topleft",col=c("blue","red"),lty=1,legend=c("cheese","engdoc"))

Per capita consumption of mozzarella cheese (US) correlates with Civil engineering doctorates awarded (US)

And calculate the correlation (which happens to be high)

# calculate the correlation
cor(cheese, engdoc)
## [1] 0.9586

This bit was made to answer a question from a friend (Hi C├ęsar!), and originally was published as a gist: https://gist.github.com/jmcastagnetto/59c26c2624e621a13582

Go Top