The human health and financial impact of storms (1950-2011)

Synopsis

Storms have a (usually negative) effect on human populations, and in this analysis we study the overall impact on public health (in terms of deaths or injuries caused by the events), as well as the financial losses incurred by damage to property and crops.

The data being used was provided in the “Reproducible Research” course (Coursera, August 2014 session), based on the NOAA NCDC Storm Events database. The data was cleaned up, recoding the event types to comply with the official list of event classes mentioned in the accompanying documentation.

In terms of the effects of storms in human health, the results show that tornadoes are the most deleterious, causing about 62% of the deaths and injuries registered in the data set: 97,043 people or were injured over the 1950-2011 time period, ~1,500 people/year. In fact, 10 of the event types (out of 50 types considered by NOAA) are responsible for over 92% of human victims, in descending order: Tornadoes, lightning, excessive heat, flooding (including flash floods), thunderstorms, winter/ice storms, high winds and wildfire.

An analysis of the financial impact of storms, indicate that floods are the number one threat to properties, representing about 150.2 billions1 USD over the 1950-2011 period (a rate of loss of ~2.5 billion USD a year).

A similar analysis indicate that drought and floods are responsible for more than half of the losses for damaged crops, for a total of 24.83 billions USD over the 1993-2011 period (~ 1.3 billions USD a year).

Data Processing

Data Source

The data for this analysis has been given as part of August 2014 session of the course “Reproducible Research” (Coursera), and comprised of records from the NOAA Storm Database, and ancillary documentation.

The data set and its documentation were dowloaded using the following code

Data exploration

To get an idea of the structure of the data set, the first 10 lines of the data file were read.

'data.frame':	10 obs. of  37 variables:
$STATE__ : num 1 1 1 1 1 1 1 1 1 1$ BGN_DATE  : Factor w/ 7 levels "11/15/1951 0:00:00",..: 6 6 5 7 1 1 2 3 4 4
$BGN_TIME : int 130 145 1600 900 1500 2000 100 900 2000 2000$ TIME_ZONE : Factor w/ 1 level "CST": 1 1 1 1 1 1 1 1 1 1
$COUNTY : num 97 3 57 89 43 77 9 123 125 57$ COUNTYNAME: Factor w/ 9 levels "BALDWIN","BLOUNT",..: 7 1 4 6 3 5 2 8 9 4
$STATE : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1$ EVTYPE    : Factor w/ 1 level "TORNADO": 1 1 1 1 1 1 1 1 1 1
$BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0$ BGN_AZI   : logi  NA NA NA NA NA NA ...
$BGN_LOCATI: logi NA NA NA NA NA NA ...$ END_DATE  : logi  NA NA NA NA NA NA ...
$END_TIME : logi NA NA NA NA NA NA ...$ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0
$COUNTYENDN: logi NA NA NA NA NA NA ...$ END_RANGE : num  0 0 0 0 0 0 0 0 0 0
$END_AZI : logi NA NA NA NA NA NA ...$ END_LOCATI: logi  NA NA NA NA NA NA ...
$LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3$ WIDTH     : num  100 150 123 100 150 177 33 33 100 100
$F : int 3 2 2 2 2 2 2 1 3 3$ MAG       : num  0 0 0 0 0 0 0 0 0 0
$FATALITIES: num 0 0 0 0 0 0 0 0 1 0$ INJURIES  : num  15 0 2 2 2 6 1 0 14 0
$PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25$ PROPDMGEXP: Factor w/ 1 level "K": 1 1 1 1 1 1 1 1 1 1
$CROPDMG : num 0 0 0 0 0 0 0 0 0 0$ CROPDMGEXP: logi  NA NA NA NA NA NA ...
$WFO : logi NA NA NA NA NA NA ...$ STATEOFFIC: logi  NA NA NA NA NA NA ...
$ZONENAMES : logi NA NA NA NA NA NA ...$ LATITUDE  : num  3040 3042 3340 3458 3412 ...
$LONGITUDE : num 8812 8755 8742 8626 8642 ...$ LATITUDE_E: num  3051 0 0 0 0 ...
$LONGITUDE_: num 8806 0 0 0 0 ...$ REMARKS   : logi  NA NA NA NA NA NA ...
$REFNUM : num 1 2 3 4 5 6 7 8 9 10  The data set has 37 columns, several of those are relevant to the analysis at hand, namely those that indicate the date the event was reported (BGN_DATE), in what US State the even occurred (STATE), the event type (EVTYPE), the number of people dying (FATALITIES) or being injured (INJURIES) due to the event, the economical cost of the damages (PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP) # removing temporary data frame rm(tmp1)  Reading, Cleaning and Normalizing the data To simplify the analysis (and save time and memory), only the relevant columns will be read from the data file: 'data.frame': 902297 obs. of 9 variables:$ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
$STATE : chr "AL" "AL" "AL" "AL" ...$ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
$FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...$ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
$PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...$ PROPDMGEXP: chr  "K" "K" "K" "K" ...
$CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...$ CROPDMGEXP: chr  "" "" "" "" ...


The read data set, contains 902297 rows and 9 columns.

According to the original data source, the event type column (EVTYPE) comprises a definite and limited vocabulary of 50 valid event names.

The list of valid events was extracted from the file pd01016005curr.pdf (section 2.1.1 “Storm Data Event Table”), and saved to a csv file (valid_events.csv)

The crucial EVTYPE column did contain more than the documented 50 unique values, in fact it had 985 possible distinct values.

Not only that, but only about 70.4% of the records in the data set correspond to the documented vocabulary for the event type.

Also, there seems to be a great diversity in the way some events have been recorded over the years, including misspellings, case-mixing, combination of a an event with some sort of numeric value, etc.

 [1] "Summary of July 26"      "HIGH WIND (G40)"
[3] "Erosion/Cstl Flood"      "RECORD SNOW/COLD"
[7] "HEAVY SNOW/WIND"         "THUNERSTORM WINDS"
[9] "DRY PATTERN"             "SNOWMELT FLOODING"
[11] "HEAVY SNOW AND ICE"      "HEAVY SNOW/BLIZZARD"
[15] "THUNDERSTORM WINDS/HAIL" "HIGH WIND/HEAVY SNOW"
[17] "COLD/WINDS"              "Summary of June 16"
[19] "COLD"                    "HAIL 1.75"


Therefore, some serious data cleanup needs to be done on this column.

To be able to accumulate by year, a variable was created to store the value extracted from the BGN_DATE column

Finally, to be able to estimate the monetary cost due to damage caused by the storms, we have to examine the appropriate columns.

- ? + 0 1 2 3 4 5
465934 1 8 5 216 25 13 4 4 28

Table: Property damage ‘exponents’ (continued below)

6 7 8 B h H K m M
4 5 1 40 1 6 424665 7 11330

? 0 2 B k K m M
618413 7 19 1 9 21 281832 1 1994

Table: Crop damage ‘exponents’

It would seem that there are a mixture of coding standards for these columns, and the great majority of the “exponents” (multipliers really) correspond to a coding such that:

• H or h: x100
• K or k: x1000
• M or m: x1’000,000
• B or b: x1,000’000,000

The meaning of the other codes is not clear. Even after checking the documentation on the site that was the source for the data, several incompatible definitions could be glimpsed:

• The numbers are exponents base 10 ( $$10^n$$ )
• The numbers are really an artifact of conversion to CSV, and are part of the decimals of the PROPDMG column. Even if we accept this interpretation, there is the issue as to whether the units are in thousands, millions, or billions
• The numbers are used to indicate categories that represent ranges of values, according to a 1959 document (“STORM DATA”, May 1959, Volume I No. 5)2:
• Cat. 1 –> Less than USD 50
• Cat. 2 –> USD 50 to USD 500
• Cat. 3 –> USD 500 to USD 5,000
• Cat. 4 –> USD 5,000 to USD 50,000
• Cat. 5 –> USD 50,000 to USD 500,000
• Cat. 6 –> USD 500,000 to USD 5’000,000
• Cat. 7 –> USD 5’000,000 to USD 50’000,000
• Cat. 8 –> USD 50’000,000 to USD 500’000,000
• Cat. 9 –> USD 500’000,000 to USD 5,000’000,000

The bottomline is that is not feasible to apply only one interpretation to the numeric codes in these columns.

To assertain if it would be possible to omit them in the analysis, we calculated the percentage of these codes in the column, from among the records that have a value for PROPDMG

Column Percent of undefined codes
PROPDMGEXP 0.134%
CROPDMGEXP 0.068%

As can be seen, the fraction of records with uninterpretable codes is very small (<< 1%), thus we can safely drop them from the respective data frames.

Results

In the cleaned up storm data set, there is an unequal distribution of the reported events during the period under analysis, as can be seen from the table below

Event Class Frequency Percentage of reports
HAIL 290400 32.18
LIGHTNING 242116 26.83
THUNDERSTORM WIND 109353 12.12
FLASH FLOOD 55677 6.17
FLOOD 29618 3.28
HIGH WIND 21777 2.41
WINTER STORM 19693 2.18
HEAVY SNOW 16968 1.88
HEAVY RAIN 11981 1.33

Table: Top 10 events reported in the storm data set

The top 10 events in the data set (vide supra) are responsible for 95.12% of the reports from 1950–2011

Human health impact

The data contains columns that can help us measure the impact of storms in Public Health, understood in term of the number of victims that suffer death or injury as a result of one of these events.

About 2.43% of the records in the Storm data indicate that there were human victims.

In the table below we can see the top 10 storm types (events) that impacted more human health in the time period under consideration

Event type Deaths/Injuries Percent Cumm Percent
LIGHTNING 13575 8.7 71.1
EXCESSIVE HEAT 12453 8 79.1
FLOOD 7386 4.7 83.8
FLASH FLOOD 2837 1.8 85.6
THUNDERSTORM WIND 2646 1.7 87.3
WINTER STORM 2247 1.4 88.8
ICE STORM 2234 1.4 90.2
HIGH WIND 1750 1.1 91.3
WILDFIRE 1698 1.1 92.4

Table: Top 10 causes of death or injury due to storms [1950-2011]

We can see that the top 10 causes comprise about 92.4% of all the victims affected in all those years, and that Tornadoes are by far the most important cause of death or injuries to humans.

The impact on humans has not been constant over the years, in fact there have been major events that went outside the norm, as can be seen in the graph below.

The graph shows events such as 1995’s Chicago Heat Wave3, shown as the maximum value in the top chart, which, during the month of July of that year caused about many deaths in a period of only five days4.

• BGN_DATE: 1995-07-12
• STATE: IL
• EVTYPE: EXCESSIVE HEAT
• FATALITIES: 583
• INJURIES: 0

Also of note are 1998’s South Texas floods 5 6, that in October of that year caused a great number of injuries and death. This event is responsible for the maximum value in the injuries plot.

Date State Event Total deaths Total injuries
1998-10-17 TX FLOOD 24 4510
1998-10-18 TX FLOOD 0 1520

Financial impact

The storms have also had a negative financial impact due to damage produced to property and crops.

About 26.47% of records in the data set include an estimate for the property damage, and 2.45% have data on the cost of damage to crops.

To evaluate the costs, we will add a column that traduces the character code into a multiplier, which will allow us to calculate the appropriate amount in each event.

The top 10 events in terms of property damage are listed in the table below, with flooding being the number one source of property loss.

Event Cost (in Billions USD) Percent from total
FLOOD 150.2 35.16
HURRICANE (TYPHOON) 85.36 19.97
COASTAL FLOOD 48.4 11.33
HAIL 17.62 4.12
FLASH FLOOD 16.91 3.96
WILDFIRE 8.5 1.99
TROPICAL STORM 7.71 1.81
WINTER STORM 6.78 1.59
LIGHTNING 6.65 1.56

And the correspoding events for crop damage shows that drought and flooding (two counterposed atmospheric events) are responsible for more that 50% of losses to crops.

Event Cost (in Billions USD) Percent from total
DROUGHT 13.97 28.45
FLOOD 10.86 22.11
HURRICANE (TYPHOON) 5.52 11.23
ICE STORM 5.02 10.23
HAIL 3.11 6.34
FROST/FREEZE 2 4.07
FLASH FLOOD 1.53 3.12
EXTREME COLD/WIND CHILL 1.33 2.71
HEAVY RAIN 0.95 1.94
EXCESSIVE HEAT 0.9 1.84

When looking at the total losses per year over the study period, we observe a definite growth trend due to property damage by storms. Whereas, for crops there has been a decrease, at least since 1993, which is the first record of such losses in the data set.

For illustration purposes (because it might not be the best model for this data), we are superimposing a linear estimate, mainly to drive home the possible underlying trend.

Some non-exhaustive reasons could be advanced for these trends:

1. There has been a steady increase in the size and density of urban populations, so storms can generate a bigger financial loss beacuse over the same area.
2. There has been a decrease in the population of rural areas, and an increase in the yield of crops due to modernization of the agricultural methods.
3. There is better forecasting and/or advanced warning of impending storms, so farmers can take measures to minimize loss.

Reproducibility information

sessionInfo()

R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] gridExtra_0.9.1 pander_0.3.8    ggplot2_1.0.0   dplyr_0.2
[5] knitr_1.6       setwidth_1.0-3

loaded via a namespace (and not attached):
[1] assertthat_0.1   codetools_0.2-9  colorspace_1.2-2 digest_0.6.4
[5] evaluate_0.5.5   formatR_0.10     gtable_0.1.2     labeling_0.2
[9] magrittr_1.0.1   MASS_7.3-33      munsell_0.4.2    parallel_3.1.1
[13] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4
[17] scales_0.2.4     stringr_0.6.2    tools_3.1.1


The source code for this document can be found at the URL: https://gist.github.com/jmcastagnetto/c4f9dad8f7b0fc146198

This document was originally published in RPubs at the URL: http://rpubs.com/jesuscastagnetto/storms-impact