For the “Practical Machine Learning” course at Coursera, the class was given a dataset from a Human Activity Recognition (HAR) study1 that tries to assess the quality of an activity (defined as “… the adherence of the execution of an activity to its specification …”), namely a weight lifting exercise, using data from sensors attached to the individuals and their equipment.
In contrast to other HAR studies, this one2 does not attempt to distinguish what activity is being done, but rather to assess how well is the activity being performed.
Figure 1: Location of body sensors3
The aforementioned study used sensors that “… provide three-axes acceleration, gyroscope and magnetometer data …”, with a Bluetooth module that allowed experimental data capture. These sensors were attached (see Figure 1), to “… six male participants aged between 20-28 years …” who performed one set of ten repetitions of the Unilateral Dumbbell Biceps Curl with a 1.25kg (light) dumbbell, in five different manners (one correct and four incorrect):
- Exactly according to the specification (Class A)
- Throwing the elbows to the front (Class B)
- Lifting the dumbbell only halfway (Class C)
- Lowering the dumbbell only halfway (Class D)
- Throwing the hips to the front (Class E)
Getting and cleaning the data
There were two datasets in CSV format, one to be used for
training, and another one for testing. The training dataset contained 19622 rows and
160 columns, including the
which classified the entry according to the how well the exercise was
performed (vide supra). The testing dataset has only
20 rows and 160 columns, and
instead of the
classe variable there is an
problem_id column to be
used as an identifier for the prediction results. The latter set,
was to be used for a different part of the assignment dealing
with specific class prediction.
Table 1: First 7 columns of the training dataset
The first seven columns of the training dataset (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window) are not related to the sensor measurements, but rather to the identity of the person, and the time stamps and capture windows for the sensor data (see Table 1). Because I am trying to produce a predictive model that only relies on the quantitative sensor measurements, I decided to remove these columns. In a similar fashion, the first seven columns of the testing dataset were also removed. This operation left me with a total of 153 columns in each data frames.
Thus, the data frame has, for each of the four sensors (positioned at the arm, forearm, belt, and dumbbell respectively), 38 different measurements (see Table 2 in Appendix 1). The problem then is to select from these 152 variables the ones relevant to predict a good exercise execution.
The automatic column type assignment of the
read.csv() R function was not
always correct, in particular because several of the numeric columns contained
text data coming from sensor reading errors (e.g. “#DIV/0!”). So, I forced all
of the sensor readings to be numeric, and set the
classe column as a
As a result of the type assignment some columns contained only
so these were removed from the dataset. Also, by using the
nearZeroVar() function of the
caret package, I eliminated columns that
were considered uninformative (zero or near zero variance predictors).
Table 3: Number of columns by percentage of missing values
|Percentage of missing values||Number of columns|
After that last operation, the training data frame had only 118 variables including the classification column. Of these variables, I checked to see how many of them contained too many missing data values. Initially I set the threshold to 80%, but soon found out that there were two cases: columns without any missing data, and columns that had about 98% missing data (see Table 3). Trying to impute values in the latter cases could be done, but is unlikely that it will give anything reasonable or useful as a predictor, thus, those 65 columns were also removed.
In the end we will use 52 measurements of the x, y, and z axis components of the acceleration, gyroscope, and magnetometer sensors, as well as the overall acceleration, pitch, roll and yaw (see Table 4 in Appendix 2), to predict whether the exercise was done correctly.
Generating and validating a Random Forest predictive model
Because the provided testing dataset could not be used to validate the predictive model, I decided to split the “training” dataset into one to be used to perform the random forest model training (75% of the data), and another to validate it (25% of the data). The training will also assess the quality of the model using an “out of bag” (OOB) error estimate using cross-validation.
The model training used the standard random forest
rf) algorithm4 method available in the
caret package, with
the default parameters and doing a 10-fold cross validation. I used the
classe variable as the dependent and 52
sensor variables as predictors.
This model gave an OOB error of 0.6%, which indicates a possible good
With the reserved validation set, I calculated
the confusion matrix (Table 5),
and other relevant statistics using the
function of the
caret package. The confusion matrix shows that the
model does a reasonable good job at predicting the exercise quality.
Table 5: Confusion Matrix (Predicted vs Reference) for Random Forest model
Validating the model results in an accuracy of 0.9943 (95% confidence interval: [0.9918, 0.9962]). The estimated accuracy is well above the “no information rate” statistic of 0.2845. The validation results also in a high kappa statistic of 0.9928, which suggest a very good classifier. Overall, this model compares well with the 0.9803 accuracy that was reported in the original work.
The first 20 model predictors can be seen in Figure 2, and the complete list of predictors (ordered by their mean decrease in accuracy) is in Table 6 (Appendix 3)
Figure 2: Variable Importance for Random Forest model (first 20 variables)
This plot indicates that the measurements of the belt sensor (roll, yaw, and pitch), the forearm (pitch) and the dumbbell (magnetic component), are the most important for distinguishing whether this particular exercise is being done correctly or not. This makes sense as the way the core body moves and the rotation of the forearm, are closely related to a correct execution of the biceps curl, and in the case of the metallic dumbbell the position changes are readily detected by the magnetometer.
The source code for the R Markdown document and other accessory artifacts is available at the github repository: https://github.com/jmcastagnetto/practical_machine_learning-coursera-june2015, and the assignment was originally published at https://jmcastagnetto.github.io/practical_machine_learning-coursera-june2015/ using a layout inspired but Tufte’s handout recommendations.
## R version 3.2.0 (2015-04-16) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 14.04.2 LTS ## ## locale: ##  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ##  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ##  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ##  LC_PAPER=en_US.UTF-8 LC_NAME=C ##  LC_ADDRESS=C LC_TELEPHONE=C ##  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ##  parallel stats graphics grDevices utils datasets methods ##  base ## ## other attached packages: ##  randomForest_4.6-7 doMC_1.3.3 iterators_1.0.7 foreach_1.4.2 ##  captioner_2.2.2 knitr_1.8 caret_6.0-47 ggplot2_1.0.0 ##  lattice_0.20-29 sjPlot_1.8.1 ## ## loaded via a namespace (and not attached): ##  Rcpp_0.11.5 formatR_1.0 plyr_1.8.1 ##  class_7.3-10 tools_3.2.0 digest_0.6.4 ##  lme4_1.1-6 evaluate_0.5.5 gtable_0.1.2 ##  nlme_3.1-120 mgcv_1.8-6 psych_1.4.5 ##  Matrix_1.1-4 DBI_0.3.1 yaml_2.1.13 ##  brglm_0.5-9 SparseM_1.6 proto_0.3-10 ##  e1071_1.6-4 BradleyTerry2_1.0-5 dplyr_0.4.1 ##  stringr_0.6.2 gtools_3.4.1 sjmisc_1.0.2 ##  grid_3.2.0 nnet_7.3-9 rmarkdown_0.6.1 ##  minqa_1.2.3 reshape2_1.4 tidyr_0.2.0 ##  car_2.0-25 magrittr_1.0.1 codetools_0.2-11 ##  scales_0.2.4 htmltools_0.2.6 MASS_7.3-33 ##  splines_3.2.0 assertthat_0.1 tufterhandout_1.2.1 ##  pbkrtest_0.3-8 colorspace_1.2-6 quantreg_5.05 ##  munsell_0.4.2 RcppEigen_0.3.2.1.2
Appendix 1: Columns related to the sensors in the original training dataset
Table 2: Measurement columns by sensor
Appendix 2: Remaining columns related to the sensors
Table 4: Remaining measurement columns by sensor
Appendix 3: Random Forest Model - Variable Importance
Table 6: Variable importance per class and overall
- For more details see “The Weight Lifting Exercises Dataset“ [return]
- Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ‘13) . Stuttgart, Germany: ACM SIGCHI, 2013. [return]
- Image obtained from http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises [return]
- randomForest: Breiman and Cutler’s random forests for classification and regression [return]