R vs Python: Image Classification with Keras
(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)
Even though the libraries for R from Python, or Python from R code execution existed since years and despite of a recent announcement of Ursa Labs foundation by Wes McKinney who is aiming to join forces with RStudio foundation, Hadley Wickham in particularly, (find more here) to improve data scientists workflow and unify libraries to be used not only in Python, but in any programming language used by data scientists, some data professionals are still very strict on the language to be used for ANN models limiting their dev. environment exclusively to Python.
As a continuation of my R vs. Python comparison, I decided to test performance of both languages in terms of time required to train a convolutional neural network based model for image recognition. As the starting point, I took the blog post by Dr. Shirin Glander on how easy it is to build a CNN model in R using Keras.
A few words about Keras. It is a Python library for artificial neural network ML models which provides high level fronted to various deep learning frameworks with Tensorflow being the default one.
Keras has many pros with the following among the others:
- Easy to build complex models just in few lines of code => perfect for dev. cycle to quickly experiment and check your ideas
- Code recycling: one can easily swap the backend framework (let’s say from CNTK to Tensorflow or vice versa) => DRY principle
- Seamless use of GPU => perfect for fast model tuning and experimenting
Since Keras is written in Python, it may be a natural choice for your dev. environment to use Python. And that was the case until about a year ago when RStudio founder J.J.Allaire announced release of the Keras library for R in May’17. I consider this to be a turning point for data scientists; now we can be more flexible with dev. environment and be able to deliver result more efficiently with opportunity to extend existing solutions we may have written in R.
It brings me to the point of this post.
My hypothesis is, when it comes to ANN ML model building with Keras, Python is not a must, and depending on your client’s request, or tech stack, R can be used without limitations and with similar efficiency.
In order to test my hypothesis, I am going to perform image classification using the fruit images data from kaggle and train a CNN model with four hidden layers: two 2D convolutional layers, one pooling layer and one dense layer. RMSProp is being used as the optimizer function.
Tech stackHardware:
CPU: Intel Core i7-7700HQ: 4 cores (8 threads), 2800 – 3800 (Boost) MHz core clock
GPU: Nvidia Geforce GTX 1050 Ti Mobile: 4Gb vRAM, 1493 – 1620 (Boost) MHz core clock
RAM: 16 Gb
Software:
OS: Linux Ubuntu 16.04
R: ver. 3.4.4
Python: ver. 3.6.3
Keras: ver. 2.2
Tensorflow: ver. 1.7.0
CUDA: ver. 9.0 (note that the current tensorflow version supports ver. 9.0 while the up-to-date version of cuda is 9.2)
cuDNN: ver. 7.0.5 (note that the current tensorflow version supports ver. 7.0 while the up-to-date version of cuDNN is 7.1)
The R and Python code snippets used for CNN model building are present bellow. Thanks to fruitful collaboration between F. Chollet and J.J. Allaire, the logic and functions names in R are alike the Python ones.
R ## Courtesy: Dr. Shirin Glander. Code source: https://shirinsplayground.netlify.com/2018/06/keras_fruits/ library(keras) start <- Sys.time() fruit_list <- c("Kiwi", "Banana", "Plum", "Apricot", "Avocado", "Cocos", "Clementine", "Mandarine", "Orange", "Limes", "Lemon", "Peach", "Plum", "Raspberry", "Strawberry", "Pineapple", "Pomegranate") # number of output classes (i.e. fruits) output_n <- length(fruit_list) # image size to scale down to (original images are 100 x 100 px) img_width <- 20 img_height <- 20 target_size <- c(img_width, img_height) # RGB = 3 channels channels <- 3 # path to image folders path <- "path/to/folder/with/data" train_image_files_path <- file.path(path, "fruits-360", "Training") valid_image_files_path <- file.path(path, "fruits-360", "Test") # optional data augmentation train_data_gen %>% image_data_generator( rescale = 1/255 ) # Validation data shouldn't be augmented! But it should also be scaled. valid_data_gen <- image_data_generator( rescale = 1/255 ) # training images train_image_array_gen <- flow_images_from_directory(train_image_files_path, train_data_gen, target_size = target_size, class_mode = 'categorical', classes = fruit_list, seed = 42) # validation images valid_image_array_gen <- flow_images_from_directory(valid_image_files_path, valid_data_gen, target_size = target_size, class_mode = 'categorical', classes = fruit_list, seed = 42) ### model definition # number of training samples train_samples <- train_image_array_gen$n # number of validation samples valid_samples <- valid_image_array_gen$n # define batch size and number of epochs batch_size <- 32 epochs <- 10 # initialise model model <- keras_model_sequential() # add layers model %>% layer_conv_2d(filter = 32, kernel_size = c(3,3), padding = 'same', input_shape = c(img_width, img_height, channels)) %>% layer_activation('relu') %>% # Second hidden layer layer_conv_2d(filter = 16, kernel_size = c(3,3), padding = 'same') %>% layer_activation_leaky_relu(0.5) %>% layer_batch_normalization() %>% # Use max pooling layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_dropout(0.25) %>% # Flatten max filtered output into feature vector # and feed into dense layer layer_flatten() %>% layer_dense(100) %>% layer_activation('relu') %>% layer_dropout(0.5) %>% # Outputs from dense layer are projected onto output layer layer_dense(output_n) %>% layer_activation('softmax') # compile model %>% compile( loss = 'categorical_crossentropy', optimizer = optimizer_rmsprop(lr = 0.0001, decay = 1e-6), metrics = 'accuracy' ) # fit hist <- fit_generator( # training data train_image_array_gen, # epochs steps_per_epoch = as.integer(train_samples / batch_size), epochs = epochs, # validation data validation_data = valid_image_array_gen, validation_steps = as.integer(valid_samples / batch_size), # print progress verbose = 2, callbacks = list( # save best model after every epoch callback_model_checkpoint(file.path(path, "fruits_checkpoints.h5"), save_best_only = TRUE), # only needed for visualising with TensorBoard callback_tensorboard(log_dir = file.path(path, "fruits_logs")) ) ) df_out <- hist$metrics %>% {data.frame(acc = .$acc[epochs], val_acc = .$val_acc[epochs], elapsed_time = as.integer(Sys.time()) - as.integer(start))} Python ## Author: D. Kisler - adoptation of R code from https://shirinsplayground.netlify.com/2018/06/keras_fruits/ from keras.preprocessing.image import ImageDataGenerator from keras.models import Sequential from keras.layers import (Conv2D, Dense, LeakyReLU, BatchNormalization, MaxPooling2D, Dropout, Flatten) from keras.optimizers import RMSprop from keras.callbacks import ModelCheckpoint, TensorBoard import PIL.Image from datetime import datetime as dt start = dt.now() # fruits categories fruit_list = ["Kiwi", "Banana", "Plum", "Apricot", "Avocado", "Cocos", "Clementine", "Mandarine", "Orange", "Limes", "Lemon", "Peach", "Plum", "Raspberry", "Strawberry", "Pineapple", "Pomegranate"] # number of output classes (i.e. fruits) output_n = len(fruit_list) # image size to scale down to (original images are 100 x 100 px) img_width = 20 img_height = 20 target_size = (img_width, img_height) # image RGB channels number channels = 3 # path to image folders path = "path/to/folder/with/data" train_image_files_path = path + "fruits-360/Training" valid_image_files_path = path + "fruits-360/Test" ## input data augmentation/modification # training images train_data_gen = ImageDataGenerator( rescale = 1./255 ) # validation images valid_data_gen = ImageDataGenerator( rescale = 1./255 ) ## getting data # training images train_image_array_gen = train_data_gen.flow_from_directory(train_image_files_path, target_size = target_size, classes = fruit_list, class_mode = 'categorical', seed = 42) # validation images valid_image_array_gen = valid_data_gen.flow_from_directory(valid_image_files_path, target_size = target_size, classes = fruit_list, class_mode = 'categorical', seed = 42) ## model definition # number of training samples train_samples = train_image_array_gen.n # number of validation samples valid_samples = valid_image_array_gen.n # define batch size and number of epochs batch_size = 32 epochs = 10 # initialise model model = Sequential() # add layers # input layer model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'same', input_shape = (img_width, img_height, channels), activation = 'relu')) # hiddel conv layer model.add(Conv2D(filters = 16, kernel_size = (3,3), padding = 'same')) model.add(LeakyReLU(.5)) model.add(BatchNormalization()) # using max pooling model.add(MaxPooling2D(pool_size = (2,2))) # randomly switch off 25% of the nodes per epoch step to avoid overfitting model.add(Dropout(.25)) # flatten max filtered output into feature vector model.add(Flatten()) # output features onto a dense layer model.add(Dense(units = 100, activation = 'relu')) # randomly switch off 25% of the nodes per epoch step to avoid overfitting model.add(Dropout(.5)) # output layer with the number of units equal to the number of categories model.add(Dense(units = output_n, activation = 'softmax')) # compile the model model.compile(loss = 'categorical_crossentropy', metrics = ['accuracy'], optimizer = RMSprop(lr = 1e-4, decay = 1e-6)) # train the model hist = model.fit_generator( # training data train_image_array_gen, # epochs steps_per_epoch = train_samples // batch_size, epochs = epochs, # validation data validation_data = valid_image_array_gen, validation_steps = valid_samples // batch_size, # print progress verbose = 2, callbacks = [ # save best model after every epoch ModelCheckpoint("fruits_checkpoints.h5", save_best_only = True), # only needed for visualising with TensorBoard TensorBoard(log_dir = "fruits_logs") ] ) df_out = {'acc': hist.history['acc'][epochs - 1], 'val_acc': hist.history['val_acc'][epochs - 1], 'elapsed_time': (dt.now() - start).seconds} ExperimentThe models above were trained 10 times with R and Pythons on GPU and CPU, the elapsed time and the final accuracy after 10 epochs were measured. The results of the measurements are presented on the plots below (click the plot to be redirected to plotly interactive plots).
From the plots above, one can see that:
- the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).
- even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.
Let’s perform unpaired t-test assuming that all our observations are normally distributed.
library(dplyr) library(data.table) # fetch the data used to plot graphs d <- fread('keras_elapsed_time_rvspy.csv') # unpaired t test: # t_score = (mean1 - mean2)/sqrt(stdev1^2/n1+stdev2^2/n2) d %>% group_by(lang, eng) %>% summarise(el_mean = mean(elapsed_time), el_std = sd(elapsed_time), n = n()) %>% data.frame() %>% group_by(eng) %>% summarise(t_score = round(diff(el_mean)/sqrt(sum(el_std^2/n)), 2)) eng t_score cpu 11.38 gpu 9.64T-score reflects a significant difference between the time required to train a CNN model in R compared to Python as we saw on the plot above.
SummaryBuilding and training CNN model in R using Keras is as “easy” as in Python with the same coding logic and functions naming convention. Final accuracy of your Keras model will depend on the neural net architecture, hyperparameters tuning, training duration, train/test data amount etc., but not on the programming language you would use for your DS project. Training a CNN Keras model in Python may be up to 15% faster compared to R
P.S.If you would like to know more about Keras and to be able to build models with this awesome library, I recommend you these books:
- Deep Learning with Python by F. Chollet (one of the Keras creators)
- Deep Learning with R by F. Chollet and J.J. Allaire
As well as this Udemy course to start your journey with Keras.
Thanks a lot for your attention! I hope this post would be helpful for an aspiring data scientist to gain an understanding of use cases for different technologies and to avoid being biased when it comes to the selection of the tools for DS projects accomplishment.
Related Post
- Update: Can we predict flu outcome with Machine Learning in R?
- Evaluation of Topic Modeling: Topic Coherence
- Natural Language Generation with Markovify in Python
- Anomaly Detection in R – The Tidy Way
- Forecasting with ARIMA – Part I
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Statistics Sunday: Accessing the YouTube API with tuber
(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)
I haven’t had a lot of time to play with this but yesterday, I discovered the tuber R package, which allows you to interact with the YouTube API.
To use the tuber package, not only do you need to install it in R – you’ll need a Google account and will have to authorize 4 APIs through Developer Console: all 3 YouTube APIs (though the Data API will be doing the heavy lifting) and the Freebase API. Before you authorize the first API, Google will have you create a project to tie the APIs to. Then, you’ll find the APIs in the API library to add to this project. Click on each API and on the next screen, select Enable. You’ll need to create credentials for each of the YouTube APIs. When asked to identify the type of app that will accessing the YouTube API, select “Other”.
The tuber package requires two pieces of information from you, which you’ll get when you set up credentials for the OAuth 2.0 Client: client ID and client secret. Once you set those up, you can download them at any time in JSON format by going to the Credentials dashboard and clicking the download button on the far right.
In R, load the tuber package, then call the yt_oauth function, using the client ID (which should end with something like “apps.googleusercontent.com”) and client secret (a string of letters and numbers). R will launch a browser window to authorize tuber to access the APIs. Once that’s done, you’ll be able to use the tuber package to, for instance, access data about a channel or get information about a video. My plan is to use my Facebook dataset to pull out the head songs I’ve shared and get the video information to generate a dataset of my songs. Look for more on that later. In the meantime, this great post on insightr can give you some details about using the tuber package.
[Apologies for the short and late post – I’ve been sick and haven’t had as much time to write recently. Hopefully I’ll get back to normal over the next week.]
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Effectively scaling Shiny in enterprise
(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)
James Blair, RStudio
Scalability is a hot word these days, and for good reason. As data continues to grow in volume and importance, the ability to reliably access and reason about that data increases in importance. Enterprises expect data analysis and reporting solutions that are robust and allow several hundred, even thousands, of concurrent users while offering up-to-date security options.
Shiny is a highly flexible and widely used framework for creating web applications using R. It enables data scientists and analysts to create dynamic content that provides straightforward access to their work for those with no working knowledge of R. While Shiny has been around for quite some time, recent introductions to the Shiny ecosystem make Shiny simpler and safer to deploy in an enterprise environment where security and scalability are paramount. These new tools in connection with RStudio Connect provide enterprise grade solutions that make Shiny an even more attractive option for data resource creation.
Develop and TestMost Shiny applications are developed either locally on a personal computer or using an instance of RStudio Server. During development it can be helpful to understand application performance, specifically if there are any concerning bottlenecks. The profvis package provides functions for profiling R code and can profile the performance of Shiny applications. profvis provides a breakdown of code performance and can be useful for identifying potential areas for improving application responsiveness.
The recently released promises package provides asynchronous capabilities to Shiny applications. Asynchronous programming can be used to improve application responsiveness when several concurrent users are accessing the same application. While there is some overhead involved in creating asynchronous applications, this method can improve application responsiveness.
Once an application is fully developed and ready to be deployed, it’s useful to establish a set of behavioral expectations. These expectations can be used to ensure that future updates to the application don’t break or unexpectedly change behavior. Traditionally most testing of Shiny applications has been done by hand, which is both time consuming and error prone. The new shinytest package provides a clean interface for testing Shiny applications. Once an application is fully developed, a set of tests can be recorded and stored to compare against future application versions. These tests can be run programatically and can even be used with continuous integration (CI) platforms. Robust testing for Shiny applications is a huge step forward in increasing the deployability and dependability of such applications.
DeployOnce an application has been developed and tested to satisfaction, it must be deployed to a production environment in order to provide other users with application access. Production deployment of data resources within an enterprise centers on control. For example, access control and user authentication are important for controlling who has access to the application. Server resource control and monitoring are important for controlling application performance and server stability. These control points enable trustworthy and performant deployment.
There are a few current solutions for deploying Shiny applications. Shiny Server provides both an open source and professional framework for publishing Shiny applications and making them available to a wide audience. The professional version provides features that are attractive for enterprise deployment, such as user authentication. RStudio Connect is a recent product from RStudio that provides several enhancements to Shiny Server. Specifically, RStudio Connect supports push button deployment and natively handles application dependencies, both of which simplify the deployment process. RStudio Connect also places resource control in the hands of the application developer, which lightens the load on system administrators and allows the developer to tune app performance to align with expectations and company priorities.
ScaleIn order to be properly leveraged, a deployed application must scale to meet user demand. In some instances, applications will have low concurrent users and will not need any additional help to remain responsive. However, it is often the case in large enterprises that applications are widely distributed and concurrently accessed by several hundred or even thousands of users. RStudio Connect provides the ability to set up a cluster of servers to provide high availability (HA) and load balanced configurations in order to scale applications to meet the needs of concurrent users. Shiny itself has been shown to effectively scale to meet the demands of 10,000 concurrent users!
As businesses continue searching for ways to efficiently capture and digest growing stores of data, R in connection with Shiny continues to establish itself as a robust and enterprise ready solution for data analysis and reporting.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Not only LIME
(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)
I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’.
And usually the discussion goes as following: ,,We have tried LIME for our black-box model, it is great, but it is not working in our case’’, ,,Have you tried other explainers?’’, ,,What other explainers’’?
So here you have a map of different visual explanations for black-box models. Choose one in (on average) less than three simple steps.
These are available in the DALEX package. Feel free to propose other visual explainers that should be added to this map (and the package).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
bounceR 0.1.2: Automated Feature Selection
(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)
New FeaturesAs promised, we kept on working on our bounceR package. For once, we changed the interface: users now do not have to choose a number of tuning parameters, that – thanks to my somewhat cryptic documentation – sound more complicated than they are. Inspired by H2o.ai feature to let the user set the time he or she wants to wait, instead of a number of cryptic tuning parameters, we added a similar function.
Further, we changed the source code quite a bit. Henrik Bengtsson gave a very inspiring talk on parallization using the genius future package at this year's eRum conference. A couple days later, Davis Vaughan released furrr. An incredibly smart – kudos – wrapper on-top of the no-less genius purrr package. Davis' package combines purrr's maping functions with future's parallization madness. As you can tell, I am a big fan of all these three packages. Thus, inspired by these new inventions, we wanted to make use of them in our package. So, the entire parallization setup of bounceR is now leveraging furrr. This way, the parallization is so much smarter, faster and works seemingless on different operating systems.
Practical ExampleThus, lets see how you can use it now. Let's start by downloading the package.
# you need devtools, cause we are just about to get it to CRAN, though we are not that far library(devtools) # now you are good to go devtools::install_github("STATWORX/bounceR") # now you can source it like every normal package library(bounceR)To show how the feature selection works, we now need some data, so lets simulate some with our sim_data() function.
# simulate some data data <- sim_data(n = 100, modelvars = 10, noisevars = 300)Now you guys can all imagine that with 310 features on 100 observations, building models could be a little challenging. In order to be able to model the target no less, you need to reduce your feature space. There are numerous ways to do so. In my last Blog Post I described our solution. Let's see how to use our algorithm.
# run our algorithm selected_features <- featureSelection(data = data, target = "y", max_time = "30 mins", bootstrap = "regular", early_stopping = TRUE, parallel = TRUE)What can you expect to get out of it? Well, we return a list with of course the optimal formula calculated by our algorithm. Further, you get a stability matrix with it, where you can see a ranking of the features by importance. Additionally we built in some convenient S4 methods, so you can easily access all the information you need.
OutlookI hope I could teaser you a little to check out the package and help us further improve it. Currently, we are developing two new algorithms for feature selection. Thus, in the next iteration we will implement those two as well. I am looking forward to your comments, issues and thoughts on the package.
Cheers Guys!
Über den Autor Lukas StrömsdörferDer Beitrag bounceR 0.1.2: Automated Feature Selection erschien zuerst auf STATWORX.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Most Starred R Packages on GitHub
(This article was first published on Blog-rss on stevenmortimer.com, and kindly contributed to R-bloggers)
It seems like all the best R packages proudly use GitHub and have a README adorned with badges across the top. The recent Microsoft acquisition of GitHub got me wondering: What proportion of current R packages use GitHub? Or at least refer to it in the URL of the package description. Also, what is the relationship between the number of CRAN downloads and the number of stars on a repository? My curiosity got the best of me so I hastily wrote a script to pull the data. Click here to go straight to the full script and data included at the bottom of this post. I acknowledge there are more elegant ways to have coded this, but let’s press on.
Pulling List of Packages & their DetailsCRAN provides a list and links to all current packages at https://cran.rstudio.com/src/contrib. By scraping this page I found 12,675 current R packages (non-archived). For each package I pinged its detail page using the canonical form link (e.g. https://cran.r-project.org/package=ggplot2). From there I looked at the package description fields "URL" and "BugReports" to see if either contained “github.com”. It turns out that 3,718 of the packages (29.3% of the total) referenced GitHub. Below is the code snippet for retrieving the list of packages that was taken from Gergely Daróczi’s gist.
# getting list of the packages pkgs <- readHTMLTable(readLines(file.path('https://cran.rstudio.com/src/contrib')), which = 1, stringsAsFactors = FALSE)[,-1] # filter out lines that aren't really packages pkgs <- pkgs %>% filter(Size != "-", grepl('tar.gz$', Name)) %>% mutate(Name = sub('^([a-zA-Z0-9\\.]*).*', '\\1', Name))While retrieving the package metadata I pinged the GitHub API to see if I could get the number of stars for the repository. Currently, GitHub allows 5,000 authenticated requests per hour (link), but out of all the packages only 3,718 referenced GitHub, so I could make all the requests at once. Here is the function I used to take a cleaned up version of the package’s URL then form a request to the GitHub API to get star counts:
# get the star count from a clean version of the package's URL gh_star_count <- function(url){ stars <- tryCatch({ this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url) req <- GET(this_url, gtoken) stop_for_status(req) cont <- content(req) cont$stargazers_count }, error = function(e){ return(NA_integer_) }) return(stars) } Analyzing the DataOnce I had all the package detail data, I found that R packages, on average, have 35.7 GitHub stars, but the median number of stars is only 6! ggplot2 has the most stars with 3,174. In my analysis I removed the xgboost, h2o, and feather packages which point to the repository of their implementations in many languages, not just R.
What I really found interesting was comparing CRAN downloads to GitHub repo stars. Using the cranlogs package I was able to get the total package downloads dating back to January 1, 2014. In contrast with the low star counts, the median downloads for R packages is 8,975. Combining stars and downloads data I found that the median R package has 903 downloads per star. Only 38.7% of packages had more than 10 stars, which shows how hard stars are to get even if you’ve written a great package. I’m not sure what proportion of R users frequently reference and contribute to GitHub, but it would be interesting to compare that with the high ratios of downloads to stars.
There are some real outliers in the data. For example, the Rcpp package, perhaps the most downloaded package of all-time, has 15.8M downloads and only 377 stars. Similarly, Hadley’s scales package has 9.4M downloads and only 115 stars. These support/helper packages just don’t get the same star love as the headliners like ggplot2, shiny, and dplyr.
Of course, I could not help but check out the stats for some of the most prolific package authors. After parsing out individuals with the ["aut", "cre"] roles I came to the not so surprising conclusion that Hadley has the most stars of any author with 12,408 stars. In contrast, Dirk Eddelbuettel had one of the lowest star-to-download ratios. For every ~38K downloads Dirk’s repositories will receive one star. Pay no attention to the man behind the curtain since his Rcpp package underpins a whole host of packages without all the GitHub fanfare. Here is a list of popular R package authors and their stats:
Author Notable Packages Downloads Stars Downloads Per Star Hadley Wickham ggplot2, dplyr, httr 113,160,314 12,408 9,119.9 Dirk Eddelbuettel Rcpp, BH 28,433,586 745 38,165.9 Yihui Xie knitr, rmarkdown, bookdown 42,472,860 6,315 6,725.7 Winston Chang R6, shiny 17,161,005 4,027 4,261.5 Jennifer Bryan readxl, gapminder, googlesheets 6,055,774 1,714 3,533.1 JJ Allaire rstudioapi, reticulate, tensorflow 8,882,553 2,798 3,174.6 Jeroen Ooms jsonlite, curl, openssl 25,907,868 1,483 17,469.9 Scott Chamberlain geojsonio, taxize 1,770,664 2,528 700.4 Jim Hester devtools, memoise, readr 22,867,071 4,332 5,278.6 Kirill Müller tibble, DBI 36,159,009 1,077 33,573.8I’m sure you could create mixed models to determine the unique download to star relationship for individuals. Also, you could use other package attributes to predict stars or downloads, but I’ll leave that to another curious soul. I will include tables below regarding the top 10 most downloaded, most starred, most and least downloaded per star.
CreditsCredit is due since this script borrows a couple different pieces of code and concepts. Retrieving the list of packages is from Gergely Daróczi’s gist. Authenticating to GitHub was taken from one of the httr demos.
Appendix Top 10 Most Starred Packages Name Author Downloads Stars Downloads Per Star ggplot2 Hadley Wickham 13,001,703 3,174 4,096.3 shiny Winston Chang 4,571,794 2,902 1,575.4 dplyr Hadley Wickham 8,276,844 2,408 3,437.2 devtools Jim Hester 5,536,730 1,645 3,365.8 knitr Yihui Xie 7,131,564 1,581 4,510.8 data.table Matt Dowle 6,005,795 1,457 4,122.0 plotly Carson Sievert 1,195,880 1,255 952.9 rmarkdown Yihui Xie 5,432,495 1,160 4,683.2 tensorflow JJ Allaire 94,856 1,033 91.8 bookdown Yihui Xie 126,586 1,009 125.5 Top 10 Most Downloaded Packages with Stars Name Author Downloads Stars Downloads Per Star Rcpp Dirk Eddelbuettel 15,824,781 377 41,975.5 ggplot2 Hadley Wickham 13,001,703 3,174 4,096.3 stringr Hadley Wickham 11,547,828 268 43,088.9 stringi Marek Gagolewski 11,310,113 122 92,705.8 digest Dirk Eddelbuettel with contributions by Antoine Lucas 11,233,244 42 267,458.2 plyr Hadley Wickham 10,340,396 470 22,000.8 R6 Winston Chang 9,993,128 212 47,137.4 reshape2 Hadley Wickham 9,582,245 173 55,388.7 scales Hadley Wickham 9,380,757 115 81,571.8 jsonlite Jeroen Ooms 9,112,790 176 51,777.2 Top 10 Packages by Stars per Download (frequently starred) Name Author Downloads Stars Downloads Per Star r2d3 Javier Luraschi 416 235 1.77 workflowr John Blischak 448 169 2.65 goodpractice Hannah Frick 523 192 2.72 xtensor Johan Mabille 2,057 664 3.10 scico Thomas Lin Pedersen 185 59 3.14 shinytest Winston Chang 418 113 3.70 furrr Davis Vaughan 724 171 4.23 pkgdown Hadley Wickham 1,589 332 4.79 rtika Sasha Goodman 168 32 5.25 mindr Peng Zhao 2,051 368 5.57 Bottom 10 Packages by Stars per Download (infrequently starred) Name Author Downloads Stars Downloads Per Star mime Yihui Xie 7,398,765 12 616,563.8 pkgmaker Renaud Gaujoux 1,228,173 2 614,086.5 rngtools Renaud Gaujoux 1,224,959 2 612,479.5 magic Robin K. S. Hankin 344,741 1 344,741.0 gsubfn G. Grothendieck 675,056 2 337,528.0 bindrcpp Kirill Müller 2,996,452 10 299,645.2 plogr Kirill Müller 3,343,099 12 278,591.6 digest Dirk Eddelbuettel with contributions by Antoine Lucas 11,233,244 42 267,458.2 munsell Charlotte Wickham 7,778,712 31 250,926.2 proto Hadley Wickham 2,593,246 11 235,749.6 Full ScriptBelow and available via gist with data at: https://gist.github.com/StevenMMortimer/1b4b626d3d91240a77f969ae04b37114
# load packages & custom functions --------------------------------------------- library(tidyverse) library(httr) library(cranlogs) library(XML) library(ggrepel) library(scales) library(knitr) library(stringr) gh_from_url <- function(x){ if(!grepl(',', x)){ x <- strsplit(x, " ")[[1]] x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))]) } else { x <- strsplit(x, ",")[[1]] x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))]) } x <- gsub("http://", "https://", tolower(x)) x <- gsub("www\\.github\\.com", "github.com", x) x <- gsub("/$", "", x) x <- gsub("^github.com", "https://github.com", x) x <- gsub("/issues", "", x) x <- gsub("\\.git", "", x) return(x) } aut_maintainer_from_details <- function(x){ x <- gsub("'|\"", "", x) if(grepl(',', x)){ x <- strsplit(x, "\\],")[[1]] aut_cre_ind <- grepl(pattern='\\[aut, cre|\\[cre, aut|\\[cre', x, ignore.case=TRUE) if(any(aut_cre_ind)){ x <- x[min(which(aut_cre_ind))] x <- gsub("\\[aut, cre|\\[cre, aut|\\[cre", "", x) } x <- strsplit(x, ",")[[1]][1] x <- trimws(gsub("\\]", "", x)) x <- trimws(gsub(" \\[aut", "", x)) } return(x) } gh_star_count <- function(url){ stars <- tryCatch({ this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url) req <- GET(this_url, gtoken) stop_for_status(req) cont <- content(req) cont$stargazers_count }, error = function(e){ return(NA_integer_) }) return(stars) } # authenticate to github ------------------------------------------------------- # use Hadley's key and secret myapp <- oauth_app("github", key = "56b637a5baffac62cad9", secret = "8e107541ae1791259e9987d544ca568633da2ebf") github_token <- oauth2.0_token(oauth_endpoints("github"), myapp) gtoken <- config(token = github_token) # pull list of packages -------------------------------------------------------- # get list of currently available packages on CRAN pkgs <- readHTMLTable(readLines(file.path('https://cran.rstudio.com/src/contrib')), which = 1, stringsAsFactors = FALSE)[,-1] # filter out lines that aren't really packages pkgs <- pkgs %>% filter(Size != "-", grepl('tar.gz$', Name)) %>% mutate(Name = sub('^([a-zA-Z0-9\\.]*).*', '\\1', Name)) %>% distinct(Name, .keep_all = TRUE) # get details for each package ------------------------------------------------- all_pkg_details <- NULL # old fashioned looping! # WARNING: This takes awhile to complete for(i in 1:nrow(pkgs)){ if(i %% 100 == 0){ message(sprintf("Processing package #%s out of %s", i, nrow(pkgs))) } pkg_details <- readHTMLTable(readLines(file.path(sprintf('https://cran.r-project.org/package=%s', pkgs[i,]$Name))), header=FALSE, which = 1, stringsAsFactors = FALSE) pkg_details <- pkg_details %>% mutate(V1 = gsub(":", "", V1)) %>% spread(V1, V2) this_url <- pkg_details$URL on_github <- FALSE this_github_url <- NA_character_ gh_stars <- NA_integer_ if(!is.null(this_url)){ on_github <- grepl('http://github.com|https://github.com|http://www.github.com', this_url) if(on_github){ this_github_url <- gh_from_url(this_url) gh_stars <- gh_star_count(this_github_url) } else { # check the BugReports URL as a backup (e.g. shiny package references GitHub this way) issues_on_github <- grepl('http://github.com|https://github.com|http://www.github.com', pkg_details$BugReports) if(length(issues_on_github) == 0 || !issues_on_github){ this_github_url <- NA_character_ } else { this_github_url <- gh_from_url(pkg_details$BugReports) gh_stars <- gh_star_count(this_github_url) on_github <- TRUE } } } else { this_url <- NA_character_ } downloads <- cran_downloads(pkgs[i,]$Name, from = "2014-01-01", to = "2018-06-15") all_pkg_details <- rbind(all_pkg_details, tibble(name = pkgs[i,]$Name, published = pkg_details$Published, author = aut_maintainer_from_details(pkg_details$Author), url = this_url, github_ind = on_github, github_url = this_github_url, downloads = sum(downloads$count), stars = gh_stars ) ) } # basic summary stats ---------------------------------------------------------- # remove observations where the GitHub URL refers to a repository that # is not specific to R and therefore might have an inflated star count all_pkg_details_clean <- all_pkg_details %>% filter(!(name %in% c('xgboost', 'h2o', 'feather'))) %>% mutate(downloads_per_star = downloads / stars, downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) # proportion of all packages listing github sum(all_pkg_details$github_ind) mean(all_pkg_details$github_ind) # proportion of packages with stars mean(!is.na(all_pkg_details$stars)) # typical number of stars per package mean(all_pkg_details_clean$stars, na.rm=TRUE) median(all_pkg_details_clean$stars, na.rm=TRUE) max(all_pkg_details_clean$stars, na.rm=TRUE) # typical number of downloads per package mean(all_pkg_details_clean$downloads, na.rm=TRUE) median(all_pkg_details_clean$downloads, na.rm=TRUE) # percent of packages over 10 stars mean(all_pkg_details_clean$stars > 10, na.rm=TRUE) mean(all_pkg_details_clean$downloads_per_star, na.rm=TRUE) median(all_pkg_details_clean$downloads_per_star, na.rm=TRUE) # stars histogram -------------------------------------------------------------- ggplot(data=all_pkg_details_clean, mapping=aes(stars)) + geom_histogram(aes(fill=..count..), bins=60) + scale_x_continuous(trans = "log1p", breaks=c(0,1,2,3,10,100,1000,3000)) + labs(x = "Stars", y = "Count", fill = "Count", caption = "Source: api.github.com as of 6/16/18") + ggtitle("Distribution of GitHub Stars on R Packages") + theme_bw() + theme(panel.grid.minor = element_blank(), plot.caption=element_text(hjust = 0)) # stars to downloads scatterplot ----------------------------------------------- plot_dat <- all_pkg_details_clean idx_label <- which(with(plot_dat, downloads > 10000000 | stars > 1000)) plot_dat$name2 <- plot_dat$name plot_dat$name <- "" plot_dat$name[idx_label] <- plot_dat$name2[idx_label] ggplot(data=plot_dat, aes(stars, downloads, label = name)) + geom_point(color = ifelse(plot_dat$name == "", "grey50", "red")) + geom_text_repel(box.padding = .5) + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma) + labs(x = "GitHub Stars", y = "CRAN Downloads", caption = "Sources:\napi.github.com as of 6/16/18\ncranlogs as of 1/1/14 - 6/15/18") + ggtitle("Relationship Between CRAN Downloads and GitHub Stars") + theme_bw() + theme(plot.caption=element_text(hjust = 0)) # author stats ----------------------------------------------------------------- # summary by author authors_detail <- all_pkg_details_clean %>% group_by(author) %>% summarize(downloads = sum(downloads, na.rm=TRUE), stars = sum(stars, na.rm=TRUE)) %>% mutate(downloads_per_star = downloads / stars, downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) %>% arrange(desc(downloads)) # popular authors pop_authors <- tibble(author = c('Hadley Wickham', 'Dirk Eddelbuettel', 'Yihui Xie', 'Winston Chang', 'Jennifer Bryan', 'JJ Allaire', 'Jeroen Ooms', 'Scott Chamberlain', 'Jim Hester', 'Kirill Müller'), notable_packages = c('ggplot2, dplyr, httr', 'Rcpp, BH', 'knitr, rmarkdown, bookdown', 'R6, shiny', 'readxl, gapminder, googlesheets', 'rstudioapi, reticulate, tensorflow', 'jsonlite, curl, openssl', 'geojsonio, taxize', 'devtools, memoise, readr', 'tibble, DBI') ) author_stats <- pop_authors %>% inner_join(., authors_detail, by='author') %>% select(author, notable_packages, downloads, stars, downloads_per_star) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # single author #all_pkg_details_clean %>% filter(author == 'Dirk Eddelbuettel') %>% arrange(desc(downloads)) # top 10 lists ----------------------------------------------------------------- # Top 10 Most Starred Packages top_starred <- all_pkg_details_clean %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(stars)) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Top 10 Most Downloaded Packages with stars top_downloaded <- all_pkg_details_clean %>% filter(!is.na(stars)) %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(downloads)) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Bottom 10 Packages by Downloads per Star (frequently starred) frequently_starred <- all_pkg_details_clean %>% filter(downloads > 100) %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(downloads_per_star) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 2)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Top 10 Packages by Downloads per Star (infrequently starred) infrequently_starred <- all_pkg_details_clean %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(downloads_per_star)) %>% slice(1:10) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Blog-rss on stevenmortimer.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Sketchnotes from TWiML&AI: Practical Deep Learning with Rachel Thomas
(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)
These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Practical Deep Learning with Rachel Thomas:
You can listen to the podcast here.
In this episode, i’m joined by Rachel Thomas, founder and researcher at Fast AI. If you’re not familiar with Fast AI, the company offers a series of courses including Practical Deep Learning for Coders, Cutting Edge Deep Learning for Coders and Rachel’s Computational Linear Algebra course. The courses are designed to make deep learning more accessible to those without the extensive math backgrounds some other courses assume. Rachel and I cover a lot of ground in this conversation, starting with the philosophy and goals behind the Fast AI courses. We also cover Fast AI’s recent decision to switch to their courses from Tensorflow to Pytorch, the reasons for this, and the lessons they’ve learned in the process. We discuss the role of the Fast AI deep learning library as well, and how it was recently used to held their team achieve top results on a popular industry benchmark of training time and training cost by a factor of more than ten. https://twimlai.com/twiml-talk-138-practical-deep-learning-with-rachel-thomas/
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How Much Money Should Machines Earn?
(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)
TweetEvery inch of sky’s got a star
Every inch of skin’s got a scar
(Everything Now, Arcade Fire)
I think that a very good way to start with R is doing an interactive visualization of some open data because you will train many important skills of a data scientist: loading, cleaning, transforming and combinig data and performing a suitable visualization. Doing it interactive will give you an idea of the power of R as well, because you will also realise that you are able to handle indirectly other programing languages such as JavaScript.
That’s precisely what I’ve done today. I combined two interesting datasets:
- The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in this paper in 2013.
- Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.
Apart from using dplyr to manipulate data and highcharter to do the visualization, I used tabulizer package to extract the table of probabilities of computerisation from the pdf: it makes this task extremely easy.
This is the resulting plot:
If you want to examine it in depth, here you have a full size version.
These are some of my insights (its corresponding figures are obtained directly from the dataset):
- There is a moderate negative correlation between wages and probability of computerisation.
- Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.
- In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.
- Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).
- Don’t study for librarian or archivist: it seems a bad way to invest your time
- Mathematicians will survive to machines
The code of this experiment is available here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Version 0.6-11 of NIMBLE released
(This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)
We’ve released the newest version of NIMBLE on CRAN and on our website.
Version 0.6-11 has important new features, notably support for Bayesian nonparametric mixture modeling, and more are on the way in the next few months.
New features include:
- support for Bayesian nonparametric mixture modeling using Dirichlet process mixtures, with specialized MCMC samplers automatically assigned in NIMBLE’s default MCMC (See Chapter 10 of the manual for details);
- additional resampling methods available with the auxiliary and bootstrap particle filters;
- user-defined filtering algorithms can be used with NIMBLE’s particle MCMC samplers;
- MCMC thinning intervals can be modified at MCMC run-time;
- both runMCMC() and nimbleMCMC() now drop burn-in samples before thinning, making their behavior consistent with each other;
- increased functionality for the ‘setSeed’ argument in nimbleMCMC() and runMCMC();
- new functionality for specifying the order in which sampler functions are executed in an MCMC; and
- invalid dynamic indexes now result in a warning and NaN values but do not cause execution to error out, allowing MCMC sampling to continue.
Please see the NEWS file in the installed package for more details
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Prediction Interval, the wider sister of Confidence Interval
(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)
In this post, I will illustrate the use of prediction intervals for the comparison of measurement methods. In the example, a new spectral method for measuring whole blood hemoglobin is compared with a reference method. But first, let's start with discussing the large difference between a confidence interval and a prediction interval.
Prediction interval versus Confidence intervalVery often a confidence interval is misinterpreted as a prediction interval, leading to unrealistic “precise” predictions. As you will see, prediction intervals (PI) resemble confidence intervals (CI), but the width of the PI is by definition larger than the width of the CI.
Let’s assume that we measure the whole blood hemoglobin concentration in a random sample of 100 persons. We obtain the estimated mean (Est_mean), limits of the confidence interval (CI_Lower and CI_Upper) and limits of the prediction interval (PI_Lower and PI_Upper): (The R-code to do this is in the next paragraph)
Est_mean CI_Lower CI_Upper PI_Lower PI_Upper 140 138 143 113 167A Confidence interval (CI) is an interval of good estimates of the unknown true population parameter. About a 95% confidence interval for the mean, we can state that if we would repeat our sampling process infinitely, 95% of the constructed confidence intervals would contain the true population mean. In other words, there is a 95% chance of selecting a sample such that the 95% confidence interval calculated from that sample contains the true population mean.
Interpretation of the 95% confidence interval in our example:
- The values contained in the interval [138g/L to 143g/L] are good estimates of the unknown mean whole blood hemoglobin concentration in the population. In general, if we would repeat our sampling process infinitely, 95% of such constructed confidence intervals would contain the true mean hemoglobin concentration.
A Prediction interval (PI) is an estimate of an interval in which a future observation will fall, with a certain confidence level, given the observations that were already observed. About a 95% prediction interval we can state that if we would repeat our sampling process infinitely, 95% of the constructed prediction intervals would contain the new observation.
Interpretation of the 95% prediction interval in the above example:
- Given the observed whole blood hemoglobin concentrations, the whole blood hemoglobin concentration of a new sample will be between 113g/L and 167g/L with a confidence of 95%. In general, if we would repeat our sampling process infinitely, 95% of the such constructed prediction intervals would contain the new hemoglobin concentration measurement.
Remark: Very often we will read the interpretation “The whole blood hemogblobin concentration of a new sample will be between 113g/L and 167g/L with a probability of 95%.” (for example on wikipedia). This interpretation is correct in the theoretical situation where the parameters (true mean and standard deviation) are known.
Estimating a prediction interval in RFirst, let's simulate some data. The sample size in the plot above was (n=100). Now, to see the effect of the sample size on the width of the confidence interval and the prediction interval, let's take a “sample” of 400 hemoglobin measurements using the same parameters:
set.seed(123) hemoglobin<-rnorm(400, mean = 139, sd = 14.75) df<-data.frame(hemoglobin)Although we don't need a linear regression yet, I'd like to use the lm() function, which makes it very easy to construct a confidence interval (CI) and a prediction interval (PI). We can estimate the mean by fitting a “regression model” with an intercept only (no slope). The default confidence level is 95%.
Confidence interval: CI<-predict(lm(df$hemoglobin~ 1), interval="confidence") CI[1,] ## fit lwr upr ## 139.2474 137.8425 140.6524The CI object has a length of 400. But since there is no slope in our “model”, each row is exactly the same.
Prediction interval: PI<-predict(lm(df$hemoglobin~ 1), interval="predict") ## Warning in predict.lm(lm(df$hemoglobin ~ 1), interval = "predict"): predictions on current data refer to _future_ responses PI[1,] ## fit lwr upr ## 139.2474 111.1134 167.3815We get a “warning” that “predictions on current data refer to future responses”. That's exactly what we want, so no worries there. As you see, the column names of the objects CI and PI are the same. Now, let's visualize the confidence and the prediction interval.
The code below is not very elegant but I like the result (tips are welcome :-))
library(ggplot2) limits_CI <- aes(x=1.3 , ymin=CI[1,2], ymax =CI[1,3]) limits_PI <- aes(x=0.7 , ymin=PI[1,2], ymax =PI[1,3]) PI_CI<-ggplot(df, aes(x=1, y=hemoglobin)) + geom_jitter(width=0.1, pch=21, fill="grey", alpha=0.5) + geom_errorbar (limits_PI, width=0.1, col="#1A425C") + geom_point (aes(x=0.7, y=PI[1,1]), col="#1A425C", size=2) + geom_errorbar (limits_CI, width=0.1, col="#8AB63F") + geom_point (aes(x=1.3, y=CI[1,1]), col="#8AB63F", size=2) + scale_x_continuous(limits=c(0,2))+ scale_y_continuous(limits=c(95,190))+ theme_bw()+ylab("Hemoglobin concentration (g/L)") + xlab(NULL)+ geom_text(aes(x=0.6, y=160, label="Prediction\ninterval", hjust="right", cex=2), col="#1A425C")+ geom_text(aes(x=1.4, y=140, label="Confidence\ninterval", hjust="left", cex=2), col="#8AB63F")+ theme(legend.position="none", axis.text.x = element_blank(), axis.ticks.x = element_blank()) PI_CIThe width of the confidence interval is very small, now that we have this large sample size (n=400). This is not surprising, as the estimated mean is the only source of uncertainty. In contrast, the width of the prediction interval is still substantial. The prediction interval has two sources of uncertainty: the estimated mean (just like the confidence interval) and the random variance of new observations.
Example: comparing a new with a reference measurement method.A prediction interval can be useful in the case where a new method should replace a standard (or reference) method.
If we can predict well enough what the measurement by the reference method would be, (given the new method) than the two methods give similar information and the new method can be used.
For example in (Tian, 2017) a new spectral method (Near-Infra-Red) to measure hemoglobin is compared with a Golden Standard. In contrast with the Golden Standard method, the new spectral method does not require reagents. Moreover, the new method is faster. We will investigate whether we can predict well enough, based on the measured concentration of the new method, what the measurement by the Golden Standard would be. (note: the measured concentrations presented below are fictive)
Hb<- read.table("http://rforbiostatistics.colmanstatistics.be/wp-content/uploads/2018/06/Hb.txt", header = TRUE) kable(head(Hb)) New Reference 84.96576 87.24013 99.91483 103.38143 111.79984 116.71593 116.95961 116.72065 118.11140 113.51541 118.21411 121.70586 plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") Prediction Interval based on linear regressionNow, let's fit a linear regression model predicting the hemoglobin concentrations measured by the reference method, based on the concentrations measured with the new method.
fit.lm <- lm(Hb$Reference ~ Hb$New) plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") cat ("Adding the regression line:")Adding the regression line:
abline (a=fit.lm$coefficients[1], b=fit.lm$coefficients[2] ) cat ("Adding the identity line:")Adding the identity line:
abline (a=0, b=1, lty=2)If both measurement methods would exactly correspond, the intercept would be zero and the slope would be one (=“identity line”, dotted line). Now, let's calculated the confidence interval for this linear regression.
CI_ex <- predict(fit.lm, interval="confidence") colnames(CI_ex)<- c("fit_CI", "lwr_CI", "upr_CI")And the prediction interval:
PI_ex <- predict(fit.lm, interval="prediction") ## Warning in predict.lm(fit.lm, interval = "prediction"): predictions on current data refer to _future_ responses colnames(PI_ex)<- c("fit_PI", "lwr_PI", "upr_PI")We can combine those results in one data frame and plot both the confidence interval and the prediction interval.
Hb_results<-cbind(Hb, CI_ex, PI_ex) kable(head(round(Hb_results),1)) New Reference fit_CI lwr_CI upr_CI fit_PI lwr_PI upr_PI 85 87 91 87 94 91 82 99Visualizing the CI (in green) and the PI (in blue):
plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") Hb_results_s <- Hb_results[order(Hb_results$New),] lines (x=Hb_results_s$New, y=Hb_results_s$fit_CI) lines (x=Hb_results_s$New, y=Hb_results_s$lwr_CI, col="#8AB63F", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$upr_CI, col="#8AB63F", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$lwr_PI, col="#1A425C", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$upr_PI, col="#1A425C", lwd=1.2) abline (a=0, b=1, lty=2)In (Bland, Altman 2003) it is proposed to calculate the average width of this prediction interval, and see whether this is acceptable. Another approach is to compare the calculated PI with an “acceptance interval”. If the PI is inside the acceptance interval for the measurement range of interest (see Francq, 2016).
In the above example, both methods do have the same measurement scale (g/L), but the linear regression with prediction interval is particularly useful when the two methods of measurement have different units.
However, the method has some disadvantages:
- Predictions intervals are very sensitive to deviations from the normal distribution.
- In “standard” linear regression (or Ordinary Least Squares (OLS) regression),the presence of measurement error is allowed for the Y-variable (here, the reference method) but not for the X-variable (the new method). The absence of errors on the x-axis is one of the assumptions. Since we can expect some measurement error for the new method, this assumption is violated here.
In contrast to Ordinary Least Square (OLS) regression, Bivariate Least Square (BLS) regression takes into account the measurement errors of both methods (the New method and the Reference method). Interestingly, prediction intervals calculated with BLS are not affected when the axes are switched (del Rio, 2001).
In 2017, a new R-package BivRegBLS was released. It offers several methods to assess the agreement in method comparison studies, including Bivariate Least Square (BLS) regression.
If the data are unreplicated but the variance of the measurement error of the methods is known, The BLS() and XY.plot() functions can be used to fit a bivariate Least Square regression line and corresponding confidence and prediction intervals.
library (BivRegBLS) Hb.BLS = BLS (data = Hb, xcol = c("New"), ycol = c("Reference"), var.y=10, var.x=8, conf.level=0.95) XY.plot (Hb.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"))Now we would like to decide whether the new method can replace the reference method. We allow the methods to differ up to a given threshold, which is not clinically relevant. Based on this threshold an “acceptance interval” is created. Suppose that differences up to 10 g/L (=threshold) are not clinically relevant, then the acceptance interval can be defined as Y=X±??, with ?? equal to 10. If the PI is inside the acceptance interval for the measurement range of interest then the two measurement methods can be considered to be interchangeable (see Francq, 2016).
The accept.int argument of the XY.plot() function allows for a visualization of this acceptance interval
XY.plot (Hb.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"), accept.int=10)For the measurement region 120g/L to 150 g/L, we can conclude that the difference between both methods is acceptable. If the measurement regions below 120g/l and above 150g/L are important, the new method cannot replace the reference method.
Regression on replicated dataIn method comparison studies, it is advised to create replicates (2 or more measurements of the same sample with the same method). An example of such a dataset:
Hb_rep <- read.table("http://rforbiostatistics.colmanstatistics.be/wp-content/uploads/2018/06/Hb_rep.txt", header = TRUE) kable(head(round(Hb_rep),1)) New_rep1 New_rep2 Ref_rep1 Ref_rep2 88 95 90 84When replicates are available, the variance of the measurement errors are calculated for both the new and the reference method and used to estimate the prediction interval. Again, the BLS() function and the XY.plot() function are used to estimate and plot the BLS regression line, the corresponding CI and PI.
Hb_rep.BLS = BLS (data = Hb_rep, xcol = c("New_rep1", "New_rep2"), ycol = c("Ref_rep1", "Ref_rep2"), qx = 1, qy = 1, conf.level=0.95, pred.level=0.95) XY.plot (Hb_rep.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"), accept.int=10)It is clear that the prediction interval is not inside the acceptance interval here. The new method cannot replace the reference method. A possible solution is to average the repeats. The BivRegBLS package can create prediction intervals for the mean of (2 or more) future values, too! More information in this presentation (presented at useR!2017).
In the plot above, averages of the two replicates are calculated and plotted. I'd like to see the individual measurements:
plot(x=c(Hb_rep$New_rep1, Hb_rep$New_rep2), y=c(Hb_rep$Ref_rep1, Hb_rep$Ref_rep2), xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,2]), lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,3]), col="#8AB63F", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,4]), col="#8AB63F", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,5]), col="#1A425C", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,6]), col="#1A425C", lwd=2) abline (a=0, b=1, lty=2) Remarks- Although not appropriate in the context of method comparison studies, Pearson correlation is still frequently used. See Bland & Altman (2003) for an explanation on why correlations are not adviced.
- Methods presented in this blogpost are not applicable to time-series
- Confidence interval and prediction interval: Applied Linear Statistical Models, 2005, Michael Kutner, Christopher Nachtsheim, John Neter, William Li. Section 2.5
- Prediction interval for method comparison:
Bland, J. M. and Altman, D. G. (2003), Applying the right statistics: analyses of measurement studies. Ultrasound Obstet Gynecol, 22: 85-93. doi:10.1002/uog.12
section: “Appropriate use of regression”. - Francq, B. G., and Govaerts, B. (2016) How to regress and predict in a Bland-Altman plot? Review and contribution based on tolerance intervals and correlated-errors-in-variables models. Statist. Med., 35: 2328-2358. doi: 10.1002/sim.6872.
- del Río, F. J., Riu, J. and Rius, F. X. (2001), Prediction intervals in linear regression taking into account errors on both axes. J. Chemometrics, 15: 773-788. doi:10.1002/cem.663
- Example of a method comparison study: H. Tian, M. Li, Y. Wang, D. Sheng, J. Liu, and L. Zhang, “Optical wavelength selection for portable hemoglobin determination by near-infrared spectroscopy method,” Infrared Phys. Techn 86, 98-102 (2017). doi.org/10.1016/j.infrared.2017.09.004.
- the predict() and lm() functions of R: Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Wadsworth & Brooks/Cole.
Related Post
- Six Sigma DMAIC Series in R – Part 2
- Six Sigma DMAIC Series in R – Part 1
- Implementation and Interpretation of Control Charts in R
- Time series model of forecasting future power demand
- Tennis Grand Slam Tournaments Champions Basic Analysis
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Interpreting machine learning models with the lime package for R
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it's difficult to answer the question, "why that label, and not one of the others?".
One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it's impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a global level, it might be possible to asses which variables are most influential on the classification at a local level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R package lime by Thomas Lin Pedersen and Michael Benesty (and a port of the Python package of the same name).
You can read about how the lime package works in the introductory vignette Understanding Lime, but this limerick by Mara Averick sums also things up nicely:
There once was a package called lime,
Whose models were simply sublime,
It gave explanations for their variations,
One observation at a time.
"One observation at a time" is the key there: given a prediction (or a collection of predictions) it will determine the variables that most support (or contradict) the predicted classification.
The lime package also works with text data: for example, you may have a model that classifies a paragraph of text as a sentiment "negative", "neutral" or "positive". In that case, lime will determine the the words in that sentence which are most important to determining (or contradicting) the classification. The package helpfully also provides a shiny app making it easy to test out different sentences and see the local effect of the model.
To learn more about the lime algorithm and how to use the associated R package, a great place to get started is the tutorial Visualizing ML Models with LIME from the University of Cincinnati Business Analytics R Programming Guide. The lime package is available on CRAN now, and you can always find the latest version at the GitHub repository linked below.
GitHub (thomasp): lime (Local Interpretable Model-Agnostic Explanations)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R Tip: Be Wary of “…”
(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)
R Tip: be wary of “...“.
The following code example contains an easy error in using the R function unique().
vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) # [1] "a" "b" "c"Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “...” function signature feature.
In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.
Data science coding is often done in a rush (deadlines, wanting to see results, and so on). Instead of moving fast, let’s take the time to think a bit about programming theory using a very small concrete issue. This lets us show how one can work a bit safer (saving time in the end), without sacrificing notational power or legibility.
A confounding issue is: unique() failed to alert me of my mistake because, unique()‘s function signature (like so many R functions) includes a “...” argument. I might have been careless or in a rush, but it seems like unique was also in a rush and did not care to supply argument inspection.
In R a function that includes a “...” in its signature will accept arbitrary arguments and values in addition to the ones explicitly declared and documented. There are three primary uses of “...” in R: accepting unknown arguments that are to be passed to later functions, building variadic functions, and forcing later arguments to be bound by name (my favorite use). Unfortunately, “...” is also often used to avoid documenting arguments and turns off a lot of very useful error checking.
An example of the “accepting unknown arguments” use is lapply(). lapply() passes what it finds in “...” to whatever function it is working with. For example:
lapply(c("a", "b"), paste, "!", sep = "") # [[1]] # [1] "a!" # # [[2]] # [1] "b!"Notice the arguments “"!", sep = ""” were passed on to paste(). Since lapply() can not know what function the user will use ahead of time it uses the “...” abstraction to forward arguments. Personally I never use this form and tend to write the somewhat more explicit and verbose style shown below.
lapply(c("a", "b"), function(vi) { paste(vi, "!", sep = "") })I feel this form is more readable as the arguments are seen where they are actually used. (Note: this, is a notional example- in practice we would use “paste0(c("a", "b"), "!")” to directly produce the result as a vector.)
An example of using “...” to supply a variadic interface is paste() itself.
paste("a", "b", "c") # [1] "a b c"Other important examples include list() and c(). In fact I like list() and c() as they only take a “...” and no other arguments. Being variadic is so important to list() and c() is that is essentially all they do. One can often separate out the variadic intent with lists or vectors as in:
paste(c("a", "b", "c"), collapse = " ") # [1] "a b c"Even I don’t write code such as the above (that is too long even for me), unless the values are coming from somewhere else (such as a variable). However with wrapr‘s reduce/expand operator we can completely separate the collection of variadic arguments and their application. The notation looks like the following:
library("wrapr") values <- c("a", "b", "c") values %.|% paste # [1] "a b c"Essentially reduce/expand calls variadic functions with items taken from a vector or list as individual arguments (allowing one to program easily over variadic functions). %.|% is intended to place values in the “...” slot of a function (the variadic term). It is designed for a more perfect world where when a function declares “...” in its signature it is then the only user facing part of the signature. This is hardly ever actually the case in R as common functions such as paste() and sum() have additional optional named arguments (which we are here leaving at their default values), whereas c() and list() are pure (take only “...“).
With a few non-standard (name capturing) and variadic value constructors one does not in fact need other functions to be name capturing or variadic. With such tools one can have these conveniences everywhere. For example we can convert our incorrect use of unique() into correct code using c().
unique(c(vec1, vec2)) # [1] "a" "b" "c" "d"In the above code roles are kept separate: c() is collecting values and unique() is applying a calculation. We don’t need a powerful “super unique” or “super union” function, unique() is good enough if we remember to use c().
In the spirit of our earlier article on function argument names we have defined a convenience function wrapr::uniques() that enforces the use of value carrying arguments. With wrapr::uniques() if one attempts the mistake I have been discussing one immediately gets a signaling error (instead of propagating incorrect calculations forward).
library("wrapr") uniques(c(vec1, vec2)) # [1] "a" "b" "c" "d" uniques(vec1, vec2) # Error: wrapr::uniques unexpected arguments: vec2With uniques() we either get the right answer, or we get immediately stopped at the mistaken step. This is a good way to work.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Why your S3 method isn’t working
(This article was first published on That’s so Random, and kindly contributed to R-bloggers)
Throughout the last years I noticed the following happening with a number of people. One of those people was actually yours truely a few years back. Person is aware of S3 methods in R through regular use of print, plot and summary functions and decides to give it a go in own work. Creates a function that assigns a class to its output and then implements a bunch of methods to work on the class. Strangely, some of these methods appear to be working as expected, while others throw an error. After a confusing and painful debugging session, person throws hands in the air and continues working without S3 methods. Which was working fine in the first place. This is a real pity, because all the person is overlooking is a very small step in the S3 chain: the generic function.
A nonworking methodSo we have a function doing all kinds of complicated stuff. It outputs a list with several elements. We assign a S3 class to it before returning, so we can subsequently implement a number of methods1. Lets just make something up here.
my_func <- function(x) { ret <- list(dataset = x, d = 42, y = rnorm(10), z = c('a', 'b', 'a', 'c')) class(ret) <- "myS3" ret } out <- my_func(mtcars)Perfect, now lets implement a print method. Rather than outputting the whole list, we just want to know the most vital information when printing.
print.myS3 <- function(x) { cat("Original dataset has", nrow(x$dataset), "rows and", ncol(x$dataset), "columns\n", "d is", x$d) } out ## Original dataset has 32 rows and 11 columns ## d is 42Ha, that is working!. Now we do a mean method, that gives us the mean of the y variable.
mean.myS3 <- function(x) { mean(x$y) } mean(out) ## [1] 0.2631094Works too! And finally we do a count_letters method. It takes z from the output and counts how often each letter occurs.
count_letters.myS3 <- function(x) { table(out$z) } count_letters(out) ## Error in count_letters(out): could not find function "count_letters"What do you mean “could not find function”? It is right there! Maybe we made a typo. Mmmm, no it doesn’t seem so. Maybe, mmmm, lets look into this…. Half an hour, a bit of swearing and feelings of stupidity later. Pfff, lets not bother about S3, we were happy with just using functions in the first place.
GenericsNow why are print and mean working just fine, but count_letters isn’t? Lets look under the hood of print and mean.
print ## function (x, ...) ## UseMethod("print") ## ## mean ## function (x, ...) ## UseMethod("mean") ## ##They look exactly the same! They call the UseMethod function on their own function name. Looking into the help file of UseMethod, it all of a sudden starts to make sense.
“When a function calling UseMethod(“fun”) is applied to an object with class attribute c(“first”, “second”), the system searches for a function called fun.first and, if it finds it, applies it to the object. If no such function is found a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used, if it exists, or an error results.”
So by calling print and mean on the myS3 object we were not calling the method itself. Rather, we call the general functions print and mean (the generics) and they call the function UseMethod. This function then does the method dispatch: lookup the method belonging to the S3 object the function was called on. We were just lucky the print and mean generics were already in place and called our methods. However, the count_letters function indeed doesn’t exist (as the error message tells us). Only the count_letters method exist, for objects of class myS3. We just learned that methods cannot get called directly, but are invoked by generics. All we need to do, thus, is build a generic for count_letters and we are all set.
count_letters <- function(x) { UseMethod("count_letters") } count_letters(out) ## ## a b c ## 2 1 1-
It is actually ill-advised to assign a S3 class directly to an output. Rather use a constructor, see 16.3.1 of Advanced R for the how and why.
To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Creating Custom Plotly Charts on Datazar
(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)
Have you ever wondered how you create or use Plotly (plotly.js) on Datazar? No? Well here it is anyway.
Datazar’s live chart editor allows you to create custom D3js charts with a JavaScript and CSS editor. Using the DatazarJS library, you can call your datasets without having to worry about file paths or URLs; simply do Datazar.dataset('someDataset.csv',()=>{}).
Is your data being automatically updated via the Datazar REST API? The charts update themselves so you don’t have to worry about taking into account new data.
Including Plotly The “Libraries” button will prompt a pop up that allows you to include external JS libraires and CSS styles.Plotly provides a CDN link so you can use their library without saving your own copy. That’s wonderful; let’s use that and copy it into the popup.
The CodeLet’s use one of the examples Plotly was kind enough to create on their site: https://plot.ly/javascript/histograms/
Copy and paste the JS code to the Datazar JS editor and you’re done.
The JS editor on the left contains the code from the histogram example.Note this example actually generates the data on the fly but it’s the same principle as grabbing your data using the async Dataset function.
The CSS editor is just making sure the container page keeps the same color as the Plotly chart.
Using the Chart in a PaperTo use your newly minted chart on a Paper, Datazar’s new interactive research paper, create a Paper and use the option bar on the right to include it. The chart will be on the “Visualization” tab.
The cool thing about using the Plotly chart along with the interactive Paper is that it will keep its interactivity so you can zoom in and do all the fun things Plotly allows on their charts even AFTER you publish your Paper. This means your readers can play with your charts and get a better understanding; after all, that’s the whole point.
Checkout more features here: https://www.datazar.com/features
Creating Custom Plotly Charts on Datazar was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to create a Flexdashboard: Exercises
(This article was first published on R-exercises, and kindly contributed to R-bloggers)
INTRODUCTION
With flexdashboard, you can easily create interactive dashboards for R. What is amazing about it is that with R Markdown, you can publish a group of related data visualizations as a dashboard.
Additionally, it supports a wide variety of components, including htmlwidgets; base, lattice, and grid graphics; tabular data; gauges and value boxes and text annotations.
It is flexible and easy to specify rows and column-based layouts. Components are intelligently re-sized to fill the browser and adapted for display on mobile devices.
In combination with Shiny, you can create a high quality dashboard with interactive visualizations.
Before proceeding, please follow our short tutorial.
Look at the examples given and try to understand the logic behind them. Then, try to solve the exercises below by using R without looking at the answers. Then, check the solutions to check your answers.
Exercise 1
Create a new flexdashboard R Markdown file from the R console
Exercise 2
Create the very initial dashboard interface in a single column.
Exercise 3
Add the space in which you will put your first chart
Exercise 4
Add the space in which you will put your second chart. The two charts should be stacked vertically.
Exercise 5
Add a third chart with the same logic.
Exercise 6
Transform your layout to scrolling.
Exercise 7
Displays THE 3 charts split across two columns.
Exercise 8
Change the width of these two columns.
Exercise 9
Defines two rows instead of columns, the first of which has a single chart and the second of which has two charts:
Exercise 10
Change the height of these two columns.
Related exercise sets:- Building Shiny App Exercises (part-8)
- Data visualization with googleVis exercises part 10
- Data Visualization with googleVis exercises part 2
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Build httr Functions Automagically from Manual Browser Requests with the middlechild Package
(This article was first published on R – rud.is, and kindly contributed to R-bloggers)
You can catch a bit of the @rOpenSci 2018 Unconference experience at home w with this short-ish ‘splainer video on how to use the new middlechild package (https://github.com/ropenscilabs/middlechild) & mitmproxy to automagically create reusable httr verb functions from manual browser form interactions.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Spreadsheet Data Manipulation in R
(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)
Today I decided to create a new repository on GitHub where I am sharing code to do spreadsheet data manipulation in R.
The first version of the repository and R script is available here: SpreadsheetManipulation_inR
As an example I am using a csv freely available from the IRS, the US Internal Revenue Service.
https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2015-zip-code-data-soi
This spreadsheet has around 170’000 rows and 131 columns.
Please feel free to request new functions to be added or add functions and code yourself directly on GitHub.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Working with Your Facebook Data in R
(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)
.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }
How to Read in and Clean Your Facebook Data – I recently learned that you can download all of your Facebook data, so I decided to check it out and bring it into R. To access your data, go to Facebook, and click on the white down arrow in the upper-right corner. From there, select Settings, then, from the column on the left, “Your Facebook Information.” When you get the Facebook Information screen, select “View” next to “Download Your Information.” On this screen, you’ll be able to select the kind of data you want, a date range, and format. I only wanted my posts, so under “Your Information,” I deselected everything but the first item on the list, “Posts.” (Note that this will still download all photos and videos you posted, so it will be a large file.) To make it easy to bring into R, I selected JSON under Format (the other option is HTML).
After you click “Create File,” it will take a while to compile – you’ll get an email when it’s ready. You’ll need to reenter your password when you go to download the file.
The result is a Zip file, which contains folders for Posts, Photos, and Videos. Posts includes your own posts (on your and others’ timelines) as well as posts from others on your timeline. And, of course, the file needed a bit of cleaning. Here’s what I did.
Since the post data is a JSON file, I need the jsonlite package to read it.
setwd("C:/Users/slocatelli/Downloads/facebook-saralocatelli35/posts")library(jsonlite)
FBposts <- fromJSON("your_posts.json")
This creates a large list object, with my data in a data frame. So as I did with the Taylor Swift albums, I can pull out that data frame.
myposts <- FBposts$status_updatesThe resulting data frame has 5 columns: timestamp, which is in UNIX format; attachments, any photos, videos, URLs, or Facebook events attached to the post; title, which always starts with the author of the post (you or your friend who posted on your timeline) followed by the type of post; data, the text of the post; and tags, the people you tagged in the post.
First, I converted the timestamp to datetime, using the anytime package.
library(anytime)myposts$timestamp <- anytime(myposts$timestamp)
Next, I wanted to pull out post author, so that I could easily filter the data frame to only use my own posts.
library(tidyverse)myposts$author <- word(string = myposts$title, start = 1, end = 2, sep = fixed(" "))
Finally, I was interested in extracting URLs I shared (mostly from YouTube or my own blog) and the text of my posts, which I did with some regular expression functions and some help from Stack Overflow (here and here).
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"myposts$links <- str_extract(myposts$attachments, url_pattern)
library(qdapRegex)
myposts$posttext <- myposts$data %>%
rm_between('"','"',extract = TRUE)
There’s more cleaning I could do, but this gets me a data frame I could use for some text analysis. Let’s look at my most frequent words.
myposts$posttext <- as.character(myposts$posttext)library(tidytext)
mypost_text <- myposts %>%
unnest_tokens(word, posttext) %>%
anti_join(stop_words)
## Joining, by = "word"
counts <- mypost_text %>%
filter(author == "Sara Locatelli") %>%
drop_na(word) %>%
count(word, sort = TRUE)
counts
## # A tibble: 9,753 x 2
## word n
##
## 1 happy 4702
## 2 birthday 4643
## 3 today's 666
## 4 song 648
## 5 head 636
## 6 day 337
## 7 post 321
## 8 009f 287
## 9 ð 287
## 10 008e 266
## # ... with 9,743 more rows
These data include all my posts, including writing “Happy birthday” on other’s timelines. I also frequently post the song in my head when I wake up in the morning (over 600 times, it seems). If I wanted to remove those, and only include times I said happy or song outside of those posts, I’d need to apply the filter in a previous step. There are also some strange characters that I want to clean from the data before I do anything else with them. I can easily remove these characters and numbers with string detect, but cells that contain numbers and letters, such as “008e” won’t be cut out with that function. So I’ll just filter them out separately.
drop_nums <- c("008a","008e","009a","009c","009f")counts <- counts %>%
filter(str_detect(word, "[a-z]+"),
!word %in% str_detect(word, "[0-9]"),
!word %in% drop_nums)
Now I could, for instance, create a word cloud.
library(wordcloud)counts %>%
with(wordcloud(word, n, max.words = 50))
In addition to posting for birthdays and head songs, I talk a lot about statistics, data, analysis, and my blog. I also post about beer, concerts, friends, books, and Chicago. Let’s see what happens if I mix in some sentiment analysis to my word cloud.
library(reshape2)##
## Attaching package: 'reshape2'
counts %>%
inner_join(get_sentiments("bing")) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red","blue"), max.words = 100)
## Joining, by = "word"
Once again, a few words are likely being misclassified – regression and plot are both negatively-valenced, but I imagine I’m using them in the statistical sense instead of the negative sense. I also apparently use “died” or “die” but I suspect in the context of, “I died laughing at this.” And “happy” is huge, because it includes birthday wishes as well as instances where I talk about happiness. Some additional cleaning and exploration of the data is certainly needed. But that’s enough to get started with this huge example of “me-search.”
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Quantile Regression (home made)
(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)
After my series of post on classification algorithms, it’s time to get back to R codes, this time for quantile regression. Yes, I still want to get a better understanding of optimization routines, in R. Before looking at the quantile regression, let us compute the median, or the quantile, from a sample.
MedianConsider a sample . To compute the median, solvewhich can be solved using linear programming techniques. More precisely, this problem is equivalent towith and , .
To illustrate, consider a sample from a lognormal distribution,
For the optimization problem, use the matrix form, with constraints, and parameters,
1 2 3 4 5 6 7 library(lpSolve) A1 = cbind(diag(2*n),0) A2 = cbind(diag(n), -diag(n), 1) r = lp("min", c(rep(1,2*n),0), rbind(A1, A2),c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,1) [1] 1.077415It looks like it’s working well…
QuantileOf course, we can adapt our previous code for quantiles
1 2 3 4 tau = .3 quantile(x,tau) 30% 0.6741586The linear program is nowwith and , . The R code is now
1 2 3 4 5 6 A1 = cbind(diag(2*n),0) A2 = cbind(diag(n), -diag(n), 1) r = lp("min", c(rep(tau,n),rep(1-tau,n),0), rbind(A1, A2),c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,1) [1] 0.6741586So far so good…
Quantile Regression (simple)Consider the following dataset, with rents of flat, in a major German city, as function of the surface, the year of construction, etc.
1 base=read.table("http://freakonometrics.free.fr/rent98_00.txt",header=TRUE)The linear program for the quantile regression is nowwith and . So use here
1 2 3 4 5 6 7 8 9 10 11 12 require(lpSolve) tau = .3 n=nrow(base) X = cbind( 1, base$area) y = base$rent_euro A1 = cbind(diag(2*n), 0,0) A2 = cbind(diag(n), -diag(n), X) r = lp("min", c(rep(tau,n), rep(1-tau,n),0,0), rbind(A1, A2), c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,2) [1] 148.946864 3.289674Of course, we can use R function to fit that model
1 2 3 4 5 library(quantreg) rq(rent_euro~area, tau=tau, data=base) Coefficients: (Intercept) area 148.946864 3.289674Here again, it seems to work quite well. We can use a different probability level, of course, and get a plot
1 2 3 4 5 6 7 8 9 10 11 12 13 plot(base$area,base$rent_euro,xlab=expression(paste("surface (",m^2,")")), ylab="rent (euros/month)",col=rgb(0,0,1,.4),cex=.5) sf=0:250 yr=r$solution[2*n+1]+r$solution[2*n+2]*sf lines(sf,yr,lwd=2,col="blue") tau = .9 r = lp("min", c(rep(tau,n), rep(1-tau,n),0,0), rbind(A1, A2), c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,2) [1] 121.815505 7.865536 yr=r$solution[2*n+1]+r$solution[2*n+2]*sf lines(sf,yr,lwd=2,col="blue") Quantile Regression (multiple)Now that we understand how to run the optimization program with one covariate, why not try with two ? For instance, let us see if we can explain the rent of a flat as a (linear) function of the surface and the age of the building.
1 2 3 4 5 6 7 8 9 10 11 12 require(lpSolve) tau = .3 n=nrow(base) X = cbind( 1, base$area, base$yearc ) y = base$rent_euro A1 = cbind(diag(2*n), 0,0,0) A2 = cbind(diag(n), -diag(n), X) r = lp("min", c(rep(tau,n), rep(1-tau,n),0,0,0), rbind(A1, A2), c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,3) [1] 0.000000 3.257562 0.077501Unfortunately, this time, it is not working well…
1 2 3 4 5 library(quantreg) rq(rent_euro~area+yearc, tau=tau, data=base) Coefficients: (Intercept) area yearc -5542.503252 3.978135 2.887234Results are quite different. And actually, another technique can confirm the later (IRLS – Iteratively Reweighted Least Squares)
1 2 3 4 5 6 7 8 eps = residuals(lm(rent_euro~area+yearc, data=base)) for(s in 1:500){ reg = lm(rent_euro~area+yearc, data=base, weights=(tau*(eps>0)+(1-tau)*(eps<0))/abs(eps)) eps = residuals(reg) } reg$coefficients (Intercept) area yearc -5484.443043 3.955134 2.857943I could not figure out what went wrong with the linear program. Not only coefficients are very different, but also predictions…
1 2 3 yr = r$solution[2*n+1]+r$solution[2*n+2]*base$area+r$solution[2*n+3]*base$yearc plot(predict(reg),yr) abline(a=0,b=1,lty=2,col="red")
It’s now time to investigate….
To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Detecting unconscious bias in models, with R
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
There's growing awareness that the data we collect, and in particular the variables we include as factors in our predictive models, can lead to unwanted bias in outcomes: from loan applications, to law enforcement, and in many other areas. In some instances, such bias is even directly regulated by laws like the Fair Housing Act in the US. But even if we explicitly remove "obvious" variables like sex, age or ethnicity from predictive models, unconscious bias might still be a factor in our predictions as a result of highly-correlated proxy variables that are included in our model.
As a result, we need to be aware of the biases in our model and take steps to address them. For an excellent general overview of the topic, I highly recommend watching the recent presentation by Rachel Thomas, "Analyzing and Preventing Bias in ML". And for a practical demonstration of one way you can go about detecting proxy bias in R, take a look at the vignette created by my colleague Paige Bailey for the ROpenSci conference, "Ethical Machine Learning: Spotting and Preventing Proxy Bias".
The vignette details general principles you can follow to identify proxy bias in an analysis, in the context of a case study analyzed using R. The case study considers data and a predictive model that might be used by a bank manager to determine the creditworthiness of a loan applicant. Even though race was not explicitly included in the adaptive boosting model (from the C5.0 package), the predictions are still biased by race:
That's because zipcode, a variable highly associated with race, was included in the model. Read the complete vignette linked below to see how Paige modified the model to ameliorate that bias, while still maintaining its predictive power. All of the associated R code is available in the iPython Notebook.
GitHub (ropenscilabs): Ethical Machine Learning: Spotting and Preventing Proxy Bias (Paige Bailey)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...