R news and tutorials contributed by (750) R bloggers
Updated: 38 min 6 sec ago

### Join, split, and compress PDF files with pdftools

Wed, 04/24/2019 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Last month we released a new version of pdftools and a new companion package qpdf for working with pdf files in R. This release introduces the ability to perform pdf transformations, such as splitting and combining pages from multiple files. Moreover, the pdf_data() function which was introduced in pdftools 2.0 is now available on all major systems.

Split and Join PDF files

It is now possible to split, join, and compress pdf files with pdftools. For example the pdf_subset() function creates a new pdf file with a selection of the pages from the input file:

# Load pdftools library(pdftools) # extract some pages pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf', pages = 1:3, output = "subset.pdf") # Should say 3 pdf_length("subset.pdf")

Similarly pdf_combine() is used to join several pdf files into one.

# Generate another pdf pdf("test.pdf") plot(mtcars) dev.off() # Combine them with the other one pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf") # Should say 4 pdf_length("joined.pdf")

The split and join features are provided via a new package qpdf which wraps the qpdf C++ library. The main benefit of qpdf is that no external software (such as pdftk) is needed. The qpdf package is entirely self contained and works reliably on all operating systems with zero system dependencies.

Data extraction now available on Linux too

The pdftools 2.0 announcement post from December introduced the new pdf_data() function for extracting individual text boxes from pdf files. However it was noted that this function was not yet available on most Linux distributions because it requires a recent fix from poppler 0.73.

I am happy to say that this should soon work on all major Linux distributions. Ubuntu has upgraded to poppler 0.74 on Ubuntu Disco which was released this week. I also created a PPA for Ubuntu 16.04 (Xenial) and 18.04 (Bionic) with backports of poppler 0.74. This makes it possible to use pdf_data on Ubuntu LTS servers, including Travis:

sudo add-apt-repository ppa:opencpu/poppler sudo apt-get update sudo apt-get install libpoppler-cpp-dev

Moreover, the upcoming Fedora 30 will ship with poppler-devel 0.73.

Finally, the upcoming Debian “Buster” release will ship with poppler 0.71, but the Debian maintainers were nice enough to let me backport the required patch from poppler 0.73, so pdf_data() will work on Debian (and hence CRAN) as well!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### 77th Tokyo.R Users Meetup Roundup!

Wed, 04/24/2019 - 01:00

(This article was first published on R by R(yo), and kindly contributed to R-bloggers)

As the sakura bloomed in Tokyo, another TokyoR User
Meetup
was held, this time
at SONY City! On April 13th useRs from all over Tokyo (and some even
from further afield) flocked to Osaki, Tokyo for a special session
focused on beginner R users, BeginneRs. We’ve also had several special
“themed” sessions in the past like TokyoR #69: Bayesian
statistics
in June last year as well as
the TokyoR #70: Hadley Wickham + Joe Rickert
Special!
last July. Although
mainly for BeginneRs, the LTs for this session were applications of
R/case studies and may be of more interest to non-beginners if you want
to skip ahead to that section.

Like my previous round up posts (for TokyoR
#76
and
JapanR Conference
#9
)
I will be going over around half of all the talks. Hopefully, my efforts
will help spread the vast knowledge of Japanese R users to the wider R
community. Throughout I will also post helpful blog posts and links from
other sources if you are interested in learning more about the topic of
a certain talk. You can follow Tokyo.R by searching for the

BeginneR Session #1

These were the same set of beginner user focused talks that happens at
the start of every TokyoR meetup:

BeginneR Session #2

Since this was a special BeginneR session, the main talks were all
focused on even more introductory stuff on top of the usual beginner’s
session in the previous section.

kilometer00: Using R for Data Analysis

First up, @kilometer00 talked about doing data analysis with R. As a
brief introduction he talked about what exactly IS data analysis, how we
might go about forming a research problem and hypothesis, and how these
steps are important in figuring out what kind of modeling we might
attempt to do on the data at hand. The modeling techniques
@kilometer00 covered were just the basics such as single/multiple
linear regression and PCA.

In the last part of the talk he covered nested data modelling. By using
group_by() and nest() we can create a data frame that has one row
per group (for example, a row for each species group in the iris
data set) with has a new column called data which is itself a data
frame. Now, instead of having to repeat the modelling code over and over
again for each subset of rows (a model for each species in the iris
data set), by using the purrr::map_*() family of functions you can
apply your model to each of the groups that you specified earlier.

Filled with great explanatory graphics, plots, and code the slides are a
good example of teaching the basics of modelling with R.

Some other resources about modelling with R:

aad34210: Become a useR with R Studio

IDE to maximize R’s strengths for programming and data analysis. After a
brief spiel on the early days of using R solely from the console he
talked about R Studio’s capabilities and the various options that can be
customized to suit your R needs. From installing R Studio, configuring
the four main panes, using R Projects, and using Global options,
aad34210 opened up his own R Studio window and pointed out the various
menu options in thorough detail to help beginneRs navigate without
getting overwhelmed. He rounded off the talk by showing the various
Cheat sheets (included one for R Studio itself) that can be obtained
from the Help tab.

Some other resources one might consider to learn R Studio are:

u_ribo: Version Control and Project Management with R

@u_ribo gave a talk about the benefits of creating a reproducible and
enclosed R environment using git and Docker. As an instructor who has
ran many R workshops, @u_ribo has ran into the problem of his learners
all having different OSs, versions of packages, and versions of R itself
causing much headache for all involved. This is also the case when
working in a team environment where analyses need to be shared and
reproducibility of results is essential.

To reduce these problems he gave a live-demo using a variety of R tools
such as the here package, the usethis package, and managing a
project environment with R Studio Projects (.Rproj). To go further in
depth he also talked about using git (made very easy with its seamless
integration with R Studio) and the use of Docker. To run Docker you
need an Docker “image” and a Docker “container”. An image is a file,
called a Dockerfile, that has the information about and configures the
Operating System for the environment. The container is the the actual
running instance of the “image” that is defined by the Docker file.
Going back to the issue of running a workshop, using Docker allows all
of the participants to run R inside a specific container, an enclosed environment
set up by the instructor, so that all the different dependencies and
version differences won’t prevent you from running the code provided in
the workshop.

Other good introductions to Docker and R:

niszet: Reproducible Reports with R Markdown

@niszet talked about reproducible reporting with R Markdown. He was
certainly the right person to give a talk on this topic as he is the
author of the mini-books, “Create a Word document with R Markdown” and
“Create a PowerPoint presentation with R Markdown”. To begin, he talked
about cases where one writes a document, report, or any kind of output
and how it might be a good idea to be able to create it again for
“future me” or anybody else that might want to take a look. Normally,
you would run into problems such as “where did my data go?”, “how did I
pre-process my data?”, “” but you can mitigate these problems by using R
Markdown reports. Whether it’s importing your raw data, the
pre-processing/modelling/viz code, to the actual report and
documentation write-up R Markdown renders the document in a completely
clean environment each time so the results should be the same, every
time! As noted in the beginning, you can create many different kinds of
output such as Word document, PowerPoint presentation, html document,
presentation slides, and more – even business cards (with the pagedown
package)! Following an explanation of what you can do with R Markdown,
@niszet talked about how exactly one would use it to its best
capabilities. In a live-demo he took us through all the different
components of an R Markdown document:

• YAML header: Where one defines the “how” of the RMD such as the
title, template, output directory, output type, etc.
• Code chunk and options: Where all your code (can be languages
besides R) that you want to be run should be located. Chunk options such as
whether to evaluate the code within, toggle showing the code, and
many more are also specified here.
• Markdown text: Regular text using markdown. Can also include inline
code using “.
• Various buttons/shortcut keys: Such as Ctrl + Shift + K to Knit!

Some other intros to R Markdown:

LTs GotaMorishita: Finding a New Place to Live with R

It’s only been 3 (three!) days since @GotaMorishita started learning R
yet here he was giving a LT on finding a new place to live in Tokyo
using R! Tired of living with his parents @GotaMorishita decided to
live by himself and started thinking about ways to use R and machine
learning to search for a place with an optimal price and location. After
web-scraping a housing website and pre-processing the data he came
across a problem: if he split the data into a training and test set for
selecting the best predictive model then he would be throwing away a
considerable amount of possible candidates for housing.

If @GotaMorishita took out 90% of the houses/apartments from the
training data and kept those as candidates for the test data, it woudl’ve meant
that the training data will have a markedly different distribution
compared to the test data set and the model created from the training
set wouldn’t be as accurate. This problem, called co-variate shifting, is when the training data and
test data have different distributions but the conditional distribution
of the output values given the input data is unchanged. Using standard
methods of model selection such as cross-validation or AIC in this
situation leads to biasedness. However, this problem can be mitigated by
weighting the loss function by importance (the ratio of training and
test input densities). You can find a more detailed description in the
research papers below. @GotaMorishita used xgboost to implement his
models, one with importance weighting and another without, and used
group-wise cross-validation to tune the hyperparameters. The results are
shown below, comparing the overall test scores for all Tokyo districts
(left) and just the Sangenjaya district (right), the RMSE was smaller
when Importance Weighting was used.

sk_bono36: Creating Marketing Personas with R and rtweet

@sk_bono36 gave a presentation on using R for marketing purposes with
the rtweet package. In marketing there is a concept called a “persona”
which is a blueprint of what a certain type of person in your target
audience for your product/service is like. A basic persona template
can consist of their name, job title, demographic details, values, and
hobbies. You create these ideal customers through careful market
research involving both social media analytics and
interviewing/surveying the customers themselves. @sk_bono36 focused on
creating a persona (with “自転車/Bicycle” as the keyword for this case
study) by using rtweet then running Japanese text analysis with
RMeCab. Visualizing the data with word clouds and network graphs of
bi-grams he was able to find a lot of information on Twitter users who
have “bicycle” on their profile or tweets and extract the common
characteristics of this type of person.

As a result of his analysis @sk_bono36 was able to create a persona of
people who like bicycles:

• 20~30 Years Old
• Friendly disposition
• Likes Anime/video games
• Does weight lifting

Some other intros to the rtweet package:

igjit: Create a type-checker package for R

@igjit, who also presented at
Japan.R back in December on
building an R compiler with
R
,
gave a talk about another recent project of his which is a package that
acts as a type checking system for R. A problem that he finds in R is
that errors relating to having the wrong data type in the arguments of R
functions are only found when code is executed, not before. Frustrated
by this @igjit created his own package called
typrr that type checks your code! The
underlying checker that typrr runs is
Flycheck which is a syntax
checking extension for Emacs.

For now, the package only checks for the basic data types found in R,
integer, double, logical, and character and it can only check functions
with one argument only. He rounded off the talk by saying that he
created this package just for fun and he advised all the beginneRs in
the audience that people learn from doing rather than just
reading so to truly get better it’s best to go out and experiment!

Other Talks Food, Drinks, and Conclusion

Following all of the talks, those who were staying for the after-party
were served sushi and drinks! With a loud rendition of “kampai!”
(cheers!) R users from all over Tokyo began to talk about their
successes and struggles with R. A fun tradition at TokyoR is a
Rock-Paper-Scissors tournament with the prize being free data
science books (I still haven’t won any nor even gotten close to the last rounds)!

The prizes for this month was:

• A couple copies of “Create a Word document with R Markdown”
mini-book by niszet.
• 3 copies of the Japanese translation (by Hoxo-m
Inc.
) of “Feature Engineering for Machine
Learning” by Alice Zheng and Amanda Casari provided by
uribo.

TokyoR happens almost monthly and it’s a great way to mingle with
Japanese R users as it’s the largest regular meetup here in Japan. Talks
in English are also welcome so if you’re ever in Tokyo come join us!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Probability of winning a best-of-7 series

Tue, 04/23/2019 - 07:50

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

The NBA playoffs are in full swing! A total of 16 teams are competing in a playoff-format competition, with the winner of each best-of-7 series moving on to the next round. In each matchup, two teams play 7 basketball games against each other, and the team that wins more games progresses. Of course, we often don’t have to play all 7 games: we can determine the overall winner once either team reaches 4 wins.

During one of the games, a commentator made a remark along the lines of “you have no idea how hard it is for the better team to lose in a best-of-7 series”. In this post, I’m going to try and quantify how hard it is to do so! I will do this not via analytical methods, but by Monte Carlo simulation. (You can see the code all at once here.)

Let’s imagine an idealized set-up, where for any given game, team A has probability of beating team B. This probability is assumed to be the same from game to game, and the outcome of each game is assumed to be independent of the others. With these assumptions, the number of games that team A wins has a binomial distribution .

In our simulation, for each , we will generate a large number of random values and determine the proportion of them that are : that would be our estimate for the winning probability of the 7-game series. How many random values should we draw? We should draw enough so that we are reasonably confident of the accuracy of the proportion estimate. If we draw samples, then the plug-in estimate of standard error is . In what follows, we will plot estimates with error bars indicating standard error.

First, let’s set up a function that takes in three parameters: the number of simulation runs (simN), the total number of games in a series (n), and the probability that team A wins (p). The function returns the proportion of the simN runs which team A wins.

sim_fn <- function(simN, n, p) { if (n %% 2 == 0) { stop("n should be an odd number") } mean(rbinom(simN, n, p) > n / 2) }

Next, we set up a data frame with each row representing a different win probability for team A. Because of symmetry inherent in the problem, we focus on win probabilities which are .

n <- 7 df <- data.frame(p = seq(0.5, 1, length.out = 26), n = n)

For a start, we set simN to be 100. For each row, we run simN simulation runs to get the win probability estimate and compute the associated standard error estimate. We also compute the upper and lower 1 standard error deviations from the probability estimate, to be used when plotting error bars.

set.seed(1043) simN <- 100 df$win_prob <- apply(df, 1, function(x) sim_fn(simN, x[2], x[1])) # compute std err & p_hat \pm 1 std err df$std_err <- with(df, sqrt(win_prob * (1 - win_prob) / simN)) df$lower <- with(df, win_prob - std_err) df$upper <- with(df, win_prob + std_err)

Here is the code for the plot and the plot itself:

library(ggplot2) ggplot(df, aes(x = p, y = win_prob)) + geom_line() + geom_linerange(aes(ymin = lower, ymax = upper)) + labs(title = paste("Win probability for best of", n, "series"), x = "Win prob for single game", y = "Win prob for series")

We can see that the line is still quite wiggly with large error bars, suggesting that 100 simulation runs is not enough. (Indeed, we expect the graph to be monotonically increasing.) Below, we run 100,000 simulations for each value of p instead (with the same random seed):

We get a much smoother line, and the error bars are so small we can hardly see them. My conclusion here is that while it is harder for the weaker team to win a best-of-7 series that a single game, the odds are not insurmountable. For example, a team that has a 70% chance of winning any one game, which is a huge advantage, still has about a 13% chance of losing a best-of-7 series: not insignificant!

How do these probabilities change if we consider best-of-n series for different values of n? The code below is very similar to that above, except that our data frame contains data for n equal to all the odd numbers from 1 to 15, not just n = 7.

p <- seq(0.5, 1, length.out = 26) n <- seq(1, 15, length.out = 8) df <- expand.grid(p, n) names(df) <- c("p", "n") set.seed(422) simN <- 100000 df$win_prob <- apply(df, 1, function(x) sim_fn(simN, x[2], x[1])) # compute std err & p_hat \pm 1 std err df$std_err <- with(df, sqrt(win_prob * (1 - win_prob) / simN)) df$lower <- with(df, win_prob - std_err) df$upper <- with(df, win_prob + std_err) ggplot(df, aes(x = p, y = win_prob, col = factor(n))) + geom_line() + geom_linerange(aes(ymin = lower, ymax = upper)) + labs(title = paste("Win probability for best of n series"), x = "Win prob for single game", y = "Win prob for series") + theme(legend.position = "bottom")

Here is the plot:

As we expect, the series win probabilities increase as n increases. Also, the n = 1 graph corresponds to the line. It looks like the win probabilities don’t change much from best-of-9 to best-of-15.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Comparing Point-and-Click Front Ends for R

Tue, 04/23/2019 - 01:58

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

Now that I’ve completed seven detailed reviews of Graphical User Interfaces (GUIs) for R, let’s compare them. It’s easy enough to count their features and plot them, so let’s start there. I’m basing the counts on the number of menu items in each category. That’s not too hard to get, but it’s far from perfect. Some software has fewer menu choices, depending instead on dialog box choices. Studying every menu and dialog box would be too time-consuming, so be aware of this limitation. I’m putting the details of each measure in the appendix so you can adjust the figures and create your own graphs. If you decide to make your own graphs, I’d love to hear from you in the comments below.

Figure 1 shows the number of analytic methods each software supports on the x-axis and the number of graphics methods on the y-axis. The analytic methods count combines statistical features, machine learning / artificial intelligence ones (ML/AI), and the ability to create R model objects. The graphics features count totals up the number of bar charts, scatterplots, etc. each package can create.

The ideal place to be in this graph is in the upper right corner. We see that BlueSky and R Commander offer quite a lot of both analytic and graphical features. Rattle stands out as having the second greatest number of graphics features. JASP is the lowest on graphics features and 3rd from the bottom on analytic ones.

Next, let’s swap out the y-axis for general usability features. These consist of a variety of features that make your work easier, including data management capabilities (see appendix for details).

Figure 2 shows that BlueSky and R Commander still in the top two positions overall, but now Deducer has nearly caught up with R Commander on the number of general features. That’s due to its reasonably strong set of data management tools, plus its output is in true word processing tables saving you the trouble of formatting it yourself. Rattle is much lower in this plot since, while its graphics capabilities are strong (at least in relation to ML/AI tasks), it has minimal data management capabilities.

These plots help show us three main overall feature sets, but each package offers things that the others don’t. Let’s look at a brief overview of each. Remember that each of these has a detailed review that follows my standard template. I’ll start with the two that have come out on top, then follow in alphabetical order.

The R Commander – This is the oldest GUI, having been around since at least 2005. There are an impressive 41 plug-ins developed for it. It is currently the only R GUI that saves R Markdown files, but it does not create word processing tables by default, as some of the others do. The R code it writes is classic, rarely using the newer tidyverse functions. It works as a partner to R; you install R separately, then use it to install and start R Commander. It makes it easy to blend menu-based analysis with coding. If your goal is to learn to code in classic R, this is an excellent choice.

BlueSky Statistics – This software was created by former SPSS employees and it shares many of SPSS’ features. BlueSky is only a few years old, and it converted from commercial to open source just a few months ago. Although BlueSky and R Commander offer many of the same features, they do them in different ways. When using BlueSky, it’s not initially apparent that R is involved at all. Unless you click the “Syntax” button that every dialog box has, you’ll never see the R code or the code editor. Its output is in publication-quality tables which follow the popular style of the American Psychological Association.

Deducer – This has a very nice-looking interface, and it’s probably the first to offer true word processing tables by default. Being able to just cut and paste a table into your word processor saves a lot of time and it’s a feature that has been copied by several others. Deducer was released in 2008, and when I first saw it, I thought it would quickly gain developers. It got a few, but development seems to have halted. Deducer’s installation is quite complex, and it depends on the troublesome Java software. It also used JGR, which never became as popular as the similar RStudio. The main developer, Ian Fellows, has moved on to another very interesting GUI project called Vivid.

jamovi – The developers who form the core of the jamovi project used to be part of the JASP team. Despite the fact that they started a couple of years later, they’re ahead of JASP in several ways at the moment. Its developers decided that the R code it used should be visible and any R code should be executable, something that differentiated it from JASP. jamovi has an extremely interactive interface that shows you the result of every selection in each dialog box. It also saves the settings in every dialog box, and lets you re-use every step on a new dataset by saving a “template.” That’s extremely useful since GUI users often don’t want to learn R code. jamovi’s biggest weakness its dearth of data management tasks, though there are plans to address that.

JASP – The biggest advantage JASP offers is its emphasis on Bayesian analysis. If that’s your preference, this might be the one for you. At the moment JASP is very different from all the other GUIs reviewed here because it won’t show you the R code it’s writing, and you can’t execute your own R code from within it. Plus the software has not been open to outside developers. The development team plans to address those issues, and their deep pockets should give them an edge.

Rattle – If your work involves ML/AI (a.k.a. data mining) instead of standard statistical methods, Rattle may be the best GUI for you. It’s focused on ML/AI, and its tabbed-based interface makes quick work of it. However, it’s the weakest of them all when it comes to statistical analysis. It also lacks many standard data management features. The only other GUI that offers many ML/AI features is BlueSky.

RKWard – This GUI blends a nice point-and-click interface with an integrated development environment that is the most advanced of all the other GUIs reviewed here. It’s easy to install and start, and it saves all your dialog box settings, allowing you to rerun them. However, that’s done step-by-step, not all at once as jamovi’s templates allow. The code RKWard creates is classic R, with no tidyverse at all.

Conclusion

I hope this brief comparison will help you choose the R GUI that is right for you. Each offers unique features that can make life easier for non-programmers. If one catches your eye, don’t forget to read the full review of it here.

Acknowledgements

Writing this set of reviews has been a monumental undertaking. It would not have been possible without the assistance of Bruno Boutin, Anil Dabral, Ian Fellows, John Fox, Thomas Friedrichsmeier, Rachel Ladd, Jonathan Love, Ruben Ortiz, Christina Peterson, Josh Price, Eric-Jan Wagenmakers, and Graham Williams.

Appendix: Guide to Scoring

In figures 1 and 2, Analytic Features adds up: statistics, machine learning / artificial intelligence, the ability to create R model objects, and the ability to validate models using techniques such as k-fold cross-validation. The Graphics Features is the sum of two rows, the number of graphs the software can create plus one point for small multiples, or facets, if it can do them. Usability is everything else, with each row worth 1 point, except where noted.

Feature Definition Simple
installation Is it done in one step? Simple start-up Does it start on its own without starting R, loading
packages, etc.? Import Data Files How many files types can it import? Import
Database How many databases can it read from? Export Data Files How many file formats can it write to? Data Editor Does it have a data editor? Can work on >1 file Can it work on more than one file at a time? Variable
View Does it show metadata in a variable view, allowing for many fast edits to metadata? Data
Management How many data management tasks can it do? Transform
Many Can it transform many variables at once? Types How many graph types does it have? Small
Multiples Can it show small multiples (facets)? Model Objects Can it create R model objects? Statistics How many statistical methods does it have? ML/AI How many ML / AI methods does it have? Model Validation Does it offer model validation (k-fold, etc.)? R Code IDE Can you edit and execute R code? GUI Reuse Does it let you re-use work without code? Code Reuse Does it let you rerun all using code? Package Management Does it manage packages for you? Table of Contents Does output have a table of contents? Re-order Can you re-order output? Publication Quality Is output in publication quality by default? R Markdown Can it create R Markdown? Add
Input Does it save equivalent to broom’s tidy, glance, augment? (They earn 1 point for each) Developer tools Does it offer developer tools? Scores Feature BlueSky Deducer JASP jamovi Rattle Rcmdr RKWard Simple installation 1 0 1 1 0 0 1 Simple start-up 1 1 1 1 0 0 1 Import Data Files 7 13 4 5 9 7 5 Import Database 5 0 0 0 1 0 0 Export Data Files 5 7 1 4 1 3 3 Data Editor 1 1 0 1 0 1 1 Can work on >1 file 1 1 0 0 0 0 0 Variable View 1 1 0 0 0 0 0 Data Management 30 9 2 3 9 25 4 Transform Many 1 1 0 1 1 1 0 Types 25 16 9 12 24 21 14 Small Multiples 1 1 0 0 0 1 0 Model Objects 1 1 0 0 0 1 1 Statistics 96 37 26 44 8 95 22 ML/AI 9 0 0 0 12 0 0 Model Validation 1 0 0 0 1 0 0 R Code IDE 1 1 0 1 0 0 1 GUI Reuse 0 0 1 0 0 0 1 Code Reuse 1 1 0 1 1 1 1 Package Management 1 0 1 1 0 0 0 Output: Table of Contents 1 0 0 0 0 0 0 Output: Re-order 0 0 0 0 0 0 0 Output: Publication Quality 1 1 1 1 0 0 0 Output: R Markdown 0 0 0 0 0 1 0 Output: Add comments 0 0 1 0 0 1 0 Group-by / Split File 1 0 0 0 0 0 0 Output as Input 3 1 0 0 0 1 0 Developer tools 1 1 0 1 0 1 1 Total 196 94 48 77 67 160 56 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Le Monde puzzle [#1094]

Tue, 04/23/2019 - 00:19

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A rather blah number Le Monde mathematical puzzle:

Find all integer multiples of 11111 with exactly one occurrence of each decimal digit..

Which I solved by brute force, by looking at the possible range of multiples (and  borrowing stringr:str_count from Robin!)

> combien=0 > for (i in 90001:900008){ j=i*11111 combien=combien+(min(stringr::str_count(j,paste(0:9)))==1)} > combien [1] 3456

And a bonus one:

Find all integers y that can write both as x³ and (10z)³+a with 1≤a≤999.

which does not offer much in terms of solutions since x³-v³=(x-v)(x²+xv+v²)=a shows that x² is less than 2a/3, meaning x is at most 25. Among such numbers only x=11,12 lead to a solution as x³=1331,1728.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Using R/exams for Written Exams in Finance Classes

Tue, 04/23/2019 - 00:00

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Experiences with using R/exams for written exams in finance classes with a moderate number of students at Texas A&M International University (TAMIU).

Guest post by Nathaniel P. Graham (Texas A&M International University, Division of International Banking and Finance Studies).

Background

While R/exams was originally written for large statistics classes, there is nothing specific to either large courses or statistics. In this article, I will describe how I use R/exams for my “Introduction to Finance (FIN 3310)” course. While occasionally I might have a class of 60 students, I generally have smaller sections of 20–25. These courses are taught and exams are given in-person (some material and homework assignments are delivered via the online learning management system, but I do not currently use R/exams to generate that).

My written exams are generated by exams2nops() and have two components: a multiple-choice portion and a short-answer portion. The former are generated using R/exams’s dynamic exercise format while the latter are included from static PDF documents. Example pages from a typical exam are linked below:

Because I have a moderate number of students, I grade all my exams manually. This blog post outlines how the exercises are set up and which R scripts I use to automate the generation of the PDF exams.

Multiple-choice questions Shuffling

The actual questions in my exams are not different from many other R/exams examples, except that they are about finance instead of statistics. An example is included below with LaTeX markup and both the R/LaTeX (.Rnw) and R/Markdown (.Rmd) versions can be downloaded: goalfinance.Rnw, goalfinance.Rmd. I have 7 possible answers, but only 5 will be randomly chosen for a given exam (always including the 1 correct answer). To do so, many of my non-numerical questions take advantage of the exshuffle option here. (If you are one of my students, I told you this would be on the exam!)

Periodic updates to individual questions (stored as separate files) keep the exams fresh.

\begin{question} The primary goal of financial management is to maximize the \dots \begin{answerlist} \item Present value of future operating cash flows \item Net income, according to GAAP/IFRS \item Market share \item Value of the firm \item Number of shares outstanding \item Book value of shareholder equity \item Firm revenue \end{answerlist} \end{question} \exname{Goal of Financial Management} \extype{schoice} \exsolution{0001000} \exshuffle{5} Numerical exercises

Since this is for a finance class, there are numerical exercises as well. Every semester, I promise my students that this problem will be on the second midterm. The problem is simple: given some cash flows and a discount rate, calculate the net present value (NPV), see npvcalc.Rnw, npvcalc.Rmd. Since the cash flows, the discount rate, and the answers are generated from random numbers, I would not gain much from defining more than five possible, the way I did in the qualitative question above. As you might expect, some of the incorrect answers are common mistakes students make when approaching NPV, so I have not lost anything relative to the carefully crafted questions and answers I used in the past to find out who really learned the material.

<>= discountrate <- round(runif(1, min = 6.0, max = 15.0), 2) r <- discountrate / 100.0 cf0 <- sample(10:20, 1) * -100 ocf <- sample(seq(200, 500, 25), 5) discounts <- sapply(1:5, function(i) (1 + r) ** i) npv <- round(sum(ocf / discounts) + cf0, 2) notvm <- round(sum(ocf) + cf0, 2) wrongtvm <- round(sum(ocf / (1.0 + r)) + cf0, 2) revtvm <- round(sum(ocf * (1.0 + r)) + cf0, 2) offnpv <- round(npv + sample(c(-200.0, 200.0), 1), 2) @ \begin{question} Assuming the discount rate is \Sexpr{discountrate}\%, find the net present value of a project with the following cash flows, starting at time 0: \$\Sexpr{cf0}, \Sexpr{ocf[1]}, \Sexpr{ocf[2]}, \Sexpr{ocf[3]}, \Sexpr{ocf[4]}, \Sexpr{ocf[5]}. \begin{answerlist} \item \$\Sexpr{wrongtvm} \item \$\Sexpr{notvm} \item \$\Sexpr{npv} \item \$\Sexpr{revtvm} \item \$\Sexpr{offnpv} \end{answerlist} \end{question} \exname{Calculating NPV} \extype{schoice} \exsolution{00100} \exshuffle{5} Leveraging R’s flexibility

While there are other methods of generating randomized exams out there, few are as flexible as R/exams, in large part because we have full access to R. The next example also appears on my second midterm; it asks students to compute the internal rate of return for a set of cash flows, see irrcalc.Rnw, irrcalc.Rmd. Numerically, this is a simple root-finding problem, but systems that do not support more advanced mathematical operations (such as Blackboard’s “Calculated Formula” questions) can make this difficult or impossible to implement directly.

<>= cf0 <- sample(10:16, 1) * -100 ocf <- sample(seq(225, 550, 25), 5) npvfunc <- function(r) { discounts <- sapply(1:5, function(i) (1 + r) ** i) npv <- (sum(ocf / discounts) + cf0) ** 2 return(npv) } res <- optimize(npvfunc, interval = c(-1,1)) irr <- round(res$minimum, 4) * 100.0 wrong1 <- irr + sample(c(1.0, -1.0), 1) wrong2 <- irr + sample(c(0.25, -0.25), 1) wrong3 <- irr + sample(c(0.5, -0.5), 1) wrong4 <- irr + sample(c(0.75, -0.75), 1) @ \begin{question} Find the internal rate of return of a project with the following cash flows, starting at time 0: \$\Sexpr{cf0}, \Sexpr{ocf[1]}, \Sexpr{ocf[2]}, \Sexpr{ocf[3]}, \Sexpr{ocf[4]}, \Sexpr{ocf[5]}. \begin{answerlist} \item \Sexpr{wrong1}\% \item \Sexpr{wrong2}\% \item \Sexpr{irr}\% \item \Sexpr{wrong3}\% \item \Sexpr{wrong4}\% \end{answerlist} \end{question} \exname{Calculating IRR} \extype{schoice} \exsolution{00100} \exshuffle{5} Short-answer questions

While exams2nops() has some support for open-ended questions that are scored manually, it is very limited. I create and format those questions in the traditional manner (Word or LaTeX) and save the resulting file as a PDF. Alternatively, a custom exams2pdf() could be used for this. Finally, I add this PDF file on to the end of the exam using the pages option of exams2nops() (as illustrated in more detail below). As an example, the short-answer questions from the final exam above are given in the following PDF file:

To facilitate automatic inclusion of these short-answer questions, a naming convention is adopted in the scripts below. The script looks for a file named examNsa.pdf, where N is the exam number (1 for the first midterm exam, 2 for the second, and 3 for the final exam) and sets pages to it. So for the final exam, it looks for the file exam3sa.pdf. If I want to update my short-answer questions, I just replace the old PDF with a new one.

Midterm exams

I keep the questions for each exam in their own directory, which makes the script below that I use to generate exams 1 and 2 straightforward. For the moment, the script uses all the questions in an exam’s directory, but I if want to guarantee that I have exactly (for example) 20 multiple choice questions, I could set nsamp = 20 instead of however many Rnw (or Rmd) files it finds.

Note that exlist is a list with one element (a vector of file names), so that R/exams will randomize the order of the questions as well. If I wanted to specify the order, I could make exlist a list of N elements, where each element was exactly one file/question. The nsamp option could then be left unset (the default is NULL).

When I generate a new set of exams, I only need to update the first four lines, and the script does the rest. Note that I use a modified version of the English en.dcf language configuration file in order to adapt some terms to TAMIU’s terminology, e.g., “ID Number” instead of the exams2nops() default “Registration Number”. See en_tamiu.dcf for the details. Since the TAMIU student ID numbers have 8 digits, I use the reglength = 8 argument, which sets the number of digits in the ID Number to 8.

library("exams") ## configuration: change these to make a new batch of exams exid <- 2 exnum <- 19 exdate <- "2019-04-04" exsemester <- "SP19" excourse <- "FIN 3310" ## options derived from configuration above exnam <- paste0("fin3310exam", exid) extitle <- paste(excourse, exsemester, "Exam", exid, sep = " ") saquestion <- paste0("SA_questions/exam", exid, "sa.pdf") ## exercise directory (edir) and output directory (odir) exedir <- paste0("fin3310_exam", exid, "_exercises") exodir <- "nops" if(!dir.exists(exodir)) dir.create(exodir) ## in case it was previously deleted exodir <- paste0(exodir, "/exam", exid, "_", format(Sys.Date(), "%Y-%m-%d")) ## exercise list: one element with a vector of all available file names exlist <- list(dir(path = exedir, pattern = "\.Rnw$")) ## generate exam set.seed(54321) # not the actual random seed exams2nops(file = exlist, n = exnum, nsamp = length(exlist[[1]]), dir = exodir, edir = exedir, name = exnam, date = exdate, course = excourse, title = extitle, institution = "Texas A\\&M International University", language = "en_tamiu.dcf", encoding = "UTF-8", pages = saquestion, blank = 1, logo = NULL, reglength = 8, samepage = TRUE) Final exam My final exam is comprehensive, so I would like to include questions from the previous exams. I do not want to keep separate copies of those questions just for the final, in case I update one version and forget to update the other, so I need a script that gathers up questions from the first two midterms and adds in some questions specific to the final. The script below uses a feature that has long been available in R/exams but was undocumented up to version 2.3-2: You can set edir to a directory and all its subdirectories will be included in the search path. I have specified some required questions that I want to appear on every student’s exam; each student will also get a random draw of other questions from the first two midterms in addition to some questions that only appear on the final. How many of each is controlled by exnsamp, which is passed to the nsamp argument. They add up to 40 currently, so exams2nops()’s 45 question limit does not affect me. library("exams") ## configuration: change these to make a new batch of exams exnum <- 19 exdate <- "2019-04-30" exsemester <- "SP19" excourse <- "FIN 3310" ## options derived from configuration above extitle <- paste(excourse, exsemester, "Exam", exid, sep = " ") ## exercise directory (edir) and output directory (odir) exedir <- getwd() exodir <- "nops" if(!dir.exists(exodir)) dir.create(exodir) exodir <- paste0(exodir, "/exam3", "_", format(Sys.Date(), "%Y-%m-%d")) ## exercises: required and from previous midterms exrequired <- c("goalfinance.Rnw", "pvcalc.Rnw", "cashcoverageortie.Rnw", "ocfcalc.Rnw", "discountratecalc.Rnw", "npvcalc.Rnw", "npvcalc2.Rnw", "stockpriceisabouttopaycalc.Rnw", "annuitycalc.Rnw", "irrcalc.Rnw") exlist1 <- dir(path = "fin3310_exam1_exercises", pattern = "\.Rnw$") exlist2 <- dir(path = "fin3310_exam2_exercises", pattern = "\.Rnw$") exlist3 <- dir(path = "fin3310_exam3_exercises", pattern = "\.Rnw$") ## final list and corresponding number to be sampled exlist <- list( exrequired, setdiff(exlist1, exrequired), setdiff(exlist2, exrequired), exlist3 ) exnsamp <- c(10, 5, 10, 15) ## generate exam set.seed(54321) # not the actual random seed exams2nops(file = exlist, n = exnum, nsamp = exnsamp, dir = exodir, edir = exedir, name = "fin3310exam3", date = exdate, course = excourse, title = extitle, institution = "Texas A\\&M International University", language = "en_tamiu.dcf", encoding = "UTF-8", pages = "SA_questions/exam3sa.pdf", blank = 1, logo = NULL, reglength = 8, samepage = TRUE) Grading

R/exams can read scans of the answer sheet, automating grading for the multiple-choice questions. For a variety of reasons, I do not use any of those features. If I had to teach larger classes I would doubtless find a way to make it convenient, but for the foreseeable future I will continue to use the function below to produce the answer key for each exam, which I grade by hand. This is not especially onorous, since I have to grade the short-answer questions by hand anyway.

Given the serialized R data file (.rds) produced by exam2nops() the corresponding exams_metainfo() can be extracted. This comes with a print() method that displays all correct answers for a given exam. Starting from R/exams version 2.3-3 (current R-Forge devel version) it is also possible to split the output into blocks of five for easy reading (matching the blocks on the answer sheets). As an example the first few correct answers of the third exam are extracted:

fin3310exam3 <- readRDS("fin3310exam3.rds") print(exams_metainfo(fin3310exam1), which = 3, block = 5) ## 19043000003 ## 1. Discount Rate: 1 ## 2. Calculating IRR: 4 ## 3. Present Value: 3 ## 4. Calculating stock prices: 5 ## 5. Calculating NPV: 3 ## ## 6. Annuity PV: 3 ## 7. Goal of Financial Management: 1 ## 8. Finding OCF: 5 ## ...

It can be convenient to display the correct answers in a customized format, e.g., with letters instead of numbers and omitting the exercise title text. To do so, the code below sets up a function exam_solutions(), applying it to the same exam as above.

exam_solutions <- function(examdata, examnum) { solutions <- LETTERS[sapply(examdata[[examnum]], function(x) which(x$metainfo$solution))] data.frame(Solution = solutions) } split(exam_solutions(fin3310exam3, examnum = 3), rep(1:8, each = 5)) ## $1 ## Solution ## 1 A ## 2 D ## 3 C ## 4 E ## 5 C ## ##$2 ## Solution ## 6 C ## 7 A ## 8 E ## ...

The only downside of the manual grading approach is that I do not have students’ responses in an electronic format for easy statistical analysis, but otherwise grading is very fast.

Wrapping up

Using the system above, R/exams works well even for small courses with in-person, paper exams. I do not need to worry about copies of my exams getting out and students memorizing it, or any of the many ways students have found to cheat over the years. By doing all the formatting work for me, R/exams helps me avoid a lot of the finicky aspects of adding, adjusting, or removing questions from an existing exam, and generally keeps exam construction focused on the important part: writing questions that gauge students’ progress.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Practical Data Science with R Book Update (April 2019)

Mon, 04/22/2019 - 18:39

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I thought I would give a personal update on our book: Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019.

The second edition should be fully available this fall! Nina and I have finished up through chapter 10 (of 12), and Manning has released previews of up through chapter 7 (with more to come!).

At this point the second edition is very real. So it makes me feel a bit conflicted about the 1st edition. If you need a good guide to data science in R in paper form right now please consider buying the 1st edition. If you are happy with an e-book: please consider buying the 2nd edition MEAP (Manning Early Access Program) now, it gets you previews of the new book, a complete e-copy of the 1st edition, and a complete e-copy of the 2nd edition when that is done! And we expect the fully printed version of the 2nd edition to be available this fall.

The 1st edition remains an excellent book that we are very proud of. I have 12 more copies of it (that I paid for) in my closet for gifting. Just four days ago I achieved a personal 1st edition goal: handing a print copy of the 1st edition to Robert Gentleman in person!

But, the 1st edition’s time is coming to an end.

The 1st edition has been with Nina and me since we started work on it in November of 2012 (getting the 1st edition out in April 2014). Earlier drafts were titled “Successful Data Science in R”, and issues around book title made me so nervous the entire book production directory was named “DataScienceBook” until last year about 4 years after the book publication!

Well, now for me Practical Data Science with R, 2nd edition is “the book.” Nina and I have put a lot of time into revising and improving the book for the new edition. In fact we started the 2nd edition March of 2018, so we have been working on this new edition for quite some time. If you wonder why you have not seen Nina for the last year, this has been the major reason (she is the leader on both editions of the book).

Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019 is intended as an improved revision of the 1st edition. We fixed things, streamlined things, incorporated new packages, and updated the content to catch up with over 5 years of evolution of the R ecosystem. We also had the chance to add some new content on boosted trees, model explanation, and data wrangling (including data.table!).

On a technical side we have a free support site for the book: https://github.com/WinVector/PDSwR2. As with the first edition: we share all code examples, so you don’t have to try and copy and paste from the e-copy. This is fairly substantial, as Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019 is an example driven book. We also have just released R-markdown re-renderings of all of the examples from the book. These confirm the examples run correctly and are where you can look if you are hung up on trying to run an example (which will almost always be an issue of adding the path from where you have the code to the data, or a small matter of installing packages).

If you have and liked the 1st edition, please consider buying the 2nd edition: it is a substantial improvement. It is the same book, but a better version of it. If you want a book that teaches you how to work as data scientist using R as the example: please buy Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mon, 04/22/2019 - 17:04

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Over the last few years I’ve helped a number of data teams train their analysts to use R. At each company there was a skilled R user who was leading the team’s effort to adopt R.

Each of these internal R advocates, however, had a similar problem: they did not have a background in training. So while they knew that R could help their company, and they were R experts themselves, they struggled to get their team to actually learn R.

Help Your Team Learn R is designed to help people in this situation by providing an introduction to some of the key concepts in the field of technical training.

What’s in the Course?

In terms of Pedagogy, the course focuses on Criterion Referenced Instruction, which is the branch of training that I have found most useful when teaching R.

The course contains seven lessons that are sent via email over the course of six days. Each email contains:

• One key concept from the field of training
• An example of how I have applied that concept when I teach R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### India has 100k records on iNaturalist

Mon, 04/22/2019 - 04:58

(This article was first published on Vijay Barve, and kindly contributed to R-bloggers)

Biodiversity citizen scientists use iNaturalist to post their observations with photographs. The observations are then curated there by crowd-sourcing the identifications and other trait related aspects too. The data once converted to “research grade” is passed on to GBIF as occurrence records.

Exciting news from India in 3rd week of April 2019 is:

Another important milestone in #Biodiversity Citizen Science in #India. This week we crossed 100K verifiable records on @inaturalist this data is about ~10K species by 4500+ observers #CitSci pic.twitter.com/DCF3QxQl1i

— Vijay Barve (@vijaybarve) April 21, 2019

Being interested in biodiversity data visualizations and completeness, I was waiting for 100k records to explore the data. Here is what I did and found out.

Step 1: Download the data from iNaturalist website. Which can be done very easily by visiting the website and choosing the right options.

https://www.inaturalist.org/observations?place_id=6681

I downloaded the file as .zip and extracted the observations-xxxx.csv. [In my case it was observations-51008.csv].

Step 2: Read the data file in R

Step 3: Clean up the data and set it up to be used in package bdvis.

library(bdvis) inatc <- list( Latitude="latitude", Longitude="longitude", Date_collected="observed_on", Scientific_name="scientific_name" ) inat <- format_bdvis(observations_51008,config = inatc)

Step 4: We still need to rename some more columns for ease in visualizations like rather than ‘taxon_family_name’ it will be easy to have field called ‘Family’

rename_column <- function(dat,old,new){ if(old %in% colnames(dat)){ colnames(dat)[which(names(dat) == old)] <- new } else { print(paste("Error: Fieldname not found...",old)) } return(dat) } inat <- rename_column(inat,'taxon_kingdom_name','Kingdom') inat <- rename_column(inat,'taxon_phylum_name','Phylum') inat <- rename_column(inat,'taxon_class_name','Class') inat <- rename_column(inat,'taxon_order_name','Order_') inat <- rename_column(inat,'taxon_family_name','Family') inat <- rename_column(inat,'taxon_genus_name','Genus') # Remove records excess of 100k inat <- inat[1:100000,]

Step 5: Make sure the data is loaded properly

bdsummary(inat)

will produce some like this:

Total no of records = 100000 Temporal coverage... Date range of the records from 1898-01-01 to 2019-04-19 Taxonomic coverage... No of Families : 1345 No of Genus : 5638 No of Species : 13377 Spatial coverage ... Bounding box of records 6.806092 , 68.532 - 35.0614769085 , 97.050133 Degree celles covered : 336 % degree cells covered : 39.9524375743163

The data looks good. But we have a small issue, we have some records from year 1898, which might cause trouble with some of our visualizations. So let us drop records before year 2000 for the time being.

inat = inat[which(inat$Date_collected > "2000-01-01"),] Now we are ready to explore the data. First one I always like to see is geographical coverage of the data. First let us try it at 1 degree (~100km) grid cells. Note here I have Admin2.shp file with India states map. mapgrid(inat,ptype="records",bbox=c(60,100,5,40), shp = "Admin2.shp") This shows a fairly good geographical coverage of the data at this scale. We have very few degree cells with no data. How about fines scale though? Say at 0.1 degree (~10km) grid. Let us generate that. mapgrid(inat,ptype="records",bbox=c(60,100,5,40), shp = "Admin2.shp", gridscale=0.1) Now the pattern is clear, where the data is coming from. To be continued… References var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Vijay Barve. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Reproducible Environments Mon, 04/22/2019 - 02:00 (This article was first published on R Views, and kindly contributed to R-bloggers) Great data science work should be reproducible. The ability to repeat experiments is part of the foundation for all science, and reproducible work is also critical for business applications. Team collaboration, project validation, and sustainable products presuppose the ability to reproduce work over time. In my opinion, mastering just a handful of important tools will make reproducible work in R much easier for data scientists. R users should be familiar with version control, RStudio projects, and literate programming through R Markdown. Once these tools are mastered, the major remaining challenge is creating a reproducible environment. An environment consists of all the dependencies required to enable your code to run correctly. This includes R itself, R packages, and system dependencies. As with many programming languages, it can be challenging to manage reproducible R environments. Common issues include: • Code that used to run no longer runs, even though the code has not changed. • Being afraid to upgrade or install a new package, because it might break your code or someone else’s. • Typing install.packages in your environment doesn’t do anything, or doesn’t do the right thing. These challenges can be addressed through a careful combination of tools and strategies. This post describes two use cases for reproducible environments: 1. Safely upgrading packages 2. Collaborating on a team The sections below each cover a strategy to address the use case, and the necessary tools to implement each strategy. Additional use cases, strategies, and tools are presented at https://environments.rstudio.com. This website is a work in progress, but we look forward to your feedback. Safely Upgrading Packages Upgrading packages can be a risky affair. It is not difficult to find serious R users who have been in a situation where upgrading a package had unintended consequences. For example, the upgrade may have broken parts of their current code, or upgrading a package for one project accidentally broke the code in another project. A strategy for safely upgrading packages consists of three steps: 1. Isolate a project 2. Record the current dependencies 3. Upgrade packages The first step in this strategy ensures one project’s packages and upgrades won’t interfere with any other projects. Isolating projects is accomplished by creating per-project libraries. A tool that makes this easy is the new renv package . Inside of your R project, simply use: # inside the project directory renv::init() The second step is to record the current dependencies. This step is critical because it creates a safety net. If the package upgrade goes poorly, you’ll be able to revert the changes and return to the record of the working state. Again, the renv package makes this process easy. # record the current dependencies in a file called renv.lock renv::snapshot() # commit the lockfile alongside your code in version control # and use this function to view the history of your lockfile renv::history() # if an upgrade goes astray, revert the lockfile renv::revert(commit = "abc123") # and restore the previous environment renv::restore() With an isolated project and a safety net in place, you can now proceed to upgrade or add new packages, while remaining certain the current functional environment is still reproducible. The pak package can be used to install and upgrade packages in an interactive environment: # upgrade packages quickly and safely pak::pkg_install("ggplot2") The safety net provided by the renv package relies on access to older versions of R packages. For public packages, CRAN provides these older versions in the CRAN archive. Organizations can use tools like RStudio Package Manager to make multiple versions of private packages available. The “snapshot and restore” approach can also be used to promote content to production. In fact, this approach is exactly how RStudio Connect and shinyapps.io deploy thousands of R applications to production each day! Team Collaboration A common challenge on teams is sharing and running code. One strategy that administrators and R users can adopt to facilitate collaboration is shared baselines. The basics of the strategy are simple: 1. Administrators setup a common environment for R users by installing RStudio Server. 2. On the server, administrators install multiple versions of R. 3. Each version of R is tied to a frozen repository using a Rprofile.site file. By using a frozen repository, either administrators or users can install packages while still being sure that everyone will get the same set of packages. A frozen repository also ensures that adding new packages won’t upgrade other shared packages as a side-effect. New packages and upgrades are offered to users over time through the addition of new versions of R. Frozen repositories can be created by manually cloning CRAN, accessing a service like MRAN, or utilizing a supported product like RStudio Package Manager . Adaptable Strategies The prior sections presented specific strategies for creating reproducible environments in two common cases. The same strategy may not be appropriate for every organization, R user, or situation. If you’re a student reporting an error to your professor, capturing your sessionInfo() may be all you need. In contrast, a statistician working on a clinical trial will need a robust framework for recreating their environment. Reproducibility is not binary! To help pick between strategies, we’ve developed a strategy map . By answering two questions, you can quickly identify where your team falls on this map and identify the nearest successful strategy. The two questions are represented on the x and y-axis of the map: 1. Do I have any restrictions on what packages can be used? 2. Who is responsible for managing installed packages? {"x":{"data":[{"x":[-0.05,1.05],"y":[0.15,1.25],"text":"","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"rgba(0,0,0,0.2)","dash":"solid"},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0,0,0.8,0],"y":[0.2,1,1,0.2],"text":"NA","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(255,0,0,0.1)","hoveron":"fills","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0,1,0.8,0,0],"y":[null,0.8,1,0.2,null],"text":"NA","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,255,0,0.1)","hoveron":"fills","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0,0,1,0.2,0],"y":[0,0.2,0.8,0,0],"text":"","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,255,0,0.1)","hoveron":"fills","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[0.2,1,1,0.2],"y":[0,0,0.8,0],"text":"NA","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(255,0,0,0.1)","hoveron":"fills","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"skip","frame":null},{"x":[-0.05,1.05],"y":[-0.25,0.85],"text":"","type":"scatter","mode":"lines","line":{"width":1.88976377952756,"color":"rgba(0,0,0,0.2)","dash":"solid"},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[0.5,0.75,0.2],"y":[0.75,0.2,0.8],"text":["Open access, not reproducible, how we learn","Backdoor package access, offline systems without a strategy","Admins involved, no testing, slow updates, high risk of breakage"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(255,0,0,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(255,0,0,1)"}},"hoveron":"points","name":"FALSE","legendgroup":"FALSE","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[0.1,0.5,0.8],"y":[0.1,0.5,0.8],"text":["Admins test and approve a subset of CRAN","All or most of CRAN, updated with R versions, tied to a system library","Open access, user or system records per-project dependencies"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(163,197,134,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(163,197,134,1)"}},"hoveron":"points","name":" TRUE","legendgroup":" TRUE","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[0.125,0.525,0.525,0.825,0.775,0.225],"y":[0.125,0.525,0.775,0.825,0.225,0.825],"text":["Validated","Shared Baseline","Wild West","Snapshot","Blocked","Ticket System"],"hovertext":["","","","","",""],"textfont":{"size":14.6645669291339,"color":"rgba(0,0,0,1)"},"type":"scatter","mode":"text","hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":89.8630136986302},"font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":"Reproducing Environments: Strategies and Danger Zones","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-0.05,1.05],"tickmode":"array","ticktext":["Admins","","","","Users"],"tickvals":[0,0.25,0.5,0.75,1],"categoryorder":"array","categoryarray":["Admins","","","","Users"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":"Who is Responsible for Reproducing the Environment?","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-0.05,1.05],"tickmode":"array","ticktext":["Locked Down","","","","Open"],"tickvals":[0,0.25,0.5,0.75,1],"categoryorder":"array","categoryarray":["Locked Down","","","","Open"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.65296803652968,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":"Package Access","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895},"y":1},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":[{"name":"Collaborate","icon":{"width":1000,"ascent":500,"descent":-50,"path":"M487 375c7-10 9-23 5-36l-79-259c-3-12-11-23-22-31-11-8-22-12-35-12l-263 0c-15 0-29 5-43 15-13 10-23 23-28 37-5 13-5 25-1 37 0 0 0 3 1 7 1 5 1 8 1 11 0 2 0 4-1 6 0 3-1 5-1 6 1 2 2 4 3 6 1 2 2 4 4 6 2 3 4 5 5 7 5 7 9 16 13 26 4 10 7 19 9 26 0 2 0 5 0 9-1 4-1 6 0 8 0 2 2 5 4 8 3 3 5 5 5 7 4 6 8 15 12 26 4 11 7 19 7 26 1 1 0 4 0 9-1 4-1 7 0 8 1 2 3 5 6 8 4 4 6 6 6 7 4 5 8 13 13 24 4 11 7 20 7 28 1 1 0 4 0 7-1 3-1 6-1 7 0 2 1 4 3 6 1 1 3 4 5 6 2 3 3 5 5 6 1 2 3 5 4 9 2 3 3 7 5 10 1 3 2 6 4 10 2 4 4 7 6 9 2 3 4 5 7 7 3 2 7 3 11 3 3 0 8 0 13-1l0-1c7 2 12 2 14 2l218 0c14 0 25-5 32-16 8-10 10-23 6-37l-79-259c-7-22-13-37-20-43-7-7-19-10-37-10l-248 0c-5 0-9-2-11-5-2-3-2-7 0-12 4-13 18-20 41-20l264 0c5 0 10 2 16 5 5 3 8 6 10 11l85 282c2 5 2 10 2 17 7-3 13-7 17-13z m-304 0c-1-3-1-5 0-7 1-1 3-2 6-2l174 0c2 0 4 1 7 2 2 2 4 4 5 7l6 18c0 3 0 5-1 7-1 1-3 2-6 2l-173 0c-3 0-5-1-8-2-2-2-4-4-4-7z m-24-73c-1-3-1-5 0-7 2-2 3-2 6-2l174 0c2 0 5 0 7 2 3 2 4 4 5 7l6 18c1 2 0 5-1 6-1 2-3 3-5 3l-174 0c-3 0-5-1-7-3-3-1-4-4-5-6z"},"click":"function(gd) { \n // is this being viewed in RStudio?\n if (location.search == '?viewer_pane=1') {\n alert('To learn about plotly for collaboration, visit:\\n https://cpsievert.github.io/plotly_book/plot-ly-for-collaboration.html');\n } else {\n window.open('https://cpsievert.github.io/plotly_book/plot-ly-for-collaboration.html', '_blank');\n }\n }"}],"cloud":false,"displayModeBar":false},"source":"A","attrs":{"f87a7b9b28cc":{"intercept":{},"slope":{},"type":"scatter"},"f87a793a87a":{"x":{},"y":{},"text":{},"x.1":{},"y.1":{}},"f87a6f19e578":{"x":{},"y":{},"text":{},"x.1":{},"y.1":{}},"f87ad286244":{"x":{},"y":{},"text":{},"x.1":{},"y.1":{}},"f87a564b651b":{"x":{},"y":{},"text":{},"x.1":{},"y.1":{}},"f87a6fdafbdf":{"intercept":{},"slope":{}},"f87a11ce26d8":{"x":{},"y":{},"colour":{},"text":{},"x.1":{},"y.1":{}},"f87a75583809":{"x":{},"y":{},"label":{},"x.1":{},"y.1":{}}},"cur_data":"f87a7b9b28cc","visdat":{"f87a7b9b28cc":["function (y) ","x"],"f87a793a87a":["function (y) ","x"],"f87a6f19e578":["function (y) ","x"],"f87ad286244":["function (y) ","x"],"f87a564b651b":["function (y) ","x"],"f87a6fdafbdf":["function (y) ","x"],"f87a11ce26d8":["function (y) ","x"],"f87a75583809":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"base_url":"https://plot.ly",".hideLegend":true},"evals":["config.modeBarButtonsToAdd.0.click"],"jsHooks":[]} For more information on picking and using these strategies, please visit https://environments.rstudio.com. By adopting a strategy for reproducible environments, R users, administrators, and teams can solve a number of important challenges. Ultimately, reproducible work adds credibility, creating a solid foundation for research, business applications, and production systems. We are excited to be working on tools to make reproducible work in R easy and fun. We look forward to your feedback, community discussions, and future posts. _____='https://rviews.rstudio.com/2019/04/22/reproducible-environments/'; var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### survivalists [a Riddler’s riddle] Mon, 04/22/2019 - 00:19 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) A neat question from The Riddler on a multi-probability survival rate: Nine processes are running in a loop with fixed survivals rates .99,….,.91. What is the probability that the first process is the last one to die? Same question with probabilities .91,…,.99 and the probability that the last process is the last one to die. The first question means that the realisation of a Geometric G(.99) has to be strictly larger than the largest of eight Geometric G(.98),…,G(.91). Given that the cdf of a Geometric G(a) is [when counting the number of attempts till failure, included, i.e. the Geometric with support the positive integers] the probability that this happens has the nice (?!) representation which leads to an easy resolution by recursion since and and a value of 0.5207 returned by R (Monte Carlo evaluation of 0.5207 based on 10⁷ replications). The second question is quite similar, with solution and value 0.52596 (Monte Carlo evaluation of 0.52581 based on 10⁷ replications). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Binning with Weights Sun, 04/21/2019 - 23:52 (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers) After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data. While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting. – First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64. – In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical. – In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output. The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Familiarisation with the Australian Election Study by @ellis2013nz Sun, 04/21/2019 - 16:00 (This article was first published on free range statistics - R, and kindly contributed to R-bloggers) The Australian Election Study is an impressive long term research project that has collected the attitudes and behaviours of a sample of individual voters after each Australian federal election since 1987. All the datasets and documentation are freely available. An individual survey of this sort is a vital complement to the necessarily aggregate results provided by the Australian Electoral Commission, and is essential for coming closer to understanding political and social change. Many countries have a comparable data source. Many thanks to the original researchers: McAllister, Ian; Makkai, Toni; Bean, Clive; Gibson, Rachel Kay, 2017, “Australian Election Study, 2016”, doi:10.4225/87/7OZCZA, ADA Dataverse, V2, UNF:6:TNnUHDn0ZNSlIM94TQphWw==; 1. Australian Election Study December 2016 Technical Report.pdf In this blog post I just scratch the surface of the latest available data, from the time of the 2016 federal election. In later posts I’ll get into modelling it a bit deeper, and hopefully get a chance to look at some changes over time. Some of the code I use in this post for preparing the data might find its way into the ozfedelect R package, but not the data itself. Getting the data The data need to be downloaded by hand from the Australian Data Archive via the Dataverse (an international open source research data repository) in order to agree to the terms and conditions of use. The following code requires the SPSS version of the data and the Excel version of the data dictionary to have been saved in an ./aes/ subfolder of the working directory. The SPSS data has column names truncated at eight characters long, which means that some of the variables in the data don’t match those in the data dictionary. The code below makes the necessary manual adjustments to variable names from the dictionary to deal with this, and re-classes the categorical responses as factors with the correct level labels in the correct order. I do this by importing and joining to the data dictionary rather than extracting the attributes information from the SPSS import, which might have also worked; I prefer my manual approach so I can deal with some anomalies explicitly (for example, many of the values listed as 999 “skipped” in the data dictionary are NA in the SPSS data). Post continues after code extract #====================Preparation============= #----------------------------packages--------------- library(tidyverse) library(scales) library(rio) library(svglite) library(frs) library(readxl) library(testthat) library(survey) library(vcd) library(RColorBrewer) #---------------------Import and tidy data----------------- # Download by hand from https://dataverse.ada.edu.au/dataset.xhtml?persistentId=doi:10.4225/87/7OZCZA aes2016_orig <- import("aes/2. Australian Election Study, 2016.sav") aes2016_code_orig <- read_excel("aes/1. Australian Election Study, 2016 - Codebook.xlsx", sheet = "Data Dictionary") aes2016_code <- aes2016_code_orig %>% rename(question_label = Label...3, answer_label = Label...5) %>% rename_all(tolower) %>% fill(variable, position, question_label) %>% mutate(var_abb = substring(variable, 1, 8)) %>% mutate(var_abb = case_when( var_abb == "H19_STAT" ~ "H19STATE", var_abb == "H19_PCOD" ~ "H19pcoRV", var_abb == "H20_othe" ~ "H20_oth", var_abb == "H25_othe" ~ "H25_oth", var_abb == "Final_ST" ~ "finSTATE", var_abb == "Samp_STA" ~ "SamSTATE", var_abb == "detailed" ~ "doutcome", var_abb == "Responde" ~ "responID", var_abb == "Start_Ti" ~ "Sta_time", var_abb == "Samp_PCO" ~ "SampcoRV", var_abb == "Final_PC" ~ "finpcoRV", var_abb == "Total_Du" ~ "totaldur", TRUE ~ var_abb )) %>% rbind(tibble( variable = "StateMap", position = NA, question_label = "Unidentified state variable", value = 1:8, answer_label = c("New South Wales", "Victoria", "Queensland", "South Australia", "Western Australia", "Tasmania", "Northern Territory", "Australian Capital Territory"), var_abb = "StateMap" )) aes2016_questions <- distinct(aes2016_code, var_abb, position, question_label) aes2016_answers <- aes2016_code %>% select(var_abb, variable, value, answer_label) %>% filter(!is.na(value)) # Check all the names in the data now match those in the data dictionary: expect_equal( names(aes2016_orig)[!names(aes2016_orig) %in% aes2016_questions$var_abb], character(0) ) # ... and vice versa: expect_equal( unique(aes2016_questions$var_abb)[!unique(aes2016_questions$var_abb) %in% names(aes2016_orig)], character(0) ) aes2016 <- aes2016_orig %>% as_tibble() attributes(aes2016)$question_labels <- names(aes2016) for(i in 1:ncol(aes2016)){ this_q <- filter(aes2016_questions, var_abb == names(aes2016)[i]) # Sometimes a code like 999 'Skipped' is not present in the data but has already # been replaced with an NA, so we don't want it in our factor level. So we need # to find out what answers are actually used in the i'th column used_a <- unique(pull(aes2016, i)) # the answer labels for this particular question these_a <- aes2016_code %>% filter(var_abb == names(aes2016)[i]) %>% filter(value %in% used_a) %>% arrange(value) attributes(aes2016)$question_labels[i] <- pull(this_q, question_label) # half a dozen open text questions don't match to the data dictionary so we exclude those: if(nrow(these_a) > 0 & !pull(this_q, question_label) %in% c("What kind of work do you do? Full job title", "What are (or were) the main tasks that you usually perform(ed)?", "What kind of business or industry is (or was) that in?", "What was your partner's main activity last week? (other specify)", "What kind of work does (or did) your partner do? Provide job title", "Generally speaking, does your partner usually think of himself or herself as Liberal, Labor, National or what? (other specify)") ){ # for the rest, we turn the response into a factor with the correct labels aes2016[ , i] <- factor(pull(aes2016, i), labels = pull(these_a, answer_label)) } } aes2016 <- aes2016 %>% mutate(age_2016 = 2016 - H2, agegrp_2016 = cut(age_2016, breaks = c(0, 17.5, 24.5, 34.5, 44.5, 54.5, 64.5, Inf), labels = c("0-17", "18-24", "25-34", "35-44", "45-54", "55-64", "65+")), agegrp_2016 = as.factor(replace_na(as.character(agegrp_2016), "Age unknown")), sex = replace_na(as.character(H1), "Sex unknown"), first_pref_hr_grp = case_when( B9_1 %in% c("Liberal Party", "National (Country) Party") ~ "Coalition", B9_1 == "Labor Party (ALP)" ~ "Labor", B9_1 == "Greens" ~ "Greens", B9_1 == "Voted informal" | is.na(B9_1) ~ "Did not vote/voted informal", TRUE ~ "Other" ), first_pref_sen_grp = case_when( B9_2 %in% c("Liberal Party", "National (Country) Party") ~ "Coalition", B9_2 == "Labor Party (ALP)" ~ "Labor", B9_2 == "Greens" ~ "Greens", B9_2 == "Voted informal" | is.na(B9_2) ~ "Did not vote/voted informal", TRUE ~ "Other" ), first_pref_sen_grp2 = case_when( B9_2 == "Liberal Party" ~ "Liberal", B9_2 == "National (Country) Party" ~ "National", B9_2 == "One Nation" ~ "One Nation", B9_2 == "Labor Party (ALP)" ~ "Labor", B9_2 == "Greens" ~ "Greens", TRUE ~ "Other (including did not vote)" ) ) %>% mutate(first_pref_sen_grp2 = toupper(first_pref_sen_grp2), first_pref_sen_grp2 = ifelse(grepl("^OTHER", first_pref_sen_grp2), "OTHER (incl. did not vote)", first_pref_sen_grp2)) %>% mutate(first_pref_sen_grp2 = fct_relevel(first_pref_sen_grp2, toupper(c("Greens", "Labor", "Liberal", "National", "One Nation")))) %>% mutate(tpp_alp = B11_1 %in% "Labor Party (ALP)" | B9_1 %in% "Labor Party (ALP)")

The last few lines of that code also defines some summarised variables that lump various parties or non-votes together and will be useful in later analysis.

Survey design and weights

The AES in 2016 used two separate sampling frames: one provided by the Australian Electoral Commission, and one from the Geo-coded National Address File or GNAF, a large, authoritative open data source of every address in Australia. For inference about Australian adults on the election roll (enrolment is required of all resident adult citizens by legislation), the AES team recommend using a combination of the two samples, and weights are provided in the wt_enrol column to do so.

The technical report describes data collection as mixed-mode (hard copy and online versions), with a solid system of managing contacts, reminders and non-returns.

The AEC sample is stratified by state, and the GNAF sample by Greater Capital City (GCC) statistical areas (which divide states into two eg “Greater Sydney” and “Rest of New South Wales”). As far as I can see no GCC variable is provided for end-use analysts such as myself. Nor are the various postcode variables populated, and I can’t see a variable for electorate/division (presumably for statistical disclosure control purposes) which is a shame as it would be an important variable to use for a variety of multi-level analyses. So we have to use “state” as our best regional variable.

Post-stratification weighting (raking) was used by the AES to match the population proportions by sex, age group, state, and first preference vote in the House of Representatives. This latter calibration is particularly important to ensure that analysis gives appropriate weight to the supporters of the various parties; it is highly likely that non-response is related to political views (or non-views).

The weights have been scaled to add up to the sample size, rather than to population numbers (as would be the approach of eg a national statistical office). This is presumably out of deference for SPSS’ behaviour of treating survey weights as frequencies, which leads to appalling results if they add up to anything other than the sample size (and still sub-optimal results even then).

Table 16 of the technical report is meant to provide the weighting benchmarks for the post-stratification weighting, but the state, age and sex totals given are only for the GNAF sample; and the benchmarks given for which party voted for appear to be simply incorrect in a way I cannot determine. They add up to much more than either the AEC or GNAF sample, but much less than their combined total; and I can’t match their results with any of the plausible combinations of filters I’ve tried. I’d be happy for anyone to point out if I’m doing something wrong, but my working hypothesis is that Table 16 simply contains some mistakes, whereas the weights in the actual data are correct.

The code below explores the three sets of weights provided, and defines a survey design object (using Thomas Lumley’s survey package) that treats state as the strata for sampling, and sex, state, age and House of Representatives vote as post-stratification variables. Although I won’t make much use of the survey package today, I’ll almost certainly want it for more sophisticated analysis in a later post.

Post continues after code extract

#--------------------Weighting----------------------- # The sample has two parts - addresses supplied by the AEC, and addresses from # the GNAF (Geocoded National Address File). There are separate weights for each, # so they can be analysed as two separate surveys or you can use the combined set of weights: table(aes2016$wt_aec > 0, useNA = 'always') # for comparison to previous waves of AES table(aes2016$wt_gnaf > 0, useNA = 'always') # inference about Australian adults table(aes2016$wt_enrol > 0, useNA = 'always') # inference about enrolled Australian adults # There are 107 cases from the GNAF sample that have zero weight in the final combined sample: filter(aes2016, wt_enrol <= 0) %>% select(wt_aec, wt_gnaf, wt_enrol) # wt_enrol had been raked for age group, sex, state and first preference vote. See pages 32-33 # of the technical guide. # These three state variables seem identical, and are all different from those in Table 16 # of the technical report: cbind( table(aes2016$finSTATE), table(aes2016$StateMap), table(aes2016$SamSTATE) ) # In fact, Table 16 has the state, age and sex only of the GNAF sample, not the whole sample: table(filter(aes2016, wt_gnaf >0)$finSTATE) table(filter(aes2016, wt_gnaf >0)$agegrp_2016) table(filter(aes2016, wt_gnaf >0)$H1) # The first pref party vote in Table 16 looks to be just wrong. The results below match those # in the code book, but not in Table 16 table(aes2016$first_pref_hr_grp, useNA = 'always') table(aes2016$first_pref_sen_grp, useNA = 'always') # Well, we'll assume that wt_enrol has actually been weighted as described, and ignore # the problems in Table 16. # Set up survey design: aes2016_svy <- svydesign(~1, strata = ~SamSTATE, weights = ~wt_enrol, data = aes2016) # Deduce population totals from the sums of weights (if weights were raked to match pop totals, # then the sum of the weights should reveal what they were) age_pop <- aes2016 %>% group_by(agegrp_2016) %>% summarise(Freq = sum(wt_enrol)) sex_pop <- aes2016 %>% group_by(sex) %>% summarise(Freq = sum(wt_enrol)) state_pop <- aes2016 %>% group_by(SamSTATE) %>% summarise(Freq = sum(wt_enrol)) age2016_svy <- rake(aes2016_svy, sample.margins = list(~sex, ~agegrp_2016, ~SamSTATE), population.margins = list(sex_pop, age_pop, state_pop)) # Compare the weighted and unweighted response to a survey question filter(aes2016_questions, var_abb == "H14_2") svytable(~H14_2, aes2016_svy) table(aes2016$H14_2) # Many more "yes" posted answers to the Internet when weighted. So people active on the internet # needed extra weight (probably younger) Analysis – vote and one question

Now we’re ready for some exploratory analysis. In Australia, the Senate is elected by proportional representation, a wider range of small parties stand candidates, and the first preference vote for that house is less subject to gaming by voters than is the single-transferrable-vote system in the House of Representatives. So I prefer to use Senate vote when examining the range of political opinion, particularly of supporters of the smaller parties. Here is one of the graphics that I love during exploration but which I despair of using for communication purposes – a mosaic plot, which beautifully visualises a contingency table of counts (or, in this case, weighted counts). This particular graphic shows the views of people who voted for different parties in the Senate on the trade off between reducing taxes and spening more on social services:

As any casual observer of Australian politics would have predicted, disproportionately large numbers (coloured blue to show the positive discrepency from the no-interaction null hypothesis) of Greens voters (and to a less extent Labor) are mildly or strongly in favour of spending more on social services. Correspondingly these parties’ supporters have relatively few people counted in the top left corner (coloured red to show a negative discrepancy from the null hypothesis) supporting reducing tax as the priority.

In contrast, there are relatively few Liberal voters in the social services camp, and disproportionately high numbers who favour reducing taxes.

The “other” voters – which includes those who did not vote, could not remember, or did not vote for one of the five most prominent parties – are disproportionately likely to sit on the fence or not have a view on the trade-off between social services and tax cuts, which again is consistent with expectations.

What about a different angle – what influences people when they decide their vote?

We see that Greens voters decided their vote disproportionately on the basis of “the policy issues”. Liberal voters have less people in the “policy” box than would be expected under the no-interaction null hypothesis and were more likely to decide based on either “the party leaders” or “the parties taken as a whole”. National voters stand out as the only party with a noticeable disproportionate count (high, in their case) in the “candidates in your electorate” box, emphasising the known spatially-specific nature of National’s support.

Here’s the code for drawing those mosaic plots, using the vcd R package that implements some useful ways of visualising categorical data.

Post continues after code extract

#======================selected bivariate============ #------------------tax/social services------------- x <- svytable(~E1 + first_pref_sen_grp2, aes2016_svy) x <- round(x, 2) colnames(x)[grepl("OTHER", colnames(x))] <- "OTHER" rownames(x) <- paste0(str_wrap(rownames(x), 20)) mosaic(x, shade = TRUE, border = "grey70", direction = "v", legend = legend_resbased(10, fontfamily = "Roboto", pvalue = FALSE), xlab = "x", ylab = "y", labeling = labeling_values( suppress = c(0, 1000), gp_labels = gpar(fontfamily = "Roboto", cex = 0.8), gp_varnames = gpar(col = "transparent"), just_labels = c("center", "right"), alternate_labels = c(TRUE, FALSE), rot_labels = c(0,90,0,0), offset_labels = c(1,0,1.5,0), digits = 0 )) #--------------how decide vote------------ x <- svytable(~B5 + first_pref_sen_grp2, aes2016_svy) x <- round(x, 2) colnames(x)[grepl("OTHER", colnames(x))] <- "OTHER" rownames(x) <- paste0(str_wrap(rownames(x), 20)) mosaic(x, shade = TRUE, border = "grey70", direction = "v", legend = legend_resbased(10, fontfamily = "Roboto", pvalue = FALSE), xlab = "x", ylab = "y", labeling = labeling_values( suppress = c(0, 1000), gp_labels = gpar(fontfamily = "Roboto", cex = 0.8), gp_varnames = gpar(col = "transparent"), just_labels = c("center", "right"), alternate_labels = c(TRUE, FALSE), rot_labels = c(0,90,0,0), offset_labels = c(1,0,1.5,0), digits = 0 )) Analysis – vote and a battery of questions

Like many social sciences surveys, the AES contains a number of batteries of questions with similar response sets. An advantage of this, both for an analyst and the consumer of their results, is that we can set up a common approach to comparing the responses to those questions. Here is a selection of diverging likert plots depicting interesting results from these questions:

Above, we see that Greens voters disproportionately think that Australia hasn’t gone far enough in allowing more migrants into the country, and One Nation voters unsurprisingly tend to think the opposite. Views on change in Australia relating to Aboriginal assistance and land rights follow a similar pattern. Interestingly, there is less partisan difference on some other issues such as “equal opportunities for women”. Overall, the pattern is clear of increasing tendency to think that “things have gone to far” as we move across the spectrum (Greens-Labor-Liberal-National-One Nation).

When it comes to attitudes on the left-right economic issues, we see subtle but expected party differences (chart above). Liberal supporters are particularly inclined to think trade unions have too much power and should be regulated; but to disagree with the statement that income and wealth should be distributed towards ordinary working people.

In the social issues chart above, we see the Greens’ voters strongly disagreeing with turning back boats of asylum seekers, and reintroducing the death penalty – issues where One Nation voters tend to agree with the statement. Again, the clear sequencing of parties’ supporters (Greens-Labor-Liberal-National-One Nation) is most obvious on these social/cultural touchpoint issues, in contrast to the left-right economic debate.

In the above, we see high levels of confidence in the armed forces, police and universities across all parties’ supporters, and distrust in the press, television (what does it even mean to trust the television?) and political parties. Differences by party in trust in institutions varies along predictable party lines eg voters of the Liberal party tend to have less confidence in unions. One Nation voters have particularly low trust in the Australian political system, consistent with the mainstream interpretation in political science of the rise of socially conservative populist parties of this sort.

Attitudes to public spend are fairly consistent across parties, with subtle but unsurprising differences in a few value-based touch points such as defence (Greens favour spending less) and unemployment benefits (Greens favour spending more but the three conservative parties favour spending less; Labor voters are in the middle).

Finally, when it comes to factual knowledge of election law and constitutional issues, there is little to say from a party perspective. Many people don’t know the maximum period between federal elections for the lower house (the correct answer is three); and doubtless niether know nor care about the need to pay a deposit to stand for election. But there is no obvious difference by party voted for.

In summary, on many of the key social variables of the day, we see the expected greatest contrast between supporters of the Greens on one hand and One Nation on the other, with the ALP, Liberal and National voters strung out between. Economic issues indicate another but closely related dimension, with One Nation closer to the Greens than they are to the other socially conservative parties on statements such as “big business in this country has too much power”, “the government should take measures to reduce differences in income levels” and “trade unions in this country have too much power”.

Here’s the code for drawing those diverging-Likert plots. I’m not particularly proud of it, it’s a bit repetitive (I hadn’t realised I’d be doing six of these), but it get’s the job done. There are some specialist R packages for drawing these charts, but I prefer (at this point) build them with the standard grammar of graphics tools in ggplot2 rather learn a new specific API.

• Post continues after code extract*
#=================Batteries of similar questions======================== #------------------Exploration--------------------- svytable(~ F7 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE) # Immigration svytable(~ F6_4 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE) # war on terror svytable(~ F2 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE) # Republic svytable(~ E9_13 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE) # trust universities svytable(~ E9_14 + first_pref_sen_grp, aes2016_svy, Ntotal = 1000, round = TRUE) # trust political system #-----------Confidence in institutions---------------------- d1 <- aes2016 %>% select(E9_1:E9_14, wt_enrol, first_pref_sen_grp2) %>% gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>% group_by(variable, value, first_pref_sen_grp2) %>% summarise(freq = sum(wt_enrol)) %>% ungroup() %>% left_join(aes2016_questions, by = c("variable" = "var_abb")) %>% mutate(institution = gsub("How much .+\\? ", "", question_label)) %>% filter(!is.na(value)) %>% group_by(variable, first_pref_sen_grp2, institution) %>% mutate(negative = value %in% c("Not very much confidence", "None at all"), prop = freq/sum(freq) * ifelse(negative, -1, 1)) %>% mutate(value = factor(value, levels = c("None at all", "Not very much confidence", "A great deal of confidence", "Quite a lot of confidence"))) %>% ungroup() %>% mutate(institution = fct_reorder(institution, prop, .fun = sum)) pal <- brewer.pal(4, "RdYlGn") names(pal) <- levels(d1$value)[c(1,2,4,3)] d1 %>% ggplot(aes(x = institution, fill = value, weight = prop)) + geom_bar(data = filter(d1, negative), position = "stack") + geom_bar(data = filter(d1, !negative), position = "stack") + facet_wrap(~first_pref_sen_grp2, ncol = 3) + coord_flip() + scale_fill_manual("", values = pal, guide = guide_legend(reverse = FALSE), breaks = c("None at all", "Not very much confidence", "Quite a lot of confidence", "A great deal of confidence")) + scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + expand_limits(y = c(-1, 1)) + theme(panel.spacing = unit(1.5, "lines"))+ ggtitle("Attitudes to institutions, by first preference Senate vote in 2016", "How much confidence do you have in the following organisation? ...") + labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by freerangestats.info.") #-----------more or less expenditure than now---------------------- d2 <- aes2016 %>% select(D8_1:D8_10, wt_enrol, first_pref_sen_grp2) %>% gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>% group_by(variable, value, first_pref_sen_grp2) %>% summarise(freq = sum(wt_enrol)) %>% ungroup() %>% left_join(aes2016_questions, by = c("variable" = "var_abb")) %>% mutate(item = gsub("Should there be .+\\? ", "", question_label)) %>% filter(!is.na(value)) %>% group_by(variable, first_pref_sen_grp2, item) %>% mutate(negative = value %in% c("Much less than now", "Somewhat less than now"), positive = value %in% c("Much more than now", "Somewhat more than now"), same = !negative & !positive, prop = freq/sum(freq) * case_when( negative ~ -1, positive ~ 1, TRUE ~ 0.5)) %>% mutate(value = factor(value, levels = c("Much less than now", "Somewhat less than now", "Much more than now", "Somewhat more than now", "The same as now"))) %>% ungroup() %>% mutate(item = fct_reorder(item, prop, .fun = sum)) pal <- brewer.pal(5, "RdYlGn") names(pal) <- levels(d2$value)[c(1,2,5,4,3)] d2a <- d2 %>% filter(negative | same) %>% mutate(prop = -abs(prop)) d2 %>% ggplot(aes(x = item, fill = value, weight = prop)) + geom_bar(data = d2a, position = "stack") + geom_bar(data = filter(d2,positive | same), position = "stack") + facet_wrap(~first_pref_sen_grp2, ncol = 3) + coord_flip() + scale_fill_manual("", values = pal, guide = guide_legend(reverse = FALSE), breaks = levels(d2$value)[c(1,2,5,4,3)]) + scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + expand_limits(y = c(-1, 1)) + theme(panel.spacing = unit(1.5, "lines"))+ ggtitle("Attitudes to public spend, by first preference Senate vote in 2016", "Should there be more or less public expenditure in the following area? ...") + labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by freerangestats.info.") #----------change gone far enough---------- d3 <- aes2016 %>% select(E2_1:E2_7, wt_enrol, first_pref_sen_grp2) %>% gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>% group_by(variable, value, first_pref_sen_grp2) %>% summarise(freq = sum(wt_enrol)) %>% ungroup() %>% left_join(aes2016_questions, by = c("variable" = "var_abb")) %>% mutate(item = gsub("Do you think .+\\? ", "", question_label)) %>% filter(!is.na(value)) %>% group_by(variable, first_pref_sen_grp2, item) %>% mutate(negative = value %in% c("Gone much too far", "Gone too far"), positive = value %in% c("Not gone nearly far enough", "Not gone far enough"), same = !negative & !positive, prop = freq/sum(freq) * case_when( negative ~ -1, positive ~ 1, TRUE ~ 0.5)) %>% mutate(value = factor(value, levels = c("Gone much too far", "Gone too far", "Not gone nearly far enough", "Not gone far enough", "About right"))) %>% ungroup() %>% mutate(item = fct_reorder(item, prop, .fun = sum)) pal <- brewer.pal(5, "RdYlGn") names(pal) <- levels(d3$value)[c(1,2,5,4,3)] d3a <- d3 %>% filter(negative | same) %>% mutate(prop = -abs(prop)) d3 %>% ggplot(aes(x = item, fill = value, weight = prop)) + geom_bar(data = d3a, position = "stack") + geom_bar(data = filter(d3, positive | same), position = "stack") + facet_wrap(~first_pref_sen_grp2, ncol = 3) + coord_flip() + scale_fill_manual("", values = pal, guide = guide_legend(reverse = FALSE), breaks = levels(d3$value)[c(1,2,5,4,3)]) + scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + theme(panel.spacing = unit(1.5, "lines")) + expand_limits(y = c(-1, 1)) + ggtitle("Attitudes to change, by first preference Senate vote in 2016", "Do you think the following change that has been happening in Australia over the years has gone...?") + labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by freerangestats.info.") #----------agree with various economic statements---------- d4 <- aes2016 %>% select(D13_1:D13_6, wt_enrol, first_pref_sen_grp2) %>% gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>% group_by(variable, value, first_pref_sen_grp2) %>% summarise(freq = sum(wt_enrol)) %>% ungroup() %>% left_join(aes2016_questions, by = c("variable" = "var_abb")) %>% mutate(item = gsub("Do you strongly .+\\? ", "", question_label)) %>% filter(!is.na(value)) %>% group_by(variable, first_pref_sen_grp2, item) %>% mutate(negative = value %in% c("Disagree", "Strongly disagree"), positive = value %in% c("Agree", "Strongly agree"), same = !negative & !positive, prop = freq/sum(freq) * case_when( negative ~ -1, positive ~ 1, TRUE ~ 0.5)) %>% mutate(value = factor(value, levels = c("Strongly disagree", "Disagree", "Strongly agree", "Agree", "Neither agree nor disagree"))) %>% ungroup() %>% mutate(item = fct_reorder(item, prop, .fun = sum)) pal <- brewer.pal(5, "RdYlGn") names(pal) <- levels(d4$value)[c(1,2,5,4,3)] d4a <- d4 %>% filter(negative | same) %>% mutate(prop = -abs(prop)) d4 %>% ggplot(aes(x = item, fill = value, weight = prop)) + geom_bar(data = d4a, position = "stack") + geom_bar(data = filter(d4, positive | same), position = "stack") + facet_wrap(~first_pref_sen_grp2, ncol = 3) + coord_flip() + scale_fill_manual("", values = pal, guide = guide_legend(reverse = FALSE), breaks = levels(d4$value)[c(1,2,5,4,3)]) + scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + theme(panel.spacing = unit(1.5, "lines")) + expand_limits(y = c(-1, 1)) + ggtitle("Attitudes to left-right economic issues, by first preference Senate vote in 2016", "Do you strongly agree ... or strongly disagree with the following statement?") + labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by freerangestats.info.") #----------agree with various social statements---------- d5 <- aes2016 %>% select(E6_1:E6_7, wt_enrol, first_pref_sen_grp2) %>% gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>% group_by(variable, value, first_pref_sen_grp2) %>% summarise(freq = sum(wt_enrol)) %>% ungroup() %>% left_join(aes2016_questions, by = c("variable" = "var_abb")) %>% mutate(item = gsub("Do you strongly .+\\? ", "", question_label)) %>% filter(!is.na(value)) %>% group_by(variable, first_pref_sen_grp2, item) %>% mutate(negative = value %in% c("Disagree", "Strongly disagree"), positive = value %in% c("Agree", "Strongly agree"), same = !negative & !positive, prop = freq/sum(freq) * case_when( negative ~ -1, positive ~ 1, TRUE ~ 0.5)) %>% mutate(value = factor(value, levels = c("Strongly disagree", "Disagree", "Strongly agree", "Agree", "Neither agree nor disagree"))) %>% ungroup() %>% mutate(item = fct_reorder(item, prop, .fun = sum)) pal <- brewer.pal(5, "RdYlGn") names(pal) <- levels(d5$value)[c(1,2,5,4,3)] d5a <- d5 %>% filter(negative | same) %>% mutate(prop = -abs(prop)) d5 %>% ggplot(aes(x = item, fill = value, weight = prop)) + geom_bar(data = d5a, position = "stack") + geom_bar(data = filter(d5, positive | same), position = "stack") + facet_wrap(~first_pref_sen_grp2, ncol = 3) + coord_flip() + scale_fill_manual("", values = pal, guide = guide_legend(reverse = FALSE), breaks = levels(d5$value)[c(1,2,5,4,3)]) + scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + theme(panel.spacing = unit(1.5, "lines")) + expand_limits(y = c(-1, 1)) + ggtitle("Attitudes to liberal-conservative social issues, by first preference Senate vote in 2016", "Do you strongly agree ... or strongly disagree with the following statement?") + labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by freerangestats.info.") #----------constitutional knowledge---------- d6 <- aes2016 %>% select(F10_1:F10_6, wt_enrol, first_pref_sen_grp2) %>% gather(variable, value, -wt_enrol, -first_pref_sen_grp2) %>% group_by(variable, value, first_pref_sen_grp2) %>% summarise(freq = sum(wt_enrol)) %>% ungroup() %>% left_join(aes2016_questions, by = c("variable" = "var_abb")) %>% mutate(item = gsub("Do you think .+\\? ", "", question_label), item = str_wrap(item, 50)) %>% filter(!is.na(value)) %>% group_by(variable, first_pref_sen_grp2, item) %>% mutate(negative = value %in% c("False"), positive = value %in% c("True"), same = !negative & !positive, prop = freq/sum(freq) * case_when( negative ~ -1, positive ~ 1, TRUE ~ 0.5)) %>% mutate(value = factor(value, levels = c("False", "True", "Don't know"))) %>% ungroup() %>% mutate(item = fct_reorder(item, prop, .fun = sum)) pal <- brewer.pal(3, "RdYlGn") names(pal) <- levels(d6$value)[c(1,3,2)] d6a <- d6 %>% filter(negative | same) %>% mutate(prop = -abs(prop)) d6 %>% ggplot(aes(x = item, fill = value, weight = prop)) + geom_bar(data = d6a, position = "stack") + geom_bar(data = filter(d6, positive | same), position = "stack") + facet_wrap(~first_pref_sen_grp2, ncol = 3) + coord_flip() + scale_fill_manual("", values = pal, guide = guide_legend(reverse = FALSE), breaks = names(pal)) + scale_y_continuous(breaks = c(-1, -0.5, 0, 0.5, 1), labels = c("100%", "50%", "0", "50%", "100%")) + theme(panel.spacing = unit(1.5, "lines")) + expand_limits(y = c(-1, 1)) + ggtitle("Knowledge of constitutional issues", "Do you think the following statement is true or false? (correct answers in order from 'federation in 1901' to '75 members' are True, True, False, False, True, False)") + labs(x = "", y = "", caption = "Source: Australian Election Study 2016; analysis by freerangestats.info.") Final word

OK, watch this space for more analysis using the AES (2016 and other years), if and when I get time.

Note – I have no affiliation or contact with the Australian Election Study, and the AES researchers bear no responsibility for my analysis or interpretation of their data. Use of the AES data by me is solely at my risk and I indemnify the Australian Data Archive and its host institution, The Australian National University. The Australian Data Archive and its host institution, The Australian National University, are not responsible for the accuracy and completeness of the material supplied. Similarly, if you use my analysis, it is at your risk. Don’t blame me for anything that goes wrong, even if I made a mistake. But it would be nice if you let me know.

thankr::shoulders() %>% mutate(maintainer = str_squish(gsub("<.+>", "", maintainer)), maintainer = ifelse(maintainer == "R-core", "R Core Team", maintainer)) %>% group_by(maintainer) %>% summarise(Number packages = sum(no_packages), packages = paste(packages, collapse = ", ")) %>% arrange(desc(Number packages)) %>% knitr::kable() %>% clipr::write_clip() maintainer Number packages packages Hadley Wickham 17 assertthat, dplyr, ellipsis, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, testthat, tidyr, tidyverse R Core Team 13 base, compiler, datasets, graphics, grDevices, grid, methods, splines, stats, tools, utils, nlme, foreign Gábor Csárdi 4 cli, crayon, pkgconfig, zip Kirill Müller 4 DBI, hms, pillar, tibble Lionel Henry 4 purrr, rlang, svglite, tidyselect Winston Chang 4 extrafont, extrafontdb, R6, Rttf2pt1 Yihui Xie 4 evaluate, knitr, rmarkdown, xfun Achim Zeileis 3 colorspace, lmtest, zoo Jim Hester 3 glue, withr, readr Yixuan Qiu 3 showtext, showtextdb, sysfonts Dirk Eddelbuettel 2 digest, Rcpp Jennifer Bryan 2 readxl, cellranger Jeroen Ooms 2 curl, jsonlite Simon Urbanek 2 Cairo, audio “Thomas Lumley” 1 survey Alex Hayes 1 broom Alexander Walker 1 openxlsx Brian Ripley 1 MASS Brodie Gaslam 1 fansi Charlotte Wickham 1 munsell David Gohel 1 gdtools David Meyer 1 vcd Deepayan Sarkar 1 lattice Erich Neuwirth 1 RColorBrewer James Hester 1 xml2 Jeremy Stephens 1 yaml Joe Cheng 1 htmltools Justin Talbot 1 labeling Kamil Slowikowski 1 ggrepel Kevin Ushey 1 rstudioapi Marek Gagolewski 1 stringi Martin Maechler 1 Matrix Matt Dowle 1 data.table Max Kuhn 1 generics Michel Lang 1 backports Patrick O. Perry 1 utf8 Peter Ellis 1 frs Rasmus Bååth 1 beepr Simon Garnier 1 viridisLite Stefan Milton Bache 1 magrittr Terry M Therneau 1 survival Thomas J. Leeper 1 rio Vitalie Spinu 1 lubridate var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### FizzBuzz in R and Python

Sun, 04/21/2019 - 08:43

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In this post, we will solve a simple problem (called “FizzBuzz“) that is asked by some employers in data scientist job interviews. The question seeks to ascertain the applicant’s familiarity with basic programming concepts. We will see 2 different ways to solve the problem in 2 different statistical programming languages: R and Python.

The FizzBuzz Question

I came across the FizzBuzz question in this excellent blog post on conducting data scientist interviews and have seen it referenced elsewhere on the web. The intent of the question is to probe the job applicant’s knowledge of basic programming concepts.

The prompt of the question is as follows:

In pseudo-code or whatever language you would like: write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.

The Solution in R

We will first solve the problem in R. We will make use of control structures: statements “which result in a choice being made as to which of two or more paths to follow.” This is necessary for us to A) iterate over a number sequence and B) print the correct output for each number in the sequence. We will solve this problem in two slightly different ways.

Solution 1: Using a “for loop”

The classical way of solving this problem is via a “for loop“: we use the loop to explicitly iterate through a sequence of numbers. For each number, we use if-else statements to evaluate which output to print to the console.

The code* looks like this:

for (i in 1:100){

if(i%%3 == 0 & i%%5 == 0) {
print('FizzBuzz')
}
else if(i%%3 == 0) {
print('Fizz')
}
else if (i%%5 == 0){
print('Buzz')
}
else {
print(i)
}

}

Our code first explicitly defines the number sequence from 1 to 100. For each number in this sequence, we evaluate a series of if-else statements, and based on the results of these evaluations, we print the appropriate output (as defined in the question prompt).

The double parentheses is a modulo operator, which allows us to determine the remainder following a division. If the remainder is zero (which we test via the double equals signs), the first number is divisible by the second number.

We start with the double evaluation – any other ordering of our if-else statements would cause the program to fail (because numbers which pass the double evaluation by definition also pass the single evaluations). If a given number in the sequence from 1 to 100 is divisible both by 3 and by 5, we print “FizzBuzz” to the console.

We then continue our testing, using “else if” statements. The program checks each one of these statements in turn. If the number fails the double evaluation, we test to see if it is divisible by 3. If this is the case, we print “Fizz” to the console. If the number fails this evaluation, we check to see if it is divisible by 5 (in which case we print “Buzz” to the console). If the number fails each of these evaluations (meaning it is not divisible by 3 or 5), we simply print the number to the console.

Solution 2: Using a function

In general, for loops are discouraged in R. In earlier days, they were much slower than functions, although this seems to be no longer true. The wonderful R programmer Hadley Wickham advocates for a functional approach based on the fact that functions are generally clearer about what they’re trying to accomplish. As he writes: “A for loop conveys that it’s iterating over something, but doesn’t clearly convey a high level goal… [Functions are] tailored for a specific task, so when you recognise the functional you know immediately why it’s being used.” (There are many interesting and strong opinions about using loops vs. functions in R; I won’t get into the details here, but it’s informative to see the various opinions out there.)

The FizzBuzz solution using a function looks like this:

# define the function
fizz_buzz <- function(number_sequence_f){
if(number_sequence_f%%3 == 0 & number_sequence_f%%5 == 0) {
print('FizzBuzz')
}
else if(number_sequence_f%%3 == 0) {
print('Fizz')
}
else if (number_sequence_f%%5 == 0){
print('Buzz')
}
else {
print(number_sequence_f)
}

}

# apply it to the numbers 1 to 100
sapply(seq(from = 1, to = 100, by = 1), fizz_buzz)

The code above defines a function called “fizz_buzz”. This function takes an input (which I have defined in the function as number_sequence_f), and then passes it through the logical sequence we defined using the “if” statements above.

We then use sapply to apply the FizzBuzz function to a sequence of numbers, which we define directly in the apply statement itself. (The seq command allows us to define a sequence of numbers, and we explicitly state that we want to start at 1 and end at 100, advancing in increments of 1).

This approach is arguably more elegant. We define a function, which is quite clear as to what its goal is. And then we use a single-line command to apply that function to the sequence of numbers we generate with the seq function.

However, note that both solutions produce the output that was required by the question prompt!

The Solution in Python

We will now see how to write these two types of programs in Python. The logic will be the same as that used above, so I will only comment on the slight differences between how the languages implement the central ideas.

Solution 1: Using a “for loop”

The code to solve the FizzBuzz problem looks like this:

for i in range(1,101):
if (i % 3 == 0) & (i % 5 == 0):
print('FizzBuzz')
elif (i % 3 == 0):
print('Fizz')
elif (i % 5 == 0):
print('Buzz')
else:
print(i)

Note that the structure of definition of the “for loop” is different. Whereas R requires {} symbols around all the code in the loop, and for the code inside of each if-else statement, Python uses the colon and indentations to indicate the different parts of the control structure.

I personally find the Python implementation cleaner and easier to read. The indentations are simpler and make the code less cluttered. I get the sense that Python is a true programming language, whereas R’s origins in the statistical community have left it with comparatively more complicated ways to accomplish the same programming goal.

Note also that, if we want to stop at 100, we have to specify that the range in our for loop stops at 101. Python has a different way of defining ranges. The first element in the range indicates where to start, but the last number in the range should be 1 greater than the number you actually want to stop at.

Solution 2: Using a function

Note that we can also use the functional approach to solve this problem in Python. As we did above, we wrap our control flow in a function (called fizz_buzz), and apply the function to the sequence of numbers from 1 to 100.

There are a number of different ways to do this in Python. Below I use pandas, a tremendously useful Python library for data manipulation and munging, to apply our function to the numbers from 1 to 100. Because the pandas apply function cannot be used on arrays (the element returned by the numpy np.arange), I convert the array to a Series and call the apply function on that.

# import the pandas library
import pandas as pd

# define the function
def fizz_buzz(number_sequence_f):
if (number_sequence_f % 3 == 0) & (number_sequence_f % 5 == 0):
return('FizzBuzz')
elif (number_sequence_f % 3 == 0):
return('Fizz')
elif (number_sequence_f % 5 == 0):
return('Buzz')
else:
return(number_sequence_f)

# apply it to the number series
pd.Series(np.arange(1,101)).apply(fizz_buzz)

One final bit of programming details. The for loop defined above calls range, which is actually an iterator, not a series of numbers. Long story short, this is appropriate for the for loop, but as we need to apply our function to an actual series of numbers, we can’t use the range command here. Instead, in order to apply our function to a series of numbers, we use the numpy arange command (which returns an array of numbers). As described above, we convert the array to a pandas Series, which is then passed to our function.

Does this make sense as an interview question?

As part of a larger interview, I suppose, this question makes sense. A large part of applied analytical work involves manipulating data programmatically, and so getting some sense of how the candidate can use control structures could be informative.

However, this doesn’t really address the biggest challenges I face as a data scientist, which involve case definition and scoping: e.g. how can a data scientist partner with stakeholders to develop an analytical solution that answers a business problem while avoiding unnecessary complexity? The key issues involve A) analytical maturity to know what numerical solutions are likely to give a reasonable solution with the available data, B) business sense to understand the stakeholder problem and translate it into a form that allows a data science solution, and C) a combination of A and B (analytical wisdom and business sense) to identify the use cases for which analytics will have the largest business impact.**

So data science is a complicated affair, and I’m not sure the FizzBuzz question gives an interviewer much sense of the things I see as being important in a job candidate.*** For another interesting opinion on what’s important in applied analytics, I can recommend this excellent blog post which details a number of other important but under-discussed aspects of the data science profession.

Conclusion

In this post, we saw how to solve a data science interview question called “FizzBuzz.” The question requires the applicant to think through (and potentially code) a solution which involves cycling through a series of numbers, and based on certain conditions, printing a specific outcome.

We solved this problem in two different ways in both R and Python. The solutions allowed us to see different programming approaches (for loops and functions) for implementing control flow logic. The for loops are perhaps the most intuitive approach, but potentially more difficult for others to interpret. The functional approach benefits from clarity of purpose, but requires a slightly higher level of abstract thinking to code. The examples here serve as a good illustration of the differences between R and Python code. R has its origins in the statistical community, and its implementation of control structures seems less straightforward than the comparable implementations in Python.

Finally, we examined the value of the FizzBuzz interview question when assessing qualities in a prospective data science candidate. While the FizzBuzz question definitely provides some information, it misses a lot of what (in my view) comprises the critical skill-set that is needed to deliver data science projects in a business setting.

Coming Up Next

In the next post, we will see how to use R to download data from Fitbit devices, employing a combination of existing R packages and custom API calls.

Stay tuned!

* My go-to html code formatter for R has disappeared! If anyone knows of such a resource, it would be hugely helpful in making the R code on this blog more readable. Please let me know in the comments! (The Python code is converted with tohtml).

** And then, of course, once a proper solution has been developed, it has to be deployed. Can you partner with data engineers and/or other IT stakeholders to integrate your results into business processes? (Notice the importance of collaboration – a critical but seldom-discussed aspect of the data science enterprise).

*** In defense of the above-referenced blog post, the original author is quite nuanced and does not only rely on the FizzBuzz question when interviewing candidates!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Process Mining (Part 2/3): More on bupaR package

Sun, 04/21/2019 - 02:00

(This article was first published on R on notast, and kindly contributed to R-bloggers)

Recap

In the last post, the discipline of event log and process mining were defined. The bupaR package was introduced as a technique to do process mining in R.

Objectives for This Post
1. Visualize workflow
2. Understand the concept of activity reoccurrences

We will use a pre-loaded dataset sepsis from the bupaR package. This event log is based on real life management of sepsis from the point of admission to discharge.

The dataset has 15214 activity instances and 1050 cases.

library(plyr) library(tidyverse) library(bupaR) sepsis ## Event log consisting of: ## 15214 events ## 846 traces ## 1050 cases ## 16 activities ## 15214 activity instances ## ## # A tibble: 15,214 x 34 ## case_id activity lifecycle resource timestamp age crp ## ## 1 A ER Regi~ complete A 2014-10-22 11:15:41 85 NA ## 2 A Leucocy~ complete B 2014-10-22 11:27:00 NA NA ## 3 A CRP complete B 2014-10-22 11:27:00 NA 210 ## 4 A LacticA~ complete B 2014-10-22 11:27:00 NA NA ## 5 A ER Tria~ complete C 2014-10-22 11:33:37 NA NA ## 6 A ER Seps~ complete A 2014-10-22 11:34:00 NA NA ## 7 A IV Liqu~ complete A 2014-10-22 14:03:47 NA NA ## 8 A IV Anti~ complete A 2014-10-22 14:03:47 NA NA ## 9 A Admissi~ complete D 2014-10-22 14:13:19 NA NA ## 10 A CRP complete B 2014-10-24 09:00:00 NA 1090 ## # ... with 15,204 more rows, and 27 more variables: diagnose , ## # diagnosticartastrup , diagnosticblood , diagnosticecg , ## # diagnosticic , diagnosticlacticacid , ## # diagnosticliquor , diagnosticother , diagnosticsputum , ## # diagnosticurinaryculture , diagnosticurinarysediment , ## # diagnosticxthorax , disfuncorg , hypotensie , ## # hypoxie , infectionsuspected , infusion , ## # lacticacid , leucocytes , oligurie , ## # sirscritheartrate , sirscritleucos , ## # sirscrittachypnea , sirscrittemperature , ## # sirscriteria2ormore , activity_instance_id , .order

We will subset a smaller and more granular event log to make illustrations more comprehensible.

activity_frequency(sepsis, level = "activity") %>% arrange(relative)# least common activity ## # A tibble: 16 x 3 ## activity absolute relative ## ## 1 Release E 6 0.000394 ## 2 Release D 24 0.00158 ## 3 Release C 25 0.00164 ## 4 Release B 56 0.00368 ## 5 Admission IC 117 0.00769 ## 6 Return ER 294 0.0193 ## 7 Release A 671 0.0441 ## 8 IV Liquid 753 0.0495 ## 9 IV Antibiotics 823 0.0541 ## 10 ER Sepsis Triage 1049 0.0689 ## 11 ER Registration 1050 0.0690 ## 12 ER Triage 1053 0.0692 ## 13 Admission NC 1182 0.0777 ## 14 LacticAcid 1466 0.0964 ## 15 CRP 3262 0.214 ## 16 Leucocytes 3383 0.222 sepsis_subset<-filter_activity_presence(sepsis, "Release E") # cases with least common activity to achieve smaller eventlog Visualizing Workflow

In an event log, the sequence of activities are captured either using time stamps or activity instance identifier. The sequential order of activities allows us to create a workflow on how a case was managed. This workflow can be compared against theoretical workflow models to identify deviations. It can also reveal the reoccurrence of activities which will be covered later. The most intuitive approach to examine workflow is with a process map and this is achieved with the process_map function from bupaR

The process map is either created base on frequency of activities (i.e. process_map(type = frequency())) or base on duration of activities (i.e. process_map(type= performance())). I have focused on the frequency aspect which is the default argument.

There are 4 arguments that you can supply to process_map(type = frequency()) which determines the value to be displayed. The values can reflect either the number of activity instances or the number of cases. The values displayed can be in absolute or relative frequency.

Process mapping based on absolute activity instances sepsis_subset %>% process_map(type = frequency("absolute"))

{"x":{"diagram":"digraph {\n\ngraph [layout = \"dot\",\n outputorder = \"edgesfirst\",\n bgcolor = \"white\",\n rankdir = \"LR\"]\n\nnode [fontname = \"Helvetica\",\n fontsize = \"10\",\n shape = \"circle\",\n fixedsize = \"true\",\n width = \"0.5\",\n style = \"filled\",\n fillcolor = \"aliceblue\",\n color = \"gray70\",\n fontcolor = \"gray50\"]\n\nedge [fontname = \"Helvetica\",\n fontsize = \"8\",\n len = \"1.5\",\n color = \"gray80\",\n arrowsize = \"0.5\"]\n\n \"1\" [label = \"Admission IC\n (2)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"Admission IC\n (2)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFF7FB\"] \n \"2\" [label = \"Admission NC\n (12)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"Admission NC\n (12)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#D0D1E6\"] \n \"3\" [label = \"CRP\n (32)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"CRP\n (32)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#3690C0\"] \n \"4\" [label = \"End\", shape = \"circle\", style = \"rounded,filled\", fontcolor = \"brown4\", color = \"brown4\", tooltip = \"End\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFFFFF\"] \n \"5\" [label = \"ER Registration\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"ER Registration\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#ECE7F2\"] \n \"6\" [label = \"ER Sepsis Triage\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"ER Sepsis Triage\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#ECE7F2\"] \n \"7\" [label = \"ER Triage\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"ER Triage\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#ECE7F2\"] \n \"8\" [label = \"IV Antibiotics\n (3)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"IV Antibiotics\n (3)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFF7FB\"] \n \"9\" [label = \"IV Liquid\n (3)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"IV Liquid\n (3)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFF7FB\"] \n \"10\" [label = \"LacticAcid\n (19)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"LacticAcid\n (19)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#A6BDDB\"] \n \"11\" [label = \"Leucocytes\n (44)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"Leucocytes\n (44)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#034E7B\"] \n \"12\" [label = \"Release E\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"Release E\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#ECE7F2\"] \n \"13\" [label = \"Return ER\n (1)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"Return ER\n (1)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFF7FB\"] \n \"14\" [label = \"Start\", shape = \"circle\", style = \"rounded,filled\", fontcolor = \"chartreuse4\", color = \"chartreuse4\", tooltip = \"Start\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFFFFF\"] \n\"1\"->\"3\" [label = \"2\", penwidth = \"1.44444444444444\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"2\"->\"1\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"2\"->\"2\" [label = \"3\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"2\"->\"3\" [label = \"5\", penwidth = \"2.11111111111111\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"2\"->\"11\" [label = \"3\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"2\" [label = \"2\", penwidth = \"1.44444444444444\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"3\" [label = \"4\", penwidth = \"1.88888888888889\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"8\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"10\" [label = \"4\", penwidth = \"1.88888888888889\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"11\" [label = \"18\", penwidth = \"5\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"12\" [label = \"3\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"5\"->\"7\" [label = \"6\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"6\"->\"3\" [label = \"2\", penwidth = \"1.44444444444444\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"6\"->\"9\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"6\"->\"10\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"6\"->\"11\" [label = \"2\", penwidth = \"1.44444444444444\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"7\"->\"6\" [label = \"6\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"8\"->\"2\" [label = \"3\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"9\"->\"8\" [label = \"2\", penwidth = \"1.44444444444444\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"9\"->\"11\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"3\" [label = \"6\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"9\" [label = \"2\", penwidth = \"1.44444444444444\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"10\" [label = \"5\", penwidth = \"2.11111111111111\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"11\" [label = \"5\", penwidth = \"2.11111111111111\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"12\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"1\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"2\" [label = \"4\", penwidth = \"1.88888888888889\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"3\" [label = \"13\", penwidth = \"3.88888888888889\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"10\" [label = \"9\", penwidth = \"3\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"11\" [label = \"15\", penwidth = \"4.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"12\" [label = \"2\", penwidth = \"1.44444444444444\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"12\"->\"4\" [label = \"5\", penwidth = \"2.11111111111111\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"12\"->\"13\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"13\"->\"4\" [label = \"1\", penwidth = \"1.22222222222222\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"14\"->\"5\" [label = \"6\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]} Process mapping based on absolute number of cases for the activity sepsis_subset %>% process_map(type = frequency("absolute_case"))

{"x":{"diagram":"digraph {\n\ngraph [layout = \"dot\",\n outputorder = \"edgesfirst\",\n bgcolor = \"white\",\n rankdir = \"LR\"]\n\nnode [fontname = \"Helvetica\",\n fontsize = \"10\",\n shape = \"circle\",\n fixedsize = \"true\",\n width = \"0.5\",\n style = \"filled\",\n fillcolor = \"aliceblue\",\n color = \"gray70\",\n fontcolor = \"gray50\"]\n\nedge [fontname = \"Helvetica\",\n fontsize = \"8\",\n len = \"1.5\",\n color = \"gray80\",\n arrowsize = \"0.5\"]\n\n \"1\" [label = \"Admission IC\n (2)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"Admission IC\n (2)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#D0D1E6\"] \n \"2\" [label = \"Admission NC\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"Admission NC\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#034E7B\"] \n \"3\" [label = \"CRP\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"CRP\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#034E7B\"] \n \"4\" [label = \"End\", shape = \"circle\", style = \"rounded,filled\", fontcolor = \"brown4\", color = \"brown4\", tooltip = \"End\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFFFFF\"] \n \"5\" [label = \"ER Registration\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"ER Registration\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#034E7B\"] \n \"6\" [label = \"ER Sepsis Triage\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"ER Sepsis Triage\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#034E7B\"] \n \"7\" [label = \"ER Triage\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"ER Triage\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#034E7B\"] \n \"8\" [label = \"IV Antibiotics\n (3)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"IV Antibiotics\n (3)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#A6BDDB\"] \n \"9\" [label = \"IV Liquid\n (3)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"IV Liquid\n (3)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#A6BDDB\"] \n \"10\" [label = \"LacticAcid\n (4)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"LacticAcid\n (4)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#74A9CF\"] \n \"11\" [label = \"Leucocytes\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"Leucocytes\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#034E7B\"] \n \"12\" [label = \"Release E\n (6)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"white\", color = \"grey\", tooltip = \"Release E\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#034E7B\"] \n \"13\" [label = \"Return ER\n (1)\", shape = \"rectangle\", style = \"rounded,filled\", fontcolor = \"black\", color = \"grey\", tooltip = \"Return ER\n (1)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFF7FB\"] \n \"14\" [label = \"Start\", shape = \"circle\", style = \"rounded,filled\", fontcolor = \"chartreuse4\", color = \"chartreuse4\", tooltip = \"Start\n (6)\", penwidth = \"1.5\", fixedsize = \"FALSE\", fontname = \"Arial\", fillcolor = \"#FFFFFF\"] \n\"1\"->\"3\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"2\"->\"1\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"2\"->\"2\" [label = \"3\", penwidth = \"3\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"2\"->\"3\" [label = \"3\", penwidth = \"3\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"2\"->\"11\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"2\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"3\" [label = \"3\", penwidth = \"3\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"8\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"10\" [label = \"3\", penwidth = \"3\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"11\" [label = \"6\", penwidth = \"5\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"3\"->\"12\" [label = \"3\", penwidth = \"3\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"5\"->\"7\" [label = \"6\", penwidth = \"5\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"6\"->\"3\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"6\"->\"9\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"6\"->\"10\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"6\"->\"11\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"7\"->\"6\" [label = \"6\", penwidth = \"5\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"8\"->\"2\" [label = \"3\", penwidth = \"3\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"9\"->\"8\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"9\"->\"11\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"3\" [label = \"4\", penwidth = \"3.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"9\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"10\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"11\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"10\"->\"12\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"1\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"2\" [label = \"4\", penwidth = \"3.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"3\" [label = \"6\", penwidth = \"5\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"10\" [label = \"3\", penwidth = \"3\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"11\" [label = \"4\", penwidth = \"3.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"11\"->\"12\" [label = \"2\", penwidth = \"2.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"12\"->\"4\" [label = \"5\", penwidth = \"4.33333333333333\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"12\"->\"13\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"13\"->\"4\" [label = \"1\", penwidth = \"1.66666666666667\", color = \"dodgerblue4\", fontname = \"Arial\"] \n\"14\"->\"5\" [label = \"6\", penwidth = \"5\", color = \"dodgerblue4\", fontname = \"Arial\"] \n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]} The darker the colour of the boxes and the darker and thicker the arrows, the higher the value. From the process map, activities leading up to “Release E” were “CRP”, “Leucocytes” and “Lactic Acid”. “CRP” contributed to half the activity instances and cases leading up to “Release E”.

Reoccurence of Activities

In the above process map, the top of the boxes for “CRP”, “Leucocytes”, “Admission NC” and “Lactic Acid” has an arrow head and its tail belonging to the same activity. These arrows indicate reoccurrence of the activities for the same case. Reoccurrence of activities can suggest inefficiency or interruptions or disruptions which may warrant further investigation to optimise workflow.

bupaR has 2 constructs to define reoccurrence of activities for the same case.

Construct 1 (resource reduplicating the activity)

If the same resource reduplicates the activity for a particular case, it is known as “repeat” in bupaR’s terminology. If a different resource reduplicates the activity, it is known as “redo”.

Construct 2 (activity instances when the activity reoccured)

When the activity for a specific case reoccurs as consecutive activity instances, it is term as “self loop”. This is an example of “1 self loop”

Activity Self Loop ER Triage – CRP 1 CRP 1 Release A –

When the activity reoccurs as non consecutive activity instances, it is known as “repetition”. In other words, there are other activities that occur before that specific activity is replicated. This is an example of “1 repetition”

Activity Repetition ER Triage – CRP 1 Leucocytes – CRP 1

The permutation of these constructs result in these 4 types of activity reoccurrences:

1. Redo self-loop
2. Redo repetition
3. Repeat self-loop
4. Repeat repetition

Let’s look at some examples using bupaR

Which activities reoccurred consecutively in a case? number_of_repetitions(sepsis_subset, level="activity", type="all") ## # Description: activity_metric [12 x 3] ## activity absolute relative ## ## 1 Admission IC 0 0 ## 2 Admission NC 2 0.167 ## 3 CRP 6 0.188 ## 4 ER Registration 0 0 ## 5 ER Sepsis Triage 0 0 ## 6 ER Triage 0 0 ## 7 IV Antibiotics 0 0 ## 8 IV Liquid 0 0 ## 9 LacticAcid 3 0.158 ## 10 Leucocytes 6 0.136 ## 11 Release E 0 0 ## 12 Return ER 0 0 Which activities reoccurred consecutively in a case where the same resource repeated the activities? number_of_repetitions(sepsis_subset, level="activity", type="repeat") ## # Description: activity_metric [12 x 3] ## activity absolute relative ## ## 1 Admission IC 0 0 ## 2 Admission NC 0 0 ## 3 CRP 6 0.188 ## 4 ER Registration 0 0 ## 5 ER Sepsis Triage 0 0 ## 6 ER Triage 0 0 ## 7 IV Antibiotics 0 0 ## 8 IV Liquid 0 0 ## 9 LacticAcid 3 0.158 ## 10 Leucocytes 6 0.136 ## 11 Release E 0 0 ## 12 Return ER 0 0 Which cases had activities duplicated? The duplicated activities were interrupted by other activities and these duplicated activities were redone by a different resource. number_of_selfloops(sepsis_subset, level="case", type = "redo") ## # Description: case_metric [6 x 3] ## case_id absolute relative ## ## 1 BCA 0 0 ## 2 CY 1 0.0303 ## 3 JAA 0 0 ## 4 JM 1 0.0769 ## 5 LG 1 0.0244 ## 6 SAA 0 0 Summing Up

In this post, we learnt how to visualize the workflow registered in event logs with a process map. We also learnt bupaR’s constructs of activity reoccurrences and therefore the types of activity reoccurrences. If you want to learn more about bupaR, head over to Datacamp for more comprehensive and interactive lessons. P.S. I’m not sponsored nor affiliated to Datacamp

In the next post, we’ll look at more visualizations for process analysis.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on notast. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Batch Deployment of WoE Transformations

Sat, 04/20/2019 - 23:32

After wrapping up the function batch_woe() today with the purpose to allow users to apply WoE transformations to many independent variables simultaneously, I have completed the development of major functions in the MOB package that can be usable for the model development in a production setting.

The function batch_woe() basically is the wrapper around cal_woe() and has two input parameters. The “data” parameter is the data frame that we would deploy binning outcomes and the “slst” parameter is the list of multiple binning specification tables that is either the direct output from the function batch_bin or created manually by combining outputs from multiple binning functions.

There are also two components in the output of batch_woe(), a list of PSI tables for transformed variables and a data frame with a row index and all transformed variables. The default printout is a PSI summation of all input variables to be transformed. As shown below, all PSI values are below 0.1 and therefore none is concerning.

binout <- batch_bin(df, 1) woeout <- batch_woe(df[sample(seq(nrow(df)), 2000, replace = T), ], binout$BinLst) woeout # tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt ... # psi 0.0027 0.0044 0.0144 0.0011 3e-04 0.0013 ... str(woeout, max.level = 1) # List of 2 #$ psi:List of 11 # $df :'data.frame': 2000 obs. of 12 variables: # - attr(*, "class")= chr "psiSummary" head(woeout$df, 1) # idx_ woe.tot_derog woe.tot_tr woe.age_oldest_tr woe.tot_open_tr woe.tot_rev_tr ... # 1 -0.3811 -0.0215 -0.5356 -0.0722 -0.1012 ...

All source codes of the MOB package are available on https://github.com/statcompute/MonotonicBinning and free (as free beer) to download and distribute.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Styling DataTables

Sat, 04/20/2019 - 02:00

(This article was first published on Stencilled, and kindly contributed to R-bloggers)

Most of the shiny apps have tables as the primary component. Now lets say you want to prettify your app and style the tables. All you need understand how tables are built using HTML. This is how the default datatable looks like in the app.

In order to build the html table I have used a function table_frame which can be used as a container in DT::renderdatatable.
This function basically uses htmltools. For more references on the basics of html tables please refer here

table_frame <- function() { htmltools::withTags(table(class = 'display', thead( tr( th(rowspan = 2, 'Latitude'), th(rowspan = 2, 'Longitude'), th(rowspan = 2, 'Month'), th(rowspan = 2, 'Year'), th(class = 'dt-center', colspan = 3, 'Cloud'), th(rowspan = 2, 'Ozone'), th(rowspan = 2, 'Pressure'), th(rowspan = 2, 'Surface Temperature'), th(rowspan = 2, 'Temperature'), tr(lapply(rep( c('High', 'Low', 'Mid'), 1 ), th)) ) ))) }

Tables might have n number of records and its not feasible to display them at once on dashboards. But someone might need to see them all at once. So in tableoptions where we can add two buttons show more and show less. Show less will use the default option of 10 records and show more will display all the records.

table_options <- function() { list( dom = 'Bfrtip', #Bfrtip pageLength = 10, buttons = list( c('copy', 'csv', 'excel', 'pdf', 'print'), list( extend = "collection", text = 'Show All', action = DT::JS( "function ( e, dt, node, config ) { dt.page.len(-1); dt.ajax.reload();}" ) ), list( extend = "collection", text = 'Show Less', action = DT::JS( "function ( e, dt, node, config ) { dt.page.len(10); dt.ajax.reload();}" ) ) ), deferRender = TRUE, lengthMenu = list(c(10, 20,-1), c('10', '20', 'All')), searching = FALSE, editable = TRUE, scroller = TRUE, lengthChange = FALSE , initComplete = JS( "function(settings, json) {", "$(this.api().table().header()).css({'background-color': '#517fb9', 'color': '#fff'});", "}" ) ) } Below is the output how the datatable looks like once the html container and table options are used.So by stying not only can we change the column names but also group them. If you see the default table how we have three columns with prefix cloud. These can be grouped under one column name Cloud. Code You can find code for the app here. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Stencilled. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Quick Example of Latent Profile Analysis in R Sat, 04/20/2019 - 02:00 (This article was first published on R on Will Hipson, and kindly contributed to R-bloggers) Latent Profile Analysis (LPA) tries to identify clusters of individuals (i.e., latent profiles) based on responses to a series of continuous variables (i.e., indicators). LPA assumes that there are unobserved latent profiles that generate patterns of responses on indicator items. Here, I will go through a quick example of LPA to identify groups of people based on their interests/hobbies. The data comes from the Young People Survey, available freely on Kaggle.com. Here’s a sneak peek at what we’re going for: Terminology note: People use the terms clusters, profiles, classes, and groups interchangeably, but there are subtle differences. I’ll mostly stick to profile to refer to a grouping of cases, in keeping with LPA terminology. We should note that LPA is a branch of Gaussian Finite Mixture Modeling, which includes Latent Class Analysis (LCA). The difference between LPA and LCA is conceptual, not computational: LPA uses continuous indicators and LCA uses binary indicators. LPA is a probabilistic model, which means that it models the probability of case belonging to a profile. This is superior to an approach like K-means that uses distance algorithms. With that aside, let’s load in the data. library(tidyverse) survey <- read_csv("https://raw.githubusercontent.com/whipson/tidytuesday/master/young_people.csv") %>% select(History:Pets) The data is on 32 interests/hobbies. Each item is ranked 1 (not interested) to 5 (very interested). The description on Kaggle suggests there may be careless responding (e.g., participants who selected the same value over and over). We can use the careless package to identify “string responding”. Let’s also look for multivariate outliers with Mahalanobis Distance (see my previous post on Mahalanobis for identifying outliers). library(careless) library(psych) interests <- survey %>% mutate(string = longstring(.)) %>% mutate(md = outlier(., plot = FALSE)) We’ll cap string responding to a maximum of 10 and use a Mahalanobis D cutoff of alpha = .001. cutoff <- (qchisq(p = 1 - .001, df = ncol(interests))) interests_clean <- interests %>% filter(string <= 10, md < cutoff) %>% select(-string, -md) The package mclust performs various types of model-based clustering and dimension reduction. Plus, it’s really intuitive to use. It requires complete data (no missing), so for this example we’ll remove cases with NAs. This is not the preferred approach; we’d be better off imputing. But for illustrative purposes, this works fine. I’m also going to standardize all of the indicators so when we plot the profiles it’s clearer to see the differences between clusters. Running this code will take a few minutes. library(mclust) interests_clustering <- interests_clean %>% na.omit() %>% mutate_all(list(scale)) BIC <- mclustBIC(interests_clustering) We’ll start by plotting Bayesian Information Criteria for all the models with profiles ranging from 1 to 9. plot(BIC) It’s not immediately clear which model is the best since the y-axis is so large and many of the models score close together. summary(BIC) shows the top three models based on BIC. summary(BIC) ## Best BIC values: ## VVE,3 VEE,3 EVE,3 ## BIC -75042.7 -75165.1484 -75179.165 ## BIC diff 0.0 -122.4442 -136.461 The highest BIC comes from VVE, 3. This says there are 3 clusters with variable volume, variable shape, equal orientation, and ellipsodial distribution (see Figure 2 from this paper for a visual). However, VEE, 3 is not far behind and actually may be a more theoretically useful model since it constrains the shape of the distribution to be equal. For this reason, we’ll go with VEE, 3. If we want to look at this model more closely, we save it as an object and inspect it with summary(). mod1 <- Mclust(interests_clustering, modelNames = "VEE", G = 3, x = BIC) summary(mod1) ## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VEE (ellipsoidal, equal shape and orientation) model with 3 ## components: ## ## log.likelihood n df BIC ICL ## -35455.83 874 628 -75165.15 -75216.14 ## ## Clustering table: ## 1 2 3 ## 137 527 210 The output describes the geometric characteristics of the profiles and the number of cases classified into each of the three clusters. BIC is one of the best fit indices, but it’s always recommended to look for more evidence that the solution we’ve chosen is the correct one. We can also compare values of the Integrated Completed Likelikood (ICL) criterion. See this paper for more details. ICL isn’t much different from BIC, except that it adds a penalty on solutions with greater entropy or classification uncertainty. ICL <- mclustICL(interests_clustering) plot(ICL) summary(ICL) ## Best ICL values: ## VVE,3 VEE,3 EVE,3 ## ICL -75134.69 -75216.13551 -75272.891 ## ICL diff 0.00 -81.44795 -138.203 We see similar results. ICL suggests that model VEE, 3 fits quite well. Finally, we’ll perform the Bootstrap Likelihood Ratio Test (BLRT) which compares model fit between k-1 and k cluster models. In other words, it looks to see if an increase in profiles increases fit. Based on simulations by Nylund, Asparouhov, and Muthén (2007) BIC and BLRT are the best indicators for how many profiles there are. This line of code will take a long time to run, so if you’re just following along I suggest skipping it unless you want to step out for a coffee break. mclustBootstrapLRT(interests_clustering, modelName = "VEE") ## ------------------------------------------------------------- ## Bootstrap sequential LRT for the number of mixture components ## ------------------------------------------------------------- ## Model = VEE ## Replications = 999 ## LRTS bootstrap p-value ## 1 vs 2 197.0384 0.001 ## 2 vs 3 684.8743 0.001 ## 3 vs 4 -124.1935 1.000 BLRT also suggests that a 3-profile solution is ideal. Visualizing LPA Now that we’re confident in our choice of a 3-profile solution, let’s plot the results. Specifically, we want to see how the profiles differ on the indicators, that is, the items that made up the profiles. If the solution is theoretically meaningful, we should see differences that make sense. First, we’ll extract the means for each profile (remember, we chose these to be standardized). Then, we melt this into long form. Note that I’m trimming values exceeding +1 SD, otherwise we run into plotting issues. library(reshape2) means <- data.frame(mod1$parameters\$mean, stringsAsFactors = FALSE) %>% rownames_to_column() %>% rename(Interest = rowname) %>% melt(id.vars = "Interest", variable.name = "Profile", value.name = "Mean") %>% mutate(Mean = round(Mean, 2), Mean = ifelse(Mean > 1, 1, Mean))

Here’s the code for the plot. I’m reordering the indicators so that similar activities are close together.

means %>% ggplot(aes(Interest, Mean, group = Profile, color = Profile)) + geom_point(size = 2.25) + geom_line(size = 1.25) + scale_x_discrete(limits = c("Active sport", "Adrenaline sports", "Passive sport", "Countryside, outdoors", "Gardening", "Cars", "Art exhibitions", "Dancing", "Musical instruments", "Theatre", "Writing", "Reading", "Geography", "History", "Law", "Politics", "Psychology", "Religion", "Foreign languages", "Biology", "Chemistry", "Mathematics", "Medicine", "Physics", "Science and technology", "Internet", "PC", "Celebrities", "Economy Management", "Fun with friends", "Shopping", "Pets")) + labs(x = NULL, y = "Standardized mean interest") + theme_bw(base_size = 14) + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top")

We have a lot of indicators (more than typical for LPA), but we see some interesting differences. Clearly the red group is interested in science and the blue group shows greater interest in arts and humanities. The green group seems disinterested in both science and art, but moderately interested in other things.

We can make this plot more informative by plugging in profile names and proportions. I’m also going to save this plot as an object so that we can do something really cool with it!

p <- means %>% mutate(Profile = recode(Profile, X1 = "Science: 16%", X2 = "Disinterest: 60%", X3 = "Arts & Humanities: 24%")) %>% ggplot(aes(Interest, Mean, group = Profile, color = Profile)) + geom_point(size = 2.25) + geom_line(size = 1.25) + scale_x_discrete(limits = c("Active sport", "Adrenaline sports", "Passive sport", "Countryside, outdoors", "Gardening", "Cars", "Art exhibitions", "Dancing", "Musical instruments", "Theatre", "Writing", "Reading", "Geography", "History", "Law", "Politics", "Psychology", "Religion", "Foreign languages", "Biology", "Chemistry", "Mathematics", "Medicine", "Physics", "Science and technology", "Internet", "PC", "Celebrities", "Economy Management", "Fun with friends", "Shopping", "Pets")) + labs(x = NULL, y = "Standardized mean interest") + theme_bw(base_size = 14) + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top") p

The something really cool that I want to do is make an interactive plot. Why would I want to do this? Well, one of the problems with the static plot is that with so many indicators it’s tough to read the values for each indicator. An interactive plot lets the reader narrow in on specific indicators or profiles of interest. We’ll use plotly to turn our static plot into an interactive one.

library(plotly) ggplotly(p, tooltip = c("Interest", "Mean")) %>% layout(legend = list(orientation = "h", y = 1.2))

{"x":{"data":[{"x":[14,17,16,22,24,26,27,29,20,21,12,13,19,23,15,6,7,18,4,8,9,11,3,1,5,28,31,25,10,30,2,32],"y":[-0.11,-0.13,-0.21,0.03,0.25,-0.14,-0.18,-0.48,1,1,0.18,-0.1,-0.1,1,-0.13,-0.09,-0.12,0.05,0.06,0.16,0.01,-0.41,-0.02,-0.01,0.28,-0.03,0.02,0.26,0.02,0.12,-0.06,0.28],"text":["Interest: History
Mean: -0.11","Interest: Psychology
Mean: -0.13","Interest: Politics
Mean: -0.21","Interest: Mathematics
Mean: 0.03","Interest: Physics
Mean: 0.25","Interest: Internet
Mean: -0.14","Interest: PC
Mean: -0.18","Interest: Economy Management
Mean: -0.48","Interest: Biology
Mean: 1.00","Interest: Chemistry
Mean: 0.18","Interest: Geography
Mean: -0.10","Interest: Foreign languages
Mean: -0.10","Interest: Medicine
Mean: 1.00","Interest: Law
Mean: -0.13","Interest: Cars
Mean: -0.09","Interest: Art exhibitions
Mean: -0.12","Interest: Religion
Mean: 0.05","Interest: Countryside, outdoors
Mean: 0.06","Interest: Dancing
Mean: 0.16","Interest: Musical instruments
Mean: 0.01","Interest: Writing
Mean: -0.41","Interest: Passive sport
Mean: -0.02","Interest: Active sport
Mean: -0.01","Interest: Gardening
Mean: 0.28","Interest: Celebrities
Mean: -0.03","Interest: Shopping
Mean: 0.02","Interest: Science and technology
Mean: 0.26","Interest: Theatre
Mean: 0.02","Interest: Fun with friends
Mean: -0.06","Interest: Pets
Mean: 0.28"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(248,118,109,1)","opacity":1,"size":8.50393700787402,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(248,118,109,1)"}},"hoveron":"points","name":"Science: 16%","legendgroup":"Science: 16%","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[14,17,16,22,24,26,27,29,20,21,12,13,19,23,15,6,7,18,4,8,9,11,3,1,5,28,31,25,10,30,2,32],"y":[-0.08,-0.13,-0.01,0,-0.1,0.08,0.08,0.15,-0.35,-0.42,-0.23,-0.06,-0.06,-0.37,-0.02,0.14,-0.18,-0.11,-0.03,-0.12,-0.18,-0.52,0.07,0,-0.19,0.02,-0.01,-0.07,-0.15,0.02,0,-0.08],"text":["Interest: History
Mean: -0.08","Interest: Psychology
Mean: -0.13","Interest: Politics
Mean: -0.01","Interest: Mathematics
Mean: 0.00","Interest: Physics
Mean: -0.10","Interest: Internet
Mean: 0.08","Interest: PC
Mean: 0.08","Interest: Economy Management
Mean: 0.15","Interest: Biology
Mean: -0.35","Interest: Chemistry
Mean: -0.23","Interest: Geography
Mean: -0.06","Interest: Foreign languages
Mean: -0.06","Interest: Medicine
Mean: -0.37","Interest: Law
Mean: -0.02","Interest: Cars
Mean: 0.14","Interest: Art exhibitions
Mean: -0.18","Interest: Religion
Mean: -0.11","Interest: Countryside, outdoors
Mean: -0.03","Interest: Dancing
Mean: -0.12","Interest: Musical instruments
Mean: -0.18","Interest: Writing
Mean: -0.52","Interest: Passive sport
Mean: 0.07","Interest: Active sport
Mean: 0.00","Interest: Gardening
Mean: -0.19","Interest: Celebrities
Mean: 0.02","Interest: Shopping
Mean: -0.01","Interest: Science and technology
Mean: -0.07","Interest: Theatre
Mean: -0.15","Interest: Fun with friends
Mean: 0.00","Interest: Pets
Mean: -0.08"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(0,186,56,1)","opacity":1,"size":8.50393700787402,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(0,186,56,1)"}},"hoveron":"points","name":"Disinterest: 60%","legendgroup":"Disinterest: 60%","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[14,17,16,22,24,26,27,29,20,21,12,13,19,23,15,6,7,18,4,8,9,11,3,1,5,28,31,25,10,30,2,32],"y":[0.28,0.43,0.15,-0.03,0.08,-0.11,-0.09,-0.06,0.01,-0.08,0.47,0.23,0.23,0.06,0.14,-0.3,0.53,0.25,0.03,0.21,0.47,1,-0.18,0,0.29,-0.03,0.01,0,0.36,-0.13,0.04,0.01],"text":["Interest: History
Mean: 0.28","Interest: Psychology
Mean: 0.43","Interest: Politics
Mean: 0.15","Interest: Mathematics
Mean: -0.03","Interest: Physics
Mean: 0.08","Interest: Internet
Mean: -0.11","Interest: PC
Mean: -0.09","Interest: Economy Management
Mean: -0.06","Interest: Biology
Mean: 0.01","Interest: Chemistry
Mean: 0.47","Interest: Geography
Mean: 0.23","Interest: Foreign languages
Mean: 0.23","Interest: Medicine
Mean: 0.06","Interest: Law
Mean: 0.14","Interest: Cars
Mean: -0.30","Interest: Art exhibitions
Mean: 0.53","Interest: Religion
Mean: 0.25","Interest: Countryside, outdoors
Mean: 0.03","Interest: Dancing
Mean: 0.21","Interest: Musical instruments
Mean: 0.47","Interest: Writing
Mean: 1.00","Interest: Passive sport
Mean: -0.18","Interest: Active sport
Mean: 0.00","Interest: Gardening
Mean: 0.29","Interest: Celebrities
Mean: -0.03","Interest: Shopping
Mean: 0.01","Interest: Science and technology
Mean: 0.00","Interest: Theatre
Mean: 0.36","Interest: Fun with friends
Mean: 0.04","Interest: Pets
Mean: 0.01"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(97,156,255,1)","opacity":1,"size":8.50393700787402,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(97,156,255,1)"}},"hoveron":"points","name":"Arts & Humanities: 24%","legendgroup":"Arts & Humanities: 24%","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],"y":[-0.01,-0.06,-0.02,0.06,0.28,-0.09,-0.12,0.16,0.01,0.02,-0.41,0.18,-0.1,-0.11,-0.13,-0.21,-0.13,0.05,-0.1,1,1,0.03,1,0.25,0.26,-0.14,-0.18,-0.03,-0.48,0.12,0.02,0.28],"text":["Interest: Active sport
Mean: -0.06","Interest: Passive sport
Mean: -0.02","Interest: Countryside, outdoors
Mean: 0.06","Interest: Gardening
Mean: 0.28","Interest: Cars
Mean: -0.09","Interest: Art exhibitions
Mean: -0.12","Interest: Dancing
Mean: 0.16","Interest: Musical instruments
Mean: 0.01","Interest: Theatre
Mean: 0.02","Interest: Writing
Mean: 0.18","Interest: Geography
Mean: -0.10","Interest: History
Mean: -0.11","Interest: Law
Mean: -0.13","Interest: Politics
Mean: -0.21","Interest: Psychology
Mean: -0.13","Interest: Religion
Mean: 0.05","Interest: Foreign languages
Mean: -0.10","Interest: Biology
Mean: 1.00","Interest: Chemistry
Mean: 1.00","Interest: Mathematics
Mean: 0.03","Interest: Medicine
Mean: 1.00","Interest: Physics
Mean: 0.25","Interest: Science and technology
Mean: 0.26","Interest: Internet
Mean: -0.14","Interest: PC
Mean: -0.18","Interest: Celebrities
Mean: -0.03","Interest: Economy Management
Mean: -0.48","Interest: Fun with friends
Mean: 0.12","Interest: Shopping
Mean: 0.02","Interest: Pets
Mean: 0.28"],"type":"scatter","mode":"lines","line":{"width":4.7244094488189,"color":"rgba(248,118,109,1)","dash":"solid"},"hoveron":"points","name":"Science: 16%","legendgroup":"Science: 16%","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],"y":[0,0,0.07,-0.03,-0.19,0.14,-0.18,-0.12,-0.18,-0.15,-0.52,-0.23,-0.06,-0.08,-0.02,-0.01,-0.13,-0.11,-0.06,-0.35,-0.42,0,-0.37,-0.1,-0.07,0.08,0.08,0.02,0.15,0.02,-0.01,-0.08],"text":["Interest: Active sport
Mean: 0.00","Interest: Passive sport
Mean: 0.07","Interest: Countryside, outdoors
Mean: -0.03","Interest: Gardening
Mean: -0.19","Interest: Cars
Mean: 0.14","Interest: Art exhibitions
Mean: -0.18","Interest: Dancing
Mean: -0.12","Interest: Musical instruments
Mean: -0.18","Interest: Theatre
Mean: -0.15","Interest: Writing
Mean: -0.23","Interest: Geography
Mean: -0.06","Interest: History
Mean: -0.08","Interest: Law
Mean: -0.02","Interest: Politics
Mean: -0.01","Interest: Psychology
Mean: -0.13","Interest: Religion
Mean: -0.11","Interest: Foreign languages
Mean: -0.06","Interest: Biology
Mean: -0.35","Interest: Chemistry
Mean: -0.42","Interest: Mathematics
Mean: 0.00","Interest: Medicine
Mean: -0.37","Interest: Physics
Mean: -0.10","Interest: Science and technology
Mean: -0.07","Interest: Internet
Mean: 0.08","Interest: PC
Mean: 0.08","Interest: Celebrities
Mean: 0.02","Interest: Economy Management
Mean: 0.15","Interest: Fun with friends
Mean: 0.02","Interest: Shopping
Mean: -0.01","Interest: Pets
Mean: -0.08"],"type":"scatter","mode":"lines","line":{"width":4.7244094488189,"color":"rgba(0,186,56,1)","dash":"solid"},"hoveron":"points","name":"Disinterest: 60%","legendgroup":"Disinterest: 60%","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],"y":[0,0.04,-0.18,0.03,0.29,-0.3,0.53,0.21,0.47,0.36,1,0.47,0.23,0.28,0.14,0.15,0.43,0.25,0.23,0.01,-0.08,-0.03,0.06,0.08,0,-0.11,-0.09,-0.03,-0.06,-0.13,0.01,0.01],"text":["Interest: Active sport
Mean: 0.04","Interest: Passive sport
Mean: -0.18","Interest: Countryside, outdoors
Mean: 0.03","Interest: Gardening
Mean: 0.29","Interest: Cars
Mean: -0.30","Interest: Art exhibitions
Mean: 0.53","Interest: Dancing
Mean: 0.21","Interest: Musical instruments
Mean: 0.47","Interest: Theatre
Mean: 0.36","Interest: Writing
Mean: 0.47","Interest: Geography
Mean: 0.23","Interest: History
Mean: 0.28","Interest: Law
Mean: 0.14","Interest: Politics
Mean: 0.15","Interest: Psychology
Mean: 0.43","Interest: Religion
Mean: 0.25","Interest: Foreign languages
Mean: 0.23","Interest: Biology
Mean: 0.01","Interest: Chemistry
Mean: -0.08","Interest: Mathematics
Mean: -0.03","Interest: Medicine
Mean: 0.06","Interest: Physics
Mean: 0.08","Interest: Science and technology
Mean: 0.00","Interest: Internet
Mean: -0.11","Interest: PC
Mean: -0.09","Interest: Celebrities
Mean: -0.03","Interest: Economy Management
Mean: -0.06","Interest: Fun with friends
Mean: -0.13","Interest: Shopping
Mean: 0.01","Interest: Pets
Mean: 0.01"],"type":"scatter","mode":"lines","line":{"width":4.7244094488189,"color":"rgba(97,156,255,1)","dash":"solid"},"hoveron":"points","name":"Arts & Humanities: 24%","legendgroup":"Arts & Humanities: 24%","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":29.0178497301785,"r":9.29846409298464,"b":103.992103462279,"l":62.2997094229971},"plot_bgcolor":"rgba(255,255,255,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":18.5969281859693},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.4,32.6],"tickmode":"array","ticktext":["Active sport","Adrenaline sports","Passive sport","Countryside, outdoors","Gardening","Cars","Art exhibitions","Dancing","Musical instruments","Theatre","Writing","Reading","Geography","History","Law","Politics","Psychology","Religion","Foreign languages","Biology","Chemistry","Mathematics","Medicine","Physics","Science and technology","Internet","PC","Celebrities","Economy Management","Fun with friends","Shopping","Pets"],"tickvals":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],"categoryorder":"array","categoryarray":["Active sport","Adrenaline sports","Passive sport","Countryside, outdoors","Gardening","Cars","Art exhibitions","Dancing","Musical instruments","Theatre","Writing","Reading","Geography","History","Law","Politics","Psychology","Religion","Foreign languages","Biology","Chemistry","Mathematics","Medicine","Physics","Science and technology","Internet","PC","Celebrities","Economy Management","Fun with friends","Shopping","Pets"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":4.64923204649232,"tickwidth":0.845314917544058,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":14.8775425487754},"tickangle":-45,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.845314917544058,"zeroline":false,"anchor":"y","title":"","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":18.5969281859693},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-0.596,1.076],"tickmode":"array","ticktext":["-0.5","0.0","0.5","1.0"],"tickvals":[-0.5,0,0.5,1],"categoryorder":"array","categoryarray":["-0.5","0.0","0.5","1.0"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":4.64923204649232,"tickwidth":0.845314917544058,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":14.8775425487754},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.845314917544058,"zeroline":false,"anchor":"x","title":"Standardized mean interest","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":18.5969281859693},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":"transparent","line":{"color":"rgba(51,51,51,1)","width":0.845314917544058,"linetype":"solid"},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":true,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":2.40515390121689,"font":{"color":"rgba(0,0,0,1)","family":"","size":14.8775425487754},"y":1.2,"orientation":"h"},"annotations":[{"text":"Profile","x":1.02,"y":1,"showarrow":false,"ax":0,"ay":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":18.5969281859693},"xref":"paper","yref":"paper","textangle":-0,"xanchor":"left","yanchor":"bottom","legendTitle":true}],"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":[{"name":"Collaborate","icon":{"width":1000,"ascent":500,"descent":-50,"path":"M487 375c7-10 9-23 5-36l-79-259c-3-12-11-23-22-31-11-8-22-12-35-12l-263 0c-15 0-29 5-43 15-13 10-23 23-28 37-5 13-5 25-1 37 0 0 0 3 1 7 1 5 1 8 1 11 0 2 0 4-1 6 0 3-1 5-1 6 1 2 2 4 3 6 1 2 2 4 4 6 2 3 4 5 5 7 5 7 9 16 13 26 4 10 7 19 9 26 0 2 0 5 0 9-1 4-1 6 0 8 0 2 2 5 4 8 3 3 5 5 5 7 4 6 8 15 12 26 4 11 7 19 7 26 1 1 0 4 0 9-1 4-1 7 0 8 1 2 3 5 6 8 4 4 6 6 6 7 4 5 8 13 13 24 4 11 7 20 7 28 1 1 0 4 0 7-1 3-1 6-1 7 0 2 1 4 3 6 1 1 3 4 5 6 2 3 3 5 5 6 1 2 3 5 4 9 2 3 3 7 5 10 1 3 2 6 4 10 2 4 4 7 6 9 2 3 4 5 7 7 3 2 7 3 11 3 3 0 8 0 13-1l0-1c7 2 12 2 14 2l218 0c14 0 25-5 32-16 8-10 10-23 6-37l-79-259c-7-22-13-37-20-43-7-7-19-10-37-10l-248 0c-5 0-9-2-11-5-2-3-2-7 0-12 4-13 18-20 41-20l264 0c5 0 10 2 16 5 5 3 8 6 10 11l85 282c2 5 2 10 2 17 7-3 13-7 17-13z m-304 0c-1-3-1-5 0-7 1-1 3-2 6-2l174 0c2 0 4 1 7 2 2 2 4 4 5 7l6 18c0 3 0 5-1 7-1 1-3 2-6 2l-173 0c-3 0-5-1-8-2-2-2-4-4-4-7z m-24-73c-1-3-1-5 0-7 2-2 3-2 6-2l174 0c2 0 5 0 7 2 3 2 4 4 5 7l6 18c1 2 0 5-1 6-1 2-3 3-5 3l-174 0c-3 0-5-1-7-3-3-1-4-4-5-6z"},"click":"function(gd) { \n // is this being viewed in RStudio?\n if (location.search == '?viewer_pane=1') {\n alert('To learn about plotly for collaboration, visit:\\n https://cpsievert.github.io/plotly_book/plot-ly-for-collaboration.html');\n } else {\n window.open('https://cpsievert.github.io/plotly_book/plot-ly-for-collaboration.html', '_blank');\n }\n }"}],"cloud":false},"source":"A","attrs":{"4f542728345a":{"x":{},"y":{},"colour":{},"type":"scatter"},"4f545952a02":{"x":{},"y":{},"colour":{}}},"cur_data":"4f542728345a","visdat":{"4f542728345a":["function (y) ","x"],"4f545952a02":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"base_url":"https://plot.ly"},"evals":["config.modeBarButtonsToAdd.0.click"],"jsHooks":[]}

There’s a quick example of LPA. Overall, I think LPA is great tool for Exploratory Analysis, although I question its reproducibility. What’s important is that the statistician considers both fit indices and theory when deciding on the number of profiles.

References & Resources

Bertoletti, M., Friel, N., & Rastelli, R. (2015). Choosing the number of clusters in a finite mixture model using an exact Integrated Completed Likelihood criterion. https://arxiv.org/pdf/1411.4257.pdf.

Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study. Structural Equation Modeling, 14, 535-569.

Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. The R Journal, 8, 289-317.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Will Hipson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Control Charts Another Package

Sat, 04/20/2019 - 02:00

(This article was first published on Analysis of AFL, and kindly contributed to R-bloggers)

I got an email from Alex Zanidean, who runs the xmrr package

“You might enjoy my package xmrr for similar charts – but mine recalculate the bounds automatically” and if we go to the vingette, “XMRs combine X-Bar control charts and Moving Range control charts. These functions also will recalculate the reference lines when significant change has occurred” This seems like a pretty handy thing. So lets do it.

First lets do our graphic from our previous post using ggQC

library(fitzRoy) library(tidyverse) ## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2 ## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1 ## ✔ tidyr 0.8.3 ✔ stringr 1.4.0 ## ✔ readr 1.3.1 ✔ forcats 0.4.0 ## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library(ggQC) library(xmrr) fitzRoy::match_results%>% mutate(total=Home.Points+Away.Points)%>% group_by(Season,Round)%>% summarise(meantotal=mean(total))%>% filter(Season>1989 & Round=="R1")%>% ggplot(aes(x=Season,y=meantotal))+geom_point()+ geom_line()+stat_QC(method="XmR")+ ylab("Mean Round 1 Total for Each Game") +ggtitle("Stop Freaking OUT over ONE ROUND")

df<-fitzRoy::match_results%>% mutate(total=Home.Points+Away.Points)%>% group_by(Season,Round)%>% summarise(meantotal=mean(total))%>% filter(Season>1989 & Round=="R1")

So when using a package for the first time, one of the best things about the R community is how the examples are usually fully reproducible and this helps.

From the github

Year <- seq(2001, 2009, 1) Measure <- runif(length(Year)) df <- data.frame(Year, Measure) head(df) ## Year Measure ## 1 2001 0.6146880 ## 2 2002 0.2854914 ## 3 2003 0.6081190 ## 4 2004 0.4357665 ## 5 2005 0.1509844 ## 6 2006 0.5935707 xmr(df, "Measure", recalc = T) ## Year Measure Order Central Line Moving Range Average Moving Range ## 1 2001 0.6146880 1 0.419 NA NA ## 2 2002 0.2854914 2 0.419 0.329 0.277 ## 3 2003 0.6081190 3 0.419 0.323 0.277 ## 4 2004 0.4357665 4 0.419 0.172 0.277 ## 5 2005 0.1509844 5 0.419 0.285 0.277 ## 6 2006 0.5935707 6 0.419 0.443 0.277 ## 7 2007 0.5739720 7 0.419 0.020 0.277 ## 8 2008 0.9961513 8 0.419 0.422 0.277 ## 9 2009 0.9091553 9 0.419 0.087 0.277 ## Lower Natural Process Limit Upper Natural Process Limit ## 1 NA NA ## 2 0 1.156 ## 3 0 1.156 ## 4 0 1.156 ## 5 0 1.156 ## 6 0 1.156 ## 7 0 1.156 ## 8 0 1.156 ## 9 0 1.156

Lets create a similar dataframe as df, but using data from fitzRoy

df<-fitzRoy::match_results%>% mutate(total=Home.Points+Away.Points)%>% group_by(Season,Round)%>% summarise(meantotal=mean(total))%>% filter(Season>1989 & Round=="R1")%>% select(Season, meantotal) df<-data.frame(df) xmr_data <-xmr(df, "meantotal", recalc = T) xmr_chart(df = xmr_data, time = "Season", measure = "meantotal", line_width = 0.75, text_size = 12, point_size = 2.5)

Does this tell a different story or a very similar one to earlier?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Happy EasteR! Let’s find some eggs…

Sat, 04/20/2019 - 00:00

(This article was first published on Modelplot[R/py] introduction, and kindly contributed to R-bloggers)

It’s Easter Time! Let’s find some eggs…

Hi there! Yes, it’s the most Easterful time of the year again. For some of us a sacret time, for others mainly an egg-eating period and some just enjoy the extra day of spare time. In case you have some time available for some good egg searching business, but no-one seems willing to hide them for you this year, here’s an alteRnative. Hide them yourself and have a nice easteR holyday! As you go, you also get to know the very nice image processing package: magick.

Get some easter scenes, collect some eggs…

Before we can start hiding eggs, we need some scenes…

…and – obviously – we need some eggs to hide!

Hiding Eggs function

Excellent, we have some scenes and some eggs, let’s create a function to make hiding our eggs easy. We’ll take advantage of some other functions from the great image processing package magick.

library(magick) library(dplyr) hideEggs <- function(sceneUrl,nEggs=nEggs,eggPositions,eggSize=0.04){ # read scene easterScene <- image_read(sceneUrl) # resize picture to 800 width (keep aspect ratio) easterScene <- image_scale(easterScene, "800x") sceneWidth <- as.numeric(image_info(easterScene)['width']) sceneHeight <- as.numeric(image_info(easterScene)['height']) # collect some eggs (sample with replacement in the basket ;)) nEggs = nEggs eggUrls = sample(easterEggs,nEggs,replace = TRUE) easterSceneEggs <- easterScene for (eggn in 1:nEggs){ eggUrl = eggUrls[eggn] # get egg, resize, rotate and put on canvas! egg <- image_read(eggUrl) # resize egg to 5% of canvas height (keep aspect ratio) egg <- image_scale(egg,paste0("x",round(eggSize*sceneWidth))) # remove background egg <- image_background(egg,"none") # rotate egg between -90 and 90 degrees eggRotation <- runif(1,-90,90) egg <- image_rotate(egg,eggRotation) #set x and y position (as specified in list or else random) if (!missing(eggPositions)){ xpos <- eggPositions[[eggn]][1]*sceneWidth } else { xpos <- runif(1,0,sceneWidth) } if (!missing(eggPositions)){ ypos <- eggPositions[[eggn]][2]*sceneHeight } else { ypos <- runif(1,0,sceneHeight) } #add egg to canvas easterSceneEggs <- image_composite(easterSceneEggs, egg, offset = paste0("+",xpos,"+",ypos)) } return(easterSceneEggs) }

Yeah, we’ve hidden 5 eggs near these happy easter bunnies. Can you spot them in a few seconds??

# let's hide 5 eggs among our easter bunnies hideEggs(sceneUrl = easterScenes[2],nEggs = 5)

These guys also seem to enjoy our little game. Finally, let’s make a somewhat more challenging version, you can share it with your relatives, wishing them a nice easter holiday.

# think where to hide... eggPositions = list(c(0.1,0.95),c(0.03,0.6),c(0.4,0.65),c(0.5,0.67),c(0.465,0.31), c(0.6,0.4),c(0.7,0.7),c(0.6,0.66),c(0.8,0.94),c(0.97,0.71)) #... and hide! we'll use smaller eggs to make it a bit more challenging. easterCard <- hideEggs(sceneUrl = easterScenes[1],nEggs = length(eggPositions), eggPositions = eggPositions, eggSize = 0.02) # let's add some wishes to our easter card... easterCard %>% image_annotate("Happy Easter!!!", size = 36, color = "yellow", location = "+270+16") %>% image_annotate("Can you spot the 10 hidden eggs?", size = 18, color = "white", location = "+250+60")

In case your audience can’t figure out where you hid the eggs, magick lets you animate your pictures, so:

# specify easterScene with and without the eggs and specify number of frames frames <- image_morph(c(easterCard, image_scale(image_read(easterScenes[1]),"800x")), frames = 10) image_animate(frames)

For much more you can do with magick, see the vignette.

That’s it, have yourself a merry little easter egg!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Modelplot[R/py] introduction. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...