Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 55 min 48 sec ago

Examining Inter-Rater Reliability in a Reality Baking Show

6 hours 34 min ago

(This article was first published on Mattan S. Ben-Shachar, and kindly contributed to R-bloggers)

Game of Chefs is an Israeli reality cooking competition show, where chefs compete for the title of “Israel’s most talented chef”. The show has four stages: blind auditions, training camp, kitchen battles, and finals. Here I will examine the 4th season’s (Game of Chefs: Confectioner) inter-rater reliability in the blind auditions, using some simple R code.

The Format In the blind auditions, candidates have 90 minutes to prepare a dessert, confection or other baked good, which is then sent to the show’s four judges for a blind taste test. The judges don’t see the candidate, and know nothing about him or her, until after they deliver their decision – A “pass” decision is signified by the judge granting the candidate a kitchen knife; a “fail” decision is signified by the judge not granting a knife. A candidate who receives a knife from at least three of the judges continues on to the next stage, the training camp.

The 4 judges and host, from left to right: Yossi Shitrit, Erez Komarovsky, Miri Bohadana (host), Assaf Granit, and Moshik Roth Inter-Rater Reliability I’ve watched all 4 episodes of the blind auditions (for purely academic reasons!), for a total of 31 candidates. For each dish, I recorded each of the four judges’ verdict (fail / pass).

Let’s load the data.

df <- read.csv('https://github.com/mattansb/blogCode/raw/master/2018_10_18%20Inter-Rater%20Reliability/IJR.csv')
head(df)

## Moshik Assaf Erez Yossi
## 1 1 1 1 1
## 2 1 0 0 0
## 3 1 1 1 0
## 4 1 0 0 0
## 5 0 1 1 1
## 6 1 1 1 1

We will need the following packages:

library(dplyr) # for manipulating the data
library(psych) # for computing Choen's Kappa
library(corrr) # for manipulating matrices

We can now use the psych package to compute Cohen’s Kappa coefficient for inter-rater agreement for categorical items, such as the fail/pass categories we have here. \$\kappa\$ ranges from -1 (full disagreement) through 0 (no pattern of agreement) to +1 (full agreement). Normally, \$\kappa\$ is computed between two raters, and for the case of more than two raters, the mean \$\kappa\$ across all pair-wise raters is used.

cohen.kappa(df)

## Cohen Kappa (below the diagonal) and Weighted Kappa (above the diagonal)
## For confidence intervals and detail print with all=TRUE
## Moshik Assaf Erez Yossi
## Moshik 1.00 0.34 0.17 0.29
## Assaf 0.34 1.00 0.17 0.68
## Erez 0.17 0.17 1.00 -0.16
## Yossi 0.29 0.68 -0.16 1.00
##
## Average Cohen kappa for all raters 0.25
## Average weighted kappa for all raters 0.25

We can see that overall \$\kappa=0.25\$ – surprisingly low for what might be expected from a group of pristine, accomplished, professional chefs and bakers in a blind taste test.

When examining the pair-wise coefficients, we can also see that Erez seems to be in lowest agreement with each of the other judges (and even in a slight disagreement with Yossi!). This might be because Erez is new on the show (this is his first season as judge), but it might also be because of the four judges, he is the only one who is actually a confectioner (the other 3 are restaurant chefs).

For curiosity’s sake, let’s also look at the \$\kappa\$ coefficient between each judge’s rating and the total fail/pass decision, based on whether a dish got a “pass” from at least 3 judges. df <- mutate(df,
PASS = as.numeric(rowSums(df) >= 3))
head(df)

## Moshik Assaf Erez Yossi PASS
## 1 1 1 1 1 1
## 2 1 0 0 0 0
## 3 1 1 1 0 1
## 4 1 0 0 0 0
## 5 0 1 1 1 1
## 6 1 1 1 1 1

We can now use the wonderful new corrr package, which is intended for exploring correlations, but can also generally be used to manipulate any symmetric matrix in a tidy-fashion.

cohen.kappa(df)$cohen.kappa %>%
as_cordf() %>%
focus(PASS)

## # A tibble: 4 x 2
## rowname PASS
##
## 1 Moshik 0.619
## 2 Assaf 0.746
## 3 Erez 0.159
## 4 Yossi 0.676

Perhaps unsurprisingly (to people familiar with previous seasons of the show), it seems that Assaf’s judgment of a dish is a good indicator of whether or not a candidate will continue on to the next stage. Also, once again we see that Erez is barely aligned with the other judges’ total decision.

Conclusion

Every man to his taste…

Even among the experts there is little agreement on what is fine cuisine and what is not worth a doggy bag. Having said that, if you still have your heart set on competing in Game of Chefs, it seems that you should at least appeal to Assaf’s palate.

Bon Appétit!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mattan S. Ben-Shachar. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing English Team of the Year Data Since 1973

13 hours 30 min ago

(This article was first published on World Soccer Analytics, and kindly contributed to R-bloggers)

The Professional Footballers’ Association (PFA) Team of the Year is released in England at the end of each season, picking the 11 most influential players in each of Britain’s leagues.

The Team of the Year award was launched in the 1973-1974 season, meaning there was 44 years worth of data to web scrape. Using Wikipedia’s PFA Team of the Year pages (filtered by decade) and the rvest package, I was left with a dataframe of 484 soccer players (44 years * 11 players per/year).

Here are some visualizations I thought were cool:

  • Liverpool, Arsenal, and (most significantly) Manchester United are represented more than other clubs. United need just three more players to hit the 100 mark.

  • The PFA Team of the Year formation has changed over time from 4-3-3 to 4-4-2. The graph below shows the number of forwards and midfielders who featured in the ranking each year. Up until the late 1980s, the formation included three midfielders and three forwards (the Team of the Year has always had four defenders and one goalkeeper). In the 1990s, both formations were used but since the 2000s, the use of two strikers has been the status quo.

  • Additionally, I thought it was interesting to see the top individual players in each position. Peter Shilton remains the only player to hit 10 First Division PFA Team of the Year Awards.

 

  • The more interesting analysis involved merging Team of the Year data with historical league position. We can now ask questions along the lines of: did the majority of players in the Team of the Year come from the champions that year or other teams? Looking at the specific seasons with the most representation in the award, only 14 times did a team have 5+ players on the list; the majority of these teams came from the champions (12 out of 14); only Manchester United in 1998 and Arsenal in 2003 (both runner-ups) feature in this list.

 

  • And here are the champions of England with the fewest players on the Team of the Year. There are many from the 1970s and 1980s, which hints to a change in the way the Team of the Year was designed over the years (it may have been purposefully less top-heavy in the past, whereas today the English champions often features more significantly on the list). We can see that more recently Manchester City in 2014 and Chelsea in 2010 were on the lower end in terms of representation.

 

  • While we’re at it, which teams that didn’t finish 1st had the most players represented? It’s clear that there were some pretty high quality teams that didn’t win the league these years (runners-up are in red, third place teams in green, and any from 4th to 20th in grey).

I think that’s enough visualizations for today, but there’s definitely a lot more we can analyze with this data. Let me know if you have any questions or feedback.

R Code Snapshot (full code can be found on Github):

Step 1 – Web Scraping:

#initialize data.frame df <- data.frame(`Pos.`= as.character(), Player = as.character(), Club = as.character(), `App.` = as.double(), year = as.character()) div_table_numbers <- c(1,5,9,13,17,21,25,29,33,37) urls <- c("https://en.wikipedia.org/wiki/PFA_Team_of_the_Year_(2000s)", "https://en.wikipedia.org/wiki/PFA_Team_of_the_Year_(1990s)", "https://en.wikipedia.org/wiki/PFA_Team_of_the_Year_(1980s)") for (i in 1:length(urls)){ for (j in 1:length(div_table_numbers)) {   xpath_base <- '//*[@id="mw-content-text"]/div/table[' new_data <- urls[i] %>% html() %>% html_nodes(xpath = paste0(xpath_base, div_table_numbers[j], "]")) %>% html_table() new_data <- new_data[[1]] year_xpath_base <- '//*[@id="mw-content-text"]/div/h3[' year <- urls[i] %>% html() %>% html_nodes(xpath = paste0(year_xpath_base, j, "]")) %>% html_text() year <- year %>% str_remove_all(fixed("[edit]")) new_data$year <- year df <- bind_rows(df, new_data) } }

Step 2: Exploratory data analysis and visualizations

club_count <- df %>% count(Club, sort = TRUE) club_count %>% top_n(n = 15) %>% ggplot(aes(x = reorder(Club,n), y = n, fill = Club)) + geom_col() + coord_flip()+ theme_few()+ guides(fill=FALSE)+ labs(y = "# of Players", x = "Club", title = "Number of Players in PFA Team of the Year (1973 - 2017)", caption = "Data Source: Wikipedia")+ theme(plot.title = element_text(hjust = 0.5)) + scale_fill_manual(values = c("Manchester United" = "darkred", "Liverpool" = "orange", "Arsenal" = "yellow", "Chelsea" = "blue", "Blackburn Rovers" = "lightblue", "Leeds United" = "grey", "Manchester City" = "dark green", "Derby County" = "black", "Everton" = "gold", "Nottingham Forest" = "red", "Ipswich Town" = "darkblue", "Southampton" = "orange", "Newcastle United" = "darkgrey", "Aston Villa" = "purple", "Tottenham Hotspur" = "navy" )) df <- df %>% mutate(short_year = str_sub(year,1,4) %>% as.numeric() + 1) order_positions <- c("GK","DF","MF","FW") df <- df %>% mutate(Pos. = fct_relevel(Pos., order_positions)) count_position <- df %>% filter(Pos. %in% c("MF","FW")) %>% count(Pos.,short_year, sort = TRUE) count_position %>% ggplot(aes(x=short_year,y=n,color=Pos.,group=Pos.)) + geom_point(position=position_jitter(h=0.005))+ geom_smooth(method = "loess")+ scale_x_continuous(breaks = seq(1973, 2017, 5))+ scale_y_continuous(breaks=seq(2,4,1))+ labs(x = "Year", y = "Number of Players per Position", title = "Count of Midfielders and Forwards in English PFA Team of the Year (1973 - 2017)", caption = "Data Source: Wikipedia")+ theme_few()+ theme(plot.title = element_text(hjust = 0.5))

Step 3: More web scraping and merging datasets

first_div_top_three_url <- "https://en.wikipedia.org/wiki/List_of_English_football_champions" first_div_top_three_xpath <- '//*[@id="mw-content-text"]/div/table[2]' first_div_top_three <- first_div_top_three_url %>% html() %>% html_nodes(xpath = first_div_top_three_xpath) %>% html_table() first_div_top_three <- first_div_top_three[[1]] first_div_top_three <- first_div_top_three %>% filter(!(Year %in% c("1915/16–1918/19", "1939/40–1945/46"))) first_div_top_three$Goals <-first_div_top_three$Goals %>% as.numeric() first_div_top_three <- first_div_top_three %>% rename(`Champions` = `Champions(number of titles)`, `Top goalscorer` = `Leading goalscorer`) epl_top_three_url <- "https://en.wikipedia.org/wiki/List_of_English_football_champions" epl_top_three_xpath <- '//*[@id="mw-content-text"]/div/table[3]' epl_top_three <- epl_top_three_url %>% html() %>% html_nodes(xpath = epl_top_three_xpath) %>% html_table() epl_top_three <- epl_top_three[[1]] epl_top_three <- epl_top_three %>% rename(`Champions` = `Champions (number of titles)`) english_top_three_total <- bind_rows(first_div_top_three, epl_top_three) english_top_three_total$Champions <- english_top_three_total$Champions %>% str_remove_all(regex("\\([^)]*\\)")) english_top_three_total$Champions <- english_top_three_total$Champions %>% str_remove_all(regex("\\[.*?\\]")) english_top_three_total <- english_top_three_total %>% mutate(short_year = str_sub(Year,1,4) %>% as.numeric() + 1) english_top_three_total <- english_top_three_total %>% filter(short_year > 1973) %>% select(-c(`Top goalscorer`,Goals)) english_top_three_total_melted <- english_top_three_total %>% melt(id.vars=c("Year","short_year"), value.name = "Club", variable.name = "Team_Ranking") english_top_three_total_melted$Club <- english_top_three_total_melted$Club %>% str_trim(side = c( "right")) df_merged <- df %>% left_join(english_top_three_total_melted, by = c("Club","short_year"))

Step 4: More data visualizations with merged dataset

club_count_year <- df_merged %>% count(Club, year, Team_Ranking, sort = TRUE) %>% mutate(club_year = paste(Club, year)) club_count_year %>% top_n(n = 10, wt = n) %>% ggplot(aes(x = reorder(club_year,n), y = n))+ geom_col(aes(fill = factor(ifelse(Team_Ranking == "Champions", 1, 2)))) + coord_flip()+ theme_few()+ guides(fill=FALSE)+ labs(y = "# of Players", x = "Club", title = "Teams with Most Representation in PFA Team of the Year (1973 - 2017)", caption = "Data Source: Wikipedia")+ theme(plot.title = element_text(hjust = 0.5)) + scale_fill_solarized() club_count_year_champions <- df_merged %>% count(Club, year, Team_Ranking, sort = TRUE) %>% mutate(club_year = paste(Club, year)) %>% filter(Team_Ranking == "Champions") club_count_year_champions %>% top_n(n = -10, wt = n) %>% ggplot(aes(x = reorder(club_year,n), y = n))+ geom_col(aes(fill = Club)) + coord_flip()+ theme_few()+ guides(fill=FALSE)+ labs(y = "# of Players", x = "Club", title = "English Champions with Fewest Players in PFA Team of the Year (1973 - 2017)", caption = "Data Source: Wikipedia")+ theme(plot.title = element_text(hjust = 0.5)) + scale_y_continuous(breaks = seq(0,2,1)) + scale_fill_manual(values = c("Manchester United" = "darkred", "Liverpool" = "orange", "Arsenal" = "yellow", "Chelsea" = "blue", "Blackburn Rovers" = "lightblue", "Leeds United" = "grey", "Manchester City" = "dark green", "Derby County" = "black", "Everton" = "gold", "Nottingham Forest" = "red")) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: World Soccer Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a data warehouse

Wed, 10/17/2018 - 11:51

(This article was first published on Claude Seidman – The Data Guy, and kindly contributed to R-bloggers)

Are you looking for some more information on building a data warehouse? The Intellixus website is the place you need to be today. A dedicated individual called Claude set up this blog in order to share some of his valuable knowledge he has learned over the years with each of his readers. He has a wealth of experience and has been working within this industry since 1996, initially working with SQL Server version 4.2. He is always happy to help and wanted to share his knowledge with more people, no matter whether he has had a chance to meet them or not.

 

When building a data warehouse, academics and researchers have been practicing statistical and Machine Learning techniques like regression analysis, linear programming, supervised and unsupervised learning for a long time. R and Python are top class tools for Machine Learning and while these languages come with clever and convenient data manipulation tools, it would be a mistake to think that they can be a replacement for platforms that specialise in data management.

 

Claude has been recently been exploring all the things one can do to enhance the usefulness of data warehouses, in particular the R and Microsoft’s R Server line of products. Over the years, he has had many friends and ex-colleagues that call him every so often because they’re stuck on a problem or they want some general advice on a technical issue.

 

If you need some assistance when building a data warehouse, the Intellixus website is the place you need to visit today. However, if you would like to have a chat with Claude directly and have some unanswered questions, leave a comment on the relevant blog post and I’m sure he will be more than happy to help you.

The post Building a data warehouse appeared first on Claude Seidman – The Data Guy.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Claude Seidman – The Data Guy. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

SatRday talks recordings

Wed, 10/17/2018 - 08:26

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

A couple of weeks ago, the first of September we had satRday in Amsterdam (The Netherlands) a fantastic event hosted by GoDataDriven. Now the great talks, including my 10 minute lightning talk on text2vec are online.

My talk

The satRday channel

Cheers, Longhow

 

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Estimating Control Chart Constants with R

Wed, 10/17/2018 - 04:33

(This article was first published on R – R-BAR, and kindly contributed to R-bloggers)

In this post, I will show you how a very basic R code can be used to estimate quality control constants needed to construct X-Individuals, X-Bar, and R-Bar charts. The value of this approach is that it gives you a mechanical sense of where these constants come from and some reinforcement on their application.

If you work in a production or quality control environment, chances are you've made or seen a control chart. If you're new to control charting or need a refresher check out Understanding Statistical Process Control, Wheeler et. al. If you want to dive in and start making control charts with R, check out R packages

  • ggQC: ggplot based QC charting
  • qcc: base graphics QC charting

If your familiar with control charts, you've likely encountered cryptic alpha-numeric constants like d2, A2, E2, d3, D3, and asked,

“What are they and where do they come from?”

Short (not so satisfying) answer: They are constants you plug into formulas to determine your control limits. Their value depends on how many samplings you do at a time and the type of chart you are making. For example, if you measure the size of 5 widgets per lot of 50, then your subgroup size, n, is 5 and you should be using a set of control chart constants for n = 5.

So where do they come from and how are they calculated?

Read on.

X-Bar and X-Individuals Constants

Often, control charts represent variability in terms of the mean range, R, observed over several subgroup rather than the mean standard deviation. The table below should make the idea of subgroup range and mean range more clear.

Why range? My guess is that, historically, employees at all levels would have understood the concept of range. Range requires no special computation, just (max-min). Speculation aside, we begin our quest to understand where control constants come from with the relationship shown in Eq. 1 that the mean subgroup range is proportional to standard deviation of the individual values.

The proportionality constant between R(XSub_Grp_Indv) and S(Xindv) is d2, the first constant we'll be estimating. The relationship is expressed in Eq.2

To estimate d2 for n = 2 (i.e, the subgroup size is 2), we start by drawing two samples from a normal distribution with mean = 0 and sd = 1. Why you ask? Because it makes the math really simple. Consider Eq. 2, if S(Xindv) = 1 then R(XSub_Grp_Indv) = d2.

So all we need to do to determine d2 when n = 2 is:

  1. Draw 2 individuals from the normal distribution,
  2. Determine the range of the 2 samples.
  3. Repeat many, many times
  4. Take the average of all the ranges you’ve calculated, R
  5. R = d2 (when the mean = 0 and sd = 1)

The R code for the process is shown below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17   ```r require(magrittr) #Bring in the Pipe Function reps <- 1E6 set.seed(5555) replicate(reps, rnorm(n=2, mean = 0, sd = 1) %>% #Draw Two From the Normal Distibution range() %>% #Determine the Range Vector = (Max, Min) diff() %>% #Determine the Difference of the Range Vector abs() #Take the Absolution Value to make sure the Result is positive ) %>% # Replicate the above proceedure 1,000,000 Times mean() -> R_BAR -> d2 #Take the mean of the 1,000,000 ranges d2 ```   ``` ## [1] 1.12804 ```

The pipes make the above code easy to read but slow things down quite a bit. The following code does the same thing about 12.5 times faster.

1 2 3 4 5 6 7 8 9 10 11   ```r reps <- 1E6 set.seed(5555) d2 <- R_BAR <- mean(replicate(reps, abs(diff(range(rnorm(2)))))) d2 ```   ``` ## [1] 1.12804 ```

Once you have d2, calculating E2 (3σ for the individuals) and A2 (3σ for the sub-group means) is straight forward as shown in Eq.3 – Eq.6. A2 and E3 are the coefficients to the left of R.

The code below gives the expected results for all the control constants need to construct X-Bar and X-Individual charts.

1 2 3 4 5 6 7 8 9   ```r c(N=2, d2 = d2, E2 = 3/d2, A2 = 3/(d2*sqrt(2))) ```   ``` ## N d2 E2 A2 ## 2.000000 1.128040 2.659480 1.880536 ``` R-Bar Constants

The constants for R charts are d3 (1σ around R,), D3 (Lower 3σ limit of R) and D4 (Upper 3σ limit of R). To get these constants, we start with the assumption that the standard deviation of R is proportional to the standard deviation of the individual X's. The proportionality constant is d3 shown in Eq.7. Notice that Eq. 7 has the same form as Eq. 2.

1 2 3 4 5 6 7 8 9 10 11   ```r reps <- 1E6 set.seed(seed) d3 <- sd(replicate(reps, abs(diff(range(rnorm(2)))))) d3 ```   ``` ## [1] 0.8529419 ```

Notice in the R code above, the only difference between the calculation of d3 and d2 is that we use standard deviation rather than the mean of the RSub_Grp_Indv. Now we have d3, but we need to do a little simple algebra to express the S(RSub_Grp_Indv) in terms of R. Remember, historically the employee doesn't need to worry about standard deviation – just ranges. We can define the above expression in term of R by combining Eq.2 and Eq.7, yielding Eq.8.

OK almost to D3 and D4. The lower 3σ limit of R can be expressed as Eq.8:

Factoring out the R terms on the right-hand side of the expression yields

The expression inside the parentheses is D3. D4, the upper limit of R is evaluated analogously. The only difference is a “+” sign in the final expression. Final expressions for D3 and D4 are:

All done! Here is the R code summarizing the constants for R using n=2.

1 2 3 4 5 6 7 8 9 10 11 12 13   ```r c(N = 2, d3 = d3, D3 = ifelse(1 - 3*d3/d2 < 0, 0, 1 - 3*d3/d2), D4 = 1 + 3*d3/d2 ) ```   ``` ## N d3 D3 D4 ## 2.0000000 0.8529419 0.0000000 3.2683817 ```

Notice for D3 the value is 0, this is because the value calculated was negative. Such values are rounded to zero per the R code above.

Summary

In this post, we used R to estimate the control chart constants needed to produce X-Individuals, X-Bar, and R-Bar charts. All the constants together are shown below. In addition, the constants for n = 7 have also been presented.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27   ```r reps <- 1E6 set.seed(5555) FUN_d2 <- function(x) {mean(replicate(reps, abs(diff(range(rnorm(x))))))} FUN_d3 <- function(x) {sd(replicate(reps, abs(diff(range(rnorm(x))))))}   Ns <- c(2,7) d2 <- sapply(Ns, FUN_d2) d3 <- sapply(Ns, FUN_d3)   round(data.frame( N = Ns, d2 = d2, E2 = 3/d2, A2 = 3/(d2*sqrt(Ns)), d3 = d3, D3 = ifelse(1 - 3*d3/d2 < 0, 0, 1 - 3*d3/d2), D4 = 1 + 3*d3/d2 ), digits = 3) ```   ``` ## N d2 E2 A2 d3 D3 D4 ## 1 2 1.128 2.659 1.881 0.854 0.000 3.272 ## 2 7 2.704 1.109 0.419 0.833 0.076 1.924 ``` var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-BAR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Use R with Excel: Importing and Exporting Data

Wed, 10/17/2018 - 03:11

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Data Management

    Tags

    1. Databases
    2. Import Data
    3. R Programming
    4. Tips & Tricks

    In this article, you learn how to connect R with Excel by importing and exporting data between the two programs.

    Excel: pros and cons

    Excel is still very popular among companies and organizations. One of the main advantages in a spreadsheet is ease of providing the user with a rapid overview of a dataset, using visualizations and tables. On the other hand, given the incredible growth of data, the spreadsheet really lacks efficient and agile solutions for data management and data cleaning.

    Solution: Use R with Excel

    Fortunately, it is possible for the data scientist/analyst to empower the advantages of both Excel and R. This can be done with some value adding and efficient packages for R. What Excel lack in data management and data cleaning – R is an excellent and efficient solution for these tasks. Furthermore, it is also possible to create the analytics in R and export the result into Excel for reporting. Basically, you have two great solutions (1): You can import Excel datasets into R for data management, data cleaning and analytics (2): You can export analytical results and clean data from R into Excel for further analytics or presentation.

    The below sections presents these solutions with coding examples in R. First we look at how you import Excel into R:

    Import Excel spread sheet in R # Read R library library(readxl) # Import xlsx MyExcelSheet <- read.xlsx(file="/ExcelSheet.xlsx",sep = ",", header=TRUE) # read sheet

    Now let us use the above import coding with an Excel sheet. We use this demand data:

    Import Excel spread sheet in R with a dataset # Read R library library(readxl) # Import xlsx dataset demand <- read_excel("~/R work/DataScience+/Make R speak Excel/demand.xls") head(demand)

    The above coding gives us the following table

    head(demand) # A tibble: 6 x 7 p1 p2 p3 y q1 q2 q3 1 129. 76.6 59.6 92.8 128. 140. 129. 2 156. 186. 124. 69.4 133. 23.2 37.7 3 111. 100. 36.6 103. 98.3 59.4 295. 4 55.1 240. 34.8 145. 374 55.7 386. 5 156 196. 134. 80.2 50.3 116. 44.8 6 92.2 150. 129. 110 194 126. 66.6 Export Excel spread sheet in R

    Now it is time to Export Excel sheet from R into Excel:

    Export spread sheet from R to Excel # Read R library library(readxl) library(xlsx) # Export xlsx sheet from R to Excel write.xlsx(MyExcelSheet, "/ExcelSheet.xlsx") # write sheet

    Let us use the above export coding with the demand excel sheet. First we do some data management in R:

    Export spread sheet from R to Excel with a dataset # Read R library library(readxl) library(xlsx) # Import xlsx dataset demand <- read_excel("~/R work/DataScience+/Make R speak Excel/demand.xls") # Data management in R demand$psum <- demand$p1 + demand$p2 + demand$p3 head(demand) # Export sheet to Excel write.xlsx(MyExcelSheet, "/ExcelSheet.xlsx") # write sheet

    The above coding gives us the following table

    head(demand) # A tibble: 6 x 8 p1 p2 p3 y q1 q2 q3 psum 1 129. 76.6 59.6 92.8 128. 140. 129. 265. 2 156. 186. 124. 69.4 133. 23.2 37.7 465. 3 111. 100. 36.6 103. 98.3 59.4 295. 247. 4 55.1 240. 34.8 145. 374 55.7 386. 330. 5 156 196. 134. 80.2 50.3 116. 44.8 486. 6 92.2 150. 129. 110 194 126. 66.6 371.

    Happy coding!

    References
    1. Using readxl in R – CRAN.R-project.org
    2. Using xlsx in R – CRAN.R-project.org

    Related Post

    1. Combining data in R: the skill of merging, joining, and stacking
    2. Efficient data management and SQL data selection in R
    3. Essential data cleaning for ad-hoc tasks in R
    4. Proteomics Data Analysis (2/3): Data Filtering and Missing Value Imputation
    5. Clean Your Data in Seconds with This R Function
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Cannibus Curve with ggplot2

    Wed, 10/17/2018 - 02:00

    (This article was first published on R on Chi's Impe[r]fect Blog, and kindly contributed to R-bloggers)

    Starting today, recreational weed is legal in Canada. This news has some how lead me to find Cannibus Curve, a mathmatical equation to draw Cannibus….!!!

    So to celebrate? being 2nd country in the world (1st was Uruguay) to legalize the green stuff for fun, I decided I’ll try drawing cannibus curve with ggplot. Here’s the final results.

    Cannibus_Final

    Here’s the step I took, because I couldn’t really understand the mathmatical equation, so I’ve break it down step by step to sort of understand what each part of equation is doing.

    library(tidyverse) cannibus <- tibble( t = seq(-pi,pi, length.out=1000), r1 = (1+.9*cos(8*t)), ## this will draw 8 petals ## this number determines number of leafs! r2 = r1 * (1+.1*cos(24*t)), ## this make it pointy r3 = r2 * (.9+0.5*cos(200*t)), ## this makes it jaggy r4 = r3 * (1+sin(t)), ## Hmm.. I think I want to rorate it 90 degree... r4_alt = r3 * (1+sin(t-pi/2)), ## one way to do it... r = (1+.9*cos(8*t)) * (1+.1*cos(24*t)) * (.9+0.5*cos(200*t)) * (1+sin(t)) ## Put all in line line! ) cannibus %>% ggplot(aes(x=t, y=r1)) + geom_path(color="#7ABA71", size=2) + coord_polar() + theme_void(base_family="Roboto Condensed") + labs(title = "(1+.9*cos(8*t) draws 8 petals")

    cannibus %>% ggplot(aes(x=t, y=r2)) + geom_path(color="#7ABA71", size=2) + coord_polar() + theme_void(base_family="Roboto Condensed") + labs(title = "(1+.9*cos(8*t) * * (1+.1*cos(24*t)) makes the tip pointy")

    cannibus %>% ggplot(aes(x=t, y=r3)) + geom_path(color="#7ABA71", size=0.5) + coord_polar() + theme_void(base_family="Roboto Condensed") + labs(title = "(1+.9*cos(8*t) * * (1+.1*cos(24*t)) * (.9+0.5*cos(200*t)) makes zaggy")

    cannibus %>% ggplot(aes(x=t, y=r4)) + geom_path(color="#7ABA71", size=0.5) + coord_polar(start=pi/2) + theme_void(base_family="Roboto Condensed") + labs(title = "(1+.9*cos(8*t) * * (1+.1*cos(24*t)) * (.9+0.5*cos(200*t)) * (1+sin(t)) - OK Cool, Now 2 leaves are small!", subcaption="Notice I used start=pi/2 to rotate!")

    cannibus %>% ggplot(aes(x=t, y=r)) + geom_polygon(fill="#499b4a", color="#74Ba71", size=0.1) + coord_polar(theta="x", start=pi/2) + theme_void(base_family="Roboto Condensed") + labs(title = "Instead of using geom_path, I used geom_polygon")

    I couldn’t figure out how to “crop” the polar coordinate image, so there’s lots of white space on final image, but I like my little cannibus!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Chi's Impe[r]fect Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Announcing RStudio Package Manager

    Wed, 10/17/2018 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    We’re excited to announce the general availability of our newest RStudio professional product, RStudio Package Manager. RStudio Package Manager helps your team, department, or entire organization centralize and organize R packages.

    Get started with the 45 day evaluation today!

    With more than 13,000 packages in the R ecosystem, managing the packages you and your teams need can be challenging. R users naturally want the latest, but everyone benefits from reproducibility, stability, and security in production code.

    RStudio Package Manager is an on-premises server product that allows R users and IT to work together to create a central repository for R packages. RStudio Package Manager supports your team wherever they run R, from bash scripts and Docker containers to RStudio, RStudio Server (Pro), Shiny Server (Pro), and RStudio Connect.

    Administrators set up the server using a scriptable command line interface (CLI), and R users install packages from the server with their existing tools.

    We’ve spent the last year working with alpha and beta customers to ensure RStudio Package Manager is ready to meet the needs of your development and production use cases.

    CRAN

    If you’re an R user, you know about CRAN. If you’re someone who helps R users get access to CRAN, you probably know about network exceptions on every production node. With RStudio Package Manager, you can enable your R users to access CRAN without requiring a network exception on every production node. You can also automate CRAN updates on your schedule. You can choose to optimize disk usage and only download the packages you need or, alternatively, download everything up-front for completely offline networks.

    Currently, RStudio Package Manager does not serve binary packages from CRAN – only source packages. This limitation won’t affect server-based users, but may impact desktop users. Future versions of RStudio Package Manager will address this limitation.

    Subsets of CRAN

    We know some projects call for even tighter restrictions. RStudio Package Manager helps admins create approved subsets of CRAN packages, and ensures that those subsets stay stable even as packages are added or updated over time.

    Internal Packages and Packages from GitHub

    Sharing internal R code has never been easier. Administrators can add internal packages using the CLI. If your internal packages live in Git, RStudio Package Manager can automatically track your Git repositories and make commits or tagged releases accessible to users. The same tools make it painless to supplement CRAN with R packages from GitHub.

    Optimized for R

    Regardless of your use case, RStudio Package Manager provides a seamless experience optimized for R users. Packages are versioned, automatically keeping older versions accessible to users, tools like Packrat, and platforms like RStudio Connect.

    RStudio Package Manager also versions the repositories themselves, preserving the ability to always return the same set of R packages or receive the latest versions.

    RStudio Package Manager records usage statistics. These metrics help administrators conduct audits and give R users an easy way to discover popular and useful packages.

    Download Today

    Get started with the 45-day evaluation or check out our demo server. Read the admin guide for answers to more questions and a guide to installation and setup. Contact Sales for more information.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Slides from my talk at the R-Ladies Meetup about Interpretable Deep Learning with R, Keras and LIME

    Wed, 10/17/2018 - 02:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    During my stay in London for the m3 conference, I also gave a talk at the R-Ladies London Meetup on Tuesday, October 16th, about one of my favorite topics: Interpretable Deep Learning with R, Keras and LIME.

    Keras is a high-level open-source deep learning framework that by default works on top of TensorFlow. Keras is minimalistic, efficient and highly flexible because it works with a modular layer system to define, compile and fit neural networks. It has been written in Python but can also be used from within R. Because the underlying backend can be changed from TensorFlow to Theano and CNTK (with more options being developed right now) it is designed to be framework-independent. Models can be trained on CPU or GPU, locally or in the cloud. Shirin will show an example of how to build an image classifier with Keras. We’ll be using a convolutional neural net to classify fruits in images. But that’s not all! We not only want to judge our black-box model based on accuracy and loss measures – we want to get a better understanding of how the model works. We will use an algorithm called LIME (local interpretable model-agnostic explanations) to find out what part of the different test images contributed most strongly to the classification that was made by our model. Shirin will introduce LIME and explain how it works. And finally, she will show how to apply LIME to the image classifier we built before, as well as to a pre-trained Imagenet model.


    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    5 Alternatives to the Default R Outputs for GLMs and Linear Models

    Wed, 10/17/2018 - 01:49

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    The standard summary outputs from the glm and lm summary methods are a case in point. If you have been using R for as long as I have (19 or 20 years…) you will no doubt have a certain affection for them, but to a new user they are both ugly and not optimized to aid interpretation.

    The sad old default summary output from glm and lm

    The default output, shown below, is not terrible. By 1960s standards it is pretty good. If you know where to look and are good with numbers it is serviceable. But it can be bettered.

    1. An HTML table

    The most basic level of improvement is to make an attractive table, as done by the stargazer package. It improves on the 1960s style standard output by creating an HTML table, but in the style of an academic publication (R code: stargazer::stargazer(my.glm, type = “html”); be careful if copying this as you’ll need to replace the quotation marks with R-friendly ones).

    2. A 21st century table

    The output below uses more modern R technology (HTML widgets). It improves on the previous outputs in two ways:

    • The formattable package is used to create an attractive table which redundantly encodes information by color, bolding, cell shading, and relatively extreme rounding.
    • The table uses variable labels, rather than names. These labels are stored as attributes of the variables in the data frame (e.g., attr(x$MonthlyCharges, “label”) = “Monthly Charges ($)”;  again, be careful if copying this to replace the quotation marks ). stargazer also supports such labeling, although they are passed into the function as arguments rather than attributes of variables.

    This output has been created using the Regression function in our flipRegression package. This is running glm in the background. This is preloaded and available from the menus when you use Displayr, but you can also install the package from github.

    3. Importance scores instead of coefficients

    A more extreme approach is to report importance scores instead of coefficients. For example, the table below uses a modification of Johnson’s relative weights as a way of simultaneously addressing the correlation between the predictors and the dependency of coefficients on the scale of the predictors (Johnson, J.W. (2000). A heuristic method for estimating the relative weight of predictor variables in multiple regression. Multivariate behavioral research 35, 1-19.).

    The modification is that I’ve assigned the signs based on the signs from a standard glm. This has been produced using the same Regression function described in the previous section, but with an additional argument of  output = “Relative Importance Analysis”.

    4. Effects plots

    Alternatively, we can go entirely graphical in our presentation of the model. I’ve created the plots below using the effects package. A few points to note:

    • Of the outputs examined in this post, these are the only ones that both show the effects and the distribution of the predictors. If the goal is to understand the model, these plots are extremely informative.
    • By using a common y-axis it is easy to assess importance. (Although note that the mean probabilities that can be read off these plots are biased, as these plots are created under the assumption that the mean function for the model is linear, which is not the case for the logit model).
    • The graphical presentation of the confidence bands is much more informative than the standard errors in the previous outputs.

    5. Simulator

    The last way of presenting the results is to show a simulator, allowing the user to experiment to gain an understanding of the interplay of all the predictors in predicting the outcome categories. Click the image below to go to an online simulator or click the button below to explore and edit the code. You can find out more about creating simulators in “Building Online Interactive Simulators for Predictive Models in R.”

    Explore and edit this simulator

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New paper – Inside or outside: quantifying extrapolation across river networks

    Wed, 10/17/2018 - 00:43

    (This article was first published on R – Amy Whitehead's Research, and kindly contributed to R-bloggers)

    It’s been pretty quiet over here for a long time. But I have been busy beavering away at many interesting projects at NIWA, including a project where we developed a new method for identifying when your regression model is starting … Continue reading →

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Amy Whitehead's Research. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Estimating Pi

    Tue, 10/16/2018 - 20:03

    (This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

    I came across this post which gives a method to estimate Pi by using a circle, it’s circumscribed square and (lots of) random points within said square. Booth used Stata to estimate Pi, but here’s some R code to do the same thing…

    x <- 0.5 # center x y <- 0.5 # center y n <- 1000 # nr of pts r <- 0.5 # radius pts <- seq(0, 2 * pi, length.out = n) plot(sin(pts), cos(pts), type = 'l', asp = 1) # test require(sp) xy <- cbind(x + r * sin(pts), y + r * cos(pts)) sl <- SpatialPolygons(list(Polygons(list(Polygon(xy)), "polygon"))) plot(sl, add=FALSE, col = 'red', axes=T ) # the square xy <- cbind(c(0, 1, 1, 0), c(0, 0, 1, 1)) sq <- SpatialPolygons(list(Polygons(list(Polygon(xy)), "polygon"))) plot(sq, add = TRUE) N <- 1e6 x <- runif(N, 0, 1) y <- runif(N, 0, 1) sp <- SpatialPoints(cbind(x, y)) plot(sp, add = TRUE, col = "green") require(rgeos) (sim_pi <- (sum(gIntersects(sp, sl, byid = TRUE))/N) *4) sim_pi - pi

    Note the use of sp and rgeos packages to calculate the intersections.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Quasiquotation in R via bquote()

    Tue, 10/16/2018 - 19:16

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    In August of 2003 Thomas Lumley added bquote() to R 1.8.1. This gave R and R users an explicit Lisp-style quasiquotation capability. bquote() and quasiquotation are actually quite powerful. Professor Thomas Lumley should get, and should continue to receive, a lot of credit and thanks for introducing the concept into R.

    In fact bquote() is already powerful enough to build a version of dplyr 0.5.0 with quasiquotation semantics quite close (from a user perspective) to what is now claimed in tidyeval/rlang.

    Let’s take a look at that.

    For our example problem: suppose we wanted to average a ratio of columns by groups, but want to write code so that what columns are to be used is specified elsewhere (perhaps coming in as function arguments). You would write code such as the following in dplyr 0.7.*:

    suppressPackageStartupMessages(library("dplyr")) # define our parameters # pretend these come from far away # or as function arguments. group_nm <- as.name("am") num_nm <- as.name("hp") den_nm <- as.name("cyl") derived_nm <- as.name(paste0(num_nm, "_per_", den_nm)) mean_nm <- as.name(paste0("mean_", derived_nm)) count_nm <- as.name("group_count")

    And then you would execute something like the following. We are not running the following block (so there is no result), as we have dplyr 0.5.0 installed, which does not yet accept the quasiquotation notation.

    # apply a parameterized pipeline using rlang::!! mtcars %>% group_by(!!group_nm) %>% mutate(!!derived_nm := !!num_nm/!!den_nm) %>% summarize( !!mean_nm := mean(!!derived_nm), !!count_nm := n() ) %>% ungroup() %>% arrange(!!group_nm)

    Writing the above code requires two ideas:

    • quasiqotation (marking some symbols for substitution, in this case using !! as the marker). As we said, quasiquotation has been available in R since 2003.
    • Using := as a substitute for = to allow substitution marks on the left-hand sides of assignment like expressions without triggering syntax errors. := is an unused left-over assignment operator that has low-precedence (like assignments do), but not all the syntactic restrictions of =. This is an idea used in data.table for quite some time, and is actually first visible in dplyr when used as part of dplyr‘s data.table adapter in May 2013).

    dplyr 0.5.0 (released June 2016) did not yet incorporate quasiqotation, so we can not run the above code in dplyr 0.5.0 unless we take the steps of:

    • Replacing the !!x substitution markings with bquote()‘s .(x) notation.
    • Changing the := back to =.
    • Placing the whole pipeline in an eval(bquote()) block.
    • Not directly substituting symbols on the left-hand sides of expressions.

    This looks like the following.

    # apply a partially parameterized pipeline using bquote eval(bquote( mtcars %>% group_by(.(group_nm)) %>% mutate(TMP1 = .(num_nm)/.(den_nm)) %>% rename_(.dots = setNames("TMP1", as.character(derived_nm))) %>% summarize( TMP2 = mean(.(derived_nm)), TMP3 = n() ) %>% ungroup() %>% rename_(.dots = setNames("TMP2", as.character(mean_nm))) %>% rename_(.dots = setNames("TMP3", as.character(count_nm))) %>% arrange(.(group_nm)) )) ## # A tibble: 2 x 3 ## am mean_hp_per_cyl group_count ## ## 1 0 22.7 19 ## 2 1 23.4 13

    That nearly solved the problem (notice we were not able to control the left-hand sides of assignment-like expressions). In December of 2016 we publicly announced and released the let() adapter that addressed almost all of these concerns at the user level (deliberately not modifying dplyr code). The substitutions are specified by name (instead of by position), and the code must still be executed in a let() block to get the desired control.

    This looks like the following.

    alias <- c(group_nm = as.character(group_nm), num_nm = as.character(num_nm), den_nm = as.character(den_nm), derived_nm = as.character(derived_nm), mean_nm = as.character(mean_nm), count_nm = as.character(count_nm)) wrapr::let(alias, mtcars %>% group_by(group_nm) %>% mutate(derived_nm = num_nm/den_nm) %>% summarize( mean_nm = mean(derived_nm), count_nm = n() ) %>% ungroup() %>% arrange(group_nm) ) ## # A tibble: 2 x 3 ## am mean_hp_per_cyl group_count ## ## 1 0 22.7 19 ## 2 1 23.4 13

    We felt this was the least intrusive way to give dplyr users useful macro substitution ability over dplyr pipelines. So by December of 2016 users had public access to a complete solution. More on wrapr::let() can be found here.

    In May of 2017 rlang was publicly announced and released. dplyr adapted new rlang based quasiquotation ability directly into many of its functions/methods. This means the user does not need to draw a substitution block around their code, as the package authors have written the block. At first glance this is something only the dplyr package authors could do (as they control the code), but it turns out one can also easily adapt (or wrap) existing code in R.

    A very small effort (about 25 lines of code now found in bquote_call()) is needed to specify argument substitution semantics using bquote(). The bulk of that code is adding the top-level := to = substitution. Then it takes only about 10 lines of code to write a function that re-maps existing dplyr functions to new functions using the bquote() adapter bquote_function(). And a package author could choose to directly integrate bquote() transforms at even smaller cost.

    Lets convert dplyr 0.5.0 to bquote() based quasiquotation. We can do that by running our adapter over the names of 29 common dplyr methods, which places bquote() enhanced versions in our running environment. The code to do this is below.

    # wrap some dplyr methods nms <- c("all_equal", "anti_join", "arrange", "as.tbl", "bind_cols", "bind_rows", "collect", "compute", "copy_to", "count", "distinct", "filter", "full_join", "group_by", "group_indices", "inner_join", "is.tbl", "left_join", "mutate", "rename", "right_join", "select", "semi_join", "summarise", "summarize", "tally", "tbl", "transmute", "ungroup") env <- environment() for(fname in nms) { fn <- getFromNamespace(fname, ns = "dplyr") assign(fname, # requires wrapr 1.6.4 or newer wrapr::bquote_function(fn), envir = env) }

    With these adaptions in place we can write partially substituted or parametric code as follows.

    packageVersion("dplyr") ## [1] '0.5.0' # apply a parameterized pipeline using bquote mtcars %>% group_by(.(group_nm)) %>% mutate(.(derived_nm) := .(num_nm)/.(den_nm)) %>% summarize( .(mean_nm) := mean(.(derived_nm)), .(count_nm) := n() ) %>% ungroup() %>% arrange(.(group_nm)) ## # A tibble: 2 x 3 ## am mean_hp_per_cyl group_count ## ## 1 0 22.7 19 ## 2 1 23.4 13

    This pretty much demonstrates that bquote() was up to the job the whole time. All one had to do is decide to incorporate it into the source code (something only the package author could do) or adapt the functions (a bit invasive, but something any user can do).

    The user visible difference is that bquote() uses .(x) to mark substitution of x. rlang currently uses !!x to mark substitution of x (syntactically reminiscent of Lisp’s back-tick). Earlier versions of rlang used UQ(x), uq(x), and even ((x)) (signs of this can be found here). Other famous substitution notations include %x (from C/printf), {} (from Python PEP 498 — Literal String Interpolation, and used in glue.), $ and ${} (from bash), and many more.

    There are some minor technical differences between the bquote() and rlang solutions. But one can write corner case code that fails either method at will. We are not discussing the rlang !!! "splice" syntax, as the need for it is avoidable by avoiding over-use of "..." tricks. For actual user work both systems are going to look very similar.

    Overall we are discussing avoiding interfaces that are convenient for interactive work, yet difficult to parameterize or program over. Quasiquotation is a tool to overcome package design limitations or design flaws. You don’t see quasiquotation very much in R, as many R developers work hard on their end to make sure the user does not require it.

    For R package and function developers my advice is:

    • If you are using "quoting interfaces" (interfaces that take names by quoting un-evaluated R-code) then design your package to also have sufficiently powerful standard evaluation interfaces so that users can move to these if they run into trouble. This was the advice in the first edition of "Advanced R":

    As a developer, you should always provide an escape hatch: an alternative version of the function that uses standard evaluation.

    • If you wish to introduce quasiquotation into your functions or packages, please consider using bquote() to do it. It has been part of base-R for 15 years, works very well, and is quite stable. bquote() is small enough to be comprehensible. Users who learn bquote() notation on one package will be able to re-use their experience on other bquote() enabled packages. To help: we have some example code called bquote_call_args() where we are prototyping a canonical way to incorporate bquote() and := re-writing in your own functions and packages. It isn’t in the released version of wrapr yet, but it is testing well in the development version.

    A lot of the technical details (including defining standard and non-standard evaluation) are covered in our long article on macro capabilities in R.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Exploring college major and income: a live data analysis in R

    Tue, 10/16/2018 - 17:20

    (This article was first published on Variance Explained, and kindly contributed to R-bloggers)

    I recently came up with the idea for a series of screencasts:

    I've thought about recording a screencast of an example data analysis in #rstats. I'd do it on a dataset I'm unfamiliar with so that I can show and narrate my live thought process.

    Any suggestions for interesting datasets to use?

    — David Robinson (@drob) October 6, 2018

    Hadley Wickham had the great suggestion of analyzing a Tidy Tuesday dataset. Tidy Tuesday is a fantastic project run by the R for Data Science online learning community (especially Thomas Mock) that releases an interesting dataset each week.

    I’ve now released my first such screencast, exploring this week’s Tidy Tuesday dataset on (the data behind The Economic Guide to Picking a College Major). You can also find the R Markdown I produced here.

    I produced a handful of figures that I found pretty interesting. I took a look at the distribution of income from graduates within each category of major.

    I spent some time on looking at the differences in gender distribution across majors, which was also included in the data.


    And I ended by setting up an interactive scatterplot with the plotly package that compared the share of women in a field to the median salary.

    Some notes and observations:

    • This isn’t an R tutorial: If I were teaching R, I’d have prepared in advance and moved more slowly through the material. This is a case study in how I’d dive into a dataset and learn from it, including steps where I think aloud and decide what route to take. If anything, it’s closer to a “speedrun”.
    • I enjoyed showing the order I work in: I write blog posts somewhat “inside-out”: I start with a few figures, then figure out what preprocessing I should have started with, and I’m always moving uninteresting figures out of the post or to appendices. It was nice to show how an analysis took shape and ended up looking like a organized final product.
    • I ran into fewer bugs than I expected to Part of the excitement of a live screencast is that “anything can go wrong” (part of the reason I recorded this first one in advance rather than live is to practice with less pressure!) I was pretty well versed in the tools I used in this session (dplyr and ggplot2), so I got stuck on only a handful of bugs (though I did go down a few unproductive routes).

    I had enough fun that I think I’ll do it again (though probably not every week). With that in mind, I’ve learned some lessons that might improve my future screencasts:

    • I speak too fast: this is a recurring issue for me. When I’m speaking in front of an audience I can watch people’s faces and pace myself a bit better, but when I’m “by myself” while recording it’s difficult. I’ve learned this is especially difficult for non-native listeners, and I’ll try to be more conscious and speak slower!
    • I need to keep a better eye on time: The screencast is about 80 minutes long (I’d originally planned on an hour, and I’ll probably aim for that in the future). I’d be interested in feedback about the length, and whether people find the whole session interesting.

    I look forward to hearing your feedback, and to recording to the next one!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Variance Explained. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Even more images as x-axis labels

    Tue, 10/16/2018 - 14:48

    (This article was first published on R – Irregularly Scheduled Programming, and kindly contributed to R-bloggers)

    This is the last update to this strange saga… I hope.

    Image labels… Photo: http://www.premierpaper.com/shop/custom-labels/

    Easily two of the most popular posts on my blog are this one and this one describing a couple of ways in which I managed to hack together using an image as a category label in a ggplot.

    There are likely many people who believe one should never do such a thing, but given the popularity, it seems a lot of people aren’t listening to that. Good on you.

    via GIPHY

    One of these posts was recently shared again by the amazing #rstats amplifier Mara Averick (if you’re not following her on Twitter, you’re missing out) and @baptiste_auguie (the saviour of the previous implementation) mentioned that he had written a ‘hack’ to get chemical symbols as a categorical axis label using tikzDevice. That package leverages (of which I am very familiar, having written my PhD thesis entirely in many moons ago) to treat all of the text in an image as potential commands and produce a working source code which generates the required plot.

    The example code is straightforward enough

    options(tikzLatexPackages = c(getOption('tikzLatexPackages'),"\\usepackage{acide-amine}\n")) d = data.frame(x=1:10, y=1:10, f=factor(sample(letters[1:2], 10, repl=TRUE))) p <- qplot(x,y,data=d) + theme_bw() + opts(plot.margin = unit(c(1, 1, 5, 1), "lines"), axis.text.x = theme_text(size = 12 * 0.8, lineheight = 0.9, vjust = 10)) + scale_x_continuous(breaks = c(2, 8), labels=c("\\phe{15}", "\\leu{15}")) tikz("annotation.tex",standAlone=T,width=4,height=4) print(p) dev.off()

    and produces this

    annotation.png

    This got me curious, though — if it can process arbitrary , could it process a \\includegraphics call?

    Efficient! If it's arbitrary LaTeX, could the labels just be \includegraphics calls?

    — Jonathan Carroll (@carroll_jono) October 11, 2018

    Yes, as it turns out.

    via GIPHY

    A quick test showed that it was indeed possible, which only leaves re-implementing the previous posts’ images using this method.

    I’ve done so, and the code isn’t particularly shorter than the other method.

    #wrap_githubgist08a1ccff36be628430d768e5b9426e54 .gist-data {max-height: 100%;} document.write('') document.write('

    \n

    \n

    \n

    \n

    \n \n\n

    \n

    \n

    \n

    <\/td>\n

    library(rvest<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># GDP per capita, top 11 countries<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    n_countries<\/span> <-<\/span> 11<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    url<\/span> <-<\/span> "<\/span>https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita"<\/span><\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    html<\/span> <-<\/span> read_html(url<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gdppc<\/span> <-<\/span> html_table(html_nodes(html<\/span>, "<\/span>table"<\/span><\/span>)[3<\/span>])[[1<\/span>]][1<\/span>:<\/span>n_countries<\/span>, ]<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># clean up; remove non-ASCII and perform type conversions<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gdppc<\/span>$<\/span>Country<\/span> <-<\/span> gsub("<\/span>Â "<\/span><\/span>, "<\/span>"<\/span><\/span>, gdppc<\/span>$<\/span>Country<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gdppc<\/span>$<\/span>Rank<\/span> <-<\/span> iconv(gdppc<\/span>$<\/span>Rank<\/span>, "<\/span>latin1"<\/span><\/span>, "<\/span>ASCII"<\/span><\/span>, sub<\/span> =<\/span> "<\/span>"<\/span><\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gdppc<\/span>$<\/span>Country<\/span> <-<\/span> iconv(gdppc<\/span>$<\/span>Country<\/span>, "<\/span>latin1"<\/span><\/span>, "<\/span>ASCII"<\/span><\/span>, sub<\/span> =<\/span> "<\/span>"<\/span><\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gdppc<\/span>$<\/span>`US<\/span>$<\/span>` <-<\/span> as.integer(sub("<\/span>,"<\/span><\/span>, "<\/span>"<\/span><\/span>, gdppc<\/span>$<\/span>`US<\/span>$<\/span>`))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># flag images (yes, this processing could be done neater, I'm sure)<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># get the 200px versions<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    flags_img<\/span> <-<\/span> html_nodes(html_nodes(html<\/span>, "<\/span>table"<\/span><\/span>)[3<\/span>][[1<\/span>]], "<\/span>img"<\/span><\/span>)[1<\/span>:<\/span>n_countries<\/span>]<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    flags_url<\/span> <-<\/span> paste0("<\/span>http://"<\/span><\/span>, sub("<\/span>[0-9]*px"<\/span><\/span>, "<\/span>200px"<\/span><\/span>, sub('<\/span>\\\\<\/span>".*$'<\/span><\/span>, "<\/span>"<\/span><\/span>, sub('<\/span>^.*src=\\\\<\/span>"//'<\/span><\/span>, "<\/span>"<\/span><\/span>, flags_img<\/span>))))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    flags_name<\/span> <-<\/span> sub("<\/span>.svg.png"<\/span><\/span>, "<\/span>.png"<\/span><\/span>, sub("<\/span>.*(Flag_of)"<\/span><\/span>, "<\/span>\\\\<\/span>1"<\/span><\/span>, flags_url<\/span>))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># download each of the flags into a directory<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    if<\/span> (!<\/span>dir.exists("<\/span>flags"<\/span><\/span>)) dir.create("<\/span>flags"<\/span><\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    for<\/span> (flag<\/span> in<\/span> seq_along(flags_url<\/span>)) {<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    switch<\/span>(Sys.info()[["<\/span>sysname"<\/span><\/span>]],<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    Windows<\/span> =<\/span> {<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    download.file(flags_url<\/span>[flag<\/span>], destfile<\/span> =<\/span> file.path("<\/span>flags"<\/span><\/span>, paste0(flag<\/span>, "<\/span>_"<\/span><\/span>, flags_name<\/span>[flag<\/span>])), method<\/span> =<\/span> "<\/span>auto"<\/span><\/span>, mode<\/span> =<\/span> "<\/span>wb"<\/span><\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    },<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    Linux<\/span> =<\/span> {<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    download.file(flags_url<\/span>[flag<\/span>], destfile<\/span> =<\/span> file.path("<\/span>flags"<\/span><\/span>, paste0(flag<\/span>, "<\/span>_"<\/span><\/span>, flags_name<\/span>[flag<\/span>])))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    },<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    Darwin<\/span> =<\/span> {<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    print("<\/span>Not tested on Mac. Use one of the above and find out?"<\/span><\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    }<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    )<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    }<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span> /etc/ImageMagick-6/policy.xml<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span> <\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span> <\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span> https://stackoverflow.com/questions/42928765/convertnot-authorized-aaaa-error-constitute-c-readimage-453<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    library(magick<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># load the images from filenames<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    npoints<\/span> <-<\/span> length(flags_name<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    pics<\/span> <-<\/span> vector(mode<\/span> =<\/span> "<\/span>list"<\/span><\/span>, length<\/span> =<\/span> npoints<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    image.file<\/span> <-<\/span> file.path(getwd(), dir("<\/span>flags"<\/span><\/span>, full.names<\/span> =<\/span> TRUE<\/span>))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    image.file<\/span> <-<\/span> image.file<\/span>[order(as.integer(sub("<\/span>_.*"<\/span><\/span>, "<\/span>"<\/span><\/span>, sub("<\/span>^.*flags/"<\/span><\/span>, "<\/span>"<\/span><\/span>, image.file<\/span>))))]<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    for<\/span> (i<\/span> in<\/span> 1<\/span>:<\/span>npoints<\/span>) {<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    pics<\/span>[[i<\/span>]] <-<\/span> image_read(image.file<\/span>[i<\/span>])<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    }<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    names(pics<\/span>) <-<\/span> sub("<\/span>.svg.png"<\/span><\/span>, "<\/span>"<\/span><\/span>, sub("<\/span>.*Flag_of_"<\/span><\/span>, "<\/span>"<\/span><\/span>, image.file<\/span>))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    names(pics<\/span>)[8<\/span>] <-<\/span> "<\/span>USA"<\/span><\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># create a dummy dataset<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    y<\/span> <-<\/span> gdppc<\/span>$<\/span>`US<\/span>$<\/span>`<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    x<\/span> <-<\/span> names(pics<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    dat<\/span> <-<\/span> data.frame<\/span>(x<\/span> =<\/span> factor<\/span>(x<\/span>, levels<\/span> =<\/span> names(pics<\/span>)), y<\/span> =<\/span> y<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    library(ggplot2<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># create the graph, as per normal now with @baptiste's adapted grob processing<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># NB: #85bb65 is the color of money in the USA apparently.<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gg<\/span> <-<\/span> ggplot(dat<\/span>, aes(x<\/span> =<\/span> x<\/span>, y<\/span> =<\/span> y<\/span> /<\/span> 1e3L<\/span>, group<\/span> =<\/span> 1<\/span>))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gg<\/span> <-<\/span> gg<\/span> +<\/span> geom_bar(col<\/span> =<\/span> "<\/span>black"<\/span><\/span>, fill<\/span> =<\/span> "<\/span>#85bb65"<\/span><\/span>, stat<\/span> =<\/span> "<\/span>identity"<\/span><\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gg<\/span> <-<\/span> gg<\/span> +<\/span> theme_minimal()<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gg<\/span> <-<\/span> gg<\/span> +<\/span> scale_fill_discrete(guide<\/span> =<\/span> FALSE<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gg<\/span> <-<\/span> gg<\/span> +<\/span> theme(plot.background<\/span> =<\/span> element_rect(fill<\/span> =<\/span> "<\/span>grey90"<\/span><\/span>))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gg<\/span> <-<\/span> gg<\/span> +<\/span> labs(title<\/span> =<\/span> "<\/span>GDP per capita"<\/span><\/span>,<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    subtitle<\/span> =<\/span> "<\/span>Top 11 countries"<\/span><\/span>,<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    x<\/span> =<\/span> "<\/span>"<\/span><\/span>, y<\/span> =<\/span> "<\/span>\\\\<\/span>$US/1000"<\/span><\/span>, #<\/span># escaping for LaTeX<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    caption<\/span> =<\/span> paste0("<\/span>\\\\<\/span>shortstack{\\\\<\/span>vspace{-1em}\\\\<\/span>scriptsize{Source: "<\/span><\/span>,<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gsub("<\/span>_"<\/span><\/span>, "<\/span>\\\\\\\\<\/span>_"<\/span><\/span>, url<\/span>), "<\/span>}}"<\/span><\/span>)) #<\/span># escaping for LaTeX <\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># use tikzDevice and include graphicx in the preamble<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    library(tikzDevice<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    options(tikzLatexPackages<\/span> =<\/span> c(getOption("<\/span>tikzLatexPackages"<\/span><\/span>), "<\/span>\\\\<\/span>usepackage{graphicx}\\n<\/span>"<\/span><\/span>))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># rather than typing this out every time, make it a function<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    make_label<\/span> <-<\/span> function<\/span>(labels<\/span>, files<\/span>, scale<\/span> =<\/span> 0.2<\/span>, offset<\/span> =<\/span> "<\/span>2em"<\/span><\/span>) {<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    paste0("<\/span>\\\\<\/span>shortstack{\\\\<\/span>includegraphics[scale="<\/span><\/span>, scale<\/span>, "<\/span>]{"<\/span><\/span>,<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    files<\/span>, "<\/span>}\\\\\\\\\\\\<\/span>hspace{-"<\/span><\/span>, offset<\/span>,<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    "<\/span>}\\\\<\/span>rotatebox[origin=rt]{45}{\\\\<\/span>small{"<\/span><\/span>, labels<\/span>, "<\/span>}}}"<\/span><\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    }<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># make some room for the images and caption<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gg<\/span> <-<\/span> gg<\/span> +<\/span> theme(axis.text.x<\/span> =<\/span> element_text(size<\/span> =<\/span> 5<\/span>,<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    lineheight<\/span> =<\/span> 0.9<\/span>,<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    vjust<\/span> =<\/span> -<\/span>4<\/span>),<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    plot.margin<\/span> =<\/span> unit(c(5.5<\/span>, 5.5<\/span>, 12<\/span>, 5.5<\/span>), "<\/span>points"<\/span><\/span>))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># this is where most of the LaTeX magic is happening - the labels<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    gg<\/span> <-<\/span> gg<\/span> +<\/span> scale_x_discrete(breaks<\/span> =<\/span> names(pics<\/span>),<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    labels<\/span> =<\/span> make_label(sub("<\/span>.png"<\/span><\/span>, "<\/span>"<\/span><\/span>, names(pics<\/span>)), image.file<\/span>))<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># process the object with LaTeX and produce a PDF<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    tikz("<\/span>xaxis.tex"<\/span><\/span>, standAlone<\/span> =<\/span> TRUE<\/span>, width<\/span> =<\/span> 4<\/span>, height<\/span> =<\/span> 4<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    print(gg<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    dev.off()<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    tools<\/span>::<\/span>texi2dvi("<\/span>xaxis.tex"<\/span><\/span>, pdf<\/span> =<\/span> TRUE<\/span>)<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    \n<\/td>\n <\/tr>\n

    \n

    <\/td>\n

    #<\/span># convert the resulting PDF to a png using magick<\/span><\/td>\n <\/tr>\n

    \n

    <\/td>\n

    image_read_pdf("<\/span>xaxis.pdf"<\/span><\/span>) %><\/span>% <\/td>\n <\/tr>\n

    \n

    <\/td>\n

    image_convert(format<\/span> =<\/span> "<\/span>png"<\/span><\/span>) %><\/span>% <\/td>\n <\/tr>\n

    \n

    <\/td>\n

    image_write("<\/span>xaxis.png"<\/span><\/span>, density<\/span> =<\/span> 500<\/span>)<\/td>\n <\/tr>\n<\/table>\n\n\n <\/div>\n\n <\/div>\n<\/div>\n\n <\/div>\n

    \n view raw<\/a>\n image_as_xaxis_latex.R<\/a>\n hosted with ❤ by GitHub<\/a>\n <\/div>\n <\/div>\n<\/div>\n')

    404: Not Found

    Producing nearly the same end result.

    tikzDevice result

    There are a few differences compared to the previous version(s):

    – I had a request for rotating the additional text, which I actually also updated recently, and it seemed to fit better, so I rotated the labels within the command.
    – Since all of the text has been rendered via , the fonts are a bit different.
    – The rankings have since changed, so I’ve added an 11th to keep Australia in the list.

    The component of this also meant that a few changes were necessary in the other labels, such as the dollar sign in the y-axis label, and the underscores throughout (these are considered special characters in ). Lastly, the result of running the tikz command is that a .tex ( source code) file is produced. This isn’t quite the plot image file we want. It does however have the commands to generate one. The last steps in the above gist are to process this .tex file with . Here I used the tools::texi2dvi function, but one could also use a system command to their installation.

    That still only produced a PDF. The last step was to use the magick package to convert this into an image.

    Overall, this is a nice proof of concept, but I don’t think it’s a particularly tidy way of achieving the goal of image axis labels. It does however lay the groundwork for anyone else who decides this might be a useful route to take. Plus I learned a bit more about how tikzDevice works and got to revisit my knowledge.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Irregularly Scheduled Programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A small logical change with big impact

    Tue, 10/16/2018 - 10:00

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    In R, the logical || (OR) and && (AND) operators are unique in that they are designed only to work with scalar arguments. Typically used in statements like

    while(iter < 1000 && eps < 0.0001) continue_optimization()

    the assumption is that the objects on either side (in the example above, iter and eps) are single values (that is, vectors of length 1) — nothing else makes sense for control flow branching like this. If either iter or eps above happened to be vectors with more than one value, R would silently consider only the first elements when making the AND comparison.

    In an ideal world that should never happen, of course. But R programmers, like all programmers, do make errors, and using non-scalar values with || or && is a good sign that something, somewhere, has gone wrong. And with an experimental feature in the next version of R, you will be able to set an environment variable, _R_CHECK_LENGTH_1_LOGIC2_, to signal a warning or error in this case. But why make it experimental, and not the default behavior? Surely making such a change is a no-brainer, right?

    It turns out things are more difficult than they seem. The issue is the large ecosystem of R packages, which may rely (wittingly or unwittingly) on the existing behavior. Suddenly throwing an error would throw those packages out of CRAN, and all of their dependencies as well.

    We can see the scale of the impact in a recent R Foundation blog post by Tomas Kalibara where he details the effort it took to introduce an even simpler change: making if(cond) { ... } throw an error if cond is a logical vector of length greater than one. Again, it seems like a no-brainer, but when this change was introduced experimentally in March 2017, 154 packages on CRAN started failing because of it … and that rose to 179 packages by November 2017. The next step was to implement a new CRAN check to alert those package authors that, in the future, their package would fail. It wasn't until October 2018 that the number of affected packages had dropped to a suitable level that the error could actually be implemented in R as well.

    So the lesson is this: with an ecosystem as large as CRAN, even "trivial" changes take a tremendous amount of effort and time. They involve testing the impact on CRAN packages, coding up tests and warnings for maintainers, and allowing enough time for packages to be udpated. For an in-depth perspective, read Tomas Kalibera's essay at the link below.

    R Developer blog: Tomas Kalibera: Conditions of Length Greater Than One

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    pubchunks: extract parts of scholarly XML articles

    Tue, 10/16/2018 - 02:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    pubchunks is a package grown out of the fulltext package. fulltext
    provides a single interface to many sources of full text scholarly articles. As
    part of the user flow in fulltext there is an extraction step where fulltext::chunks()
    pulls parts of articles out of XML format article files.

    As part of making fulltext more maintainable and focused on simply fetching articles,
    and realizing that pulling out bits of structured XML files is a more general problem,
    we broke out pubchunks into a separate package. fulltext::ft_chunks() and
    fulltext::ft_tabularize() will eventually be removed and we’ll point users to
    pubchunks.

    The goal of pubchunks is to fetch sections out of XML format scholarly articles.
    Users do not need to know about XML and all of its warts. They only need to know
    where their files or XML strings are and what sections they want of
    each article. Then the user can combine these sections and do whatever they wish
    downstream; for example, analysis of the text structure or a meta-analysis combining
    p-values or other data.

    The other major format, and more common format, that articles come in is PDF. However,
    PDF has no structure other than perhaps separate pages, so it’s not really possible
    to easily extract specific sections of an article. Some publishers provide absolutely
    no XML versions (cough, Wiley) while others that do a good job of this are almost
    entirely paywalled (cough, Elsevier). There are some open access publishers that
    do provide XML (PLOS, Pensoft, Hindawi) – so you have the best of both worlds with
    those publishers.

    pubchunks is still in early days of development, so we’d love any feedback.

    Functions in pubchunks

    All exported functions are prefixed with pub to help reduce namespace conflicts.

    The two main functions are:

    • pub_chunks(): fetch XML sections
    • pub_tabularize(): coerce output of pub_chunks() into data.frame’s

    Other functions that you may run in to:

    • pub_guess_publisher(): guess publisher from XML file or character string
    • pub_sections(): sections pubchunks knows how to handle
    • pub_providers(): providers (i.e., publishers) pubchunks knows how to handle explicitly

    How it works

    When using pub_chunks() we first figure out what publisher the XML comes from. We do this
    beacause journals from the same publisher often/usually follow the same format/structure, so
    we can be relatively confident of rules for pulling out certain sections.

    Once we have the publisher, we go through each section of the article the user reqeusts, and
    use the publisher specific XPATH (an XML query language)
    that we’ve written to extract that section from the XML.

    The ouput is a named list, where names are the sections; and the output is an S3 class with
    a print method to make it more easily digestable.

    Installation

    The first version of the package has just hit CRAN, so not all binaries are available as of this
    date.

    install.packages("pubchunks")

    You may have to install the dev version

    remotes::install_github("ropensci/pubchunks")

    After sending to CRAN a few days back I noticed a number of things that needed fixing/improving,
    so you can also try out those fixes:

    remotes::install_github("ropensci/pubchunks@fix-pub_chunks")

    Load the package

    library(pubchunks)

    Pub chunks

    pub_chunks accepts a file path, XML as character string, xml_document object, or a list of
    those three things.

    We currently support the following publishers, though not all sections below are allowed
    in every publisher:

    • elife
    • plos
    • elsevier
    • hindawi
    • pensoft
    • peerj
    • copernicus
    • frontiers
    • f1000research

    We currently support the following sections, not all of which are supported, or make sense, for
    each publisher:

    • front – Publisher, journal and article metadata elements
    • body – Body of the article
    • back – Back of the article, acknowledgments, author contributions, references
    • title – Article title
    • doi – Article DOI
    • categories – Publisher’s categories, if any
    • authors – Authors
    • aff – Affiliation (includes author names)
    • keywords – Keywords
    • abstract – Article abstract
    • executive_summary – Article executive summary
    • refs – References
    • refs_dois – References DOIs – if available
    • publisher – Publisher name
    • journal_meta – Journal metadata
    • article_meta – Article metadata
    • acknowledgments – Acknowledgments
    • permissions – Article permissions
    • history – Dates, recieved, published, accepted, etc.

    Get an example XML file from the package

    x <- system.file("examples/pensoft_1.xml", package = "pubchunks")

    Pass to pub_chunks and state which section(s) you want

    pub_chunks(x, sections = "abstract") #> #> from: character #> sections: abstract #> abstract (n=1): AbstractNineteen species of seed-beetles belonging ... pub_chunks(x, sections = "aff") #> #> from: character #> sections: aff #> aff (n=7): nested list pub_chunks(x, sections = "title") #> #> from: character #> sections: title #> title (n=1): Contribution to the knowledge of seed-beetles (Col ... pub_chunks(x, sections = c("abstract", "title", "authors", "refs")) #> #> from: character #> sections: abstract, title, authors, refs #> abstract (n=1): AbstractNineteen species of seed-beetles belonging ... #> title (n=1): Contribution to the knowledge of seed-beetles (Col ... #> authors (n=7): nested list #> refs (n=13): AntonKW (2010) Catalogue of Palaearctic Coleoptera

    You can also pass in a character string of XML, e.g.,

    xml_str <- paste0(readLines(x), collapse = "\n") class(xml_str) #> [1] "character" pub_chunks(xml_str, sections = "title") #> #> from: character #> sections: title #> title (n=1): Contribution to the knowledge of seed-beetles (Col ...

    Or an xml_document object from xml2::read_xml

    xml_doc <- xml2::read_xml(x) class(xml_doc) #> [1] "xml_document" "xml_node" pub_chunks(xml_doc, sections = "title") #> #> from: xml_document #> sections: title #> title (n=1): Contribution to the knowledge of seed-beetles (Col ...

    Or pass in a list of the above objects, e.g., here a list of file paths

    pensoft_xml <- system.file("examples/pensoft_1.xml", package = "pubchunks") peerj_xml <- system.file("examples/peerj_1.xml", package = "pubchunks") copernicus_xml <- system.file("examples/copernicus_1.xml", package = "pubchunks") frontiers_xml <- system.file("examples/frontiers_1.xml", package = "pubchunks") pub_chunks( list(pensoft_xml, peerj_xml, copernicus_xml, frontiers_xml), sections = c("abstract", "title", "authors", "refs") ) #> [[1]] #> #> from: character #> sections: abstract, title, authors, refs #> abstract (n=1): AbstractNineteen species of seed-beetles belonging ... #> title (n=1): Contribution to the knowledge of seed-beetles (Col ... #> authors (n=7): nested list #> refs (n=13): AntonKW (2010) Catalogue of Palaearctic Coleoptera #> #> [[2]] #> #> from: character #> sections: abstract, title, authors, refs #> abstract (n=1): Climate change is predicted to lead to more extrem ... #> title (n=1): Storm effects on intertidal invertebrates: increas ... #> authors (n=7): nested list #> refs (n=60): Alcántara-Carrió et al. (2017)Alcántara-CarrióJSas #> #> [[3]] #> #> from: character #> sections: abstract, title, authors, refs #> abstract (n=1): Soil temperatures at various depths are unique par ... #> title (n=1): Quality control of 10-min soil temperatures data a ... #> authors (n=3): nested list #> refs (n=9): 1Bertrand, C., Gonzalez Sotelino, L., and Journée, #> #> [[4]] #> #> from: character #> sections: abstract, title, authors, refs #> abstract (n=1): Our current understanding of Antarctic soils is de ... #> title (n=1): Metagenomic Analysis of a Southern Maritime Antarc ... #> authors (n=8): nested list #> refs (n=56): AislabieJ.BroadyP.SaulD. (2006). Culturable hetero #> #> attr(,"ft_data") #> [1] FALSE

    Last, since we broke pubchunks out of fulltext package we support fulltext
    here as well.

    library("fulltext") dois <- c('10.1371/journal.pone.0086169', '10.1371/journal.pone.0155491', '10.7554/eLife.03032') x <- fulltext::ft_get(dois) pub_chunks(fulltext::ft_collect(x), sections="authors") #> $plos #> $plos$`10.1371/journal.pone.0086169` #> #> from: xml_document #> sections: authors #> authors (n=4): nested list #> #> $plos$`10.1371/journal.pone.0155491` #> #> from: xml_document #> sections: authors #> authors (n=9): nested list #> #> #> $elife #> $elife$`10.7554/eLife.03032` #> #> from: xml_document #> sections: authors #> authors (n=6): nested list #> #> #> attr(,"ft_data") #> [1] TRUE

    Tabularize

    It’s great to pull out the sections you want, but most people will likely want to
    work with data.frame’s instead of lists. pub_tabularize is the answer:

    x <- system.file("examples/elife_1.xml", package = "pubchunks") res <- pub_chunks(x, c("doi", "title", "keywords")) pub_tabularize(res) #> doi title #> 1 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs #> 2 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs #> 3 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs #> 4 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs #> 5 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs #> 6 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs #> keywords .publisher #> 1 microRNA elife #> 2 nonsense mutation elife #> 3 nonsense-mediated mRNA decay elife #> 4 APC elife #> 5 intron retention elife #> 6 premature termination codon elife

    It handles many inputs as well:

    out <- pub_chunks( list(pensoft_xml, peerj_xml, copernicus_xml, frontiers_xml), sections = c("doi", "title", "keywords") ) pub_tabularize(out) #> [[1]] #> title #> 1 Contribution to the knowledge of seed-beetles (Coleoptera, Chrysomelidae, Bruchinae) in Xinjiang, China #> .publisher #> 1 pensoft #> #> [[2]] #> title #> 1 Storm effects on intertidal invertebrates: increased beta diversity of few individuals and species #> .publisher #> 1 peerj #> #> [[3]] #> doi #> 1 10.5194/asr-12-23-2015 #> title .publisher #> 1 Quality control of 10-min soil temperatures data at RMI copernicus #> #> [[4]] #> doi #> 1 10.3389/fmicb.2012.00403 #> title .publisher #> 1 Metagenomic Analysis of a Southern Maritime Antarctic Soil frontiers

    The output of pub_tabularize is a list of data.frame’s. You can easily combine the output
    with e.g. rbind or data.table::rbindlist or dplyr::bind_rows. Here’s an example with
    data.table::rbindlist:

    data.table::rbindlist(pub_tabularize(out), fill = TRUE) #> title #> 1: Contribution to the knowledge of seed-beetles (Coleoptera, Chrysomelidae, Bruchinae) in Xinjiang, China #> 2: Storm effects on intertidal invertebrates: increased beta diversity of few individuals and species #> 3: Quality control of 10-min soil temperatures data at RMI #> 4: Metagenomic Analysis of a Southern Maritime Antarctic Soil #> .publisher doi #> 1: pensoft #> 2: peerj #> 3: copernicus 10.5194/asr-12-23-2015 #> 4: frontiers 10.3389/fmicb.2012.00403

    TO DO

    We’ll be adding support for more publishers (including right now working on Pubmed XML format
    articles), more article sections, making sure default section extraction is smart as possible,
    and more.

    Of course, if you know XPATH and don’t mind doing it, you can do what this package does yourself.
    However, you will have to write different XPATH for different publishers/journals, so leveraging
    this approach still may save some time.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Serendipity at R / Medicine

    Tue, 10/16/2018 - 02:00

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    We knew we were on to something important early on in the process of organizing R / Medicine 2018. Even during our initial attempts to articulate the differences between this conference and R / Pharma 2018, it became clear that the focus on the use of R and statistics in clinical settings was going to be a richer topic than just the design of clinical trials. However, it wasn’t until the conference got underway that we realized there was magic in the mix of attendees. R / Medicine attracted quite a few clinicians who were themselves using R in their work, or were in the process of teaching themselves R. This group catalyzed the discussions that continued throughout the conference, enabling high-bandwidth exchanges that would have otherwise suffered from the effort to translate between the two cultures. The small, single-track nature of the conference helped to keep the conversations going, with the questions and answers at the end of a given talk helping to enrich the quality of successive discussions.

    Rob Tibshirani set the collaborative tone for the conference with his opening keynote talk describing the clinical forecasting system he and his collaborators have built to predict platelet usage for the Stanford hospitals. Big-league and big-impact, the system shows the promise of delivering real clinical and financial benefits. Tibshirani’s presentation of the modeling process also set the bar for clarity.

    The other keynotes were also “top shelf”. Michael Lawrence spoke about Scientific Software In-the-Large. He laid out three challenges for scientific programming at this scale:
         * Integration of independently developed modules
         * Translation of analyses and prototypes into software
         * Scalability
    and addressed these issues using examples from the Bioconductor project.

    Victoria Stodden’s Keynote, Computational Reproducibility in Medical Research: Toward Open Code and Data, was a meditation on the need to reassess scientific transparency in an age where big data and computational power are driving medical research, and deep intellectual contributions are encoded in software. I was particularly struck by the idea that progress towards computational reproducibility depends on the coordination of stakeholders.

    Perhaps the highest-energy talk of the conference (and maybe all of the conferences I have attended this year) was given by Yale’s Dr. Harlan Krumholz. Unfortunately, we have neither video nor slides from this keynote, but to give you some ideal of Dr. Krumholz iconoclastic work, look at the 2010 Forbes Article and this more recent article published in HealthAffairs. The following are some notes I managed to take at the talk between moments of mesmerization. With respect to medicine in general Dr. Krumholz said that:

    There could not be a more exciting era in medicine. Medicine is emerging as an information science and the clinician’s role is changing to be a guide or interpretor, not a shaman.

    Commenting on evidence-based medicine:

    More than half of the guidlines in cardiology are not based on evidence.

    With respect to medical data, he said:

    The goal should be to take high-dimensional data and make it low-dimensional. Instead of thinking that everyone should have the same data, we should move towards thinking: How dow we use the data that we do have? There should be no missing data.

    I took these statements to mean that teams of clinicians, statisticians, and data scientists should be working towards building predictive models for individual patients based on whatever data is available for them and whatever big data is relevant. This was clearly the music the crowd wanted to dance to.

    The slides for most of the rest of the talks are available on the website. One talk I would like to highlight here is Nathaniel Phillips’ talk on Fast and Frugal Trees.

    This talk addressed a recurring theme throughout the conference: the difference in decision making between the two cultures of statisticians and physicians. Probabilistic estimates to characteristic risk and to inform decision making are central to a statisticians worldview. Physicians, on the other hand, are in general not comfortable with probabilities, and when push comes to shove, prefer unambiguous guidelines and thresholds, such as blood pressure ranges, to inform treatment decisions. A vexing cultural problem is to identify effective decision models that have a chance of actually being used by clinicians.

    The conference finished with a roundtable discussion with the theme Bridging the Two Cultures, with panelists Beth Atkinson, Joseph Chou, Peter Higgins, Stephan Kadauke, Chinonyerem Madu, and Jack Wasey representing both the statistical and clinical points of view. The moderator (me) began by asking three questions:
    1. How do clinicians engage with statisticians and data scientists?
    2. What are some key ideas you should know about collaborating?
    3. In your experience, what kinds of engagements have been the most successful?

    Panelists were free to respond as they felt inclined to any or all of the questions. As I recall, a consensus emerged around three key ideas: make an effort to empathize with colleagues, meet frequently and go out of your way to interact with colleagues, and carefully select projects and then cultivate them.

    Planning is already underway for R / Medicine 2019. Mark the week of September 23rd, and stay tuned!

    _____='https://rviews.rstudio.com/2018/10/16/serendipity-at-r-medicine/';

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Modifying Excel Files using openxlsx

    Tue, 10/16/2018 - 02:00

    (This article was first published on R on Abhijit Dasgupta, and kindly contributed to R-bloggers)

    I’ve been creating several output tables for a paper, which I usually store as
    sheets in an Excel file, since my collaborators are entirely in the Microsoft Office ecosystem.

    One issue I often run into is having to modify a single sheet in that file with updated data, while keeping the rest of the file intact. This is necessary since I’ve perhaps done some custom formatting in Excel on some of the tables, and I don’t
    want to re-format them everytime I modify a single sheet. This problem can be alleviated by creating output functions in R that properly format the output tables in the first place, making the entire process reproducible. Working on this one, stay tuned!

    But for now, how to modify a single Excel sheet in a file? The openxlsx package allows this to be done very easily. As an aside, if you interact with R and Excel and are not using openxlsx, why aren’t you? This doesn’t depend on Java and has several powerful features.

    The following code reads an existing Excel file, checks if a particular sheet exists,
    creates it if it doesn’t, and writes data from a data.frame results to it and then saves it back on disk:

    library(openxlsx) boldHeader <- createStyle(textDecoration = 'bold') # Makes first row bold wb <- loadWorkbook('Tables.xlsx') if (!('Supplemental Table 1' %in% names(wb))) addWorksheet(wb, 'Supplemental Table 1') writeData(wb, 'Supplemental Table 1', results, headerStyle = boldHeader) setColWidths(wb, 'Supplemental Table 1', cols = 1:ncol(results), widths = 'auto') saveWorkbook(wb, 'Tables.xlsx', overwrite = T) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Abhijit Dasgupta. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RStudio 1.2 Preview: Stan

    Tue, 10/16/2018 - 02:00

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    We previously discussed improved support in RStudio v1.2 for SQL, D3, Python, and C/C++. Today, we’re excited to announce improved support for the Stan programming language. The Stan programming language makes it possible for researchers and analysts to write high-performance and scalable statistical models.

    Stan® is a state-of-the-art platform for statistical modeling and high-performance statistical computation. Thousands of users rely on Stan for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business.

    With RStudio v1.2, we now provide:

    • Improved, context-aware autocompletion for Stan files and chunks

    • A document outline, which allows for easy navigation between Stan code blocks

    • Inline diagnostics, which help to find issues while you develop your Stan model

    • The ability to interrupt Stan parallel workers launched within the IDE

    Together, these features bring the editing experience in Stan programs in-line with what you’re familiar with in R.

    Autocompletion

    RStudio provides autocompletion results for Stan functions, drawing from the set of pre-defined Stan keywords and functions. The same autocompletion features you might be familiar with in R, like fuzzy matching, are now also available in Stan programs.

    As with R, RStudio will also provide a small tooltip describing the arguments accepted by a particular function.

    Document Outline

    The document outline allows for easy navigation between Stan blocks. This can be especially useful as your model definition grows in size, and you need to quickly reference the different blocks used in your program.

    Diagnostics

    RStudio now uses the Stan parser to provide inline diagnostics, and will report any problems discovered as you prepare your model.

    If the Stan compiler discovers any issues in your model, RStudio’s diagnostics will show you exactly where those issues live so you can fix them quickly and easily.

    Worker Interruption

    One aspect that had previously made working with Stan in RStudio frustrating was the inability to interrupt parallel Stan workers. This implied that attempts to fit a computationally expensive model could not be interrupted, and the only remedy previously was to restart the IDE or forcefully shut down the workers through another mechanism.

    We’re very happy to share that this limitation has been lifted with RStudio v1.2. Now, when fitting a Stan model with parallel workers, you can interrupt the workers as you would any regular R code – either use the Escape key, or press the Stop button in the Console pane.

    Try it Out

    With the improvements to Stan integration coming in RStudio v1.2, getting started with Stan has never been easier. If you’d like to get more familiar with Stan, we think these resources will be helpful:

    You can download the RStudio v1.2 Preview Release at https://www.rstudio.com/products/rstudio/download/preview/. If you have any questions or comments, please get in touch with us on the community forums.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages