New Versions of R GUIs: BlueSky, JASP, jamovi
(This article was first published on R – r4stats.com, and kindly contributed to Rbloggers)
It has been only two months since I summarized my reviews of pointandclick front ends for R, and it’s already out of date! I have converted that post into a regularlyupdated article and added a plot of total features, which I repeat below. It shows the total number of features in each package, including the latest versions of BlueSky Statistics, JASP, and jamovi. The reviews which initially appeared as blog posts are now regularlyupdated pages.
New Features in JASPLet’s take a look at some of the new features, starting with the version of JASP that was released three hours ago:
 Interface adjustments
 Data panel, analysis input panel and results panel can be manipulated much more intuitively with sliders and show/hide buttons
 Changed the analysis input panel to have an overview of all opened analyses and added the possibility to change titles, to show documentation, and remove analyses
 Enhanced the navigation through the file menu; it is now possible to use arrow keys or simply hover over the buttons
 Added possibility to scale the entire application with Ctrl +, Ctrl – and Ctrl 0
 Added MANOVA
 Added Confirmatory Factor Analysis
 Added Bayesian Multinomial Test
 Included additional menu preferences to customize JASP to your needs
 Added/updated help files for most analyses
 R engine updated from 3.4.4 to 3.5.2
 Added Šidák correction for posthoc tests (AN(C)OVA)
A complete list of fixes and features is available here. JASP is available for free from their download page. My comparative review of JASP is here.
New Features in jamoviTwo of the usability features added to jamovi recently are templates and multifile input. Both are described in detail here.
Templates enable you to save all the steps in your work as a template file. Opening that file in jamovi then lets you open a new dataset and the template will recreate all the previous analyses and graphs using the new data. It provides reusability without having to depend on the R code that GUI users are trying to avoid using.
The multifile input lets you select many CSV files at once and jamovi will open and stack them all (they must contain common variable names, of course).
Other new analytic features have been added with a set of modeling modules. They’re described in detail here, and a list of some of their capability is below. You can read my full review of jamovi here, and you can download it for free here.
 OLS Regression (GLM)
 OLS ANOVA (GLM)
 OLS ANCOVA (GLM)
 Random coefficients regression (Mixed)
 Random coefficients ANOVAANCOVA (Mixed)
 Logistic regression (GZLM)
 Logistic ANOVAlike model (GZLM)
 Probit regression (GZLM)
 Probit ANOVAlike model (GZLM)
 Multinomial regression (GZLM)
 Multinomial ANOVAlike model (GZLM)
 Poisson regression (GZLM)
 Poisson ANOVAlike model (GZLM)
 Overdispersed Poisson regression (GZLM)
 Overdispersed Poisson ANOVAlike model (GZLM)
 Negative binomial regression (GZLM)
 Negative binomial ANOVAlike model (GZLM)
 Continuous and categorical independent variables
 Omnibus tests and parameter estimates
 Confidence intervals
 Simple slopes analysis
 Simple effects
 Posthoc tests
 Plots for up to threeway interactions for both categorical and continuous independent variables.
 Automatic selection of best estimation methods and degrees of freedom selection
 Type III estimation
The BlueSky developers have been working on adding psychometric methods (for a book that is due out soon) and support for distributions. My full review is here and you can download BlueSky Statistics for free here.
 Model Fitting: IRT: Simple Rasch Model
 Model Fitting: IRT: Simple Rasch Model (MultiFaceted)
 Model Fitting: IRT: Partial Credit Model
 Model Fitting: IRT: Partial Credit Model (MultiFaceted)
 Model Fitting: IRT: Rating Scale Model
 Model Fitting: IRT: Rating Scale Model (MultiFaceted)
 Model Statistics: IRT: ICC Plots
 Model Statistics: IRT: Item Fit
 Model Statistics: IRT: Plot PI Map
 Model Statistics: IRT: Item and Test Information
 Model Statistics: IRT: Likelihood Ratio and Beta plots
 Model Statistics: IRT: Personfit
 Distributions: Continuous: BetaProbabilities
 Distributions: Continuous: Beta Quantiles
 Distributions: Continuous: Plot Beta Distribution
 Distributions: Continuous: Sample from Beta Distribution
 Distributions: Continuous: Cauchy Probabilities
 Distributions: Continuous: Plot Cauchy Distribution
 Distributions: Continuous: Cauchy Quantiles
 Distributions: Continuous: Sample from Cauchy Distribution
 Distributions: Continuous: Sample from Cauchy Distribution
 Distributions: Continuous: Chisquared Probabilities
 Distributions: Continuous: Chisquared Quantiles
 Distributions: Continuous: Plot Chisquared Distribution
 Distributions: Continuous: Sample from Chisquared Distribution
 Distributions: Continuous: Exponential Probabilities
 Distributions: Continuous: Exponential Quantiles
 Distributions: Continuous: Plot Exponential Distribution
 Distributions: Continuous: Sample from Exponential Distribution
 Distributions: Continuous: F Probabilities
 Distributions: Continuous: F Quantiles
 Distributions: Continuous: Plot F Distribution
 Distributions: Continuous: Sample from F Distribution
 Distributions: Continuous: Gamma Probabilities
 Distributions: Continuous: Gamma Quantiles
 Distributions: Continuous: Plot Gamma Distribution
 Distributions: Continuous: Sample from Gamma Distribution
 Distributions: Continuous: Gumbel Probabilities
 Distributions: Continuous: Gumbel Quantiles
 Distributions: Continuous: Plot Gumbel Distribution
 Distributions: Continuous: Sample from Gumbel Distribution
 Distributions: Continuous: Logistic Probabilities
 Distributions: Continuous: Logistic Quantiles
 Distributions: Continuous: Plot Logistic Distribution
 Distributions: Continuous: Sample from Logistic Distribution
 Distributions: Continuous: Lognormal Probabilities
 Distributions: Continuous: Lognormal Quantiles
 Distributions: Continuous: Plot Lognormal Distribution
 Distributions: Continuous: Sample from Lognormal Distribution
 Distributions: Continuous: Normal Probabilities
 Distributions: Continuous: Normal Quantiles
 Distributions: Continuous: Plot Normal Distribution
 Distributions: Continuous: Sample from Normal Distribution
 Distributions: Continuous: t Probabilities
 Distributions: Continuous: t Quantiles
 Distributions: Continuous: Plot t Distribution
 Distributions: Continuous: Sample from t Distribution
 Distributions: Continuous: Uniform Probabilities
 Distributions: Continuous: Uniform Quantiles
 Distributions: Continuous: Plot Uniform Distribution
 Distributions: Continuous: Sample from Uniform Distribution
 Distributions: Continuous: Weibull Probabilities
 Distributions: Continuous: Weibull Quantiles
 Distributions: Continuous: Plot Weibull Distribution
 Distributions: Continuous: Sample from Weibull Distribution
 Distributions: Discrete: Binomial Probabilities
 Distributions: Discrete: Binomial Quantiles
 Distributions: Discrete: Binomial Tail Probabilities
 Distributions: Discrete: Plot Binomial Distribution
 Distributions: Discrete: Sample from Binomial Distribution
 Distributions: Discrete: Geometric Probabilities
 Distributions: Discrete: Geometric Quantiles
 Distributions: Discrete: Geometric Tail Probabilities
 Distributions: Discrete: Plot Geometric Distribution
 Distributions: Discrete: Sample from Geometric Distribution
 Distributions: Discrete: Hypergeometric Probabilities
 Distributions: Discrete: Hypergeometric Quantiles
 Distributions: Discrete: Hypergeometric Tail Probabilities
 Distributions: Discrete: Plot Hypergeometric Distribution
 Distributions: Discrete: Sample from Hypergeometric Distribution
 Distributions: Discrete: Negative Binomial Probabilities
 Distributions: Discrete: Negative Binomial Quantiles
 Distributions: Discrete: Negative Binomial Tail Probabilities
 Distributions: Discrete: Plot Negative Binomial Distribution
 Distributions: Discrete: Sample from Negative Binomial Distribution
 Distributions: Discrete: Poisson Probabilities
 Distributions: Discrete: Poisson Quantiles
 Distributions: Discrete: Poisson Tail Probabilities
 Distributions: Discrete: Plot Poisson Distribution
 Distributions: Discrete: Sample from Poisson Distribution
To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to Perform Ordinal Logistic Regression in R
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
In this article, we discuss the basics of ordinal logistic regression and its implementation in R. Ordinal logistic regression is a widely used classification method, with applications in variety of domains. This method is the goto tool when there is a natural ordering in the dependent variable. For example, dependent variable with levels low, medium, high is a perfect context for application of logistic ordinal regression.
Having wide range of applicability, ordinal logistic regression is considered as one of the most admired methods in the field of data analytics. The method is also known as proportional odds model because of the transformations used during estimation and the log odds interpretation of the output. We hope that this article helps our readers to understand the basics and implement the model in R.
The article is organized as follows: focusing on the theoretical aspects of the technique, section 1 provides a quick review of ordinal logistic regression. Section 2 discusses the steps to perform ordinal logistic regression in R and shares R script. In addition, section 2 also covers the basics of interpretation and evaluation of the model on R. In section 3, we learn a more intuitive way to interpret the model. Section 4 concludes the article.
Basics of ordinal logistic regression
Ordinal logistic regression is an extension of simple logistic regression model. In simple logistic regression, the dependent variable is categorical and follows a Bernoulli distribution. (for a quick reference check out this article by perceptive analytics – https://www.kdnuggets.com/2017/10/learngeneralizedlinearmodelsglmr.html). Whereas, in ordinal logistic regression the dependent variable is ordinal i.e. there is an explicit ordering in the categories. For example, during preliminary testing of a pain relief drug, the participants are asked to express the amount of relief they feel on a five point Likert scale. Another common example of an ordinal variable is app ratings. On google play, customers are asked to rate apps on a scale ranging from 1 to 5. Ordinal logistic regression becomes handy in the aforementioned examples as there is a clear order in the categorical dependent variable.
In simple logistic regression, log of odds that an event occurs is modeled as a linear combination of the independent variables. But, the above approach of modeling ignores the ordering of the categorical dependent variable. Ordinal logistic regression model overcomes this limitation by using cumulative events for the log of the odds computation. It means that unlike simple logistic regression, ordinal logistic models consider the probability of an event and all the events that are below the focal event in the ordered hierarchy. For example, the event of interest in ordinal logistic regression would be to obtain an app rating equal to X or less than X. For example, the log of odds for the app rating less than or equal to 1 would be computed as follows:
LogOdds rating<1 = Log (p(rating=1)/p(rating>1) [Eq. 1]
Likewise, the log of odds can be computed for other values of app ratings. The computations for other ratings are below:
LogOdds rating<2 = Log (p(rating<=2)/p(rating>2) [Eq. 2]
LogOdds rating<3 = Log (p(rating<=3)/p(rating>3) [Eq. 3]
LogOdds rating<4 = Log (p(rating=4)/p(rating>4) [Eq. 4]
Because all the ratings below the focal score are considered in computation, the highest app rating of 5 will include all the ratings below it and does not have a log of odds associated with it. In general, the ordinal regression model can be represented using the LogOdds computation.
LogoddsY = αi+ β1X1 +β2X2 +….. +βnXn
where,
Y is the ordinal dependent variable
i is the number of categories minus 1
X1, X2,…. Xn are independent variables. They can be measured on nominal, ordinal or continuous measurement scale.
β1, β2,… βn are estimated parameters
For i ordered categories, we obtain i – 1 equations. The proportional odds assumption implies that the effect of independent variables is identical for each log of odds computation. But, this is not the case for intercept as the intercept takes different values for each computation. Besides the proportional odds assumption, the ordinal logistic regression model assumes an ordinal dependent variable and absence of multicollinearity. Absence of multicollinearity means that the independent variables are not significantly correlated. These assumptions are important as their violation makes the computed parameters unacceptable.
Model building in R
In this section, we describe the dataset and implement ordinal logistic regression in R. We use a simulated dataset for analysis. The details of the variables are as follows. The objective of the analysis is to predict the likelihood of each level of customer purchase. The dependent variable is the likelihood of repeated purchase by customers. The variable is measured in an ordinal scale and can be equal to one of the three levels – low probability, medium probability, and high probability. The independent variables are measures of possession of coupon by the focal customer, recommendation received by the peers and quality of the product. Possession of coupon and peer recommendation are categorical variables, while quality is measured on a scale of 1 to 5. We discuss the steps for implementation of ordinal logistic regression and share the commented R script for better understanding of the reader. The data is in .csv format and can be downloaded by clicking here.
Before starting the analysis, I will describe the preliminary steps in short. The first step is to keep the data file in the working directory. The next step is to explicitly define the ordering of the levels in the dependent variable and the relevant independent variables. This step is crucial and ignoring it can lead to meaningless analysis.
#Read data file from working directory setwd("C:/Users/You/Desktop") data < read.table("data.txt") #Ordering the dependent variable data$rpurchase = factor(data$rpurchase, levels = c("low probability", "medium probability", "high probability"), ordered = TRUE) data$peers = factor(data$peers, levels = c("0", "1"), ordered = TRUE) data$coupon = factor(data$coupon, levels = c("0", "1"), ordered = TRUE)
Next, it is essential to perform the exploratory data analysis. Exploratory data analysis deals with outliers and missing values, which can induce bias in our model. In this article we do basic exploration by looking at summary statistics and frequency table. Figure 1. shows the summary statistics. We observe the count of data for ordinal variables and distribution characteristics for other variables. We compute the count of rpurchase with different values of coupon in Table 1. Note that the median value for rpurchase changes with change in coupon. The median level of rpurchase increases, indicating that coupon positively affects the likelihood of repeated purchase. The R script for summary statistics and the frequency table is as follows:
#Exploratory data analysis #Summarizing the data summary(data) #Making frequency table table(data$rpurchase, data$coupon)Figure 1. Summary statistics
Table 1. Count of rpurchase by coupon
Rpurchase/Coupon With coupon Without coupon Low probability 200 20 Medium probability 110 30 High proabability 27 13After defining the order of the levels in the dependent variable and performing exploratory data analysis, the data is ready to be partitioned into training and testing set. We will build the model using the training set and validate the computed model using the data in test set. The partitioning of the data into training and test set is random. The random sample is generated using sample ( ) function along with set.seed ( ) function. The R script for explicitly stating the order of levels in the dependent variable and creating training and test data set is follows:
#Dividing data into training and test set #Random sampling samplesize = 0.60*nrow(data) set.seed(100) index = sample(seq_len(nrow(data)), size = samplesize) #Creating training and test set datatrain = data[index,] datatest = data[index,]Now, we will build the model using the data in training set. As discussed in the above section the dependent variable in the model is in the form of log odds. Because of the log odds transformation, it is difficult to interpret the coefficients of the model. Note that in this case the coefficients of the regression cannot be interpreted in terms of marginal effects. The coefficients are called as proportional odds and interpreted in terms of increase in log odds. The interpretation changes not only for the coefficients but also for the intercept. Unlike simple linear regression, in ordinal logistic regression we obtain n1 intercepts, where n is the number of categories in the dependent variable. The intercept can be interpreted as the expected odds of identifying in the listed categories. Before interpreting the model, we share the relevant R script and the results. In the R code, we set Hess equal to true which the logical operator to return hessian matrix. Returning the hessian matrix is essential to use summary function or calculate variancecovariance matrix of the fitted model.
#Build ordinal logistic regression model model= polr(rpurchase ~ coupon + peers + quality , data = datatrain, Hess = TRUE) summary(model)
Figure 2. Ordinal logistic regression model on training set
The table displays the value of coefficients and intercepts, and corresponding standard errors and t values. The interpretation for the coefficients is as follows. For example, holding everything else constant, an increase in value of coupon by one unit increase the expected value of rpurchase in log odds by 0.96. Likewise, the coefficients of peers and quality can be interpreted.
Note that the ordinal logistic regression outputs multiple values of intercepts depending on the levels of intercept. The intercepts can be interpreted as the expected odds when others variables assume a value of zero. For example, the low probability  medium probability intercept takes value of 2.13, indicating that the expected odds of identifying in low probability category, when other variables assume a value of zero, is 2.13. Using the logit inverse transformation, the intercepts can be interpreted in terms of expected probabilities. The inverse logit transformation, . The expected probability of identifying low probability category, when other variables assume a value of zero, is 0.89.
After building the model and interpreting the model, the next step is to evaluate it. The evaluation of the model is conducted on the test dataset. A basic evaluation approach is to compute the confusion matrix and the misclassification error. The R code and the results are as follows:
#Compute confusion table and misclassification error predictrpurchase = predict(model,datatest) table(datatest$rpurchase, predictrpurchase) mean(as.character(datatest$rpurchase) != as.character(predictrpurchase))Figure 3. Confusion matrix
The confusion matrix shows the performance of the ordinal logistic regression model. For example, it shows that, in the test dataset, 76 times low probability category is identified correctly. Similarly, 10 times medium category and 0 times high category is identified correctly. We observe that the model identifies high probability category poorly. This happens because of inadequate representation of high probability category in the training dataset. Using the confusion matrix, we find that the misclassification error for our model is 46%.
Interpretation using plots
The interpretation of the logistic ordinal regression in terms of log odds ratio is not easy to understand. We offer an alternative approach to interpretation using plots. The R code for plotting the effects of the independent variables is as follows:
#Plotting the effects library("effects") Effect(focal.predictors = "quality",model) plot(Effect(focal.predictors = "coupon",model)) plot(Effect(focal.predictors = c("quality", "coupon"),model))The plots are intuitive and easy to understand. For example, figure 4 shows that coupon increases the likelihood of classification into high probability and medium probability classes, while decreasing the likelihood of classification in low probability class.
Figure 4. Effect of coupon on identification
It is also possible to look at joint effect of two independent variable. Figure 5. shows the joint effect of quality and coupon on identification of category of independent variable. Observing the top row of figure 5., we notice that the interaction of coupon and quality increases the likelihood of identification in high probability category.
Figure 5. Joint effect of quality and coupon on identification
Conclusion
The article discusses the fundamentals of ordinal logistic regression, builds and the model in R, and ends with interpretation and evaluation. Ordinal logistic regression extends the simple logistic regression model to the situations where the dependent variable is ordinal, i.e. can be ordered. Ordinal logistic regression has variety of applications, for example, it is often used in marketing to increase customer life time value. For example, consumers can be categorized into different classes based on their tendency to make repeated purchase decision. In order to discuss the model in an applied manner, we develop this article around the case of consumer categorization. The independent variables of interest are – coupon held by consumers from previous purchase, influence of peers, quality of the product.
The article has two key takeaways. First, ordinal logistic regression come handy while dealing with a dependent variable that can be ordered. If one uses multinomial logistic regression then the user is ignoring the information related to ordering of the dependent variable. Second, the coefficients of the ordinal linear regression cannot be interpreted in a similar manner to the coefficients of ordinary linear regression. Interpreting the coefficents in terms of marginal effects is one of the common mistakes that users make while implementing the ordinal regression model. We again emphasize the use of graphical method to interpret the coefficients. Using the graphical method, it is easy to understand the individual and joint effects of independent variables on the likelihood of classification.
This article can be a go to reference for understanding the basics of ordinal logistic regression and its implementation in R. We have provided commented R code throughout the article.
Credits: Chaitanya Sagar and Aman Asija of Perceptive Analytics. Perceptive Analytics is a marketing analytics and Tableau consulting company.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to Perform Ordinal Logistic Regression in R
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
In this article, we discuss the basics of ordinal logistic regression and its implementation in R. Ordinal logistic regression is a widely used classification method, with applications in variety of domains. This method is the goto tool when there is a natural ordering in the dependent variable. For example, dependent variable with levels low, medium, high is a perfect context for application of logistic ordinal regression.
Having wide range of applicability, ordinal logistic regression is considered as one of the most admired methods in the field of data analytics. The method is also known as proportional odds model because of the transformations used during estimation and the log odds interpretation of the output. We hope that this article helps our readers to understand the basics and implement the model in R.
The article is organized as follows: focusing on the theoretical aspects of the technique, section 1 provides a quick review of ordinal logistic regression. Section 2 discusses the steps to perform ordinal logistic regression in R and shares R script. In addition, section 2 also covers the basics of interpretation and evaluation of the model on R. In section 3, we learn a more intuitive way to interpret the model. Section 4 concludes the article.
Basics of ordinal logistic regression
Ordinal logistic regression is an extension of simple logistic regression model. In simple logistic regression, the dependent variable is categorical and follows a Bernoulli distribution. (for a quick reference check out this article by perceptive analytics – https://www.kdnuggets.com/2017/10/learngeneralizedlinearmodelsglmr.html). Whereas, in ordinal logistic regression the dependent variable is ordinal i.e. there is an explicit ordering in the categories. For example, during preliminary testing of a pain relief drug, the participants are asked to express the amount of relief they feel on a five point Likert scale. Another common example of an ordinal variable is app ratings. On google play, customers are asked to rate apps on a scale ranging from 1 to 5. Ordinal logistic regression becomes handy in the aforementioned examples as there is a clear order in the categorical dependent variable.
In simple logistic regression, log of odds that an event occurs is modeled as a linear combination of the independent variables. But, the above approach of modeling ignores the ordering of the categorical dependent variable. Ordinal logistic regression model overcomes this limitation by using cumulative events for the log of the odds computation. It means that unlike simple logistic regression, ordinal logistic models consider the probability of an event and all the events that are below the focal event in the ordered hierarchy. For example, the event of interest in ordinal logistic regression would be to obtain an app rating equal to X or less than X. For example, the log of odds for the app rating less than or equal to 1 would be computed as follows:
LogOdds rating<1 = Log (p(rating=1)/p(rating>1) [Eq. 1]
Likewise, the log of odds can be computed for other values of app ratings. The computations for other ratings are below:
LogOdds rating<2 = Log (p(rating<=2)/p(rating>2) [Eq. 2]
LogOdds rating<3 = Log (p(rating<=3)/p(rating>3) [Eq. 3]
LogOdds rating<4 = Log (p(rating=4)/p(rating>4) [Eq. 4]
Because all the ratings below the focal score are considered in computation, the highest app rating of 5 will include all the ratings below it and does not have a log of odds associated with it. In general, the ordinal regression model can be represented using the LogOdds computation.
LogoddsY = αi+ β1X1 +β2X2 +….. +βnXn
where,
Y is the ordinal dependent variable
i is the number of categories minus 1
X1, X2,…. Xn are independent variables. They can be measured on nominal, ordinal or continuous measurement scale.
β1, β2,… βn are estimated parameters
For i ordered categories, we obtain i – 1 equations. The proportional odds assumption implies that the effect of independent variables is identical for each log of odds computation. But, this is not the case for intercept as the intercept takes different values for each computation. Besides the proportional odds assumption, the ordinal logistic regression model assumes an ordinal dependent variable and absence of multicollinearity. Absence of multicollinearity means that the independent variables are not significantly correlated. These assumptions are important as their violation makes the computed parameters unacceptable.
Model building in R
In this section, we describe the dataset and implement ordinal logistic regression in R. We use a simulated dataset for analysis. The details of the variables are as follows. The objective of the analysis is to predict the likelihood of each level of customer purchase. The dependent variable is the likelihood of repeated purchase by customers. The variable is measured in an ordinal scale and can be equal to one of the three levels – low probability, medium probability, and high probability. The independent variables are measures of possession of coupon by the focal customer, recommendation received by the peers and quality of the product. Possession of coupon and peer recommendation are categorical variables, while quality is measured on a scale of 1 to 5. We discuss the steps for implementation of ordinal logistic regression and share the commented R script for better understanding of the reader. The data is in .csv format and can be downloaded by clicking here.
Before starting the analysis, I will describe the preliminary steps in short. The first step is to keep the data file in the working directory. The next step is to explicitly define the ordering of the levels in the dependent variable and the relevant independent variables. This step is crucial and ignoring it can lead to meaningless analysis.
#Read data file from working directory setwd("C:/Users/You/Desktop") data < read.table("data.txt") #Ordering the dependent variable data$rpurchase = factor(data$rpurchase, levels = c("low probability", "medium probability", "high probability"), ordered = TRUE) data$peers = factor(data$peers, levels = c("0", "1"), ordered = TRUE) data$coupon = factor(data$coupon, levels = c("0", "1"), ordered = TRUE)
Next, it is essential to perform the exploratory data analysis. Exploratory data analysis deals with outliers and missing values, which can induce bias in our model. In this article we do basic exploration by looking at summary statistics and frequency table. Figure 1. shows the summary statistics. We observe the count of data for ordinal variables and distribution characteristics for other variables. We compute the count of rpurchase with different values of coupon in Table 1. Note that the median value for rpurchase changes with change in coupon. The median level of rpurchase increases, indicating that coupon positively affects the likelihood of repeated purchase. The R script for summary statistics and the frequency table is as follows:
#Exploratory data analysis #Summarizing the data summary(data) #Making frequency table table(data$rpurchase, data$coupon)Figure 1. Summary statistics
Table 1. Count of rpurchase by coupon
Rpurchase/Coupon With coupon Without coupon Low probability 200 20 Medium probability 110 30 High proabability 27 13After defining the order of the levels in the dependent variable and performing exploratory data analysis, the data is ready to be partitioned into training and testing set. We will build the model using the training set and validate the computed model using the data in test set. The partitioning of the data into training and test set is random. The random sample is generated using sample ( ) function along with set.seed ( ) function. The R script for explicitly stating the order of levels in the dependent variable and creating training and test data set is follows:
#Dividing data into training and test set #Random sampling samplesize = 0.60*nrow(data) set.seed(100) index = sample(seq_len(nrow(data)), size = samplesize) #Creating training and test set datatrain = data[index,] datatest = data[index,]Now, we will build the model using the data in training set. As discussed in the above section the dependent variable in the model is in the form of log odds. Because of the log odds transformation, it is difficult to interpret the coefficients of the model. Note that in this case the coefficients of the regression cannot be interpreted in terms of marginal effects. The coefficients are called as proportional odds and interpreted in terms of increase in log odds. The interpretation changes not only for the coefficients but also for the intercept. Unlike simple linear regression, in ordinal logistic regression we obtain n1 intercepts, where n is the number of categories in the dependent variable. The intercept can be interpreted as the expected odds of identifying in the listed categories. Before interpreting the model, we share the relevant R script and the results. In the R code, we set Hess equal to true which the logical operator to return hessian matrix. Returning the hessian matrix is essential to use summary function or calculate variancecovariance matrix of the fitted model.
#Build ordinal logistic regression model model= polr(rpurchase ~ coupon + peers + quality , data = datatrain, Hess = TRUE) summary(model)
Figure 2. Ordinal logistic regression model on training set
The table displays the value of coefficients and intercepts, and corresponding standard errors and t values. The interpretation for the coefficients is as follows. For example, holding everything else constant, an increase in value of coupon by one unit increase the expected value of rpurchase in log odds by 0.96. Likewise, the coefficients of peers and quality can be interpreted.
Note that the ordinal logistic regression outputs multiple values of intercepts depending on the levels of intercept. The intercepts can be interpreted as the expected odds when others variables assume a value of zero. For example, the low probability  medium probability intercept takes value of 2.13, indicating that the expected odds of identifying in low probability category, when other variables assume a value of zero, is 2.13. Using the logit inverse transformation, the intercepts can be interpreted in terms of expected probabilities. The inverse logit transformation, . The expected probability of identifying low probability category, when other variables assume a value of zero, is 0.89.
After building the model and interpreting the model, the next step is to evaluate it. The evaluation of the model is conducted on the test dataset. A basic evaluation approach is to compute the confusion matrix and the misclassification error. The R code and the results are as follows:
#Compute confusion table and misclassification error predictrpurchase = predict(model,datatest) table(datatest$rpurchase, predictrpurchase) mean(as.character(datatest$rpurchase) != as.character(predictrpurchase))Figure 3. Confusion matrix
The confusion matrix shows the performance of the ordinal logistic regression model. For example, it shows that, in the test dataset, 76 times low probability category is identified correctly. Similarly, 10 times medium category and 0 times high category is identified correctly. We observe that the model identifies high probability category poorly. This happens because of inadequate representation of high probability category in the training dataset. Using the confusion matrix, we find that the misclassification error for our model is 46%.
Interpretation using plots
The interpretation of the logistic ordinal regression in terms of log odds ratio is not easy to understand. We offer an alternative approach to interpretation using plots. The R code for plotting the effects of the independent variables is as follows:
#Plotting the effects library("effects") Effect(focal.predictors = "quality",model) plot(Effect(focal.predictors = "coupon",model)) plot(Effect(focal.predictors = c("quality", "coupon"),model))The plots are intuitive and easy to understand. For example, figure 4 shows that coupon increases the likelihood of classification into high probability and medium probability classes, while decreasing the likelihood of classification in low probability class.
Figure 4. Effect of coupon on identification
It is also possible to look at joint effect of two independent variable. Figure 5. shows the joint effect of quality and coupon on identification of category of independent variable. Observing the top row of figure 5., we notice that the interaction of coupon and quality increases the likelihood of identification in high probability category.
Figure 5. Joint effect of quality and coupon on identification
Conclusion
The article discusses the fundamentals of ordinal logistic regression, builds and the model in R, and ends with interpretation and evaluation. Ordinal logistic regression extends the simple logistic regression model to the situations where the dependent variable is ordinal, i.e. can be ordered. Ordinal logistic regression has variety of applications, for example, it is often used in marketing to increase customer life time value. For example, consumers can be categorized into different classes based on their tendency to make repeated purchase decision. In order to discuss the model in an applied manner, we develop this article around the case of consumer categorization. The independent variables of interest are – coupon held by consumers from previous purchase, influence of peers, quality of the product.
The article has two key takeaways. First, ordinal logistic regression come handy while dealing with a dependent variable that can be ordered. If one uses multinomial logistic regression then the user is ignoring the information related to ordering of the dependent variable. Second, the coefficients of the ordinal linear regression cannot be interpreted in a similar manner to the coefficients of ordinary linear regression. Interpreting the coefficents in terms of marginal effects is one of the common mistakes that users make while implementing the ordinal regression model. We again emphasize the use of graphical method to interpret the coefficients. Using the graphical method, it is easy to understand the individual and joint effects of independent variables on the likelihood of classification.
This article can be a go to reference for understanding the basics of ordinal logistic regression and its implementation in R. We have provided commented R code throughout the article.
Credits: Chaitanya Sagar and Aman Asija of Perceptive Analytics. Perceptive Analytics is a marketing analytics and Tableau consulting company.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
anytime 0.3.4
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new minor release of the anytime package is arriving on CRAN. This is the fifteenth release, and first since the 0.3.3 release in November.
anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects – and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.
This release is mostly internal and switches to the excellent new tinytest package, a tweak the iso8601() format helper which now uses T between date and time (which is a breaking change with the usual addition of a option to get the old behaviour back) and a little more. The full list of changes follows.
Changes in anytime version 0.3.4 (20190618)
Documentation was updated about a ‘Europe/London’ conversion issue (#84, inter alia).

The package is now compiled under the C++11 standard.

The package now uses tinytest for unit tests.

The iso8601() function now places a ‘T’ between date and time; an option switches to prior format using a space.

The vignette is now premade and included asis in a Sweave document reducing the number of suggested packages.
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page. The issue tracker tracker off the GitHub repo can be use for questions and comments.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
anytime 0.3.4
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new minor release of the anytime package is arriving on CRAN. This is the fifteenth release, and first since the 0.3.3 release in November.
anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects – and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.
This release is mostly internal and switches to the excellent new tinytest package, a tweak the iso8601() format helper which now uses T between date and time (which is a breaking change with the usual addition of a option to get the old behaviour back) and a little more. The full list of changes follows.
Changes in anytime version 0.3.4 (20190618)
Documentation was updated about a ‘Europe/London’ conversion issue (#84, inter alia).

The package is now compiled under the C++11 standard.

The package now uses tinytest for unit tests.

The iso8601() function now places a ‘T’ between date and time; an option switches to prior format using a space.

The vignette is now premade and included asis in a Sweave document reducing the number of suggested packages.
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page. The issue tracker tracker off the GitHub repo can be use for questions and comments.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Understanding AdaBoost – or how to turn Weakness into Strength
(This article was first published on RBloggers – Learning Machines, and kindly contributed to Rbloggers)
Many of you might have heard of the concept “Wisdom of the Crowd”: when many people independently guess some quantity, e.g. the number of marbles in a jar glass, the average of their guesses is often pretty accurate – even though many of the guesses are totally off.
The same principle is at work in so called ensemble methods, like bagging and boosting. If you want to know more about boosting and how to turn pseudocode of a scientific paper into valid R code read on…
We start from an original paper of one of the authors of the first practical boosting algorithm, i.e. AdaBoost: Robert E. Schapire: Explaining AdaBoost. The first sentence of the introduction gives the big idea:
Boosting is an approach to machine learning based on the idea of creating a highly accurate prediction rule by combining many relatively weak and inaccurate rules.
The second page gives the pseudocode of Adaboost…:
Given: where .
Initialize: for .
For :
 Train weak learner using distribution .
 Get weak hypothesis : .
 Aim: select with low weighted error:
 Choose .
 Update, for :
where is a normalization factor (chosen so that will be a distribution).
Output the final hypothesis:
… with some explanation:
[…] we are given labeled training examples where the are in some domain , and the labels . On each round , a distribution is computed as in the figure over the training examples, and a given weak learner or weak learning algorithm is applied to find a weak hypothesis : , where the aim of the weak learner is to find a weak hypothesis with low weighted error relative to . The final or combined hypothesis computes the sign of a weighted combination of weak hypotheses
This is equivalent to saying that is computed as a weighted majority vote of the weak hypotheses where each is assigned weight . ([…] we use the terms “hypothesis” and “classifier” interchangeably.)
So, AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous ones. But to really understand what is going on my approach has always been that you haven’t really understood something before you didn’t build it yourself…
Perhaps you might want to try to translate the pseudocode into R code before reading on… (to increase your motivation I frankly admit that I also had some errors in my first implementation… which provides a good example of how strong the R community is because I posted it on stackoverflow and got a perfect answer two hours later: What is wrong with my implementation of AdaBoost?
Anyway, here is my implementation (the data can be found here: http://freakonometrics.free.fr/myocarde.csv):
library(rpart) library(OneR) maxdepth < 1 T < 100 # number of rounds # Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {1, +1} myocarde < read.table("data/myocarde.csv", header = TRUE, sep = ";") y < (myocarde[ , "PRONO"] == "SURVIE") * 2  1 x < myocarde[ , 1:7] m < nrow(x) data < data.frame(x, y) # Initialize: D_1(i) = 1/m for i = 1,...,m D < rep(1/m, m) H < replicate(T, list()) a < vector(mode = "numeric", T) set.seed(123) # For t = 1,...,T for(t in 1:T) { # Train weak learner using distribution D_t # Get weak hypothesis h_t: X > {1, +1} H[[t]] < rpart(y ~., data = data, weights = D, maxdepth = maxdepth, method = "class") # Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i] h < predict(H[[t]], x, type = "class") e < sum((h!=y) * D) # Choose a_t = 0.5 * log((1e) / e) a[t] < 0.5 * log((1e) / e) # Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(a_t * y_i * h_t(x_i))) / Z_t # where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution) D < D * exp(a[t] * y * as.numeric(as.character(h))) D < D / sum(D) } # Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T) newdata < x H_x < sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class")))) H_x < t(a * t(H_x)) pred < sign(rowSums(H_x)) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 1 Sum ## 1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 1 Sum ## 1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (pvalue < 2.2e16)Let’s compare this with the result from the package JOUSBoost (on CRAN):
library(JOUSBoost) ## JOUSBoost 2.1.0 boost < adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T) pred < predict(boost, x) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 1 Sum ## 1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 1 Sum ## 1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (pvalue < 2.2e16)As you can see: zero errors as with my implementation. Two additional remarks are in order:
An accuracy of 100% hints at one of the problems of boosting: it is prone to overfitting (see also Learning Data Science: Modelling Basics).
The second problem is the lack of interpretability: whereas decision trees are normally well interpretable ensembles of them are not. This is also known under the name AccuracyInterpretability TradeOff (another often used ensemble method is random forests, see also here: Learning Data Science: Predicting Income Brackets).
Hope that this post was helpful for you to understand the widely used boosting methodology better and to see how you can get from pseudocode to valid R code. If you have any questions or feedback please let me know in the comments – Thank you and stay tuned!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Learning Machines. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Understanding AdaBoost – or how to turn Weakness into Strength
(This article was first published on RBloggers – Learning Machines, and kindly contributed to Rbloggers)
Many of you might have heard of the concept “Wisdom of the Crowd”: when many people independently guess some quantity, e.g. the number of marbles in a jar glass, the average of their guesses is often pretty accurate – even though many of the guesses are totally off.
The same principle is at work in so called ensemble methods, like bagging and boosting. If you want to know more about boosting and how to turn pseudocode of a scientific paper into valid R code read on…
We start from an original paper of one of the authors of the first practical boosting algorithm, i.e. AdaBoost: Robert E. Schapire: Explaining AdaBoost. The first sentence of the introduction gives the big idea:
Boosting is an approach to machine learning based on the idea of creating a highly accurate prediction rule by combining many relatively weak and inaccurate rules.
The second page gives the pseudocode of Adaboost…:
Given: where .
Initialize: for .
For :
 Train weak learner using distribution .
 Get weak hypothesis : .
 Aim: select with low weighted error:
 Choose .
 Update, for :
where is a normalization factor (chosen so that will be a distribution).
Output the final hypothesis:
… with some explanation:
[…] we are given labeled training examples where the are in some domain , and the labels . On each round , a distribution is computed as in the figure over the training examples, and a given weak learner or weak learning algorithm is applied to find a weak hypothesis : , where the aim of the weak learner is to find a weak hypothesis with low weighted error relative to . The final or combined hypothesis computes the sign of a weighted combination of weak hypotheses
This is equivalent to saying that is computed as a weighted majority vote of the weak hypotheses where each is assigned weight . ([…] we use the terms “hypothesis” and “classifier” interchangeably.)
So, AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous ones. But to really understand what is going on my approach has always been that you haven’t really understood something before you didn’t build it yourself…
Perhaps you might want to try to translate the pseudocode into R code before reading on… (to increase your motivation I frankly admit that I also had some errors in my first implementation… which provides a good example of how strong the R community is because I posted it on stackoverflow and got a perfect answer two hours later: What is wrong with my implementation of AdaBoost?
Anyway, here is my implementation (the data can be found here: http://freakonometrics.free.fr/myocarde.csv):
library(rpart) library(OneR) maxdepth < 1 T < 100 # number of rounds # Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {1, +1} myocarde < read.table("data/myocarde.csv", header = TRUE, sep = ";") y < (myocarde[ , "PRONO"] == "SURVIE") * 2  1 x < myocarde[ , 1:7] m < nrow(x) data < data.frame(x, y) # Initialize: D_1(i) = 1/m for i = 1,...,m D < rep(1/m, m) H < replicate(T, list()) a < vector(mode = "numeric", T) set.seed(123) # For t = 1,...,T for(t in 1:T) { # Train weak learner using distribution D_t # Get weak hypothesis h_t: X > {1, +1} H[[t]] < rpart(y ~., data = data, weights = D, maxdepth = maxdepth, method = "class") # Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i] h < predict(H[[t]], x, type = "class") e < sum((h!=y) * D) # Choose a_t = 0.5 * log((1e) / e) a[t] < 0.5 * log((1e) / e) # Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(a_t * y_i * h_t(x_i))) / Z_t # where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution) D < D * exp(a[t] * y * as.numeric(as.character(h))) D < D / sum(D) } # Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T) newdata < x H_x < sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class")))) H_x < t(a * t(H_x)) pred < sign(rowSums(H_x)) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 1 Sum ## 1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 1 Sum ## 1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (pvalue < 2.2e16)Let’s compare this with the result from the package JOUSBoost (on CRAN):
library(JOUSBoost) ## JOUSBoost 2.1.0 boost < adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T) pred < predict(boost, x) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 1 Sum ## 1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 1 Sum ## 1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (pvalue < 2.2e16)As you can see: zero errors as with my implementation. Two additional remarks are in order:
An accuracy of 100% hints at one of the problems of boosting: it is prone to overfitting (see also Learning Data Science: Modelling Basics).
The second problem is the lack of interpretability: whereas decision trees are normally well interpretable ensembles of them are not. This is also known under the name AccuracyInterpretability TradeOff (another often used ensemble method is random forests, see also here: Learning Data Science: Predicting Income Brackets).
Hope that this post was helpful for you to understand the widely used boosting methodology better and to see how you can get from pseudocode to valid R code. If you have any questions or feedback please let me know in the comments – Thank you and stay tuned!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Learning Machines. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
radian: a modern console for R
(This article was first published on From the Bottom of the Heap  R, and kindly contributed to Rbloggers)
Whenever I’m developing R code or writing data wrangling or analysis scripts for research projects that I work on I use Emacs and its addon package Emacs Speaks Statistics (ESS). I’ve done so for nigh on a couple of decades now, ever since I switched full time to running Linux as my daily OS. For years this has served me well, though I wouldn’t call myself an Emacs expert; not even close! With a bit of help from some R Core coding standards document I got indentation working how I like it, I learned to contort my fingers in weird and wonderful ways to execute a small set of useful shortcuts, and I even committed some of those shortcuts to memory. More recently, however, my goto methods for configuring Emacs+ESS were failing; indentation was all over the shop, the smart _ stopped working or didn’t work as it had for over a decade, syntax highlighting of Rrelated files, like .Rmd was hit and miss, and polymode was just a mystery to me. Configuring Emacs+ESS was becoming much more of a chore, and rather unhelpfully, my problems coincided with my having less and less time to devote to tinkering with my computer setups. Also, fiddling with this stuff just wasn’t fun any more. So, in a fit of pique following one to many reconfiguration sessions of Emacs+ESS, I went in search of some greener grass. During that search I came across radian, a neat, attractive, simple console for working with R.
Written by Randy Lai, radian is a crossplatform console for R that provides code completion, syntax highlighting, etc in a neat little package that runs in a shell or terminal, such as Bash. I’m someone who fires up multiple terminals every day to run some bit of R code, to show a student how to do something, to quickly check on argument names or such like, or prepare an answer to a question on stackoverflow or crossvalidated. Running R in a terminal after using an IDE/environment like Emacs+ESS or RStudio is an exercise in time travel; all those little helpful editing tools the IDE provides are missing and you’re coding like it was the 1980s all over again. radian changes all that.
radian is a Python application so to run it you’ll need a python stack installed. You’ll also need a relatively recent version of R (≥ 3.4.0). Using pip, the python package installer, installing radian is straightforward. Python v3 is recommended and on Fedora this mean I had to install using
pip3 install user radianThe –user flag does an user install, which sets the installation location to be inside your home directory. Once installed, you can start radian by simply typing the application name and hitting enter
radianA nice configuration tip included in the radian README.md is to alias the radian command to r, so that running R runs the standard R console, while running r starts radian. On Fedora, you configure this alias in your ~/.bashrc file
alias r="radian"Having started radian you’ll see something like this
radian at startup running in a bash shell on Fedoraradian starts up with a simple statement of the R version running in radian and the platform (OS) it’s running on; so is it just a lessverbose version of the standard R console? The radian prompt hints at the greater capabilities however.
Code completion is a nice addition; yes, you have some form of code completion in the standard R console but in radian we have a more RStudio or Emacs+ESSlike experience with a dropdown menu for object, function, argument, and filename completion. To activate this you start typing, hit Tab and the relevant completions pops up. Hit Tab again or press the down cursor and you can scroll through the potential completions.
Code completion in radianWe also get nice syntax highlighting of R code using the colour schemes from pygments:
Syntax highlighting in radian using the monokai themeAnd, if you’re copying & pasting code into the terminal or piping code in from a editor with an embedded terminal (that’s running radian) then you also get rather handy multiline editing. Pressing the up cursor ↑ will retrieve the previous set of commands pasted or piped into radian, and repeatedly pressin ↑ will scroll back through the history. If you want to edit a set of R calls, instead of pressing ↑ again, press ← to enter the chunk of code and then you can move around among the lines using the cursor keys, editing as you see fit. Hitting enter will run the entire chunk of code for you, edits and all:
Multiline editing a ggplot call in radianYou can configure aspects of the behaviour of radian via options() in your .Rprofile. The options I’m currently using on the computer used for the screenshots in this post are:
options(radian.auto_indentation = FALSE) options(radian.color_scheme = "monokai")but on my laptop I’m currently using
# auto match brackets and quotes options(radian.auto_match = TRUE) # auto indentation for new line and curly braces options(radian.auto_indentation = TRUE) options(radian.tab_size = 4) # timeout in seconds to cancel completion if it takes too long # set it to 0 to disable it options(radian.completion_timeout = 0.05) # insert new line between prompts options(radian.insert_new_line = FALSE)The last option is something I’m not sure about yet; as you can see in the screenshots, there’s a new line between the prompts, which makes it super easy to read the R code you’ve entered, but with the font I’m currently using (Iosevka) things look a bit too spread out. Setting radian.insert_new_line = FALSE, as I have it on the laptop results in more standard behaviour but it can feel a little cramped. I’ll probably play with both options and see which I like best after a few more weeks of use.
You can also define shortcuts. This is useful for entering the assignment operator <, which I have bound to Alt +  using
options(radian.escape_key_map = list( list(key = "", value = " < ") ))where I’ve added spaces around the operator to mimic how the smart underscore works in Emacs+ESS.
I’m really liking using radian for my throwaway R sessions that I typically do in a terminal. The only issue I’ve noticed is that it is a little slow to print tibbles, and clearly it’s not going to replace my current IDE — that’s not what it is designed for. That said, radian can be run inside any app that can run a terminal and I’ve had this running inside VS Code for example, which was nice.
If you have any comments on radian or other R consoles, let me know what you think below; if you’ve used radian I’m especially interested in your experience with it.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Parametric survival modeling
(This article was first published on Devin Incerti’s blog, and kindly contributed to Rbloggers)
IntroductionSurvival analysis is used to analyze the time until the occurrence of an event (or multiple events). Cox models—which are often referred to as semiparametric because they do not assume any particular baseline survival distribution—are perhaps the most widely used technique; however, Cox models are not without limitations and parametric approaches can be advantageous in many contexts. For instance, parametric survival models are essential for extrapolating survival outcomes beyond the available followup data. For this reason they are nearly always used in healtheconomic evaluations where it is necessary to consider the lifetime health effects (and costs) of medical interventions.
R contains a large number of packages related to biostatistics and its support for parametric survival modeling is no different. Below we will examine a range of parametric survival distributions, their specifications in R, and the hazard shapes they support. We will then show how the flexsurv package can make parametric regression modeling of survival data straightforward.
Survival distributionsThe primary quantity of interest in survival analysis is the survivor function, defined as the probability of survival beyond time $t$,
S(t) = Pr(T > t) = 1  F(t),
where $T$ is a random variable denoting the time that the event occurs. The survival function is the complement of the cumulative density function (CDF), $F(t) = \int_0^t f(u)du$, where $f(t)$ is the probability density function (PDF).
The hazard function, or the instantaneous rate at which an event occurs at time $t$ given survival until time $t$ is given by,
\lambda (t) = \frac{f(t)}{S(t)}.
The survivor function can also be expressed in terms of the cumulative hazard function, $\Lambda(t) = \int_0^t \lambda (u)du$,
S(t) = e^{\Lambda(t)}.
R functions for parametric distributions used for survival analysis are shown in the table below. The default stats package contains functions for the PDF, the CDF, and random number generation for many of the distributions. Additional distributions as well as support for hazard functions are provided by flexsurv.
PDF CDF Hazard Random sampling Exponential stats::dexp stats::pexp flexsurv::hexp flexsurv::rexp Weibull (AFT) stats::dweibull stats::pweibull flexsurv::hweibull stats::rweibull Weibull (PH) flexsurv::dweibullPH flexsurv::pweibullPH flexsurv::hweibullPH flexsurv::rweibullPH Gompertz flexsurv::dgompertz flexsurv::pgompertz flexsurv::hgompertz flexsurv::rgompertz Gamma stats::dgamma stats::pgamma flexsurv::hgamma stats::rgamma Lognormal stats::dlnorm stats::plnorm flexsurv::hlnorm stats::rlnorm Loglogistic flexsurv::dllogis flexsurv::pllogis flexsurv::hllogis flexsurv::rllogis Generalized gamma flexsurv::dgengamma flexsurv::pgengamma flexsurv::hgengamma flexsurv::rgengammaThe parameterizations of these distributions in R are shown in the next table. The parameter of primary interest (in flexsurv) is colored in red—it is known as the location parameter and typically governs the mean or location for each distribution. The other parameters are ancillary parameters that determine the shape, variance, or higher moments of the distribution. These parameters impact the hazard function, which can take a variety of shapes depending on the distribution:
 the exponential distribution only supports a constant hazard;
 the Weibull, Gompertz, and gamma distributions support monotonically increasing and decreasing hazards;
 the loglogistic and lognormal distributions support arcshaped and monotonically decreasing hazards; and
 the generalized gamma distribution supports an arcshaped, bathtubshaped, monotonically increasing, and monotonically decreasing hazards.
\frac{\gamma(Q^{2}, u)}{\Gamma(Q^{2})} \text{ if } Q \neq 0 \\
\Phi(w) \text{ if } Q = 0
\end{cases}$ $f(t)/S(t)$ $\color{red}{\text{mu}} = \mu \in (\infty, \infty) \\ \text{sigma} = \sigma \gt 0 \\ \text{Q} = Q \in (\infty, \infty)$ Arcshaped, bathtubshaped, monotonically increasing/decreasing a The proportional hazard (PH) model is a reparameterization of the accelerated failure time (AFT) model with $m = b^{a}$. b $\Gamma(z) = \int_{0}^{\infty} x^{z1}e^{x} dx$ is the gamma function and $\gamma(s,x) = \int_{0}^{x} t^{s1}e^{t}dt$ is the lower incomplete gamma function. c $w = (log(t) – \mu)/\sigma; u = Q^{2}e^{Qw}$. Shapes of hazard functions
We will now examine the shapes of the hazards in a bit more detail and show how both the location and shape vary with the parameters of each distribution. Readers interested in a more interactive experience can also view my Shiny app here.
To do so we will load some needed packages: we will use flexsurv to compute the hazards, data.table as a fast alternative to data.frame, and ggplot2 for plotting.
library("flexsurv") library("data.table") library("ggplot2") ggplot2::theme_set(theme_minimal())We can create a general function for computing hazards for any general hazard function given combinations of parameter values at different time points. The key to the function is mapply, a multivariate version of sapply. To illustrate, let’s compute the hazard from a Weibull distribution given 3 values each of the shape and scale parameters at time points 1 and 2. The output is a matrix where each row corresponds to a time point and each column is combination of the shape and scale parameters. For example, the second row and third column is the hazard at time point 2 given a shape parameter of 1.5 and a scale parameter of 1.75.
mapply(flexsurv::hweibull, shape = c(.5, 1, 1.5), scale = c(.25, 1, 1.75), MoreArgs = list(x = 1:2)) ## [,1] [,2] [,3] ## [1,] 1.0000000 1 0.6479391 ## [2,] 0.7071068 1 0.9163243The more general function uses mapply to return a data.table of hazards at all possible combinations of the parameter values and time points. Factor variables and intuitive names are also returned to facilitate plotting with ggplot2.
hazfun < function(FUN, param_vals, times){ # Compute hazard for all possible combinations of parameters and times values < do.call("mapply", c(list(FUN = FUN), as.list(expand.grid(param_vals)), list(MoreArgs = list(x = times)))) x < data.table(expand.grid(c(param_vals, list(time = times))), value = c(t(values))) # Create factor variables and intuitive names for plotting param_names < names(param_vals) x[[param_names[1]]] < factor(x[[param_names[1]]], levels = unique(x[[param_names[1]]])) if (length(param_vals) > 1){ for (j in 2:length(param_vals)){ ordered_vals < unique(x[[param_names[j]]]) x[[param_names[j]]] < paste0(param_names[j], " = ", x[[param_names[j]]]) factor_levels < paste0(param_names[j], " = ", ordered_vals) x[[param_names[j]]] < factor(x[[param_names[j]]], levels = factor_levels) } } # Return return(x) } Exponential distributionThe exponential distribution is parameterized by a single rate parameter and only supports a hazard that is constant over time. The hazard is simply equal to the rate parameter.
times < seq(1, 10, 1) rate < seq(1, 5, 1) haz_exp < hazfun(flexsurv::hexp, list(rate = rate), times = times) ggplot(haz_exp, aes(x = time, y = value, col = rate)) + geom_line() + xlab("Time") + ylab("Hazard") + scale_y_continuous(breaks = rate) Weibull distribution (AFT)The Weibull distribution can be parameterized as both an accelerated failure time (AFT) model or as a proportional hazards (PH) model. The parameterization in the base stats package is an AFT model.
The hazard is decreasing for shape parameter $a < 1$ and increasing for $a > 1$. For $a = 1$, the Weibull distribution is equivalent to an exponential distribution with rate parameter $1/b$ where $b$ is the scale parameter.
wei_shape < seq(.25, 3, .25) haz_wei < hazfun(flexsurv::hweibull, list(scale = seq(2, 5, .5), shape = wei_shape), times = times) ggplot(haz_wei, aes(x = time, y = value, col = scale)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard") Weibull distribution (PH)flexsurv provides an alternative PH parameterization of the Weibull model with the same shape parameter $a$ and a scale parameter $m = b^{a}$ where $b$ is the scale parameter in the AFT model.
The hazard is again decreasing for $a < 1$, constant for $a = 1$, and increasing for $a > 1$. Note that for $a = 1$, the PH Weibull distribution is equivalent to an exponential distribution with rate parameter $m$.
haz_weiPH < hazfun(flexsurv::hweibullPH, list(scale = seq(2, 5, .5), shape = wei_shape), times = times) ggplot(haz_weiPH, aes(x = time, y = value, col = scale)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard") Gompertz distributionThe Gompertz distribution is parameterized by a shape parameter $a$ and rate parameter $b$. The hazard is increasing for $a > 0$, constant for $a = 0$, and decreasing for $a < 0$. When $a = 0$, the Gompertz distribution is equivalent to an exponential distribution with rate parameter $b$.
haz_gomp < hazfun(flexsurv::hgompertz, list(rate = seq(.5, 3, .5), shape = seq(2, 2, .5)), times = times) ggplot(haz_gomp, aes(x = time, y = value, col = rate)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard") Gamma distributionThe gamma distribution is parameterized by a shape parameter $a$ and a rate parameter $b$. Like the Weibull distribution, the hazard is decreasing for $a < 1$, constant for $a = 1$, and increasing for $a >1$. In the case where $a = 1$, the gamma distribution is an exponential distribution with rate parameter $b$.
haz_gam < hazfun(flexsurv::hgamma, list(rate = seq(.5, 3, .5), shape = c(1e5, .5, 1, seq(2, 6))), times = times) ggplot(haz_gam, aes(x = time, y = value, col = rate)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard") Lognormal distributionThe lognormal distribution is parameterized by the mean $\mu$ and standard deviation $\sigma$ of survival time on the log scale. The lognormal hazard is either monotonically decreasing or arcshaped. Note that the shape of the hazard depends on the values of both $\mu$ and $\sigma$.
haz_lnorm < hazfun(flexsurv::hlnorm, list(meanlog = seq(0, 4, .5), sdlog = seq(.5, 3, .5)), times = seq(0, 20, .1)) ggplot(haz_lnorm, aes(x = time, y = value, col = meanlog)) + geom_line() + facet_wrap(~sdlog, scales = "free_y") + xlab("Time") + ylab("Hazard") Loglogistic distributionThe loglogistic distribution is parameterized by a shape parameter $a$ and a scale parameter $b$. When $a > 1$, the hazard function is arcshaped whereas when $a \leq 1$, the hazard function is decreasing monotonically.
haz_llogis < hazfun(flexsurv::hllogis, list(scale = seq(.5, 4, .5), shape = seq(.5, 3, .5)), times = seq(0, 20, 1)) ggplot(haz_llogis, aes(x = time, y = value, col = scale)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard") Generalized gamma distributionThe generalized gamma distribution is parameterized by a location parameter $\mu$, a scale parameter $\sigma$, and a shape parameter $Q$. It is the most flexible distribution reviewed in this post and includes the exponential ($Q = \sigma = 1$), Weibull ($Q = 1$), gamma ($Q = \sigma$), and lognormal ($Q = 0$) distributions as special cases.
Each row in the figure corresponds to a unique value of $\sigma$ and each column corresponds to a unique value of $Q$.The generalized gamma distribution is quite flexible as it supports hazard functions that are monotonically increasing, monotonically decreasing, arcshaped, and bathtub shaped.
haz_gengamma < hazfun(flexsurv::hgengamma, list(mu = seq(1, 4, .5), sigma = seq(.5, 2, .5), Q = seq(3, 3, 1)), times = seq(0, 30, 1)) ggplot(haz_gengamma, aes(x = time, y = value, col = mu)) + geom_line() + facet_wrap(sigma ~ Q, ncol = length(levels(haz_gengamma$Q)), scales = "free") + xlab("Time") + ylab("Hazard") + theme(legend.position = "bottom") + guides(col = guide_legend(nrow = 2, byrow = TRUE)) RegressionIn flexsurv, survival models are fit to the data using maximum likelihood. Each parameter can be modeled as a function of covariates $z$,
\alpha_l = g^{1}\left(z^t\beta \right),
where $\alpha_l$ is the $l$th parameter and $g^{1}()$ is a link function (typically $log()$ if the parameter is strictly positive and the identity function if the parameter is defined on the real line).
We will illustrate by modeling survival in a dataset of patients with advanced lung cancer from the survival package. The dataset uses a status indicator where 2 denotes death and 1 denotes alive at the time of last followup; we will convert this to the more traditional coding where 0 is dead and 1 is alive.
dat < data.table(survival::lung) dat[, status := ifelse(status == 1, 0, 1)] head(dat) ## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss ## 1: 3 306 1 74 1 1 90 100 1175 NA ## 2: 3 455 1 68 1 0 90 90 1225 15 ## 3: 3 1010 0 56 1 0 90 90 NA 15 ## 4: 5 210 1 57 1 1 90 60 1150 11 ## 5: 1 883 1 60 1 0 100 90 NA 0 ## 6: 12 1022 0 74 1 1 50 80 513 0 Intercept only modelWe will begin by estimating intercept only parametric regression models (i.e., without covariates). But first, it’s helpful to estimate the hazard function (among all patients) using nonparametric techniques. We can do this using the kernel density estimator from the muhaz package.
library("muhaz") kernel_haz_est < muhaz(dat$time, dat$status) kernel_haz < data.table(time = kernel_haz_est$est.grid, est = kernel_haz_est$haz.est, method = "Kernel density")Then we can use flexsurv to estimate intercept only models for a range of probability distributions. The hazard function for each fitted model is returned using summary.flexsurvreg().
dists < c("exp", "weibull", "gompertz", "gamma", "lognormal", "llogis", "gengamma") dists_long < c("Exponential", "Weibull (AFT)", "Gompertz", "Gamma", "Lognormal", "Loglogistic", "Generalized gamma") parametric_haz < vector(mode = "list", length = length(dists)) for (i in 1:length(dists)){ fit < flexsurvreg(Surv(time, status) ~ 1, data = dat, dist = dists[i]) parametric_haz[[i]] < summary(fit, type = "hazard", ci = FALSE, tidy = TRUE) parametric_haz[[i]]$method < dists_long[i] } parametric_haz < rbindlist(parametric_haz)We can plot the hazard functions from the parametric models and compare them to the kernel density estimate.
haz < rbind(kernel_haz, parametric_haz) haz[, method := factor(method, levels = c("Kernel density", dists_long))] n_dists < length(dists) ggplot(haz, aes(x = time, y = est, col = method, linetype = method)) + geom_line() + xlab("Days") + ylab("Hazard") + scale_colour_manual(name = "", values = c("black", rainbow(n_dists))) + scale_linetype_manual(name = "", values = c(1, rep_len(2:6, n_dists)))The kernel density estimate is monotonically increasing and the slope increases considerably after around 500 days. The arcshaped lognormal and loglogistic hazards and the constant exponential hazard do not fit the data well. The best performing models are those that support monotonically increasing hazards (Gompertz, Weibull, gamma, and generalized gamma). The flexible generalized gamma and the Gompertz models perform the best with the Gompertz modeling the increase in the slope of the hazard the most closely.
Adding covariatesAs mentioned above each parameter can be modeled as a function of covariates. By default, flexsurv only uses covariates to model the location parameter. To demonstrate, we will let the rate parameter of the Gompertz distribution depend on the ECOG performance score (0 = good, 5 = dead), which describes a patient’s level of functioning and has been shown to be a prognostic factor for survival. The model is fit using flexsurvreg().
dat[, ph.ecog := factor(ph.ecog)] # Covariates on rate parameter only fit_covs1 < flexsurvreg(Surv(time, status) ~ ph.ecog, data = dat, dist = "gompertz") print(fit_covs1) ## Call: ## flexsurvreg(formula = Surv(time, status) ~ ph.ecog, data = dat, ## dist = "gompertz") ## ## Estimates: ## data mean est L95% U95% se exp(est) ## shape NA 0.001595 0.000925 0.002265 0.000342 NA ## rate NA 0.001070 0.000725 0.001579 0.000212 NA ## ph.ecog1 0.497797 0.358883 0.029676 0.747442 0.198248 1.431729 ## ph.ecog2 0.220264 0.910197 0.469959 1.350435 0.224615 2.484812 ## ph.ecog3 0.004405 1.973828 0.020522 3.968179 1.017544 7.198181 ## L95% U95% ## shape NA NA ## rate NA NA ## ph.ecog1 0.970760 2.111591 ## ph.ecog2 1.599929 3.859102 ## ph.ecog3 0.979687 52.888129 ## ## N = 227, Events: 164, Censored: 63 ## Total time at risk: 69522 ## Loglikelihood = 1139.984, df = 5 ## AIC = 2289.967Covariates for ancillary parameters can be supplied using the anc argument to flexsurvreg().
# Covariates on rate and shape parameter fit_covs2 < flexsurvreg(Surv(time, status) ~ ph.ecog, anc = list(shape =~ ph.ecog), data = dat, dist = "gompertz") print(fit_covs2) ## Call: ## flexsurvreg(formula = Surv(time, status) ~ ph.ecog, anc = list(shape = ~ph.ecog), ## data = dat, dist = "gompertz") ## ## Estimates: ## data mean est L95% U95% ## shape NA 1.84e03 5.54e04 3.12e03 ## rate NA 9.89e04 5.75e04 1.70e03 ## ph.ecog1 4.98e01 4.18e01 2.28e01 1.06e+00 ## ph.ecog2 2.20e01 1.11e+00 3.99e01 1.82e+00 ## ph.ecog3 4.41e03 3.02e+02 9.48e+02 3.43e+02 ## shape(ph.ecog1) 4.98e01 1.77e04 1.75e03 1.40e03 ## shape(ph.ecog2) 2.20e01 7.61e04 2.77e03 1.25e03 ## shape(ph.ecog3) 4.41e03 2.63e+00 2.86e+00 8.11e+00 ## se exp(est) L95% U95% ## shape 6.54e04 NA NA NA ## rate 2.74e04 NA NA NA ## ph.ecog1 3.30e01 1.52e+00 7.96e01 2.90e+00 ## ph.ecog2 3.63e01 3.04e+00 1.49e+00 6.19e+00 ## ph.ecog3 3.29e+02 4.78e132 0.00e+00 1.32e+149 ## shape(ph.ecog1) 8.03e04 1.00e+00 9.98e01 1.00e+00 ## shape(ph.ecog2) 1.02e03 9.99e01 9.97e01 1.00e+00 ## shape(ph.ecog3) 2.80e+00 1.38e+01 5.75e02 3.33e+03 ## ## N = 227, Events: 164, Censored: 63 ## Total time at risk: 69522 ## Loglikelihood = 1134.06, df = 8 ## AIC = 2284.12The standard errors and confidence intervals are very large on the shape parameter coefficients, suggesting that they are not reliably estimated and that there is little evidence that the shape parameter depends on the ECOG score. A such, we will use the first model to predict the hazards. In flexsurv, input data for prediction can be specified by using the newdata argument in summary.flexsurvreg(). So we will first create this “new” dataset for prediction consisting of each possible value of the ECOG score in the data.
newdat < data.table(ph.ecog = sort(unique(dat[!is.na(ph.ecog)]$ph.ecog))) newdat[, ph.ecog := factor(ph.ecog)] print(newdat) ## ph.ecog ## 1: 0 ## 2: 1 ## 3: 2 ## 4: 3We can then predict the hazard for each level of the ECOG score. The hazard increases with the ECOG score which is expected since higher scores denote higher levels of disability. Note, however, that the shape of the hazard remains the same since we did not find evidence that the shape parameter of the Gompertz distribution depended on the ECOG score.
haz_covs1 < summary(fit_covs1, newdata = newdat, type = "hazard", tidy = TRUE) ggplot(haz_covs1, aes(x = time, y = est, col = ph.ecog)) + geom_line() + xlab("Days") + ylab("Hazard") + scale_color_discrete(name = "ECOG performance score") + theme(legend.position = "bottom") ConclusionParametric models are a useful technique for survival analysis, particularly when there is a need to extrapolate survival outcomes beyond the available followup data. R provides wide range of survival distributions and the flexsurv package provides excellent support for parametric modeling. Parametric distributions can support a wide range of hazard shapes including monotonically increasing, monotonically decreasing, arcshaped, and bathtubshaped hazards. However, in some cases, even the most flexible distributions such as the generalized gamma distribution may be insufficient. In these cases, flexible parametric models such as splines or fractional polynomials may be needed.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Devin Incerti’s blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Getting from flat data a world of relationships to visualise with Gephi
(This article was first published on R on Coding Club UC3M, and kindly contributed to Rbloggers)
by Mariluz Congosto
Network analysis offers a perspective of the data that broadens and enriches any investigation. Many times we deal with data in which the elements are related, but we have them in a tabulated format that is difficult to import into network analysis tools.
Relationship data require a definition of nodes and connections. Both parts have different structures and it is not possible to structure them in a single table, at least two would be needed. Data analysis tools define different input formats, one of them is GDF, which is characterized by its simplicity and versatility.
In this session, we will see how we can extract the relationships between elements of a file in CSV format to generate a file in GDF format to work with Gephi.
IntroductionOne of the most used network analysis tools is Gephi. It is a desktop application that allows us to analyse a network in depth and visualize it. It provides a set of functions to filter nodes and edges or calculate network parameters or apply different layouts or give size and colour to nodes depending on different attributes.
Gephi supports a set of formats with which it is possible to perform more or less functions. The format with more features is GEXF but it has an XML structure that generates large files. When we work with very large graphs, size matters. The GDF Format is simple, compact and versatile. It allows us to work with attributes that are a complement that can be combined with the analysis of the topology of the network.
As Gephi said:
GDF is built like a database table or a coma separated file (CSV). It supports attributes to both nodes and edges. A standard file is divided in two sections, one for nodes and one for edges. Each section has a header line, which basically is the column title. Each element (i.e. node or edge) is on a line and values are separated by coma. The GDF format is therefore very easy to read and can be easily converted from CSV.
This is a basic example of a GDF file, in a first part it defines the nodes and in a second the connections:
nodedef>name VARCHAR,label VARCHAR s1,Site number 1 s2,Site number 2 s3,Site number 3 edgedef>node1 VARCHAR,node2 VARCHAR s1,s2 s2,s3 s3,s2 s3,s1The steps to extract relationships from a CSV file and generate a file in GDF format are:
 Define the parameters for the extraction of related data.
 Define the data structures for the transformation.
 Import the data from a file in CSV format and store the data in the structures.
 Sort data according to connections.
 Prepare the data for the GDF format.
 Generate the file in GDF format.
A set of parameters are defined so this converter can be adapted to different cases. We must specify what information will be taken from the file in CSV format, where we will leave the result and the graph type.
library("readr") library("hash") library("stringr") source < 3 # Column number of source entity target < 9 # Column number of target entity source_attribs < c(5, 6, 11, 12, 13) # Column numbers of attribs null_attrib < rep(NA, 5) # Default value for entities without attributes name_attribs < c("app", "location", "hastag", "lang", "create at") # Attrib names name_file_csv < "rstudio_RTs.csv" # Name of the input file in CSV format name_file_gdf < "rstudio_RTs.gdf" # Name of the output file in GDF format directed < TRUE # Indicates whether the graph is directed or notThe sample CSV file can be found here.
Define the data structures for the transformationIn order to store the data, dynamic structures are needed to allow data to be added as they appear. The hash tables were chosen because they are the most appropriate for this case.
Hash tables will be used to store nodes and connections.
Import data reading the CSV file and run it row by row to store the nodes and connections in the hash tables.
Related entities can appear multiple times, as a source or as a target. When an entity appears for the first time, it is stored in the hash_nodes table. Attributes are associated to source entities and null attributes to target entities. It is a criterion that assumes this algorithm, but there could be others. If an entity appears the first time as a target, it will be assigned the null attributes, but if it appears later as a source, the null attributes will be replaced by theirs.
For each entity, the number of total links (hash_links), the number of inbound links (hash_links_in) and the number of outbound links (hash_links) are counted. This is done to allow ordering the nodes from greater to lesser degree when generating the file in GDF format.
For each origintarget entity pair, the number of times that the relation appears (hash_connections) and the attributes (hash_connections_attrib) are stored. In the first case, we get the weight of the relationship and, in the second, we get the associated attributes.
table_csv < read_csv2(name_file_csv) num_rows < nrow(table_csv) num_cols < ncol(table_csv) for (i in 1:num_rows) { node_source < table_csv[[i, source]] node_target < table_csv[[i, target]] print(i) node_source_attribs < null_attrib # The attributes are stored and the ',' character is changed by ''. to avoid confict in the GDF format for (j in 1:num_attribs) { k < source_attribs[[j]] raw_attrib < table_csv[[i, k]] cooked_attrib < str_replace_all(raw_attrib, ",", "") node_source_attribs[[j]] < cooked_attrib } # If a source node appears for the first time, store with attributes if (!(has.key(node_source, hash_nodes))) { hash_nodes[[node_source]] < node_source_attribs hash_links[[node_source]] < 0 hash_links_in[[node_source]] < 0 hash_links_out[[node_source]] < 0 } # If the source node exists and has null attributes, store its own else { node_source_attribs_old < hash_nodes[[node_source]] if (identical(node_source_attribs_old, null_attrib)) hash_nodes[[node_source]] < node_source_attribs } # Check that target node exists if (!(is.na(node_target))) { # If a target node appears for the first time, store with null attributes if (!(has.key(node_target, hash_nodes))) { hash_nodes[[node_target]] < null_attrib hash_links[[node_target]] < 0 hash_links_in[[node_target]] < 0 hash_links_out[[node_target]] < 0 } par_nodes < paste(node_source, node_target) # If a pair of nodes appears for the first time related, store relation if (!(has.key(par_nodes, hash_connections))) { hash_connections[[par_nodes]] < 0 hash_connections_attrib[[par_nodes]] < node_source_attribs } # In all cases, increase the number of connections hash_connections[[par_nodes]] < hash_connections[[par_nodes]] + 1 hash_links[[node_source]] < hash_links[[node_source]] + 1 hash_links_out[[node_source]] < hash_links_out[[node_source]] + 1 hash_links[[node_target]] < hash_links[[node_target]] + 1 hash_links_in[[node_target]] < hash_links_in[[node_target]] + 1 } } Sort data according to connectionsThe hash table object does not have the sort method, but has one to convert it into a list. Once we have converted the hash_links and hash_connections into a list, we sort them down by number of connections.
list_links < as.list.hash(hash_links) list_link_order < list_links[order(unlist(list_links), decreasing=TRUE)] list_connections < as.list.hash(hash_connections) list_connections_order < list_connections[order(unlist(list_connections), decreasing=TRUE)] Prepare the data for the GDF formatIn this step we place in GDF format nodes and links in descending order by number of connections.
In the GDF format, the only data required for the definition of nodes is the name of the node, but attributes can be added. In this case, three fixed attributes are included, which are the total number of links, the number of inbound links and the number of outbound links. Since the GDF format is readable, these attributes allow getting an idea of the most relevant nodes even before importing them into Gephi. The attributes configured in the parameters are also added.
The information of the nodes is stored in a matrix sized in rows by the number of nodes and in columns by the number of attributes configured plus four.
For the definition of links only the source and target nodes are required, but we can also expand them with attributes. In this case we add the weight of the relation, a boolean variable to indicate if the graph is directed or not (by default it is not directed) and the attributes configured in the parameters.
The information of the links is stored in a matrix dimensioned in rows by the number of pairs of connections and in columns by the number of attributes configured plus four.
# Definition of nodes num_nodes < length(list_links) table_nodes < matrix(nrow=num_nodes, ncol=num_attribs+4) num_nodes_connected < 0 for(i in 1:num_nodes) { name_node < names(list_link_order)[i] if (hash_links[[name_node]] > 0) { num_nodes_connected < num_nodes_connected + 1 table_nodes[i, 1] < name_node table_nodes[i, 2] < hash_links[[name_node]] table_nodes[i, 3] < hash_links_in[[name_node]] table_nodes[i, 4] < hash_links_out[[name_node]] node_attrib < hash_nodes[[name_node]] for (j in 1:num_attribs) table_nodes[i, 4+j] < node_attrib[[j]] } } # Only connected nodes are considered k < num_attribs + 4 table_nodes < table_nodes[1:num_nodes_connected, 1:k] # Definition of links num_connections < length(list_connections) table_connections < matrix(nrow=num_connections, ncol=num_attribs+4) for(i in 1:num_connections) { name_conexion < names(list_connections_order)[i] source_target < strsplit(name_conexion, " ") table_connections[i, 1] < source_target[[1]][1] table_connections[i, 2] < source_target[[1]][2] table_connections[i, 3] < list_connections_order[[i]] table_connections[i, 4] < directed connection_attrib < hash_connections_attrib[[name_conexion]] for (j in 1:num_attribs) table_connections[i, 4+j] < connection_attrib[[j]] } Generate the file in GDF formatThe last step is to write the file in GDF format. We will only have to add the headers before writing the information of the nodes and links.
# Definition of nodes head_nodes < "nodedef>name VARCHAR,links INT,Links_in INT,links_out INT" for (j in 1:num_attribs) { attrib_type < paste(name_attribs[[j]], "VARCHAR", sep = " ") head_nodes < paste(head_nodes, attrib_type, sep = ",") } write.table(head_nodes, file = name_file_gdf, append = FALSE, quote = FALSE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "UTF8") write.table(table_nodes, file = name_file_gdf, append = TRUE, quote = FALSE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "UTF8") # Definition of links head_arcs < "edgedef>node1 VARCHAR,node2 VARCHAR, weight INT, directed BOOLEAN" for (j in 1:num_attribs) { attrib_type < paste(name_attribs[[j]], "VARCHAR", sep = " ") head_arcs < paste(head_arcs, attrib_type, sep = ",") } write.table(head_arcs, file = name_file_gdf, append = TRUE, quote = FALSE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "UTF8") write.table(table_connections, file = name_file_gdf, append = TRUE, quote = FALSE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "UTF8")The resulting GDF file can be found here, and also the complete R code is available in a standalone script.
ExampleThe data has been obtained with the tool thoarder_kit, which allows downloading data through the Twitter API. With it, a query of the tweets that mentioned the @rstudio profile has been made. From the data downloaded in CSV format, the most relevant columns have been selected to facilitate the visibility of the tables.
Once the file in CSV format has been converted to a file in GDF format with the RT relation, this is the visualization of the graph with Gephi. (tweets from 20190527 to 20190606)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on Coding Club UC3M. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualizing the Copa América: Historical Records, Squad Profiles, and Player Profiles with xG statistics!
(This article was first published on R by R(yo), and kindly contributed to Rbloggers)
Another summer and another edition of the Copa América! Along with the
Africa Cup of Nations, Nations League finals, the Women’s World Cup,
Under21 European Championship AND the Gold Cup this is yet another
soccerfilled season after last year’s World Cup and the Asian Cup
earlier this year (I also did a blog post on these last two tournaments
which you can see here (World
Cup) and here
(Asian Cup)).
There is so much football going on at once even I can’t keep up,
especially with the time difference! To not redo all the previous
visualizations with Copa América data I tried to find new sources of
data and other forms of visualizations to give some insight into the
players and teams competing to be the champion of South America. You can
find all the code I used in this blogpost here and you can also find
other soccer related data viz in my
soccer_ggplot Github repo.
The sections will go from a very macrolevel view of the historical
records of the tournament, to the squads competing, the teams’
match record in the Copa América, and finally to a microlevel view
of various attacking players using xG statistics.
¡Vámonos!
Packages library(dplyr) ## data wrangling library(tidyr) ## data wrangling library(purrr) ## data wrangling and iteration library(stringr) ## data wrangling library(rvest) ## webscraping library(polite) ## webscraping (Github only pkg) library(ggplot2) ## plotting library(scales) ## plotting scales library(ggimage) ## images for flags library(ggforce) ## plotting text labels library(cowplot) ## plotting grid library(glue) ## text library(ggrepel) ## plotting text labels library(magick) ## plotting library(kable) ## tables library(ggtextures) ## soccer ball emoji as geom_col() library(extrafont) ## fonts: Roboto Condensed loadfonts() theme_copaAmericaI wanted to have all the plots in this blogpost to have a consistent
color theme. As the tournament is going to be held in Brazil, I went
with a color theme based on its flag with blue, yellow, and green being
the primary colors.
For this plot I took the stats from the Spanish version of the Wikipedia
page as it had more content. I used purrr::flatten_df() to squish the
list output into a dataframe then set the names of each column using
purrr::set_names().
Like in the Asian Cup blogpost I use Claus
Wilke’s
ggtextures package to use
soccer ball emoji as the column image in the plot.
Most of these players aren’t ones you might recognize. The Copa América
used to be held a lot more regularly (and sometimes erratically) until
this century so players had a lot more opportunities to score goals. All
five of the players you see here played in the 1930s1950s when there
was a tournament every one or two years. Out of currently active
players, Peruvian legend Paolo Guerrero has 11 goals along with Eduardo
Vargas (from Chile) with 10. (Edit: after the Chile – Japan game, Vargas is on
12…) Another player you might recognize that was actually tied with
Ademir for 5th place, along with three other players, was Gabriel
Batistuta (“Batigol”).
After grabbing the data from the Wikipedia page I used a variety of
functions to clean and reshape the dataset like tidyr::separate() to
split the number of occurences and the year.
I also wanted to add flags to this plot but
cowplot::insert_yaxis_grob() is unfortunately not compatible with
facets. I used stringr::str_wrap() to format the subtitle nicely while
I used glue::glue() to avoid having the use ‘\n’ to create a new line
for the caption.
What’s surprising to note is that Pele never won a Copa América with
Brazil, although he did get Best Player and Top Scorer in the 1959
edition of the tournament. Even more bizarrely Diego Maradona has never
won it either! He didn’t play in either of the 1991 and 1993 editions
where Argentina won their 13th and 14th Copas.
We just looked at what happened in the past but who are the players
competing in the tournament this year? To take a quick look I
webscraped the squads of each of the competing teams from Wikipedia.
I created a list of the xpaths for each of squads and using
purrr::map() I grabbed the data for each participating country. After
I got some metainformation about the country name and the group I
created a listcolumn that stores the squad data as a dataframe in its
own column. To explode this out I used tidyr::unnest() to reshape the
entire dataframe to have one row with all the data for each player in
every squad.
To get a clean dataset I use some stringr::str_*() functions to
properly format the character strings such as the player positions,
ages, date of births.
Using this data I can plot a bunch of histograms:
age_country_plot < squads_df_clean %>% group_by(country) %>% mutate(median_age = median(age)) %>% ungroup() %>% ggplot(aes(x = age)) + geom_histogram(fill = "red", binwidth = 1) + geom_vline(aes(xintercept = median_age), size = 1.2) + geom_label(aes(x = median_age, y = 8, label = glue::glue("Median: {median_age}")), nudge_x = 0.5, hjust = 0.1, size = 3, family = "Roboto Condensed", color = "black") + labs(title = "Age Distribution of Copa América squads", subtitle = "Columns ordered Group A to Group C", x = "Age", y = NULL, caption = glue::glue(" Source: Wikipedia By: @R_by_Ryo")) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0), breaks = scales::pretty_breaks()) + theme_copaAmerica(title.size = 22, subtitle.size = 14, caption.size = 8, axis.text.size = 12, axis.title.size = 16, strip.text.size = 18, panel.grid.minor.x = element_line(color = "white"), panel.grid.minor.y = element_line(color = "white")) + facet_wrap(~country, ncol = 3) age_country_plotIn terms of age, Japan have the youngest team with a median of 21, 4
years younger than the next youngest team, Qatar. The rest have a fairly
balanced spread of ages from 20 to earlymid 30s with most of the
medians hovering around 27 years of age. The reason for Japan’s
extremely young squad is due to the fact that the fullstrength Japan
team has played in both the World Cup and the Asian Cup in the past
year. Along with the fact that the Tokyo Olympics are next year, it was
decided to use the invitation to the Copa América as a trialbyfire for
the young stars of the future. Much like in a real Olympic squad, the
team contains three “overage” players in World Cup 2010/2014/2018
goalkeeper Eiji Kawashima, Premier League winner Shinji Okazaki, and
Getafe playmaker Gaku Shibasaki.
The oldest player will be Brazil captain Dani Alves at 36 with
Paraguay’s Oscar Cardozo only two weeks younger. On the other hand, the
youngest player is Japan’s 18year old prodigy Takefusa Kubo, the
exBarcelona youth player who only just recently moved to Real Madrid!
In light of his transfer a lot of eyes will be on him to see if he can
produce some Captain Tsubasaesque performances for a very inexperienced
Japan team gearing up for the Tokyo Olympics!
When considering the experience of a squad it’s not enough to look at
ages but one needs to look at the caps or appearances for the national
team as well.
The majority of Japan’s squad have 0 (ZERO) caps, with the
aforementioned three “overage” players taking up most of the proportion
of caps on the team. Bolivia are also taking a untested squad with 8 of
their players with 2 caps or less! Chile, Uruguay, and Argentina bring
their veterans with multiple players over or around 100 caps. From this
data I was surprised that Jefferson Farfan and Paolo Guerrero didn’t
have 100 caps by now…
The player with the most caps is Lionel Messi (130) followed closely by
Diego Godin (126), and Alexis Sanchez (124). On the other hand there are
29 players hopeful of making their first national team appearance at
this tournament with the majority (17 players) coming from Japan.
Next I looked at the distribution of goals scored by the midfielders and
strikers of each team. I found out about using
ggplot2::position_nudge() for slightly adjusting variables on a
discrete scale in similar fashion to the nudge_y = and nudge_x =
arguments most people might be familiar with from other geoms. I also
used ggforce::geom_mark_hull() to do some labelling.
With a lot of these players being more defensively minded or new players
the distribution is heavily skewed but you can see little mounds showing
the top goalscorers for each country and see which countries have their
goalscorers spread out among multiple players such as Brazil, Qatar, and
Peru.
If you know your South American players you can take a good guess at who
are the top goal scorers for each nation. For Colombia the two outlying
mounds are obviously James Rodriguez and Falcao, for example.
Venezuela’s top scorer with 22 is Salomon Rondon and for Brazil, if not
for his injury, a lonesome mound would have appeared for Neymar with 60
goals!
Now let’s check the player contribution to the squads at the Copa
América by league. I’m just going to use the country that the league is
from for simplicity’s sake. Originally I wanted to left_join() it with
a ‘country <> domestic league’ table but couldn’t find one and the
league names itself aren’t very meaningful or have awful sponsor names
that obfuscate the country of origin even further.
The best of the best players from South American countries will move on
to Europe so the Argentinean league (Superliga Argentina) and the
Brazilian league (Brasileirão – Serie A) do not have as many players as
you might think and as a consequence, the top leagues of England, Spain,
and Italy contribute quite a bit! A lot of the better players but not
quite elite South American players might go to Mexico instead of a
lowermid European league. With the growth of the MLS a fair number of
players ply their trade there as well.
We can take a more detailed look by creating a table of the proportion
of players from each squad coming from either a domestic league or any
other league. I had to do a lot of wrangling to get the proper output
for the table. After calculating the percentage of domestic players from
a country’s domestic league I added the full data back in. Then I had to
make sure that for each country, the country – domestic league country
was the first row in each of the country groups (so Bolivia – Bolivia,
Bolivia, China, Japan – Japan, Japan – England, etc.). By doing this I
can automatically tidyr::fill()in the rest of the rows of that
country with the ‘percentage of players from domestic league stat’.
Country League (country name) Number of players from league Percentage of players from league Percentage of players from domestic league Qatar Qatar – league 23 100.0% 100.0% Bolivia Bolivia – league 20 87.0% 87.0% Japan Japan – league 14 60.9% 60.9% Ecuador Ecuador – league 9 39.1% 39.1% Paraguay Paraguay – league 8 34.8% 34.8% Brazil England – league 7 30.4% 13.0% Uruguay Spain – league 6 26.1% 4.3% Peru Peru – league 6 26.1% 26.1% Chile Chile – league 6 26.1% 26.1% Argentina Argentina – league 5 21.7% 21.7%
Three interesting facts I found:
 30% of players on the Brazil squad play for an English team, most
out of any league – squad combination excluding domestic leagues.  100% of the Qatar squad play in their domestic league!
 Only one Uruguayan player (4.3%) plays in its domestic league.
In the final plot for this section, I looked at the top 10 clubs
contributing the most players to the tournament. I used
arrange(desc(n)) %>% slice() instead of top_n() as sometimes top_n() grabs
way too many teams that are tied with the same value. To set the team names inside the bars I
created a midpoint value midval that calculated a value half of the
number of players contributed so the labels were placed neatly.
With 100% of its players coming from the domestic league it’s not
surprise that the Qatari team, AlSadd, is contributing the most players
to the tournament. Tied with another Qatari team, Mexican club America
features 7 players yet none of them are Mexicans (2 Argentineans, 2
Colombians, 1 Ecuadorian, 1 Chilean, and 1 Paraguayan).
At first I thought Barcelona contributed 8 players until I realized the
Ecuadorian players were coming from the Ecuadorian team called
Barcelona…I had to go all the way back up to the beginning of this
section to fix that small peculiarity. As futbol came to South America
via European colonists and immigrants a lot of teams took up the names
and colors of the teams these Europeans were fond of. Other examples
include Liverpool F.C. (Montevideo, Uruguay), Arsenal de Sarandi (Buenos
Aires, Argentina), and Club Atletico Juventus (Sao Paulo, Brazil –
although they use the colors of Torino F.C.).
If you download the data and type in the code below you can see the
entire clubcountry list.
Now that we got a good look at the composition of the teams, we can take
a look at how they’ve done at every Copa América.
The next code chunk mainly comes from PH Julien and his excellent Kaggle
kernel of “A Journey Through The History of
Soccer”.
Unfortunately, this data does not have penalty results as those
games are all counted as a draw (as technically that is the actual
result). Considering there a lot of cagey knockout rounds that finish
in a penalty shootout (including the last two finals…) it is
unfortunate but that’s just the data you have sometimes. There is a way
to webscrape all the Copa América results and assign WinLose to those
games that went to penalties but I’ll leave that for another time. Also,
there is no info on what stage of the tournament the match recorded is
in.
Now that it’s nice and cleaned up I created a function that reshapes the
data so that it’s set from a certain team’s perspective with the “team”
argument. You can also set the function to look for only results against
a certain opponent by filling in the versus argument.
The output is either a dataframe of all the games a team has been
involved in as well as the record of the team against other teams in the
Copa América or a message saying that the team you picked has never
played against the opponent you picked.
Oh… that’s right Japan has never played against Brazil at the Copa…
resultados_japon < copaAmerica_resultados(data = results_copa_cleaned, team = "Japan") resultados_japon[[2]] %>% knitr::kable(format = "html", caption = "Japan's record in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Japan’s record in the Copa AméricaDate Tournament Venue Opponent Home / Away Goals For Goals Against Result 19990629 Copa América Asunción Peru away 2 3 Loss 19990702 Copa América Asunción Paraguay away 0 4 Loss 19990705 Copa América Pedro Juan Caballero Bolivia away 1 1 Draw
Japan’s only previous journey to the Copa América was in the 1999
edition where they lost all 2 games and drew against Bolivia. They were also invited for the 2011
edition but withdrew due to the Tohoku Earthquake and were replaced by
Costa Rica. Japanese football has come a long way since 1999 but with a
young squad this tournament it will be a uphill battle to get 3 points against any of
their Group C opponents, Uruguay, Chile, and Ecuador.
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 20010711 Copa América Barranquilla Venezuela home 2 0 Win 20010714 Copa América Barranquilla Ecuador home 1 0 Win 20010717 Copa América Barranquilla Chile home 2 0 Win 20010723 Copa América Armenia Peru home 3 0 Win 20010726 Copa América Manizales Honduras home 2 0 Win 20010729 Copa América Bogotá Mexico home 1 0 Win
Despite a recent resurgence of the Colombia national team they have not
been able to match the feats of the 2001 side that won the Copa with
their best place finish since then coming 3rd in 2004. The 2001 team
were not only unbeaten but also did not concede a single goal throughout
the tournament!
Opponent Win Draw Loss Goals For Goals Against Argentina 9 8 15 38 52 resultados_de_brazil[[2]] %>% tail(5) %>% knitr::kable(format = "html", caption = "Brazil vs. Argentina in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Brazil vs. Argentina in the Copa América
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 19930627 Copa América Guayaquil Argentina home 1 1 Draw 19950717 Copa América Rivera Argentina home 2 2 Draw 19990711 Copa América Ciudad del Este Argentina home 2 1 Win 20040725 Copa América Lima Argentina away 2 2 Draw 20070715 Copa América Maracaibo Argentina home 3 0 Win
Brazil does not have a good overall record vs. Argentina but they have
not lost against their rivals at the Copa América since the 1993 edition
where they lost 56 on penalties in the Quarter Finals. The “draw” in
1995 was won on penalties while the “draw” in 2004 was actually in the
final where they won 42 on penalties.
What I found odd was that the Copa América seems to have a very low
priority to certain countries, especially Brazil who have repeatedly
sent their B or C teams to the tournament in favor of sending their best
team to other tournaments or resting star players. Funnily enough these
understrength Brazilian squads have actually won the entire tournament a
few times, most notably in 2007 against a full strength Argentina side
containing the likes of Zanetti, Riquelme, Cambiasso, Tevez, a young
Messi/Mascherano, Cambiasso, et al!
After looking at the history of the competition and the composition of
the squads I examined the players and their form coming into the Copa
América. In recent years football analytics has really taken off and
there have been many strides made in creating more informative
statistics to assess players’ abilities, the most prominently being the
xG statistics. This is the first time I talk about xG in any
length/depth so this introduction is as much to solidify my
understanding as well as yours!

xG: Quantitative measure (between 0 and 1) that assigns a
probability that a shot taken will result in a goal based on a
variety of variables and is used for evaluating the quality of
chances and predicting players’ and teams’ future performances. 
xA: Quantitative measure (between 0 and 1) that assigns a
probability that a given pass will result in an assist for a goal
based on a variety of variables.
Common variables used in the models that output xG statistics are the
distance and angle of a shot, the body part used, rebound, among others.
Similar to how you might assess your favorite striker’s chances of
scoring just as he is lining up to take a shot: Is the shot a header? Is
he trying to score from a cross in regular play or a corner kick? Are
there a crowd of defenders in front of him or is he oneonone with the
goalkeeper? Etc. You might think who is taking the shot would be a genuine
factor but in actuality it tells you a lot less about the chances of a
goal compared to the location of the shot.
Note that there isn’t a SINGLE xG model. You can check out a blog post
comparing different people’s xG models
here.
People and organizations (from Statsbomb to OptaPro) have their own
ideas about what could be the important variables in play and as
such it’s important to report from which source you got your data from
as the stats can differ between models. A few things xG does not factor
in are things like goalkeeper performance (someone pulling off
incredible saves or letting in a poor shot) and one must also consider
the fact that team style of play and the quality of a player’s
teammates. When judging players based on these stats it is important to
be aware of contextual factors like the team they play for, their
opponent, and the player’s position/role in the team.
From xG and xA more granular statistics such as xGChain and xGBuildup
were created to be able to dig a little deeper into who is contributing
to chance creation, I’ll introduce the latter two a bit later. As the
field has grown new statistics have popped up such as Karun
Singh’s “expected threat” or xT. You
can check out an introduction to xT from
here.
Of course, these statistics only tell a part of the story and are
definitenly not the beallandendall. In the context of this current
blog post, these stats only tell the story about how these players did
for their club teams this past season rather than for their national
team. Even still it gives us a good idea of what kind of form these
players are in coming into this tournament.
You might also want to watch these Youtube videos by
TifoFootball
for a quick primer on xG
and xA.
For the data I used the website, understat.com. Their xG models were
created via training a neural network on a dataset consisting of over
100,000 shots using more than 10 different variables. Getting data from
understat has been made easy by Ewen Henderson’s understatr package
available from Github (he’s also
the guy that made the ghibli color
palette!). I tried to pick a wide selection of attacking players but I
was also limited by the fact that understat only has data for
teams/players from six European leagues (Premier League, Bundesliga,
Serie A, La Liga, Ligue 1, and Russian Premier League).
For Peru I would have chosen Paolo Guerrero but as he plays in
Brazil now I went with Jefferson Farfan (who hasn’t played as many games
as the other players used for comparison unfortunately…). For Chile
I would pick Eduardo Vargas but he as doesn’t play for a team covered by
understat I went with Alexis Sanchez, who had a woeful season and only
played ~600 minutes despite appearing in ~20 league matches and later
added Arturo Vidal. For Brazil I included Neymar initially but since
he won’t actually be playing I’ll keep him for comparison’s sake but
also include Gabriel Jesus and Roberto Firmino who have been fighting
for the starting striker spot. Note that these two aren’t the ones
replacing Neymar positionally. In Neymar’s leftwing position I can
see David Neres or Phil Coutinho replacing him (Richarlison and Willian
mostly play on the right). (Edit: In the first match vs. Bolivia, David
Neres started off on the left while Richarlison played on the right,
Coutinho played just behind Bobby Firmino)
The other nation’s strikers/attacking midfielders don’t play for the six
European leagues covered by understat or like in Shinji Okazaki’s case
just did not play as many minutes/games during the season. To get the
data I created a list of the player codes and use purrr::map() to
iterate each through the understatr::get_player_seasons_stats()
function.
As you can see the data consists of a row for each player and each year
(from the 2014/2015 season to the 2018/2019 season). I tried to mitigate
the fact that some players played a lot more minutes than others by
standardize everything to a ‘per 90 minutes’ value but this does have
its own disadvantages. These include the fact that players who play a
lot of minutes (as regular starting members) may not have as high ‘per
90’ stat even though their production with all these minutes might
suggest that they are consistently performing and producing at a high
level.
It’ll be a bit crowded (kind of like a spilt box of Skittles…) but let’s
check out the key metrics for all the players at once.
Note: npg = nonpenalty goals, npxG = nonpenalty goals xG
comparison_data < understat_data %>% filter(year == 2018) %>% select(games, team_name, year) %>% rename(Shots = shots, KP = key_passes) %>% gather(key = "key", value = "value", player_name, time) %>% mutate(key = forcats::as_factor(key) %>% forcats::fct_relevel(., "xG", "goals", "npxG", "npg", "xA", "assists", "xGChain", "xGBuildup", "Shots", "KP")) comparison_strikers_plot < comparison_data %>% filter(key != "Shots", key != "KP", key != "xGBuildup", key != "xGChain") %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) + geom_vline(xintercept = 1.5, size = 2) + geom_vline(xintercept = 2.5, size = 2) + geom_vline(xintercept = 3.5, size = 2) + geom_vline(xintercept = 4.5, size = 2) + geom_vline(xintercept = 5.5, size = 2) + coord_flip() + scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 1.26)) + scale_fill_manual(values = pals::glasbey(16), name = "Player") + labs(title = "Comparison: Top attackers at the Copa América", subtitle = "For select group of attacking players with data available from understat.com", x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 20182019 Season")) + theme_copaAmerica(title.size = 18, subtitle = 12, panel.grid.minor.x = element_line(color = "white")) comparison_strikers_plotAs usual in these types of charts, Messi is leading a lot of the metrics
here and showing consistency too with having played the third highest
amount of minutes out of the selected players. It’s helpful to have the
xG/xA stats next to the actual goals/assists as it provides an
indication of whether the player in question is scoring shots that he
probabilistically should be scoring. When a player’s actual goal count
is higher than their xG stat this suggests that the player is
“exceeding their xG” or that they are scoring from shots that are
historically hard to score from. It can be seen as a marker of an elite
finisher as they are putting away chances from difficult situations
consistently. In terms of assists and xA Alexis Sanchez, who only played
about 600 minutes, looks a lot better than in reality due to the
aforementioned disadvantage of standardizing everything to a “per 90
minutes” value. Normally you would have a cutoff based on a certain
minimum amount of minutes but as I mentioned I was rather limited in
my choice of players.
A way to take a closer look at xG – Goals and xA – Assists is to use a
simple dot plot with a line going through the 45 degree angle. Those
below the line are underperforming relative to their xG or xA stat,
those over it are overachieving (“exceeding” their xG/xA stat) while
those just on the line are scoring or assisting right around what the
model expects the player to be. I use nonpenalty xG below as penalties
have around ~0.75 xG (give or take a few percentage points depending on
the model) and can inflate the stats of those players who take a lot of
penalties and score them, especially if they weren’t the ones who earned
the penalty themselves.
Gabriel Jesus is quite clearly below the 45 degree line meaning that he
has been very poor at finishing chances (and/or incredibly unlucky).
After a poor World Cup where he scored 0 goals as a starter, he is
really going to have to step up to fill Neymar’s goalscoring boots for
this tournament. However, his buildup play for City has still been good
this past season and he has been scoring for Brazil in the friendlies
leading up to the tournament so it’s going to be a hard decision for
Tite to decide on who starts against Bolivia (edit: Firmino started and
contributed an assist while Jesus replaced him in the 65th minute).
One thing to keep in mind is that xA does not take into account the
recipient of the assist pass. Even if the pass given had a high expected
assist value the receiving player still might not have the quality to
put it away through no fault of the passer. This might explain why most
of the players with a higher xA among this group don’t have the assists
to match. It can also be that these players are also the ones playing a
lot more minutes and the volume of chances they create just aren’t
translating to goals all the time.
I separated “key passes” and “shots” as well as “xGChain” and
“xGBuildup” from the rest as these two sets were on a very different
scale.

xGChain: Quantitative measure that is the combined sum of the xG of
every possession that ends in a shot that a player is involved in.
The same derived value is given to each of the players involved and
allows us to credit players for attacking contributions outside of
just shots (xG) and assists (xA). 
xGBuildup: Similar to xGChain but excluding shots and assists. This
is in response to xGChain values still being dominated by the xG and
xA from shots and assists, respectively.
Although Gabriel Jesus has been poor at finishing his chances as seen in
previous graphs, his xGChain and xGBuildup stat makes it clear that he
is still contributing to City’s attack outside of scoring goals himself
(not to mention all the defensive work he does as well).
For example below, the stats are able to clearly differentiate between
James, who is more of a playmaker, compared to Falcao and Duvan who are
traditional number 9s with his superior xGBuildup, xGChain, and Key
Passes values. For a more detailed overview on xGChain and xGBuildup
check out Statsbomb’s article
here.
There are lots of different ways to visualize this data, most famously
the radar charts created by Statsbomb. Using
the data you can also compare and evaluate players in many different
ways and using understatr package you could add a few more players
like Paulo Dybala, Miguel Almiron, and more! I could probably do a whole
other article using just this data but I’ll leave it here for now.
Throughout this blog post I talked about some of the historical records,
squad compositions, match records, and finally the player profiles of
attacking players at this summer’s Copa América. Using the power of R it
is really easy to webscrape and visualize data in a way that is
informative and aesthetically pleasing. I wanted to finish this before
the tournament started but other life things got in the way as well as
the fact that the amount of content ballooned out of control (especially
the xG section) so I had to cut down a lot.
It’s been fun reading the articles on the Copa América website and
seeing how far my “intermediatebutveryoutofpractice”level of
Spanish can get me to understand the content, here is one that I
particularly liked reading: 14 Estadisticas de la Copa
América.
With so many tournaments going on right now (and with the African Cup of
Nations starting in a few days) a lot of the news media is spread thin
right now but there are still some quality articles out there to read
about the Copa, like this
article from BBC’s South
American football expert, Tim Vickery.
After the first round of games, a few points of discussion:

After an extremely lacklustre performance vs. Colombia, how does
Argentina bounce back? What tactical changes need to be made? 
Qatar impressed against Paraguay but can they pull off a major upset
vs. Colombia? 
How will Japan lineup against Uruguay after a losing by a scoreline
that didn’t really do their performance justice? How will manager
Moriyasu balance experience vs. youth, will he start with veteran
Okazaki after Ueda’s numerous misses vs. Chile? How will Japan deal with their
glaring weaknesses at the fullback position? Will Yuta Nakayama keep his place after
an awful performance? 
Can Brazil earn a early ticket to the next round vs. Venezuela after
a clinical but not excellent performance vs. Bolivia? Will they be
able to keep up their streak of winning the Copa every time they
have hosted it?
Thanks for reading and…
¡Buena suerte a todos equipos!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualizing the Copa América: Historical Records, Squad Profiles, and Player Profiles with xG statistics!
(This article was first published on R by R(yo), and kindly contributed to Rbloggers)
Another summer and another edition of the Copa América! Along with the
Africa Cup of Nations, Nations League finals, the Women’s World Cup,
Under21 European Championship AND the Gold Cup this is yet another
soccerfilled season after last year’s World Cup and the Asian Cup
earlier this year (I also did a blog post on these last two tournaments
which you can see here (World
Cup) and here
(Asian Cup)).
There is so much football going on at once even I can’t keep up,
especially with the time difference! To not redo all the previous
visualizations with Copa América data I tried to find new sources of
data and other forms of visualizations to give some insight into the
players and teams competing to be the champion of South America. You can
find all the code I used in this blogpost here and you can also find
other soccer related data viz in my
soccer_ggplot Github repo.
The sections will go from a very macrolevel view of the historical
records of the tournament, to the squads competing, the teams’
match record in the Copa América, and finally to a microlevel view
of various attacking players using xG statistics.
¡Vámonos!
Packages library(dplyr) ## data wrangling library(tidyr) ## data wrangling library(purrr) ## data wrangling and iteration library(stringr) ## data wrangling library(rvest) ## webscraping library(polite) ## webscraping (Github only pkg) library(ggplot2) ## plotting library(scales) ## plotting scales library(ggimage) ## images for flags library(ggforce) ## plotting text labels library(cowplot) ## plotting grid library(glue) ## text library(ggrepel) ## plotting text labels library(magick) ## plotting library(kable) ## tables library(ggtextures) ## soccer ball emoji as geom_col() library(extrafont) ## fonts: Roboto Condensed loadfonts() theme_copaAmericaI wanted to have all the plots in this blogpost to have a consistent
color theme. As the tournament is going to be held in Brazil, I went
with a color theme based on its flag with blue, yellow, and green being
the primary colors.
For this plot I took the stats from the Spanish version of the Wikipedia
page as it had more content. I used purrr::flatten_df() to squish the
list output into a dataframe then set the names of each column using
purrr::set_names().
Like in the Asian Cup blogpost I use Claus
Wilke’s
ggtextures package to use
soccer ball emoji as the column image in the plot.
Most of these players aren’t ones you might recognize. The Copa América
used to be held a lot more regularly (and sometimes erratically) until
this century so players had a lot more opportunities to score goals. All
five of the players you see here played in the 1930s1950s when there
was a tournament every one or two years. Out of currently active
players, Peruvian legend Paolo Guerrero has 11 goals along with Eduardo
Vargas (from Chile) with 10. (Edit: after the Chile – Japan game, Vargas is on
12…) Another player you might recognize that was actually tied with
Ademir for 5th place, along with three other players, was Gabriel
Batistuta (“Batigol”).
After grabbing the data from the Wikipedia page I used a variety of
functions to clean and reshape the dataset like tidyr::separate() to
split the number of occurences and the year.
I also wanted to add flags to this plot but
cowplot::insert_yaxis_grob() is unfortunately not compatible with
facets. I used stringr::str_wrap() to format the subtitle nicely while
I used glue::glue() to avoid having the use ‘\n’ to create a new line
for the caption.
What’s surprising to note is that Pele never won a Copa América with
Brazil, although he did get Best Player and Top Scorer in the 1959
edition of the tournament. Even more bizarrely Diego Maradona has never
won it either! He didn’t play in either of the 1991 and 1993 editions
where Argentina won their 13th and 14th Copas.
We just looked at what happened in the past but who are the players
competing in the tournament this year? To take a quick look I
webscraped the squads of each of the competing teams from Wikipedia.
I created a list of the xpaths for each of squads and using
purrr::map() I grabbed the data for each participating country. After
I got some metainformation about the country name and the group I
created a listcolumn that stores the squad data as a dataframe in its
own column. To explode this out I used tidyr::unnest() to reshape the
entire dataframe to have one row with all the data for each player in
every squad.
To get a clean dataset I use some stringr::str_*() functions to
properly format the character strings such as the player positions,
ages, date of births.
Using this data I can plot a bunch of histograms:
age_country_plot < squads_df_clean %>% group_by(country) %>% mutate(median_age = median(age)) %>% ungroup() %>% ggplot(aes(x = age)) + geom_histogram(fill = "red", binwidth = 1) + geom_vline(aes(xintercept = median_age), size = 1.2) + geom_label(aes(x = median_age, y = 8, label = glue::glue("Median: {median_age}")), nudge_x = 0.5, hjust = 0.1, size = 3, family = "Roboto Condensed", color = "black") + labs(title = "Age Distribution of Copa América squads", subtitle = "Columns ordered Group A to Group C", x = "Age", y = NULL, caption = glue::glue(" Source: Wikipedia By: @R_by_Ryo")) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0), breaks = scales::pretty_breaks()) + theme_copaAmerica(title.size = 22, subtitle.size = 14, caption.size = 8, axis.text.size = 12, axis.title.size = 16, strip.text.size = 18, panel.grid.minor.x = element_line(color = "white"), panel.grid.minor.y = element_line(color = "white")) + facet_wrap(~country, ncol = 3) age_country_plotIn terms of age, Japan have the youngest team with a median of 21, 4
years younger than the next youngest team, Qatar. The rest have a fairly
balanced spread of ages from 20 to earlymid 30s with most of the
medians hovering around 27 years of age. The reason for Japan’s
extremely young squad is due to the fact that the fullstrength Japan
team has played in both the World Cup and the Asian Cup in the past
year. Along with the fact that the Tokyo Olympics are next year, it was
decided to use the invitation to the Copa América as a trialbyfire for
the young stars of the future. Much like in a real Olympic squad, the
team contains three “overage” players in World Cup 2010/2014/2018
goalkeeper Eiji Kawashima, Premier League winner Shinji Okazaki, and
Getafe playmaker Gaku Shibasaki.
The oldest player will be Brazil captain Dani Alves at 36 with
Paraguay’s Oscar Cardozo only two weeks younger. On the other hand, the
youngest player is Japan’s 18year old prodigy Takefusa Kubo, the
exBarcelona youth player who only just recently moved to Real Madrid!
In light of his transfer a lot of eyes will be on him to see if he can
produce some Captain Tsubasaesque performances for a very inexperienced
Japan team gearing up for the Tokyo Olympics!
When considering the experience of a squad it’s not enough to look at
ages but one needs to look at the caps or appearances for the national
team as well.
The majority of Japan’s squad have 0 (ZERO) caps, with the
aforementioned three “overage” players taking up most of the proportion
of caps on the team. Bolivia are also taking a untested squad with 8 of
their players with 2 caps or less! Chile, Uruguay, and Argentina bring
their veterans with multiple players over or around 100 caps. From this
data I was surprised that Jefferson Farfan and Paolo Guerrero didn’t
have 100 caps by now…
The player with the most caps is Lionel Messi (130) followed closely by
Diego Godin (126), and Alexis Sanchez (124). On the other hand there are
29 players hopeful of making their first national team appearance at
this tournament with the majority (17 players) coming from Japan.
Next I looked at the distribution of goals scored by the midfielders and
strikers of each team. I found out about using
ggplot2::position_nudge() for slightly adjusting variables on a
discrete scale in similar fashion to the nudge_y = and nudge_x =
arguments most people might be familiar with from other geoms. I also
used ggforce::geom_mark_hull() to do some labelling.
With a lot of these players being more defensively minded or new players
the distribution is heavily skewed but you can see little mounds showing
the top goalscorers for each country and see which countries have their
goalscorers spread out among multiple players such as Brazil, Qatar, and
Peru.
If you know your South American players you can take a good guess at who
are the top goal scorers for each nation. For Colombia the two outlying
mounds are obviously James Rodriguez and Falcao, for example.
Venezuela’s top scorer with 22 is Salomon Rondon and for Brazil, if not
for his injury, a lonesome mound would have appeared for Neymar with 60
goals!
Now let’s check the player contribution to the squads at the Copa
América by league. I’m just going to use the country that the league is
from for simplicity’s sake. Originally I wanted to left_join() it with
a ‘country <> domestic league’ table but couldn’t find one and the
league names itself aren’t very meaningful or have awful sponsor names
that obfuscate the country of origin even further.
The best of the best players from South American countries will move on
to Europe so the Argentinean league (Superliga Argentina) and the
Brazilian league (Brasileirão – Serie A) do not have as many players as
you might think and as a consequence, the top leagues of England, Spain,
and Italy contribute quite a bit! A lot of the better players but not
quite elite South American players might go to Mexico instead of a
lowermid European league. With the growth of the MLS a fair number of
players ply their trade there as well.
We can take a more detailed look by creating a table of the proportion
of players from each squad coming from either a domestic league or any
other league. I had to do a lot of wrangling to get the proper output
for the table. After calculating the percentage of domestic players from
a country’s domestic league I added the full data back in. Then I had to
make sure that for each country, the country – domestic league country
was the first row in each of the country groups (so Bolivia – Bolivia,
Bolivia, China, Japan – Japan, Japan – England, etc.). By doing this I
can automatically tidyr::fill()in the rest of the rows of that
country with the ‘percentage of players from domestic league stat’.
Country League (country name) Number of players from league Percentage of players from league Percentage of players from domestic league Qatar Qatar – league 23 100.0% 100.0% Bolivia Bolivia – league 20 87.0% 87.0% Japan Japan – league 14 60.9% 60.9% Ecuador Ecuador – league 9 39.1% 39.1% Paraguay Paraguay – league 8 34.8% 34.8% Brazil England – league 7 30.4% 13.0% Uruguay Spain – league 6 26.1% 4.3% Peru Peru – league 6 26.1% 26.1% Chile Chile – league 6 26.1% 26.1% Argentina Argentina – league 5 21.7% 21.7%
Three interesting facts I found:
 30% of players on the Brazil squad play for an English team, most
out of any league – squad combination excluding domestic leagues.  100% of the Qatar squad play in their domestic league!
 Only one Uruguayan player (4.3%) plays in its domestic league.
In the final plot for this section, I looked at the top 10 clubs
contributing the most players to the tournament. I used
arrange(desc(n)) %>% slice() instead of top_n() as sometimes top_n() grabs
way too many teams that are tied with the same value. To set the team names inside the bars I
created a midpoint value midval that calculated a value half of the
number of players contributed so the labels were placed neatly.
With 100% of its players coming from the domestic league it’s not
surprise that the Qatari team, AlSadd, is contributing the most players
to the tournament. Tied with another Qatari team, Mexican club America
features 7 players yet none of them are Mexicans (2 Argentineans, 2
Colombians, 1 Ecuadorian, 1 Chilean, and 1 Paraguayan).
At first I thought Barcelona contributed 8 players until I realized the
Ecuadorian players were coming from the Ecuadorian team called
Barcelona…I had to go all the way back up to the beginning of this
section to fix that small peculiarity. As futbol came to South America
via European colonists and immigrants a lot of teams took up the names
and colors of the teams these Europeans were fond of. Other examples
include Liverpool F.C. (Montevideo, Uruguay), Arsenal de Sarandi (Buenos
Aires, Argentina), and Club Atletico Juventus (Sao Paulo, Brazil –
although they use the colors of Torino F.C.).
If you download the data and type in the code below you can see the
entire clubcountry list.
Now that we got a good look at the composition of the teams, we can take
a look at how they’ve done at every Copa América.
The next code chunk mainly comes from PH Julien and his excellent Kaggle
kernel of “A Journey Through The History of
Soccer”.
Unfortunately, this data does not have penalty results as those
games are all counted as a draw (as technically that is the actual
result). Considering there a lot of cagey knockout rounds that finish
in a penalty shootout (including the last two finals…) it is
unfortunate but that’s just the data you have sometimes. There is a way
to webscrape all the Copa América results and assign WinLose to those
games that went to penalties but I’ll leave that for another time. Also,
there is no info on what stage of the tournament the match recorded is
in.
Now that it’s nice and cleaned up I created a function that reshapes the
data so that it’s set from a certain team’s perspective with the “team”
argument. You can also set the function to look for only results against
a certain opponent by filling in the versus argument.
The output is either a dataframe of all the games a team has been
involved in as well as the record of the team against other teams in the
Copa América or a message saying that the team you picked has never
played against the opponent you picked.
Oh… that’s right Japan has never played against Brazil at the Copa…
resultados_japon < copaAmerica_resultados(data = results_copa_cleaned, team = "Japan") resultados_japon[[2]] %>% knitr::kable(format = "html", caption = "Japan's record in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Japan’s record in the Copa AméricaDate Tournament Venue Opponent Home / Away Goals For Goals Against Result 19990629 Copa América Asunción Peru away 2 3 Loss 19990702 Copa América Asunción Paraguay away 0 4 Loss 19990705 Copa América Pedro Juan Caballero Bolivia away 1 1 Draw
Japan’s only previous journey to the Copa América was in the 1999
edition where they lost all 2 games and drew against Bolivia. They were also invited for the 2011
edition but withdrew due to the Tohoku Earthquake and were replaced by
Costa Rica. Japanese football has come a long way since 1999 but with a
young squad this tournament it will be a uphill battle to get 3 points against any of
their Group C opponents, Uruguay, Chile, and Ecuador.
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 20010711 Copa América Barranquilla Venezuela home 2 0 Win 20010714 Copa América Barranquilla Ecuador home 1 0 Win 20010717 Copa América Barranquilla Chile home 2 0 Win 20010723 Copa América Armenia Peru home 3 0 Win 20010726 Copa América Manizales Honduras home 2 0 Win 20010729 Copa América Bogotá Mexico home 1 0 Win
Despite a recent resurgence of the Colombia national team they have not
been able to match the feats of the 2001 side that won the Copa with
their best place finish since then coming 3rd in 2004. The 2001 team
were not only unbeaten but also did not concede a single goal throughout
the tournament!
Opponent Win Draw Loss Goals For Goals Against Argentina 9 8 15 38 52 resultados_de_brazil[[2]] %>% tail(5) %>% knitr::kable(format = "html", caption = "Brazil vs. Argentina in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Brazil vs. Argentina in the Copa América
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 19930627 Copa América Guayaquil Argentina home 1 1 Draw 19950717 Copa América Rivera Argentina home 2 2 Draw 19990711 Copa América Ciudad del Este Argentina home 2 1 Win 20040725 Copa América Lima Argentina away 2 2 Draw 20070715 Copa América Maracaibo Argentina home 3 0 Win
Brazil does not have a good overall record vs. Argentina but they have
not lost against their rivals at the Copa América since the 1993 edition
where they lost 56 on penalties in the Quarter Finals. The “draw” in
1995 was won on penalties while the “draw” in 2004 was actually in the
final where they won 42 on penalties.
What I found odd was that the Copa América seems to have a very low
priority to certain countries, especially Brazil who have repeatedly
sent their B or C teams to the tournament in favor of sending their best
team to other tournaments or resting star players. Funnily enough these
understrength Brazilian squads have actually won the entire tournament a
few times, most notably in 2007 against a full strength Argentina side
containing the likes of Zanetti, Riquelme, Cambiasso, Tevez, a young
Messi/Mascherano, Cambiasso, et al!
After looking at the history of the competition and the composition of
the squads I examined the players and their form coming into the Copa
América. In recent years football analytics has really taken off and
there have been many strides made in creating more informative
statistics to assess players’ abilities, the most prominently being the
xG statistics. This is the first time I talk about xG in any
length/depth so this introduction is as much to solidify my
understanding as well as yours!

xG: Quantitative measure (between 0 and 1) that assigns a
probability that a shot taken will result in a goal based on a
variety of variables and is used for evaluating the quality of
chances and predicting players’ and teams’ future performances. 
xA: Quantitative measure (between 0 and 1) that assigns a
probability that a given pass will result in an assist for a goal
based on a variety of variables.
Common variables used in the models that output xG statistics are the
distance and angle of a shot, the body part used, rebound, among others.
Similar to how you might assess your favorite striker’s chances of
scoring just as he is lining up to take a shot: Is the shot a header? Is
he trying to score from a cross in regular play or a corner kick? Are
there a crowd of defenders in front of him or is he oneonone with the
goalkeeper? Etc. You might think who is taking the shot would be a genuine
factor but in actuality it tells you a lot less about the chances of a
goal compared to the location of the shot.
Note that there isn’t a SINGLE xG model. You can check out a blog post
comparing different people’s xG models
here.
People and organizations (from Statsbomb to OptaPro) have their own
ideas about what could be the important variables in play and as
such it’s important to report from which source you got your data from
as the stats can differ between models. A few things xG does not factor
in are things like goalkeeper performance (someone pulling off
incredible saves or letting in a poor shot) and one must also consider
the fact that team style of play and the quality of a player’s
teammates. When judging players based on these stats it is important to
be aware of contextual factors like the team they play for, their
opponent, and the player’s position/role in the team.
From xG and xA more granular statistics such as xGChain and xGBuildup
were created to be able to dig a little deeper into who is contributing
to chance creation, I’ll introduce the latter two a bit later. As the
field has grown new statistics have popped up such as Karun
Singh’s “expected threat” or xT. You
can check out an introduction to xT from
here.
Of course, these statistics only tell a part of the story and are
definitenly not the beallandendall. In the context of this current
blog post, these stats only tell the story about how these players did
for their club teams this past season rather than for their national
team. Even still it gives us a good idea of what kind of form these
players are in coming into this tournament.
You might also want to watch these Youtube videos by
TifoFootball
for a quick primer on xG
and xA.
For the data I used the website, understat.com. Their xG models were
created via training a neural network on a dataset consisting of over
100,000 shots using more than 10 different variables. Getting data from
understat has been made easy by Ewen Henderson’s understatr package
available from Github (he’s also
the guy that made the ghibli color
palette!). I tried to pick a wide selection of attacking players but I
was also limited by the fact that understat only has data for
teams/players from six European leagues (Premier League, Bundesliga,
Serie A, La Liga, Ligue 1, and Russian Premier League).
For Peru I would have chosen Paolo Guerrero but as he plays in
Brazil now I went with Jefferson Farfan (who hasn’t played as many games
as the other players used for comparison unfortunately…). For Chile
I would pick Eduardo Vargas but he as doesn’t play for a team covered by
understat I went with Alexis Sanchez, who had a woeful season and only
played ~600 minutes despite appearing in ~20 league matches and later
added Arturo Vidal. For Brazil I included Neymar initially but since
he won’t actually be playing I’ll keep him for comparison’s sake but
also include Gabriel Jesus and Roberto Firmino who have been fighting
for the starting striker spot. Note that these two aren’t the ones
replacing Neymar positionally. In Neymar’s leftwing position I can
see David Neres or Phil Coutinho replacing him (Richarlison and Willian
mostly play on the right). (Edit: In the first match vs. Bolivia, David
Neres started off on the left while Richarlison played on the right,
Coutinho played just behind Bobby Firmino)
The other nation’s strikers/attacking midfielders don’t play for the six
European leagues covered by understat or like in Shinji Okazaki’s case
just did not play as many minutes/games during the season. To get the
data I created a list of the player codes and use purrr::map() to
iterate each through the understatr::get_player_seasons_stats()
function.
As you can see the data consists of a row for each player and each year
(from the 2014/2015 season to the 2018/2019 season). I tried to mitigate
the fact that some players played a lot more minutes than others by
standardize everything to a ‘per 90 minutes’ value but this does have
its own disadvantages. These include the fact that players who play a
lot of minutes (as regular starting members) may not have as high ‘per
90’ stat even though their production with all these minutes might
suggest that they are consistently performing and producing at a high
level.
It’ll be a bit crowded (kind of like a spilt box of Skittles…) but let’s
check out the key metrics for all the players at once.
Note: npg = nonpenalty goals, npxG = nonpenalty goals xG
comparison_data < understat_data %>% filter(year == 2018) %>% select(games, team_name, year) %>% rename(Shots = shots, KP = key_passes) %>% gather(key = "key", value = "value", player_name, time) %>% mutate(key = forcats::as_factor(key) %>% forcats::fct_relevel(., "xG", "goals", "npxG", "npg", "xA", "assists", "xGChain", "xGBuildup", "Shots", "KP")) comparison_strikers_plot < comparison_data %>% filter(key != "Shots", key != "KP", key != "xGBuildup", key != "xGChain") %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) + geom_vline(xintercept = 1.5, size = 2) + geom_vline(xintercept = 2.5, size = 2) + geom_vline(xintercept = 3.5, size = 2) + geom_vline(xintercept = 4.5, size = 2) + geom_vline(xintercept = 5.5, size = 2) + coord_flip() + scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 1.26)) + scale_fill_manual(values = pals::glasbey(16), name = "Player") + labs(title = "Comparison: Top attackers at the Copa América", subtitle = "For select group of attacking players with data available from understat.com", x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 20182019 Season")) + theme_copaAmerica(title.size = 18, subtitle = 12, panel.grid.minor.x = element_line(color = "white")) comparison_strikers_plotAs usual in these types of charts, Messi is leading a lot of the metrics
here and showing consistency too with having played the third highest
amount of minutes out of the selected players. It’s helpful to have the
xG/xA stats next to the actual goals/assists as it provides an
indication of whether the player in question is scoring shots that he
probabilistically should be scoring. When a player’s actual goal count
is higher than their xG stat this suggests that the player is
“exceeding their xG” or that they are scoring from shots that are
historically hard to score from. It can be seen as a marker of an elite
finisher as they are putting away chances from difficult situations
consistently. In terms of assists and xA Alexis Sanchez, who only played
about 600 minutes, looks a lot better than in reality due to the
aforementioned disadvantage of standardizing everything to a “per 90
minutes” value. Normally you would have a cutoff based on a certain
minimum amount of minutes but as I mentioned I was rather limited in
my choice of players.
A way to take a closer look at xG – Goals and xA – Assists is to use a
simple dot plot with a line going through the 45 degree angle. Those
below the line are underperforming relative to their xG or xA stat,
those over it are overachieving (“exceeding” their xG/xA stat) while
those just on the line are scoring or assisting right around what the
model expects the player to be. I use nonpenalty xG below as penalties
have around ~0.75 xG (give or take a few percentage points depending on
the model) and can inflate the stats of those players who take a lot of
penalties and score them, especially if they weren’t the ones who earned
the penalty themselves.
Gabriel Jesus is quite clearly below the 45 degree line meaning that he
has been very poor at finishing chances (and/or incredibly unlucky).
After a poor World Cup where he scored 0 goals as a starter, he is
really going to have to step up to fill Neymar’s goalscoring boots for
this tournament. However, his buildup play for City has still been good
this past season and he has been scoring for Brazil in the friendlies
leading up to the tournament so it’s going to be a hard decision for
Tite to decide on who starts against Bolivia (edit: Firmino started and
contributed an assist while Jesus replaced him in the 65th minute).
One thing to keep in mind is that xA does not take into account the
recipient of the assist pass. Even if the pass given had a high expected
assist value the receiving player still might not have the quality to
put it away through no fault of the passer. This might explain why most
of the players with a higher xA among this group don’t have the assists
to match. It can also be that these players are also the ones playing a
lot more minutes and the volume of chances they create just aren’t
translating to goals all the time.
I separated “key passes” and “shots” as well as “xGChain” and
“xGBuildup” from the rest as these two sets were on a very different
scale.

xGChain: Quantitative measure that is the combined sum of the xG of
every possession that ends in a shot that a player is involved in.
The same derived value is given to each of the players involved and
allows us to credit players for attacking contributions outside of
just shots (xG) and assists (xA). 
xGBuildup: Similar to xGChain but excluding shots and assists. This
is in response to xGChain values still being dominated by the xG and
xA from shots and assists, respectively.
Although Gabriel Jesus has been poor at finishing his chances as seen in
previous graphs, his xGChain and xGBuildup stat makes it clear that he
is still contributing to City’s attack outside of scoring goals himself
(not to mention all the defensive work he does as well).
For example below, the stats are able to clearly differentiate between
James, who is more of a playmaker, compared to Falcao and Duvan who are
traditional number 9s with his superior xGBuildup, xGChain, and Key
Passes values. For a more detailed overview on xGChain and xGBuildup
check out Statsbomb’s article
here.
There are lots of different ways to visualize this data, most famously
the radar charts created by Statsbomb. Using
the data you can also compare and evaluate players in many different
ways and using understatr package you could add a few more players
like Paulo Dybala, Miguel Almiron, and more! I could probably do a whole
other article using just this data but I’ll leave it here for now.
Throughout this blog post I talked about some of the historical records,
squad compositions, match records, and finally the player profiles of
attacking players at this summer’s Copa América. Using the power of R it
is really easy to webscrape and visualize data in a way that is
informative and aesthetically pleasing. I wanted to finish this before
the tournament started but other life things got in the way as well as
the fact that the amount of content ballooned out of control (especially
the xG section) so I had to cut down a lot.
It’s been fun reading the articles on the Copa América website and
seeing how far my “intermediatebutveryoutofpractice”level of
Spanish can get me to understand the content, here is one that I
particularly liked reading: 14 Estadisticas de la Copa
América.
With so many tournaments going on right now (and with the African Cup of
Nations starting in a few days) a lot of the news media is spread thin
right now but there are still some quality articles out there to read
about the Copa, like this
article from BBC’s South
American football expert, Tim Vickery.
After the first round of games, a few points of discussion:

After an extremely lacklustre performance vs. Colombia, how does
Argentina bounce back? What tactical changes need to be made? 
Qatar impressed against Paraguay but can they pull off a major upset
vs. Colombia? 
How will Japan lineup against Uruguay after a losing by a scoreline
that didn’t really do their performance justice? How will manager
Moriyasu balance experience vs. youth, will he start with veteran
Okazaki after Ueda’s numerous misses vs. Chile? How will Japan deal with their
glaring weaknesses at the fullback position? Will Yuta Nakayama keep his place after
an awful performance? 
Can Brazil earn a early ticket to the next round vs. Venezuela after
a clinical but not excellent performance vs. Bolivia? Will they be
able to keep up their streak of winning the Copa every time they
have hosted it?
Thanks for reading and…
¡Buena suerte a todos equipos!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Le Monde puzzle [#1104]
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
A palindromic Le Monde mathematical puzzle:
In a monetary system where all palindromic amounts between 1 and 10⁸ have a coin, find the numbers less than 10³ that cannot be paid with less than three coins. Find if 20,191,104 can be paid with two coins. Similarly, find if 11,042,019 can be paid with two or three coins.
Which can be solved in a few lines of R code:
coin=sort(c(1:9,(1:9)*11,outer(1:9*101,(0:9)*10,"+"))) amounz=sort(unique(c(coin,as.vector(outer(coin,coin,"+"))))) amounz=amounz[amounz<1e3]and produces 10 amounts that cannot be paid with one or two coins. It is also easy to check that three coins are enough to cover all amounts below 10³. For the second question, starting with n¹=20,188,102, a simple downward search of palindromic pairs (n¹,n²) such that n¹+n²=20,188,102 led to n¹=16,755,761 and n²=3,435,343. And starting with 11,033,011, the same search does not produce any solution, while there are three coins such that n¹+n²+n³=11,042,019, for instance n¹=11,022,011, n²=20,002, and n³=6.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Le Monde puzzle [#1104]
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
A palindromic Le Monde mathematical puzzle:
In a monetary system where all palindromic amounts between 1 and 10⁸ have a coin, find the numbers less than 10³ that cannot be paid with less than three coins. Find if 20,191,104 can be paid with two coins. Similarly, find if 11,042,019 can be paid with two or three coins.
Which can be solved in a few lines of R code:
coin=sort(c(1:9,(1:9)*11,outer(1:9*101,(0:9)*10,"+"))) amounz=sort(unique(c(coin,as.vector(outer(coin,coin,"+"))))) amounz=amounz[amounz<1e3]and produces 10 amounts that cannot be paid with one or two coins. It is also easy to check that three coins are enough to cover all amounts below 10³. For the second question, starting with n¹=20,188,102, a simple downward search of palindromic pairs (n¹,n²) such that n¹+n²=20,188,102 led to n¹=16,755,761 and n²=3,435,343. And starting with 11,033,011, the same search does not produce any solution, while there are three coins such that n¹+n²+n³=11,042,019, for instance n¹=11,022,011, n²=20,002, and n³=6.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
On my way to Manizales (Colombia)
(This article was first published on Renglish – Freakonometrics, and kindly contributed to Rbloggers)
Next week, I will be in Manizales, Colombia, for the Third International Congress on Actuarial Science and Quantitative Finance. I will be giving a lecture on Wednesday with Jed Fress and Emilianos Valdez.
I will give my course on Algorithms for Predictive Modeling on Thursday morning (after Jed and Emil’s lectures). Unfortunately, my computer locked itself last week, and I could not unlock it (could not IT team at the university, who have the internal EFI password). So I will not be able to work further on the slides, so it will be based on the version asat now (clearly in progress).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Renglish – Freakonometrics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
On my way to Manizales (Colombia)
(This article was first published on Renglish – Freakonometrics, and kindly contributed to Rbloggers)
Next week, I will be in Manizales, Colombia, for the Third International Congress on Actuarial Science and Quantitative Finance. I will be giving a lecture on Wednesday with Jed Fress and Emilianos Valdez.
I will give my course on Algorithms for Predictive Modeling on Thursday morning (after Jed and Emil’s lectures). Unfortunately, my computer locked itself last week, and I could not unlock it (could not IT team at the university, who have the internal EFI password). So I will not be able to work further on the slides, so it will be based on the version asat now (clearly in progress).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Renglish – Freakonometrics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Forecasting tools in development
(This article was first published on  R, and kindly contributed to Rbloggers)
As I’ve been writing up a progress report for my NIGMS R35 MIRA award, I’ve been reminded at how much of the work that we’ve been doing is focused on forecasting infrastructure. A common theme in the Reich Lab is making operational forecasts of infectious disease outbreaks. The operational aspect means that we focus on everything from developing and adapting statistical methods to be used in forecasting applications to thinking about the data science toolkit that you need to store, evaluate, and visualize forecasts. To that end, in addition to working closely with the CDC in their FluSight initiative, we’ve been doing a lot of collaborative work on new R packages and data repositories that I hope will be useful beyond the confines of our lab. Some of these projects are fully operational, used in our production flu forecasts for CDC, and some have even gone through a level of code peer review. Others are in earlier stages of development. My hope is that in putting this list out there (see below the fold) we will generate some interest (and possibly find some new opensource collaborators) for these projects.
Here is a partial list of inprogress software that we’ve been working on:
 sarimaTD is an R package that serves as a wrapper to some of the ARIMA modeling functionality in the forecast R package. We found that we consistently wanted to be specifying some transformations (T) and differencing (D) in specific ways that we have found useful in modeling infectious disease timeseries data, so we made it easy for us and others to use specifications.
 ForecastFramework is an R package that we have collaborated on with our colleagues at the Johns Hopkins Infectious Disease Dynamics lab. We’ve blogged about this before, and we see a lot of potential in this objectoriented framework for both standardizing how datasets are specified/accessed and how models are generated. That said, there still is a long ways to go to document and make this infrastructure usable by a wide audience. The most success I’ve had using it so far was having PhD students write forecast models for a seminar I taught this spring. I used a single script that could run and score forecasts from each model, with a very simple plugandplay interface to the models because they had been specified appropriately.
 Zoltar is a new repository (in alphaish release right now) for forecasts that we have been working on over the last year. It was initially designed with our CDC flu forecast usecase in mind, although the forecast structure is quite general, and with predx integration on the way (see next bullet) we are hoping that this will broaden the scope of possible use cases for Zoltar. To help facilitate our and others use of Zoltar, we are working on two interfaces to the Zoltar API, zoltpy for python and zoltr for R. Check out the documentation, especially for zoltr. There is quite a bit of data available!
 predx is an R package designed my colleague and friend Michael Johansson of the US Centers for Disease Control and Prevention and OutbreakScience. Evan Ray, from the Reich Lab team, has contributed to it as well. The goal of predx is to define some general classes of data for both probabilistic and point forecasts, to better standardize ways that we might want to store and operate on these data.
 d3foresight is the main engine behind our interactive forecast visualizations for flu in the US. We have also integrated it with Zoltar, so that you can view forecasts stored in Zoltar (note, kind of a long load time for that last link) using some of the basic d3foresight functionality.
The lion’s share of the credit for all of the above are due to some combination of Matthew Cornell, Abhinav Tushar, Katie House, and Evan Ray.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog:  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Forecasting tools in development
(This article was first published on  R, and kindly contributed to Rbloggers)
As I’ve been writing up a progress report for my NIGMS R35 MIRA award, I’ve been reminded at how much of the work that we’ve been doing is focused on forecasting infrastructure. A common theme in the Reich Lab is making operational forecasts of infectious disease outbreaks. The operational aspect means that we focus on everything from developing and adapting statistical methods to be used in forecasting applications to thinking about the data science toolkit that you need to store, evaluate, and visualize forecasts. To that end, in addition to working closely with the CDC in their FluSight initiative, we’ve been doing a lot of collaborative work on new R packages and data repositories that I hope will be useful beyond the confines of our lab. Some of these projects are fully operational, used in our production flu forecasts for CDC, and some have even gone through a level of code peer review. Others are in earlier stages of development. My hope is that in putting this list out there (see below the fold) we will generate some interest (and possibly find some new opensource collaborators) for these projects.
Here is a partial list of inprogress software that we’ve been working on:
 sarimaTD is an R package that serves as a wrapper to some of the ARIMA modeling functionality in the forecast R package. We found that we consistently wanted to be specifying some transformations (T) and differencing (D) in specific ways that we have found useful in modeling infectious disease timeseries data, so we made it easy for us and others to use specifications.
 ForecastFramework is an R package that we have collaborated on with our colleagues at the Johns Hopkins Infectious Disease Dynamics lab. We’ve blogged about this before, and we see a lot of potential in this objectoriented framework for both standardizing how datasets are specified/accessed and how models are generated. That said, there still is a long ways to go to document and make this infrastructure usable by a wide audience. The most success I’ve had using it so far was having PhD students write forecast models for a seminar I taught this spring. I used a single script that could run and score forecasts from each model, with a very simple plugandplay interface to the models because they had been specified appropriately.
 Zoltar is a new repository (in alphaish release right now) for forecasts that we have been working on over the last year. It was initially designed with our CDC flu forecast usecase in mind, although the forecast structure is quite general, and with predx integration on the way (see next bullet) we are hoping that this will broaden the scope of possible use cases for Zoltar. To help facilitate our and others use of Zoltar, we are working on two interfaces to the Zoltar API, zoltpy for python and zoltr for R. Check out the documentation, especially for zoltr. There is quite a bit of data available!
 predx is an R package designed my colleague and friend Michael Johansson of the US Centers for Disease Control and Prevention and OutbreakScience. Evan Ray, from the Reich Lab team, has contributed to it as well. The goal of predx is to define some general classes of data for both probabilistic and point forecasts, to better standardize ways that we might want to store and operate on these data.
 d3foresight is the main engine behind our interactive forecast visualizations for flu in the US. We have also integrated it with Zoltar, so that you can view forecasts stored in Zoltar (note, kind of a long load time for that last link) using some of the basic d3foresight functionality.
The lion’s share of the credit for all of the above are due to some combination of Matthew Cornell, Abhinav Tushar, Katie House, and Evan Ray.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog:  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
modelDown is now on CRAN!
(This article was first published on English – SmarterPoland.pl, and kindly contributed to Rbloggers)
The modelDown package turns classification or regression models into HTML static websites.
With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits.
The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.
Here you can browse an example website automatically created for 4 classification models (random forest, gradient boosting, support vector machines, knearest neighbours). The R code beyond this example is here.
Fun facts:
– archivist hooks are generated for every documented object. So you can easily extract R objects from the HTML website. Try
archivist::aread("MI2DataLab/modelDown_example/docs/repository/574defd6a96ecf7e5a4026699971b1d7")
– session info is automatically recorded. So you can check version of packages available at model development (https://github.com/MI2DataLab/modelDown_example/blob/master/docs/session_info/session_info.txt)
– This package is initially created by Magda Tatarynowicz, Kamil Romaszko, Mateusz Urbański from Warsaw University of Technology as a student project.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
modelDown is now on CRAN!
(This article was first published on English – SmarterPoland.pl, and kindly contributed to Rbloggers)
The modelDown package turns classification or regression models into HTML static websites.
With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits.
The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.
Here you can browse an example website automatically created for 4 classification models (random forest, gradient boosting, support vector machines, knearest neighbours). The R code beyond this example is here.
Fun facts:
– archivist hooks are generated for every documented object. So you can easily extract R objects from the HTML website. Try
archivist::aread("MI2DataLab/modelDown_example/docs/repository/574defd6a96ecf7e5a4026699971b1d7")
– session info is automatically recorded. So you can check version of packages available at model development (https://github.com/MI2DataLab/modelDown_example/blob/master/docs/session_info/session_info.txt)
– This package is initially created by Magda Tatarynowicz, Kamil Romaszko, Mateusz Urbański from Warsaw University of Technology as a student project.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...