Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 20 min 38 sec ago

New Versions of R GUIs: BlueSky, JASP, jamovi

Tue, 06/18/2019 - 16:38

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

It has been only two months since I summarized my reviews of point-and-click front ends for R, and it’s already out of date! I have converted that post into a regularly-updated article and added a plot of total features, which I repeat below. It shows the total number of features in each package, including the latest versions of BlueSky Statistics, JASP, and jamovi. The reviews which initially appeared as blog posts are now regularly-updated pages.

New Features in JASP

Let’s take a look at some of the new features, starting with the version of JASP that was released three hours ago:

  • Interface adjustments
    • Data panel, analysis input panel and results panel can be manipulated much more intuitively with sliders and show/hide buttons
    • Changed the analysis input panel to have an overview of all opened analyses and added the possibility to change titles, to show documentation, and remove analyses
  • Enhanced the navigation through the file menu; it is now possible to use arrow keys or simply hover over the buttons
  • Added possibility to scale the entire application with Ctrl +, Ctrl – and Ctrl 0
  • Added MANOVA
  • Added Confirmatory Factor Analysis
  • Added Bayesian Multinomial Test
  • Included additional menu preferences to customize JASP to your needs
  • Added/updated help files for most analyses
  • R engine updated from 3.4.4 to 3.5.2
  • Added Šidák correction for post-hoc tests (AN(C)OVA)

A complete list of fixes and features is available here. JASP is available for free from their download page.  My comparative review of JASP is here.

New Features in jamovi

Two of the usability features added to jamovi recently are templates and multi-file input. Both are described in detail here.

Templates enable you to save all the steps in your work as a template file. Opening that file in jamovi then lets you open a new dataset and the template will recreate all the previous analyses and graphs using the new data. It provides reusability without having to depend on the R code that GUI users are trying to avoid using.

The multi-file input lets you select many CSV files at once and jamovi will open and stack them all (they must contain common variable names, of course).

Other new analytic features have been added with a set of modeling modules. They’re described in detail here, and a list of some of their capability is below. You can read my full review of jamovi here, and you can download it for free here.

  • OLS Regression (GLM)
  • OLS ANOVA (GLM)
  • OLS ANCOVA (GLM)
  • Random coefficients regression (Mixed)
  • Random coefficients ANOVA-ANCOVA (Mixed)
  • Logistic regression (GZLM)
  • Logistic ANOVA-like model (GZLM)
  • Probit regression (GZLM)
  • Probit ANOVA-like model (GZLM)
  • Multinomial regression (GZLM)
  • Multinomial ANOVA-like model (GZLM)
  • Poisson regression (GZLM)
  • Poisson ANOVA-like model (GZLM)
  • Overdispersed Poisson regression (GZLM)
  • Overdispersed Poisson ANOVA-like model (GZLM)
  • Negative binomial regression (GZLM)
  • Negative binomial ANOVA-like model (GZLM)
  • Continuous and categorical independent variables
  • Omnibus tests and parameter estimates
  • Confidence intervals
  • Simple slopes analysis
  • Simple effects
  • Post-hoc tests
  • Plots for up to three-way interactions for both categorical and continuous independent variables.
  • Automatic selection of best estimation methods and degrees of freedom selection
  • Type III estimation
New Features in BlueSky Statistics

The BlueSky developers have been working on adding psychometric methods (for a book that is due out soon) and support for distributions. My full review is here and you can download BlueSky Statistics for free here.

  • Model Fitting: IRT: Simple Rasch Model
  • Model Fitting: IRT: Simple Rasch Model (Multi-Faceted)
  • Model Fitting: IRT: Partial Credit Model
  • Model Fitting: IRT: Partial Credit Model (Multi-Faceted)
  • Model Fitting: IRT: Rating Scale Model
  • Model Fitting: IRT: Rating Scale Model (Multi-Faceted)
  • Model Statistics: IRT: ICC Plots
  • Model Statistics: IRT: Item Fit
  • Model Statistics: IRT: Plot PI Map
  • Model Statistics: IRT: Item and Test Information
  • Model Statistics: IRT: Likelihood Ratio and Beta plots
  • Model Statistics: IRT: Personfit
  • Distributions: Continuous: BetaProbabilities
  • Distributions: Continuous: Beta Quantiles
  • Distributions: Continuous: Plot Beta Distribution
  • Distributions: Continuous: Sample from Beta Distribution
  • Distributions: Continuous: Cauchy Probabilities
  • Distributions: Continuous: Plot Cauchy Distribution
  • Distributions: Continuous: Cauchy Quantiles
  • Distributions: Continuous: Sample from Cauchy Distribution
  • Distributions: Continuous: Sample from Cauchy Distribution
  • Distributions: Continuous: Chi-squared Probabilities
  • Distributions: Continuous: Chi-squared Quantiles
  • Distributions: Continuous: Plot Chi-squared Distribution
  • Distributions: Continuous: Sample from Chi-squared Distribution
  • Distributions: Continuous: Exponential Probabilities
  • Distributions: Continuous: Exponential Quantiles
  • Distributions: Continuous: Plot Exponential Distribution
  • Distributions: Continuous: Sample from Exponential Distribution
  • Distributions: Continuous: F Probabilities
  • Distributions: Continuous: F Quantiles
  • Distributions: Continuous: Plot F Distribution
  • Distributions: Continuous: Sample from F Distribution
  • Distributions: Continuous: Gamma Probabilities
  • Distributions: Continuous: Gamma Quantiles
  • Distributions: Continuous: Plot Gamma Distribution
  • Distributions: Continuous: Sample from Gamma Distribution
  • Distributions: Continuous: Gumbel Probabilities
  • Distributions: Continuous: Gumbel Quantiles
  • Distributions: Continuous: Plot Gumbel Distribution
  • Distributions: Continuous: Sample from Gumbel Distribution
  • Distributions: Continuous: Logistic Probabilities
  • Distributions: Continuous: Logistic Quantiles
  • Distributions: Continuous: Plot Logistic Distribution
  • Distributions: Continuous: Sample from Logistic Distribution
  • Distributions: Continuous: Lognormal Probabilities
  • Distributions: Continuous: Lognormal Quantiles
  • Distributions: Continuous: Plot Lognormal Distribution
  • Distributions: Continuous: Sample from Lognormal Distribution
  • Distributions: Continuous: Normal Probabilities
  • Distributions: Continuous: Normal Quantiles
  • Distributions: Continuous: Plot Normal Distribution
  • Distributions: Continuous: Sample from Normal Distribution
  • Distributions: Continuous: t Probabilities
  • Distributions: Continuous: t Quantiles
  • Distributions: Continuous: Plot t Distribution
  • Distributions: Continuous: Sample from t Distribution
  • Distributions: Continuous: Uniform Probabilities
  • Distributions: Continuous: Uniform Quantiles
  • Distributions: Continuous: Plot Uniform Distribution
  • Distributions: Continuous: Sample from Uniform Distribution
  • Distributions: Continuous: Weibull Probabilities
  • Distributions: Continuous: Weibull Quantiles
  • Distributions: Continuous: Plot Weibull Distribution
  • Distributions: Continuous: Sample from Weibull Distribution
  • Distributions: Discrete: Binomial Probabilities
  • Distributions: Discrete: Binomial Quantiles
  • Distributions: Discrete: Binomial Tail Probabilities
  • Distributions: Discrete: Plot Binomial Distribution
  • Distributions: Discrete: Sample from Binomial Distribution
  • Distributions: Discrete: Geometric Probabilities
  • Distributions: Discrete: Geometric Quantiles
  • Distributions: Discrete: Geometric Tail Probabilities
  • Distributions: Discrete: Plot Geometric Distribution
  • Distributions: Discrete: Sample from Geometric Distribution
  • Distributions: Discrete: Hypergeometric Probabilities
  • Distributions: Discrete: Hypergeometric Quantiles
  • Distributions: Discrete: Hypergeometric Tail Probabilities
  • Distributions: Discrete: Plot Hypergeometric Distribution
  • Distributions: Discrete: Sample from Hypergeometric Distribution
  • Distributions: Discrete: Negative Binomial Probabilities
  • Distributions: Discrete: Negative Binomial Quantiles
  • Distributions: Discrete: Negative Binomial Tail Probabilities
  • Distributions: Discrete: Plot Negative Binomial Distribution
  • Distributions: Discrete: Sample from Negative Binomial Distribution
  • Distributions: Discrete: Poisson Probabilities
  • Distributions: Discrete: Poisson Quantiles
  • Distributions: Discrete: Poisson Tail Probabilities
  • Distributions: Discrete: Plot Poisson Distribution
  • Distributions: Discrete: Sample from Poisson Distribution

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to Perform Ordinal Logistic Regression in R

Tue, 06/18/2019 - 14:17

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

In this article, we discuss the basics of ordinal logistic regression and its implementation in R. Ordinal logistic regression is a widely used classification method, with applications in variety of domains. This method is the go-to tool when there is a natural ordering in the dependent variable. For example, dependent variable with levels low, medium, high is a perfect context for application of logistic ordinal regression.

Having wide range of applicability, ordinal logistic regression is considered as one of the most admired methods in the field of data analytics. The method is also known as proportional odds model because of the transformations used during estimation and the log odds interpretation of the output. We hope that this article helps our readers to understand the basics and implement the model in R.

The article is organized as follows: focusing on the theoretical aspects of the technique, section 1 provides a quick review of ordinal logistic regression. Section 2 discusses the steps to perform ordinal logistic regression in R and shares R script. In addition, section 2 also covers the basics of interpretation and evaluation of the model on R. In section 3, we learn a more intuitive way to interpret the model. Section 4 concludes the article.

Basics of ordinal logistic regression

Ordinal logistic regression is an extension of simple logistic regression model. In simple logistic regression, the dependent variable is categorical and follows a Bernoulli distribution. (for a quick reference check out this article by perceptive analytics – https://www.kdnuggets.com/2017/10/learn-generalized-linear-models-glm-r.html). Whereas, in ordinal logistic regression the dependent variable is ordinal i.e. there is an explicit ordering in the categories. For example, during preliminary testing of a pain relief drug, the participants are asked to express the amount of relief they feel on a five point Likert scale. Another common example of an ordinal variable is app ratings. On google play, customers are asked to rate apps on a scale ranging from 1 to 5. Ordinal logistic regression becomes handy in the aforementioned examples as there is a clear order in the categorical dependent variable.

In simple logistic regression, log of odds that an event occurs is modeled as a linear combination of the independent variables. But, the above approach of modeling ignores the ordering of the categorical dependent variable. Ordinal logistic regression model overcomes this limitation by using cumulative events for the log of the odds computation. It means that unlike simple logistic regression, ordinal logistic models consider the probability of an event and all the events that are below the focal event in the ordered hierarchy. For example, the event of interest in ordinal logistic regression would be to obtain an app rating equal to X or less than X.  For example, the log of odds for the app rating less than or equal to 1 would be computed as follows:

LogOdds rating<1 = Log (p(rating=1)/p(rating>1)  [Eq. 1]

Likewise, the log of odds can be computed for other values of app ratings.  The computations for other ratings are below:

LogOdds rating<2 = Log (p(rating<=2)/p(rating>2)  [Eq. 2]

LogOdds rating<3 = Log (p(rating<=3)/p(rating>3)  [Eq. 3]

LogOdds rating<4 = Log (p(rating=4)/p(rating>4)  [Eq. 4]

Because all the ratings below the focal score are considered in computation, the highest app rating of 5 will include all the ratings below it and does not have a log of odds associated with it. In general, the ordinal regression model can be represented using the LogOdds computation.

LogoddsY = αi+ β1X1 +β2X2 +….. +βnXn

where,
Y is the ordinal dependent variable
i is the number of categories minus 1
X1, X2,…. Xn  are independent variables. They can be measured on nominal, ordinal or continuous measurement scale.
β1, β2,… βn are estimated parameters

For i ordered categories, we obtain i – 1 equations. The proportional odds assumption implies that the effect of independent variables is identical for each log of odds computation. But, this is not the case for intercept  as the intercept takes different values for each computation. Besides the proportional odds assumption, the ordinal logistic regression model assumes an ordinal dependent variable and absence of multicollinearity. Absence of multicollinearity means that the independent variables are not significantly correlated. These assumptions are important as their violation makes the computed parameters unacceptable.

Model building in R

In this section, we describe the dataset and implement ordinal logistic regression in R. We use a simulated dataset for analysis. The details of the variables are as follows. The objective of the analysis is to predict the likelihood of each level of customer purchase. The dependent variable is the likelihood of repeated purchase by customers. The variable is measured in an ordinal scale and can be equal to one of the three levels – low probability, medium probability, and high probability. The independent variables are measures of possession of coupon by the focal customer, recommendation received by the peers and quality of the product. Possession of coupon and peer recommendation are categorical variables, while quality is measured on a scale of 1 to 5. We discuss the steps for implementation of ordinal logistic regression and share the commented R script for better understanding of the reader. The data is in .csv format and can be downloaded by clicking here.

Before starting the analysis, I will describe the preliminary steps in short. The first step is to keep the data file in the working directory. The next step is to explicitly define the ordering of the levels in  the dependent variable and the relevant independent variables. This step is crucial and ignoring it can lead to meaningless analysis.

#Read data file from working directory setwd("C:/Users/You/Desktop") data <- read.table("data.txt") #Ordering the dependent variable data$rpurchase = factor(data$rpurchase, levels = c("low probability", "medium probability", "high probability"), ordered = TRUE)  data$peers = factor(data$peers, levels = c("0", "1"), ordered = TRUE)  data$coupon = factor(data$coupon, levels = c("0", "1"), ordered = TRUE)

 

Next, it is essential to perform the exploratory data analysis. Exploratory data analysis deals with outliers and missing values, which can induce bias in our model.  In this article we do basic exploration by looking at summary statistics and frequency table. Figure 1. shows the summary statistics. We observe the count of data for ordinal variables and distribution characteristics for other variables.  We compute the count of rpurchase with different values of coupon in Table 1. Note that the median value for rpurchase changes with change in coupon. The median level of rpurchase increases, indicating that coupon positively affects the likelihood of repeated purchase. The R script for summary statistics and the frequency table is as follows:

#Exploratory data analysis #Summarizing the data summary(data) #Making frequency table table(data$rpurchase, data$coupon)

Figure 1. Summary statistics 

 

Table 1. Count of rpurchase by coupon

Rpurchase/Coupon With coupon Without coupon Low probability 200 20 Medium probability 110 30 High proabability 27 13

After defining the order of the levels in the dependent variable and performing exploratory data analysis, the data is ready to be partitioned into training and testing set. We will build the model using the training set and validate the computed model using the data in test set. The partitioning of the data into training and test set is random. The random sample is generated using sample ( ) function along with set.seed ( ) function. The R script for explicitly stating the order of levels in the dependent variable and creating training and test data set is follows:

#Dividing data into training and test set #Random sampling  samplesize = 0.60*nrow(data) set.seed(100) index = sample(seq_len(nrow(data)), size = samplesize) #Creating training and test set  datatrain = data[index,] datatest = data[-index,]

 Now, we will build the model using the data in training set. As discussed in the above section the dependent variable in the model is in the form of log odds.  Because of the log odds transformation, it is difficult to interpret the coefficients of the model. Note that in this case the coefficients of the regression cannot be interpreted in terms of marginal effects.  The coefficients are called as proportional odds and interpreted in terms of increase in log odds. The interpretation changes not only for the coefficients but also for the intercept. Unlike simple linear regression, in ordinal logistic regression we obtain n-1 intercepts, where n is the number of categories in the dependent variable. The intercept can be interpreted as the expected odds of identifying in the listed categories. Before interpreting the model, we share the relevant R script and the results. In the R code, we set Hess equal to true which the logical operator to return hessian matrix. Returning the hessian matrix is essential to use summary function or calculate variance-covariance matrix of the fitted model.

#Build ordinal logistic regression model model= polr(rpurchase ~ coupon + peers + quality , data = datatrain, Hess = TRUE) summary(model)

 

Figure 2. Ordinal logistic regression model on training set

The table displays the value of coefficients and intercepts, and corresponding standard errors and t values.  The interpretation for the coefficients is as follows. For example, holding everything else constant, an increase in value of coupon by one unit increase the expected value of rpurchase in log odds by 0.96. Likewise, the coefficients of peers and quality can be interpreted.

Note that the ordinal logistic regression outputs multiple values of intercepts depending on the levels of intercept. The intercepts can be interpreted as the expected odds when others variables assume a value of zero. For example, the low probability | medium probability intercept takes value of 2.13, indicating that the expected odds of identifying in low probability category, when other variables assume a value of zero, is 2.13. Using the logit inverse transformation, the intercepts can be interpreted in terms of expected probabilities. The inverse logit transformation,  . The expected probability of identifying low probability category, when other variables assume a value of zero, is 0.89.

After building the model and interpreting the model, the next step is to evaluate it. The evaluation of the model is conducted on the test dataset. A basic evaluation approach is to compute the confusion matrix and the misclassification error. The R code and the results are as follows:

#Compute confusion table and misclassification error predictrpurchase = predict(model,datatest) table(datatest$rpurchase, predictrpurchase) mean(as.character(datatest$rpurchase) != as.character(predictrpurchase))

Figure 3. Confusion matrix

The confusion matrix shows the performance of the ordinal logistic regression model. For example, it shows that, in the test dataset, 76 times low probability category is identified correctly. Similarly, 10 times medium category and 0 times high category is identified correctly. We observe that the model identifies high probability category poorly. This happens because of inadequate representation of high probability category in the training dataset. Using the confusion matrix, we find that the misclassification error for our model is 46%.

Interpretation using plots

The interpretation of the logistic ordinal regression in terms of log odds ratio is not easy to understand. We offer an alternative approach to interpretation using plots. The R code for plotting the effects of the independent variables is as follows:

#Plotting the effects  library("effects") Effect(focal.predictors = "quality",model) plot(Effect(focal.predictors = "coupon",model)) plot(Effect(focal.predictors = c("quality", "coupon"),model))

The plots are intuitive and easy to understand. For example, figure 4 shows that coupon increases the likelihood of classification into high probability and medium probability classes, while decreasing the likelihood of classification in low probability class.

Figure 4. Effect of coupon on identification


It is also possible to look at joint effect of two independent variable. Figure 5. shows the joint effect of quality and coupon on identification of category of independent variable. Observing the top row of figure 5., we notice that the interaction of coupon and quality increases the likelihood of identification in high probability category.

Figure 5. Joint effect of quality and coupon on identification

Conclusion

The article discusses the fundamentals of ordinal logistic regression, builds and the model in R, and ends with interpretation and evaluation.  Ordinal logistic regression extends the simple logistic regression model to the situations where the dependent variable is ordinal, i.e. can be ordered. Ordinal logistic regression has variety of applications, for example, it is often used in marketing to increase customer life time value. For example, consumers can be categorized into different classes based on their tendency to make repeated purchase decision. In order to discuss the model in an applied manner, we develop this article around the case of consumer categorization. The independent variables of interest are – coupon held by consumers from previous purchase, influence of peers, quality of the product.

The article has two key takeaways. First, ordinal logistic regression come handy while dealing with a dependent variable that can be ordered. If one uses multinomial logistic regression then the user is ignoring the information related to ordering of the dependent variable. Second, the coefficients of the ordinal linear regression cannot be interpreted in a similar manner to the coefficients of ordinary linear regression. Interpreting the coefficents in terms of marginal effects is one of the common mistakes that users make while implementing the ordinal regression model. We again emphasize the use of graphical method to interpret the coefficients. Using the graphical method, it is easy to understand the individual and joint effects of independent variables on the likelihood of classification.

This article can be a go to reference for understanding the basics of ordinal logistic regression and its implementation in R. We have provided commented R code throughout the article.

Download R-code

Credits: Chaitanya Sagar and Aman Asija of Perceptive Analytics. Perceptive Analytics is a marketing analytics and Tableau consulting company. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to Perform Ordinal Logistic Regression in R

Tue, 06/18/2019 - 14:17

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

In this article, we discuss the basics of ordinal logistic regression and its implementation in R. Ordinal logistic regression is a widely used classification method, with applications in variety of domains. This method is the go-to tool when there is a natural ordering in the dependent variable. For example, dependent variable with levels low, medium, high is a perfect context for application of logistic ordinal regression.

Having wide range of applicability, ordinal logistic regression is considered as one of the most admired methods in the field of data analytics. The method is also known as proportional odds model because of the transformations used during estimation and the log odds interpretation of the output. We hope that this article helps our readers to understand the basics and implement the model in R.

The article is organized as follows: focusing on the theoretical aspects of the technique, section 1 provides a quick review of ordinal logistic regression. Section 2 discusses the steps to perform ordinal logistic regression in R and shares R script. In addition, section 2 also covers the basics of interpretation and evaluation of the model on R. In section 3, we learn a more intuitive way to interpret the model. Section 4 concludes the article.

Basics of ordinal logistic regression

Ordinal logistic regression is an extension of simple logistic regression model. In simple logistic regression, the dependent variable is categorical and follows a Bernoulli distribution. (for a quick reference check out this article by perceptive analytics – https://www.kdnuggets.com/2017/10/learn-generalized-linear-models-glm-r.html). Whereas, in ordinal logistic regression the dependent variable is ordinal i.e. there is an explicit ordering in the categories. For example, during preliminary testing of a pain relief drug, the participants are asked to express the amount of relief they feel on a five point Likert scale. Another common example of an ordinal variable is app ratings. On google play, customers are asked to rate apps on a scale ranging from 1 to 5. Ordinal logistic regression becomes handy in the aforementioned examples as there is a clear order in the categorical dependent variable.

In simple logistic regression, log of odds that an event occurs is modeled as a linear combination of the independent variables. But, the above approach of modeling ignores the ordering of the categorical dependent variable. Ordinal logistic regression model overcomes this limitation by using cumulative events for the log of the odds computation. It means that unlike simple logistic regression, ordinal logistic models consider the probability of an event and all the events that are below the focal event in the ordered hierarchy. For example, the event of interest in ordinal logistic regression would be to obtain an app rating equal to X or less than X.  For example, the log of odds for the app rating less than or equal to 1 would be computed as follows:

LogOdds rating<1 = Log (p(rating=1)/p(rating>1)  [Eq. 1]

Likewise, the log of odds can be computed for other values of app ratings.  The computations for other ratings are below:

LogOdds rating<2 = Log (p(rating<=2)/p(rating>2)  [Eq. 2]

LogOdds rating<3 = Log (p(rating<=3)/p(rating>3)  [Eq. 3]

LogOdds rating<4 = Log (p(rating=4)/p(rating>4)  [Eq. 4]

Because all the ratings below the focal score are considered in computation, the highest app rating of 5 will include all the ratings below it and does not have a log of odds associated with it. In general, the ordinal regression model can be represented using the LogOdds computation.

LogoddsY = αi+ β1X1 +β2X2 +….. +βnXn

where,
Y is the ordinal dependent variable
i is the number of categories minus 1
X1, X2,…. Xn  are independent variables. They can be measured on nominal, ordinal or continuous measurement scale.
β1, β2,… βn are estimated parameters

For i ordered categories, we obtain i – 1 equations. The proportional odds assumption implies that the effect of independent variables is identical for each log of odds computation. But, this is not the case for intercept  as the intercept takes different values for each computation. Besides the proportional odds assumption, the ordinal logistic regression model assumes an ordinal dependent variable and absence of multicollinearity. Absence of multicollinearity means that the independent variables are not significantly correlated. These assumptions are important as their violation makes the computed parameters unacceptable.

Model building in R

In this section, we describe the dataset and implement ordinal logistic regression in R. We use a simulated dataset for analysis. The details of the variables are as follows. The objective of the analysis is to predict the likelihood of each level of customer purchase. The dependent variable is the likelihood of repeated purchase by customers. The variable is measured in an ordinal scale and can be equal to one of the three levels – low probability, medium probability, and high probability. The independent variables are measures of possession of coupon by the focal customer, recommendation received by the peers and quality of the product. Possession of coupon and peer recommendation are categorical variables, while quality is measured on a scale of 1 to 5. We discuss the steps for implementation of ordinal logistic regression and share the commented R script for better understanding of the reader. The data is in .csv format and can be downloaded by clicking here.

Before starting the analysis, I will describe the preliminary steps in short. The first step is to keep the data file in the working directory. The next step is to explicitly define the ordering of the levels in  the dependent variable and the relevant independent variables. This step is crucial and ignoring it can lead to meaningless analysis.

#Read data file from working directory setwd("C:/Users/You/Desktop") data <- read.table("data.txt") #Ordering the dependent variable data$rpurchase = factor(data$rpurchase, levels = c("low probability", "medium probability", "high probability"), ordered = TRUE)  data$peers = factor(data$peers, levels = c("0", "1"), ordered = TRUE)  data$coupon = factor(data$coupon, levels = c("0", "1"), ordered = TRUE)

 

Next, it is essential to perform the exploratory data analysis. Exploratory data analysis deals with outliers and missing values, which can induce bias in our model.  In this article we do basic exploration by looking at summary statistics and frequency table. Figure 1. shows the summary statistics. We observe the count of data for ordinal variables and distribution characteristics for other variables.  We compute the count of rpurchase with different values of coupon in Table 1. Note that the median value for rpurchase changes with change in coupon. The median level of rpurchase increases, indicating that coupon positively affects the likelihood of repeated purchase. The R script for summary statistics and the frequency table is as follows:

#Exploratory data analysis #Summarizing the data summary(data) #Making frequency table table(data$rpurchase, data$coupon)

Figure 1. Summary statistics 

 

Table 1. Count of rpurchase by coupon

Rpurchase/Coupon With coupon Without coupon Low probability 200 20 Medium probability 110 30 High proabability 27 13

After defining the order of the levels in the dependent variable and performing exploratory data analysis, the data is ready to be partitioned into training and testing set. We will build the model using the training set and validate the computed model using the data in test set. The partitioning of the data into training and test set is random. The random sample is generated using sample ( ) function along with set.seed ( ) function. The R script for explicitly stating the order of levels in the dependent variable and creating training and test data set is follows:

#Dividing data into training and test set #Random sampling  samplesize = 0.60*nrow(data) set.seed(100) index = sample(seq_len(nrow(data)), size = samplesize) #Creating training and test set  datatrain = data[index,] datatest = data[-index,]

 Now, we will build the model using the data in training set. As discussed in the above section the dependent variable in the model is in the form of log odds.  Because of the log odds transformation, it is difficult to interpret the coefficients of the model. Note that in this case the coefficients of the regression cannot be interpreted in terms of marginal effects.  The coefficients are called as proportional odds and interpreted in terms of increase in log odds. The interpretation changes not only for the coefficients but also for the intercept. Unlike simple linear regression, in ordinal logistic regression we obtain n-1 intercepts, where n is the number of categories in the dependent variable. The intercept can be interpreted as the expected odds of identifying in the listed categories. Before interpreting the model, we share the relevant R script and the results. In the R code, we set Hess equal to true which the logical operator to return hessian matrix. Returning the hessian matrix is essential to use summary function or calculate variance-covariance matrix of the fitted model.

#Build ordinal logistic regression model model= polr(rpurchase ~ coupon + peers + quality , data = datatrain, Hess = TRUE) summary(model)

 

Figure 2. Ordinal logistic regression model on training set

The table displays the value of coefficients and intercepts, and corresponding standard errors and t values.  The interpretation for the coefficients is as follows. For example, holding everything else constant, an increase in value of coupon by one unit increase the expected value of rpurchase in log odds by 0.96. Likewise, the coefficients of peers and quality can be interpreted.

Note that the ordinal logistic regression outputs multiple values of intercepts depending on the levels of intercept. The intercepts can be interpreted as the expected odds when others variables assume a value of zero. For example, the low probability | medium probability intercept takes value of 2.13, indicating that the expected odds of identifying in low probability category, when other variables assume a value of zero, is 2.13. Using the logit inverse transformation, the intercepts can be interpreted in terms of expected probabilities. The inverse logit transformation,  . The expected probability of identifying low probability category, when other variables assume a value of zero, is 0.89.

After building the model and interpreting the model, the next step is to evaluate it. The evaluation of the model is conducted on the test dataset. A basic evaluation approach is to compute the confusion matrix and the misclassification error. The R code and the results are as follows:

#Compute confusion table and misclassification error predictrpurchase = predict(model,datatest) table(datatest$rpurchase, predictrpurchase) mean(as.character(datatest$rpurchase) != as.character(predictrpurchase))

Figure 3. Confusion matrix

The confusion matrix shows the performance of the ordinal logistic regression model. For example, it shows that, in the test dataset, 76 times low probability category is identified correctly. Similarly, 10 times medium category and 0 times high category is identified correctly. We observe that the model identifies high probability category poorly. This happens because of inadequate representation of high probability category in the training dataset. Using the confusion matrix, we find that the misclassification error for our model is 46%.

Interpretation using plots

The interpretation of the logistic ordinal regression in terms of log odds ratio is not easy to understand. We offer an alternative approach to interpretation using plots. The R code for plotting the effects of the independent variables is as follows:

#Plotting the effects  library("effects") Effect(focal.predictors = "quality",model) plot(Effect(focal.predictors = "coupon",model)) plot(Effect(focal.predictors = c("quality", "coupon"),model))

The plots are intuitive and easy to understand. For example, figure 4 shows that coupon increases the likelihood of classification into high probability and medium probability classes, while decreasing the likelihood of classification in low probability class.

Figure 4. Effect of coupon on identification


It is also possible to look at joint effect of two independent variable. Figure 5. shows the joint effect of quality and coupon on identification of category of independent variable. Observing the top row of figure 5., we notice that the interaction of coupon and quality increases the likelihood of identification in high probability category.

Figure 5. Joint effect of quality and coupon on identification

Conclusion

The article discusses the fundamentals of ordinal logistic regression, builds and the model in R, and ends with interpretation and evaluation.  Ordinal logistic regression extends the simple logistic regression model to the situations where the dependent variable is ordinal, i.e. can be ordered. Ordinal logistic regression has variety of applications, for example, it is often used in marketing to increase customer life time value. For example, consumers can be categorized into different classes based on their tendency to make repeated purchase decision. In order to discuss the model in an applied manner, we develop this article around the case of consumer categorization. The independent variables of interest are – coupon held by consumers from previous purchase, influence of peers, quality of the product.

The article has two key takeaways. First, ordinal logistic regression come handy while dealing with a dependent variable that can be ordered. If one uses multinomial logistic regression then the user is ignoring the information related to ordering of the dependent variable. Second, the coefficients of the ordinal linear regression cannot be interpreted in a similar manner to the coefficients of ordinary linear regression. Interpreting the coefficents in terms of marginal effects is one of the common mistakes that users make while implementing the ordinal regression model. We again emphasize the use of graphical method to interpret the coefficients. Using the graphical method, it is easy to understand the individual and joint effects of independent variables on the likelihood of classification.

This article can be a go to reference for understanding the basics of ordinal logistic regression and its implementation in R. We have provided commented R code throughout the article.

Download R-code

Credits: Chaitanya Sagar and Aman Asija of Perceptive Analytics. Perceptive Analytics is a marketing analytics and Tableau consulting company. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

anytime 0.3.4

Tue, 06/18/2019 - 13:28

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new minor release of the anytime package is arriving on CRAN. This is the fifteenth release, and first since the 0.3.3 release in November.

anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects – and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.

This release is mostly internal and switches to the excellent new tinytest package, a tweak the iso8601() format helper which now uses T between date and time (which is a breaking change with the usual addition of a option to get the old behaviour back) and a little more. The full list of changes follows.

Changes in anytime version 0.3.4 (2019-06-18)
  • Documentation was updated about a ‘Europe/London’ conversion issue (#84, inter alia).

  • The package is now compiled under the C++11 standard.

  • The package now uses tinytest for unit tests.

  • The iso8601() function now places a ‘T’ between date and time; an option switches to prior format using a space.

  • The vignette is now pre-made and included as-is in a Sweave document reducing the number of suggested packages.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page. The issue tracker tracker off the GitHub repo can be use for questions and comments.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

anytime 0.3.4

Tue, 06/18/2019 - 13:28

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new minor release of the anytime package is arriving on CRAN. This is the fifteenth release, and first since the 0.3.3 release in November.

anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects – and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.

This release is mostly internal and switches to the excellent new tinytest package, a tweak the iso8601() format helper which now uses T between date and time (which is a breaking change with the usual addition of a option to get the old behaviour back) and a little more. The full list of changes follows.

Changes in anytime version 0.3.4 (2019-06-18)
  • Documentation was updated about a ‘Europe/London’ conversion issue (#84, inter alia).

  • The package is now compiled under the C++11 standard.

  • The package now uses tinytest for unit tests.

  • The iso8601() function now places a ‘T’ between date and time; an option switches to prior format using a space.

  • The vignette is now pre-made and included as-is in a Sweave document reducing the number of suggested packages.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page. The issue tracker tracker off the GitHub repo can be use for questions and comments.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Understanding AdaBoost – or how to turn Weakness into Strength

Tue, 06/18/2019 - 08:00

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)


Many of you might have heard of the concept “Wisdom of the Crowd”: when many people independently guess some quantity, e.g. the number of marbles in a jar glass, the average of their guesses is often pretty accurate – even though many of the guesses are totally off.

The same principle is at work in so called ensemble methods, like bagging and boosting. If you want to know more about boosting and how to turn pseudocode of a scientific paper into valid R code read on…

We start from an original paper of one of the authors of the first practical boosting algorithm, i.e. AdaBoost: Robert E. Schapire: Explaining AdaBoost. The first sentence of the introduction gives the big idea:

Boosting is an approach to machine learning based on the idea of creating a highly accurate prediction rule by combining many relatively weak and inaccurate rules.

The second page gives the pseudocode of Adaboost…:

Given: where .
Initialize: for .
For :

  • Train weak learner using distribution .
  • Get weak hypothesis : .
  • Aim: select with low weighted error:

       

  • Choose .
  • Update, for :

       

    where is a normalization factor (chosen so that will be a distribution).

Output the final hypothesis:

   

… with some explanation:

[…] we are given labeled training examples where the are in some domain , and the labels . On each round , a distribution is computed as in the figure over the training examples, and a given weak learner or weak learning algorithm is applied to find a weak hypothesis : , where the aim of the weak learner is to find a weak hypothesis with low weighted error relative to . The final or combined hypothesis computes the sign of a weighted combination of weak hypotheses

   

This is equivalent to saying that is computed as a weighted majority vote of the weak hypotheses where each is assigned weight . ([…] we use the terms “hypothesis” and “classifier” interchangeably.)

So, AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous ones. But to really understand what is going on my approach has always been that you haven’t really understood something before you didn’t build it yourself…

Perhaps you might want to try to translate the pseudocode into R code before reading on… (to increase your motivation I frankly admit that I also had some errors in my first implementation… which provides a good example of how strong the R community is because I posted it on stackoverflow and got a perfect answer two hours later: What is wrong with my implementation of AdaBoost?

Anyway, here is my implementation (the data can be found here: http://freakonometrics.free.fr/myocarde.csv):

library(rpart) library(OneR) maxdepth <- 1 T <- 100 # number of rounds # Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {-1, +1} myocarde <- read.table("data/myocarde.csv", header = TRUE, sep = ";") y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1 x <- myocarde[ , 1:7] m <- nrow(x) data <- data.frame(x, y) # Initialize: D_1(i) = 1/m for i = 1,...,m D <- rep(1/m, m) H <- replicate(T, list()) a <- vector(mode = "numeric", T) set.seed(123) # For t = 1,...,T for(t in 1:T) { # Train weak learner using distribution D_t # Get weak hypothesis h_t: X -> {-1, +1} H[[t]] <- rpart(y ~., data = data, weights = D, maxdepth = maxdepth, method = "class") # Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i] h <- predict(H[[t]], x, type = "class") e <- sum((h!=y) * D) # Choose a_t = 0.5 * log((1-e) / e) a[t] <- 0.5 * log((1-e) / e) # Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(-a_t * y_i * h_t(x_i))) / Z_t # where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution) D <- D * exp(-a[t] * y * as.numeric(as.character(h))) D <- D / sum(D) } # Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T) newdata <- x H_x <- sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class")))) H_x <- t(a * t(H_x)) pred <- sign(rowSums(H_x)) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction -1 1 Sum ## -1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction -1 1 Sum ## -1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (p-value < 2.2e-16)

Let’s compare this with the result from the package JOUSBoost (on CRAN):

library(JOUSBoost) ## JOUSBoost 2.1.0 boost <- adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T) pred <- predict(boost, x) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction -1 1 Sum ## -1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction -1 1 Sum ## -1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (p-value < 2.2e-16)

As you can see: zero errors as with my implementation. Two additional remarks are in order:

An accuracy of 100% hints at one of the problems of boosting: it is prone to overfitting (see also Learning Data Science: Modelling Basics).

The second problem is the lack of interpretability: whereas decision trees are normally well interpretable ensembles of them are not. This is also known under the name Accuracy-Interpretability Trade-Off (another often used ensemble method is random forests, see also here: Learning Data Science: Predicting Income Brackets).

Hope that this post was helpful for you to understand the widely used boosting methodology better and to see how you can get from pseudocode to valid R code. If you have any questions or feedback please let me know in the comments – Thank you and stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Understanding AdaBoost – or how to turn Weakness into Strength

Tue, 06/18/2019 - 08:00

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)


Many of you might have heard of the concept “Wisdom of the Crowd”: when many people independently guess some quantity, e.g. the number of marbles in a jar glass, the average of their guesses is often pretty accurate – even though many of the guesses are totally off.

The same principle is at work in so called ensemble methods, like bagging and boosting. If you want to know more about boosting and how to turn pseudocode of a scientific paper into valid R code read on…

We start from an original paper of one of the authors of the first practical boosting algorithm, i.e. AdaBoost: Robert E. Schapire: Explaining AdaBoost. The first sentence of the introduction gives the big idea:

Boosting is an approach to machine learning based on the idea of creating a highly accurate prediction rule by combining many relatively weak and inaccurate rules.

The second page gives the pseudocode of Adaboost…:

Given: where .
Initialize: for .
For :

  • Train weak learner using distribution .
  • Get weak hypothesis : .
  • Aim: select with low weighted error:

       

  • Choose .
  • Update, for :

       

    where is a normalization factor (chosen so that will be a distribution).

Output the final hypothesis:

   

… with some explanation:

[…] we are given labeled training examples where the are in some domain , and the labels . On each round , a distribution is computed as in the figure over the training examples, and a given weak learner or weak learning algorithm is applied to find a weak hypothesis : , where the aim of the weak learner is to find a weak hypothesis with low weighted error relative to . The final or combined hypothesis computes the sign of a weighted combination of weak hypotheses

   

This is equivalent to saying that is computed as a weighted majority vote of the weak hypotheses where each is assigned weight . ([…] we use the terms “hypothesis” and “classifier” interchangeably.)

So, AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous ones. But to really understand what is going on my approach has always been that you haven’t really understood something before you didn’t build it yourself…

Perhaps you might want to try to translate the pseudocode into R code before reading on… (to increase your motivation I frankly admit that I also had some errors in my first implementation… which provides a good example of how strong the R community is because I posted it on stackoverflow and got a perfect answer two hours later: What is wrong with my implementation of AdaBoost?

Anyway, here is my implementation (the data can be found here: http://freakonometrics.free.fr/myocarde.csv):

library(rpart) library(OneR) maxdepth <- 1 T <- 100 # number of rounds # Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {-1, +1} myocarde <- read.table("data/myocarde.csv", header = TRUE, sep = ";") y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1 x <- myocarde[ , 1:7] m <- nrow(x) data <- data.frame(x, y) # Initialize: D_1(i) = 1/m for i = 1,...,m D <- rep(1/m, m) H <- replicate(T, list()) a <- vector(mode = "numeric", T) set.seed(123) # For t = 1,...,T for(t in 1:T) { # Train weak learner using distribution D_t # Get weak hypothesis h_t: X -> {-1, +1} H[[t]] <- rpart(y ~., data = data, weights = D, maxdepth = maxdepth, method = "class") # Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i] h <- predict(H[[t]], x, type = "class") e <- sum((h!=y) * D) # Choose a_t = 0.5 * log((1-e) / e) a[t] <- 0.5 * log((1-e) / e) # Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(-a_t * y_i * h_t(x_i))) / Z_t # where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution) D <- D * exp(-a[t] * y * as.numeric(as.character(h))) D <- D / sum(D) } # Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T) newdata <- x H_x <- sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class")))) H_x <- t(a * t(H_x)) pred <- sign(rowSums(H_x)) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction -1 1 Sum ## -1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction -1 1 Sum ## -1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (p-value < 2.2e-16)

Let’s compare this with the result from the package JOUSBoost (on CRAN):

library(JOUSBoost) ## JOUSBoost 2.1.0 boost <- adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T) pred <- predict(boost, x) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction -1 1 Sum ## -1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction -1 1 Sum ## -1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (p-value < 2.2e-16)

As you can see: zero errors as with my implementation. Two additional remarks are in order:

An accuracy of 100% hints at one of the problems of boosting: it is prone to overfitting (see also Learning Data Science: Modelling Basics).

The second problem is the lack of interpretability: whereas decision trees are normally well interpretable ensembles of them are not. This is also known under the name Accuracy-Interpretability Trade-Off (another often used ensemble method is random forests, see also here: Learning Data Science: Predicting Income Brackets).

Hope that this post was helpful for you to understand the widely used boosting methodology better and to see how you can get from pseudocode to valid R code. If you have any questions or feedback please let me know in the comments – Thank you and stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

radian: a modern console for R

Tue, 06/18/2019 - 08:00

(This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers)

Whenever I’m developing R code or writing data wrangling or analysis scripts for research projects that I work on I use Emacs and its add-on package Emacs Speaks Statistics (ESS). I’ve done so for nigh on a couple of decades now, ever since I switched full time to running Linux as my daily OS. For years this has served me well, though I wouldn’t call myself an Emacs expert; not even close! With a bit of help from some R Core coding standards document I got indentation working how I like it, I learned to contort my fingers in weird and wonderful ways to execute a small set of useful shortcuts, and I even committed some of those shortcuts to memory. More recently, however, my go-to methods for configuring Emacs+ESS were failing; indentation was all over the shop, the smart _ stopped working or didn’t work as it had for over a decade, syntax highlighting of R-related files, like .Rmd was hit and miss, and polymode was just a mystery to me. Configuring Emacs+ESS was becoming much more of a chore, and rather unhelpfully, my problems coincided with my having less and less time to devote to tinkering with my computer setups. Also, fiddling with this stuff just wasn’t fun any more. So, in a fit of pique following one to many reconfiguration sessions of Emacs+ESS, I went in search of some greener grass. During that search I came across radian, a neat, attractive, simple console for working with R.

Written by Randy Lai, radian is a cross-platform console for R that provides code completion, syntax highlighting, etc in a neat little package that runs in a shell or terminal, such as Bash. I’m someone who fires up multiple terminals every day to run some bit of R code, to show a student how to do something, to quickly check on argument names or such like, or prepare an answer to a question on stackoverflow or crossvalidated. Running R in a terminal after using an IDE/environment like Emacs+ESS or RStudio is an exercise in time travel; all those little helpful editing tools the IDE provides are missing and you’re coding like it was the 1980s all over again. radian changes all that.

radian is a Python application so to run it you’ll need a python stack installed. You’ll also need a relatively recent version of R (≥ 3.4.0). Using pip, the python package installer, installing radian is straightforward. Python v3 is recommended and on Fedora this mean I had to install using

pip-3 install --user radian

The –user flag does an user install, which sets the installation location to be inside your home directory. Once installed, you can start radian by simply typing the application name and hitting enter

radian

A nice configuration tip included in the radian README.md is to alias the radian command to r, so that running R runs the standard R console, while running r starts radian. On Fedora, you configure this alias in your ~/.bashrc file

alias r="radian"

Having started radian you’ll see something like this

radian at start-up running in a bash shell on Fedora

radian starts up with a simple statement of the R version running in radian and the platform (OS) it’s running on; so is it just a less-verbose version of the standard R console? The radian prompt hints at the greater capabilities however.

Code completion is a nice addition; yes, you have some form of code completion in the standard R console but in radian we have a more RStudio or Emacs+ESS-like experience with a drop-down menu for object, function, argument, and filename completion. To activate this you start typing, hit Tab and the relevant completions pops up. Hit Tab again or press the down cursor and you can scroll through the potential completions.

Code completion in radian

We also get nice syntax highlighting of R code using the colour schemes from pygments:

Syntax highlighting in radian using the monokai theme

And, if you’re copying & pasting code into the terminal or piping code in from a editor with an embedded terminal (that’s running radian) then you also get rather handy multiline editing. Pressing the up cursor ↑ will retrieve the previous set of commands pasted or piped into radian, and repeatedly pressin ↑ will scroll back through the history. If you want to edit a set of R calls, instead of pressing ↑ again, press ← to enter the chunk of code and then you can move around among the lines using the cursor keys, editing as you see fit. Hitting enter will run the entire chunk of code for you, edits and all:

Multiline editing a ggplot call in radian

You can configure aspects of the behaviour of radian via options() in your .Rprofile. The options I’m currently using on the computer used for the screenshots in this post are:

options(radian.auto_indentation = FALSE) options(radian.color_scheme = "monokai")

but on my laptop I’m currently using

# auto match brackets and quotes options(radian.auto_match = TRUE) # auto indentation for new line and curly braces options(radian.auto_indentation = TRUE) options(radian.tab_size = 4) # timeout in seconds to cancel completion if it takes too long # set it to 0 to disable it options(radian.completion_timeout = 0.05) # insert new line between prompts options(radian.insert_new_line = FALSE)

The last option is something I’m not sure about yet; as you can see in the screenshots, there’s a new line between the prompts, which makes it super easy to read the R code you’ve entered, but with the font I’m currently using (Iosevka) things look a bit too spread out. Setting radian.insert_new_line = FALSE, as I have it on the laptop results in more standard behaviour but it can feel a little cramped. I’ll probably play with both options and see which I like best after a few more weeks of use.

You can also define shortcuts. This is useful for entering the assignment operator <-, which I have bound to Alt + - using

options(radian.escape_key_map = list( list(key = "-", value = " <- ") ))

where I’ve added spaces around the operator to mimic how the smart underscore works in Emacs+ESS.

I’m really liking using radian for my throw-away R sessions that I typically do in a terminal. The only issue I’ve noticed is that it is a little slow to print tibbles, and clearly it’s not going to replace my current IDE — that’s not what it is designed for. That said, radian can be run inside any app that can run a terminal and I’ve had this running inside VS Code for example, which was nice.

If you have any comments on radian or other R consoles, let me know what you think below; if you’ve used radian I’m especially interested in your experience with it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Parametric survival modeling

Tue, 06/18/2019 - 02:00

(This article was first published on Devin Incerti’s blog, and kindly contributed to R-bloggers)

Introduction

Survival analysis is used to analyze the time until the occurrence of an event (or multiple events). Cox models—which are often referred to as semiparametric because they do not assume any particular baseline survival distribution—are perhaps the most widely used technique; however, Cox models are not without limitations and parametric approaches can be advantageous in many contexts. For instance, parametric survival models are essential for extrapolating survival outcomes beyond the available follow-up data. For this reason they are nearly always used in health-economic evaluations where it is necessary to consider the lifetime health effects (and costs) of medical interventions.

R contains a large number of packages related to biostatistics and its support for parametric survival modeling is no different. Below we will examine a range of parametric survival distributions, their specifications in R, and the hazard shapes they support. We will then show how the flexsurv package can make parametric regression modeling of survival data straightforward.

Survival distributions

The primary quantity of interest in survival analysis is the survivor function, defined as the probability of survival beyond time $t$,

S(t) = Pr(T > t) = 1 - F(t),

where $T$ is a random variable denoting the time that the event occurs. The survival function is the complement of the cumulative density function (CDF), $F(t) = \int_0^t f(u)du$, where $f(t)$ is the probability density function (PDF).

The hazard function, or the instantaneous rate at which an event occurs at time $t$ given survival until time $t$ is given by,

\lambda (t) = \frac{f(t)}{S(t)}.

The survivor function can also be expressed in terms of the cumulative hazard function, $\Lambda(t) = \int_0^t \lambda (u)du$,

S(t) = e^{-\Lambda(t)}.

R functions for parametric distributions used for survival analysis are shown in the table below. The default stats package contains functions for the PDF, the CDF, and random number generation for many of the distributions. Additional distributions as well as support for hazard functions are provided by flexsurv.

PDF CDF Hazard Random sampling Exponential stats::dexp stats::pexp flexsurv::hexp flexsurv::rexp Weibull (AFT) stats::dweibull stats::pweibull flexsurv::hweibull stats::rweibull Weibull (PH) flexsurv::dweibullPH flexsurv::pweibullPH flexsurv::hweibullPH flexsurv::rweibullPH Gompertz flexsurv::dgompertz flexsurv::pgompertz flexsurv::hgompertz flexsurv::rgompertz Gamma stats::dgamma stats::pgamma flexsurv::hgamma stats::rgamma Lognormal stats::dlnorm stats::plnorm flexsurv::hlnorm stats::rlnorm Log-logistic flexsurv::dllogis flexsurv::pllogis flexsurv::hllogis flexsurv::rllogis Generalized gamma flexsurv::dgengamma flexsurv::pgengamma flexsurv::hgengamma flexsurv::rgengamma

The parameterizations of these distributions in R are shown in the next table. The parameter of primary interest (in flexsurv) is colored in red—it is known as the location parameter and typically governs the mean or location for each distribution. The other parameters are ancillary parameters that determine the shape, variance, or higher moments of the distribution. These parameters impact the hazard function, which can take a variety of shapes depending on the distribution:

  • the exponential distribution only supports a constant hazard;
  • the Weibull, Gompertz, and gamma distributions support monotonically increasing and decreasing hazards;
  • the log-logistic and lognormal distributions support arc-shaped and monotonically decreasing hazards; and
  • the generalized gamma distribution supports an arc-shaped, bathtub-shaped, monotonically increasing, and monotonically decreasing hazards.
PDF CDF Hazard Parameters Hazard shape Exponential $\lambda e^{-\lambda t}$ $1 – e^{-\lambda t}$ $\lambda$ $\color{red}{\text{rate}} = \lambda \gt 0$ Constant Weibull (AFT) $\frac{a}{b}\left(\frac{t}{b}\right)^{a-1}e^{-(t/b)^a}$ $1 – e^{-(t/b)^a}$ $\frac{a}{b}\left(\frac{t}{b}\right)^{a-1}$ $\text{shape} = a \gt 0 \\ \color{red}{\text{scale}} = b \gt 0$ Constant, monotonically increasing/decreasing Weibull (PH)a $a m t^{a-1} e^{-m t^a}$ $1 – e^{-mt^a}$ $amt^{a-1}$ $\text{shape} = a \gt 0 \\ \color{red}{\text{scale}} = m \gt 0$ Constant, monotonically increasing/decreasing Gompertz $b e^{at} \exp\left[-\frac{b}{a}(e^{at}-1)\right]$ $1 – \exp\left[-\frac{b}{a}(e^{at}-1)\right]$ $b e^{at}$ $\text{shape} = a \in (-\infty, \infty) \\ \color{red}{\text{rate}} = b \gt 0$ Constant, monotonically increasing/decreasing Gammab $\frac{b^a}{\Gamma(a)}t^{a -1}e^{-bt}$ $\frac{\gamma(a, bt)}{\Gamma(a)}$ $f(t)/S(t)$ $\text{shape} = a \gt 0 \\ \color{red}{\text{rate}} = b \gt 0$ Constant, monotonically increasing/decreasing Lognormal $\frac{1}{t\sigma\sqrt{2\pi}}e^{-\frac{(\ln t – \mu)^2}{2\sigma^2}}$ $\Phi\left(\frac{\ln t – \mu}{\sigma}\right)$ $f(t)/S(t)$ $\color{red}{\text{meanlog}} = \mu \in (-\infty, \infty) \\ \text{sdlog} = \sigma \gt 0$ Arc-shaped, monotonically decreasing Log-logistic $\frac{(a/b)(t/b)^{a-1}}{\left(1 + (t/b)^a\right)^2}$ $\frac{1}{(1+(t/b)^a)}$ $1-\frac{(a/b)(t/b)^{a-1}}{\left(1 + (t/b)^a\right)}$ $\text{shape} = a \gt 0 \\ \color{red}{\text{scale}} = b \gt 0$ Arc-shaped, monotonically decreasing Generalized gammac $\frac{|Q|(Q^{-2})^{Q^{-2}}}{\sigma t \Gamma(Q^{-2})} \exp\left[Q^{-2}\left(Qw-e^{Qw}\right)\right]$ $\begin{cases}
\frac{\gamma(Q^{-2}, u)}{\Gamma(Q^{-2})} \text{ if } Q \neq 0 \\
\Phi(w) \text{ if } Q = 0
\end{cases}$ $f(t)/S(t)$ $\color{red}{\text{mu}} = \mu \in (-\infty, \infty) \\ \text{sigma} = \sigma \gt 0 \\ \text{Q} = Q \in (-\infty, \infty)$ Arc-shaped, bathtub-shaped, monotonically increasing/decreasing a The proportional hazard (PH) model is a reparameterization of the accelerated failure time (AFT) model with $m = b^{-a}$. b $\Gamma(z) = \int_{0}^{\infty} x^{z-1}e^{-x} dx$ is the gamma function and $\gamma(s,x) = \int_{0}^{x} t^{s-1}e^{-t}dt$ is the lower incomplete gamma function. c $w = (log(t) – \mu)/\sigma; u = Q^{-2}e^{Qw}$. Shapes of hazard functions

We will now examine the shapes of the hazards in a bit more detail and show how both the location and shape vary with the parameters of each distribution. Readers interested in a more interactive experience can also view my Shiny app here.

To do so we will load some needed packages: we will use flexsurv to compute the hazards, data.table as a fast alternative to data.frame, and ggplot2 for plotting.

library("flexsurv") library("data.table") library("ggplot2") ggplot2::theme_set(theme_minimal())

We can create a general function for computing hazards for any general hazard function given combinations of parameter values at different time points. The key to the function is mapply, a multivariate version of sapply. To illustrate, let’s compute the hazard from a Weibull distribution given 3 values each of the shape and scale parameters at time points 1 and 2. The output is a matrix where each row corresponds to a time point and each column is combination of the shape and scale parameters. For example, the second row and third column is the hazard at time point 2 given a shape parameter of 1.5 and a scale parameter of 1.75.

mapply(flexsurv::hweibull, shape = c(.5, 1, 1.5), scale = c(.25, 1, 1.75), MoreArgs = list(x = 1:2)) ## [,1] [,2] [,3] ## [1,] 1.0000000 1 0.6479391 ## [2,] 0.7071068 1 0.9163243

The more general function uses mapply to return a data.table of hazards at all possible combinations of the parameter values and time points. Factor variables and intuitive names are also returned to facilitate plotting with ggplot2.

hazfun <- function(FUN, param_vals, times){ # Compute hazard for all possible combinations of parameters and times values <- do.call("mapply", c(list(FUN = FUN), as.list(expand.grid(param_vals)), list(MoreArgs = list(x = times)))) x <- data.table(expand.grid(c(param_vals, list(time = times))), value = c(t(values))) # Create factor variables and intuitive names for plotting param_names <- names(param_vals) x[[param_names[1]]] <- factor(x[[param_names[1]]], levels = unique(x[[param_names[1]]])) if (length(param_vals) > 1){ for (j in 2:length(param_vals)){ ordered_vals <- unique(x[[param_names[j]]]) x[[param_names[j]]] <- paste0(param_names[j], " = ", x[[param_names[j]]]) factor_levels <- paste0(param_names[j], " = ", ordered_vals) x[[param_names[j]]] <- factor(x[[param_names[j]]], levels = factor_levels) } } # Return return(x) } Exponential distribution

The exponential distribution is parameterized by a single rate parameter and only supports a hazard that is constant over time. The hazard is simply equal to the rate parameter.

times <- seq(1, 10, 1) rate <- seq(1, 5, 1) haz_exp <- hazfun(flexsurv::hexp, list(rate = rate), times = times) ggplot(haz_exp, aes(x = time, y = value, col = rate)) + geom_line() + xlab("Time") + ylab("Hazard") + scale_y_continuous(breaks = rate)

Weibull distribution (AFT)

The Weibull distribution can be parameterized as both an accelerated failure time (AFT) model or as a proportional hazards (PH) model. The parameterization in the base stats package is an AFT model.

The hazard is decreasing for shape parameter $a < 1$ and increasing for $a > 1$. For $a = 1$, the Weibull distribution is equivalent to an exponential distribution with rate parameter $1/b$ where $b$ is the scale parameter.

wei_shape <- seq(.25, 3, .25) haz_wei <- hazfun(flexsurv::hweibull, list(scale = seq(2, 5, .5), shape = wei_shape), times = times) ggplot(haz_wei, aes(x = time, y = value, col = scale)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard")

Weibull distribution (PH)

flexsurv provides an alternative PH parameterization of the Weibull model with the same shape parameter $a$ and a scale parameter $m = b^{-a}$ where $b$ is the scale parameter in the AFT model.

The hazard is again decreasing for $a < 1$, constant for $a = 1$, and increasing for $a > 1$. Note that for $a = 1$, the PH Weibull distribution is equivalent to an exponential distribution with rate parameter $m$.

haz_weiPH <- hazfun(flexsurv::hweibullPH, list(scale = seq(2, 5, .5), shape = wei_shape), times = times) ggplot(haz_weiPH, aes(x = time, y = value, col = scale)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard")

Gompertz distribution

The Gompertz distribution is parameterized by a shape parameter $a$ and rate parameter $b$. The hazard is increasing for $a > 0$, constant for $a = 0$, and decreasing for $a < 0$. When $a = 0$, the Gompertz distribution is equivalent to an exponential distribution with rate parameter $b$.

haz_gomp <- hazfun(flexsurv::hgompertz, list(rate = seq(.5, 3, .5), shape = seq(-2, 2, .5)), times = times) ggplot(haz_gomp, aes(x = time, y = value, col = rate)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard")

Gamma distribution

The gamma distribution is parameterized by a shape parameter $a$ and a rate parameter $b$. Like the Weibull distribution, the hazard is decreasing for $a < 1$, constant for $a = 1$, and increasing for $a >1$. In the case where $a = 1$, the gamma distribution is an exponential distribution with rate parameter $b$.

haz_gam <- hazfun(flexsurv::hgamma, list(rate = seq(.5, 3, .5), shape = c(1e-5, .5, 1, seq(2, 6))), times = times) ggplot(haz_gam, aes(x = time, y = value, col = rate)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard")

Lognormal distribution

The lognormal distribution is parameterized by the mean $\mu$ and standard deviation $\sigma$ of survival time on the log scale. The lognormal hazard is either monotonically decreasing or arc-shaped. Note that the shape of the hazard depends on the values of both $\mu$ and $\sigma$.

haz_lnorm <- hazfun(flexsurv::hlnorm, list(meanlog = seq(0, 4, .5), sdlog = seq(.5, 3, .5)), times = seq(0, 20, .1)) ggplot(haz_lnorm, aes(x = time, y = value, col = meanlog)) + geom_line() + facet_wrap(~sdlog, scales = "free_y") + xlab("Time") + ylab("Hazard")

Log-logistic distribution

The log-logistic distribution is parameterized by a shape parameter $a$ and a scale parameter $b$. When $a > 1$, the hazard function is arc-shaped whereas when $a \leq 1$, the hazard function is decreasing monotonically.

haz_llogis <- hazfun(flexsurv::hllogis, list(scale = seq(.5, 4, .5), shape = seq(.5, 3, .5)), times = seq(0, 20, 1)) ggplot(haz_llogis, aes(x = time, y = value, col = scale)) + geom_line() + facet_wrap(~shape, scales = "free_y") + xlab("Time") + ylab("Hazard")

Generalized gamma distribution

The generalized gamma distribution is parameterized by a location parameter $\mu$, a scale parameter $\sigma$, and a shape parameter $Q$. It is the most flexible distribution reviewed in this post and includes the exponential ($Q = \sigma = 1$), Weibull ($Q = 1$), gamma ($Q = \sigma$), and lognormal ($Q = 0$) distributions as special cases.

Each row in the figure corresponds to a unique value of $\sigma$ and each column corresponds to a unique value of $Q$.The generalized gamma distribution is quite flexible as it supports hazard functions that are monotonically increasing, monotonically decreasing, arc-shaped, and bathtub shaped.

haz_gengamma <- hazfun(flexsurv::hgengamma, list(mu = seq(1, 4, .5), sigma = seq(.5, 2, .5), Q = seq(-3, 3, 1)), times = seq(0, 30, 1)) ggplot(haz_gengamma, aes(x = time, y = value, col = mu)) + geom_line() + facet_wrap(sigma ~ Q, ncol = length(levels(haz_gengamma$Q)), scales = "free") + xlab("Time") + ylab("Hazard") + theme(legend.position = "bottom") + guides(col = guide_legend(nrow = 2, byrow = TRUE))

Regression

In flexsurv, survival models are fit to the data using maximum likelihood. Each parameter can be modeled as a function of covariates $z$,

\alpha_l = g^{-1}\left(z^t\beta \right),

where $\alpha_l$ is the $l$th parameter and $g^{-1}()$ is a link function (typically $log()$ if the parameter is strictly positive and the identity function if the parameter is defined on the real line).

We will illustrate by modeling survival in a dataset of patients with advanced lung cancer from the survival package. The dataset uses a status indicator where 2 denotes death and 1 denotes alive at the time of last follow-up; we will convert this to the more traditional coding where 0 is dead and 1 is alive.

dat <- data.table(survival::lung) dat[, status := ifelse(status == 1, 0, 1)] head(dat) ## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss ## 1: 3 306 1 74 1 1 90 100 1175 NA ## 2: 3 455 1 68 1 0 90 90 1225 15 ## 3: 3 1010 0 56 1 0 90 90 NA 15 ## 4: 5 210 1 57 1 1 90 60 1150 11 ## 5: 1 883 1 60 1 0 100 90 NA 0 ## 6: 12 1022 0 74 1 1 50 80 513 0 Intercept only model

We will begin by estimating intercept only parametric regression models (i.e., without covariates). But first, it’s helpful to estimate the hazard function (among all patients) using nonparametric techniques. We can do this using the kernel density estimator from the muhaz package.

library("muhaz") kernel_haz_est <- muhaz(dat$time, dat$status) kernel_haz <- data.table(time = kernel_haz_est$est.grid, est = kernel_haz_est$haz.est, method = "Kernel density")

Then we can use flexsurv to estimate intercept only models for a range of probability distributions. The hazard function for each fitted model is returned using summary.flexsurvreg().

dists <- c("exp", "weibull", "gompertz", "gamma", "lognormal", "llogis", "gengamma") dists_long <- c("Exponential", "Weibull (AFT)", "Gompertz", "Gamma", "Lognormal", "Log-logistic", "Generalized gamma") parametric_haz <- vector(mode = "list", length = length(dists)) for (i in 1:length(dists)){ fit <- flexsurvreg(Surv(time, status) ~ 1, data = dat, dist = dists[i]) parametric_haz[[i]] <- summary(fit, type = "hazard", ci = FALSE, tidy = TRUE) parametric_haz[[i]]$method <- dists_long[i] } parametric_haz <- rbindlist(parametric_haz)

We can plot the hazard functions from the parametric models and compare them to the kernel density estimate.

haz <- rbind(kernel_haz, parametric_haz) haz[, method := factor(method, levels = c("Kernel density", dists_long))] n_dists <- length(dists) ggplot(haz, aes(x = time, y = est, col = method, linetype = method)) + geom_line() + xlab("Days") + ylab("Hazard") + scale_colour_manual(name = "", values = c("black", rainbow(n_dists))) + scale_linetype_manual(name = "", values = c(1, rep_len(2:6, n_dists)))

The kernel density estimate is monotonically increasing and the slope increases considerably after around 500 days. The arc-shaped lognormal and log-logistic hazards and the constant exponential hazard do not fit the data well. The best performing models are those that support monotonically increasing hazards (Gompertz, Weibull, gamma, and generalized gamma). The flexible generalized gamma and the Gompertz models perform the best with the Gompertz modeling the increase in the slope of the hazard the most closely.

Adding covariates

As mentioned above each parameter can be modeled as a function of covariates. By default, flexsurv only uses covariates to model the location parameter. To demonstrate, we will let the rate parameter of the Gompertz distribution depend on the ECOG performance score (0 = good, 5 = dead), which describes a patient’s level of functioning and has been shown to be a prognostic factor for survival. The model is fit using flexsurvreg().

dat[, ph.ecog := factor(ph.ecog)] # Covariates on rate parameter only fit_covs1 <- flexsurvreg(Surv(time, status) ~ ph.ecog, data = dat, dist = "gompertz") print(fit_covs1) ## Call: ## flexsurvreg(formula = Surv(time, status) ~ ph.ecog, data = dat, ## dist = "gompertz") ## ## Estimates: ## data mean est L95% U95% se exp(est) ## shape NA 0.001595 0.000925 0.002265 0.000342 NA ## rate NA 0.001070 0.000725 0.001579 0.000212 NA ## ph.ecog1 0.497797 0.358883 -0.029676 0.747442 0.198248 1.431729 ## ph.ecog2 0.220264 0.910197 0.469959 1.350435 0.224615 2.484812 ## ph.ecog3 0.004405 1.973828 -0.020522 3.968179 1.017544 7.198181 ## L95% U95% ## shape NA NA ## rate NA NA ## ph.ecog1 0.970760 2.111591 ## ph.ecog2 1.599929 3.859102 ## ph.ecog3 0.979687 52.888129 ## ## N = 227, Events: 164, Censored: 63 ## Total time at risk: 69522 ## Log-likelihood = -1139.984, df = 5 ## AIC = 2289.967

Covariates for ancillary parameters can be supplied using the anc argument to flexsurvreg().

# Covariates on rate and shape parameter fit_covs2 <- flexsurvreg(Surv(time, status) ~ ph.ecog, anc = list(shape =~ ph.ecog), data = dat, dist = "gompertz") print(fit_covs2) ## Call: ## flexsurvreg(formula = Surv(time, status) ~ ph.ecog, anc = list(shape = ~ph.ecog), ## data = dat, dist = "gompertz") ## ## Estimates: ## data mean est L95% U95% ## shape NA 1.84e-03 5.54e-04 3.12e-03 ## rate NA 9.89e-04 5.75e-04 1.70e-03 ## ph.ecog1 4.98e-01 4.18e-01 -2.28e-01 1.06e+00 ## ph.ecog2 2.20e-01 1.11e+00 3.99e-01 1.82e+00 ## ph.ecog3 4.41e-03 -3.02e+02 -9.48e+02 3.43e+02 ## shape(ph.ecog1) 4.98e-01 -1.77e-04 -1.75e-03 1.40e-03 ## shape(ph.ecog2) 2.20e-01 -7.61e-04 -2.77e-03 1.25e-03 ## shape(ph.ecog3) 4.41e-03 2.63e+00 -2.86e+00 8.11e+00 ## se exp(est) L95% U95% ## shape 6.54e-04 NA NA NA ## rate 2.74e-04 NA NA NA ## ph.ecog1 3.30e-01 1.52e+00 7.96e-01 2.90e+00 ## ph.ecog2 3.63e-01 3.04e+00 1.49e+00 6.19e+00 ## ph.ecog3 3.29e+02 4.78e-132 0.00e+00 1.32e+149 ## shape(ph.ecog1) 8.03e-04 1.00e+00 9.98e-01 1.00e+00 ## shape(ph.ecog2) 1.02e-03 9.99e-01 9.97e-01 1.00e+00 ## shape(ph.ecog3) 2.80e+00 1.38e+01 5.75e-02 3.33e+03 ## ## N = 227, Events: 164, Censored: 63 ## Total time at risk: 69522 ## Log-likelihood = -1134.06, df = 8 ## AIC = 2284.12

The standard errors and confidence intervals are very large on the shape parameter coefficients, suggesting that they are not reliably estimated and that there is little evidence that the shape parameter depends on the ECOG score. A such, we will use the first model to predict the hazards. In flexsurv, input data for prediction can be specified by using the newdata argument in summary.flexsurvreg(). So we will first create this “new” dataset for prediction consisting of each possible value of the ECOG score in the data.

newdat <- data.table(ph.ecog = sort(unique(dat[!is.na(ph.ecog)]$ph.ecog))) newdat[, ph.ecog := factor(ph.ecog)] print(newdat) ## ph.ecog ## 1: 0 ## 2: 1 ## 3: 2 ## 4: 3

We can then predict the hazard for each level of the ECOG score. The hazard increases with the ECOG score which is expected since higher scores denote higher levels of disability. Note, however, that the shape of the hazard remains the same since we did not find evidence that the shape parameter of the Gompertz distribution depended on the ECOG score.

haz_covs1 <- summary(fit_covs1, newdata = newdat, type = "hazard", tidy = TRUE) ggplot(haz_covs1, aes(x = time, y = est, col = ph.ecog)) + geom_line() + xlab("Days") + ylab("Hazard") + scale_color_discrete(name = "ECOG performance score") + theme(legend.position = "bottom")

Conclusion

Parametric models are a useful technique for survival analysis, particularly when there is a need to extrapolate survival outcomes beyond the available follow-up data. R provides wide range of survival distributions and the flexsurv package provides excellent support for parametric modeling. Parametric distributions can support a wide range of hazard shapes including monotonically increasing, monotonically decreasing, arc-shaped, and bathtub-shaped hazards. However, in some cases, even the most flexible distributions such as the generalized gamma distribution may be insufficient. In these cases, flexible parametric models such as splines or fractional polynomials may be needed.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Devin Incerti’s blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting from flat data a world of relationships to visualise with Gephi

Tue, 06/18/2019 - 02:00

(This article was first published on R on Coding Club UC3M, and kindly contributed to R-bloggers)

by Mariluz Congosto

Network analysis offers a perspective of the data that broadens and enriches any investigation. Many times we deal with data in which the elements are related, but we have them in a tabulated format that is difficult to import into network analysis tools.

Relationship data require a definition of nodes and connections. Both parts have different structures and it is not possible to structure them in a single table, at least two would be needed. Data analysis tools define different input formats, one of them is GDF, which is characterized by its simplicity and versatility.

In this session, we will see how we can extract the relationships between elements of a file in CSV format to generate a file in GDF format to work with Gephi.

Introduction

One of the most used network analysis tools is Gephi. It is a desktop application that allows us to analyse a network in depth and visualize it. It provides a set of functions to filter nodes and edges or calculate network parameters or apply different layouts or give size and colour to nodes depending on different attributes.

Gephi supports a set of formats with which it is possible to perform more or less functions. The format with more features is GEXF but it has an XML structure that generates large files. When we work with very large graphs, size matters. The GDF Format is simple, compact and versatile. It allows us to work with attributes that are a complement that can be combined with the analysis of the topology of the network.

As Gephi said:

GDF is built like a database table or a coma separated file (CSV). It supports attributes to both nodes and edges. A standard file is divided in two sections, one for nodes and one for edges. Each section has a header line, which basically is the column title. Each element (i.e. node or edge) is on a line and values are separated by coma. The GDF format is therefore very easy to read and can be easily converted from CSV.

This is a basic example of a GDF file, in a first part it defines the nodes and in a second the connections:

nodedef>name VARCHAR,label VARCHAR s1,Site number 1 s2,Site number 2 s3,Site number 3 edgedef>node1 VARCHAR,node2 VARCHAR s1,s2 s2,s3 s3,s2 s3,s1

The steps to extract relationships from a CSV file and generate a file in GDF format are:

  1. Define the parameters for the extraction of related data.
  2. Define the data structures for the transformation.
  3. Import the data from a file in CSV format and store the data in the structures.
  4. Sort data according to connections.
  5. Prepare the data for the GDF format.
  6. Generate the file in GDF format.
Define the parameters for the extraction of related data

A set of parameters are defined so this converter can be adapted to different cases. We must specify what information will be taken from the file in CSV format, where we will leave the result and the graph type.

library("readr") library("hash") library("stringr") source <- 3 # Column number of source entity target <- 9 # Column number of target entity source_attribs <- c(5, 6, 11, 12, 13) # Column numbers of attribs null_attrib <- rep(NA, 5) # Default value for entities without attributes name_attribs <- c("app", "location", "hastag", "lang", "create at") # Attrib names name_file_csv <- "rstudio_RTs.csv" # Name of the input file in CSV format name_file_gdf <- "rstudio_RTs.gdf" # Name of the output file in GDF format directed <- TRUE # Indicates whether the graph is directed or not

The sample CSV file can be found here.

Define the data structures for the transformation

In order to store the data, dynamic structures are needed to allow data to be added as they appear. The hash tables were chosen because they are the most appropriate for this case.
Hash tables will be used to store nodes and connections.

hash_nodes <- hash() hash_links <- hash() hash_links_in <- hash() hash_links_out <- hash() hash_connections <- hash() hash_connections_attrib <- hash() num_attribs <- length(source_attribs) Import the data from a file in CSV format and store the data in the structures

Import data reading the CSV file and run it row by row to store the nodes and connections in the hash tables.

Related entities can appear multiple times, as a source or as a target. When an entity appears for the first time, it is stored in the hash_nodes table. Attributes are associated to source entities and null attributes to target entities. It is a criterion that assumes this algorithm, but there could be others. If an entity appears the first time as a target, it will be assigned the null attributes, but if it appears later as a source, the null attributes will be replaced by theirs.

For each entity, the number of total links (hash_links), the number of inbound links (hash_links_in) and the number of outbound links (hash_links) are counted. This is done to allow ordering the nodes from greater to lesser degree when generating the file in GDF format.

For each origin-target entity pair, the number of times that the relation appears (hash_connections) and the attributes (hash_connections_attrib) are stored. In the first case, we get the weight of the relationship and, in the second, we get the associated attributes.

table_csv <- read_csv2(name_file_csv) num_rows <- nrow(table_csv) num_cols <- ncol(table_csv) for (i in 1:num_rows) { node_source <- table_csv[[i, source]] node_target <- table_csv[[i, target]] print(i) node_source_attribs <- null_attrib # The attributes are stored and the ',' character is changed by '-'. to avoid confict in the GDF format for (j in 1:num_attribs) { k <- source_attribs[[j]] raw_attrib <- table_csv[[i, k]] cooked_attrib <- str_replace_all(raw_attrib, ",", "-") node_source_attribs[[j]] <- cooked_attrib } # If a source node appears for the first time, store with attributes if (!(has.key(node_source, hash_nodes))) { hash_nodes[[node_source]] <- node_source_attribs hash_links[[node_source]] <- 0 hash_links_in[[node_source]] <- 0 hash_links_out[[node_source]] <- 0 } # If the source node exists and has null attributes, store its own else { node_source_attribs_old <- hash_nodes[[node_source]] if (identical(node_source_attribs_old, null_attrib)) hash_nodes[[node_source]] <- node_source_attribs } # Check that target node exists if (!(is.na(node_target))) { # If a target node appears for the first time, store with null attributes if (!(has.key(node_target, hash_nodes))) { hash_nodes[[node_target]] <- null_attrib hash_links[[node_target]] <- 0 hash_links_in[[node_target]] <- 0 hash_links_out[[node_target]] <- 0 } par_nodes <- paste(node_source, node_target) # If a pair of nodes appears for the first time related, store relation if (!(has.key(par_nodes, hash_connections))) { hash_connections[[par_nodes]] <- 0 hash_connections_attrib[[par_nodes]] <- node_source_attribs } # In all cases, increase the number of connections hash_connections[[par_nodes]] <- hash_connections[[par_nodes]] + 1 hash_links[[node_source]] <- hash_links[[node_source]] + 1 hash_links_out[[node_source]] <- hash_links_out[[node_source]] + 1 hash_links[[node_target]] <- hash_links[[node_target]] + 1 hash_links_in[[node_target]] <- hash_links_in[[node_target]] + 1 } } Sort data according to connections

The hash table object does not have the sort method, but has one to convert it into a list. Once we have converted the hash_links and hash_connections into a list, we sort them down by number of connections.

list_links <- as.list.hash(hash_links) list_link_order <- list_links[order(unlist(list_links), decreasing=TRUE)] list_connections <- as.list.hash(hash_connections) list_connections_order <- list_connections[order(unlist(list_connections), decreasing=TRUE)] Prepare the data for the GDF format

In this step we place in GDF format nodes and links in descending order by number of connections.

In the GDF format, the only data required for the definition of nodes is the name of the node, but attributes can be added. In this case, three fixed attributes are included, which are the total number of links, the number of inbound links and the number of outbound links. Since the GDF format is readable, these attributes allow getting an idea of the most relevant nodes even before importing them into Gephi. The attributes configured in the parameters are also added.

The information of the nodes is stored in a matrix sized in rows by the number of nodes and in columns by the number of attributes configured plus four.

For the definition of links only the source and target nodes are required, but we can also expand them with attributes. In this case we add the weight of the relation, a boolean variable to indicate if the graph is directed or not (by default it is not directed) and the attributes configured in the parameters.

The information of the links is stored in a matrix dimensioned in rows by the number of pairs of connections and in columns by the number of attributes configured plus four.

# Definition of nodes num_nodes <- length(list_links) table_nodes <- matrix(nrow=num_nodes, ncol=num_attribs+4) num_nodes_connected <- 0 for(i in 1:num_nodes) { name_node <- names(list_link_order)[i] if (hash_links[[name_node]] > 0) { num_nodes_connected <- num_nodes_connected + 1 table_nodes[i, 1] <- name_node table_nodes[i, 2] <- hash_links[[name_node]] table_nodes[i, 3] <- hash_links_in[[name_node]] table_nodes[i, 4] <- hash_links_out[[name_node]] node_attrib <- hash_nodes[[name_node]] for (j in 1:num_attribs) table_nodes[i, 4+j] <- node_attrib[[j]] } } # Only connected nodes are considered k <- num_attribs + 4 table_nodes <- table_nodes[1:num_nodes_connected, 1:k] # Definition of links num_connections <- length(list_connections) table_connections <- matrix(nrow=num_connections, ncol=num_attribs+4) for(i in 1:num_connections) { name_conexion <- names(list_connections_order)[i] source_target <- strsplit(name_conexion, " ") table_connections[i, 1] <- source_target[[1]][1] table_connections[i, 2] <- source_target[[1]][2] table_connections[i, 3] <- list_connections_order[[i]] table_connections[i, 4] <- directed connection_attrib <- hash_connections_attrib[[name_conexion]] for (j in 1:num_attribs) table_connections[i, 4+j] <- connection_attrib[[j]] } Generate the file in GDF format

The last step is to write the file in GDF format. We will only have to add the headers before writing the information of the nodes and links.

# Definition of nodes head_nodes <- "nodedef>name VARCHAR,links INT,Links_in INT,links_out INT" for (j in 1:num_attribs) { attrib_type <- paste(name_attribs[[j]], "VARCHAR", sep = " ") head_nodes <- paste(head_nodes, attrib_type, sep = ",") } write.table(head_nodes, file = name_file_gdf, append = FALSE, quote = FALSE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "UTF-8") write.table(table_nodes, file = name_file_gdf, append = TRUE, quote = FALSE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "UTF-8") # Definition of links head_arcs <- "edgedef>node1 VARCHAR,node2 VARCHAR, weight INT, directed BOOLEAN" for (j in 1:num_attribs) { attrib_type <- paste(name_attribs[[j]], "VARCHAR", sep = " ") head_arcs <- paste(head_arcs, attrib_type, sep = ",") } write.table(head_arcs, file = name_file_gdf, append = TRUE, quote = FALSE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "UTF-8") write.table(table_connections, file = name_file_gdf, append = TRUE, quote = FALSE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "UTF-8")

The resulting GDF file can be found here, and also the complete R code is available in a standalone script.

Example

The data has been obtained with the tool t-hoarder_kit, which allows downloading data through the Twitter API. With it, a query of the tweets that mentioned the @rstudio profile has been made. From the data downloaded in CSV format, the most relevant columns have been selected to facilitate the visibility of the tables.

Once the file in CSV format has been converted to a file in GDF format with the RT relation, this is the visualization of the graph with Gephi. (tweets from 2019-05-27 to 2019-06-06)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Coding Club UC3M. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing the Copa América: Historical Records, Squad Profiles, and Player Profiles with xG statistics!

Tue, 06/18/2019 - 01:00

(This article was first published on R by R(yo), and kindly contributed to R-bloggers)

Another summer and another edition of the Copa América! Along with the
Africa Cup of Nations, Nations League finals, the Women’s World Cup,
Under-21 European Championship AND the Gold Cup this is yet another
soccer-filled season after last year’s World Cup and the Asian Cup
earlier this year (I also did a blog post on these last two tournaments
which you can see here (World
Cup)
and here
(Asian Cup)
).
There is so much football going on at once even I can’t keep up,
especially with the time difference! To not redo all the previous
visualizations with Copa América data I tried to find new sources of
data and other forms of visualizations to give some insight into the
players and teams competing to be the champion of South America. You can
find all the code I used in this blogpost here and you can also find
other soccer related data viz in my
soccer_ggplot Github repo.

The sections will go from a very macro-level view of the historical
records
of the tournament, to the squads competing, the teams’
match record in the Copa América, and finally to a micro-level view
of various attacking players using xG statistics.

¡Vámonos!

Packages library(dplyr) ## data wrangling library(tidyr) ## data wrangling library(purrr) ## data wrangling and iteration library(stringr) ## data wrangling library(rvest) ## webscraping library(polite) ## webscraping (Github only pkg) library(ggplot2) ## plotting library(scales) ## plotting scales library(ggimage) ## images for flags library(ggforce) ## plotting text labels library(cowplot) ## plotting grid library(glue) ## text library(ggrepel) ## plotting text labels library(magick) ## plotting library(kable) ## tables library(ggtextures) ## soccer ball emoji as geom_col() library(extrafont) ## fonts: Roboto Condensed loadfonts() theme_copaAmerica

I wanted to have all the plots in this blogpost to have a consistent
color theme. As the tournament is going to be held in Brazil, I went
with a color theme based on its flag with blue, yellow, and green being
the primary colors.

theme_copaAmerica <- function( title.size = 24, subtitle.size = 14, caption.size = 8, axis.text.size = 14, axis.text.x.size = 12, axis.text.y.size = 12, axis.title.size = 16, strip.text.size = 18, panel.grid.major.x = element_line(size = 0.5, color = "white"), panel.grid.major.y = element_line(size = 0.5, color = "white"), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks = element_line(color = "white")) { ## Theme: theme(text = element_text(family = "Roboto Condensed", color = "white"), plot.title = element_text(family = "Roboto Condensed", face = "bold", size = title.size, color = "yellow"), plot.subtitle = element_text(size = subtitle.size), plot.caption = element_text(size = caption.size), panel.background = element_rect(fill = "#009b3a"), plot.background = element_rect(fill = "#002776"), axis.text = element_text(size = axis.text.size, color = "white"), axis.text.x = element_text(size = axis.text.x.size, color = "white"), axis.text.y = element_text(size = axis.text.y.size, color = "white"), axis.title = element_text(size = axis.title.size), axis.line.x = element_blank(), axis.line.y = element_blank(), panel.grid.major.x = panel.grid.major.x, panel.grid.major.y = panel.grid.major.y, panel.grid.minor.x = panel.grid.minor.x, panel.grid.minor.y = panel.grid.minor.y, strip.text = element_text(color = "yellow", face = "bold", size = strip.text.size, margin = margin(4.4, 4.4, 4.4, 4.4)), strip.background = element_blank(), axis.ticks = axis.ticks ) } Top Goal Scorers // Goleadores

For this plot I took the stats from the Spanish version of the Wikipedia
page as it had more content. I used purrr::flatten_df() to squish the
list output into a dataframe then set the names of each column using
purrr::set_names().

url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica" session <- bow(url) copa_top_scorers <- scrape(session) %>% html_nodes(".mw-parser-output > table:nth-child(95)") %>% html_table() %>% flatten_df() %>% set_names(c("player", "country", "goals")) %>% mutate(image = "https://www.emoji.co.uk/files/microsoft-emojis/activity-windows10/8356-soccer-ball.png") glimpse(copa_top_scorers) ## Observations: 22 ## Variables: 4 ## $ player "Norberto Méndez", "Zizinho", "Lolo Fernández", "Sever... ## $ country "ARG Argentina", "BRA Brasil", "PER Perú", "URU Urugua... ## $ goals 17, 17, 15, 15, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11... ## $ image "https://www.emoji.co.uk/files/microsoft-emojis/activi...

Like in the Asian Cup blogpost I use Claus
Wilke
’s
ggtextures package to use
soccer ball emoji as the column image in the plot.

copa_goleadores_raw_plot <- copa_top_scorers %>% head(5) %>% ggplot(aes(x = reorder(player, goals), y = goals, image = image)) + geom_isotype_col(img_width = grid::unit(1, "native"), img_height = NULL, ncol = NA, nrow = 1, hjust = 0, vjust = 0.5) + geom_text(aes(label = goals, family = "Roboto Condensed", fontface = "bold"), size = 7.5, color = "yellow", nudge_y = 0.5) + coord_flip() + scale_y_continuous(breaks = c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18), expand = c(0, 0), limits = c(0, 19)) + labs(title = "Top Scorers of the Copa América", subtitle = glue(" Most goals in a single tournament: 9 Humberto Maschio (Argentina), Javier Ambrois (Uruguay), Jair (Brazil)"), y = "Number of Goals", x = NULL, caption = glue(" Source: Wikipedia By @R_by_Ryo")) + theme_copaAmerica(title.size = 26, subtitle.size = 16, caption.size = 12, axis.text.size = 18, axis.title.size = 18, panel.grid.major.y = element_blank(), axis.ticks = element_blank()) ## Add flags to y-axis: axis_image <- axis_canvas(copa_goleadores_raw_plot, axis = 'y') + draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg", y = 16.5, scale = 1.8) + draw_image("https://upload.wikimedia.org/wikipedia/commons/1/1a/Flag_of_Argentina.svg", y = 12.5, scale = 1.8) + draw_image("https://upload.wikimedia.org/wikipedia/commons/f/fe/Flag_of_Uruguay.svg", y = 9, scale = 1.8) + draw_image("https://upload.wikimedia.org/wikipedia/commons/d/df/Flag_of_Peru_%28state%29.svg", y = 5.25, scale = 1.8) + draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg", y = 1.5, scale = 1.8) copa_goleadores_plot <- ggdraw(insert_yaxis_grob(copa_goleadores_raw_plot, axis_image, position = "left")) copa_goleadores_plot

Most of these players aren’t ones you might recognize. The Copa América
used to be held a lot more regularly (and sometimes erratically) until
this century so players had a lot more opportunities to score goals. All
five of the players you see here played in the 1930s-1950s when there
was a tournament every one or two years. Out of currently active
players, Peruvian legend Paolo Guerrero has 11 goals along with Eduardo
Vargas (from Chile) with 10. (Edit: after the Chile – Japan game, Vargas is on
12…) Another player you might recognize that was actually tied with
Ademir for 5th place, along with three other players, was Gabriel
Batistuta (“Batigol”).

Winners of the Copa América

After grabbing the data from the Wikipedia page I used a variety of
functions to clean and reshape the dataset like tidyr::separate() to
split the number of occurences and the year.

url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica" session <- bow(url) copa_campeones <- scrape(session) %>% html_nodes(".mw-parser-output > table:nth-child(10)") %>% html_table() %>% flatten_df() copa_campeones_limpia <- copa_campeones %>% janitor::clean_names() %>% slice(1:8) %>% select(1:4) %>% set_names(c("team", "winners", "runners_up", "third_place")) %>% separate(winners, into = c("Champions", "first_place_year"), sep = " ", extra = "merge") %>% separate(runners_up, into = c("Runners-up", "second_place_year"), sep = " ", extra = "merge") %>% separate(third_place, into = c("Third Place", "third_place_year"), sep = " ", extra = "merge") %>% mutate_all(list(~str_replace_all(., "–", "0"))) %>% mutate_at(vars(contains("num")), funs(as.numeric)) %>% gather(key = "key", value = "value", -team, -first_place_year, -second_place_year, -third_place_year) %>% mutate(key = as.factor(key), value = as.numeric(value), team = team %>% str_replace(., "[A-Z]{3}", "") %>% str_trim(.), team = case_when(team == "Brasil" ~ "Brazil", TRUE ~ team)) %>% mutate(key = forcats::fct_relevel(key, "Champions", "Runners-up", "Third Place")) %>% arrange(key, desc(value)) %>% mutate(team = forcats::as_factor(team), order = row_number())

I also wanted to add flags to this plot but
cowplot::insert_yaxis_grob() is unfortunately not compatible with
facets. I used stringr::str_wrap() to format the subtitle nicely while
I used glue::glue() to avoid having the use ‘\n’ to create a new line
for the caption.

copa_ganadores_plot <- copa_campeones_limpia %>% ggplot(aes(value, forcats::fct_rev(team), color = key)) + geom_point(size = 10) + # 10 geom_text(aes(label = value), size = 5, color = "black", # 5 family = "Roboto Condensed", fontface = "bold") + scale_color_manual(values = c("Champions" = "#FFCC33", "Runners-up" = "#999999", "Third Place" = "#CC6600"), guide = FALSE) + scale_x_continuous(breaks = c(1, 5, 10, 15), labels = c(1, 5, 10, 15), limits = c(-1, 16)) + labs(x = "Number of Occurrence", y = NULL, title = "Most Successful Teams of the Copa América!", subtitle = str_wrap("Ordered by number of Copa América(s) won. Argentina missed the chance to leapfrog Uruguay after consecutive final losses in the previous two tournaments!", width = 80), caption = glue(" Source: Wikipedia By @R_by_Ryo")) + facet_wrap(~key) + theme_copaAmerica(subtitle.size = 14, caption.size = 10) copa_ganadores_plot

What’s surprising to note is that Pele never won a Copa América with
Brazil, although he did get Best Player and Top Scorer in the 1959
edition of the tournament. Even more bizarrely Diego Maradona has never
won it either! He didn’t play in either of the 1991 and 1993 editions
where Argentina won their 13th and 14th Copas.

Copa América Squad Profiles

We just looked at what happened in the past but who are the players
competing in the tournament this year? To take a quick look I
web-scraped the squads of each of the competing teams from Wikipedia.

I created a list of the xpaths for each of squads and using
purrr::map() I grabbed the data for each participating country. After
I got some meta-information about the country name and the group I
created a list-column that stores the squad data as a dataframe in its
own column. To explode this out I used tidyr::unnest() to reshape the
entire dataframe to have one row with all the data for each player in
every squad.

To get a clean dataset I use some stringr::str_*() functions to
properly format the character strings such as the player positions,
ages, date of births.

squads_df_clean <- squads_df_raw %>% janitor::clean_names() %>% select(-delete, squad_num = no, position = pos, birth_age = date_of_birth_age) %>% mutate(position = position %>% str_replace_all(., "[1-9]", ""), birth_age = birth_age %>% str_extract_all(., pattern = "\\([^()]+\\)")) %>% unnest(birth_age) %>% group_by(player) %>% mutate(colnum = seq_along(player)) %>% spread(key = colnum, value = birth_age) %>% ungroup() %>% select(everything(), dob = `1`, age = `2`) %>% mutate(dob = dob %>% str_replace_all(., "[()]", "") %>% lubridate::as_date(), age = age %>% str_extract(., "[0-9]+") %>% as.integer, country = forcats::fct_relevel(country, "Brazil", "Argentina", "Uruguay", "Peru", "Qatar", "Chile", "Venezuela", "Paraguay", "Japan", "Bolivia", "Colombia", "Ecuador", ), club = case_when( club == "Barcelona" & country == "Ecuador" ~ "Barcelona (Ecuador)", TRUE ~ club)) glimpse(squads_df_clean) ## Observations: 276 ## Variables: 12 ## $ name 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,... ## $ group "A", "A", "A", "A", "A", "A", "A", "A", "A", "A... ## $ country Brazil, Brazil, Brazil, Brazil, Brazil, Brazil,... ## $ squad_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ... ## $ position "GK", "DF", "DF", "DF", "MF", "DF", "FW", "MF",... ## $ player "Alisson", "Thiago Silva", "Miranda", "Marquinh... ## $ caps 36, 79, 57, 36, 36, 40, 3, 10, 29, 65, 49, 15, ... ## $ goals 0, 7, 3, 1, 0, 2, 1, 0, 16, 8, 14, 1, 7, 0, 0, ... ## $ club "Liverpool", "Paris Saint-Germain", "Internazio... ## $ country_league "England", "France", "Italy", "France", "Spain"... ## $ dob 1992-10-02, 1984-09-22, 1984-09-07, 1994-05-14... ## $ age 26, 34, 34, 25, 27, 33, 22, 22, 22, 30, 27, 28,... Age-histogram

Using this data I can plot a bunch of histograms:

age_country_plot <- squads_df_clean %>% group_by(country) %>% mutate(median_age = median(age)) %>% ungroup() %>% ggplot(aes(x = age)) + geom_histogram(fill = "red", binwidth = 1) + geom_vline(aes(xintercept = median_age), size = 1.2) + geom_label(aes(x = median_age, y = 8, label = glue::glue("Median: {median_age}")), nudge_x = 0.5, hjust = 0.1, size = 3, family = "Roboto Condensed", color = "black") + labs(title = "Age Distribution of Copa América squads", subtitle = "Columns ordered Group A to Group C", x = "Age", y = NULL, caption = glue::glue(" Source: Wikipedia By: @R_by_Ryo")) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0), breaks = scales::pretty_breaks()) + theme_copaAmerica(title.size = 22, subtitle.size = 14, caption.size = 8, axis.text.size = 12, axis.title.size = 16, strip.text.size = 18, panel.grid.minor.x = element_line(color = "white"), panel.grid.minor.y = element_line(color = "white")) + facet_wrap(~country, ncol = 3) age_country_plot

In terms of age, Japan have the youngest team with a median of 21, 4
years younger than the next youngest team, Qatar. The rest have a fairly
balanced spread of ages from 20 to early-mid 30s with most of the
medians hovering around 27 years of age. The reason for Japan’s
extremely young squad is due to the fact that the full-strength Japan
team has played in both the World Cup and the Asian Cup in the past
year. Along with the fact that the Tokyo Olympics are next year, it was
decided to use the invitation to the Copa América as a trial-by-fire for
the young stars of the future. Much like in a real Olympic squad, the
team contains three “overage” players in World Cup 2010/2014/2018
goalkeeper Eiji Kawashima, Premier League winner Shinji Okazaki, and
Getafe playmaker Gaku Shibasaki.

The oldest player will be Brazil captain Dani Alves at 36 with
Paraguay’s Oscar Cardozo only two weeks younger. On the other hand, the
youngest player is Japan’s 18-year old prodigy Takefusa Kubo, the
ex-Barcelona youth player who only just recently moved to Real Madrid!
In light of his transfer a lot of eyes will be on him to see if he can
produce some Captain Tsubasa-esque performances for a very inexperienced
Japan team gearing up for the Tokyo Olympics!

Caps histogram

When considering the experience of a squad it’s not enough to look at
ages but one needs to look at the caps or appearances for the national
team as well.

caps_country_plot <- squads_df_clean %>% group_by(country) %>% mutate(median_cap = median(caps)) %>% ungroup() %>% ggplot(aes(x = caps)) + geom_histogram(fill = "red", binwidth = 5) + geom_vline(aes(xintercept = median_cap), size = 1.25) + geom_label(aes(x = median_cap, y = 15, label = glue::glue("Median: {median_cap}")), nudge_x = 0.5, hjust = 0.05, size = 3, family = "Roboto Condensed", color = "black") + labs(title = "Caps (Appearances) by Country", subtitle = "Columns ordered Group A to Group C", x = "Caps", y = NULL, caption = glue::glue(" Source: Wikipedia By: @R_by_Ryo")) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) + theme_copaAmerica( title.size = 20, subtitle.size = 14, caption.size = 10, axis.text.size = 10, axis.title.size = 16, strip.text.size = 18, panel.grid.major.x = element_line(color = "white", size = 0.25), panel.grid.major.y = element_line(color = "white", size = 0.25), panel.grid.minor.x = element_line(color = "white", size = 0.25), panel.grid.minor.y = element_line(color = "white", size = 0.25)) + facet_wrap(~country, ncol = 3) caps_country_plot

The majority of Japan’s squad have 0 (ZERO) caps, with the
aforementioned three “overage” players taking up most of the proportion
of caps on the team. Bolivia are also taking a untested squad with 8 of
their players with 2 caps or less! Chile, Uruguay, and Argentina bring
their veterans with multiple players over or around 100 caps. From this
data I was surprised that Jefferson Farfan and Paolo Guerrero didn’t
have 100 caps by now…

The player with the most caps is Lionel Messi (130) followed closely by
Diego Godin (126), and Alexis Sanchez (124). On the other hand there are
29 players hopeful of making their first national team appearance at
this tournament with the majority (17 players) coming from Japan.

Goal distribution

Next I looked at the distribution of goals scored by the midfielders and
strikers of each team. I found out about using
ggplot2::position_nudge() for slightly adjusting variables on a
discrete scale in similar fashion to the nudge_y = and nudge_x =
arguments most people might be familiar with from other geoms. I also
used ggforce::geom_mark_hull() to do some labelling.

goals_country_plot <- squads_df_clean %>% filter(position %in% c("MF", "FW")) %>% group_by(country) %>% mutate(median = median(goals)) %>% ungroup() %>% ggplot(aes(x = goals, y = reorder(country, median))) + ggridges::geom_density_ridges(fill = "red", color = "white", scale = 1.1) + geom_point(aes(x = median, y = country), position = position_nudge(y = 0.25), color = "yellow", size = 3) + ggforce::geom_mark_hull(aes(filter = country == "Argentina" & goals == 67, label = "Lionel Messi: 67 goals"), label.buffer = unit(15, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 55, label = "Luis Suarez: 55 goals"), label.buffer = unit(5, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + ggforce::geom_mark_hull(aes(filter = country == "Japan" & goals == 50, label = "Shinji Okazaki: 50 goals"), label.buffer = unit(2, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 46, label = "Edinson Cavani: 46 goals"), label.buffer = unit(25, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + ggforce::geom_mark_hull(aes(filter = country == "Chile" & goals == 41, label = "Alexis Sanchez: 41 goals"), label.buffer = unit(4, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + scale_x_continuous(limits = c(0, 73), expand = c(0.01, 0.01), breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70), labels = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)) + expand_limits(y = 13.5) + labs(title = "Distribution of Goals Scored by Midfielders and Strikers", subtitle = "Copa América 2019 squads, Yellow dot = Median goals", x = "Goals", y = NULL, caption = glue::glue(" Source: Wikipedia Data from prior to start of tournament By: @R_by_Ryo")) + theme_copaAmerica(title.size = 18, subtitle.size = 12, caption.size = 8, axis.text.size = 14, axis.title.size = 16, strip.text.size = 18) goals_country_plot

With a lot of these players being more defensively minded or new players
the distribution is heavily skewed but you can see little mounds showing
the top goalscorers for each country and see which countries have their
goalscorers spread out among multiple players such as Brazil, Qatar, and
Peru.

If you know your South American players you can take a good guess at who
are the top goal scorers for each nation. For Colombia the two outlying
mounds are obviously James Rodriguez and Falcao, for example.
Venezuela’s top scorer with 22 is Salomon Rondon and for Brazil, if not
for his injury, a lonesome mound would have appeared for Neymar with 60
goals!

Player contribution by league

Now let’s check the player contribution to the squads at the Copa
América by league. I’m just going to use the country that the league is
from for simplicity’s sake. Originally I wanted to left_join() it with
a ‘country <> domestic league’ table but couldn’t find one and the
league names itself aren’t very meaningful or have awful sponsor names
that obfuscate the country of origin even further.

player_contrib_league_plot <- squads_df_clean %>% group_by(country_league) %>% summarize(n = n()) %>% ungroup() %>% ggplot(aes(y = n, x = reorder(country_league, n))) + geom_col(fill = "red") + geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"), size = 4.5, color = "yellow", nudge_y = 0.5) + coord_flip() + scale_y_continuous(labels = c(0, 5, 10, 15, 20, 25), breaks = c(0, 5, 10, 15, 20, 25), limits = c(0, 30), expand = c(0, 0)) + labs(title = "Breakdown of Player Contributions by League", subtitle = glue(" Shown as Country Name Mexico (Liga MX) contributed 27 players to South American squads"), x = "League (Country name)", y = "Number of players", caption = glue::glue(" Source: Wikipedia By: @R_by_Ryo")) + theme_copaAmerica(title.size = 18, subtitle.size = 12, caption.size = 10, axis.text.size = 14, axis.text.y.size = 11, axis.title.size = 16, panel.grid.major.y = element_blank(), panel.grid.minor.x = element_line(color = "white")) player_contrib_league_plot

The best of the best players from South American countries will move on
to Europe so the Argentinean league (Superliga Argentina) and the
Brazilian league (Brasileirão – Serie A) do not have as many players as
you might think and as a consequence, the top leagues of England, Spain,
and Italy contribute quite a bit! A lot of the better players but not
quite elite South American players might go to Mexico instead of a
lower-mid European league. With the growth of the MLS a fair number of
players ply their trade there as well.

We can take a more detailed look by creating a table of the proportion
of players from each squad coming from either a domestic league or any
other league. I had to do a lot of wrangling to get the proper output
for the table. After calculating the percentage of domestic players from
a country’s domestic league I added the full data back in. Then I had to
make sure that for each country, the country – domestic league country
was the first row in each of the country groups (so Bolivia – Bolivia,
Bolivia, China, Japan – Japan, Japan – England, etc.). By doing this I
can automatically tidyr::fill()-in the rest of the rows of that
country with the ‘percentage of players from domestic league stat’.

squads_df_clean %>% group_by(country, country_league) %>% summarize(player_from_league = n()) %>% filter(country == country_league) %>% mutate(perc_from_domestic_league = percent(player_from_league / 23, accuracy = 0.1)) %>% right_join(squads_df_clean %>% group_by(country, country_league) %>% summarize(player_from_league = n()) %>% ungroup()) %>% mutate(first = case_when( country == country_league ~ 1, TRUE ~ 0)) %>% arrange(country, desc(first)) %>% fill(perc_from_domestic_league) %>% group_by(country) %>% mutate(perc_from_league = percent(player_from_league / 23, accuracy = 0.1), country_league = glue::glue("{country_league} - league")) %>% arrange(desc(player_from_league)) %>% select(Country = country, `League (country name)` = country_league, `Number of players from league` = player_from_league, `Percentage of players from league` = perc_from_league, `Percentage of players from domestic league` = perc_from_domestic_league) %>% head(10) %>% knitr::kable(format = "html", caption = "Breakdown of Player Contribution by League") %>% kableExtra::kable_styling(full_width = FALSE) Breakdown of Player Contribution by League
Country League (country name) Number of players from league Percentage of players from league Percentage of players from domestic league Qatar Qatar – league 23 100.0% 100.0% Bolivia Bolivia – league 20 87.0% 87.0% Japan Japan – league 14 60.9% 60.9% Ecuador Ecuador – league 9 39.1% 39.1% Paraguay Paraguay – league 8 34.8% 34.8% Brazil England – league 7 30.4% 13.0% Uruguay Spain – league 6 26.1% 4.3% Peru Peru – league 6 26.1% 26.1% Chile Chile – league 6 26.1% 26.1% Argentina Argentina – league 5 21.7% 21.7%

Three interesting facts I found:

  • 30% of players on the Brazil squad play for an English team, most
    out of any league – squad combination excluding domestic leagues.
  • 100% of the Qatar squad play in their domestic league!
  • Only one Uruguayan player (4.3%) plays in its domestic league.
Player contribution by club

In the final plot for this section, I looked at the top 10 clubs
contributing the most players to the tournament. I used
arrange(desc(n)) %>% slice() instead of top_n() as sometimes top_n() grabs
way too many teams that are tied with the same value. To set the team names inside the bars I
created a midpoint value midval that calculated a value half of the
number of players contributed so the labels were placed neatly.

player_contrib_club_plot <- squads_df_clean %>% group_by(club) %>% summarize(n = n()) %>% mutate(club = club %>% forcats::as_factor() %>% forcats::fct_reorder(n), midval = n / 2) %>% arrange(desc(n)) %>% slice(1:15) %>% ggplot(aes(x = club, y = n)) + geom_col(fill = "red") + geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"), size = 7.5, color = "yellow", nudge_y = 0.5) + geom_text(aes(y = midval, label = club, family = "Roboto Condensed", fontface = "bold"), size = 5, color = "white") + coord_flip() + scale_y_continuous(breaks = scales::pretty_breaks(), expand = c(0, 0), limits = c(0, 10.5)) + labs(title = "Top 15 Clubs contributing the most players to the Copa América", x = "Club", y = "Number of players", caption = "Source: Wikipedia") + theme_copaAmerica( title.size = 18, subtitle.size = 12, caption.size = 8, axis.text.size = 14, axis.title.size = 16, strip.text.size = 18, panel.grid.major.y = element_blank(), panel.grid.minor.x = element_line(color = "white")) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) player_contrib_club_plot

With 100% of its players coming from the domestic league it’s not
surprise that the Qatari team, Al-Sadd, is contributing the most players
to the tournament. Tied with another Qatari team, Mexican club America
features 7 players yet none of them are Mexicans (2 Argentineans, 2
Colombians, 1 Ecuadorian, 1 Chilean, and 1 Paraguayan).

At first I thought Barcelona contributed 8 players until I realized the
Ecuadorian players were coming from the Ecuadorian team called
Barcelona…I had to go all the way back up to the beginning of this
section to fix that small peculiarity. As futbol came to South America
via European colonists and immigrants a lot of teams took up the names
and colors of the teams these Europeans were fond of. Other examples
include Liverpool F.C. (Montevideo, Uruguay), Arsenal de Sarandi (Buenos
Aires, Argentina), and Club Atletico Juventus (Sao Paulo, Brazil –
although they use the colors of Torino F.C.).

If you download the data and type in the code below you can see the
entire club-country list.

squads_df_clean %>% group_by(club, country) %>% summarize(n = n()) %>% View() Match Records

Now that we got a good look at the composition of the teams, we can take
a look at how they’ve done at every Copa América.

The next code chunk mainly comes from PH Julien and his excellent Kaggle
kernel of “A Journey Through The History of
Soccer”
.

## grab football federation affiliations data federation_files <- Sys.glob("../data/federation_affiliations/*") df_federations = data.frame(country = NULL, federation = NULL) for (f in federation_files) { federation = basename(f) content = read.csv(f, header=FALSE) content <- cbind(content,federation=rep(federation, dim(content)[1])) df_federations <- rbind(df_federations, content) } colnames(df_federations) <- c("country", "federation") df_federations <- df_federations %>% mutate(country = as.character(country) %>% str_trim(side = "both")) results_raw <- readr::read_csv("../data/results.csv") results_copa <- results_raw %>% filter(tournament == "Copa América") %>% rename(venue_country = country, venue_city = city) %>% mutate(match_num = row_number()) ## combine with federation affiliations results_copa_home <- results_copa %>% left_join(df_federations, by = c("home_team" = "country")) %>% mutate(federation = as.character(federation)) %>% rename(home_federation = federation) results_copa_away <- results_copa %>% left_join(df_federations, by = c("away_team" = "country")) %>% mutate(federation = as.character(federation)) %>% rename(away_federation = federation) ## combine home-away results_copa_cleaned <- results_copa_home %>% full_join(results_copa_away)

Unfortunately, this data does not have penalty results as those
games are all counted as a draw (as technically that is the actual
result). Considering there a lot of cagey knock-out rounds that finish
in a penalty shoot-out (including the last two finals…) it is
unfortunate but that’s just the data you have sometimes. There is a way
to web-scrape all the Copa América results and assign Win-Lose to those
games that went to penalties but I’ll leave that for another time. Also,
there is no info on what stage of the tournament the match recorded is
in.

results_copa_cleaned <- results_copa_cleaned %>% mutate( home_federation = case_when( home_team == "USA" ~ "Concacaf", TRUE ~ home_federation), away_federation = case_when( away_team == "USA" ~ "Concacaf", TRUE ~ away_federation)) %>% select(-contains("federation"), -contains("venue"), -neutral, date, home_team, home_score, away_team, away_score, tournament, venue_city) glimpse(results_copa_cleaned) ## Observations: 787 ## Variables: 8 ## $ date 1916-07-02, 1916-07-06, 1916-07-08, 1916-07-10, 19... ## $ home_team "Chile", "Argentina", "Brazil", "Argentina", "Brazi... ## $ away_team "Uruguay", "Chile", "Chile", "Brazil", "Uruguay", "... ## $ home_score 0, 6, 1, 1, 1, 0, 4, 4, 1, 4, 5, 1, 6, 2, 0, 3, 4, ... ## $ away_score 4, 1, 1, 1, 2, 0, 0, 2, 0, 0, 0, 0, 0, 3, 2, 1, 1, ... ## $ tournament "Copa América", "Copa América", "Copa América", "Co... ## $ match_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... ## $ venue_city "Buenos Aires", "Buenos Aires", "Buenos Aires", "Bu...

Now that it’s nice and cleaned up I created a function that reshapes the
data so that it’s set from a certain team’s perspective with the “team”
argument. You can also set the function to look for only results against
a certain opponent by filling in the versus argument.

copaAmerica_resultados <- function(data, team, versus = NA) { ## team of interest: ex. 'Brazil' team_var <- enquo(team) todos_partidos <- data %>% ## filter only for results of team of interest filter(home_team == !!team_var | away_team == !!team_var) %>% ## reshape columns to team vs. opponent mutate( opponent = case_when( away_team != !!team_var ~ away_team, home_team != !!team_var ~ home_team), home_away = case_when( home_team == !!team_var ~ "home", away_team == !!team_var ~ "away"), equipo_goals = case_when( home_team == !!team_var ~ home_score, away_team == !!team_var ~ away_score), opp_goals = case_when( home_team != !!team_var ~ home_score, away_team != !!team_var ~ away_score)) %>% ## label results from team's perspective mutate( result = case_when( equipo_goals > opp_goals ~ "Win", equipo_goals < opp_goals ~ "Loss", equipo_goals == opp_goals ~ "Draw")) %>% mutate(result = result %>% forcats::as_factor() %>% forcats::fct_relevel(c("Win", "Draw", "Loss"))) %>% select(-contains("score"), -contains("team"), -match_num) %>% rename(Date = date, Tournament = tournament, `Venue` = venue_city, Opponent = opponent, `Home / Away` = home_away, `Goals For` = equipo_goals, `Goals Against` = opp_goals, Result = result) if (is.na(versus) | is.null(versus)) { resultados_totalmente <- todos_partidos %>% group_by(Result, Opponent) %>% mutate(n = n()) %>% ungroup() %>% ## sum amount of goals by team and opponent group_by(Result, Opponent) %>% summarize(e_g = sum(`Goals For`), o_g = sum(`Goals Against`), n = n()) %>% ungroup() %>% ## spread results over multiple columns spread(Result, n) %>% mutate_if(is.integer, as.numeric) missing_cols <- c("Win", "Draw", "Loss") %>% map_dfr( ~tibble(!!.x := numeric())) resultados_totalmente <- resultados_totalmente %>% bind_rows(missing_cols) %>% mutate(Win = if_else(is.na(Win), 0, Win), Draw = if_else(is.na(Draw), 0, Draw), Loss = if_else(is.na(Loss), 0, Loss)) %>% group_by(Opponent) %>% summarize(Win = sum(Win, na.rm = TRUE), Draw = sum(Draw, na.rm = TRUE), Loss = sum(Loss, na.rm = TRUE), `Goals For` = sum(e_g), `Goals Against` = sum(o_g)) return(list(resultados_totalmente, todos_partidos)) } else { ## opponent: ex. 'Argentina' todos_partidos <- todos_partidos %>% filter(Opponent == versus) if (nrow(todos_partidos) == 0) { return(glue("{team} has never played {versus} at the Copa América!")) } else { resultados_totalmente <- todos_partidos %>% group_by(Result, Opponent) %>% mutate(n = n()) %>% ungroup() %>% # sum amount of goals by team and opponent group_by(Result, Opponent) %>% summarize(e_g = sum(`Goals For`), o_g = sum(`Goals Against`), n = n()) %>% ungroup() %>% # spread results over multiple columns spread(Result, n) %>% mutate_if(is.integer, as.numeric) %>% group_by(Opponent) %>% summarize(Win = sum(Win, na.rm = TRUE), Draw = sum(Draw, na.rm = TRUE), Loss = sum(Loss, na.rm = TRUE), `Goals For` = sum(e_g), `Goals Against` = sum(o_g)) return(list(resultados_totalmente, todos_partidos)) } } }

The output is either a dataframe of all the games a team has been
involved in as well as the record of the team against other teams in the
Copa América or a message saying that the team you picked has never
played against the opponent you picked.

Japan copaAmerica_resultados(data = results_copa_cleaned, team = "Japan", versus = "Brazil") ## Japan has never played Brazil at the Copa América!

Oh… that’s right Japan has never played against Brazil at the Copa…

resultados_japon <- copaAmerica_resultados(data = results_copa_cleaned, team = "Japan") resultados_japon[[2]] %>% knitr::kable(format = "html", caption = "Japan's record in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Japan’s record in the Copa América
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 1999-06-29 Copa América Asunción Peru away 2 3 Loss 1999-07-02 Copa América Asunción Paraguay away 0 4 Loss 1999-07-05 Copa América Pedro Juan Caballero Bolivia away 1 1 Draw

Japan’s only previous journey to the Copa América was in the 1999
edition where they lost all 2 games and drew against Bolivia. They were also invited for the 2011
edition but withdrew due to the Tohoku Earthquake and were replaced by
Costa Rica. Japanese football has come a long way since 1999 but with a
young squad this tournament it will be a uphill battle to get 3 points against any of
their Group C opponents, Uruguay, Chile, and Ecuador.

Colombia resultados_colombia <- copaAmerica_resultados(data = results_copa_cleaned, team = "Colombia") resultados_colombia[[2]] %>% slice(87:92) %>% knitr::kable(format = "html", caption = "Colombia's record in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Colombia’s record in the Copa América
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 2001-07-11 Copa América Barranquilla Venezuela home 2 0 Win 2001-07-14 Copa América Barranquilla Ecuador home 1 0 Win 2001-07-17 Copa América Barranquilla Chile home 2 0 Win 2001-07-23 Copa América Armenia Peru home 3 0 Win 2001-07-26 Copa América Manizales Honduras home 2 0 Win 2001-07-29 Copa América Bogotá Mexico home 1 0 Win

Despite a recent resurgence of the Colombia national team they have not
been able to match the feats of the 2001 side that won the Copa with
their best place finish since then coming 3rd in 2004. The 2001 team
were not only unbeaten but also did not concede a single goal throughout
the tournament!

Superclásico Sudamericano: Brazil vs. Argentina resultados_de_brazil <- copaAmerica_resultados(data = results_copa_cleaned, team = "Brazil", versus = "Argentina") resultados_de_brazil[[1]] %>% knitr::kable(format = "html", caption = "Brazil vs. Argentina in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Brazil vs. Argentina in the Copa América
Opponent Win Draw Loss Goals For Goals Against Argentina 9 8 15 38 52 resultados_de_brazil[[2]] %>% tail(5) %>% knitr::kable(format = "html", caption = "Brazil vs. Argentina in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Brazil vs. Argentina in the Copa América
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 1993-06-27 Copa América Guayaquil Argentina home 1 1 Draw 1995-07-17 Copa América Rivera Argentina home 2 2 Draw 1999-07-11 Copa América Ciudad del Este Argentina home 2 1 Win 2004-07-25 Copa América Lima Argentina away 2 2 Draw 2007-07-15 Copa América Maracaibo Argentina home 3 0 Win

Brazil does not have a good overall record vs. Argentina but they have
not lost against their rivals at the Copa América since the 1993 edition
where they lost 5-6 on penalties in the Quarter Finals. The “draw” in
1995 was won on penalties while the “draw” in 2004 was actually in the
final where they won 4-2 on penalties.

What I found odd was that the Copa América seems to have a very low
priority to certain countries, especially Brazil who have repeatedly
sent their B or C teams to the tournament in favor of sending their best
team to other tournaments or resting star players. Funnily enough these
understrength Brazilian squads have actually won the entire tournament a
few times, most notably in 2007 against a full strength Argentina side
containing the likes of Zanetti, Riquelme, Cambiasso, Tevez, a young
Messi/Mascherano, Cambiasso, et al!

Player Profiles

After looking at the history of the competition and the composition of
the squads I examined the players and their form coming into the Copa
América. In recent years football analytics has really taken off and
there have been many strides made in creating more informative
statistics to assess players’ abilities, the most prominently being the
xG statistics. This is the first time I talk about xG in any
length/depth so this introduction is as much to solidify my
understanding as well as yours!

What IS xG?
  • xG: Quantitative measure (between 0 and 1) that assigns a
    probability that a shot taken will result in a goal based on a
    variety of variables and is used for evaluating the quality of
    chances and predicting players’ and teams’ future performances.

  • xA: Quantitative measure (between 0 and 1) that assigns a
    probability that a given pass will result in an assist for a goal
    based on a variety of variables.

Common variables used in the models that output xG statistics are the
distance and angle of a shot, the body part used, rebound, among others.
Similar to how you might assess your favorite striker’s chances of
scoring just as he is lining up to take a shot: Is the shot a header? Is
he trying to score from a cross in regular play or a corner kick? Are
there a crowd of defenders in front of him or is he one-on-one with the
goalkeeper? Etc. You might think who is taking the shot would be a genuine
factor but in actuality it tells you a lot less about the chances of a
goal compared to the location of the shot.

Note that there isn’t a SINGLE xG model. You can check out a blog post
comparing different people’s xG models
here.
People and organizations (from Statsbomb to OptaPro) have their own
ideas about what could be the important variables in play and as
such it’s important to report from which source you got your data from
as the stats can differ between models. A few things xG does not factor
in are things like goalkeeper performance (someone pulling off
incredible saves or letting in a poor shot) and one must also consider
the fact that team style of play and the quality of a player’s
teammates. When judging players based on these stats it is important to
be aware of contextual factors like the team they play for, their
opponent, and the player’s position/role in the team.

From xG and xA more granular statistics such as xGChain and xGBuildup
were created to be able to dig a little deeper into who is contributing
to chance creation, I’ll introduce the latter two a bit later. As the
field has grown new statistics have popped up such as Karun
Singh
’s “expected threat” or xT. You
can check out an introduction to xT from
here.

Of course, these statistics only tell a part of the story and are
definitenly not the be-all-and-end-all. In the context of this current
blog post, these stats only tell the story about how these players did
for their club teams this past season rather than for their national
team. Even still it gives us a good idea of what kind of form these
players are in coming into this tournament.

You might also want to watch these Youtube videos by
TifoFootball
for a quick primer on xG
and xA.

understat data

For the data I used the website, understat.com. Their xG models were
created via training a neural network on a dataset consisting of over
100,000 shots using more than 10 different variables. Getting data from
understat has been made easy by Ewen Henderson’s understatr package
available from Github (he’s also
the guy that made the ghibli color
palette!). I tried to pick a wide selection of attacking players but I
was also limited by the fact that understat only has data for
teams/players from six European leagues (Premier League, Bundesliga,
Serie A, La Liga, Ligue 1, and Russian Premier League).

For Peru I would have chosen Paolo Guerrero but as he plays in
Brazil now I went with Jefferson Farfan (who hasn’t played as many games
as the other players used for comparison unfortunately…). For Chile
I would pick Eduardo Vargas but he as doesn’t play for a team covered by
understat I went with Alexis Sanchez, who had a woeful season and only
played ~600 minutes despite appearing in ~20 league matches and later
added Arturo Vidal. For Brazil I included Neymar initially but since
he won’t actually be playing I’ll keep him for comparison’s sake but
also include Gabriel Jesus and Roberto Firmino who have been fighting
for the starting striker spot. Note that these two aren’t the ones
replacing Neymar positionally. In Neymar’s left-wing position I can
see David Neres or Phil Coutinho replacing him (Richarlison and Willian
mostly play on the right). (Edit: In the first match vs. Bolivia, David
Neres started off on the left while Richarlison played on the right,
Coutinho played just behind Bobby Firmino)

The other nation’s strikers/attacking midfielders don’t play for the six
European leagues covered by understat or like in Shinji Okazaki’s case
just did not play as many minutes/games during the season. To get the
data I created a list of the player codes and use purrr::map() to
iterate each through the understatr::get_player_seasons_stats()
function.

player_codes <- c(2097, 2099, 813, ## Messi, Neymar, Rondon 498, 4299, 696, ## Alexis, Farfan, Falcao 3294, 2098, 5543, ## Cavani, Suarez, G. Jesus 482, 1148, 2249, ## Bobby, Duvan, James 1089, 3553, 488, ## Cuadrado, Di Maria, Coutinho 222) ## Arturo Vidal understat_data <- player_codes %>% map(., ~ understatr::get_player_seasons_stats(.x)) %>% reduce(bind_rows) %>% select(-player_id, -position, -yellow, -red) glimpse(understat_data) ## Observations: 83 ## Variables: 15 ## $ games 34, 36, 34, 33, 38, 17, 20, 30, 34, 33, 32, 36, 38... ## $ goals 36, 34, 37, 26, 43, 15, 19, 13, 24, 22, 11, 7, 8, ... ## $ shots 170, 196, 179, 158, 187, 55, 91, 105, 124, 95, 89,... ## $ time 2704, 2995, 2832, 2726, 3374, 1443, 1797, 2652, 30... ## $ xG 25.997169, 28.946281, 26.885174, 27.101910, 35.891... ## $ assists 13, 12, 9, 16, 18, 7, 13, 11, 12, 7, 7, 3, 2, 2, 0... ## $ xA 15.33516552, 15.10040562, 13.95513140, 15.87127814... ## $ key_passes 93, 87, 79, 77, 95, 43, 70, 91, 102, 52, 32, 23, 2... ## $ year 2018, 2017, 2016, 2015, 2014, 2018, 2017, 2016, 20... ## $ team_name "Barcelona", "Barcelona", "Barcelona", "Barcelona"... ## $ npg 32, 32, 31, 23, 38, 10, 15, 12, 19, 21, 11, 7, 8, ... ## $ npxG 22.280909, 25.973170, 21.682231, 21.899351, 31.432... ## $ xGChain 38.459877, 48.180634, 42.525045, 41.996866, 54.753... ## $ xGBuildup 10.6987990, 21.6344040, 18.1335122, 15.1963644, 19... ## $ player_name "Lionel Messi", "Lionel Messi", "Lionel Messi", "L...

As you can see the data consists of a row for each player and each year
(from the 2014/2015 season to the 2018/2019 season). I tried to mitigate
the fact that some players played a lot more minutes than others by
standardize everything to a ‘per 90 minutes’ value but this does have
its own disadvantages. These include the fact that players who play a
lot of minutes (as regular starting members) may not have as high ‘per
90’ stat even though their production with all these minutes might
suggest that they are consistently performing and producing at a high
level.

It’ll be a bit crowded (kind of like a spilt box of Skittles…) but let’s
check out the key metrics for all the players at once.

Note: npg = non-penalty goals, npxG = non-penalty goals xG

comparison_data <- understat_data %>% filter(year == 2018) %>% select(-games, -team_name, -year) %>% rename(Shots = shots, KP = key_passes) %>% gather(key = "key", value = "value", -player_name, -time) %>% mutate(key = forcats::as_factor(key) %>% forcats::fct_relevel(., "xG", "goals", "npxG", "npg", "xA", "assists", "xGChain", "xGBuildup", "Shots", "KP")) comparison_strikers_plot <- comparison_data %>% filter(key != "Shots", key != "KP", key != "xGBuildup", key != "xGChain") %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) + geom_vline(xintercept = 1.5, size = 2) + geom_vline(xintercept = 2.5, size = 2) + geom_vline(xintercept = 3.5, size = 2) + geom_vline(xintercept = 4.5, size = 2) + geom_vline(xintercept = 5.5, size = 2) + coord_flip() + scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 1.26)) + scale_fill_manual(values = pals::glasbey(16), name = "Player") + labs(title = "Comparison: Top attackers at the Copa América", subtitle = "For select group of attacking players with data available from understat.com", x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(title.size = 18, subtitle = 12, panel.grid.minor.x = element_line(color = "white")) comparison_strikers_plot

As usual in these types of charts, Messi is leading a lot of the metrics
here and showing consistency too with having played the third highest
amount of minutes out of the selected players. It’s helpful to have the
xG/xA stats next to the actual goals/assists as it provides an
indication of whether the player in question is scoring shots that he
probabilistically should be scoring. When a player’s actual goal count
is higher than their xG stat this suggests that the player is
“exceeding their xG” or that they are scoring from shots that are
historically hard to score from. It can be seen as a marker of an elite
finisher as they are putting away chances from difficult situations
consistently. In terms of assists and xA Alexis Sanchez, who only played
about 600 minutes, looks a lot better than in reality due to the
aforementioned disadvantage of standardizing everything to a “per 90
minutes” value. Normally you would have a cut-off based on a certain
minimum amount of minutes but as I mentioned I was rather limited in
my choice of players.

A way to take a closer look at xG – Goals and xA – Assists is to use a
simple dot plot with a line going through the 45 degree angle. Those
below the line are underperforming relative to their xG or xA stat,
those over it are overachieving (“exceeding” their xG/xA stat) while
those just on the line are scoring or assisting right around what the
model expects the player to be. I use non-penalty xG below as penalties
have around ~0.75 xG (give or take a few percentage points depending on
the model) and can inflate the stats of those players who take a lot of
penalties and score them, especially if they weren’t the ones who earned
the penalty themselves.

expected_goal_plot <- understat_data %>% filter(year == 2018) %>% select(player_name, time, npxG, xG, goals) %>% mutate_at(c("npxG", "xG", "goals"), ~. / time * 90) %>% ggplot(aes(x = npxG, y = goals, fill = player_name)) + geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) + geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) + scale_x_continuous(limits = c(0, 1.1), expand = c(0, 0)) + scale_y_continuous(limits = c(0, 1.3), expand = c(0, 0)) + scale_fill_manual(values = pals::glasbey(16), name = "Player") + labs(title = "Expected vs. Actual Goals", subtitle = "For select group of attacking players with data available from understat.com", x = "Non-penalty xG per 90 minutes", y = "Goals per 90 minutes", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"), panel.grid.minor.y = element_line(color = "white"), subtitle.size = 11) expected_goal_plot

Gabriel Jesus is quite clearly below the 45 degree line meaning that he
has been very poor at finishing chances (and/or incredibly unlucky).
After a poor World Cup where he scored 0 goals as a starter, he is
really going to have to step up to fill Neymar’s goalscoring boots for
this tournament. However, his build-up play for City has still been good
this past season and he has been scoring for Brazil in the friendlies
leading up to the tournament so it’s going to be a hard decision for
Tite to decide on who starts against Bolivia (edit: Firmino started and
contributed an assist while Jesus replaced him in the 65th minute).

expected_assists_plot <- understat_data %>% filter(year == 2018) %>% select(player_name, time, xA, assists) %>% mutate_at(c("xA", "assists"), ~. / time * 90) %>% ggplot(aes(x = xA, y = assists, fill = player_name)) + geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) + geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) + labs(title = "Expected vs. Actual Assists", subtitle = "For select group of attacking players with data available from understat.com", x = "xA per 90 minutes", y = "Assists per 90 minutes", caption = glue::glue(" data: understat.com 2018-2019 Season")) + scale_x_continuous(limits = c(0, 0.55), expand = c(0, 0)) + scale_y_continuous(limits = c(0, 0.55), expand = c(0, 0)) + scale_fill_manual(values = pals::glasbey(16), name = "Player") + theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"), panel.grid.minor.y = element_line(color = "white"), subtitle.size = 11) expected_assists_plot

One thing to keep in mind is that xA does not take into account the
recipient of the assist pass. Even if the pass given had a high expected
assist value the receiving player still might not have the quality to
put it away through no fault of the passer. This might explain why most
of the players with a higher xA among this group don’t have the assists
to match. It can also be that these players are also the ones playing a
lot more minutes and the volume of chances they create just aren’t
translating to goals all the time.

Key Passes, Shots, xGChain, and xGBuildup (per 90)

I separated “key passes” and “shots” as well as “xGChain” and
“xGBuildup” from the rest as these two sets were on a very different
scale.

kp_shots_plot <- comparison_data %>% filter(key == "Shots" | key == "KP") %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) + coord_flip() + scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 6), breaks = c(0, 1, 2, 3, 4, 5, 6), labels = c(0, 1, 2, 3, 4, 5, 6)) + scale_fill_manual(values = pals::glasbey(17), name = "Player") + geom_vline(xintercept = 1.5, size = 2) + labs(title = "Comparison: Stars of the Copa América", subtitle = glue(" KP = Key Passes For select group of attacking players with data available from understat.com"), x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(title.size = 18, subtitle.size = 10, panel.grid.minor.x = element_line(color = "white")) kp_shots_plot

  • xGChain: Quantitative measure that is the combined sum of the xG of
    every possession that ends in a shot that a player is involved in.
    The same derived value is given to each of the players involved and
    allows us to credit players for attacking contributions outside of
    just shots (xG) and assists (xA).

  • xGBuildup: Similar to xGChain but excluding shots and assists. This
    is in response to xGChain values still being dominated by the xG and
    xA from shots and assists, respectively.

xgbuildup_xgchain_plot <- comparison_data %>% filter(key == "xGBuildup" | key == "xGChain") %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) + coord_flip() + scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 1.55), breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5), labels = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) + scale_fill_manual(values = pals::glasbey(17), name = "Player") + geom_vline(xintercept = 1.5, size = 2) + labs(title = "Comparison: Stars of the Copa América", subtitle = "For select group of attacking players with data available from understat.com", x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(title.size = 18, subtitle.size = 10, panel.grid.minor.x = element_line(color = "white")) xgbuildup_xgchain_plot

Although Gabriel Jesus has been poor at finishing his chances as seen in
previous graphs, his xGChain and xGBuildup stat makes it clear that he
is still contributing to City’s attack outside of scoring goals himself
(not to mention all the defensive work he does as well).

For example below, the stats are able to clearly differentiate between
James, who is more of a playmaker, compared to Falcao and Duvan who are
traditional number 9s with his superior xGBuildup, xGChain, and Key
Passes values. For a more detailed overview on xGChain and xGBuildup
check out Statsbomb’s article
here.

## keep colors for Colombians consistent with other plots colombia_pal <- c("#000033", "#005300", "#009FFF", "#00FFBE") comparison_colombia_plot <- comparison_data %>% filter(!key %in% c("xG", "goals", "npxG", "npg", "xA", "assists"), player_name %in% c("James Rodríguez", "Falcao", "Duván Zapata", "Juan Cuadrado")) %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) + geom_vline(xintercept = 1.5, size = 2) + geom_vline(xintercept = 2.5, size = 2) + geom_vline(xintercept = 3.5, size = 2) + coord_flip() + scale_y_continuous(expand = c(0.05, 0.05), breaks = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4), labels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4)) + scale_fill_manual(values = colombia_pal, name = "Player") + labs(title = "Comparison: Stars of Colombia", subtitle = "KP: Key Passes", x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(title.size = 20, subtitle.size = 12, panel.grid.minor.x = element_line(color = "white")) comparison_colombia_plot

There are lots of different ways to visualize this data, most famously
the radar charts created by Statsbomb. Using
the data you can also compare and evaluate players in many different
ways and using understatr package you could add a few more players
like Paulo Dybala, Miguel Almiron, and more! I could probably do a whole
other article using just this data but I’ll leave it here for now.

Conclusion

Throughout this blog post I talked about some of the historical records,
squad compositions, match records, and finally the player profiles of
attacking players at this summer’s Copa América. Using the power of R it
is really easy to webscrape and visualize data in a way that is
informative and aesthetically pleasing. I wanted to finish this before
the tournament started but other life things got in the way as well as
the fact that the amount of content ballooned out of control (especially
the xG section) so I had to cut down a lot.

It’s been fun reading the articles on the Copa América website and
seeing how far my “intermediate-but-very-out-of-practice”-level of
Spanish can get me to understand the content, here is one that I
particularly liked reading: 14 Estadisticas de la Copa
América
.
With so many tournaments going on right now (and with the African Cup of
Nations starting in a few days) a lot of the news media is spread thin
right now but there are still some quality articles out there to read
about the Copa, like this
article
from BBC’s South
American football expert, Tim Vickery.

After the first round of games, a few points of discussion:

  • After an extremely lacklustre performance vs. Colombia, how does
    Argentina bounce back? What tactical changes need to be made?

  • Qatar impressed against Paraguay but can they pull off a major upset
    vs. Colombia?

  • How will Japan line-up against Uruguay after a losing by a scoreline
    that didn’t really do their performance justice? How will manager
    Moriyasu balance experience vs. youth, will he start with veteran
    Okazaki after Ueda’s numerous misses vs. Chile? How will Japan deal with their
    glaring weaknesses at the fullback position? Will Yuta Nakayama keep his place after
    an awful performance?

  • Can Brazil earn a early ticket to the next round vs. Venezuela after
    a clinical but not excellent performance vs. Bolivia? Will they be
    able to keep up their streak of winning the Copa every time they
    have hosted it?

Thanks for reading and…

¡Buena suerte a todos equipos!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing the Copa América: Historical Records, Squad Profiles, and Player Profiles with xG statistics!

Tue, 06/18/2019 - 01:00

(This article was first published on R by R(yo), and kindly contributed to R-bloggers)

Another summer and another edition of the Copa América! Along with the
Africa Cup of Nations, Nations League finals, the Women’s World Cup,
Under-21 European Championship AND the Gold Cup this is yet another
soccer-filled season after last year’s World Cup and the Asian Cup
earlier this year (I also did a blog post on these last two tournaments
which you can see here (World
Cup)
and here
(Asian Cup)
).
There is so much football going on at once even I can’t keep up,
especially with the time difference! To not redo all the previous
visualizations with Copa América data I tried to find new sources of
data and other forms of visualizations to give some insight into the
players and teams competing to be the champion of South America. You can
find all the code I used in this blogpost here and you can also find
other soccer related data viz in my
soccer_ggplot Github repo.

The sections will go from a very macro-level view of the historical
records
of the tournament, to the squads competing, the teams’
match record in the Copa América, and finally to a micro-level view
of various attacking players using xG statistics.

¡Vámonos!

Packages library(dplyr) ## data wrangling library(tidyr) ## data wrangling library(purrr) ## data wrangling and iteration library(stringr) ## data wrangling library(rvest) ## webscraping library(polite) ## webscraping (Github only pkg) library(ggplot2) ## plotting library(scales) ## plotting scales library(ggimage) ## images for flags library(ggforce) ## plotting text labels library(cowplot) ## plotting grid library(glue) ## text library(ggrepel) ## plotting text labels library(magick) ## plotting library(kable) ## tables library(ggtextures) ## soccer ball emoji as geom_col() library(extrafont) ## fonts: Roboto Condensed loadfonts() theme_copaAmerica

I wanted to have all the plots in this blogpost to have a consistent
color theme. As the tournament is going to be held in Brazil, I went
with a color theme based on its flag with blue, yellow, and green being
the primary colors.

theme_copaAmerica <- function( title.size = 24, subtitle.size = 14, caption.size = 8, axis.text.size = 14, axis.text.x.size = 12, axis.text.y.size = 12, axis.title.size = 16, strip.text.size = 18, panel.grid.major.x = element_line(size = 0.5, color = "white"), panel.grid.major.y = element_line(size = 0.5, color = "white"), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks = element_line(color = "white")) { ## Theme: theme(text = element_text(family = "Roboto Condensed", color = "white"), plot.title = element_text(family = "Roboto Condensed", face = "bold", size = title.size, color = "yellow"), plot.subtitle = element_text(size = subtitle.size), plot.caption = element_text(size = caption.size), panel.background = element_rect(fill = "#009b3a"), plot.background = element_rect(fill = "#002776"), axis.text = element_text(size = axis.text.size, color = "white"), axis.text.x = element_text(size = axis.text.x.size, color = "white"), axis.text.y = element_text(size = axis.text.y.size, color = "white"), axis.title = element_text(size = axis.title.size), axis.line.x = element_blank(), axis.line.y = element_blank(), panel.grid.major.x = panel.grid.major.x, panel.grid.major.y = panel.grid.major.y, panel.grid.minor.x = panel.grid.minor.x, panel.grid.minor.y = panel.grid.minor.y, strip.text = element_text(color = "yellow", face = "bold", size = strip.text.size, margin = margin(4.4, 4.4, 4.4, 4.4)), strip.background = element_blank(), axis.ticks = axis.ticks ) } Top Goal Scorers // Goleadores

For this plot I took the stats from the Spanish version of the Wikipedia
page as it had more content. I used purrr::flatten_df() to squish the
list output into a dataframe then set the names of each column using
purrr::set_names().

url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica" session <- bow(url) copa_top_scorers <- scrape(session) %>% html_nodes(".mw-parser-output > table:nth-child(95)") %>% html_table() %>% flatten_df() %>% set_names(c("player", "country", "goals")) %>% mutate(image = "https://www.emoji.co.uk/files/microsoft-emojis/activity-windows10/8356-soccer-ball.png") glimpse(copa_top_scorers) ## Observations: 22 ## Variables: 4 ## $ player "Norberto Méndez", "Zizinho", "Lolo Fernández", "Sever... ## $ country "ARG Argentina", "BRA Brasil", "PER Perú", "URU Urugua... ## $ goals 17, 17, 15, 15, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11... ## $ image "https://www.emoji.co.uk/files/microsoft-emojis/activi...

Like in the Asian Cup blogpost I use Claus
Wilke
’s
ggtextures package to use
soccer ball emoji as the column image in the plot.

copa_goleadores_raw_plot <- copa_top_scorers %>% head(5) %>% ggplot(aes(x = reorder(player, goals), y = goals, image = image)) + geom_isotype_col(img_width = grid::unit(1, "native"), img_height = NULL, ncol = NA, nrow = 1, hjust = 0, vjust = 0.5) + geom_text(aes(label = goals, family = "Roboto Condensed", fontface = "bold"), size = 7.5, color = "yellow", nudge_y = 0.5) + coord_flip() + scale_y_continuous(breaks = c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18), expand = c(0, 0), limits = c(0, 19)) + labs(title = "Top Scorers of the Copa América", subtitle = glue(" Most goals in a single tournament: 9 Humberto Maschio (Argentina), Javier Ambrois (Uruguay), Jair (Brazil)"), y = "Number of Goals", x = NULL, caption = glue(" Source: Wikipedia By @R_by_Ryo")) + theme_copaAmerica(title.size = 26, subtitle.size = 16, caption.size = 12, axis.text.size = 18, axis.title.size = 18, panel.grid.major.y = element_blank(), axis.ticks = element_blank()) ## Add flags to y-axis: axis_image <- axis_canvas(copa_goleadores_raw_plot, axis = 'y') + draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg", y = 16.5, scale = 1.8) + draw_image("https://upload.wikimedia.org/wikipedia/commons/1/1a/Flag_of_Argentina.svg", y = 12.5, scale = 1.8) + draw_image("https://upload.wikimedia.org/wikipedia/commons/f/fe/Flag_of_Uruguay.svg", y = 9, scale = 1.8) + draw_image("https://upload.wikimedia.org/wikipedia/commons/d/df/Flag_of_Peru_%28state%29.svg", y = 5.25, scale = 1.8) + draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg", y = 1.5, scale = 1.8) copa_goleadores_plot <- ggdraw(insert_yaxis_grob(copa_goleadores_raw_plot, axis_image, position = "left")) copa_goleadores_plot

Most of these players aren’t ones you might recognize. The Copa América
used to be held a lot more regularly (and sometimes erratically) until
this century so players had a lot more opportunities to score goals. All
five of the players you see here played in the 1930s-1950s when there
was a tournament every one or two years. Out of currently active
players, Peruvian legend Paolo Guerrero has 11 goals along with Eduardo
Vargas (from Chile) with 10. (Edit: after the Chile – Japan game, Vargas is on
12…) Another player you might recognize that was actually tied with
Ademir for 5th place, along with three other players, was Gabriel
Batistuta (“Batigol”).

Winners of the Copa América

After grabbing the data from the Wikipedia page I used a variety of
functions to clean and reshape the dataset like tidyr::separate() to
split the number of occurences and the year.

url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica" session <- bow(url) copa_campeones <- scrape(session) %>% html_nodes(".mw-parser-output > table:nth-child(10)") %>% html_table() %>% flatten_df() copa_campeones_limpia <- copa_campeones %>% janitor::clean_names() %>% slice(1:8) %>% select(1:4) %>% set_names(c("team", "winners", "runners_up", "third_place")) %>% separate(winners, into = c("Champions", "first_place_year"), sep = " ", extra = "merge") %>% separate(runners_up, into = c("Runners-up", "second_place_year"), sep = " ", extra = "merge") %>% separate(third_place, into = c("Third Place", "third_place_year"), sep = " ", extra = "merge") %>% mutate_all(list(~str_replace_all(., "–", "0"))) %>% mutate_at(vars(contains("num")), funs(as.numeric)) %>% gather(key = "key", value = "value", -team, -first_place_year, -second_place_year, -third_place_year) %>% mutate(key = as.factor(key), value = as.numeric(value), team = team %>% str_replace(., "[A-Z]{3}", "") %>% str_trim(.), team = case_when(team == "Brasil" ~ "Brazil", TRUE ~ team)) %>% mutate(key = forcats::fct_relevel(key, "Champions", "Runners-up", "Third Place")) %>% arrange(key, desc(value)) %>% mutate(team = forcats::as_factor(team), order = row_number())

I also wanted to add flags to this plot but
cowplot::insert_yaxis_grob() is unfortunately not compatible with
facets. I used stringr::str_wrap() to format the subtitle nicely while
I used glue::glue() to avoid having the use ‘\n’ to create a new line
for the caption.

copa_ganadores_plot <- copa_campeones_limpia %>% ggplot(aes(value, forcats::fct_rev(team), color = key)) + geom_point(size = 10) + # 10 geom_text(aes(label = value), size = 5, color = "black", # 5 family = "Roboto Condensed", fontface = "bold") + scale_color_manual(values = c("Champions" = "#FFCC33", "Runners-up" = "#999999", "Third Place" = "#CC6600"), guide = FALSE) + scale_x_continuous(breaks = c(1, 5, 10, 15), labels = c(1, 5, 10, 15), limits = c(-1, 16)) + labs(x = "Number of Occurrence", y = NULL, title = "Most Successful Teams of the Copa América!", subtitle = str_wrap("Ordered by number of Copa América(s) won. Argentina missed the chance to leapfrog Uruguay after consecutive final losses in the previous two tournaments!", width = 80), caption = glue(" Source: Wikipedia By @R_by_Ryo")) + facet_wrap(~key) + theme_copaAmerica(subtitle.size = 14, caption.size = 10) copa_ganadores_plot

What’s surprising to note is that Pele never won a Copa América with
Brazil, although he did get Best Player and Top Scorer in the 1959
edition of the tournament. Even more bizarrely Diego Maradona has never
won it either! He didn’t play in either of the 1991 and 1993 editions
where Argentina won their 13th and 14th Copas.

Copa América Squad Profiles

We just looked at what happened in the past but who are the players
competing in the tournament this year? To take a quick look I
web-scraped the squads of each of the competing teams from Wikipedia.

I created a list of the xpaths for each of squads and using
purrr::map() I grabbed the data for each participating country. After
I got some meta-information about the country name and the group I
created a list-column that stores the squad data as a dataframe in its
own column. To explode this out I used tidyr::unnest() to reshape the
entire dataframe to have one row with all the data for each player in
every squad.

To get a clean dataset I use some stringr::str_*() functions to
properly format the character strings such as the player positions,
ages, date of births.

squads_df_clean <- squads_df_raw %>% janitor::clean_names() %>% select(-delete, squad_num = no, position = pos, birth_age = date_of_birth_age) %>% mutate(position = position %>% str_replace_all(., "[1-9]", ""), birth_age = birth_age %>% str_extract_all(., pattern = "\\([^()]+\\)")) %>% unnest(birth_age) %>% group_by(player) %>% mutate(colnum = seq_along(player)) %>% spread(key = colnum, value = birth_age) %>% ungroup() %>% select(everything(), dob = `1`, age = `2`) %>% mutate(dob = dob %>% str_replace_all(., "[()]", "") %>% lubridate::as_date(), age = age %>% str_extract(., "[0-9]+") %>% as.integer, country = forcats::fct_relevel(country, "Brazil", "Argentina", "Uruguay", "Peru", "Qatar", "Chile", "Venezuela", "Paraguay", "Japan", "Bolivia", "Colombia", "Ecuador", ), club = case_when( club == "Barcelona" & country == "Ecuador" ~ "Barcelona (Ecuador)", TRUE ~ club)) glimpse(squads_df_clean) ## Observations: 276 ## Variables: 12 ## $ name 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,... ## $ group "A", "A", "A", "A", "A", "A", "A", "A", "A", "A... ## $ country Brazil, Brazil, Brazil, Brazil, Brazil, Brazil,... ## $ squad_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ... ## $ position "GK", "DF", "DF", "DF", "MF", "DF", "FW", "MF",... ## $ player "Alisson", "Thiago Silva", "Miranda", "Marquinh... ## $ caps 36, 79, 57, 36, 36, 40, 3, 10, 29, 65, 49, 15, ... ## $ goals 0, 7, 3, 1, 0, 2, 1, 0, 16, 8, 14, 1, 7, 0, 0, ... ## $ club "Liverpool", "Paris Saint-Germain", "Internazio... ## $ country_league "England", "France", "Italy", "France", "Spain"... ## $ dob 1992-10-02, 1984-09-22, 1984-09-07, 1994-05-14... ## $ age 26, 34, 34, 25, 27, 33, 22, 22, 22, 30, 27, 28,... Age-histogram

Using this data I can plot a bunch of histograms:

age_country_plot <- squads_df_clean %>% group_by(country) %>% mutate(median_age = median(age)) %>% ungroup() %>% ggplot(aes(x = age)) + geom_histogram(fill = "red", binwidth = 1) + geom_vline(aes(xintercept = median_age), size = 1.2) + geom_label(aes(x = median_age, y = 8, label = glue::glue("Median: {median_age}")), nudge_x = 0.5, hjust = 0.1, size = 3, family = "Roboto Condensed", color = "black") + labs(title = "Age Distribution of Copa América squads", subtitle = "Columns ordered Group A to Group C", x = "Age", y = NULL, caption = glue::glue(" Source: Wikipedia By: @R_by_Ryo")) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0), breaks = scales::pretty_breaks()) + theme_copaAmerica(title.size = 22, subtitle.size = 14, caption.size = 8, axis.text.size = 12, axis.title.size = 16, strip.text.size = 18, panel.grid.minor.x = element_line(color = "white"), panel.grid.minor.y = element_line(color = "white")) + facet_wrap(~country, ncol = 3) age_country_plot

In terms of age, Japan have the youngest team with a median of 21, 4
years younger than the next youngest team, Qatar. The rest have a fairly
balanced spread of ages from 20 to early-mid 30s with most of the
medians hovering around 27 years of age. The reason for Japan’s
extremely young squad is due to the fact that the full-strength Japan
team has played in both the World Cup and the Asian Cup in the past
year. Along with the fact that the Tokyo Olympics are next year, it was
decided to use the invitation to the Copa América as a trial-by-fire for
the young stars of the future. Much like in a real Olympic squad, the
team contains three “overage” players in World Cup 2010/2014/2018
goalkeeper Eiji Kawashima, Premier League winner Shinji Okazaki, and
Getafe playmaker Gaku Shibasaki.

The oldest player will be Brazil captain Dani Alves at 36 with
Paraguay’s Oscar Cardozo only two weeks younger. On the other hand, the
youngest player is Japan’s 18-year old prodigy Takefusa Kubo, the
ex-Barcelona youth player who only just recently moved to Real Madrid!
In light of his transfer a lot of eyes will be on him to see if he can
produce some Captain Tsubasa-esque performances for a very inexperienced
Japan team gearing up for the Tokyo Olympics!

Caps histogram

When considering the experience of a squad it’s not enough to look at
ages but one needs to look at the caps or appearances for the national
team as well.

caps_country_plot <- squads_df_clean %>% group_by(country) %>% mutate(median_cap = median(caps)) %>% ungroup() %>% ggplot(aes(x = caps)) + geom_histogram(fill = "red", binwidth = 5) + geom_vline(aes(xintercept = median_cap), size = 1.25) + geom_label(aes(x = median_cap, y = 15, label = glue::glue("Median: {median_cap}")), nudge_x = 0.5, hjust = 0.05, size = 3, family = "Roboto Condensed", color = "black") + labs(title = "Caps (Appearances) by Country", subtitle = "Columns ordered Group A to Group C", x = "Caps", y = NULL, caption = glue::glue(" Source: Wikipedia By: @R_by_Ryo")) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) + theme_copaAmerica( title.size = 20, subtitle.size = 14, caption.size = 10, axis.text.size = 10, axis.title.size = 16, strip.text.size = 18, panel.grid.major.x = element_line(color = "white", size = 0.25), panel.grid.major.y = element_line(color = "white", size = 0.25), panel.grid.minor.x = element_line(color = "white", size = 0.25), panel.grid.minor.y = element_line(color = "white", size = 0.25)) + facet_wrap(~country, ncol = 3) caps_country_plot

The majority of Japan’s squad have 0 (ZERO) caps, with the
aforementioned three “overage” players taking up most of the proportion
of caps on the team. Bolivia are also taking a untested squad with 8 of
their players with 2 caps or less! Chile, Uruguay, and Argentina bring
their veterans with multiple players over or around 100 caps. From this
data I was surprised that Jefferson Farfan and Paolo Guerrero didn’t
have 100 caps by now…

The player with the most caps is Lionel Messi (130) followed closely by
Diego Godin (126), and Alexis Sanchez (124). On the other hand there are
29 players hopeful of making their first national team appearance at
this tournament with the majority (17 players) coming from Japan.

Goal distribution

Next I looked at the distribution of goals scored by the midfielders and
strikers of each team. I found out about using
ggplot2::position_nudge() for slightly adjusting variables on a
discrete scale in similar fashion to the nudge_y = and nudge_x =
arguments most people might be familiar with from other geoms. I also
used ggforce::geom_mark_hull() to do some labelling.

goals_country_plot <- squads_df_clean %>% filter(position %in% c("MF", "FW")) %>% group_by(country) %>% mutate(median = median(goals)) %>% ungroup() %>% ggplot(aes(x = goals, y = reorder(country, median))) + ggridges::geom_density_ridges(fill = "red", color = "white", scale = 1.1) + geom_point(aes(x = median, y = country), position = position_nudge(y = 0.25), color = "yellow", size = 3) + ggforce::geom_mark_hull(aes(filter = country == "Argentina" & goals == 67, label = "Lionel Messi: 67 goals"), label.buffer = unit(15, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 55, label = "Luis Suarez: 55 goals"), label.buffer = unit(5, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + ggforce::geom_mark_hull(aes(filter = country == "Japan" & goals == 50, label = "Shinji Okazaki: 50 goals"), label.buffer = unit(2, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 46, label = "Edinson Cavani: 46 goals"), label.buffer = unit(25, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + ggforce::geom_mark_hull(aes(filter = country == "Chile" & goals == 41, label = "Alexis Sanchez: 41 goals"), label.buffer = unit(4, "mm"), label.fontsize = 10, label.fill = "red", label.family = "Roboto Condensed", label.colour = "white", con.cap = unit(1, "mm"), con.type = "straight") + scale_x_continuous(limits = c(0, 73), expand = c(0.01, 0.01), breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70), labels = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)) + expand_limits(y = 13.5) + labs(title = "Distribution of Goals Scored by Midfielders and Strikers", subtitle = "Copa América 2019 squads, Yellow dot = Median goals", x = "Goals", y = NULL, caption = glue::glue(" Source: Wikipedia Data from prior to start of tournament By: @R_by_Ryo")) + theme_copaAmerica(title.size = 18, subtitle.size = 12, caption.size = 8, axis.text.size = 14, axis.title.size = 16, strip.text.size = 18) goals_country_plot

With a lot of these players being more defensively minded or new players
the distribution is heavily skewed but you can see little mounds showing
the top goalscorers for each country and see which countries have their
goalscorers spread out among multiple players such as Brazil, Qatar, and
Peru.

If you know your South American players you can take a good guess at who
are the top goal scorers for each nation. For Colombia the two outlying
mounds are obviously James Rodriguez and Falcao, for example.
Venezuela’s top scorer with 22 is Salomon Rondon and for Brazil, if not
for his injury, a lonesome mound would have appeared for Neymar with 60
goals!

Player contribution by league

Now let’s check the player contribution to the squads at the Copa
América by league. I’m just going to use the country that the league is
from for simplicity’s sake. Originally I wanted to left_join() it with
a ‘country <> domestic league’ table but couldn’t find one and the
league names itself aren’t very meaningful or have awful sponsor names
that obfuscate the country of origin even further.

player_contrib_league_plot <- squads_df_clean %>% group_by(country_league) %>% summarize(n = n()) %>% ungroup() %>% ggplot(aes(y = n, x = reorder(country_league, n))) + geom_col(fill = "red") + geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"), size = 4.5, color = "yellow", nudge_y = 0.5) + coord_flip() + scale_y_continuous(labels = c(0, 5, 10, 15, 20, 25), breaks = c(0, 5, 10, 15, 20, 25), limits = c(0, 30), expand = c(0, 0)) + labs(title = "Breakdown of Player Contributions by League", subtitle = glue(" Shown as Country Name Mexico (Liga MX) contributed 27 players to South American squads"), x = "League (Country name)", y = "Number of players", caption = glue::glue(" Source: Wikipedia By: @R_by_Ryo")) + theme_copaAmerica(title.size = 18, subtitle.size = 12, caption.size = 10, axis.text.size = 14, axis.text.y.size = 11, axis.title.size = 16, panel.grid.major.y = element_blank(), panel.grid.minor.x = element_line(color = "white")) player_contrib_league_plot

The best of the best players from South American countries will move on
to Europe so the Argentinean league (Superliga Argentina) and the
Brazilian league (Brasileirão – Serie A) do not have as many players as
you might think and as a consequence, the top leagues of England, Spain,
and Italy contribute quite a bit! A lot of the better players but not
quite elite South American players might go to Mexico instead of a
lower-mid European league. With the growth of the MLS a fair number of
players ply their trade there as well.

We can take a more detailed look by creating a table of the proportion
of players from each squad coming from either a domestic league or any
other league. I had to do a lot of wrangling to get the proper output
for the table. After calculating the percentage of domestic players from
a country’s domestic league I added the full data back in. Then I had to
make sure that for each country, the country – domestic league country
was the first row in each of the country groups (so Bolivia – Bolivia,
Bolivia, China, Japan – Japan, Japan – England, etc.). By doing this I
can automatically tidyr::fill()-in the rest of the rows of that
country with the ‘percentage of players from domestic league stat’.

squads_df_clean %>% group_by(country, country_league) %>% summarize(player_from_league = n()) %>% filter(country == country_league) %>% mutate(perc_from_domestic_league = percent(player_from_league / 23, accuracy = 0.1)) %>% right_join(squads_df_clean %>% group_by(country, country_league) %>% summarize(player_from_league = n()) %>% ungroup()) %>% mutate(first = case_when( country == country_league ~ 1, TRUE ~ 0)) %>% arrange(country, desc(first)) %>% fill(perc_from_domestic_league) %>% group_by(country) %>% mutate(perc_from_league = percent(player_from_league / 23, accuracy = 0.1), country_league = glue::glue("{country_league} - league")) %>% arrange(desc(player_from_league)) %>% select(Country = country, `League (country name)` = country_league, `Number of players from league` = player_from_league, `Percentage of players from league` = perc_from_league, `Percentage of players from domestic league` = perc_from_domestic_league) %>% head(10) %>% knitr::kable(format = "html", caption = "Breakdown of Player Contribution by League") %>% kableExtra::kable_styling(full_width = FALSE) Breakdown of Player Contribution by League
Country League (country name) Number of players from league Percentage of players from league Percentage of players from domestic league Qatar Qatar – league 23 100.0% 100.0% Bolivia Bolivia – league 20 87.0% 87.0% Japan Japan – league 14 60.9% 60.9% Ecuador Ecuador – league 9 39.1% 39.1% Paraguay Paraguay – league 8 34.8% 34.8% Brazil England – league 7 30.4% 13.0% Uruguay Spain – league 6 26.1% 4.3% Peru Peru – league 6 26.1% 26.1% Chile Chile – league 6 26.1% 26.1% Argentina Argentina – league 5 21.7% 21.7%

Three interesting facts I found:

  • 30% of players on the Brazil squad play for an English team, most
    out of any league – squad combination excluding domestic leagues.
  • 100% of the Qatar squad play in their domestic league!
  • Only one Uruguayan player (4.3%) plays in its domestic league.
Player contribution by club

In the final plot for this section, I looked at the top 10 clubs
contributing the most players to the tournament. I used
arrange(desc(n)) %>% slice() instead of top_n() as sometimes top_n() grabs
way too many teams that are tied with the same value. To set the team names inside the bars I
created a midpoint value midval that calculated a value half of the
number of players contributed so the labels were placed neatly.

player_contrib_club_plot <- squads_df_clean %>% group_by(club) %>% summarize(n = n()) %>% mutate(club = club %>% forcats::as_factor() %>% forcats::fct_reorder(n), midval = n / 2) %>% arrange(desc(n)) %>% slice(1:15) %>% ggplot(aes(x = club, y = n)) + geom_col(fill = "red") + geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"), size = 7.5, color = "yellow", nudge_y = 0.5) + geom_text(aes(y = midval, label = club, family = "Roboto Condensed", fontface = "bold"), size = 5, color = "white") + coord_flip() + scale_y_continuous(breaks = scales::pretty_breaks(), expand = c(0, 0), limits = c(0, 10.5)) + labs(title = "Top 15 Clubs contributing the most players to the Copa América", x = "Club", y = "Number of players", caption = "Source: Wikipedia") + theme_copaAmerica( title.size = 18, subtitle.size = 12, caption.size = 8, axis.text.size = 14, axis.title.size = 16, strip.text.size = 18, panel.grid.major.y = element_blank(), panel.grid.minor.x = element_line(color = "white")) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) player_contrib_club_plot

With 100% of its players coming from the domestic league it’s not
surprise that the Qatari team, Al-Sadd, is contributing the most players
to the tournament. Tied with another Qatari team, Mexican club America
features 7 players yet none of them are Mexicans (2 Argentineans, 2
Colombians, 1 Ecuadorian, 1 Chilean, and 1 Paraguayan).

At first I thought Barcelona contributed 8 players until I realized the
Ecuadorian players were coming from the Ecuadorian team called
Barcelona…I had to go all the way back up to the beginning of this
section to fix that small peculiarity. As futbol came to South America
via European colonists and immigrants a lot of teams took up the names
and colors of the teams these Europeans were fond of. Other examples
include Liverpool F.C. (Montevideo, Uruguay), Arsenal de Sarandi (Buenos
Aires, Argentina), and Club Atletico Juventus (Sao Paulo, Brazil –
although they use the colors of Torino F.C.).

If you download the data and type in the code below you can see the
entire club-country list.

squads_df_clean %>% group_by(club, country) %>% summarize(n = n()) %>% View() Match Records

Now that we got a good look at the composition of the teams, we can take
a look at how they’ve done at every Copa América.

The next code chunk mainly comes from PH Julien and his excellent Kaggle
kernel of “A Journey Through The History of
Soccer”
.

## grab football federation affiliations data federation_files <- Sys.glob("../data/federation_affiliations/*") df_federations = data.frame(country = NULL, federation = NULL) for (f in federation_files) { federation = basename(f) content = read.csv(f, header=FALSE) content <- cbind(content,federation=rep(federation, dim(content)[1])) df_federations <- rbind(df_federations, content) } colnames(df_federations) <- c("country", "federation") df_federations <- df_federations %>% mutate(country = as.character(country) %>% str_trim(side = "both")) results_raw <- readr::read_csv("../data/results.csv") results_copa <- results_raw %>% filter(tournament == "Copa América") %>% rename(venue_country = country, venue_city = city) %>% mutate(match_num = row_number()) ## combine with federation affiliations results_copa_home <- results_copa %>% left_join(df_federations, by = c("home_team" = "country")) %>% mutate(federation = as.character(federation)) %>% rename(home_federation = federation) results_copa_away <- results_copa %>% left_join(df_federations, by = c("away_team" = "country")) %>% mutate(federation = as.character(federation)) %>% rename(away_federation = federation) ## combine home-away results_copa_cleaned <- results_copa_home %>% full_join(results_copa_away)

Unfortunately, this data does not have penalty results as those
games are all counted as a draw (as technically that is the actual
result). Considering there a lot of cagey knock-out rounds that finish
in a penalty shoot-out (including the last two finals…) it is
unfortunate but that’s just the data you have sometimes. There is a way
to web-scrape all the Copa América results and assign Win-Lose to those
games that went to penalties but I’ll leave that for another time. Also,
there is no info on what stage of the tournament the match recorded is
in.

results_copa_cleaned <- results_copa_cleaned %>% mutate( home_federation = case_when( home_team == "USA" ~ "Concacaf", TRUE ~ home_federation), away_federation = case_when( away_team == "USA" ~ "Concacaf", TRUE ~ away_federation)) %>% select(-contains("federation"), -contains("venue"), -neutral, date, home_team, home_score, away_team, away_score, tournament, venue_city) glimpse(results_copa_cleaned) ## Observations: 787 ## Variables: 8 ## $ date 1916-07-02, 1916-07-06, 1916-07-08, 1916-07-10, 19... ## $ home_team "Chile", "Argentina", "Brazil", "Argentina", "Brazi... ## $ away_team "Uruguay", "Chile", "Chile", "Brazil", "Uruguay", "... ## $ home_score 0, 6, 1, 1, 1, 0, 4, 4, 1, 4, 5, 1, 6, 2, 0, 3, 4, ... ## $ away_score 4, 1, 1, 1, 2, 0, 0, 2, 0, 0, 0, 0, 0, 3, 2, 1, 1, ... ## $ tournament "Copa América", "Copa América", "Copa América", "Co... ## $ match_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... ## $ venue_city "Buenos Aires", "Buenos Aires", "Buenos Aires", "Bu...

Now that it’s nice and cleaned up I created a function that reshapes the
data so that it’s set from a certain team’s perspective with the “team”
argument. You can also set the function to look for only results against
a certain opponent by filling in the versus argument.

copaAmerica_resultados <- function(data, team, versus = NA) { ## team of interest: ex. 'Brazil' team_var <- enquo(team) todos_partidos <- data %>% ## filter only for results of team of interest filter(home_team == !!team_var | away_team == !!team_var) %>% ## reshape columns to team vs. opponent mutate( opponent = case_when( away_team != !!team_var ~ away_team, home_team != !!team_var ~ home_team), home_away = case_when( home_team == !!team_var ~ "home", away_team == !!team_var ~ "away"), equipo_goals = case_when( home_team == !!team_var ~ home_score, away_team == !!team_var ~ away_score), opp_goals = case_when( home_team != !!team_var ~ home_score, away_team != !!team_var ~ away_score)) %>% ## label results from team's perspective mutate( result = case_when( equipo_goals > opp_goals ~ "Win", equipo_goals < opp_goals ~ "Loss", equipo_goals == opp_goals ~ "Draw")) %>% mutate(result = result %>% forcats::as_factor() %>% forcats::fct_relevel(c("Win", "Draw", "Loss"))) %>% select(-contains("score"), -contains("team"), -match_num) %>% rename(Date = date, Tournament = tournament, `Venue` = venue_city, Opponent = opponent, `Home / Away` = home_away, `Goals For` = equipo_goals, `Goals Against` = opp_goals, Result = result) if (is.na(versus) | is.null(versus)) { resultados_totalmente <- todos_partidos %>% group_by(Result, Opponent) %>% mutate(n = n()) %>% ungroup() %>% ## sum amount of goals by team and opponent group_by(Result, Opponent) %>% summarize(e_g = sum(`Goals For`), o_g = sum(`Goals Against`), n = n()) %>% ungroup() %>% ## spread results over multiple columns spread(Result, n) %>% mutate_if(is.integer, as.numeric) missing_cols <- c("Win", "Draw", "Loss") %>% map_dfr( ~tibble(!!.x := numeric())) resultados_totalmente <- resultados_totalmente %>% bind_rows(missing_cols) %>% mutate(Win = if_else(is.na(Win), 0, Win), Draw = if_else(is.na(Draw), 0, Draw), Loss = if_else(is.na(Loss), 0, Loss)) %>% group_by(Opponent) %>% summarize(Win = sum(Win, na.rm = TRUE), Draw = sum(Draw, na.rm = TRUE), Loss = sum(Loss, na.rm = TRUE), `Goals For` = sum(e_g), `Goals Against` = sum(o_g)) return(list(resultados_totalmente, todos_partidos)) } else { ## opponent: ex. 'Argentina' todos_partidos <- todos_partidos %>% filter(Opponent == versus) if (nrow(todos_partidos) == 0) { return(glue("{team} has never played {versus} at the Copa América!")) } else { resultados_totalmente <- todos_partidos %>% group_by(Result, Opponent) %>% mutate(n = n()) %>% ungroup() %>% # sum amount of goals by team and opponent group_by(Result, Opponent) %>% summarize(e_g = sum(`Goals For`), o_g = sum(`Goals Against`), n = n()) %>% ungroup() %>% # spread results over multiple columns spread(Result, n) %>% mutate_if(is.integer, as.numeric) %>% group_by(Opponent) %>% summarize(Win = sum(Win, na.rm = TRUE), Draw = sum(Draw, na.rm = TRUE), Loss = sum(Loss, na.rm = TRUE), `Goals For` = sum(e_g), `Goals Against` = sum(o_g)) return(list(resultados_totalmente, todos_partidos)) } } }

The output is either a dataframe of all the games a team has been
involved in as well as the record of the team against other teams in the
Copa América or a message saying that the team you picked has never
played against the opponent you picked.

Japan copaAmerica_resultados(data = results_copa_cleaned, team = "Japan", versus = "Brazil") ## Japan has never played Brazil at the Copa América!

Oh… that’s right Japan has never played against Brazil at the Copa…

resultados_japon <- copaAmerica_resultados(data = results_copa_cleaned, team = "Japan") resultados_japon[[2]] %>% knitr::kable(format = "html", caption = "Japan's record in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Japan’s record in the Copa América
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 1999-06-29 Copa América Asunción Peru away 2 3 Loss 1999-07-02 Copa América Asunción Paraguay away 0 4 Loss 1999-07-05 Copa América Pedro Juan Caballero Bolivia away 1 1 Draw

Japan’s only previous journey to the Copa América was in the 1999
edition where they lost all 2 games and drew against Bolivia. They were also invited for the 2011
edition but withdrew due to the Tohoku Earthquake and were replaced by
Costa Rica. Japanese football has come a long way since 1999 but with a
young squad this tournament it will be a uphill battle to get 3 points against any of
their Group C opponents, Uruguay, Chile, and Ecuador.

Colombia resultados_colombia <- copaAmerica_resultados(data = results_copa_cleaned, team = "Colombia") resultados_colombia[[2]] %>% slice(87:92) %>% knitr::kable(format = "html", caption = "Colombia's record in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Colombia’s record in the Copa América
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 2001-07-11 Copa América Barranquilla Venezuela home 2 0 Win 2001-07-14 Copa América Barranquilla Ecuador home 1 0 Win 2001-07-17 Copa América Barranquilla Chile home 2 0 Win 2001-07-23 Copa América Armenia Peru home 3 0 Win 2001-07-26 Copa América Manizales Honduras home 2 0 Win 2001-07-29 Copa América Bogotá Mexico home 1 0 Win

Despite a recent resurgence of the Colombia national team they have not
been able to match the feats of the 2001 side that won the Copa with
their best place finish since then coming 3rd in 2004. The 2001 team
were not only unbeaten but also did not concede a single goal throughout
the tournament!

Superclásico Sudamericano: Brazil vs. Argentina resultados_de_brazil <- copaAmerica_resultados(data = results_copa_cleaned, team = "Brazil", versus = "Argentina") resultados_de_brazil[[1]] %>% knitr::kable(format = "html", caption = "Brazil vs. Argentina in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Brazil vs. Argentina in the Copa América
Opponent Win Draw Loss Goals For Goals Against Argentina 9 8 15 38 52 resultados_de_brazil[[2]] %>% tail(5) %>% knitr::kable(format = "html", caption = "Brazil vs. Argentina in the Copa América") %>% kableExtra::kable_styling(full_width = FALSE) Brazil vs. Argentina in the Copa América
Date Tournament Venue Opponent Home / Away Goals For Goals Against Result 1993-06-27 Copa América Guayaquil Argentina home 1 1 Draw 1995-07-17 Copa América Rivera Argentina home 2 2 Draw 1999-07-11 Copa América Ciudad del Este Argentina home 2 1 Win 2004-07-25 Copa América Lima Argentina away 2 2 Draw 2007-07-15 Copa América Maracaibo Argentina home 3 0 Win

Brazil does not have a good overall record vs. Argentina but they have
not lost against their rivals at the Copa América since the 1993 edition
where they lost 5-6 on penalties in the Quarter Finals. The “draw” in
1995 was won on penalties while the “draw” in 2004 was actually in the
final where they won 4-2 on penalties.

What I found odd was that the Copa América seems to have a very low
priority to certain countries, especially Brazil who have repeatedly
sent their B or C teams to the tournament in favor of sending their best
team to other tournaments or resting star players. Funnily enough these
understrength Brazilian squads have actually won the entire tournament a
few times, most notably in 2007 against a full strength Argentina side
containing the likes of Zanetti, Riquelme, Cambiasso, Tevez, a young
Messi/Mascherano, Cambiasso, et al!

Player Profiles

After looking at the history of the competition and the composition of
the squads I examined the players and their form coming into the Copa
América. In recent years football analytics has really taken off and
there have been many strides made in creating more informative
statistics to assess players’ abilities, the most prominently being the
xG statistics. This is the first time I talk about xG in any
length/depth so this introduction is as much to solidify my
understanding as well as yours!

What IS xG?
  • xG: Quantitative measure (between 0 and 1) that assigns a
    probability that a shot taken will result in a goal based on a
    variety of variables and is used for evaluating the quality of
    chances and predicting players’ and teams’ future performances.

  • xA: Quantitative measure (between 0 and 1) that assigns a
    probability that a given pass will result in an assist for a goal
    based on a variety of variables.

Common variables used in the models that output xG statistics are the
distance and angle of a shot, the body part used, rebound, among others.
Similar to how you might assess your favorite striker’s chances of
scoring just as he is lining up to take a shot: Is the shot a header? Is
he trying to score from a cross in regular play or a corner kick? Are
there a crowd of defenders in front of him or is he one-on-one with the
goalkeeper? Etc. You might think who is taking the shot would be a genuine
factor but in actuality it tells you a lot less about the chances of a
goal compared to the location of the shot.

Note that there isn’t a SINGLE xG model. You can check out a blog post
comparing different people’s xG models
here.
People and organizations (from Statsbomb to OptaPro) have their own
ideas about what could be the important variables in play and as
such it’s important to report from which source you got your data from
as the stats can differ between models. A few things xG does not factor
in are things like goalkeeper performance (someone pulling off
incredible saves or letting in a poor shot) and one must also consider
the fact that team style of play and the quality of a player’s
teammates. When judging players based on these stats it is important to
be aware of contextual factors like the team they play for, their
opponent, and the player’s position/role in the team.

From xG and xA more granular statistics such as xGChain and xGBuildup
were created to be able to dig a little deeper into who is contributing
to chance creation, I’ll introduce the latter two a bit later. As the
field has grown new statistics have popped up such as Karun
Singh
’s “expected threat” or xT. You
can check out an introduction to xT from
here.

Of course, these statistics only tell a part of the story and are
definitenly not the be-all-and-end-all. In the context of this current
blog post, these stats only tell the story about how these players did
for their club teams this past season rather than for their national
team. Even still it gives us a good idea of what kind of form these
players are in coming into this tournament.

You might also want to watch these Youtube videos by
TifoFootball
for a quick primer on xG
and xA.

understat data

For the data I used the website, understat.com. Their xG models were
created via training a neural network on a dataset consisting of over
100,000 shots using more than 10 different variables. Getting data from
understat has been made easy by Ewen Henderson’s understatr package
available from Github (he’s also
the guy that made the ghibli color
palette!). I tried to pick a wide selection of attacking players but I
was also limited by the fact that understat only has data for
teams/players from six European leagues (Premier League, Bundesliga,
Serie A, La Liga, Ligue 1, and Russian Premier League).

For Peru I would have chosen Paolo Guerrero but as he plays in
Brazil now I went with Jefferson Farfan (who hasn’t played as many games
as the other players used for comparison unfortunately…). For Chile
I would pick Eduardo Vargas but he as doesn’t play for a team covered by
understat I went with Alexis Sanchez, who had a woeful season and only
played ~600 minutes despite appearing in ~20 league matches and later
added Arturo Vidal. For Brazil I included Neymar initially but since
he won’t actually be playing I’ll keep him for comparison’s sake but
also include Gabriel Jesus and Roberto Firmino who have been fighting
for the starting striker spot. Note that these two aren’t the ones
replacing Neymar positionally. In Neymar’s left-wing position I can
see David Neres or Phil Coutinho replacing him (Richarlison and Willian
mostly play on the right). (Edit: In the first match vs. Bolivia, David
Neres started off on the left while Richarlison played on the right,
Coutinho played just behind Bobby Firmino)

The other nation’s strikers/attacking midfielders don’t play for the six
European leagues covered by understat or like in Shinji Okazaki’s case
just did not play as many minutes/games during the season. To get the
data I created a list of the player codes and use purrr::map() to
iterate each through the understatr::get_player_seasons_stats()
function.

player_codes <- c(2097, 2099, 813, ## Messi, Neymar, Rondon 498, 4299, 696, ## Alexis, Farfan, Falcao 3294, 2098, 5543, ## Cavani, Suarez, G. Jesus 482, 1148, 2249, ## Bobby, Duvan, James 1089, 3553, 488, ## Cuadrado, Di Maria, Coutinho 222) ## Arturo Vidal understat_data <- player_codes %>% map(., ~ understatr::get_player_seasons_stats(.x)) %>% reduce(bind_rows) %>% select(-player_id, -position, -yellow, -red) glimpse(understat_data) ## Observations: 83 ## Variables: 15 ## $ games 34, 36, 34, 33, 38, 17, 20, 30, 34, 33, 32, 36, 38... ## $ goals 36, 34, 37, 26, 43, 15, 19, 13, 24, 22, 11, 7, 8, ... ## $ shots 170, 196, 179, 158, 187, 55, 91, 105, 124, 95, 89,... ## $ time 2704, 2995, 2832, 2726, 3374, 1443, 1797, 2652, 30... ## $ xG 25.997169, 28.946281, 26.885174, 27.101910, 35.891... ## $ assists 13, 12, 9, 16, 18, 7, 13, 11, 12, 7, 7, 3, 2, 2, 0... ## $ xA 15.33516552, 15.10040562, 13.95513140, 15.87127814... ## $ key_passes 93, 87, 79, 77, 95, 43, 70, 91, 102, 52, 32, 23, 2... ## $ year 2018, 2017, 2016, 2015, 2014, 2018, 2017, 2016, 20... ## $ team_name "Barcelona", "Barcelona", "Barcelona", "Barcelona"... ## $ npg 32, 32, 31, 23, 38, 10, 15, 12, 19, 21, 11, 7, 8, ... ## $ npxG 22.280909, 25.973170, 21.682231, 21.899351, 31.432... ## $ xGChain 38.459877, 48.180634, 42.525045, 41.996866, 54.753... ## $ xGBuildup 10.6987990, 21.6344040, 18.1335122, 15.1963644, 19... ## $ player_name "Lionel Messi", "Lionel Messi", "Lionel Messi", "L...

As you can see the data consists of a row for each player and each year
(from the 2014/2015 season to the 2018/2019 season). I tried to mitigate
the fact that some players played a lot more minutes than others by
standardize everything to a ‘per 90 minutes’ value but this does have
its own disadvantages. These include the fact that players who play a
lot of minutes (as regular starting members) may not have as high ‘per
90’ stat even though their production with all these minutes might
suggest that they are consistently performing and producing at a high
level.

It’ll be a bit crowded (kind of like a spilt box of Skittles…) but let’s
check out the key metrics for all the players at once.

Note: npg = non-penalty goals, npxG = non-penalty goals xG

comparison_data <- understat_data %>% filter(year == 2018) %>% select(-games, -team_name, -year) %>% rename(Shots = shots, KP = key_passes) %>% gather(key = "key", value = "value", -player_name, -time) %>% mutate(key = forcats::as_factor(key) %>% forcats::fct_relevel(., "xG", "goals", "npxG", "npg", "xA", "assists", "xGChain", "xGBuildup", "Shots", "KP")) comparison_strikers_plot <- comparison_data %>% filter(key != "Shots", key != "KP", key != "xGBuildup", key != "xGChain") %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) + geom_vline(xintercept = 1.5, size = 2) + geom_vline(xintercept = 2.5, size = 2) + geom_vline(xintercept = 3.5, size = 2) + geom_vline(xintercept = 4.5, size = 2) + geom_vline(xintercept = 5.5, size = 2) + coord_flip() + scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 1.26)) + scale_fill_manual(values = pals::glasbey(16), name = "Player") + labs(title = "Comparison: Top attackers at the Copa América", subtitle = "For select group of attacking players with data available from understat.com", x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(title.size = 18, subtitle = 12, panel.grid.minor.x = element_line(color = "white")) comparison_strikers_plot

As usual in these types of charts, Messi is leading a lot of the metrics
here and showing consistency too with having played the third highest
amount of minutes out of the selected players. It’s helpful to have the
xG/xA stats next to the actual goals/assists as it provides an
indication of whether the player in question is scoring shots that he
probabilistically should be scoring. When a player’s actual goal count
is higher than their xG stat this suggests that the player is
“exceeding their xG” or that they are scoring from shots that are
historically hard to score from. It can be seen as a marker of an elite
finisher as they are putting away chances from difficult situations
consistently. In terms of assists and xA Alexis Sanchez, who only played
about 600 minutes, looks a lot better than in reality due to the
aforementioned disadvantage of standardizing everything to a “per 90
minutes” value. Normally you would have a cut-off based on a certain
minimum amount of minutes but as I mentioned I was rather limited in
my choice of players.

A way to take a closer look at xG – Goals and xA – Assists is to use a
simple dot plot with a line going through the 45 degree angle. Those
below the line are underperforming relative to their xG or xA stat,
those over it are overachieving (“exceeding” their xG/xA stat) while
those just on the line are scoring or assisting right around what the
model expects the player to be. I use non-penalty xG below as penalties
have around ~0.75 xG (give or take a few percentage points depending on
the model) and can inflate the stats of those players who take a lot of
penalties and score them, especially if they weren’t the ones who earned
the penalty themselves.

expected_goal_plot <- understat_data %>% filter(year == 2018) %>% select(player_name, time, npxG, xG, goals) %>% mutate_at(c("npxG", "xG", "goals"), ~. / time * 90) %>% ggplot(aes(x = npxG, y = goals, fill = player_name)) + geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) + geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) + scale_x_continuous(limits = c(0, 1.1), expand = c(0, 0)) + scale_y_continuous(limits = c(0, 1.3), expand = c(0, 0)) + scale_fill_manual(values = pals::glasbey(16), name = "Player") + labs(title = "Expected vs. Actual Goals", subtitle = "For select group of attacking players with data available from understat.com", x = "Non-penalty xG per 90 minutes", y = "Goals per 90 minutes", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"), panel.grid.minor.y = element_line(color = "white"), subtitle.size = 11) expected_goal_plot

Gabriel Jesus is quite clearly below the 45 degree line meaning that he
has been very poor at finishing chances (and/or incredibly unlucky).
After a poor World Cup where he scored 0 goals as a starter, he is
really going to have to step up to fill Neymar’s goalscoring boots for
this tournament. However, his build-up play for City has still been good
this past season and he has been scoring for Brazil in the friendlies
leading up to the tournament so it’s going to be a hard decision for
Tite to decide on who starts against Bolivia (edit: Firmino started and
contributed an assist while Jesus replaced him in the 65th minute).

expected_assists_plot <- understat_data %>% filter(year == 2018) %>% select(player_name, time, xA, assists) %>% mutate_at(c("xA", "assists"), ~. / time * 90) %>% ggplot(aes(x = xA, y = assists, fill = player_name)) + geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) + geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) + labs(title = "Expected vs. Actual Assists", subtitle = "For select group of attacking players with data available from understat.com", x = "xA per 90 minutes", y = "Assists per 90 minutes", caption = glue::glue(" data: understat.com 2018-2019 Season")) + scale_x_continuous(limits = c(0, 0.55), expand = c(0, 0)) + scale_y_continuous(limits = c(0, 0.55), expand = c(0, 0)) + scale_fill_manual(values = pals::glasbey(16), name = "Player") + theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"), panel.grid.minor.y = element_line(color = "white"), subtitle.size = 11) expected_assists_plot

One thing to keep in mind is that xA does not take into account the
recipient of the assist pass. Even if the pass given had a high expected
assist value the receiving player still might not have the quality to
put it away through no fault of the passer. This might explain why most
of the players with a higher xA among this group don’t have the assists
to match. It can also be that these players are also the ones playing a
lot more minutes and the volume of chances they create just aren’t
translating to goals all the time.

Key Passes, Shots, xGChain, and xGBuildup (per 90)

I separated “key passes” and “shots” as well as “xGChain” and
“xGBuildup” from the rest as these two sets were on a very different
scale.

kp_shots_plot <- comparison_data %>% filter(key == "Shots" | key == "KP") %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) + coord_flip() + scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 6), breaks = c(0, 1, 2, 3, 4, 5, 6), labels = c(0, 1, 2, 3, 4, 5, 6)) + scale_fill_manual(values = pals::glasbey(17), name = "Player") + geom_vline(xintercept = 1.5, size = 2) + labs(title = "Comparison: Stars of the Copa América", subtitle = glue(" KP = Key Passes For select group of attacking players with data available from understat.com"), x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(title.size = 18, subtitle.size = 10, panel.grid.minor.x = element_line(color = "white")) kp_shots_plot

  • xGChain: Quantitative measure that is the combined sum of the xG of
    every possession that ends in a shot that a player is involved in.
    The same derived value is given to each of the players involved and
    allows us to credit players for attacking contributions outside of
    just shots (xG) and assists (xA).

  • xGBuildup: Similar to xGChain but excluding shots and assists. This
    is in response to xGChain values still being dominated by the xG and
    xA from shots and assists, respectively.

xgbuildup_xgchain_plot <- comparison_data %>% filter(key == "xGBuildup" | key == "xGChain") %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) + coord_flip() + scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 1.55), breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5), labels = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) + scale_fill_manual(values = pals::glasbey(17), name = "Player") + geom_vline(xintercept = 1.5, size = 2) + labs(title = "Comparison: Stars of the Copa América", subtitle = "For select group of attacking players with data available from understat.com", x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(title.size = 18, subtitle.size = 10, panel.grid.minor.x = element_line(color = "white")) xgbuildup_xgchain_plot

Although Gabriel Jesus has been poor at finishing his chances as seen in
previous graphs, his xGChain and xGBuildup stat makes it clear that he
is still contributing to City’s attack outside of scoring goals himself
(not to mention all the defensive work he does as well).

For example below, the stats are able to clearly differentiate between
James, who is more of a playmaker, compared to Falcao and Duvan who are
traditional number 9s with his superior xGBuildup, xGChain, and Key
Passes values. For a more detailed overview on xGChain and xGBuildup
check out Statsbomb’s article
here.

## keep colors for Colombians consistent with other plots colombia_pal <- c("#000033", "#005300", "#009FFF", "#00FFBE") comparison_colombia_plot <- comparison_data %>% filter(!key %in% c("xG", "goals", "npxG", "npg", "xA", "assists"), player_name %in% c("James Rodríguez", "Falcao", "Duván Zapata", "Juan Cuadrado")) %>% mutate(value = value / time * 90) %>% ggplot(aes(x = key, y = value, fill = player_name)) + geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) + geom_vline(xintercept = 1.5, size = 2) + geom_vline(xintercept = 2.5, size = 2) + geom_vline(xintercept = 3.5, size = 2) + coord_flip() + scale_y_continuous(expand = c(0.05, 0.05), breaks = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4), labels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4)) + scale_fill_manual(values = colombia_pal, name = "Player") + labs(title = "Comparison: Stars of Colombia", subtitle = "KP: Key Passes", x = NULL, y = "Metric per 90", caption = glue::glue(" data: understat.com 2018-2019 Season")) + theme_copaAmerica(title.size = 20, subtitle.size = 12, panel.grid.minor.x = element_line(color = "white")) comparison_colombia_plot

There are lots of different ways to visualize this data, most famously
the radar charts created by Statsbomb. Using
the data you can also compare and evaluate players in many different
ways and using understatr package you could add a few more players
like Paulo Dybala, Miguel Almiron, and more! I could probably do a whole
other article using just this data but I’ll leave it here for now.

Conclusion

Throughout this blog post I talked about some of the historical records,
squad compositions, match records, and finally the player profiles of
attacking players at this summer’s Copa América. Using the power of R it
is really easy to webscrape and visualize data in a way that is
informative and aesthetically pleasing. I wanted to finish this before
the tournament started but other life things got in the way as well as
the fact that the amount of content ballooned out of control (especially
the xG section) so I had to cut down a lot.

It’s been fun reading the articles on the Copa América website and
seeing how far my “intermediate-but-very-out-of-practice”-level of
Spanish can get me to understand the content, here is one that I
particularly liked reading: 14 Estadisticas de la Copa
América
.
With so many tournaments going on right now (and with the African Cup of
Nations starting in a few days) a lot of the news media is spread thin
right now but there are still some quality articles out there to read
about the Copa, like this
article
from BBC’s South
American football expert, Tim Vickery.

After the first round of games, a few points of discussion:

  • After an extremely lacklustre performance vs. Colombia, how does
    Argentina bounce back? What tactical changes need to be made?

  • Qatar impressed against Paraguay but can they pull off a major upset
    vs. Colombia?

  • How will Japan line-up against Uruguay after a losing by a scoreline
    that didn’t really do their performance justice? How will manager
    Moriyasu balance experience vs. youth, will he start with veteran
    Okazaki after Ueda’s numerous misses vs. Chile? How will Japan deal with their
    glaring weaknesses at the fullback position? Will Yuta Nakayama keep his place after
    an awful performance?

  • Can Brazil earn a early ticket to the next round vs. Venezuela after
    a clinical but not excellent performance vs. Bolivia? Will they be
    able to keep up their streak of winning the Copa every time they
    have hosted it?

Thanks for reading and…

¡Buena suerte a todos equipos!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1104]

Tue, 06/18/2019 - 00:19

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A palindromic Le Monde mathematical puzzle:

In a monetary system where all palindromic amounts between 1 and 10⁸ have a coin, find the numbers less than 10³ that cannot be paid with less than three coins. Find if 20,191,104 can be paid with two coins. Similarly, find if 11,042,019 can be paid with two or three coins.

Which can be solved in a few lines of R code:

coin=sort(c(1:9,(1:9)*11,outer(1:9*101,(0:9)*10,"+"))) amounz=sort(unique(c(coin,as.vector(outer(coin,coin,"+"))))) amounz=amounz[amounz<1e3]

and produces 10 amounts that cannot be paid with one or two coins. It is also easy to check that three coins are enough to cover all amounts below 10³. For the second question, starting with n¹=20,188,102,  a simple downward search of palindromic pairs (n¹,n²) such that n¹+n²=20,188,102 led to n¹=16,755,761 and n²=3,435,343. And starting with 11,033,011, the same search does not produce any solution, while there are three coins such that n¹+n²+n³=11,042,019, for instance n¹=11,022,011, n²=20,002, and n³=6.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1104]

Tue, 06/18/2019 - 00:19

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A palindromic Le Monde mathematical puzzle:

In a monetary system where all palindromic amounts between 1 and 10⁸ have a coin, find the numbers less than 10³ that cannot be paid with less than three coins. Find if 20,191,104 can be paid with two coins. Similarly, find if 11,042,019 can be paid with two or three coins.

Which can be solved in a few lines of R code:

coin=sort(c(1:9,(1:9)*11,outer(1:9*101,(0:9)*10,"+"))) amounz=sort(unique(c(coin,as.vector(outer(coin,coin,"+"))))) amounz=amounz[amounz<1e3]

and produces 10 amounts that cannot be paid with one or two coins. It is also easy to check that three coins are enough to cover all amounts below 10³. For the second question, starting with n¹=20,188,102,  a simple downward search of palindromic pairs (n¹,n²) such that n¹+n²=20,188,102 led to n¹=16,755,761 and n²=3,435,343. And starting with 11,033,011, the same search does not produce any solution, while there are three coins such that n¹+n²+n³=11,042,019, for instance n¹=11,022,011, n²=20,002, and n³=6.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

On my way to Manizales (Colombia)

Mon, 06/17/2019 - 04:31

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Next week, I will be in Manizales, Colombia, for the Third International Congress on Actuarial Science and Quantitative Finance. I will be giving a lecture on Wednesday with Jed Fress and Emilianos Valdez.

I will give my course on Algorithms for Predictive Modeling on Thursday morning (after Jed and Emil’s lectures). Unfortunately, my computer locked itself last week, and I could not unlock it (could not IT team at the university, who have the internal EFI password). So I will not be able to work further on the slides, so it will be based on the version as-at now (clearly in progress).

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

On my way to Manizales (Colombia)

Mon, 06/17/2019 - 04:31

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Next week, I will be in Manizales, Colombia, for the Third International Congress on Actuarial Science and Quantitative Finance. I will be giving a lecture on Wednesday with Jed Fress and Emilianos Valdez.

I will give my course on Algorithms for Predictive Modeling on Thursday morning (after Jed and Emil’s lectures). Unfortunately, my computer locked itself last week, and I could not unlock it (could not IT team at the university, who have the internal EFI password). So I will not be able to work further on the slides, so it will be based on the version as-at now (clearly in progress).

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Forecasting tools in development

Mon, 06/17/2019 - 02:00

(This article was first published on - R, and kindly contributed to R-bloggers)

As I’ve been writing up a progress report for my NIGMS R35 MIRA award, I’ve been reminded at how much of the work that we’ve been doing is focused on forecasting infrastructure. A common theme in the Reich Lab is making operational forecasts of infectious disease outbreaks. The operational aspect means that we focus on everything from developing and adapting statistical methods to be used in forecasting applications to thinking about the data science toolkit that you need to store, evaluate, and visualize forecasts. To that end, in addition to working closely with the CDC in their FluSight initiative, we’ve been doing a lot of collaborative work on new R packages and data repositories that I hope will be useful beyond the confines of our lab. Some of these projects are fully operational, used in our production flu forecasts for CDC, and some have even gone through a level of code peer review. Others are in earlier stages of development. My hope is that in putting this list out there (see below the fold) we will generate some interest (and possibly find some new open-source collaborators) for these projects.



Here is a partial list of in-progress software that we’ve been working on:

  • sarimaTD is an R package that serves as a wrapper to some of the ARIMA modeling functionality in the forecast R package. We found that we consistently wanted to be specifying some transformations (T) and differencing (D) in specific ways that we have found useful in modeling infectious disease time-series data, so we made it easy for us and others to use specifications.
  • ForecastFramework is an R package that we have collaborated on with our colleagues at the Johns Hopkins Infectious Disease Dynamics lab. We’ve blogged about this before, and we see a lot of potential in this object-oriented framework for both standardizing how datasets are specified/accessed and how models are generated. That said, there still is a long ways to go to document and make this infrastructure usable by a wide audience. The most success I’ve had using it so far was having PhD students write forecast models for a seminar I taught this spring. I used a single script that could run and score forecasts from each model, with a very simple plug-and-play interface to the models because they had been specified appropriately.
  • Zoltar is a new repository (in alpha-ish release right now) for forecasts that we have been working on over the last year. It was initially designed with our CDC flu forecast use-case in mind, although the forecast structure is quite general, and with predx integration on the way (see next bullet) we are hoping that this will broaden the scope of possible use cases for Zoltar. To help facilitate our and others use of Zoltar, we are working on two interfaces to the Zoltar API, zoltpy for python and zoltr for R. Check out the documentation, especially for zoltr. There is quite a bit of data available!
  • predx is an R package designed my colleague and friend Michael Johansson of the US Centers for Disease Control and Prevention and OutbreakScience. Evan Ray, from the Reich Lab team, has contributed to it as well. The goal of predx is to define some general classes of data for both probabilistic and point forecasts, to better standardize ways that we might want to store and operate on these data.
  • d3-foresight is the main engine behind our interactive forecast visualizations for flu in the US. We have also integrated it with Zoltar, so that you can view forecasts stored in Zoltar (note, kind of a long load time for that last link) using some of the basic d3-foresight functionality.

The lion’s share of the credit for all of the above are due to some combination of Matthew Cornell, Abhinav Tushar, Katie House, and Evan Ray.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Forecasting tools in development

Mon, 06/17/2019 - 02:00

(This article was first published on - R, and kindly contributed to R-bloggers)

As I’ve been writing up a progress report for my NIGMS R35 MIRA award, I’ve been reminded at how much of the work that we’ve been doing is focused on forecasting infrastructure. A common theme in the Reich Lab is making operational forecasts of infectious disease outbreaks. The operational aspect means that we focus on everything from developing and adapting statistical methods to be used in forecasting applications to thinking about the data science toolkit that you need to store, evaluate, and visualize forecasts. To that end, in addition to working closely with the CDC in their FluSight initiative, we’ve been doing a lot of collaborative work on new R packages and data repositories that I hope will be useful beyond the confines of our lab. Some of these projects are fully operational, used in our production flu forecasts for CDC, and some have even gone through a level of code peer review. Others are in earlier stages of development. My hope is that in putting this list out there (see below the fold) we will generate some interest (and possibly find some new open-source collaborators) for these projects.



Here is a partial list of in-progress software that we’ve been working on:

  • sarimaTD is an R package that serves as a wrapper to some of the ARIMA modeling functionality in the forecast R package. We found that we consistently wanted to be specifying some transformations (T) and differencing (D) in specific ways that we have found useful in modeling infectious disease time-series data, so we made it easy for us and others to use specifications.
  • ForecastFramework is an R package that we have collaborated on with our colleagues at the Johns Hopkins Infectious Disease Dynamics lab. We’ve blogged about this before, and we see a lot of potential in this object-oriented framework for both standardizing how datasets are specified/accessed and how models are generated. That said, there still is a long ways to go to document and make this infrastructure usable by a wide audience. The most success I’ve had using it so far was having PhD students write forecast models for a seminar I taught this spring. I used a single script that could run and score forecasts from each model, with a very simple plug-and-play interface to the models because they had been specified appropriately.
  • Zoltar is a new repository (in alpha-ish release right now) for forecasts that we have been working on over the last year. It was initially designed with our CDC flu forecast use-case in mind, although the forecast structure is quite general, and with predx integration on the way (see next bullet) we are hoping that this will broaden the scope of possible use cases for Zoltar. To help facilitate our and others use of Zoltar, we are working on two interfaces to the Zoltar API, zoltpy for python and zoltr for R. Check out the documentation, especially for zoltr. There is quite a bit of data available!
  • predx is an R package designed my colleague and friend Michael Johansson of the US Centers for Disease Control and Prevention and OutbreakScience. Evan Ray, from the Reich Lab team, has contributed to it as well. The goal of predx is to define some general classes of data for both probabilistic and point forecasts, to better standardize ways that we might want to store and operate on these data.
  • d3-foresight is the main engine behind our interactive forecast visualizations for flu in the US. We have also integrated it with Zoltar, so that you can view forecasts stored in Zoltar (note, kind of a long load time for that last link) using some of the basic d3-foresight functionality.

The lion’s share of the credit for all of the above are due to some combination of Matthew Cornell, Abhinav Tushar, Katie House, and Evan Ray.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

modelDown is now on CRAN!

Sun, 06/16/2019 - 15:33

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)


The modelDown package turns classification or regression models into HTML static websites.
With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits.

The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.

Here you can browse an example website automatically created for 4 classification models (random forest, gradient boosting, support vector machines, k-nearest neighbours). The R code beyond this example is here.

Fun facts:

archivist hooks are generated for every documented object. So you can easily extract R objects from the HTML website. Try

archivist::aread("MI2DataLab/modelDown_example/docs/repository/574defd6a96ecf7e5a4026699971b1d7")

– session info is automatically recorded. So you can check version of packages available at model development (https://github.com/MI2DataLab/modelDown_example/blob/master/docs/session_info/session_info.txt)

– This package is initially created by Magda Tatarynowicz, Kamil Romaszko, Mateusz Urbański from Warsaw University of Technology as a student project.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

modelDown is now on CRAN!

Sun, 06/16/2019 - 15:33

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)


The modelDown package turns classification or regression models into HTML static websites.
With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits.

The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.

Here you can browse an example website automatically created for 4 classification models (random forest, gradient boosting, support vector machines, k-nearest neighbours). The R code beyond this example is here.

Fun facts:

archivist hooks are generated for every documented object. So you can easily extract R objects from the HTML website. Try

archivist::aread("MI2DataLab/modelDown_example/docs/repository/574defd6a96ecf7e5a4026699971b1d7")

– session info is automatically recorded. So you can check version of packages available at model development (https://github.com/MI2DataLab/modelDown_example/blob/master/docs/session_info/session_info.txt)

– This package is initially created by Magda Tatarynowicz, Kamil Romaszko, Mateusz Urbański from Warsaw University of Technology as a student project.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages