Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 7 hours 36 min ago

linl 0.0.3: Micro release

Sun, 12/16/2018 - 00:12

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Our linl package for writing LaTeX letter with (R)markdown had a fairly minor release today, following up on the previous release well over a year ago. This version just contains one change which Mark van der Loo provided a few months ago with a clean PR. As another user was just bitten the same issue when using an included letterhead – which was fixed but unreleased – we decided it was time for a release. So there it is.

linl makes it easy to write letters in markdown, with some extra bells and whistles thanks to some cleverness chiefly by Aaron.

Here is screenshot of the vignette showing the simple input for some moderately fancy output:

The NEWS entry follows:

Changes in linl version 0.0.3 (2018-12-15)
  • Correct LaTeX double loading of package color with different options (Mark van der Loo in #18 fixing #17).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the linl page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Request for comments on planned features for futile.logger 1.5

Sat, 12/15/2018 - 20:06

(This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers)

I will be pushing a new version of futile.logger (version 1.5) to CRAN in January. This version introduces a number of enhancements and fixes some bugs. It will also contain at least one breaking change. I am making the release process public, since the package is now used in a number of other packages. If you use futile.logger, this is your opportunity to influence the direction of the package and prepare for changes. Please use the github issue tracker for discussion.

There are two main themes for enhancements: 1) integration with the signal conditions, and 2) supporting multiple appenders with different  thresholds.

Hijacking the signal system

Currently, futile.logger is unaware of the signal system in R. The only tie-in is with ftry, which catches a warning or error and prints a message. Unfortunately, the behavior is different from try. This release will make ftry consistent with try. This is a breaking change, so if you use ftry in your code, it will no longer halt processing.

In addition to this fix, futile.logger will now have better integration with the signal system. Currently, if a function emits a warning or an error, these are printed to the screen. It would be convenient if futile.logger can capture these signals and associate them to the correct log levels, i.e. warning to WARN and error to ERROR.

My proposal is to create a hijack_signal_handlers function to override the existing signal handlers. This function can be called at the top of a script or within the .onLoad function of a package. Once called, any warnings or errors would be captured and handled by futile.logger. The implementation would look like this, giving granular control of whether to hijack just warning, errors, or both:

hijack_signal_handlers <- function(warning=TRUE, error=TRUE) {
  if (warning) {
# Override warning handler
}
if (error) {
# Override error handler
}
}

One issue I see with this function is when used in a package’s .onLoad. Suppose a user requires package A and B. These don’t use futile.logger. Now the user requires package C, which calls hijack_signal_handlers in its .onLoad. When this occurs, warnings and errors emitted from packages A and B would also be captured by futile.logger. From my perspective, this is probably a good thing, but I can appreciate why others may not want this behavior. 

Emitting signals

The other half of the signal handler puzzle is being able to emit signals from futile.logger. For this case, we want flog.warn to emit a warning signal and flog.error to emit an error signal. One signature looks like

flog.warn(message, emit=TRUE)

meaning that by default no signal is emitted. This would work for flog.warn, flog.error, and flog.fatal (could map to error as well).

I have mixed feelings about this use case. Part of me says if futile.logger hijacks the signal system, then just use stop and futile.logger will catch it. On the other hand that seems roundabout and less capable than writing flog.error(message, emit=TRUE) or whatever.

Really my main concern is what happens when these two systems work together? Will they play nice or will an error be emitted twice (likely)? If not, then there is more logic that has to be built in, which ultimately adds complexity and wastes cycles. Any input here is encouraged!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Six Sigma DMAIC Series in R – Part4

Sat, 12/15/2018 - 16:34

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Basic Statistics

    Tags

    1. Data Visualisation
    2. Linear Regression
    3. R Programming

    Hope you liked the Part 1 ,Part 2 and Part 3 of this Series. In this Part 4, we will go through the tools used during the Improve phase of Six Sigma DMAIC cycle. The most representative tool used during the Improve Phase is DOE-Design of experiments. Proper use of DOE can lead to process improvement, but a bad design of experiment can lead to undesired results – inefficiency, higher costs.

    What is experiment

    An experiment is a test or series of tests in which purposeful changes are made to input variables of a process or system so that we may observe and identify the reason for the change in output response.
    Consider the example of the simple process model, shown below, where we have controlled input factors X’s, output Y and uncontrolled factors Z’s.

    The objective of the experiment can be:

    1. Process Characterization: To know the relationship between X, Y & Z.
    2. Process Control: Capture changes in X, so that Y is always near the desired nominal value.
    3. Process Optimization: Capture changes in X, so that the variability of Y is minimized.
    4. Robust Design: Capture changes in X, so that effect of uncontrolled Z is minimized.
    Importance of experiment

    Experiments allow to control the values of the Xs of the process and then measure the value of the Ys to discover what values of the independent variables will allow us to improve the performance of our process. On the contrary, in the case of Observational Studies, we don’t have any influence on the variables we are measuring. We just collect the data and use the appropriate statistical technique.
    There are some risks associated when the analysis is based on the data gathered directly during the normal performance of process: Inconsistent Data, Variable Value Range ( performance of X’s outside range not known ) & Correlated Variables.

    Characteristics of well planned experiments

    Some of the Characteristics of well-planned experiments are:

    1. The degree of Precision:
      Probability should be high that experiment will be able to measure the differences with a degree of precision the experimenter desires. It implies appropriate design and sufficient replication.
    2. Simplicity:
      As simple as possible consistent with the objectives of experiment.
    3. The absence of Systematic Error:
      Units receiving one treatment should not differ in any systematic way from those receiving other treatment.
    4. The range of Validity of Conclusions:
      Experiments replicated in time and space would increase the range of validity of conclusions.
    5. Calculation of the degree of Uncertainty:
      A possibility of obtaining the observed results by chance alone.
    Three basic principle of Experiments

    Three basic principles of experiments are Randomization, Repetition & Blocking.

    Lets understand this through an example – A food manufacturer is searching for the best recipe for its main product: a pizza dough. The managers decided to perform an experiment to determine the optimal levels of the three main ingredients in the pizza dough: flour, salt, and baking powder. The other ingredients are fixed as they do not affect the flavor of the final cooked pizza. The flavor of the product will be determined by a panel of experts who will give a score to each recipe. Therefore, we have three factors that we will call flour, salt, and baking powder(bakPow), with two levels each (− and +).

    pizzaDesign <- expand.grid(flour = gl(2, 1, labels = c("-", "+")), salt = gl(2, 1, labels = c("-", "+")), bakPow = gl(2, 1, labels = c("-", "+")), score = NA)

    Now, we have eight different experiments (recipes) including all the possible combinations of the three factors at two levels.
    When we have more than 2 factors, the combination of levels of different factors may affect the response. Therefore, to discover the main effects and the interactions, we should vary more than one level at a time, performing experiments in all the possible combinations.
    The reason Why two-level factor experiments are widely used is that as the number of factor level increases, cost of doing the experiment increases. To study the variation under the same experimental conditions, replication needs to be done, making more than one trial per factor combination. The number of replications depends on several aspects (e.g budget).

    Once an experiment has been designed, we will proceed with its randomization.

    pizzaDesign$ord <- sample(1:8, 8) pizzaDesign[order(pizzaDesign$ord),]

    Each time you repeat the command you get a different order due to randomization.

    2^k factorial Designs

    2k factorial designs are those whose number of factors to be studied are k, all of them with 2 levels. The number of experiments we need to carryout to obtain a complete replication is precisely the power(2 to the k). If we want n replications of the experiment, then the total number of experiments is n×2k.
    ANOVA can be used to estimate the effect of each factor and interaction and assess which of these effects are significant.

    Example contd.:- The experiment is carried out by preparing the pizzas at the factory following the package instructions, namely: “bake the pizza for 9 min in an oven at 180◦C.”

    After a blind trial is conducted,the scores were given by the experts to each of the eight (2^3) recipes in each replication of the experiment.

    ss.data.doe1 <- data.frame(repl = rep(1:2, each = 8), rbind(pizzaDesign[, -6], pizzaDesign[, -6])) ss.data.doe1$score <- c(5.33, 6.99, 4.23, 6.61, 2.26, 5.75, 3.26, 6.24, 5.7, 7.71, 5.13, 6.76, 2.79, 4.57, 2.48, 6.18)

    The average for each recipe can be calculated as below:

    aggregate(score ~ flour + salt + bakPow, FUN = mean, data = ss.data.doe1)

    The best recipe seems to be the one with a high level of flour and a low level of salt and baking powder. Fit a linear model and perform an ANOVA to find the significant effects.

    doe.model1 <- lm(score ~ flour + salt + bakPow + flour * salt + flour * bakPow + salt * bakPow + flour * salt * bakPow, data = ss.data.doe1) summary(doe.model1)

    p-values show that the main effects of the ingredients flour and baking powder are significant, while the effect of the salt is not significant. Interactions among the ingredients are neither 2-way nor 3-way, making them insignificant. Thus, we can simplify the model, excluding the non significant effects. Thus, the new model with the significant effects is :

    doe.model2 <- lm(score ~ flour + bakPow, data = ss.data.doe1) summary(doe.model2)

    Therefore,the statistical model for our experiment is
    score =4.8306 + 2.4538*Flour-1.8662*bakpow
    Thus, the recipe with a high level of flour and low level of baking powder will be the best one, regardless of the level of salt (high or low). The estimated score for this recipe is

    `4.8306 + 2.4538 × 1 + (−1.8662) × 0 = 7.284.`

    predict function can be used to get the estimation for all the experiment conditions.

    predict(doe.model2)

    Visualize Effect Plot and Interaction plot

    the ggplot2 package can be used to visualize the effect plot. The effect of flour is positive while the effect of baking powder is negative.

    prinEf <- data.frame(Factor = rep(c("A_Flour", "C_Baking Powder"), each = 2), Level = rep(c(-1, 1), 2), Score = c(aggregate(score ~ flour, FUN = mean, data = ss.data.doe1)[,2], aggregate(score ~ bakPow, FUN = mean, data = ss.data.doe1)[,2])) p <- ggplot(prinEf, aes(x = Level, y = Score)) + geom_point() + geom_line() +geom_hline(yintercept =mean(ss.data.doe1$score),linetype="dashed", color = "blue")+scale_x_continuous(breaks = c(-1, 1)) + facet_grid(. ~ Factor)+ggtitle("Plot of Factor Effects") print(p)

    The interaction plot is as shown below.The lines don’t cross means that there is no interaction between the factors plotted.

    intEf <- aggregate(score ~ flour + bakPow, FUN = mean, data = ss.data.doe1) q <- ggplot(intEf, aes(x = flour, y = score, color = bakPow )) + geom_point() + geom_line(aes(group=bakPow)) +geom_hline(yintercept =mean(ss.data.doe1$score),linetype="dashed", color = "blue")+ggtitle("Interaction Plot") print(q)

    The normality of residual can be checked with Shapiro test. As the p-value is large, fail to reject the Null hypothesis of Normality of residuals.

    shapiro.test(residuals(doe.model2))

    This was the brief introduction to DOE in R.
    In next part, we will go through Control Phase of Six Sigma DMAIC process. Please let me know your feedback in the comments section. Make sure to like & share it. Happy Learning!!

    References

    Related Post

    1. NYC buses: company level predictors with R
    2. Visualizations for correlation matrices in R
    3. Interpretation of the AUC
    4. Simple Experiments with Smoothed Scatterplots
    5. Understanding the Covariance Matrix
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Day 15 – little helper sci_palette

    Sat, 12/15/2018 - 08:00

    (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

    We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 15th day of Christmas my true love gave to me…

    What can it do?

    This little helper returns a set of colors which we often use at STATWORX. So, if – like me – you cannot remeber each hex color code you need, this might help. Of course these are our colours, but you could rewrite it with your own palette. But the main benefactor is the plotting method – so you can see the color instead of only reading the hex code.

    How to use it?

    To see which hex code corresponds to which colour and for what purpose to use it

    sci_palette() main_color accent_color_1 accent_color_2 accent_color_3 highlight black "#013848" "#0085AF" "#00A378" "#09557F" "#FF8000" "#000000" text grey_2 light_gray special "#696969" "#D9D9D9" "#F8F8F8" "#C62F4B" attr(,"class") [1] "sci"

    As mentioned above, there is a plot() method which gives the following picture.

    plot(sci_palette())

    Overview

    To see all the other functions you can either check out our GitHub or you can read about them here.

    Have a merry advent season!

    Über den Autor

    Jakob Gepp

    Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

    ABOUT US

    STATWORX
    is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

    .button {background-color: #0085af;}

    Der Beitrag Day 15 – little helper sci_palette erschien zuerst auf STATWORX.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Manipulate dates easily with {lubridate}

    Sat, 12/15/2018 - 01:00

    (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


    This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for
    free here. This is taken from Chapter 5, which presents
    the {tidyverse} packages and how to use them to compute descriptive statistics and manipulate data.
    In the text below, I scrape a table from Wikipedia, which shows when African countries gained
    independence from other countries. Then, using {lubridate} functions I show you how you can
    answers questions such as Which countries gained independence before 1960?.

    Set-up: scraping some data from Wikipedia

    {lubridate} is yet another tidyverse package, that makes dealing with dates or duration data
    (and intervals) as painless as possible. I do not use every function contained in the package
    daily, and as such will only focus on some of the functions. However, if you have to deal with
    dates often, you might want to explore the package thoroughly.

    Let’s get some data from a Wikipedia table:

    library(tidyverse) library(rvest) page <- read_html("https://en.wikipedia.org/wiki/Decolonisation_of_Africa") independence <- page %>% html_node(".wikitable") %>% html_table(fill = TRUE) independence <- independence %>% select(-Rank) %>% map_df(~str_remove_all(., "\\[.*\\]")) %>% rename(country = `Country[a]`, colonial_name = `Colonial name`, colonial_power = `Colonial power[b]`, independence_date = `Independence date`, first_head_of_state = `First head of state[d]`, independence_won_through = `Independence won through`)

    This dataset was scraped from the following Wikipedia table.
    It shows when African countries gained independence from which colonial powers. In Chapter 11, I
    will show you how to scrape Wikipedia pages using R. For now, let’s take a look at the contents
    of the dataset:

    independence ## # A tibble: 54 x 6 ## country colonial_name colonial_power independence_da… first_head_of_s… ## ## 1 Liberia Liberia United States 26 July 1847 Joseph Jenkins … ## 2 South … Cape Colony … United Kingdom 31 May 1910 Louis Botha ## 3 Egypt Sultanate of… United Kingdom 28 February 1922 Fuad I ## 4 Eritrea Italian Erit… Italy 10 February 1947 Haile Selassie ## 5 Libya British Mili… United Kingdo… 24 December 1951 Idris ## 6 Sudan Anglo-Egypti… United Kingdo… 1 January 1956 Ismail al-Azhari ## 7 Tunisia French Prote… France 20 March 1956 Muhammad VIII a… ## 8 Morocco French Prote… France Spain 2 March 19567 A… Mohammed V ## 9 Ghana Gold Coast United Kingdom 6 March 1957 Kwame Nkrumah ## 10 Guinea French West … France 2 October 1958 Ahmed Sékou Tou… ## # ... with 44 more rows, and 1 more variable: ## # independence_won_through

    as you can see, the date of independence is in a format that might make it difficult to answer questions
    such as Which African countries gained independence before 1960 ? for two reasons. First of all,
    the date uses the name of the month instead of the number of the month (well, this is not such a
    big deal, but still), and second of all the type of
    the independence day column is character and not “date”. So our first task is to correctly define the column
    as being of type date, while making sure that R understands that January is supposed to be “01”, and so
    on.

    Using {lubridate}

    There are several helpful functions included in {lubridate} to convert columns to dates. For instance
    if the column you want to convert is of the form “2012-11-21”, then you would use the function ymd(),
    for “year-month-day”. If, however the column is “2012-21-11”, then you would use ydm(). There’s
    a few of these helper functions, and they can handle a lot of different formats for dates. In our case,
    having the name of the month instead of the number might seem quite problematic, but it turns out
    that this is a case that {lubridate} handles painfully:

    library(lubridate) ## ## Attaching package: 'lubridate' ## The following object is masked from 'package:base': ## ## date independence <- independence %>% mutate(independence_date = dmy(independence_date)) ## Warning: 5 failed to parse.

    Some dates failed to parse, for instance for Morocco. This is because these countries have several
    independence dates; this means that the string to convert looks like:

    "2 March 1956 7 April 1956 10 April 1958 4 January 1969"

    which obviously cannot be converted by {lubridate} without further manipulation. I ignore these cases for
    simplicity’s sake.

    Let’s take a look at the data now:

    independence ## # A tibble: 54 x 6 ## country colonial_name colonial_power independence_da… first_head_of_s… ## ## 1 Liberia Liberia United States 1847-07-26 Joseph Jenkins … ## 2 South … Cape Colony … United Kingdom 1910-05-31 Louis Botha ## 3 Egypt Sultanate of… United Kingdom 1922-02-28 Fuad I ## 4 Eritrea Italian Erit… Italy 1947-02-10 Haile Selassie ## 5 Libya British Mili… United Kingdo… 1951-12-24 Idris ## 6 Sudan Anglo-Egypti… United Kingdo… 1956-01-01 Ismail al-Azhari ## 7 Tunisia French Prote… France 1956-03-20 Muhammad VIII a… ## 8 Morocco French Prote… France Spain NA Mohammed V ## 9 Ghana Gold Coast United Kingdom 1957-03-06 Kwame Nkrumah ## 10 Guinea French West … France 1958-10-02 Ahmed Sékou Tou… ## # ... with 44 more rows, and 1 more variable: ## # independence_won_through

    As you can see, we now have a date column in the right format. We can now answer questions such as
    Which countries gained independence before 1960? quite easily, by using the functions year(),
    month() and day(). Let’s see which countries gained independence before 1960:

    independence %>% filter(year(independence_date) <= 1960) %>% pull(country) ## [1] "Liberia" "South Africa" ## [3] "Egypt" "Eritrea" ## [5] "Libya" "Sudan" ## [7] "Tunisia" "Ghana" ## [9] "Guinea" "Cameroon" ## [11] "Togo" "Mali" ## [13] "Madagascar" "Democratic Republic of the Congo" ## [15] "Benin" "Niger" ## [17] "Burkina Faso" "Ivory Coast" ## [19] "Chad" "Central African Republic" ## [21] "Republic of the Congo" "Gabon" ## [23] "Mauritania"

    You guessed it, year() extracts the year of the date column and converts it as a numeric so that we can work
    on it. This is the same for month() or day(). Let’s try to see if countries gained their independence on
    Christmas Eve:

    independence %>% filter(month(independence_date) == 12, day(independence_date) == 24) %>% pull(country) ## [1] "Libya"

    Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between
    two dates, using the interval() column:

    independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) ## # A tibble: 54 x 2 ## country independent_since ## ## 1 Liberia 1847-07-26 UTC--2018-12-15 UTC ## 2 South Africa 1910-05-31 UTC--2018-12-15 UTC ## 3 Egypt 1922-02-28 UTC--2018-12-15 UTC ## 4 Eritrea 1947-02-10 UTC--2018-12-15 UTC ## 5 Libya 1951-12-24 UTC--2018-12-15 UTC ## 6 Sudan 1956-01-01 UTC--2018-12-15 UTC ## 7 Tunisia 1956-03-20 UTC--2018-12-15 UTC ## 8 Morocco NA--NA ## 9 Ghana 1957-03-06 UTC--2018-12-15 UTC ## 10 Guinea 1958-10-02 UTC--2018-12-15 UTC ## # ... with 44 more rows

    The independent_since column now contains an interval object that we can convert to years:

    independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) %>% mutate(years_independent = as.numeric(independent_since, "years")) ## # A tibble: 54 x 3 ## country independent_since years_independent ## ## 1 Liberia 1847-07-26 UTC--2018-12-15 UTC 171. ## 2 South Africa 1910-05-31 UTC--2018-12-15 UTC 109. ## 3 Egypt 1922-02-28 UTC--2018-12-15 UTC 96.8 ## 4 Eritrea 1947-02-10 UTC--2018-12-15 UTC 71.8 ## 5 Libya 1951-12-24 UTC--2018-12-15 UTC 67.0 ## 6 Sudan 1956-01-01 UTC--2018-12-15 UTC 63.0 ## 7 Tunisia 1956-03-20 UTC--2018-12-15 UTC 62.7 ## 8 Morocco NA--NA NA ## 9 Ghana 1957-03-06 UTC--2018-12-15 UTC 61.8 ## 10 Guinea 1958-10-02 UTC--2018-12-15 UTC 60.2 ## # ... with 44 more rows

    We can now see for how long the last country to gain independence has been independent.
    Because the data is not tidy (in some cases, an African country was colonized by two powers,
    see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom:

    independence %>% filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% mutate(years_independent = as.numeric(independent_since, "years")) %>% group_by(colonial_power) %>% summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE)) ## # A tibble: 4 x 2 ## colonial_power last_colony_independent_for ## ## 1 Belgium 56.5 ## 2 France 41.5 ## 3 Portugal 43.1 ## 4 United Kingdom 42.5

    {lubridate} contains many more functions. If you often work with dates, duration or interval data, {lubridate}
    is a package that you have to master.

    Hope you enjoyed! If you found this blog post useful, you might want to follow
    me on twitter for blog post updates and
    buy me an espresso or paypal.me.

    .bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Learning R: A gentle introduction to higher-order functions

    Fri, 12/14/2018 - 23:02

    (This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

    Have you ever thought about why the definition of a function in R is different from many other programming languages? The part that causes the biggest difficulties (especially for beginners of R) is that you state the name of the function at the beginning and use the assignment operator – as if functions were like any other data type, like vectors, matrices or data frames…

    Congratulations! You just encountered one of the big ideas of functional programming: functions are indeed like any other data type, they are not special – or in programming lingo, functions are first-class members. Now, you might ask: So what? Well, there are many ramifications, for example that you could use functions on other functions by using one function as an argument for another function. Sounds complicated?

    In mathematics most of you will be familiar with taking the derivative of a function. When you think about it you could say that you put one function into the derivative function (or operator) and get out another function!

    In R there are many applications as well, let us go through a simple example step by step.

    Let’s say I want to apply the mean function on the first four columns of the iris dataset. I could do the following:

    mean(iris[ , 1]) ## [1] 5.843333 mean(iris[ , 2]) ## [1] 3.057333 mean(iris[ , 3]) ## [1] 3.758 mean(iris[ , 4]) ## [1] 1.199333

    Quite tedious and not very elegant. Of course, we can use a for loop for that:

    for (x in iris[1:4]) { print(mean(x)) } ## [1] 5.843333 ## [1] 3.057333 ## [1] 3.758 ## [1] 1.199333

    This works fine but there is an even more intuitive approach. Just look at the original task: “apply the mean function on the first four columns of the iris dataset” – so let us do just that:

    apply(iris[1:4], 2, mean) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 5.843333 3.057333 3.758000 1.199333

    Wow, this is very concise and works perfectly (the 2 just stands for “go through the data column wise”, 1 would be for “row wise”). apply is called a “higher-order function” and we could use it with all kinds of other functions:

    apply(iris[1:4], 2, sd) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 0.8280661 0.4358663 1.7652982 0.7622377 apply(iris[1:4], 2, min) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 4.3 2.0 1.0 0.1 apply(iris[1:4], 2, max) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 7.9 4.4 6.9 2.5

    You can also use user-defined functions:

    midrange <- function(x) (min(x) + max(x)) / 2 apply(iris[1:4], 2, midrange) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.10 3.20 3.95 1.30

    We can even use new functions that are defined “on the fly” (or in functional programming lingo “anonymous functions”):

    apply(iris[1:4], 2, function(x) (min(x) + max(x)) / 2) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.10 3.20 3.95 1.30

    Let us now switch to another inbuilt data set, the mtcars dataset with 11 different variables of 32 cars (if you want to find out more, please consult the documentation):

    head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

    To see the power of higher-order functions let us create a (numeric) matrix with minimum, first quartile, median, mean, third quartile and maximum for all 11 columns of the mtcars dataset with just one command!

    apply(mtcars, 2, summary) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Min. 10.40000 4.0000 71.1000 52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000 ## 1st Qu. 15.42500 4.0000 120.8250 96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000 ## Median 19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000 ## Mean 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125 ## 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000 ## Max. 33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000

    Wow, that was easy and the result is quite impressive, is it not!

    Or if you want to perform a linear regression for all ten variables separately against mpg and want to get a table with all coefficients – there you go:

    sapply(mtcars, function(x) round(coef(lm(mpg ~ x, data = mtcars)), 3)) ## mpg cyl disp hp drat wt qsec vs am gear carb ## (Intercept) 0 37.885 29.600 30.099 -7.525 37.285 -5.114 16.617 17.147 5.623 25.872 ## x 1 -2.876 -0.041 -0.068 7.678 -5.344 1.412 7.940 7.245 3.923 -2.056

    Here we used another higher-order function, sapply, together with an anonymous function. sapply goes through all the columns of a data frame (i.e. elements of a list) and tries to simplify the result (here your get back a nice matrix).

    Often, you might not even have realised when you were using higher-order functions! I can tell you that it is quite a hassle in many programming languages to program a simple function plotter, i.e. a function which plots another function. In R it has already been done for you: you just use the higher-order function curve and give it the function you want to plot as an argument:

    curve(sin(x) + cos(1/2 * x), -10, 10)

    I want to give you one last example of another very helpful higher-order function (which not too many people know or use): by. It comes in very handy when you want to apply a function on different attributes split by a factor. So let’s say you want to get a summary of all the attributes of iris split by (!) species – here it comes:

    by(iris[1:4], iris$Species, summary) ## iris$Species: setosa ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 ## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 ## Median :5.000 Median :3.400 Median :1.500 Median :0.200 ## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 ## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 ## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600 ## -------------------------------------------------------------- ## iris$Species: versicolor ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 ## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 ## Median :5.900 Median :2.800 Median :4.35 Median :1.300 ## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326 ## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500 ## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800 ## -------------------------------------------------------------- ## iris$Species: virginica ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 ## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 ## Median :6.500 Median :3.000 Median :5.550 Median :2.000 ## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026 ## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300 ## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500

    This was just a very shy look at this huge topic. There are very powerful higher-order functions in R, like lapppy, aggregate, replicate (very handy for numerical simulations) and many more. A good overview can be found in the answers of this question: stackoverflow (my answer there is on the rather illusive switch function: switch).

    For some reason people tend to confuse higher-order functions with recursive functions but that is the topic of another post, so stay tuned…

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    In case you missed it: November 2018 roundup

    Fri, 12/14/2018 - 16:29

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    In case you missed them, here are some articles from November of particular interest to R users.

    David Gerard assesses the plausibility of a key plot point in 'Jurassic Park' with simulations in R.

    In-database R is available in Azure SQL Database for private preview. 

    Introducing AzureR, a new suite of R packages for managing Azure resources in R.

    The AzureRMR package provides an interface for Resource Manager.

    Roundup of AI, Machine Learning and Data Science news from November 2018.

    You can now use the AI capabilities of Microsoft Cognitive Services within a container you host.

    A look back at some of the R applications presented at the EARL conference in Seattle.

    Slides and notebooks from my ODSC workshop, AI for Good.

    T-Mobile uses AI models implemented with R to streamline customer service.

    A guide to R packages for importing and working with US Census data.

    Azure Machine Learning Studio, the online drag-and-drop data analysis tool, upgrades its R support.

    And some general interest stories (not necessarily related to R):

    As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Day 14 – little helper print_fs

    Fri, 12/14/2018 - 08:00

    (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

    We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 14th day of Christmas my true love gave to me…

    What can it do?

    This little helper returns the folder structure of a given path. With this, one can for example add a nice overview to the documentation of a project or within a git. For the sake of automation, this function could run and change parts wihtin a log or news file after a major change.

    How to use it?

    If we take a look at the same example we used for the get_network function on day 5, we get the following:

    print_fs("~/flowchart/", depth = 4) 1 flowchart 2 ¦--create_network.R 3 ¦--getnetwork.R 4 ¦--plots 5 ¦ ¦--example-network-helfRlein.png 6 ¦ °--improved-network.png 7 ¦--R_network_functions 8 ¦ ¦--dataprep 9 ¦ ¦ °--foo_01.R 10 ¦ ¦--method 11 ¦ ¦ °--foo_02.R 12 ¦ ¦--script_01.R 13 ¦ °--script_02.R 14 °--README.md

    With depth we can adjust how deep we want to traverse through our folders.

    Overview

    To see all the other functions you can either check out our GitHub or you can read about them here.

    Have a merry advent season!

    Über den Autor

    Jakob Gepp

    Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

    ABOUT US

    STATWORX
    is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

    .button {background-color: #0085af;}

    Der Beitrag Day 14 – little helper print_fs erschien zuerst auf STATWORX.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    running plot [and simulated annealing]

    Fri, 12/14/2018 - 07:26

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    Last weekend, I found out a way to run updated plots within a loop in R, when calling plot() within the loop was never updated in real time. The above suggestion of including a Sys.sleep(0.25) worked perfectly on a simulated annealing example for determining the most dispersed points in a unit disc.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Easy CI/CD of GPU applications on Google Cloud including bare-metal using Gitlab and Kubernetes

    Fri, 12/14/2018 - 01:00

    (This article was first published on Angel Sevilla Camins' Blog, and kindly contributed to R-bloggers)

    Summary

    Are you a data scientist who only wants to focus on modelling and coding and not on setting up a GPU cluster? Then, this blog might be interesting for you. We developed an automated pipeline using gitlab and Kubernetes that is able to run code in two GPU environments, GCP and bare-metal; no need to worry about drivers, Kubernetes cluster creation or deletion. The only thing that you should do is to push your code and it runs in a GPU!
    Source code for both the custom Docker images and the Kubernetes objects definitions can be found here and here respectively.
    See here the complete blog post.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Angel Sevilla Camins' Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Reusable Pipelines in R

    Thu, 12/13/2018 - 22:46

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    Pipelines in R are popular, the most popular one being magrittr as used by dplyr.

    This note will discuss the advanced re-usable piping systems: rquery/rqdatatable operator trees and wrapr function object pipelines. In each case we have a set of objects designed to extract extra power from the wrapr dot-arrow pipe %.>%.

    Piping

    Piping is not much more than having a system that lets one treat “x %.>% f(.)” as a near synonym for “f(x)”. For the wrapr dot arrow pipe the semantics are intentionally closer to (x %.>% f(.)) ~ {. <- x; f(.)}.

    The pipe notation may be longer, but it avoids nesting and reversed right to left reading for many-stage operations (such as “x %.>% f1(.) %.>% f2(.) %.>% f3(.)” versus “f3(f2(f1(x)))”).

    In addition to allowing users to write operations in this notation, most piping systems allow users to save pipelines for later re-use (though some others have issues serializing or saving such pipelines due to entanglement with the defining environment).

    wrapr and rquery/rqdatatable supply a number of piping tools that are re-usable, serializable, and very powerful (via R S3 and S4 dispatch features). One of the most compelling features are “function objects” which mans objects can be treated like functions (applied to other objects by pipelines). We will discuss some of these features in the context of rquery/rqdatatable and wrapr.

    rquery/rqdatatable

    For quite a while the rquery and rqdatatable packages have supplied a sequence of operators abstraction called an “operator tree” or “operator pipeline”.

    These pipelines are (deliberately) fairly strict:

    • They must start with a table description or definition.
    • Each step must be a table to table transform meeting certain column pre-conditions.
    • Each step must advertise what columns it makes available or produces, for later condition checking.

    For a guiding example suppose we want to row-subset some data, get per-group means, and then sort the data by those means.

    # our example data d <- data.frame( group = c("a", "a", "b", "b"), value = c( 1, 2, 2, -10), stringsAsFactors = FALSE ) # load our package library("rqdatatable") ## Loading required package: rquery # build an operator tree threshold <- 0.0 ops <- # define the data format local_td(d) %.>% # restrict to rows with value >= threshold select_rows_nse(., value >= threshold) %.>% # compute per-group aggegations project_nse(., groupby = "group", mean_value = mean(value)) %.>% # sort rows by mean_value decreasing orderby(., cols = "mean_value", reverse = "mean_value") # show the tree/pipeline cat(format(ops)) ## table(d; ## group, ## value) %.>% ## select_rows(., ## value >= 0) %.>% ## project(., mean_value := mean(value), ## g= group) %.>% ## orderby(., desc(mean_value))

    Of course the purpose of such a pipeline is to be able to apply it to data. This is done simply with the wrapr dot arrow pipe:

    d %.>% ops ## group mean_value ## 1: b 2.0 ## 2: a 1.5

    rquery pipelines are designed to specify and execute data wrangling tasks. An important feature of rquery pipelines is: they are designed for serialization. This means we can save them and also send them to multiple nodes for parallel processing.

    # save the optree saveRDS(ops, "rquery_optree.RDS") # simulate a fresh R session rm(list=setdiff(ls(), "d")) library("rqdatatable") # read the optree back in ops <- readRDS('rquery_optree.RDS') # look at it cat(format(ops)) ## table(d; ## group, ## value) %.>% ## select_rows(., ## value >= 0) %.>% ## project(., mean_value := mean(value), ## g= group) %.>% ## orderby(., desc(mean_value)) # use it again d %.>% ops ## group mean_value ## 1: b 2.0 ## 2: a 1.5 # clean up rm(list=setdiff(ls(), "d"))

    We can also run rqdatatable operations in “immediate mode”, without pre-defining the pipeline or tables:

    threshold <- 0.0 d %.>% select_rows_nse(., value >= threshold) %.>% project_nse(., groupby = "group", mean_value = mean(value)) %.>% orderby(., cols = "mean_value", reverse = "mean_value") ## group mean_value ## 1: b 2.0 ## 2: a 1.5 wrapr function objects

    A natural question is: given we already have rquery pipelines why do we need wrapr function object pipelines? The reason is: rquery/rdatatable pipelines are strict and deliberately restricted to operations that can be hosted both in R (via data.table) or on databases (examples: PostgreSQL and Spark). One might also want a more general pipeline with fewer constraints optimized for working in R directly.

    The wrapr “function object” pipelines allow treatment of arbitrary objects as items we can pipe into. Their primary purpose is to partially apply functions to convert arbitrary objects and functions into single-argument (or unary) functions. This converted form is perfect for pipelining. This, in a sense, lets us treat these objects as functions. The wrapr function object pipeline also has less constraint checking than rquery pipelines, so is more suitable for “black box” steps that do not publish their column use and production details (in fact wrapr function object pipelines work on arbitrary objects, not just data.frames or tables).

    Let’s adapt our above example into a simple wrapr dot arrow pipeline.

    library("wrapr") threshold <- 0 d %.>% .[.$value >= threshold, , drop = FALSE] %.>% tapply(.$value, .$group, 'mean') %.>% sort(., decreasing = TRUE) ## b a ## 2.0 1.5

    All we have done is replace the rquery steps with typical base-R commands. As we see the wrapr dot arrow can route data through a sequence of such commands to repeat our example.

    Now let’s adapt our above example into a re-usable wrapr function object pipeline.

    library("wrapr") threshold <- 0 pipeline <- srcfn( ".[.$value >= threshold, , drop = FALSE]" ) %.>% srcfn( "tapply(.$value, .$group, 'mean')" ) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., ), ## SrcFunction{ tapply(.$value, .$group, 'mean') }(.=., ), ## base::sort(x=., decreasing))

    We used two wrapr abstractions to capture the steps for re-use (something built in to rquery, and now also supplied by wrapr). The abstractions are:

    • srcfn() which wraps arbitrary quoted code as a function object.
    • pkgfn() which wraps a package qualified function name as a function object (“base” being the default package).

    This sort of pipeline can be applied to data using pipe notation:

    d %.>% pipeline ## b a ## 2.0 1.5

    The above pipeline has one key inconvenience and one key weakness:

    • For the srcfn() steps we had to place the source code in quotes, which defeats any sort of syntax highlighting and auto-completing in our R integrated development environment (IDE).
    • The above pipeline has a reference to the value of threshold in our current environment, this means the pipeline is not sufficiently self-contained to serialize and share.

    We can quickly address both of these issues with the wrapr::qe() (“quote expression”) function. It uses base::substitute() to quote its arguments, and the IDE doesn’t know the contents are quoted and thus can help us with syntax highlighting and auto-completion. Also we are using base::bquote() .()-style escaping to bind in the value of threshold.

    pipeline <- srcfn( qe( .[.$value >= .(threshold), , drop = FALSE] )) %.>% srcfn( qe( tapply(.$value, .$group, 'mean') )) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= 0, , drop = FALSE] }(.=., ), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5

    Notice this pipeline works as before, but no longer refers to the external value threshold. This pipeline can be saved and shared.

    Another recommended way to bind in values is with the args-argument, which is a named list of values that are expected to be available with a srcfn() is evaluated, or additional named arguments that will be applied to a pkgfn().

    In this notation the pipeline is written as follows.

    pipeline <- srcfn( qe( .[.$value >= threshold, , drop = FALSE] ), args = list('threshold' = threshold)) %.>% srcfn( qe( tapply(.$value, .$group, 'mean') )) %.>% pkgfn( "sort", arg_name = "x", args = list(decreasing = TRUE)) cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., threshold), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5

    We can save this pipeline.

    saveRDS(pipeline, "wrapr_pipeline.RDS")

    And simulate using it in a fresh environment (i.e. simulate sharing it).

    # simulate a fresh environment rm(list = setdiff(ls(), "d")) library("wrapr") pipeline <- readRDS('wrapr_pipeline.RDS') cat(format(pipeline)) ## UnaryFnList( ## SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., threshold), ## SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ), ## base::sort(x=., decreasing)) d %.>% pipeline ## b a ## 2.0 1.5 Conclusion

    And that is some of the power of wrapr piping, rquery/rqdatatable, and wrapr function objects. Essentially wrapr function objects are a reference application of the S3/S4 piping abilities discussed in the wrapr pipe formal article.

    The technique is very convenient when each of the steps is a substantial (such as non-trivial data preparation and model application steps).

    The above techniques can make reproducing and sharing methods much easier.

    We have some more examples of the technique here and here.

    # clean up after example unlink("rquery_optree.RDS") unlink("wrapr_pipeline.RDS") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R community update: announcing sessions for useR Delhi December meetup

    Thu, 12/13/2018 - 18:45

    (This article was first published on R – Tech and Mortals, and kindly contributed to R-bloggers)

    As referenced in my last blog post, useR Delhi NCR is all set to host our second meetup on 15th December, i.e. upcoming Saturday. We’ve finalized two exciting speaker sessions for the same. They’re as follows:

    • Basics of Shiny and geospatial visualizations by Sean Angiolillo
    • Up in the air: cloud storage based workflows in R by Himanshu Sikaria
      Speaker session #1 Speaker session #2

       

    Apart from the above, the meetup will feature lightning talks and networking session for attendees.
    If you haven’t registered yet, do it now via this form. Event details can be found on useR Delhi’s meetup page.

    Also, if you know of someone who may be interested in speaking at a future event, please let us know via mail, Twitter, Facebook or Meetup.
    Please share and retweet our event and/or this blog. We would be grateful.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Tech and Mortals. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Gold-Mining Week 15 (2018)

    Thu, 12/13/2018 - 16:02

    (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

    Week 15 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

    The post Gold-Mining Week 15 (2018) appeared first on Fantasy Football Analytics.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RTutor: Better Incentive Contracts For Road Construction

    Thu, 12/13/2018 - 09:00

    (This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

    Since about two weeks, I face a large additional traffic jam every morning due to a construction site on the road. When passing the construction site, often only few people or sometimes nobody seems to be working there. Being an economist, I really wonder how much of such traffic jams could be avoided with better contracts that give the construction company proper incentives to account for the social cost of the road blocking and therefore more often fully staff the constructing site and finish earlier.

    Patrick Bajari and Gregory Lewis have collected a detailed sample of 466 road construction projects in Minnesota to study this question in their very interesting article Moral Hazard, Incentive Contracts and Risk: Evidence from Procurement in the Review of Economic Studies, 2014.

    They estimate a structural econometric model and find that changes in contract design could substantially reduce the duration of road blockages and largely increase total welfare at only minor increases in the risk that road construction firms face.

    As part of his Master Thesis at Ulm University, Claudius Schmid has generated a nice and detailed RTutor problem set that allows you to replicate the findings in an interactive fashion. You learn a lot about the structure and outcomes of the currently used contracts, the theory behind better contract design and how the structural model to assess the quantitative effects can be estimated and simulated. At the same time, you can hone your general data science and R skills.

    Here is a small screenshot:

    Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to proceed. In addition you are challenged by multiple choice quizzes.

    You can test the problem set online on the rstudio.cloud: https://rstudio.cloud/project/137023 Source the run.R file to start the problem set.

    If you don’t want to register at rstudio cloud, you can also check out the problem on shinyapps.io: https://clamasch.shinyapps.io/RTutorIncentiveContracts

    The free shinyapps.io account is capped at 25 hours total usage time per month. So it may be greyed out when you click at it. Also, unlike on rstudio.cloud, you cannot save your progress on shinyapps.io.

    To locally install the problem set, follow the installation guide at the problem set’s Github repository: https://github.com/ClaMaSch/RTutorIncentiveContracts

    If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

    https://github.com/skranz/RTutor

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Day 13 – little helper read_files

    Thu, 12/13/2018 - 08:00

    (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

    We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 13th day of Christmas my true love gave to me…

    What can it do?

    This little helper reads in multiple files of the same type and combines them into a data.table. What kind of „file reading function“ should be used can be choosen by the FUN argument.

    How to use it?

    If you have a list of files, that all needs to be loaded in with the same function (e.g. read.csv), instead of using lapply and rbindlist now you can use this:

    read_files(files, FUN = readRDS) read_files(files, FUN = readLines) read_files(files, FUN = read.csv, sep = ";")

    Internally, it just uses lapply and rbindlist but you dont have to type it all the time. The read_files combines the single files by their column names and returns one data.table. Why data.table? Because I like it. But, let's not go down the rabbit hole of data.table vs dplyr (to the rabbit hole …).

    Overview

    To see all the other functions you can either check out our GitHub or you can read about them here.

    Have a merry advent season!

    Über den Autor

    Jakob Gepp

    Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

    ABOUT US

    STATWORX
    is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

    .button {background-color: #0085af;}

    Der Beitrag Day 13 – little helper read_files erschien zuerst auf STATWORX.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Recreating the NBA lead tracker graphic

    Thu, 12/13/2018 - 07:38

    (This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

    For each NBA game, nba.com has a really nice graphic which tracks the point differential between the two teams throughout the game. Here is the lead tracker graphic for the game between the LA Clippers and the Phoenix Suns on 10 Dec 2018:

    Taken from https://www.nba.com/games/20181210/LACPHX#/matchup

    I thought it would be cool to try recreating this graphic with R. You might ask: why try to replicate something that exists already? If we are able to pull out the data underlying this graphic, we could do much more than just replicate what is already out there; we have the power to make other visualizations which could be more informative or powerful. (For example, how does this chart look like for all games that the Golden State Warriors played in? Or, how does the chart look like for each quarter of the game?)

    The full R code for this post can be found here. For a self-contained script that accepts a game ID parameter and produces the lead tracker graphic, click here.

    First, we load the packages that we will use:

    library(lubridate) library(rvest) library(stringr) library(tidyverse)

    We can get play-by-play data from Basketball-Reference.com (here is the link for the LAC @ PHX game on 2018-12-10). Here is a snippet of the play-by-play table on that webpage, we would like to extract the columns in red:

    Play-by-play data from basketball-reference.com.

    The code below extracts the webpage, then pulls out rows from the play-by-play table:

    # get webpage url <- paste0("https://www.basketball-reference.com/boxscores/pbp/", current_id, ".html") webpage <- read_html(url) # pull out the events from the play-by-play table events <- webpage %>% html_nodes("#pbp") %>% html_nodes("tr") %>% html_text()

    events is a character vector that looks like this:

    We would really like to pull out the data in the boxes above. Timings are easy enough to pull out with regular expressions (e.g. start of the string: at least 1 digit, then :, then at least one digit, then ., then at least one digit). Pulling out the score is a bit trickier: we can’t just use the regular expression denoting a dash – with a number on each side. An example of why that doesn’t work is in the purple box above. Whenever a team scores, basketball-reference.com puts a “+2” or “+3” on the left or right of the score, depending on which team scored. In events, these 3 columns get smushed together into one string. If the team on the left scores, pulling out number-dash-number will give the wrong value (e.g. the purple box above would give 22-2 instead of 2-2).

    To avoid this issue, we extract the “+”s that may appear on either side of the score. In fact, this has an added advantage: we only need to extract a score if it is different from the previous timestamp. As such, we only have to keep the scores which have a “+” on either side of it. We then post-process the scores.

    # get event times & scores times <- str_extract(events, "^\\d+:\\d+.\\d+") scores <- str_extract(events, "[\\+]*\\d+-\\d+[\\+]*") scores <- ifelse(str_detect(scores, "\\+"), scores, NA) df <- data.frame(time = times, score = scores, stringsAsFactors = FALSE) %>% na.omit() # remove the +'s parseScore <- function(x) { if (startsWith(x, "+")) { return(str_sub(x, 3, str_length(x))) } else if (endsWith(x, "+")) { return(str_sub(x, 1, str_length(x) - 1)) } else { return(x) } } df$score <- sapply(df$score, parseScore)

    Next, we split the score into visitor and home score and compute the point differential (positive means the visitor team is winning):

    # split score into visitor and home score, get home advantage df <- df %>% separate(score, into = c("visitor", "home"), sep = "-") %>% mutate(visitor = as.numeric(visitor), home = as.numeric(home), time = ms(time)) %>% mutate(visitor_adv = visitor - home)

    Next we need to process the timings. Each of the 4 quarters lasts for 12 minutes, while each overtime period (if any) lasts for 5 minutes. The time column shows the amount of time remaining in the current period. We will amend the times so that they show the time elapsed (in seconds) from the start of the game. This notion of time makes it easier for plotting, and works for any number of overtime periods as well.

    # get period of play (e.g. Q1, Q2, ...) df$period <- NA period <- 0 prev_time <- ms("0:00") for (i in 1:nrow(df)) { curr_time <- df[i, "time"] if (prev_time < curr_time) { period <- period + 1 } df[i, "period"] <- period prev_time <- curr_time } # convert time such that it runs upwards. regular quarters are 12M long, OT # periods are 5M long df <- df %>% mutate(time = ifelse(period <= 4, as.duration(12 * 60) - as.duration(time), as.duration(5 * 60) - as.duration(time))) %>% mutate(time = ifelse(period <= 4, time + as.duration(12 * 60 * (period - 1)), time + as.duration(12 * 60 * 4) + as.duration(5 * 60 * (period - 5)) ))

    At this point, we have enough to make crude approximations of the lead tracker graphic:

    ggplot() + geom_line(data = df, aes(x = time, y = visitor_adv)) + labs(title = "LAC @ PHX, 2018-12-10") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

    ggplot() + geom_step(data = df, aes(x = time, y = visitor_adv)) + labs(title = "LAC @ PHX, 2018-12-10") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

    Getting the fill colors that NBA.com’s lead tracker has requires a bit more work. We need to split the visitor_adv into two columns: the visitor’s lead (0 if they are behind) and the home’s lead (0 if they are behind). We can then draw the chart above and below the x-axis as two geom_ribbons. (It’s a little more complicated than that, see this StackOverflow question and this gist for details.) Colors were obtained using imagecolorpicker.com.

    df$visitor_lead <- pmax(df$visitor_adv, 0) df$home_lead <- pmin(df$visitor_adv, 0) df_extraSteps <- df %>% mutate(visitor_adv = lag(visitor_adv), visitor_lead = lag(visitor_lead), home_lead = lag(home_lead)) df2 <- bind_rows(df_extraSteps, df) %>% arrange(time) ggplot() + geom_ribbon(data = df2, aes(x = time, ymin = 0, ymax = visitor_lead), fill = "#F7174E") + geom_ribbon(data = df2, aes(x = time, ymin = home_lead, ymax = 0), fill = "#F16031") + labs(title = "LAC @ PHX, 2018-12-10") + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

    Almost there! The code below does some touch up to the figure, giving it the correct limits for the y-axis as well as vertical lines for the end of the periods.

    # get score differential range (round to nearest 5) ymax <- round(max(df$visitor_adv) * 2, digits = -1) / 2 ymin <- round(min(df$visitor_adv) * 2, digits = -1) / 2 # get period positions and labels periods <- unique(df$period) x_value <- ifelse(periods <= 4, 12 * 60 * periods, 12 * 60 * 4 + 5 * 60 * (periods - 4)) x_label <- ifelse(periods <= 4, paste0("Q", periods), paste0("OT", periods - 4)) ggplot() + geom_ribbon(data = df2, aes(x = time, ymin = 0, ymax = visitor_lead), fill = "#F7174E") + geom_ribbon(data = df2, aes(x = time, ymin = home_lead, ymax = 0), fill = "#F16031") + geom_vline(aes(xintercept = x_value), linetype = 2, col = "grey") + scale_y_continuous(limits = c(ymin, ymax)) + labs(title = "LAC @ PHX, 2018-12-10") + scale_x_continuous(breaks = x_value, labels = x_label) + theme_minimal() + theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5), axis.title.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank())

    The figure above is what we set out to plot. However, since we have the underlying data, we can now make plots of the same data that may reveal other trends (code at the end of this R file). Here are the line and ribbon plots where we look at the absolute score rather than the point differential:

    Here, we add points to the line plot to indicate whether a free throw, 2 pointer or 3 pointer was scored:

    The possibilities are endless!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Twins on the up

    Thu, 12/13/2018 - 01:00

    (This article was first published on HighlandR, and kindly contributed to R-bloggers)

    Are multiple births on the increase?

    My twin boys turned 5 years old today. Wow, time flies. Life is never dull, because twins are still seen as something of a novelty, so wherever we go, we find ourselves in conversation with strangers, who are intrigued by the whole thing.

    In order to save time if we ever meet, here’s some FAQ’s:

    • No, they’re not identical
    • Yes, I’m sure
    • No, they do not have similar personalities
    • They like different things – One likes Hulk and Gekko, the other likes Iron Man and Catboy.

    Recently I’ve been hearing and seeing anecdotal evidence that twins and multiple births are on the increase. I tried to find some data for Scotland, and while there is a lot of information on births in Scotland available, I couldn’t find breakdowns of multiple births.
    However, I did find some information for England and Wales, so let’s look at that.

    In this next bit, they key thing that may be of interest is the use of tidyr::gather.
    There has been some discussion on #rstats Twitter about things people struggle with and a surprising amount of people struggle to remember the syntax for tidyr’s gather and spread.

    (I can neither confirm or deny I am one of them).

    The data was found here

    library(readxl) library(dplyr) library(tidyr) library(ggplot2) data <- read_xls("birthcharacteristicsworkbook2016.xls", sheet = "Table 11", range = "A10:I87") data <- data %>% rename(Year = X__1, All_ages = `All maternities with multiple births`, Under20 = X__2, `20_to_24` = X__3, `25_to_29` = X__4, `30_to_34` = X__5, `35_to_39` = X__6, `40_to_44` = X__7, `45_and_over` = X__8) # the 1981 data is borked, so ignore that

    Note use of gather to combine all the age groups into an age_group variable.

    We use the Year column as an index so we have an entry for every age group, for every year, with the value represented as ‘maternities’.

    Back to the code:

    long_data <- data %>% filter(Year != "1981") %>% gather(key = age_group, value = "maternities", -Year) long_data$Year <- as.numeric(long_data$Year) long_data$age_group <- forcats::as_factor(long_data$age_group) long_data$maternities <- as.numeric(long_data$maternities) ggplot(long_data,aes(Year, maternities), group = age_group) + geom_line() + geom_point() + facet_wrap(vars(age_group), scales = "free_y") + ggtitle(label = "England and Wales maternities with multiple births - numbers", subtitle = "By age of mother, 1940 to 2016") + labs(x = NULL, y = "Multiple maternities")

    # Let's do rates rates <- read_xls("birthcharacteristicsworkbook2016.xls", sheet = "Table 11", range = "A89:I166") rates <- rates %>% rename(Year = X__1, All_ages = `All maternities with multiple births per 1,000 all maternities`, Under20 = X__2, `20_to_24` = X__3, `25_to_29` = X__4, `30_to_34` = X__5, `35_to_39` = X__6, `40_to_44` = X__7, `45_and_over` = X__8) long_rates <- rates %>% filter(Year != 1981) %>% gather(key = age_group, value = "multiple_maternities_per_1000", -Year) long_rates$Year <- as.numeric(long_rates$Year) long_rates$age_group <- forcats::as_factor(long_rates$age_group) long_rates$multiple_maternities_per_1000 <- as.numeric(long_rates$multiple_maternities_per_1000) ggplot(long_rates,aes(Year, multiple_maternities_per_1000), group = age_group) + geom_line() + geom_point() + facet_wrap(vars(age_group)) + ggtitle(label = "England and Wales Rate of maternities with multiple births - per 1,000 all maternities ", subtitle = "By age of mother, 1940 to 2016") + labs(x = NULL, y = "Multiple maternities")

    When we look at maternities with multiple births as a rate per 1000 maternities, we see the increase in multiple births among older mothers, especially in the over 45 group.

    Again, with free scales on the y axis – which helps us see almost all age groups are exhibiting an increase – compare the 20-24 age group as a rate and as count for example.

    Looks to me that overall, the rate of multiple births is increasing.
    What’s driving this?
    Can it continue?
    Will people ever stop asking us if the twins are identical?

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Rsampling Fama French

    Thu, 12/13/2018 - 01:00

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    Today we will continue our work on Fama French factor models, but more as a vehicle to explore some of the awesome stuff happening in the world of tidy models. For new readers who want get familiar with Fama French before diving into this post, see here where we covered importing and wrangling the data, here where we covered rolling models and visualization, my most recent previous post here where we covered managing many models, and if you’re into Shiny, this flexdashboard.

    Our goal today is to explore k-fold cross-validation via the rsample package, and a bit of model evaluation via the yardstick package. We started the model evaluation theme last time when we used tidy(), glance() and augment() from the broom package. In this post, we will use the rmse() function from yardstick, but our main focus will be on the vfold_cv() function from rsample. We are going to explore these tools in the context of linear regression and Fama French, which might seem weird since these tools would typically be employed in the realms of machine learning, classification, and the like. We’ll stay in the world of explanatory models via linear regression world for a few reasons.

    First, and this is a personal preference, when getting to know a new package or methodology, I prefer to do so in a context that’s already familiar. I don’t want to learn about rsample whilst also getting to know a new data set and learning the complexities of some crazy machine learning model. Since Fama French is familiar from our previous work, we can focus on the new tools in rsample and yardstick. Second, factor models are important in finance, despite relying on good old linear regression. We won’t regret time spent on factor models, and we might even find creative new ways to deploy or visualize them.

    The plan for today is take the same models that we ran in the last post, only this use k-fold cross validation and bootstrapping to try to assess the quality of those models.

    For that reason, we’ll be working with the same data as we did previously. I won’t go through the logic again, but in short, we’ll import data for daily prices of five ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen:

    library(tidyverse) library(broom) library(tidyquant) library(timetk) symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG") # The prices object will hold our daily price data. prices <- getSymbols(symbols, src = 'yahoo', from = "2012-12-31", to = "2017-12-31", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) asset_returns_long <- prices %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, -date) %>% group_by(asset) %>% mutate(returns = (log(returns) - log(lag(returns)))) %>% na.omit() factors_data_address <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip" factors_csv_name <- "Global_5_Factors_Daily.csv" temp <- tempfile() download.file( # location of file to be downloaded factors_data_address, # where we want R to store that file temp, quiet = TRUE) Global_5_Factors <- read_csv(unz(temp, factors_csv_name), skip = 6 ) %>% rename(date = X1, MKT = `Mkt-RF`) %>% mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>% mutate_if(is.numeric, funs(. / 100)) %>% select(-RF) data_joined_tidy <- asset_returns_long %>% left_join(Global_5_Factors, by = "date") %>% na.omit()

    After running that code, we have an object called data_joined_tidy. It holds daily returns for 5 ETFs and the Fama French factors. Here’s a look at the first row for each ETF rows.

    data_joined_tidy %>% slice(1) # A tibble: 5 x 8 # Groups: asset [5] date asset returns MKT SMB HML RMW CMA 1 2013-01-02 AGG -0.00117 0.0199 -0.0043 0.0028 -0.0028 -0.0023 2 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023 3 2013-01-02 EFA 0.0154 0.0199 -0.0043 0.0028 -0.0028 -0.0023 4 2013-01-02 IJS 0.0271 0.0199 -0.0043 0.0028 -0.0028 -0.0023 5 2013-01-02 SPY 0.0253 0.0199 -0.0043 0.0028 -0.0028 -0.0023

    Let’s work with just one ETF for today and use filter(asset == "AGG") to shrink our data down to just that ETF.

    agg_ff_data <- data_joined_tidy %>% filter(asset == "AGG")

    Okay, we’re going to regress the daily returns of AGG on one factor, then three factors, then five factors, and we want to evaluate how well each model explains AGG’s returns. That means we need a way to test the model. Last time, we looked at the adjusted r-squared values when the model was run on the entirety of AGG returns. Today, we’ll evaluate the model using k-fold cross validation. That’s a pretty jargon-heavy phrase that isn’t part of the typical finance lexicon. Let’s start with the second part, cross-validation. Instead of running our model on the entire data set – all the daily returns of AGG – we’ll run it on just part of the data set, then test the results on the part that we did not use. Those different subsets of our original data are often called the training and the testing sets, though rsample calls them the analysis and assessment sets. We validate the model results by applying them to the assessment data and seeing how the model performed.

    The k-fold bit refers to the fact that we’re not just dividing our data into training and testing subsets, we’re actually going to divide it into a bunch of groups, a k number of groups, or a k number of folds. One of those folds will be used as the validation set; the model will be fit on the other k - 1 sets, and then tested on the validation set. We’re doing this with a linear model to see how well it explains the data; it’s typically used in machine learning to see how well a model predicts data (we’ll get there in 2019).1

    If you’re like me, it will take a bit of tinkering to really grasp k-fold cross validation, but rsample as a great function for dividing our data into k-folds. If we wish to use five folds (the state of the art seems to be either five or ten folds), we call the vfold_cv() function, pass it our data object agg_ff_data, and set v = 5.

    library(rsample) library(yardstick) set.seed(752) cved_ff<- vfold_cv(agg_ff_data, v = 5) cved_ff # 5-fold cross-validation # A tibble: 5 x 2 splits id 1 Fold1 2 Fold2 3 Fold3 4 Fold4 5 Fold5

    We have an object called cved_ff, with a column called splits and a column called id. Let’s peek at the first split.

    cved_ff$splits[[1]] <1007/252/1259>

    Three numbers. The first, 1007, is telling us how many observations are in the analysis. Since we have five folds, we should have 80% (or 4/5) of our data in the analysis set. The second number, 252, is telling us how many observations are in the assessment, which is 20% of our original data. The third number, 1259, is the total number of observations in our original data.

    Next, we want to apply a model to the analysis set of this k-folded data and test the results on the assessment set. Let’s start with one factor and run a simple linear model, lm(returns ~ MKT).

    We want to run it on analysis(cved_ff$splits[[1]]) – the analysis set of out first split.

    ff_model_test <- lm(returns ~ MKT, data = analysis(cved_ff$splits[[1]])) ff_model_test Call: lm(formula = returns ~ MKT, data = analysis(cved_ff$splits[[1]])) Coefficients: (Intercept) MKT 0.0001025 -0.0265516

    Nothing too crazy so far. Now we want to test on our assessment data. The first step is to add that data to the original set. We’ll use augment() for that task, and pass it assessment(cved_ff$splits[[1]])

    ff_model_test %>% augment(newdata = assessment(cved_ff$splits[[1]])) %>% head() %>% select(returns, .fitted) returns .fitted 1 0.0009021065 1.183819e-04 2 0.0011726989 4.934779e-05 3 0.0010815505 1.157267e-04 4 -0.0024385815 -7.544460e-05 5 -0.0021715702 -8.341007e-05 6 0.0028159467 3.865527e-04

    We just added our fitted values to the assessment data, the subset of the data on which the model was not fit. How well did our model do when compare the fitted values to the data in the held out set?

    We can use the rmse() function from yardstick to measure our model. RMSE stands for root mean-squared error. It’s the sum of the squared differences between our fitted values and the actual values in the assessment data. A lower RMSE is better!

    ff_model_test %>% augment(newdata = assessment(cved_ff$splits[[1]])) %>% rmse(returns, .fitted) # A tibble: 1 x 3 .metric .estimator .estimate 1 rmse standard 0.00208

    Now that we’ve done that piece by piece, let’s wrap the whole operation into one function. This function takes one argument, a split, and we’re going to use pull() so we can extract the raw number, instead of the entire tibble result.

    model_one <- function(split) { split_for_model <- analysis(split) ff_model <- lm(returns ~ MKT, data = split_for_model) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }

    Now we pass it our first split.

    model_one(cved_ff$splits[[1]]) %>% head() [1] 0.002080324

    Now we want to apply that function to each of our five folds that are stored in agg_cved_ff. We do that with a combination of mutate() and map_dbl(). We use map_dbl() instead of map because we are returning a number here and there’s not a good reason to store that number in a list column.

    cved_ff %>% mutate(rmse = map_dbl(cved_ff$splits, model_one)) # 5-fold cross-validation # A tibble: 5 x 3 splits id rmse * 1 Fold1 0.00208 2 Fold2 0.00189 3 Fold3 0.00201 4 Fold4 0.00224 5 Fold5 0.00190

    OK, we have five RMSE’s since we ran the model on five separate analysis fold sets and tested on five separate assessment fold sets. Let’s find the average RMSE by taking the mean() of the rmse column. That can help reduce noisiness that resulted from our random creation of those five folds.

    cved_ff %>% mutate(rmse = map_dbl(cved_ff$splits, model_one)) %>% summarise(mean_rse = mean(rmse)) # 5-fold cross-validation # A tibble: 1 x 1 mean_rse 1 0.00202

    We now have the mean RMSE after running on our model, lm(returns ~ MKT), on all five of our folds.

    That process for finding the mean RMSE can be applied other models, as well. Let’s suppose we wish to find the mean RMSE for two other models: lm(returns ~ MKT + SMB + HML), the Fama French three-factor model, and lm(returns ~ MKT + SMB + HML + RMW + CMA, the Fama French five-factor model. By comparing the mean RMSE’s, we can evaluate which model explained the returns of AGG better. Since we’re just adding more and more factors, the models can be expected to get more and more accurate but again, we are exploring the rsample machinery and creating a template where we can pop in whatever models we wish to compare.

    First, let’s create two new functions, that follow the exact same code pattern as above but house the three-factor and five-factor models.

    model_two <- function(split) { split_for_model <- analysis(split) ff_model <- lm(returns ~ MKT + SMB + HML, data = split_for_model) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) } model_three <- function(split) { split_for_model <- analysis(split) ff_model <- lm(returns ~ MKT + SMB + HML + RMW + CMA, data = split_for_model) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }

    Now we pass those three models to the same mutate() with map_dbl() flow that we used with just one model. The result will be three new columns of RMSE’s, one for each of our three models applied to our five folds.

    cved_ff %>% mutate( rmse_model_1 = map_dbl( splits, model_one), rmse_model_2 = map_dbl( splits, model_two), rmse_model_3 = map_dbl( splits, model_three)) # 5-fold cross-validation # A tibble: 5 x 5 splits id rmse_model_1 rmse_model_2 rmse_model_3 * 1 Fold1 0.00208 0.00211 0.00201 2 Fold2 0.00189 0.00184 0.00178 3 Fold3 0.00201 0.00195 0.00194 4 Fold4 0.00224 0.00221 0.00213 5 Fold5 0.00190 0.00183 0.00177

    We can also find the mean RMSE for each model.

    cved_ff %>% mutate( rmse_model_1 = map_dbl( splits, model_one), rmse_model_2 = map_dbl( splits, model_two), rmse_model_3 = map_dbl( splits, model_three)) %>% summarise(mean_rmse_model_1 = mean(rmse_model_1), mean_rmse_model_2 = mean(rmse_model_2), mean_rmse_model_3 = mean(rmse_model_3)) # 5-fold cross-validation # A tibble: 1 x 3 mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3 1 0.00202 0.00199 0.00192

    That code flow worked just fine, but we had to repeat ourselves when creating the functions for each model. Let’s toggle to a flow where we define three models – the ones that we wish to test with via cross-validation and RMSE – then pass those models to one function.

    First, we use as.formula() to define our three models.

    mod_form_1 <- as.formula(returns ~ MKT) mod_form_2 <- as.formula(returns ~ MKT + SMB + HML) mod_form_3 <- as.formula(returns ~ MKT + SMB + HML + RMW + CMA)

    Now we write one function that takes split as an argument, same as above, but also takes formula as an argument, so we can pass it different models. This gives us the flexibility to more easily define new models and pass them to map, so I’ll append _flex to the name of this function.

    ff_rmse_models_flex <- function(split, formula) { split_for_data <- analysis(split) ff_model <- lm(formula, data = split_for_data) holdout <- assessment(split) rmse <- ff_model %>% augment(newdata = holdout) %>% rmse(returns, .fitted) %>% pull(.estimate) }

    Now we use the same code flow as before, except we call map_dbl(), pass it our cved_ff$splits object, our new flex function called ff_rmse_models_flex(), and the model we wish to pass as the formula argument. First we pass it mod_form_1.

    cved_ff %>% mutate(rmse_model_1 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_1)) # 5-fold cross-validation # A tibble: 5 x 3 splits id rmse_model_1 * 1 Fold1 0.00208 2 Fold2 0.00189 3 Fold3 0.00201 4 Fold4 0.00224 5 Fold5 0.00190

    Now let’s pass it all three models and find the mean RMSE.

    cved_ff %>% mutate(rmse_model_1 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_1), rmse_model_2 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_2), rmse_model_3 = map_dbl(cved_ff$splits, ff_rmse_models_flex, formula = mod_form_3)) %>% summarise(mean_rmse_model_1 = mean(rmse_model_1), mean_rmse_model_2 = mean(rmse_model_2), mean_rmse_model_3 = mean(rmse_model_3)) # 5-fold cross-validation # A tibble: 1 x 3 mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3 1 0.00202 0.00199 0.00192

    Alright, that code flow seems a bit more flexible than our original method of writing a function to assess each model. We didn’t do much hard thinking about functional form here, but hopefully this provides a template where you could assess more nuanced models. We’ll get into bootstrapping and time series work next week, then head to Shiny to ring in the New Year!

    And, finally, a couple of public service announcements.

    First, thanks to everyone who has checked out my new book! The price just got lowered for the holidays. See on Amazon or on the CRC homepage (okay, that was more of an announcement about my book).

    Second, applications are open for the Battlefin alternative data contest, and RStudio is one of the tools you can use to analyze the data. Check it out here. In January, they’ll announce 25 finalists who will get to compete for a cash prize and connect with some quant hedge funds. Go get ‘em!

    Thanks for reading and see you next time.

    1. For more on cross-validation, see “An Introduction to Statistical Learning”, chapter 5. Available online here: http://www-bcf.usc.edu/~gareth/ISL/.

    _____='https://rviews.rstudio.com/2018/12/13/rsampling-fama-french/';

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Teaching and Learning Materials for Data Visualization

    Wed, 12/12/2018 - 19:12

    (This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)


    Data Visualization: A Practical Introduction will begin shipping next week. I’ve written an R package that contains datasets, functions, and a course packet to go along with the book. The socviz package contains about twenty five datasets and a number of utility and convenience functions. The datasets range in size from things with just a few rows (used for purely illustrative purproses) to datasets with over 120,000 observations, for practicing with and exploring.

    A course packet is also included the package. This is a zipped file containing an R Studio project consisting of nine R Markdown documents that parallel the chapters in the book. They contain the code for almost all the figures in the book (and a few more besides). There are also some additional support files, to help demonstrate things like reading in your own data locally in R.

    Installing the package

    To install the package, you can follow the instructions in the Preface to the book. Alternatively, first download and install R for MacOS, Windows or Linux, as appropriate. Then download and install RStudio. Launch RStudio and then type the following code at the Console prompt (>), hitting return at the end of each line:

    my_packages <- c("tidyverse", "fs", "devtools") install.packages(my_packages) devtools::install_github("kjhealy/socviz")

    Once everything has downloaded and been installed (which may take a little while), load the socviz package:

    library(socviz) The Course Packet

    The supporting materials are contained in a compressed .zip file. To extract them to your Desktop, make sure the socviz package is loaded as described above. Then do this:

    setup_course_notes()

    This will copy the dataviz_course_notes.zip file to your Desktop, and uncompress it into a folder called dataviz_course_notes. Double-click the file named dataviz.Rproj to launch the project as a new RStudio session. If you want to uncompress the file somewhere other than your Desktop, e.g. your Documents folder, you can do this:

    setup_course_notes(folder = "~/Documents")

    The source code for socviz is available on GitHub. I plan on continuing to update and improve it as I use it myself in my own classes and workshops.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to deploy a predictive service to Kubernetes with R and the AzureContainers package

    Wed, 12/12/2018 - 17:30

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    It's easy to create a function in R, but what if you want to call that function from a different application, with the scale to support a large number of simultaneous requests? This article shows how you can deploy an R fitted model as a Plumber web service in Kubernetes, using Azure Container Registry (ACR) and Azure Kubernetes Service (AKS). We use the AzureContainers package to create the necessary resources and deploy the service.

    Fit the model

    We’ll fit a simple model for illustrative purposes, using the Boston housing dataset (which ships with R in the MASS package). To make the deployment process more interesting, the model we fit will be a random forest, using the randomForest package. This is not part of R, so we’ll have to install it from CRAN.

    data(Boston, package="MASS") install.packages("randomForest") library(randomForest) # train a model for median house price bos_rf <- randomForest(medv ~ ., data=Boston, ntree=100) # save the model saveRDS(bos.rf, "bos_rf.rds") Scoring script for plumber

    Now that we have the model, we also need a script to obtain predicted values from it given a set of inputs:

    # save as bos_rf_score.R bos_rf <- readRDS("bos_rf.rds") library(randomForest) #* @param df data frame of variables #* @post /score function(req, df) { df <- as.data.frame(df) predict(bos_rf, df) }

    This is fairly straightforward, but the comments may require some explanation. They are plumber annotations that tell it to call the function if the server receives a HTTP POST request with the path /score, and query parameter df. The value of the df parameter is then converted to a data frame, and passed to the randomForest predict method. For a fuller description of how Plumber works, see the Plumber website.

    Create a Dockerfile

    Let’s package up the model and the scoring script into a Docker image. A Dockerfile to do this is shown below. This uses the base image supplied by Plumber (trestletech/plumber), installs randomForest, and then adds the model and the above scoring script. Finally, it runs the code that will start the server and listen on port 8000.

    # example Dockerfile to expose a plumber service FROM trestletech/plumber # install the randomForest package RUN R -e 'install.packages(c("randomForest"))' # copy model and scoring script RUN mkdir /data COPY bos_rf.rds /data COPY bos_rf_score.R /data WORKDIR /data # plumb and run server EXPOSE 8000 ENTRYPOINT ["R", "-e", \ "pr <- plumber::plumb('/data/bos_rf_score.R'); \ pr$run(host='0.0.0.0', port=8000)"] Build and upload the image

    The code to store our image on Azure Container Registry is as follows. This calls AzureRMR to login to Resource Manager, creates an Azure Container Registry resource (a Docker registry hosted in Azure), and then pushes the image to the registry.

    If this is the first time you are using AzureRMR, you’ll have to create a service principal first. For more information on how to do this, see the AzureRMR readme.

    library(AzureContainers) az <- AzureRMR::az_rm$new( tenant="myaadtenant.onmicrosoft.com", app="app_id", password="password") # create a resource group for our deployments deployresgrp <- az$ get_subscription("subscription_id")$ create_resource_group("deployresgrp", location="australiaeast") # create container registry deployreg_svc <- deployresgrp$create_acr("deployreg") # build image 'bos_rf' call_docker("build -t bos_rf .") # upload the image to Azure deployreg <- deployreg_svc$get_docker_registry() deployreg$push("bos_rf")

    If you run this code, you should see a lot of output indicating that R is downloading, compiling and installing randomForest, and finally that the image is being pushed to Azure. (You will see this output even if your machine already has the randomForest package installed. This is because the package is being installed to the R session inside the container, which is distinct from the one running the code shown here.)

    All docker calls in AzureContainers, like the one to build the image, return the actual docker commandline as the cmdline attribute of the (invisible) returned value. In this case, the commandline is docker build -t bos_rf . Similarly, the push() method actually involves two Docker calls, one to retag the image, and the second to do the actual pushing; the returned value in this case will be a 2-component list with the command lines being docker tag bos_rf deployreg.azurecr.io/bos_rf and docker push deployreg.azurecr.io/bos_rf.

    Deploy to a Kubernetes cluster

    The code to create an AKS resource (a managed Kubernetes cluster in Azure) is quite simple:

    # create a Kubernetes cluster with 2 nodes, running Linux deployclus_svc <- deployresgrp$create_aks("deployclus", agent_pools=aks_pools("pool1", 2))

    Creating a Kubernetes cluster can take several minutes. By default, the create_aks() method will wait until the cluster provisioning is complete before it returns.

    Having created the cluster, we can deploy our model and create a service. We’ll use a YAML configuration file to specify the details for the deployment and service API.

    apiVersion: extensions/v1beta1 kind: Deployment metadata: name: bos-rf spec: replicas: 1 template: metadata: labels: app: bos-rf spec: containers: - name: bos-rf image: deployreg.azurecr.io/bos_rf ports: - containerPort: 8000 resources: requests: cpu: 250m limits: cpu: 500m imagePullSecrets: - name: deployreg.azurecr.io --- apiVersion: v1 kind: Service metadata: name: bos-rf-svc spec: selector: app: bos-rf type: LoadBalancer ports: - protocol: TCP port: 8000

    The following code will obtain the cluster endpoint from the AKS resource and then deploy the image and service to the cluster. The configuration details for the deployclus cluster are stored in a file located in the R temporary directory; all of the cluster’s methods will use this file. Unless told otherwise, AzureContainers does not touch your default Kubernetes configuration (~/kube/config).

    # get the cluster endpoint deployclus <- deployclus_svc$get_cluster() # pass registry authentication details to the cluster deployclus$create_registry_secret(deployreg, email="me@example.com") # create and start the service deployclus$create("bos_rf.yaml")

    To check on the progress of the deployment, run the get() methods specifying the type and name of the resource to get information on. As with Docker, these correspond to calls to the kubectl commandline tool, and again, the actual commandline is stored as the cmdline attribute of the returned value.

    deployclus$get("deployment bos-rf") #> Kubernetes operation: get deployment bos-rf --kubeconfig=".../kubeconfigxxxx" #> NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE #> bos-rf 1 1 1 1 5m deployclus$get("service bos-rf-svc") #> Kubernetes operation: get service bos-rf-svc --kubeconfig=".../kubeconfigxxxx" #> NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE #> bos-rf-svc LoadBalancer 10.0.8.189 52.187.249.58 8000:32276/TCP 5m

    Once the service is up and running, as indicated by the presence of an external IP in the service details, let’s test it with a HTTP request. The response should look like this.

    response <- httr::POST("http://52.187.249.58:8000/score", body=list(df=MASS::Boston[1:10,]), encode="json") httr::content(response, simplifyVector=TRUE) #> [1] 25.9269 22.0636 34.1876 33.7737 34.8081 27.6394 21.8007 22.3577 16.7812 18.9785

    Finally, once we are done, we can tear down the service and deployment:

    deployclus$delete("service", "bos-rf-svc") deployclus$delete("deployment", "bos-rf")

    And if required, we can also delete all the resources created here, by simply deleting the resource group (AzureContainers will prompt you for confirmation):

    deployresgrp$delete() See also

    An alternative to Plumber is the model operationalisation framework found in Microsoft Machine Learning Server. While it is proprietary software, unlike Plumber which is open source, ML Server provides a number of features not available in the latter. These include model management, so that you can easily access multiple versions of a given model; user authentication, so that only authorised users can access your service; and batch (asynchronous) requests. For more information, see the MMLS documentation.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages