Insurance data science : Text
[This article was first published on Renglish – Freakonometrics, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
At the Summer School of the Swiss Association of Actuaries, in Lausanne, I will start talking about text based data and NLP this Thursday. Slides are available online
Ewen Gallic (AMSE) will present a tutorial on tweets. I can upload a few additional slides on LSTM (recurrent neural nets)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Renglish – Freakonometrics. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Prepping data for #rstats #tidyverse and a priori planning
[This article was first published on R – christopher lortie, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
messy data can be your friend (or frenemy)Many if not most data clean up, tidying, wrangling, and joining can be done directly in R. There are many advantages to this approach – i.e. read in data in whatever format (from excel to json to zip) and then do your tidying – including transparency, a record of what you did, reproducibility (if you ever have to do it again for another experiment or someone else does), and reproducibility again if your data get updated and you must rinse and repeat! Additionally, an allR workflow forces you to think about your data structure, the class of each vector (or what each variable means/represents), missing data, and facilitates better QA/QC. Finally, your data assets are then ready to go for the #tidyverse once read in for joining, mutating in derived variables, or reshaping. That said, I propose that good thinking precedes good rstats. For me, this ripples backwards from projects where I did not think ahead sufficiently and certainly got the job done with R and the tidyverse in particular, but it took some time that could have been better spent on the statistical models later in the workflow. Consequently, here are some recent tips I have been thinking about this season of data collection and experimental design that are preR for R and how I know I like to subsequently join and tidy.
Tips
 Keep related data in separate csv files.
For instance, I have a site with 30 longterm shrubs that I measure morphology and growth, interactions with/associations with the plant and animal community, and microclimate. I keep each set of data in a separate csv (no formatting, keeps it simple, reads in well) including shrub_morphology.csv, associations.csv, and microclimate.csv. This matches how I collect the data in the field, represents the levels of thinking and sampling, and at times I sample asynchronously so open each only as needed. You can have a single excel file instead with multiple sheets and read those in using R, but I find that there is a tendency for things to get lost, and it is hard for me to have parallel checking with sheets versus just opening up each file sidebyside and doing some thinking. Plus, this format ensures and reminds me to write a metadata file for each data stream.  Always have a key vector in each data file that represents the unique sampling instance.
I like the #tidyverse dyplr::join family to put together my different data files. Here is an explanation of the workflow. So, in the shrub example, the 30 individual shrubs structure all observations for how much they grow, what other plants and animals associate with them, and what the microclimate looks like under their canopy so I have a vector entitled shrub_ID that demarcates each instance in space that I sample. I often also have a fourth data file for field sampling that is descriptive using the same unique ID as the key approach wherein I add lat, long, qualitative observations, disturbances, or other supporting data.  Ensure each vector/column is a single class.
You can resolve this issue later, but I prefer to keep each vector a single class, i.e. all numeric, all character, or all date and time.  Doublecode confusing vectors for simplicity and error checking.
I doublecode data and time vectors to simpler vectors just to be safe. I try to use readr functions like read_csv that makes minimal and more often than not correct assumptions about the class of each vector. However, to be safe for vectors that I have struggled with in the past, and fixed in R using tidytools or others, I now just set up secondary columns that match my experimental design and sampling. If I visit my site with 30 shrubs three times in a growing season, I have a date vector that captures this rich and accurate sampling process, i.e. august 14, 2019 as a row, but I also prefer a census column that for each row has 1,2, or 3. This helps me recall how often I sampled when I reinspect the data and also provides a means for quick tallies and other tools. Sometimes, if I know it is longterm data over many years, I also add a simple year column that simply lists 2017, 2018, and 2019. Yes, I can reverse engineer this in R, but I like the structure – like a backbone or scaffold to my dataframe to support my thinking about statistics to match design.  Keep track of total unique observation instances.
I like tidy data. In each dataframe, I like a vector that provides me a total tally of the length of the data as a representation of unique observations. You can wrangle in later, and this vector does not replace the unique ID key vector at all. In planning an experiment, I do some math. One site, 30 shrubs, 3 census events per season/year, and a total of 3 years. So, 1 x 30 x 3 x 3 should be 270 unique observations or rows. I hardcode that into the data to ensure that I did not miss or forget to collect data. It is also fulfilling to have them all checked off. The doublecheck using tibble::rowid_to_column should confirm that count, and further to tip #2, you can have a variable or set of variables to join different dataframes so this becomes fundamentally useful if I measured shrub growth and climate three times each year for three years in my join (i.e. I now have a single observation_ID vector I generated that should match my hardcoded collection_ID data column and I can ensure it lines up with the census column too etc per year). A tiny bit of rendundancy just makes it so much easier to check for missing data later.  Leave blanks blank. Ensures your data codes true and false zeros correctly (for me this means I observed a zero, i.e. no plants under the shrub at all versus missing data) and also stick to tip #3. My quick a priori rule that I annotate in metadata for each file is that missing altogether is coded as blank (i.e. no entry in that row/instance but I still have the unique_ID and row there as placeholder) and an observed zero in the field or during experiment is coded as 0. Do not record ‘NA’ as characters in a numeric column in the csv because it flips the entire vector to character, and read_csv and other functions sorts this out better with blanks anyway. I can also impute in missing values if needed by leaving blanks blank.
 Never delete data. Further to tip #1, and other ideas described in R for Data Science, once I plan my experiment and decide on my data collecting and structural rules a priori, the data are sacred and stay intact. I have many colleagues delete data that they did not ‘need’ once they derived their sitelevel climate estimates (then lived to regret it) or delete rows because they are blank (not omit in #rstats workflow but opened up data file and deleted). Sensu tip #5, I like the tally that matches the designed plan for experiment and prefer to preserve the data structure.
 Avoid automatic factor assignments. Further to simple data formats like tip #1 and tip #4, I prefer to read in data and keep assumptions minimal until I am ready to build my models. Many packages and statistical tools do not need the vector be factor class, and I prefer to make assignments directly to ensure statistics match the planned design for the purpose of each variable. Sometimes, variables can be both. The growth of the shrub in my example is a response to the growing season and climate in some models but a predictor in other models such as the effect of the shrub canopy on the other plants and animals. The rpackage forcats can help you out when you need to enter into these decisions and challenges with levels within factors.
Putting the different pieces together in science and data science is important. The construction of each project element including the design of experiment, evidence and data collection, and #rstats workflow for data wrangling, viz, and statistical models suggest that a little thinking beforehand and planning (like visual lego instruction guides) ensures that all these different pieces fit together in the process of project building and writing. Design them so that connect easily.
Sometimes you can get away without instructions and that is fun, but jamming pieces together that do not really fit and trying to pry them apart later is never really fun.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – christopher lortie. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Using the lpSolve package in R to optimise an electricity system
[This article was first published on The Manipulative Gerbil: Playing around with Energy Data, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
@page { size: 8.27in 11.69in; margin: 0.79in } td p { background: transparent } p.sdfootnote { marginleft: 0.24in; textindent: 0.24in; marginbottom: 0in; fontsize: 10pt; lineheight: 100%; background: transparent } p { marginbottom: 0.1in; lineheight: 115%; background: transparent } a:link { color: #000080; solanguage: zxx; textdecoration: underline } a.sdfootnoteanc { fontsize: 57% } Reducing carbon emissions is maybe the world’s most pressing challenge at the moment. One obvious avenue for action is the reduction of carbon emissions from electricity generation, which are a significant contributor to global carbon emissions overall. This is particularly true if trends now in place continue undisturbed, with the world relying on electricity to power their devices, for personal comfort (air conditioners) or even for mobility. What kind of tradeoffs are desirable when deciding how much electricity to generate? How does a diversity of possible generation sources impact the best possible choice?In the following blog post, I’m going to show howlpSolve, a packageavailable through CRAN,can be used to arrive at realistic (enough) scenarios which can inform the decarbonisation debate. Specifically, I will illustrate an example of using Linear Programming to find the best combination of power stations in a given economy. The data I used is from Cyprus, most of the data for which can be found hereand here, but of course any example could work1.
Linear Optimisation or Linear Programming is a method for finding the best possible (or the leastbad) solution fulfilling a strict criterion (later, multiple criteria). In this example, I will use the lpSolve package which implements the simplex algorithm in determining the best choice given a set of constraints. A good place to read up on the fundamentals behind this is here. In our example, the total costs of operating the entire electricity generation system of Cyprus defines our “objective function” to be minimised while the “constraints” are, in the first instance, the minimum amount of electricity generated to meet demand on Cyprus. The rest of this post will assume only a rudimentary understanding of linear programming–not much else is needed–and will try to focus on how to run the model and arrive at some results.
This is a fairly long blog post but in fact I would try to bring down the word count by making a few things clear in these bullet points:
 The “objective function” is a simple sum of the coefficients of the hourly cost per generator per year. There are thus years × generators elements in this function.
 The constraints exist for each year. Each constraint exists for each year during which the model is run, in this case it is 15 years. The constraints are codified in a matrix which is fed into lpSolve.

Other than the objective function and the constraints, the elements which I look at are: the “right hand side”, which is a vector that contains the elements defining either the maximum or minimum of a given quantity; and the “direction,” which defines whether or not the constraint is a maximum or a minimum. Note that the “direction” of the objective function is distinct here, it tells us if we want to reduce the objective function (true in our case) or maximise it (for example, if our objective function was a plot of the profits of a firm). If we are looking at the prices of a two items, a and b, during one year, a maximum constraint would look like the following:
First, we define a linear equation which sets the quantity for which we want to find the optimum value–which is either a maximum or minimum. In this example, the optimum is the minimum amount of money needed which can monetize the overall operation of the electricity system/country in question. My model uses a linear expression which sums the hourly operating cost of all the generators in a given system during a 15year period.
We can then add constraints. For example, I want to set a minimum amount of electricityhours generated. This is coded in units of MWh (the global household average is about 3.5 MWh annually, but it is more than 3 * that in Canada; 1 MWh = 3.6 billion Joules of energy). Later on, I will add constraints which will place a maximum on the number of hours which a given generator can operate during a year. Finally, I will add a maximum constraint of carbon dioxide emissions from the combination of generators in a given year.
From the data available, and a few other resources easily available online2, I drew up the following table showing the type of generator, its capacity and its hourly running costs.
Name of generator type Capacity, MW Hourly running costs, € “Sunny” Solar photovoltaic 167.5 8.67 “Windy” Wind turbines 148 8.35 Vasilikos steam; fuel oil 390 194.74 Dhekelia steam; fuel oil 360 89.88 Combustion generator Internal Combustion Engine 100 74.90 Vasilikos II Closed Cycle Natural Gas 440 141.60 Vasilikos III Open Cycle Natural Gas 38 31.87
There are a few things worth pointing out about the above table. Firstly, the solar and wind generators are grouped into individual “generators” although there are in fact multiple generators. This will have an impact on the final results since the system will decide on which generator to rely on based on price, which has the effect of preferring smaller generators.
Let’s move on with the objective function. I have a system with seven (i = 7) generators, each of which will run for x hours during each of j = 15 years. The values of x are in fact determined by the algortihm deployed; what I provide as arguments/parameters to the function are the hourly operating costs of each generator by year. These can be placed in amatrix representing i generators and j :
Let’s say that these were contained in the matrix with the variable name “bmat”3. The matrix of these values can be read into the R working directory and then turned into a vector:
# Read the relevant data in costcoefficients.matrix < read_xls(“linearprogramming.xls”, sheet = “costcoefficients”)
# Turn the matrix into a vector; note the use of the transpose. costcoefficients.vector < as.vector(t(costcoefficients.matrix))
The vector above now has a coefficient which is the hourly operating cost of each generator. The vector is i * j elements long, with each element representing a generatoryear. Note that after every j (= 15) elements, the coefficients move on to the next year. We eventually feed that vector into the linear programming function.
# set the direction of the optimisation to a “minimum”–we want the lowest possible costs # we feed in the objective function electricitysystem.results < lpSolve::lp(dir = “min”, costcoefficients.vector)
The results produced by lpSolve are kept in the object electricitysystem.results$solution. It is a vector with the same dimensionality as the objective function, and in our case each element defines the number of hours that a generatoryear is operational.
Limiting the problem in this way gives a completely unhelpful answer–0. (In fact, using the R package as above gives an error, although that’s avoiding the more fundamental reason of why there is no viable answer yet.) The reason is that all we have done is to direct the algorithm is to minimise the solution based on our objective function, and obviously the best that can be done is to generate no electricity at all from any generator.
So in the next step, we set a minimum amount of total electricity generated in a given year. This is the equivalent of setting the next constraints which the model has to meet. In the system here, we feed these constraints through the constraint matrix. For example, let’s say that in our first year, each of the generators run at their stated capacity. The minimum total generated electricial energy during that year amounts to 4,274,886 MWh (4.27 TWh), which comes from the electricity generation/consumption data for the benchmark year of 2013. This latter value is input into the “right hand side” of the constraints. Meanwhile, in order to fulfill this demand, each of the generators works at its full capacity for a given number of hours–which is what we determine using the system.
The first row of the constraint matrix, which I have called “example1.constraints,” then looks like this:
Note that each of the generatoryears which are inactive in year 1 are 0. Each row of the constraint matrix has i (number of generators) by j (here, number of years) elements; we have rows for each constraint which, for now, will be a constraint defining the minimum electrical energy generated each year. For comparison, this is how row 5 of the constraint matrix looks:
Obviously, the elements where are left as 0 since for a given year, we only care about the relevant generatoryears. The first 15 rows of the constraint matrix are constructed in the same way. The next task is to define what the righthand side of the constraint is–what the minimum value of the electricity generated would be. For the purposes of my model, I allowed the electricity generated to increase by 2% each year. This is coded quite simply as below:
I ought to point out that this is clearly not realistic. If it were, then the demand for electricity generation on Cyprus would grow around nearly ~32% in 15 years, which is really not on the cards in real life, but I wanted to be a bit reckless and illustrate what we get from silly numbers. So this is our Example 1 case: we have electricity demand increasing by 2% each year for 15 years from a base year of 2013. Our objective function remains unchanged from the previous example.
Feeding this in to the console and checking the results is done like this:
There’s one part of the syntax which I did not explain yet, and that’s the “direction” of the constraints, which tells the lpSolve package how to interpret the constraints compared to the right–hand side: are they “>”, “<“, “>=” or “<=”? For the first case, we have only set a minimum of electricity generated, and so this is going to be defined as:
So, to recap: We have fed in the objective function; the matrix which define the constraints; the “direction,” or the relationship, between what the constraints are, and the constraints on the righthand side. The results can be accessed from the variable we created above: “example1.results,” and features of it can be accessed using the $ sign: for example, “example1.results$solution” gives an array of the results showing how many hours each of the generators were run, arranged in the same order as the objective function.
Our first set of results are a bit disappointing and, again, meaningless; in fact the only generator to work is the solar farm.
What happened here is that the system simply chooses the most cost effective generator, and also that we did not limit the number of hours in a year, ending with the result that the solar generator runs more than hours than are actually available (8,760) in a year. So our next step is to implement additional constraints which limit the amount which a given generator can run in a particular year.
We can fix this in the system by adding new rows to the constraint matrix; by adding elements to the vector which contains the right hand side values, this time representing the total number of hours which a given generator can run within a set year; and by adding elements to the direction vector, which this time is set as a “maximum” or “>=” instead of the “<=” for the amount of electricity generated.
This time, we will need to add a total of 7 * 15 = 105 rows. Each row will have exactly one element equal to 1, every other element will be 0 since we will define the number of hours for each generatoryear separately. For example, the 17th row of additional constraints will contain the constraints for the second year of operation for the second generator, and it looks like this:
Although it looks cumbersome, the additional rows for the constraint matrix can be created in a few short lines of code, like below:
How to define the limits which set out what the maximum number of hours of operation for a given generator in a specific year? For this, I use a predefined capacity factor for each generator. In my own example, I actually leave the capacity factor unchanged for each generator over the years, although strictly speaking this is not necessarily accurate. The capacity factor, in this case, is the proportion of hours out of 8760 for which a generator can be turned “on”. In my actual example, I took the values from a spreadsheet which was read into R then multiplied each row–which held the values of the capacity for a given generator–by 8760, giving me a data frame of 105 elements which held the total hours of operation per generator year.
It should be easy enough to see how these are then incorporated into the vector containing the righthandside values; for comparison I filled in the total operating hours in the fourth column below. In the case of the solar farm, I allowed the capacity factor to grow by 1.2% per year, while all other capacity factors are held constant for all years of operation. Generator Capacity factor (maximum), % Annual increase (%) Operating hours/year solar 8.6 1.2 753870 wind 14.4 stable 1270 Vasilikos I 75 stable 6570 Dhekelia 75 stable 6570 Internal Combustion Generator 75 stable 6570 Vasilikos II 75 stable 6570 Vasilikos III 75 stable 6570 In other words, I want to allow the thermal (i.e. gas and oil) generators to be running for 75% of the total time of the year. To do this, I set the relevant element in the constraint matrix equal to 1, all other elements in the row are 0. This means that “for generator i in year j, let the total number of hours running equal to CF * 8760 where CF is the maximum capacity factor of that generator. For example, the constraint for the sixth year of the first generator looks like this:
The corresponding entry in the righthand side vector is then equal to max capacity factor multiplied by 8760 (which for the solar generator in the second year is ~800 hours). This time the chart of operating hours by generator is a little more interesting: Figure 1: There’s so much wrong in the way this graph is presented that I won’t even mention those things, but the general idea is right here.
What happened here is that the system “chose” different generators to varying extents based not only on how much its hourly operating cost was, but also the extent of time it could reach its capacity. Our third and final run will force the model to take into account an upper limit on the emissions of CO2equivalent from each generator. I did this by adapting the carbon intensity of each generator—which is the amount of emissions of Greenhouse Gases per unit of energy, expressed in kg/MWh—into a value for each hour running. (This adaptation, again, is important.)
If the limit on carbon emissions for a year j is Mjmegatons, and the emissions from a generator i is given by the coefficient aij, then the emissions cap can be expressed as. The emissions profiles for the generators are in the table below. As before, I add elements to the vector for the righthand side which contain the caps on the carbon emissions expressed in kgs; the “direction” vector is this time populated with “≤”; and of course there are additional rows to the constraint matrix which contain the coefficients equivalent to the perhour carbon intensity, the same as in the table below. Note that we can actually change the hourly carbon intensities for different years (but I don’t).
 Generator Hourly carbon intensity, kg/hour of operation Solar — Wind — Vasilikos I 285,870 Dhekelia 263,880 Internal Combustion Generator 73,300 Vasilikos II 158,035 Vasilikos III 19,497
Obviously I will have to convert the megatons of Cyprus’ carbon emissions from Mtons to kg (1 Mt = 1*109 kg). The value I start with is based on the electricitygenerated carbon emissions of Cyprus in 2013 (assuming electric generation was responsible for 50% of Cypriot carbon emissions). I then reduce the caps on successive years by an amount of 2% per step; these values are input into the righthand side. Remember that here again, the corresponding elements of the direction vector are “<=”.
The rows of the constraint matrix which contain the relevant coefficients are again defined by generatoryear. This time, all generators contributing to carbon emissions in year j are given the full value (as shown in the table for carbon intensities above) for hourly operation, giving us 15 rows of additional data. For example, the row of data showing carbon intensities in the third year of operation looks like the following:
For this final run, example 3, I am only going to focus on one of the thermal power plants: the natural gas generator at Vasilikos or “Vasilikos I”. The plot below shows how with each successive year, the Vasilikos NG generator acceleratedly approaches its peak capacity of 75%.
What I wanted to illustrate graphically–because I really should cut down on the use of words right now–is how the load on thermal generators could change when the requirements of a carbon cap are taken into account. Instead of the stable flatline after example 2 above, this time the system utilises generators to differing extents, and it naturally chose the (less carbon intense) natural gas generator over fuel oil generators.
Epilogue and caveats
 There is no way that the above model could be an “accurate” description of the electricity situation in Cyprus, but it is a fun way to start thinking about the choices small island nations need to make when dealing with existing power infrastructures.
 Without a doubt the biggest flaw in the model presented here is that it does not allow for the intermittency of the renewable generators in particular. A more complete/correct system would allow for 8760 hours of each year, then set the caps for the renewable generators based on how much wind and solar power is actually available for a given hour. These data are obtainable from the “Typical Meteorological Year” of Cyprus. The resulting model would of course be much larger in size but not otherwise more complex. I was supposed to start work on such a model, but I’m also about to take my summer holidays, so …
 Another avenue would be to find ways to allow the model to let individual generators only partially on. I think this is a bit more tricky and haven’t fully thought it out yet.
 There are a few consequences of defining the solar farm and wind farm as inidivudal standalone generators, but this was done out of expediency and can be easily fixed.
To leave a comment for the author, please follow the link and comment on their blog: The Manipulative Gerbil: Playing around with Energy Data. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
What is vtreat?
[This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
vtreat is a DataFrame processor/conditioner that prepares realworld data for supervised machine learning or predictive modeling in a statistically sound manner.
vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/stringvalued, these columns may have missing values) that the user later wants to use to predict “y”. In practice such an input DataFrame may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.
To solve this, vtreat builds a transformed DataFrame where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The vtreat implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified “y” or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed DataFrame is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.
The idea is: you can take a DataFrame of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using vtreat. Incorporating vtreat into your machine learning workflow lets you quickly work with very diverse structured data.
Worked examples can be found here.
For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the R version, however all of the examples can be found worked in Python here).
vtreat is available as a Python/Pandas package, and also as an R package.
(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)
Some operational examples can be found here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Plotting Pairs (GC4W6HJ)
[This article was first published on geocacheR, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Planning for a trip to Cornwall, I was churning through some mysteries there, largely picking them at random from the map. I worked my way through this one by hand, copying the puzzle into a text editor and separating the pairs manually before finally transferring it into RStudio and plotting the result. I moved on.
Later, I twigged that this would be an ideal demonstration of several powerful aspects of string manipulation in R, so I returned and wrote a program to do the same.
The puzzle is straightforward: just plot the coordinate pairs and you should see the cache coordinates spelled out in your plot.
I chose to use a regular expression to extract the contents of each pair of parentheses, using the str_extract_all function from the stringr package. This “regex” approach is powerful but also makes for a daunting read. If you are new to regular expressions, this website can help you understand them a little better. There is also a cheatsheet section to help remind more experienced users. Regular expressions are definitely worth the trouble.
library(tidyverse) library(magrittr) library(stringr) B10 < "(3,1)(3,5)(2,3)(1,5)(1,1) (8,9)(10,9)(10,7)(9,7)(8,5)(10,5) (7,5)(7,9)(5,9)(5,5)(7,5) (5,6)(6,8)(6,10)(4,10)(4,8)(6,8) (2,9)(1,7)(1,5)(3,5)(3,7)(1,7) (5,8)(3,8)(3,10)(5,10)(5,6)(4,6)(4,8) (2,1)(0,1)(0,3)(1,3)(2,5)(0,5) (2,6)(0,6)(0,8)(1,8)(2,10)(0,10) (13,5)(13,9)(11,5)(11,9) (12,7)(10,7)(10,5)(12,5)(12,9)(11,9)(11,7) (1,8)(3,8)(3,6)(1,6)(1,8)(2,10) (8,9)(7,7)(7,5)(9,5)(9,7)(7,7) (4,5)(4,9)(6,9)(6,5)(4,5) (2,8)(1,9)(1,5)" B10 %>% str_extract_all("(?<=\()([^\(\)]+)(?=\))") %>% unlist() %>% paste(collapse=",") %>% str_split(",") %>% unlist() %>% as.numeric() %>% matrix(ncol = 2, byrow = TRUE) %>% plot(type="l") Part of the coordinates redacted to prevent spoilersI chose to collapse the pairs of numbers into one big vector and make them into a matrix, which I then plotted with a line plot. It’s not tidy because of the lines joining different characters, but it’s legible.
I just noticed that there are spaces in the original text, probably to separate individual characters. Therefore I could have made something even more sophisticated, separating by spaces first and producing a tidier plot. But I’ve solved this puzzle twice now; a third time just for perfection would be overkill.
The bulk of the work in this puzzle is preparing the data, which is similar to many data visualisation or analysis workflows. stringr is the main engine of my solution, plucking the pairs from the parentheses, and again splitting them by the separator in order to coerce messy character data into a clean set of numbers that can be plotted. And it’s scalable, unlike the byhand method I first used. This task involved 156 numbers, but it could just as easily have been 1560, making the manual approach a poor use of solving time.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: geocacheR. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Using linear models with binary dependent variables, a simulation study
[This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for free
here.
This is taken from Chapter 8, in which I discuss advanced functional programming methods for
modeling.
As written just above (note: as written above in the book), map() simply applies a function
to a list of inputs, and in the previous
section we mapped ggplot() to generate many plots at once. This approach can also be used to
map any modeling functions, for instance lm() to a list of datasets.
For instance, suppose that you wish to perform a Monte Carlo simulation. Suppose that you are
dealing with a binary choice problem; usually, you would use a logistic regression for this.
However, in certain disciplines, especially in the social sciences, the socalled Linear Probability
Model is often used as well. The LPM is a simple linear regression, but unlike the standard setting
of a linear regression, the dependent variable, or target, is a binary variable, and not a continuous
variable. Before you yell “Wait, that’s illegal”, you should know that in practice LPMs do a good
job of estimating marginal effects, which is what social scientists and econometricians are often
interested in. Marginal effects are another way of interpreting models, giving how the outcome
(or the target) changes given a change in a independent variable (or a feature). For instance,
a marginal effect of 0.10 for age would mean that probability of success would increase by 10% for
each added year of age.
There has been a lot of discussion on logistic regression vs LPMs, and there are pros and cons
of using LPMs. Microeconometricians are still fond of LPMs, even though the pros of LPMs are
not really convincing. However, quoting Angrist and Pischke:
“While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs
(limited dependent variables) more closely than a linear model, when it comes to marginal effects,
this probably matters little” (source: Mostly Harmless Econometrics)
so LPMs are still used for estimating marginal effects.
Let us check this assessment with one example. First, we simulate some data, then
run a logistic regression and compute the marginal effects, and then compare with a LPM:
This data generating process generates data from a binary choice model. Fitting the model using a
logistic regression allows us to recover the structural parameters:
Let’s see a summary of the model fit:
summary(logistic_regression) ## ## Call: ## glm(formula = y ~ ., family = binomial(link = "logit"), data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## 2.91941 0.44872 0.00038 0.42843 2.55426 ## ## Coefficients: ## Estimate Std. Error z value Pr(>z) ## (Intercept) 0.0960 0.3293 0.292 0.770630 ## x1 1.6625 0.4628 3.592 0.000328 *** ## x2 3.6582 0.8059 4.539 5.64e06 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 138.629 on 99 degrees of freedom ## Residual deviance: 60.576 on 97 degrees of freedom ## AIC: 66.576 ## ## Number of Fisher Scoring iterations: 7We do recover the parameters that generated the data, but what about the marginal effects? We can
get the marginal effects easily using the {margins} package:
Or, even better, we can compute the true marginal effects, since we know the data
generating process:
Ok, so now what about using this infamous Linear Probability Model to estimate the marginal effects?
lpm < lm(y ~ ., data = df) summary(lpm) ## ## Call: ## lm(formula = y ~ ., data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## 0.83953 0.31588 0.02885 0.28774 0.77407 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 0.51340 0.03587 14.314 < 2e16 *** ## x1 0.16771 0.03545 4.732 7.58e06 *** ## x2 0.31250 0.03449 9.060 1.43e14 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3541 on 97 degrees of freedom ## Multiple Rsquared: 0.5135, Adjusted Rsquared: 0.5034 ## Fstatistic: 51.18 on 2 and 97 DF, pvalue: 6.693e16It’s not too bad, but maybe it could have been better in other circumstances. Perhaps if we had more
observations, or perhaps for a different set of structural parameters the results of the LPM
would have been closer. The LPM estimates the marginal effect of x1 to be
0.1677134 vs 0.1597956
for the logistic regression and for x2, the LPM estimation is 0.3124966
vs 0.351607. The true marginal effects are
0.1750963 and 0.3501926 for x1 and x2 respectively.
Just as to assess the accuracy of a model data scientists perform crossvalidation, a Monte Carlo
study can be performed to asses how close the estimation of the marginal effects using a LPM is
to the marginal effects derived from a logistic regression. It will allow us to test with datasets
of different sizes, and generated using different structural parameters.
First, let’s write a function that generates data. The function below generates 10 datasets of size
100 (the code is inspired by this StackExchange answer):
Let’s first generate one dataset:
one_dataset < generate_datasets(repeats = 1)Let’s take a look at one_dataset:
one_dataset ## # A tibble: 1 x 4 ## coefs sample_size repeats simulations ## ## 1 100 1As you can see, the tibble with the simulated data is inside a listcolumn called simulations.
Let’s take a closer look:
The structure is quite complex, and it’s important to understand this, because it will have an
impact on the next lines of code; it is a list, containing a list, containing a dataset! No worries
though, we can still map over the datasets directly, by using modify_depth() instead of map().
Now, let’s fit a LPM and compare the estimation of the marginal effects with the true marginal
effects. In order to have some confidence in our results,
we will not simply run a linear regression on that single dataset, but will instead simulate hundreds,
then thousands and ten of thousands of data sets, get the marginal effects and compare
them to the true ones (but here I won’t simulate more than 500 datasets).
Let’s first generate 10 datasets:
many_datasets < generate_datasets()Now comes the tricky part. I have this object, many_datasets looking like this:
many_datasets ## # A tibble: 1 x 4 ## coefs sample_size repeats simulations ## ## 1 100 10I would like to fit LPMs to the 10 datasets. For this, I will need to use all the power of functional
programming and the {tidyverse}. I will be adding columns to this data frame using mutate()
and mapping over the simulations listcolumn using modify_depth(). The list of data frames is
at the second level (remember, it’s a list containing a list containing data frames).
I’ll start by fitting the LPMs, then using broom::tidy() I will get a nice data frame of the
estimated parameters. I will then only select what I need, and then bind the rows of all the
data frames. I will do the same for the true marginal effects.
I highly suggest that you run the following lines, one after another. It is complicated to understand
what’s going on if you are not used to such workflows. However, I hope to convince you that once
it will click, it’ll be much more intuitive than doing all this inside a loop. Here’s the code:
This is how results looks like:
results ## # A tibble: 1 x 6 ## coefs sample_size repeats simulations lpm true_effect ## ## 1 100 10Let’s take a closer look to the lpm and true_effect columns:
results$lpm ## [[1]] ## # A tibble: 20 x 2 ## term estimate ## ## 1 x1 0.228 ## 2 x2 0.353 ## 3 x1 0.180 ## 4 x2 0.361 ## 5 x1 0.165 ## 6 x2 0.374 ## 7 x1 0.182 ## 8 x2 0.358 ## 9 x1 0.125 ## 10 x2 0.345 ## 11 x1 0.171 ## 12 x2 0.331 ## 13 x1 0.122 ## 14 x2 0.309 ## 15 x1 0.129 ## 16 x2 0.332 ## 17 x1 0.102 ## 18 x2 0.374 ## 19 x1 0.176 ## 20 x2 0.410 results$true_effect ## [[1]] ## # A tibble: 20 x 2 ## term true_effect ## ## 1 x1 0.183 ## 2 x2 0.366 ## 3 x1 0.166 ## 4 x2 0.331 ## 5 x1 0.174 ## 6 x2 0.348 ## 7 x1 0.169 ## 8 x2 0.339 ## 9 x1 0.167 ## 10 x2 0.335 ## 11 x1 0.173 ## 12 x2 0.345 ## 13 x1 0.157 ## 14 x2 0.314 ## 15 x1 0.170 ## 16 x2 0.340 ## 17 x1 0.182 ## 18 x2 0.365 ## 19 x1 0.161 ## 20 x2 0.321Let’s bind the columns, and compute the difference between the true and estimated marginal
effects:
Let’s take a look at the simulation results:
simulation_results %>% group_by(term) %>% summarise(mean = mean(difference), sd = sd(difference)) ## # A tibble: 2 x 3 ## term mean sd ## ## 1 x1 0.0122 0.0370 ## 2 x2 0.0141 0.0306Already with only 10 simulated datasets, the difference in means is not significant. Let’s rerun
the analysis, but for difference sizes. In order to make things easier, we can put all the code
into a nifty function:
And now, let’s run the simulation for different parameters and sizes:
monte_carlo(c(.5, 2, 4), 100, 10) ## # A tibble: 2 x 3 ## term mean sd ## ## 1 x1 0.00826 0.0291 ## 2 x2 0.00732 0.0412 monte_carlo(c(.5, 2, 4), 100, 100) ## # A tibble: 2 x 3 ## term mean sd ## ## 1 x1 0.00360 0.0392 ## 2 x2 0.00517 0.0446 monte_carlo(c(.5, 2, 4), 100, 500) ## # A tibble: 2 x 3 ## term mean sd ## ## 1 x1 0.00152 0.0371 ## 2 x2 0.000701 0.0423 monte_carlo(c(pi, 6, 9), 100, 10) ## # A tibble: 2 x 3 ## term mean sd ## ## 1 x1 0.00829 0.0546 ## 2 x2 0.00178 0.0370 monte_carlo(c(pi, 6, 9), 100, 100) ## # A tibble: 2 x 3 ## term mean sd ## ## 1 x1 0.0107 0.0608 ## 2 x2 0.00831 0.0804 monte_carlo(c(pi, 6, 9), 100, 500) ## # A tibble: 2 x 3 ## term mean sd ## ## 1 x1 0.00879 0.0522 ## 2 x2 0.0113 0.0668We see that, at least for this set of parameters, the LPM does a good job of estimating marginal
effects.
Now, this study might in itself not be very interesting to you, but I believe the general approach
is quite useful and flexible enough to be adapted to all kinds of usecases.
Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
(Bootstrapping) FollowUp Contrasts for WithinSubject ANOVAs (part 2)
[This article was first published on R on I Should Be Writing: Now Sugar Free!, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
 1. Fit your repeatedmeasures model with lmer
 2. Define the contrast(s) of interest
 3. Run the bootstrap
 Summary
A while back I wrote a post demonstrating how to bootstrap followup contrasts for repeatedmeasure ANOVAs for cases where
you data violates some / any assumptions. Here is a demo of how to conduct the same bootstrap analysis, more simply (no need to make your data wide!)
Note that I assume here data is aggregated (one value per cell/subject), as it would be in a rmANOVA, as so it is sufficient to model only a random intercept.
2. Define the contrast(s) of interestFor this step we will be using emmeans to get the estimates of the pairwise differences between the treatment groups within each phase of the study:
library(emmeans) # get the correct reference grid with the correct ultivariate levels! rg < ref_grid(fit_mixed, mult.levs = rm_levels) rg ## 'emmGrid' object with variables: ## treatment = control, A, B ## gender = F, M ## phase = fup, post, pre ## hour = 1, 2, 3, 4, 5 # get the expected means: em_ < emmeans(rg, ~ phase * treatment) ## NOTE: Results may be misleading due to involvement in interactions em_ ## phase treatment emmean SE df lower.CL upper.CL ## fup control 4.33 0.603 13.2 3.03 5.64 ## post control 4.08 0.603 13.2 2.78 5.39 ## pre control 4.25 0.603 13.2 2.95 5.55 ## fup A 7.25 0.661 13.2 5.82 8.68 ## post A 6.50 0.661 13.2 5.07 7.93 ## pre A 5.00 0.661 13.2 3.57 6.43 ## fup B 7.29 0.505 13.2 6.20 8.38 ## post B 6.62 0.505 13.2 5.54 7.71 ## pre B 4.17 0.505 13.2 3.08 5.26 ## ## Results are averaged over the levels of: gender, hour ## Degreesoffreedom method: kenwardroger ## Confidence level used: 0.95 # run pairwise tests between the treatment groups within each phase c_ < contrast(em_, "pairwise", by = 'phase') c_ ## phase = fup: ## contrast estimate SE df t.ratio p.value ## control  A 2.9167 0.895 13.2 3.259 0.0157 ## control  B 2.9583 0.787 13.2 3.760 0.0061 ## A  B 0.0417 0.832 13.2 0.050 0.9986 ## ## phase = post: ## contrast estimate SE df t.ratio p.value ## control  A 2.4167 0.895 13.2 2.700 0.0445 ## control  B 2.5417 0.787 13.2 3.230 0.0166 ## A  B 0.1250 0.832 13.2 0.150 0.9876 ## ## phase = pre: ## contrast estimate SE df t.ratio p.value ## control  A 0.7500 0.895 13.2 0.838 0.6869 ## control  B 0.0833 0.787 13.2 0.106 0.9938 ## A  B 0.8333 0.832 13.2 1.002 0.5885 ## ## Results are averaged over the levels of: gender, hour ## P value adjustment: tukey method for comparing a family of 3 estimates # extract the estimates est_names < c("fup: control  A", "fup: control  B", "fup: A  B", "post: control  A", "post: control  B", "post: A  B", "pre: control  A", "pre: control  B", "pre: A  B") est_values < summary(c_)$estimate names(est_values) < est_names est_values ## fup: control  A fup: control  B fup: A  B post: control  A ## 2.91666667 2.95833333 0.04166667 2.41666667 ## post: control  B post: A  B pre: control  A pre: control  B ## 2.54166667 0.12500000 0.75000000 0.08333333 ## pre: A  B ## 0.83333333 3. Run the bootstrapNow let’s wrap this all in a function that accepts the fitted model as an argument:
treatment_phase_contrasts < function(mod){ rg < ref_grid(mod, mult.levs = rm_levels) # get the expected means: em_ < emmeans(rg, ~ phase * treatment) # run pairwise tests between the treatment groups within each phase c_ < contrast(em_, "pairwise", by = 'phase') # extract the estimates est_names < c("fup: control  A", "fup: control  B", "fup: A  B", "post: control  A", "post: control  B", "post: A  B", "pre: control  A", "pre: control  B", "pre: A  B") est_values < summary(c_)$estimate names(est_values) < est_names est_values } # test it treatment_phase_contrasts(fit_mixed) ## NOTE: Results may be misleading due to involvement in interactions ## fup: control  A fup: control  B fup: A  B post: control  A ## 2.91666667 2.95833333 0.04166667 2.41666667 ## post: control  B post: A  B pre: control  A pre: control  B ## 2.54166667 0.12500000 0.75000000 0.08333333 ## pre: A  B ## 0.83333333Finally, we will use lme4::bootMer to get the bootstrapped estimates!
treatment_phase_results < bootMer(fit_mixed, treatment_phase_contrasts, nsim = 50) # R = 599 at least ## NOTE: Results may be misleading due to involvement in interactions summary(treatment_phase_results) # original vs. bootstrapped estimate (bootMed) ## ## Number of bootstrap replications R = 50 ## original bootBias bootSE bootMed ## fup: control  A 2.916667 0.017263 0.77841 2.801902 ## fup: control  B 2.958333 0.017880 0.86119 3.025705 ## fup: A  B 0.041667 0.035143 0.98850 0.066474 ## post: control  A 2.416667 0.031072 0.82654 2.383370 ## post: control  B 2.541667 0.024860 0.82351 2.520263 ## post: A  B 0.125000 0.055932 1.03670 0.216929 ## pre: control  A 0.750000 0.065397 0.73276 0.851533 ## pre: control  B 0.083333 0.024664 0.78592 0.111930 ## pre: A  B 0.833333 0.090061 0.95015 0.994195 confint(treatment_phase_results, type = "perc") # does include zero? ## 2.5 % 97.5 % ## fup: control  A 5.062951 1.2782764 ## fup: control  B 4.985715 1.0325666 ## fup: A  B 2.348035 2.1660820 ## post: control  A 4.451445 0.5162071 ## post: control  B 4.840519 1.1705024 ## post: A  B 2.349137 2.3025369 ## pre: control  A 2.427992 0.8830127 ## pre: control  B 1.915388 1.7159931 ## pre: A  B 1.530049 2.7527436Results indicate that the Control group is lower than both treatment groups in the post and fup (follow up) phases.
If we wanted pvalues, we could use this little function (based on this demo):
boot_pvalues < function(x, side = c(0, 1, 1)) { # Based on: # https://blogs.sas.com/content/iml/2011/11/02/howtocomputepvaluesforabootstrapdistribution.html side < side[1] x < as.data.frame(x$t) ps < sapply(x, function(.x) { s < na.omit(.x) s0 < 0 N < length(s) if (side == 0) { min((1 + sum(s >= s0)) / (N + 1), (1 + sum(s <= s0)) / (N + 1)) * 2 } else if (side < 0) { (1 + sum(s <= s0)) / (N + 1) } else if (side > 0) { (1 + sum(s >= s0)) / (N + 1) } }) setNames(ps,colnames(x)) } boot_pvalues(treatment_phase_results) ## fup: control  A fup: control  B fup: A  B post: control  A ## 0.03921569 0.03921569 0.94117647 0.03921569 ## post: control  B post: A  B pre: control  A pre: control  B ## 0.03921569 0.74509804 0.23529412 0.94117647 ## pre: A  B ## 0.27450980These pvalues can then be passed to p.adjust() for the pvalue adjustment method of your choosing.
SummaryI’ve demonstrated (again!) how to run permutation tests on main effects / interactions, with followup analysis using the bootstrap method. Using this code as a basis for any analysis you might have in mind gives you all the flexibility of emmeans, which supports many (many) models!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };
(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R on I Should Be Writing: Now Sugar Free!. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Insurance data science : Pictures
[This article was first published on Renglish – Freakonometrics, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
At the Summer School of the Swiss Association of Actuaries, in Lausanne, following the part of JeanPhilippe Boucher (UQAM) on telematic data, I will start talking about pictures this Wednesday. Slides are available online
Ewen Gallic (AMSE) will present a tutorial on satellite pictures, and a simple classification problem, related to Alzeimher detection.
We will try to identify what is on the following pictures, starting with the car
(we will see that the car is indeed identified)
a skier,
and a fire,
We will also discuss previous pictures from the summer school
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Renglish – Freakonometrics. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Speaking at BARUG
[This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us.
Nina Zumel & John Mount
Practical Data Science with R
Practical Data Science with R (Zumel and Mount) was one of the first, and most widelyread books on the practice of doing Data Science using R. We have been working hard on an improved and revised 2nd edition of our book (coming out this Fall). The book reflects more experience with data science, teaching, and with R itself. We will talk about what direction we think the R community has been taking, how this affected the book, and what is new in the upcoming edition.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Be careful of NA/NaN/Inf values when using base R’s plotting functions!
[This article was first published on R – Statistical Odds & Ends, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
I was recently working on a supervised learning problem (i.e. building a model using some features to predict some response variable) with a fairly large dataset. I used base R’s plot and hist functions for exploratory data analysis and all looked well. However, when I started building my models, I began to run into errors. For example, when trying to fit the lasso using the glmnet package, I encountered this error:
I thought this error message was rather cryptic. However, after some debugging, I realized the error was exactly what it said it was: there were NA/NaN/Inf values in my data matrix! The problem was that I had expected these problematic values to have been flagged during my exploratory data analysis. However, R’s plot and hist functions silently drop these values before giving a plot.
Here’s some code to demonstrate the issue. Let’s create some fake data with NA/NaN/Inf values:
n < 50 # no. of observations p < 2 # no. of features # create fake data matrix set.seed(1) x < matrix(rnorm(n * p), nrow = n) # make some entries invalid x[1:3, 1] < NA x[4:5, 2] < Inf head(x) #> [,1] [,2] #> [1,] NA 0.3981059 #> [2,] NA 0.6120264 #> [3,] NA 0.3411197 #> [4,] 1.5952808 Inf #> [5,] 0.3295078 Inf #> [6,] 0.8204684 1.9803999The two lines of code give plots in return, without any warning message to the console that data points have been dropped:
plot(x[, 1], x[, 2]) hist(x[,1])The ggplot2 package does a better job of handling such values. While it also makes the plot, it sends a warning to the console that some values have been dropped in the process:
library(ggplot2) df < data.frame(x = x[,1]) ggplot(df, aes(x)) + geom_histogram()
Moral(s) of the story:
 Don’t assume that your data is free of NA/NaN/Inf values. Check!
 Base R’s hist and plot functions do not warn about invalid values being removed. Either follow the advice in the previous point or use code that flags such removals (e.g. ggplot2).
To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Cyclists – London Ride 100 – Analysis for riders and clubs using Shiny/R
[This article was first published on RBloggers – Data Diversions, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Introduction
The Prudential Ride London is an annual summer cycling weekend and within that I will focus on the Ride LondonSurrey 100, a 100 mile route open to the public starting at the Stratford Olympic Park in East London and finishing in front of Buckingham Palace.
I wrote an initial analysis using R in 2016 and added the 2017 data. In 2018 I added a Shiny interactive web page so that riders could see an analysis of their ride. Below is an image of how that Shiny App looks and here is link to the Shiny page Ride 100 – Shiny Individual Rider Analysis (2019 data has been added).
The Shiny page allows a single rider to be highlighted by entering their “Bib Number”, but works fine without one. It is more interesting if you do enter one though so here is a list of some celebrity* Bib Numbers for 2018 so you can try a few (you will need to change the “Year” selected as the app now also contains the 2019 data). An explanation of how the page works can be found here.
This year I’ve added a variation of the existing Shiny page that focuses on club riders. Here is a link to the Ride 100 – Shiny Club Analysis Page.
General Notes about the new Club Analysis Shiny App
The club analysis was based on the individual analysis page but many of the charts/pages were not appropriate for aggregated data, so were removed.
Club Start Time vs Finish Split
The chart has been modified so that the size of each point reflects the number of riders included. Also a “Minimum Group Size” control, label options and the ability to highlight multiple clubs have been added to the control panel on the left. As before, the colour of each point shows the mean “Group Riding Efficiency” (see previous post about that here).
Total Group Size
This is a new chart that aims to highlight clubs that seem to have been given inappropriately early or late start times in the hope of helping the clubs be more realistic about the start times they request (or maybe even the organisers allocate) for next year.
It calculates how many people each rider overtook (or was overtaken by) over the 100 miles and then sums this figure across all members of the club. If a club has a large number of riders who are doing a lot of overtaking, they should probably be given earlier start times next year. My reasoning is that while a certain amount of overtaking and being overtaken is necessary and healthy, all other things being equal a rider overtaking 5,000* people is more likely to be involved in an accident than the same rider overtaking 500 people. If a club has a lot of riders who did a lot of overtaking the club will be shown as a larger green point in the plot. Similarly clubs which were overtaken by a lot of people are highlighted with a large red point.
Club riders are on average faster than the nonclub riders so the data is somewhat skewed to overtakes, but I think the point still stands. Also worthy of note is that the overtake figure only includes overtakes of people who completed the 100 mile course (without diversion). The actual number would be higher, including people doing shorter distances, people who were diverted and people who didn’t manage to finish.
Note: The mean overtakes for each club are shown in the label text on the previous chart and the individual overtake totals are shown on the companion individual analysis Shiny page (see above link)
*this is not an unrealistic figure, some riders overtake more than 10,000 people!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Data Diversions. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
How to Automate EDA with DataExplorer in R
[This article was first published on rbloggers on Programming with R, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
EDA (Exploratory Data Analysis) is one of the key steps in any Data Science Project. The better the EDA is the better the Feature Engineering could be done. From Modelling to Communication, EDA has got much more hidden benefits that aren’t often emphasised while beginners start while teaching Data Science for beginners.
The ProblemThat said, EDA is also one of the areas of the Data Science Pipeline where a lot of manual code is written for different types of plots and different types for inference. Let’s you’d want to visualize a bar plot of a categorical variable and you’d want to visualize a histogram of a continuous variable to understand their distribution. All these things increase the number of lines of code and also there by number of lines of code which could be time consuming if you’re participating in Hackathons or Online Competitions like Kaggle where timebound response is usually required to move ahead in the leaderboard.
The SolutionThat’s where the tools of Automated EDA comes very handy and one such popular tool for Automated EDA in R is DataExplorer by Boxuan Cui.
DataExplorerThe stable version of DataExplorer can be installed from CRAN.
install.packages("DataExplorer")And if you’d like to try on the development version:
if (!require(devtools)) install.packages("devtools") devtools::install_github("boxuancui/DataExplorer", ref = "develop") Automating EDA – Get startedBefore we start with EDA, We should first get the data that we would like explore. In this case, We’ll use data generated by fakir
library(fakir) library(tidyverse) library(DataExplorer) web < fakir::fake_visits() glimpse(web) ## Observations: 365 ## Variables: 8 ## $ timestamp 20170101, 20170102, 20170103, 20170104, 2017… ## $ year 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, … ## $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ day 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,… ## $ home 352, 203, 103, 484, 438, NA, 439, 273, 316, 193, 322, … ## $ about 176, 115, 59, 113, 138, 75, 236, 258, 206, 260, NA, 29… ## $ blog 521, 492, 549, 633, 423, 478, 364, 529, 320, 315, 578,… ## $ contact NA, 89, NA, 331, 227, 289, 220, 202, 367, 369, 241, 28… # year,month,day to factor web$year < as.factor(web$year) web$month < as.factor(web$month) web$day < as.factor(web$day)To go with glimpse(), DataExplorer itself has got a function called introduce()
introduce(web) ## # A tibble: 1 x 9 ## rows columns discrete_columns continuous_colu… all_missing_col… ## ## 1 365 8 4 4 0 ## # … with 4 more variables: total_missing_values , ## # complete_rows , total_observations , memory_usageThe same introduce() could also be plotted in a pretty graph.
plot_intro(web) Automating EDA – MissingPersonally, The most useful function of DataExplorer is to plot_missing() values.
plot_missing(web)That’s so handy that I don’t have to copy paste any custom function from SO or my previous code.
Automating EDA – ContinuousAs with most EDA on Continuous variables (numbers), We’ll start of with Histogram that can help us understand the underlying distributions.
And that’s just one function plot_histogram()
DataExplorer::plot_histogram(web)And a similar function for density plot plot_density()
plot_density(web)That’s all univariate and if we get on with `bivariate1, we can start off with boxplots with respect to a categorical variable.
plot_boxplot(web, by= 'month', ncol = 2) ## Warning: Removed 111 rows containing nonfinite values (stat_boxplot).And, the superuseful correlation plot.
plot_correlation(web, cor_args = list( 'use' = 'complete.obs')) ## 2 features with more than 20 categories ignored! ## timestamp: 365 categories ## day: 31 categories ## Warning in cor(x = structure(list(home = c(352L, 203L, 103L, 484L, 438L, : ## the standard deviation is zero ## Warning: Removed 32 rows containing missing values (geom_text).If in case, you want the correlation plot to be plotted only for continuous variables:
plot_correlation(web, type = 'c',cor_args = list( 'use' = 'complete.obs'))Well, that’s how simple it’s to make a bunch of plots for continous variables.
Automating EDA – CategoricalA bar plot to combine a categorical and a continuous variable. By default (with no with value), plot_bar() plots the categorical variable against the frequency/count.
plot_bar(web,maxcat = 20, parallel = TRUE) ## 2 columns ignored with more than 20 categories. ## timestamp: 365 categories ## day: 31 categoriesAlso, We’ve got an option to specify the name of the continuous variable to be summed up.
plot_bar(web,with = c("home"), maxcat = 20, parallel = TRUE) ## 2 columns ignored with more than 20 categories. ## timestamp: 365 categories ## day: 31 categories EDA ReportWhile those above ones are specific functions for a specific type of plot (but plotted for the whole dataset) making EDA a very quick process.
create_report()create_report() helps us in generating an output report combining all the required plots for different types of variables.
Plot AestheticsIt’s worthy enough to mention that these ggplots that are built aren’t the final version as DataExplorer allows us to supply ggtheme theme name and theme_config to pass on the theme paramaters. Also functions like plot_box() or plot_histgram() also takes in the plotspecific arguments. For more details on this check out relevant help files.
plot_intro(web, ggtheme = theme_minimal(), title = "Automated EDA with Data Explorer", ) SummaryDataExplorer is extremely handy for automating EDA in a lot of usecses like Missing Values reporting in an ETL process or Basic EDA in a Hackathon. It’s definitely another generalisttool that could be customized for better usage.
If you liked this, Please subscribe to my Data Science Newsletter and also share it with your friends!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers on Programming with R. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Do you love Data Science? I mean, the Data part in it
[This article was first published on rbloggers on Programming with R, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Last week, We talked all about Artificial Intelligence (also Artifical Stupidity) which led me to think about the foundation of Data Science that's the Data itself. I think, Data is the least appreciated entity in the Data Science Value chain. You might agree with me, If you do Data Science outside Competitive Platforms like Kaggle where Data given to you is what most of the Data Scientists dream about in their jobs.
Data Foundation"AI God fathers" have a good fan following but many of us know FeiFei Li whose (with her team)contribution of building the ImageNetfor AI is invaluable.
"One thing ImageNet changed in the field of AI is suddenly people realized the thankless work of making a dataset was at the core of AI research. People really recognize the importance the dataset is front and center in the research as much as algorithms." – FeiFei Li
Data StartupMeanwhile, Venture Capitalists aren't shying away from putting their money where Data is created and curated – Recently, siliconvalley startup Scale AI has hit the unicorn status. Scale AI's about us page reads:
The Data Platform for AI
Scale AI has also opensourced Datasets and That's sweet.
Build your own DataZalando that opensourced FashionMNIST published a nice paper that listed out the steps they took to publish the dataset. There are also free tools like labelImg and makesense.ai to help you annotate images for a typical Image dataset. For NLP Annotation, BRAT is a nice free opensource tool. And, If you are planning for a pet project and don't have the required dataset this tutorial by Mat Kelcey of counting bees on a rasp pi with a conv net would be a tremendous help.
In R, Check out this to learn How to generate meaningful fake data for learning, experimentation and teaching using {fakir}.
That said, If you appreciate Data Science as much as you'd appreciate the beauty of a Ferrari or Lamborghini, then you might also have to remind you that car is only useful if you've got the oil in it which is your superclean labelled Data that's usable for Data science and Machine Learning.
If you liked this, Please subscribe to my Data Science Newsletter and also share it with your friends!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers on Programming with R. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Basic Quantile Regression
[This article was first published on R – insightR, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
By Gabriel Vasconcelos IntroductionToday we are going to talk about quantile regression. When we use the lm command in R we are fitting a linear regression using Ordinary Least Squares (OLS), which has the interpretation of a model for the conditional mean of on . However, sometimes we may need to look at more than the conditional mean to understand our data and quantile regressions may be a good alternative. Instead of looking at the mean, quantile regressions will establish models for particular quantiles as chosen by the user. The most simple case when quantile regressions are good is when you have outliers in your data because the median is much less affected by extreme values than the mean (0.5 quantile). But there are other cases where quantile regression may be used, for example to identify some heterogeneous effects of some variable or even to give more robustness to your results.
OutliersThe package we will be using for quantile regressions is the quantreg, which is very easy to use if you are already familiar with the lm function.
library(quantreg)Our data is going to be very simple. The sample size will be 300 and we will have 10 variables. The s will be either 1 or 1 depending on the index of the variable. This is our dgp:
where and are all generated from a distribution.
The first step is to generate the data and run a linear regression by OLS. As you can see in the results below, the estimates are very precise. The signals are always correct and the coefficients are close to 1 or 1.
N = 300 #sample size P = 10 # number of variable ### Generate Data ### betas = (1)^(1:P) set.seed(1) # Seed for replication error = rnorm(N) X = matrix(rnorm(N*P), N, P) y = X%*%betas + error summary(lm(y~X)) ## ## Call: ## lm(formula = y ~ X) ## ## Residuals: ## Min 1Q Median 3Q Max ## 2.94301 0.58806 0.03508 0.61445 2.55389 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 0.03501 0.05739 0.61 0.542 ## X1 0.98513 0.05534 17.80 <2e16 *** ## X2 1.04319 0.05321 19.60 <2e16 *** ## X3 1.01520 0.05577 18.20 <2e16 *** ## X4 0.92930 0.05705 16.29 <2e16 *** ## X5 0.97971 0.05404 18.13 <2e16 *** ## X6 1.01255 0.05376 18.83 <2e16 *** ## X7 1.02357 0.05622 18.21 <2e16 *** ## X8 0.97456 0.05946 16.39 <2e16 *** ## X9 0.97097 0.05295 18.34 <2e16 *** ## X10 1.03386 0.05312 19.46 <2e16 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9747 on 289 degrees of freedom ## Multiple Rsquared: 0.9274, Adjusted Rsquared: 0.9249 ## Fstatistic: 369 on 10 and 289 DF, pvalue: < 2.2e16Now we are going to add some outliers to the data. We have 300 observations and only 5 are going to be outliers. The code below introduces the outliers and run OLS again. Estimates are now considerably worst than the first case with no outliers.
### Introducing outliers ### set.seed(1) outliers = rnorm(5, 0, 50) error_o = error error_o[1:5] = outliers y_o = X%*%betas + error_o ## OLS ### summary(lm(y_o~X)) ## ## Call: ## lm(formula = y_o ~ X) ## ## Residuals: ## Min 1Q Median 3Q Max ## 41.216 1.105 0.027 0.828 77.040 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 0.1951 0.3367 0.579 0.562717 ## X1 1.3165 0.3247 4.055 6.45e05 *** ## X2 1.1355 0.3122 3.637 0.000326 *** ## X3 1.0541 0.3272 3.222 0.001421 ** ## X4 0.9136 0.3347 2.729 0.006733 ** ## X5 1.2849 0.3171 4.052 6.52e05 *** ## X6 1.2834 0.3154 4.069 6.10e05 *** ## X7 0.7777 0.3298 2.358 0.019055 * ## X8 1.4859 0.3488 4.260 2.77e05 *** ## X9 1.0915 0.3107 3.513 0.000513 *** ## X10 0.5756 0.3117 1.847 0.065797 . ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.719 on 289 degrees of freedom ## Multiple Rsquared: 0.3223, Adjusted Rsquared: 0.2989 ## Fstatistic: 13.75 on 10 and 289 DF, pvalue: < 2.2e16If we run a quantile regression for the median like in the code below we can get good results once again. Note that in the OLS case the coefficients were not as close to 1 or 1 as in the quantile case below.
## Quantile Regression  Median ## summary(rq(y_o ~ X, tau = 0.5), se = "boot", bsmethod= "xy") ## ## Call: rq(formula = y_o ~ X, tau = 0.5) ## ## tau: [1] 0.5 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 0.03826 0.08202 0.46642 0.64127 ## X1 0.95730 0.07394 12.94750 0.00000 ## X2 0.99394 0.07612 13.05731 0.00000 ## X3 0.97981 0.07995 12.25523 0.00000 ## X4 0.92966 0.09319 9.97573 0.00000 ## X5 0.92781 0.08520 10.88998 0.00000 ## X6 0.98021 0.08114 12.08030 0.00000 ## X7 1.00993 0.06727 15.01266 0.00000 ## X8 1.03572 0.08043 12.87722 0.00000 ## X9 0.96732 0.07502 12.89374 0.00000 ## X10 1.02600 0.08775 11.69171 0.00000 Treatment effectsSuppose now that we have the same model, without outliers, plus a treatment that is positive for lower values of and negative for bigger values of . Only half of the sample will be treated and we want to see what we get when we try to estimate the effects of this treatment. The figure below plots the original against the treated . The points in the 45 degrees line are the untreated observations.
Our treatment was simple adding 5 to the equation if the deterministic part of the equation is negative and subtracting 5 if it is positive. If we run OLS we can se below that the treatment effects are very poorly estimated and they are also not significant. That is because we are looking at the average effect of the treatment, which is 0 because half of the treated sample had an increase of 5 and the other half had a decrease of five, which is 0 on average.
#### More complicated data ##### set.seed(1) treatment = sample(c(0,1),N,replace = TRUE) aux = X%*%betas treatment_error = rnorm(N)*treatment y_tr = aux + ifelse(aux0,5,0)*treatment + error + treatment_error X_tr = cbind(treatment,X) ## Treated X Untreated plot(y,y_tr)Now lets try quantile regression for multiple quantiles (0.1 ,0.2 ,…,0.8, 0.9). The results are presented below. When we look at the middle quantiles like 0.5 and 0.6 we find that the treatment is not significant just like in the OLS case. However, as we move further to 0.1 or 0.9 we obtain significant results with estimates very close to the real treatment, which would be 5 or 5. The model is telling us that bigger values of are either untreated samples or treated samples that were previously small but became big because of the treatment of 5, that is why the treatment effect is close to 5 in the 0.9 quantile. Similarly, smaller values of are either untreated samples or treated samples that were previously big and became small because of the treatment of 5. We are modeling the quantiles of the treated .
## OLS ## summary(lm(y_tr ~ X_tr)) ## ## Call: ## lm(formula = y_tr ~ X_tr) ## ## Residuals: ## Min 1Q Median 3Q Max ## 7.5913 2.0909 0.0168 1.8840 8.8915 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 0.08713 0.24512 0.355 0.722502 ## X_trtreatment 0.24555 0.35698 0.688 0.492093 ## X_tr 0.40117 0.17283 2.321 0.020973 * ## X_tr 0.55826 0.16594 3.364 0.000872 *** ## X_tr 0.61606 0.17365 3.548 0.000454 *** ## X_tr 0.56658 0.17766 3.189 0.001584 ** ## X_tr 0.37539 0.16829 2.231 0.026477 * ## X_tr 0.43680 0.16788 2.602 0.009752 ** ## X_tr 0.29718 0.17523 1.696 0.090972 . ## X_tr 0.39646 0.18585 2.133 0.033751 * ## X_tr 0.34905 0.16499 2.116 0.035233 * ## X_tr 0.83088 0.16593 5.007 9.64e07 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.035 on 288 degrees of freedom ## Multiple Rsquared: 0.2454, Adjusted Rsquared: 0.2166 ## Fstatistic: 8.514 on 11 and 288 DF, pvalue: 5.243e13 # Quantilie Regression for q = 0.1, 0.2, ... , 0.8, 0.9 summary(rq(y_tr ~ X_tr, tau = seq(0.1,0.9,0.1)), se = "boot", bsmethod= "xy") ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.1 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 1.59418 0.22484 7.09019 0.00000 ## X_trtreatment 4.13074 0.47280 8.73680 0.00000 ## X_tr 0.85152 0.17088 4.98320 0.00000 ## X_tr 1.10326 0.14030 7.86384 0.00000 ## X_tr 0.85055 0.18554 4.58421 0.00001 ## X_tr 0.68881 0.21612 3.18719 0.00159 ## X_tr 0.88047 0.16396 5.36995 0.00000 ## X_tr 0.68033 0.15287 4.45051 0.00001 ## X_tr 1.00234 0.15569 6.43819 0.00000 ## X_tr 0.99443 0.19484 5.10369 0.00000 ## X_tr 0.94634 0.17999 5.25759 0.00000 ## X_tr 0.92124 0.17348 5.31048 0.00000 ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.2 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 0.87790 0.18851 4.65701 0.00000 ## X_trtreatment 4.04359 0.43354 9.32695 0.00000 ## X_tr 0.87484 0.17336 5.04648 0.00000 ## X_tr 0.90953 0.16734 5.43510 0.00000 ## X_tr 0.92386 0.15931 5.79905 0.00000 ## X_tr 0.84438 0.14318 5.89740 0.00000 ## X_tr 0.79882 0.16353 4.88478 0.00000 ## X_tr 0.86351 0.14477 5.96469 0.00000 ## X_tr 0.89046 0.14533 6.12699 0.00000 ## X_tr 0.80481 0.16133 4.98842 0.00000 ## X_tr 0.85986 0.14983 5.73877 0.00000 ## X_tr 1.05532 0.14247 7.40725 0.00000 ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.3 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 0.48650 0.15254 3.18941 0.00158 ## X_trtreatment 3.55483 0.39081 9.09596 0.00000 ## X_tr 0.81732 0.15516 5.26750 0.00000 ## X_tr 0.97204 0.15176 6.40526 0.00000 ## X_tr 0.88749 0.12320 7.20353 0.00000 ## X_tr 0.80248 0.13892 5.77649 0.00000 ## X_tr 0.63208 0.15928 3.96825 0.00009 ## X_tr 0.85960 0.12409 6.92695 0.00000 ## X_tr 0.73083 0.13347 5.47540 0.00000 ## X_tr 0.80175 0.14581 5.49859 0.00000 ## X_tr 0.73295 0.14156 5.17769 0.00000 ## X_tr 0.96849 0.14074 6.88120 0.00000 ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.4 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 0.20518 0.17305 1.18566 0.23673 ## X_trtreatment 3.12524 0.82009 3.81083 0.00017 ## X_tr 0.75969 0.17512 4.33802 0.00002 ## X_tr 0.75137 0.18705 4.01697 0.00008 ## X_tr 0.89673 0.15018 5.97081 0.00000 ## X_tr 0.81794 0.15558 5.25741 0.00000 ## X_tr 0.60392 0.18754 3.22019 0.00143 ## X_tr 0.76221 0.16089 4.73733 0.00000 ## X_tr 0.63207 0.24626 2.56665 0.01077 ## X_tr 0.70476 0.24297 2.90064 0.00401 ## X_tr 0.72050 0.21575 3.33951 0.00095 ## X_tr 0.97243 0.14277 6.81125 0.00000 ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.5 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 0.01435 0.17259 0.08317 0.93377 ## X_trtreatment 0.63174 0.97917 0.64518 0.51933 ## X_tr 0.60768 0.19231 3.15992 0.00175 ## X_tr 0.54861 0.19838 2.76543 0.00605 ## X_tr 0.80869 0.16209 4.98908 0.00000 ## X_tr 0.73479 0.18775 3.91377 0.00011 ## X_tr 0.41427 0.21412 1.93477 0.05400 ## X_tr 0.64703 0.21863 2.95946 0.00334 ## X_tr 0.24361 0.22809 1.06801 0.28641 ## X_tr 0.37166 0.23587 1.57566 0.11620 ## X_tr 0.45650 0.19626 2.32605 0.02071 ## X_tr 0.92603 0.17611 5.25825 0.00000 ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.6 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 0.25546 0.18342 1.39270 0.16478 ## X_trtreatment 1.36457 1.24271 1.09806 0.27310 ## X_tr 0.60693 0.21221 2.86008 0.00455 ## X_tr 0.64917 0.21300 3.04782 0.00252 ## X_tr 0.83220 0.17072 4.87459 0.00000 ## X_tr 0.73617 0.18348 4.01219 0.00008 ## X_tr 0.45447 0.19288 2.35628 0.01913 ## X_tr 0.76270 0.21047 3.62385 0.00034 ## X_tr 0.45186 0.25197 1.79332 0.07397 ## X_tr 0.46886 0.22511 2.08276 0.03816 ## X_tr 0.56117 0.22729 2.46900 0.01413 ## X_tr 0.94549 0.19311 4.89600 0.00000 ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.7 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 0.52520 0.14537 3.61291 0.00036 ## X_trtreatment 3.30124 0.61742 5.34682 0.00000 ## X_tr 0.67172 0.15957 4.20960 0.00003 ## X_tr 0.74401 0.15205 4.89331 0.00000 ## X_tr 0.93090 0.15068 6.17788 0.00000 ## X_tr 0.78792 0.15054 5.23389 0.00000 ## X_tr 0.72815 0.13243 5.49819 0.00000 ## X_tr 0.84332 0.14158 5.95641 0.00000 ## X_tr 0.71329 0.16296 4.37707 0.00002 ## X_tr 0.59229 0.17063 3.47111 0.00060 ## X_tr 0.76839 0.16055 4.78593 0.00000 ## X_tr 0.98971 0.15318 6.46091 0.00000 ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.8 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 0.79126 0.14003 5.65083 0.00000 ## X_trtreatment 3.70886 0.36971 10.03188 0.00000 ## X_tr 0.73646 0.14104 5.22148 0.00000 ## X_tr 0.71236 0.12350 5.76817 0.00000 ## X_tr 0.86668 0.13143 6.59424 0.00000 ## X_tr 0.75392 0.13379 5.63521 0.00000 ## X_tr 0.76165 0.10571 7.20496 0.00000 ## X_tr 0.86711 0.11852 7.31598 0.00000 ## X_tr 0.73505 0.13053 5.63128 0.00000 ## X_tr 0.80531 0.13801 5.83513 0.00000 ## X_tr 0.79344 0.12981 6.11250 0.00000 ## X_tr 1.00810 0.13979 7.21169 0.00000 ## ## Call: rq(formula = y_tr ~ X_tr, tau = seq(0.1, 0.9, 0.1)) ## ## tau: [1] 0.9 ## ## Coefficients: ## Value Std. Error t value Pr(>t) ## (Intercept) 1.16385 0.16462 7.07006 0.00000 ## X_trtreatment 4.08872 0.37080 11.02661 0.00000 ## X_tr 0.93217 0.15900 5.86256 0.00000 ## X_tr 0.82984 0.12705 6.53140 0.00000 ## X_tr 0.93147 0.13285 7.01143 0.00000 ## X_tr 0.70112 0.12903 5.43359 0.00000 ## X_tr 0.90992 0.13283 6.85018 0.00000 ## X_tr 0.84127 0.14036 5.99349 0.00000 ## X_tr 0.66466 0.15369 4.32462 0.00002 ## X_tr 0.84578 0.17838 4.74134 0.00000 ## X_tr 0.84474 0.13833 6.10682 0.00000 ## X_tr 0.98760 0.16193 6.09902 0.00000 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Plumber Logging
[This article was first published on R Views, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The plumber R package is used to expose R functions as API endpoints. Due to plumber’s incredible flexibility, most major API design decisions are left up to the developer. One important consideration to be made when developing APIs is how to log information about API requests and responses. This information can be used to determine how plumber APIs are performing and how they are being utilized.
An example of logging API requests in plumber is included in the package documentation. That example uses a filter to log information about incoming requests before a response has been generated. This is certainly a valid approach, but it means that the log cannot contain details about the response since the response hasn’t been created yet. In this post we will look at an alternative approach to logging plumber APIs that uses preroute and postroute hooks to log information about each API request and its associated response.
LoggingIn this example, we will use the logger package to generate the actual log entries. Using this package isn’t required, but it does provide some convenient functionality that we will take advantage of.
Since we will be registering hooks for our API, we will need both a plumber.R file and an entrypoint.R file. The plumber.R file contains the following:
# plumber.R # A simple API to illustrate logging with Plumber library(plumber) #* @apiTitle Logging Example #* @apiDescription Simple example API for implementing logging with Plumber #* Echo back the input #* @param msg The message to echo #* @get /echo function(msg = "") { list(msg = paste0("The message is: '", msg, "'")) } #* Plot a histogram #* @png #* @get /plot function() { rand < rnorm(100) hist(rand) }Now that we’ve defined two endpoints (/echo and /plot), we can use entrypoint.R to setup logging using preroute and postroute hooks. First, we need to configure the logger package:
# entrypoint.R library(plumber) # logging library(logger) # Specify how logs are written log_dir < "logs" if (!fs::dir_exists(log_dir)) fs::dir_create(log_dir) log_appender(appender_tee(tempfile("plumber_", log_dir, ".log")))The log_appender() function is used to specify which appender method is used for logging. Here we use appender_tee() so that logs will be written to stdout and to a specific file path. We create a directory called logs/ in the current working directory to store the resulting logs. Every log file is assigned a unique name using tempfile(). This prevents errors that can occur if concurrent processes try to write to the same file.
Now, we need to create a helper function that we will use when creating log entries:
convert_empty < function(string) { if (string == "") { "" } else { string } }This function takes an empty string and converts it into a dash (""). We will use this to ensure that empty log values still get recorded so that it is easy to read the log files. We’re now ready to create our plumber router and register the hooks necessary for logging:
pr < plumb("plumber.R") pr$registerHooks( list( preroute = function() { # Start timer for log info tictoc::tic() }, postroute = function(req, res) { end < tictoc::toc(quiet = TRUE) # Log details about the request and the response log_info('{convert_empty(req$REMOTE_ADDR)} "{convert_empty(req$HTTP_USER_AGENT)}" {convert_empty(req$HTTP_HOST)} {convert_empty(req$REQUEST_METHOD)} {convert_empty(req$PATH_INFO)} {convert_empty(res$status)} {round(end$toc  end$tic, digits = getOption("digits", 5))}') } ) ) prWe use the $registerHooks() method to register both preroute and postroute hooks. The preroute hook uses the tictoc package to start a timer. The postroute hook stops the timer and then writes a log entry using the log_info() function from the logger package. Each log entry contains the following information:
 Log level: This is a distinction made by the logger package, and in this
example the value is always INFO  Timestamp: The timestamp for when the response was generated and sent back to
the client  Remote Address: The address of the client making the request
 User Agent: The user agent making the request
 Http Host: The host of the API
 Method: The HTTP method attached to the request
 Path: The specific API endpoint requested
 Status: The HTTP status of the response
 Execution Time: The amount of time from when the request received until the response was generated
This log format is loosely inspired by the NCSA Common log format.
TestingNow that our API is all setup, it’s time to test to make sure logging works as expected. First, we need to start the API. The easiest way to do this is to click the Run API button that appears at the top of the plumber.R file in the RStudio IDE. Once the API is running, you’ll see a message in the console similar to the following:
Running plumber API at http://127.0.0.1:5762Now that we know the API is running, we need to make a request. One of the easiest ways to make a request in this case is to open a web browser (like Google Chrome) and type the API address in the address bar followed by /plot. In this example, I would type http://127.0.0.1:5762/plot into the address bar of my browser. If all goes well, you should see a plot rendered in the browser. The RStudio console will display the log output:
INFO [20190809 12:30:23] 127.0.0.1 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36" localhost:5762 GET /plot 200 0.158A new logs/ directory will have been created in the current working directory and it will contain a file with the log entry. You can generate more log entries by refreshing your browser window.
AnalyzingLet’s say that we refreshed the browser window 1,000 times. The log file generated will contain an entry for each request. We can analyze this log file to find helpful information about the API. For example, we could plot a histogram of execution time:
library(ggplot2) plumber_log < readr::read_log("logs/plumber_fe3daed895d.log", col_names = c("log_level", "timestamp", "remote_address", "user_agent", "http_host", "method", "path", "status", "execution_time")) ggplot(plumber_log, aes(x = execution_time)) + geom_histogram() + theme_bw() + labs(title = "Execution Times", x = "Execution Time")We could even build a Shiny application to monitor the logs/ directory and provide realtime visibility into API metrics!
The details of this Shiny application go beyond the scope of this post, but the source code is available here.
Plumber APIs published to RStudio Connect can use this pattern to log and monitor API requests. Details on this use case can be found in this repository
ConclusionPlumber is an incredibly flexible package for exposing R functions as API endpoints. Logging information about API requests and responses provides visibility into API usage and performance. These log files can be manually inspected or used in connection with other tools (like Shiny) to provide realtime metrics around API use. The code used in this example along with additional information is available in this GitHub repository.
If you are interested in learning more about using plumber, logging, Shiny, and RStudio Connect, please visit community.rstudio.com and let us know!
James Blair is a solutions engineer at RStudio who focuses on tools,
technologies, and best practices for using R in the enterprise.
_____='https://rviews.rstudio.com/2019/08/13/plumberlogging/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Local randomness in R
[This article was first published on rstats on QuestionFlow, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Prologue Let’s say we have a deterministic (nonrandom) problem for which one of the solutions involves randomness. One very common example of such problem is a function minimization on certain interval: it can be solved nonrandomly (like in most methods of optim()), or randomly (the simplest approach being to generate random set of points on interval and to choose the one with the lowest function value).
What is a “clean” way of writing a function to solve the problem?
To leave a comment for the author, please follow the link and comment on their blog: rstats on QuestionFlow. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Synthesizing population timeseries data from the USA Long Term Ecological Research Network
[This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
IntroductionThe availability of large quantities of freely available data is revolutionizing the world of ecological research. Open data maximizes the opportunities to perform comparative analyses and metaanalyses. Such synthesis efforts will increasingly exploit “population data”, which we define here as time series of population abundance. Such population data plays a central role in testing ecological theory and guiding management decisions. One of the richest sources of open access population data is the USA Long Term Ecological Research (LTER) Network. However, LTER data presents the drawback common to all ecological timeseries: extreme heterogeneity derived from differences in sampling designs.
We experienced this heterogeneity first hand, upon embarking on our own comparative analysis of population data. Specifically, we noticed that heterogeneities in sampling design made datasets hard to compare, and therefore hard to search and analyze.
To alleviate these issues, we created popler (POPulation time series from Longterm Ecological Research): an online PostgreSQL database (henceforth “popler online database”), and associated R client (henceforth “popler R package”). popler accommodates raw population timeseries data using the same structure for all datasets. Using raw data, without aggregation, promotes flexibility in data analysis. Moreover, using a common data structure implies that all datasets share the same variables, or the same types of variables. As a result, popler facilitates comparing, manipulating, and retrieving population data.
The popler R packageWe strove to make userfriendliness the hallmark of the popler R package. To facilitate use, we created as few functions as possible: popler relies on essentially two functions: pplr_browse(), which you can use to see what types of data are available in popler, and pplr_get_data(), to download datasets from the popler online database.
pplr_browse() allows selection of which one of the 305 datasets satisfies userspecified criteria. These criteria will depend on the metadata variables used to describe the dataset stored in popler. These metadata variables, and their content, is described by function pplr_dictionary().
Once a user understands the meaning and content of metadata variables, she/he can use pplr_browse() to identify datasets of interest. For example, one could be interested in experimental datasets that quantify population abundance using cover measurements, and that are at least 20 years long. To see which of these data are available, the user should run:
The resulting tibble object shows that there are only four datasets with these characteristics:
## # A tibble: 4 x 20 ## title proj_metadata_k~ lterid datatype structured_data studytype duration_years community ## * ## 1 SGS~ 71 SGS cover no exp 42 yes ## 2 Post~ 140 AND cover no exp 22 yes ## 3 Vege~ 196 BNZ cover no exp 27 yes ## 4 e133~ 235 CDR cover no exp 26 yes ## # ... with 12 more variables: studystartyr , studyendyr , structured_type_1 , ## # structured_type_2 , structured_type_3 , structured_type_4 , treatment_type_1 , ## # treatment_type_2 , treatment_type_3 , lat_lter , lng_lter , taxasIf interested we can download these directly:
df_obj < pplr_get_data( datatype == 'cover' & studytype == 'exp' & duration_years > 20 ) Managing the popler database and its APIOne of the biggest hurdles in creating the popler R package was integrating it with an Application Programming Interface (API) that allows secure access to the popler online database. In principle, the popler R package could access the popler online database directly from R using functionalities of R packages such as RPostgreSQL. However, this poses substantial security threats, and we therefore welcomed the suggestion by Noam Ross, from rOpenSci, to use an API. Subsequently, Scott Chamberlain, also from rOpenSci, was kind enough to develop the API using the Ruby language, the code for which is currently stored on GitHub.
Querying an SQL database using an API improves security, but poses a few limitations on the features of the popler R package, and imposes the additional challenge of managing the API. The API limits the ways in which the R package can query the popler online database. In particular, our API allows querying popler in two ways: downloading the metadata of all 305 datasets, and downloading entire datasets. Fortunately, these two queries reflect the vast majority of operations that users perform when using the popler R package. In particular, the metadata of all 305 datasets is essentially what is used by the pplr_browse() function, and users almost always tend to use pplr_get_data() to download entire datasets. Hence, the limitations posed by the API were of little importance when weighed against the benefit provided by increased security. As a side note, using an API provides a minor, yet somewhat enticing perk: a progress bar for the downloads of pplr_get_data().
The API also needs to be managed in order to avoid crashes which make downloads impossible. The API crashes whenever the online database is stopped and restarted, and whenever the API web service crashes. The former case is predictable, but the API web service can crash at any time. To monitor whether the web service of our API is running correctly, we use Uptime Robot, a free application that quantifies the time during which the web server is available to users.
Conclusion: working with LTER population dataOur work with popler highlighted many opportunities linked to data synthesis using population data from the USA LTER network. Specifically, we found that population timeseries data, while containing sometimes daunting heterogeneities, also present conspicuous regularities. By “regularities”, we mean that population abundance can be quantified in just a few ways (e.g. using densities, counts, or area covered), that all datasets have a temporal replication of population censuses, a spatial replication structure (which is often times nested), and so on. Ecologists can use these regularities to format most population timeseries using the same data model. During our work on popler, we realized that the population data associated with the USA LTER data, and with other longterm population sampling efforts, provide untapped potential to foster synthesis work in population, community, and macro ecology.
AcknowledgementsWe thank reviewers Corinna Gries, and Benjamin BondLamberty for helpful comments that improved the package functionality.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Can we use a neural network to generate Shiny code?
[This article was first published on r – Appsilon Data Science  End to End Data Science Solutions, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Many news reports scare us with machines taking over our jobs in the not too distant future. Common examples of takeover targets include professions like truck drivers, lawyers and accountants. In this article we will explore how far machines are from replacing us (R programmers) in writing Shiny code. Spoiler alert: you should not be worried about your obsolescence right now. You will see in a minute that we’re not quite there yet. I’m just hoping to show you in an entertaining way some easy applications of a simple model of a recurrent neural network implemented in an R version of Keras
Let’s formulate our problem once again precisely: we want to generate Shiny code character by character with a neural network.
BackgroundTo achieve that we would need a recurrent neural network (RNN). By definition such a network does a pretty good job with time series. Right now you might be asking yourself, what? We defined our problem as a text mining issue; where is temporal dependency here?! Well, imagine a programmer typing characters on his/her keyboard, one by one, every time step. It would also be nice if our network captured longrange dependencies such as, for instance, a curly bracket in the 1021st line of code that can refer to a “for” loop from line 352 (that would be a long loop though). Fortunately, RNNs are perfect for that because they can (in theory) memorize the influence of a signal from the distant past to a present data sample.
I will not get into details on how recurrent neural networks work here, as I believe that there are a lot of fantastic resources online elsewhere. Let me just briefly mention that some of the regular recurrent networks suffer from a vanishing gradient problem. As a result, networks with such architectures are notoriously difficult to train. That’s why machine learning researchers started looking for more robust solutions. These are provided by a gating mechanism that helps to teach a network longterm dependencies.
The first such solution was introduced in 1997 as a Long Short Term Memory neuron (LSTM). It consists of three gates: input, forget and output, that together prevent the gradient from vanishing in further time steps. A simplified version of LSTM that still achieves good performance is the Gated Recurrent Unit (GRU) introduced in 2014. In this solution, forget and input gates are merged into one update gate. In our implementation we will use a layer of GRU units.
Most of my code relies on an excellent example from Chapter 8 in Deep Learning with R by François Chollet. I recommend this book wholeheartedly to everyone interested in practical basics of neural networks. Since I think that François can explain to you his implementation better than I could , I’ll just leave you with it and get to the part I modified or added.
ExperimentBefore we get to the model, we need some training data. As we don’t want to generate just any code, but specifically Shiny code , we need to find enough training samples. For that, I scraped the data mainly from this official shiny examples repository and added some of our semantic examples. As a result I generated 1300 lines of Shiny code.
Second, I played with several network architectures and looked for a balance between speed of training, accuracy and model complexity. After some experiments, I found a suitable network for our purposes:
model < keras_model_sequential() %>% layer_gru(units = 100, input_shape = c(maxlen, length(chars))) %>% layer_dense(units = length(chars), activation = "softmax")(BTW If you want to find out more about Keras in R, I invite you to take a look at a nice introduction by Michał).
I trained the above model for 50 epochs with a learning rate of 0.02. I experimented with different values of a temperature parameter too. Temperature is used to control the randomness of a prediction by scaling the logits (output of a last layer) before applying the softmax function. To illustrate, let’s have a look at the output of the network predictions with temperature = 0.07.
nput$n) }) # gen dorre out: t", " ras < ss) }, # l imat") }) } # tageerl itht oimang = shndabres(h4t 1") }) sses$ypabs viog hthewest onputputpung w do panetatstaserval = 1alin hs < geo verdpasyex(") }) send tmonammm(asera d ary vall wa g xb =1iomm(dat_ngg( dater( # fu t coo  1ang aoplonoi_dur d"), o tehing 1 ch mout = cor;")o}) < t < coan t d iand with temperature = 1:
filectinput < ren({ # goith an htmm the oblsctr th verichend dile distr(input$dateretcaption$print_om < ren({ th cond filen(io outputs # the ion tppet chooww.h vichecheckboartcarp" = show(dy), simptect = select) }) }) # funutput$datetable < ren({ heag(heig= x(input$obr)) }) ) ) function({ suiphed = simplenter = "opter") ) ) )I think that both examples are already quite impressive, given the limited training data we had. In the first case, the network is more confident about its choices but also quite prone to repetitions (many spaces follow spaces, letters follow letters and so on). The latter, from a long, loooong distance looks way closer to Shiny code. Obviously, it’s still gibberish, but look! There is a nice function call heag(heig= x(input$obr)), object property input$obr, comment # goith and even variable assignment filectinput < ren({. Isn’t that cool?
Let’s have a look now at the evolution of training after 5 epochs:
finp < r ctived = "text", "dachintion < pepristexplet({ ut < fendertext({ ftable('checkbs cutpanel changlis input dowcter selecter base bar  10 epochs: # data < pasht(brch(null) ] ), # input$c a couten quift to col c( expctfrlren beteracing changatput: pp)) }) 20 epochs: fine asc i) { foutput("apputc"in"), text(teat(input$contrs) # th render a number butt summaryerver pation ion tre # chapte the gendate ion. bhthect.hate whtn hbloAs you can see, after each training the generated text becomes increasingly structured.
Final ThoughtsI appreciate that some of you might not be as impressed as I was. Frankly speaking, I almost hear all of these Shiny programmers saying: “Phew… my job is secure then!” Yeah, yeah, sure it is… For now! Remember that these models will probably improve over time. I challenge you to play with different architectures and train some better models based on this example.
And for completeness, here’s the code I used to generate the fake Shiny code above:
library(keras) library(stringr) path < "shinyappstextdata.dat" # input data path text < tolower(readChar(path, file.info(path)$size)) # loading data #  Data preprocessing maxlen < 20 step < 3 text_indexes < seq(1, nchar(text)  maxlen, by = step) sentences < str_sub(text, text_indexes, text_indexes + maxlen  1) next_chars < str_sub(text, text_indexes + maxlen, text_indexes + maxlen) cat("Number of sequences: ", length(sentences), "\n") chars < unique(sort(strsplit(text, "")[[1]])) cat("Unique characters:", length(chars), "\n") char_indices < 1:length(chars) names(char_indices) < chars cat("Vectorization...\n") x < array(0L, dim = c(length(sentences), maxlen, length(chars))) y < array(0L, dim = c(length(sentences), length(chars))) for (i in 1:length(sentences)) { sentence < strsplit(sentences[[i]], "")[[1]] for (t in 1:length(sentence)) { char < sentence[[t]] x[i, t, char_indices[[char]]] < 1 } next_char < next_chars[[i]] y[i, char_indices[[next_char]]] < 1 } #  RNN model training model < keras_model_sequential() %>% layer_gru(units = 100, input_shape = c(maxlen, length(chars))) %>% layer_dense(units = length(chars), activation = "softmax") optimizer < optimizer_rmsprop(lr = 0.02) model %>% compile( loss = "categorical_crossentropy", optimizer = optimizer ) model %>% fit(x, y, batch_size = 128, epochs = 50) #  Predictions evaluation sample_next_char < function(preds, temperature = 1.0) { preds < as.numeric(preds) preds < log(preds) / temperature exp_preds < exp(preds) preds < exp_preds / sum(exp_preds) which.max(t(rmultinom(1, 1, preds))) } nr_of_character_to_generate < 500 start_index < sample(1:(nchar(text)  maxlen  1), 1) seed_text < str_sub(text, start_index, start_index + maxlen  1) temperature < 1.0 cat(seed_text, "\n") generated_text < seed_text for (i in 1:nr_of_character_to_generate) { sampled < array(0, dim = c(1, maxlen, length(chars))) generated_chars < strsplit(generated_text, "")[[1]] for (t in 1:length(generated_chars)) { char < generated_chars[[t]] sampled[1, t, char_indices[[char]]] < 1 } preds < model %>% predict(sampled, verbose = 0) next_index < sample_next_char(preds[1,], temperature) next_char < chars[[next_index]] generated_text < paste0(generated_text, next_char) generated_text < substring(generated_text, 2) cat(next_char) }You can find me on Twitter @doktaox
Article Can we use a neural network to generate Shiny code? comes from Appsilon Data Science  End to End Data Science Solutions.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science  End to End Data Science Solutions. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Vectors and Functions
[This article was first published on Rexercises, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
In the previous set we started with arithmetic operations on vectors. We’ll take this a step further now, by practising functions to summarize, sort and round the elements of a vector.
Sofar, the functions we have practised (log, sqrt, exp, sin, cos, and acos) always return a vector with the same length as the input vector. In other words, the function is applied element by element to the elements of the input vector. Not all functions behave this way though. For example, the function min(x) returns a single value (the minimum of all values in x), regardless of whether x has length 1, 100 or 100,000.
Before starting the exercises, please note this is the third set in a series of five: In the first two sets, we practised creating vectors and vector arithmetics. In the fourth set (posted next week) we will practise regular sequences and replications.
You can find all sets right now in our ebook Start Here To Learn R – vol. 1: Vectors, arithmetic, and regular sequences. The book also includes all solutions (carefully explained), and the fifth and final set of the series. This final set focuses on the application of the concepts you learned in the first four sets, to realworld data.
One more thing: I would really appreciate your feedback on these exercises: Which ones did you like? Which ones were too easy or too difficult? Please let me know what you think here!
Exercise 1Did you know R has actually lots of builtin datasets that we can use to practise? For example, the rivers data “gives the lengths (in miles) of 141 “major” rivers in North America, as compiled by the US Geological Survey” (you can find this description, and additonal information, if you enter help(rivers) in R. Also, for an overview of all builtin datasets, enter data().
Have a look at the rivers data by simply entering rivers at the R prompt. Create a vector v with 7 elements, containing the number of elements (length) in rivers, their sum (sum), mean (mean), median (median), variance (var), standard deviation (sd), minimum (min) and maximum (max).
(Solution)
Exercise 2For many functions, we can tweak their result through additional arguments. For example, the mean function accepts a trim argument, which trims a fraction of observations from both the low and high end of the vector the function is applied to.
 What is the result of mean(c(100, 0, 1, 2, 3, 6, 50, 73), trim=0.25)? Don’t use R, but try to infer the result from the explanation of the trim argument I just gave. Then check your answer with R.
 Calculate the mean of rivers after trimming the 10 highest and lowest observations. Hint: first calculate the trim fraction, using the length function.
(Solution)
Exercise 3Some functions accept multiple vectors as inputs. For example, the cor function accepts two vectors and returns their correlation coefficient. The women data “gives the average heights and weights for American women aged 3039”. It contains two vectors height and weight, which we access after entering attach(women) (we’ll discuss the details of attach in a later chapter).
 Using the cor function, show that the average height and weight of these women are almost perfectly correlated.
 Calculate their covariance, using the cov function.
 The cor function accepts a third argument method which allows for three distinct methods (“pearson”, “kendall”, “spearman”) to calculate the correlation. Repeat part (a) of this exercise for each of these methods. Which is the method chosen by the default (i.e. without specifying the method explicitly?)
(Solution)
Exercise 4In the previous three exercises, we practised functions that accept one or more vectors of any length as input, but return a single value as output. We’re now returning to functions that return a vector of the same length as their input vector. Specifically, we’ll practise rounding functions. R has several functions for rounding. Let’s start with floor, ceiling, and trunc:
 floor(x) rounds to the largest integer not greater than x
 ceiling(x) rounds to the smallest integer not less than x
 trunc(x) returns the integer part of x
To appreciate the difference between the three, I suggest you first play around a bit in R with them. Just pick any number (with or without a decimal point, positive and negative values), and see the result each of these functions gives you. Then make it somewwat closer to the next integer (either above or below), or flip the sign, and see what happens. Then continue with the following exercise:
Below you will find a series of arguments (x), and results (y), that can be obtained by choosing one or more of the 3 functions above (e.g. y < floor(x)). Which of the above 3 functions could have been used in each case? First, choose your answer without using R, then check with R.
 x < c(300.99, 1.6, 583, 42.10)
y < c(300, 1, 583, 42)  x < c(152.34, 1940.63, 1.0001, 2.4, sqrt(26))
y < c(152, 1940, 1, 5, 2)  x < c(3.2, 444.35, 1/9, 100)
y < c(3, 444, 0, 100)  x < c(35.6, 670, 5.4, 3^3)
y < c(36, 670, 5, 27)
(Solution)
Exercise 5In addition to trunc, floor, and ceiling, R also has round and signif rounding functions. The latter two accept a second argument digits. In case of round, this is the number of decimal places, and in case of signif, the number of significant digits. As with the previous exercise, first play around a little, and see how these functions behave. Then continue with the exercise below:
Below you will find a series of arguments (x), and results (y), that can be obtained by choosing one, or both, of the 2 functions above (e.g. y < round(x, digits=d)). Which of these functions could have been used in each case, and what should the value of d be? First, choose your answer without using R, then check with R.
 x < c(35.63, 300.20, 0.39, 57.8)
y < c(36, 300, 0, 58)  x < c(153, 8642, 10, 39.842)
y < c(153.0, 8640.0, 10.0, 39.8)  x < c(3.8, 0.983, 23, 7.1)
y < c(3.80, 0.98, 23.00, 7.10)
(Solution)
Exercise 6Ok, let’s continue with a really interesting function: cumsum. This function returns a vector of the same length as its input vector. But contrary to the previous functions, the value of an element in the output vector depends not only on its corresponding element in the input vector, but on all previous elements in the input vector. So, its results are cumulative, hence the cum prefix. Take for example: cumsum(c(0, 1, 2, 3, 4, 5)), which returns: 0, 1, 3, 6, 10, 15. Do you notice the pattern?
Functions that are similar in their behavior to cumsum, are: cumprod, cummax and cummin. From just their naming, you might already have an idea how they work, and I suggest you play around a bit with them in R before continuing with the exercise.
 The nhtemp data contain “the mean annual temperature in degrees Fahrenheit in New Haven, Connecticut, from 1912 to 1971”. (Although nhtemp is not a vector, but a timeseries object (which we’ll learn the details of later), for the purpose of this exercise this doesn’t really matter.) Use one of the four functions above to calculate the maximum mean annual temperature in New Haven observed since 1912, for each of the years 19121971.
 Suppose you put $1,000 in an investment fund that will exhibit the following annual returns in the next 10 years: 9% 18% 10% 7% 2% 17% 8% 5% 9% 33%. Using one of the four functions above, show how much money your investment will be worth at the end of each year for the next 10 years, assuming returns are reinvested every year. Hint: If an investment returns e.g. 4% per year, it will be worth 1.04 times as much after one year, 1.04 * 1.04 times as much after two years, 1.04 * 1.04 * 1.04 times as much after three years, etc.
(Solution)
Exercise 7R has several functions for sorting data: sort takes a vector as input, and returns the same vector with its elements sorted in increasing order. To reverse the order, you can add a second argument: decreasing=TRUE.
 Use the women data (exercise 3) and create a vector x with the elements of the height vector sorted in decreasing order.
 Let’s look at the rivers data (exercise 1) from another perspective. Looking at the 141 data points in rivers, at first glance it seems quite a lot have zero as their last digit. Let’s examine this a bit closer. Using the modulo operator you practised in exercise 9 of the previous exercise set, to isolate the last digit of the rivers vector, sort the digits in increasing order, and look at the sorted vector on your screen. How many are zero?
 What is the total length of the 4 largest rivers combined? Hint: Sort the rivers vector from longest to shortest, and use one of the cum... functions to show their combined length. Read off the appropriate answer from your screen.
(Solution)
Exercise 8Another sorting function is rank, which returns the ranks of the values of a vector. Have a look at the following output:
x < c(100465, 300, 67.1, 1, 1, 0) rank(x) ## [1] 6.0 1.0 5.0 3.5 3.5 2.0 Can you describe in your own words what rank does?
 In exercise 3(c) you estimated the correlation between height and weight, using Spearman’s rho statistic. Try to replicate this using the cor function, without the method argument (i.e., using its default Pearson method, and using rank to first obtain the ranks of height and weight.
(Solution)
Exercise 9A third sorting function is order. Have a look again at the vector x introduced in the previous exercise, and the output of order applied to this vector:
x < c(100465, 300, 67.1, 1, 1, 0) order(x) ## [1] 2 6 4 5 3 1 Can you describe in your own words what order does? Hint: look at the output of sort(x) if you run into trouble.
 Remember the time series of mean annual temperature in New Haven, Connecticut, in exercise 6? Have a look at the output of order(nhtemp):
Given that the starting year for this series is 1912, in which years did the lowest and highest mean annual temperature occur?
 What is the result of order(sort(x)), if x is a vector of length 100, and all of its elements are numbers? Explain your answer.
(Solution)
Exercise 10In exercise 1 of this set, we practised the max function, followed by the cummax function in exercise 6. In the final exercise of this set, we’re returning to this topic, and will practise yet another function to find a maximum. While the former two functions applied to a single vector, it’s also possible to find a maximum across multiple vectors.
 First let’s see how max deals with multiple vectors. Create two vectors x and y, where x contains the first 5 even numbers greater than zero, and y contains the first 5 uneven numbers greater than zero. Then see what max does, as in max(x, y). Is there a difference with max(y, x)?
 Now, try pmax(x, y), where p stands for “parallel”. Without using R, what do you think intuitively, what it will return? Then, check, and perhaps refine, your answer with R.
 Now try to find the parallel minimum of x and y. Again, first try to write down the output you expect. Then check with R (I assume, you can guess the appropriate name of the function).
 Let’s move from two to three vectors. In addition to x and y, add x as a third vector. Write down the expected output for the parallel minima and maxima, then check your answer with R.
 Finally, let’s find out how pmax handles vectors of different lenghts Write down the expected output for the following statements, then check your answer with R.
 pmax(x, 6)
 pmax(c(x, x), y)
 pmin(x, c(y, y), 3)
(Solution)
Related exercise sets: Spatial Data Analysis: Introduction to Raster Processing (Part 1)
 Spatial Data Analysis: Introduction to Raster Processing: Part3
 Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
 Become a Top R Programmer Fast with our Individual Coaching Program
 Explore all our (>4000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
tsbox 0.2: supporting additional time series classes
[This article was first published on R – usefulr, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The tsbox package makes life with time series in R easier. It is built around a set of functions that convert time series of different classes to each other. They are frequencyagnostic, and allow the user to combine time series of multiple nonstandard and irregular frequencies. A detailed overview of the package functionality is given in the documentation page (or in a previous blogpost).
Version 0.2 is now on CRAN and provides a larger number of bugfixes. Nonstandard column names are now handled correctly, and nonstandard column orders are treated consistently.
New ClassesThere are two more time series classes supported: tis time series, from the tis package, and irts time series, from the tseries package.
In order to create an object of these classes, it is sufficient to use the appropriate converter.
E.g., for tis time series:
library(tsbox) ts_tis(fdeaths) ## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## 1974 901 689 827 677 522 406 441 393 387 582 578 666 ## 1975 830 752 785 664 467 438 421 412 343 440 531 771 ## 1976 767 1141 896 532 447 420 376 330 357 445 546 764 ## 1977 862 660 663 643 502 392 411 348 387 385 411 638 ## 1978 796 853 737 546 530 446 431 362 387 430 425 679 ## 1979 821 785 727 612 478 429 405 379 393 411 487 574 ## class: tisOr for irts time series:
head(ts_irts(fdeaths)) ## 19740101 00:00:00 GMT 901 ## 19740201 00:00:00 GMT 689Conversion works from all classes to all classes, and we can easily convert these objects to any other time series class, or to a data frame:
x.tis < ts_tis(fdeaths) head(ts_df(x.tis)) ## time value ## 1 19740101 901 ## 2 19740201 689 ## 3 19740301 827 ## 4 19740401 677 ## 5 19740501 522 ## 6 19740601 406 Classagnostic functionsBecause coercion works reliably and is well tested, we can use it to make functions classagnostic. If a classagnostic function works for one class, it works for all:
ts_pc(ts_tis(fdeaths)) ts_pc(ts_irts(fdeaths)) ts_pc(ts_df(fdeaths)) ts_pc(fdeaths)ts_pc calculates percentage change rates towards the previous period. It works like a ‘generic’ function: You can apply it on any time series object, and it will return an object of the same class as its input.
So, whether we want to smooth, scale, differentiate, chainlink, forecast, regularize or seasonally adjust a series, we can use the same commands to all time series classes. tsbox offers a comprehensive toolkit for the basics of time series manipulation. Here are some additional examples:
ts_pcy(fdeaths) # pc., comp. to same period of prev. year ts_forecast(fdeaths) # forecast, by exponential smoothing ts_seas(fdeaths) # seasonal adjustment, by X13 ts_frequency(fdeaths, "year") # convert to annual frequency ts_span(fdeaths, "1 year") # limit time span to final yearThere are many more. Because they all start with ts_, you can use autocomplete to see what’s around. Most conveniently, there is a time series plot function that works for all classes and frequencies:
ts_plot( `Airline Passengers` = AirPassengers, `Lynx trappings` = ts_tis(lynx), `Deaths from Lung Diseases` = ts_xts(fdeaths), title = "Airlines, trappings, and deaths", subtitle = "Monthly passengers, annual trappings, monthly deaths" ) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – usefulr. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.