Big Data Ignite 2020 Webinar Series
[social4i size="small" align="alignleft"] > [This article was first published on R – Hi! I am Nagdev, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Big Data Ignite (BDI) was born out of a shared vision: To foster a local center of excellence in advanced computing technologies and practice. After initial success in organizing local Meetup groups, cofounders Elliott and Tuhin realized that to achieve their goal, the scope and scale of activism would need to grow. So, in 2016, the Big Data Ignite conference was launched! In its first year, the conference attracted over 30 presenters and over 200 participants. The twoday conference was designed to provide handson skills in datarelated tools and techniques, as well as a practical overview of the datatechnology landscape and the state of the art. The first day of the conference was devoted to deepdive workshops by major tool vendors and academics. The second day was filled with visionary keynote addresses, panel discussions, and practical breakout sessions focusing on usecases and strategy in the areas of data analytics, data management, IoT, and cloud computing. It was a combination that worked ‚ bringing together handson learning and strategic guidance from practicing experts.
Big Data Ignite 2017, 2018, and 2019 steadily built the momentum. The conference grew to three days, over 50 presenters, and over 400 participants, offering an even broader variety of workshops, keynotes, and breakout sessions, with a special focus on AI
Big Data Ignite in Uncertain TimesThis year, BDI is hosting a series of webinars instead of a threeday conference. These webinars would be on every Tuesday and Thursday from September 15 – October 15 at 12:00 to 1:00 PM EST. Industry leaders and academics have already scheduled various presentations. BDI 2020 is my second year, and I’m also a key speaker for the September 15 session “Predictive Maintenance Pipeline Using R”.
Webinar SeriesCheck out the below link for all the talks scheduled in this session
https://bigdataignite.org/
This year there is absolutely no cost to attend these sessions. You can register on their website at https://bigdataignite.org/register/
The post Big Data Ignite 2020 Webinar Series appeared first on Hi! I am Nagdev.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Hi! I am Nagdev. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Big Data Ignite 2020 Webinar Series first appeared on Rbloggers.
Why R? Webinar – Me, Myself and my Rprofile
[social4i size="small" align="alignleft"] > [This article was first published on http://raddict.com, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
This Thursday September 17 we are lauching another [whyr.pl/webinars/][http://whyr.pl/webinars/] entitled Me, Myself and my Rprofile
20200917 Me, Myself and my Rprofile speaker: Colin Gillespie
 meetup event
 webinar
 donate: whyr.pl/donate/
 channel: youtube.com/WhyRFoundation
 date: every Thursday 8:00 pm UTC+2
 format: 45 minutes long talk streamed on YouTube + 10 minutes for Q&A
 comments: ask questions on YouTube live chat
 Why R? 2020 Conference (Remote)
 Website: 2020.whyr.pl
Sydeaka Watson Data Science for Social Justice. Video
John Blischak Reproducible research with workflowr: a framework for organizing, versioning, and sharing your data analysis projects. Video
JD Long Taking friction out of R: helping drive data science adoption in organizations. Video
Leon Eyrich Jessen In Silico Immunology: Neural Networks for Modelling Molecular Interactions using Tensorflow via Keras in R. Video
Erin Hodgess Using R with High Performance Tools on a Windows Laptop. Video
Julia Silge Understanding Word Embeddings. Video
Bernd Bischl, Florian Pfisterer and Martin Binder Pipelines and AutoML with mlr3. Video
Mateusz Zawisza + Armin Reinert Uplift modeling for marketing campaigns. Video
Erin LeDell – Scalable Automatic Machine Learning in R with H2O AutoML. Video
Ahmadou Dicko – Humanitarian Data Analysis with R. Video
Dr. Nina Zumel and Dr. John Mount from winvector – Advanced Data Preparation for Supervised Machine Learning. Video
Lorenzo Braschi – rZYPAD: Development pipeline for R production. Video
Robin Lovelace and Jakub Nowosad (authors of Geocomputation with R) – Recent changes in R spatial and how to be ready for them. Video
Heidi Seibold, Department of Statistics (collaboration with LMU Open Science Center) (University of Munich) – Teaching Machine Learning online. Video
Olgun Aydin – PwC Poland – Introduction to shinyMobile. Video
Achim Zeileis from Universität Innsbruck – R/exams: A OneforAll Exams Generator – Online Tests, Live Quizzes, and Written Exams with R. Video
Stay up to date subscribe to YouTube channel youtube.com/c/WhyRFoundation
 join Why R? Slack whyr.pl/slack/
 join Meetup
To leave a comment for the author, please follow the link and comment on their blog: http://raddict.com. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Why R? Webinar  Me, Myself and my Rprofile first appeared on Rbloggers.
Dataviz Interview
[social4i size="small" align="alignleft"] > [This article was first published on R on kieranhealy.org, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
I had a very nice chat recently about data visualization with Brian Fannin, a research actuary with the
CAS. We covered a variety of topics from R and ggplot in particular, to how to think about data visualization in general, and what the dataviz community is learning from COVID. You can watch it here:
To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Dataviz Interview first appeared on Rbloggers.
Bookdown, Not Bookout
[social4i size="small" align="alignleft"] > [This article was first published on R, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
I first got to grips with the R coding language to do the statistical analysis for my dissertation, but I still did all my actual writing in more traditional word processing apps. 1 This became a pain any time I needed to copypaste an update into a table or, even worse, change a plot and reexport to an image and reimport it into my writing environment. Going forward, I resolved to find a better way to incorporate my analysis code and writing to avoid this problem.
Bookdown is a fantastic R package that compiles Rmarkdown source documents into one big output, with code evaluated and, crucially, the ability to add crossreferences to tables/plots etc. It seems to work best when producing HTML or PDF output, for which it has a whole bunch of great customisation options. This is great, as I always submit my final work in PDF. Unfortunately, due to needing to interact with lecturers and nonRcodeusers for feedback on my work, I also need it to work with the Word docx file format. It can do this, but it does not have nearly as many options for customising the output. And a lot of the things you can build when producing a PDF just flat out don’t work in Word.
This has led to… some work on my part.
Functionally building your bookThe most obvious solution to me seemed to simply change what my code was doing depending on the format I was exporting to. I can tweak the final PDF output to be perfect, but the docx format has to be just good enough to get the point across to whoever I’m seeking feedback from.
You can set Bookdown to export to multiple formats, and set options for each export format, using an _output.yml file. But, weirdly, I don’t actually find it easy to change code execution in your document based on what is currently being rendered by Bookdown. You’d think there’d be a way to declare is(pdf) or is(docx) and execute appropriately, right? Unfortunately not. 2
According to this Stackoverflow thread, you can determine the currently rendering format with rmarkdown::all_output_formats(knitr::current_input())[1]. I think _output.yml is supposed to be ‘shuffled’ by Rstudio so that index 1 is always the currently rendering output, but that wasn’t occurring when I tested it; every time I would just get back is the first format listed in _output.yml, even if it wasn’t the one currently being rendered.So if _output.yml had bookdown::pdf_document and then bookdown::word_document in it, all I would get back is bookdown::pdf_document even when rendering to docx. 3
I could just swap in an out various outputs as I need them, of course, but I ended up taking a step back at this point and instead considering what was going on when rendering my source files. If you’re like me, you’re using Rstudio as your IDE, and this gives you a handy ‘knit’ or ‘build’ button in your toolbar when it detects you’re in a Rmarkdown/Bookdown project. While nice, this abstraction actually ended up masking for me the underlying method. Really all these things are doing is calling bookdown::render_book, itself a wrap around rmarkdown::render. This function takes an argument for the format(s) to render when called, using the options found in _output.yml for each particular output format.
Once I realised that rendering a book is just another function, I decided to just wrap it in my own function. I can then provide different, unique _output.yml files to each call to bookdown::render_book based on an argument I provide, using yaml and ymlthis packages to read and write the yaml files as I need them:
build_book < function(format = "all"){ switch(format, "all" = formats < c("bookdown::pdf_document2", "bookdown::word_document2"), "pdf" = formats < "bookdown::pdf_document2", "word" = formats < "bookdown::word_document2" ) for(fmt in formats) { if(grepl("pdf", fmt)) { out_yml < yaml::read_yaml("_pdf_output.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } if(grepl("word", fmt)) { out_yml < yaml::read_yaml("_word_output.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } bookdown::render_book(here::here("index.Rmd"), output_format = fmt) fs::file_delete("_output.yml") } }This pulls in specific yaml files from a project folder, and sets it as the _output.yml for Bookdown to find. Now, when rendering, rmarkdown::all_output_formats(knitr::current_input())[1] returns the correct file format for the current render, as it’s the only format in the output yaml file! This allowed me to put this at the start of my bookdown source files:
fmt < rmarkdown::all_output_formats(knitr::current_input())[1]And then test for output with:
if(!grepl("word", fmt)) { my_table %>% kableExtra::kbl(caption = "My full table caption", caption.short = "Short caption for ToC", booktabs = TRUE) %>% kableExtra::kable_styling(latex_options = c("striped", "HOLD_position", "scale_down")) } else { exTab %>% knitr::kable(caption = "My full table caption", booktabs = TRUE)This lets me, for example here, use kableExtra to further style my table for PDF/HTML output, but since kableExtra doesn’t really work for docx, just use regular ol’ knitr::kable to render it there!
Taking it furtherOnce I started writing my own book rendering wrapper function, the opportunities opened up in front of me. For example – one thing I like to do is build my drafts and save them into folders with today’s date, so I can put my feedback from lecturers in that same folder and keep it all together for easy reference. So I added the following to my build_book() function:
build_path < glue::glue("book/builds/{Sys.Date()}") bookdown_yml < ymlthis::yml_bookdown_opts(.yml = ymlthis::yml_empty(), book_filename = "My Project", rmd_subdir = c("src"), delete_merged_file = FALSE, output_dir = build_path )Now by creating a custom _bookdown.yml file for each call to build_book(), each build will save into a book/src/DATE specific folder!
There are now tons of ways to further change how Bookdown builds your book based on arguments you provide to your own build_book, swapping in and out various Bookdown configuration files as required. Is this the most efficient way to customise your Bookdown project? Probably not, with all the creating and changing of existing yaml files, but speed isn’t really my concern here – running build_book and having to step back and wait for a minute or two for my code to run is much more preferable to the stupid amounts of time I spent correcting fiddly tables/plot/cross reference output by hand in Word before!
Word / .docx output tweaksSo now I can customise my output based on where I’m rendering, I need to fix a couple of other things in my docx output. These now need to be done by editing my reference document for Bookdown docx output. For a primer on creating and editing your reference doc, check out this article.
One extra thing I needed was numbered section headings, not just H1, H2 etc rendered as headings. This can’t just be set in Bookdown, but instead can be done in your reference doc by selecting your header, going to Format > Style... > Modify, and then finding this hidden gem of a menu in the lower left of the style formatting menu:
Then clicking Outline Numbered:
You can also click ‘customise’ here for some extra options, such as how much subheadings should indent. Be aware you also need to do this for every heading you want numbered, so H1, H2, etc. I then combined this with build_book to choose whether to number headings in docx output:
build_book < function(format = "word", word_num = TRUE){ switch(format, "all" = formats < c("bookdown::pdf_document2", "bookdown::word_document2"), "pdf" = formats < "bookdown::pdf_document2", "word" = formats < "bookdown::word_document2" ) for(fmt in formats) { if(grepl("pdf", fmt)) { out_yml < yaml::read_yaml("_pdf_output.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } if(grepl("word", fmt)) { if(word_num = TRUE) { out_yml < yaml::read_yaml("_word_output_numbered.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } else { out_yml < yaml::read_yaml("_word_output.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } } } bookdown::render_book(here::here("index.Rmd"), output_format = fmt) fs::file_delete("_output.yml") } }And just refer to the two differently formatted ref docx’s within _word_output_numbered.yml and _word_output.yml respectively. Each one gets swapped in when called upon.
Finally, I needed to put in a table of contents and lists of figures/tables at the start of the document. There really doesn’t seem to be an easy way to create ToCs in docx from Bookdown, but fortunately it’s just a quick click or two in Word to autogenerate them from document headings or existing document styles. 4 But, I wanted some blank pages at the start, with ‘Table of Contents’ etc as a heading on each one, so at least I had space for all these things set aside automatically that I didn’t need to think about.
You can include some basic latex functions in Rmarkdown, one of which is \newpage to start a new page. But, if I tried adding # Table of Contents as a header on a page at the start of the document, when I auto generate the ToC in Word this heading itself gets included in the ToC!
I ended up declaring pandoc custom style fences to give this page the style of ‘Title’ rather than ‘heading’, so it won’t be included in the auto generated ToC. I also used glue and the output detector from above to only include it in docx output:
if(grepl("word", fmt)) { word_toc < glue::glue(' \\newpage ::: {{customstyle="Title"}} Table of Contents ::: \\newpage ::: {{customstyle="Title"}} List of Figures ::: \\newpage ::: {{customstyle="Title"}} List of Tables ::: \\newpage ::: {{customstyle="Title"}} List of Appendices ::: ') } else { word_toc < NULL } word_tocRemember to escape your forward slashes and your curly brackets in glue by doubling up!
Writing environmentJust as an aside, as great as Rstudio is, I wouldn’t really recommend it as the best pure writing environment. You can certainly make it a bit nicer with themes and by editing the font used, but you’re not going to get away from its original purpose as a coding environment. While fine for the coding parts of my projects, there was also going to be quite a lot of just plain ol’ prose writing, and I wanted to find somewhere a bit nicer to do those parts.
Although Rmarkdown’s .Rmd files are theoretically plain text files, I found options to be limited when looking for a writing app that can handle them. Initially I tried using VS Code, as while first and foremost a coding environment, it’s highly customisable. In particular, a solarized colour theme, an R markdown plugin, and discovering VS Code’s zen mode got me quite a long way towards a pleasant writing environment, but it still felt a little janky.
I’ve ended up settling on IA Writer as my main ‘writing’ app of choice. It handles .Rmd files without a problem, feels super smooth, and fullscreen provides a nice, distraction free writing environment. It’s not particularly cheap, but you get what you pay for. As a bonus, I can use Working Copy to sync my git repo for each project to my mobile devices and continue working on them in IA Writer’s mobile apps just as if I was on desktop.
If IA Writer is a bit too steep in cost, Typora is a really nice alternative. It’s currently free while in beta, so not sure how much longer that’s going to be the case, but it also handles .Rmd files really well. I just preferred some of the library/file management options in IA myself.
On mobile, 1Writer or Textastic are also great choices for iOS, and JotterPad on Android is also great for markdown, although I haven’t been able to check it works with .Rmd files myself yet.
ReproducibilityOnce I start cobbling together workflows and hacks like those I’ve detailed in this blog post, I invariably start worrying about stability and reproducibility. 5 What happens when I want to start a new project? Or if I lose or break my laptop and I need to start over again on a new device?
This is where working in code is really handy, and having recently tried my hand at R package development comes in really handy. See, now I know just enough about how R works with packages to put my custom project functions – like build_book() – in an R/ folder within my project, and have the usethis package create me a DESCRIPTION file that lists my project’s dependencies.
I can then also drop all my book building assets – like my custom reference docs for docx files, custom latex preambles 6, bibliography files, the works – into that same project, push it to GitHub, and turn that whole thing into a template repository. 7 Now, anytime I need to create a new Bookdown project, I can just clone that repo, and run devtools::load_all() to have my custom build_book function back in action!
Taking it a step further, I can use the newish renv package to create a renv.lock file in my template repo that I can then use to exactly reinstall the package dependencies I might need for any book project. 8
As a side bonus, while reassuring myself in creating a stable and reproducible project setup, this also follows a lot of the steps recommended to turn your research into a research compendium. This means that someone else can also download the git repositories I create this way, and similarly, use my renv.lock or package dependencies in DESCRIPTION to very quickly get up and running with a similar environment to mine to test out my work for themselves. I could even take it a step further and include a Dockerfile and link to an online Rstudio server using the holepunch package to recreate it exactly!
Wrapping upThis blog post is really the result of a couple weekends of StackOverflow diving and thinking about what I really need for my next big writing project. It can be a bit overwhelming working with something with as many options as Bookdown/Rmarkdown. The sheer amount of configurations and things like the integration with Rstudio can lead you to forget that, underneath it all, is still just code you can pull apart and make do what you want with a little sideways thinking. I found a lot of questions from people trying to make Bookdown do something from within, when really a little metacoding with a little function that just swaps out different _bookdown.yml files as needed would do the trick.
I hope this post helps a little bit for those out there in the same boat as me. Us poor souls that still have to deal with docx and Bookdown need to stick together. If you have your own Bookdown/Rmarkdown hints and tips, tweet them to me, I’d love to hear them!
 I like Scrivener, but man, be ready to spend some time reading the manual and figuring out all the little things you can tweak.
︎  There is knit::is_latex_output() and knit::is_html_output() but no is_word. You could work in the negative, but I wanted a way to be explicitly clear about exactly the format being rendered.
︎  Maybe I was doing something wrong, but it never worked for me.
︎  And nicely, table and figure captions from kable/knitr get given their own styles that can be used to generate separate indexes in Word. See this help article for more on generating lists of figures/tables in Word from document styles.
︎  My dissertation project was such a house of cards by the end that I was scared to change anything in any configuration menu/file incase I broke the whole thing.
︎  Now that’s a whole other rabbit hole to fall down…
︎  For a fun trick, setup this template repo with exactly the folder structure you want in all your projects, and use .keep hidden files to keep otherwise empty directories synced in git. Useful for that ‘data’ folder you want to keep in the template but don’t actually have any data in yet.
︎  renv is great, and near as I can tell is aiming to replicate the functionality of things like npm’s package.json file and python’s virtual environments. Highly recommend checking out, it’s worked very smoothly for me so far.
︎
To leave a comment for the author, please follow the link and comment on their blog: R. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Bookdown, Not Bookout first appeared on Rbloggers.
Announcing the 2020 RStudio Table Contest
[social4i size="small" align="alignleft"] > [This article was first published on RStudio Blog, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Here at RStudio, we love tables. No, not tables like the one pictured above, but data tables.
Did you know that data tables have been used since the 2nd century? They’ve really been around.
Tables are a fantastic way to communicate lists of quantitative and qualitative information. Sometimes, tables can fall very short of their potential for greatness. But that was the past: we now have some excellent R packages at our disposal to generate welldesigned and functional presentation tables. And because of this renaissance of tablemaking in R, we’re announcing a contest: The 2020 RStudio Table Contest. It will run from September 15th to October 31st, 2020.
One thing we love about the R community is how open and generous you are in sharing the code and process you use to solve problems. This lets others learn from your experience and invites feedback to improve your work. We hope this contest encourages more sharing and helps to recognize the many outstanding ways people work with and display data with R.
Contest Judging CriteriaTables will be judged based on technical merit, artistic design, and quality of documentation. We recognize that some tables may excel in only one category and others in more than one or all categories. Honorable mentions will be awarded with this in mind.
We are working with maintainers of many of the R community’s most popular R packages for building tables, including Yihui Xie of DT, Rich Iannone of gt, Greg Lin of
reactable, David Gohel of
flextable, David HughJones of huxtable
, and Hao Zhu of kableExtra. Many of these maintainers will help review submissions built with their packages.
A submission must include all code and data used to replicate your entry. This may be a fully knitted R Markdown document with code (for example published to RPubs or shinyapps.io), a repository, or rstudio.cloud project.
A submission can use any tablemaking package available in R, not just the ones mentioned above.
Submission Types – We are looking for three types of table submissions,
 Single Table Example: This may highlight interesting structuring of content, useful and tricky features – for example, enabling interaction – or serve as an example of a common table popular in a specific field. Be sure to document your code for clarity.
 Tutorial: It’s all about teaching us how to craft an excellent table or understand a package’s features. This may include several tables and narrative.
 Other: For submissions that do not easily fit into one of the types above.
Category – Given that tables have different features and purposes, we’d also like you to further categorize the submission table. There are four categories, staticHTML, interactiveHTML, staticprint, and interactiveShiny. Simply choose the one that best fits your table.
You can submit your entry for the contest by filling the form at rstd.io/tablecontest2020. The form will generate a post on RStudio Community, which you can then edit further at a later date. You may make multiple entries.
The deadline for submissions is October 31st, 2020, at midnight Pacific Time.
PrizesGrand Prize
 One year of shinyapps.io Basic plan or one year of RStudio.Cloud Premium.
 Additionally, any number of RStudio tshirts, books, and mugs (worth up to $200).
Unfortunately, we may not be able to send tshirts, books, or other items larger than stickers to nonUS addresses for which shipping and customs costs are high.
Honorable Mention
 A good helping of hex stickers for RStudio packages plus a side of hexes for tablemaking packages, and other goodies.
Previous Shiny Contests have driven significant improvement to the popular Shiny Gallery and we hope this contest will spur development of a similar Tables Gallery. Winners and other participants may be invited to feature their work on such a resource.
We will announce the winners and their submissions on the RStudio blog, RStudio Community, and also on Twitter.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Announcing the 2020 RStudio Table Contest first appeared on Rbloggers.
Generating probabilities for ordinal categorical data
[social4i size="small" align="alignleft"] > [This article was first published on ouR data generation, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Over the past couple of months, I’ve been describing various aspects of the simulations that we’ve been doing to get ready for a metaanalysis of convalescent plasma treatment for hospitalized patients with COVID19, most recently here. As I continue to do that, I want to provide motivation and code for a small but important part of the data generating process, which involves creating probabilities for ordinal categorical outcomes using a Dirichlet distribution.
MotivationThe outcome for the analysis that we will be conducting is the WHO 11point ordinal scale for clinical improvement at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. We plan to use a Bayesian proportional odds model to assess the effectiveness of the therapy. Since this is a metaanalysis, we will be including these data from a collection of studies being conducted around the world.
Typically, in a proportional odds model one has to make an assumption about proportionality. In this case, while we are willing to make that assumption within specific studies, we are not willing to make that assumption across the various studies. This means we need to generate a separate set of intercepts for each study that we simulate.
In the proportional odds model, we are modeling the logcumulative odds at a particular level. The simplest model with a single exposure/treatment covariate for a specific study or cluster \(k\) is
\[log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{xk} + \beta A,\]
where \(x\) ranges from 1 to 10, all the levels of the WHO score excluding the lowest level \(x=0\). \(A\) is the treatment indicator, and is \(A=1\) for patients who receive the treatment. \(\alpha_{xk}\) is the intercept for each study/cluster \(k\). \(\beta\) is interpreted as the logodds ratio comparing the odds of the treated with the nontreated within each study. The proportionality assumption kicks in here when we note that \(\beta\) is constant for all levels of \(x\). In addition, in this particular model, we are assuming that the logodds ratio is constant across studies (not something we will assume in a more complete model). We make no assumptions about how the study intercepts relate to each other.
To make clear what it would mean to make a stronger assumption about the odds across studies consider this model:
\[log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{x} + b_k + \beta A,\]
where the intercepts for each study are related, since they are defined as \(\alpha_{x} + b_k\), and share \(\alpha_x\) in common. If we compare the logodds of the treated in one study \(k\) with the logodds of treated in another study \(j\) (so \(A=1\) in both cases), the logodds ratio is \(b_j – b_k\). The ratio is independent of \(x\), which implies a strong proportional odds assumption across studies. In contrast, the same comparison across studies based on the first model is \(\alpha_{xj} – \alpha_{xk}\), which is not necessarily constant across different levels of \(x\).
This is a long way of explaining why we need to generate different sets of intercepts for each study. In short, we would like to make the more relaxed assumption that odds are not proportional across studies or clusters.
The Dirichlet distributionIn order to generate ordinal categorical data I use the genOrdCat function in the simstudy package. This function requires a set of baseline probabilities that sum to one; these probabilities map onto levelspecific intercepts. There will be a distinct set of baseline probabilities for each study and I will create a data set for each study. The challenge is to be able to generate unique baseline probabilities as if I were sampling from a population of studies.
If I want to generate a single probability (i.e. a number between \(0\) and \(1\)), a good solution is to draw a value from a beta distribution, which has two shape parameters \(\alpha\) and \(\beta\).
Here is a single draw from \(beta(3, 3)\):
set.seed(872837) rbeta(1, shape1 = 3, shape2 = 3) ## [1] 0.568The mean of the beta distribution is \(\alpha/(\alpha + \beta)\) and the variance is \(\alpha\beta/(\alpha+\beta)^2(\alpha + \beta + 1)\). We can reduce the variance and maintain the same mean by increasing \(\alpha\) and \(\beta\) by a constant factor (see addendum for a pretty picture):
library(data.table) d1 < data.table(s = 1, value = rbeta(1000, shape1 = 1, shape2 = 2)) d2 < data.table(s = 2, value = rbeta(1000, shape1 = 5, shape2 = 10)) d3 < data.table(s = 3, value = rbeta(1000, shape1 = 100, shape2 = 200)) dd < rbind(d1, d2, d3) dd[, .(mean(value), sd(value)), keyby = s] ## s V1 V2 ## 1: 1 0.338 0.2307 ## 2: 2 0.336 0.1195 ## 3: 3 0.333 0.0283The Dirichlet distribution is a multivariate version of the beta distribution where \(K\) values between \(0\) and \(1\) are generated, with the caveat that they sum to \(1\). Instead of \(\alpha\) and \(\beta\), the Dirichlet is parameterized by a vector of length \(K\)
\[\boldsymbol{\alpha} = \left(\alpha_1,\dots, \alpha_K\right)^T,\]
where there are \(K\) levels of the ordinal outcome. A draw from this distribution returns a vector \(\boldsymbol{p} = ( p_1, \dots, p_K)^T\) where \(\sum_{i=1}^K p_i = 1\) and
\[E(p_k)=\frac{\alpha_k}{\sum_{i=1}^K \alpha_i}.\]
A draw from a Dirichlet distribution with \(K=2\) is actually equivalent to a draw from a beta distribution where \(\boldsymbol{\alpha} = (\alpha, \beta)^T\). Before, I generated data from a \(beta(1, 2)\), and now here is a draw from \(Dirichlet\left(\boldsymbol\alpha = (1,2)\right)\) using rdirichlet from the gtools package:
The first column has the same distribution as the \(beta\) distribution from before; the mean and standard deviation are close to the values estimated above:
c(mean(dir[,1]), sd(dir[,1])) ## [1] 0.332 0.236To ramp things up a bit, say we have \(K = 5\), and the target mean values for each level are \(\boldsymbol{p} = \left(\frac{1}{9}, \frac{2}{9}, \frac{3}{9}, \frac{2}{9}, \frac{1}{9} \right)\), one way to specify this is:
dir_1 < rdirichlet(1000, alpha = c(1, 2, 3, 2, 1)) head(dir_1) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.1710 0.6637 0.0676 0.0633 0.0343 ## [2,] 0.1130 0.1150 0.2803 0.4229 0.0689 ## [3,] 0.1434 0.0678 0.3316 0.1721 0.2851 ## [4,] 0.0250 0.1707 0.3841 0.2490 0.1712 ## [5,] 0.0633 0.3465 0.4056 0.0853 0.0993 ## [6,] 0.1291 0.1510 0.3993 0.2612 0.0593Here are the observed means for each \(p_k\), pretty close to the target:
apply(dir_1, 2, mean) ## [1] 0.111 0.221 0.328 0.229 0.112Of course, we could generate data with a similar target \(\boldsymbol{p}\) by multiplying \(\boldsymbol\alpha\) by a constant \(c\). In this case, we use \(c=10\) and see that the average values for each \(p_k\) are also close to the target:
dir_2 < rdirichlet(1000, alpha = c(10, 20, 30, 20, 10)) apply(dir_2, 2, mean) ## [1] 0.113 0.222 0.334 0.220 0.111There is a key difference between specifying \(\boldsymbol{\alpha}\) and \(c\boldsymbol{\alpha}\). Just as in the beta distribution, as \(c\) grows larger, the variation within each \(p_k\) decreases. This will be useful when generating the study specific probabilities if we want explore different levels of variation.
Here’s the standard deviations from the two data sets just generated:
apply(dir_1, 2, sd) ## [1] 0.102 0.131 0.144 0.134 0.098 apply(dir_2, 2, sd) ## [1] 0.0333 0.0425 0.0508 0.0421 0.0333 Generating the baseline probabilitiesA simple function that includes two key arguments – the base probabilities (which are really \(\boldsymbol{\alpha}\)) and a similarity index (which is really just the constant \(c\)) – implements these ideas to generate studyspecific probabilities for each outcome level. As the similarity index increases, the variation across studies or sites decreases. The function includes an additional adjustment to ensure that the row totals sum exactly to \(1\) and not to some value infinitesimally greater than \(1\) as a result of rounding. Such a rounding error could cause problems for the function genOrdCat.
genBaseProbs < function(n, base, similarity, digits = 8) { n_levels < length(base) x < rdirichlet(n, similarity * base) # ensure that each vector of probabilities sums exactly to 1 x < round(floor(x*1e8)/1e8, digits) # round the generated probabilities xpart < x[, 1:(n_levels1)] # delete the base prob of the final level partsum < apply(xpart, 1, sum) # add the values of levels 1 to K1 x[, n_levels] < 1  partsum # the base prob of the level K = 1  sum(1:[K1]) return(x) }In this first example, I am generating 11 values (representing base probabilities) for each of 9 studies using a relatively low similarity index, showing you the first six studies:
basestudy < genBaseProbs( n = 9, base = c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05), similarity = 15, ) round(head(basestudy), 3) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] ## [1,] 0.094 0.022 0.121 0.100 0.061 0.102 0.053 0.309 0.059 0.078 0.000 ## [2,] 0.025 0.079 0.043 0.197 0.083 0.044 0.099 0.148 0.025 0.150 0.107 ## [3,] 0.007 0.042 0.084 0.066 0.049 0.145 0.191 0.323 0.078 0.012 0.003 ## [4,] 0.061 0.021 0.063 0.104 0.092 0.292 0.112 0.110 0.113 0.026 0.008 ## [5,] 0.067 0.023 0.021 0.042 0.063 0.473 0.108 0.127 0.016 0.013 0.046 ## [6,] 0.001 0.018 0.054 0.225 0.150 0.301 0.043 0.081 0.100 0.008 0.020A great way to see the variability is a cumulative probability plot for each individual study. With a relatively low similarity index, you can generate quite a bit of variability across the studies. In order to create the plot, I need to first calculate the cumulative probabilities:
library(ggplot2) library(viridis) cumprobs < data.table(t(apply(basestudy, 1, cumsum))) n_levels < ncol(cumprobs) cumprobs[, id := .I] dm < melt(cumprobs, id.vars = "id", variable.factor = TRUE) dm[, level := factor(variable, labels = c(0:10))] ggplot(data = dm, aes(x=level, y = value)) + geom_line(aes(group = id, color = id)) + scale_color_viridis( option = "D") + theme(panel.grid.minor = element_blank(), panel.grid.major.y = element_blank(), legend.position = "none")Here is a plot of data generated using a similarity index of 150. Variation is reduced pretty dramatically:
Using base probabilities to generate ordinal dataNow that we have these base probabilities, the last step is to use them to generate ordinal outcomes. I am generating the simplest of data sets: 9 “studies” each with 500 subjects, without any covariates or even treatment assignment. Since the genOrdCat requires an adjustment variable, I am adjusting everyone by 0. (This is something I need to fix – there should be no such requirement.)
library(simstudy) d_study < genData(9, id = "study") d_ind < genCluster(d_study, "study", numIndsVar = 500, "id") d_ind[, z := 0] d_ind ## study id z ## 1: 1 1 0 ## 2: 1 2 0 ## 3: 1 3 0 ## 4: 1 4 0 ## 5: 1 5 0 ##  ## 4496: 9 4496 0 ## 4497: 9 4497 0 ## 4498: 9 4498 0 ## 4499: 9 4499 0 ## 4500: 9 4500 0To generate the ordinal categorical outcome, we have to treat each study separately since they have unique baseline probabilities. This can be accomplished using lapply in the following way:
basestudy < genBaseProbs( n = 9, base = c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05), similarity = 50 ) list_ind < lapply( X = 1:9, function(i) { b < basestudy[i,] d_x < d_ind[study == i] genOrdCat(d_x, adjVar = "z", b, catVar = "ordY") } )The output list_ind is a list of data tables, one for each study. For example, here is the 5th data table in the list:
list_ind[[5]] ## study id z ordY ## 1: 5 2001 0 7 ## 2: 5 2002 0 9 ## 3: 5 2003 0 5 ## 4: 5 2004 0 9 ## 5: 5 2005 0 9 ##  ## 496: 5 2496 0 9 ## 497: 5 2497 0 4 ## 498: 5 2498 0 7 ## 499: 5 2499 0 5 ## 500: 5 2500 0 11And here is a table of proportions for each study that we can compare with the base probabilities:
t(sapply(list_ind, function(x) x[, prop.table(table(ordY))])) ## 1 2 3 4 5 6 7 8 9 10 11 ## [1,] 0.106 0.048 0.086 0.158 0.058 0.162 0.092 0.156 0.084 0.028 0.022 ## [2,] 0.080 0.024 0.092 0.134 0.040 0.314 0.058 0.110 0.028 0.110 0.010 ## [3,] 0.078 0.050 0.028 0.054 0.148 0.172 0.162 0.134 0.058 0.082 0.034 ## [4,] 0.010 0.056 0.116 0.160 0.054 0.184 0.102 0.084 0.156 0.056 0.022 ## [5,] 0.010 0.026 0.036 0.152 0.150 0.234 0.136 0.084 0.120 0.026 0.026 ## [6,] 0.040 0.078 0.100 0.092 0.170 0.168 0.196 0.050 0.038 0.034 0.034 ## [7,] 0.006 0.064 0.058 0.064 0.120 0.318 0.114 0.068 0.082 0.046 0.060 ## [8,] 0.022 0.070 0.038 0.160 0.182 0.190 0.074 0.068 0.070 0.036 0.090 ## [9,] 0.054 0.046 0.052 0.128 0.100 0.290 0.102 0.092 0.080 0.030 0.026Of course, the best way to compare is to plot the data for each study. Here is another cumulative probability plot, this time including the observed (generated) probabilities in black over the baseline probabilities used in the data generation in red:
Sometime soon, I plan to incorporate something like the function genBaseProbs into simstudy to make it easier to incorporate nonproportionality assumptions into simulation studies that use ordinal categorical outcomes.
Addendum
The variance of the beta distribution (and similarly the Dirichlet distribution) decreases as \(\alpha\) and \(\beta\) both increase proportionally (keeping the mean constant). I’ve plotted the variance of the beta distribution for \(\alpha = 1\) and different levels of \(\beta\) and \(C\). It is clear that at any level of \(\beta\) (I’ve drawn a line at \(\beta = 1\)), the variance decreases as \(C\) increases. It is also clear that, holding \(\alpha\) constant, the relationship of \(\beta\) to variance is not strictly monotonic:
var_beta < function(params) { a < params[1] b < params[2] (a * b) / ( (a + b)^2 * (a + b + 1)) } loop_b < function(C, b) { V < sapply(C, function(x) var_beta(x*c(1, b))) data.table(b, V, C) } b < seq(.1, 25, .1) C < c(0.01, 0.1, 0.25, 0.5, 1, 2, 4, 10, 100) d_var < rbindlist(lapply(b, function(x) loop_b(C, x))) ggplot(data = d_var, aes(x = b, y = V, group = C)) + geom_vline(xintercept = 1, size = .5, color = "grey80") + geom_line(aes(color = factor(C))) + scale_y_continuous(name = expression("Var beta"~(alpha==1~","~beta))) + scale_x_continuous(name = expression(beta)) + scale_color_viridis(discrete = TRUE, option = "B", name = "C") + theme(panel.grid = element_blank(), legend.title.align=0.15) var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Generating probabilities for ordinal categorical data first appeared on Rbloggers.
GoldMining Week 1 (2020)
[social4i size="small" align="alignleft"] > [This article was first published on R – Fantasy Football Analytics, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Week 1 Gold Mining and Fantasy Football Projection Roundup now available. (Better late than never!)
The post GoldMining Week 1 (2020) appeared first on Fantasy Football Analytics.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post GoldMining Week 1 (2020) first appeared on Rbloggers.
COVID19: Exponential Growth in London
[social4i size="small" align="alignleft"] > [This article was first published on R & Decision Making, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
I predicted on 25th August a secondwave in London in 24 weeks time. It looks like the recent media’s coverage and the government’s policy reaction are consistent with my prediction.
The policy response of ruleofsixinsocialgathering is mild. Hopefully though, it forces people to realise that they cannot afford to relax too much. Nevertheless, from talking to friends & colleagues, I don’t get the sense that people are ready to scale back their activity and social events!
I have fitted an exponential curve to the recent data and I get a whopping 98% R2 and <1% pvalue (OK – R2 is a bit inflated on overlapping data but still.). Here I added to my graph the fitted value and the model’s predicted value for the next four weeks.
#covid19 #secondwave
The full code below. The code is essentially based on my previous post with the additional exponential model in the middle section. library(data.table) library(ggplot2) data_url < “https://c19downloads.azureedge.net/downloads/csv/coronaviruscases_latest.csv” raw_data < fread(data_url, check.names = TRUE) area_name < “London” area_type < “region” area_data < raw_data[ Area.name == area_name & Area.type == area_type,, ][,Specimen.date := as.Date(Specimen.date) ][,c(“Specimen.date”,”Daily.lab.confirmed.cases”)][ order(Specimen.date) ] area_data < merge(area_data, data.table(Specimen.date = seq( min(area_data[,Specimen.date]), max(area_data[,Specimen.date]), by = “1 day” )), all = TRUE, by = “Specimen.date”) setkey(area_data, Specimen.date) setnafill(area_data, type = “const”, fill = 0, cols = c(“Daily.lab.confirmed.cases”)) area_data[,roll_mean := frollmean(Daily.lab.confirmed.cases, n = 7, align = “right”)] ###################################### ###########Exponential model########## ###################################### area_data[,increasing := c(rep(NA,7), roll_mean[(1:7)] roll_mean[((.N6):.N)]>0)] end_date < area_data[order(Specimen.date, decreasing = TRUE)][increasing==TRUE,,][, Specimen.date[1], by=”increasing”]$V1 start_date < area_data[order(Specimen.date, decreasing = TRUE)][ increasing==FALSE & Specimen.date < end_date,,][, Specimen.date[1], by=”increasing”]$V1 exp_lm_data < area_data[Specimen.date > start_date & Specimen.date <= end_date,,] exp_lm_data[, days := 1:.N] exp_lm < lm(log(roll_mean)~ days, data = exp_lm_data) exp_lm_data[,fitted_numbers := exp(fitted.values(exp_lm))] predicted_data < data.table(days=max(exp_lm_data$days)+1:28) predicted_data[,Specimen.date := min(exp_lm_data$Specimen.date)+ lubridate::days(days)] predicted_data[,predicted_numbers := exp(predict.lm(exp_lm, predicted_data))] ##################################### m_area_data < melt(area_data, id.vars=”Specimen.date”, measure.vars = c(“Daily.lab.confirmed.cases”,”roll_mean”)) exp_lm_data < melt(dplyr::bind_rows(exp_lm_data, predicted_data), id.vars=”Specimen.date”, measure.vars = c(“fitted_numbers”,”predicted_numbers”)) m_area_data < rbind(m_area_data, exp_lm_data) area_plot < ggplot(m_area_data, aes(x = Specimen.date, y = value, fill = variable, color = variable))+ geom_bar(data = subset(m_area_data, variable == “Daily.lab.confirmed.cases”), stat = “identity”) + geom_line(data = subset(m_area_data, variable != “Daily.lab.confirmed.cases”)) + labs(x=”Specimen Date”, y=”Number of Confirmed Cases”, fill = “”, color = “”) + scale_fill_manual(values = c(“#ff0000″,”#05d153″,”#cad105″,”#000000”), labels = c(sprintf(“%s # Daily Confirmed cases”,area_name), “fitted”,”predicted”,”7 day average”)) + scale_color_manual(values = c(“#ff0000″,”#05d153″,”#cad105″,”#000000”), labels = c(sprintf(“%s # Daily Confirmed cases”,area_name), “fitted”,”predicted”,”7 day average”)) + scale_x_date(date_breaks = “4 weeks”, date_labels = “%Y%m%d”) + theme_bw() %+replace% theme(legend.position = “top”, legend.justification = “left”) var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R & Decision Making. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post COVID19: Exponential Growth in London first appeared on Rbloggers.
Learning guide: Introduction to R, oneday workshop
[social4i size="small" align="alignleft"] > [This article was first published on George J. Mount, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
When I was an undergrad, a professor suggested I learn this statistical programming language called R.
I took one look at the interface, panicked, and left.
A lot has changed in the R world since then, not the least of which was the release of the RStudio integrated development environment. While the universe of R packages continues to grow, and the work can now be done from the comfort of RStudio, the fact remains: learning R means learning to code R.
Many of my students have never coded before, although this is a halftruth: they’ve probably used Excel, which requires a decent amount of functions and references. What Excel doesn’t require, though, is naming and manipulating variables.
R is an ideal choice for firsttime data coders: the familiar tabular data frame is a core structure. Operations are designed with data analysis in mind: after all, R is a statistical programming language. (In my opinion, this makes it preferred to Python, which was designed as a generalpurpose scripting language — again, as far as learning to code as a data analyst goes.)
I assume no prior coding language for this workshop. My goals are to equip students to work comfortably from the RStudio environment, ingest and explore data, and make simple graphical representations of data. In particular, students will perform the most common tabular data cleaning and exploration tasks using the dplyr library.
Above all these objectives, however, is my goal to help students not panic over learning R, like I did when I started.
Download the learning guide here.Download 1: Welcome to the R ProjectObjective: Student can install and load an R package
Description:
 What is R and when would I use it?
 R plus RStudio
 Installing and loading packages
Exercise: Install a CRAN task view
Assets needed: None
Time: 35 minutes
Lesson 2: Introduction to RStudioObjective: Student can navigate the RStudio integrated development environment
Description:
 Basic arithmetic and comparison operations
 Saving, closing and loading scripts
 Opening help documentation
 Plotting graphs
 Assigning objects
Exercises: Practice assigning and removing objects
Assets needed: None
Time: 40 minutes
Lesson 3: Working with vectorsObjective: Student can create, inspect and modify vectors
Description:
 Creating vectors
 Vector operations
 Indexing elements of a vector
Exercises: Drills
Assets needed: None
Time: 35 minutes
Lesson 4: Working with data framesObjective: Student can create, inspect and modify data frames
Description:
 Creating a data frame
 Data frame operations
 Indexing data frames
 Column calculations
 Filtering and subsetting a data frame
 Conducting exploratory data analysis on a data frame
Exercises: Drills
Assets needed: Iris dataset
Time: 70 minutes
Lesson 5: Reading, writing and exploring data framesObjective: Student can read, write and analyze tabular external fines
Description:
 Reading and writing csv and txt files
 Reading and writing Excel files
 Exploring a dataset
 Descriptive statistics
Exercises: Drills
Assets needed: Iris dataset
Time: 40 minutes
Lesson 6: Data manipulation with dplyrObjective: Student can perform common data manipulation tasks with dplyr
Description:
 Manipulating rows
 Manipulating columns
 Summarizing data
Exercises: Drills
Assets needed: Airport flight records
Time: 50 minutes
Lesson 7: Data manipulation with dplyr, continuedObjective: Student can perform more advanced data manipulation with dplyr
Description:
 Building a data pipeline
 Joining two datasets
 Reshaping a dataset
Exercises: Drills
Assets needed: Airport flight records
Time: 50 minutes
Lesson 8: R for data visualizationObjective: Student can create graphs in R using visualization best practices
Description:
 Graphics in base R
 Visualizing a variable’s distribution
 Visualizing values across categories
 Visualizing trends over time
 Graphics in ggplot2
Exercises: Drills
Assets needed: Airport flight records
Time: 70 minutes
Lesson 9: CapstoneObjective: Student can complete endtoend data exploration project in R
Assets needed: Baseball records
Time: 40 minutes
This download is part of my resource library. For exclusive free access, subscribe below. var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: George J. Mount. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Learning guide: Introduction to R, oneday workshop first appeared on Rbloggers.
U.S. Avalanche Fatalities – A History and Breakdown in R
[social4i size="small" align="alignleft"] > [This article was first published on Stoltzman Consulting Data Analytics Blog  Stoltzman Consulting, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Each winter season, avalanches pose a serious threat to those who adventure in mountainous regions. An avalanche is defined as a mass of snow sliding, flowing, or tumbling down an inclined surface, typically destroying everything in its path. This post showcases the history of various activities that contribute to the total avalanche deaths that occur in the United States every year.
Data – CAIC
The Colorado Avalanche Information Center (CAIC) collected data on avalanches from 1951 to 2020. The dataset includes a date, setting, state, primary activity, and number of deaths for each avalanche event that resulted in at least one death. When it comes to date, year and avalanche year are included. Avalanche years go with winter seasons, where avalanches later in the year are considered to be in the next avalanche year. For example, an avalanche that occurred on 12222018 would be in avalanche year 2019. Avalanche year was chosen for this exploration. There are 24 unique primary activities in the dataset. I grouped similar primary activities into categories to prevent over plotting.
Observations

Backcountry skiers and snowmobilers/motorized make up the majority of deaths across decades

Avalanches do occur inbounds at ski resorts

There was a huge spike from 1980 to 2000 for snowmobiling/motorized. This may be related to advancements in technology
The following code shows how the winter activities were grouped:
avalanche < avalanche %>% mutate(activity_group = case_when( primary_activity %in% c("Backcountry Tourer","Hybrid Tourer", "Hybrid Rider", "Sidecountry Rider") ~ "Backcountry Skier", primary_activity %in% c("Hiker", "Hunter", "Snowplayer", "Humanpowered Guide Client", "Resident") ~ "Hiker", primary_activity %in% c("Snowmobiler", "Snow Biker", "Motorized", "Mechanized Guide", "Machanized Guiding Client") ~ "Snowmobiler/Motorized", primary_activity %in% "Climber" ~ "Climber", primary_activity %in% c("Inbounds Rider", "Ski Patroller") ~ "Inbounds", primary_activity %in% c("Rescuer", "Ranger", "Others at Work", "Highway Personnel") ~ "Worker", TRUE ~ "Other" ))Additional geographical visualizations are provided below.
By states:
Comparing East Coast to West Coast:
There are many possibilities with this dataset and we are excited to explore it further. Stay tuned for more interesting finds. For more information on this dataset, check out: avalanche.state.co.us/accidents/statisticsandreporting/
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Stoltzman Consulting Data Analytics Blog  Stoltzman Consulting. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post U.S. Avalanche Fatalities – A History and Breakdown in R first appeared on Rbloggers.
Why R? Text Mining Hackathon
[social4i size="small" align="alignleft"] > [This article was first published on http://raddict.com, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
We just announced the theme for Why R? 2020 Hackathon – Text Mining. You can read some more details about the event on this website 2020.whyr.pl/hackathon/
Submit TeamsIf you find the topic interesting, please follow this form whyr.pl/2020/hackathon/register/ to submit your team. We would like participants to gather in teams of 4 or 5. If you don’t yet have a team, please use our whyr.pl/slack/ to find team mates.
Timespan We start the hackathon with Keynote Talk by Julia Silge at 5:30pm UTC 20200923.
 We finish the hackathon with Sponsored Talk by McKinsey at 5:30pm UTC 20200924.
This means there are 24hours for solving challenges published at the beginning of the event!
You can find more scheduled talks on our YouTube channel youtube.com/WhyRFoundation
We will be adding content during upcoming days.
Hackathon promotes Why R? Conference that is planned for 2427 Sept 2020. All combined urls are gathered below.
 website 2020.whyr.pl
 register 2020.whyr.pl/register/
 plan 2020.whyr.pl/plan/
 workshops 2020.whyr.pl/workshops/
 hackathon 2020.whyr.pl/hackathon/
 abstracts 2020.whyr.pl/abstracts/
To leave a comment for the author, please follow the link and comment on their blog: http://raddict.com. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Why R? Text Mining Hackathon first appeared on Rbloggers.
Lasso and the Methods of Causality
[social4i size="small" align="alignleft"] > [This article was first published on Economics and R  R posts, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
An exciting, quite new branch of econometrics studies how machine learning techniques can be adapted to consistently estimate causal effects. For example, the widely cited article by Belloni, Chernozhukov and Hansen (2014) introduces a post double selection method where one first runs two lasso regressions to select suitable control variables for a final OLS regression.
When I first heard about the method, it had the alluring promise to provide an objective way to select control variables that may reduce the scope of phacking. Under certain assumption Belloni et. al. establish consistency and valid confidence intervals for causal effect estimation via post double selection.
But assumptions may be violated and there may be plausible settings where the method fails. Unfortunately, I lack the deep econometric background to have a good intuitive assessment of the assumptions. But luckily even without a deep grasp of asymptotic theory, one can always combine some intuition with MonteCarlo simulation to assess an econometric method. This blog post first introduces the concepts on an intuitive level and illustrates cases where post double selection works very well. But in the second part, I intuitively discuss and simulate cases where postdouble selection fails while variable selection with a simple lasso regression works pretty well.
(Note that this post’s title is just a small reference to Harry Potter and the Methods of Rationality, one of my favorite fantasy books. It even has a chapter called “Bayes Theorem”.)
Motivation: Confounders and Control VariablesProbably most empirical economics papers are interested in estimating a causal effect $\alpha$ of some explanatory variable $d$ on an dependent variable $y$. (Note that $d$ is not necessarily a dummy, I just stick to prevalent notation in the literature). If our data does not come from a randomized experiment this is generally a hard task because there are essentially always some confounders. Unless we appropriately control for confounders (or circumvent them e.g. by instrumental variable estimation), we get an inconsistent estimator $\hat \alpha$ of the causal effect.
Consider the following simple example where our OLS estimator $\hat \alpha$ has a positive bias if we don’t control for the confounder xc:
n = 10000 xc = rnorm(n,0,1) # confounder d = xc + rnorm(n,0,1) alpha = 1; beta = 1; y = alpha*d + beta * xc + rnorm(n, 0,1) # OLS regression without controlling for confounder coef(lm(y~d))[2] ## d ## 1.485069If we add the confounder as control variable, we get an unbiased OLS estimator:
coef(lm(y~d+xc))[2] ## d ## 0.9987413In empirical studies, the main problem is typically that we just don’t have data for all relevant confounders. However, there are some applications where we don’t have too many observations but a lot of potential control variables, in particular if we also consider nonlinear effects and interaction terms. An example can be found in the vignette (Chernozhukov, Hansen and Spindler, 2016) of the hdm package (which implements the postdoubleselection from Belloni et. al.). The example is based on the empirical growth literature, which tries to estimate the causal effect of some variable, like initial GDP, on GDP growth. Gross country panel data sets are not very large, but we can observe a lot of variables for a country. If one is optimistic, one may believe that some subset of the potential control variables is able to effectively control for all confounders.
But if we don’t have enough observations, we need some method to select the relevant control variables.
Framework of Monte Carlo SimulationLet us study different approaches for variable selection using a MonteCarlo simulation with the following data generating process for $y$ and $d$. (If Mathjax is not well rendered on a blog aggregator click here.):
\[y = \alpha d + \sum_{k=1}^{K_c} \beta^{cy}_k {x^c_k} + \sum_{k=1}^{K_y} \beta^y_k {x^y_k} + \varepsilon^y\]
\[d = \sum_{k=1}^{K_c} \beta^{cd}_k {x^c_k} + \sum_{k=1}^{K_e} \beta^e_k {x^e_k} + \varepsilon^d\]
We have the following potential control variables that are all independently normally distributed from each other:

$x^c_k$ is one of $K_c$ confounders that directly affect both $d$ and $y$. For a consistent OLS estimation of $\alpha$, we need to control for all confounders.

$x^y_k$ is one of $K_y$ variables that only affect the dependent variable $y$ but not the explanatory variable $d$. Whether we would add it or not in an OLS regression should not affect the bias of our estimator $\hat \alpha$.

$x^e_k$ is one of $K_e$ variables that only affect $d$ but not through any other channel the dependent variable $y$. It constitutes a source of exogenous variation in $d$. We can estimate in an OLS regression $\alpha$ more precisely if we don’t add any $x^e_k$ to the regression. Also, we will see that adding fewer $x^e_k$ can reduce the bias of an OLS estimator that arises if we have not perfectly controlled for all confounders.

We also observe $K_u$ variables $x^u_k$ that neither affect $y$ nor $d$. They are just uncorrelated noise variables that we ideally leave out of our regressions.
In this Github repository you can find scripts that contains the function lasso_sim and other utility functions that help us to run the simulation.
The following code simulates a data set with $n=1000$ observations, $K_c=50$ confounders, $K_y=50$ variables that affect only $y$, just a single observed variable $x^e_k$ that provides a source of exogenous variation and $K_u=1000$ explanatory variables that are uncorrelated with everything else. The causal effect of interest $\alpha$, as well as all other regression coefficients and standard deviations are equal to 1.
source("lasso_tools.R") source("lasso_sim.R") set.seed(1) mat = lasso_sim(alpha=1, n=1000,Kc=50,Ke=1,Ky=50,Ku=1000,return.what = "data") dim(mat) # more variables than observations ## [1] 1000 1103 mat[1:3,1:5] # show excerpt ## y d xc1 xc2 xc3 ## [1,] 1.319487 3.570383 0.6264538 1.1349651 0.8861496 ## [2,] 15.526450 5.809369 0.1836433 1.1119318 1.9222549 ## [3,] 21.525964 1.011645 0.8356286 0.8707776 1.6197007Let us first estimate three regressions via OLS:
y = mat[,1] X = cbind(1,mat[,1]); colnames(X)[1] = "const" # Using all x does give nonsensible result # because we have too many variables coef(lm.fit(y=y,x=X))[2] ## d ## 3.112387 # Short regression y ~ d yields positive bias # due to omitted confounders coef(lm.fit(y=y,x=X[,1:2]))[2] ## d ## 1.94576 # Controlling for all confounders eliminates bias coef(lm.fit(y=y,x=X[,1:52]))[2] ## d ## 1.015087If we know which variables are the confounders, we can easily consistently estimate the causal effect $\alpha=1$. But let’s assume we don’t know which of the 1101 potential control variables are the confounders.
Lasso, Postlasso and Post Double SelectionFrom the machine learning toolbox, lasso regressions seem natural candidates to select relevant control variables. Consider a linear regression with $K$ standardized explanatory variables with corresponding estimators and residuals denoted by $\hat \beta$ and $\hat \varepsilon(\hat \beta)$, respectively. The lasso estimator solves the following optimization problem:
\[\min_{\hat \beta} \sum_{i=1}^n {\hat \varepsilon_i(\hat \beta)^2} + \lambda \sum_{k=1}^K {\hat \beta_k}\]
The first term is just the sum of squared residuals, which the OLS estimator minimizes. The second term penalizes larger absolute values of the estimated coefficients. The penalty parameter $\lambda \gt 0$ will be chosen in an outer loop e.g. by crossvalidation or using a criterion like the corrected AIC. Lasso estimates typically have many coefficients $\hat \beta_k$ equal to zero. In this sense the lasso estimator selects a subset of explanatory variables whose estimated coefficients are nonzero.
But also the coefficients of the selected variables will be typically attenuated towards 0 because of the penalty term. The postlasso estimator avoids this attenuation by simply performing an OLS estimation using all the selected variables from the lasso estimation.
The following code estimates a lasso and postlasso regression for our dependent variable $y$ in our simulated data set and examines the corresponding estimator $\hat \alpha$ of our causal effect of interest. I use the package gamlr developed by Matt Taddy. He uses it extensively in his great new text book Business Data Science. It discusses a lot of modern econometric developments at the intersection with machine learning in a very intuitive and applied fashion.
The gamlr function uses by default the corrected Akaike Information Criterion to select the penalty parameter $\lambda$. This avoids the random fluctuation from cross validation (used in the popular cv.glmnet function) and makes the gamlr function blazingly fast.
# Lasso estimator library(gamlr) lasso = gamlr(x=X,y=y) # lasso_coef is a helper function in lasso_tools.R # because coef(lasso) returns a sparse matrix coefs = lasso_coef(lasso,keep.intercept = FALSE) coefs[1] # still biased ## d ## 1.955174 # Post lasso estimator vars = names(coefs) post.lasso = lm.fit(y=y,x=X[,c("const",vars)]) coef(post.lasso)[2] # postlasso estimator also still biased ## d ## 1.979171We see that both the lasso and postlasso estimators of the causal effect $\alpha$ are still biased. Both are roughly the same size as the estimate in the short OLS regression of $y$ on just $d$. To understand why, let’s have a look at the selected variables:
vars ## [1] "d" "xc3" "xc39" "xe1" "xu314" "xu700" "xu734" "xu763" "xu779" ## [10] "xu831" "xu870" "xy1" "xy2" "xy3" "xy4" "xy5" "xy6" "xy7" ## [19] "xy8" "xy9" "xy10" "xy11" "xy12" "xy13" "xy14" "xy15" "xy16" ## [28] "xy17" "xy18" "xy19" "xy20" "xy21" "xy22" "xy23" "xy24" "xy25" ## [37] "xy26" "xy27" "xy28" "xy29" "xy30" "xy31" "xy32" "xy33" "xy34" ## [46] "xy35" "xy36" "xy37" "xy38" "xy39" "xy40" "xy41" "xy42" "xy43" ## [55] "xy44" "xy45" "xy46" "xy47" "xy48" "xy49" "xy50" # A helper function to count selected variables by type vars_counts(vars) ## num.vars xc xe xu xy ## 60 2 1 7 50While the lasso regression selects all 50 xy variables that only affect $y$, it only selects 2 of the 50 confounders xc. So 48 confounders remain uncontrolled and create massive bias. This quite asymmetric selection may be a bit surprising given that in our simulation the confounders xc affect y as strongly as the xy variables ($\beta_k^{cy} = \beta_k^{y} = 1$). My intuition is that our (also selected) variable of interest d already captures some effect of the confounders on y. Thus the confounders are less important to predict y and are therefore not selected.
The post double selection method by Belloni et. al. (2014) selects the control variables as follows. We run two lasso regressions. The first regresses d on all potential controls. The second regresses y on all potential controls (excluding d). Then we use the union of the selected variables from both lasso regressions for our postlasso OLS regression. An intuition for this approach is that confounders are variables that affect both d and y. To ensure that very few confounders are omitted, it thus seems not implausible to use as a broad control set all variables that relevantly affect d or y.
Let us apply the post double selection method:
# post double selection # 1. run the two lasso regressions d = mat[,2] lasso.dx = gamlr(y=d,x=X[,2]) lasso.yx = gamlr(y=y,x=X[,2]) # 2. compute union of selected variables from # both lasso regressions vars1 = names(lasso_coef(lasso.dx)) vars2 = names(lasso_coef(lasso.yx)) vars = union(vars1,vars2) # 3. Run OLS estimation with d and all selected variables post.lasso = lm.fit(y=y,x=X[,c("const","d",vars)]) coef(post.lasso)[2] # looks fairly unbiased ## d ## 0.9643243Now our estimator looks fairly unbiased. Let us look at the selected variables:
vars_counts(vars1) ## num.vars xc xe xu xy ## 148 50 1 96 1 vars_counts(vars2) ## num.vars xc xe xu xy ## 207 50 1 106 50 vars_counts(vars) ## num.vars xc xe xu xy ## 271 50 1 170 50We see that the post double selection method selects all 50 confounders xc. Interestingly the confounders are found in both the first and second lasso regression. This means if we don’t add the variable of interest d when regressing y on all potential controls, we seem more likely to pick up confounders.
You may think: “But we selected many more variables: 271 instead of only 60 before. In particular, we wrongly picked up 170 unrelated variables. Therefore it is no surprise that we now also select the 50 confounders.”
However, Belloni et. al. (2014) actually use a different method to select the penalty parameter $\lambda$, not the AICc critierion used in gamlr. The method is implemented in the R package hdm authored by Martin Spindler, Victor Chernozhukov and Chris Hansen. Let us repeat the post double selection procedure using the function rlasso from the hdm package:
library(hdm) # post double selection # 1. run the two lasso regressions lasso.dx = rlasso(y=d,x=X[,2]) lasso.yx = rlasso(y=y,x=X[,2]) # 2. compute union of selected variables from # both lasso regressions vars1 = names(lasso_coef(lasso.dx)) vars2 = names(lasso_coef(lasso.yx)) vars = union(vars1,vars2) # 3. Run OLS estimation with d and all selected variables post.lasso = lm.fit(y=y,x=X[,c("const","d",vars)]) coef(post.lasso)[2] # looks unbiased ## d ## 1.013494 # Var counts vars_counts(vars) ## num.vars xc xe xu xy ## 113 50 1 12 50We now only select 12 unrelated variables but still all confounders. Correspondingly, our resulting OLS estimate is pretty close to the true causal effect.
The function rlassoEffect is a convenient wrapper to directly apply the double selection method via rlasso:
rlassoEffect(x=X[,2],d=d,y=y,method="double selection") ## ## Call: ## rlassoEffect(x = X[, 2], y = y, d = d, method = "double selection") ## ## Coefficients: ## d1 ## 1.013The hdm package also implements an alternative “partialling out” method based on an idea related to regression anatomy and the FWL theorem. The method is well explained in Chapter 4 of the hdm vignette. Here we find:
rlassoEffect(x=X[,2],d=d,y=y,method="partialling out") ## ## Call: ## rlassoEffect(x = X[, 2], y = y, d = d, method = "partialling out") ## ## Coefficients: ## [1] 0.9402While the estimated $\hat \alpha$ differs slightly, in all simulations where I checked either both approaches seemed to worked well or both failed.
Cases were post double selection fails but selection with single lasso regression works wellWhile in the previous example post double selection worked very well, I believe there are cases in which selection with a single lasso regression on y works better for selecting variables.
We now simulate data from our model with the main difference that we now also add $K_e=50$ variables that are sources of exogenous variation, i.e. they only affect $d$ but not $y$.
set.seed(1) mat = lasso_sim(alpha=1,n=700,Kc=50,Ke=50,Ky=50,Ku=700,return.what = "data") y = mat[,1] d = mat[,2] X = cbind(1,mat[,1]); colnames(X)[1] = "const" # Short OLS regression y on d: biased coef(lm.fit(y=y,x=X[,1:2]))[2] ## d ## 1.472816Now let’s compare two linear regressions where in both we select 49 of the 50 confounders. In one regression we also control for all 50 observable sources of exogenous variation xe in the other we don’t add any xe as control variable. Make a guess about the biases of $\hat \alpha$ in the two regressions before you look at the results here:
xc.cols = 3:52; xe.cols = 53:102 # Controlling for 49 of 50 of confounders, # not controlling sources of exogenous variation coef(lm.fit(y=y,x=X[,c(1:2, xc.cols[1:49])]))[2] ## d ## 1.028753 # Controlling for 49 of 50 of confounders # and controlling for all sources of exogenous variation coef(lm.fit(y=y,x=X[,c(1:2, xc.cols[1:49], xe.cols)]))[2] ## d ## 1.478361If we don’t add any observable source of exogenous variation, we find an estimate pretty close to the true causal effect $\alpha=1$, but if we control for the exogenous variation our estimate looks substantially biased.
OK, one probably should not infer a bias from a single simulation run. Therefore, I have run a proper MonteCarlo simulation, where I repeated the data generation and both estimations a 1000 times. The code and results can be found on Github. Let’s take a look at the distribution of the simulated estimators $\hat \alpha$ in both cases:
sim = readRDS("control_exo_sim.Rds") library(ggplot2) ggplot(sim, aes(x=alpha.hat, fill=reg)) + geom_density() + facet_wrap(~reg) + geom_vline(xintercept=1, alpha=0.7)
We now properly see that we have a substantial bias in our causal effect estimate $\hat \alpha$ if we control for 49 out of 50 confounders and also add all sources of exogenous variation xe as control variables. The standard error is also high. If we don’t control for the sources of exogenous variation both the bias and standard error of $\hat \alpha$ are very small instead.
The following graph thus summarizes the goal in selecting control variables:
Unless we can manually guarantee that enough relevant sources of exogenous variation are excluded from the pool of candidate control variables, there seems to be a drawback of the postdouble selection procedure: The lasso regression of $d$ on all potential control variables is likely to select the sources of exogenous variation, which we don’t want to use as control variables.
In contrast, if we would just select variables with a single lasso regression of $y$ on $d$ and all potential control variables, we seem much less likely to select sources of exogenous variation, since by definition they only affect $y$ via $d$. Let’s run both approaches on our simulated data set:
# Post double selection double_sel = rlassoEffect(x=X[,2],y=y,d=d, method="double selection") coef(double_sel) ## d1 ## 1.440466 # Simple gamlr postlasso estimator simple_lasso = gamlr(x=X,y=y) post_lasso_coef(simple_lasso,x=X,y=y,keep.intercept = FALSE)[1] ## d ## 1.023805Indeed, the simple lasso approach yields here a much better estimate of the causal effect than double selection. The hdm package also allows to compute confidence intervals:
confint(double_sel,level = 0.999) ## 0.05 % 99.95 % ## d1 1.334566 1.546366The computed 99.9% confidence interval of the post double selection estimator is erroneously far away from the true causal effect $\alpha=1$.
The following code estimates both models again and also returns statistics about the selected variables:
set.seed(1) lasso_sim(alpha=1,n=700,Kc=50,Ke=50,Ky=50,Ku=700, models=c("gamlr_simple","rlasso_double_sel")) ## model coef num.vars xc xe xu xy ## 1 gamlr_simple 1.023805 179 50 28 51 50 ## 2 rlasso_double_sel 1.440466 7 5 2 0 0We see here that the reason why the post double selection algorithm does not work in our example is not that too many xe are selected, but that overall only 7 variables are selected including just 5 of the 50 confounders. The default settings of rlasso seem to select relatively few variables.
Below we estimate two variants of the post double selection where we reduce a parameter c in the determination of the penalty term. This causes rlasso to select more variables:
set.seed(1) models = list( rlasso_double_sel_c106 = list(lasso.fun="rlasso",type="double_sel", args=list(penalty=list(c=1.06))), rlasso_double_sel_c100 = list(lasso.fun="rlasso",type="double_sel", args=list(penalty=list(c=1))) ) lasso_sim(alpha=1,n=700,Kc=50,Ke=50,Ky=50,Ku=700,return.what = "details", models = models) ## model coef num.vars xc xe xu xy ## 1 rlasso_double_sel_c106 1.489470 76 32 34 8 2 ## 2 rlasso_double_sel_c100 1.214481 140 50 50 34 6Now more variables are selected, but never more confounders xc than sources or exogenous variation xe. Let’s load the results of a proper MonteCarlo simulation with 1000 simulation runs:
sim = readRDS("lasso_sel_sim.Rds") sim %>% group_by(model) %>% summarize( bias = mean(coef1), se = sd(coef) ) ## # A tibble: 4 x 3 ## model bias se ## ## 1 gamlr_simple 0.0242 0.00951 ## 2 rlasso_double_sel 0.431 0.101 ## 3 rlasso_double_sel_c100 0.176 0.292 ## 4 rlasso_double_sel_c106 0.365 0.192We see how the simple lasso model yields the lowest bias and also has a lower standard error than the 3 specifications of the post double selection approach.
Are confounders that strongly affect y worse than those that strongly affect d?Finally, let us look at a simulation where both the xe and the xc variables are confounders. Yet, there is a difference. The confounders xe strongly affect the explanatory variable d with coefficients $\beta_k^{ed}=10$ and only weakly affect y with $\beta_k^{ey}=0.5$. The coefficients shall be reversed for the confounders xc that strongly affect y with $\beta_k^{cy}=10$ but only weakly affect d with $\beta_k^{cd}=0.5$.
Let’s compare post double selection with a simple gamlr lasso selection in this setting:
set.seed(1) res = lasso_sim(alpha=1,n=100,Kc=10,Ke=10,Ky=5,Ku=20, beta.ed = 10, beta.ey = 0.5, beta.cd = 0.5, beta.cy = 10, models=c("gamlr_simple","rlasso_double_sel")) select(res,  coef) ## model num.vars xc xe xu xy ## 1 gamlr_simple 16 10 1 0 5 ## 2 rlasso_double_sel 17 7 10 0 0I have not yet shown the estimated coefficients. But we see that the simple lasso regression has selected all confounders that strongly affect y but only 1 confounder that strongly affects d. In contrast, double selection picks 7 out of the 10 confounders that strongly affect y and all 10 that strongly affect d.
Why not make a guess about the two estimates $\hat \alpha$ of the two methods…
Loading…
Ok, let’s look at the complete result:
res ## model coef num.vars xc xe xu xy ## 1 gamlr_simple 1.051267 16 10 1 0 5 ## 2 rlasso_double_sel 8.150202 17 7 10 0 0While the simple lasso estimate is pretty close to the true causal effect, the double selection estimate is pretty far away, even though it selects many more confounders. It seems as if omitting confounders that strongly affect the dependent variable is worse than omitting those that strongly affect the explanatory variable.
If you find this result surprising, it may be because the simple omitted variable bias formula is often presented in a way that could wrongly suggest that an omitted confounder that strongly affects the explanatory variable d is equally bad as one that strongly affects the dependent variable y. However, even with a single omitted variable it is worse if it strongly affects y rather than d (see my previous blog post for more details).
A thousand simulation runs also verify that the single run above is representative:
sim = readRDS("lasso_sim3.Rds") sim %>% group_by(model) %>% summarize( bias = mean(coef1), se = sd(coef), num.vars = mean(num.vars), xe = mean(xe), xc = mean(xc) ) ## # A tibble: 2 x 6 ## model bias se num.vars xe xc ## ## 1 gamlr_simple 0.0489 0.00393 16.6 0.685 10 ## 2 rlasso_double_sel 6.85 4.04 17.8 10 6.65To sum up, this simulation provides another related example, in which control variable selection with a simple lasso regression works better than the post double selection method.
ConclusionWe first have illustrated a setting in which post double selection performed considerably better in selecting control variables than just a simple single lasso regression. Post double selection is probably most save to apply in settings where an exogenous source of variation is known and manually taken care that it is excluded from the set of potential control variables. Intuitively, the procedure seems also reasonable in quasiexperimental settings where d is a dummy variable of a single policy change. Belloni et. al. (2014) indeed mostly refer to such settings.
Yet, we also have illustrated that in situations in which many important sources of exogenous variation are part of the set of potential control variables, a simple lasso regression can outperform post double selection. We also exemplified in a setting without observable sources of exogenous variation that a simple lasso regression was better able to select the worse confounders (i.e. those that affect y strongly).
Perhaps a method will surface that works well across all situations were either double selection or selection with a simple lasso fails. Alas, I imagine that the selection of control variables can be automatized only up to a certain extend and will continue to require considerable subjective assessments. For example, in addition to the issues discussed in this post, one must also ensure that the set of potential control variables does not contain colliders or channel variables (unless one wants to explicitly ignore the mediated effect from certain channels).
ReferencesA. Belloni, V. Chernozhukov, C. Hansen (2014). “Inference on treatment effects after selection among highdimensional controls.” The Review of Economic Studies 81(2), 608650.
Chernozhukov, V., Hansen, C. and Spindler, M., (2016). “hdm: Highdimensional metrics.” arXiv preprint arXiv:1608.00354.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Economics and R  R posts. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Lasso and the Methods of Causality first appeared on Rbloggers.
Efficient Grouped Programming in R and/or C/C++ – with the collapse Package
[social4i size="small" align="alignleft"] > [This article was first published on R, Econometrics, High Performance, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
collapse is a C/C++ based package for data transformation and statistical computing in R. Among other features it introduces an excellent and highly efficient architecture for grouped (and weighted) statistical programming in R. This post briefly explains this architecture and demonstrates:

How to program highly efficient grouped statistical computations and data manipulations in R using the grouped functions supplied by collapse.

How to use the grouping mechanism of collapse with custom C/C++ code to create further efficient grouped functions/operations in R.
collapse uses grouping objects as essential inputs for grouped computations. These objects are created from vectors or lists of vectors (i.e. data frames) using the function GRP():
library(collapse) # A dataset supplied with collapse providing sectoral value added (VA) and employment (EMP) head(GGDC10S, 3) ## Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM ## 1 BWA SSA Subsaharan Africa VA 1960 NA NA NA NA NA NA NA NA NA NA NA ## 2 BWA SSA Subsaharan Africa VA 1961 NA NA NA NA NA NA NA NA NA NA NA ## 3 BWA SSA Subsaharan Africa VA 1962 NA NA NA NA NA NA NA NA NA NA NA fdim(GGDC10S) ## [1] 5027 16 # Creating a grouping object (by default return.order = FALSE as the ordering is typically not needed) g < GRP(GGDC10S, c("Country", "Variable"), return.order = TRUE) # Printing it print(g) ## collapse grouping object of length 5027 with 85 ordered groups ## ## Call: GRP.default(X = GGDC10S, by = c("Country", "Variable"), return.order = TRUE), X is unordered ## ## Distribution of group sizes: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.00 53.00 62.00 59.14 63.00 65.00 ## ## Groups with sizes: ## ARG.EMP ARG.VA BOL.EMP BOL.VA BRA.EMP BRA.VA ## 62 62 61 62 62 62 ##  ## VEN.EMP VEN.VA ZAF.EMP ZAF.VA ZMB.EMP ZMB.VA ## 62 63 52 52 52 52 # Plotting it plot(g)
Grouping is done very efficiently using radixbased ordering in C (thanks to data.table source code). The structure of this object is shown below:
The first three slots of this object provide the number of unique groups, a groupid matching each value/row to a group1, and a vector of groupsizes. The fourth slot provides the unique groups (default return.groups = TRUE), followed by the names of the grouping variables, a logical vector showing whether the grouping is ordered (default sort = TRUE), and the ordering vector which can be used to sort the data alphabetically according to the grouping variables (default return.order = FALSE).
Grouped Programming in Rcollapse provides a whole ensemble of C++ based generic statistical functions that can use these ‘GRP’ objects to internally perform (columnwise) grouped (and weighted) computations on vectors, matrices and data frames in R. Their names are contained in the global macro .FAST_FUN:
.FAST_FUN ## [1] "fmean" "fmedian" "fmode" "fsum" "fprod" "fsd" "fvar" ## [8] "fmin" "fmax" "fnth" "ffirst" "flast" "fNobs" "fNdistinct" ## [15] "fscale" "fbetween" "fwithin" "fHDbetween" "fHDwithin" "flag" "fdiff" ## [22] "fgrowth"Additional functions supporting grouping objects are TRA (grouped replacing and sweeping out statistics), BY (splitapplycombine computing) and collap (advanced data aggregation with multiple functions).
To provide a brief example, we can compute a grouped mean of the above data using:
head(fmean(GGDC10S[6:16], g)) ## AGR MIN MAN PU CON WRT TRA ## ARG.EMP 1419.8013 52.08903 1931.7602 101.720936 742.4044 1982.1775 648.5119 ## ARG.VA 14951.2918 6454.94152 36346.5456 2722.762554 9426.0033 26633.1292 14404.6626 ## BOL.EMP 964.2103 56.03295 235.0332 5.346433 122.7827 281.5164 115.4728 ## BOL.VA 3299.7182 2846.83763 3458.2904 664.289574 729.0152 2757.9795 2727.4414 ## BRA.EMP 17191.3529 206.02389 6991.3710 364.573404 3524.7384 8509.4612 2054.3731 ## BRA.VA 76870.1456 30916.64606 223330.4487 43549.277879 70211.4219 178357.8685 89880.9743 ## FIRE GOV OTH SUM ## ARG.EMP 627.79291 2043.471 992.4475 10542.177 ## ARG.VA 8547.37278 25390.774 7656.3565 152533.839 ## BOL.EMP 44.56442 NA 395.5650 2220.524 ## BOL.VA 1752.06208 NA 4383.5425 22619.177 ## BRA.EMP 4413.54448 5307.280 5710.2665 54272.985 ## BRA.VA 183027.46189 249135.452 55282.9748 1200562.671By default (use.g.names = TRUE), group names are added as names (vectors) or rownames (matrices and data frames) to the result. For data frames we can also add the grouping columns again using2:
head(add_vars(g[["groups"]], fmean(get_vars(GGDC10S, 6:16), g, use.g.names = FALSE))) ## Country Variable AGR MIN MAN PU CON WRT TRA ## 1 ARG EMP 1419.8013 52.08903 1931.7602 101.720936 742.4044 1982.1775 648.5119 ## 2 ARG VA 14951.2918 6454.94152 36346.5456 2722.762554 9426.0033 26633.1292 14404.6626 ## 3 BOL EMP 964.2103 56.03295 235.0332 5.346433 122.7827 281.5164 115.4728 ## 4 BOL VA 3299.7182 2846.83763 3458.2904 664.289574 729.0152 2757.9795 2727.4414 ## 5 BRA EMP 17191.3529 206.02389 6991.3710 364.573404 3524.7384 8509.4612 2054.3731 ## 6 BRA VA 76870.1456 30916.64606 223330.4487 43549.277879 70211.4219 178357.8685 89880.9743 ## FIRE GOV OTH SUM ## 1 627.79291 2043.471 992.4475 10542.177 ## 2 8547.37278 25390.774 7656.3565 152533.839 ## 3 44.56442 NA 395.5650 2220.524 ## 4 1752.06208 NA 4383.5425 22619.177 ## 5 4413.54448 5307.280 5710.2665 54272.985 ## 6 183027.46189 249135.452 55282.9748 1200562.671The execution cost of all of these functions is extremely small, so the performance is essentially limited by C++, not by R.
library(microbenchmark) microbenchmark(call = add_vars(g[["groups"]], fmean(get_vars(GGDC10S, 6:16), g, use.g.names = FALSE))) ## Unit: microseconds ## expr min lq mean median uq max neval ## call 257.931 271.765 368.8147 369.2695 384.889 987.545 100We can use these functions to write very efficient grouped code in R. This shows a simple application in panel data econometrics comparing a pooled OLS to a group means, a between and a within estimator computed on the demeaned data3:
Panel_Ests < function(formula, data, pids) { # Get variables as character string, first variable is dependent variable vars < all.vars(formula) # na_omit is a fast replacement for na.omit data_cc < na_omit(get_vars(data, c(vars, pids))) g < GRP(data_cc, pids, return.groups = FALSE, call = FALSE) # qM is a faster as.matrix data_cc < qM(get_vars(data_cc, vars)) # Computing group means mean_data_cc < fmean(data_cc, g, use.g.names = FALSE) # This computes regression coefficients reg < function(x) qr.coef(qr(cbind(Intercept = 1, x[, 1L, drop = FALSE])), x[, 1L]) qM(list(Pooled = reg(data_cc), Means = reg(mean_data_cc), # This replaces data values with the groupmean > betweengroup estimator Between = reg(TRA(data_cc, mean_data_cc, "replace_fill", g)), # This subtracts the groupmeans > withingroup estimator Within = reg(TRA(data_cc, mean_data_cc, "", g)))) } library(magrittr) # Pipe operators # Calculating Value Added Percentage Shares (data is in local currency) VA_shares < fsubset(GGDC10S, Variable == "VA") %>% ftransformv(6:16, `*`, 100/SUM) # Value Added data (regressing Government on Agriculture, Manufactoring and Finance & Real Estate) Panel_Ests(GOV ~ AGR + MAN + FIRE, VA_shares, "Country") %>% round(4) ## Pooled Means Between Within ## Intercept 25.8818 26.6702 26.5828 0.0000 ## AGR 0.3425 0.3962 0.3749 0.2124 ## MAN 0.2339 0.1744 0.2215 0.2680 ## FIRE 0.2083 0.3337 0.2572 0.0742 # Employment data fsubset(GGDC10S, Variable == "EMP") %>% ftransformv(6:16, `*`, 100/SUM) %>% Panel_Ests(formula = GOV ~ AGR + MAN + FIRE, "Country") %>% round(4) ## Pooled Means Between Within ## Intercept 33.2047 34.6626 35.4332 0.0000 ## AGR 0.3543 0.3767 0.3873 0.2762 ## MAN 0.4444 0.4595 0.4790 0.4912 ## FIRE 0.1721 0.3097 0.2892 0.1087It would be easy to add an option for sampling weights as fmean also supports weighted grouped computations. A benchmark below shows that this series of estimators is executed very efficiently and scales nicely to large data (quite a bit faster than using plm to do it).
# Benchmark on VA data microbenchmark(call = Panel_Ests(SUM ~ AGR + MIN + MAN, VA_shares, "Country")) ## Unit: milliseconds ## expr min lq mean median uq max neval ## call 1.643975 2.203792 3.119572 2.72077 3.583589 10.44576 100There are lots and lots of other applications that can be devised in R using the .FAST_FUN and efficient programming with grouping objects.
Creating Grouped Functions in C/C++It is also possible to just use ‘GRP’ objects as input to new grouped functions written in C or C++. Below I use Rcpp to create a generic grouped anyNA function for vectors:
// [[Rcpp::plugins(cpp11)]] #include using namespace Rcpp; // Inputs: // x  A vector of any type // ng  The number of groups  supplied by GRP() in R // g  An integer grouping vector  supplied by GRP() in R // Output: A plain logical vector of size ng template LogicalVector ganyNACppImpl(Vector x, int ng, IntegerVector g) { int l = x.size(); if(l != g.size()) stop("length(x) must match length(g)"); LogicalVector out(ng); // Initializes as false if(RTYPE == REALSXP) { // Numeric vector: all logical operations on NA/NaN evaluate to false, except != which is true. for(int i = 0; i < l; ++i) { if(x[i] != x[i] && !out[g[i]1]) out[g[i]1] = true; } } else { // other vectors for(int i = 0; i < l; ++i) { if(x[i] == Vector::get_na() && !out[g[i]1]) out[g[i]1] = true; } } return out; } // Disabling complex and nonatomic vector types template <> LogicalVector ganyNACppImpl(Vector x, int ng, IntegerVector) { stop("Not supported SEXP type!"); } template <> LogicalVector ganyNACppImpl(Vector x, int ng, IntegerVector) { stop("Not supported SEXP type!"); } template <> LogicalVector ganyNACppImpl(Vector x, int ng, IntegerVector) { stop("Not supported SEXP type!"); } template <> LogicalVector ganyNACppImpl(Vector x, int ng, IntegerVector) { stop("Not supported SEXP type!"); } // [[Rcpp::export]] LogicalVector ganyNACpp(const SEXP& x, int ng = 0, const IntegerVector& g = 0){ RCPP_RETURN_VECTOR(ganyNACppImpl, x, ng, g); }On the R side things are then pretty simple:
library(Rcpp) sourceCpp("ganyNA.cpp") ganyNA < function(x, g, use.g.names = TRUE) { # Option group.sizes = FALSE prevents tabulation of levels if a factor is passed g < GRP(g, return.groups = use.g.names, group.sizes = FALSE, call = FALSE) res < ganyNACpp(x, g[[1L]], g[[2L]]) # GRPnames creates unique group names. For vectors they need not be character typed. if(use.g.names) names(res) < GRPnames(g, force.char = FALSE) res }Strictly speaking there are different options to set this up: GRP() is a S3 generic function with a default method applying to atomic vectors and lists / data frames, but also a ‘factor’ method converting factors to ‘GRP’ objects. Above I have used the generic GRP function with the option group.sizes = FALSE, so factors are efficiently converted without tabulating the levels. This provides more efficiency if a factor is passed to g, but will not drop unused factor levels. The alternative is to use g < GRP.default(g, return.groups = use.g.names, call = FALSE), which will get rid of unused factor levels, but using factors for grouping is just as efficient as any other vector.
GGDC10S %$% ganyNA(SUM, list(Country, Variable)) %>% head ## ARG.EMP ARG.VA BOL.EMP BOL.VA BRA.EMP BRA.VA ## FALSE FALSE FALSE TRUE FALSE TRUE # 10 million obs and 1 million groups, 1% of data missing x < na_insert(rnorm(1e7), prop = 0.01) g < sample.int(1e6, 1e7, TRUE) system.time(ganyNA(x, g)) ## User System verstrichen ## 0.56 0.05 0.61 system.time(ganyNA(x, g, use.g.names = FALSE)) ## User System verstrichen ## 0.42 0.03 0.46 # Using a factor grouping variable: more efficient but does not drop any unused levels f < qF(g, na.exclude = FALSE) # Efficiently creating a factor (qF is faster as.factor) system.time(ganyNA(x, f)) ## User System verstrichen ## 0.02 0.02 0.03 system.time(ganyNA(x, f, use.g.names = FALSE)) ## User System verstrichen ## 0.04 0.01 0.05 # We can also efficiently pass a 'GRP' object: both GRP.GRP and GRP.default simply return it. g < GRP(g) system.time(ganyNA(x, g)) ## User System verstrichen ## 0.01 0.00 0.01 system.time(ganyNA(x, g, use.g.names = FALSE)) ## User System verstrichen ## 0.03 0.00 0.03We could additionally add a TRA argument and then internally call the TRA() function to allow for replacing and sweeping out statistics, but this does not make much sense here.

By default (sort = TRUE) the grouping is ordered, which is equivalent to data.table grouping with keyby.︎

add_vars is a faster alternative to cbind and get_vars is a faster alternative to [.data.frame for subsetting columns.︎

A random effects estimator could easily be added, see the example here.︎
To leave a comment for the author, please follow the link and comment on their blog: R, Econometrics, High Performance. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Efficient Grouped Programming in R and/or C/C++  with the collapse Package first appeared on Rbloggers.
100 Time Series Data Mining Questions – Part 3
[social4i size="small" align="alignleft"] > [This article was first published on R Bloggers on Francisco Bischoff, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
In the last post we started looking for repeated patterns in a time series, what we call Motifs.
For the next question, we will still be using the datasets available at https://github.com/matrixprofilefoundation/mpfdatasets so you can try this at home.
The original code (MATLAB) and data are here.
Now let’s start:
 What are the three most unusual days in this threemonthlong dataset?
Now we don’t know what we are looking for, but we want to discover something. Not just one pattern, but patterns that repeat, and may look similar or not.
First, let’s load the tsmp library and import our dataset. In this example, we will use the package xts only because the imported data has a timestamp index and we make more understandable plots, but this is not necessary as we will use basically the values. We will use the window from 20141001 and 20141215 to keep up with the original example from Keogh’s Lab:
# install.packages('tsmp') library(tsmp) library(xts) baseurl < "https://raw.githubusercontent.com/matrixprofilefoundation/mpfdatasets/a3a3c08a490dd0df29e64cb45dbb355855f4bcb2/" dataset < read.csv(paste0(baseurl, "real/nyctaxianomalies.csv"), header = TRUE) dataset < xts(dataset$value, as.POSIXct(dataset$timestamp)) tzone(dataset) < "UTC" dataset < window(dataset, start = "20141001", end = "20141215") events < xts(c("Columbus Day", "Daylight Saving Time", "Thanksgiving"), as.POSIXct(c("20141013", "20141102", "20141127"), tz = "UTC"))And plot it as always:
plot(dataset, main = "NYC Taxi anomalies", lwd = 1)Above we see the taxi load between 2014, October 1st and 2014, December 15th. The data was collected each halfhour.
We can see that this data seems to have a wellrepeated pattern, but in some days there are some differences. The objective here is to locate what is not normal, or better saying, the discord. Discord is a term that defines the most unusual time series subsequence. Discords have become one of the most effective and competitive methods in anomaly detection.
As we can see above, the Matrix Profile contains three peaks that show us that these areas are very different from the other regions, so they are, as said above, unusual subsequences.
Let’s analyze (ok, I’ve already spoiled the answer in the plot). The most unusual subsequence is located near the Thanksgiving date. (remember, we are using a window size of 100, that means about 2 days, so there is a visible ‘lag’).
The second most unusual subsequence is located around November 6th. And what is that? It is the Daylight Saving Time! The clock going backwards one hour gives an apparent doubling of taxi load.
The third most unusual subsequence is around October 13th, what could it be? Columbus Day! Columbus Day is mostly ignored in much of America, but still a big deal in NY, with its large Italian American community.
Now let’s use the function find_discord to find these unusual patterns. Notice that I’ve set n_neighbors to zero since I don’t want to look for possible repetition of those patterns, but new, unusual patterns.
# We already know the Matrix Profile Object, so I'll skip talking about it. mp_obj %>% find_discord(n_discords = 3, n_neighbors = 0) %T>% plot ## Matrix Profile ##  ## Profile size = 3500 ## Window size = 100 ## Exclusion zone = 200 ## Contains 1 set of data with 3599 observations and 1 dimension ## ## Discord ##  ## Discords found = 3 ## Discords indexes = [2686] [1484] [553] ## Discords neighbors = [] [] []And here they are. Exactly where we’d foreseen and in the lower part of the potting are the normalized shape of these patterns.
So we did it again!
This ends our third question of one hundred (let’s hope so!).
Until next time.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on Francisco Bischoff. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post 100 Time Series Data Mining Questions  Part 3 first appeared on Rbloggers.
Reading JSON file from web and preparing data for analysis
[social4i size="small" align="alignleft"] > [This article was first published on dataENQ, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Welcome to the tothepoint article about reading a JSON file from the web and preparing the data for analysis. I have found this data in JSON format here and used it to replicate the table presented here in “Row and Columns” section. The final data prepared in this article is in the data frame format, which can turn into a graph comfortably. DataData used in this article is available here and licensed under the CC BY 4.0 license.
StrategyThe strategy is to read the JSON file using the fromJSON function of the jsonlite package. The output will be presented as a list of lists. Read individual lists, and with the help of rapply and unique functions, extract the value of the labels. Repeat this for all the data that is required to form a data frame.
The value section of the JSON file returns the elements in the form of a numeric vector. Read the vectors by adding three into their indexes and assign them to a new variable. Remember to start from the first, second, and third place to read the right element. Repeat this logic three times to create three variables. Use the same logic and create two more variables, one for the year and another for statistics.
CodeHere is the working copy of the code for your scrutiny. Please comment if you have a better and more optimized way of handling this data. If you are interested, then a copy of this code is available at github repository as well.
################################################################################## www.dataenq.com
## Reading a JSON file and preparing data for analysis
################################################################################
#Using jsonlite to read .json file
library(jsonlite)
#Using function fromJSON from jsonlite package to read the file
djson < fromJSON("https://statbank.cso.ie/StatbankServices/StatbankServices.svc/jsonservice/responseinstance/CIS78")
#Preparing the data frame from the list of lists djson created above
#Reading individual lists and preparing columns
df < data.frame(
#Reading dimension Type of Cooperation Partner
unique(rapply(djson$dataset$dimension$`Type of Cooperation Partner`$category$label, function(lst) head(lst, 1))),
#Reading first and every other third value from there for each observation
V2 = djson$dataset$value[seq(1, length(djson$dataset$value), 3)],
#Reading second and every other third value from there for each observation
V3 = djson$dataset$value[seq(2, length(djson$dataset$value), 3)],
#Reading third and every other third value from there for each observation
V4 = djson$dataset$value[seq(3, length(djson$dataset$value), 3)],
#Reading first and every other third value from there for each observation but for dimension called year
V5 = djson$dataset$value[seq(1, length(djson$dataset$value), 3)],
#Reading first and every other third value from there for each observation but for dimension called Statistic
V6 = djson$dataset$value[seq(2, length(djson$dataset$value), 3)])
#Assigning column names from vectors to match the data presented on the site given below
# https://data.gov.ie/dataset/7b6c5d4c955c4eeba9d0e35fb58bf200/resource/5a856b72f4704c71ab1ffbb0ef3b1e22#&r=Type%20of%20Cooperation%20Partner&c=NACE%20Rev%202%20Sector
colnames(df) = c(djson$dataset$dimension$`Type of Cooperation Partner`$label,
unique(rapply(djson$dataset$dimension$`NACE Rev 2 Sector`$category$label, function(lst) head(lst, 1))),
unique(rapply(djson$dataset$dimension$Year$category$label, function(lst) head(lst, 1))),
unique(rapply(djson$dataset$dimension$Statistic$category$label, function(lst) head(lst, 1))))
#Structure of the data frame
str(df) ## 'data.frame': 12 obs. of 6 variables:
## $ Type of Cooperation Partner : chr "Any type of cooperation" "Cooperation from clients and or customers" "Cooperation from competitors" "Cooperation other enterprises within own enterprise group" ...
## $ Industries (05 to 39) : num 54.7 34.9 21.5 29 30 45.8 43.8 27 15.9 44.6 ...
## $ Industries and selected services (05 to 39,46,49 to 53,58 to 63,64 to 66,71 to 73): num 50.8 32.9 20.2 27.4 25.8 40.1 38.7 23.3 17.3 41.9 ...
## $ Selected Services (46, 4953, 5863, 6466, 7173) : num 47.8 31.5 19.3 26.3 22.7 35.9 34.8 20.5 18.5 39.9 ...
## $ 2018 : num 54.7 34.9 21.5 29 30 45.8 43.8 27 15.9 44.6 ...
## $ Cooperation by Technological Innovative Enterprises (%) : num 50.8 32.9 20.2 27.4 25.8 40.1 38.7 23.3 17.3 41.9 ... #Printing data frame
df ## Type of Cooperation Partner
## 1 Any type of cooperation
## 2 Cooperation from clients and or customers
## 3 Cooperation from competitors
## 4 Cooperation other enterprises within own enterprise group
## 5 Cooperation from Universities and or third level institutions
## 6 Cooperation from suppliers of equipment, materials, components or software
## 7 Cooperation from consultants and or commercial laboratories or private research and development institutes
## 8 Cooperation from Government or public research institutes
## 9 Cooperation from public sector clients or customers
## 10 Cooperation from private business enterprises outside your enterprise group
## 11 Cooperation from other enterprises
## 12 Cooperation from nonprofit organisations
## Industries (05 to 39)
## 1 54.7
## 2 34.9
## 3 21.5
## 4 29.0
## 5 30.0
## 6 45.8
## 7 43.8
## 8 27.0
## 9 15.9
## 10 44.6
## 11 22.7
## 12 11.3
## Industries and selected services (05 to 39,46,49 to 53,58 to 63,64 to 66,71 to 73)
## 1 50.8
## 2 32.9
## 3 20.2
## 4 27.4
## 5 25.8
## 6 40.1
## 7 38.7
## 8 23.3
## 9 17.3
## 10 41.9
## 11 21.5
## 12 12.5
## Selected Services (46, 4953, 5863, 6466, 7173) 2018
## 1 47.8 54.7
## 2 31.5 34.9
## 3 19.3 21.5
## 4 26.3 29.0
## 5 22.7 30.0
## 6 35.9 45.8
## 7 34.8 43.8
## 8 20.5 27.0
## 9 18.5 15.9
## 10 39.9 44.6
## 11 20.6 22.7
## 12 13.4 11.3
## Cooperation by Technological Innovative Enterprises (%)
## 1 50.8
## 2 32.9
## 3 20.2
## 4 27.4
## 5 25.8
## 6 40.1
## 7 38.7
## 8 23.3
## 9 17.3
## 10 41.9
## 11 21.5
## 12 12.5
I hope you would like this short article. Please help dataenq.com by commenting on what you think about this article and by sharing it with your network. Thank you.
Image Credit unsplash.com var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: dataENQ. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Reading JSON file from web and preparing data for analysis first appeared on Rbloggers.
salesforcer 0.2.2 – Relationship Queries and the Reports API
[social4i size="small" align="alignleft"] > [This article was first published on stevenmortimer.com, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The latest version of the {salesforcer} package (v0.2.2) is now available on CRAN
and is ready to help you have better access to data in your Salesforce Org. Along
with a host of bug fixes this release has three big features:

Experimental Functions for the Reports and Dashboards REST API (jump
to section) – Now
you have programmatic access to executing and managing reports, dashboards, and
analytics notifications in your Org. You can familiarize yourself with the
Salesforce documentation HERE. Not all functions have been implemented yet, but your questions, comments,
and feedback are welcome! 
Support for Bulk 2.0 API Queries (jump to
section) – In Salesforce version 47.0
(Winter ’20) query jobs in the Bulk API 2.0 were added. Now you can leverage
this resource from the {salesforcer} package in addition to the queries via
the REST, SOAP, and Bulk 1.0 APIs. 
Support for Relationship Queries (jump to
section) – In previous versions of the package the
childtoparent and nested parenttochild queries did not work or returned a
jumbled mess of results that were not parsed correctly. This version fixes those
issues in both the REST and SOAP APIs with better test coverage on a variety of
query types.
For a complete list of updates, please review the release notes from v0.2.0 onwards
listed on the {salesforcer} pkgdown site here: https://stevenmmortimer.github.io/salesforcer/news/index.html.
Salesforce has rich support for Reports in your Salesforce Org. Sometimes
reports are a better way to collaborate with other users because they can create
reports in the GUI or you can create one for them so they always have access to
the most current recordset meeting your report criteria. The challenge comes
when trying to access this data programmatically. Fortunately, Salesforce
provides the Reports
and Dashboards REST API as a means to not only execute Reports, but to also
manage them in your Org.
In Salesforce there is a dedicated page to displaying the list of reports in your
Org. It typically follows the pattern: https://na1.salesforce.com/00O/o
(replace na1 with your server instance). When you click on a report in the GUI
you should see the report’s results. Below is a screenshot of how a report may
look in your Org (note the report Id in the URL bar):
The report Id above ("00O3s000006tE7zEAE") is the only information needed to pull
those same results from an R session, like so:
Currently, all of the report related functionality in the Reports and Dashboards
REST API has been ported into the {salesforcer} package and you can do some pretty
neat stuff like onthefly filtering and sorting:
For more detail on how to take advantage of this new functionality please see
the pkgdown website https://stevenmmortimer.github.io/salesforcer
and, more specifically, the Working
with Reports vignette which provides a soft introduction to these concepts.
Finally, keep an eye out as more dashboard and analytics notifications
functionality is also added.
In Salesforce version 47.0 (Winter ’20) query functionality added to
the Bulk 2.0 API. In the overview of this feature Salesforce emphasizes the
consistency with the REST APIs and the ease of use (e.g. “Automatic File
Batching”), but does not mention any claims in terms of speed compared to the
Bulk 1.0 query functionality. In {salesforcer 0.2.2} the default API when using
sf_run_bulk_query() or sf_query_bulk() is now the Bulk 2.0 API, assuming it
is better than the Bulk 1.0 API. However, You can easily switch between the APIs
just as you did before in previous {salesforcer} releases by specifying it in the
api_type argument. Please note that, because of additional standardization on
the column ordering and arguments to guess types, the queries below will all
return the same exact format of results. For example we prioritize the following
fields in queries alphabetically within this prioritization waterfall:
 First, the sObject field (indicates the record’s object if multiple objects returned in the results)
 Second, the Id field (Id, id, sf__Id)
 Third, record success status (Success, success, sf_Success)
 Fourth, record created status (Created, created, sf__Created)
 Fifth, record error(s) status (Error, error, errors,
errors.statusCode, errors.fields, errors.message, sf__Error)  Sixth, all other fields from the target object (e.g. Name, Phone, etc.)
 Seventh, relationship fields (fields from a parent or child of the target). For example,
anything typically containing a dot like Account.Id, Owner.Name, etc.
In short, Bulk 2.0 now has query functionality and it is consistent with the
other API’s query functionality. I recommend checking to see for yourself which
API works well. Below is a simple example comparing a single run of the REST,
Bulk 1.0, and Bulk 2.0 APIs. Consider using the {microbenchmark} package to run
more precise performance tests.
One upgrade for {salesforcer 0.2.2} is better support for relationship queries,
both childtoparent lookups using the dot notation and parenttochild nested
queries. In prior releases the results were not parsed consistently and presented
themselves in a variety of hard to debug issues on GitHub that were brought up in #19, #35, #38, and #54. This
release finally aims to address some of those bugs through more consistent parsing methods
for both the XML returned by the SOAP API and the JSON returned by the REST API. However,
I would strongly recommend testing in your Org with your own queries to see the
impact before deploying to a production environment. If any unexpected behavior
crops up, then please file an issue on GitHub using the query issue template so we can get it resolved.
Old Nested Query Behavior (v0.1.4 or earlier)
sf_query("SELECT Name, (SELECT LastName FROM Contacts) FROM Account", api_type="SOAP") #> # A tibble: 24 x 4 #> Id Name Contacts LastName #> #> 1 NA GenePoint #> 2 NA Frank #> 3 NA United Oil & Gas, UK #> 4 NA James #> 5 NA United Oil & Gas, Singapore #> # … with 19 more rowsNew Query Behavior (v0.2.2)
sf_query("SELECT Name, (SELECT LastName FROM Contacts) FROM Account", api_type="SOAP") #> # A tibble: 16 x 4 #> Name Contact.LastName #> #> 1 GenePoint Frank #> 2 United Oil & Gas, UK James #> 3 United Oil & Gas, Singapore D'Cruz #> 4 United Oil & Gas, Singapore Ripley #> 5 Edge Communications Forbes #> # … with 11 more rowsA new vignette has been included with this release that covers the types of
queries currently supported by the package and is available here: Supported
Queries. I highly recommend reviewing for guidance and inspiration on how to
what types of queries are possible running against your Org.
For a complete listing of all changes made in recent releases of {salesforcer} please
view the Changelog (aka NEWS.md) file. Bug reports and feature requests are welcome on GitHub in the repository issues section.
Thank you for your continued support!
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: stevenmortimer.com. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post salesforcer 0.2.2  Relationship Queries and the Reports API first appeared on Rbloggers.
crosstalk: Shining without Shiny
[social4i size="small" align="alignleft"] > [This article was first published on rstats  Emily Riederer, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
This week, I was pleased to become an official
RStudio Certified Instructor after completing Greg Wilson’s training program, based on his excellent book
Teaching Tech Together. Part of the certification process involved teaching a 10 minute lesson. I chose to create a brief introduction to using the crosstalk package for dynamic interaction and filtering of plots, maps, and tables.
The primary intent of this post is simply to point out these materials, which are available on
GitHub. The lesson teaches the skills needed to create an threepanel application like
this (as shown in the screnshot above) with a scatterplot, map, and table comparing ridership at a sample of Chicago train stations in April of 2019 versus 2020 (not long after COVID shutdowns).
To make this more accessible, I converted to previous “lecture” component into a
stepbystep tutorial hosted on the repo’s GitHub Pages.1
Not convinced? Let me tell you just a bit more about crosstalk.
crosstalk is, in my opinion, an exciting and underutilized tool for interactive graphics. Shiny is the R world’s goto tool for creating interactive applications. However, Shiny has a higher learning curve and can be difficult to share with nonR users because it must be deployed on a server. crosstalk, on the other hand, requires minimal additional syntax and enables browserbased interactions which can be shared in a typical HTML R Markdown output.
This convenience comes at the cost of the feature set. crosstalk can handle interactive filtering and highlighting across multiple plots, maps, and tables to feature the same data subsets. However, unlike Shiny, t cannot do operations that would require R to recompute the data (for example, aggregation or refitting a model). Arguably, this limitation can also be a strength of crosstalk; since interaction is limited, the onus still lies firmly on the analyst to find a clear story and think carefully about which relationships to highlight.
The screenshots below demonstrate the types of exploration that can be done with crosstalk‘s dynamic filtering. In the overall scatterplot (above), we see a positive correlation between a higher proportion of weekend (versus weekend) trips in 2019 and lower declines in ridership from 2019 to 2020. This makes sense since business areas have lower weekend travel and would also be more affected by “stay at home” orders. We can use crosstalk to confirm this theory. Highlight the stations with the greatest decline in trips (left), Chicagoans could quickly recognize the mapped stations to be in the Chicago Loop, the heart of the business district. Conversely, those with the greatest weekend travel and lowest ride decline tend to be located in more residential areas on the west and south sides. You can recreate these exact views in the
demo, as linked above.
If this tutorial peaks your curiosity, check our package developer Carson Sievert’s book
Interactive webbased data visualization with R, plotly, and shiny from CRC Press for more information and advanced features.

Initially, I was going to republish the post in it’s entirety, but be warned: hugodown, which I currently use to render Rmd to md, has
known issues with including output from htmlwidgets R packages (such as leaflet and DT) in the Hugo Academic theme that I use. This is not a problem when rendering to normal rmarkdown::html_output or apparently
when using blogdown. ︎
To leave a comment for the author, please follow the link and comment on their blog: rstats  Emily Riederer. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post crosstalk: Shining without Shiny first appeared on Rbloggers.
Predicting pneumonia outcomes: Modelling via DataRobot API
[social4i size="small" align="alignleft"] > [This article was first published on R on notast, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
This post is a supplementary material for an assignment. The assignment is part of the Augmented Machine Learning unit for a Specialised Diploma in Data Science for Business. The aim of this project is to classify if patients with Community Acquired Pneumonia (CAP) became better after seeing a doctor or became worse despite seeing a doctor.
Previously, EDA (part1, part 2)and feature enginnering was done in R. The dataset was then imported into DataRobot for predictive modelling (the aim of the assigment was to use DataRobot).
The post here will outline basic modelling done in DataRobot and how I used R to conduct advanced feature selection to enhance the performance in subsequent modelling.
When the data is imported into DataRobot, the raw features will go through “a reasonableness” check for useful information (e.g., are not duplicates or reference ids and not a constant value)”. Predictors that pass this screener will form Informative Features which DataRobot deems to be more appropriate for modelling.
DataRobot deemed all variables informative including the case identifier, Pt_CaseNumber. The uniqueness of the case identifier must have been disrupted when the dataset was split up between unseen data and data for DataRobot. A custom Feature List, L1,included all variables except Pt_CaseNumber.
The number of Kfold was increased from the default 5 to 10 folds. The increase in Kfolds helps to reduce biasness and therefore reduces the risk of underfitting. Based on the concept of bias and variance trade off, the decrease of biasness will increase the variance. Nonetheless, the sample size of the training set is large enough (1840 rows) to accommodate this increase in folds without adversely compromising variance. Additionally, 10fold cross validation is the default number of folds for some modelling programs; for example, R’s modern modelling metapackage, Tidymodels.
DataRobot was also permitted to search for Interactions among the variables. There is support in identifying interactions as the combined effect of two or more variables maybe different than the impact of the individual constituent variables.
Downsampling was introduced to address the imbalanced dataset (DataRobot only allows downsampling). The proportion of positive and negative outcome of the binary classification was imbalanced which [could affect model performance as most models work best when the classes are about equal)[https://towardsdatascience.com/methodsfordealingwithimbalanceddata5b761be45a18]. There were 6.25 more Better outcomes than Worse outcomes. DataRobot’s Downsampling was used to reduce the disparity. Undersampling can help to address imbalance dataset. The number of Better outcomes were reduced by 20%. Post downsampling, there was 5.22 Better outcomes than Worse outcomes.
The performance metric for the binary classification was Area Under the Curve (AUC) which is common in the medical literature. In DataRobot, when downsampling is used, the performance metric AUC becomes a weighted AUC.
1st Quick ModeThe experiment ran on Quick Mode (DataRobot has 3 modelling modes: Autopilot, Quick, Manual. The student version which I used only has Quick and Manual Mode).
Modelling (2nd run)The results from the first round of Quick Mode were used for advanced feature selection. A smaller but possibly more influential subset of variables were used in a second round of Quick Mode. The advanced feature selection used results from the first round of Quick Mode to create 2 new Feature Lists. The creation of these 2 new Feature List required RStudio to be connected to DataRobot via its API.
DataRobot API library(datarobot) library(tidyverse) theme_set(theme_light()) Connecting to DataRobotDataRobot was connected to RStudio via datarobot::ConnectToDataRobot. The endpoint for cloud based DataRobot is https://app.datarobot.com/api/v2. The token can be found under Developer Tools option.
ConnectToDataRobot(endpoint = "https://app.datarobot.com/api/v2", token = "API token key") ## [1] "Authentication token saved" Saving the relevant project ID projects_inDR<data.frame(name=ListProjects()$projectName, ID=ListProjects()$projectId) project_ID<projects_inDR %>% filter(name=="Classify CAP") %>% pull(ID) ## # A tibble: 11 x 2 ## name ID ## ## 1 Classify CAP 5f54dce5e5e3f924b99408ee ## 2 CAP Experiment 3 5f3678ff735d1f59f99534ed ## 3 CAP Experiment 2 5f33f9899071ae44ded6de84 ## 4 CAP Experiment 1 5f269f896c6a7a1cefaff780 ## 5 CAP informative ft 5f2694a9d6e13f1dc8ae17d4 ## 6 CleanDR OG cv10 5f2655a8d6e13f1c26ae17c7 ## 7 CleanDR OG cv5 5f263758ce243f1bfac853eb ## 8 P04_Readmission(1).csv 5f16dd3b99c1372d672bdde4 ## 9 P03_Advertising(1).csv 5f0da6861c29b917239402dc ## 10 P02_LendingClubData2(1).csv 5f047ce3c40c123f3487346b ## 11 P01_TitanicData.csv 5efb44cc60882464b616c03b Saving the models List_Model3<ListModels(project_ID) %>% as.data.frame(simple = F) List_Model3 %>% select(modelType, expandedModel, featurelistName, `Weighted AUC.crossValidation`) %>% arrange(`Weighted AUC.crossValidation`)DT::datatable(rownames = F, options = list(searchHighlight = TRUE, paging= T))
{"x":{"filter":"none","data":[["RuleFit Classifier::OneHot Encoding::Missing Values Imputed","Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)::OneHot Encoding::Numeric Data Cleansing::Smooth Ridit Transform","RandomForest Classifier (Gini)::Treebased Algorithm Preprocessing v1","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","AVG Blender::Average Blender","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","Generalized Additive2 Model::Ordinal encoding of categorical variables::Missing Values Imputed::Text fit on Residuals (L2 / Binomial Deviance)","Light Gradient Boosted Trees Classifier with Early Stopping::Treebased Algorithm Preprocessing v1","Light Gradient Boosting on ElasticNet Predictions ::OneHot Encoding::Numeric Data Cleansing::Standardize::Ordinal encoding of categorical variables::ElasticNet Classifier (L2 / Binomial Deviance)","ElasticNet Classifier (L2 / Binomial Deviance)::Regularized Linear Model Preprocessing v19","ElasticNet Classifier (mixing alpha=0.5 / Binomial Deviance)::OneHot Encoding::Missing Values Imputed::Standardize"],["L1","L1","L1","L1","DR Reduced Features M8","Multiple featurelists","DR Reduced Features M8","DR Reduced Features M8","L1","L1","L1","L1","L1"],["0.904837","0.90765600000000002","0.91857100000000003","0.92684699999999998","0.92768899999999999","0.92877699999999996","0.92993499999999896","NA","NA","NA","NA","NA","NA"]],"container":"
\n \n \n expandedModel<\/th>\n featurelistName<\/th>\n Weighted AUC.crossValidation<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"searchHighlight":true,"paging":true,"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}In the 1st round of Quick Mode, 13 models were created. The best model was Extreme Gradient Boosted Tree Classifier, with a 10 fold cross validation weighted AUC of 0.9299. 1. Advanced Feature Selection (DR Reduced Features)
During the first round of Quick Mode, DataRobot created other Feature Lists including DR Reduced Features. These feature lists contained curated variables consisting of most important features based on Feature Impact scores from models.
Ft_DR<GetFeaturelist(project_ID, "5f54e0f611a5fc3a03274070") %>% as_tibble() Ft_DR$description %>% unqiue() ## [1] "The most important features based on Feature Impact scores from a particular model."3/13 models used DR Reduced Features in the 1st round of Quick Mode.
List_Model3 %>% count(featurelistName) ## # A tibble: 3 x 2 ## featurelistName n ## ## 1 DR Reduced Features M8 3 ## 2 L1 9 ## 3 Multiple featurelists 1Originally 70 variables were used for Quick Mode. 33 variables were identified for DR Reduced Features M8 (32 predictors + 1 outcome).
n_distinct(Ft_DR$features) ## [1] 33 2. Advanced Feature Selection (Top 32 Mean Feature Impact Variables)The number of predictors were kept the same as the number of predictors in DR Reduced Features to see which technique achieved better performance. The steps for identifying these variables were as follows:
 All the crossvalidation scores for all models were enabled on DataRobot GUI (by default, DataRobot calculates the crossvalidation score for the top few models based on the 1st crossvalidation fold score). The top 80% of the models based on cross validation AUC were selected. These selected models also had to have sufficient sample size of >64%. 9/13 models passed this screener.
{"x":{"filter":"none","data":[["eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","AVG Blender::Average Blender","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","Light Gradient Boosted Trees Classifier with Early Stopping::Treebased Algorithm Preprocessing v1","Generalized Additive2 Model::Ordinal encoding of categorical variables::Missing Values Imputed::Text fit on Residuals (L2 / Binomial Deviance)","RandomForest Classifier (Gini)::Treebased Algorithm Preprocessing v1","Light Gradient Boosting on ElasticNet Predictions ::OneHot Encoding::Numeric Data Cleansing::Standardize::Ordinal encoding of categorical variables::ElasticNet Classifier (L2 / Binomial Deviance)","Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)::OneHot Encoding::Numeric Data Cleansing::Smooth Ridit Transform"],["DR Reduced Features M8","Multiple featurelists","DR Reduced Features M8","L1","L1","L1","L1","L1","L1"],[0.929934999999999,0.928777,0.927689,0.926847,0.92296,0.919872,0.918571,0.912232,0.907656]],"container":"
\n \n \n expandedModel<\/th>\n featurelistName<\/th>\n Weighted AUC.crossValidation<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"searchHighlight":true,"paging":true,"columnDefs":[{"className":"dtright","targets":2}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]} The unnormalized feature impact score for all the variables in these models were averaged out. Variables in the top 32 mean feature impact score were used to create a new set of Feature List.
The calculation can be seen on DataRobot’s GUI.
The top 32 variables based on mean feature impact score and their respective feature impact score for each model is plotted here.
Ft_mean_varOutput[[1]]%>% filter(featureName %in% Ft_mean_varOutput[[2]]) %>% mutate(featureName= fct_reorder(featureName, impactUnnormalized, .fun =mean)) %>% ggplot(aes(featureName, impactUnnormalized, colour=featureName))+ geom_point(alpha=.5) + coord_flip() + theme(legend.position="none") + labs(title = "Top 32 variables based on mean feature impact score", subtitle = "each dot is the score for a specific model")New Feature List using the top 32 mean feature impact variables was created.
Feature_id_percent_rank = CreateFeaturelist(project_ID, listName= "mean ft impact" , featureNames= Ft_mean_varOutput[[2]][1:length(Ft_mean_varOutput[[2]])])$featurelistId ## [1] "Featurelist mean ft impact created" DR Features vs Top 32 mean Feature Impact variablesBoth Feature Lists had these predictors in common
intersect(Ft_mean_varOutput[[2]], Ft_DR$features) ## [1] "Abx_AmoxicillinSulbactamOral" "Abx_Duration" ## [3] "Pt_Age" "PE_AMS" ## [5] "Lab_RBC" "PE_HR" ## [7] "Care_daysUnfit" "Care_admit" ## [9] "Lab_Hb" "Lab_urea" ## [11] "Care_BPSupport" "Lab_Neu" ## [13] "Lab_WBC" "Lab_plt" ## [15] "Care_GP/OutptVisit" "Care_breathingAid" ## [17] "PE_temp" "Lab_Sugar" ## [19] "SS_daysOfRespSymp" "Hx_comorbidities" ## [21] "Lab_Cr" "Abx_LowFreq" ## [23] "PE_RR" "PE_BP_MAP" ## [25] "SS_breathing" "Social_smoke" ## [27] "Abx_Ceftriaxone" "Abx_ClarithromycinIV" ## [29] "HCAP_IVAbx" "Abx_no" ## [31] "Hx_brainMental"The only difference was that the top 32 mean Feature Impact variables contained medical history on the type of heart disease the patient had.
setdiff(Ft_mean_varOutput[[2]], Ft_DR$features) ## [1] "Hx_heart_type" Running 2nd Quick ModeAnother round of Quick Mode was ran with both of the new Feature Lists created with advanced feature selection. Again, cross validation scores for all models was calculated.
8 additional models were generated with DR Reduced features (the other 3 was previously generated in the 1st round of Quick Mode). 10 additional models were generated with the Top 32 mean Feature Impact score variables.
In the next post, the results from all the models from both runs of Quick Mode will be analysed.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on notast. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Predicting pneumonia outcomes: Modelling via DataRobot API first appeared on Rbloggers.
Data Science is a Science (Just Not the One You May Think)
[social4i size="small" align="alignleft"] > [This article was first published on R – Win Vector LLC, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
I am working on a promising new series of notes: common data science fallacies and pitfalls. (Probably still looking for a good name for the series!)
I thought I would share a few thoughts on it, and hopefully not jinx it too badly.
Data science, is for better or worse, an empirical field. Please consider the following partial definition of science from the Oxford English dictionary:
A branch of knowledge conducted on objective principles involving the systematized observation of and experimentation with phenomena …
My point being: data science is a science. It is the science of studying the empirical phenomena of modeling methodologies. Data science isn’t the study of the areas it is applied to, or even of the development of the tools it uses.
Methodologies that empirically work are further used and studied. This means they tend to evolve or accrete into complex, but not fully correct systems.
I think a lot of statisticians’ issue with data science is the following. Data science, while a producer and consumer of mathematical content, isn’t a mathematical field. Formal correctness isn’t a central tenant of data science. Empirical effectiveness is the central tenant.
Methods that tend to work are used, and reproduce through sharing.
I too find this disappointing. My background is mathematical. I still feel upfront time establishing theoretical correctness pays off in the long term.
So I am working on a small series of practices that are not quite correct in data science. Where I can, I will identify correct alternatives. What I am trying to avoid is the accumulation of not quite correct patches and affordances that introduce more problems are then further patched and so on. This cycle leads to some of the complexity of current machine learning infrastructure.
This, unfortunately, will not often lead to huge improvements. Any examined practice that was very wrong would, in an empirical field, get patched or removed.
This is like the old finance joke:
Two traders see a $100 bill on the street. One starts to bend to pick it up. The other stops him by saying: “Wait, if it were a real $100 somebody would have already picked it up by now.”
So we will go after smaller denomination bills.
We will show small improvements that can stack (many small improvements can be a large improvement), and cut down on complexity (using a correct method lessens the need for patches and the problems they then introduce).
Our first examples in this series are already out:
Some of this work is in collaboration with Dr. Nina Zumel and Professor Norm Matloff.
Many of these issues are “small nicks.” But one can die by a thousand cuts. So I want to help get rid of some of these issues, get rid of the workarounds or patches used to deal with the problems, and get rid of the problems the patches introduce. Break the autocatalytic chain of faulty patches patching faulty patches.
For a lot of these issues I expect the following stage reactions (even all from some of the same respondents):
 Nobody does that.
 You have no right to say that.
 We have always done that, as it isn’t a problem.
 There is no way to fix that.
 We already work around the problem in the following way.
The series will take workarounds as evidence of a problem, but not as solutions. I want to emphasize methods that identify and avoid the core problems.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Data Science is a Science (Just Not the One You May Think) first appeared on Rbloggers.
Video: How To Set Up A Distributed Data Science Team
[social4i size="small" align="alignleft"] > [This article was first published on r – Appsilon Data Science  End to End Data Science Solutions, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
This presentation was part of a joint virtual webinar with Appsilon and RStudio on July 28, 2020 entitled “Enabling Remote Data Science Teams.” Find a direct link to the presentation here.
In this video, Appsilon Senior Data Scientist Olga MierzwaSulima explains best practices for data science teams – whether your team is lucky enough to be working in the office together or fully remote. Olga outlines the basics of project management (Scrum framework, project boards, and implementation plans), code review (linter, continuous integration, and GitHub Templates), and setting up a development environment (Docker Hub, RStudio Server Pro, and renv lockfiles).
In the video, Olga explains Appsilon’s variation on the Scrum approach to project management, as well as how to set up a project board (in Asana, GitHub, or another workflow tool) to organize backlog. You will learn why code review and version control are essential for commercial projects, and how tools like GitHub Actions can help. Finally, Olga describes how to properly set up a development environment and avoid chaos with Docker Hub or RStudio Server Pro.
Learn More Read Olga’s full article on Data Science Team Best Practices
 shiny.worker: Speed Up R Shiny Apps by Offloading Heavy Calculations
 Read Appsilon Engineer Krystian Igras’s RStudio Guest Article: 4 Tips to Make Your Shiny Dashboard Faster
Appsilon is hiring a Data Storyteller! This is a fulltime role focused on creating compelling content for our blog and social media. We want this person to help us create ambitious articles and interactive experiences around the topics of R Shiny, Machine Learning, and climate change. If you have experience working with R or Python (especially R Shiny) and you have a passion for writing, please apply.
Appsilon is an RStudio Full Service Certified Partner. We specialize in advanced enterprise Shiny apps for Fortune 500 companies. Reach out to us at hello@appsilon.com.
Article Video: How To Set Up A Distributed Data Science Team comes from Appsilon Data Science  End to End Data Science Solutions.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.rbloggers.com/wpcontent/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science  End to End Data Science Solutions. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The post Video: How To Set Up A Distributed Data Science Team first appeared on Rbloggers.