Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 31 min ago

forcats::fct_match

Fri, 02/22/2019 - 14:27

(This article was first published on R – Irregularly Scheduled Programming, and kindly contributed to R-bloggers)

This journey started almost exactly a year ago, but it’s finally been sufficiently worked through and merged! Yay, I’ve officially contributed to the tidyverse (minor as it may be).

It began with a tweet, recalling a surprise I encountered that day during some routine data processing

Source of today's mild heart-attack: I have categories W, X_Y, and Z in some data. Intending to keep only the second two:

data %>% filter(g %in% c("X Y", "Z")

Did you spot that I used a space instead of an underscore? I sure as heck didn't, and filtered excessively to just Z.— Jonathan Carroll (@carroll_jono) March 6, 2018

For those of you not so comfortable with pipes and dplyr, I was trying to subset a data.frame ‘data‘ (with a column g having values "W", "X_Y" and "Z") to only those rows for which the column g had the value "X_Y" or "Z" (not the actual values, of course, but that’s the idea). Without dplyr this might simply be

data[data$g %in% c("X Y", "Z"), ]

To make that more concrete, let’s actually show it in action

data <- data.frame(a = 1:5, g = c("X_Y", "W", "Z", "Z", "W")) data #> a g #> 1 1 X_Y #> 2 2 W #> 3 3 Z #> 4 4 Z #> 5 5 W data %>% filter(g %in% c("X Y", "Z")) #> a g #> 1 3 Z #> 2 4 Z

filter isn’t at fault here — the same issue would arise with [ — I have mis-specified the values I wish to match, so I am returned only the matching values. %in% is also performing its job – it returns a logical vector; the result of comparing the values in the column g to the vector c("X Y", "Z"). Both of these functions are behaving as they should, but the logic of what I was trying to achieve (subset to only these values) was lost.

Now, in some instances, that is exactly the behaviour you want — subset this vector to any of these values… where those values may not be present in the vector to begin with

data %>% filter(values %in% all_known_values)

The problem, for me, is that there isn’t a way to say “all of these should be there”. The lack of matching happens silently. If you make a typo, you don’t get that level, and you aren’t told that it’s been skipped

simpsons_characters %>% filter(first_name %in% c("Homer", "Marge", "Bert", "Lisa", "Maggie")

Technically this is a double-post because I also want to sidenote this with something I am amazed I have not known about yet (I was approximately today years old when I learned about this)… I’ve used regexmatching for a while, and have been surprised at how well I’ve been able to make it work occasionally. I’m familiar with counting patterns ((A){2} to match two occurrences of A) and ranges of counts ((A){2,4} to match between two and four occurrences of A) but I was not aware that you can specify a number of mistakes that can be included to still make a match… 

grep("Bart", c("Bart", "Bort", "Brat"), value = TRUE) #> [1] "Bart" grep("(Bart){~1}", c("Bart", "Bort", "Brat"), value = TRUE) #> [1] "Bart" "Bort"

(“Are you matching to me?”… “No, my regex also matches to ‘Bort’”)

Use (pattern){~n}to allow up to nsubstitutions in the pattern matching. Refer here and here.

Back to the original problem — filterand %in%are doing their jobs, but we aren’t getting the result we want because we made a typo, and we aren’t told that we’ve done so.

Enter a new PR to forcats(originally to dplyr, but forcatsdoes make more sense) which implements fct_match(f, lvls). This checks that all of the values in lvlsare actually present in fbefore returning the logical vector of which entries they correspond to. With this, the pattern becomes (after loading the development version of forcatsfrom github)

data %>% filter(fct_match(g, c("X Y", "Z"))) #> Error in filter_impl(.data, quo): Evaluation error: Levels not present in factor: "X Y".

Yay! We’re notified that we’ve made an error. "X Y"isn’t actually in our column g. If we don’t make the error, we get the result we actually wanted in the first place. We can now use this successfully

data %>% filter(fct_match(g, c("X_Y", "Z"))) #> a g #> 1 1 X_Y #> 2 3 Z #> 3 4 Z

It took a while for the PR to be addressed (the tidyverse crew have plenty of backlog, no doubt) but after some minor requested changes and a very neat cleanup by Hadley himself, it’s been merged.

My original version had a few bells and whistles that the current implementation has put aside. The first was inverting the matching with fct_excludeto make it easier to negate the matching without having to create a new anonymous function, i.e. ~!fct_match(.x). I find this particularly useful since a pipe expects a call/named function, not a lambda/anonymous function, which is actually quite painful to construct

data %>% pull(g) %>% (function(x) !fct_match(x, c("X_Y", "Z"))) #> [1] FALSE TRUE FALSE FALSE TRUE

whereas if we defined

fct_exclude <- function(f, lvls, ...) !fct_match(f, lvls, ...)

we can use

data %>% pull(g) %>% fct_exclude(c("X_Y", "Z")) #> [1] FALSE TRUE FALSE FALSE TRUE

The other was specifying whether or not to include missing levels when considering if lvls is a valid value in f since unique(f) and levels(f) can return different answers.

The cleanup really made me think about how much ‘fluff’ some of my code can have. Sure, it’s nice to encapsulate some logic in a small additional function, but sometimes you can actually replace all of that with a one-liner and not need all that. If you’re ever in the mood to see how compact internal code can really be, check out the source of forcats.

Hopefully this pattern of filter(fct_match(f, lvls)) is useful to others. It’s certainly going to save me overlooking some typos.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Irregularly Scheduled Programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RVowpalWabbit 0.0.13: Keeping CRAN happy

Fri, 02/22/2019 - 13:42

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Another small RVowpalWabbit package update brings us version 0.0.13. And just like Rblpapi yesterday, we have a new RVowpalWabbit update to cope with staged installs which will be a new feature of R 3.6.0. No other changes were made No new code or features were added.

We should mention once more there is a newer, but not on CRAN, package rvw thanks to the excellent GSoC 2018 and beyond work by Ivan Pavlov (who was mentored by James and myself) so if you are into Vowpal Wabbit from R go check it out.

More information is on the RVowpalWabbit page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Finding Economic Articles With Data

Fri, 02/22/2019 - 10:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

In my view, one of the greatest developments during the last decade in economics is that the Journals of the American Economic Association and some other leading journals require authors to upload the replication code and data sets of accepted articles.

I wrote a Shiny app that allows to search currently among more than 3000 articles that have an accessible data and code supplement. Just click here to use it:

http://econ.mathematik.uni-ulm.de:3200/ejd/

One can perform a keyword search among the abstract and title. The screenshot shows an example:

One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.

The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.

While the app performs well for a single user, I have not tested the performance for many simultaneous users. If it is too sluggish or you don’t get connected there are perhaps currently too many users. Then just try it out a bit later.

If you want to analyse yourself the collected data underlying the search app, you can download the zipped SQLite databases using the following links:

I try to update the databases regularly.

Below is an example, for a simple analysis based on that databases. First make sure that you download and extract articles.zip into your working directory.

We first open a database connection

library(RSQLite) db = dbConnect(RSQLite::SQLite(),"articles.sqlite")

File type conversion between databases and R can sometimes be a bit tedious. For example, SQLite knows no native Date or logical type. For this reason, I typically use my package dbmisc when working with SQLite databases. It allows to specify a database schema as simple yaml file and has a lot of convenience function to retrieve or modify data that automatically use the provided schema. The following code sets the database schema that is provided in the package EconJournalData:

library(dbmisc) db = set.db.schemas(db,schema.file= system.file("schema/articles.yaml", package="EconJournalData"))

Of course, for a simple analysis as ours below just using the standard function in the DBI package without schemata suffices. But I am just used to working with the dbmisc package.

The main information about articles is stored in the table article

# Get the first 4 entries of articles as data frame dbGet(db, "article",n = 4) table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } id year date journ title vol issue artnum article_url has_data data_url size unit files_txt downloaded_file num_authors file_info_stored file_info_summarized abstract readme_file aer_108_11_1 2018 2018-11-01 aer Firm Sorting and Agglomeration 108 11 1 https://www.aeaweb.org/articles?id=10.1257/aer.20150361 TRUE https://www.aeaweb.org/doi/10.1257/aer.20150361.data 0.05339 MB NA aer_vol_108_issue_11_article_1.zip NA TRUE NA Abstract
To account for the uneven distribution of economic activity in space, I propose a theory of the location choices of heterogeneous firms in a variety of sectors across cities. In equilibrium, the distribution of city sizes and the sorting patterns of firms are uniquely determined and affect aggregate TFP and welfare. I estimate the model using French firm-level data and find that nearly half of the productivity advantage of large cities is due to firm sorting, the rest coming from agglomeration economies. I quantify the general equilibrium effects of place-based policies: policies that subsidize smaller cities have negative aggregate effects. aer/2018/aer_108_11_1/READ_ME.pdf aer_108_11_2 2018 2018-11-01 aer Near-Feasible Stable Matchings with Couples 108 11 2 https://www.aeaweb.org/articles?id=10.1257/aer.20141188 TRUE https://www.aeaweb.org/doi/10.1257/aer.20141188.data 0.07286 MB NA aer_vol_108_issue_11_article_2.zip NA TRUE NA Abstract
The National Resident Matching program seeks a stable matching of medical students to teaching hospitals. With couples, stable matchings need not exist. Nevertheless, for any student preferences, we show that each instance of a matching problem has a “nearby” instance with a stable matching. The nearby instance is obtained by perturbing the capacities of the hospitals. In this perturbation, aggregate capacity is never reduced and can increase by at most four. The capacity of each hospital never changes by more than two. aer/2018/aer_108_11_2/Readme.pdf aer_108_11_3 2018 2018-11-01 aer The Costs of Patronage: Evidence from the British Empire 108 11 3 https://www.aeaweb.org/articles?id=10.1257/aer.20171339 TRUE https://www.aeaweb.org/doi/10.1257/aer.20171339.data 0.44938 MB NA aer_vol_108_issue_11_article_3.zip NA TRUE NA Abstract
I combine newly digitized personnel and public finance data from the British colonial administration for the period 1854-1966 to study how patronage affects the promotion and incentives of governors. Governors are more likely to be promoted to higher salaried colonies when connected to their superior during the period of patronage. Once allocated, they provide more tax exemptions, raise less revenue, and invest less. The promotion and performance gaps disappear after the abolition of patronage appointments. Patronage therefore distorts the allocation of public sector positions and reduces the incentives of favored bureaucrats to perform. aer/2018/aer_108_11_3/Readme.pdf aer_108_11_4 2018 2018-11-01 aer The Logic of Insurgent Electoral Violence 108 11 4 https://www.aeaweb.org/articles?id=10.1257/aer.20170416 TRUE https://www.aeaweb.org/doi/10.1257/aer.20170416.data 56 MB NA aer_vol_108_issue_11_article_4.zip NA TRUE NA Abstract
Competitive elections are essential to establishing the political legitimacy of democratizing regimes. We argue that insurgents undermine the state’s mandate through electoral violence. We study insurgent violence during elections using newly declassified microdata on the conflict in Afghanistan. Our data track insurgent activity by hour to within meters of attack locations. Our results

suggest that insurgents carefully calibrate their production of violence during elections to avoid harming civilians. Leveraging a novel instrumental variables approach, we find that violence depresses voting. Collectively, the results suggest insurgents try to depress turnout while avoiding backlash from harming civilians. Counterfactual exercises provide potentially actionable insights

for safeguarding at-risk elections and enhancing electoral legitimacy in emerging democracies. aer/2018/aer_108_11_4/READ_ME.pdf

The table files_summary contains information about code, data and archive files for each article

dbGet(db, "files_summary",n = 6) table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } id file_type num_files mb is_code is_data aejapp_1_1_10 do 9 0.009427 TRUE FALSE aejapp_1_1_10 dta 2 0.100694 FALSE TRUE aejapp_1_1_3 do 19 0.103628 TRUE FALSE aejapp_1_1_4 csv 1 0.024872 FALSE TRUE aejapp_1_1_4 dat 1 7.15491 FALSE TRUE aejapp_1_1_4 do 9 0.121618 TRUE FALSE

Let us now analyse which share of articles uses Stata, R, Python, Matlab or Julia and how the usage has developed over time.

Since our datasets are small, we can just download the two tables and work with dplyr in memory. Alternatively, you could use some SQL commands or work with dplyr on the database connection.

articles = dbGet(db,"article") fs = dbGet(db,"files_summary")

Let us now compute the shares of articles that have one of the file types, we are interested in

# Number of articles with analyes data & code supplementary n_art = n_distinct(fs$id) # Count articles by file types and compute shares fs %>% group_by(file_type) %>% summarize(count = n(), share=round((count / n_art)*100,2)) %>% # note that all file extensions are stored in lower case filter(file_type %in% c("do","r","py","jl","m")) %>% arrange(desc(share)) table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } file_type count share do 2606 70.55 m 852 23.06 r 105 2.84 py 32 0.87 jl 2 0.05

Roughly 70% of the articles have Stata do files and almost a quarter Matlab m files. Using Open Source statistical Software seems not yet very popular among economists, less than 3% of articles have R code files, Python is below 1% and only 2 articles have Julia code.

This dominance of Stata in economics never ceases to surprise me, in particular when for some reason I just happened to open the Stata do file editor and compare it with RStudio… But then, I am not an expert in writing empirical economic research papers – I just like R programming and rather passively consume empirical research. For writing empirical papers it probably is convenient that in Stata you can add a robust or robust cluster option to almost every type of regression in order to quickly get the economists’ standard standard errors…

For a teaching empirical economics with R the dominance of Stata is not neccessarily bad news. It means that there are a lot of studies which students can replicate in R. Such replication would be considerably less interesting if the original code of the articles would already be given in R.

Let us finish by having a look at the development over time…

sum_dat = fs %>% left_join(select(articles, year, id), by="id") %>% group_by(year) %>% mutate(n_art_year = n()) %>% group_by(year, file_type) %>% summarize( count = n(), share=round((count / first(n_art_year))*100,2) ) %>% filter(file_type %in% c("do","r","py","jl","m")) %>% arrange(year,desc(share)) head(sum_dat) table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } year file_type count share 2005 do 25 22.12 2005 m 10 8.85 2006 do 24 20.87 2006 m 13 11.3 2007 do 24 19.35 2007 m 16 12.9 library(ggplot2) ggplot(sum_dat, aes(x=year, y=share, color=file_type)) + geom_line(size=1.5) + scale_y_log10() + theme_bw()

Well, maybe there is a little upward trend for the open source languages, but not too much seems to have happened over time so far…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Running your R script in Docker

Fri, 02/22/2019 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

Since its release in 2014, Docker has become an essential tool for deploying applications. At STATWORX, R is part of our daily toolset. Clearly, many of us were thrilled to learn about RStudio’s Rocker Project, which makes containerizing R code easier than ever.

Containerization is useful in a lot of different situations. To me, it is very helpful when I’m deploying R code in a cloud computing environment, where the coded workflow needs to be run on a regular schedule. Docker is a perfect fit for this task for two reasons: On the one hand, you can simply schedule a container to be started at your desired interval. On the other hand, you always know what behavior and what output to expect, because of the static nature of containers. So if you’re tasked with deploying a machine-learning model that should regularly make predictions, consider doing so with the help of Docker. This blog entry will guide you through the entire process of getting your R script to run in a Docker container one step at a time. For the sake of simplicity, we’ll be working with a local dataset.

I’d like to start off with emphasizing that this blog entry is not a general Docker tutorial. If you don’t really know what images and containers are, I recommend that you take a look at the Docker Curriculum first. If you’re interested in running an RStudio session within a Docker container, then I suggest you pay a visit to the OpenSciLabs Docker Tutorial instead. This blog specifically focuses on containerizing an R script to eventually execute it automatically each time the container is started, without any user interaction – thus eliminating the need for the RStudio IDE. The syntax used in the Dockerfile and the command line will only be treated briefly here, so it’s best to get familiar with the basics of Docker before reading any further.

What we’ll need

For the entire procedure we’ll be needing the following:

  • An R script which we’ll build into an image
  • A base image on top of ??? which we’ll build our new image
  • A Dockerfile which we’ll use to build our new image

You can clone all following files and the folder structure I used from the STATWORX GitHub Repository.

The R script

We’re working with a very simple R script that imports a dataframe, manipulates it, creates a plot based on the manipulated data and, in the end, exports both the plot and the data it is based on. The dataframe used for this example is the US 500 Records dataset provided by Brian Dunning. If you’d like to work along, I’d recommend you to copy this dataset into the 01_data folder.

library(readr) library(dplyr) library(ggplot2) library(forcats) # import dataframe df <- read_csv("01_data/us-500.csv") # manipulate data plot_data <- df %>% group_by(state) %>% count() # save manipulated data to output folder write_csv(plot_data, "03_output/plot_data.csv") # create plot based on manipulated data plot <- plot_data %>% ggplot()+ geom_col(aes(fct_reorder(state, n), n, fill = n))+ coord_flip()+ labs( title = "Number of people by state", subtitle = "From US-500 dataset", x = "State", y = "Number of people" )+ theme_bw() # save plot to output folder ggsave("03_output/myplot.png", width = 10, height = 8, dpi = 100)

This creates a simple bar plot based on our dataset:

We use this script not only to run R code inside a Docker container, but we also want to run it on data from outside our container and afterward save our results.

The base image

The DockerHub page of the Rocker project lists all available Rocker repositories. Seeing as we’re using Tidyverse-packages in our script the rocker/tidyverse image should be an obvious choice. The problem with this repository is that it also includes RStudio, which is not something we want for this specific project. This means that we’ll have to work with the r-base repository instead and build our own Tidyverse-enabled image. We can pull the rocker/r-base image from DockerHub by executing the following command in the terminal:

docker pull rocker/r-base

This will pull the Base-R image from the Rocker DockerHub repository. We can run a container based on this image by typing the following into the terminal:

docker run -it --rm rocker/r-base

Congratulations! You are now running R inside a Docker container! The terminal was turned into an R console, which we can now interact with thanks to the -it argument. The —-rm argument makes sure the container is automatically removed once we stop it. You’re free to experiment with your containerized R session, which you can exit by executing the q() function from the R console. You could, for example, start installing the packages you need for your workflow with install.packages(), but that’s usually a tedious and time-consuming task. It is better to already build your desired packages into the image, so you don’t have to bother with manually installing the packages you need every time you start a container. For that, we need a Dockerfile.

The Dockerfile

With a Dockerfile, we tell Docker how to build our new image. A Dockerfile is a text file that must be called „Dockerfile.txt“ and by default is assumed to be located in the build-context root directory (which in our case would be the „R-Script in Docker“ folder). First, we have to define the image on top of which we’d like to build ours. Depending on how we’d like our image to be set up, we give it a list of instructions so that running containers will be as smooth and efficient as possible. In this case, I’d like to base our new image on the previously discussed rocker/r-base image. Next, we replicate the local folder structure, so we can specify the directories we want in the Dockerfile. After that we copy the files which we want our image to have access to into said directories – this is how you get your R script into the Docker image. Furthermore, this allows us to prevent having to manually install packages after starting a container, as we can prepare a second R script that takes care of the package installation. Simply copying the R script is not enough, we also need to tell Docker to automatically run it when building the image. And that’s our first Dockerfile!

# Base image https://hub.docker.com/u/rocker/ FROM rocker/r-base:latest ## create directories RUN mkdir -p /01_data RUN mkdir -p /02_code RUN mkdir -p /03_output ## copy files COPY /02_code/install_packages.R /02_code/install_packages.R COPY /02_code/myScript.R /02_code/myScript.R ## install R-packages RUN Rscript /02_code/install_packages.R

Don’t forget preparing and saving your appropriate install_packages.R script, where you specify which R packages you need to be pre-installed in your image. In our case the file would look like this:

install.packages("readr") install.packages("dplyr") install.packages("ggplot2") install.packages("forcats") Building and running the image

Now we have assembled all necessary parts for our new Docker image. Use the terminal to navigate to the folder where your Dockerfile is located and build the image with

docker build -t myname/myimage .

The process will take a while due to the package installation. Once it’s finished we can test our new image by starting a container with

docker run -it --rm -v ~/"R-Script in Docker"/01_data:/01_data -v ~/"R-Script in Docker"/03_output:/03_output myname/myimage

Using the -v arguments signales Docker which local folders to map to the created folders inside the container. This is important because we want to both get our dataframe inside the container and save our output from the workflow locally so it isn’t lost once the container is stopped.

This container can now interact with our dataframe in the 01_data folder and has a copy of our workflow-script inside its own 02_code folder. Telling R to source("02_code/myScript.R") will run the script and save the output into the 03_output folder, from where it will also be copied to our local 03_output folder.

Improving on what we have

Now that we have tested and confirmed that our R script runs as expected when containerized, there’s only a few things missing.

  1. We don’t want to manually have to source the script from inside the container, but have it run automatically whenever the container is started.

We can achieve this very easily by simply adding the following command to the end of our Dockerfile:

## run the script CMD Rscript /02_code/myScript.R

This points towards the location of our script within the folder structure of our container, marks it as R code and then tells it to run whenever the container is started. Making changes to our Dockerfile, of course, means that we have to rebuild our image and that in turn means that we have to start the slow process of pre-installing our packages all over again. This is tedious, especially if chances are that there will be further revisions of any of the components of our image down the road. That’s why I suggest we

  1. Create an intermediary Docker image where we install all important packages and dependencies so that we can then build our final, desired image on top.

This way we can quickly rebuild our image within seconds, which allows us to freely experiment with our code without having to sit through Docker installing packages over and over again.

Building an intermediary image

The Dockerfile for our intermediary image looks very similar to our previous example. Because I decided to modify my install_packages() script to include the entire tidyverse for future use, I also needed to install a few debian packages the tidyverse depends upon. Not all of these are 100% necessary, but all of them should be useful in one way or another.

# Base image https://hub.docker.com/u/rocker/ FROM rocker/r-base:latest ## install debian packages RUN apt-get update -qq && apt-get -y --no-install-recommends install \ libxml2-dev \ libcairo2-dev \ libsqlite3-dev \ libmariadbd-dev \ libpq-dev \ libssh2-1-dev \ unixodbc-dev \ libcurl4-openssl-dev \ libssl-dev ## copy files COPY 02_code/install_packages.R /install_packages.R ## install R-packages RUN Rscript /install_packages.R

I build the image by navigating to the folder where my Dockerfile sits and executing the Docker build command again:

docker build -t oliverstatworx/base-r-tidyverse .

I have also pushed this image to my DockerHub so if you ever need a base-R image with the tidyverse pre-installed you can simply build it ontop of my image without having to go through the hassle of building it yourself.

Now that the intermediary image has been built we can change our original Dockerfile to build on top of it instead of rocker/r-base and remove the package-installation because our intermediary image already takes care of that. We also add the last line that automatically starts running our script whenever the container is started. Our final Dockerfile should look something like this:

# Base image https://hub.docker.com/u/oliverstatworx/ FROM oliverstatworx/base-r-tidyverse:latest ## create directories RUN mkdir -p /01_data RUN mkdir -p /02_code RUN mkdir -p /03_output ## copy files COPY /02_code/myScript.R /02_code/myScript.R ## run the script CMD Rscript /02_code/myScript.R The final touches

Since we built our image on top of an intermediary image with all our needed packages, we can now easily modify parts of our final image to our liking. I like making my R script less verbose by suppressing warnings and messages that are not of interest anymore (since I already tested the image and know that everything works as expected) and adding messages that tell the user which part of the script is currently being executed by the running container.

suppressPackageStartupMessages(library(readr)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(forcats)) options(scipen = 999, readr.num_columns = 0) print("Starting Workflow") # import dataframe print("Importing Dataframe") df <- read_csv("01_data/us-500.csv") # manipulate data print("Manipulating Data") plot_data <- df %>% group_by(state) %>% count() # save manipulated data to output folder print("Writing manipulated Data to .csv") write_csv(plot_data, "03_output/plot_data.csv") # create plot based on manipulated data print("Creating Plot") plot <- plot_data %>% ggplot()+ geom_col(aes(fct_reorder(state, n), n, fill = n))+ coord_flip()+ labs( title = "Number of people by state", subtitle = "From US-500 dataset", x = "State", y = "Number of people" )+ theme_bw() # save plot to output folder print("Saving Plot") ggsave("03_output/myplot.png", width = 10, height = 8, dpi = 100) print("Worflow Finished")

After navigating to the folder where our Dockerfile is located we rebuild our image once more with: docker build -t myname/myimage . Once again we start a container based on our image and map the 01_data and 03_output folders to our local directories. This way we can import our data and save our created output locally:

docker run -it --rm -v ~/"R-Script in Docker"/01_data:/01_data -v ~/"R-Script in Docker"/03_output:/03_output myname/myimage

Congratulations, you now have a clean Docker image that not only automatically runs your R script whenever a container is started, but also tells you exactly which part of the code it is executing via console messages. Happy docking!

Über den Autor

Oliver Guggenbühl

I am a Data Scientist at STATWORX and love telling stories with data – the ShinyR the better!

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

.button {background-color: #0085af;}

Der Beitrag Running your R script in Docker erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Vivid: Toward A Next Generation Statistical User Interface

Fri, 02/22/2019 - 05:50

(This article was first published on Fells Stats, and kindly contributed to R-bloggers)

We are announcing the development of a new statistical user interface for R. I’m really excited about it and I hope that your will be too. Vivid is a rich document format deeply integrated with RStudio that mixes user interface elements, code and output. I firmly believe that the next million R users are going to fall in love with R first through the lens of a statistical user interface. Vivid aims to be that interface.

Vivid is in its developmental infancy, but we would love feedback and collaborators. If you are interested in testing and/or helping us build this tool, let us know!

https://github.com/fellstat/vivid

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Fells Stats. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rblpapi 0.3.8: Keeping CRAN happy

Fri, 02/22/2019 - 03:28

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A minimal maintenance release of Rblpapi, now at version 0.3.9, arrived on CRAN earlier today. Rblpapi provides a direct interface between R and the Bloomberg Terminal via the C++ API provided by Bloomberg (but note that a valid Bloomberg license and installation is required).

This is the ninth release since the package first appeared on CRAN in 2016. It accomodates a request by CRAN / R Core to cope with staged installs which will be a new feature of R 3.6.0. No other changes were made (besides updating a now-stale URL at Bloomberg in a few spots and other miniscule maintenance). However, a few other changes have been piling up at the GitHub repo so feel free to try that version too. Details of this release below:

Changes in Rblpapi version 0.3.9 (2019-02-20)
  • Add ‘StagedInstall: no’ to DESCRIPTION to accomodate R 3.6.0.

Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the Rblpapi page. Questions, comments etc should go to the issue tickets system at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bugfix release for the ssh package

Fri, 02/22/2019 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

The ssh package provides a native ssh client for R. You can connect to a remote server over SSH to transfer files via SCP, setup a secure tunnel, or run a command or script on the host while streaming stdout and stderr directly to the client. The intro vignette provides a brief introduction.

This week version 0.4 has been released, so you can install it directly from CRAN:

install.packages("ssh")

The NEWS file shows that this is mostly a bugfix release:

ssh 0.4: - Fix for security problem with latest openssh-server - Windows/Mac: update libssh to 0.8.6 - Use new 'askpass' package for password and passphrase prompts - Use new ssh_session_is_known_server() api on libssh 0.8 and up - Fix bug that would write key to known_hosts for each connection - Add support for parsing ipv6 ip-address

There are no new features but upgrading is highly recommended.

OpenSSH and libssh Updates

The most significant changes are due to library upgrades. The Windows and MacOS binary packages have been upgraded to the latest libssh 0.8.6. There have been numerous fixes as listed in the libssh changelog.

On the server side, a recent security patch release of openssh (the standard ssh server) had caused a problem in the R client for copying files via SCP. It is pretty unusual that a server upgrade breaks the client in an established protocol like ssh, but apparently the R client was making a call that is no longer permitted which would cause an error, so this call has been removed.

Authentication and Password Entry

This release also introduces several improvements to the authentication mechanics:

The R package now uses the same ~/.ssh/known_hosts file as the ssh command line utility to store and check server fingerprints. This is an important part of the ssh protocol to protect against MITM attacks. The R client will now automatically add new hosts to the file, and check if a known server fingerprint matches the one from the file.

Finally we now use the askpass package to query the user for a password when needed. This may be needed in two cases: either when you want to log in with username/password authentication, or when reading a private key with a passphrase. With askpass we get secure native password entry programs for various R front-ends, including RStudio, RGui for Windows and R.app for MacOS.

For example, this is what it looks like on MacOS:

And below a screenshot on Windows:

Hopefully this will help making the package more secure and user-friendly.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Developments in AzureR

Thu, 02/21/2019 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Hong Ooi, senior data scientist, Microsoft Azure

The AzureR packages have now been on CRAN for a couple of months, so I thought I'd provide an update on developments in the works.

First, AAD authentication has been moved into a new package, AzureAuth, so that people who just want OAuth tokens can get it without any other baggage. This has many new features:

  • Supports both AAD v1.0 and v2.0
  • Tokens are cached in a user-specific directory using the rappdirs package, typically c:\users\\local\AzureR on Windows and ~/.local/share/AzureR on Linux
  • Supports 4 authentication methods: authorization_code, device_code, client_credentials and resource_grant
  • Supports logging in with a username or with a certificate

In the longer term, the hope is for AzureAuth to be something like the R equivalent of the ADAL client libraries. Things to add include dSTS, federated logins, and more.

AzureRMR 2.0 has a new login framework and no longer requires you to create a service principal (although you can still provide your own SP if desired). Running create_azure_login() logs you into Azure interactively, and caches your credentials; in subsequent sessions, run get_azure_login() to login without having to reauthenticate.

AzureStor 2.0 has several new features mostly for more efficient uploading and downloading:

  • Parallel file transfers, using a pool of background processes. This greatly improves the transfer speed when working with lots of small files.
  • Transfer files to or from a local connection. This lets you transfer in-memory R objects without having to create a temporary file first.
  • Experimental interface to AzCopy version 10. This lets you do essentially anything that AzCopy can do, from within R. (Note: AzCopy 10 is quite a different beast to AzCopy 8.x.)
  • A new framework of generic methods, to organise all the various storage-type-specific functions.

A new AzureKusto package is in the works, for working with Kusto/Azure Data Explorer. This includes:

  • All basic functionality including querying, engine management commands, and ingesting
  • A dplyr interface written by Alex Kyllo.

AzureStor 2.0 is now on CRAN, and the others should also be there over the next few weeks. As always, if you run into any problems using the packages, feel free to contact me.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny App to access NOAA data

Thu, 02/21/2019 - 17:09

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

Now that the US Government shutdown is over, it is time to download NOAA weather daily summaries in bulk and store them somewhere safe so that at the next shutdown we do not need to worry.

Below is the code to download data for a series of years:

NOAA_BulkDownload <- function(Year, Dir){
URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/",Year,"/gsod_",Year,".tar")
download.file(URL, destfile=paste0(Dir,"/gsod_",Year,".tar"),
method="auto",mode="wb")

if(dir.exists(paste0(Dir,"/NOAA Data"))==FALSE){dir.create(paste0(Dir,"/NOAA Data"))}

untar(paste0(Dir,"/gsod_",Year,".tar"),
exdir=paste0(Dir,"/NOAA Data"))
}

An example on how to use this function is below:

Date <- 1980:2019
lapply(Date, NOAA_BulkDownload, Dir="C:/Users/fabio.veronesi/Desktop/New folder")

Theoretically, the process can be parallelized using parLappy, but I have not tested it.

Once we have all the file in one folder we can create the Shiny app to query these data.
The app will have a dashboard look with two tabs: one with a Leaflet map showing the location of the weather stations (markers are shown only at a certain zoom level to decrease loading time and RAM usage), below:

The other tab will allow the creation of the time-series (each file represents only 1 year, so we need to bind several files together to get the full period we are interested in) and it will also do some basic data cleaning, e.g. turn T from F to C, or snow depth from inches to mm. Finally, from this tab users can view the final product and download a cleaned csv.

The code for ui and server scripts is on my GitHub:
https://github.com/fveronesi/NOAA_ShinyApp

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Encryptr package: easily encrypt and decrypt columns of sensitive data

Thu, 02/21/2019 - 15:59

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

A number of existing R packages support data encryption. However, we haven’t found one that easily suits our needs: to encrypt one or many columns of a data frame or tibble using a private/public key pair in tidyversefunctions. The emphasis is on the easily.

Encrypting and decrypting data securely is important when it comes to healthcare and sociodemographic data. We have developed a simple and secure package encryptyr which allows non-experts to encrypt and decrypt columns of data.

There is a simple and easy-to-follow vignette available on our GitHub page which guides you through the process of using encryptr:

https://github.com/SurgicalInformatics/encryptr.

Confidential data – security challenges

Data containing columns of disclosive or confidential information such as a postcode or a patient ID (CHI in Scotland) require extreme care. Storing sensitive information as raw values leaves the data vulnerable to confidentiality breaches.

It is best to just remove confidential information from the records whenever possible. However, this can mean the data can never be re-associated with an individual. This may be a problem if, for example, auditors of a clinical trial need to re-identify an individual from the trial data.

One potential solution currently in common use is to generate a study number which is linked to the confidential data in a separate lookup table, but this still leaves the confidential data available in another file.

Encryptr package solution – storing encrypted data

The encryptr package allows users to store confidential data in a pseudoanonymised form, which is far less likely to result in re-identification.

The package allows users to create a public key and a private key to enable RSA encryption and decryption of the data. The public key allows encryption of the data. The private key is required to decrypt the data. The data cannot be decrypted with the public key. This is the basis of many modern encryption systems.

When creating keys, the user sets a password for the private key using a dialogue box. This means that the password is included in an R script. We recommend creating a secure password with a variety of alphanumeric characters and symbols.

As the password is not stored, it is important that you are able to remember it if you need to decrypt the data later.

Once the keys are created it is possible to encrypt one or more columns of data in a data frame or tibble using the public key. Every time RSA encryption is used it will generate a unique output. Even if the same information is encrypted more than once, the output will always be different. It is not possible therefore to match two encrypted values.

These outputs are also secure from decryption without the private key. This may allow sharing of data within or between research teams without sharing confidential data.

Caution: data often remains potentially disclosive (or only pseudoanomymised) even after encryption of identifiable variables and all of the required permissions for usage and sharing of data must still be in place.

Encryptr package – decrypting the data

Sometimes decrypting data is necessary. For example, participants in a clinical trial may need to be contacted to explain a change or early termination of the trial.

The encryptr package allows users to securely and reliably decrypt the data. The decrypt function will use the private key to decrypt one or more columns. The user will be required to enter the password created when the keys were generated.

As the private key is able to decrypt all of the data, we do not recommend sharing this key.

Blinding and unblinding clinical trials – another encryptr package use

Often when working with clinical trial data, the participants are randomised to one or more treatment groups. Often teams working on the trial are unaware of the group to which patients were randomised (blinded).

Using the same method of encryption, it is possible to encrypt the participant allocation group, allowing the sharing of data without compromising blinding. If other members of the trial team are permitted to see treatment allocation (unblinded), then the decryption process can be followed to reveal the group allocation.

What this is not

This is a simple set of wrappers of openssl aimed at non-experts. It does not seek to replace the many excellent encryption packages available in R, such as PKI, sodium and safer. We believe however that it makes things much easier. Comments and forks welcome.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Finding Economic Articles With Data

Thu, 02/21/2019 - 06:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

In my view, one of the greatest developments during the last decade in economics is that the Journals of the American Economic Association and some other leading journals require authors to upload the replication code and data sets of accepted articles.

I wrote a Shiny app that allows to search currently among more than 3000 articles that have an accessible data and code supplement. Just click here to use it:

http://econ.mathematik.uni-ulm.de:3200/ejd/

One can perform a keyword search among the abstract and title. The screenshot shows an example:

One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.

The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.

While the app performs well for a single user, I have not tested the performance for many simultaneous users. If it is too sluggish or you don’t get connected there are perhaps currently too many users. Then just try it out a bit later.

If you want to analyse yourself the collected data underlying the search app, you can download the zipped SQLite databases using the following links:

I try to update the databases regularly.

Below is an example, for a simple analysis based on that databases. First make sure that you download and extract articles.zip into your working directory.

We first open a database connection

library(RSQLite) db = dbConnect(RSQLite::SQLite(),"articles.sqlite")

File type conversion between databases and R can sometimes be a bit tedious. For example, SQLite knows no native Date or logical type. For this reason, I typically use my package dbmisc when working with SQLite databases. It allows to specify a database schema as simple yaml file and has a lot of convenience function to retrieve or modify data that automatically use the provided schema. The following code sets the database schema that is provided in the package EconJournalData:

library(dbmisc) db = set.db.schemas(db,schema.file= system.file("schema/articles.yaml", package="EconJournalData"))

Of course, for a simple analysis as ours below just using the standard function in the DBI package without schemata suffices. But I am just used to working with the dbmisc package.

The main information about articles is stored in the table article

# Get the first 4 entries of articles as data frame dbGet(db, "article",n = 4) table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } id year date journ title vol issue artnum article_url has_data data_url size unit files_txt downloaded_file num_authors file_info_stored file_info_summarized abstract readme_file aer_108_11_1 2018 2018-11-01 aer Firm Sorting and Agglomeration 108 11 1 https://www.aeaweb.org/articles?id=10.1257/aer.20150361 TRUE https://www.aeaweb.org/doi/10.1257/aer.20150361.data 0.05339 MB NA aer_vol_108_issue_11_article_1.zip NA TRUE NA Abstract
To account for the uneven distribution of economic activity in space, I propose a theory of the location choices of heterogeneous firms in a variety of sectors across cities. In equilibrium, the distribution of city sizes and the sorting patterns of firms are uniquely determined and affect aggregate TFP and welfare. I estimate the model using French firm-level data and find that nearly half of the productivity advantage of large cities is due to firm sorting, the rest coming from agglomeration economies. I quantify the general equilibrium effects of place-based policies: policies that subsidize smaller cities have negative aggregate effects. aer/2018/aer_108_11_1/READ_ME.pdf aer_108_11_2 2018 2018-11-01 aer Near-Feasible Stable Matchings with Couples 108 11 2 https://www.aeaweb.org/articles?id=10.1257/aer.20141188 TRUE https://www.aeaweb.org/doi/10.1257/aer.20141188.data 0.07286 MB NA aer_vol_108_issue_11_article_2.zip NA TRUE NA Abstract
The National Resident Matching program seeks a stable matching of medical students to teaching hospitals. With couples, stable matchings need not exist. Nevertheless, for any student preferences, we show that each instance of a matching problem has a “nearby” instance with a stable matching. The nearby instance is obtained by perturbing the capacities of the hospitals. In this perturbation, aggregate capacity is never reduced and can increase by at most four. The capacity of each hospital never changes by more than two. aer/2018/aer_108_11_2/Readme.pdf aer_108_11_3 2018 2018-11-01 aer The Costs of Patronage: Evidence from the British Empire 108 11 3 https://www.aeaweb.org/articles?id=10.1257/aer.20171339 TRUE https://www.aeaweb.org/doi/10.1257/aer.20171339.data 0.44938 MB NA aer_vol_108_issue_11_article_3.zip NA TRUE NA Abstract
I combine newly digitized personnel and public finance data from the British colonial administration for the period 1854-1966 to study how patronage affects the promotion and incentives of governors. Governors are more likely to be promoted to higher salaried colonies when connected to their superior during the period of patronage. Once allocated, they provide more tax exemptions, raise less revenue, and invest less. The promotion and performance gaps disappear after the abolition of patronage appointments. Patronage therefore distorts the allocation of public sector positions and reduces the incentives of favored bureaucrats to perform. aer/2018/aer_108_11_3/Readme.pdf aer_108_11_4 2018 2018-11-01 aer The Logic of Insurgent Electoral Violence 108 11 4 https://www.aeaweb.org/articles?id=10.1257/aer.20170416 TRUE https://www.aeaweb.org/doi/10.1257/aer.20170416.data 56 MB NA aer_vol_108_issue_11_article_4.zip NA TRUE NA Abstract
Competitive elections are essential to establishing the political legitimacy of democratizing regimes. We argue that insurgents undermine the state’s mandate through electoral violence. We study insurgent violence during elections using newly declassified microdata on the conflict in Afghanistan. Our data track insurgent activity by hour to within meters of attack locations. Our results

suggest that insurgents carefully calibrate their production of violence during elections to avoid harming civilians. Leveraging a novel instrumental variables approach, we find that violence depresses voting. Collectively, the results suggest insurgents try to depress turnout while avoiding backlash from harming civilians. Counterfactual exercises provide potentially actionable insights

for safeguarding at-risk elections and enhancing electoral legitimacy in emerging democracies. aer/2018/aer_108_11_4/READ_ME.pdf

The table files_summary contains information about code, data and archive files for each article

dbGet(db, "files_summary",n = 6) table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } id file_type num_files mb is_code is_data aejapp_1_1_10 do 9 0.009427 TRUE FALSE aejapp_1_1_10 dta 2 0.100694 FALSE TRUE aejapp_1_1_3 do 19 0.103628 TRUE FALSE aejapp_1_1_4 csv 1 0.024872 FALSE TRUE aejapp_1_1_4 dat 1 7.15491 FALSE TRUE aejapp_1_1_4 do 9 0.121618 TRUE FALSE

Let us now analyse which share of articles uses Stata, R, Python, Matlab or Julia and how the usage has developed over time.

Since our datasets are small, we can just download the two tables and work with dplyr in memory. Alternatively, you could use some SQL commands or work with dplyr on the database connection.

articles = dbGet(db,"article") fs = dbGet(db,"files_summary")

Let us now compute the shares of articles that have one of the file types, we are interested in

# Number of articles with analyes data & code supplementary n_art = n_distinct(fs$id) # Count articles by file types and compute shares fs %>% group_by(file_type) %>% summarize(count = n(), share=round((count / n_art)*100,2)) %>% # note that all file extensions are stored in lower case filter(file_type %in% c("do","r","py","jl","m")) %>% arrange(desc(share)) table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } file_type count share do 2606 70.55 m 852 23.06 r 105 2.84 py 32 0.87 jl 2 0.05

Roughly 70% of the articles have Stata do files and almost a quarter Matlab m files. Using open source statistical software seems not yet very popular among economists: less than 3% of articles have R code files, Python is below 1% and only 2 articles have Julia code.

This dominance of Stata in economics never ceases to surprise me, in particular when for some reason I just happened to open the Stata do file editor and compare it with RStudio… But then, I am not an expert in writing empirical economic research papers – I just like R programming and rather passively consume empirical research. For writing empirical papers it probably is convenient that in Stata you can add a robust or robust cluster option to almost every type of regression in order to quickly get the economists’ standard standard errors…

For a teaching empirical economics with R the dominance of Stata is not neccessarily bad news. It means that there are a lot of studies which students can replicate in R. Such replication would be considerably less interesting if the original code of the articles would already be given in R.

Let us finish by having a look at the development over time…

sum_dat = fs %>% left_join(select(articles, year, id), by="id") %>% group_by(year) %>% mutate(n_art_year = n()) %>% group_by(year, file_type) %>% summarize( count = n(), share=round((count / first(n_art_year))*100,2) ) %>% filter(file_type %in% c("do","r","py","jl","m")) %>% arrange(year,desc(share)) head(sum_dat) table.data-frame-table { border-collapse: collapse; display: block; overflow-x: auto;} td.data-frame-td {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%;} td.data-frame-td-bottom {font-family: Verdana,Geneva,sans-serif; margin: 0px 3px 1px 3px; padding: 1px 3px 1px 3px; border-left: solid 1px black; border-right: solid 1px black; text-align: left;font-size: 80%; border-bottom: solid 1px black;} th.data-frame-th {font-weight: bold; margin: 3px; padding: 3px; border: solid 1px black; text-align: center;font-size: 80%;} tbody>tr:last-child>td { border-bottom: solid 1px black; } year file_type count share 2005 do 25 22.12 2005 m 10 8.85 2006 do 24 20.87 2006 m 13 11.3 2007 do 24 19.35 2007 m 16 12.9 library(ggplot2) ggplot(sum_dat, aes(x=year, y=share, color=file_type)) + geom_line(size=1.5) + scale_y_log10() + theme_bw()

Well, maybe there is a little upward trend for the open source languages, but not too much seems to have happened over time so far…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Setting Up Raspberry Pi Temperature/Humidity Sensors for Data Analysis in R

Thu, 02/21/2019 - 01:00

(This article was first published on R on Thomas Roh, and kindly contributed to R-bloggers)

This tutorial is going to cover how to set up a temperature/humidity sensor with
a raspberry pi. You will learn how to setup the sensor, a MYSQL server, and
connect to the databse remotely in R. I will also do exploratory
data analysis in R with the stored readings. A little bit of familiarity with linux, mysql servers,
soldering skills, and R is helpful but not mandatory. The materials required
are:

  • Raspberry Pi with standard setup (SD card, case, etc.)
  • Adafruit AM2302 (wired DHT22) temperature-humidity sensor
  • Soldering Iron
  • Female Pin Headers
  • Small piece of wood

Materials to be used

Materials to be used

Final Product

Final Product

I mostly followed the tutorial found here.
The majority of the work in this post is not using R. Instead of rebuilding
everything, I wanted to build off of content that has made and sensor readings
are handled a bit better by a low level language (C is used here).

Install wiringPi

We are going to start with the assumption that you have already set up the
raspberry pi and soldered the sensor to the GPIO. An excellent repository already
exists at http://wiringpi.com/. It provides a C interface
to the GPIO that will prevent us from having to write any of the low level code.
I’m going to SSH into my headless pi and install the wiringPi program with the
following commands. In the examples, you can build and run a program to check
if you are getting good readings.

git clone git://git.drogon.net/wiringPi cd wiringPi ./build cd examples make rht03 ./rht03

You should see readings from the sensor now. Type CTRL+c to quit the
program.

Set up a MYSQL Server

In this step we are going to install MYSQL server and create some security
around it. When you see ‘username’ or ‘password’ those are meant for you to
replace with your own credentials. Instead of only using root to access
the server with elevated priviliges, I am going to grant all privileges to
a different user, but only when on the ‘localhost’, Essentially, you need to
remote in currently to access the server with your ‘user’ credentials. Last,
log back into the server with your new user identity for the next step.

sudo apt-get install mysql-server sudo apt-get install default-libmysqlclient-dev sudo mysql_secure_installation sudo mysql GRANT ALL PRIVILEGES ON *.* TO 'username'@'localhost' IDENTIFIED BY 'password'; \q mysql -u 'username' -p Set up a Database

Now, let’s create a database and a table with the time in UNIX integer time and
two other columns for the sensor readings that we want to record.

create database Monitoring; use Monitoring; create table Climate (ComputerTime INTEGER UNSIGNED,Temperature DECIMAL(5,1), Humidity DECIMAL(5,1)); Run the program to read and write data

You will need to download this file:

th.c

Change the 'root' and 'password'
credentials to match the user that you set up earlier. You will need to copy
over the Makefile and change some flags so that the program knows where to
find some of the drivers that it needs.

cp wiringPi/examples/Makefile ~/raspberrypi/monitor/Makefile sudo nano ~/raspberrypi/monitor/Makefile

Add the following lines to the file:

INCLUDE = -I/usr/local/include,/usr/include/mysql LDFLAGS = -L/usr/local/lib,/usr/lib/arm-linux-gnueabihf -lmysqlclient -lpthread -lz -lm -lrt -ldl

Compile the program:

make ~/raspberrypi/monintor/th

You will now run the program that you altered and compiled. Use the & to
run the program continuously in the background. The program will write the
temperature and humidity every 60 seconds to the database.

./raspberrypi/monitor/th & Set up R

This step is optional but good to have for troubleshooting. Later, I will be
connecting remotely on my laptop instead of working in R on the raspberry pi.

sudo apt-get install r-base sudo su - -e 'install.packages("DBI", "RMariaDB")' Check Database Connection and Query

After a couple hours, you should have a good amount of data. I’m going to be
connecting from my laptop so I’ll need to set up my user credentials with
privileges to access over my LAN which to cover all ip addresses that get
assigned use the % (wildcard) at the end of 192.168.1 (your LAN is
already set to use this ip address numbering system).

GRANT ALL PRIVILEGES on *.* TO 'user'@'192.168.1.%' IDENTIFIED BY 'password'; FLUSH PRIVILEGES; \q

Let’s make sure that the server has a port that we can access. If it’s not
already in the my.conf file, open up the file with a text editor.

sudo nano /etc/mysql/my.cnf

Add the following lines to the file which opens up the default port 3306 for
the mysql server and then bind to current IP address.

[msqld] port=3306 bind-address=0.0.0.0

Restart the service for the changes to take effect.

sudo service mysql restart

You can log off the machine now. We don’t need to do anything else on the
raspberry pi for now.

Access the Data Remotely

From RStudio on my laptop (while connected to my LAN), we’re going to open
a connection to the database. You can specify the host as a DNS name that
can be set up on your router’s administration portal or you can specify the
IP address. I would recommend making the IP address static if you plan on
using that method going forward. Since we stored the timestamp in UNIX
integer form, we can convert it to POSIXct knowing that the origin of UNIX
time is the start of the year 1970.

library(DBI) library(ggplot2) library(trstyles) # optional package for my styling of ggplot2 plots con <- dbConnect(RMariaDB::MariaDB(), host = 'sensorpi', user = 'pi', password = 'password', dbname = 'Monitoring') query <- 'SELECT ComputerTime, Temperature, Humidity FROM Climate' readings <- dbGetQuery(con, query) readings[['ComputerTime']] <- as.POSIXct(readings[['ComputerTime']], origin = '1970-01-01 00:00:00') Plot the Data

Now that we have the data, let’s plot temperature against time to see what has
been going on.

ggplot(readings) + geom_line(aes(ComputerTime, Temperature)) + scale_y_continuous(name = expression('Temperature ('*degree*'C)'), sec.axis = sec_axis(~.*9/5+32, name = expression('Temperature ('*degree*'F)'))) + scale_x_datetime(name = '', date_breaks = '2 days') + theme_tr(base_size = 18) + theme(axis.text.x = element_text(angle = 90, vjust = .5))

It looks good for the most part, but we definitely have some outlier readings.
I can see probably outliers at 30 degrees Celsius. I’m going to cut those off
and take a second look.

readings[['Temperature']][readings[['Temperature']] > 30] <- NA ggplot(readings) + geom_line(aes(ComputerTime, Temperature)) + scale_y_continuous(name = expression('Temperature ('*degree*'C)'), sec.axis = sec_axis(~.*9/5+32, name = expression('Temperature ('*degree*'F)'))) + scale_x_datetime(name = '', date_breaks = '2 days') + theme_tr(base_size = 18) + theme(axis.text.x = element_text(angle = 90, vjust = .5))

You can see some patterns already existing within the data. Given how the
weather patterns have been, adding in some outside temperature readings would
provide some more insight into what is going on. I’ll dive into some more
analysis in another post.

We can do the same for the relative humidity.

ggplot(readings) + geom_line(aes(ComputerTime, Humidity), color = '#9D5863') + scale_y_continuous(name = 'Relative Humidity (Rh)') + scale_x_datetime(name = '', date_breaks = '2 days') + theme_tr(base_size = 18) + theme(axis.text.x = element_text(angle = 90, vjust = .5))

readings[['Humidity']][readings[['Humidity']] > 50] <- NA ggplot(readings) + geom_line(aes(ComputerTime, Humidity), color = '#9D5863') + scale_y_continuous(name = 'Relative Humidity (Rh)') + scale_x_datetime(name = '', date_breaks = '2 days') + theme_tr(base_size = 18) + theme(axis.text.x = element_text(angle = 90, vjust = .5))

And that’s it. You have your very own indoor climate monitoring system and
time series data to play around with at home.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Thomas Roh. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Post-docs in wind and solar power forecasting

Thu, 02/21/2019 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

We currently have two postdoc opportunities together with an industry partner in the field of wind and solar power forecasting (full time, Level B). They are suitable for recently graduated PhD students that can start between now and June-July.

The opportunities are as follows:

Wind power forecasting:
  • 1 year contract
  • Good programming skills in R and/or Python
  • Solid background in Machine Learning and/or Statistics
  • Background in time series forecasting desirable
Solar power forecasting:
  • 6 months contract
  • Good programming skills in R and/or Python
  • Solid background in Machine Learning and/or Statistics
  • Data will be cloud coverage data from sky cams, so some image processing background is necessary
  • Background in time series forecasting desirable

Please contact Christoph Bergmeir if you are interested.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting Started With rquery

Wed, 02/20/2019 - 19:32

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

To make getting started with rquery (an advanced query generator for R) easier we have re-worked the package README for various data-sources (including SparkR!).

Here are our current examples:

For the MonetDBLite the query diagrammer shows a repeated calculation that we decided was best to leave in.

And the RSQLite diagram shows the consequences of replacing window functions with joins.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Announcement: eRum 2020 held in Milano!

Wed, 02/20/2019 - 12:06

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Hello, R_users!

We are very excited to inform you that the eRum2020 (European R Users Meeting) will be held in Milan in 2020!

 

About Conference

The eRum is a conference that takes place every 2 years in Europe, every time in a different country, and is designed to create a community of European R users and to share knowledge and passion within it.

In 2018, eRum2018 took place in Budapest, Hungary. More than 500 attendees and over 90 speakers participated, and we expect an even wider contribution for this edition.

 

Info and Contacts

We will let you know the dates, the location and the program of eRum2020 as soon as it is all set. Follow us on Twitter at @erum2020_conf to keep updated!

 

The Support

We will work hard to keep the registration fees as low as possible. If you want to support the success of this event, please get in touch with us, and we will provide you every info about the sponsorship opportunities.

At the moment we wish to thank Quantide, a Milano based company of R training and consulting, which is already proudly supporting this wonderful adventure.

 

We would like to thank the whole community of R, our MilanoR community and all the organizers of the previous editions.

 

Thank you so much for your attention; we are eager to meet you in Milan!

 

The post Announcement: eRum 2020 held in Milano! appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Number 6174 or Kaprekar constant in R

Wed, 02/20/2019 - 08:55

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Not always is the answer 42 as explained in Hitchhiker’s guide. Sometimes it is also 6174.

Kaprekar number is one of those gems, that makes Mathematics fun. Indian recreational mathematician D.R.Kaprekar, found number 6174 – also known as Kaprekar constant – that will return the subtraction result when following this rules:

  1.  Take any four-digit number, with minimum of two different numbers (1122 or 5151 or 1001 or 4375 and so on.)
  2. Sort the taken number and sort it descending order and ascending order.
  3. Subtract the descending number from ascending number.
  4. Repeat step 2. and 3. until you get the result 6174

In practice, e.g.: number 5462, the steps would be:

6542 - 2456 = 4086 8640 - 468 = 8172 8721 - 1278 = 7443 7443 - 3447 = 3996 9963 - 3699 = 6264 6642 - 2466 = 4176 7641 - 1467 = 6174

or for number 6235:

6532 - 2356 = 4176 7641 - 1467 = 6174

Based on different number, the steps might vary.

Function for Kaprekar is:

kap <- function(num){ #check the len of number if (nchar(num) == 4) { kaprekarConstant = 6174 while (num != kaprekarConstant) { nums <- as.integer(str_extract_all(num, "[0-9]")[[1]]) sortD <- as.integer(str_sort(nums, decreasing = TRUE)) sortD <- as.integer(paste(sortD, collapse = "")) sortA <- as.integer(str_sort(nums, decreasing = FALSE)) sortA <- as.integer(paste(sortA, collapse = "")) num = as.integer(sortD) - as.integer(sortA) r <- paste0('Pair is: ',as.integer(sortD), ' and ', as.integer(sortA), ' and result of subtraction is: ', as.integer(num)) print(r) } } else { print("Number must be 4-digits") } }

 

Function can be used as:

kap(5462)

and it will return all the intermediate steps until the function converges.

[1] "Pair is: 6542 and 2456 and result of subtraction is: 4086" [1] "Pair is: 8640 and 468 and result of subtraction is: 8172" [1] "Pair is: 8721 and 1278 and result of subtraction is: 7443" [1] "Pair is: 7443 and 3447 and result of subtraction is: 3996" [1] "Pair is: 9963 and 3699 and result of subtraction is: 6264" [1] "Pair is: 6642 and 2466 and result of subtraction is: 4176" [1] "Pair is: 7641 and 1467 and result of subtraction is: 6174"

And to make the matter more interesting, let us find the distribution, based on all valid four-digit numbers, and append the number of steps needed to find the constant.

First, we will find the solutions for all four-digit numbers and store the solution in dataframe.

Create the empty dataframe:

#create empty dataframe for results df_result <- data.frame(number =as.numeric(0), steps=as.numeric(0)) i = 1000 korak = 0

And then run the following loop:

# Generate the list of all 4-digit numbers while (i <= 9999) { korak = 0 num = i while ((korak <= 10) & (num != 6174)) { nums <- as.integer(str_extract_all(num, "[0-9]")[[1]]) sortD <- as.integer(str_sort(nums, decreasing = TRUE)) sortD <- as.integer(paste(sortD, collapse = "")) sortA <- as.integer(str_sort(nums, decreasing = FALSE)) sortA <- as.integer(paste(sortA, collapse = "")) num = as.integer(sortD) - as.integer(sortA) korak = korak + 1 if((num == 6174)){ r <- paste0('Number is: ', as.integer(i), ' with steps: ', as.integer(korak)) print(r) df_result <- rbind(df_result, data.frame(number=i, steps=korak)) } } i = i + 1 }

 

Fifteen seconds later, I got the dataframe with solutions for all valid (valid solutions are those that comply with step 1 and have converged within 10 steps) four-digit numbers.

Now we can add some distribution, to see how solutions are being presented with numbers. Summary of the solutions shows in average 4,6 iteration (mathematical subtractions) were needed in order to come to number 6174.

But adding the counts to steps, we get the most frequent solutions:

table(df_result$steps) hist(df_result$steps)

With some additional visual, you can see the results as well:

library(ggplot2) library(gridExtra) #par(mfrow=c(1,2)) p1 <- ggplot(df_result, aes(x=number,y=steps)) + geom_bar(stat='identity') + scale_y_continuous(expand = c(0, 0), limits = c(0, 8)) p2 <- ggplot(df_result, aes(x=log10(number),y=steps)) + geom_point(alpha = 1/50) grid.arrange(p1, p2, ncol=2, nrow = 1)

And the graph:

A lot of numbers converges on third step, meaning that every 4th or 5th number.  We would need to look into the steps of the solutions, what these numbers have in common. This will follow! So stay tuned.

Fun fact: For the time of writing this blog post, the number 6174 was not constant in R base.

As always, code is available at Github.

 

Happy Rrrring

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

I Just Wanted The Data : Turning Tableau & Tidyverse Tears Into Smiles with Base R (An Encoding Detective Story)

Wed, 02/20/2019 - 06:08

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Those outside the Colonies may not know that Payless—a national chain that made footwear affordable for millions of ‘Muricans who can’t spare $100.00 USD for a pair of shoes their 7 year old will outgrow in a year— is closing. CNBC also had a story that featured a choropleth with a tiny button at the bottom that indicated one could get the data:

I should have known this would turn out to be a chore since they used Tableau—the platform of choice when you want to take advantage of all the free software libraries they use to power their premier platform which, in turn, locks up all the data for you so others can’t adopt, adapt and improve. Go. Egregious. Predatory. Capitalism.

Anyway.

I wanted the data to do some real analysis vs produce a fairly unhelpful visualization (TLDR: layer in Census data for areas impacted, estimate job losses, compute nearest similar Payless stores to see impact on transportation-challenged homes, etc. Y’now, citizen data journalism-y things) so I pressed the button and watched for the URL in Chrome (aye, for those that remember I moved to Firefox et al in 2018, I switched back; more on that in March) and copied it to try to make this post actually reproducible (a novel concept for Tableau fanbois):

library(tibble) library(readr) # https://www.cnbc.com/2019/02/19/heres-a-map-of-where-payless-shoesource-is-closing-2500-stores.html tfil <- "~/Data/Sheet_3_data.csv" download.file( "https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true", tfil ) ## trying URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true' ## Error in download.file("https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true", : ## cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true' ## In addition: Warning message: ## In download.file("https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true", : ## cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true': HTTP status was '410 Gone'

WAT

Truth be told I expected a time-boxed URL of some sort (prior experience FTW). Selenium or Splash were potential alternatives but I didn’t want to research the legality of more forceful scraping (I just wanted the data) so I manually downloaded the file (*the horror*) and proceeded to read it in. Well, try to read it in:

read_csv(tfil) ## Parsed with column specification: ## cols( ## A = col_logical() ## ) ## Warning: 2092 parsing failures. ## row col expected actual file ## 1 A 1/0/T/F/TRUE/FALSE '~/Data/Sheet_3_data.csv' ## 2 A 1/0/T/F/TRUE/FALSE '~/Data/Sheet_3_data.csv' ## 3 A 1/0/T/F/TRUE/FALSE '~/Data/Sheet_3_data.csv' ## 4 A 1/0/T/F/TRUE/FALSE '~/Data/Sheet_3_data.csv' ## 5 A 1/0/T/F/TRUE/FALSE '~/Data/Sheet_3_data.csv' ## ... ... .................. ...... ......................... ## See problems(...) for more details. ## ## # A tibble: 2,090 x 1 ## A ## ## 1 NA ## 2 NA ## 3 NA ## 4 NA ## 5 NA ## 6 NA ## 7 NA ## 8 NA ## 9 NA ## 10 NA ## # … with 2,080 more rows

WAT

Getting a single column back from readr::read_[ct]sv() is (generally) a tell-tale sign that the file format is amiss. Before donning a deerstalker (I just wanted the data!) I tried to just use good ol’ read.csv():

read.csv(tfil, stringsAsFactors=FALSE) ## Error in make.names(col.names, unique = TRUE) : ## invalid multibyte string at 'A' ## In addition: Warning messages: ## 1: In read.table(file = file, header = header, sep = sep, quote = quote, : ## line 1 appears to contain embedded nulls ## 2: In read.table(file = file, header = header, sep = sep, quote = quote, : ## line 2 appears to contain embedded nulls ## 3: In read.table(file = file, header = header, sep = sep, quote = quote, : ## line 3 appears to contain embedded nulls ## 4: In read.table(file = file, header = header, sep = sep, quote = quote, : ## line 4 appears to contain embedded nulls ## 5: In read.table(file = file, header = header, sep = sep, quote = quote, : ## line 5 appears to contain embedded nulls

WAT

Actually the “WAT” isn’t really warranted since read.csv() gave us some super-valuable info via invalid multibyte string at 'A'. FF FE is a big signal1 2 we’re working with a file in another encoding as that’s a common “magic” sequence at the start of such files.

But, I didn’t want to delve into my Columbo persona… I. Just. Wanted. The. Data. So, I tried the mind-bendingly fast and flexible helper from data.table:

data.table::fread(tfil) ## Error in data.table::fread(tfil) : ## File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

AHA. UTF-16 (maybe). Let’s poke at the raw file:

x <- readBin(tfil, "raw", file.size(tfil)) ## also: read_file_raw(tfil) x[1:100] ## [1] ff fe 41 00 64 00 64 00 72 00 65 00 73 00 73 00 09 00 43 00 ## [21] 69 00 74 00 79 00 09 00 43 00 6f 00 75 00 6e 00 74 00 72 00 ## [41] 79 00 09 00 49 00 6e 00 64 00 65 00 78 00 09 00 4c 00 61 00 ## [61] 62 00 65 00 6c 00 09 00 4c 00 61 00 74 00 69 00 74 00 75 00 ## [81] 64 00 65 00 09 00 4c 00 6f 00 6e 00 67 00 69 00 74 00 75 00

There’s our ff fe (which is the beginning of the possibility it’s UTF-16) but that 41 00 harkens back to UTF-16’s older sibling UCS-2. The 0x00‘s are embedded nuls (likely to get bytes aligned). And, there are alot of 09s. Y’know what they are? They’re s. That’s right. Tableau named file full of TSV records in an unnecessary elaborate encoding CSV. Perhaps they broke the “T” on all their keyboards typing their product name so much.

Living A Boy’s [Data] Adventure Tale

At this point we have:

  • no way to support an automated, reproducible workflow
  • an ill-named file for what it contains
  • an overly-encoded file for what it contains
  • many wasted minutes (which is likely by design to have us give up and just use Tableau. No. Way.)

At this point I’m in full-on Rockford Files (pun intended) mode and delved down to the command line to use a old, trusted sidekick enca:

$ enca -L none Sheet_3_data.csv ## Universal character set 2 bytes; UCS-2; BMP ## LF line terminators ## Byte order reversed in pairs (1,2 -> 2,1)

Now, all we have to do is specify the encoding!

read_tsv(tfil, locale = locale(encoding = "UCS-2LE")) ## Error in guess_header_(datasource, tokenizer, locale) : ## Incomplete multibyte sequence

WAT

Unlike the other 99% of the time (mebbe 99.9%) you use it, the tidyverse doesn’t have your back in this situation (but it does have your backlog in that it’s on the TODO).

Y’know who does have your back? Base R!:

read.csv(tfil, sep="\t", fileEncoding = "UCS-2LE", stringsAsFactors=FALSE) %>% as_tibble() ## # A tibble: 2,089 x 14 ## Address City Country Index Label Latitude Longitude ## ## 1 1627 O… Aubu… United… 1 Payl… 32.6 -85.4 ## 2 900 Co… Doth… United… 2 Payl… 31.3 -85.4 ## 3 301 Co… Flor… United… 3 Payl… 34.8 -87.6 ## 4 304 Ox… Home… United… 4 Payl… 33.5 -86.8 ## 5 2000 R… Hoov… United… 5 Payl… 33.4 -86.8 ## 6 6140 U… Hunt… United… 6 Payl… 34.7 -86.7 ## 7 312 Sc… Mobi… United… 7 Payl… 30.7 -88.2 ## 8 3402 B… Mobi… United… 8 Payl… 30.7 -88.1 ## 9 5300 H… Mobi… United… 9 Payl… 30.6 -88.2 ## 10 6641 A… Mont… United… 10 Payl… 32.4 -86.2 ## # … with 2,079 more rows, and 7 more variables: ## # Number.of.Records , State , Store.Number , ## # Store.count , Zip.code , State.Usps , ## # statename

WAT WOOT!

Note that read.csv(tfil, sep="\t", fileEncoding = "UTF-16LE", stringsAsFactors=FALSE) would have worked equally as well.

The Road Not [Originally] Taken

Since this activity decimated productivity, for giggles I turned to another trusted R sidekick, the stringi package, to see what it said:

library(stringi) stri_enc_detect(x) ## [[1]] ## Encoding Language Confidence ## 1 UTF-16LE 1.00 ## 2 ISO-8859-1 pt 0.61 ## 3 ISO-8859-2 cs 0.39 ## 4 UTF-16BE 0.10 ## 5 Shift_JIS ja 0.10 ## 6 GB18030 zh 0.10 ## 7 EUC-JP ja 0.10 ## 8 EUC-KR ko 0.10 ## 9 Big5 zh 0.10 ## 10 ISO-8859-9 tr 0.01

And, just so it’s primed in the Google caches for future searchers, another way to get this data (and other data that’s even gnarlier but similar in form) into R would have been:

stri_read_lines(tfil) %>% paste0(collapse="\n") %>% read.csv(text=., sep="\t", stringsAsFactors=FALSE) %>% as_tibble() ## # A tibble: 2,089 x 14 ## Address City Country Index Label Latitude Longitude ## ## 1 1627 O… Aubu… United… 1 Payl… 32.6 -85.4 ## 2 900 Co… Doth… United… 2 Payl… 31.3 -85.4 ## 3 301 Co… Flor… United… 3 Payl… 34.8 -87.6 ## 4 304 Ox… Home… United… 4 Payl… 33.5 -86.8 ## 5 2000 R… Hoov… United… 5 Payl… 33.4 -86.8 ## 6 6140 U… Hunt… United… 6 Payl… 34.7 -86.7 ## 7 312 Sc… Mobi… United… 7 Payl… 30.7 -88.2 ## 8 3402 B… Mobi… United… 8 Payl… 30.7 -88.1 ## 9 5300 H… Mobi… United… 9 Payl… 30.6 -88.2 ## 10 6641 A… Mont… United… 10 Payl… 32.4 -86.2 ## # … with 2,079 more rows, and 7 more variables: `Number of ## # Records` , State , `Store Number` , `Store ## # count` , `Zip code` , `State Usps` , ## # statename

(with similar dances to use read_csv() or fread()).

FIN

The night’s quest to do some real work with the data was DoS’d by what I’ll brazenly call a deliberate attempt to dissuade doing exactly that in anything but a commercial program. But, understanding the impact of yet-another massive retail store closing is super-important and it looks like it may be up to us (since the media is too distracted by incompetent leaders and inexperienced junior NY representatives) to do the work.

Folks who’d like to do the same can grab the UTF-8 encoded actual CSV from this site which has also been run through janitor::clean_names() so there’s proper column types and names to work with.

Speaking of which, here’s the cols spec for that CSV:

cols( address = col_character(), city = col_character(), country = col_character(), index = col_double(), label = col_character(), latitude = col_double(), longitude = col_double(), number_of_records = col_double(), state = col_character(), store_number = col_double(), store_count = col_double(), zip_code = col_character(), state_usps = col_character(), statename = col_character() )

If you do anything with the data blog about it and post a link in the comments so I and others can learn from what you’ve discovered! It’s already kinda scary that one doesn’t even need a basemap to see just how much apart of ‘Murica Payless was:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Few New R Books

Wed, 02/20/2019 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Greg Wilson is a data scientist and professional educator at RStudio.

As a newcomer to R who prefers to read paper rather than pixels, I’ve been working my way through a more-or-less random selection of relevant books over the past few months. Some have discussed topics that I’m already familiar with in the context of R, while others have introduced me to entirely new subjects. This post describes four of them in brief; I hope to follow up with a second post in a few months as I work through the backlog on my desk.

First up is Sharon Machlis’ Practical R for Mass Communcation and Journalism, which is based on the author’s workshops for journalists. This book dives straight into doing the kinds of things a busy reporter or news analyst needs to do to meet a 5:00 pm deadline: data cleaning, presentation-quality graphics, and maps take precedence over control flow or the niceties of variable scope. I particularly enjoyed the way each chapter starts with a realistic project and works through what’s needed to build it. People who’ve never programmed before will be a little intimidated by how many packages they need to download if they try to work through the material on their own, but the instructions are clear, and the author’s enthusiasm for her material shines through in every example. (If anyone is working on a similar tutorial for sports data, please let me know – I have more than a few friends it would make very happy.)

In contrast, Chris Beeley and Shitalkumar Sukhdeve’s Web Application Development with R Using Shiny focuses on a particular tool rather than a industry vertical. It covers exactly what its title promises, step by step from the basics through custom JavaScript functions and animations through persistent storage. Every example I ran was cleanly written and clearly explained, and it’s clear that the authors have tested their material with real audiences. I particularly appreciated the chapter on code patterns – while I’m still not sure I fully understand when and how to use isolate() and req(), I’m much less confused than I was.

Functional programming has been the next big thing in computing since I was a graduate student in the 1980s. It does finally seem to be getting some traction outside the craft-beer-and-Emacs community, and Functional Programming in R by Thomas Mailund looks at how these ideas can be used in R. Mailund writes clearly, and readers who don’t have a background in computer science may find this a gentle way into a complex subject. However, despite the subtitle “Advanced Statistical Programming for Data Science, Analysis and Finance”, there’s nothing particularly statistical or financial about the book’s content. Some parts felt rushed, such as the lightning coverage of point-free programming (which should have had either a detailed exposition or no mention at all), but my biggest complaint about the book is its price: I think $34 for 100 pages is more than most people will want to pay.

Finally, we have Stefano Allesina and Madlen Wilmes’ Computing Skills for Biologists. As the subtitle says, this book presents a toolbox that includes Python, Git, LaTeX, and SQL as well as R, and is aimed at graduate students in biology who have just realized that a few hundred megabytes of messy data are standing between them and their thesis. The authors present the basics of each subject clearly and concisely using real-world data analysis examples at every turn. They freely admit in the introduction that coverage will be broad and shallow, but that’s exactly what books like this should aim for, and they hit a bulls eye. The book’s only weakness – unfortunately, a significant one – is an almost complete lack of diagrams. There are only six figures in its 400 pages, and none in the material on visualization. I realize that readers who are coding along with the examples will be able to view some plots and charts as they go, but I would urge the authors to include these in a second edition.

R is growing by leaps and bounds, and so is the literature about it. If you have written or read a book on R recently that you think others would be interested in, please let us know – we’d enjoy checking it out.

Stefano Allesina and Madlen Wilmes: Computing Skills for Biologists: A Toolbox. Princeton University Press, 978-0691182759.

Chris Beeley and Shitalkumar Sukhdeve: Web Application Development with R Using Shiny (3rd ed.). Packt, 2018, 978-1788993128.

Sharon Machlis: Practical R for Mass Communcation and Journalism. Chapman & Hall/CRC, 2018, 978-1138726918.

Thomas Mailund: Functional Programming in R: Advanced Statistical Programming for Data Science, Analysis and Finance. Apress, 2017, 978-1484227459.

_____='https://rviews.rstudio.com/2019/02/20/a-few-new-books/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Descriptive/Summary Statistics with descriptr

Wed, 02/20/2019 - 01:00

(This article was first published on Rsquared Academy Blog, and kindly contributed to R-bloggers)

We are pleased to introduce the descriptr package, a set of tools for
generating descriptive/summary statistics.

Installation # Install release version from CRAN install.packages("descriptr") # Install development version from GitHub # install.packages("devtools") devtools::install_github("rsquaredacademy/descriptr") Shiny App

descriptr includes a shiny app which can be launched using

ds_launch_shiny_app()

or try the live version here.

Read on to learn more about the features of descriptr, or see the
descriptr website for
detailed documentation on using the package.

Data

We have modified the mtcars data to create a new data set mtcarz. The only
difference between the two data sets is related to the variable types.

str(mtcarz) ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ... ## $ disp: num 160 160 108 258 360 ... ## $ hp : num 110 110 93 110 175 105 245 62 95 123 ... ## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec: num 16.5 17 18.6 19.4 17 ... ## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ... ## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ... ## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ... ## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ... Data Screening

The ds_screener() function will screen a data set and return the following:
– Column/Variable Names
– Data Type
– Levels (in case of categorical data)
– Number of missing observations
– % of missing observations

ds_screener(mtcarz) ## ----------------------------------------------------------------------- ## | Column Name | Data Type | Levels | Missing | Missing (%) | ## ----------------------------------------------------------------------- ## | mpg | numeric | NA | 0 | 0 | ## | cyl | factor | 4 6 8 | 0 | 0 | ## | disp | numeric | NA | 0 | 0 | ## | hp | numeric | NA | 0 | 0 | ## | drat | numeric | NA | 0 | 0 | ## | wt | numeric | NA | 0 | 0 | ## | qsec | numeric | NA | 0 | 0 | ## | vs | factor | 0 1 | 0 | 0 | ## | am | factor | 0 1 | 0 | 0 | ## | gear | factor | 3 4 5 | 0 | 0 | ## | carb | factor |1 2 3 4 6 8| 0 | 0 | ## ----------------------------------------------------------------------- ## ## Overall Missing Values 0 ## Percentage of Missing Values 0 % ## Rows with Missing Values 0 ## Columns With Missing Values 0 Continuous Data Summary Statistics

The ds_summary_stats() function returns a comprehensive set of statistics
including measures of location, variation, symmetry and extreme observations.

ds_summary_stats(mtcarz, mpg) ## ------------------------------ Variable: mpg ------------------------------ ## ## Univariate Analysis ## ## N 32.00 Variance 36.32 ## Missing 0.00 Std Deviation 6.03 ## Mean 20.09 Range 23.50 ## Median 19.20 Interquartile Range 7.38 ## Mode 10.40 Uncorrected SS 14042.31 ## Trimmed Mean 19.95 Corrected SS 1126.05 ## Skewness 0.67 Coeff Variation 30.00 ## Kurtosis -0.02 Std Error Mean 1.07 ## ## Quantiles ## ## Quantile Value ## ## Max 33.90 ## 99% 33.44 ## 95% 31.30 ## 90% 30.09 ## Q3 22.80 ## Median 19.20 ## Q1 15.43 ## 10% 14.34 ## 5% 12.00 ## 1% 10.40 ## Min 10.40 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 15 10.4 20 33.9 ## 16 10.4 18 32.4 ## 24 13.3 19 30.4 ## 7 14.3 28 30.4 ## 17 14.7 26 27.3

You can pass multiple variables as shown below:

ds_summary_stats(mtcarz, mpg, disp) ## ------------------------------ Variable: mpg ------------------------------ ## ## Univariate Analysis ## ## N 32.00 Variance 36.32 ## Missing 0.00 Std Deviation 6.03 ## Mean 20.09 Range 23.50 ## Median 19.20 Interquartile Range 7.38 ## Mode 10.40 Uncorrected SS 14042.31 ## Trimmed Mean 19.95 Corrected SS 1126.05 ## Skewness 0.67 Coeff Variation 30.00 ## Kurtosis -0.02 Std Error Mean 1.07 ## ## Quantiles ## ## Quantile Value ## ## Max 33.90 ## 99% 33.44 ## 95% 31.30 ## 90% 30.09 ## Q3 22.80 ## Median 19.20 ## Q1 15.43 ## 10% 14.34 ## 5% 12.00 ## 1% 10.40 ## Min 10.40 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 15 10.4 20 33.9 ## 16 10.4 18 32.4 ## 24 13.3 19 30.4 ## 7 14.3 28 30.4 ## 17 14.7 26 27.3 ## ## ## ## ------------------------------ Variable: disp ----------------------------- ## ## Univariate Analysis ## ## N 32.00 Variance 15360.80 ## Missing 0.00 Std Deviation 123.94 ## Mean 230.72 Range 400.90 ## Median 196.30 Interquartile Range 205.18 ## Mode 275.80 Uncorrected SS 2179627.47 ## Trimmed Mean 228.00 Corrected SS 476184.79 ## Skewness 0.42 Coeff Variation 53.72 ## Kurtosis -1.07 Std Error Mean 21.91 ## ## Quantiles ## ## Quantile Value ## ## Max 472.00 ## 99% 468.28 ## 95% 449.00 ## 90% 396.00 ## Q3 326.00 ## Median 196.30 ## Q1 120.83 ## 10% 80.61 ## 5% 77.35 ## 1% 72.53 ## Min 71.10 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 20 71.1 15 472 ## 19 75.7 16 460 ## 18 78.7 17 440 ## 26 79 25 400 ## 28 95.1 5 360

If you do not specify any variables, it will detect all the continuous
variables in the data set and return summary statistics for each of them.

Frequency Distribution

The ds_freq_table() function creates frequency tables for continuous variables.
The default number of intervals is 5.

ds_freq_table(mtcarz, mpg, 4) ## Variable: mpg ## |---------------------------------------------------------------------------| ## | Bins | Frequency | Cum Frequency | Percent | Cum Percent | ## |---------------------------------------------------------------------------| ## | 10.4 - 16.3 | 10 | 10 | 31.25 | 31.25 | ## |---------------------------------------------------------------------------| ## | 16.3 - 22.1 | 13 | 23 | 40.62 | 71.88 | ## |---------------------------------------------------------------------------| ## | 22.1 - 28 | 5 | 28 | 15.62 | 87.5 | ## |---------------------------------------------------------------------------| ## | 28 - 33.9 | 4 | 32 | 12.5 | 100 | ## |---------------------------------------------------------------------------| ## | Total | 32 | - | 100.00 | - | ## |---------------------------------------------------------------------------| Histogram

A plot() method has been defined which will generate a histogram.

k <- ds_freq_table(mtcarz, mpg, 4) plot(k)

Auto Summary

If you want to view summary statistics and frequency tables of all or subset of
variables in a data set, use ds_auto_summary().

ds_auto_summary_stats(mtcarz, disp, mpg) ## ------------------------------ Variable: disp ----------------------------- ## ## ---------------------------- Summary Statistics --------------------------- ## ## ------------------------------ Variable: disp ----------------------------- ## ## Univariate Analysis ## ## N 32.00 Variance 15360.80 ## Missing 0.00 Std Deviation 123.94 ## Mean 230.72 Range 400.90 ## Median 196.30 Interquartile Range 205.18 ## Mode 275.80 Uncorrected SS 2179627.47 ## Trimmed Mean 228.00 Corrected SS 476184.79 ## Skewness 0.42 Coeff Variation 53.72 ## Kurtosis -1.07 Std Error Mean 21.91 ## ## Quantiles ## ## Quantile Value ## ## Max 472.00 ## 99% 468.28 ## 95% 449.00 ## 90% 396.00 ## Q3 326.00 ## Median 196.30 ## Q1 120.83 ## 10% 80.61 ## 5% 77.35 ## 1% 72.53 ## Min 71.10 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 20 71.1 15 472 ## 19 75.7 16 460 ## 18 78.7 17 440 ## 26 79 25 400 ## 28 95.1 5 360 ## ## ## ## NULL ## ## ## -------------------------- Frequency Distribution ------------------------- ## ## Variable: disp ## |---------------------------------------------------------------------------| ## | Bins | Frequency | Cum Frequency | Percent | Cum Percent | ## |---------------------------------------------------------------------------| ## | 71.1 - 151.3 | 12 | 12 | 37.5 | 37.5 | ## |---------------------------------------------------------------------------| ## | 151.3 - 231.5 | 5 | 17 | 15.62 | 53.12 | ## |---------------------------------------------------------------------------| ## | 231.5 - 311.6 | 6 | 23 | 18.75 | 71.88 | ## |---------------------------------------------------------------------------| ## | 311.6 - 391.8 | 5 | 28 | 15.62 | 87.5 | ## |---------------------------------------------------------------------------| ## | 391.8 - 472 | 4 | 32 | 12.5 | 100 | ## |---------------------------------------------------------------------------| ## | Total | 32 | - | 100.00 | - | ## |---------------------------------------------------------------------------| ## ## ## ------------------------------ Variable: mpg ------------------------------ ## ## ---------------------------- Summary Statistics --------------------------- ## ## ------------------------------ Variable: mpg ------------------------------ ## ## Univariate Analysis ## ## N 32.00 Variance 36.32 ## Missing 0.00 Std Deviation 6.03 ## Mean 20.09 Range 23.50 ## Median 19.20 Interquartile Range 7.38 ## Mode 10.40 Uncorrected SS 14042.31 ## Trimmed Mean 19.95 Corrected SS 1126.05 ## Skewness 0.67 Coeff Variation 30.00 ## Kurtosis -0.02 Std Error Mean 1.07 ## ## Quantiles ## ## Quantile Value ## ## Max 33.90 ## 99% 33.44 ## 95% 31.30 ## 90% 30.09 ## Q3 22.80 ## Median 19.20 ## Q1 15.43 ## 10% 14.34 ## 5% 12.00 ## 1% 10.40 ## Min 10.40 ## ## Extreme Values ## ## Low High ## ## Obs Value Obs Value ## 15 10.4 20 33.9 ## 16 10.4 18 32.4 ## 24 13.3 19 30.4 ## 7 14.3 28 30.4 ## 17 14.7 26 27.3 ## ## ## ## NULL ## ## ## -------------------------- Frequency Distribution ------------------------- ## ## Variable: mpg ## |-----------------------------------------------------------------------| ## | Bins | Frequency | Cum Frequency | Percent | Cum Percent | ## |-----------------------------------------------------------------------| ## | 10.4 - 15.1 | 6 | 6 | 18.75 | 18.75 | ## |-----------------------------------------------------------------------| ## | 15.1 - 19.8 | 12 | 18 | 37.5 | 56.25 | ## |-----------------------------------------------------------------------| ## | 19.8 - 24.5 | 8 | 26 | 25 | 81.25 | ## |-----------------------------------------------------------------------| ## | 24.5 - 29.2 | 2 | 28 | 6.25 | 87.5 | ## |-----------------------------------------------------------------------| ## | 29.2 - 33.9 | 4 | 32 | 12.5 | 100 | ## |-----------------------------------------------------------------------| ## | Total | 32 | - | 100.00 | - | ## |-----------------------------------------------------------------------| Group Summary

The ds_group_summary() function returns descriptive statistics of a continuous
variable for the different levels of a categorical variable.

k <- ds_group_summary(mtcarz, cyl, mpg) k ## mpg by cyl ## ----------------------------------------------------------------------------------------- ## | Statistic/Levels| 4| 6| 8| ## ----------------------------------------------------------------------------------------- ## | Obs| 11| 7| 14| ## | Minimum| 21.4| 17.8| 10.4| ## | Maximum| 33.9| 21.4| 19.2| ## | Mean| 26.66| 19.74| 15.1| ## | Median| 26| 19.7| 15.2| ## | Mode| 22.8| 21| 10.4| ## | Std. Deviation| 4.51| 1.45| 2.56| ## | Variance| 20.34| 2.11| 6.55| ## | Skewness| 0.35| -0.26| -0.46| ## | Kurtosis| -1.43| -1.83| 0.33| ## | Uncorrected SS| 8023.83| 2741.14| 3277.34| ## | Corrected SS| 203.39| 12.68| 85.2| ## | Coeff Variation| 16.91| 7.36| 16.95| ## | Std. Error Mean| 1.36| 0.55| 0.68| ## | Range| 12.5| 3.6| 8.8| ## | Interquartile Range| 7.6| 2.35| 1.85| ## -----------------------------------------------------------------------------------------

ds_group_summary() returns a tibble which can be used for further analysis.

k$tidy_stats ## # A tibble: 3 x 15 ## cyl length min max mean median mode sd variance skewness ## ## 1 4 11 21.4 33.9 26.7 26 22.8 4.51 20.3 0.348 ## 2 6 7 17.8 21.4 19.7 19.7 21 1.45 2.11 -0.259 ## 3 8 14 10.4 19.2 15.1 15.2 10.4 2.56 6.55 -0.456 ## # ... with 5 more variables: kurtosis , coeff_var , ## # std_error , range , iqr Box Plot

A plot() method has been defined for comparing distributions.

k <- ds_group_summary(mtcarz, cyl, mpg) plot(k)

Multiple Variables

If you want grouped summary statistics for multiple variables in a data set, use
ds_auto_group_summary().

ds_auto_group_summary(mtcarz, cyl, gear, mpg) ## mpg by cyl ## ----------------------------------------------------------------------------------------- ## | Statistic/Levels| 4| 6| 8| ## ----------------------------------------------------------------------------------------- ## | Obs| 11| 7| 14| ## | Minimum| 21.4| 17.8| 10.4| ## | Maximum| 33.9| 21.4| 19.2| ## | Mean| 26.66| 19.74| 15.1| ## | Median| 26| 19.7| 15.2| ## | Mode| 22.8| 21| 10.4| ## | Std. Deviation| 4.51| 1.45| 2.56| ## | Variance| 20.34| 2.11| 6.55| ## | Skewness| 0.35| -0.26| -0.46| ## | Kurtosis| -1.43| -1.83| 0.33| ## | Uncorrected SS| 8023.83| 2741.14| 3277.34| ## | Corrected SS| 203.39| 12.68| 85.2| ## | Coeff Variation| 16.91| 7.36| 16.95| ## | Std. Error Mean| 1.36| 0.55| 0.68| ## | Range| 12.5| 3.6| 8.8| ## | Interquartile Range| 7.6| 2.35| 1.85| ## ----------------------------------------------------------------------------------------- ## ## ## ## mpg by gear ## ----------------------------------------------------------------------------------------- ## | Statistic/Levels| 3| 4| 5| ## ----------------------------------------------------------------------------------------- ## | Obs| 15| 12| 5| ## | Minimum| 10.4| 17.8| 15| ## | Maximum| 21.5| 33.9| 30.4| ## | Mean| 16.11| 24.53| 21.38| ## | Median| 15.5| 22.8| 19.7| ## | Mode| 10.4| 21| 15| ## | Std. Deviation| 3.37| 5.28| 6.66| ## | Variance| 11.37| 27.84| 44.34| ## | Skewness| -0.09| 0.7| 0.56| ## | Kurtosis| -0.38| -0.77| -1.83| ## | Uncorrected SS| 4050.52| 7528.9| 2462.89| ## | Corrected SS| 159.15| 306.29| 177.37| ## | Coeff Variation| 20.93| 21.51| 31.15| ## | Std. Error Mean| 0.87| 1.52| 2.98| ## | Range| 11.1| 16.1| 15.4| ## | Interquartile Range| 3.9| 7.08| 10.2| ## ----------------------------------------------------------------------------------------- Multiple Variable Statistics

The ds_tidy_stats() function returns summary/descriptive statistics for
variables in a data frame/tibble.

ds_tidy_stats(mtcarz, mpg, disp, hp) ## # A tibble: 3 x 16 ## vars min max mean t_mean median mode range variance stdev skew ## ## 1 disp 71.1 472 231. 228 196. 276. 401. 15361. 124. 0.420 ## 2 hp 52 335 147. 144. 123 110 283 4701. 68.6 0.799 ## 3 mpg 10.4 33.9 20.1 20.0 19.2 10.4 23.5 36.3 6.03 0.672 ## # ... with 5 more variables: kurtosis , coeff_var , q1 , ## # q3 , iqrange Measures

If you want to view the measure of location, variation, symmetry, percentiles
and extreme observations as tibbles, use the below functions. All of them,
except for ds_extreme_obs() will work with single or multiple variables. If
you do not specify the variables, they will return the results for all the
continuous variables in the data set.

Measures of Location ds_measures_location(mtcarz) ## # A tibble: 6 x 5 ## var mean trim_mean median mode ## ## 1 disp 231. 228 196. 276. ## 2 drat 3.60 3.58 3.70 3.07 ## 3 hp 147. 144. 123 110 ## 4 mpg 20.1 20.0 19.2 10.4 ## 5 qsec 17.8 17.8 17.7 17.0 ## 6 wt 3.22 3.20 3.32 3.44 Measures of Variation ds_measures_variation(mtcarz) ## # A tibble: 6 x 7 ## var range iqr variance sd coeff_var std_error ## ## 1 disp 401. 205. 15361. 124. 53.7 21.9 ## 2 drat 2.17 0.840 0.286 0.535 14.9 0.0945 ## 3 hp 283 83.5 4701. 68.6 46.7 12.1 ## 4 mpg 23.5 7.38 36.3 6.03 30.0 1.07 ## 5 qsec 8.40 2.01 3.19 1.79 10.0 0.316 ## 6 wt 3.91 1.03 0.957 0.978 30.4 0.173 Measures of Symmetry ds_measures_symmetry(mtcarz) ## # A tibble: 6 x 3 ## var skewness kurtosis ## ## 1 disp 0.420 -1.07 ## 2 drat 0.293 -0.450 ## 3 hp 0.799 0.275 ## 4 mpg 0.672 -0.0220 ## 5 qsec 0.406 0.865 ## 6 wt 0.466 0.417 Percentiles ds_percentiles(mtcarz) ## # A tibble: 6 x 12 ## var min per1 per5 per10 q1 median q3 per95 per90 per99 ## ## 1 disp 71.1 72.5 77.4 80.6 121. 196. 326 449 396. 468. ## 2 drat 2.76 2.76 2.85 3.01 3.08 3.70 3.92 4.31 4.21 4.78 ## 3 hp 52 55.1 63.6 66 96.5 123 180 254. 244. 313. ## 4 mpg 10.4 10.4 12.0 14.3 15.4 19.2 22.8 31.3 30.1 33.4 ## 5 qsec 14.5 14.5 15.0 15.5 16.9 17.7 18.9 20.1 20.0 22.1 ## 6 wt 1.51 1.54 1.74 1.96 2.58 3.32 3.61 5.29 4.05 5.40 ## # ... with 1 more variable: max Categorical Data Cross Tabulation

The ds_cross_table() function creates two way tables of categorical variables.

ds_cross_table(mtcarz, cyl, gear) ## Cell Contents ## |---------------| ## | Frequency | ## | Percent | ## | Row Pct | ## | Col Pct | ## |---------------| ## ## Total Observations: 32 ## ## ---------------------------------------------------------------------------- ## | | gear | ## ---------------------------------------------------------------------------- ## | cyl | 3 | 4 | 5 | Row Total | ## ---------------------------------------------------------------------------- ## | 4 | 1 | 8 | 2 | 11 | ## | | 0.031 | 0.25 | 0.062 | | ## | | 0.09 | 0.73 | 0.18 | 0.34 | ## | | 0.07 | 0.67 | 0.4 | | ## ---------------------------------------------------------------------------- ## | 6 | 2 | 4 | 1 | 7 | ## | | 0.062 | 0.125 | 0.031 | | ## | | 0.29 | 0.57 | 0.14 | 0.22 | ## | | 0.13 | 0.33 | 0.2 | | ## ---------------------------------------------------------------------------- ## | 8 | 12 | 0 | 2 | 14 | ## | | 0.375 | 0 | 0.062 | | ## | | 0.86 | 0 | 0.14 | 0.44 | ## | | 0.8 | 0 | 0.4 | | ## ---------------------------------------------------------------------------- ## | Column Total | 15 | 12 | 5 | 32 | ## | | 0.468 | 0.375 | 0.155 | | ## ----------------------------------------------------------------------------

If you want the above result as a tibble, use ds_twoway_table().

ds_twoway_table(mtcarz, cyl, gear) ## Joining, by = c("cyl", "gear", "count") ## # A tibble: 8 x 6 ## cyl gear count percent row_percent col_percent ## ## 1 4 3 1 0.0312 0.0909 0.0667 ## 2 4 4 8 0.25 0.727 0.667 ## 3 4 5 2 0.0625 0.182 0.4 ## 4 6 3 2 0.0625 0.286 0.133 ## 5 6 4 4 0.125 0.571 0.333 ## 6 6 5 1 0.0312 0.143 0.2 ## 7 8 3 12 0.375 0.857 0.8 ## 8 8 5 2 0.0625 0.143 0.4

A plot() method has been defined which will generate:

Grouped Bar Plots k <- ds_cross_table(mtcarz, cyl, gear) plot(k)

Stacked Bar Plots k <- ds_cross_table(mtcarz, cyl, gear) plot(k, stacked = TRUE)

Proportional Bar Plots k <- ds_cross_table(mtcarz, cyl, gear) plot(k, proportional = TRUE)

Frequency Table

The ds_freq_table() function creates frequency tables.

ds_freq_table(mtcarz, cyl) ## Variable: cyl ## ----------------------------------------------------------------------- ## Levels Frequency Cum Frequency Percent Cum Percent ## ----------------------------------------------------------------------- ## 4 11 11 34.38 34.38 ## ----------------------------------------------------------------------- ## 6 7 18 21.88 56.25 ## ----------------------------------------------------------------------- ## 8 14 32 43.75 100 ## ----------------------------------------------------------------------- ## Total 32 - 100.00 - ## -----------------------------------------------------------------------

A plot() method has been defined which will create a bar plot.

k <- ds_freq_table(mtcarz, cyl) plot(k)

Multiple One Way Tables

The ds_auto_freq_table() function creates multiple one way tables by creating a
frequency table for each categorical variable in a data set. You can also
specify a subset of variables if you do not want all the variables in the data
set to be used.

ds_auto_freq_table(mtcarz) ## Variable: cyl ## ----------------------------------------------------------------------- ## Levels Frequency Cum Frequency Percent Cum Percent ## ----------------------------------------------------------------------- ## 4 11 11 34.38 34.38 ## ----------------------------------------------------------------------- ## 6 7 18 21.88 56.25 ## ----------------------------------------------------------------------- ## 8 14 32 43.75 100 ## ----------------------------------------------------------------------- ## Total 32 - 100.00 - ## ----------------------------------------------------------------------- ## ## Variable: vs ## ----------------------------------------------------------------------- ## Levels Frequency Cum Frequency Percent Cum Percent ## ----------------------------------------------------------------------- ## 0 18 18 56.25 56.25 ## ----------------------------------------------------------------------- ## 1 14 32 43.75 100 ## ----------------------------------------------------------------------- ## Total 32 - 100.00 - ## ----------------------------------------------------------------------- ## ## Variable: am ## ----------------------------------------------------------------------- ## Levels Frequency Cum Frequency Percent Cum Percent ## ----------------------------------------------------------------------- ## 0 19 19 59.38 59.38 ## ----------------------------------------------------------------------- ## 1 13 32 40.62 100 ## ----------------------------------------------------------------------- ## Total 32 - 100.00 - ## ----------------------------------------------------------------------- ## ## Variable: gear ## ----------------------------------------------------------------------- ## Levels Frequency Cum Frequency Percent Cum Percent ## ----------------------------------------------------------------------- ## 3 15 15 46.88 46.88 ## ----------------------------------------------------------------------- ## 4 12 27 37.5 84.38 ## ----------------------------------------------------------------------- ## 5 5 32 15.62 100 ## ----------------------------------------------------------------------- ## Total 32 - 100.00 - ## ----------------------------------------------------------------------- ## ## Variable: carb ## ----------------------------------------------------------------------- ## Levels Frequency Cum Frequency Percent Cum Percent ## ----------------------------------------------------------------------- ## 1 7 7 21.88 21.88 ## ----------------------------------------------------------------------- ## 2 10 17 31.25 53.12 ## ----------------------------------------------------------------------- ## 3 3 20 9.38 62.5 ## ----------------------------------------------------------------------- ## 4 10 30 31.25 93.75 ## ----------------------------------------------------------------------- ## 6 1 31 3.12 96.88 ## ----------------------------------------------------------------------- ## 8 1 32 3.12 100 ## ----------------------------------------------------------------------- ## Total 32 - 100.00 - ## ----------------------------------------------------------------------- Multiple Two Way Tables

The ds_auto_cross_table() function creates multiple two way tables by creating a
cross table for each unique pair of categorical variables in a data set. You
can also specify a subset of variables if you do not want all the variables in
the data set to be used.

ds_auto_cross_table(mtcarz, cyl, gear, am) ## Cell Contents ## |---------------| ## | Frequency | ## | Percent | ## | Row Pct | ## | Col Pct | ## |---------------| ## ## Total Observations: 32 ## ## cyl vs gear ## ---------------------------------------------------------------------------- ## | | gear | ## ---------------------------------------------------------------------------- ## | cyl | 3 | 4 | 5 | Row Total | ## ---------------------------------------------------------------------------- ## | 4 | 1 | 8 | 2 | 11 | ## | | 0.031 | 0.25 | 0.062 | | ## | | 0.09 | 0.73 | 0.18 | 0.34 | ## | | 0.07 | 0.67 | 0.4 | | ## ---------------------------------------------------------------------------- ## | 6 | 2 | 4 | 1 | 7 | ## | | 0.062 | 0.125 | 0.031 | | ## | | 0.29 | 0.57 | 0.14 | 0.22 | ## | | 0.13 | 0.33 | 0.2 | | ## ---------------------------------------------------------------------------- ## | 8 | 12 | 0 | 2 | 14 | ## | | 0.375 | 0 | 0.062 | | ## | | 0.86 | 0 | 0.14 | 0.44 | ## | | 0.8 | 0 | 0.4 | | ## ---------------------------------------------------------------------------- ## | Column Total | 15 | 12 | 5 | 32 | ## | | 0.468 | 0.375 | 0.155 | | ## ---------------------------------------------------------------------------- ## ## ## cyl vs am ## ------------------------------------------------------------- ## | | am | ## ------------------------------------------------------------- ## | cyl | 0 | 1 | Row Total | ## ------------------------------------------------------------- ## | 4 | 3 | 8 | 11 | ## | | 0.094 | 0.25 | | ## | | 0.27 | 0.73 | 0.34 | ## | | 0.16 | 0.62 | | ## ------------------------------------------------------------- ## | 6 | 4 | 3 | 7 | ## | | 0.125 | 0.094 | | ## | | 0.57 | 0.43 | 0.22 | ## | | 0.21 | 0.23 | | ## ------------------------------------------------------------- ## | 8 | 12 | 2 | 14 | ## | | 0.375 | 0.062 | | ## | | 0.86 | 0.14 | 0.44 | ## | | 0.63 | 0.15 | | ## ------------------------------------------------------------- ## | Column Total | 19 | 13 | 32 | ## | | 0.594 | 0.406 | | ## ------------------------------------------------------------- ## ## ## gear vs am ## ------------------------------------------------------------- ## | | am | ## ------------------------------------------------------------- ## | gear | 0 | 1 | Row Total | ## ------------------------------------------------------------- ## | 3 | 15 | 0 | 15 | ## | | 0.469 | 0 | | ## | | 1 | 0 | 0.47 | ## | | 0.79 | 0 | | ## ------------------------------------------------------------- ## | 4 | 4 | 8 | 12 | ## | | 0.125 | 0.25 | | ## | | 0.33 | 0.67 | 0.38 | ## | | 0.21 | 0.62 | | ## ------------------------------------------------------------- ## | 5 | 0 | 5 | 5 | ## | | 0 | 0.156 | | ## | | 0 | 1 | 0.16 | ## | | 0 | 0.38 | | ## ------------------------------------------------------------- ## | Column Total | 19 | 13 | 32 | ## | | 0.594 | 0.406 | | ## ------------------------------------------------------------- Visualization

descriptr can help visualize multiple variables by automatically
detecting their data types.

Continuous Data ds_plot_scatter(mtcarz, mpg, disp, hp)

Categorical Data ds_plot_bar_stacked(mtcarz, cyl, gear, am)

Learning More

The descriptr website includes
comprehensive documentation on using the package, including the following
articles that cover various aspects of using rfm:

Feedback

All feedback is welcome. Issues (bugs and feature
requests) can be posted to github tracker.
For help with code or other related questions, feel free to reach me hebbali.aravind@gmail.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Floor filler

Wed, 02/20/2019 - 01:00

(This article was first published on R on Gianluca Baio, and kindly contributed to R-bloggers)

As I posted recently, I’m involved in a couple of events, later this summer: our annual Summer School and the new(er) tradition of the R for HTA workshop.

I have to say that I’m very happy about how things are proceeding for both of them. The summer school has been first advertise a few months back (I’ve posted on the blog, but we’ve also tried to reach other relevant mailing lists and groups, such as the HTA agencies in the EUnetHTA Network). And the dancefloor is quickly filling — there’s been a surge in registrations in the past couple of weeks and we now only have 4 places left. (I’m not expecting to have dance sessions when we reconvene in Florence, in June. Although usually people do have lots of fun, both at the Centro Studi, chilling in the terrace, or rolling down to Florence…).

The R for HTA workshop is even more impressive and pleasing, I think. We basically almost filled up the 20 places for the short course on using R for Cost-Effectiveness Modelling. We already have 12 places reserved! We also have already 16 registrations for the main event as well.

And we’re also finalising the “hackathon” — well challenge, to use the formal terminology — which sounds like an interesting exercise. We’ll publicise this shortly as well, so people can sign up for it too!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Gianluca Baio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages