Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 43 min ago

Decision Making Support Systems #3: Differences between IA and AI

Wed, 10/23/2019 - 14:48

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Differences between Artificial Intelligence and Augmented Intelligence

In previous posts, we looked at the definition of Artificial Intelligence (AI) and the definition of Intelligence Augmentation (IA). So, what are the differences between the two? Intelligence Augmentation has always been concerned with aiding human decision making and keeping humans-in-the-loop, whereas the AI endeavor seeks to “build machines that emulate and then exceed the full range of human cognition, with an eye on replacing them. 

Rob Bromage puts it well in “Artificial intelligence vs intelligence augmentation”: “AI is an autonomous system that can be taught to imitate and replace human cognitive functions. To put it simply, the machine completely replaces human intervention and interaction. IA, on the other hand, plays more of an assistive role by leveraging AI to enhance human intelligence, rather than replace it.”

In “Augmented Intelligence, not Artificial Intelligence, is the Future” Aaron Masih writes that “While the underlying technologies powering AI and IA are the same, the goals and applications are fundamentally different: AI aims to create systems that run without humans, whereas IA aims to create systems that make humans better.”  

A beneficial partnership between human and machine is depicted in Interstellar.

IA systems are able to exceed system boundaries (the parts of an environment not covered by an AI model’s data inputs) because of the “unsung heroes of the AI revolution” — humans! AI systems can only work with the datasets that they’re trained on and that feed them.  Subject matter expert humans excel in applying context and intuition to problems. Humans can grasp and explain causality. Alex Bates recalls from his experience running a an AI/prescriptive maintenance firm that “…what was remarkable about the human process engineers and maintenance engineers at these plants …were all the clues they incorporated into their assessments of equipment failure, allowing them to identify what was failing and what to repair.  Where they struggled, and where the AI helped, was in making sense of all the massive amount of sensor data coming off the equipment.”   

Another major difference? To state the obvious, one receives much more attention than the other. Even the casual industry observer of the industry that augmenting human intelligence represents a tiny fraction of the total research and development devoted to Artificial Intelligence. The lack of investment in the area seems like a missed opportunity. 

So the AI approach seeks to minimize or replace the role of humans, while an IA solution seeks to amplify the abilities and performance of the humans that participate in a given activity. The IA approach benefits from the human ability to think outside system boundaries. But despite the lack of investment and press attention on Intelligence Augmentation, businesses and other organizations continue to benefit from IA applications…   

Examples of IA at work today 

CEO of Appsilon Filip Stachura at useR! Toulouse 2019

I asked Filip Stachura, CEO of Appsilon Data Science, who are specialists in Data Science and Machine Learning, How much of what you do as a company qualifies as Intelligence Augmentation?  

FS: Most of what we do qualifies as IA. Decision support systems can be more realistic in many business cases. Even a recommendation engine application is a sales-support decision system, not the AI ​​that sells itself.  

FS: For one firm we did the following: Prices for products that our client sells change frequently and depend heavily on negotiations.  When a client requests a product, the salesperson opens the application that we built and fills in the client and product name. They then see the history of deals made with the client and suggested prices/discounts for the product. The prices vary depending on the segment of the business, category/size of the client and the size of the order.  Ultimately, the human salespeople make the pricing decision, but the application provides the most up to date information to optimize decision-making.  It’s sales-support, not salesperson replacement.

FS: Here is another example. What if a manager has to optimize the usage of hundreds of varying models of cleaning machines that are based in different locations in a region? Each location has its own logistical needs and have staff of varying levels of expertise in operating the machines.  How do you optimize performance and cost with such diverse conditions? Is the answer to eliminate the manager in charge, or is it to augment the manager’s capabilities with a decision-support system? Obviously it’s the latter, since a fully automated management system would have great difficulty in assessing non-standard situations and communicating with the various teams at the various locations.  

Appsilon Data Science co-founder Damian Rodziewicz added that

DR: We also worked with Dr. Ken Benoit and his team from the London School of Economics to make their Quanteda text analysis tool available to a much greater number of social scientists and practitioners from other fields, including medicine and law. Now the user doesn’t need to know the R programming language in order to make use of the powerful Quanteda R package. A human researcher can use the Quanteda package to quickly evaluate language and social trends from millions of documents, truly extending the human’s research domain by many orders of magnitude.

The man and the machine: Dr. Ken Benoit and the Quanteda text analysis tool

Here is another example, and I have transcribed and hijacked it from the Andrew Ng “Amazon re:MARS 2019 in Las Vegas, Nevada” presentation.  He gave it as an example of AI, but I think it’s really an example of IA.  

Radiologists do a lot of things. They read x-ray images, they also consult with patients, they do surgical planning, and mentor younger doctors… out of all of these tasks, one of them seems amenable to AI automation or acceleration — that’s why many teams including mine are building systems like these to have AI enter this task of reading x-ray images. 

2016: @GeoffreyHinton says we should stop training radiologists, because radiologists will soon be replaced by deep learning

2019: There is a shortage of radiologists. # replaced ≈ 0

What did Hinton miss?

by @garymarcus & @MaxALittlehttps://t.co/rxTTNxaeBN

— Gary Marcus (@GaryMarcus) October 23, 2019

Does anyone really want to replace radiologists with machines at this time?  With human lives at stake? Probably not. But a machine-assisted radiologist?  That is interesting. After the first versions are released, radiologists can work with engineers to teach the machines to do a better job in finding problems.  And eventually there can be a feature set that allows the radiologists to teach the machines directly, without the constant participation of the engineers. And training data from all over the world can be shared to further optimize results.   

When is IA superior? 

Another beneficial human and machine partnership depicted in “Lucky 13.”

In short, IA is superior when…

…only a limited amount of labeled data, or only unlabeled data exists for a given task

…a task requires empathy and//or negotiation between humans

…a task requires a notion of causality, not just correlation

system boundaries exist. Crucial data cannot be captured by sensors. Data input doesn’t cover the entirety of the problem.

….a problem exists within a regulated environment in which machines still struggle with decisions, such as in medicine and surgery

…a task is so critical that only a human can make the final decision

…a task requires more than 2-3 seconds of human thought

The above criteria probably describe most business and research problem scenarios. Consider moving away from the approach of “how do we replace our staff with Artificial Intelligence agents,” and instead move towards “what repeatable, routinizable task can we automate in order to free up time for our human teammates? How do we unleash more intuitive and brilliant ideas by increasing their bandwidth? How do we prevent problems in our facilities by partnering machines and humans?”

Thanks for reading. In the next post, we’ll look at “How to Implement an IA Solution.”

Here are the previous posts:

What Is Artificial Intelligence?

What Is Intelligence Augmentation?

Follow me on Twitter @_joecha_

Follow Appsilon Data Science on Social Media

Follow @Appsilon on Twitter!
Follow us on LinkedIn!
Don’t forget to sign up for our newsletter.
And try out our R Shiny open source packages!

 

 

 

Article Decision Making Support Systems #3: Differences between IA and AI comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Horizontal scaling of data science applications in the cloud

Wed, 10/23/2019 - 14:42

[This article was first published on R-Bloggers – eoda GmbH, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Prediction models, machine learning algorithms and scripts for data storage: The modern data science application not only shows more and more complexity, but also puts the existing infrastructure to the test by temporary resource peaks. In this article, we will show how tools such as the RStudio Job Launcher in conjunction with a Kubernetes cluster can be used to outsource the execution of arbitrary analysis scripts to the cloud, scale them and return them to the local infrastructure.

A brief introduction to Kubernetes and the Job Launcher

Kubernetes was designed by Google in 2014 and is an open source published container-orchestration system. The focus of these systems is the automated deployment, scaling and management of container applications. A Kubernetes cluster provides so-called (worker) nodes that can be addressed by other applications. Within the nodes, the necessary containers are then booted up in pods and are made available. In regard to a statistics/analysis context, the outsourcing or horizontal scalability of computation-intensive analyses is particularly interesting. Thus, in a multi-user environment, the distribution of jobs among the worker nodes ensures that the exact amount of resources required is made available, depending on the workload. In the analysis context with R, the RStudio Job Launcher, the independent tool of RStudio Server, can play to its strengths and send sessions and scripts directly to a Kubernetes cluster via plugin.

 

On the one hand this prevents additional costs caused by servers at-rest, on the other hand it also prevents bottlenecks, which can occur more often during workload peaks on standard systems. Based on this basic idea, the RStudio Job Launcher can also be used in local sessions by executing individual R scripts in the Kubernetes cluster and playing back their results. Data science use cases are the resource-intensive scripts, which are listed below, the simultaneous training of different analysis models or compilation tasks that can be outsourced to external nodes.

Our conclusion

Scalability, combined with on-demand provisioning and use of resources, is an ideal scenario for organizations that need to keep their data in the local data center and cannot go all the way to the cloud. In addition, by outsourcing computationally intensive processes, the local data center does not need to grow unnecessarily. This saves the purchase of additional servers in the local data center, which are only used at temporary resource peaks.

In our opinion, this scenario will be particularly interesting for companies that are not allowed to store their data in the cloud due to the constant data growth and the ever more complex requirements on the analysis infrastructure.

In addition to the advantage of processing local data hosted on-premise in a computing cluster, analyses can also base on different frameworks due to Docker images. In addition, flexible requirements on the analysis infrastructure, such as the execution of certain analyses on a GPU or CPU cluster or the booting of additional worker nodes, are easily implemented. Scaling computation-intensive processes horizontally can be achieved with little effort because access to a cluster is easier than ever, for example through Amazon’s EKS service, which provides a completely cloud-based Kubernetes cluster.

This special approach is a solution for numerous challenges for data scientists and data engineers. For this reason, we are happy to support and advise you in the planning and implementation of an IT infrastructure in your company. Learn more about aicon | analytic infrastructure consulting!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – eoda GmbH. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

linl 0.0.4: Now with footer

Wed, 10/23/2019 - 13:48

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release of our linl package for writing LaTeX letters with (R)markdown just arrived on CRAN. linl makes it easy to write letters in markdown, with some extra bells and whistles thanks to some cleverness chiefly by Aaron.

This version now supports a (pdf, png, …) footer along with the already-supported header, thanks to an intiial PR by Michal Bojanowski to which Aaron added nice customization for scale and placement (as supported by LaTeX package wallpaper). I also added support for continued integration testing at Travis CI via a custom Docker RMarkdown container—which is something I should actually say more about at another point.

Here is screenshot of the vignette showing the simple input for some moderately fancy output (now with a footer):

The NEWS entry follows:

Changes in linl version 0.0.4 (2019-10-23)
  • Continuous integration tests at Travis are now running via custom Docker container (Dirk in #21).

  • A footer for the letter can now be specified (Michal Bojanowski in #23 fixing #10).

  • The header and footer options be customized more extensively, and are documented (Aaron in #25 and #26).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the linl page. For questions or comments use the issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

pkgKitten 0.1.5: Creating R Packages that purr

Tue, 10/22/2019 - 14:52

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Another minor release 0.1.5 of pkgKitten just hit on CRAN today, after a break of almost three years.

This release provides a few small changes. The default per-package manual page now benefits from a second refinement (building on what was introduced in the 0.1.4 release) in using the Rd macros referring to the DESCRIPTION file rather than duplicating information. Several pull requests fixes sloppy typing in the README.md, NEWS.Rd or manual page—thanks to all contributors for fixing these. Details below.

Changes in version 0.1.5 (2019-10-22)
  • More extensive use of newer R macros in package-default manual page.

  • Install .Rbuildignore and .gitignore files.

  • Use the updated Travis run script.

  • Use more Rd macros in default ‘stub’ manual page (#8).

  • Several typos were fixed in README.md, NEWS.Rd and the manual page (#9, #10)

More details about the package are at the pkgKitten webpage and the pkgKitten GitHub repo.

Courtesy of CRANberries, there is also a diffstat report for this release

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Council spending – open data

Tue, 10/22/2019 - 13:06

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My local authority recently decided to publish all spending over £500 in an effort to be more transparent. Here’s a post taking an overview of what they’ve published. I’ve used R for the analysis. The dataset doesn’t contain much detail, but if you’ve analysis suggestions, please add them in the comments!

You can download the spending data here. It’s available in pdf (why?!) and xlsx (plain text would be more open).

First off, some packages:

library(tidyverse) library(readxl) library(janitor) library(lubridate) library(formattable)

Read in the dataset:

df = read_excel("~/Downloads/midlothian_payments_over_500_01042019_to_15092019.xlsx") %>% clean_names()

We’ve got six columns:

  • type
  • date_paid
  • supplier
  • amount
  • our_ref
  • financial_year

 

Busiest day:

df %>% mutate(day = weekdays(date_paid)) %>% group_by(day) %>% summarise(transactions = n(), thousands_pounds_spent = sum(amount) / 1000) %>% mutate(day = fct_relevel(day, rev(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))) %>% gather(var, value, -day) %>% ggplot(aes(day, value)) + geom_col() + facet_wrap(~var, scales = "free_x") + coord_flip() + scale_y_continuous(labels = scales::comma) + labs(title = "Busiest day of the week", x = "", y = "")

Busiest time of year:

df %>% mutate(dow = weekdays(date_paid), dow = if_else(dow == "Tuesday" | dow == "Friday", "Tue/Fri", "Other")) %>% group_by(date_paid, dow) %>% summarise(transactions = n(), pounds_spent = sum(amount)) %>% gather(var, value, -date_paid, -dow) %>% ggplot(aes(date_paid, value, colour = dow)) + geom_point() + facet_wrap(~var, scales = "free_y") + scale_y_log10(labels = scales::comma) + scale_colour_brewer(type = "qual", palette = "Set2") + labs(title = "Busiest day of the year", x = "", y = "")

Top 10 payees by value:

df %>% group_by(supplier) %>% summarise(pounds_spent = sum(amount), transactions = n()) %>% arrange(desc(pounds_spent)) %>% top_n(n = 10, wt = pounds_spent) %>% mutate(pounds_spent = currency(pounds_spent, "£", digits = 0L)) %>% formattable(list(`pounds_spent` = color_bar("#FA614B"), `transactions` = color_bar("lightpink")))

In Scotland local authorities collect water charges on behalf of the water authority, which they then pass on. It’s not surprise that Scottish Water is the biggest supplier.

Top 10 payees by frequency:

df %>% group_by(supplier) %>% summarise(pounds_spent = sum(amount), transactions = n()) %>% arrange(desc(transactions)) %>% top_n(n = 10, wt = transactions) %>% mutate(pounds_spent = currency(pounds_spent, "£", digits = 0L)) %>% formattable(list(`pounds_spent` = color_bar("lightpink"), `transactions` = color_bar("#FA614B")))

As a final note, writing this post is reminding me again I should be moving away from wordpress because incorporating code and output would be much easier with mark/blog down! As always, legacy is holding me back.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Super Solutions for Shiny Architecture #5 of 5: Automated Tests

Tue, 10/22/2019 - 12:06

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TL;DR

Describes the best practices for setting automated test architecture for Shiny apps. Automate and test early and often with unit tests, user interface tests, and performance tests.

Best Practices for Testing Your Shiny App

Even your best apps will break down at some point during development or during User Acceptance Tests. I can bet on this. It’s especially true when developing big, productionalized applications with the support of various team members and under a client’s deadlines pressure. It’s best to find those bugs on the early side. Automated testing can assure your product quality. Investing time and effort in automated tests brings a huge return. It may seem like a burden at the beginning, but imagine an alternative: fixing the same misbehaviour of the app for the third time e.g. when a certain button is clicked. What is worse, bugs are sometimes spotted after changes are merged to the master branch. And you have no idea which code change let the door open for the bugs, as no one checked particular functionality for a month or so. Manual testing is a solution to some extent, but I can confidently assume that that you would rather spend testing time on improving user experience rather than looking for a missing comma in the code.

How do we approach testing in Appsilon? We aim to organize our test structure according to the “pyramide” best practice:

FYI there is also an anti-pattern called the “test-cone”. Even such tests architecture in the app I would consider as good sign, after all the app is (automatically) tested – which is unfortunately often not even the case. Nevertheless switching to the “pyramid” makes your tests more reliable and effective plus less time-consuming.

No matter how extensively you are testing or planning to test your app, take this piece of advice: start your working environment with automated tests triggered before merging any pull request (check tools like CircleCI for this). Otherwise you would soon hate finding bugs caused by developers: “Aaaa, yeaaa, it’s on me, haven’t run the tests, but I thought that the change is so small and not related to anything crucial!” (I assume it goes without saying that no changes goes into ‘master’ or ‘development’ branches without proper Pull Request procedure and review).

Let’s now describe in details different types of tests:

Unit Tests

… are the simplest to implement and most low-level kind of tests. The term refers to testing the behaviour of functions based on the expected output comparison. It’s a case by case approach – hence the name. Implementing them will allow you to recognize all edge cases and understand the logic of your function better. Believe me – you will be surprised what your function can return when starting with unexpected input. This idea is pushed to the boundaries with so called Test Driven Development (TDD) approach. No matter if you’re a fan or rather skeptic at the end of the day you should have implemented the good unit tests for your functions.

How to achieve it in practice? The popular and well-known package testthat should be your weapon of choice. Add the tests folder in your source code. Inside it, add another folder testthat and a script testthat.R. The script’s only job will be to trigger all of your tests stored in testthat folder, in which you should define scripts for your tests (one script per functionality or single function – names should start with “test_” + some name that reflects the functionality or even just the name of the function). Start such a test script with context() – write inside some text that will help you understand what the test included is about. Now you can start writing down your tests, one by one. Every test is wrapped with test_that() function, with the text info what is exactly tested followed by test itself – commonly just calling the function with set of parameters and comparing the result with the expected output, e.g.

result <- sum(2, 2)   expect_equal(result, 4)

Continue adding tests for single function and scripts for all functions. Once it is ready, we can set the main testthat.R script. You can use there code: test_check(“yourPackageName”) for apps as packages or general test_results <- test_dir(“tests/testthat”, reporter = “summary”, stop_on_failure = TRUE).

User Interface (UI) Tests

The core of those tests is to compare the actual app behaviour with what is expected to be displayed after various user actions. Usually it is done by comparing screen snapshots with the reference images. The crucial part though is to set up the architecture to automatically perform human-user like actions and taking snapshots. 

Why are User Interface (UI) tests needed? It is common that in an app development project, all of the functions are work fine, yet the app still crashes. It might be for example due to the JS code that used to do the job but suddenly stopped working as the object that it is looking for appears with a slight delay on the screen in comparison to what was there before. Or the modal ID has been changed and clicking the button does not trigger anything now. The point is this: Shiny apps are much more than R code with all of the JS, CSS, browser dependencies, and at the end of the day what is truly important is whether the users get the expected, bug-free experience.

The great folks from RStudio figured out a way to aid developers in taking snapshots. Check this article to get more information on the shinytest package. It basically allows you to record the actions in the app and select when the snapshots should be created to be checked during tests. What is important shinytest saves the snapshots as the json files describe the content. It fixes the usual problem with comparing images of recognizing small differences in colors or fonts on various browsers as an error. The image is also generated to make it easy for the human eye to check if everything is OK.

There is also an RSelenium package worth mentioning. It connects R with Selenium Webdriver API for automated web browsers. It is harder to configure than shinytest, but it does the job.

As shinytest is quite a new solution, in Appsilon we had already developed our internal architecture for tests. The solution is based on puppeteer and BackstopJS. The test scenarios are written in javascript, so it is quite easy to produce them. Plus BackstopJS has very nice looking reports. 

I guess the best strategy would be to start with shinytest and if there are some problems with using it, switch to some other more general solution for web applications.

Performance Tests

Yes, Shiny applications can scale. They just need the appropriate architecture. Check our case study and architecture description blog posts to learn how we are building large scale apps. As a general rule, you should always check how your app is performing in extreme usage conditions. The source code should be profiled and optimised. The application’s heavy usage can be tested with RStudio’s recent package shinyloadtest. It will help you estimate how many users your application can support and where the bottlenecks are located. It is achieved by recording the “typical” user session and then replaying it in parallel on the huge scale.

So, please test. Test automatically, early and often.

Smash down all the bugs before they become big, strong and dangerous insects!

Follow Appsilon Data Science on Social Media

Follow @Appsilon on Twitter!
Follow us on LinkedIn!
Don’t forget to sign up for our newsletter.
And try out our R Shiny open source packages!

Article Super Solutions for Shiny Architecture #5 of 5: Automated Tests comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Understanding Blockchain Technology by building one in R

Tue, 10/22/2019 - 09:00

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By now you will know that it is a good tradition of this blog to explain stuff by rebuilding toy examples of it in R (see e.g. Understanding the Maths of Computed Tomography (CT) scans, So, what is AI really? or Google’s Eigenvector… or how a Random Surfer finds the most relevant Webpages). This time we will do the same for the hyped Blockchain technology, so read on!

Everybody is talking about blockchains, e.g. applications like the so-called cryptocurrencies (like Bitcoins) or smart contracts and the big business potential behind it. Alas, not many people know what the technological basis is. The truth is that blockchain technology, like any database technology, can be used for every conceivable content, not only new currencies. Business and governmental transactions as well as research results, data about organ transplants and items you gained in online games can be stored as can examination results and all kinds of certificates, the possibilities are endless. There are two big advantages:

  • It is very hard to alter the content and
  • you don’t need some centralized trustee.

To understand why let us create a toy example of a blockchain in R. We will use three simple transactions as content:

trnsac1 <- "Peter buys car from Michael" trnsac2 <- "John buys house from Linda" trnsac3 <- "Jane buys car from Peter"

It is called a chain because the transactions are concatenated like so:

To understand this picture we need to know what a hash is. Basically, a hash (or better a cryptographic hash in this case) is just some function to encode messages. For educational purposes let us take the following (admittedly not very sophisticated) hash function:

# very simple (and not very good ;-)) hash function hash <- function(x, l = 5) { hash <- sapply(unlist(strsplit(x, "")), function(x) which(c(LETTERS, letters, 0:9, "-", " ") == x)) hash <- as.hexmode(hash[quantile(1:length(hash), (0:l)/l)]) paste(hash, collapse = "") } hash(trnsac1) ## [1] "104040200d26" hash(trnsac2) ## [1] "0a1c2240401b"

We will use this function to hash the respective transaction and (and this is important here!) the header of the transaction before that. In this way, a header is created and the transactions form a chain (have a look at the pic again).

We will create the blockchain via a simple data frame but of course, it can also be distributed across several computers (this is why the technology is also sometimes called distributed ledger). Have a look at the function to add a transaction to an already existing blockchain or create a new one in case you start with NULL.

add_trnsac <- function(bc, trnsac) { if (is.null(bc)) bc <- data.frame(Header = hash(sample(LETTERS, 20, replace = TRUE)), Hash = hash(trnsac), Transaction = trnsac, stringsAsFactors = FALSE) else bc <- rbind(bc, data.frame(Header = hash(paste0(c(bc[nrow(bc), "Header"]), bc[nrow(bc), "Hash"])), Hash = hash(trnsac), Transaction = trnsac)) bc }

We are now ready to create our little blockchain and add the transactions:

# create blockchain set.seed(1234) bc <- add_trnsac(NULL, trnsac1) bc ## Header Hash Transaction ## 1 10050502060e 104040200d26 Peter buys car from Michael # add transactions bc <- add_trnsac(bc, trnsac2) bc <- add_trnsac(bc, trnsac3) bc ## Header Hash Transaction ## 1 10050502060e 104040200d26 Peter buys car from Michael ## 2 36353b35373b 0a1c2240401b John buys house from Linda ## 3 38383c1b391c 0a404040402c Jane buys car from Peter

To test the integrity of the blockchain we just recalculate the hash values and stop when they don’t match:

test_bc <- function(bc) { integrity <- TRUE row <- 2 while (integrity && row <= nrow(bc)) { if (hash(paste0(c(bc[(row-1), "Header"]), hash(bc[(row-1), "Transaction"]))) != bc[row, "Header"]) integrity <- FALSE row <- row + 1 } if (integrity) { TRUE } else { warning(paste("blockchain is corrupted at row", (row-2))) FALSE } } # test integrity of blockchain test_bc(bc) ## [1] TRUE

Let us now manipulate a transaction in the blockchain! Mafia-Joe hacks his way into the blockchain and manipulates the second transaction so that not John but he owns Linda’s house. He even changes the hash value of the transaction so that it is consistent with the manipulated transaction:

# manipulate blockchain, even with consistent hash-value! bc[2, "Transaction"] <- "Mafia-Joe buys house from Linda" bc[2, "Hash"] <- hash("Mafia-Joe buys house from Linda") bc ## Header Hash Transaction ## 1 10050502060e 104040200d26 Peter buys car from Michael ## 2 36353b35373b 0d0a332d271b Mafia-Joe buys house from Linda ## 3 38383c1b391c 0a404040402c Jane buys car from Peter test_bc(bc) ## Warning in test_bc(bc): blockchain is corrupted at row 2 ## [1] FALSE

Bingo, the integrity test cries foul! The consistency of the chain is corrupted and Mafia-Joe’s hack doesn’t work!

One last thing: in our toy implementation verifying a blockchain and creating a new one use the same amount of computing power. This is a gross oversimplification of what is going on in real-world systems: there creating a new blockchain is computationally much more expensive than verifying an existing one. For creating a new one huge amounts of possible hash values have to be tried out because they have to fulfill certain criteria (e.g. a number of leading zeros). This makes the blockchain extremely safe.

In the cryptocurrency world people (so-called miners) get paid (of course also in cryptocurrency) for finding valid hash values (called mining). Indeed big mining farms have been established which consume huge amounts of computing power (and therefore electricity, which is one of the disadvantages of this technology). For more details consult my question on Bitcoin.SE and the answers and references given there: Is verification of a blockchain computationally cheaper than recreating it?

I hope that this post helped you understand the technological basis of this fascinating trend. Please share your thoughts on the technology and its potential in the comments below!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

digest 0.6.22: More goodies!

Tue, 10/22/2019 - 03:31

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of digest arrived at CRAN earlier today, and I just sent an updated package to Debian too.

digest creates hash digests of arbitrary R objects (using the md5, sha-1, sha-256, sha-512, crc32, xxhash32, xxhash64, murmur32, and spookyhash algorithms) permitting easy comparison of R language objects. It is a fairly widely-used package (currently listed at 868k monthly downloads) as many tasks may involve caching of objects for which it provides convenient general-purpose hash key generation.

This release comes pretty much exactly one month after the very nice 0.6.21 release but contains five new pull requests. Matthew de Queljoe did a little bit of refactoring of the vectorised digest function he added in 0.6.21. Ion Suruceanu added a CFB cipher for AES. Bill Denney both corrected and extended sha1. And Jim Hester made the windows-side treatment of filenames UTF-8 compliant.

CRANberries provides the usual summary of changes to the previous version.

For questions or comments use the issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Access the free economic database DBnomics with R

Tue, 10/22/2019 - 00:00

[This article was first published on Macroeconomic Observatory - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

DBnomics : the world’s economic database

Explore all the economic data from different providers (national and international statistical institutes, central banks, etc.), for free, following the link db.nomics.world.

You can also retrieve all the economic data through the rdbnomics package here. This blog post describes the different ways to do so.

Fetch time series by ids

First, let’s assume that we know which series we want to download. A series identifier (ids) is defined by three values, formatted like this: provider_code/dataset_code/series_code.

Fetch one series from dataset ‘Unemployment rate’ (ZUTN) of AMECO provider library(magrittr) library(dplyr) library(ggplot2) library(rdbnomics) df <- rdb(ids = 'AMECO/ZUTN/EA19.1.0.0.0.ZUTN') %>% filter(!is.na(value))

In such data.frame (data.table or tibble), you will always find at least nine columns:

  • provider_code
  • dataset_code
  • dataset_name
  • series_code
  • series_name
  • original_period (character string)
  • period (date of the first day of original_period)
  • original_value (character string)
  • value
  • @frequency (harmonized frequency generated by DBnomics)

The other columns depend on the provider and on the dataset. They always come in pairs (for the code and the name). In the data.frame df, you have:

  • unit (code) and Unit (name)
  • geo (code) and Country (name)
  • freq (code) and Frequency (name)

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics()

In the event that you only use the argument ids, you can drop it and run:

Fetch two series from dataset ‘Unemployment rate’ (ZUTN) of AMECO provider df <- rdb(ids = c('AMECO/ZUTN/EA19.1.0.0.0.ZUTN', 'AMECO/ZUTN/DNK.1.0.0.0.ZUTN')) %>% filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics()

Fetch two series from different datasets of different providers df <- rdb(ids = c('AMECO/ZUTN/EA19.1.0.0.0.ZUTN', 'Eurostat/une_rt_q/Q.SA.TOTAL.PC_ACT.T.EA19')) %>% filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics() + theme(legend.text = element_text(size=7))

Fetch time series by mask

The code mask notation is a very concise way to select one or many time series at once. It is compatible only with some providers : BIS, ECB, Eurostat, FED, ILO, IMF, INSEE, OECD, WTO.

Fetch one series from dataset ‘Balance of Payments’ (BOP) of IMF df <- rdb('IMF', 'BOP', mask = 'A.FR.BCA_BP6_EUR') %>% filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics()

In the event that you only use the arguments provider_code, dataset_code and mask, you can drop the name mask and run:

Fetch two series from dataset ‘Balance of Payments’ (BOP) of IMF

You just have to add a + between two different values of a dimension.

df <- rdb('IMF', 'BOP', mask = 'A.FR+ES.BCA_BP6_EUR') %>% filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics()

Fetch all series along one dimension from dataset ‘Balance of Payments’ (BOP) of IMF df <- rdb('IMF', 'BOP', mask = 'A..BCA_BP6_EUR') %>% filter(!is.na(value)) %>% arrange(desc(period), REF_AREA) %>% head(100)

Fetch series along multiple dimensions from dataset ‘Balance of Payments’ (BOP) of IMF df <- rdb('IMF', 'BOP', mask = 'A.FR.BCA_BP6_EUR+IA_BP6_EUR') %>% filter(!is.na(value)) %>% group_by(INDICATOR) %>% top_n(n = 50, wt = period)

Fetch time series by dimensions

Searching by dimensions is a less concise way to select time series than using the code mask, but it works with all the different providers. You have a “Description of series code” at the bottom of each dataset page on the DBnomics website.

Fetch one value of one dimension from dataset ‘Unemployment rate’ (ZUTN) of AMECO provider df <- rdb('AMECO', 'ZUTN', dimensions = list(geo = "ea19")) %>% filter(!is.na(value)) # or # df <- rdb('AMECO', 'ZUTN', dimensions = '{"geo": ["ea19"]}') %>% # filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics()

Fetch two values of one dimension from dataset ‘Unemployment rate’ (ZUTN) of AMECO provider df <- rdb('AMECO', 'ZUTN', dimensions = list(geo = c("ea19", "dnk"))) %>% filter(!is.na(value)) # or # df <- rdb('AMECO', 'ZUTN', dimensions = '{"geo": ["ea19", "dnk"]}') %>% # filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics()

Fetch several values of several dimensions from dataset ‘Doing business’ (DB) of World Bank df <- rdb('WB', 'DB', dimensions = list(country = c("DZ", "PE"), indicator = c("ENF.CONT.COEN.COST.ZS", "IC.REG.COST.PC.FE.ZS"))) %>% filter(!is.na(value)) # or # df <- rdb('WB', 'DB', dimensions = '{"country": ["DZ", "PE"], "indicator": ["ENF.CONT.COEN.COST.ZS", "IC.REG.COST.PC.FE.ZS"]}') %>% # filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics()

Fetch time series found on the web site

When you don’t know the codes of the dimensions, provider, dataset or series, you can:

  • go to the page of a dataset on DBnomics website, for example Doing Business,

  • select some dimensions by using the input widgets of the left column,

  • click on “Copy API link” in the menu of the “Download” button,

  • use the rdb_by_api_link function such as below.

df <- rdb_by_api_link("https://api.db.nomics.world/v22/series/WB/DB?dimensions=%7B%22country%22%3A%5B%22FR%22%2C%22IT%22%2C%22ES%22%5D%7D&q=IC.REG.PROC.FE.NO&observations=1&format=json&align_periods=1&offset=0&facets=0") %>% filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_step(size = 1.2) + geom_point(size = 2) + dbnomics()

Fetch time series from the cart

On the cart page of the DBnomics website, click on “Copy API link” and copy-paste it as an argument of the rdb_by_api_link function. Please note that when you update your cart, you have to copy this link again, because the link itself contains the ids of the series in the cart.


df <- rdb_by_api_link("https://api.db.nomics.world/v22/series?observations=1&series_ids=BOE/6008/RPMTDDC,BOE/6231/RPMTBVE") %>% filter(!is.na(value))

ggplot(df, aes(x = period, y = value, color = series_name)) + geom_line(size = 1.2) + geom_point(size = 2) + dbnomics()

Proxy configuration or connection error Could not resolve host

When using the functions rdb or rdb_..., you may come across the following error:

Error in open.connection(con, "rb") : Could not resolve host: api.db.nomics.world

To get round this situation, you have two options:

  1. configure curl to use a specific and authorized proxy.

  2. use the default R internet connection i.e. the Internet Explorer proxy defined in internet2.dll.

Configure curl to use a specific and authorized proxy

In rdbnomics, by default the function curl_fetch_memory (of the package curl) is used to fetch the data. If a specific proxy must be used, it is possible to define it permanently with the package option rdbnomics.curl_config or on the fly through the argument curl_config. Because the object is a named list, its elements are passed to the connection (the curl_handle object created internally with new_handle()) with handle_setopt() before using curl_fetch_memory.

To see the available parameters, run names(curl_options()) in R or visit the website https://curl.haxx.se/libcurl/c/curl_easy_setopt.html. Once they are chosen, you define the curl object as follows:

h <- list( proxy = "", proxyport = <port>, proxyusername = "", proxypassword = "" ) Set the connection up for a session

The curl connection can be set up for a session by modifying the following package option:

options(rdbnomics.curl_config = h)

When fetching the data, the following command is executed:

hndl <- curl::new_handle() curl::handle_setopt(hndl, .list = getOption("rdbnomics.curl_config")) curl::curl_fetch_memory(url = <...>, handle = hndl)

After configuration, just use the standard functions of rdbnomics e.g.:

df1 <- rdb(ids = 'AMECO/ZUTN/EA19.1.0.0.0.ZUTN')

This option of the package can be disabled with:

options(rdbnomics.curl = NULL) Use the connection only for a function call

If a complete configuration is not needed but just an “on the fly” execution, then use the argument curl_config of the functions rdb and rdb_...:

df1 <- rdb(ids = 'AMECO/ZUTN/EA19.1.0.0.0.ZUTN', curl_config = h) Use the default R internet connection

To retrieve the data with the default R internet connection, rdbnomics will use the base function readLines.

Set the connection up for a session

To activate this feature for a session, you need to enable an option of the package :

options(rdbnomics.use_readLines = TRUE)

And then use the standard function as follows :

df1 <- rdb(ids = 'AMECO/ZUTN/EA19.1.0.0.0.ZUTN')

This configuration can be disabled with :

options(rdbnomics.use_readLines = FALSE) Use the connection only for a function call

If you just want to do it once, you may use the argument use_readLines of the functions rdb and rdb_... :

df1 <- rdb(ids = 'AMECO/ZUTN/EA19.1.0.0.0.ZUTN', use_readLines = TRUE) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Macroeconomic Observatory - R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Trends in U.S. Border Crossing Entry since 1996

Mon, 10/21/2019 - 19:54

[This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Since the 2016 election, inland U.S. Border security has been the huge topic. The construction for the new border wall has started and the tension between Mexico and U.S. has intensified along with it. Many people predicted not only the decrease in number of illegal border entry but also the decrease in number of legal border entry which could hurt the tourism and discourage the trade across the borders. Currently in end of 2019, 3 years from 2016 election campaign, what can we learn from the statistics of inland U.S. Border entry? Did the political agenda affect the way people come into our country? I would like to answer these questions through visualizing the findings.

Data

I used the dataset from Kaggle.com . This dataset is originally collected by U.S. Customs and Border Protection (CBP) every quarter which then gets cleaned, assessed and maintained by the Bureau of Transportation Statistics (BTS). It contains the statistics for inbound crossings at the U.S.-Canada and the U.S.-Mexico border at the port level for every month from beginning of 1996. 

The original dataset includes 349,000 rows with 7 Columns as following: Port Name, State, Port Code, Border, Date, Measure, Value and Location. Measure is a method of transportation used for border entry and it has 12 different categories: Bus Passengers, Buses, Pedestrians, Personal Vehicle Passengers, Personal Vehicles, Rail Containers Empty, Rail Containers Full, Train Passengers, Trains, Truck Containers Empty, Truck Containers Full, Trucks. Values column includes the total number of crossing. 

It is important to be aware that this dataset doesn’t count the number of unique vehicles, passengers or pedestrians but rather count the number of crossings. For example, same truck can go back and forth the border many times a day and data for each time will be collected. Also this data doesn’t include the nationality of the passengers or pedestrians nor the reason for the border crossing. 

I used R package dplyr to clean the data. Criteria for Border Name was shortened as Canada and Mexico and Location column was divided into two sections: Longitude and Latitude. Also year 2019 was excluded from the analysis as the data is not complete yet and that will not provide a good insight for this project. The exploratory data analysis (EDA) was done mostly using R package ggplot2 and leaflet. I used the ShinyDashboard to show the visualization and shinyapps.io as a server to present the findings.

 ShinyApp / Analysis

First I wanted to observe the location of border ports and their distribution across the U.S. to see the big picture.

 

 

There are total of 116 ports used in this dataset. Among those, 89 ports are in U.S.-Canada Border and 27 ports are in U.S.-Mexico Border. So there are about 3 times more ports in U.S.-Canada Border compared to U.S.-Mexico Border. 

However the number of total incoming from 1996 to 2018 showed total opposite as there were about 7 billions of total border crossing at U.S.-Mexico Border while there were about 2.6 billions of total border crossing at U.S-Canada Border. Even though there were more ports available in the northern border of U.S., there were less people coming in. 

Now moving on, I wanted to find out the methods of border entry and how it looks different between U.S.-Canada border and U.S.-Mexico border. 

Overall, the most of the border entry method was by using personal vehicles. Here, the Personal Vehicles count the numbers of personal cars entering the border whereas the Personal Vehicle Passengers count the number of people that were in the Personal Vehicles. Next high value was surprisingly the Pedestrians. Bus Passengers and Trucks came after. 

When the measure was compared between two borders, I could see a difference where the Mexico Border has a significant number of pedestrians whereas the Canada Border does not.

When I looked into the number of entry in different states, I could find the similar trend from the number of entry by two different Borders. There is more entry from the southern U.S. border especially in Texas, California and Arizona. From the northern U.S. border, the large number of border entry was from New York and Michigan. 

Most of the southern states had a similar trend of entry transportation as Texas as it is shown on the left graph above: mostly containing the Personal Vehicle and Pedestrians.  For the northern states, it looked similar to the New York as it is shown on the right graph above. This indicates that the northern and southern borders differ a lot in terms of number of people walking into our country. 

 

Two exceptions to the statement were Alaska and Ohio. Alaska had more significant number of Bus Passengers and Train Passengers suggesting that most of the people coming into the U.S. from Alaska border ports are travelers. Ohio, the state with the lowest number of border entry, was reported with only one method of transportation: Personal Vehicles. 

When I looked into the change in number of truck entering the U.S. in U.S.-Canada and U.S.-Mexico Borders, I saw some patterns. First, both Canada and Mexico sides had a sudden drop of number in certain year such as 2009. That was the year of global financial crisis. Trucks can be used for in-land trades and it is obvious from the graph above that the economy has a big impact on the border entry. Second, I was able to see the increase in number of trucks entering at the Mexico border, possibly suggesting a better trade condition between U.S. and Mexico. 

As I wanted to see the impact of the 2016 election and the increase in border security issue, I looked at the number of Pedestrians and Personal Vehicle Passengers over the years. Surprisingly, unlike what I have guessed that issue of border security and building the border wall would discourage the number of legal border entry in U.S.-Mexico border, the statistics show the increase in border entry. 

The number of incoming buses and bus passengers into the U.S. seem to be decreasing in both Canada and Mexico sides. Usually buses can be used for the tourism and as other methods of traveling such as flight and train have advanced over the years, use of bus as a method for traveling seems to have declined. 

Two methods of transportation used for border entry that look distinctly different were Pedestrian and Train Passengers. As seen in previous graphs, most of the pedestrians are coming into the U.S. using U.S.-Mexico border as it is more accessible to walk across the border in the southern side of the States. Train is used more often in U.S.-Canada border and the number of its use has been increasing. This increase in use of train can be related to the decrease use of bus as a traveling method. 

Different methods of transportation into the U.S. also show unique trend when the data was looked according to the months. Transportation methods that can be used for trade such as Truck, Truck Container Full, Rail Container Full seem to have a steady number of incoming in both Mexico and Canada borers no matter what month it is. This indicates that the business related border entry stays steady in all-year-around. 

However this changes, when it comes to the transportation related to traveling such as Personal Vehicle Passengers, Bus Passengers and Train Passengers. Number of border entry in U.S.-Canada increases significantly in summer months while number of border entry in U.S.-Mexico stays around same all-year-around. As the northern border get a harsh cold winter, it is obvious to have more travelers in summer months. 

Conclusion / Further Research

From the visualization and statistics, the number of border entry depends on the economy and business rather than the politics. However this dataset itself can’t explain the reason behinds the change in number of border entry as there are still many factors that need to be considered. For the further research, I would like to obtain the data regarding the citizenship of people that enter the U.S. borders as well as their intension or the reason for the entry. This can furthermore support how economy or tourism change the trend in the border entry. 

Thank you for reading my findings in U.S. Border Crossing Entry data. If you are interested in looking at the dataset I used, my ShinyApps, and the code, you can follow with the links below. 

Dataset

ShinyApps

GitHub

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Split Intermixed Names into First, Middle, and Last

Mon, 10/21/2019 - 18:36

[This article was first published on RLang.io | R Language Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data cleaning can be a challenge, so I hope this helps the process for someone out there. This is a tiny, but valuable function for those who deal with data collected from non-ideal forms. As nearly always, this depends on the tidyverse library. You may want to rename the function from fml, but it does best describe dealing with mangled data.

This function retuns the first, middle, and last names for a given name or list of names. Missing data is represented as NA.

Usage on Existing Dataframe

Setting up a dataframe with manged names and missing first, middle, and last names.

df <- data.frame(names = c("John Jacbon Jingle", "Heimer Schmitt", "Cher", "John Jacbon Jingle Heimer Schmitt", "Mr. Anderson", "Sir Patrick Stewart", "Sammy Davis Jr.")) %>% add_column(First = NA) %>% add_column(Middle = NA) %>% add_column(Last = NA)

Row names First Middle Last 1 John Jacob Jingle NA NA NA 2 Heimer Schmitt NA NA NA 3 Cher NA NA NA 4 John Jacob Jingle Heimer Schmitt NA NA NA 5 Mr. Anderson NA NA NA 6 Sir Patrick Stewart NA NA NA 7 Sammy Davis Jr. NA NA NA

Replacing the first, middle, and last name values…

df[,c("First","Middle","Last")] <- df$names %>% fml

Row names First Middle Last 1 John Jacbon Jingle John Jacbon Jingle 2 Heimer Schmitt Heimer NA Schmitt 3 Cher Cher NA NA 4 John Jacbon Jingle Heimer Schmitt John Jacbon-Jingle-Heimer Schmitt 5 Mr. Anderson NA NA Anderson 6 Sir Patrick Stewart Patrick NA Stewart 7 Sammy Davis Jr. Sammy NA Davis

Values Changed

  • In roe 1 All names were found
  • In row 2 the middle name was skipped
  • In row 3 only a first name was found
  • In row 4 the middle names were collapsed
  • In row 5 only a last name was found
  • In row 6 the title Sir was omitted
  • In row 7 the title Jr. was omitted

Using with a single name.

fml("Matt Sandy")

V1 V2 V3 Matt Sandy Matt NA Sandy The Function

fml <- function(mangled_names) { titles <- c("MASTER", "MR", "MISS", "MRS", "MS", "MX", "JR", "SR", "M", "SIR", "GENTLEMAN", "SIRE", "MISTRESS", "MADAM", "DAME", "LORD", "LADY", "ESQ", "EXCELLENCY","EXCELLENCE", "HER", "HIS", "HONOUR", "THE", "HONOURABLE", "HONORABLE", "HON", "JUDGE") mangled_names %>% sapply(function(name) { split <- str_split(name, " ") %>% unlist original_length <- length(split) split <- split[which(!split %>% toupper %>% str_replace_all('[^A-Z]','') %in% titles)] case_when( (length(split) < original_length) & (length(split) == 1) ~ c(NA, NA, split[1]), length(split) == 1 ~ c(split[1],NA,NA), length(split) == 2 ~ c(split[1],NA, split[2]), length(split) == 3 ~ c(split[1], split[2], split[3]), length(split) > 3 ~ c(split[1], paste(split[2:(length(split)-1)], collapse = "-"), split[length(split)]) ) }) %>% t %>% return }

Improvements

I recommend improving upon this if you want to integrate this function (or attributes of this function) into your workflow. Naming the output or using lists so you can just get partial returns fml("John Smith")$Last could come in handy.

Additional cases could also be created, such as when names are entered Last, First M.. Tailoring the function to your project will yield best results.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RLang.io | R Language Programming. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Gold-Mining Week 7 (2019)

Mon, 10/21/2019 - 17:15

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Week 7 Gold Mining and Fantasy Football Projection Roundup now available.

The post Gold-Mining Week 7 (2019) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Widening Multiple Columns Redux

Mon, 10/21/2019 - 17:14

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last year I wrote about the slightly tedious business of spreading (or widening) multiple value columns in Tidyverse-flavored R. Recent updates to the tidyr package, particularly the introduction of the pivot_wider() and pivot_longer() functions, have made this rather more straightforward to do than before. Here I recapitulate the earlier example with the new tools.

The motivating case is something that happens all the time when working with social science data. We’ll load the tidyverse, and then quickly make up some sample data to work with.

library(tidyverse) gen_cats <- function(x, N = 1000) { sample(x, N, replace = TRUE) } set.seed(101) N <- 1000 income <- rnorm(N, 100, 50) vars <- list(stratum = c(1:8), sex = c("M", "F"), race = c("B", "W"), educ = c("HS", "BA")) df <- as_tibble(map_dfc(vars, gen_cats)) df <- add_column(df, income)

What we have are measures of sex, race, stratum (from a survey, say), education, and income. Of these, everything is categorical except income. Here’s what it looks like:

df ## # A tibble: 1,000 x 5 ## stratum sex race educ income ## ## 1 6 F W HS 83.7 ## 2 5 F W BA 128. ## 3 3 F B HS 66.3 ## 4 3 F W HS 111. ## 5 6 M W BA 116. ## 6 7 M B HS 159. ## 7 8 M W BA 131. ## 8 3 M W BA 94.4 ## 9 7 F B HS 146. ## 10 2 F W BA 88.8 ## # … with 990 more rows

Let’s say we want to transform this to a wider format, specifically by widening the educ column, so we end up with columns for both the HS and BA categories, and as we do so we want to calculate both the mean of income and the total n within each category of educ.

For comparison, one could do this with data.table in the following way:

data.table::setDT(df) df_wide_dt <- data.table::dcast(df, sex + race + stratum ~ educ, fun = list(mean, length), value.var = "income") head(df_wide_dt) ## sex race stratum income_mean_BA income_mean_HS income_length_BA income_length_HS ## 1: F B 1 93.78002 99.25489 19 6 ## 2: F B 2 89.66844 93.04118 11 16 ## 3: F B 3 112.38483 94.99198 13 16 ## 4: F B 4 107.57729 96.06824 14 15 ## 5: F B 5 91.02870 92.56888 11 15 ## 6: F B 6 92.99184 116.06218 15 15

Until recently, widening or spreading on multiple values like this was kind of a pain when working in the tidyverse. You can see how I approached it before in the earlier post. (The code there still works fine.) Previously, you had to put spread() and gather() through a slightly tedious series of steps, best wrapped in a function you’d have to write yourself. No more! Since tidyr v1.0.0 has been released, though, the new function pivot_wider() (and its complement, pivot_longer()) make this common operation more accessible.

Here’s how to do it now. Remember that in the tidyverse approach, we’ll first do the summary calculations, mean and length, respectively, though we’ll use dplyr’s n() for the latter. Then we widen the long result.

tv_pivot <- df %>% group_by(sex, race, stratum, educ) %>% summarize(mean_inc = mean(income), n = n()) %>% pivot_wider(names_from = (educ), values_from = c(mean_inc, n))

This gives us an object that’s equivalent to the df_wide_dt object created by data.table.

tv_pivot ## # A tibble: 32 x 7 ## # Groups: sex, race, stratum [32] ## sex race stratum mean_inc_BA mean_inc_HS n_BA n_HS ## ## 1 F B 1 93.8 99.3 19 6 ## 2 F B 2 89.7 93.0 11 16 ## 3 F B 3 112. 95.0 13 16 ## 4 F B 4 108. 96.1 14 15 ## 5 F B 5 91.0 92.6 11 15 ## 6 F B 6 93.0 116. 15 15 ## 7 F B 7 102. 121. 13 13 ## 8 F B 8 105. 88.3 14 8 ## 9 F W 1 92.6 110. 19 13 ## 10 F W 2 98.5 101. 15 19 ## # … with 22 more rows

And there you have it. Be sure to check out the complement of pivot_wider(), pivot_longer(), also.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Avoiding embarrassment by testing data assumptions with expectdata

Mon, 10/21/2019 - 12:42

[This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Expectdata is an R package that makes it easy to test assumptions about a data frame before conducting analyses. Below is a concise tour of some of the data assumptions expectdata can test for you. For example,

Note: assertr is an ropensci project that aims to have similar functionality. Pros and cons haven’t been evaluated yet, but ropensci is a big pro for assertR.

Check for unexpected duplication library(expectdata) expect_no_duplicates(mtcars, "cyl") #> [1] "top duplicates..." #> # A tibble: 3 x 2 #> # Groups: cyl [3] #> cyl n #> #> 1 8 14 #> 2 4 11 #> 3 6 7 #> Error: Duplicates detected in column: cyl

The default return_df == TRUE option allows for using these function as part of a dplyr piped expression that is stopped when data assumptions are not kept.

library(dplyr, warn.conflicts = FALSE) library(ggplot2) mtcars %>% filter(cyl == 4) %>% expect_no_duplicates("wt", return_df = TRUE) %>% ggplot(aes(x = wt, y = hp, color = mpg, size = mpg)) + geom_point() #> [1] "no wt duplicates...OK"

If there are no expectations violated, an “OK” message is printed.

After joining two data sets you may want to verify that no unintended duplication occurred. Expectdata allows comparing pre- and post- processing to ensure they have the same number of rows before continuing.

expect_same_number_of_rows(mtcars, mtcars, return_df = FALSE) #> [1] "Same number of rows...OK" expect_same_number_of_rows(mtcars, iris, show_fails = FALSE, stop_if_fail = FALSE, return_df = FALSE) #> Warning: Different number of rows: 32 vs: 150 # can also compare to no df2 to check is zero rows expect_same_number_of_rows(mtcars, show_fails = FALSE, stop_if_fail = FALSE, return_df = FALSE) #> Warning: Different number of rows: 32 vs: 0

Can see how the stop_if_fail = FALSE option will turn failed expectations into warnings instead of errors.

Check for existance of problematic rows

Comparing a data frame to an empty, zero-length data frame can also be done more explicitly. If the expectations fail, cases can be shown to begin the next step of exploring why these showed up.

expect_zero_rows(mtcars[mtcars$cyl == 0, ], return_df = TRUE) #> [1] "No rows found as expected...OK" #> [1] mpg cyl disp hp drat wt qsec vs am gear carb #> <0 rows> (or 0-length row.names) expect_zero_rows(mtcars$cyl[mtcars$cyl == 0]) #> [1] "No rows found as expected...OK" #> numeric(0) expect_zero_rows(mtcars, show_fails = TRUE) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 #> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 #> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 #> Error: Different number of rows: 32 vs: 0

This works well at the end of a pipeline that starts with a data frame, runs some logic to filter to cases that should not exist, then runs expect_zero_rows() to check no cases exist.

# verify no cars have zero cylindars mtcars %>% filter(cyl == 0) %>% expect_zero_rows(return_df = FALSE) #> [1] "No rows found as expected...OK"

Can also check for NAs in a vector, specific columns of a data frame, or a whole data frame.

expect_no_nas(mtcars, "cyl", return_df = FALSE) #> [1] "Detected 0 NAs...OK" expect_no_nas(mtcars, return_df = FALSE) #> [1] "Detected 0 NAs...OK" expect_no_nas(c(0, 3, 4, 5)) #> [1] "Detected 0 NAs...OK" #> [1] 0 3 4 5 expect_no_nas(c(0, 3, NA, 5)) #> Error: Detected 1 NAs

Several in one dplyr pipe expression:

mtcars %>% expect_no_nas(return_df = TRUE) %>% expect_no_duplicates("wt", stop_if_fail = FALSE) %>% filter(cyl == 4) %>% expect_zero_rows(show_fails = TRUE) #> [1] "Detected 0 NAs...OK" #> [1] "top duplicates..." #> # A tibble: 2 x 2 #> # Groups: wt [2] #> wt n #> #> 1 3.44 3 #> 2 3.57 2 #> Warning: Duplicates detected in column: wt #> mpg cyl disp hp drat wt qsec vs am gear carb #> 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 #> 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> 3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 #> 4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 #> 5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 #> 6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 #> Error: Different number of rows: 11 vs: 0 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(Much) faster unnesting with data.table

Mon, 10/21/2019 - 02:00

[This article was first published on Johannes B. Gruber on Johannes B. Gruber, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I was struggling with a relatively simple operation: unnest() from the tidyr package. What it’s supposed to do is pretty simple. When you have a data.frame where one or multiple columns are lists, you can unlist these columns while duplicating the information in other columns if the length of an element is larger than 1.

library(tibble) df <- tibble( a = LETTERS[1:5], b = LETTERS[6:10], list_column = list(c(LETTERS[1:5]), "F", "G", "H", "I") ) df ## # A tibble: 5 x 3 ## a b list_column ## ## 1 A F ## 2 B G ## 3 C H ## 4 D I ## 5 E J library(tidyr) unnest(df, list_column) ## # A tibble: 9 x 3 ## a b list_column ## ## 1 A F A ## 2 A F B ## 3 A F C ## 4 A F D ## 5 A F E ## 6 B G F ## 7 C H G ## 8 D I H ## 9 E J I

I came across this a lot while working on data from Twitter since individual tweets can contain multiple hashtags, mentions, URLs and so on, which is why they are stored in lists. unnest() is really helpful and very flexible in my experience since it makes creating, for example, a table of top 10 hashtags a piece of cake.

However, on large datasets, unnest() has its limitations (as I found out today). On a set with 1.8 million tweets, I was barely able to unnest the URL column and it would take forever on my laptop or simply crash at some point. In a completely new environment, unnesting the data took half an hour.

So let’s cut this time down to 10 seconds with data.table. In data.table, you would unlist like this1:

library(data.table) dt <- as.data.table(df) dt[, list(list_column = as.character(unlist(list_column))), by = list(a, b)] ## a b list_column ## 1: A F A ## 2: A F B ## 3: A F C ## 4: A F D ## 5: A F E ## 6: B G F ## 7: C H G ## 8: D I H ## 9: E J I

This is quite a bit longer than the tidyr code. So I wrapped it in a short function (note, that most of the code deals with quasiquotation so we can use it the same way as the original unnest()):

library(rlang) unnest_dt <- function(tbl, col) { tbl <- as.data.table(tbl) col <- ensyms(col) clnms <- syms(setdiff(colnames(tbl), as.character(col))) tbl <- as.data.table(tbl) tbl <- eval( expr(tbl[, as.character(unlist(!!!col)), by = list(!!!clnms)]) ) colnames(tbl) <- c(as.character(clnms), as.character(col)) tbl }

On the surface, it does the same as unnest:

unnest_dt(df, list_column) ## a b list_column ## 1: A F A ## 2: A F B ## 3: A F C ## 4: A F D ## 5: A F E ## 6: B G F ## 7: C H G ## 8: D I H ## 9: E J I

But the function is extremely fast and lean. To show this, I do some benchmarking on a larger object. I scale the example ‘data.frame’ up from 5 to 50,000 rows since the overhead of loading a function will influence runtime much stronger on small-n data.

library(bench) df_large <- dplyr::sample_frac(df, 10000, replace = TRUE) res <- mark( tidyr = unnest(df_large, list_column), dt = unnest_dt(df_large, list_column) ) res ## # A tibble: 2 x 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## ## 1 tidyr 52.4s 52.4s 0.0191 16.77GB 6.38 ## 2 dt 14.3ms 18.5ms 50.0 9.56MB 10.00 summary(res, relative = TRUE) ## # A tibble: 2 x 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## ## 1 tidyr 3666. 2832. 1 1796. 1 ## 2 dt 1 1 2617. 1 1.57

As you can see, data.table is 3666 times faster. That is pretty insane. But what is often even more important, the memory consumption is negligible with the data.table function compared to tidyr. When trying to unnest my Twitter dataset with 1.8 million tweets, my computer would choke on the memory issue and even throw an error if I had some other large objects loaded.

Admittedly the function is not perfect. It is far less flexible than unnest, especially since it only runs on one variable at the time. However, this covers 95% of my usage of unnest and I would only consider including it in a script if performance is key.

  1. Source: this answer from @akrun: https://stackoverflow.com/a/40420690/5028841, which I think should be added to data.table’s documentation somewhere.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Johannes B. Gruber on Johannes B. Gruber. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

IPO Exploration

Mon, 10/21/2019 - 02:00

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rmangal: making ecological networks easily accessible

Mon, 10/21/2019 - 02:00

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In early September, the version 2.0.0 of rmangal was approved by
rOpenSci, four weeks later it made it to CRAN. Following-up on our experience we
detail below the reasons why we wrote rmangal, why we submitted our package to
rOpenSci and how the peer review improved our package.

Mangal, a database for ecological networks

Ecological networks are defined as a set of species populations (the nodes of
the network) connected through ecological interactions (the edges). Interactions
are ecological processes in which one species affects another. Although
predation is probably the most known and documented interaction, other less
noticeable associations are just as essential to ecosystem functioning. For
instance, a mammal that unintentionally disperses viable seeds attached to its
fur might help plants to thrive. All of these interactions occur simultaneously,
shaping ecosystem functioning and making them as complex as they are
fascinating.

Recording and properly storing these interactions help ecologists to better
understand ecosystems. That is why they are currently compiling datasets to
explore how species associations vary over environmental gradients and how
species lost might affect ecosystem functioning. This fundamental research
question should help us understanding how ecological networks will respond to
global change. To this end, the Mangal project https://mangal.io/#/
standardizes ecological networks and eases their access. Every dataset contains
a collection of networks described in a specific reference (a scientific
publication, a book, etc.). For every network included in the database, Mangal
includes all the species names and several taxonomic identifiers
(gbif, eol,
tsn etc.) as well as all interactions and their types.
Currently, Mangal includes 172 datasets, which represents over 1300 ecological
networks distributed worldwide.

An R client to make ecological networks easily accessible

In 2016, the first paper describing the project was published1. In
2018, a substantial effort was made in order to improve the data structure and
gather new networks from existing publications. In 2019, the web API was
rewritten, a new website launched and hundreds of new interactions were added.


Because of all these modifications, the first version of rmangal was obsolete
and a new version needed. It is worth explaining here why the R client is an
important component of the Mangal project. Even though Mangal has a documented
RESTful API, this web technology is not commonly
used by ecologists. On the contrary, providing a R client ensures that the
scientific community that documents these interactions in the field can access
them, as easily as possible. The same argument holds true for the Julia
client
that Timothée Poisot
wrote because Julia is increasingly popular among
theoreticians, that can test ecological theory with such datasets.

We had two main objectives for rmangal 2.0.0. First, the rmangal
package had to allow users to search for all entries in the database in a very
flexible way. From a technical point this means that we had to write functions
to query all the endpoints of the new web API. The second goal was to
make the package as user friendly as possible. To do so, we used explicit and
consistent names for functions and arguments. We then designed a simple workflow, and
documented how to use other field related packages (such as igraph) to
visualize and analyze networks (see below). You can find further details in the vignette “get
started with rmangal”
.

# Loading dependancies library(rmangal) library(magrittr) library(ggraph) library(tidygraph) # Retrieving all ecological networks documented in Haven, 1992 havens_1992 <- search_references(doi="10.1126/science.257.5073.1107") %>% get_collection() # Coerce and visualize the first network object return by mangal with ggraph ggraph(as_tbl_graph(havens_1992[[1]])) + geom_edge_link(aes(colour = factor(type))) + geom_node_point() + theme_graph(background = "white")
A successful peer review process

After some hard work behind the screen and once we deemed our two objectives
achieved, we decided to submit the rmangal package to rOpenSci for peer review. We
did so because we needed feedback, we needed qualified people to critically
assess whether our two main objectives were achieved. Given the strong expertise
of rOpenSci in software review, and given that our package was in-scope,
submitting rmangal to rOpenSci was an obvious choice.

We had very valuable
feedback
from Anna
Willoughby
and Thomas Lin Pedersen. They carefully assessed
our work and pointed out areas where improvement was required. One good example
of how their review made our package better concerns the dependencies. We
originally listed sf in Imports as we used it to filter networks based
on geographic coordinates. But the reviewers pointed out that this was not an
essential part of the package and that sf has several dependencies. This made us realize that for one extra feature, we were substantially
increasing the number of indirect dependencies. Following the reviewers’
suggestions, we moved sf to Suggests and added a message to warn users
that to use the spatial filtering
feature

requires sf to be installed. Similarly, based on another good comment, we
added a function to convert Mangal networks into tidygraph objects
and we documented how to plot Mangal networks with ggraph (and so we
added those packages in Suggests). Such improvements were very helpful to
properly connect rmangal to the existing R packages. The plethora of R packages
is one of its major strengths, and connecting a package properly to others makes
the entire ecosystem even stronger.

We are now looking for user experience feedback, not only for rmangal
(vignette) but also
for the web API (documentation) and the mangal.io
website
. We welcome suggestions and contributions,
especially for the documentation by opening new issues on GitHub
(mangal-api,
mangal-app,
rmangal). In the future, we
envision that rmangal will integrate functions to format ecological networks for
ecologists willing to add their datasets to Mangal. This will likely be the next
major release of rmangal.

Acknowledgments

We are thankful to all contributors to
rmangal
and to all ecologists
that have spent countless hours in collecting data. We would like to thank Anna
Willoughby
and Thomas Lin Pedersen for thorough reviews as
well as Noam Ross for handling the review
process.

  1. Poisot, T. et al. mangal – making ecological network analysis simple. Ecography 39, 384–390 (2016). https://doi.org/10.1111/ecog.00976
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RcppGSL 0.3.7: Fixes and updates

Sun, 10/20/2019 - 17:01

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release 0.3.7 of RcppGSL is now on CRAN. The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

Stephen Wade noticed that we were not actually freeing memory from the GSL vectors and matrices as we set out to do. And he is quite right: a dormant bug, present since the 0.3.0 release, has now been squashed. I had one boolean wrong, and this has now been corrected. I also took the opportunity to switch the vignette to prebuilt mode: Now a pre-made pdf is just included in a Sweave document, which makes the build more robust to tooling changes around the vignette processing. Lastly, the package was converted to the excellent tinytest unit test framework. Detailed changes below.

Changes in version 0.3.7 (2019-10-20)
  • A logic error was corrected in the wrapper class, vector and matrix memory is now properly free()’ed (Dirk in #22 fixing #20).

  • The introductory vignettes is now premade (Dirk in #23), and was updated lightly in its bibliography handling.

  • The unit tests are now run by tinytest (Dirk in #24).

Courtesy of CRANberries, a summary of changes to the most recent release is also available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Permutation Feature Importance (PFI) of GRNN

Sun, 10/20/2019 - 06:00

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the post https://statcompute.wordpress.com/2019/10/13/assess-variable-importance-in-grnn, it was shown how to assess the variable importance of a GRNN by the decrease in GoF statistics, e.g. AUC, after averaging or dropping the variable of interest. The permutation feature importance evaluates the variable importance in a similar manner by permuting values of the variable, which attempts to break the relationship between the predictor and the response.

Today, I added two functions to calculate PFI in the YAGeR project, e.g. the grnn.x_pfi() function (https://github.com/statcompute/yager/blob/master/code/grnn.x_pfi.R) calculating PFI of an individual variable and the grnn.pfi() function (https://github.com/statcompute/yager/blob/master/code/grnn.pfi.R) calculating PFI for all variables in the GRNN.

Below is an example showing how to use PFI to evaluate the variable importance. It turns out that the outcome looks very similar to the one created by the grnn.imp() function previously discussed.

.gist table { margin-bottom: 0; }

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Building a Corporate R Package for Pleasure and Profit

Sun, 10/20/2019 - 02:00

[This article was first published on R on technistema, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The “Great Restructuring” of our economy is underway. That’s the official name for what we know is happening: the best are rising to the top, and the mediocre are sinking to the bottom. It’s the Matthew Principle in-motion.

In Brynjolfsson and McAfee’s 2011 book Race Against the Machine, they detail how this New Economy will favor those that have the skill set or the capital to interface and invest in new technologies such as deep learning and robotics, which are becoming more ubiquitous every day.

Cal Newport’s Deep Work outlines two core abilities for thriving in this new economy:

  1. Be able to quickly master hard things
  1. Be able to produce at an elite level, in terms of both quality and speed. This need for speed (sorry) will be our focus.

Don’t repeat yourself (DRY) is a well-known maxim in software development, and most R programmers follow this rule and build functions to avoid duplicating code. But how often do you:

  • Reference the same dataset in different analyses
  • Create the same ODBC connection to a database
  • Tinker with the same colors and themes in ggplot
  • Produce markdown docs from the same template

and so on? Notice a pattern? The word “same” is sprinkled in each bullet point. I smell an opportunity to apply DRY!

If you work in a corporate or academic setting like me, you probably do these things pretty often. I’m going to show you how to wrap all of these tasks into a minimalist R package to save you time, which, as we’ve learned, is one of the keys to your success in the New Economy.

Tools

First some groundwork. I’ll assume if you work in R that you are using RStudio, which will be necessary to follow along. I’m using R version 3.5.1 on a Windows 10 machine (ahh, corporate America…). Note that the package we are about to develop is minimalist, which is a way of saying that we’re gonna cut corners to make a minimum viable product. We won’t get deep into documentation and dependencies much, as the packages we’ll require in our new package are more than likely already on your local machine.

Create an empty package project

We’ll be creating a package for the consulting firm Ketchbrook Analytics, a boutique shop from Connecticut who know their way around a %>% better than anyone.

Open RStudio and create a project in a new directory:


Select R Package and give it a name. I’ll call mine ketchR.

RStudio will now start a new session with an example “hello” function. Looks like we’re ready to get down to business.

Custom functions

Let’s start by adding a function to our package. A common task at Ketchbrook is mapping customer data with an outline for market area or footprint. We can easily wrap that into a simple function.

Create a new R file and name it ketchR.R. We’ll put all of our functions in here.

# To generate the footprint_polys data footprint_poly <- function() { #' Returns object of class SpatialPolygons of the AgC footprint. #' Utilizes the Tigris:: package. require(tidyverse) require(tigris) require(sf) # Get County Polygons states.raw <- tigris::states() states <- states.raw[states.raw@data$STUSPS %in% c("CA", "OR", "WA"),] states <- sf::st_as_sfc(states) states <- sf::st_union(states) states <- as(states, 'Spatial') return(states) }

So what we’ve done is create a function that utilizes the tigris package to grab shapefiles for states in our footprint. The function then unions those states into one contiguous polygon so we can easily overlay this using leaflet, ggmap, etc.

Try your new function out:

library(leaflet) leaflet() %>% addTiles() %>% addPolygons(data = footprint_poly())

There is no limit to what kinds of custom functions you can add in your package. Machine learning algs, customer segmentation, whatever you want you can throw in a function with easy access in your package.

Datasets

Let’s stay on our geospatial bent. Branch or store-level analysis is common in companies spread out over a large geographical region. In our example, Ketchbrook’s client has eight branches from Tijuana to Seattle. Instead of manually storing and importing a CSV or R data file each time we need to reference these locations, we can simply save the data set to our package.

In order to add a dataset to our package, we first need to pull it into our local environment either by reading a csv or grabbing it from somewhere else. I simply read in a csv from my local PC:

branches <- read.csv("O:\\exchange\\branches.csv", header = T)

This is what the data set looks like:

Now, we have to put this data in a very specific place, or our package won’t be able to find it. Like when my wife hides the dishwasher so I’m reluctantly forced to place dirty dishes on the counter.

First, create a folder in your current directory called “data.” Your directory should look like this now, btw:

Bonus points: use the terminal feature in RStudio to create the directory easily:

Now we need to save this branches data set into our new folder as an .RData file:

save(branches, file = "data/branches.RData") Now, we build

Let’s test this package out while there’s still a good chance we didn’t mess anything up. When we build the package, we are compiling it into the actual package as we know it. In RStudio, this is super simple. Navigate to the “Build” tab, and click “Install and Restart.”
If you’ve followed along, you shouldn’t see any errors, but if you do see errors, try updating your local packages.

Now, we should be able to call our package directly and use our branches dataset:

Cool, that works. Now let’s plot our branches with Leaflet quick to make sure footprint_poly() worked:

library(leaflet) leaflet() %>% addTiles() %>% addPolygons(data = ketchR::footprint_poly()) %>% addCircles(data = branches, lat = branches$lat, lng = branches$lon, radius = 40000, stroke = F, color = "red")

Niiiice.

Database connections

One of the most common tasks in data science is pulling data from databases. Let’s say that Ketchbrook stores data in a SQL Server. Instead of manually copy and pasting a connection script or relying on the RStudio session to cache the connection string, let’s just make a damn function.

get_db <- function(query = "SELECT TOP 10 * FROM datas.dbo.Customers") { #' Pull data from g23 database #' @param query: enter a SQL query; Microsoft SQL syntax please require(odbc) con <- dbConnect(odbc(), Driver = "SQL Server", Server = "datas", Database = "dataserver", UID = "user", PWD = rstudioapi::askForSecret("password"), Port = 6969) z <- odbc::dbGetQuery(con, query) return(z) odbc::dbDisconnect(con) }

Here, we’re building a function that lets us enter any query we want to bang against this SQL Server. The function creates the connection, prompts us to enter the password each time (we don’t store passwords in code…) and closes the connection when it’s through.

Let’s take it a step further. Many times you may pull a generic SELECT * query in order to leverage dplyr to do your real data munging. In this case, it’s easier to just make a function that does just that.

Let’s make another function that pulls a SELECT * FROM Customers.

get_customers <- function() { #' Pull most recent customer data from G23 - datascience.agc_Customers require(odbc) con <- dbConnect(odbc(), Driver = "SQL Server", Server = "datas", Database = "dataserver", UID = "user", PWD = rstudioapi::askForSecret("password"), Port = 6969) query1 <- "SELECT * FROM datas.dbo.Customers" z <- odbc::dbGetQuery(con, query1) return(z) odbc::dbDisconnect(con) }

Ahh, this alone saved me quarters-of-hours each week once I started using it in my own practice. Think hard about any piece of code that you may copy and paste on a regular basis — that’s a candidate for your packages stable of functions.

Branded ggplot visualizations

Ok now we’re getting to the primo honey, the real time-savers, the analyst-impresser parts of our package. We’re going to make it easy to produce consistent data visualizations which reflect a company’s image with custom colors and themes.

Although I personally believe the viridis palette is the best color scheme of all time, it doesn’t necessarily line up with Ketchbrook’s corporate color palette. So let’s make our own set of functions to use Ketchbrook’s palette is a ‘lazy’ way. (Big thanks to this Simon Jackson’s great article).

Get the colors

Let’s pull the colors directly from their website. We can use the Chrome plugin Colorzilla to pull the colors we need.

Take those hex color codes and paste them into this chunk like so:

# Palette main colors ketch.styles <- c( `salmon` = "#F16876", `light_blue`= "#00A7E6", `light_grey` = "#E8ECF8", `brown` = "#796C68")

This will give us a nice palette that has colors different enough for categorical data, and similar enough for continuous data. We can even split this up into two separate sub-palettes for this very purpose:

# Create separate palettes agc.palettes <- list( `main` = styles('salmon','light_blue', 'brown', 'light_grey'), `cool` = styles('light_blue', 'light_grey') ) Create the functions

I’m not going to go through these functions line by line; if you have questions reach out to me at bradley.lindblad[at]gmail[dot]com, create an issue on the Github repo. Here is the full code snippet:

# Palette main colors ketch.styles <- c( `salmon` = "#F16876", `light_blue`= "#00A7E6", `light_grey` = "#E8ECF8", `brown` = "#796C68") # Fn to extract them by hex codes styles <- function(...) { cols <- c(...) if (is.null(cols)) return (ketch.styles) ketch.styles[cols] } # Create separate palettes ketch.palettes <- list( `main` = styles('salmon','light_blue', 'brown', 'light_grey'), `cool` = styles('light_blue', 'light_grey') ) # Fn to access them ketch_pal <- function(palette = "main", reverse = FALSE, ...) { pal <- ketch.palettes[[palette]] if (reverse) pal <- rev(pal) colorRampPalette(pal, ...) } # Fn for customer scale scale_color_ketch <- function(palette = "main", discrete = TRUE, reverse = FALSE, ...) { pal <- ketch_pal(palette = palette, reverse = reverse) #' Scale color using AgC color palette. #' @param palette: main, greens or greys #' @param discrete: T or F #' @param reverse: reverse the direction of the color scheme if (discrete) { discrete_scale("colour", paste0("ketch_", palette), palette = pal, ...) } else { scale_color_gradientn(colours = pal(256), ...) } } scale_fill_ketch <- function(palette = "main", discrete = TRUE, reverse = FALSE, ...) { #' Scale fill using AgC color palette. #' @param palette: main, greens or greys #' @param discrete: T or F #' @param reverse: reverse the direction of the color scheme pal <- ketch_pal(palette = palette, reverse = reverse) if (discrete) { discrete_scale("fill", paste0("ketch_", palette), palette = pal, ...) } else { scale_fill_gradientn(colours = pal(256), ...) } }

Let’s test it out:

ggplot(mtcars) + geom_point(aes(mpg, disp, color = qsec), alpha = 0.5, size = 6) + ketchR::scale_color_ketch(palette = "main", discrete = F) + theme_minimal()

produces:

Markdown templates

Now that we’ve fetched the data and plotted the data much more quickly, the final step is to communicate the results of our analysis. Again, we want to be able to do this quickly and consistently. A custom markdown template is in order.

I found this part to be the hardest to get right, as everything needs to be in the right place within the file structure, so follow closely. (Most of the credit here goes to this article by Chester Ismay.)

1. Create skeleton directory dir.create("ketchbrookTemplate/inst/rmarkdown/templates/report/skeleton", recursive = TRUE)

This creates a nested directory that will hold our template .Rmd and .yaml files. You should have a new folder in your directory called “ketchbrookTemplate”:

2. Create skeleton.Rmd

Next we create a new RMarkdown file:

This will give us a basic RMarkdown file like this:

At this point let’s modify the template to fit our needs. First I’ll replace the top matter with a theme that I’ve found to work well for me, feel free to rip it off:

--- title: "ketchbrookTemplate" author: Brad Lindblad output: prettydoc::html_pretty: theme: cayman number_sections: yes toc: yes pdf_document: number_sections: yes toc: yes rmarkdown::html_document: theme: cayman html_notebook: number_sections: yes theme: journal toc: yes header-includes: - \setlength{\parindent}{2em} - \setlength{\parskip}{0em} date: February 05, 2018 always_allow_html: yes #bibliography: bibliography.bib abstract: "Your text block here" ---

I like to follow an analysis template, so this is the top matter combined with my basic EDA template:

-- title: "Customer Service Survey EDA" author: Brad Lindblad, MBA output: pdf_document: number_sections: yes toc: yes html_notebook: number_sections: yes theme: journal toc: yes rmarkdown::html_document: theme: cayman prettydoc::html_pretty: theme: cayman number_sections: yes toc: yes header-includes: - \setlength{\parindent}{2em} - \setlength{\parskip}{0em} date: September 20, 2018 always_allow_html: yes bibliography: bibliography.bib abstract: "Your text block here" --- Writing Your Report Now that you've done the necessary preparation, you can begin writing your report. To start, keep in mind there is a simple structure you should follow. Below you'll see the sections in order along with descriptions of each part. Introduction Summarize the purpose of the report and summarize the data / subject. Include important contextual information about the reason for the report. Summarize your analysis questions, your conclusions, and briefly outline the report. Body - Four Sections Data Section - Include written descriptions of data and follow with relevant spreadsheets. Methods Section - Explain how you gathered and analyzed data. Analysis Section - Explain what you analyzed. Include any charts here. Results - Describe the results of your analysis. Conclusions Restate the questions from your introduction. Restate important results. Include any recommendations for additional data as needed. Appendix Include the details of your data and process here. Include any secondary data, including references. # Introduction # Data # Methods # Analysis # Results # Conclusions # Appendix # References

Save this file in the skeleton folder and we’re done here.

3. Create the yaml file

Next we need to create a yaml file. Simply create a new text document called “template.yaml” in RStudio and save it like you see in this picture:

Rebuild the package and open a new RMarkdown document, select “From Template” and you should see your new template available:

Sweet. You can now knit to html pretty and have sweet output like this:

If you run into problems, make sure your file structure matches this:

├───inst │ └───rmarkdown │ └───templates │ └───ketchbrookTemplate │ │ template.yaml │ │ │ └───skeleton │ skeleton.nb.html │ skeleton.Rmd What’s next?

So we’ve essentially made a bomb package that will let you do everything just a little more quickly and a little better: pull data, reference common data, create data viz and communicate results.

From here, you can use the package locally, or push it to a remote Github repository to spread the code among your team.

The full code for this package is available at the Github repo set up for it. Feel free to fork it and make it your own. I’m not good at goodbye’s so I’m just gonna go.

I’m available for data science consulting on a limited basis. Reach me at bradley.lindblad[at]gmail[dot]com

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on technistema. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pages