Using R Plumber and Slack to Run Data Science Pipelines

Using R Plumber and Slack to Run Data Science Pipelines

 

Last year I published a starter guide that explains how we used the R package Plumber to build a Slack chatbot named Tina to help streamline operations at our company. But the real benefits have come from using it to solve a much bigger challenge—running our data science pipelines efficiently.

 

This blog post covers what our team has achieved by combining Plumber, Slack, and Databricks together to help us deliver faster and more powerful results for our clients.  We hope that you find it inspiring for your own team as well.

The need behind the change

At Retina, we create and use data science pipelines to perform a wide range of processes.  These include automated data import, validation, and transformation, running various distributed machine learning models, and automatically making plots and labeled data outputs.

Our biggest motivation for using pipelines is efficiency.  We leverage the same pipelines for multiple data sets and generate outputs much faster than if we had to hand-code things each time. These pipelines can take anywhere from 10 minutes to multiple hours to complete, so we launch them in an environment like Databricks to keep them organized and easy to maintain.

We made a video of the process of launching a Databricks pipeline:

As you can see, the interface is a bit awkward to use—the data scientist first logs into the portal and initiates the job. Then, they must keep the window open on their browser to monitor the job’s progress (or manually configure email alerts). If the end result of the job is a file or an image, the data scientist then has to log into their cloud storage to download the asset from there once it’s ready. It’s a multi-step process that requires shuffling between programs at exactly the right moments in order to keep a job moving forward and deliver insights to the right stakeholders in a timely fashion.

The solution

Fortunately, this cumbersome process lends itself well to automation. Our team uses Slack as our primary communication tool, so for us, that’s also the most convenient interface for scheduling the jobs we’re already discussing on Slack.

Plumber makes it possible to easily turn R scripts into deployable APIs.  We’ve used it to create a Slack chatbot we call Tina (short for reTina). Tina is a Slack App and can interact with our team in the relevant project-related channels that they are already using.

Our latest update to Tina integrates with the Databricks Jobs API to automatically start and query new long-running or otherwise complex data science jobs. We pass in inputs (configuration and input data paths) and receive the outputs (output data paths, plots, etc) for each run. While a job is still running, we repeatedly poll the API for the status of that run.

Using Tina, we can now initiate any available job from any Slack channel. The bot notifies the channel about status updates along the way and automates the next steps whenever possible. Check out a demo in the video below.

Tina now enables our team to:

  • View all available jobs on Databricks and select the job to run
  • Link to the source notebooks in Databricks
  • View basic information and default parameters available for the job
  • Initiate the job in any channel and update the status for the user every 10 seconds
  • Modify inputs as needed
  • Return the output of the completed job, posting the file or figure directly to Slack if there is one included in the output

Use Cases

Bringing Databricks functionality into communication tools our team already uses saves a lot of time and grief on its own, but the use cases of this integration haven’t stopped there for our team. We can also:

  • Share the results of our work in a sleek and professional manner. We now initiate and monitor jobs directly from the Slack channel where we communicate with each of our enterprise clients. This allows us to organize our projects by client, be transparent about progress, and deliver results more efficiently. As a result, everyone can see what jobs were run previously and what the outputs were. 
  • Automate commonly run ad-hoc tasks. We can now perform tasks like data validation on client-provided tables very efficiently and generate reports on the fly.

Interested in this for your organization? Reach out to us at [email protected] and we’d be happy to share some code snippets and chat about how this, or some of the other work that Retina does, can be useful for you.