Scheduling Jupyter Notebooks at Meta

At Meta, Bento is our inner Jupyter notebooks platform that’s leveraged by many inner customers. Notebooks are additionally getting used extensively for creating stories and workflows (for instance, performing data ETL) that must be repeated at sure intervals. Customers with such notebooks must keep in mind to manually run their notebooks on the required cadence – a course of individuals would possibly overlook as a result of it doesn’t scale with the variety of notebooks used.

To deal with this drawback, we invested in constructing a scheduled notebooks infrastructure that matches in seamlessly with the remainder of the inner tooling out there at Meta. Investing in infrastructure helps make sure that privateness is inherent in all the pieces we construct. It allows us to proceed constructing modern, useful options in a privacy-safe means. 

The flexibility to transparently reply questions on information circulation via Meta methods for functions of information privateness and complying with laws differentiates our scheduled notebooks implementation from the remainder of the business.

On this publish, we’ll clarify how we married Bento with our batch ETL pipeline framework referred to as Dataswarm (suppose Apache Airflow) in a privateness and lineage-aware method.

The problem round doing scheduled notebooks at Meta

At Meta, we’re dedicated to bettering confidence in manufacturing by performing static evaluation on scheduled artifacts and sustaining coherent narratives round dataflows by leveraging clear Dataswarm Operators and information annotations. Notebooks pose a particular problem as a result of:

  • Resulting from dynamic code content material (suppose desk names created by way of f-strings, as an illustration), static evaluation received’t work, making it tougher to grasp information lineage.
  • Since notebooks can have any arbitrary code, their execution in manufacturing is taken into account “opaque” as information lineage can’t be decided, validated, or recorded. 
  • Scheduled notebooks are thought-about to be on the manufacturing aspect of the production-development barrier. Earlier than something runs in manufacturing, it must be reviewed, and reviewing pocket book code is non-trivial.

These three issues formed and influenced our design choices. Particularly, we restricted notebooks that may be scheduled to these primarily performing ETL and people performing information transformations and displaying visualizations. Notebooks with every other unwanted effects are at present out of scope and should not eligible to be scheduled.

How scheduled notebooks work at Meta

There are three essential elements for supporting scheduled notebooks:

  1. The UI for establishing a schedule and making a diff (Meta’s pull request equal) that must be reviewed earlier than the pocket book and related dataswarm pipeline will get checked into supply management.
  2. The debugging interface as soon as a pocket book has been scheduled. 
  3. The mixing level (a customized Operator) with Meta’s inner scheduler to truly run the pocket book. We’re calling this: BentoOperator.

How BentoOperator works

As a way to deal with nearly all of the considerations highlighted above, we carry out the pocket book execution state in a container with out entry to the community. We additionally leverage enter & output information annotations to indicate the circulation of information.

The general design for BentoOperator.

For ETL, we fetch information and write it out in a novel means:

  • Supported notebooks carry out information fetches in a structured method by way of customized cells that we’ve constructed. An instance of that is the SQL cell. When BentoOperator runs, step one includes parsing metadata related to these cells and fetching the info utilizing clear Dataswarm Operators and persisting this in native csv information on the ephemeral distant hosts.
  • Situations of those customized cells are then changed with a name to pandas.read_csv() to load that information within the pocket book, unlocking the flexibility to execute the pocket book with none entry to the community.
  • Information writes additionally leverage a customized cell, which we substitute with a name to pandas.DataFrame.to_csv() to persist to a neighborhood csv file, which we then course of after the precise pocket book execution is full and add the info to the warehouse utilizing clear Dataswarm Operators.
  • After this step, the short-term csv information are garbage-collected; the ensuing pocket book model with outputs uploaded and the ephemeral execution host deallocated.
Customized SQL cell supported for scheduled notebooks.
Structured customized cell for information uploads.

Our strategy to privateness with BentoOperator

Now we have built-in BentoOperator inside Meta’s information function framework to make sure that information is used just for the aim it was meant. This framework ensures that the info utilization function is revered as information flows and transmutes throughout Meta’s stack. As a part of scheduling a pocket book, a “function coverage zone” is equipped by the person and this serves as the combination level with the info function framework.

General person workflow

Let’s now discover the workflow for scheduling a pocket book:

We’ve uncovered the scheduling entry level immediately from the pocket book header, so all customers should do is hit a button to get began.

Step one within the workflow is establishing some parameters that shall be used for robotically producing the pipeline for the schedule.

The subsequent step includes previewing the generated pipeline earlier than a Phabricator (Meta’s diff evaluate device) diff is created.

Along with the pipeline code for working the pocket book, the pocket book itself can also be checked into supply management so it may be reviewed. The outcomes of making an attempt to run the pocket book in a scheduled setup are additionally included within the check plan. 

As soon as the diff has been reviewed and landed, the schedule begins working the following day. Within the occasion that the pocket book execution fails for no matter purpose, the schedule proprietor is robotically notified. We’ve additionally constructed a context pane extension immediately in Bento to assist with debugging pocket book runs.

What’s subsequent for scheduled notebooks

Whereas we’ve addressed the problem of supporting scheduled notebooks in a privacy-aware method, the notebooks which can be in scope for scheduling are restricted to these performing ETL or these performing information evaluation with no different unwanted effects. That is solely a fraction of the notebooks that customers need to finally schedule. As a way to enhance the variety of use instances, we’ll be investing in supporting different clear information sources along with the SQL cell. 

Now we have additionally begun work on supporting parameterized notebooks in a scheduled setup. The concept is to assist cases the place as an alternative of checking in many notebooks into supply management that solely differ by a couple of variables, we as an alternative simply verify in a single pocket book and inject the differentiating parameters throughout runtime.

Lastly, we’ll be engaged on event-based scheduling (along with the time-based strategy we’ve right here) so {that a} scheduled pocket book may anticipate predefined occasions earlier than working. This would come with, for instance, the flexibility to attend till all information sources the pocket book will depend on land earlier than pocket book execution can start.


A few of the approaches we took had been immediately impressed by the work carried out on Papermill.