This is the first in a series of posts on TensorFlow Extended. In this post I’ll set the stage without diving into any code yet, but bare with me. In subsequent posts, I’ll try to steer clear from the typical “getting started” content (the official docs do a fantastic job at that) and instead dive into some real-world use cases that require going beyond the built-in components and standard approach. I will assume some familiarity with basic machine learning and data science terminology.
As machine learning has increasingly inundated the tech world, it’s been pretty much impossible to keep track of the growing set of tools, frameworks and platforms which all claim to be the next best thing. While those of us that have been “around” for a while might remember (more or less fondly) such frameworks as (the original) Torch, Caffe, Theano, etc… these days it will be hard to get by in the data science world without at least a passing knowledge of both TensorFlow and PyTorch. That is not to say that the steady stream of new tools has slowed down, but it does seem that, at least for the time being, large ecosystems of tools have aggregated around these two frameworks.
While they are both excellent, TensorFlow (or TF for friends) has typically had a stronger following in the business world, while PyTorch seems to be more widely adopted by the academic community. If you don’t mind, I will keep a safe distance from the chicken-or-egg question of how and why that came to be. It is however fair to say that, at least until the “recent” major update of TensorFlow, PyTorch has been easier to experiment with, while TensorFlow has had much better support for bringing machine learning pipelines and models to production. Now, with TF 2.x and its wholehearted embrace of the “Keras way,” TF has become a delight (that is, if you are “that kind” of person) to experiment with. On the other hand, I think it’s still fair to say that the ecosystem around TF continues to be stronger and more unified when it comes to “productionising” machine learning compared to what the PyTorch world currently offers. This is in no small part due to the main character in this multipart story: TensorFlow Extended a.k.a TFX.
TensorFlow Extended in context
Calling TFX a “character” is perhaps a bit of an understatement: imagining a multi-headed Hydra will probably give you a better sense of what it can do for you. TFX is the continuation of a project at Google called Sibyl, which grew from the realisation that coming up with ingenious machine learning models and writing slick loss-optimisation code is only a small part of what’s needed to put ML to effective use in the scary and chaotic wild west we tend to call reality.
What do you do if suddenly the distribution of the training data changes and retraining your model results in a dramatic drop in performance? Or worse, what if the distribution of the data your production model is making predictions on slowly drifts over time without anyone noticing? Even if you do notice, this means that you’ll (in the easiest scenario) need to adapt your features, and corresponding feature engineering, over time. How much of this can you automate? And what if you need to track and manage this for thousands of models simultaneously? Even if your model count in production is much lower than that, you will quickly realise that finding the right tools to help you with these challenges is more than just a luxury problem.
That’s where TFX comes in. It extends (as you might have guessed) TF with a suite of tools that help you deal with the (MLOps) challenges I just described. In order to understand its power, it’s useful to know a bit about how it works under the covers. Many of the TFX components require processing large amounts of data, preferably in a distributed way. Instead of reinventing the wheel, these components leverage another Google-developed powerhouse that excels in exactly this, Apache Beam. This will become very clear in an upcoming post in which we’ll write a custom component to ingest data into the pipeline. What we’ll end up creating will essentially be a thin wrapper around Beam code.
Another important aspect of TFX’s architecture is that, very much like Beam, it should be seen as a tool for authoring pipelines. To actually run TFX at scale requires a runner. You can, as is the case for Beam, run it locally during development. In fact, TFX’s local runner is essentially Beam’s local runner in that it leverages Beam both for running many of its components and for orchestrating the full pipeline. To run your pipeline at scale, you have three options:
- Use a Beam runner: you can keep leveraging Beam for both processing and orchestration at scale by using a Beam runner such as Google Cloud’s Dataflow. This is the easiest option as it requires minimal extra setup and configuration to promote your local setup to one that runs at scale (similar to promoting a local Beam pipeline to a Dataflow pipeline). On the other hand, it will not be as fully-featured as the two options that follow.
- Use Apache Airflow: a lot of companies are already using Airflow for orchestrating their data pipelines. If that’s the case, running your TFX pipelines on Airflow as well might be the path of least resistance. This is an excellent option and will offer more control than a Beam runner as far as the orchestration goes. Note that even in this case, the data processing that is part of many of the components will still be done by Beam, so it’s perhaps more appropriate to call this option Airflow + Beam, or, if you’re on Google Cloud Platform (GCP) for example, Cloud Composer + Dataflow.
- Use Kubeflow Pipelines: Kubeflow is an open source project that aims to make deployments of ML workloads on Kubernetes simpler. One of its sub-projects is Kubeflow Pipelines, which offers a UI, API and SDK that together enable authoring and orchestrating ML pipelines, running and comparing ML experiments, and artifact lineage tracking through a metadata store. This requires a bit more setup (although managing Airflow on Kubernetes is not that different), but offers the most fully-featured experience and richest UI for a data science team. Furthermore, on GCP you can also find a fully managed version as AI Platform Pipelines.
We will not focus on the runners in this series, although it is an essential part of the MLOps story. In what follows, we will either use the local Beam runner or the “manual” Interactive runner that essentially turns your Jupyter notebook into an orchestrator with your fingertips as the triggers, very handy while developing pipelines or exploring metadata and artifacts.
TFX comes in two flavours: a set of stand-alone Python libraries on one hand and a collection of components that can be used as building blocks for authoring DAGs (Directed Acyclic Graphs, welcome to Data Engineering) on the other. The components wrap some of the functionality from the stand-alone libraries with the specific purpose of building automated machine learning pipelines. We will focus on the component approach, although, as you’ll see, we will sometimes need to break through this abstraction layer and use some of the library features that are not exposed through the wrappers. In any case, we encourage you to dig into the libraries themselves as well, as they allow you to perform very in-depth analysis of your data, metadata and models in a more exploratory way.
We will get to know most of the components better as we progress through the subsequent posts, but it would be rude to not include the main ones at least briefly in this general round of introductions:
Ingest and validate data: First, data needs to be ingested into the pipeline and ideally turned into a binary format that is optimised for further processing. This is done by the ExampleGen component (TensorFlow tends to call data points Examples), which typically turns external files into binary TFRecord files of protobuf messages. Subsequently, StatisticsGen extracts feature statistics from the data that are then used by SchemaGen to generate a schema for your data. This schema can be adapted to your needs, after which it is used by ExampleValidator to validate future data against. This allows you to automatically detect data drift or training-serving skew and act accordingly. If these terms don’t ring many bells, more on all of this in future posts. Note that ExampleGen, StatisticsGen and ExampleValidator use Beam under the hood.
Transform data and train model: After the data has been validated, it needs to be transformed into the input shape required by the training loop. This can be much more elaborate than you might expect. The Transform component leverages Beam to first analyse the full training set (unless the required information already exists in the schema) and then uses the obtained statistics (such as mean or standard deviation) to transform the data (for example, impute missing values or normalise the data). To ensure that the same preprocessing steps are performed during serving/prediction, it then saves the transformation steps into a TF graph which can be saved along with the trained model. The transformed data is then fed into the Trainer component that finally runs that precious bit of code you thought was going to be 90% of the work. The Trainer can run distributed training on, for example, a Kubeflow cluster, or offload this job to a service such as Google’s AI Platform Training. The Trainer works hand in hand with the Tuner, the newest addition to the TFX family responsible for handling hyperparameter tuning.
Evaluate and validate model: The Trainer spits out a SavedModel, but that should not be the end of your pipeline. What if the new model performs worse than the one that’s currently running in production? The Evaluator first evaluates the model on the test set, but then, more importantly, compares its performance against that of the latest “blessed model.” A blessed model is one that was previously deemed good enough by the Evaluator (or your-human-in-the-loop-self) to be representing your company in production. If the new model performs better than the previous blessed model (in ways that you have a lot of control over), it is tagged as the latest blessed model. The aptly named Pusher is then responsible for “pushing” any new blessed model to the right deployment target (ready to be served in the cloud, on the edge or in the browser, depending on your use case).
Finally, the (not so) secret glue that holds everything together is the ML Metadata (MLMD) library. This layer allows all the components to share metadata and artifacts (even though every component typically runs in its own container or service) and exposes a rich interface for lineage tracking and pipeline debugging. We will cross paths with it many times on our travels “along the pipeline.”
Note that this story did not include the actual serving of the model, but also there, TFX is your friend through the TensorFlow Serving library, but this should be seen as a separate application, not one of the components of the pipeline. You can alternatively use any other framework or service that can serve SavedModels, such as Google’s AI Platform Prediction service.
That’s it from me for now. I realise that we had a bit of an information density explosion going on there towards the end, but we’ll revisit a lot of this at a slower pace later on. Hopefully see you soon in the next post, where we’ll dive deeper into some of the components and how to wield them to your custom needs.