Real-Time Machine Learning at Headspace

Introduction

Data is often most valuable when it can be immediately leveraged to make decisions in the moment, but traditionally consumer data is ingested, transformed, persisted, and sits dormant for lengthy periods of time before being used by machine learning and analytics teams.

  • Current session bounce rates for sleep content
  • Semantic embeddings for recent user search terms (if a user recently searched for “preparing for big exam”, the ML model can assign more weight to Focus-themed meditations)
  • Users’ biometric data (i.e., if step counts and heart rate are increasing over the last 10 minutes, we can recommend Move or Exercise content)

Real-Time Inference Requirements

In order facilitate real-time inference that personalize our user’s content recommendations, we need to

  • Ingest, process, and forward along relevant events (actions) that our users perform on our client apps (iOS, Android, web).
  • Quickly compute store, and fetch online features (millisecond latency) that enrich the feature set used by a real-time inference model
  • Serve and reload the real-time inference model in a way that synchronizes the served model with online feature stores while minimizing (and ideally avoiding) any downtime.

Challenges Adapting the Traditional ML Model Workflow

The requirements we listed are problems that are often not solved (and don’t need to be solved) with offline models that serve daily batch predictions. ML models that make inferences from records pulled and transformed from an ELT / ETL data pipeline usually have lead times of multiple hours for raw event data. Traditionally, an ML model’s training and serving workflow would involve the following steps, executed via periodic jobs that run every few hours or daily:

  1. Pull relevant raw data from upstream data stores: For Headspace, this involves using Spark SQL to query from the upstream data lake maintained by our Data Engineering team.

High Level Architecture Overview

The Publishing and Serving Layer: Model Training and Lifecycle Deployments

ML models are developed in Databricks notebooks and evaluated via MLflow experiments on core offline metrics (for example, recall at k for recommendation systems). The Headspace ML team has written wrapper classes that extend the base Python Function Model Flavor class in MLflow:

Updating and Reloading Real-Time Models

We re-train our models frequently, but updating a real-time inference model in production is tricky. AWS has a variety of deployment patterns (gradual rollout, canary, etc.) that we can leverage to update the actual served Sagemaker model. However, real-time models also require in-sync online feature stores, which — given the size of Headspace’s user base, can take up to 30 minutes to fully update. Given that we do not want downtime each time we update our model images, we need to be careful to ensure we synchronize our feature store with our model image.

Blue Green Architecture

To address this issue, we can adopt a blue-green architecture that follows from the DevOps practice of blue-green deployments. The workflow is illustrated below:

  • Maintain two parallel pieces of infrastructure (two copies of feature stores, in this case).
  • Designate one as the production environment (let’s call it the “green” environment, to start) and route requests for features and predictions towards it via our Lambda
  • Every time we wish to update our model, we use a batch process/script to update the complementary infrastructure (the “blue” environment) with the latest features. Once this update is complete, switch the Lambda to point towards the blue production environment for features/predictions.
  • Repeat this each time we want to update the model (and its corresponding feature store).

The Receiver Layer: Event Stream Ingestion with Apache Spark Structured Streaming Scheduled Job

Headspace users’ event actions (logging into the app, playing a specific piece of content, renewing a subscription, searching for content, etc.) are aggregated and forwarded onto Kinesis Streams (Step 1 in diagram). We leverage the Spark Structured Streaming framework on top of Databricks to consume from these Kinesis Streams. Structured Streaming offers several benefits:

  • It leverages the same unified language (Python/Scala) and framework (Spark) shared by data scientists, data engineers, and analytics team members, allowing multiple Headspace teams to reason about user data using the familiar Dataset / DataFrame APIs and abstractions.
  • It allows our teams to implement custom micro-batching logic to meet business requirements (for example, we could trigger and define micro batches based on custom event-time windows and session watermarks logic on a per-user basis).
  • It comes with existing Databricks infrastructure tools (Scheduled Jobs, automatic retries, efficient DBU credit pricing, email notifications for process failure events, built-in Spark Streaming dashboards, ability to quickly auto-scale to meet spikes in user app event activity) that significantly reduce the infrastructure administration burden on ML engineers.
  • 1 Maximum Concurrent Runs
  • Unlimited retries
  • New Scheduled Job clusters (as opposed to an All-Purpose cluster)

The Orchestration Layer: Decoupled Event Queues and Lambdas as Feature Transformers

Headspace users produce events that our real-time inference models consume downstream to make fresh recommendations. However, user event activity volume is not uniformly distributed. There are various peaks and valleys — our users are often most active during specific times of day (early morning, mid-afternoon break, evening before bedtime).

  • Unpack the message and fetch the corresponding online and offline features for the user whom we want to make a recommendation for (Step 5 in above diagram). For instance, we may augment the event’s temporal/session-based features with user tenure level, gender, locale, etc.
  • Performs any final pre-processing business logic (for instance, mapping our Headspace user IDs to user sequence IDs to fit the interface expected by certain collaborative filtering models that we deploy).
  • Selecting the appropriate served Sagemaker model and invoking it with the input features (Step 6 in above diagram).
  • Forwarding along the recommendation to its downstream destination (Step 7 in above diagram). The actual location depends on whether we want users to pull down content recommendations or push recommendations out to users:
  • Pull: Persisting the final recommended content to our internal Prediction Service, which is responsible for ultimately supplying users with their updated personalized content for many of the Headspace app’s tabs upon client app request. An example experiment of using real-time inference infrastructure to allow users to fetch personalized recommendations from the Headspace app’s Today tab:
  • Push: Placing the recommendation onto another SQS push notifications queue that results in a push notification or in-app modal content recommendations being pushed to users. Examples of (above) in-app modal push recommendations triggered from a user recent search for sleep content and (below) iOS push notification from a recent user content completion:

Summary

From an infrastructure and architecture perspective, a key learning is prioritizing designing flexible hand-off points between different services — in our case, the Publishing, Receiver, Orchestrator, and Serving layers. For instance,

  • What format should the message payload that our Structured Stream jobs send to the prediction SQS queue use?
  • What is in the model signature and HTTP POST payload that each Sagemaker model expects?
  • How do we synchronize the model image and the online feature stores so that we can safely and reliably update retrained models once in production?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Headspace

Headspace

Headspace is meditation made simple. Learn with our app or online, when you want, wherever you are, in just 10 minutes a day.