AGI

Scaling Production Feature Engineering Pipelines with Feast and Ray

Scale Feature Engineering Pipelines with Feast and Ray

Your models are ready, your notebooks look clean, yet every retrain feels like a gamble. A small schema tweak can sink a release. A slightly larger dataset can crash your laptop. You copy and paste the same feature logic three times, then still discover that training and serving use different values. This guide walks through a concrete way out, so you can turn fragile, one off feature code into a scalable pipeline that you can trust and reuse.

Many teams still build features in fragile notebooks that crash on big data, drift in production, and break with every schema change. Training jobs overrun their windows, joins take hours, and no one trusts that features in serving match features in training. This guide shows how to scale feature engineering pipelines with Feast and Ray so you can move from one-off notebooks to repeatable, production-grade workflows. You will learn what feature engineering pipelines are, how feature stores and distributed compute fit together, and how to design an architecture that grows from a laptop to a full cluster. You will see concrete steps, code snippets, and trade offs so you can decide when Feast and Ray fit your stack and when other tools might work better.

Key Takeaways

  • Feature engineering pipelines need a feature store plus distributed compute once data and models scale.
  • Feast centralizes feature definitions and storage, which improves consistency between training and serving.
  • Ray lets Python teams scale transformations without switching to a full Spark stack.
  • You can grow from a laptop setup to a production Feast and Ray architecture in clear stages.

From Fragile Notebooks to Scalable Feature Pipelines

Picture a small ML team at a growing SaaS company. They have churn and upsell models, all powered by handwritten SQL and Pandas code in notebooks. Each new feature means more joins, more copies of logic, and more hidden assumptions.

Then a single event table changes. Half the notebooks fail. Retraining jobs miss their time windows. The team spends days hunting broken joins instead of improving models. The serving system reads features from a different pipeline, so predictions drift from offline metrics.

Industry surveys report that a large share of ML work goes into data preparation and pipeline maintenance, not modeling. For example, the 2022 Anaconda State of Data Science report noted that data scientists spend much of their time on data cleaning and preparation, rather than modeling, which slows progress on core goals [1]. If you often patch data issues by hand, you are not alone.

Feature stores and distributed compute platforms exist to fix this bottleneck. Feast offers a central place for feature definitions and storage. Ray gives Python teams an easy way to scale workloads across cores and machines. Together they form a practical stack for scalable feature engineering pipelines.

What Is a Feature Engineering Pipeline?

Feature Engineering Pipeline Definition

A feature engineering pipeline is the end to end process that turns raw data into machine learning features in a repeatable, automated way. It covers extraction, cleaning, joining, aggregation, and delivery of features for training and real time inference.

Core Components of a Feature Pipeline

A robust feature pipeline usually includes several building blocks.

  • Data sources. These include OLTP databases, event streams like Kafka, logs, and data warehouses.
  • Transformations. This layer handles joins, filters, aggregations, window functions, and encoding.
  • Storage layers. Offline storage supports training sets, often as Parquet, BigQuery, or Snowflake tables. Online storage serves low latency features, for example Redis or a key value store.
  • Orchestration. Tools like Airflow, Prefect, or Ray Jobs run the pipeline on a schedule or on demand.
  • Monitoring and tests. Checks watch for data quality issues, feature drift, or failed jobs.

Why Scaling Feature Engineering Is Hard

Small teams often start with Pandas and simple scripts. That works for prototypes. It fails once volume and complexity grow.

  • Data volume and velocity. Tables reach billions of rows. Joins and aggregations no longer fit on one machine. Techniques such as reading large DataFrames in chunks can delay this limit, they do not remove it.
  • Point in time correctness. Features must use only information available at each timestamp. Wrong joins create label leakage.
  • Training and serving consistency. If training and serving code paths differ, models see different feature values.
  • Schema evolution and backfills. Columns get renamed or added. Old features need backfills over months of history.
  • Operational burden. Copy pasted SQL in notebooks becomes impossible to maintain across models and teams.

These challenges push teams to adopt a feature store and a distributed compute layer. Simple Pandas pipelines lack built in point in time joins, feature versioning, and scalable execution. Feast and Ray address these gaps for Python centric ML stacks.

What Is a Feature Store (and Why You Need One)?

Feature Store Definition

A feature store is a centralized system for storing, managing, and serving machine learning features. It keeps a versioned history of features for offline training and offers low latency access to the same features for online inference.

Google Cloud describes its Vertex AI Feature Store as a managed repository that helps ensure training and serving use consistent features across workflows [2]. That captures the core idea for feature stores in general.

Key Benefits for ML Teams

  • Training and serving consistency. The same feature definitions feed batch training and online prediction.
  • Reuse of features. Teams share feature views across models instead of duplicating logic.
  • Backfills and point in time joins. Feature stores help build time correct training sets without leakage.
  • Governance and discoverability. The registry tracks ownership, schema, and documentation for each feature set.

Vendors like Tecton and Databricks highlight these same benefits in their feature store overviews, which shows broad agreement in the industry [3][4].

When You Should Add a Feature Store

A feature store is not always needed for a single research project. Certain signals suggest the time has come.

  • You run more than one or two models in production.
  • Multiple teams reuse similar features, such as user activity or product stats.
  • You have seen training and serving skew issues or unexplained drops in live performance.
  • Your notebooks contain duplicate ETL code for the same features.

At that point, a feature store like Feast reduces rework and improves reliability.

What Is Feast in Machine Learning?

Feast Definition

Feast is an open source feature store that lets you define, store, and serve machine learning features. It manages a feature registry, offline storage for training data, and online storage for low latency inference so teams can use the same features across development and production.

The official Feast documentation describes it as a platform that bridges the gap between data infrastructure and ML models and standardizes how features are defined and served [5].

Core Concepts and Terminology

  • Entity. The primary key of a feature set, such as user_id or device_id.
  • Feature view. A logical group of features tied to one or more entities and a data source.
  • Offline store. The storage that holds historical feature values across time, for example BigQuery, Snowflake, or file based tables.
  • Online store. A key value database that serves current feature values at low latency, for example Redis.
  • Registry or feature repo. The configuration that describes entities, feature views, and stores, usually kept in a Git repository.

Feast Architecture in Two Minutes

Feast connects to your existing data platform. The main flow looks like this.

  1. You define feature views that point to tables in your offline store, such as a parquet dataset or a warehouse table.
  2. Feast uses these views to build training sets that join feature values with labels using point in time logic.
  3. Materialization jobs read from the offline store and write fresh feature values into an online store.
  4. Online services call Feast to fetch features by entity keys at serving time.

The architecture page in the Feast docs shows how the registry, offline store, and online store fit together in more detail [6].

What Is Ray in Machine Learning?

Ray Definition

Ray is an open source distributed computing framework for Python that makes it easy to scale workloads like feature engineering, training, and inference across many cores and machines. It offers high level APIs for parallel tasks, distributed datasets, and model serving.

The Ray project page describes it as a unified framework to scale both Python applications and ML workloads from a laptop to a cluster [7].

Ray Concepts Relevant to Feature Engineering

  • Ray cluster. A head node manages the cluster. Worker nodes run tasks and hold data.
  • Tasks and actors. Tasks are stateless functions that run in parallel. Actors are stateful workers that keep data in memory across calls.
  • Ray Datasets. A high level API for distributed data processing that feels similar to Pandas.
  • Ray job submission. A way to package and submit jobs to a running cluster, often used from CI or orchestration tools.

Why Use Ray for Feature Engineering?

Many teams hit scaling limits with single node Pandas. Spark is very powerful, but its Java and Scala roots create friction for some Python heavy groups.

Ray offers a flexible middle path.

  • It supports native Python functions and libraries.
  • It can run on a laptop, on VMs, or on Kubernetes.
  • Ray Datasets give a familiar API for batch transforms like joins and aggregations.

Anyscale, one of the main contributors to Ray, reports that companies such as OpenAI and Uber use Ray for large scale ML workloads [8]. That track record shows that Ray is mature enough for demanding feature pipelines. For comparison of model complexity choices you can review a short guide on machine learning versus deep learning.

Feast and Ray Architecture for Scalable Feature Pipelines

High Level Architecture Overview

Feast and Ray complement each other. Ray handles compute. Feast handles storage, catalog, and serving.

A common architecture follows this flow.

  1. Raw data lands in a data lake or warehouse such as S3 plus Parquet, BigQuery, or Snowflake.
  2. Ray Datasets read raw tables and perform heavy joins and aggregations in parallel.
  3. Ray jobs write the resulting feature tables into the offline store that Feast uses.
  4. Feast materialization jobs read from the offline store and populate the online store.
  5. Training pipelines request historical features from Feast. Online services request current features by key.

This separation keeps your compute and storage concerns distinct. You can size the Ray cluster for heavy transforms and scale Feast storage independently as data grows.

Batch Versus Real Time Feature Flows

Some features refresh in daily or hourly batches. Others need near real time updates.

  • Batch pipelines. A scheduler triggers a Ray job to compute features for the latest window. That job writes results into the offline store. Feast then materializes deltas into the online store.
  • Real time pipelines. Event streams feed a lightweight stream processor or Ray streaming job. That job updates feature values in the online store directly or via Feast streaming support.

Feast supports both batch and streaming sources, as described in its feature view documentation [9]. Many teams start with batch pipelines, then add real time paths for latency sensitive features.

Local Prototype to Production Roadmap

It helps to think about growth in three stages.

  1. Local prototype. Everything runs on a laptop. Offline store is local Parquet. Online store is in memory or a local Redis container. Ray runs with local mode or a single node.
  2. Small team deployment. You deploy a small Ray cluster on VMs or Kubernetes. Offline store lives in S3 or a cloud warehouse. Feast runs as a feature repo managed in Git, with a shared registry.
  3. Production scale. Ray runs on managed Ray or a robust Kubernetes setup. You use a cloud warehouse such as BigQuery or Snowflake as the offline store and a managed key value store like Redis Enterprise or DynamoDB for the online store. CI and observability surround both systems.

This staged roadmap keeps complexity aligned with team size and traffic.

How to Build a Feature Pipeline with Feast and Ray

Step 1: Set Up a Minimal Local Environment

Start on a single machine to learn the concepts.

  1. Install Python and create a virtual environment.
  2. Install Feast and Ray via pip.
pip install feast ray[default] ray[data]
  1. Initialize a Feast feature repository.
feast init my_feature_repo
cd my_feature_repo
  1. Start Ray in local mode.
import ray
ray.init()

Step 2: Define a Simple Feature View in Feast

Suppose you want daily user metrics from a Parquet file.

from feast import Entity, Feature, FeatureView, Field
from feast.types import Int64, Float32
from feast.data_source import FileSource

user = Entity(name="user_id", value_type=Int64)

user_events = FileSource(
    path="data/user_events.parquet",
    timestamp_field="event_ts"
)

user_daily_stats = FeatureView(
    name="user_daily_stats",
    entities=[user],
    ttl=None,
    schema=[
        Field(name="user_id", dtype=Int64),
        Field(name="num_events", dtype=Int64),
        Field(name="total_spend", dtype=Float32),
    ],
    source=user_events,
    online=True,
)

Update your feature repo config to register this feature view, then apply it.

feast apply

Step 3: Use Ray to Compute Features at Scale

Now replace simple local transforms with Ray Datasets. Assume raw events live in a large Parquet dataset.

import ray
import pyarrow.dataset as ds
from ray.data import from_arrow

ray.init()

# Read raw events with Arrow, then convert to Ray Dataset
arrow_ds = ds.dataset("s3://my-bucket/raw_events/", format="parquet")
ray_ds = from_arrow(arrow_ds)

# Compute daily stats per user
def add_date(batch):
    import pandas as pd
    df = batch.to_pandas()
    df["event_date"] = pd.to_datetime(df["event_ts"]).dt.date
    return df

ray_ds = ray_ds.map_batches(add_date)

grouped = ray_ds.groupby(["user_id", "event_date"]).aggregate(
    num_events=("event_id", "count"),
    total_spend=("amount", "sum"),
)

# Write to offline store as Parquet
grouped.write_parquet("s3://my-bucket/features/user_daily_stats/")

In a production setup, that Parquet location matches the FileSource that Feast uses as its offline store.

Step 4: Materialize Features into an Online Store

Once the offline data is ready, run Feast materialization.

feast materialize-incremental "2025-01-01T00:00:00"

This reads new rows from the offline store and writes them into the online store. In local setups, this may be a SQLite or in memory store. In production, you will use Redis or a cloud key value database.

Step 5: Fetch Features for Training and Serving

Use the Feast SDK to build training sets and query features.

from feast import FeatureStore
from datetime import datetime

store = FeatureStore(repo_path=".")

# Build a training dataset with point in time joins
training_df = store.get_historical_features(
    entity_df="SELECT user_id, event_ts AS timestamp, label FROM labels",
    features=[
        "user_daily_stats:num_events",
        "user_daily_stats:total_spend",
    ],
).to_df()

# Fetch online features for inference
online_features = store.get_online_features(
    features=[
        "user_daily_stats:num_events",
        "user_daily_stats:total_spend",
    ],
    entity_rows=[{"user_id": 123}],
).to_dict()

This pattern guarantees that the same feature definitions power both training and serving.

Scaling from Local to Production with Feast and Ray

Phase 1: Local and Single Node

In the first phase, focus on learning and correctness.

  • Use Parquet files on local disk or a small S3 bucket as the offline store.
  • Use the default Feast online store for experiments.
  • Run Ray in local mode with multiple workers on one machine.

At this stage, you confirm that feature definitions, joins, and materialization logic behave as expected.

Phase 2: Small Cluster and Shared Environment

Once multiple people depend on the pipeline, move to a shared environment.

  • Deploy a Ray cluster on a few cloud VMs or a small Kubernetes cluster.
  • Store offline data in S3, GCS, or a cloud warehouse. Feast supports several backends.
  • Run a Redis container as the online store or use a managed Redis service.
  • Manage the Feast feature repo in Git with code review and tests.

Use Ray Jobs or an orchestrator like Airflow to trigger feature computation and materialization on a schedule.

Phase 3: Production Grade Setup

The final phase adds resilience, observability, and cost control.

  • Run Ray on managed infrastructure, such as Anyscale or Ray on Kubernetes with autoscaling.
  • Adopt a cloud warehouse as the main offline store, like BigQuery or Snowflake.
  • Use a highly available online store such as Redis Enterprise or DynamoDB.
  • Instrument pipelines with logging, metrics, and tracing.
  • Integrate data validation tools like Great Expectations or TFDV.

Cloud providers advertise large savings for spot or preemptible VMs relative to on demand pricing. For instance, AWS notes that Spot Instances can reduce costs by up to 90 percent for suitable workloads [10]. Batch feature computation often fits this pattern, so Ray clusters can use cheaper instances for large jobs. You can also automate repeated steps in this lifecycle with Python centric tooling that follows patterns similar to those in guides on streamlining ML workflows.

Feast and Ray versus Alternatives

Managed Feature Stores

Cloud vendors and startups offer managed feature stores.

  • Amazon SageMaker Feature Store. Tightly integrated with SageMaker and other AWS services [11].
  • Vertex AI Feature Store. Integrated with BigQuery and GCP ML services [2].
  • Tecton. A commercial feature platform built by early Feast contributors [3].

Managed stores reduce operational overhead but may lock you into a vendor and charge higher per feature or per read costs.

Alternative Compute Engines

Many teams use other engines for feature engineering.

  • Spark. Very mature, especially on Databricks. Great for SQL heavy pipelines and big batch ETL.
  • Dask. Scales Python across cores and machines, with a Pandas like API.
  • Plain SQL and warehouse features. Some warehouses support feature style patterns directly.

Spark feature stores exist as managed options. Databricks Feature Store integrates with Spark and Delta tables [4]. That stack fits teams already invested in Spark. When you design models on top of these platforms, a short primer on core ML algorithms can help alignment between data and model choices.

Decision Guide by Team Type

Consider three simplified profiles.

  • Student or solo practitioner. Start with Feast and small Ray jobs on a laptop. Managed services are likely overkill.
  • Startup with Python heavy team. Feast plus Ray offers strong flexibility with full control over infrastructure. You avoid an all Spark stack if your team prefers Python.
  • Enterprise already on Spark and Databricks. Managed feature stores on top of Spark and Delta may integrate more smoothly. Ray can still play a role for certain ML workloads, but Spark will likely handle most feature engineering.

The best choice depends on skills, cloud stack, and governance needs. Feast and Ray shine where teams want open source tools, Python centric workflows, and modular architecture.

My Experience

I have worked with teams that moved from notebook based feature pipelines to setups that looked like Feast and Ray, with different tooling names but similar ideas. The biggest shift was cultural. Engineers had to think about features as shared assets, not personal notebook code. That change required a feature registry, code review, and common ownership.

Scaling compute with a system like Ray felt natural for Python users. People could reuse existing Pandas logic with small adjustments instead of rewriting everything in another framework. That eased adoption and limited long training periods. Simple checks like Pandas data quality one liners also helped catch obvious feature issues early, which reduced firefighting later.

The hardest problems were not raw performance, but data quality and operational issues. Point in time correctness caught many historic leaks. Data validation caught silent schema changes before they corrupted features. Monitoring on both pipelines and feature distributions helped detect regressions early.

In my view, Feast and Ray fit best when you already embrace infrastructure as code and CI practices. Feature definitions become versioned code. Pipeline runs become part of your deployment story. Teams that invest in those basics get the most from a feature store and distributed compute framework.

Common Failure Modes and How to Avoid Them

Data Skew and Slow Joins

Large joins can skew work across nodes if a few keys dominate the data.

  • Use balanced partitioning keys where possible.
  • Monitor Ray task durations and data volumes per worker.
  • Consider bucketing keys or salting hot keys.

Ray Data includes profiling tools and progress bars that help detect skew during development [12].

Feature Drift and Data Quality Issues

Feature values can drift over time as user behavior or upstream systems change. Data quality vendors such as Monte Carlo report that many incidents in ML systems stem from upstream data issues rather than model bugs [13].

Mitigation steps include:

  • Define expectation tests on input tables and feature outputs.
  • Monitor summary statistics and distribution changes on key features.
  • Alert on schema changes and missing data rates.

Training and Serving Skew

Skew appears when training features differ from serving features due to path divergence.

  • Use Feast for both offline and online access instead of separate code paths.
  • Keep transformations in version controlled feature definitions rather than ad hoc scripts.
  • Regularly compare offline and online feature values for a sample of entities.

Uber and Airbnb platform teams have written about training serving skew and built feature stores to address it [14][15]. Their lessons apply to Feast as well.

FAQ

What is a feature engineering pipeline?

A feature engineering pipeline is the automated process that turns raw data into machine learning features. It includes extraction, cleaning, joins, aggregations, and storage for both training and inference.

What is Feast in machine learning?

Feast is an open source feature store. It manages feature definitions, stores historical feature data for training, and serves fresh feature values with low latency for online predictions.

What is Ray in machine learning?

Ray is a distributed computing framework for Python. It lets you scale tasks like feature engineering, training, and serving across clusters without switching languages or rewriting code in another framework.

What is a feature store?

A feature store is a centralized system that stores, manages, and serves machine learning features for both offline training and online inference. It ensures that models use consistent features across environments.

How do I scale feature engineering pipelines?

To scale feature engineering pipelines, separate compute and storage concerns. Use a feature store such as Feast for storage and catalog. Use a distributed engine like Ray to handle large joins, aggregations, and complex transforms. Add orchestration, monitoring, and data validation as traffic grows.

Can I use Feast without Ray?

Yes. Feast works with any offline compute stack, including SQL based ETL, Spark, or simple Python scripts. Ray becomes useful when you want to scale Python transformations without fully moving to Spark or a different engine.

Is Ray a replacement for Spark?

Ray is not a direct drop in replacement for Spark. Spark remains strong for SQL heavy ETL and long standing data engineering workflows. Ray focuses on flexible Python workloads and offers strong support for ML training, serving, and feature engineering.

When should a small team adopt a feature store?

A small team should consider a feature store once they operate more than a few models in production, see repeated feature logic across notebooks, or struggle with training and serving consistency. A feature store helps manage that complexity in a structured way.

Conclusion

Scaling feature engineering pipelines requires more than faster notebooks. It needs clear separation between feature definitions, storage, and compute. Feast provides the feature catalog and consistent access path for training and serving. Ray provides a way to scale Python transforms across cores and machines without abandoning familiar tools.

You can start small, with a local Feast repo and Ray in local mode, and grow to a shared cluster and cloud stores as needs increase. Along the way, you gain reproducible training sets, shared feature definitions, and more predictable latency. Alternative stacks like Spark based feature stores or fully managed services may fit other teams better, but Feast and Ray form a powerful open source option for many Python centric ML groups.

If your current pipelines feel fragile, run a pilot. Pick one production model, define its top three features in Feast, run feature computation on Ray, and compare training time, failure rates, and feature reuse to your current approach. Capture what breaks, document what gets simpler, and use that evidence to decide how far you want to take this architecture.

References

  1. Anaconda. “2022 State of Data Science.” 2022. https://www.anaconda.com/state-of-data-science-2022
  2. Google Cloud. “Vertex AI Feature Store overview.” Accessed 2026. https://cloud.google.com/vertex-ai/docs/featurestore
  3. Tecton. “What is a Feature Store?” Accessed 2026. https://www.tecton.ai/blog/what-is-a-feature-store/
  4. Databricks. “What is Databricks Feature Store?” Accessed 2026. https://docs.databricks.com/en/machine-learning/feature-store/index.html
  5. Feast. “What is Feast?” Documentation. Accessed 2026. https://docs.feast.dev
  6. Feast. “Architecture overview.” Documentation. Accessed 2026. https://docs.feast.dev/reference/architecture
  7. Ray Project. “What is Ray?” Documentation. Accessed 2026. https://docs.ray.io/en/latest/ray-overview/index.html
  8. Anyscale. “Who uses Ray?” Accessed 2026. https://www.ray.io/users
  9. Feast. “Streaming feature ingestion.” Documentation. Accessed 2026. https://docs.feast.dev/how-to-guides/streaming
  10. Amazon Web Services. “Amazon EC2 Spot Instances pricing.” Accessed 2026. https://aws.amazon.com/ec2/spot/pricing/
  11. Amazon Web Services. “Amazon SageMaker Feature Store.” Documentation. Accessed 2026. https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html
  12. Ray Project. “Ray Data: Scalable Datasets for ML.” Documentation. Accessed 2026. https://docs.ray.io/en/latest/data/data.html
  13. Monte Carlo. “The State of Data Quality: 2023 Edition.” 2023. https://www.montecarlodata.com/resources/state-of-data-quality-2023/
  14. Uber Engineering. “Introducing Michelangelo: Uber’s Machine Learning Platform.” 2017. https://eng.uber.com/michelangelo/
  15. Airbnb Engineering. “Zipline: Airbnb’s Machine Learning Data Management Platform.” 2018.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button