// the find

feathr-ai/feathr

★ 1,928 · Scala · Apache-2.0 · updated Apr 2024

Feathr – A scalable, unified data and AI engineering platform for enterprise

Feathr is LinkedIn's internal feature store, open-sourced in 2022. It handles feature definition, materialization, and serving across batch, streaming, and online environments using Spark as the compute engine. The target is ML teams at companies already running Azure or Databricks who need point-in-time correct training data and a shared feature registry.

Point-in-time join correctness is a first-class concept, not bolted on — the sliding window aggregation API handles data leakage prevention at the framework level. The feature registry with lineage tracking is genuinely useful: you can see what upstream sources feed a feature and who is consuming it, which matters when you have dozens of teams sharing features. The sandbox Docker image gets you a working environment in one command, which is a real time-saver for evaluation. The Python API is clean — defining anchors, derived features, and window aggregations reads close to how you'd think about the problem.

The last commit was April 2024 and the roadmap items (feature versioning, monitoring) are still open checkboxes — this project looks like it has stalled. The Azure bias is heavy: the deployment story is Synapse or Databricks, and the ARM template path assumes you're all-in on that ecosystem; AWS support exists but feels like an afterthought. The Scala/Java core means debugging Spark job failures requires reading JVM stack traces through the Python client, which is painful. The online serving story is thin — you're expected to bring your own Redis and wire it up yourself, with no built-in consistency guarantees between offline materialization and online reads.

View on GitHub → Homepage ↗