// the find

OryxProject/oryx

★ 1,783 · Java · Apache-2.0 · updated Aug 2021

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a framework for building lambda architecture ML systems on top of Spark and Kafka, with batteries-included apps for ALS collaborative filtering, k-means clustering, and random decision forests. It targets teams running Hadoop/Cloudera clusters who want real-time model updates without building the plumbing from scratch. Last commit was August 2021 — this project is effectively dead.

The three-layer separation (batch/speed/serving) is cleanly expressed in the API — you implement three interfaces and the framework wires Kafka and Spark together for you. The ALS implementation includes a hyperparameter tuning loop that cross-validates k and lambda on held-out data, which most roll-your-own CF systems skip. PMML is used as the model interchange format between layers, which is a reasonable choice for portability across JVM ML libraries. Test coverage is solid for a project of this age — integration tests spin up real mini Spark/Kafka clusters rather than mocking them.

Dead project: last pushed 2021, Travis CI build badge, Cloudera branding throughout — the ecosystem it was built for (Spark MLlib's old Java API, YARN clusters) has moved on. The lambda architecture pattern itself has largely been supplanted by Flink or Spark Structured Streaming, which can do the batch and speed layers in one unified system. Requires a full Hadoop cluster to run anything meaningful; there's no lightweight local mode that would let you evaluate it without a Cloudera deployment. Documentation lives at oryx.io which you'll need to verify still resolves, and the linked SourceSpy module diagram is a third-party service with no guarantee of accuracy.

View on GitHub → Homepage ↗