// the find
salesforce/TransmogrifAI
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
TransmogrifAI is a Salesforce-built AutoML library for Spark that wraps SparkML with a type-safe Scala DSL. It automates feature engineering, model selection via cross-validated hyperparameter search, and basic feature validation (sanity checks, correlation filtering). Target audience is data engineering teams already running Spark who want to reduce time-to-first-model without writing boilerplate pipeline code.
Compile-time type safety on features is genuinely useful — passing the wrong feature type to a transformer is a runtime crash in vanilla SparkML, here it's a compile error. The SanityChecker that detects label leakage and removes correlated/low-variance features automatically is a real time saver that most teams implement badly by hand. The LOCO (Leave One Covariate Out) record-level insights are more honest than feature importances — you get per-prediction explanations, not just global averages. The CLI scaffolding tool that generates a working Spark project from your Avro schema is a practical touch that cuts project setup from an afternoon to minutes.
Pinned to Spark 2.4 and Scala 2.11 — both are years past EOL. In 2026 this means fighting dependency hell with anything modern, and it's unclear whether Salesforce is actively maintaining it for Spark 3.x given the last published stable is 0.7.0. Deep learning is absent — no neural net model selectors, which matters for text-heavy datasets where the SmartTextVectorizer's TF-IDF approach will underperform. The automation is still structured-data-only; if your features are images, time series, or raw text at scale, you're outside the happy path. Build tooling is Gradle with a large multi-module setup that requires significant JVM memory to compile — new contributors routinely hit OOM on first build.