// the find
prestodb/presto
The official home of the Presto distributed SQL query engine for big data
Presto is Meta's distributed SQL query engine, designed to run analytical queries across heterogeneous data sources — Hive, HDFS, S3, Kafka, Accumulo, and more — without moving data. It's a serious piece of infrastructure used at petabyte scale, not a weekend project. The target audience is data platform teams who need federated SQL across a data lake.
The connector architecture is genuinely well-designed: each data source gets its own plugin implementing a clean SPI, so adding a new source doesn't touch the query engine core. The C++ native worker (presto-native-execution / Velox) is a real investment — offloading execution to Velox gets you SIMD-accelerated vectorized processing that the JVM can't match. The CI pipeline is thorough: separate workflow files per connector (Hive, Kudu, SingleStore, Arrow Flight) with product tests against real environments, not just unit mocks. Active development as of today with 5,500+ forks and a clear governance structure under the Linux Foundation.
The build is a nightmare to get running locally — it needs a Hive metastore just to do development, the full Maven build takes forever, and the JDK 17 setup requires a pile of --add-opens flags that signal the codebase is still fighting the module system rather than embracing it. The Java/C++ split (coordinator in Java, optional native workers in C++) means you're running two completely different runtimes and debugging across that boundary when things go wrong. The coordinator is a single point of failure — no coordinator HA out of the box, so a coordinator restart kills all running queries. Documentation quality is uneven: connector docs vary wildly, and operational guidance (tuning memory configs, resource groups, OOM behavior) is scattered and often outdated.