This document discusses Project Vortex, which is Actian's effort to build their high-performance vector database system Vectorwise directly into Hadoop. It provides an overview of vector processing and SQL-on-Hadoop approaches. Project Vortex uses Actian Vector's vectorized query execution engine to provide the highest SQL query performance on Hadoop. It also details Project Vortex's architecture, storage format, data ingestion capabilities, and roadmap for further integration with Hadoop and optimization. Benchmark results show it outperforming other SQL-on-Hadoop systems like Impala by over 10x on average.
Execution
Subset of TPC-DS as chosen by Impala
Data size is 3TB (SF3000)
Executed on 5-node “rushcluster” in Austin
Both Impala and Vector numbers are on the same hardware
Comparison with Impala
Verified that Impala plans are sensible
Currently observed average speedup is 11x
Optimal query plans (manually written) gives us 16x speedup
These are real numbers! We executed manual plans directly
Changes in the cost model would get us to this performance
Performance improvements
Cost model changes will get us to 16x speedup
Pipeline of query execution changes
Well into H2
Estimated to get us 2x improvement
So, estimated speedup vs Impala would be ~30x (no guarantees)
Planning to run TPC-H SF1000 and SF3000
With all planned improvements (end of the year) we should be able to beat the EXASOL cluster numbers.