Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Presto@Netflix Presto Meetup 03-19-15

2 291 vues

Publié le

Presto Meetup 03-19-15

Publié dans : Internet
  • Soyez le premier à commenter

Presto@Netflix Presto Meetup 03-19-15

  1. 1. Presto @ Netflix: Interactive Queries at Petabyte Scale Nezih Yigitbasi and Zhenxiao Luo Big Data Platform
  2. 2. Outline » Big data platform @ Netflix » Why we love Presto? » Our contributions » What are we working on? » What else we need?
  3. 3. Cloud Apps S3 Suro Ursula SSTable s Cassandra Aegisthus Event Data 15m Daily Dimension Data Our Data Pipeline
  4. 4. Data Warehouse Service Tool s Gateways Big Data Platform Architecture Prod Clients Clusters VPCQuery Prod TestBonusProd
  5. 5. » Batch jobs (Pig, Hive) » ETL jobs » reporting and other analysis » Ad-hoc queries » interactive data exploration » Looked at Impala, Redshift, Spark, and Presto Our Use Cases
  6. 6. Deployment » v 0.86 » 1 coordinator (r3.4xlarge) » 250 workers (m2.4xlarge) Tooling Numbers » ~2.5K queries/day against our 10PB Hive DW on S3 » 230+ Presto users out of 300+ platform users » presto-cli, Python, R, BI tools (ODBC/JDBC), etc. » Atlas/Suro for monitoring/logging Presto @ Netflix
  7. 7. Why we love Presto? » Open source » Fast » Scalable » Works well on AWS » Good integration with the Hadoop stack » ANSI SQL
  8. 8. Our Contributions 24 open PRs, 60+ commits » S3 file system » multipart upload, IAM roles, retries, monitoring, etc. » Functions for complex types » Parquet » name/index-based access, type coercion, etc. » Query optimization » Various other bug fixes
  9. 9. » Vectorized reader* Read based on column vectors » Predicate pushdown Use statistics to skip data » Lazy load Postpone loading the data until needed » Lazy materialization Postpone decoding the data until needed What are we Working on? Parquet Optimizations * PARQUET-
  10. 10. Netflix Integration » BI tools integration » ODBC driver, Tableau web connector, etc. » Better monitoring » Ganglia ⟶ Atlas » Data lineage » Presto ⟶ Suro ⟶ Charlotte
  11. 11. » Graceful cluster shrink » Better resource management » Dynamic type coercion for all file formats » Support for more Hive types (e.g., decimal) » Predictable metastore cache behavior » Big table joins similar to Hive What else we need?
  12. 12. THANK YOU

×