This document discusses the pros and cons of building an in-house data analytics platform versus using cloud-based services. It notes that in startups it is generally better not to build your own platform and instead use cloud services from AWS, Google, or Treasure Data. However, the options have expanded in recent years to include on-premise or cloud-based platforms from vendors like Cloudera, Hortonworks, or cloud services from various providers. The document does not make a definitive conclusion, but discusses factors to consider around distributed processing, data management, process management, platform management, visualization, and connecting different data sources.
5. At Feb 23, 2015
• To Have Own Data Analytics Platform, Or NOT To,
In Startup Companies:
• "NOT To, in general"
• Data analytics services:
• AWS EMR, Redshift
• Google BigQuery
• Treasure Data
6. Options In 2017
• On Premise
• Cloudera CDH, Hortonworks HDP, ...
• Services
• AWS EMR, Redshift, Athena, Kinesis Analytics, ...
• Google BigQuery, Cloud Dataflow, Cloud
Dataproc, ...
• MS Azure SQL Data Warehouse, Stream Analytics,
Data Lake Analytics, ...
• Treasure Data
12. On Premise Platform In Past
• 2011-2014: On-premise Hadoop&Presto cluster
• w/ Fluentd stream processing cluster
• w/ Norikra stream processing
• w/ Web UI (Shib)
https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan
13. To Be Considered
• Distributed Processing Platform
• Data Management
• Process Management
• Platform Management
• Visualization and BI
• Connecting Data
15. Data Management
• How to collect data?
• How to ingest data?
• How to manage schema?
• How to move data from here to there?
16. Process Management
• How to run queries on schedule?
• How to build workflow between queries?
• How to run queries after data ingestion?
• How to move data from the platform to elsewhere
after queries?
17. Platform Management
• How to upgrade software?
• How to add nodes?
• How to manage failures / downtime?
• How to replace hardware?
• How to switch platforms?
• How to provide compatibility for queries?
18. Visualization and BI
• How to show query results graphically?
• How to show relations between data graphically?
• How to query data interactively?
19. Connecting Data
• How to join logs and master data?
• How to join logs and user list?
• How to join logs and CRM data?
• How to push query results to marketing tools/
services?
• How to send notifications using query results?
21. In My Past Case:
• Distributed Processing Platform
• Hadoop & Presto (& Norikra)
• Data Management
• Hive schema & Custom made UI (Shib)
• Managed by engineers of each services
• Process Management
• Custom made query scheduler (ShibUI)
• Platform Management
• By tagomoris
• Visualization, BI: N/A
• Connecting Data: N/A
22. About Treasure Data
• Distributed Processing Platform: Hive, Presto
• Data Management: Fluentd & Schema-less DB
• Process Management: Digdag / Treasure Workflow
• Platform Management: Automatic
• Visualization and BI: Treasure BI
• Connecting Data: Embulk / Data Connector
😝
23. Recent Improvements around Data Analytics
• Improvements of CDH/HDP to manage clusters
• Online Upgrade
• Support many processing frameworks
• Many new data processing software/frameworks
• Apache Flink, Apache Arrow, Apache Beam, ...
• Many new services available
• Stream processing, Machine learning, ...
27. Complexity
• Connecting data / processing with applications
• Connecting data / processing with services
• Connecting data / processing with people
28. Chasing the World
• Many new software / services / platform /
paradigm, day by day
• Data sizes are growing day by day
• Complexity is growing day by day
• A data platform CANNOT live as-is 5 years!
29. Finding Treasure From Data
• "Data Processing" is:
• NOT the purpose
• just a tool to get something great
• Use developers and their time to find treasures!