Hortonworks & IBM – Integration of HDP and DSX
Who Are We
IBM - #1 Data Science solution.
Hortonworks – Largest Open Source Hadoop distribution.
We believe this partnership optimizes the strengths our companies and uniquely positions our solution in the Data Science market.
What Are We Talking About Today
Integrating HDP and DSX creates a platform for organizations to unlock the potential of their data. Ultimately, it creates a pathway to innovative and valuable Data Science work flows.
Presentation Overview
1. Walk through the Data Science Life Cycle.
2. Discuss challenges in the process.
3. Discuss how DSX & HDP solves these problems.
4. Demonstration of the technology.
Problem Definition
A successful Data Science practices begin with a well defined business problem. Ideally, the business has specific questions to ask their data.
ETL – Feature Extraction
Once the the problem has been defined, the process of data wrangling, transformation, and cleaning must be completed using various ETL processes.
Once the data corpus has been curated, statistical analysis techniques are utilized to determine which features should be extracted.
Learning
After the features are selected, supervised or unsupervised Machine Learning models are be created for future prediction or classification.
Model Deployment & Management
In order for this process to be valuable, the organization must deploy these models into their production environment.
Additionally, they most also monitor the performance and health of these models while they are operating.
Data Science team consist of Data Scientist, Data Engineers, Business Analyst, and Application Developers.
Challenge #1 – Multiple sources of data.
Traditional – Structured (DB, EDW, CRM, Ect),
Big Data – Unstructured (Social media, IOT), Hadoop based data stores
Legacy – Spreadsheets
Problem 1 – 20% of the time in the Data Science Lifecycle (DSLC) is spent on Data Scientists trying to find where the required data is located.
Problem 2 – 60% of the the time in the DSLC is spent on Data Engineers centrally located the data, and preparing it to ensure data quality.
*Key Take Away - Combined 80% of the time in the DSLC is spent on locating, moving, and preparing the data before machine learning models can be created or deployed.
Challenge #2 – Data Science workflows lack standardization.
Problem 1 - the open source community has created too many tools for expect a single person to know them all.
Problem 2 – Data Science teams are often limited to using the tools that their data scientist and application developer are most familiar with including languages, and libraries.
Problem 3 – There are no systems in place where an organization’s Data Science team can build reliable, standardized, and repeatable pipelines for managing models at a Big Data scale.
*Key Take Away – Due to the number of open source tools, there are no standardizations in Data Science practices.
Challenge #3 – Collaboration is difficult
Problem 1 – Without a common framework, Data Scientists have difficultly collaborating with team members. They are unable to share code, results, or models with each other.
Problem 2 - Due to lack of collaboration, Business Analyst struggle to find visualization tools that can integrate with the data repository.
Key Take Away – Collaboration for a Data Science team is difficult leveraging existing tools.
Challenge #4 – Deploying Models into Production
Problem 1 – Data Science teams struggle migrating their prototype models to deployment in their production environment.
Problem 2 – Using current tools, it is also difficult to monitor the health and performance on their models while they are operating in production.
Key Take Away – Creating models in an isolated environment is relatively straight forward. The challenge begins when these models need to be deployed and monitored in production.
Components of the end-to-end real-time insights dataflow platform
MiNiFi : Edge Data Collection w/Provenance and centralized C&C
NiFi: End to end dataflow management w/Provenance and Interactive C&C
Kafka: High throughput durable replayable messaging
Storm: High-scale Data Processing
Right-sized solutions
All optimized for delivery into HDP (HDFS, Hive, Spark, Hbase, etc…)