Contenu connexe
Similaire à Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib (20)
Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib
- 1. ‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Building a Stock Prediction system with
Machine Learning using Geode, Spring XD
e Spark MLLib
William Markito
@william_markito
Fred Melo
@fredmelo_br
- 3. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
It's all about DATA
Data Sources
Look for patterns
Prediction
- 10. © Copyright 2014 Pivotal. All rights reserved.
Transform Sink
SpringXD
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable
Cloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions
- 11. ‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Apache Geode (incubating)
Introduction
- 12. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Introduction
A distributed, memory-based data management platform for
data oriented apps that need:
High performance, scalability, resiliency and continuous
availability
Fast access to critical data set
Location aware distributed data processing
Event driven data architecture
- 13. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Cache
In-memory storage and management for
your data
Configurable through XML, Spring, Java
API or CLI
Collection of Region
- 14. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Region
Distributed java.util.Map on steroids
(Key/Value)
Consistent API regardless of where or how data
is stored
Observable (reactive)
Highly available, redundant on cache Member
(s).
- 15. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Region
Local, Replicated or Partitioned
In-memory or persistent
Redundant
LRU
Overflow
LOCAL
LOCAL_HEAP_LRU
LOCAL_OVERFLOW
LOCAL_PERSISTENT
LOCAL_PERSISTENT_OVERFLOW
PARTITION
PARTITION_HEAP_LRU
PARTITION_OVERFLOW
PARTITION_PERSISTENT
PARTITION_PERSISTENT_OVERFLOW
PARTITION_PROXY
PARTITION_PROXY_REDUNDANT
PARTITION_REDUNDANT
PARTITION_REDUNDANT_HEAP_LRU
PARTITION_REDUNDANT_OVERFLOW
PARTITION_REDUNDANT_PERSISTENT
PARTITION_REDUNDANT_PERSISTENT_OVERFLOW
REPLICATE
REPLICATE_HEAP_LRU
REPLICATE_OVERFLOW
REPLICATE_PERSISTENT
REPLICATE_PERSISTENT_OVERFLOW
REPLICATE_PROXY
- 16. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Member
A process that has a connection to the system
A process that has created a cache
Embeddable within your application
Client
Locator
Server
- 17. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Client cache
A process connected to the Geode server(s)
Can have a local copy of the data
Can be notified about events on the servers
- 18. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Listeners
CacheWriter / CacheListener
AsyncEventListener (queue / batch)
Parallel or Serial
Conflation
- 19. © Copyright 2014 Pivotal. All rights reserved. 19
Apache Geode (incubating)
Currently under incubation in Apache Software Foundation
Welcome contributions and contributors
Code and Patches
Bugs, feature requests
Documentation and content
Any form of feedback
- 20. © Copyright 2014 Pivotal. All rights reserved. 20
Code
New features
Bug fixes (patches)
Writing tests
Documentation
Wiki
Web site
User guides
Community
Join our mailing lists (Ask or answer)
Become a speaker
Find and report bugs
Testing a release candidate or beta
Apache Geode (incubating)
- 21. © Copyright 2014 Pivotal. All rights reserved. 21
JIRA - https://issues.apache.org/jira/browse/GEODE
GitHub - https://github.com/apache/incubator-geode
Mailing lists:
Development - dev@geode.incubator.apache.org
Users - user@geode.incubator.apache.org
Wiki - cwiki.apache.org/confluence/display/GEODE
StackOverflow - http://stackoverflow.com/questions/tagged/geode+or+gemfire
Apache Geode (incubating)
- 24. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
A stream is composed from modules. Each module is deployed to a container and its
channels are bound to the transport.
- 25. ‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Apache Zeppelin
(incubating)
Introduction
- 26. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Web based REPL
Iterative & Exploratory
Support for Data Ingestion
- 27. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Multi interpreters
Markdown
Shell
Spark
Geode
Python…
- 28. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
Sharing through URLs without Reports
- 29. ‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Apache Spark
Introduction
- 30. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
RDD
Dataframe
Driver
Worker
"An RDD in Spark is simply an immutable distributed collection of objects.
Each RDD is split into multiple partitions, which may be computed on different nodes
of the cluster. RDDs can contain any type of Python, Java, or Scala objects,
including user-defined classes."
- 31. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
RDD
Dataframe
Driver
Worker
“A dataframe is a distributed collection of rows organized into named columns. An
abstraction for selecting, filtering and plotting structured data (pandas), previously
known as SchemaRDD."
- 32. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Concepts
RDD
Dataframe
Driver
Worker
- 34. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Summary
• Integration
• Spark, JDBC, Geode
• HDFS, Twitter, File, Mail…
• Data pipeline orchestration
• Intuitive DSL
• Streaming & Analytics
• Distributed and scalable
• Web based REPL
• Multiple Interpreters
• Apache Spark
• Markdown
• Flink
• Python
• Geode…
• Iterative & Exploratory
- 35. ‹#›© 2015 Pivotal Software, Inc. All rights reserved.
Summary
• Fast data processing
• Columnar queries
• RDDs
• Machine Learning
• Analytics & Streaming
• Fast data store and processing
• In-memory & Persistent
• Highly Consistent
• Transaction processing
• Thousands of concurrent
clients
- 36. © Copyright 2014 Pivotal. All rights reserved. 36
Source Code
http://pivotal-open-source-hub.github.io/StockInference-Spark/