13. EC2: virtual private servers using Xen.
EMR: (Elastic MapReduce): allows businesses, researchers, data analysts, and developers to easily
and cheaply process vast amounts of data. It uses a hosted Hadoop framework running on the web-
scale infrastructure of EC2 and Amazon S3.
S3: Web based storage.
Redshift: petabyte-scale data warehousing with column-based storage and multi-node compute.
SimpleDB: allows developers to run queries on structured data. It operates in concert with EC2 and S3
to provide "the core functionality of a database".
DynamoDB: scalable, low-latency NoSQL online Database Service backed by SSDs.
RDS: scalable database server with MySQL, Oracle, SQL Server, and PostgreSQL support.
http://en.wikipedia.org/wiki/Amazon_Web_Services
Cloud: Amazon Web Services (AWS)
24. https://storm.apache.org/
● Distributed realtime computation system.
● Storm makes it easy to reliably process unbounded
streams of data, doing for realtime processing what
Hadoop did for batch processing.
● Use cases: realtime analytics, online machine
learning, continuous computation, distributed RPC,
ETL, and more
Realtime: Storm
25. Batch + Realtime: Spark
https://spark.apache.org/
http://www.slideshare.net/search/slideshow?q=apache+spark
速度:
100x faster than Hadoop MapReduce in memory, or 10x
faster on disk.
26. Runs Everywhere:
Spark runs on Hadoop, Mesos, standalone, or in the
cloud. It can access diverse data sources including
HDFS, Cassandra, HBase, S3.
28. Machine learning 機能 (MLlib 1.1):
● linear SVM and logistic regression
● classification and regression tree
● k-means clustering
● recommendation via alternating least squares
● singular value decomposition
● linear regression with L1- and L2-regularization
● multinomial naive Bayes
● basic statistics
● feature transformations
29. Graph 機能:
● GraphX unifies ETL, exploratory analysis, and
iterative graph computation within a single system.
● Seamlessly work with both graphs and collections:
You can view the same data as both graphs and
collections, transform and join graphs with RDDs
efficiently, and write custom iterative graph
algorithms using the Pregel API.
● Algorithms: PageRank, Connected components,
Label propagation, SVD++, Strongly connected
components, Triangle count...
30. Streaming 機能:
Spark Streaming can read data from HDFS, Flume,
Kafka, Twitter and ZeroMQ. You can also define your
own custom data sources.