L'architettura di classe enterprise di nuova generazione - Massimo Brignoli

Massimo Brignoli
Principal Solution Architect
massimo@mongodb.com
@massimobrignoli
L’architettura di Classe Enterprise di Nuova
Generazione

Agenda
• Nascita dei Data Lake
• Overview di MongoDB
• Proposta di un’architettura
EDM
• Case Study & Scenarios
• Data Lake Lessons Learned

Quanti dati?
• Una cosa non manca alla aziende: dati
– Flussi dei sensori
– Sentiment sui social
– Log dei server
– App mobile
• Analisti stimano una crescita del volume di dati del 40%
annuo, 90% dei quali non strutturati.
• Le tecnologie tradizionali (alcune disegnate 40 anni fa) non
sono sufficienti

La Promessa del “Big Data”
• Scoprire informazioni collezionando ed analizzando i dati
porta la promessa di
– Un vantaggio competitivo
– Risparmio economico
• Un esempio diffuso dell’utilizzo della tecnologia Big Data è la
“Single View”: aggregare tutto quello che si conosce di un
cliente per migliorarne l’ingaggio e i ricavi
• Il tradizionale EDW scricchiola sotto il carico, sopraffatto dal
volume e varietà dei dati (e dall’alto costo).

La Nascita dei Data Lake
• Molte aziende hanno iniziato a guardare verso un’architettura
detta Data Lake:
– Piattaforma per gestire i dati in modo flessibile
– Per aggregare e correlare i dati cross-silos in un unico posto
– Permette l’esplorazione di tutti i dati
• La piattaforma più in voga in questo momento è Hadoop:
– Permette la scalabilità orizzontale su hardware commodity
– Permette una schema di dati variegati ottimizzato in lettura
– Include strati di lavorazione dei dati in SQL e linguaggi comuni
– Grandi referenze (Yahoo e Google in primis)

Perché Hadoop?
• Hadoop Distributed FileSystem è disegnato per scalare su
grandi operazioni batch
• Fornisce un modello write-one read-many append-only
• Ottimizzato per lunghe scansione di TB o PB di dati
• Questa capacità di gestire dati multi-strutturati è usata:
– Segmentazione dei clienti per campagne di marketing e
recommendation
– Analisi predittiva
– Modelli di Rischio

Ma va bene per tutto?
• I Data Lake sono disegnati per fornire l’output di Hadoop alle
applicazioni online. Queste applicazioni hanno dei requisiti
tra cui:
– Latenza di risposta in ms
– Accesso random su un sottoinsieme di dati indicizzato
– Supporto di query espressive ed aggregazioni di dati
– Update di dati che cambiano valori frequentemente in real-time

Hadoop è la risposta a tutto?
• Nel nostro mondo guidato ormai dai dati, i millisecondi sono
importanti.
– Ricercatori IBM affermano che il 60% dei dati perde valore alcuni
millisecondi dopo la generazione
– Ad esempio identificare una transazione di borsa fraudolenta può essere
inutile dopo alcuni minuti
• Gartner predice che il 70% delle installazioni di Hadoop fallirà
per non aver raggiunto gli obiettivi di costo e di incremento
del fatturato.

Enterprise Data Management Pipeline
…
Siloed source databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw
data
AnalyzeAggregate
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems

Document Model
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location: {
type : ‘Point’,
coordinates : [45.123,47.232]
},
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
MongoDB
RDBMS

Documents are Rich Data Structures
{
first_name: ‘Paul’,
surname: ‘Miller’,
cell: 447557505611,
city: ‘London’,
location: {
type : ‘Point’,
coordinates : [45.123,47.232]
},
Profession: [‘banking’, ‘finance’, ‘trader’],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
Fields can contain an array
of sub-documents
Typed field values
Fields can contain
arrays
Fields

Documents are Flexible
{
product_name: ‘Acme Paint’,
color: [‘Red’, ‘Green’],
size_oz: [8, 32],
finish: [‘satin’, ‘eggshell’]
}
{
product_name: ‘T-shirt’,
size: [‘S’, ‘M’, ‘L’, ‘XL’],
color: [‘Heather Gray’ … ],
material: ‘100% cotton’,
wash: ‘cold’,
dry: ‘tumble dry low’
}
{
product_name: ‘Mountain Bike’,
brake_style: ‘mechanical disc’,
color: ‘grey’,
frame_material: ‘aluminum’,
no_speeds: 21,
package_height: ‘7.5x32.9x55’,
weight_lbs: 44.05,
suspension_type: ‘dual’,
wheel_size_in: 26
}
Documents in the same product catalog collection in MongoDB

Drivers & Frameworks
Morphia
MEAN Stack

Development – The Past
{ CODE } DB SCHEMAXML CONFIG
APPLICATION RELATIONAL DATABASE
OBJECT RELATIONAL
MAPPING

Development – With MongoDB
{ CODE } DB SCHEMAXML CONFIG
APPLICATION RELATIONAL DATABASE
OBJECT RELATIONAL
MAPPING

MongoDB is Full-Featured
Rich Queries
Find Paul’s cars
Find everybody in London with a car between 1970 and 1980
Geospatial Find all of the car owners within 5km of Trafalgar Sq.
Text Search Find all the cars described as having leather seats
Aggregation Calculate the average value of Paul’s car collection
Map Reduce
What is the ownership pattern of colors by geography over time
(is purple trending in China?)

Dynamic Lookup
• Combine data from multiple collections with left outer joins
for richer analytics & more flexibility in data modeling

Model of the Aggregation Framework

Replica Sets
• Replica set – 2 to 50 copies
• Replica sets make up a self-healing ‘shard’
• Data center awareness
• Replica sets address:
• High availability
• Data durability, consistency
• Maintenance (e.g., HW swaps)
• Disaster Recovery
Application
Driver
Primary
Secondary
Secondary
Replication

Query Routing
• Multiple query optimization models
• Each of the sharding options are appropriate for different
apps / use cases

3.2 Features Relevant for EDM
• WiredTiger operational Engine
• In-memory storage engine
• Encryption
• Compass (data viewer & query builder)
• Connector for BI (Visualization)
• Connector for Hadoop
• Connector for Spark
• $lookUp (left outer join)

Come scegliere il layer di Data Management per
ognuno degli stage?
Processing
Layer
?
When you want:
1. Secondary indexes
2. Sub-second latency
3. Aggregations in DB
4. Updates of data
For:
1. Scanning files
2. When indexes
not needed
Wide column store
(e.g. HBase)
For:
1. Primary key queries
2. If multiple indexes &
slices not needed
3. Optimized for
writing, not reading

Data Store for Raw Dataset
…
External feeds
(batch)
Streams
Transform
Store raw
data
AnalyzeAggregate
Stream Processing
Users
Other
Systems
Store
raw data
Transform
- Typically just writing
record-by-record from
source data
- Usually just need high
write volumes
- All 3 options handle that
Transform read requirements
- Benefits to reading multiple datasets
sorted [by index], e.g. to do a merge
- Might want to look up across tables
with indexes (and join functionality in
MDB v3.2)
- Want high read performance while
writes are happening
Interactive querying on
the raw data could use
indexes with MongoDB

Data Store for Transformed Dataset
…
External feeds
(batch)
Streams
Transform
Store raw
data
AnalyzeAggregate
Stream Processing
Users
Other
Systems
AggregateTransform
Often benefits to
updating data as
merging multiple
datasets
Dashboards &
reports can have
sub-second latency
with indexes
Aggregate read requirements
- Benefits to using indexes for grouping
- Aggregations natively in the DB would help
- With indexes, can do aggregations on slices of
data
- Might want to look up across tables with
indexes to aggregate

Data Store for Aggregated Dataset
…
External feeds
(batch)
Streams
Transform
Store raw
data
AnalyzeAggregate
Stream Processing
Users
Other
Systems
AnalyzeAggregate
Dashboards &
reports can have
sub-second
latency with
indexes
Analytics read requirements
- For scanning all of data, could
be in any data store
- Often want to analyze a slice
of data (using indexes)
- Querying on slices is best in
MongoDB

Data Store for Last Dataset
…
External feeds
(batch)
Streams
Transform
Store raw
data
AnalyzeAggregate
Stream Processing
Users
Other
Systems
Analyze
Users
Dashboards &
reports can have
sub-second
latency with
indexes
- At the last step, there are
many consuming systems and
users
- Need expressive querying with
secondary indexes
- MongoDB is best option for the
publication or distribution of
analytical results and
operationalization of data
Other
Systems
Often digital
applications
- High scale
- Expressive querying
- JSON preferred
Often
RESTful
services,
APIs

Architettura EDM Completa
…
Siloed source
databases
External feeds
(batch)
Streams
Data processing pipeline
Stream Processing
Downstrea
m Systems
… …
Single CSR
Application
Unified
Digital Apps
Operational
Reporting
…
… …
Analytic
Reporting
Drivers & Stacks
Customer
Clusterin
g
Churn
Analysis
Predictiv
e
Analytics
…
Distributed
Processing
Governance to
choose where to
load and process
data
Optimal
location for
providing
operational
response times
& slices
Can run
processing on
all data or
slices
Data Lake

Example scenarios
1.Single Customer View
a. Operational
b. Analytics on customer segments
c. Analytics on all customers
2.Customer profiles & clustering
3.Presenting churn analytics on high value customers

Single View of Customer
Spanish bank replaces Teradata and Microstrategy to
increase business and avoid significant cost
Problem Why MongoDB Results
Problem Solution Results
Took days to implement new
functionality and business policies,
inhibiting revenue growth
Branches needed an app providing
single view of the customer and real
time recommendations for new
products and services
Multi-minute latency for accessing
customer data stored in Teradata and
Microstrategy
Built single view of customer on
MongoDB – flexible and scalable app
easy to adapt to new business needs
Super fast, ad hoc query capabilities
(milliseconds), and real-time analytics
thanks to MongoDB’s Aggregation
Framework
Can now leverage distributed
infrastructure and commodity
hardware for lower total cost of
ownership and greater availability
Cost avoidance of 10M$+
Application developed and deployed
in less than 6 months. New business
policies easily deployed and executed,
bringing new revenue to the company
Current capacity allows branches to
load instantly all customer info in
milliseconds, providing a great
customer experience
Large Spanish
Bank

Case Study
Insurance leader generates coveted single view of
customers in 90 days – “The Wall”
Problem Why MongoDB ResultsProblem Solution Results
No single view of customer, leading
to poor customer experience and
churn
145 years of policy data, 70+
systems, 15+ apps that are not
integrated
Spent 2 years, $25M trying build
single view with Oracle – failed
Built “The Wall” pulling in
disparate data and serving single
view to customer service reps in
real time
Flexible data model to aggregate
disparate data into single data
store
Churn analysis done with Hadoop
with relevant results output to
MongoDB
Prototyped in 2 weeks
Deployed to production in 90
days
Decreased churn and improved
ability to upsell/cross-sell

Top 15
Global Bank
Kicking Out Oracle
Global bank with 48M customers in 50 countries terminates
Oracle ULA & makes MongoDB database of choice
Problem Why MongoDB Results
Problem Solution Results
Slow development cycles due to RDBMS’
rigid data model hindering ability to meet
business demands
High TCO for hardware, licenses,
development, and support
(>$50M Oracle ULA)
Poor overall performance of customer-
facing and internal applications
Building dozens of apps on MongoDB,
both net new and migrations from Oracle
– e.g., significant portion of retail
banking, including customer-facing and
backoffice apps, fraud detection, card
activation, equity research content mgt.)
Flexible data model to develop apps
quickly and accommodate diverse data
Ability to scale infrastructure and costs
elastically
Able to cancel Oracle ULA. Evaluating
what apps can be migrated to MongoDB.
For new apps, MongoDB is default choice
Apps built in weeks instead of months or
years, e.g., ebanking app prototyped in 2
weeks and in production in 4 weeks
70% TCO reduction

L'architettura di classe enterprise di nuova generazione - Massimo Brignoli

L'architettura di classe enterprise di nuova generazione - Massimo Brignoli

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à L'architettura di classe enterprise di nuova generazione - Massimo Brignoli

Similaire à L'architettura di classe enterprise di nuova generazione - Massimo Brignoli (20)

Plus de Data Driven Innovation

Plus de Data Driven Innovation (20)

Dernier

Dernier (20)

L'architettura di classe enterprise di nuova generazione - Massimo Brignoli

Notes de l'éditeur