Joe Caserta, President of Award Winning Consulting and Analytics Firm Caserta Concepts, Talked about Big Data Trends, Practical Techniques, and New Best Practices at the BI Leadership Summit in New York
For more information about the event: http://ow.ly/G3MkH
For more information about the services and solutions Caserta Concepts offers, visit our website at http://casertaconcepts.com/
2. @joe_Caserta@BizAnalyticsTT
Top 20 Big Data
Consulting - CIO Review
Joe Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, Hortonworks, IBM, Cisco,
Datameer, Basho more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013
Launched Big Data Warehousing
Meetup in NYC ~ 1,500 Members
2012
2014
Established best practices for big
data ecosystem implementation –
Healthcare, Finance, Insurance
Top 20 Most Powerful
Big Data consulting firms
Dedicated to Data Governance
Techniques on Big Data (Innovation)
3. @joe_Caserta@BizAnalyticsTT
About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Digital Marketing
• Financial Services / Insurance
• Healthcare / Higher Education
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation, Analytics
• Writing, Education, Mentoring
• Data Science & Analytics
• Cloud Computing
• Data Interaction & Visualization
4. @joe_Caserta@BizAnalyticsTT
Sales
Marketing
Finance
ETL
Data Exploration
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Others…
The Evolution of Enterprise Data?
Data Science
Enterprise
Data Warehouse
ETL
7. @joe_Caserta@BizAnalyticsTT
The one’s you need to know….
Hadoop Distribution: Cloudera, Hortonworks, MapR, Pivotal-HD, IBM
Tools:
Hive: Map data to structures and use SQL-like queries
Pig: Data transformation language for big data
Sqoop: Extracts external sources and loads Hadoop
Spark: General-purpose cluster computing framework
Storm: Real-time ETL
NoSQL:
Document: MongoDB, CouchDB
Graph: Neo4j, Titan
Key Value: Riak, Redis
Columnar: Cassandra, Hbase
Search: Lucene, Solr, ElasticSearch
Languages: Python, SciPy, Java, R, Scala
8. @joe_Caserta@BizAnalyticsTT
Advertising
Real time interactive queries on massive
audience datasets in the cloud
360
o
Customer
Cross-channel customer linking to
improve the customer experience and
increase sales
Why are we Changing?
Recommendation Engines
“You chose… you might also like…”
Real-Time
Aggregation, Monitoring & Alerting on
events at extremely high message
rates… ~1M msgs/sec
Big Data Warehouse
Extending EDW with Hadoop
Governing data from the “lake” to the
EDW
Personal/Commercial Banking
Investment/Trading Bank
Quick Service Restaurant (QSR)
Cable Television
Audience-based Advertising
9. @joe_Caserta@BizAnalyticsTT
The Big Data Pyramid
Hadoop has different demands at each tier.
Only top tier of the is fully governed and ready for Enterprise BI
Big
Data
Warehouse
Data Science
Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Metadata Catalog
ILM who has access,
how long do we
“manage it”
Raw machine
data collection,
collect everything
Data is ready to be turned
into information: organized,
well defined, complete.
Agile business insight through
data-munging, machine learning,
blending with external data,
development of to-be BDW facts
Metadata Catalog
ILM who has access, how long do we
“manage it”
Data Quality and Monitoring
Monitor completeness of data
Metadata Catalog
ILM who has access, how long to “manage it”
Data Quality and Monitoring Monitoring of
completeness of data
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
10. @joe_Caserta@BizAnalyticsTT
• The Big Data movement breaks the relational database
barrier and enables analysis on massive amounts of
structured and unstructured data.
• NoSQL puts the value of SQL based relational databases
into question. This disruption is forging a new road for the
progress and advancement of scalable data analytics.
• The value of legacy Business Intelligence comes into
question.
• Rather than forcing data users to become technologists, it
must make data analysis available for the masses.
BI is About to be Disrupted!
11. @joe_Caserta@BizAnalyticsTT
• The role of the ‘Business Analyst’, the primary user of the
BI tool, is being replaced or by two types of data users:
1. Highly technical Data Scientists
2. Non-technical Business Persons
• New analytics (BI) platforms must be created to
accommodate the new users. We see these very discrete
users using very different technologies.
• Perhaps legacy BI tools will not go away, but the market is
absolutely about to be disrupted.
Who Does BI Today?
12. @joe_Caserta@BizAnalyticsTT
• Data Scientists have deep technical knowledge
• They enjoy writing code and mining data
• The best way to serve a data scientist is to provide access
to raw data and then get out of their way.
Empower the Data Scientist
13. @joe_Caserta@BizAnalyticsTT
What does a Data Scientist Do, Anyway?
Searching for the data they need
Making sense of the data
Figuring why the data looks the way is does and assessing its validity
Cleaning up all the garbage within the data so it represents true
business
Combining events with Reference data to give it context
Correlating event data with other events
Finally, they write algorithms to perform mining, clustering and
predictive analytics – the sexy stuff.
Writes really cool and
sophisticated algorithms that
impacts the way the business
runs.
Much of the time of a Data
Scientist is spent:
NOT
14. @joe_Caserta@BizAnalyticsTT
• Business users don’t have, and don’t want to have,
technical wherewithal to interact with ‘data’.
• “We have a business to run! Programming should be done by
people in rooms with no windows.”
• “I need information at my fingertips and I should not need a PhD in
SQL to get it.”
• “It’s a myth that BI tools will solve my problems, I still need IT to get
new reports. This is unacceptable.”
• Every business professional on the planet knows how to
search for needed information via a Google search bar.
• Business people want to be able to ‘Google’ their
corporate data for the information they need.
Empower the Business Person
17. @joe_Caserta@BizAnalyticsTT
• During normal BI
implementations, much
time is spent/wasted on
selecting the best way to
graphically represent a
set of metrics.
• We can embed
algorithms that are
statistically proven to
best represent
information depending
on the type of question
being asked.
• The user should be able
to preview and change
from the default
infographic as easy as
clicking ‘next’ on a
Yahoo! Slideshow.
Why do we make it so difficult?
18. @joe_Caserta@BizAnalyticsTT
Lady gaga sales by state by customer age Go!
joe@casertaconcepts.com
Region
Northeast
Midwest
South
West
Product
Records
Perfume
Clothes
Performances
Dates
2009 to 2013
DOWNLOAD
TO EXCEL
Imagine the Possibilities….
19. @joe_Caserta@BizAnalyticsTT
• Modern web application framework
• Developed and supported by Google
• Bootstrap used for Mobile
Angular
• JavaScript library for data visualization.
• Exposes full capability CSS3, HTML5 and SVG. Is extremely fast
• Support large datasets and dynamic behaviors for interaction
D3.js
• The “glue” that brings other components together
• The ‘engine’ that transforms search strings into queries.
• Integrated with the Customer Metadata repository
Python
• Full-text and faceted-search engine and database
• This is the backbone of the applicationSolr
• Customer Metadata repository. Stores all business rules (default
facets, etc) and user preferences (default graph types, etc)
• Cassandra may not be ultimate selection
Cassandra
• Amazon Web Services
• Queree will be a zero-footprint cloud based solution
• User experience is same as Googling info
AWS
Building the Future of BI (Hint: it’s Big Data)