SlideShare une entreprise Scribd logo
1  sur  91
Télécharger pour lire hors ligne
Data	Science
Crash	Course - DataWorks Summit	- Munich	2017
Robert	Hryniewicz
Developer	Advocate
@RobertH8z
rhryniewicz@hortonworks.com
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	Data	Science?
à Extracting knowledge/insights	from data
– Data:	structured	or	unstructured
à Continuation	of
– statistics
– machine	learning
– data	mining
– predictive	analytics
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	Machine	Learning?
Machine	Learning
“science	of	how	computers	learn	
without	being	explicitly	programmed”
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
“AI	is	the	new	electricity.”
“AI	needs	to	be	company	wide	
strategic	decision.”	
Andrew	Ng	
Chief	Data	Scientist	
Co-founder	of	Coursera
Prof.	at	Stanford
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
A	Brief	History	of	AI
Antiquity	– An	Ancient	Wish	to	Forge	the	Gods
1940	 (Digital	Computer,	scientists	discuss	electronic	brain)
1954	– 73		(Marvin	Minsky	et	al.	in	Dartmouth	College)
1973	– 80		
1980	– 87		(Japanese	gov.)
1987	– 93
1993	– 2000
2000	à Present
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
AI	in	Media	&	Pop	Culture
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	AI?
à General	or	Pure	AI
à Narrow	or	Pragmatic	AI
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
“Big	Data”
à Internet	of	Anything	(IoT)
– Wind	Turbines,	Oil	Rigs
– Beacons,	Wearables
– Smart	Cars
à User	Generated	Content	(Social,	Web	&	Mobile)
– Twitter,	Facebook,	Snapchat
– Clickstream
– Paypal,	Venmo
44ZB	in	2020
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Visualizing	44ZB	
100	pixels	=	1M	TB
100	px ->	1M	TB		assumes	5M	pixel	resolution	screen
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Key	drivers	behind	AI	Explosion
à Exponential	data	growth
à Faster	distributed	systems
à Smarter	algorithms
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Major	Trends	in	AI	Technologies
à Knowledge	Engineering
à Machine	Learning
à Deep	Learning
à Image	Analysis
à Natural	Language	Processing	&	Generation
à Robotics	&	Automation
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Creating	Value	with	AI
à Cognitive	insights
à Cognitive	engagement
à Cognitive	automation
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Machine Learning Use Cases
Healthcare
Predict	diagnosis
Prioritize	screenings
Reduce	re-admittance	rates
Financial	services
Fraud	Detection/prevention
Predict	underwriting	risk
New	account	risk	screens
Public	Sector
Analyze	public	sentiment
Optimize	resource	allocation
Law	enforcement	&	security	
Retail
Product	recommendation
Inventory	management
Price	optimization
Telco/mobile
Predict	customer	churn
Predict	equipment	failure
Customer	behavior	analysis
Oil	&	Gas
Predictive	maintenance
Seismic	data	management
Predict	well	production	levels
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	Is	Apache	Spark?
à Apache	open	source	project	
originally	developed	at	AMPLab
(University	of	California	Berkeley)
à Unified	data	processing	engine	that	
operates	across	varied	data	
workloads	and	platforms
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark?
à Elegant	Developer	APIs
– Single	environment	for	data	munging,	data	wrangling,	and	Machine	Learning	(ML)
à Fast!	- In-memory	computation	model
– Effective	for	iterative	computations	
à Machine	Learning
– Implementation	of	distributed	ML	algorithms
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
More	Flexible Better	Storage	and	Performance///
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL	Overview
à Spark	module	for	structured	data	processing	(e.g.	DB	tables,	JSON	files,	CSV)
à Three	ways	to	manipulate	data:
– DataFrames API
– SQL	queries
– Datasets	API
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
à Distributed collection of	data organized into	named	
columns
à Conceptually	equivalent	to	a	table	in	relational	DB	or	
a	data	frame	in	R/Python
à API	available	in	Scala,	Java,	Python,	and	R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data	is	described	as	a	DataFrame
with	rows,	columns,	and	a	schema
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
CSVAvro
HIVE
Spark	SQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Visualizations
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved Source:	commons.wikimedia.org/w/index.php?curid=17857442
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Visualization:	Twitter
Source:	https://medium.com/@swainjo/us-presidential-election-2016-twitter-analysis-7596606853e5#.dozwu2bhd
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
imple	line	chart
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
orizontal	plot	of	three	line	charts
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
treaming	data	into	a	line	chart
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
lotting	Iris	data	features	in	one	plot
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
omparing	Iris	data	distributions
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Algorithms
37 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	a	ML	Model?
à Mathematical	formula	with	a	number	of	parameters that	need	to	be learned from	the	
data.	And	fitting	a	model	to	the	data	is	a	process	known	as model	training
à E.g.	linear	regression
– Goal:	fit	a	line	y	=	mx	+	c to	data	points
– After	model	training:	y	=	2x	+	5
Input OutputModel
1,	0,	7,	2,	… 7,	5,	19,	9,	…
38 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
START
Regression	
Classification Collaborative	Filtering
Clustering
Dimensionality	Reduction
• Logistic	Regression
• Support	Vector	Machines	(SVM)
• Random	Forest	(RF)
• Naïve	Bayes
• Linear	Regression
• Alternating	Least	Squares	(ALS)
• K-Means,	LDA
• Principal	Component	Analysis	(PCA)
39 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
CLASSIFICATION
Identifying	to	which	category	an	object	belongs	to
Examples:	spam	detection,	diabetes	diagnosis,	text	labeling
Algorithms:
à Logistic	Regression
– Fast	training,	linear	model
– Classes	expressed	in	probabilities
à Support	Vector	Machines	(SVM)
– “Best”	supervised	learning	algorithm,	effective
– More	robust	to	outliers	than	Log	Regression
– Handles	non-linearity
à Random	Forest
– Fast	training
– Handles	categorical	features
– Does	not	require	feature	scaling
– Captures	non-linearity	and	
feature	interaction
à Naïve	Bayes	
– Good	for	text	classification
– Assumes	independent	variables
40 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Visual	Intro	to	Decision	Trees
à http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION
41 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
REGRESSION
Predicting	a	continuous-valued	output
Example:	Predicting house	prices	based	on	number	of	bedrooms	and	square	footage
Algorithms:	Linear	Regression
42 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
CLUSTERING
Automatic	grouping	of	similar	objects	into	sets	(clusters)
Example:	market	segmentation	– auto	group	customers	into	different	market	segments
Algorithms: K-means,	LDA
43 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
COLLABORATIVE	FILTERING
Fill	in	the	missing	entries	of	a	user-item	association	matrix
Applications:	Product/movie	recommendation
Algorithms: Alternating Least Squares (ALS)
44 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DIMENSIONALITY	REDUCTION
Reducing	the	number	of	redundant	features/variables
Applications:	
à Removing	noise	in	images	by	selecting	only	
“important”	features
à Removing	redundant	features,	e.g.	MPH	&	KPH	are	
linearly	dependent
Algorithms: Principal	Component	Analysis	(PCA)
45 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
START
Regression	
Classification Deep	Learning
Clustering
Dimensionality	Reduction
• XGBoost (Extreme	Gradient	Boosting)
• Classification	and	regression	trees	(CART)
• Recurrent	Neural	Network	(RNN)
• Convolutional	Neural	Network	(CNN)
• Yinyang K-Means
• t-Distributed	Stochastic	Neighbor	Embedding	(t-SNE)
• Local	Regression	(LOESS)
Collaborative	Filtering
• Weighted	Alternating	Least	
Squares	(WALS)
46 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
47 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hyperparameters
à Define	higher-level	model	properties,	e.g.	complexity	or	learning	rate
à Cannot	be	learned	during	training	à need	to	be	predefined
à Can	be	decided	by
– setting	different	values
– training	different	models
– choosing	the	values	that	test	better
à Hyperparameter examples
– Number	of	leaves	or	depth	of	a	tree
– Number	of	latent	factors	in	a	matrix	factorization
– Learning	rate	(in	many	models)
– Number	of	hidden	layers	in	a	deep	neural	network
– Number	of	clusters	in	a	k-means	clustering
48 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Predictive Analytics Pre-requisites
49 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Predictive Analytics Process and Tools
50 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Asking	Relevant	Questions
à Specific (can	you	think	of	a	clear	answer?)
à Measurable (quantifiable?	data	driven?)
à Actionable (if	you	had	an	answer,	could	you	do	something	with	it?)
à Realistic	(can	you	get	an	answer	with	data	you	have?)
à Timely (answer	in	reasonable	timeframe?)
51 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
With	that	in	mind…
à No	simple	formula	for	“good	questions”	only	general	guidelines
à The	right	data	is	better	than	lots	of	data
à Understanding	relationships	matters
52 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Preparation
1. Data	analysis	(audit	for	anomalies/errors)
2. Creating	an	intuitive	workflow	(formulate	seq.	of	prep	operations)
3. Validation	(correctness	evaluated	against	sample	representative	dataset)
4. Transformation (actual	prep	process	takes	place)
5. Backflow	of	cleaned	data	(replace	original	dirty	data)
Approx.	80%	of	Data	Analyst’s	job	is	Data	Preparation!
Example	of	multiple	values	used	for	U.S.	States	è California,	CA,	Cal.,	Cal
53 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Detailed	Research	and	Operational	Workflows
54 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Training	Set
Learning	Algorithm
h
hypothesis/model
input output
Ingest	/	Enrich	Data
Clean	/	Transform	/	Filter
Select	/	Create	New	Features
Evaluate	Accuracy	/	Score
55 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Building Spark ML pipelines
Feature	
transform	
1
Feature	
transform	
2
Combine	
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline	Model
Train
Predict
Export	Model
56 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark ML Pipeline
à fit() is for training
à transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline	Model
fit()
transform()
Train
Predict
57 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
58 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Exporting ML Models - PMML
à Predictive	Model	Markup	Language	(PMML)
à Supported	models
–K-Means	
–Linear	Regression
–Ridge	Regression	
–Lasso
–SVM
–Binary
59 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HDCloud
60 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Cloud	Solutions
Microsoft AWS Google
Managed Azure	HDInsight
Non-Managed	/
Marketplace
Hortonworks	Data	
Cloud	for	AWS
Cloud	IaaS
Hortonworks	Data	Platform
(via	Ambari	and	via	Cloudbreak)
61 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
62 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Zeppelin
Ambari
Spark	History	Server
Files	View
63 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
à Zeppelin	è Interactive	notebook
à Spark
à YARN	è Resource	Management
à HDFS	è Distributed	Storage	Layer
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
64 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark and HDP
65 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Labs	/	Tutorials
66 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
67 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression Model Training (one feature)
Coefficients:	2.81				Intercept:	3.05
y	=	2.81x	+	3.05
Training
Result
68 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563
69 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
ML Lab
• Residuals
• residual of	an	observed	value	is	the	difference	between	the	observed	value	and	
the estimated value
• R2 (R Squared) – Coefficient of Determination
• indicates	a	goodness	of	fit
• R2	of	1	means	regression	line	perfectly	fits	data
• RMSE (Root Mean Square Error)
• measure	of	differences	between	values	predicted	by	a	model	or	and	values	actually	
observed
• good	measure	of accuracy,	but	only	to	compare	forecasting	errors	of	different	
models	(individual	variables	are	scale-dependent)
70 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Demo:	Stock	Portfolio	Simulation	using	Monte	Carlo	method
Monte	Carlo	Simulation
1. Define	a	domain	of	possible	inputs
2. Randomly	generate	inputs	from	prob.	
distribution	over	domain
3. Perform computation	on	the	inputs
4. Aggregate	the	results
Approximating the value of π after
placing 30K random points.
Error < 0.07% of actual value.
71 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Demo:	Text	Classification	with	Naïve	Bayes
72 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Diabetes	Dataset	– Decision	Trees	/	Random	Forest
Labeled	set	with	8	Features
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667
-1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333
+1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1
-1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6
+1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7
-1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333
...
73 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
TensorFlowOnSpark
74 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
TensorFlowOnSpark
75 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
TensorFlowOnSpark
76 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Robert	Hryniewicz
E:	rhryniewicz@hortonworks.com
T:	@robertH8z
77 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Feature	Selection
78 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Feature	Selection
à Also	known	as	variable	or	attribute	selection
à Why	important?
– simplification	of	models	è easier	to	interpret	by	researchers/users
– shorter	training	times
– enhanced	generalization	by	reducing overfitting
à Dimensionality	reduction	vs	feature	selection
– Dimensionality	red:	create	new	combinations	of	attributes
– Feature	selection:	include/exclude	attributes	in	data	without changing them
Q:	Which	features	should	you	use	to	create	a	predictive	model?
79 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Feature	Selection
à Methods
– Filter
– Wrapper
– Embedded
Goal:	Identify	and	remove	unneeded,	irrelevant	and	redundant	features	from	data	that	do	not	contribute	
or	may	decrease the	accuracy of	a	predictive	model.
80 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Feature	Selection	Traps
à Feature	selection	is	another	key	part	of	the	applied	machine	learning	process,	like	
model	selection.	You	cannot	fire	and	forget.
à It	is	important	to	consider	feature	selection	a	part	of	the	model	selection	process.	If	you	
do	not,	you	may	inadvertently	introduce	bias	into	your	models	which	can	result	in	
overfitting.
à For	example,	you	must	include	feature	selection	within	the	inner-loop	when	you	are	
using	accuracy	estimation	methods	such	as	cross-validation.	This	means	that	feature	
selection	is	performed	on	the	prepared	fold	right	before	the	model	is	trained.	A	mistake	
would	be	to	perform	feature	selection	first	to	prepare	your	data,	then	perform	model	
selection	and	training	on	the	selected	features.
81 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Feature	Selection	Checklist
1. Do	you	have	domain	knowledge? If	yes,	construct	a	better	set	of	“ad	hoc”	features
2. Are	your	features	commensurate? If	no,	consider	normalizing	them.
3. Do	you	suspect	interdependence	of	features? If	yes,	expand	your	feature	set	by	constructing	conjunctive	features	or	products	of	features,	as	much	as	your	
computer	resources	allow	you.
4. Do	you	need	to	prune	the	input	variables	(e.g.	for	cost,	speed	or	data	understanding	reasons)? If	no,	construct	disjunctive	features	or	weighted	sums	of	
feature
5. Do	you	need	to	assess	features	individually	(e.g.	to	understand	their	influence	on	the	system	or	because	their	number	is	so	large	that	you	need	to	do	a	
first	filtering)? If	yes,	use	a	variable	ranking	method;	else,	do	it	anyway	to	get	baseline	results.
6. Do	you	need	a	predictor? If	no,	stop
7. Do	you	suspect	your	data	is	“dirty”	(has	a	few	meaningless	input	patterns	and/or	noisy	outputs	or	wrong	class	labels)? If	yes,	detect	the	outlier	
examples	using	the	top	ranking	variables	obtained	in	step	5	as	representation;	check	and/or	discard	them.
8. Do	you	know	what	to	try	first? If	no,	use	a	linear	predictor.	Use	a	forward	selection	method	with	the	“probe”	method	as	a	stopping	criterion	or	use	the	0-
norm	embedded	method	for	comparison,	following	the	ranking	of	step	5,	construct	a	sequence	of	predictors	of	same	nature	using increasing	subsets	of	
features.	Can	you	match	or	improve	performance	with	a	smaller	subset?	If	yes,	try	a	non-linear	predictor	with	that	subset.
9. Do	you	have	new	ideas,	time,	computational	resources,	and	enough	examples? If	yes,	compare	several	feature	selection	methods,	including	your	new	
idea,	correlation	coefficients,	backward	selection	and	embedded	methods.	Use	linear	and	non-linear	predictors.	Select	the	best	approach	with	model	
selection
10. Do	you	want	a	stable	solution	(to	improve	performance	and/or	understanding)? If	yes,	subsample	your	data	and	redo	your	analysis	for	several	
“bootstrap”.
82 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Robert	Hryniewicz
E:	rhryniewicz@hortonworks.com
T:	@robertH8z
83 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
AI	Investment	Landscape
84 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Only	$100k	investment	needed	to	start	with	AI
85 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Report from IDC Analyst firm
Spending on AI
• $12.5B in 2017
• $4.5B	on	apps	for	threat	detection,	fraud	analysis,	public	safety,	and	pharmaceutical	research
• $46B+ by 2020
86 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Closing	thoughts	on	AI
87 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
The	Future	of	Cognitive	Computing	/	MI
– Machine
• Deep	Learning
• Discovery
• Large-scale	math
• Fact	checking
– Human
• Compassion
• Intuition
• Design
• Value	judgements
• Common	Sense
88 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
89 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Robert	Hryniewicz
E:	rhryniewicz@hortonworks.com
T:	@robertH8z
90 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s	new	in	HDP	2.6	– Spark	&	Zeppelin
à Spark	1.6.3	GA
à Spark	2.1	GA
à REST	API	(Livy)	GA
à Spark	Thrift	Server	doAS GA
à SparkSQL – Row/Column	Security	(GA)
à Spark	Streaming	+	Kafka	over	SSL
à Multi	Cluster	HBase support	for	SHC
à Package	support	in	PySpark &	SparkR
Spark
à Spark	2.x	support
à Improved	Livy	integration
à No	password	in	clear
à JDBC	interpreter	improvements
à Smart	Sense	integration
à Knox	proxy	Zeppelin	UI
Zeppelin	0.7.x
91 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Thanks!
Robert	Hryniewicz
@RobertH8z
rhryniewicz@hortonworks.com

Contenu connexe

Tendances

Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Hortonworks
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHortonworks
 
Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetThiago Santiago
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in FinancialYifeng Jiang
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceThiago Santiago
 
Make Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for YouMake Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for YouHortonworks
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the dataDataWorks Summit
 
Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...DataWorks Summit
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
 
Connecting Home/Building, Life and Car..The Importance of Insurance Risk Moni...
Connecting Home/Building, Life and Car..The Importance of Insurance Risk Moni...Connecting Home/Building, Life and Car..The Importance of Insurance Risk Moni...
Connecting Home/Building, Life and Car..The Importance of Insurance Risk Moni...DataWorks Summit
 
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017 Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017 Hortonworks
 
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...DataWorks Summit/Hadoop Summit
 
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...DataWorks Summit/Hadoop Summit
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Hortonworks, Novetta and Noble Energy Webinar
Hortonworks, Novetta and Noble Energy Webinar Hortonworks, Novetta and Noble Energy Webinar
Hortonworks, Novetta and Noble Energy Webinar Hortonworks
 
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic EcosystemsHortonworks
 

Tendances (20)

Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09
 
HPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare TransformationHPE and Hortonworks join forces to Deliver Healthcare Transformation
HPE and Hortonworks join forces to Deliver Healthcare Transformation
 
Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and Superset
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in Financial
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
 
Make Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for YouMake Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for You
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
 
Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Connecting Home/Building, Life and Car..The Importance of Insurance Risk Moni...
Connecting Home/Building, Life and Car..The Importance of Insurance Risk Moni...Connecting Home/Building, Life and Car..The Importance of Insurance Risk Moni...
Connecting Home/Building, Life and Car..The Importance of Insurance Risk Moni...
 
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017 Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
 
Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics
 
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
 
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Hortonworks, Novetta and Noble Energy Webinar
Hortonworks, Novetta and Noble Energy Webinar Hortonworks, Novetta and Noble Energy Webinar
Hortonworks, Novetta and Noble Energy Webinar
 
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
 

Similaire à Data Science Crash Course

Big Traffic, Big Trouble: Big Data Security Analytics
Big Traffic, Big Trouble: Big Data Security AnalyticsBig Traffic, Big Trouble: Big Data Security Analytics
Big Traffic, Big Trouble: Big Data Security AnalyticsDataWorks Summit
 
Big Traffic, Big Trouble: Big Data - Tokyo
Big Traffic, Big Trouble: Big Data - TokyoBig Traffic, Big Trouble: Big Data - Tokyo
Big Traffic, Big Trouble: Big Data - TokyoDataWorks Summit
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onDataWorks Summit
 
Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Daniel Madrigal
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real WorldDave Russell
 
What's new in Hortonworks DataFlow 3.0 by Andrew Psaltis
What's new in Hortonworks DataFlow 3.0 by Andrew PsaltisWhat's new in Hortonworks DataFlow 3.0 by Andrew Psaltis
What's new in Hortonworks DataFlow 3.0 by Andrew PsaltisData Con LA
 
The Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricThe Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricDataWorks Summit
 
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...Spark Summit
 
Hortonworks - How Hadoop makes the successful Retailer.
Hortonworks - How Hadoop makes the successful Retailer. Hortonworks - How Hadoop makes the successful Retailer.
Hortonworks - How Hadoop makes the successful Retailer. Mats Johansson
 
NHH 20221023 v3.pptx
NHH 20221023 v3.pptxNHH 20221023 v3.pptx
NHH 20221023 v3.pptxISSIP
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
[Hortonworks] Future Of Data: Madrid - HDF & Data in motionRaúl Marín
 

Similaire à Data Science Crash Course (20)

Big Traffic, Big Trouble: Big Data Security Analytics
Big Traffic, Big Trouble: Big Data Security AnalyticsBig Traffic, Big Trouble: Big Data Security Analytics
Big Traffic, Big Trouble: Big Data Security Analytics
 
Big Traffic, Big Trouble: Big Data - Tokyo
Big Traffic, Big Trouble: Big Data - TokyoBig Traffic, Big Trouble: Big Data - Tokyo
Big Traffic, Big Trouble: Big Data - Tokyo
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
PGDay Brasilia 2017
PGDay Brasilia 2017PGDay Brasilia 2017
PGDay Brasilia 2017
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real World
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
What's new in Hortonworks DataFlow 3.0 by Andrew Psaltis
What's new in Hortonworks DataFlow 3.0 by Andrew PsaltisWhat's new in Hortonworks DataFlow 3.0 by Andrew Psaltis
What's new in Hortonworks DataFlow 3.0 by Andrew Psaltis
 
The Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricThe Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data Centric
 
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
 
Hortonworks - How Hadoop makes the successful Retailer.
Hortonworks - How Hadoop makes the successful Retailer. Hortonworks - How Hadoop makes the successful Retailer.
Hortonworks - How Hadoop makes the successful Retailer.
 
NHH 20221023 v3.pptx
NHH 20221023 v3.pptxNHH 20221023 v3.pptx
NHH 20221023 v3.pptx
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
 

Plus de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 

Dernier

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Dernier (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Data Science Crash Course