SlideShare une entreprise Scribd logo
1  sur  106
Télécharger pour lire hors ligne
 Big	
  Data	
  
	
  Ins+tuto	
  Superior	
  Técnico	
  
	
  	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Technical	
  University	
  of	
  Lisbon	
  
Bruno	
  Lopes	
  
Catarina	
  Moreira	
  
João	
  Pinho	
  
Mo#va#on	
  
2	
  220	
  
2	
  
PetaBytes	
  	
  
Of	
  data	
  that	
  people	
  create	
  every	
  day!	
  
Mo#va#on	
  
3	
  
90	
  %	
  of	
  Data	
  
UNSTRUCTURED	
  
Which	
  is	
  difficult	
  to	
  manage	
  and	
  interpret!	
  
Mo#va#on	
  
4	
  
2	
  hours	
  	
  
per	
  day	
  	
  
Spent	
  searching	
  for	
  the	
  right	
  informa#on	
  
Mo#va#on	
  
5	
  
$5	
  million	
  
	
  per	
  year	
  
Money	
  lost	
  due	
  to	
  data-­‐related	
  problems	
  
Mo#va#on	
  
6	
  
1	
  in	
  3	
  
Business	
  leaders	
  make	
  decisions	
  
based	
  on	
  informa#on	
  they	
  don’t	
  	
  
trust	
  or	
  don’t	
  have	
  
1	
  in	
  2	
  
Business	
  leaders	
  say	
  they	
  don’t	
  
have	
  access	
  to	
  the	
  informa#on	
  
they	
  need	
  to	
  do	
  their	
  jobs	
  
60%	
  
CEO’s	
  need	
  to	
  do	
  a	
  beQer	
  job	
  capturing	
  	
  
and	
  understanding	
  informa+on	
  rapidly	
  in	
  	
  
order	
  to	
  make	
  swiS	
  business	
  decisions	
  
Of	
  world’s	
  data	
  	
  
is	
  unstructured	
  80	
  %	
  
Organiza+ons	
  Need	
  Deeper	
  
Insights	
  …	
  
Informa+on	
  is	
  at	
  the	
  Center	
  
of	
  a	
  New	
  Wave	
  of	
  Opportunity	
  
Source:	
  Big	
  Data	
  by	
  Ami	
  Redwan	
  Haq,	
  Founder	
  at	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Sen:nel	
  Solu:ons	
  Ltd	
  	
  
44	
  %	
  
as	
  much	
  Data	
  and	
  Content	
  
Over	
  the	
  Coming	
  Decade	
  
	
  	
  	
  	
  	
  	
  	
  2009	
  
800	
  000	
  PetaBytes	
  
	
  	
  	
  	
  	
  	
  	
  2020	
  
35	
  ZeQaBytes	
  
What	
  is	
  Big	
  Data?	
  
7	
  
Big	
  Data	
  (video)	
  
What	
  is	
  Big	
  Data?	
  
Data	
  that	
  exceeds	
  the	
  processing	
  
capacity	
  of	
  a	
  conven+onal	
  database.	
  
The	
  data	
  is	
  too	
  big,	
  moves	
  too	
  fast	
  or	
  
doesn’t	
  fit	
  the	
  structures	
  of	
  your	
  
database	
  architectures.	
  
8	
  Source:	
  J.	
  Hurwitz,	
  A.	
  Nugent,	
  F.	
  Halper	
  &	
  M.	
  Kaufman,	
  Big	
  Data	
  for	
  Dummies,	
  Willey	
  &	
  Sons,	
  2013	
  
Why	
  Big	
  Data?	
  
Big	
  data	
  enables	
  organiza+ons	
  to	
  
store,	
  manage	
  and	
  manipulate	
  vast	
  
amounts	
  of	
  data	
  at	
  the	
  right	
  speed	
  
and	
  at	
  the	
  right	
  #me	
  to	
  gain	
  the	
  right	
  
insights.	
  
9	
  Source:	
  J.	
  Hurwitz,	
  A.	
  Nugent,	
  F.	
  Halper	
  &	
  M.	
  Kaufman,	
  Big	
  Data	
  for	
  Dummies,	
  Willey	
  &	
  Sons,	
  2013	
  
Why	
  Big	
  Data?	
  
•  The	
  value	
  of	
  Big	
  Data	
  falls	
  into	
  2	
  categories:	
  
1.  Analy#cal	
  Use.	
  	
  
•  Reveal	
  new	
  insights	
  that	
  were	
  hidden	
  in	
  too	
  costly	
  to	
  process	
  
massive	
  amounts	
  of	
  data;	
  
•  Example:	
  Peer	
  influence	
  among	
  customers;	
  
	
  
2.  Enabling	
  New	
  Products.	
  
•  Combining	
  large	
  number	
  of	
  signals	
  from	
  user’s	
  ac+ons	
  and	
  friends	
  
•  Example:	
  Facebook	
  
10	
  
The	
  V’s	
  
Big	
  Data	
  is	
  defined	
  as	
  any	
  kind	
  of	
  data	
  source	
  that	
  has	
  
at	
  least	
  three	
  shared	
  characteris+cs:	
  
	
  
	
  
11	
  
Velocity	
   Variety	
  
Volume	
  
Viability	
   Value	
  
Velocity	
  
	
  
Increasing	
  rate	
  at	
  
which	
  data	
  flows	
  into	
  
organiza#ons	
  
12	
  
Velocity	
  
The	
  Large	
  Hadron	
  Collider	
  	
  at	
  CERN	
  generates	
  
	
  
13	
  
Scien+sts	
  are	
  forced	
  to	
  
25	
  PetaBytes	
  per	
  Year!	
  
Discard	
  Massive	
  Data	
  
In	
  order	
  to	
  keep	
   requirements	
  prac+cal	
  storage	
  
Velocity	
  
The	
  way	
  we	
  deliver	
  and	
  consume	
  products	
  and	
  services	
  
is	
  genera+ng	
  a	
  massive	
  data	
  flow	
  to	
  the	
  provider	
  
•  It’s	
  possible	
  to	
  stream	
  fast-­‐moving	
  data	
  into	
  bulk	
  
storage	
  for	
  later	
  batch	
  processing.	
  
•  The	
  importance	
  lies	
  in	
  taking	
  data	
  from	
  input	
  
through	
  decision.	
  	
  
14	
  
Volume	
  
Ability	
  to	
  store	
  
massive	
  amounts	
  of	
  
(un)structured	
  data	
  
15	
  
Volume	
  
16	
  
No	
  longer	
  a	
  problem!	
  
Volume	
  
17	
  
Don’t	
  forget	
  the	
  cloud!	
  
Volume	
  -­‐	
  Challenges	
  
•  How	
  to	
  determine	
  relevance	
  within	
  large	
  data	
  
volumes	
  	
  
•  How	
  to	
  use	
  analy+cs	
  to	
  create	
  value	
  from	
  
relevant	
  data.	
  
18	
  
Variety	
  
Managing,	
  merging	
  
and	
  governing	
  
different	
  varie+es	
  of	
  
data.	
  
19	
  
Variety	
  
•  Source	
  data	
  is	
  diverse	
  and	
  doesn’t	
  fall	
  into	
  neat	
  
rela+onal	
  structures	
  
	
  
•  Data	
  comes	
  in	
  the	
  form	
  of	
  emails,	
  photos,	
  videos,	
  
monitoring	
  devices,	
  PDFs,	
  audio,	
  etc.	
  
•  This	
  variety	
  of	
  (un)structured	
  data	
  creates	
  problems	
  
for	
  storage,	
  mining	
  and	
  analyzing	
  data	
  
20	
  
Viability	
  
Quickly	
  and	
  cost-­‐effec#vely	
  
test	
  and	
  confirm	
  a	
  par+cular	
  
variable’s	
  relevance	
  before	
  
inves+ng	
  in	
  the	
  crea+on	
  of	
  a	
  
fully	
  featured	
  model.	
  
21	
  
Value	
  
Gain	
  new	
  insights	
  
about	
  data	
  by	
  
analyzing	
  variables	
  
that	
  were	
  hidden	
  
22	
  
Big	
  Data	
  
23	
  
	
  
So,	
  how	
  does	
  it	
  
work?	
  
The	
  Big	
  Data	
  Supply	
  Chain	
  
24	
  Source:	
  Edd	
  Dumbill,	
  Planning	
  for	
  Big	
  Data,	
  O’Reilly,	
  2012	
  
Summary	
  
•  Mo+va+on	
  of	
  why	
  companies	
  need	
  
to	
  learn	
  how	
  to	
  deal	
  with	
  Big	
  Data!	
  
	
  
•  Presented	
  a	
  defini+on	
  of	
  Big	
  Data	
  
and	
  analyzed	
  the	
  V’s.	
  
•  Presented	
  an	
  analysis	
  of	
  the	
  Big	
  Data	
  
Supply	
  chain	
  for	
  enterprises.	
  
25	
  
26	
  
Technologies	
  for	
  Big	
  Data	
  
The	
  Reach	
  of	
  Big	
  Data	
  
•  So	
  far,	
  the	
  analy+cal	
  use	
  of	
  Big	
  Data	
  has	
  been	
  in	
  
reach	
  only	
  to	
  leading	
  coopera+ons:	
  
	
  
27	
  
But	
  at	
  a	
  fantas+c	
  cost!	
  
The	
  Reach	
  of	
  Big	
  Data	
  
But	
  now,	
  there	
  are	
  several	
  open	
  source	
  
applica+ons	
  than	
  can	
  tackle	
  Big	
  Data	
  
related	
  challenges!	
  
28	
  
Apache	
  Hadoop	
  
•  Developed	
  and	
  released	
  by	
  Yahoo!	
  
•  Open	
  source	
  framework	
  for	
  storage	
  
and	
  large	
  scale	
  parallel	
  processing	
  of	
  
datasets	
  on	
  clusters.	
  
•  Ability	
  to	
  cheaply	
  process	
  large	
  
amounts	
  of	
  data,	
  regardless	
  of	
  its	
  
structure.	
  
29	
  
Apache	
  Hadoop	
  
30	
  
Apache	
  Hadoop	
  
31	
  
(Video	
  -­‐	
  MapReduce)	
  
•  IBM	
  –	
  MapReduce	
  
	
  
•  Health	
  Care	
  -­‐	
  Real	
  Time	
  Alerts	
  
32	
  
HDFS	
  –	
  Hadoop	
  Distributed	
  File	
  System	
  
33	
  
34	
  
•  Detec+on	
  of	
  faults	
  and	
  quick,	
  automa+c	
  recovery	
  	
  
•  Emphasis	
  on	
  high	
  throughput	
  of	
  data	
  access	
  
•  Tuned	
  to	
  support	
  large	
  datasets	
  
•  Write-­‐once-­‐read-­‐many	
  access	
  model:	
  once	
  a	
  file	
  is	
  
created,	
  wriQen	
  and	
  closed,	
  needs	
  not	
  to	
  be	
  changed	
  
HDFS	
  –	
  Hadoop	
  Distributed	
  File	
  System	
  
35	
  
•  Moving	
  computa+on	
  is	
  cheaper	
  than	
  moving	
  data	
  
–  minimizes	
  network	
  conges+on	
  
–  Increases	
  the	
  overall	
  throughput	
  of	
  the	
  system	
  
•  Portability	
  Across	
  Heterogeneous	
  Hardware	
  and	
  
SoSware	
  Plaqorms	
  
HDFS	
  –	
  Hadoop	
  Distributed	
  File	
  System	
  
HDFS	
  –	
  Hadoop	
  Distributed	
  File	
  System	
  
36	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
37	
  
•  Created	
  by	
  Google	
  for	
  web	
  search	
  indexes	
  
•  Ability	
  to	
  take	
  a	
  query	
  over	
  a	
  dataset,	
  divide	
  it	
  and	
  
run	
  it	
  in	
  parallel	
  over	
  mul+ple	
  nodes	
  
•  Consists	
  in	
  three	
  stages:	
  
–  Map	
  
–  Shuffle	
  
–  Reduce	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
38	
  
Input	
   Output	
  Reduce	
  Shuffle	
  Map	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
39	
  
Input	
   Output	
  Reduce	
  Reduce	
   Shuffle	
  Map	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
40	
  
•  Imagine	
  that	
  you	
  want	
  to	
  apply	
  MapReduce	
  to	
  
word	
  coun+ng	
  in	
  books	
  
	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
41	
  
Input	
   Output	
  Reduce	
  Reduce	
   Shuffle	
  Map	
  
E	
  …	
  com	
  
a	
  mulher	
  
e	
  o	
  filho…	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
42	
  
Input	
   Output	
  Reduce	
  Reduce	
   Shuffle	
  Map	
  
E	
  …	
  com	
  
a	
  mulher	
  
e	
  o	
  filho…	
  
(	
  1,	
  E)	
  
(2,	
  com)	
  
(3,	
  a	
  )	
  
(4,	
  mulher)	
  
(5,	
  e	
  )	
  
(6,	
  o)	
  
(7,	
  filho)	
  
	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
43	
  
Input	
   Output	
  Reduce	
  Reduce	
   Shuffle	
  Map	
  
e	
  …	
  com	
  
a	
  mulher	
  
e	
  o	
  filho…	
  
(	
  1,	
  e)	
  
(2,	
  com)	
  
(3,	
  a	
  )	
  
(4,	
  mulher)	
  
(5,	
  e	
  )	
  
(6,	
  o)	
  
(7,	
  filho)	
  
	
  
(1,	
  e)	
  -­‐>	
  2	
  
(2,	
  com)	
  -­‐>	
  1	
  
(3,	
  a	
  )	
  -­‐>	
  1	
  
(4,	
  mulher)	
  -­‐>	
  1	
  
(5,	
  o)	
  -­‐>	
  1	
  
(6,	
  filho)	
  -­‐>	
  1	
  
	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
44	
  
Input	
   Output	
  Reduce	
  Reduce	
   Shuffle	
  Map	
  
e	
  …	
  com	
  
a	
  mulher	
  
e	
  o	
  filho…	
  
(	
  1,	
  e)	
  
(3,	
  com)	
  
(2,	
  a	
  )	
  
(6,	
  mulher)	
  
(5,	
  e	
  )	
  
(7,	
  o)	
  
(4,	
  filho)	
  
	
  
(1,	
  e)	
  -­‐>	
  2	
  
(3,	
  com)	
  -­‐>	
  1	
  
(2,	
  a	
  )	
  -­‐>	
  1	
  
(6,	
  mulher)	
  -­‐>	
  1	
  
(7,	
  o)	
  -­‐>	
  1	
  
(4,	
  filho)	
  -­‐>	
  1	
  
	
  
e	
  [4,2,6]	
  
a	
  [3,1,2]	
  
com	
  [5,1,2]	
  
filho	
  [2,1,1]	
  
mulher	
  [3,1,10]	
  
O	
  [7,1,9]	
  
	
  
The	
  Core	
  of	
  Hadoop	
  -­‐	
  MapReduce	
  
45	
  
Input	
   Output	
  Reduce	
  Reduce	
   Shuffle	
  Map	
  
e	
  …	
  com	
  
a	
  mulher	
  
e	
  o	
  filho…	
  
(	
  1,	
  e)	
  
(3,	
  com)	
  
(2,	
  a	
  )	
  
(6,	
  mulher)	
  
(5,	
  e	
  )	
  
(7,	
  o)	
  
(4,	
  filho)	
  
	
  
(1,	
  e)	
  -­‐>	
  2	
  
(3,	
  com)	
  -­‐>	
  1	
  
(2,	
  a	
  )	
  -­‐>	
  1	
  
(6,	
  mulher)	
  -­‐>	
  1	
  
(7,	
  o)	
  -­‐>	
  1	
  
(4,	
  filho)	
  -­‐>	
  1	
  
e	
  [4,2,6]	
  
a	
  [3,1,2]	
  
com	
  [5,1,2]	
  
filho	
  [2,1,1]	
  
mulher	
  [3,1,10]	
  
O	
  [7,1,9]	
  
e	
  -­‐>	
  12	
  
a	
  -­‐>	
  6	
  
com	
  -­‐>	
  8	
  
filho	
  -­‐>	
  4	
  
mulher	
  -­‐>	
  14	
  
o	
  -­‐>	
  17	
  
Who	
  Uses	
  Hadoop?	
  
46	
  
The	
  Social	
  Genome	
  product:	
  	
  
	
  
•  Reach	
  customers,	
  or	
  friends	
  of	
  customers,	
  
who	
  have	
  men+oned	
  product	
  and	
  include	
  a	
  
discount	
  
•  Combines	
  public	
  data	
  from	
  the	
  web,	
  social	
  
data	
  and	
  proprietary	
  data	
  
47	
  
Walmart	
  (Real	
  Time)	
  
The	
  Social	
  Genome	
  product:	
  	
  
	
  
•  Helps	
  Walmart	
  to	
  understand	
  the	
  context	
  of	
  
what	
  their	
  customers	
  are	
  saying	
  online	
  
	
  
•  When	
  a	
  person	
  tweets	
  I	
  love	
  Salt,	
  Walmart	
  
can	
  understand	
  that	
  she	
  is	
  talking	
  about	
  
the	
  movie	
  Salt	
  and	
  not	
  the	
  condiment.	
  
48	
  
Walmart	
  (Real	
  Time)	
  
Apache	
  Hive	
  
49	
  
•  Data	
  warehouse	
  soSware	
  that	
  facilitates	
  
querying	
  and	
  managing	
  large	
  datasets	
  
residing	
  in	
  a	
  distributed	
  storage	
  
•  Uses	
  HiveSQL	
  to	
  analyze	
  data	
  
•  Allows	
  programmers	
  to	
  use	
  their	
  own	
  map	
  
reduce	
  func+ons	
  
	
  
Apache	
  Cassandra	
  
50	
  
•  Database	
  with	
  high	
  scalability	
  and	
  high	
  
availability	
  without	
  compromising	
  
performance.	
  
•  Offers	
  
– Denormaliza+on	
  
– Materialized	
  views	
  
– Powerful	
  built	
  in	
  cache	
  
Google	
  Big	
  Query	
  
51	
  
Built	
  on	
  GoogleFS	
  
Summary	
  
	
  
52	
  
•  Apache	
  Hadoop	
  techhnology	
  
–  HDFS	
  file	
  system	
  
–  MapReduce	
  paradigm	
  
•  Apache	
  Hive	
  
•  Apache	
  Cassandra	
  
•  Google	
  BigQuery	
  
Project	
  HEIDI	
  	
  
54	
  
How	
  to	
  index	
  high	
  dimensional	
  data?	
  
Project	
  HEIDI	
  	
  
55	
  
•  Mul+media	
  datasets	
  are	
  growing!	
  
–  Ex:	
  Audio,	
  video,	
  medical	
  images,	
  photos,	
  etc.	
  
•  We	
  need	
  techniques	
  to	
  efficiently	
  manage	
  and	
  
access	
  informa+on	
  in	
  such	
  large	
  datasets!	
  
•  Mul+media	
  datasets	
  cannot	
  be	
  searched	
  like	
  
tradi+onal	
  databases	
  (e.g.	
  Text	
  search).	
  They	
  are	
  
searched	
  in	
  metric	
  spaces.	
  
Project	
  HEIDI	
  
Given	
  a	
  mul#media	
  
object	
  as	
  input,	
  return	
  
the	
  most	
  similar	
  objects	
  
from	
  a	
  mul+media	
  
database	
  
56	
  
Project	
  HEIDI	
  	
  
57	
  
•  In	
  content-­‐based	
  image	
  retrieval,	
  images	
  are	
  
described	
  by	
  their	
  proper+es:	
  
•  Colors	
  
•  Texture	
  
•  Shape	
  
•  Shadows	
  
•  etc	
  
Project	
  HEIDI	
  	
  
58	
  
•  The	
  process	
  of	
  conver+ng	
  a	
  mul+media	
  object	
  
into	
  a	
  vector	
  with	
  its	
  contents/features	
  is	
  
called	
  feature	
  extrac,on.	
  
59	
  
Project	
  HEIDI	
  	
  
60	
  
The	
  problem	
  in	
  high	
  dimensional	
  indexing	
  is	
  the	
  
Project	
  HEIDI	
  	
  
61	
  
The	
  problem	
  in	
  high	
  dimensional	
  indexing	
  is	
  the	
  
OF	
  DIMENSIONALITY!!!	
  
The	
  Curse	
  of	
  Dimensionality	
  
62	
  
•  A	
  phenomena	
  that	
  arises	
  when	
  analyzing	
  
and	
  organizing	
  data	
  in	
  high	
  dimensional	
  
spaces.	
  
•  It	
  does	
  not	
  occur	
  in	
  lower	
  dimensions!	
  
The	
  Curse	
  of	
  Dimensionality	
  
63	
  
•  When	
  dimensionality	
  increases,	
  the	
  
volume	
  of	
  the	
  space	
  increases	
  so	
  fast	
  
that	
  the	
  data	
  become	
  sparse.	
  
•  Tradi+onal	
  tree	
  indexing	
  structures	
  do	
  
not	
  provide	
  any	
  advantages	
  in	
  the	
  index	
  
process	
  (outperformed	
  by	
  linear	
  scan)…	
  
The	
  Curse	
  of	
  Dimensionality	
  
64	
  
•  	
  Consequences:	
  	
  
– (hyper)	
  cubes	
  
– (hyper)	
  spheres	
  
	
  
•  Volume	
  increases	
  exponen#ally	
  with	
  growing	
  
dimension	
  (while	
  the	
  edge	
  length	
  remains	
  
constant).	
  
The	
  Curse	
  of	
  Dimensionality	
  
65	
  
	
  
How	
  can	
  this	
  curse	
  
affect	
  datasets	
  in	
  
metric	
  spaces?	
  
The	
  Curse	
  of	
  Dimensionality	
  
66	
  
•  	
  In	
  metric	
  spaces,	
  there	
  is	
  the	
  
Nearest	
  Neighbor	
  paradigm.	
  
•  The	
  NN-­‐sphere	
  contains	
  one	
  
point.	
  
•  	
  For	
  higher	
  dimensions,	
  the	
  radius	
  of	
  the	
  
NN-­‐sphere	
  will	
  be	
  larger	
  than	
  the	
  size	
  of	
  
the	
  database.	
  
The	
  Curse	
  of	
  Dimensionality	
  
67	
  
Literature	
  addresses	
  this	
  issue	
  in	
  two	
  ways:	
  
	
  
1.  Approximate	
  Nearest	
  Search	
  
2.  Reduce	
  the	
  dimensionality	
  of	
  the	
  mul+media	
  
objects	
  
Project	
  HEIDI	
  
68	
  
The	
  algorithm	
  that	
  is	
  proposed	
  in	
  this	
  project	
  does	
  
not	
  suffer	
  from	
  the	
  curse	
  of	
  dimensionality!	
  
Project	
  HEIDI	
  
69	
  
The	
  Linear	
  SubSpace	
  Indexing	
  Algorithm	
  
	
  
•  Based	
  on	
  the	
  idea	
  of	
  subspaces!	
  
– Feature	
  extrac+on	
  func+on	
  that	
  maps	
  the	
  high	
  
dimensional	
  objects	
  into	
  low	
  dimensional	
  
equivalent	
  objects.	
  
Project	
  HEIDI	
  
70	
  
The	
  Quick	
  and	
  Dirty	
  Paradigm	
  
	
  
•  Discard	
  data	
  objects	
  in	
  lower	
  dimensional	
  spaces	
  
•  Massive	
  comparison	
  opera+ons	
  are	
  less	
  
expensive	
  in	
  lower	
  dimensions	
  
Project	
  HEIDI	
  
71	
  
	
  
However,	
  this	
  approach	
  has	
  
Project	
  HEIDI	
  
72	
  
The	
  Quick	
  and	
  Dirty	
  Paradigm	
  
	
  
•  Can	
  produce	
  lots	
  of	
  false	
  hits!	
  
•  The	
  mapping	
  to	
  lower	
  dimensions	
  can	
  lead	
  to	
  
closer	
  data	
  points	
  
Project	
  HEIDI	
  
73	
  
Are	
  false	
  hits	
  really	
  a	
  	
  
	
  
	
  
Project	
  HEIDI	
  
74	
  
Project	
  HEIDI	
  
75	
  
•  Just	
  perform	
  a	
  final	
  comparison	
  in	
  the	
  original	
  
space!	
  
•  All	
  false	
  hits	
  will	
  be	
  discarded!	
  
•  The	
  computa+on	
  is	
  quick,	
  because	
  the	
  massive	
  
amount	
  of	
  objects	
  were	
  discarded	
  in	
  the	
  lower	
  
dimensions.	
  
Project	
  HEIDI	
  
76	
  
Project	
  HEIDI	
  
77	
  
	
  
But…	
  How	
  to	
  map	
  high	
  
dimensional	
  objects	
  into	
  
lower	
  dimensional	
  spaces?	
  
	
  
	
  
	
  
Project	
  HEIDI	
  
78	
  
The	
  Quick	
  and	
  Dirty	
  Paradigm	
  
	
  
•  Lower	
  bonding	
  lemma:	
  
– Distances	
  between	
  objects	
  in	
  the	
  feature	
  space	
  are	
  
always	
  smaller	
  or	
  equal	
  that	
  in	
  the	
  original	
  space	
  
	
  
	
  
Project	
  HEIDI	
  
79	
  
The	
  Quick	
  and	
  Dirty	
  Paradigm	
  
	
  
•  The	
  Principal	
  Component	
  Analysis	
  algorithm	
  
maps	
  objects	
  into	
  lower	
  dimensions	
  
•  Sa+sfies	
  the	
  lower	
  bounding	
  lemma	
  
	
  
	
  
Project	
  HEIDI	
  
80	
  
The	
  Principal	
  Component	
  Analysis	
  
	
  
	
  
Project	
  HEIDI	
  
81	
  
The	
  Principal	
  Component	
  Analysis	
  
	
  
1.  Compute	
  Co-­‐Variance	
  Matrix	
  
	
  
	
  
Project	
  HEIDI	
  
82	
  
The	
  Principal	
  Component	
  Analysis	
  
	
  
2.  Compute	
   eigen	
   vectors	
   and	
   eigen	
   values	
   from	
  
co-­‐variance	
  matrix	
  
3.  Get	
   the	
   most	
   significant	
   eigen	
   vectors	
   and	
  
create	
  a	
  projec+on	
  matrix	
  
4.  Project	
  data	
  
	
  
Project	
  HEIDI	
  
83	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
	
  
1.  Index	
  Phase	
  (off-­‐line)	
  
– Par++on	
  large	
  dataset	
  into	
  chunks	
  of	
  data	
  
– Itera+vely	
  apply	
  the	
  Principal	
  Component	
  Analysis	
  to	
  
project	
  the	
  dataset	
  into	
  lower	
  dimensions	
  
– This	
  ac+on	
  can	
  be	
  parallelized	
  in	
  many	
  servers 	
  	
  
	
  
Project	
  HEIDI	
  
84	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Project	
  HEIDI	
  
85	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
Project	
  HEIDI	
  
86	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
Project	
  HEIDI	
  
87	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
Project	
  HEIDI	
  
88	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
89	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
90	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
91	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
92	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
93	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
94	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
95	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
96	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
Original	
  Space	
   Lower	
  Space	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
PCA	
  
Project	
  HEIDI	
  
97	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
	
  
Project	
  HEIDI	
  
98	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
	
  
Project	
  HEIDI	
  
99	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
	
  
Project	
  HEIDI	
  
100	
  
The	
  Lower	
  Bounding	
  Lemma	
  is	
  sa#sfied	
  
	
  
Euclidean	
  Distance	
  
Most	
  Similar	
  Images	
  Retrieved	
  
Project	
  HEIDI	
  
101	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
	
  
Project	
  HEIDI	
  
102	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
	
  
2.  Query	
  Phase	
  (on-­‐line)	
  
– Project	
  queries	
  using	
  projec+on	
  matrices	
  from	
  step	
  1	
  
– Star+ng	
  in	
  the	
  lowest	
  dimensional	
  space,	
  itera+vely	
  
discard	
  all	
  objects	
  that	
  are	
  different	
  from	
  the	
  query	
  
– Keep	
  doing	
  this	
  un+l	
  the	
  original	
  space	
  is	
  reached	
  and	
  
the	
  false	
  hits	
  are	
  discarded.	
  
	
  
Project	
  HEIDI	
  
103	
  
The	
  Hierarchical	
  Linear	
  SubSpace	
  Indexing	
  Method	
  
	
  
Summary	
  
•  Important	
  concepts	
  in	
  Big	
  Data	
  for	
  
High	
  Dimensional	
  Datasets	
  
•  The	
  curse	
  of	
  dimensionality	
  
•  Projec+on	
  techniques	
  
•  The	
  Lower	
  Bounding	
  Lemma	
  
•  The	
  Quick	
  and	
  Dirty	
  paradigm	
  (some	
  
rela+on	
  to	
  the	
  MapReduce	
  paradigm)	
  
	
   104	
  
Summary	
  
	
  
Mathema#cs	
  is	
  more	
  
important	
  than	
  
engineering!	
  
	
  
105	
  
Now…	
  
Time	
  to	
  See	
  This	
  Working	
  
for	
  REAL!!!	
  
106	
  

Contenu connexe

Tendances

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKristof Jozsa
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaStudent
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Arohi Khandelwal
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...CloudxLab
 

Tendances (20)

Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
 

En vedette

Quantum Models for Decision and Cognition
Quantum Models for Decision and CognitionQuantum Models for Decision and Cognition
Quantum Models for Decision and CognitionCatarina Moreira
 
Quantum Models of Cognition and Decision
Quantum Models of Cognition and DecisionQuantum Models of Cognition and Decision
Quantum Models of Cognition and DecisionCatarina Moreira
 
The Relation Between Acausality and Interference in Quantum-Like Bayesian Net...
The Relation Between Acausality and Interference in Quantum-Like Bayesian Net...The Relation Between Acausality and Interference in Quantum-Like Bayesian Net...
The Relation Between Acausality and Interference in Quantum-Like Bayesian Net...Catarina Moreira
 
Quantum-Like Bayesian Networks using Feynman's Path Diagram Rules
Quantum-Like Bayesian Networks using Feynman's Path Diagram RulesQuantum-Like Bayesian Networks using Feynman's Path Diagram Rules
Quantum-Like Bayesian Networks using Feynman's Path Diagram RulesCatarina Moreira
 
Pivot Selection Techniques
Pivot Selection TechniquesPivot Selection Techniques
Pivot Selection TechniquesCatarina Moreira
 

En vedette (6)

Quantum Models for Decision and Cognition
Quantum Models for Decision and CognitionQuantum Models for Decision and Cognition
Quantum Models for Decision and Cognition
 
Slides
SlidesSlides
Slides
 
Quantum Models of Cognition and Decision
Quantum Models of Cognition and DecisionQuantum Models of Cognition and Decision
Quantum Models of Cognition and Decision
 
The Relation Between Acausality and Interference in Quantum-Like Bayesian Net...
The Relation Between Acausality and Interference in Quantum-Like Bayesian Net...The Relation Between Acausality and Interference in Quantum-Like Bayesian Net...
The Relation Between Acausality and Interference in Quantum-Like Bayesian Net...
 
Quantum-Like Bayesian Networks using Feynman's Path Diagram Rules
Quantum-Like Bayesian Networks using Feynman's Path Diagram RulesQuantum-Like Bayesian Networks using Feynman's Path Diagram Rules
Quantum-Like Bayesian Networks using Feynman's Path Diagram Rules
 
Pivot Selection Techniques
Pivot Selection TechniquesPivot Selection Techniques
Pivot Selection Techniques
 

Similaire à Big Data

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataPrakalp Agarwal
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationDoug Denton
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigManish Chopra
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataSpringPeople
 

Similaire à Big Data (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG Data
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
 

Dernier

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 

Dernier (20)

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 

Big Data

  • 1.  Big  Data    Ins+tuto  Superior  Técnico                              Technical  University  of  Lisbon   Bruno  Lopes   Catarina  Moreira   João  Pinho  
  • 2. Mo#va#on   2  220   2   PetaBytes     Of  data  that  people  create  every  day!  
  • 3. Mo#va#on   3   90  %  of  Data   UNSTRUCTURED   Which  is  difficult  to  manage  and  interpret!  
  • 4. Mo#va#on   4   2  hours     per  day     Spent  searching  for  the  right  informa#on  
  • 5. Mo#va#on   5   $5  million    per  year   Money  lost  due  to  data-­‐related  problems  
  • 6. Mo#va#on   6   1  in  3   Business  leaders  make  decisions   based  on  informa#on  they  don’t     trust  or  don’t  have   1  in  2   Business  leaders  say  they  don’t   have  access  to  the  informa#on   they  need  to  do  their  jobs   60%   CEO’s  need  to  do  a  beQer  job  capturing     and  understanding  informa+on  rapidly  in     order  to  make  swiS  business  decisions   Of  world’s  data     is  unstructured  80  %   Organiza+ons  Need  Deeper   Insights  …   Informa+on  is  at  the  Center   of  a  New  Wave  of  Opportunity   Source:  Big  Data  by  Ami  Redwan  Haq,  Founder  at                                  Sen:nel  Solu:ons  Ltd     44  %   as  much  Data  and  Content   Over  the  Coming  Decade                2009   800  000  PetaBytes                2020   35  ZeQaBytes  
  • 7. What  is  Big  Data?   7   Big  Data  (video)  
  • 8. What  is  Big  Data?   Data  that  exceeds  the  processing   capacity  of  a  conven+onal  database.   The  data  is  too  big,  moves  too  fast  or   doesn’t  fit  the  structures  of  your   database  architectures.   8  Source:  J.  Hurwitz,  A.  Nugent,  F.  Halper  &  M.  Kaufman,  Big  Data  for  Dummies,  Willey  &  Sons,  2013  
  • 9. Why  Big  Data?   Big  data  enables  organiza+ons  to   store,  manage  and  manipulate  vast   amounts  of  data  at  the  right  speed   and  at  the  right  #me  to  gain  the  right   insights.   9  Source:  J.  Hurwitz,  A.  Nugent,  F.  Halper  &  M.  Kaufman,  Big  Data  for  Dummies,  Willey  &  Sons,  2013  
  • 10. Why  Big  Data?   •  The  value  of  Big  Data  falls  into  2  categories:   1.  Analy#cal  Use.     •  Reveal  new  insights  that  were  hidden  in  too  costly  to  process   massive  amounts  of  data;   •  Example:  Peer  influence  among  customers;     2.  Enabling  New  Products.   •  Combining  large  number  of  signals  from  user’s  ac+ons  and  friends   •  Example:  Facebook   10  
  • 11. The  V’s   Big  Data  is  defined  as  any  kind  of  data  source  that  has   at  least  three  shared  characteris+cs:       11   Velocity   Variety   Volume   Viability   Value  
  • 12. Velocity     Increasing  rate  at   which  data  flows  into   organiza#ons   12  
  • 13. Velocity   The  Large  Hadron  Collider    at  CERN  generates     13   Scien+sts  are  forced  to   25  PetaBytes  per  Year!   Discard  Massive  Data   In  order  to  keep   requirements  prac+cal  storage  
  • 14. Velocity   The  way  we  deliver  and  consume  products  and  services   is  genera+ng  a  massive  data  flow  to  the  provider   •  It’s  possible  to  stream  fast-­‐moving  data  into  bulk   storage  for  later  batch  processing.   •  The  importance  lies  in  taking  data  from  input   through  decision.     14  
  • 15. Volume   Ability  to  store   massive  amounts  of   (un)structured  data   15  
  • 16. Volume   16   No  longer  a  problem!  
  • 17. Volume   17   Don’t  forget  the  cloud!  
  • 18. Volume  -­‐  Challenges   •  How  to  determine  relevance  within  large  data   volumes     •  How  to  use  analy+cs  to  create  value  from   relevant  data.   18  
  • 19. Variety   Managing,  merging   and  governing   different  varie+es  of   data.   19  
  • 20. Variety   •  Source  data  is  diverse  and  doesn’t  fall  into  neat   rela+onal  structures     •  Data  comes  in  the  form  of  emails,  photos,  videos,   monitoring  devices,  PDFs,  audio,  etc.   •  This  variety  of  (un)structured  data  creates  problems   for  storage,  mining  and  analyzing  data   20  
  • 21. Viability   Quickly  and  cost-­‐effec#vely   test  and  confirm  a  par+cular   variable’s  relevance  before   inves+ng  in  the  crea+on  of  a   fully  featured  model.   21  
  • 22. Value   Gain  new  insights   about  data  by   analyzing  variables   that  were  hidden   22  
  • 23. Big  Data   23     So,  how  does  it   work?  
  • 24. The  Big  Data  Supply  Chain   24  Source:  Edd  Dumbill,  Planning  for  Big  Data,  O’Reilly,  2012  
  • 25. Summary   •  Mo+va+on  of  why  companies  need   to  learn  how  to  deal  with  Big  Data!     •  Presented  a  defini+on  of  Big  Data   and  analyzed  the  V’s.   •  Presented  an  analysis  of  the  Big  Data   Supply  chain  for  enterprises.   25  
  • 26. 26   Technologies  for  Big  Data  
  • 27. The  Reach  of  Big  Data   •  So  far,  the  analy+cal  use  of  Big  Data  has  been  in   reach  only  to  leading  coopera+ons:     27   But  at  a  fantas+c  cost!  
  • 28. The  Reach  of  Big  Data   But  now,  there  are  several  open  source   applica+ons  than  can  tackle  Big  Data   related  challenges!   28  
  • 29. Apache  Hadoop   •  Developed  and  released  by  Yahoo!   •  Open  source  framework  for  storage   and  large  scale  parallel  processing  of   datasets  on  clusters.   •  Ability  to  cheaply  process  large   amounts  of  data,  regardless  of  its   structure.   29  
  • 32. (Video  -­‐  MapReduce)   •  IBM  –  MapReduce     •  Health  Care  -­‐  Real  Time  Alerts   32  
  • 33. HDFS  –  Hadoop  Distributed  File  System   33  
  • 34. 34   •  Detec+on  of  faults  and  quick,  automa+c  recovery     •  Emphasis  on  high  throughput  of  data  access   •  Tuned  to  support  large  datasets   •  Write-­‐once-­‐read-­‐many  access  model:  once  a  file  is   created,  wriQen  and  closed,  needs  not  to  be  changed   HDFS  –  Hadoop  Distributed  File  System  
  • 35. 35   •  Moving  computa+on  is  cheaper  than  moving  data   –  minimizes  network  conges+on   –  Increases  the  overall  throughput  of  the  system   •  Portability  Across  Heterogeneous  Hardware  and   SoSware  Plaqorms   HDFS  –  Hadoop  Distributed  File  System  
  • 36. HDFS  –  Hadoop  Distributed  File  System   36  
  • 37. The  Core  of  Hadoop  -­‐  MapReduce   37   •  Created  by  Google  for  web  search  indexes   •  Ability  to  take  a  query  over  a  dataset,  divide  it  and   run  it  in  parallel  over  mul+ple  nodes   •  Consists  in  three  stages:   –  Map   –  Shuffle   –  Reduce  
  • 38. The  Core  of  Hadoop  -­‐  MapReduce   38   Input   Output  Reduce  Shuffle  Map  
  • 39. The  Core  of  Hadoop  -­‐  MapReduce   39   Input   Output  Reduce  Reduce   Shuffle  Map  
  • 40. The  Core  of  Hadoop  -­‐  MapReduce   40   •  Imagine  that  you  want  to  apply  MapReduce  to   word  coun+ng  in  books    
  • 41. The  Core  of  Hadoop  -­‐  MapReduce   41   Input   Output  Reduce  Reduce   Shuffle  Map   E  …  com   a  mulher   e  o  filho…  
  • 42. The  Core  of  Hadoop  -­‐  MapReduce   42   Input   Output  Reduce  Reduce   Shuffle  Map   E  …  com   a  mulher   e  o  filho…   (  1,  E)   (2,  com)   (3,  a  )   (4,  mulher)   (5,  e  )   (6,  o)   (7,  filho)    
  • 43. The  Core  of  Hadoop  -­‐  MapReduce   43   Input   Output  Reduce  Reduce   Shuffle  Map   e  …  com   a  mulher   e  o  filho…   (  1,  e)   (2,  com)   (3,  a  )   (4,  mulher)   (5,  e  )   (6,  o)   (7,  filho)     (1,  e)  -­‐>  2   (2,  com)  -­‐>  1   (3,  a  )  -­‐>  1   (4,  mulher)  -­‐>  1   (5,  o)  -­‐>  1   (6,  filho)  -­‐>  1    
  • 44. The  Core  of  Hadoop  -­‐  MapReduce   44   Input   Output  Reduce  Reduce   Shuffle  Map   e  …  com   a  mulher   e  o  filho…   (  1,  e)   (3,  com)   (2,  a  )   (6,  mulher)   (5,  e  )   (7,  o)   (4,  filho)     (1,  e)  -­‐>  2   (3,  com)  -­‐>  1   (2,  a  )  -­‐>  1   (6,  mulher)  -­‐>  1   (7,  o)  -­‐>  1   (4,  filho)  -­‐>  1     e  [4,2,6]   a  [3,1,2]   com  [5,1,2]   filho  [2,1,1]   mulher  [3,1,10]   O  [7,1,9]    
  • 45. The  Core  of  Hadoop  -­‐  MapReduce   45   Input   Output  Reduce  Reduce   Shuffle  Map   e  …  com   a  mulher   e  o  filho…   (  1,  e)   (3,  com)   (2,  a  )   (6,  mulher)   (5,  e  )   (7,  o)   (4,  filho)     (1,  e)  -­‐>  2   (3,  com)  -­‐>  1   (2,  a  )  -­‐>  1   (6,  mulher)  -­‐>  1   (7,  o)  -­‐>  1   (4,  filho)  -­‐>  1   e  [4,2,6]   a  [3,1,2]   com  [5,1,2]   filho  [2,1,1]   mulher  [3,1,10]   O  [7,1,9]   e  -­‐>  12   a  -­‐>  6   com  -­‐>  8   filho  -­‐>  4   mulher  -­‐>  14   o  -­‐>  17  
  • 47. The  Social  Genome  product:       •  Reach  customers,  or  friends  of  customers,   who  have  men+oned  product  and  include  a   discount   •  Combines  public  data  from  the  web,  social   data  and  proprietary  data   47   Walmart  (Real  Time)  
  • 48. The  Social  Genome  product:       •  Helps  Walmart  to  understand  the  context  of   what  their  customers  are  saying  online     •  When  a  person  tweets  I  love  Salt,  Walmart   can  understand  that  she  is  talking  about   the  movie  Salt  and  not  the  condiment.   48   Walmart  (Real  Time)  
  • 49. Apache  Hive   49   •  Data  warehouse  soSware  that  facilitates   querying  and  managing  large  datasets   residing  in  a  distributed  storage   •  Uses  HiveSQL  to  analyze  data   •  Allows  programmers  to  use  their  own  map   reduce  func+ons    
  • 50. Apache  Cassandra   50   •  Database  with  high  scalability  and  high   availability  without  compromising   performance.   •  Offers   – Denormaliza+on   – Materialized  views   – Powerful  built  in  cache  
  • 51. Google  Big  Query   51   Built  on  GoogleFS  
  • 52. Summary     52   •  Apache  Hadoop  techhnology   –  HDFS  file  system   –  MapReduce  paradigm   •  Apache  Hive   •  Apache  Cassandra   •  Google  BigQuery  
  • 53.
  • 54. Project  HEIDI     54   How  to  index  high  dimensional  data?  
  • 55. Project  HEIDI     55   •  Mul+media  datasets  are  growing!   –  Ex:  Audio,  video,  medical  images,  photos,  etc.   •  We  need  techniques  to  efficiently  manage  and   access  informa+on  in  such  large  datasets!   •  Mul+media  datasets  cannot  be  searched  like   tradi+onal  databases  (e.g.  Text  search).  They  are   searched  in  metric  spaces.  
  • 56. Project  HEIDI   Given  a  mul#media   object  as  input,  return   the  most  similar  objects   from  a  mul+media   database   56  
  • 57. Project  HEIDI     57   •  In  content-­‐based  image  retrieval,  images  are   described  by  their  proper+es:   •  Colors   •  Texture   •  Shape   •  Shadows   •  etc  
  • 58. Project  HEIDI     58   •  The  process  of  conver+ng  a  mul+media  object   into  a  vector  with  its  contents/features  is   called  feature  extrac,on.  
  • 59. 59  
  • 60. Project  HEIDI     60   The  problem  in  high  dimensional  indexing  is  the  
  • 61. Project  HEIDI     61   The  problem  in  high  dimensional  indexing  is  the   OF  DIMENSIONALITY!!!  
  • 62. The  Curse  of  Dimensionality   62   •  A  phenomena  that  arises  when  analyzing   and  organizing  data  in  high  dimensional   spaces.   •  It  does  not  occur  in  lower  dimensions!  
  • 63. The  Curse  of  Dimensionality   63   •  When  dimensionality  increases,  the   volume  of  the  space  increases  so  fast   that  the  data  become  sparse.   •  Tradi+onal  tree  indexing  structures  do   not  provide  any  advantages  in  the  index   process  (outperformed  by  linear  scan)…  
  • 64. The  Curse  of  Dimensionality   64   •   Consequences:     – (hyper)  cubes   – (hyper)  spheres     •  Volume  increases  exponen#ally  with  growing   dimension  (while  the  edge  length  remains   constant).  
  • 65. The  Curse  of  Dimensionality   65     How  can  this  curse   affect  datasets  in   metric  spaces?  
  • 66. The  Curse  of  Dimensionality   66   •   In  metric  spaces,  there  is  the   Nearest  Neighbor  paradigm.   •  The  NN-­‐sphere  contains  one   point.   •   For  higher  dimensions,  the  radius  of  the   NN-­‐sphere  will  be  larger  than  the  size  of   the  database.  
  • 67. The  Curse  of  Dimensionality   67   Literature  addresses  this  issue  in  two  ways:     1.  Approximate  Nearest  Search   2.  Reduce  the  dimensionality  of  the  mul+media   objects  
  • 68. Project  HEIDI   68   The  algorithm  that  is  proposed  in  this  project  does   not  suffer  from  the  curse  of  dimensionality!  
  • 69. Project  HEIDI   69   The  Linear  SubSpace  Indexing  Algorithm     •  Based  on  the  idea  of  subspaces!   – Feature  extrac+on  func+on  that  maps  the  high   dimensional  objects  into  low  dimensional   equivalent  objects.  
  • 70. Project  HEIDI   70   The  Quick  and  Dirty  Paradigm     •  Discard  data  objects  in  lower  dimensional  spaces   •  Massive  comparison  opera+ons  are  less   expensive  in  lower  dimensions  
  • 71. Project  HEIDI   71     However,  this  approach  has  
  • 72. Project  HEIDI   72   The  Quick  and  Dirty  Paradigm     •  Can  produce  lots  of  false  hits!   •  The  mapping  to  lower  dimensions  can  lead  to   closer  data  points  
  • 73. Project  HEIDI   73   Are  false  hits  really  a        
  • 75. Project  HEIDI   75   •  Just  perform  a  final  comparison  in  the  original   space!   •  All  false  hits  will  be  discarded!   •  The  computa+on  is  quick,  because  the  massive   amount  of  objects  were  discarded  in  the  lower   dimensions.  
  • 77. Project  HEIDI   77     But…  How  to  map  high   dimensional  objects  into   lower  dimensional  spaces?        
  • 78. Project  HEIDI   78   The  Quick  and  Dirty  Paradigm     •  Lower  bonding  lemma:   – Distances  between  objects  in  the  feature  space  are   always  smaller  or  equal  that  in  the  original  space      
  • 79. Project  HEIDI   79   The  Quick  and  Dirty  Paradigm     •  The  Principal  Component  Analysis  algorithm   maps  objects  into  lower  dimensions   •  Sa+sfies  the  lower  bounding  lemma      
  • 80. Project  HEIDI   80   The  Principal  Component  Analysis      
  • 81. Project  HEIDI   81   The  Principal  Component  Analysis     1.  Compute  Co-­‐Variance  Matrix      
  • 82. Project  HEIDI   82   The  Principal  Component  Analysis     2.  Compute   eigen   vectors   and   eigen   values   from   co-­‐variance  matrix   3.  Get   the   most   significant   eigen   vectors   and   create  a  projec+on  matrix   4.  Project  data    
  • 83. Project  HEIDI   83   The  Hierarchical  Linear  SubSpace  Indexing  Method     1.  Index  Phase  (off-­‐line)   – Par++on  large  dataset  into  chunks  of  data   – Itera+vely  apply  the  Principal  Component  Analysis  to   project  the  dataset  into  lower  dimensions   – This  ac+on  can  be  parallelized  in  many  servers      
  • 84. Project  HEIDI   84   The  Hierarchical  Linear  SubSpace  Indexing  Method  
  • 85. Project  HEIDI   85   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space  
  • 86. Project  HEIDI   86   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA  
  • 87. Project  HEIDI   87   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA  
  • 88. Project  HEIDI   88   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA  
  • 89. Project  HEIDI   89   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA  
  • 90. Project  HEIDI   90   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA   PCA  
  • 91. Project  HEIDI   91   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA   PCA  
  • 92. Project  HEIDI   92   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA   PCA   PCA  
  • 93. Project  HEIDI   93   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA   PCA   PCA  
  • 94. Project  HEIDI   94   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA   PCA   PCA   PCA  
  • 95. Project  HEIDI   95   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA   PCA   PCA   PCA  
  • 96. Project  HEIDI   96   The  Hierarchical  Linear  SubSpace  Indexing  Method   Original  Space   Lower  Space   PCA   PCA   PCA   PCA   PCA   PCA  
  • 97. Project  HEIDI   97   The  Hierarchical  Linear  SubSpace  Indexing  Method    
  • 98. Project  HEIDI   98   The  Hierarchical  Linear  SubSpace  Indexing  Method    
  • 99. Project  HEIDI   99   The  Hierarchical  Linear  SubSpace  Indexing  Method    
  • 100. Project  HEIDI   100   The  Lower  Bounding  Lemma  is  sa#sfied     Euclidean  Distance   Most  Similar  Images  Retrieved  
  • 101. Project  HEIDI   101   The  Hierarchical  Linear  SubSpace  Indexing  Method    
  • 102. Project  HEIDI   102   The  Hierarchical  Linear  SubSpace  Indexing  Method     2.  Query  Phase  (on-­‐line)   – Project  queries  using  projec+on  matrices  from  step  1   – Star+ng  in  the  lowest  dimensional  space,  itera+vely   discard  all  objects  that  are  different  from  the  query   – Keep  doing  this  un+l  the  original  space  is  reached  and   the  false  hits  are  discarded.    
  • 103. Project  HEIDI   103   The  Hierarchical  Linear  SubSpace  Indexing  Method    
  • 104. Summary   •  Important  concepts  in  Big  Data  for   High  Dimensional  Datasets   •  The  curse  of  dimensionality   •  Projec+on  techniques   •  The  Lower  Bounding  Lemma   •  The  Quick  and  Dirty  paradigm  (some   rela+on  to  the  MapReduce  paradigm)     104  
  • 105. Summary     Mathema#cs  is  more   important  than   engineering!     105  
  • 106. Now…   Time  to  See  This  Working   for  REAL!!!   106