Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Course	
  Introduc-on	
  
Designing	
  Data	
  Bases	
  with	
  Advanced	
  Data	
  Models	
  	
  
Dr. Fabio Fumarola
Enterprise	
  compu-ng	
  evolu-on	
  
•  We’ve	
  spent	
  several	
  years	
  in	
  the	
  world	
  of	
  enterprise	
  ...
The	
  stability	
  of	
  the	
  reign	
  
•  Why?	
  
•  An	
  organiza-on’s	
  data	
  lasts	
  much	
  longer	
  than	
...
FROM	
  BUSINESS	
  TO	
  DECISION	
  
SUPPORT	
  
Before	
  going	
  deep	
  on	
  the	
  course	
  arguments	
  let’s	
 ...
From	
  business	
  to	
  decision	
  support:	
  ‘60	
  
•  Star-ng	
  from	
  ’60	
  data	
  were	
  stored	
  using	
  ...
From	
  business	
  to	
  decision	
  support:	
  ‘80	
  
•  With	
  rela-onal	
  databases	
  and	
  SQL,	
  data	
  anal...
From	
  business	
  to	
  decision	
  support:	
  ‘80	
  
•  The	
  best	
  hypothesis	
  was	
  that	
  the	
  described	...
OLTP	
  design	
  
•  Such	
  kind	
  of	
  databases	
  are	
  designed	
  to	
  be:	
  
–  Strongly	
  normalized,	
  
–...
Foster	
  decision	
  support	
  
•  In	
  order	
  to	
  support	
  “Decisions”	
  we	
  need	
  to	
  extract	
  
de-­‐n...
From	
  business	
  to	
  decision	
  support:	
  ‘90	
  
•  Thus	
  in	
  ’90	
  bore	
  databases	
  designed	
  to	
  s...
Types	
  of	
  DWs	
  systems	
  
•  Data	
  Mart	
  	
  
–  is	
  a	
  simple	
  form	
  of	
  a	
  data	
  warehouse	
  ...
Types	
  of	
  DWs	
  systems	
  
•  Predic-ve	
  analysis	
  
–  Predic-ve	
  analysis	
  is	
  about	
  finding	
  and	
 ...
Data	
  Warehouse	
  
•  Data	
  stored	
  in	
  DWs	
  are	
  the	
  star-ng	
  point	
  for	
  the	
  
Business	
  Intel...
Example	
  DW	
  Cube	
  
13	
  
OLAP	
  features	
  
•  OLAP	
  systems	
  support	
  data	
  explora-on	
  through	
  drill-­‐
down,	
  drill-­‐up,	
  sl...
From	
  business	
  to	
  decision	
  support:	
  ‘00	
  
•  Star-ng	
  from	
  2000	
  it	
  arises	
  the	
  necessity	
...
From	
  business	
  to	
  decision	
  support:	
  ‘00	
  
•  The	
  overall	
  goal	
  of	
  the	
  data	
  mining	
  proc...
From	
  business	
  to	
  decision	
  support:	
  Recap	
  
•  In	
  the	
  last	
  decades	
  RDBMS	
  have	
  been	
  su...
From	
  business	
  to	
  decision	
  support:	
  Recap	
  
•  Vendors	
  such	
  as	
  Oracle,	
  Ver-ca,	
  Teradata,	
 ...
Challenges	
  of	
  Scale	
  Differ	
  
THERE	
  IS	
  SOMETHING	
  THAT	
  DOES	
  
NOT	
  WORK!	
  
But	
  
20	
  
1.	
  Scaling	
  Up	
  Databases	
  
A	
  ques-on	
  I’m	
  oben	
  asked	
  about	
  Heroku	
  is:	
  “How	
  do	
  you	
...
2.	
  Data	
  Variety	
  
•  RDBMs	
  have	
  problems	
  with	
  Unstructured	
  and	
  Semi-­‐
Structured	
  Data	
  (va...
3.	
  Connec-vity	
  
23	
  
4.	
  P2P	
  Knowledge	
  
24	
  
5.	
  Concurrency	
  
25	
  
6.	
  Concurrency	
  
26	
  
6.	
  Diversity	
  
27	
  
7.	
  Cloud	
  
28	
  
What	
  is	
  the	
  problem	
  with	
  RDBMs	
  
Caching	
  
Master/Slave	
  
Master/Master	
  
Cluster	
  
Table	
  Par-...
What	
  is	
  the	
  problem	
  with	
  RDBMs	
  
•  RDBMS	
  can	
  somehow	
  deal	
  with	
  this	
  aspects,	
  but	
 ...
NOSQL:	
  THE	
  NEW	
  CHALLENGER!	
  
Help!!!	
  
31	
  
NoSQL	
  
•  It	
  is	
  born	
  out	
  of	
  a	
  need	
  to	
  handle	
  large	
  data	
  volumes	
  
•  It	
  forces	
 ...
NoSQL	
  
•  The	
  term	
  “NoSQL”	
  is	
  very	
  ill-­‐defined.	
  	
  
•  It’s	
  generally	
  applied	
  to	
  a	
  n...
Why	
  are	
  NoSQL	
  Databases	
  
Interes-ng	
  
1.	
  	
  Applica-on	
  development	
  produc-vity:	
  
–  A	
  lot	
 ...
Why	
  are	
  NoSQL	
  Databases	
  
Interes-ng	
  
2.  Large-­‐scale	
  data:	
  
–  Organiza-ons	
  are	
  finding	
  it	...
WHY	
  NOSQL	
  
Internet	
  Hypertext,	
  RSS,	
  Wikis,	
  blogs,	
  wikis,	
  tagging,	
  user	
  generated	
  
content...
37	
  
The	
  Value	
  of	
  Rela-onal	
  Databases	
  
•  Rela-onal	
  databases	
  have	
  become	
  such	
  an	
  
embedded	
 ...
Getng	
  at	
  Persistent	
  Data	
  
•  The	
  most	
  obvious	
  value	
  of	
  a	
  database	
  is	
  keeping	
  
large...
Getng	
  at	
  Persistent	
  Data	
  
•  The	
  most	
  obvious	
  value	
  of	
  a	
  database	
  is	
  keeping	
  
large...
Getng	
  at	
  Persistent	
  Data	
  
•  The	
  backing	
  store	
  can	
  be	
  organized	
  in	
  all	
  sort	
  of	
  
...
Concurrency	
  
•  Concurrency	
  is	
  notoriously	
  difficult	
  to	
  get	
  right.	
  
•  Object	
  oriented	
  is	
  n...
Concurrency	
  
•  You	
  s-ll	
  have	
  to	
  deal	
  with	
  transac-onal	
  error	
  when	
  
you	
  try	
  to	
  book...
Integra-on	
  
•  Enterprise	
  applica-ons	
  live	
  in	
  a	
  rich	
  ecosystem	
  	
  
•  mul-ple	
  applica-on	
  wr...
A	
  (Mostly)	
  Standard	
  Model	
  
•  Rela-onal	
  database	
  have	
  succeeded	
  because	
  they	
  
have	
  a	
  s...
Impedance	
  Mismatch	
  
•  It	
  is	
  the	
  difference	
  between	
  the	
  rela-onal	
  model	
  and	
  
the	
  in-­‐m...
Impedance	
  Mismatch:	
  Example	
  
47	
  
Impedance	
  Mismatch	
  
•  Tuples	
  and	
  rela-on	
  provides	
  elegance	
  and	
  simplicity,	
  
but	
  it	
  also	...
Impedance	
  Mismatch	
  
•  As	
  a	
  result,	
  if	
  we	
  want	
  to	
  use	
  richer	
  in-­‐memory	
  data	
  
stru...
Impedance	
  Mismatch	
  
50	
  
•  ORMs	
  remove	
  a	
  lot	
  of	
  work,	
  but	
  can	
  become	
  a	
  
problem	
  ...
Applica-on	
  and	
  Integra-on	
  DBs	
  
•  This	
  is	
  a	
  event	
  that	
  happen	
  several	
  -mes	
  in	
  SW	
 ...
Applica-on	
  and	
  Integra-on	
  DBs	
  
•  The	
  downsides	
  to	
  share	
  database	
  are:	
  
–  Its	
  structure	...
Applica-on	
  and	
  Integra-on	
  DBs	
  
•  Interoperability	
  concerns	
  can	
  now	
  shit	
  to	
  interfaces	
  
o...
Applica-on	
  and	
  Integra-on	
  DBs	
  
•  If	
  you	
  are	
  going	
  to	
  use	
  service	
  integra-on	
  using	
  ...
A[ack	
  of	
  the	
  Clusters	
  
•  In	
  2000s	
  several	
  large	
  web	
  proper-es	
  drama-cally	
  
increase	
  i...
A[ack	
  of	
  the	
  Clusters	
  
•  There	
  were	
  to	
  choices:	
  
–  Scaling	
  up	
  
–  Scaling	
  out	
  
•  Sc...
A[ack	
  of	
  the	
  Clusters	
  
•  This	
  revealed	
  a	
  new	
  problem,	
  rela-onal	
  databases	
  
are	
  note	
...
A[ack	
  of	
  the	
  Clusters	
  
•  However,	
  it	
  needs	
  an	
  applica-on	
  to	
  control	
  the	
  
sharded-­‐da...
The	
  emergence	
  of	
  NoSQL	
  
•  Two	
  companies	
  in	
  par-cular	
  –	
  Google	
  and	
  Amazon	
  –	
  
have	
...
The	
  emergence	
  of	
  NoSQL	
  
•  As	
  part	
  of	
  innova-on	
  in	
  data	
  management	
  system,	
  several	
  ...
The	
  emergence	
  of	
  NoSQL	
  
•  It	
  is	
  irony	
  that	
  the	
  term	
  “NoSQL”	
  appeared	
  in	
  late	
  
9...
NoSQL	
  Characteris-cs	
  
1.  They	
  don’t	
  use	
  SQL.	
  (HBase,	
  Cassandra,	
  Redis…)	
  
2.  They	
  are	
  ge...
The	
  emergence	
  of	
  NoSQL	
  
•  NoSQL	
  does	
  not	
  stands	
  for	
  Not-­‐Only	
  SQL.	
  
•  It	
  is	
  be[e...
The	
  emergence	
  of	
  NoSQL	
  
•  Instead	
  of	
  just	
  picking	
  a	
  rela-onal	
  database,	
  we	
  need	
  
t...
The	
  emergence	
  of	
  NoSQL	
  
•  The	
  Big	
  Data	
  concerns	
  have	
  created	
  an	
  opportunity	
  
for	
  p...
66	
  
Prochain SlideShare
Chargement dans…5
×

1. Introduction to the Course "Designing Data Bases with Advanced Data Models (NoSQL & Hadoop)"

6 455 vues

Publié le

The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.

Course Website http://pbdmng.datatoknowledge.it/

Publié dans : Données & analyses
  • My brother found Custom Writing Service ⇒ www.HelpWriting.net ⇐ and ordered a couple of works. Their customer service is outstanding, never left a query unanswered.
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • I pasted a website that might be helpful to you: ⇒ www.WritePaper.info ⇐ Good luck!
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Dating direct: ♥♥♥ http://bit.ly/39sFWPG ♥♥♥
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Follow the link, new dating source: ❤❤❤ http://bit.ly/39sFWPG ❤❤❤
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Copas Url to Read PDF Format === http://bestadaododadj.justdied.com/2742412581-meteo-et-strategie-croisiere-et-course-au-large.html
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

1. Introduction to the Course "Designing Data Bases with Advanced Data Models (NoSQL & Hadoop)"

  1. 1. Course  Introduc-on   Designing  Data  Bases  with  Advanced  Data  Models     Dr. Fabio Fumarola
  2. 2. Enterprise  compu-ng  evolu-on   •  We’ve  spent  several  years  in  the  world  of  enterprise   compu-ng   •  We’ve  seen  many  things  change  in:   –  Languages,  Architectures,   –  PlaAorms,  and  Processes.   •  But  in  one  thing  we  stayed  constant  –  “rela-onal   database  stored  the  data”   •  The  data  storage  ques-on  for  architects  was:  “which   rela-onal  database  to  use”   1  
  3. 3. The  stability  of  the  reign   •  Why?   •  An  organiza-on’s  data  lasts  much  longer  than  its   programs  (COBOL?)   •  It’s  valuable  to  have  stable  data  storage  which  is   accessible  from  many  applica-ons   •  In  the  last  decades  RDBMS  have  been  successful  in   solving  problems  related  to  storing,  serving  and   processing  data.   2  
  4. 4. FROM  BUSINESS  TO  DECISION   SUPPORT   Before  going  deep  on  the  course  arguments  let’s  understand  the   evolu-on  of  decision  support  systems.       3  
  5. 5. From  business  to  decision  support:  ‘60   •  Star-ng  from  ’60  data  were  stored  using  magne-c   disks.   •  Supported  analysis  where  sta-c,  only  aggregated  and   pre[y  limited   •  For  instance,  it  was  possible  to  extract  the  total   amount  of  last  month  sales   4  
  6. 6. From  business  to  decision  support:  ‘80   •  With  rela-onal  databases  and  SQL,  data  analysis  start   to  be  somehow  dynamics.   •  SQL  allows  us  to  extract  data  at  detailed  and   aggregated  level   •  Transac-onal  ac-vi-es  are  stored  in  Online   Transac-on  Process  databases     •  OLTP  are  used  in  several  applica-ons  such  as  orders,   salary,  invoices…   5  
  7. 7. From  business  to  decision  support:  ‘80   •  The  best  hypothesis  was  that  the  described  modules   are  included  into  Enterprise  Resource  Planning  (ERP)   sobware   •  Examples  of  such  vendors  are  SAP,  Microsob,  HP  and   Oracle.   •  Normally,  what    happen  is  that  each  module  is   implemented  as  an  ad-­‐hoc  sobware  with  is  own   database.     •  Cons:  Data  representa-on  and  integra-on.   6  
  8. 8. OLTP  design   •  Such  kind  of  databases  are  designed  to  be:   –  Strongly  normalized,   –  Fast  in  inser-ng  data.   •  However,  data  normaliza-on:   –  Do  not  foster  the  read  of  huge  quan-ty  of  data,   –  Increase  the  number  of  tables  used  to  store  records.   7  
  9. 9. Foster  decision  support   •  In  order  to  support  “Decisions”  we  need  to  extract   de-­‐normalized  data  via  several  JOINS.   •  Moreover,  opera-onal  databases  offer  a  no/limited   visibility  on  historical  data.   Considera-ons:   •  These  factors  make  hard  data  analysis  made  on  OLTP   databases…   8  
  10. 10. From  business  to  decision  support:  ‘90   •  Thus  in  ’90  bore  databases  designed  to  support   analysis  from  OLTP  databases.   •  This  is  the  arise  of  Data  Warehouses   •  DWs  are  central  repositories  of  integrated  data  from   one  or  more  disparate  sources.   •  They  store  current  and  historical  data  and  are  used   for  crea-ng  trending  reports  for  management   repor-ng  such  as  annual  and  quarterly  comparisons.   9  
  11. 11. Types  of  DWs  systems   •  Data  Mart     –  is  a  simple  form  of  a  data  warehouse  that  is  focused  on  a   single  subject  (or  func-onal  area),  such  as  sales,  finance  or   marke-ng.   •  Online  analy-cal  processing  (OLAP):     –  is  characterized  by  a  rela-vely  low  volume  of  transac-ons.     –  Queries  are  oben  very  complex  and  involve  aggrega-ons.   –  OLAP  databases  store  aggregated,  historical  data  in  mul--­‐ dimensional  schemas.   10  
  12. 12. Types  of  DWs  systems   •  Predic-ve  analysis   –  Predic-ve  analysis  is  about  finding  and  quan-fying  hidden   pa[erns  in  the  data  using  complex  mathema-cal  models   that  can  be  used  to  predict  future  outcomes.     –  Predic-ve  analysis  is  different  from  OLAP  in  that  OLAP   focuses  on  historical  data  analysis  and  is  reac-ve  in   nature,  while  predic-ve  analysis  focuses  on  the  future.     11  
  13. 13. Data  Warehouse   •  Data  stored  in  DWs  are  the  star-ng  point  for  the   Business  Intelligence  (BI).   •  Def.  “Business  intelligence  (BI)  is  the  set  of   techniques  and  tools  for  the  transforma7on  of  raw   data  into  meaningful  and  useful  informa7on  for   business  analysis  purposes”.   •  With  the  evolu-on  of  BI  systems  we  moved  from  SQL   based  analysis  to  Visual  Instruments.   12  
  14. 14. Example  DW  Cube   13  
  15. 15. OLAP  features   •  OLAP  systems  support  data  explora-on  through  drill-­‐ down,  drill-­‐up,  slicing  e  dicing  opera-ons.   •  However,  we  s-ll  have  an  historical  point  of  view  of   what  happened  in  the  business  but  now  on  what  is   happening.   •  We  cannot  make  predic-on  on  the  future   14  
  16. 16. From  business  to  decision  support:  ‘00   •  Star-ng  from  2000  it  arises  the  necessity  to  do   predic-ve  analysis.   •  The  techniques  in  this  scenario  are  in  the  field  of  Data   Mining   •  Data  Mining  is  the  computa-onal  process  of   discovering  pa[erns  in  large  dataset  involving   methods  at  the  intersec-on  of  ar-ficial  intelligence,   machine  learning,  sta-s-cs,  and  database  systems.   15  
  17. 17. From  business  to  decision  support:  ‘00   •  The  overall  goal  of  the  data  mining  process  is  to   extract  novel  informa-on  from  a  data  set  and   transform  it  into  understandable  knowledge.   •  Aside  from  the  raw  analysis  step,  it  involves  database   and  data  management  aspects,  data  pre-­‐processing,   model  and  inference  considera-ons,  interes-ngness   metrics,  complexity  considera-ons,  post-­‐processing   of  discovered  structures,  visualiza-on,  and  online   upda-ng.   16  
  18. 18. From  business  to  decision  support:  Recap   •  In  the  last  decades  RDBMS  have  been  successful  in   solving  problems  related  to  storing,  serving  and   processing  data.     17  
  19. 19. From  business  to  decision  support:  Recap   •  Vendors  such  as  Oracle,  Ver-ca,  Teradata,  Microsob   and  IBM  proposed  their  solu-on  based  on  Rela-onal   Algebra  and  SQL.   •  With  Data  Mining  we  are  able  to  extract  knowledge   from  data  which  can  be  used  to  support  predic-ve   analysis    (Data  Mining  is  not  only  predic-on!!!).   18  
  20. 20. Challenges  of  Scale  Differ  
  21. 21. THERE  IS  SOMETHING  THAT  DOES   NOT  WORK!   But   20  
  22. 22. 1.  Scaling  Up  Databases   A  ques-on  I’m  oben  asked  about  Heroku  is:  “How  do  you  scale  the  SQL   database?”  There’s  a  lot  of  things  I  can  say  about  using  caching,  sharding,  and   other  techniques  to  take  load  off  the  database.  But  the  actual  answer  is:  we   don’t.  SQL  databases  are  fundamentally  non-­‐scalable,  and  there  is  no  magical   pixie  dust  that  we,  or  anyone,  can  sprinkle  on  them  to  suddenly  make  them   scale.     Adam  Wiggins  Heroku     Adam  Wiggins,  Heroku  Pa[erson,  David;  Fox,  Armando  (2012-­‐07-­‐11).  Engineering  Long-­‐LasGng  SoHware:  An   Agile  Approach  Using  SaaS  and  Cloud  CompuGng,  Alpha  Edi-on  (Kindle  Loca-ons  1285-­‐1288).  Strawberry   Canyon  LLC.  Kindle  Edi-on.       21  
  23. 23. 2.  Data  Variety   •  RDBMs  have  problems  with  Unstructured  and  Semi-­‐ Structured  Data  (varied  data)   22  
  24. 24. 3.  Connec-vity   23  
  25. 25. 4.  P2P  Knowledge   24  
  26. 26. 5.  Concurrency   25  
  27. 27. 6.  Concurrency   26  
  28. 28. 6.  Diversity   27  
  29. 29. 7.  Cloud   28  
  30. 30. What  is  the  problem  with  RDBMs   Caching   Master/Slave   Master/Master   Cluster   Table  Par--oning   Federated  Tables   Sharding   Distributed  DBs   29   http://codefutures.com/database-sharding/
  31. 31. What  is  the  problem  with  RDBMs   •  RDBMS  can  somehow  deal  with  this  aspects,  but   they  have  issues  related  to:     –  expensive  licensing,   –  requiring  complex  applica-on  logic,   –  Dealing  with  evolving  data  models   •  There  were  a  need  for  systems  that  could:   –  work  with  different  kind  of  data  format,   –  Do  not  require  strict  schema,   –  and  are  easily  scalable.   30  
  32. 32. NOSQL:  THE  NEW  CHALLENGER!   Help!!!   31  
  33. 33. NoSQL   •  It  is  born  out  of  a  need  to  handle  large  data  volumes   •  It  forces  a  fundamental  shib  to  building  large   hardware  plaAorms  through  clusters  of  commodity   servers.   •  This  need  raises  from  the  difficul-es  of  making   applica-on  code  play  well  with  rela-onal  databases   32  
  34. 34. NoSQL   •  The  term  “NoSQL”  is  very  ill-­‐defined.     •  It’s  generally  applied  to  a  number  of  recent  non   rela-onal  databases:  Cassandra,  Mongo,  Neo4j,   Hbase  and  Redis,…   •  They  embrace     –  schemaless  data,     –  run  on  a  cluster,     –  and  have  the  ability  to  trade  off  tradi-onal  consistency  for   other  useful  proper-es   33  
  35. 35. Why  are  NoSQL  Databases   Interes-ng   1.    Applica-on  development  produc-vity:   –  A  lot  of  applica-on  development  is  spent  on  mapping  data   between  in  memory  data  structures  and  rela-onal   databases   –  A  NoSQL  database  may  provide  a  data  model  that  can   simplify  that  interac-on  resul-ng  in  less  code  to  write,   debug,  and  evolve.   34  
  36. 36. Why  are  NoSQL  Databases   Interes-ng   2.  Large-­‐scale  data:   –  Organiza-ons  are  finding  it  valuable  to  capture  mode   data  and  process  it  more  quickly.   –  They  are  finding  it  expensive  to  do  so  with  rela-onal   databases.   –  NoSQL  database  are  more  economic  if  ran  on  large   cluster  of  many  smaller  an  cheaper  machines.   –  Many  NoSQL  database  are  designed  to  run  on  clusters,  so   they  be[er  fit  on  Big  Data  scenarios.   35  
  37. 37. WHY  NOSQL   Internet  Hypertext,  RSS,  Wikis,  blogs,  wikis,  tagging,  user  generated   content,  RDF,  ontologies   36   Connectedness  
  38. 38. 37  
  39. 39. The  Value  of  Rela-onal  Databases   •  Rela-onal  databases  have  become  such  an   embedded  part  of  our  compu-ng  culture.   •  What  are  the  benefits  they  provide?   38  
  40. 40. Getng  at  Persistent  Data   •  The  most  obvious  value  of  a  database  is  keeping   large  amount  of  persistent  data   •  Two  areas  of  memory:   –  Main  memory:  fast,  vola-le,  limited  in  space  and  lose  data   when  it  loses  the  power   –  Backing  store:  larger  but  slower,  commonly  seen  as  a  disk   39  
  41. 41. Getng  at  Persistent  Data   •  The  most  obvious  value  of  a  database  is  keeping   large  amount  of  persistent  data   •  Two  areas  of  memory:   –  Main  memory:  fast,  vola-le,  limited  in  space  and  lose  data   when  it  loses  the  power   –  Backing  store:  larger  but  slower,  commonly  seen  as  a  disk   40  
  42. 42. Getng  at  Persistent  Data   •  The  backing  store  can  be  organized  in  all  sort  of   ways.   •  For  many  produc-vity  applica-ons  (such  as  word   processors)  it  is  a  file  in  the  file  system.   •  For  most  enterprise  applica-ons,  however,  the   backing  store  is  a  database.   •  A  database  allows  more  flexibility  than  a  file  system.   41  
  43. 43. Concurrency   •  Concurrency  is  notoriously  difficult  to  get  right.   •  Object  oriented  is  not  the  right  programming  model   to  deal  with  concurrency.   •  Since  enterprise  applica-ons  can  have  a  lot  of   concurrent  users,  there  is  a  lot  of  rooms  for  bad   things  to  happen.   •  Rela-on  databases  have  transac-ons  that  help   mi-ga-ng  this  problem,  but….   42  
  44. 44. Concurrency   •  You  s-ll  have  to  deal  with  transac-onal  error  when   you  try  to  book  a  room  that  is  just  gone.   •  The  reality  is  that  the  transac-onal  mechanism  has   worked  well  to  contain  the  complexity  of   concurrency.   •  Transac-on  with  rollback  allows  as  to  deal  with   errors.   43  
  45. 45. Integra-on   •  Enterprise  applica-ons  live  in  a  rich  ecosystem     •  mul-ple  applica-on  wri[en  by  different  teams  need  to   collaborate  in  order  to  get  things  done   •  This  collabora-on  is  done  via  data  sharing.   •  A  common  way  to  do  this  is  shared  database  integraGon   [Hohpe  and  Woolf]  where  mul-ple  applica-ons  store  their   data  into  a  single  database   •  Using  a  single  database,  allows  all  the  applica-on  to  share   data  easily,  while  the  database  concurrency  control   applica-ons  such  as  users.   44  
  46. 46. A  (Mostly)  Standard  Model   •  Rela-onal  database  have  succeeded  because  they   have  a  standard  model   •  As  a  result,  developers  and  database  professionals   can  apply  the  same  knowledge  in  several  projects.   •  Although  there  are  differences  between  different   RDBMs,  the  core  mechanism  remain  the  same.   45  
  47. 47. Impedance  Mismatch   •  It  is  the  difference  between  the  rela-onal  model  and   the  in-­‐memory  data  structures.   •  Rela-onal  data  organizes  data  into  table  and  rows   (rela-on  and  tuples).     –  A  tuple  is  a  set  of  name-­‐value  pairs   –  A  rela-on  is  a  set  of  tuples.   •  All  the  SQL  opera-ons  consume  and  return  rela-ons.   46  
  48. 48. Impedance  Mismatch:  Example   47  
  49. 49. Impedance  Mismatch   •  Tuples  and  rela-on  provides  elegance  and  simplicity,   but  it  also  introduces  limita-ons.   •  In  par-cular,  the  values  in  a  rela-onal  tuple  have  to   be  simple.   •  They  cannot  contain  any  structure,  such  as  nested   record  or  a  list.   •  This  limita-on  is  not  true  for  in  memory  data-­‐ structures.   48  
  50. 50. Impedance  Mismatch   •  As  a  result,  if  we  want  to  use  richer  in-­‐memory  data   structure,  we  have  to  translate  it  to  a  rela-onal   representa-on  to  store  in  on  disk.   •  While  object-­‐oriented  language  succeeded,  object-­‐ oriented  databases  faded  into  obscurity.   •  Impedance  mismatch  has  been  made  much  easier  to   deal  with  Object-­‐Rela-onal  Mapping    (ORM)   frameworks  such  as  Eclipse-­‐Link,  Hibernate  and   other  JPA  implementa-ons.   49  
  51. 51. Impedance  Mismatch   50   •  ORMs  remove  a  lot  of  work,  but  can  become  a   problem  when  people  try  to  ignore:   –  the  database,  and   –  query  performance  suffer   •  This  is  where  NoSQL  database  works  greatly,  why?  
  52. 52. Applica-on  and  Integra-on  DBs   •  This  is  a  event  that  happen  several  -mes  in  SW   projects.   •  In  this  scenario,  the  database  acts  as  an  integra-on   database.   •  The  downsides  to  share  database  are.   51  
  53. 53. Applica-on  and  Integra-on  DBs   •  The  downsides  to  share  database  are:   –  Its  structure  tend  to  be  more  complex  than  any  single   applica-on  needs,   –  If  an  applica-on  want  to  make  changes  to  its  data  storage,   it  needs  to  coordinate  with  all  the  other  applica-ons,   –  Performance  degrada-on  due  to  huge  number  of  access   –  Errors  in  database  usage  since  it  is  accessed  by  applica-on   wri[en  by  different  teams.   •  This  is  different  from  single  applica-on  databases   52  
  54. 54. Applica-on  and  Integra-on  DBs   •  Interoperability  concerns  can  now  shit  to  interfaces   of  the  applica-on  allowing  interac-on  over  HTTTP.   •  This  is  what  happen  with  micro-­‐services  ( h[p://www.-kalk.com/java/micro-­‐services/)   •  Micro-­‐services  and  Web  services  in  general  enable  a   form  of  communica-ons  based  on  data.   •  Data  is  represented  as  documents  using  before  XML   and  now  JSON  format.   53  
  55. 55. Applica-on  and  Integra-on  DBs   •  If  you  are  going  to  use  service  integra-on  using  text   over  HTTP  is  the  way  to  go.   •  However,  if  we  are  dealing  with  performance  there   are  binary  protocols.   54  
  56. 56. A[ack  of  the  Clusters   •  In  2000s  several  large  web  proper-es  drama-cally   increase  in  scale:   –  Websites  started  tracking  ac-vity  and  structure  in  detail   (analy-cs)   –  Large  sets  of  data  appeared:  links,  social  networks,  ac-vity   in  logs,  mapping  data.   –  With  this  growth  in  data  came  a  growth  in  users   •  Coping  with  this  increase  in  data  and  traffic  required   more  compu-ng  resources.   55  
  57. 57. A[ack  of  the  Clusters   •  There  were  to  choices:   –  Scaling  up   –  Scaling  out   •  Scaling  up  implies  bigger  machines,  more  processors,   disk  storage  and  memory  ($$$).   •  Scaling  out  was  the  alterna-ve:   –  Use  a  lot  of  small  machine  in  a  cluster.   –  It  is  cheap  and  also  more  resilient  (individual  failures)   56  
  58. 58. A[ack  of  the  Clusters   •  This  revealed  a  new  problem,  rela-onal  databases   are  note  designed  to  run  on  clusters.   •  Clustered  rela-onal  databases  (e.g.  Oracle  RAC  or   Microsob  SQL  Server)  work  on  the  concept  of  shared   disk  subsystem.   •  RDBMs  can  also  run  on  separate  server  with   different  sets  of  data  (sharding)   57  
  59. 59. A[ack  of  the  Clusters   •  However,  it  needs  an  applica-on  to  control  the   sharded-­‐database.   •  Also  we  lose  querying,  referen-al  integrity,   transac-ons  or  consistency  control  cross  shard.   •  These  technical  issues  are  exacerbated  by  licensing   cots.   •  This  mismatch  between  DBs  and  clusters  led  some   organiza-ons  to  consider  different  solu-ons   58  
  60. 60. The  emergence  of  NoSQL   •  Two  companies  in  par-cular  –  Google  and  Amazon  –   have  been  very  influen-al.   •  They  were  capturing  a  large  amount  of  data  and   their  business  is  on  data  management   •  Both  companies  produces  influen-al  papers:   –  BigTable  from  Google   –  Dynamo  DB  from  Amazon   59  
  61. 61. The  emergence  of  NoSQL   •  As  part  of  innova-on  in  data  management  system,  several   new  technologies  where  built:   –  2003  -­‐  Google  File  System,   –  2004  -­‐  MapReduce,   –  2006  -­‐  BigTable,   –  2007  -­‐  Amazon  DynamoDB   –  2012  Google  Cloud  Engine   •  Each  solved  different  use  cases  and  had  a  different  set  of   assump-ons.   •  All  these  mark  the  beginning  of  a  different  way  of  thinking   about  data  management.   60  
  62. 62. The  emergence  of  NoSQL   •  It  is  irony  that  the  term  “NoSQL”  appeared  in  late   90s  from  a  rela-onal  database  made  by  Carlo  Strozzi.   •  The  name  comes  from  the  fact  that  it  does  not  used   SQL  as  query  language.   •  However,  the  usage  of  “NoSQL”  as  we  consider   today  come  from  a  meetup  on  2009  in  San  Francisco.   •  They  want  a  term  that  can  be  used  as  Twi[er   hashtag.  #NoSQL   61  
  63. 63. NoSQL  Characteris-cs   1.  They  don’t  use  SQL.  (HBase,  Cassandra,  Redis…)   2.  They  are  generally  open-­‐source  projects.   3.  Most  of  them  are  designed  to  run  on  clusters.     4.  RDBMs  used  ACID  transac-ons  to  handle   consistency  across  the  whole  database.  NoSQL   resort  to  other  op-ons  (CAP  theorem).   5.  No  all  are  cluster  oriented  (Graph  DBs)   6.  NoSQL  operate  without  a  schema.  (schema  free)   62  
  64. 64. The  emergence  of  NoSQL   •  NoSQL  does  not  stands  for  Not-­‐Only  SQL.   •  It  is  be[er  to  NoSQL  as  a  movement  rather  than  a   techonology.   •  RDBMs  are  not  going  away.   •  The  change  is  that  rela-onal  databases  are  an  op-on   •  This  point  of  view  is  oben  referred  to  as  polyglot   persistence   63  
  65. 65. The  emergence  of  NoSQL   •  Instead  of  just  picking  a  rela-onal  database,  we  need   to  understand:   1.  The  nature  of  the  data  we  are  storing,  and   2.  How  we  want  to  manipulate  it.   •  In  order  to  deal  with  this  change  most  organiza-ons   need  to  shib  from  integra-on  database  to   applica-on  database.   •  In  this  course  we  concentrate  on  Big  Data  running   on  clusters.   64  
  66. 66. The  emergence  of  NoSQL   •  The  Big  Data  concerns  have  created  an  opportunity   for  people  to  think  freshly  about  their  storage  needs.   •  NoSQL  help  developer  produc-vity  by  simplifying   their  database  access  even  if  they  have  no  need  to   scale  beyond  single  machine.   65  
  67. 67. 66  

×