SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
"How I Learned to Stop Worrying and Love the Bomb"

WIVE 2010
An unhinged computer scientist, known
as TimBL, has invented the WWW,
plunging the world into an information
vortex…

Now everybody is fighting to prevent
the knowledge apocalypse...

Recently, Dr. StrangeCloud, a former
mainframe virtualization specialist, has
been called to the rescue…




   (story inspired by Kubrick's Dr. Strangelove)
Is Dr. Strangecloud going to save the
planet from the ever increasing danger of
death by information overload ?
Web Intelligence
   within/for VE
  Virtual Organiza

  resource sharing

Cloud based WI for VE
Looking for YACCP ?
   Yet Another
Cloud Computing
  Presentation ?

 You'd better check the specialists instead…
NOT (only)
"Intelligence on Web data"
as a research &
application field

(COMPSAC 2000, Taiwan)
Crossing hot topics

             Artificial Intelligence

                               web
                              mining

       web                    WEB
                                         semantic
                          INTELLIGENCE
information retrieval                      web
                            cloud
                                                  our
                          computing             focu
                                                     s
                                               toda
                                                    y
                    Information Technology
Hot, mild or cold ?
(based on Wikipedia article popularity)
                           cumulated Wikipedia page views, Jan => June 2010
                      (source : access statistics from wikipedia's squid cluster as compiled by http://stats.grok.se/)




                                                                                             Lady Gaga
                                                                                             11 344 529
             6 month trends for Wikipedia pages Web
          Intelligence, Cloud Computing and Lady Gaga


   10 000 000
    1 000 000
     100 000
      10 000
       1 000
         100
                jan        feb       mar         apr       may        jun




                                                   Cloud Computing
                                                       1 911 127
                Web Intelligence
                     1 632
But hot local
Web Intelligence
    recipe




www.web-intelligence-rhone-alpes.org
cloud based domain knowledge
              repository enrichment
                      (use case in FP7 project proposal)
                                      millions
                                   crawls/month
                                                            web
                                                          crawler
                                triple          90M
                                               triples
publication in LOD cloud
                                store         initially


                     in put r
                  ual s/yea
                                                     semantic
                 n
              ma triple
             2.5
                M                                    extractor
UKWA
                                                   by The British Library

                                      crawl, annotate , preserve

                                     visual analysis & navigation


          powered by
         IBM BigSheets
on British Library private cloud
 (demo on www.webarchive.org.uk/analytics/analytics.htm)
(details on news.cnet.com/8301-13846_3-10459507-62.html)
Public Terabyte Dataset
                     by Bixo Labs

50-200M pages from the 1M top US domains
                       SimpleDB
 Elactic MapReduce                      S3

                      powered by
Hadoop                   Bixo
                     on AWS cloud
                                             Tika
             Avro           Cascading

            not yet available (09/2010)
     big corpus ready for AWS based analysis
           (WI research, evaluation, ...)
the Web Intelligence paradox
All the Web data is at hand, ready for WI research and applications

                       2 simple steps :

           pick up                         process it with all those
          the data                        marvelous ML algorithms...

Wait a minute, it's not that simple ! What about :

                                                    politeness ?
scale ?               heterogeneity?
                     (aka "crappiness")
                                                            copyright ?
Use the Semantic Web?
 Looking for semantic annotations in 82k web pages
           (Squido production systems, 01/2010)




less than 3%
kindof real world WI process
millions pages    dedicated bandwith


    crawl          lot's of memory

                     lot's of i/o's
    clean
   (ML,...)
                    lot's of threads

  process
                     lot's of CPU
  (ML, ...)
typical load pattern
for distributed computing
Load may scale up or
 down considerably with crawl size

      when
testing/calibrating
                          consider
                      Cloud Computing
  in production
if no crawl limits
1 .45 automatic.
2 boxes of ammunition.
4 days' concentrated emergency rations.
1 drug issue containing antibiotics, morphine, vitamin pills, pep pills, sleeping pills, tranquilizer
pills.
1 miniature combination Russian phrase book and Bible.
100 dollars in rubles.
100 dollars in gold.
9 packs of chewing gum.
1 issue of prophylactics.
3 lipsticks.
3 pairs of nylon stockings.
Build from other's
Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud
           (Rod Cope, OpenLogic's CTO, CloudSlam'10)
"Cloud computing is a trap"
              warns GNU founder Richard Stallman

                                              "It's stupidity.
                                             It's worse than
                                             stupidity: it's a
                                             marketing hype
                                                campaign."

(www.guardian.co.uk/technology/2008/sep/29/cloud.computing.richard.stallman)


=> we can still consider private cloud+OSS
web-scale
distributed crawl OSS
      not mature
                        (Heritrix Cluster Controller build server exception)




Cloud OSS on the rise

                        (www.blackducksoftware.com/oss/projects/#cloud)



OSS stack for DC/DML
    under active
    development
Compare prices

$330k/year


 $33k/year



              (Rod Cope, OpenLogic's CTO, CloudSlam'10)
Crawling
is the launch pad
 in Web Intelligence
                       Don't take it easy !
                          Get yourself
                        a decent crawler
Crawling by millions
                 is not trivial...

many large objects              www crappiness
  in memory :                        means
   transient ?                 endless ugly special
  persistent ?                        cases



customizable revisit                 politeness is
     policy ?                        challenging
DDOS is at the corner
with (poor) cloud based crawling
Infrastructure is not always key to perfs

   Organic effect
    of politeness                      fetch rate
   on performance                        drops
                                       over time
    (ken-blog.krugler.org)
     1,264,539 URLs from
   41,978 unique domains
       10 slaves cluster
4000 active fetch threads max
                                 opportunity
    brute force
                                to scale down !
a.   Cloud Computing is worth considering for WI
b.   Have a cloud survival kit
c.   Consider private cloud & OSS
d.   Compare prices
e.   Get yourself a decent crawler
f.   Don't turn into DDOS
g.   Infrastructure is not always key to perfs
"SaaS intelligence on web data, for professionnals"
                         collect


       share                               filter



               monitor             analyse www.squido.fr
www.ixxo.fr
www.slideshare.net/fpouilloux
www.linkedin.com/in/fpouilloux
Photos:                                                              Websites:
1. National Nuclear Security Administration/Nevada Site Office
                                                                     wikipedia.org
2. Dr. Strangelove/Original film poster by Tomi Ungerer
3. Dr. Strangelove/movie still                                       www.emse.fr/wive/
4. Dr. Strangelove/movie still                                       csrc.nist.gov
6. cloudslam10.com/Gartner keynote slide,                            cloudslam10.com
National Institute of Standards and Technology web site screenshot   www.web-intelligence-rhone-alpes.org
7. cia.gov/OHB lobby seal picture
                                                                     stats.grok.se
8. amazon.com/Computational Web Intelligence book cover
10. Wikimedia Commons/Lady Gaga by petercruise                       www.ibm.com/software/ebusiness/jstart/bigsheets
12. Wikimedia Commons/Operation Crossroads Baker in color.jpg        bixolabs.com/datasets/public-terabyte-dataset-project
13. Linking Open Data cloud diagram, by Richard Cyganiak and Anja    www.openlogic.com
Jentzsch. http://lod-cloud.net/
                                                                     www.blackducksoftware.com
14. flickr/British Library III/jovike,
ibm.com/The_British_Library_and_IBM_Bi.jpg                           crawler.archive.org
16. Dr. Strangelove/movie still                                      www.apache.org
21. Wikimedia Commons/Castle Bravo Blast.jpg                         twitter.com
22. Dr. Strangelove/movie still
                                                                     ken-blog.krugler.org
23. cloudslam10.com/OpenLogic slide
24. Dr. Strangelove/movie still
25. Wikimedia Commons/RMS iGNUcius techfest iitb.JPG
27. cloudslam10.com/OpenLogic slide
28. Wikimedia Commons/Peacekeeper_missile_after_silo_launch.jpg
31. kkrugler.files.wordpress.com/2009/05/fetch-performance2.png
32. Dr. Strangelove/movie still

Contenu connexe

En vedette

IB Computer Science Section 6.3 Loaders, linkers and library managers
IB Computer Science Section 6.3 Loaders, linkers and library managersIB Computer Science Section 6.3 Loaders, linkers and library managers
IB Computer Science Section 6.3 Loaders, linkers and library managersstjulians school
 
DiffCalcSecondPartialReview
DiffCalcSecondPartialReviewDiffCalcSecondPartialReview
DiffCalcSecondPartialReviewCarlos Vázquez
 
Networking & Networking Etiquette
Networking & Networking EtiquetteNetworking & Networking Etiquette
Networking & Networking EtiquetteJMULLINMBA
 
Paris Class Presentation
Paris Class PresentationParis Class Presentation
Paris Class PresentationCharles Markó
 
In library school or job hunting: tips & tricks to build up your professional...
In library school or job hunting: tips & tricks to build up your professional...In library school or job hunting: tips & tricks to build up your professional...
In library school or job hunting: tips & tricks to build up your professional...Lisa Chow
 
Achieving Mastery Through Gamification
Achieving Mastery Through GamificationAchieving Mastery Through Gamification
Achieving Mastery Through GamificationPete Baikins
 
Let´s meet Yolanda
Let´s meet Yolanda Let´s meet Yolanda
Let´s meet Yolanda cpremolino
 
Are cows more likely to lie down the longer they stand? (Ig Nobel)
Are cows more likely to lie down the longer they stand? (Ig Nobel)Are cows more likely to lie down the longer they stand? (Ig Nobel)
Are cows more likely to lie down the longer they stand? (Ig Nobel)Arek Bee.
 
Christmas Newsletter
Christmas NewsletterChristmas Newsletter
Christmas Newsletterandy biggin
 
23204992
2320499223204992
23204992radgirl
 
Steve Morey Selected Work
Steve Morey Selected WorkSteve Morey Selected Work
Steve Morey Selected Workscmorey
 
Poe's Astro Shows Medley
Poe's Astro Shows MedleyPoe's Astro Shows Medley
Poe's Astro Shows MedleyKevin Poe
 
Horizon News Print and Digital Overview
Horizon News Print and Digital Overview Horizon News Print and Digital Overview
Horizon News Print and Digital Overview Beth Spallone
 
BCC (2012): Federal Panel Identifying Future Government Needs
BCC (2012):  Federal Panel Identifying Future Government NeedsBCC (2012):  Federal Panel Identifying Future Government Needs
BCC (2012): Federal Panel Identifying Future Government NeedsDuane Blackburn
 
Drawing and Illustrations
Drawing and IllustrationsDrawing and Illustrations
Drawing and Illustrationstbmertz
 

En vedette (20)

Web2.0
Web2.0Web2.0
Web2.0
 
IB Computer Science Section 6.3 Loaders, linkers and library managers
IB Computer Science Section 6.3 Loaders, linkers and library managersIB Computer Science Section 6.3 Loaders, linkers and library managers
IB Computer Science Section 6.3 Loaders, linkers and library managers
 
DiffCalcSecondPartialReview
DiffCalcSecondPartialReviewDiffCalcSecondPartialReview
DiffCalcSecondPartialReview
 
莫曼顿 iThink8.0fs 产品亮点 20100710
莫曼顿 iThink8.0fs 产品亮点 20100710莫曼顿 iThink8.0fs 产品亮点 20100710
莫曼顿 iThink8.0fs 产品亮点 20100710
 
Networking & Networking Etiquette
Networking & Networking EtiquetteNetworking & Networking Etiquette
Networking & Networking Etiquette
 
Teach,Tutor, Assess ISTE 2012
Teach,Tutor, Assess ISTE 2012Teach,Tutor, Assess ISTE 2012
Teach,Tutor, Assess ISTE 2012
 
Paris Class Presentation
Paris Class PresentationParis Class Presentation
Paris Class Presentation
 
In library school or job hunting: tips & tricks to build up your professional...
In library school or job hunting: tips & tricks to build up your professional...In library school or job hunting: tips & tricks to build up your professional...
In library school or job hunting: tips & tricks to build up your professional...
 
Achieving Mastery Through Gamification
Achieving Mastery Through GamificationAchieving Mastery Through Gamification
Achieving Mastery Through Gamification
 
Let´s meet Yolanda
Let´s meet Yolanda Let´s meet Yolanda
Let´s meet Yolanda
 
Are cows more likely to lie down the longer they stand? (Ig Nobel)
Are cows more likely to lie down the longer they stand? (Ig Nobel)Are cows more likely to lie down the longer they stand? (Ig Nobel)
Are cows more likely to lie down the longer they stand? (Ig Nobel)
 
Christmas Newsletter
Christmas NewsletterChristmas Newsletter
Christmas Newsletter
 
23204992
2320499223204992
23204992
 
Steve Morey Selected Work
Steve Morey Selected WorkSteve Morey Selected Work
Steve Morey Selected Work
 
Poe's Astro Shows Medley
Poe's Astro Shows MedleyPoe's Astro Shows Medley
Poe's Astro Shows Medley
 
Horizon News Print and Digital Overview
Horizon News Print and Digital Overview Horizon News Print and Digital Overview
Horizon News Print and Digital Overview
 
BCC (2012): Federal Panel Identifying Future Government Needs
BCC (2012):  Federal Panel Identifying Future Government NeedsBCC (2012):  Federal Panel Identifying Future Government Needs
BCC (2012): Federal Panel Identifying Future Government Needs
 
Halloween
HalloweenHalloween
Halloween
 
Drawing and Illustrations
Drawing and IllustrationsDrawing and Illustrations
Drawing and Illustrations
 
参与感 00序言 jason_laipmp
参与感 00序言 jason_laipmp参与感 00序言 jason_laipmp
参与感 00序言 jason_laipmp
 

Similaire à Cloud based Web Intelligence

ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedJazz Yao-Tsung Wang
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28korusamol
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud ComputingDeepak Singh
 
Cloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónCloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónFundación Ramón Areces
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remaniploibl
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniJAXLondon2014
 
8 mattwoodaws-intro-pdf-110411093115-phpapp01
8 mattwoodaws-intro-pdf-110411093115-phpapp018 mattwoodaws-intro-pdf-110411093115-phpapp01
8 mattwoodaws-intro-pdf-110411093115-phpapp01Carl Chesal
 
Future Cloud Infrastructure
Future Cloud InfrastructureFuture Cloud Infrastructure
Future Cloud Infrastructureexponential-inc
 
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach  to Provision Resources in the CloudsTowards CloudML, a Model-Based Approach  to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach to Provision Resources in the CloudsSébastien Mosser
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNGDATA
 
Google Cloud Platform and Kubernetes
Google Cloud Platform and KubernetesGoogle Cloud Platform and Kubernetes
Google Cloud Platform and KubernetesKasper Nissen
 
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels - AWS Summit 2...
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels  - AWS Summit 2...Keynote: Your Future With Cloud Computing - Dr. Werner Vogels  - AWS Summit 2...
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels - AWS Summit 2...Amazon Web Services
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 

Similaire à Cloud based Web Intelligence (20)

ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Cloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónCloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovación
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
 
8 mattwoodaws-intro-pdf-110411093115-phpapp01
8 mattwoodaws-intro-pdf-110411093115-phpapp018 mattwoodaws-intro-pdf-110411093115-phpapp01
8 mattwoodaws-intro-pdf-110411093115-phpapp01
 
Future Cloud Infrastructure
Future Cloud InfrastructureFuture Cloud Infrastructure
Future Cloud Infrastructure
 
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach  to Provision Resources in the CloudsTowards CloudML, a Model-Based Approach  to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
 
Google Cloud Platform and Kubernetes
Google Cloud Platform and KubernetesGoogle Cloud Platform and Kubernetes
Google Cloud Platform and Kubernetes
 
Chep2012
Chep2012Chep2012
Chep2012
 
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels - AWS Summit 2...
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels  - AWS Summit 2...Keynote: Your Future With Cloud Computing - Dr. Werner Vogels  - AWS Summit 2...
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels - AWS Summit 2...
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 

Dernier

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Dernier (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Cloud based Web Intelligence

  • 1. "How I Learned to Stop Worrying and Love the Bomb" WIVE 2010
  • 2. An unhinged computer scientist, known as TimBL, has invented the WWW, plunging the world into an information vortex… Now everybody is fighting to prevent the knowledge apocalypse... Recently, Dr. StrangeCloud, a former mainframe virtualization specialist, has been called to the rescue… (story inspired by Kubrick's Dr. Strangelove)
  • 3. Is Dr. Strangecloud going to save the planet from the ever increasing danger of death by information overload ?
  • 4.
  • 5. Web Intelligence within/for VE Virtual Organiza resource sharing Cloud based WI for VE
  • 6. Looking for YACCP ? Yet Another Cloud Computing Presentation ? You'd better check the specialists instead…
  • 8. as a research & application field (COMPSAC 2000, Taiwan)
  • 9. Crossing hot topics Artificial Intelligence web mining web WEB semantic INTELLIGENCE information retrieval web cloud our computing focu s toda y Information Technology
  • 10. Hot, mild or cold ? (based on Wikipedia article popularity) cumulated Wikipedia page views, Jan => June 2010 (source : access statistics from wikipedia's squid cluster as compiled by http://stats.grok.se/) Lady Gaga 11 344 529 6 month trends for Wikipedia pages Web Intelligence, Cloud Computing and Lady Gaga 10 000 000 1 000 000 100 000 10 000 1 000 100 jan feb mar apr may jun Cloud Computing 1 911 127 Web Intelligence 1 632
  • 11. But hot local Web Intelligence recipe www.web-intelligence-rhone-alpes.org
  • 12.
  • 13. cloud based domain knowledge repository enrichment (use case in FP7 project proposal) millions crawls/month web crawler triple 90M triples publication in LOD cloud store initially in put r ual s/yea semantic n ma triple 2.5 M extractor
  • 14. UKWA by The British Library crawl, annotate , preserve visual analysis & navigation powered by IBM BigSheets on British Library private cloud (demo on www.webarchive.org.uk/analytics/analytics.htm) (details on news.cnet.com/8301-13846_3-10459507-62.html)
  • 15. Public Terabyte Dataset by Bixo Labs 50-200M pages from the 1M top US domains SimpleDB Elactic MapReduce S3 powered by Hadoop Bixo on AWS cloud Tika Avro Cascading not yet available (09/2010) big corpus ready for AWS based analysis (WI research, evaluation, ...)
  • 16.
  • 17. the Web Intelligence paradox All the Web data is at hand, ready for WI research and applications 2 simple steps : pick up process it with all those the data marvelous ML algorithms... Wait a minute, it's not that simple ! What about : politeness ? scale ? heterogeneity? (aka "crappiness") copyright ?
  • 18. Use the Semantic Web? Looking for semantic annotations in 82k web pages (Squido production systems, 01/2010) less than 3%
  • 19. kindof real world WI process millions pages dedicated bandwith crawl lot's of memory lot's of i/o's clean (ML,...) lot's of threads process lot's of CPU (ML, ...)
  • 20. typical load pattern for distributed computing
  • 21. Load may scale up or down considerably with crawl size when testing/calibrating consider Cloud Computing in production if no crawl limits
  • 22. 1 .45 automatic. 2 boxes of ammunition. 4 days' concentrated emergency rations. 1 drug issue containing antibiotics, morphine, vitamin pills, pep pills, sleeping pills, tranquilizer pills. 1 miniature combination Russian phrase book and Bible. 100 dollars in rubles. 100 dollars in gold. 9 packs of chewing gum. 1 issue of prophylactics. 3 lipsticks. 3 pairs of nylon stockings.
  • 23. Build from other's Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud (Rod Cope, OpenLogic's CTO, CloudSlam'10)
  • 24.
  • 25. "Cloud computing is a trap" warns GNU founder Richard Stallman "It's stupidity. It's worse than stupidity: it's a marketing hype campaign." (www.guardian.co.uk/technology/2008/sep/29/cloud.computing.richard.stallman) => we can still consider private cloud+OSS
  • 26. web-scale distributed crawl OSS not mature (Heritrix Cluster Controller build server exception) Cloud OSS on the rise (www.blackducksoftware.com/oss/projects/#cloud) OSS stack for DC/DML under active development
  • 27. Compare prices $330k/year $33k/year (Rod Cope, OpenLogic's CTO, CloudSlam'10)
  • 28. Crawling is the launch pad in Web Intelligence Don't take it easy ! Get yourself a decent crawler
  • 29. Crawling by millions is not trivial... many large objects www crappiness in memory : means transient ? endless ugly special persistent ? cases customizable revisit politeness is policy ? challenging
  • 30. DDOS is at the corner with (poor) cloud based crawling
  • 31. Infrastructure is not always key to perfs Organic effect of politeness fetch rate on performance drops over time (ken-blog.krugler.org) 1,264,539 URLs from 41,978 unique domains 10 slaves cluster 4000 active fetch threads max opportunity brute force to scale down !
  • 32. a. Cloud Computing is worth considering for WI b. Have a cloud survival kit c. Consider private cloud & OSS d. Compare prices e. Get yourself a decent crawler f. Don't turn into DDOS g. Infrastructure is not always key to perfs
  • 33. "SaaS intelligence on web data, for professionnals" collect share filter monitor analyse www.squido.fr
  • 35. Photos: Websites: 1. National Nuclear Security Administration/Nevada Site Office wikipedia.org 2. Dr. Strangelove/Original film poster by Tomi Ungerer 3. Dr. Strangelove/movie still www.emse.fr/wive/ 4. Dr. Strangelove/movie still csrc.nist.gov 6. cloudslam10.com/Gartner keynote slide, cloudslam10.com National Institute of Standards and Technology web site screenshot www.web-intelligence-rhone-alpes.org 7. cia.gov/OHB lobby seal picture stats.grok.se 8. amazon.com/Computational Web Intelligence book cover 10. Wikimedia Commons/Lady Gaga by petercruise www.ibm.com/software/ebusiness/jstart/bigsheets 12. Wikimedia Commons/Operation Crossroads Baker in color.jpg bixolabs.com/datasets/public-terabyte-dataset-project 13. Linking Open Data cloud diagram, by Richard Cyganiak and Anja www.openlogic.com Jentzsch. http://lod-cloud.net/ www.blackducksoftware.com 14. flickr/British Library III/jovike, ibm.com/The_British_Library_and_IBM_Bi.jpg crawler.archive.org 16. Dr. Strangelove/movie still www.apache.org 21. Wikimedia Commons/Castle Bravo Blast.jpg twitter.com 22. Dr. Strangelove/movie still ken-blog.krugler.org 23. cloudslam10.com/OpenLogic slide 24. Dr. Strangelove/movie still 25. Wikimedia Commons/RMS iGNUcius techfest iitb.JPG 27. cloudslam10.com/OpenLogic slide 28. Wikimedia Commons/Peacekeeper_missile_after_silo_launch.jpg 31. kkrugler.files.wordpress.com/2009/05/fetch-performance2.png 32. Dr. Strangelove/movie still