SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
The Google Cloud Platform
Accelerator
Release v2020.04.07 16.45
Abdelkrim B., Gabygaël P.
Apr 07, 2020
CONTENTS
1 The Accelerator 1
1.1 The Hybrid cloud infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How does the Accelerator work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The design of the BI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Business Intelligence Architecture 5
2.1 BI Solution Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Google Cloud Architecture 6
4 Technologies used by the Accelerator 7
4.1 Wrappers for Web Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 What are the possible Web Data Extraction Use Cases? 8
6 Capabilities of the Accelerator 9
7 Citations 10
8 Glossary for the Accelerator 11
9 License 17
Bibliography 18
Index 19
i
CHAPTER
ONE
THE ACCELERATOR
The Accelerator is an IT infrastructure able to collect and analyze a massive amount of public data on the World
Wide Web.
The Accelerator leverages the untapped potential of web data with the first solution designed for diverse sectors,
completely scalable, available on-premise, and cloud-provider agnostic.
1.1 The Hybrid cloud infrastructure
The IT Infrastructure supporting the objective of the Accelerator is an Hybrid cloud solution, including :
Several servers deployed on-premise running the scrapers that collect the raw data on the internet websites. Scrap-
ers push those data to the Cloud architecture for large-scale analysis
Several servers deployed on the Public Cloud :
1. store and analyze a large volume of data
2. and give access to the raw data and the data analyzed by the Accelerator to third-parties through secured
Web APIs
1
The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
1.2 How does the Accelerator work?
The Accelerator: from data identified on
the World Wide Web to insightful reports
1. Choose which
data to scrap
2-3-4-5. The Accelerator
scraps the web
Extract
6. Data cleansing and
data Transformation
Transform
7. Insertion of data into
a data mart,
a data lake and
a data warehouse
Load
8-9. Reporting
Build
intelligence
1. The user of the web application selects the website that she wants to scrap
2. The scrapers request continuously the scrap-center which URL the scrapers have to scrape
3. The scrap-center shares to the scraper from which URL it should download data
4. The scraper performs its duties, e.g., to download the HTML from the URL, to take a screenshot of the
HTML page, to download the PDF available on the page
5. The scraper uploads the collected data to the scrap-center
6. The ETL (Extract-Transform-Load) process starts including the Data Cleansing and Data transformation
by the scrap-center
7. The scrap-center
• stores the raw data into the data lake
• extracts and stores the metadata with a low velocity, that does not often change, into the Cloud Datas-
tore e.g., the product name, the URL where the product is available
• stores the screenshot on a file system
• stores the data with a high velocity, that do often change, up to near real-time to the data warehouse,
e.g., the product price, the promotion price
Now the data are available to the end-user.
8. The user selects the dashboard of the report she wants to analyze or print
9. The Web application generates the report based on the data stored in the data lake, data warehouse and the
Cloud datastore
2 Chapter 1. The Accelerator
The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
1.2. How does the Accelerator work? 3
The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
1.3 The design of the BI Architecture
Internet websites
On-Premise architecture
Public Cloud architecture
Users
retail.com
Scraper 1
the scraper
scraps HTML data
clothes.com shoes.com
Scraper 2
...
Scraper N
the scraper
takes a screenshot
and metadata
Scrap-Center
scrapers
request for a task
scrapers
push raw data
HTML, screenshots
Data Lake
store
Data Warehouse
store
Datastore
store
Web
application
collect and analyze collect and analyze collect and analyze
Desktop Browser
Get reports through
the Dashboard
Choose which Websites
to scrap
4 Chapter 1. The Accelerator
CHAPTER
TWO
BUSINESS INTELLIGENCE ARCHITECTURE
2.1 BI Solution Architecture
The chart describes the interactions between the modules of the Accelerator required to scrap the data until the
generation of a BI report.
Scrapers
API Interface
Analyzer
App Engine
API Interface
Publish
e.g. Change price
API
Scrap-Center
App Engine
Command
Datastore
Structured version
of the Data
extracted from
the HTML
store
Subscribe
Dataflow
Apache Beam
Data Warehouse
BigQuery
Authorization:
Firebase
Adminstration
interface
Web Application
interface
Deliver
Datastore
Technical metadata
of each HTML page
When was it scraped?
store
Scraper 1
Scraper 2
Scraper N
request for a task
Data Lake
HTML
Images
store
push raw data
HTML, screenshots
Navigate the web with Selenium
Plugins available:
- retrieve the content
- cut-request : black lists of websites we do not want to inform
e.g. facebook, datadome
Proxy VPN available to
- circumvent captcha
- switch IP address to behave like humans
5
CHAPTER
THREE
GOOGLE CLOUD ARCHITECTURE
The accelerator uses the GCP (Google Cloud Platform) infrastructure :
• Cloud Firestore
• App Engine
• Cloud Datastore
• Cloud Dataflow
• BigQuery
• Cloud Datastore
6
CHAPTER
FOUR
TECHNOLOGIES USED BY THE ACCELERATOR
4.1 Wrappers for Web Data Extraction
As the quantity and diversity of the information available online increases, more of the typical information access
tasks are done by a program such as web wrappers.
Wrappers “facilitate access to Web-based information sources by providing a uniform querying and data extrac-
tion capability.” [KO2018]
Why are web scrapers useful?
For example, a Web wrapper for the e-commerce website source can take a query for a Product and extract its
description in the same way as the information is extracted from a database:
• the price or anterior price,
• the promotion price,
• the description,
• the date when the data were collected,
• and the bar code.
7
CHAPTER
FIVE
WHAT ARE THE POSSIBLE WEB DATA EXTRACTION USE
CASES?
Here is a non-exhaustive list of use cases the Web Scraping has proven to be a useful solution.
Table 1: Web Data Extraction Use Cases - What kind of project are you
working on?
Market research Price intelligence Lead generation
Brand monitoring An alternative data source for the
finance industry
Recruitment
Business automation MAP violations Fraud detection
8
CHAPTER
SIX
CAPABILITIES OF THE ACCELERATOR
Business-minded
All data are stored to enable an ex-post analysis
Cost-effective
The software developer accesses all data using a straightforward interface independently of the storage
Customization
Implement any business rule on the data
Add new data source at any time
Choose the interfaces of the REST APIs
Extensible
Add new functionalities to the ETL process
Scalable
Scrap as many websites as you want
Add as many data as you want
High-performance
The Accelerator starts servers when necessary to support the load
Secure
Authentication and Identification
All connections between the components are secured
Optimal
Cache the data in memory to fasten the process
Fault-tolerant
Historical data are stored for future analysis
Recovery of lost data
Source code
Build by a Belgian company, owner of the source code
The Accelerator relies on Open source Code and a proprietary Cloud Infrastructure
9
CHAPTER
SEVEN
CITATIONS
10
CHAPTER
EIGHT
GLOSSARY FOR THE ACCELERATOR
Apache Beam Apache Beam is an open source unified programming model to define and execute data processing
pipelines, including :term:`ETL`, batch and stream (continuous) processing
Source : Wikipedia contributors. (2020, February 10). Apache Beam. In Wikipedia, The Free Encyclopedia.
Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Apache_Beam&oldid=
940068914
App Engine
Google App Engine “Google App Engine is a Platform as a Service and cloud computing platform for devel-
oping and hosting web applications in Google-managed data centers. Applications are sandboxed and
run across multiple servers. App Engine offers automatic scaling for web applications—as the number
of requests increases for an application, App Engine automatically allocates more resources for the web
application to handle the additional demand”
Wikipedia contributors. (2020, March 5). Google App Engine. In Wikipedia, The Free Encyclopedia. Re-
trieved 15:22, April 2, 2020, from https://en.wikipedia.org/w/index.php?title=Google_App_Engine&oldid=
944010303
Authorship “Copyright law protects authorship intended as the expression of an original work created by an
author. This generally applies to literary, musical, artistic, and other intellectual works.”
Source : http://www.iprhelpdesk.eu/sites/default/files/newsdocuments/
Fact-Sheet-Inventorship-Authorship-Ownership.pdf
BI
Business intelligence “Business intelligence (BI) comprises the strategies and technologies used by enterprises
for the data analysis of business information.[1] BI technologies provide historical, current, and predictive
views of business operations. Common functions of business intelligence technologies include reporting,
online analytical processing, analytics, data mining, process mining, complex event processing, business
performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.”
Source : Wikipedia contributors. (2020, March 16). Business intelligence. In Wikipedia, The Free En-
cyclopedia. Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Business_
intelligence&oldid=945780339
BigQuery “BigQuery is a fully-managed data warehouse on RESTful web service that enables scalable, cost-
effective and fast analysis of big data working in conjunction with Google Cloud Storage.
It is a serverless Software as a Service (SaaS) that may be used complementarily with MapReduce. It also
has built-in machine learning capabilities.”
Source : Wikipedia contributors. (2020, February 21). BigQuery. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=BigQuery&oldid=
941896656
Captcha “A CAPTCHA, an acronym for “Completely Automated Public Turing test to tell Computers and Hu-
mans Apart”) is a type of challenge–response test used in computing to determine whether or not the user is
human...[A] CAPTCHA requires someone to correctly evaluate and enter a sequence of letters or numbers
perceptible in a distorted image displayed on their screen”
11
The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
Source : Wikipedia contributors. (2020, March 25). CAPTCHA. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=CAPTCHA&oldid=
947308972
CCPA “The California Consumer Privacy Act (CCPA) is a state statute intended to enhance privacy rights and
consumer protection for residents of California, United States.
The intentions of the Act are to provide California residents with the right to:”
• Know what personal data is being collected about them.
• Know whether their personal data is sold or disclosed and to whom.
• Say no to the sale of personal data.
• Access their personal data.
• Request a business to delete any personal information about a consumer collected from that consumer.
• Not be discriminated against for exercising their privacy rights.
Source : Wikipedia contributors. (2020, March 25). California Consumer Privacy Act. In Wikipedia, The
Free Encyclopedia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=
California_Consumer_Privacy_Act&oldid=947332131
Look at the CCPA GDPR Chart By Thomson Reuters comparing some of the key requirements of the
California Consumer Privacy Act (CCPA) and the EU General Data Protection Regulation (GDPR).
Cloud “Cloud computing is the on-demand availability of computer system resources, especially data storage
and computing power, without direct active management by the user. The term is generally used to describe
data centers available to many users over the Internet”
Source : Wikipedia contributors. (2020, March 24). Cloud computing. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Cloud_computing&
oldid=947158114
Cloud Dataflow “Google Cloud Dataflow is a fully managed service for executing :term:`Apache Beam`
pipelines within the Google Cloud Platform ecosystem.”
Source : Wikipedia contributors. (2019, November 27). Google Cloud Dataflow. In Wikipedia, The Free
Encyclopedia. Retrieved 15:25, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Google_
Cloud_Dataflow&oldid=928227601
Cloud Datastore “Google Cloud Datastore (Cloud Datastore) is a highly scalable, fully managed
:term:`NoSQL` database service offered by Google on the Google Cloud Platform.”
Source : Wikipedia contributors. (2019, November 27). Google Cloud Datastore. In Wikipedia, The Free
Encyclopedia. Retrieved 15:25, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Google_
Cloud_Datastore&oldid=928227557
Cloud Firestore “Cloud Firestore is a flexible, scalable database for mobile, web, and server development from
Firebase and Google Cloud Platform.”
Source : https://firebase.google.com/docs/firestore
Data Cleansing “Data cleansing or data cleaning is the process of detecting and correcting (or removing) cor-
rupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incor-
rect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse
data”
Source : Wikipedia contributors. (2020, March 3). Data cleansing. In Wikipedia, The Free Encyclopedia.
Retrieved 15:26, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_cleansing&oldid=
943697218
Data lake “A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or
files.
A data lake is usually a single store of all enterprise data including raw copies of source system data and
transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.
12 Chapter 8. Glossary for the Accelerator
The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
A data lake can include structured data from relational databases (rows and columns), semi-structured data
(CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio,
video).”
Source : Wikipedia contributors. (2020, March 3). Data lake. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:30, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_lake&oldid=
943633024
Data mining
Text mining “‘text and data mining’ means any automated analytical technique aimed at analysing text and data
in digital form in order to generate information which includes but is not limited to patterns, trends and
correlations;”
Source : https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L0790
Data warehouse “A data warehouse (DW or DWH) is a system used for reporting and data analysis, and is
considered a core component of business intelligence. DWs are central repositories of integrated data from
one or more disparate sources. They store current and historical data in one single place that are used for
creating analytical reports for workers throughout the enterprise.
The data stored in the warehouse is uploaded from the operational systems. The data may pass through
an operational data store and may require data cleansing for additional operations to ensure data quality
before it is used in the DW for reporting.”
Source : Wikipedia contributors. (2020, March 12). Data warehouse. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:29, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_warehouse&
oldid=945155021
DRM
Digital Rights Management “Digital rights management (DRM) tools ... are a set of access control technolo-
gies for restricting the use of proprietary hardware and copyrighted works. DRM technologies try to control
the use, modification, and distribution of copyrighted works (such as software and multimedia content), as
well as systems within devices that enforce these policies”
Source : Wikipedia contributors. (2020, March 19). Digital rights management. In Wikipedia, The Free
Encyclopedia. Retrieved 15:29, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Digital_
rights_management&oldid=946249128
ETL
Extract-Transform-Load “Extract, transform, load (ETL) is the general procedure of copying data from one
or more sources into a destination system which represents the data differently from the source(s) or in a
different context than the source(s)
Data extraction involves extracting data from homogeneous or heterogeneous sources;
data transformation processes data by data cleansing and transforming them into a proper storage for-
mat/structure for the purposes of querying and analysis;
finally, data loading describes the insertion of data into the final target database such as an operational
data store, a data mart, data lake or a data warehouse.”
Source : Wikipedia contributors. (2020, March 12). Extract, transform, load. In Wikipedia, The Free
Encyclopedia. Retrieved 15:33, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Extract,
_transform,_load&oldid=945159964
GDPR
RGPD “The General Data Protection Regulation (EU) 2016/679 (GDPR) is a regulation in EU law on data
protection and privacy in the European Union (EU) and the European Economic Area (EEA).”
“It also addresses the transfer of personal data outside the EU and EEA areas. The GDPR aims primar-
ily to give control to individuals over their personal data and to simplify the regulatory environment for
international business by unifying the regulation within the EU”
13
The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
Source : Wikipedia contributors. (2020, March 23). General Data Protection Regulation. In Wikipedia, The
Free Encyclopedia. Retrieved 15:34, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=
General_Data_Protection_Regulation&oldid=946999924
Hybrid cloud
Cloud Hybride “Hybrid cloud service as a cloud computing service that is composed of some combination of
private, public and community cloud services, from different service providers” [BiGartner12]
Source : Wikipedia contributors. (2020, March 24). Cloud computing. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:35, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Cloud_computing&
oldid=947158114
MAP violations “A minimum advertised price (MAP) is the practice of a manufacturer providing marketing
funds to a retailer contingent on the retailer advertising an end customer price at or above a specified
level. Such agreements can be illegal in some countries when members and terms in the agreement match
predefined legal criteria.
Fixed pricing established between a distributor and seller or between two or more sellers may violate
antitrust laws in the United States.”
Source : Wikipedia contributors. (2020, March 16). List price. In Wikipedia, The Free Encyclope-
dia. Retrieved 14:56, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=List_price&oldid=
945807027
Netiquette “Netiquette is a combination of the words network and etiquette and is defined as a set of rules for
acceptable online behavior. Similarly, online ethics focuses on the acceptable use of online resources in an
online social environment.”
Source : What is Netiquette? A Guide to Online Ethics and Etiquette (n.d.). Retrieved March 26, 2020,
from https://www.webroot.com/nz/en/resources/tips-articles/netiquette-and-online-ethics-what-are-they
NoSQL A NoSQL (originally referring to “non SQL” or “non relational”) database provides a mechanism for
storage and retrieval of data that is modeled in means other than the tabular relations used in relational
databases.
Source : Wikipedia contributors. (2020, March 14). NoSQL. In Wikipedia, The Free Encyclopedia. Re-
trieved 14:56, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=NoSQL&oldid=945474807
On-Premise
On-Premise Software “An On-premises software ... is installed and runs on computers on the premises of the
person or organization using the software, rather than at a remote facility such as a server farm or cloud.”
Wikipedia contributors. (2019, November 28). On-premises software. In Wikipedia, The Free Encyclo-
pedia. Retrieved 14:57, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=On-premises_
software&oldid=928327829
Residential proxy “A residential proxy is an IP address provided by an Internet Service Provider (ISP).”
Source : Buy Residential Proxies: 10M IPs - 99.99% uptime. (2020, January 29). Retrieved March 26,
2020, from https://smartproxy.com/proxies/residential-proxies
RESTful
Representational state transfer “Representational state transfer (REST) is a software architectural style that
defines a set of constraints to be used for creating Web services. Web services that conform to the REST
architectural style, called RESTful Web services, provide interoperability between computer systems on the
Internet. RESTful Web services allow the requesting systems to access and manipulate textual representa-
tions of Web resources by using a uniform and predefined set of stateless operations.”
Source : Wikipedia contributors. (2020, February 19). Representational state transfer. In Wikipedia, The
Free Encyclopedia. Retrieved 14:57, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=
Representational_state_transfer&oldid=941589430
scrap-center A tool to orchestrate the request for scraping to the scrapers.
14 Chapter 8. Glossary for the Accelerator
The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
The scrap-center stores raw data (images, HTML) into the file system and the metadata into the NoSQL
databases.
scraper
scrapers A tool, part of the Accelerator, able to scrap the pages on website taking over the complexity to collect
the data on websites (captcha, IP blocking...).
The scraper uses tools such as Web Crawler to browse the World Wide Web.
scraping
web scraping Web scraping means extracting required information from a web page using code.
[WikiWebScraping20]
Reading : Jarell, E. (2018, November 26). Building a Web Scraper from start to finish. Retrieved March 26,
2020, from https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184
Selenium “Selenium is a portable framework for testing web applications. Selenium provides a playback tool
for authoring functional tests without the need to learn a test scripting language (Selenium IDE). It also
provides a test domain-specific language (Selenese) to write tests in a number of popular programming
languages, including C#, Groovy, Java, Perl, PHP, Python, Ruby and Scala.”
Source : Wikipedia contributors. (2020, February 18). Selenium (software). In Wikipedia, The Free En-
cyclopedia. Retrieved 14:59, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Selenium_
(software)&oldid=941443949
VPN
Virtual Private network “A virtual private network (VPN) extends a private network across a public network,
and enables users to send and receive data across shared or public networks as if their computing devices
were directly connected to the private network”
Source : Wikipedia contributors. (2020, March 25). Virtual private network. In Wikipedia, The Free
Encyclopedia. Retrieved 14:59, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Virtual_
private_network&oldid=947375734
Web API
Web APIs “Web APIs are the defined interfaces through which interactions happen between an enterprise and
applications that use its assets, which also is a Service Level Agreement (SLA) to specify the functional
provider and expose the service path or URL for its API users. An API approach is an architectural approach
that revolves around providing a program interface to a set of services to different applications serving
different types of consumers.”
Source: Wikipedia contributors. (2020, March 25). Application programming interface. In Wikipedia, The
Free Encyclopedia. Retrieved 15:00, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=
Application_programming_interface&oldid=947328151
Web Crawler “A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an
Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing
(web spidering).
Web search engines and some other sites use Web crawling or spidering software to update their web content
or indices of others sites’ web content. Web crawlers copy pages for processing by a search engine which
indexes the downloaded pages so users can search more efficiently.
Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule,
load, and “politeness” come into play when large collections of pages are accessed. Mechanisms exist for
public sites not wishing to be crawled to make this known to the crawling agent. For example, including a
robots.txt file can request bots to index only parts of a website, or nothing at all.”
Source : https://en.wikipedia.org/wiki/Web_crawler
web services “A server running on a computer device, listening for requests at a particular port over a network,
serving web documents (HTML, JSON, XML, images), and creating web applications services, which serve
in solving specific domain problems over the Web (WWW, Internet, HTTP)”
15
The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
Wikipedia contributors. (2020, March 25). Web crawler. In Wikipedia, The Free Encyclopedia. Retrieved
15:01, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Web_crawler&oldid=947328470
wrapper
wrappers “Wrappers facilitate access to Web-based information sources by providing a uniform querying and
data extraction capability” [KO2018]
16 Chapter 8. Glossary for the Accelerator
CHAPTER
NINE
LICENSE
© 2019-2020 Abdelkrim Boujraf, ALT-F1 SPRL <http://www.alt-f1.be>
All trademarks mentioned belong to their owners, third party brands, product names, trade names, corporate names
and company names mentioned may be trademarks of their respective owners or registered trademarks of other
companies and are used for purposes of explanation and to the owner’s benefit, without implying a violation of
copyright law.
Toutes les marques citées appartiennent à leurs propriétaires. Les marques de tiers, les noms de produits, les noms
commerciaux, les dénominations sociales et les noms de sociétés mentionnés peuvent être des marques déposées
de leurs propriétaires respectifs ou des marques déposées d’autres sociétés et sont utilisés à des fins d’explication
et au profit de leurs propriétaires, sans impliquer une violation de la loi sur le droit d’auteur.
17
BIBLIOGRAPHY
[BiddersEdge100FSupp2d105800] eBay, Inc. v. Bidder’s Edge, Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000).
(n.d.). Retrieved from https://law.justia.com/cases/federal/district-courts/FSupp2/100/1058/2478126
[BiGartner12] Bittman, T. (2012, September 24). Mind the Gap: Here Comes Hybrid Cloud. Retrieved from
https://blogs.gartner.com/thomas_bittman/2012/09/24/mind-the-gap-here-comes-hybrid-cloud
[DmlpNotCopy14] Works Not Covered By Copyright . (n.d.). Retrieved March 21, 2020, from https://www.dmlp.
org/legal-guide/works-not-covered-copyright
[KO2018] Knoblock, C. A. (2018, February 3). Modeling Web Sources for Information Integration. Re-
trieved March 26, 2020, from https://github.com/usc-isi-i2/usc-isi-i2.github.io/blob/master/slides/
2018-02-03-AAAI-KG-Tutorial-CK.pptx
[ScrapinghubBestPractices18] Scrapinghub. (n.d.). Guide to Web Scraping Best Practices. Retrieved March 21,
2020, from https://info.scrapinghub.com/web-scraping-guide/web-scraping-best-practices
[WebScrapingforCPI18] Web scraping and online data collection and processing for the con-
sumer price index. (2018, February 8). Retrieved from https://statbel.fgov.be/en/news/
web-scraping-and-online-data-collection-and-processing-consumer-price-index
[WikiWebScraping20] Wikipedia contributors. (2020, March 11). Web scraping. In Wikipedia, The Free Ency-
clopedia. Retrieved 11:16, March 19, 2020, from https://en.wikipedia.org/w/index.php?title=Web_
scraping&oldid=945118241
18
INDEX
A
Apache Beam, 11
App Engine, 11
Authorship, 11
B
BI, 11
BigQuery, 11
Business intelligence, 11
C
Captcha, 11
CCPA, 12
Cloud, 12
Cloud Dataflow, 12
Cloud Datastore, 12
Cloud Firestore, 12
Cloud Hybride, 14
D
Data Cleansing, 12
Data lake, 12
Data mining, 13
Data warehouse, 13
Digital Rights Management, 13
DRM, 13
E
ETL, 13
Extract-Transform-Load, 13
G
GDPR, 13
Google App Engine, 11
H
Hybrid cloud, 14
M
MAP violations, 14
N
Netiquette, 14
NoSQL, 14
O
On-Premise, 14
On-Premise Software, 14
R
Representational state transfer, 14
Residential proxy, 14
RESTful, 14
RGPD, 13
S
scrap-center, 14
scraper, 15
scrapers, 15
scraping, 15
Selenium, 15
T
Text mining, 13
V
Virtual Private network, 15
VPN, 15
W
Web API, 15
Web APIs, 15
Web Crawler, 15
web scraping, 15
web services, 15
wrapper, 16
wrappers, 16
19

Contenu connexe

Tendances

Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin RobbinsData Con LA
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Impetus Technologies
 
Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...
Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...
Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...Sergio Zenatti Filho
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the CloudsDr. Mirko Kämpf
 
Making connections with Graph
Making connections with GraphMaking connections with Graph
Making connections with GraphDataStax
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
 
Building Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarBuilding Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarImpetus Technologies
 
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the FieldAnalytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the FieldAWS Germany
 
How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...
How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...
How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...Databricks
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Domain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data MeshDomain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data Meshconfluent
 
Enterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricEnterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricPrecisely
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Stories from the Financial Service AI Trenches: Lessons Learned from Building...
Stories from the Financial Service AI Trenches: Lessons Learned from Building...Stories from the Financial Service AI Trenches: Lessons Learned from Building...
Stories from the Financial Service AI Trenches: Lessons Learned from Building...Databricks
 
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...Data Con LA
 
A "First Time Right" Start with Data Virtualization by Bart De Groeve, Practi...
A "First Time Right" Start with Data Virtualization by Bart De Groeve, Practi...A "First Time Right" Start with Data Virtualization by Bart De Groeve, Practi...
A "First Time Right" Start with Data Virtualization by Bart De Groeve, Practi...Patrick Van Renterghem
 
Data Virtualization: From Zero to Hero (Middle East)
Data Virtualization: From Zero to Hero (Middle East)Data Virtualization: From Zero to Hero (Middle East)
Data Virtualization: From Zero to Hero (Middle East)Denodo
 

Tendances (20)

Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
 
Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...
Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...
Perth Microsoft Data & Analytics User Group - Building Solutions with Azure D...
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Making connections with Graph
Making connections with GraphMaking connections with Graph
Making connections with Graph
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Building Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarBuilding Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus Webinar
 
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the FieldAnalytics Web Day | From Theory to Practice: Big Data Stories from the Field
Analytics Web Day | From Theory to Practice: Big Data Stories from the Field
 
How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...
How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...
How a Media Data Platform Drives Real-time Insights & Analytics using Apache ...
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Domain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data MeshDomain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data Mesh
 
Enterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricEnterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data Fabric
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Stories from the Financial Service AI Trenches: Lessons Learned from Building...
Stories from the Financial Service AI Trenches: Lessons Learned from Building...Stories from the Financial Service AI Trenches: Lessons Learned from Building...
Stories from the Financial Service AI Trenches: Lessons Learned from Building...
 
AI is a Team Sport
AI is a Team SportAI is a Team Sport
AI is a Team Sport
 
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
Data Science Out of The Box : Case Studies in the Telecommunication by Anand ...
 
A "First Time Right" Start with Data Virtualization by Bart De Groeve, Practi...
A "First Time Right" Start with Data Virtualization by Bart De Groeve, Practi...A "First Time Right" Start with Data Virtualization by Bart De Groeve, Practi...
A "First Time Right" Start with Data Virtualization by Bart De Groeve, Practi...
 
Data Virtualization: From Zero to Hero (Middle East)
Data Virtualization: From Zero to Hero (Middle East)Data Virtualization: From Zero to Hero (Middle East)
Data Virtualization: From Zero to Hero (Middle East)
 
Data Warehousing Trends
Data Warehousing TrendsData Warehousing Trends
Data Warehousing Trends
 

Similaire à ALT-F1.BE : The Accelerator (Google Cloud Platform)

Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
File Repository on GAE
File Repository on GAEFile Repository on GAE
File Repository on GAElynneblue
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Societyconfluent
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like ProductsVMware Tanzu
 
Technology Overview
Technology OverviewTechnology Overview
Technology OverviewLiran Zelkha
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentCA | Automic Software
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
Understanding the Windows Azure Platform - Dec 2010
Understanding the Windows Azure Platform - Dec 2010Understanding the Windows Azure Platform - Dec 2010
Understanding the Windows Azure Platform - Dec 2010DavidGristwood
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Azure Overview Csco
Azure Overview CscoAzure Overview Csco
Azure Overview Cscorajramab
 
StreamCentral for the IT Professional
StreamCentral for the IT ProfessionalStreamCentral for the IT Professional
StreamCentral for the IT ProfessionalRaheel Retiwalla
 
Big data application using hadoop in cloud [Smart Refrigerator]
Big data application using hadoop in cloud [Smart Refrigerator] Big data application using hadoop in cloud [Smart Refrigerator]
Big data application using hadoop in cloud [Smart Refrigerator] Pushkar Bhandari
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 

Similaire à ALT-F1.BE : The Accelerator (Google Cloud Platform) (20)

Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
File Repository on GAE
File Repository on GAEFile Repository on GAE
File Repository on GAE
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Society
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like Products
 
Technology Overview
Technology OverviewTechnology Overview
Technology Overview
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop Agent
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Understanding the Windows Azure Platform - Dec 2010
Understanding the Windows Azure Platform - Dec 2010Understanding the Windows Azure Platform - Dec 2010
Understanding the Windows Azure Platform - Dec 2010
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Azure Overview Csco
Azure Overview CscoAzure Overview Csco
Azure Overview Csco
 
StreamCentral for the IT Professional
StreamCentral for the IT ProfessionalStreamCentral for the IT Professional
StreamCentral for the IT Professional
 
Big data application using hadoop in cloud [Smart Refrigerator]
Big data application using hadoop in cloud [Smart Refrigerator] Big data application using hadoop in cloud [Smart Refrigerator]
Big data application using hadoop in cloud [Smart Refrigerator]
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 

Plus de Abdelkrim Boujraf

Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...Abdelkrim Boujraf
 
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...Abdelkrim Boujraf
 
Abdelkrim Boujraf references-in-research-papers-v0.03
Abdelkrim Boujraf references-in-research-papers-v0.03Abdelkrim Boujraf references-in-research-papers-v0.03
Abdelkrim Boujraf references-in-research-papers-v0.03Abdelkrim Boujraf
 
ALT-F1 Google AppEngine-Cloud-computing
ALT-F1 Google AppEngine-Cloud-computingALT-F1 Google AppEngine-Cloud-computing
ALT-F1 Google AppEngine-Cloud-computingAbdelkrim Boujraf
 
What you need to know about an IT experience - 2012-11-29 - universite-lille
What you need to know about an IT experience - 2012-11-29  - universite-lilleWhat you need to know about an IT experience - 2012-11-29  - universite-lille
What you need to know about an IT experience - 2012-11-29 - universite-lilleAbdelkrim Boujraf
 
ALT-F1 Techtalk 3 - Google AppEngine
ALT-F1 Techtalk 3 - Google AppEngineALT-F1 Techtalk 3 - Google AppEngine
ALT-F1 Techtalk 3 - Google AppEngineAbdelkrim Boujraf
 
ALT-F1: Visualize Quantify Optimize Facility layout and planning using GE Bri...
ALT-F1: Visualize Quantify Optimize Facility layout and planning using GE Bri...ALT-F1: Visualize Quantify Optimize Facility layout and planning using GE Bri...
ALT-F1: Visualize Quantify Optimize Facility layout and planning using GE Bri...Abdelkrim Boujraf
 
ALT-F1: Integration Predix and Aviation Dataset
ALT-F1: Integration Predix and Aviation DatasetALT-F1: Integration Predix and Aviation Dataset
ALT-F1: Integration Predix and Aviation DatasetAbdelkrim Boujraf
 
Reduce CO2 Emission of the Civial Aviation - GE Digital - Predix.io - v3.03
Reduce CO2 Emission of the Civial Aviation - GE Digital - Predix.io - v3.03Reduce CO2 Emission of the Civial Aviation - GE Digital - Predix.io - v3.03
Reduce CO2 Emission of the Civial Aviation - GE Digital - Predix.io - v3.03Abdelkrim Boujraf
 
ALT-F1: Integration Predix and SIMOGGA Solutions
ALT-F1: Integration Predix and SIMOGGA SolutionsALT-F1: Integration Predix and SIMOGGA Solutions
ALT-F1: Integration Predix and SIMOGGA SolutionsAbdelkrim Boujraf
 
StratEx, PMO product-features-long-en-1.06
StratEx, PMO product-features-long-en-1.06StratEx, PMO product-features-long-en-1.06
StratEx, PMO product-features-long-en-1.06Abdelkrim Boujraf
 
AMIA Systems, Layout Design, Planning & Scheduling, Appliance
AMIA Systems, Layout Design, Planning & Scheduling, ApplianceAMIA Systems, Layout Design, Planning & Scheduling, Appliance
AMIA Systems, Layout Design, Planning & Scheduling, ApplianceAbdelkrim Boujraf
 
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4Abdelkrim Boujraf
 
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4Abdelkrim Boujraf
 
StratEx, Project management - HORIZON 2020 v1.02 (SaaS and On-premise)
StratEx, Project management - HORIZON 2020  v1.02 (SaaS and On-premise)StratEx, Project management - HORIZON 2020  v1.02 (SaaS and On-premise)
StratEx, Project management - HORIZON 2020 v1.02 (SaaS and On-premise)Abdelkrim Boujraf
 
StratEx, Project management - Features list v1.06
StratEx, Project management - Features list v1.06StratEx, Project management - Features list v1.06
StratEx, Project management - Features list v1.06Abdelkrim Boujraf
 
ALT-F1-StratEx-005-Value_proposition-2015-11-18
ALT-F1-StratEx-005-Value_proposition-2015-11-18ALT-F1-StratEx-005-Value_proposition-2015-11-18
ALT-F1-StratEx-005-Value_proposition-2015-11-18Abdelkrim Boujraf
 
StratEx, PMO Easily and Affordably
StratEx, PMO Easily and AffordablyStratEx, PMO Easily and Affordably
StratEx, PMO Easily and AffordablyAbdelkrim Boujraf
 
ALT-F1-StratEx-005-Value_proposition-2015-11-18
ALT-F1-StratEx-005-Value_proposition-2015-11-18ALT-F1-StratEx-005-Value_proposition-2015-11-18
ALT-F1-StratEx-005-Value_proposition-2015-11-18Abdelkrim Boujraf
 

Plus de Abdelkrim Boujraf (20)

Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
 
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
Test-Driven Developments are Inefficient; Behavior-Driven Developments are a ...
 
Abdelkrim Boujraf references-in-research-papers-v0.03
Abdelkrim Boujraf references-in-research-papers-v0.03Abdelkrim Boujraf references-in-research-papers-v0.03
Abdelkrim Boujraf references-in-research-papers-v0.03
 
ALT-F1 Google AppEngine-Cloud-computing
ALT-F1 Google AppEngine-Cloud-computingALT-F1 Google AppEngine-Cloud-computing
ALT-F1 Google AppEngine-Cloud-computing
 
What you need to know about an IT experience - 2012-11-29 - universite-lille
What you need to know about an IT experience - 2012-11-29  - universite-lilleWhat you need to know about an IT experience - 2012-11-29  - universite-lille
What you need to know about an IT experience - 2012-11-29 - universite-lille
 
ALT-F1 Techtalk 3 - Google AppEngine
ALT-F1 Techtalk 3 - Google AppEngineALT-F1 Techtalk 3 - Google AppEngine
ALT-F1 Techtalk 3 - Google AppEngine
 
ALT-F1: Visualize Quantify Optimize Facility layout and planning using GE Bri...
ALT-F1: Visualize Quantify Optimize Facility layout and planning using GE Bri...ALT-F1: Visualize Quantify Optimize Facility layout and planning using GE Bri...
ALT-F1: Visualize Quantify Optimize Facility layout and planning using GE Bri...
 
ALT-F1: Integration Predix and Aviation Dataset
ALT-F1: Integration Predix and Aviation DatasetALT-F1: Integration Predix and Aviation Dataset
ALT-F1: Integration Predix and Aviation Dataset
 
Reduce CO2 Emission of the Civial Aviation - GE Digital - Predix.io - v3.03
Reduce CO2 Emission of the Civial Aviation - GE Digital - Predix.io - v3.03Reduce CO2 Emission of the Civial Aviation - GE Digital - Predix.io - v3.03
Reduce CO2 Emission of the Civial Aviation - GE Digital - Predix.io - v3.03
 
ALT-F1: Integration Predix and SIMOGGA Solutions
ALT-F1: Integration Predix and SIMOGGA SolutionsALT-F1: Integration Predix and SIMOGGA Solutions
ALT-F1: Integration Predix and SIMOGGA Solutions
 
StratEx, PMO product-features-long-en-1.06
StratEx, PMO product-features-long-en-1.06StratEx, PMO product-features-long-en-1.06
StratEx, PMO product-features-long-en-1.06
 
sis_factsheet_21122016_en
sis_factsheet_21122016_ensis_factsheet_21122016_en
sis_factsheet_21122016_en
 
AMIA Systems, Layout Design, Planning & Scheduling, Appliance
AMIA Systems, Layout Design, Planning & Scheduling, ApplianceAMIA Systems, Layout Design, Planning & Scheduling, Appliance
AMIA Systems, Layout Design, Planning & Scheduling, Appliance
 
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
 
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
_AMIA_Systems-Layout_Design-Planner-Appliance-ERP-MES-APS-EN-v1.4.4
 
StratEx, Project management - HORIZON 2020 v1.02 (SaaS and On-premise)
StratEx, Project management - HORIZON 2020  v1.02 (SaaS and On-premise)StratEx, Project management - HORIZON 2020  v1.02 (SaaS and On-premise)
StratEx, Project management - HORIZON 2020 v1.02 (SaaS and On-premise)
 
StratEx, Project management - Features list v1.06
StratEx, Project management - Features list v1.06StratEx, Project management - Features list v1.06
StratEx, Project management - Features list v1.06
 
ALT-F1-StratEx-005-Value_proposition-2015-11-18
ALT-F1-StratEx-005-Value_proposition-2015-11-18ALT-F1-StratEx-005-Value_proposition-2015-11-18
ALT-F1-StratEx-005-Value_proposition-2015-11-18
 
StratEx, PMO Easily and Affordably
StratEx, PMO Easily and AffordablyStratEx, PMO Easily and Affordably
StratEx, PMO Easily and Affordably
 
ALT-F1-StratEx-005-Value_proposition-2015-11-18
ALT-F1-StratEx-005-Value_proposition-2015-11-18ALT-F1-StratEx-005-Value_proposition-2015-11-18
ALT-F1-StratEx-005-Value_proposition-2015-11-18
 

Dernier

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 

Dernier (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 

ALT-F1.BE : The Accelerator (Google Cloud Platform)

  • 1. The Google Cloud Platform Accelerator Release v2020.04.07 16.45 Abdelkrim B., Gabygaël P. Apr 07, 2020
  • 2. CONTENTS 1 The Accelerator 1 1.1 The Hybrid cloud infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 How does the Accelerator work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 The design of the BI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Business Intelligence Architecture 5 2.1 BI Solution Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Google Cloud Architecture 6 4 Technologies used by the Accelerator 7 4.1 Wrappers for Web Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5 What are the possible Web Data Extraction Use Cases? 8 6 Capabilities of the Accelerator 9 7 Citations 10 8 Glossary for the Accelerator 11 9 License 17 Bibliography 18 Index 19 i
  • 3. CHAPTER ONE THE ACCELERATOR The Accelerator is an IT infrastructure able to collect and analyze a massive amount of public data on the World Wide Web. The Accelerator leverages the untapped potential of web data with the first solution designed for diverse sectors, completely scalable, available on-premise, and cloud-provider agnostic. 1.1 The Hybrid cloud infrastructure The IT Infrastructure supporting the objective of the Accelerator is an Hybrid cloud solution, including : Several servers deployed on-premise running the scrapers that collect the raw data on the internet websites. Scrap- ers push those data to the Cloud architecture for large-scale analysis Several servers deployed on the Public Cloud : 1. store and analyze a large volume of data 2. and give access to the raw data and the data analyzed by the Accelerator to third-parties through secured Web APIs 1
  • 4. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45 1.2 How does the Accelerator work? The Accelerator: from data identified on the World Wide Web to insightful reports 1. Choose which data to scrap 2-3-4-5. The Accelerator scraps the web Extract 6. Data cleansing and data Transformation Transform 7. Insertion of data into a data mart, a data lake and a data warehouse Load 8-9. Reporting Build intelligence 1. The user of the web application selects the website that she wants to scrap 2. The scrapers request continuously the scrap-center which URL the scrapers have to scrape 3. The scrap-center shares to the scraper from which URL it should download data 4. The scraper performs its duties, e.g., to download the HTML from the URL, to take a screenshot of the HTML page, to download the PDF available on the page 5. The scraper uploads the collected data to the scrap-center 6. The ETL (Extract-Transform-Load) process starts including the Data Cleansing and Data transformation by the scrap-center 7. The scrap-center • stores the raw data into the data lake • extracts and stores the metadata with a low velocity, that does not often change, into the Cloud Datas- tore e.g., the product name, the URL where the product is available • stores the screenshot on a file system • stores the data with a high velocity, that do often change, up to near real-time to the data warehouse, e.g., the product price, the promotion price Now the data are available to the end-user. 8. The user selects the dashboard of the report she wants to analyze or print 9. The Web application generates the report based on the data stored in the data lake, data warehouse and the Cloud datastore 2 Chapter 1. The Accelerator
  • 5. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45 1.2. How does the Accelerator work? 3
  • 6. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45 1.3 The design of the BI Architecture Internet websites On-Premise architecture Public Cloud architecture Users retail.com Scraper 1 the scraper scraps HTML data clothes.com shoes.com Scraper 2 ... Scraper N the scraper takes a screenshot and metadata Scrap-Center scrapers request for a task scrapers push raw data HTML, screenshots Data Lake store Data Warehouse store Datastore store Web application collect and analyze collect and analyze collect and analyze Desktop Browser Get reports through the Dashboard Choose which Websites to scrap 4 Chapter 1. The Accelerator
  • 7. CHAPTER TWO BUSINESS INTELLIGENCE ARCHITECTURE 2.1 BI Solution Architecture The chart describes the interactions between the modules of the Accelerator required to scrap the data until the generation of a BI report. Scrapers API Interface Analyzer App Engine API Interface Publish e.g. Change price API Scrap-Center App Engine Command Datastore Structured version of the Data extracted from the HTML store Subscribe Dataflow Apache Beam Data Warehouse BigQuery Authorization: Firebase Adminstration interface Web Application interface Deliver Datastore Technical metadata of each HTML page When was it scraped? store Scraper 1 Scraper 2 Scraper N request for a task Data Lake HTML Images store push raw data HTML, screenshots Navigate the web with Selenium Plugins available: - retrieve the content - cut-request : black lists of websites we do not want to inform e.g. facebook, datadome Proxy VPN available to - circumvent captcha - switch IP address to behave like humans 5
  • 8. CHAPTER THREE GOOGLE CLOUD ARCHITECTURE The accelerator uses the GCP (Google Cloud Platform) infrastructure : • Cloud Firestore • App Engine • Cloud Datastore • Cloud Dataflow • BigQuery • Cloud Datastore 6
  • 9. CHAPTER FOUR TECHNOLOGIES USED BY THE ACCELERATOR 4.1 Wrappers for Web Data Extraction As the quantity and diversity of the information available online increases, more of the typical information access tasks are done by a program such as web wrappers. Wrappers “facilitate access to Web-based information sources by providing a uniform querying and data extrac- tion capability.” [KO2018] Why are web scrapers useful? For example, a Web wrapper for the e-commerce website source can take a query for a Product and extract its description in the same way as the information is extracted from a database: • the price or anterior price, • the promotion price, • the description, • the date when the data were collected, • and the bar code. 7
  • 10. CHAPTER FIVE WHAT ARE THE POSSIBLE WEB DATA EXTRACTION USE CASES? Here is a non-exhaustive list of use cases the Web Scraping has proven to be a useful solution. Table 1: Web Data Extraction Use Cases - What kind of project are you working on? Market research Price intelligence Lead generation Brand monitoring An alternative data source for the finance industry Recruitment Business automation MAP violations Fraud detection 8
  • 11. CHAPTER SIX CAPABILITIES OF THE ACCELERATOR Business-minded All data are stored to enable an ex-post analysis Cost-effective The software developer accesses all data using a straightforward interface independently of the storage Customization Implement any business rule on the data Add new data source at any time Choose the interfaces of the REST APIs Extensible Add new functionalities to the ETL process Scalable Scrap as many websites as you want Add as many data as you want High-performance The Accelerator starts servers when necessary to support the load Secure Authentication and Identification All connections between the components are secured Optimal Cache the data in memory to fasten the process Fault-tolerant Historical data are stored for future analysis Recovery of lost data Source code Build by a Belgian company, owner of the source code The Accelerator relies on Open source Code and a proprietary Cloud Infrastructure 9
  • 13. CHAPTER EIGHT GLOSSARY FOR THE ACCELERATOR Apache Beam Apache Beam is an open source unified programming model to define and execute data processing pipelines, including :term:`ETL`, batch and stream (continuous) processing Source : Wikipedia contributors. (2020, February 10). Apache Beam. In Wikipedia, The Free Encyclopedia. Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Apache_Beam&oldid= 940068914 App Engine Google App Engine “Google App Engine is a Platform as a Service and cloud computing platform for devel- oping and hosting web applications in Google-managed data centers. Applications are sandboxed and run across multiple servers. App Engine offers automatic scaling for web applications—as the number of requests increases for an application, App Engine automatically allocates more resources for the web application to handle the additional demand” Wikipedia contributors. (2020, March 5). Google App Engine. In Wikipedia, The Free Encyclopedia. Re- trieved 15:22, April 2, 2020, from https://en.wikipedia.org/w/index.php?title=Google_App_Engine&oldid= 944010303 Authorship “Copyright law protects authorship intended as the expression of an original work created by an author. This generally applies to literary, musical, artistic, and other intellectual works.” Source : http://www.iprhelpdesk.eu/sites/default/files/newsdocuments/ Fact-Sheet-Inventorship-Authorship-Ownership.pdf BI Business intelligence “Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis of business information.[1] BI technologies provide historical, current, and predictive views of business operations. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.” Source : Wikipedia contributors. (2020, March 16). Business intelligence. In Wikipedia, The Free En- cyclopedia. Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Business_ intelligence&oldid=945780339 BigQuery “BigQuery is a fully-managed data warehouse on RESTful web service that enables scalable, cost- effective and fast analysis of big data working in conjunction with Google Cloud Storage. It is a serverless Software as a Service (SaaS) that may be used complementarily with MapReduce. It also has built-in machine learning capabilities.” Source : Wikipedia contributors. (2020, February 21). BigQuery. In Wikipedia, The Free Encyclope- dia. Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=BigQuery&oldid= 941896656 Captcha “A CAPTCHA, an acronym for “Completely Automated Public Turing test to tell Computers and Hu- mans Apart”) is a type of challenge–response test used in computing to determine whether or not the user is human...[A] CAPTCHA requires someone to correctly evaluate and enter a sequence of letters or numbers perceptible in a distorted image displayed on their screen” 11
  • 14. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45 Source : Wikipedia contributors. (2020, March 25). CAPTCHA. In Wikipedia, The Free Encyclope- dia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=CAPTCHA&oldid= 947308972 CCPA “The California Consumer Privacy Act (CCPA) is a state statute intended to enhance privacy rights and consumer protection for residents of California, United States. The intentions of the Act are to provide California residents with the right to:” • Know what personal data is being collected about them. • Know whether their personal data is sold or disclosed and to whom. • Say no to the sale of personal data. • Access their personal data. • Request a business to delete any personal information about a consumer collected from that consumer. • Not be discriminated against for exercising their privacy rights. Source : Wikipedia contributors. (2020, March 25). California Consumer Privacy Act. In Wikipedia, The Free Encyclopedia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title= California_Consumer_Privacy_Act&oldid=947332131 Look at the CCPA GDPR Chart By Thomson Reuters comparing some of the key requirements of the California Consumer Privacy Act (CCPA) and the EU General Data Protection Regulation (GDPR). Cloud “Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. The term is generally used to describe data centers available to many users over the Internet” Source : Wikipedia contributors. (2020, March 24). Cloud computing. In Wikipedia, The Free Encyclope- dia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Cloud_computing& oldid=947158114 Cloud Dataflow “Google Cloud Dataflow is a fully managed service for executing :term:`Apache Beam` pipelines within the Google Cloud Platform ecosystem.” Source : Wikipedia contributors. (2019, November 27). Google Cloud Dataflow. In Wikipedia, The Free Encyclopedia. Retrieved 15:25, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Google_ Cloud_Dataflow&oldid=928227601 Cloud Datastore “Google Cloud Datastore (Cloud Datastore) is a highly scalable, fully managed :term:`NoSQL` database service offered by Google on the Google Cloud Platform.” Source : Wikipedia contributors. (2019, November 27). Google Cloud Datastore. In Wikipedia, The Free Encyclopedia. Retrieved 15:25, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Google_ Cloud_Datastore&oldid=928227557 Cloud Firestore “Cloud Firestore is a flexible, scalable database for mobile, web, and server development from Firebase and Google Cloud Platform.” Source : https://firebase.google.com/docs/firestore Data Cleansing “Data cleansing or data cleaning is the process of detecting and correcting (or removing) cor- rupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incor- rect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data” Source : Wikipedia contributors. (2020, March 3). Data cleansing. In Wikipedia, The Free Encyclopedia. Retrieved 15:26, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_cleansing&oldid= 943697218 Data lake “A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. 12 Chapter 8. Glossary for the Accelerator
  • 15. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45 A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).” Source : Wikipedia contributors. (2020, March 3). Data lake. In Wikipedia, The Free Encyclope- dia. Retrieved 15:30, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_lake&oldid= 943633024 Data mining Text mining “‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations;” Source : https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L0790 Data warehouse “A data warehouse (DW or DWH) is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the DW for reporting.” Source : Wikipedia contributors. (2020, March 12). Data warehouse. In Wikipedia, The Free Encyclope- dia. Retrieved 15:29, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_warehouse& oldid=945155021 DRM Digital Rights Management “Digital rights management (DRM) tools ... are a set of access control technolo- gies for restricting the use of proprietary hardware and copyrighted works. DRM technologies try to control the use, modification, and distribution of copyrighted works (such as software and multimedia content), as well as systems within devices that enforce these policies” Source : Wikipedia contributors. (2020, March 19). Digital rights management. In Wikipedia, The Free Encyclopedia. Retrieved 15:29, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Digital_ rights_management&oldid=946249128 ETL Extract-Transform-Load “Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s) Data extraction involves extracting data from homogeneous or heterogeneous sources; data transformation processes data by data cleansing and transforming them into a proper storage for- mat/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, data lake or a data warehouse.” Source : Wikipedia contributors. (2020, March 12). Extract, transform, load. In Wikipedia, The Free Encyclopedia. Retrieved 15:33, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Extract, _transform,_load&oldid=945159964 GDPR RGPD “The General Data Protection Regulation (EU) 2016/679 (GDPR) is a regulation in EU law on data protection and privacy in the European Union (EU) and the European Economic Area (EEA).” “It also addresses the transfer of personal data outside the EU and EEA areas. The GDPR aims primar- ily to give control to individuals over their personal data and to simplify the regulatory environment for international business by unifying the regulation within the EU” 13
  • 16. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45 Source : Wikipedia contributors. (2020, March 23). General Data Protection Regulation. In Wikipedia, The Free Encyclopedia. Retrieved 15:34, March 26, 2020, from https://en.wikipedia.org/w/index.php?title= General_Data_Protection_Regulation&oldid=946999924 Hybrid cloud Cloud Hybride “Hybrid cloud service as a cloud computing service that is composed of some combination of private, public and community cloud services, from different service providers” [BiGartner12] Source : Wikipedia contributors. (2020, March 24). Cloud computing. In Wikipedia, The Free Encyclope- dia. Retrieved 15:35, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Cloud_computing& oldid=947158114 MAP violations “A minimum advertised price (MAP) is the practice of a manufacturer providing marketing funds to a retailer contingent on the retailer advertising an end customer price at or above a specified level. Such agreements can be illegal in some countries when members and terms in the agreement match predefined legal criteria. Fixed pricing established between a distributor and seller or between two or more sellers may violate antitrust laws in the United States.” Source : Wikipedia contributors. (2020, March 16). List price. In Wikipedia, The Free Encyclope- dia. Retrieved 14:56, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=List_price&oldid= 945807027 Netiquette “Netiquette is a combination of the words network and etiquette and is defined as a set of rules for acceptable online behavior. Similarly, online ethics focuses on the acceptable use of online resources in an online social environment.” Source : What is Netiquette? A Guide to Online Ethics and Etiquette (n.d.). Retrieved March 26, 2020, from https://www.webroot.com/nz/en/resources/tips-articles/netiquette-and-online-ethics-what-are-they NoSQL A NoSQL (originally referring to “non SQL” or “non relational”) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Source : Wikipedia contributors. (2020, March 14). NoSQL. In Wikipedia, The Free Encyclopedia. Re- trieved 14:56, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=NoSQL&oldid=945474807 On-Premise On-Premise Software “An On-premises software ... is installed and runs on computers on the premises of the person or organization using the software, rather than at a remote facility such as a server farm or cloud.” Wikipedia contributors. (2019, November 28). On-premises software. In Wikipedia, The Free Encyclo- pedia. Retrieved 14:57, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=On-premises_ software&oldid=928327829 Residential proxy “A residential proxy is an IP address provided by an Internet Service Provider (ISP).” Source : Buy Residential Proxies: 10M IPs - 99.99% uptime. (2020, January 29). Retrieved March 26, 2020, from https://smartproxy.com/proxies/residential-proxies RESTful Representational state transfer “Representational state transfer (REST) is a software architectural style that defines a set of constraints to be used for creating Web services. Web services that conform to the REST architectural style, called RESTful Web services, provide interoperability between computer systems on the Internet. RESTful Web services allow the requesting systems to access and manipulate textual representa- tions of Web resources by using a uniform and predefined set of stateless operations.” Source : Wikipedia contributors. (2020, February 19). Representational state transfer. In Wikipedia, The Free Encyclopedia. Retrieved 14:57, March 26, 2020, from https://en.wikipedia.org/w/index.php?title= Representational_state_transfer&oldid=941589430 scrap-center A tool to orchestrate the request for scraping to the scrapers. 14 Chapter 8. Glossary for the Accelerator
  • 17. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45 The scrap-center stores raw data (images, HTML) into the file system and the metadata into the NoSQL databases. scraper scrapers A tool, part of the Accelerator, able to scrap the pages on website taking over the complexity to collect the data on websites (captcha, IP blocking...). The scraper uses tools such as Web Crawler to browse the World Wide Web. scraping web scraping Web scraping means extracting required information from a web page using code. [WikiWebScraping20] Reading : Jarell, E. (2018, November 26). Building a Web Scraper from start to finish. Retrieved March 26, 2020, from https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184 Selenium “Selenium is a portable framework for testing web applications. Selenium provides a playback tool for authoring functional tests without the need to learn a test scripting language (Selenium IDE). It also provides a test domain-specific language (Selenese) to write tests in a number of popular programming languages, including C#, Groovy, Java, Perl, PHP, Python, Ruby and Scala.” Source : Wikipedia contributors. (2020, February 18). Selenium (software). In Wikipedia, The Free En- cyclopedia. Retrieved 14:59, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Selenium_ (software)&oldid=941443949 VPN Virtual Private network “A virtual private network (VPN) extends a private network across a public network, and enables users to send and receive data across shared or public networks as if their computing devices were directly connected to the private network” Source : Wikipedia contributors. (2020, March 25). Virtual private network. In Wikipedia, The Free Encyclopedia. Retrieved 14:59, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Virtual_ private_network&oldid=947375734 Web API Web APIs “Web APIs are the defined interfaces through which interactions happen between an enterprise and applications that use its assets, which also is a Service Level Agreement (SLA) to specify the functional provider and expose the service path or URL for its API users. An API approach is an architectural approach that revolves around providing a program interface to a set of services to different applications serving different types of consumers.” Source: Wikipedia contributors. (2020, March 25). Application programming interface. In Wikipedia, The Free Encyclopedia. Retrieved 15:00, March 26, 2020, from https://en.wikipedia.org/w/index.php?title= Application_programming_interface&oldid=947328151 Web Crawler “A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering). Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites’ web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently. Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and “politeness” come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For example, including a robots.txt file can request bots to index only parts of a website, or nothing at all.” Source : https://en.wikipedia.org/wiki/Web_crawler web services “A server running on a computer device, listening for requests at a particular port over a network, serving web documents (HTML, JSON, XML, images), and creating web applications services, which serve in solving specific domain problems over the Web (WWW, Internet, HTTP)” 15
  • 18. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45 Wikipedia contributors. (2020, March 25). Web crawler. In Wikipedia, The Free Encyclopedia. Retrieved 15:01, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Web_crawler&oldid=947328470 wrapper wrappers “Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability” [KO2018] 16 Chapter 8. Glossary for the Accelerator
  • 19. CHAPTER NINE LICENSE © 2019-2020 Abdelkrim Boujraf, ALT-F1 SPRL <http://www.alt-f1.be> All trademarks mentioned belong to their owners, third party brands, product names, trade names, corporate names and company names mentioned may be trademarks of their respective owners or registered trademarks of other companies and are used for purposes of explanation and to the owner’s benefit, without implying a violation of copyright law. Toutes les marques citées appartiennent à leurs propriétaires. Les marques de tiers, les noms de produits, les noms commerciaux, les dénominations sociales et les noms de sociétés mentionnés peuvent être des marques déposées de leurs propriétaires respectifs ou des marques déposées d’autres sociétés et sont utilisés à des fins d’explication et au profit de leurs propriétaires, sans impliquer une violation de la loi sur le droit d’auteur. 17
  • 20. BIBLIOGRAPHY [BiddersEdge100FSupp2d105800] eBay, Inc. v. Bidder’s Edge, Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000). (n.d.). Retrieved from https://law.justia.com/cases/federal/district-courts/FSupp2/100/1058/2478126 [BiGartner12] Bittman, T. (2012, September 24). Mind the Gap: Here Comes Hybrid Cloud. Retrieved from https://blogs.gartner.com/thomas_bittman/2012/09/24/mind-the-gap-here-comes-hybrid-cloud [DmlpNotCopy14] Works Not Covered By Copyright . (n.d.). Retrieved March 21, 2020, from https://www.dmlp. org/legal-guide/works-not-covered-copyright [KO2018] Knoblock, C. A. (2018, February 3). Modeling Web Sources for Information Integration. Re- trieved March 26, 2020, from https://github.com/usc-isi-i2/usc-isi-i2.github.io/blob/master/slides/ 2018-02-03-AAAI-KG-Tutorial-CK.pptx [ScrapinghubBestPractices18] Scrapinghub. (n.d.). Guide to Web Scraping Best Practices. Retrieved March 21, 2020, from https://info.scrapinghub.com/web-scraping-guide/web-scraping-best-practices [WebScrapingforCPI18] Web scraping and online data collection and processing for the con- sumer price index. (2018, February 8). Retrieved from https://statbel.fgov.be/en/news/ web-scraping-and-online-data-collection-and-processing-consumer-price-index [WikiWebScraping20] Wikipedia contributors. (2020, March 11). Web scraping. In Wikipedia, The Free Ency- clopedia. Retrieved 11:16, March 19, 2020, from https://en.wikipedia.org/w/index.php?title=Web_ scraping&oldid=945118241 18
  • 21. INDEX A Apache Beam, 11 App Engine, 11 Authorship, 11 B BI, 11 BigQuery, 11 Business intelligence, 11 C Captcha, 11 CCPA, 12 Cloud, 12 Cloud Dataflow, 12 Cloud Datastore, 12 Cloud Firestore, 12 Cloud Hybride, 14 D Data Cleansing, 12 Data lake, 12 Data mining, 13 Data warehouse, 13 Digital Rights Management, 13 DRM, 13 E ETL, 13 Extract-Transform-Load, 13 G GDPR, 13 Google App Engine, 11 H Hybrid cloud, 14 M MAP violations, 14 N Netiquette, 14 NoSQL, 14 O On-Premise, 14 On-Premise Software, 14 R Representational state transfer, 14 Residential proxy, 14 RESTful, 14 RGPD, 13 S scrap-center, 14 scraper, 15 scrapers, 15 scraping, 15 Selenium, 15 T Text mining, 13 V Virtual Private network, 15 VPN, 15 W Web API, 15 Web APIs, 15 Web Crawler, 15 web scraping, 15 web services, 15 wrapper, 16 wrappers, 16 19