The Accelerator is an IT infrastructure able to collect and analyze a massive amount of public data on the WWW.
The Accelerator leverages the untapped potential of web data with the first solution designed for diverse sectors,
completely scalable, available on-premise, and cloud-provider agnostic.
ALT-F1.BE : The Accelerator (Google Cloud Platform)
1. The Google Cloud Platform
Accelerator
Release v2020.04.07 16.45
Abdelkrim B., Gabygaël P.
Apr 07, 2020
2. CONTENTS
1 The Accelerator 1
1.1 The Hybrid cloud infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How does the Accelerator work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The design of the BI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Business Intelligence Architecture 5
2.1 BI Solution Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Google Cloud Architecture 6
4 Technologies used by the Accelerator 7
4.1 Wrappers for Web Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 What are the possible Web Data Extraction Use Cases? 8
6 Capabilities of the Accelerator 9
7 Citations 10
8 Glossary for the Accelerator 11
9 License 17
Bibliography 18
Index 19
i
3. CHAPTER
ONE
THE ACCELERATOR
The Accelerator is an IT infrastructure able to collect and analyze a massive amount of public data on the World
Wide Web.
The Accelerator leverages the untapped potential of web data with the first solution designed for diverse sectors,
completely scalable, available on-premise, and cloud-provider agnostic.
1.1 The Hybrid cloud infrastructure
The IT Infrastructure supporting the objective of the Accelerator is an Hybrid cloud solution, including :
Several servers deployed on-premise running the scrapers that collect the raw data on the internet websites. Scrap-
ers push those data to the Cloud architecture for large-scale analysis
Several servers deployed on the Public Cloud :
1. store and analyze a large volume of data
2. and give access to the raw data and the data analyzed by the Accelerator to third-parties through secured
Web APIs
1
4. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
1.2 How does the Accelerator work?
The Accelerator: from data identified on
the World Wide Web to insightful reports
1. Choose which
data to scrap
2-3-4-5. The Accelerator
scraps the web
Extract
6. Data cleansing and
data Transformation
Transform
7. Insertion of data into
a data mart,
a data lake and
a data warehouse
Load
8-9. Reporting
Build
intelligence
1. The user of the web application selects the website that she wants to scrap
2. The scrapers request continuously the scrap-center which URL the scrapers have to scrape
3. The scrap-center shares to the scraper from which URL it should download data
4. The scraper performs its duties, e.g., to download the HTML from the URL, to take a screenshot of the
HTML page, to download the PDF available on the page
5. The scraper uploads the collected data to the scrap-center
6. The ETL (Extract-Transform-Load) process starts including the Data Cleansing and Data transformation
by the scrap-center
7. The scrap-center
• stores the raw data into the data lake
• extracts and stores the metadata with a low velocity, that does not often change, into the Cloud Datas-
tore e.g., the product name, the URL where the product is available
• stores the screenshot on a file system
• stores the data with a high velocity, that do often change, up to near real-time to the data warehouse,
e.g., the product price, the promotion price
Now the data are available to the end-user.
8. The user selects the dashboard of the report she wants to analyze or print
9. The Web application generates the report based on the data stored in the data lake, data warehouse and the
Cloud datastore
2 Chapter 1. The Accelerator
5. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
1.2. How does the Accelerator work? 3
6. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
1.3 The design of the BI Architecture
Internet websites
On-Premise architecture
Public Cloud architecture
Users
retail.com
Scraper 1
the scraper
scraps HTML data
clothes.com shoes.com
Scraper 2
...
Scraper N
the scraper
takes a screenshot
and metadata
Scrap-Center
scrapers
request for a task
scrapers
push raw data
HTML, screenshots
Data Lake
store
Data Warehouse
store
Datastore
store
Web
application
collect and analyze collect and analyze collect and analyze
Desktop Browser
Get reports through
the Dashboard
Choose which Websites
to scrap
4 Chapter 1. The Accelerator
7. CHAPTER
TWO
BUSINESS INTELLIGENCE ARCHITECTURE
2.1 BI Solution Architecture
The chart describes the interactions between the modules of the Accelerator required to scrap the data until the
generation of a BI report.
Scrapers
API Interface
Analyzer
App Engine
API Interface
Publish
e.g. Change price
API
Scrap-Center
App Engine
Command
Datastore
Structured version
of the Data
extracted from
the HTML
store
Subscribe
Dataflow
Apache Beam
Data Warehouse
BigQuery
Authorization:
Firebase
Adminstration
interface
Web Application
interface
Deliver
Datastore
Technical metadata
of each HTML page
When was it scraped?
store
Scraper 1
Scraper 2
Scraper N
request for a task
Data Lake
HTML
Images
store
push raw data
HTML, screenshots
Navigate the web with Selenium
Plugins available:
- retrieve the content
- cut-request : black lists of websites we do not want to inform
e.g. facebook, datadome
Proxy VPN available to
- circumvent captcha
- switch IP address to behave like humans
5
8. CHAPTER
THREE
GOOGLE CLOUD ARCHITECTURE
The accelerator uses the GCP (Google Cloud Platform) infrastructure :
• Cloud Firestore
• App Engine
• Cloud Datastore
• Cloud Dataflow
• BigQuery
• Cloud Datastore
6
9. CHAPTER
FOUR
TECHNOLOGIES USED BY THE ACCELERATOR
4.1 Wrappers for Web Data Extraction
As the quantity and diversity of the information available online increases, more of the typical information access
tasks are done by a program such as web wrappers.
Wrappers “facilitate access to Web-based information sources by providing a uniform querying and data extrac-
tion capability.” [KO2018]
Why are web scrapers useful?
For example, a Web wrapper for the e-commerce website source can take a query for a Product and extract its
description in the same way as the information is extracted from a database:
• the price or anterior price,
• the promotion price,
• the description,
• the date when the data were collected,
• and the bar code.
7
10. CHAPTER
FIVE
WHAT ARE THE POSSIBLE WEB DATA EXTRACTION USE
CASES?
Here is a non-exhaustive list of use cases the Web Scraping has proven to be a useful solution.
Table 1: Web Data Extraction Use Cases - What kind of project are you
working on?
Market research Price intelligence Lead generation
Brand monitoring An alternative data source for the
finance industry
Recruitment
Business automation MAP violations Fraud detection
8
11. CHAPTER
SIX
CAPABILITIES OF THE ACCELERATOR
Business-minded
All data are stored to enable an ex-post analysis
Cost-effective
The software developer accesses all data using a straightforward interface independently of the storage
Customization
Implement any business rule on the data
Add new data source at any time
Choose the interfaces of the REST APIs
Extensible
Add new functionalities to the ETL process
Scalable
Scrap as many websites as you want
Add as many data as you want
High-performance
The Accelerator starts servers when necessary to support the load
Secure
Authentication and Identification
All connections between the components are secured
Optimal
Cache the data in memory to fasten the process
Fault-tolerant
Historical data are stored for future analysis
Recovery of lost data
Source code
Build by a Belgian company, owner of the source code
The Accelerator relies on Open source Code and a proprietary Cloud Infrastructure
9
13. CHAPTER
EIGHT
GLOSSARY FOR THE ACCELERATOR
Apache Beam Apache Beam is an open source unified programming model to define and execute data processing
pipelines, including :term:`ETL`, batch and stream (continuous) processing
Source : Wikipedia contributors. (2020, February 10). Apache Beam. In Wikipedia, The Free Encyclopedia.
Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Apache_Beam&oldid=
940068914
App Engine
Google App Engine “Google App Engine is a Platform as a Service and cloud computing platform for devel-
oping and hosting web applications in Google-managed data centers. Applications are sandboxed and
run across multiple servers. App Engine offers automatic scaling for web applications—as the number
of requests increases for an application, App Engine automatically allocates more resources for the web
application to handle the additional demand”
Wikipedia contributors. (2020, March 5). Google App Engine. In Wikipedia, The Free Encyclopedia. Re-
trieved 15:22, April 2, 2020, from https://en.wikipedia.org/w/index.php?title=Google_App_Engine&oldid=
944010303
Authorship “Copyright law protects authorship intended as the expression of an original work created by an
author. This generally applies to literary, musical, artistic, and other intellectual works.”
Source : http://www.iprhelpdesk.eu/sites/default/files/newsdocuments/
Fact-Sheet-Inventorship-Authorship-Ownership.pdf
BI
Business intelligence “Business intelligence (BI) comprises the strategies and technologies used by enterprises
for the data analysis of business information.[1] BI technologies provide historical, current, and predictive
views of business operations. Common functions of business intelligence technologies include reporting,
online analytical processing, analytics, data mining, process mining, complex event processing, business
performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.”
Source : Wikipedia contributors. (2020, March 16). Business intelligence. In Wikipedia, The Free En-
cyclopedia. Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Business_
intelligence&oldid=945780339
BigQuery “BigQuery is a fully-managed data warehouse on RESTful web service that enables scalable, cost-
effective and fast analysis of big data working in conjunction with Google Cloud Storage.
It is a serverless Software as a Service (SaaS) that may be used complementarily with MapReduce. It also
has built-in machine learning capabilities.”
Source : Wikipedia contributors. (2020, February 21). BigQuery. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:23, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=BigQuery&oldid=
941896656
Captcha “A CAPTCHA, an acronym for “Completely Automated Public Turing test to tell Computers and Hu-
mans Apart”) is a type of challenge–response test used in computing to determine whether or not the user is
human...[A] CAPTCHA requires someone to correctly evaluate and enter a sequence of letters or numbers
perceptible in a distorted image displayed on their screen”
11
14. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
Source : Wikipedia contributors. (2020, March 25). CAPTCHA. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=CAPTCHA&oldid=
947308972
CCPA “The California Consumer Privacy Act (CCPA) is a state statute intended to enhance privacy rights and
consumer protection for residents of California, United States.
The intentions of the Act are to provide California residents with the right to:”
• Know what personal data is being collected about them.
• Know whether their personal data is sold or disclosed and to whom.
• Say no to the sale of personal data.
• Access their personal data.
• Request a business to delete any personal information about a consumer collected from that consumer.
• Not be discriminated against for exercising their privacy rights.
Source : Wikipedia contributors. (2020, March 25). California Consumer Privacy Act. In Wikipedia, The
Free Encyclopedia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=
California_Consumer_Privacy_Act&oldid=947332131
Look at the CCPA GDPR Chart By Thomson Reuters comparing some of the key requirements of the
California Consumer Privacy Act (CCPA) and the EU General Data Protection Regulation (GDPR).
Cloud “Cloud computing is the on-demand availability of computer system resources, especially data storage
and computing power, without direct active management by the user. The term is generally used to describe
data centers available to many users over the Internet”
Source : Wikipedia contributors. (2020, March 24). Cloud computing. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:24, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Cloud_computing&
oldid=947158114
Cloud Dataflow “Google Cloud Dataflow is a fully managed service for executing :term:`Apache Beam`
pipelines within the Google Cloud Platform ecosystem.”
Source : Wikipedia contributors. (2019, November 27). Google Cloud Dataflow. In Wikipedia, The Free
Encyclopedia. Retrieved 15:25, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Google_
Cloud_Dataflow&oldid=928227601
Cloud Datastore “Google Cloud Datastore (Cloud Datastore) is a highly scalable, fully managed
:term:`NoSQL` database service offered by Google on the Google Cloud Platform.”
Source : Wikipedia contributors. (2019, November 27). Google Cloud Datastore. In Wikipedia, The Free
Encyclopedia. Retrieved 15:25, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Google_
Cloud_Datastore&oldid=928227557
Cloud Firestore “Cloud Firestore is a flexible, scalable database for mobile, web, and server development from
Firebase and Google Cloud Platform.”
Source : https://firebase.google.com/docs/firestore
Data Cleansing “Data cleansing or data cleaning is the process of detecting and correcting (or removing) cor-
rupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incor-
rect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse
data”
Source : Wikipedia contributors. (2020, March 3). Data cleansing. In Wikipedia, The Free Encyclopedia.
Retrieved 15:26, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_cleansing&oldid=
943697218
Data lake “A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or
files.
A data lake is usually a single store of all enterprise data including raw copies of source system data and
transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.
12 Chapter 8. Glossary for the Accelerator
15. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
A data lake can include structured data from relational databases (rows and columns), semi-structured data
(CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio,
video).”
Source : Wikipedia contributors. (2020, March 3). Data lake. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:30, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_lake&oldid=
943633024
Data mining
Text mining “‘text and data mining’ means any automated analytical technique aimed at analysing text and data
in digital form in order to generate information which includes but is not limited to patterns, trends and
correlations;”
Source : https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L0790
Data warehouse “A data warehouse (DW or DWH) is a system used for reporting and data analysis, and is
considered a core component of business intelligence. DWs are central repositories of integrated data from
one or more disparate sources. They store current and historical data in one single place that are used for
creating analytical reports for workers throughout the enterprise.
The data stored in the warehouse is uploaded from the operational systems. The data may pass through
an operational data store and may require data cleansing for additional operations to ensure data quality
before it is used in the DW for reporting.”
Source : Wikipedia contributors. (2020, March 12). Data warehouse. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:29, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Data_warehouse&
oldid=945155021
DRM
Digital Rights Management “Digital rights management (DRM) tools ... are a set of access control technolo-
gies for restricting the use of proprietary hardware and copyrighted works. DRM technologies try to control
the use, modification, and distribution of copyrighted works (such as software and multimedia content), as
well as systems within devices that enforce these policies”
Source : Wikipedia contributors. (2020, March 19). Digital rights management. In Wikipedia, The Free
Encyclopedia. Retrieved 15:29, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Digital_
rights_management&oldid=946249128
ETL
Extract-Transform-Load “Extract, transform, load (ETL) is the general procedure of copying data from one
or more sources into a destination system which represents the data differently from the source(s) or in a
different context than the source(s)
Data extraction involves extracting data from homogeneous or heterogeneous sources;
data transformation processes data by data cleansing and transforming them into a proper storage for-
mat/structure for the purposes of querying and analysis;
finally, data loading describes the insertion of data into the final target database such as an operational
data store, a data mart, data lake or a data warehouse.”
Source : Wikipedia contributors. (2020, March 12). Extract, transform, load. In Wikipedia, The Free
Encyclopedia. Retrieved 15:33, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Extract,
_transform,_load&oldid=945159964
GDPR
RGPD “The General Data Protection Regulation (EU) 2016/679 (GDPR) is a regulation in EU law on data
protection and privacy in the European Union (EU) and the European Economic Area (EEA).”
“It also addresses the transfer of personal data outside the EU and EEA areas. The GDPR aims primar-
ily to give control to individuals over their personal data and to simplify the regulatory environment for
international business by unifying the regulation within the EU”
13
16. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
Source : Wikipedia contributors. (2020, March 23). General Data Protection Regulation. In Wikipedia, The
Free Encyclopedia. Retrieved 15:34, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=
General_Data_Protection_Regulation&oldid=946999924
Hybrid cloud
Cloud Hybride “Hybrid cloud service as a cloud computing service that is composed of some combination of
private, public and community cloud services, from different service providers” [BiGartner12]
Source : Wikipedia contributors. (2020, March 24). Cloud computing. In Wikipedia, The Free Encyclope-
dia. Retrieved 15:35, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Cloud_computing&
oldid=947158114
MAP violations “A minimum advertised price (MAP) is the practice of a manufacturer providing marketing
funds to a retailer contingent on the retailer advertising an end customer price at or above a specified
level. Such agreements can be illegal in some countries when members and terms in the agreement match
predefined legal criteria.
Fixed pricing established between a distributor and seller or between two or more sellers may violate
antitrust laws in the United States.”
Source : Wikipedia contributors. (2020, March 16). List price. In Wikipedia, The Free Encyclope-
dia. Retrieved 14:56, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=List_price&oldid=
945807027
Netiquette “Netiquette is a combination of the words network and etiquette and is defined as a set of rules for
acceptable online behavior. Similarly, online ethics focuses on the acceptable use of online resources in an
online social environment.”
Source : What is Netiquette? A Guide to Online Ethics and Etiquette (n.d.). Retrieved March 26, 2020,
from https://www.webroot.com/nz/en/resources/tips-articles/netiquette-and-online-ethics-what-are-they
NoSQL A NoSQL (originally referring to “non SQL” or “non relational”) database provides a mechanism for
storage and retrieval of data that is modeled in means other than the tabular relations used in relational
databases.
Source : Wikipedia contributors. (2020, March 14). NoSQL. In Wikipedia, The Free Encyclopedia. Re-
trieved 14:56, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=NoSQL&oldid=945474807
On-Premise
On-Premise Software “An On-premises software ... is installed and runs on computers on the premises of the
person or organization using the software, rather than at a remote facility such as a server farm or cloud.”
Wikipedia contributors. (2019, November 28). On-premises software. In Wikipedia, The Free Encyclo-
pedia. Retrieved 14:57, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=On-premises_
software&oldid=928327829
Residential proxy “A residential proxy is an IP address provided by an Internet Service Provider (ISP).”
Source : Buy Residential Proxies: 10M IPs - 99.99% uptime. (2020, January 29). Retrieved March 26,
2020, from https://smartproxy.com/proxies/residential-proxies
RESTful
Representational state transfer “Representational state transfer (REST) is a software architectural style that
defines a set of constraints to be used for creating Web services. Web services that conform to the REST
architectural style, called RESTful Web services, provide interoperability between computer systems on the
Internet. RESTful Web services allow the requesting systems to access and manipulate textual representa-
tions of Web resources by using a uniform and predefined set of stateless operations.”
Source : Wikipedia contributors. (2020, February 19). Representational state transfer. In Wikipedia, The
Free Encyclopedia. Retrieved 14:57, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=
Representational_state_transfer&oldid=941589430
scrap-center A tool to orchestrate the request for scraping to the scrapers.
14 Chapter 8. Glossary for the Accelerator
17. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
The scrap-center stores raw data (images, HTML) into the file system and the metadata into the NoSQL
databases.
scraper
scrapers A tool, part of the Accelerator, able to scrap the pages on website taking over the complexity to collect
the data on websites (captcha, IP blocking...).
The scraper uses tools such as Web Crawler to browse the World Wide Web.
scraping
web scraping Web scraping means extracting required information from a web page using code.
[WikiWebScraping20]
Reading : Jarell, E. (2018, November 26). Building a Web Scraper from start to finish. Retrieved March 26,
2020, from https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184
Selenium “Selenium is a portable framework for testing web applications. Selenium provides a playback tool
for authoring functional tests without the need to learn a test scripting language (Selenium IDE). It also
provides a test domain-specific language (Selenese) to write tests in a number of popular programming
languages, including C#, Groovy, Java, Perl, PHP, Python, Ruby and Scala.”
Source : Wikipedia contributors. (2020, February 18). Selenium (software). In Wikipedia, The Free En-
cyclopedia. Retrieved 14:59, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Selenium_
(software)&oldid=941443949
VPN
Virtual Private network “A virtual private network (VPN) extends a private network across a public network,
and enables users to send and receive data across shared or public networks as if their computing devices
were directly connected to the private network”
Source : Wikipedia contributors. (2020, March 25). Virtual private network. In Wikipedia, The Free
Encyclopedia. Retrieved 14:59, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Virtual_
private_network&oldid=947375734
Web API
Web APIs “Web APIs are the defined interfaces through which interactions happen between an enterprise and
applications that use its assets, which also is a Service Level Agreement (SLA) to specify the functional
provider and expose the service path or URL for its API users. An API approach is an architectural approach
that revolves around providing a program interface to a set of services to different applications serving
different types of consumers.”
Source: Wikipedia contributors. (2020, March 25). Application programming interface. In Wikipedia, The
Free Encyclopedia. Retrieved 15:00, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=
Application_programming_interface&oldid=947328151
Web Crawler “A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an
Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing
(web spidering).
Web search engines and some other sites use Web crawling or spidering software to update their web content
or indices of others sites’ web content. Web crawlers copy pages for processing by a search engine which
indexes the downloaded pages so users can search more efficiently.
Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule,
load, and “politeness” come into play when large collections of pages are accessed. Mechanisms exist for
public sites not wishing to be crawled to make this known to the crawling agent. For example, including a
robots.txt file can request bots to index only parts of a website, or nothing at all.”
Source : https://en.wikipedia.org/wiki/Web_crawler
web services “A server running on a computer device, listening for requests at a particular port over a network,
serving web documents (HTML, JSON, XML, images), and creating web applications services, which serve
in solving specific domain problems over the Web (WWW, Internet, HTTP)”
15
18. The Google Cloud Platform Accelerator, Release v2020.04.07 16.45
Wikipedia contributors. (2020, March 25). Web crawler. In Wikipedia, The Free Encyclopedia. Retrieved
15:01, March 26, 2020, from https://en.wikipedia.org/w/index.php?title=Web_crawler&oldid=947328470
wrapper
wrappers “Wrappers facilitate access to Web-based information sources by providing a uniform querying and
data extraction capability” [KO2018]
16 Chapter 8. Glossary for the Accelerator
20. BIBLIOGRAPHY
[BiddersEdge100FSupp2d105800] eBay, Inc. v. Bidder’s Edge, Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000).
(n.d.). Retrieved from https://law.justia.com/cases/federal/district-courts/FSupp2/100/1058/2478126
[BiGartner12] Bittman, T. (2012, September 24). Mind the Gap: Here Comes Hybrid Cloud. Retrieved from
https://blogs.gartner.com/thomas_bittman/2012/09/24/mind-the-gap-here-comes-hybrid-cloud
[DmlpNotCopy14] Works Not Covered By Copyright . (n.d.). Retrieved March 21, 2020, from https://www.dmlp.
org/legal-guide/works-not-covered-copyright
[KO2018] Knoblock, C. A. (2018, February 3). Modeling Web Sources for Information Integration. Re-
trieved March 26, 2020, from https://github.com/usc-isi-i2/usc-isi-i2.github.io/blob/master/slides/
2018-02-03-AAAI-KG-Tutorial-CK.pptx
[ScrapinghubBestPractices18] Scrapinghub. (n.d.). Guide to Web Scraping Best Practices. Retrieved March 21,
2020, from https://info.scrapinghub.com/web-scraping-guide/web-scraping-best-practices
[WebScrapingforCPI18] Web scraping and online data collection and processing for the con-
sumer price index. (2018, February 8). Retrieved from https://statbel.fgov.be/en/news/
web-scraping-and-online-data-collection-and-processing-consumer-price-index
[WikiWebScraping20] Wikipedia contributors. (2020, March 11). Web scraping. In Wikipedia, The Free Ency-
clopedia. Retrieved 11:16, March 19, 2020, from https://en.wikipedia.org/w/index.php?title=Web_
scraping&oldid=945118241
18
21. INDEX
A
Apache Beam, 11
App Engine, 11
Authorship, 11
B
BI, 11
BigQuery, 11
Business intelligence, 11
C
Captcha, 11
CCPA, 12
Cloud, 12
Cloud Dataflow, 12
Cloud Datastore, 12
Cloud Firestore, 12
Cloud Hybride, 14
D
Data Cleansing, 12
Data lake, 12
Data mining, 13
Data warehouse, 13
Digital Rights Management, 13
DRM, 13
E
ETL, 13
Extract-Transform-Load, 13
G
GDPR, 13
Google App Engine, 11
H
Hybrid cloud, 14
M
MAP violations, 14
N
Netiquette, 14
NoSQL, 14
O
On-Premise, 14
On-Premise Software, 14
R
Representational state transfer, 14
Residential proxy, 14
RESTful, 14
RGPD, 13
S
scrap-center, 14
scraper, 15
scrapers, 15
scraping, 15
Selenium, 15
T
Text mining, 13
V
Virtual Private network, 15
VPN, 15
W
Web API, 15
Web APIs, 15
Web Crawler, 15
web scraping, 15
web services, 15
wrapper, 16
wrappers, 16
19