From an experimental ISTAT survey:
1. Researchers used web scraping and text mining of 85,000 Italian company websites to develop models predicting key metrics like web ordering.
2. They compared these "alternative estimates" to traditional survey estimates for accuracy, finding the new estimates were equally accurate.
3. The new methods provided additional granular data like estimates by industry and region not available from surveys alone.
A. Nurra, From ICT survey data to experimental statistics; using IaD source for website functionalities
1. From ICT survey data to
experimental statistics:
using IaD source for
website functionalities
ALESSANDRA NURRA
ISTAT – Researcher
0
2. ① The ICT Survey
② 3 target variables and official European statistics
③ Importance of web ordering
④ Main goals
⑤ 5 phases of alternative estimate procedure
⑥ Results: estimates comparison and additional information
⑦ Production point of view: conclusions and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
1
Outlines
3. MAIN INDICATORS FROM THE ICT SURVEY
o The principal aim of this survey is to supply users with indicators on:
Internet activities (web site, social media, cloud computing) and
connection used (fixed and mobile broadband), e-Business (use of
software as ERP, CRM), e-Commerce, ICT skills, e-Invoice, etc.
MAIN PURPOSES OF THE ICT SURVEY INDICATORS
o ICT survey is also one of the major sources of data for the Digital
Agenda Scoreboard and Digital Economy and Society Index (DESI)
measuring progress of the European digital economy and to track the
evolution of EU member states in digital competitiveness.
The survey is part of the
European Community
statistics on the information
society
Community Survey on ICT
usage and e-commerce in
enterprises
Data for year 2017:
- Pop ent 10+ (from BR
updated to 2015): 184,865
- Sampling frame: 32,361
- Respondents: 21,410 (66%)
2
2
The ICT Survey
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
4. o Rate of enterprises where the website provides online ordering or
booking, e.g. shopping cart (percentage out of tot pop 10+)
o Rate of enterprises where the website provides advertisement of
open job position or job application (percentage out of tot pop 10+)
o Rate of enterprises where the website has links or references to the
enterprise's social media profiles (percentage out of tot pop 10+)
Phenomena are slowly growing
Italy is below European values
3
3
3 target variables and official European statistics
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
geo
time
2012 2013 2014 2015 2016 2017
EU28 15 16 17 17 18 20
IT 11 12 11 13 14 15
geo
time
2012 2013 2014 2015 2016 2017
EU28 19 21 24 n.d. 27 n.d.
IT 8 8 10 10 10 11
geo
time
2012 2013 2014 2015 2016 2017
EU28 n.d. n.d. 22 28 33 35
IT n.d. n.d. 21 26 28 31
geo
time
2012 2013 2014 2015 2016 2017
EU28 71 73 74 75 77 77
IT 65 67 69 71 71 72
Rate of enterprises with website
(percentage out of tot pop 10+)
5. 4
4
Importance of web ordering
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
E-commerce
E-sales
Web
sales
Web site (web ordering)
App
E-marketplace
Electronic
automatic
sales (EDI)E-purchases
Internet data on
web ordering
would help us to
have more control
on the evolution
of
• WEB SALES
• E-SALES
• E-COMMERCE
Competitiveness
drivers
6. 5
WHAT HOW WHY
To replicate a subset of estimates
currently produced by the survey
• investigating new IT solutions
• Improving /developing new skills
• evaluating and comparing quality of alternative estimates with traditional ones
To produce additional information • increasing the offer of statistical information
To integrate the information
collected with survey with those
collected via Internet
• improving accuracy of traditional estimates
5
Main goals
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
7. IV model fitting
III
V
II
I phase list of URLs
6
6
5 phases of alternative estimate procedure
Predictors
130,000
Websites
Big Data:
Internet as
Data Source
Doc
Terms
Matrix
100,000 websites
Predictions
Survey
data
Estimation
12,000
Sample
Alternative
Estimates
Estimation Estimates
Population
Frame: Asia
185,000
ent 10+
Predictors
85,000
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Web scraping
Text processing
8. 1 - LIST OF URLs
1. integration, valid. list of
available URLs
2. Retrieval URLs:
• enterprise denom (10
website from query)
• processing 10 to choose 1
matching other info with
web content
using a ML approach to
associate URLs to
enterprises
from potentially 130,000
to about 100,000
2 - WEB SCRAPING
1. reading the
homepage and all
the other reachable
pages
2. doing Optical
Character
Recognition (OCR) on
all types of images
(also screen-shot of
homepage)
from 100,000 to 85,000
7
First 4 phases of alternative estimate procedure
3 - TEXT MINING
to convert each website
in a data record with
relevant information:
1. processing text (NLP)
2. computing Term
Evaluation (TE)
function to give a
measure (score) of
relevance to each
term (using ML)
3. select the first best
terms to codify the
enterprise website in
terms of target
variables
4 - MODEL FITTING
1. fitting model with
supervised ML classifier on
a subset of 12,000
(observed and Internet)
2. choosing classifier on basis
of performance measures
3. applying classifier chosen to
all 85,000 document text
matrix; the best results have
been obtained with Random
Forest (RF) (information
retrieval for target variable
on link to SM profiles)
4. obtaining predictions on
85,000 enterprises
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
9. Survey estimates
1. obtained by using the usual design
based / model assisted approach
where weights are obtained by
calibration procedure of basic weights
(inverse of inclusion probabilities)
making use of known totals in the target
population (𝑈 = 185,000) in order to
reduce the bias due to non-response and
the variability due to sampling errors
8
8
Alternative estimates
Survey
data
Estimation Estimates
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
COMPARISON
three different sets of
estimates
V
Predictions
Estimation
Alternative
Estimates
Alternative estimates
have been calculated by adopting two different
estimators:
2. full model based estimator where the estimate of the
total number of enterprises offering target variable on
their websites is given by the count of the predicted values
for all units for which it was possible reach their websites
(𝑈2
= 85,000 ), calibrated in order to make them
representative of all the population having websites (𝑈1
=
130,000);
3. combined estimator produced by summing three
components: the counting of predicted values in the sub
pop 𝑈2
; an adjustment based on the differences between
the reported values and the predicted values expanded to
sub pop 𝑈2
; the counting of observed values for
respondents that declared a website, that was not found
nor scraped expanded to sub pop 𝑈1
− 𝑈2
.
10. 9
RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR
WEBSITES BY SIZE CLASSES - Year 2017
9
Results: estimates comparison (1/3)
The three different sets are not incoherent. In many cases, but not for all, the alternative estimates are well inside the
confidence interval of the survey estimate, and this is the same for many values in the different domains for all three
target variables.
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
11. 10
10
Results: estimates comparison (2/3)
RATE OF ENTERPRISES WITH WEB ORDERING, JOB
ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR
WEBSITES BY NACE (24 economic activities
group considered in the survey) - Year 2017
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
A simulation study carried out on 1000 iterations to
compare the accuracy of the 3 estimates in terms of the
components of the MSE (bias and variance) shows that
the accuracy of these new estimates is not lower than
those already produced by the ICT survey .
12. 11
Results: additional information (3/3)
MODEL BASED ESTIMATES - RATE OF ENTERPRISES WITH WEB ORDERING, JOB
ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY NACE BY NACE
REV. 2 LEVEL 2 (62 division) - Year 2017
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Alla data and metadata were published on June 8 on the Istat website dedicated to
experimental statistics in the subsection on Results of experiments on big data.
13. For burden reason in ICT survey for year 2019 entire website section will be ‘optional’ so will not be possible
to use combined estimator (produced using also observed values of survey); so full model based estimators
will remain the only alternative for time series.
o Role of ICT survey data: they have been used for fitting the models to predict values; furthermore
prediction errors have a direct impact on one of three component of combined estimator.
12
12
Production point of view: conclusions and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Full model based (and Combined) estimates can be considered acceptable but…. we need time
series analysis to verify stability of procedure and of results
o Open question: respondent or URL website errors or other reasons (for example web ordering
made in an private area of website)? Urgent need (time consuming): re-contact respondent, to ask URL
inside the question on web functionalities and not at the end of questionnaire, improve definitions; ..so
strong effort is requested to assure good quality of answers to reduce response errors in the training
set (from survey) … even if in the future this should be necessary only every ‘n’ years and one solution
could be use a (small) subset of data as training set, not necessarily obtained by costly repeated official
sample surveys.
o In cases of predicted values different from those reported by respondents, after manual controls, we
discovered that in about half of the cases difference was not due to model fault, but to response errors.
o Other open issues for the future: European comparability, with predicted values will not be possible to
combine observed variables and predicted ones for the same respondents
14. 13
13
Production point of view: conclusion and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
The work done could be extended and adapted in multiple ways
Considering the 3 target variables:
in case of web ordering could be evaluate
possibility to use IaD to find other functionalities
related to website sales as web payment and
web deliver tracking;
in case of job advertisement (it will not included
in European ICT survey for next 3 years) could
be evaluate possibility to search additional job
details as characteristics of each single job in
terms of skills required;
in case of social media presence detection could
be extended to not only to scrape the enterprise
website, but also directly the social media in
order to investigate what kind of use of social
media is being done in a more detailed way.
Considering other aspects linked to ICT usage and
eCommerce:
• to investigate web sales of enterprises via other
means: e-marketplaces, app, social shopping
(new Instagram ‘shopping’ feature, launched in
March 2018, Facebook marketplace);
• to investigate more on enterprises operating in
specific economic activities (for example in Nace
47.91 Retail sale via mail order houses or via Internet
including retail sale of any kind of product over the
Internet) in order to have information about
products/services;
• to reuse and to adapt the procedure described to e-
government website or to website of enterprises
with less than 10 persons employed.
15. 14
14
Working team
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Istat: G.Barcaroli, G.Bianchi, N.Golini, A. Nurra,
P.Righi, S.Salamone, F.Scalfati, M.Scannapieco,
D.Summa, D.Zardetto
CINECA: M.Scarnò
Univ.Roma Sapienza: R.Bruni
Link to Istat experimental statistics and metadata on website functionalities:
https://www.istat.it/en/archivio/216641
Thanks for your attention
16. 15
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Use of e-commerce marketplaces for web sales most popular in Italy, Germany, Austria and Poland
Regarding the number of enterprises that sold their goods or services through an e-commerce
marketplace, the highest shares were recorded in Italy (54%) and Germany (52%), followed by Austria
and Poland (both 47%).
17. 16
Enterprises using social networks, 2017 and 2013 (% of enterprises)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
18. 17
E-sales broken down by web and EDI-type sales, 2016 (% enterprises with e-sales)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
The percentage of enterprises receiving orders over websites
or via apps was considerably high for almost all Member
States (Italy 70% of enterprises with e-sales receive orders
via web sales, 20% via Edi, 10% via both)
In Italy 8 out of 10 enterprises with web sales sell through their
own website or app and almost 5 out of 10 enterprises (EU28
39%) sold via an e-marketplace.
20. 19
Importance of web ordering
YES - WEB ORDERING (referred to year t)
YES WEB SALES (referred to year t-1) of the respondent
NO WEB SALES (referred to year t-1) of the respondent [due to new web
site or functionality used in year t and not in t-1; due to cases of web
sales computed on turnover of another enterprise of the group, foreign
enterprises, enterprises with less than 10 persons employed (out of the
scope)]
NO - WEB ORDERING (referred to year t)
NO WEB SALES (referred to year t-1) of the respondent
YES WEB SALES (referred to year t-1) of the respondent (via
emarketplaces, via app)
SIMINGLY
CONTRADDICTORY
ANSWERS
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
21. 20
20
5 phases of alternative estimates procedure (2/5)
132,000
Websites
Big Data:
Internet as Data
Source
100,000 websites
• Data integration to
have a list of URL to
validate
Sources: ICT survey, Consodata
• URLs retrieval
• For non available URL an automated procedure has been set up to make use of enterprises
denomination as a search string to make query and collect the first 10 links returned as the
result of the query
• processing the first ten URLs in order to choose the right one for the given enterprise of
population of interest:
matching of the enterprises information (denomination, fiscal code VAT, telephone,
address, etc. available from administrative data) and the content of the first ten URLs
retrieved;
using a ML approach to associate URLs to enterprises: for each link its probability of
correctness is evaluated, and those links whose probability exceeds a given threshold
is accepted as valid.
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
IUniform Resource Locator
corresponding to a statistical unit
22. 21
21
5 phases of alternative estimates procedure (3/5)
100,000
Web scraping +
text processing
Doc
Terms
Matrix
85,000
Website scraping
• reads the homepage and all
the other reachable pages
(max of 20 pages, the depth
can be selected)
• does Optical Character
Recognition (OCR) on all types
of images (also screen-shot of
homepage)
The text mining phase to convert
each website in a data record
• processing text using Natural Language
Processing techniques
• computing Term Evaluation (TE) function:
give a measure (score) of relevance to
each term (using supervised ML - given a
term of the training set, its frequency in a class
is compared to its overall frequency: relevant if
it occurs primarily in positive documents and
rarely in negative ones, or vice versa)
• summarizing each website into a number
of relevant terms and applying
dimensionality reductions techniques to
obtain a set of data records describing
Websites (85,000 doc terms matrix)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
II
+
III
II III
23. 22
22
5 phases of alternative estimates procedure (4/5)
Model fitting: supervised ML classifier
using training set (data driven - not
deterministic choice of keywords)
• To fit model (machine learning) in the subset
of enterprises where both Internet data and
survey data were available (12,000)
considering survey data as the true values
(several classification approaches have been
applied);
• To apply the classifier to all 85,000 websites
predicting the values of target variables for all
the enterprises for which the retrieval and
scraping of their websites was successful.
Performance evaluation of classification algorithms -
performance measures for classifiers:
1. Accuracy: rate of correct predictions on the total of cases
2. Sensitivity (or recall): rate of true pos on total number of pos
3. Precision: rate of true pos on total number of pos predictions
4. F1-score: harmonic mean of Sensitivity and Precision
the best results have been obtained with Random Forest (RF)
(information retrieval for target variable on link to SM profiles)
IV
Predictors
Doc
Terms
Matrix
Predictions
12,000
Predictors
85,000
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC