SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
From ICT survey data to
experimental statistics:
using IaD source for
website functionalities
ALESSANDRA NURRA
ISTAT – Researcher
0
① The ICT Survey
② 3 target variables and official European statistics
③ Importance of web ordering
④ Main goals
⑤ 5 phases of alternative estimate procedure
⑥ Results: estimates comparison and additional information
⑦ Production point of view: conclusions and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
1
Outlines
MAIN INDICATORS FROM THE ICT SURVEY
o The principal aim of this survey is to supply users with indicators on:
Internet activities (web site, social media, cloud computing) and
connection used (fixed and mobile broadband), e-Business (use of
software as ERP, CRM), e-Commerce, ICT skills, e-Invoice, etc.
MAIN PURPOSES OF THE ICT SURVEY INDICATORS
o ICT survey is also one of the major sources of data for the Digital
Agenda Scoreboard and Digital Economy and Society Index (DESI)
measuring progress of the European digital economy and to track the
evolution of EU member states in digital competitiveness.
The survey is part of the
European Community
statistics on the information
society
Community Survey on ICT
usage and e-commerce in
enterprises
Data for year 2017:
- Pop ent 10+ (from BR
updated to 2015): 184,865
- Sampling frame: 32,361
- Respondents: 21,410 (66%)
2
2
The ICT Survey
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
o Rate of enterprises where the website provides online ordering or
booking, e.g. shopping cart (percentage out of tot pop 10+)
o Rate of enterprises where the website provides advertisement of
open job position or job application (percentage out of tot pop 10+)
o Rate of enterprises where the website has links or references to the
enterprise's social media profiles (percentage out of tot pop 10+)
 Phenomena are slowly growing
 Italy is below European values
3
3
3 target variables and official European statistics
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
geo
time
2012 2013 2014 2015 2016 2017
EU28 15 16 17 17 18 20
IT 11 12 11 13 14 15
geo
time
2012 2013 2014 2015 2016 2017
EU28 19 21 24 n.d. 27 n.d.
IT 8 8 10 10 10 11
geo
time
2012 2013 2014 2015 2016 2017
EU28 n.d. n.d. 22 28 33 35
IT n.d. n.d. 21 26 28 31
geo
time
2012 2013 2014 2015 2016 2017
EU28 71 73 74 75 77 77
IT 65 67 69 71 71 72
Rate of enterprises with website
(percentage out of tot pop 10+)
4
4
Importance of web ordering
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
E-commerce
E-sales
Web
sales
Web site (web ordering)
App
E-marketplace
Electronic
automatic
sales (EDI)E-purchases
Internet data on
web ordering
would help us to
have more control
on the evolution
of
• WEB SALES
• E-SALES
• E-COMMERCE
Competitiveness
drivers
5
WHAT HOW WHY
To replicate a subset of estimates
currently produced by the survey
• investigating new IT solutions
• Improving /developing new skills
• evaluating and comparing quality of alternative estimates with traditional ones
To produce additional information • increasing the offer of statistical information
To integrate the information
collected with survey with those
collected via Internet
• improving accuracy of traditional estimates
5
Main goals
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
IV model fitting
III
V
II
I phase list of URLs
6
6
5 phases of alternative estimate procedure
Predictors
130,000
Websites
Big Data:
Internet as
Data Source
Doc
Terms
Matrix
100,000 websites
Predictions
Survey
data
Estimation
12,000
Sample
Alternative
Estimates
Estimation Estimates
Population
Frame: Asia
185,000
ent 10+
Predictors
85,000
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Web scraping
Text processing
1 - LIST OF URLs
1. integration, valid. list of
available URLs
2. Retrieval URLs:
• enterprise denom (10
website from query)
• processing 10 to choose 1
 matching other info with
web content
 using a ML approach to
associate URLs to
enterprises
from potentially 130,000
to about 100,000
2 - WEB SCRAPING
1. reading the
homepage and all
the other reachable
pages
2. doing Optical
Character
Recognition (OCR) on
all types of images
(also screen-shot of
homepage)
from 100,000 to 85,000
7
First 4 phases of alternative estimate procedure
3 - TEXT MINING
to convert each website
in a data record with
relevant information:
1. processing text (NLP)
2. computing Term
Evaluation (TE)
function to give a
measure (score) of
relevance to each
term (using ML)
3. select the first best
terms to codify the
enterprise website in
terms of target
variables
4 - MODEL FITTING
1. fitting model with
supervised ML classifier on
a subset of 12,000
(observed and Internet)
2. choosing classifier on basis
of performance measures
3. applying classifier chosen to
all 85,000 document text
matrix; the best results have
been obtained with Random
Forest (RF) (information
retrieval for target variable
on link to SM profiles)
4. obtaining predictions on
85,000 enterprises
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Survey estimates
1. obtained by using the usual design
based / model assisted approach
where weights are obtained by
calibration procedure of basic weights
(inverse of inclusion probabilities)
making use of known totals in the target
population (𝑈 = 185,000) in order to
reduce the bias due to non-response and
the variability due to sampling errors
8
8
Alternative estimates
Survey
data
Estimation Estimates
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
COMPARISON
three different sets of
estimates
V
Predictions
Estimation
Alternative
Estimates
Alternative estimates
have been calculated by adopting two different
estimators:
2. full model based estimator where the estimate of the
total number of enterprises offering target variable on
their websites is given by the count of the predicted values
for all units for which it was possible reach their websites
(𝑈2
= 85,000 ), calibrated in order to make them
representative of all the population having websites (𝑈1
=
130,000);
3. combined estimator produced by summing three
components: the counting of predicted values in the sub
pop 𝑈2
; an adjustment based on the differences between
the reported values and the predicted values expanded to
sub pop 𝑈2
; the counting of observed values for
respondents that declared a website, that was not found
nor scraped expanded to sub pop 𝑈1
− 𝑈2
.
9
RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR
WEBSITES BY SIZE CLASSES - Year 2017
9
Results: estimates comparison (1/3)
The three different sets are not incoherent. In many cases, but not for all, the alternative estimates are well inside the
confidence interval of the survey estimate, and this is the same for many values in the different domains for all three
target variables.
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
10
10
Results: estimates comparison (2/3)
RATE OF ENTERPRISES WITH WEB ORDERING, JOB
ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR
WEBSITES BY NACE (24 economic activities
group considered in the survey) - Year 2017
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
A simulation study carried out on 1000 iterations to
compare the accuracy of the 3 estimates in terms of the
components of the MSE (bias and variance) shows that
the accuracy of these new estimates is not lower than
those already produced by the ICT survey .
11
Results: additional information (3/3)
MODEL BASED ESTIMATES - RATE OF ENTERPRISES WITH WEB ORDERING, JOB
ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY NACE BY NACE
REV. 2 LEVEL 2 (62 division) - Year 2017
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Alla data and metadata were published on June 8 on the Istat website dedicated to
experimental statistics in the subsection on Results of experiments on big data.
For burden reason in ICT survey for year 2019 entire website section will be ‘optional’ so will not be possible
to use combined estimator (produced using also observed values of survey); so full model based estimators
will remain the only alternative for time series.
o Role of ICT survey data: they have been used for fitting the models to predict values; furthermore
prediction errors have a direct impact on one of three component of combined estimator.
12
12
Production point of view: conclusions and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Full model based (and Combined) estimates can be considered acceptable but…. we need time
series analysis to verify stability of procedure and of results
o Open question: respondent or URL website errors or other reasons (for example web ordering
made in an private area of website)? Urgent need (time consuming): re-contact respondent, to ask URL
inside the question on web functionalities and not at the end of questionnaire, improve definitions; ..so
strong effort is requested to assure good quality of answers to reduce response errors in the training
set (from survey) … even if in the future this should be necessary only every ‘n’ years and one solution
could be use a (small) subset of data as training set, not necessarily obtained by costly repeated official
sample surveys.
o In cases of predicted values different from those reported by respondents, after manual controls, we
discovered that in about half of the cases difference was not due to model fault, but to response errors.
o Other open issues for the future: European comparability, with predicted values will not be possible to
combine observed variables and predicted ones for the same respondents
13
13
Production point of view: conclusion and perspectives
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
The work done could be extended and adapted in multiple ways
Considering the 3 target variables:
 in case of web ordering could be evaluate
possibility to use IaD to find other functionalities
related to website sales as web payment and
web deliver tracking;
 in case of job advertisement (it will not included
in European ICT survey for next 3 years) could
be evaluate possibility to search additional job
details as characteristics of each single job in
terms of skills required;
 in case of social media presence detection could
be extended to not only to scrape the enterprise
website, but also directly the social media in
order to investigate what kind of use of social
media is being done in a more detailed way.
Considering other aspects linked to ICT usage and
eCommerce:
• to investigate web sales of enterprises via other
means: e-marketplaces, app, social shopping
(new Instagram ‘shopping’ feature, launched in
March 2018, Facebook marketplace);
• to investigate more on enterprises operating in
specific economic activities (for example in Nace
47.91 Retail sale via mail order houses or via Internet
including retail sale of any kind of product over the
Internet) in order to have information about
products/services;
• to reuse and to adapt the procedure described to e-
government website or to website of enterprises
with less than 10 persons employed.
14
14
Working team
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Istat: G.Barcaroli, G.Bianchi, N.Golini, A. Nurra,
P.Righi, S.Salamone, F.Scalfati, M.Scannapieco,
D.Summa, D.Zardetto
CINECA: M.Scarnò
Univ.Roma Sapienza: R.Bruni
Link to Istat experimental statistics and metadata on website functionalities:
https://www.istat.it/en/archivio/216641
Thanks for your attention
15
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
Use of e-commerce marketplaces for web sales most popular in Italy, Germany, Austria and Poland
Regarding the number of enterprises that sold their goods or services through an e-commerce
marketplace, the highest shares were recorded in Italy (54%) and Germany (52%), followed by Austria
and Poland (both 47%).
16
Enterprises using social networks, 2017 and 2013 (% of enterprises)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
17
E-sales broken down by web and EDI-type sales, 2016 (% enterprises with e-sales)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
The percentage of enterprises receiving orders over websites
or via apps was considerably high for almost all Member
States (Italy 70% of enterprises with e-sales receive orders
via web sales, 20% via Edi, 10% via both)
In Italy 8 out of 10 enterprises with web sales sell through their
own website or app and almost 5 out of 10 enterprises (EU28
39%) sold via an e-marketplace.
18
Summarising performance of first URLs retrieval procedure
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
19
Importance of web ordering
YES - WEB ORDERING (referred to year t)
YES WEB SALES (referred to year t-1) of the respondent
NO WEB SALES (referred to year t-1) of the respondent [due to new web
site or functionality used in year t and not in t-1; due to cases of web
sales computed on turnover of another enterprise of the group, foreign
enterprises, enterprises with less than 10 persons employed (out of the
scope)]
NO - WEB ORDERING (referred to year t)
NO WEB SALES (referred to year t-1) of the respondent
YES WEB SALES (referred to year t-1) of the respondent (via
emarketplaces, via app)
SIMINGLY
CONTRADDICTORY
ANSWERS
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
20
20
5 phases of alternative estimates procedure (2/5)
132,000
Websites
Big Data:
Internet as Data
Source
100,000 websites
• Data integration to
have a list of URL to
validate
Sources: ICT survey, Consodata
• URLs retrieval
• For non available URL an automated procedure has been set up to make use of enterprises
denomination as a search string to make query and collect the first 10 links returned as the
result of the query
• processing the first ten URLs in order to choose the right one for the given enterprise of
population of interest:
 matching of the enterprises information (denomination, fiscal code VAT, telephone,
address, etc. available from administrative data) and the content of the first ten URLs
retrieved;
 using a ML approach to associate URLs to enterprises: for each link its probability of
correctness is evaluated, and those links whose probability exceeds a given threshold
is accepted as valid.
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
IUniform Resource Locator
corresponding to a statistical unit
21
21
5 phases of alternative estimates procedure (3/5)
100,000
Web scraping +
text processing
Doc
Terms
Matrix
85,000
Website scraping
• reads the homepage and all
the other reachable pages
(max of 20 pages, the depth
can be selected)
• does Optical Character
Recognition (OCR) on all types
of images (also screen-shot of
homepage)
The text mining phase to convert
each website in a data record
• processing text using Natural Language
Processing techniques
• computing Term Evaluation (TE) function:
give a measure (score) of relevance to
each term (using supervised ML - given a
term of the training set, its frequency in a class
is compared to its overall frequency: relevant if
it occurs primarily in positive documents and
rarely in negative ones, or vice versa)
• summarizing each website into a number
of relevant terms and applying
dimensionality reductions techniques to
obtain a set of data records describing
Websites (85,000 doc terms matrix)
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC
II
+
III
II III
22
22
5 phases of alternative estimates procedure (4/5)
Model fitting: supervised ML classifier
using training set (data driven - not
deterministic choice of keywords)
• To fit model (machine learning) in the subset
of enterprises where both Internet data and
survey data were available (12,000)
considering survey data as the true values
(several classification approaches have been
applied);
• To apply the classifier to all 85,000 websites
predicting the values of target variables for all
the enterprises for which the retrieval and
scraping of their websites was successful.
Performance evaluation of classification algorithms -
performance measures for classifiers:
1. Accuracy: rate of correct predictions on the total of cases
2. Sensitivity (or recall): rate of true pos on total number of pos
3. Precision: rate of true pos on total number of pos predictions
4. F1-score: harmonic mean of Sensitivity and Precision
the best results have been obtained with Random Forest (RF)
(information retrieval for target variable on link to SM profiles)
IV
Predictors
Doc
Terms
Matrix
Predictions
12,000
Predictors
85,000
ALESSANDRA NURRA
Researcher, Istat – DIPS-DCSE-SEC

Contenu connexe

Tendances

2010.080 1226
2010.080 12262010.080 1226
2010.080 1226swaipnew
 
EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...
EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...
EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...European Data Forum
 
EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...
EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...
EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...European Data Forum
 
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...European Data Forum
 
DISCOVERY DAY 2017: MAKE IT HAPPEN!
DISCOVERY DAY 2017: MAKE IT HAPPEN!DISCOVERY DAY 2017: MAKE IT HAPPEN!
DISCOVERY DAY 2017: MAKE IT HAPPEN!FAO
 
Site suitability analysis for constructing New ATM in Margao , Goa
Site suitability analysis for constructing New ATM in Margao , GoaSite suitability analysis for constructing New ATM in Margao , Goa
Site suitability analysis for constructing New ATM in Margao , Goasuyog patwardhan
 

Tendances (9)

2010.080 1226
2010.080 12262010.080 1226
2010.080 1226
 
EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...
EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...
EDF2014: Talk of Vassileios Tsetsos, Chief Technical Officer, Mobics Ltd: Pre...
 
EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...
EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...
EDF2014: Talk of Axel Polleres, Full Professor, WU - Vienna University of Eco...
 
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
 
Wireless network planning solutions
Wireless network planning solutions Wireless network planning solutions
Wireless network planning solutions
 
DISCOVERY DAY 2017: MAKE IT HAPPEN!
DISCOVERY DAY 2017: MAKE IT HAPPEN!DISCOVERY DAY 2017: MAKE IT HAPPEN!
DISCOVERY DAY 2017: MAKE IT HAPPEN!
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica 14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
Pasquale Persico, Nuovi strumenti di analisi del turismo
Pasquale Persico, Nuovi strumenti di analisi del turismoPasquale Persico, Nuovi strumenti di analisi del turismo
Pasquale Persico, Nuovi strumenti di analisi del turismo
 
Site suitability analysis for constructing New ATM in Margao , Goa
Site suitability analysis for constructing New ATM in Margao , GoaSite suitability analysis for constructing New ATM in Margao , Goa
Site suitability analysis for constructing New ATM in Margao , Goa
 

Similaire à A. Nurra, From ICT survey data to experimental statistics; using IaD source for website functionalities

Application of DEA in IT & Communication
Application of DEA in IT & CommunicationApplication of DEA in IT & Communication
Application of DEA in IT & CommunicationAbhay_018
 
Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics  Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics dannyijwest
 
Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics  Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics dannyijwest
 
Time Series ANN Approach for Weather Forecasting
Time Series ANN Approach for Weather ForecastingTime Series ANN Approach for Weather Forecasting
Time Series ANN Approach for Weather Forecastingijctcm
 
Data analytics to improve home broadband cx & network insight
Data analytics to improve home broadband cx & network insightData analytics to improve home broadband cx & network insight
Data analytics to improve home broadband cx & network insightRavi Sharma
 
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDYRESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDYIRJET Journal
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningEditor IJCATR
 
P. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsP. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsIstituto nazionale di statistica
 
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docxChapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docxpoulterbarbara
 
Transport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital TwinTransport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital TwinNeo4j
 
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market ShareData Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market ShareFormulatedby
 
Open Data Infrastructures Evaluation Framework using Value Modelling
Open Data Infrastructures Evaluation Framework using Value Modelling Open Data Infrastructures Evaluation Framework using Value Modelling
Open Data Infrastructures Evaluation Framework using Value Modelling Yannis Charalabidis
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine ScrapperIRJET Journal
 
Effort Estimation Development Model for Web-based Mobile Application Using Fu...
Effort Estimation Development Model for Web-based Mobile Application Using Fu...Effort Estimation Development Model for Web-based Mobile Application Using Fu...
Effort Estimation Development Model for Web-based Mobile Application Using Fu...TELKOMNIKA JOURNAL
 
Determination and visualization of density210409
Determination and visualization of density210409 Determination and visualization of density210409
Determination and visualization of density210409 Kenji Sugihara
 
IRJET- Popularity based Recommender Sytsem for Google Maps
IRJET-  	  Popularity based Recommender Sytsem for Google MapsIRJET-  	  Popularity based Recommender Sytsem for Google Maps
IRJET- Popularity based Recommender Sytsem for Google MapsIRJET Journal
 
CV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCLCV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCLHan Yang
 
IRJET- Logistics Network Superintendence Based on Knowledge Engineering
IRJET- Logistics Network Superintendence Based on Knowledge EngineeringIRJET- Logistics Network Superintendence Based on Knowledge Engineering
IRJET- Logistics Network Superintendence Based on Knowledge EngineeringIRJET Journal
 
Data Query: Exploration at your fingertips
Data Query: Exploration at your fingertips Data Query: Exploration at your fingertips
Data Query: Exploration at your fingertips AT Internet
 

Similaire à A. Nurra, From ICT survey data to experimental statistics; using IaD source for website functionalities (20)

Application of DEA in IT & Communication
Application of DEA in IT & CommunicationApplication of DEA in IT & Communication
Application of DEA in IT & Communication
 
EW-Shopp: Interoperability Challenges and Solutions
EW-Shopp: Interoperability Challenges and SolutionsEW-Shopp: Interoperability Challenges and Solutions
EW-Shopp: Interoperability Challenges and Solutions
 
Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics  Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics
 
Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics  Analyzing the Impact of Visitors on Page Views with Google Analytics
Analyzing the Impact of Visitors on Page Views with Google Analytics
 
Time Series ANN Approach for Weather Forecasting
Time Series ANN Approach for Weather ForecastingTime Series ANN Approach for Weather Forecasting
Time Series ANN Approach for Weather Forecasting
 
Data analytics to improve home broadband cx & network insight
Data analytics to improve home broadband cx & network insightData analytics to improve home broadband cx & network insight
Data analytics to improve home broadband cx & network insight
 
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDYRESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
P. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsP. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European Statistics
 
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docxChapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
Chapter 3 • Nature of Data, Statistical Modeling, and Visuali.docx
 
Transport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital TwinTransport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital Twin
 
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market ShareData Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
 
Open Data Infrastructures Evaluation Framework using Value Modelling
Open Data Infrastructures Evaluation Framework using Value Modelling Open Data Infrastructures Evaluation Framework using Value Modelling
Open Data Infrastructures Evaluation Framework using Value Modelling
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine Scrapper
 
Effort Estimation Development Model for Web-based Mobile Application Using Fu...
Effort Estimation Development Model for Web-based Mobile Application Using Fu...Effort Estimation Development Model for Web-based Mobile Application Using Fu...
Effort Estimation Development Model for Web-based Mobile Application Using Fu...
 
Determination and visualization of density210409
Determination and visualization of density210409 Determination and visualization of density210409
Determination and visualization of density210409
 
IRJET- Popularity based Recommender Sytsem for Google Maps
IRJET-  	  Popularity based Recommender Sytsem for Google MapsIRJET-  	  Popularity based Recommender Sytsem for Google Maps
IRJET- Popularity based Recommender Sytsem for Google Maps
 
CV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCLCV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCL
 
IRJET- Logistics Network Superintendence Based on Knowledge Engineering
IRJET- Logistics Network Superintendence Based on Knowledge EngineeringIRJET- Logistics Network Superintendence Based on Knowledge Engineering
IRJET- Logistics Network Superintendence Based on Knowledge Engineering
 
Data Query: Exploration at your fingertips
Data Query: Exploration at your fingertips Data Query: Exploration at your fingertips
Data Query: Exploration at your fingertips
 

Plus de Istituto nazionale di statistica

Plus de Istituto nazionale di statistica (20)

Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
14a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica1414a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica14
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 

Dernier

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Dernier (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

A. Nurra, From ICT survey data to experimental statistics; using IaD source for website functionalities

  • 1. From ICT survey data to experimental statistics: using IaD source for website functionalities ALESSANDRA NURRA ISTAT – Researcher 0
  • 2. ① The ICT Survey ② 3 target variables and official European statistics ③ Importance of web ordering ④ Main goals ⑤ 5 phases of alternative estimate procedure ⑥ Results: estimates comparison and additional information ⑦ Production point of view: conclusions and perspectives ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC 1 Outlines
  • 3. MAIN INDICATORS FROM THE ICT SURVEY o The principal aim of this survey is to supply users with indicators on: Internet activities (web site, social media, cloud computing) and connection used (fixed and mobile broadband), e-Business (use of software as ERP, CRM), e-Commerce, ICT skills, e-Invoice, etc. MAIN PURPOSES OF THE ICT SURVEY INDICATORS o ICT survey is also one of the major sources of data for the Digital Agenda Scoreboard and Digital Economy and Society Index (DESI) measuring progress of the European digital economy and to track the evolution of EU member states in digital competitiveness. The survey is part of the European Community statistics on the information society Community Survey on ICT usage and e-commerce in enterprises Data for year 2017: - Pop ent 10+ (from BR updated to 2015): 184,865 - Sampling frame: 32,361 - Respondents: 21,410 (66%) 2 2 The ICT Survey ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 4. o Rate of enterprises where the website provides online ordering or booking, e.g. shopping cart (percentage out of tot pop 10+) o Rate of enterprises where the website provides advertisement of open job position or job application (percentage out of tot pop 10+) o Rate of enterprises where the website has links or references to the enterprise's social media profiles (percentage out of tot pop 10+)  Phenomena are slowly growing  Italy is below European values 3 3 3 target variables and official European statistics ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC geo time 2012 2013 2014 2015 2016 2017 EU28 15 16 17 17 18 20 IT 11 12 11 13 14 15 geo time 2012 2013 2014 2015 2016 2017 EU28 19 21 24 n.d. 27 n.d. IT 8 8 10 10 10 11 geo time 2012 2013 2014 2015 2016 2017 EU28 n.d. n.d. 22 28 33 35 IT n.d. n.d. 21 26 28 31 geo time 2012 2013 2014 2015 2016 2017 EU28 71 73 74 75 77 77 IT 65 67 69 71 71 72 Rate of enterprises with website (percentage out of tot pop 10+)
  • 5. 4 4 Importance of web ordering ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC E-commerce E-sales Web sales Web site (web ordering) App E-marketplace Electronic automatic sales (EDI)E-purchases Internet data on web ordering would help us to have more control on the evolution of • WEB SALES • E-SALES • E-COMMERCE Competitiveness drivers
  • 6. 5 WHAT HOW WHY To replicate a subset of estimates currently produced by the survey • investigating new IT solutions • Improving /developing new skills • evaluating and comparing quality of alternative estimates with traditional ones To produce additional information • increasing the offer of statistical information To integrate the information collected with survey with those collected via Internet • improving accuracy of traditional estimates 5 Main goals ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 7. IV model fitting III V II I phase list of URLs 6 6 5 phases of alternative estimate procedure Predictors 130,000 Websites Big Data: Internet as Data Source Doc Terms Matrix 100,000 websites Predictions Survey data Estimation 12,000 Sample Alternative Estimates Estimation Estimates Population Frame: Asia 185,000 ent 10+ Predictors 85,000 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Web scraping Text processing
  • 8. 1 - LIST OF URLs 1. integration, valid. list of available URLs 2. Retrieval URLs: • enterprise denom (10 website from query) • processing 10 to choose 1  matching other info with web content  using a ML approach to associate URLs to enterprises from potentially 130,000 to about 100,000 2 - WEB SCRAPING 1. reading the homepage and all the other reachable pages 2. doing Optical Character Recognition (OCR) on all types of images (also screen-shot of homepage) from 100,000 to 85,000 7 First 4 phases of alternative estimate procedure 3 - TEXT MINING to convert each website in a data record with relevant information: 1. processing text (NLP) 2. computing Term Evaluation (TE) function to give a measure (score) of relevance to each term (using ML) 3. select the first best terms to codify the enterprise website in terms of target variables 4 - MODEL FITTING 1. fitting model with supervised ML classifier on a subset of 12,000 (observed and Internet) 2. choosing classifier on basis of performance measures 3. applying classifier chosen to all 85,000 document text matrix; the best results have been obtained with Random Forest (RF) (information retrieval for target variable on link to SM profiles) 4. obtaining predictions on 85,000 enterprises ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 9. Survey estimates 1. obtained by using the usual design based / model assisted approach where weights are obtained by calibration procedure of basic weights (inverse of inclusion probabilities) making use of known totals in the target population (𝑈 = 185,000) in order to reduce the bias due to non-response and the variability due to sampling errors 8 8 Alternative estimates Survey data Estimation Estimates ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC COMPARISON three different sets of estimates V Predictions Estimation Alternative Estimates Alternative estimates have been calculated by adopting two different estimators: 2. full model based estimator where the estimate of the total number of enterprises offering target variable on their websites is given by the count of the predicted values for all units for which it was possible reach their websites (𝑈2 = 85,000 ), calibrated in order to make them representative of all the population having websites (𝑈1 = 130,000); 3. combined estimator produced by summing three components: the counting of predicted values in the sub pop 𝑈2 ; an adjustment based on the differences between the reported values and the predicted values expanded to sub pop 𝑈2 ; the counting of observed values for respondents that declared a website, that was not found nor scraped expanded to sub pop 𝑈1 − 𝑈2 .
  • 10. 9 RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY SIZE CLASSES - Year 2017 9 Results: estimates comparison (1/3) The three different sets are not incoherent. In many cases, but not for all, the alternative estimates are well inside the confidence interval of the survey estimate, and this is the same for many values in the different domains for all three target variables. ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 11. 10 10 Results: estimates comparison (2/3) RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY NACE (24 economic activities group considered in the survey) - Year 2017 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC A simulation study carried out on 1000 iterations to compare the accuracy of the 3 estimates in terms of the components of the MSE (bias and variance) shows that the accuracy of these new estimates is not lower than those already produced by the ICT survey .
  • 12. 11 Results: additional information (3/3) MODEL BASED ESTIMATES - RATE OF ENTERPRISES WITH WEB ORDERING, JOB ADVERTISING, LINKS TO SOCIAL MEDIA IN THEIR WEBSITES BY NACE BY NACE REV. 2 LEVEL 2 (62 division) - Year 2017 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Alla data and metadata were published on June 8 on the Istat website dedicated to experimental statistics in the subsection on Results of experiments on big data.
  • 13. For burden reason in ICT survey for year 2019 entire website section will be ‘optional’ so will not be possible to use combined estimator (produced using also observed values of survey); so full model based estimators will remain the only alternative for time series. o Role of ICT survey data: they have been used for fitting the models to predict values; furthermore prediction errors have a direct impact on one of three component of combined estimator. 12 12 Production point of view: conclusions and perspectives ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Full model based (and Combined) estimates can be considered acceptable but…. we need time series analysis to verify stability of procedure and of results o Open question: respondent or URL website errors or other reasons (for example web ordering made in an private area of website)? Urgent need (time consuming): re-contact respondent, to ask URL inside the question on web functionalities and not at the end of questionnaire, improve definitions; ..so strong effort is requested to assure good quality of answers to reduce response errors in the training set (from survey) … even if in the future this should be necessary only every ‘n’ years and one solution could be use a (small) subset of data as training set, not necessarily obtained by costly repeated official sample surveys. o In cases of predicted values different from those reported by respondents, after manual controls, we discovered that in about half of the cases difference was not due to model fault, but to response errors. o Other open issues for the future: European comparability, with predicted values will not be possible to combine observed variables and predicted ones for the same respondents
  • 14. 13 13 Production point of view: conclusion and perspectives ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC The work done could be extended and adapted in multiple ways Considering the 3 target variables:  in case of web ordering could be evaluate possibility to use IaD to find other functionalities related to website sales as web payment and web deliver tracking;  in case of job advertisement (it will not included in European ICT survey for next 3 years) could be evaluate possibility to search additional job details as characteristics of each single job in terms of skills required;  in case of social media presence detection could be extended to not only to scrape the enterprise website, but also directly the social media in order to investigate what kind of use of social media is being done in a more detailed way. Considering other aspects linked to ICT usage and eCommerce: • to investigate web sales of enterprises via other means: e-marketplaces, app, social shopping (new Instagram ‘shopping’ feature, launched in March 2018, Facebook marketplace); • to investigate more on enterprises operating in specific economic activities (for example in Nace 47.91 Retail sale via mail order houses or via Internet including retail sale of any kind of product over the Internet) in order to have information about products/services; • to reuse and to adapt the procedure described to e- government website or to website of enterprises with less than 10 persons employed.
  • 15. 14 14 Working team ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Istat: G.Barcaroli, G.Bianchi, N.Golini, A. Nurra, P.Righi, S.Salamone, F.Scalfati, M.Scannapieco, D.Summa, D.Zardetto CINECA: M.Scarnò Univ.Roma Sapienza: R.Bruni Link to Istat experimental statistics and metadata on website functionalities: https://www.istat.it/en/archivio/216641 Thanks for your attention
  • 16. 15 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC Use of e-commerce marketplaces for web sales most popular in Italy, Germany, Austria and Poland Regarding the number of enterprises that sold their goods or services through an e-commerce marketplace, the highest shares were recorded in Italy (54%) and Germany (52%), followed by Austria and Poland (both 47%).
  • 17. 16 Enterprises using social networks, 2017 and 2013 (% of enterprises) ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 18. 17 E-sales broken down by web and EDI-type sales, 2016 (% enterprises with e-sales) ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC The percentage of enterprises receiving orders over websites or via apps was considerably high for almost all Member States (Italy 70% of enterprises with e-sales receive orders via web sales, 20% via Edi, 10% via both) In Italy 8 out of 10 enterprises with web sales sell through their own website or app and almost 5 out of 10 enterprises (EU28 39%) sold via an e-marketplace.
  • 19. 18 Summarising performance of first URLs retrieval procedure ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 20. 19 Importance of web ordering YES - WEB ORDERING (referred to year t) YES WEB SALES (referred to year t-1) of the respondent NO WEB SALES (referred to year t-1) of the respondent [due to new web site or functionality used in year t and not in t-1; due to cases of web sales computed on turnover of another enterprise of the group, foreign enterprises, enterprises with less than 10 persons employed (out of the scope)] NO - WEB ORDERING (referred to year t) NO WEB SALES (referred to year t-1) of the respondent YES WEB SALES (referred to year t-1) of the respondent (via emarketplaces, via app) SIMINGLY CONTRADDICTORY ANSWERS ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC
  • 21. 20 20 5 phases of alternative estimates procedure (2/5) 132,000 Websites Big Data: Internet as Data Source 100,000 websites • Data integration to have a list of URL to validate Sources: ICT survey, Consodata • URLs retrieval • For non available URL an automated procedure has been set up to make use of enterprises denomination as a search string to make query and collect the first 10 links returned as the result of the query • processing the first ten URLs in order to choose the right one for the given enterprise of population of interest:  matching of the enterprises information (denomination, fiscal code VAT, telephone, address, etc. available from administrative data) and the content of the first ten URLs retrieved;  using a ML approach to associate URLs to enterprises: for each link its probability of correctness is evaluated, and those links whose probability exceeds a given threshold is accepted as valid. ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC IUniform Resource Locator corresponding to a statistical unit
  • 22. 21 21 5 phases of alternative estimates procedure (3/5) 100,000 Web scraping + text processing Doc Terms Matrix 85,000 Website scraping • reads the homepage and all the other reachable pages (max of 20 pages, the depth can be selected) • does Optical Character Recognition (OCR) on all types of images (also screen-shot of homepage) The text mining phase to convert each website in a data record • processing text using Natural Language Processing techniques • computing Term Evaluation (TE) function: give a measure (score) of relevance to each term (using supervised ML - given a term of the training set, its frequency in a class is compared to its overall frequency: relevant if it occurs primarily in positive documents and rarely in negative ones, or vice versa) • summarizing each website into a number of relevant terms and applying dimensionality reductions techniques to obtain a set of data records describing Websites (85,000 doc terms matrix) ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC II + III II III
  • 23. 22 22 5 phases of alternative estimates procedure (4/5) Model fitting: supervised ML classifier using training set (data driven - not deterministic choice of keywords) • To fit model (machine learning) in the subset of enterprises where both Internet data and survey data were available (12,000) considering survey data as the true values (several classification approaches have been applied); • To apply the classifier to all 85,000 websites predicting the values of target variables for all the enterprises for which the retrieval and scraping of their websites was successful. Performance evaluation of classification algorithms - performance measures for classifiers: 1. Accuracy: rate of correct predictions on the total of cases 2. Sensitivity (or recall): rate of true pos on total number of pos 3. Precision: rate of true pos on total number of pos predictions 4. F1-score: harmonic mean of Sensitivity and Precision the best results have been obtained with Random Forest (RF) (information retrieval for target variable on link to SM profiles) IV Predictors Doc Terms Matrix Predictions 12,000 Predictors 85,000 ALESSANDRA NURRA Researcher, Istat – DIPS-DCSE-SEC