2. Collecting a Citizen’s
Digital Footprint for Health
Data Mining
Oguzhan Gencoglu, Heidi Simil, Harri Honko,
Minna Isomursu
3. Abstract
This paper describes a case study for collecting digital
footprint data for the purpose of health data mining.
The case study involved 20 subjects residing in Finland who
were instructed to collect data from registries which they
evaluated to be useful for understanding their health or
health behavior, current or past.
11 subjects were active, sending 100 data requests to 49
distinct organizations in total.
Our results indicate that there are still practical challenges in
collecting actionable digital footprint data.
4. Abstract
Out of the received data, 44 datasets (72.1% were
delivered in paper format.
4 (6.6%) in portable document format .
13 (21.3%) in structured digital form.
The time duration between the sending of the
information requests and reception of a reply was
26.4 days on the average.
5. Introduction
Digital footprint or digital shadowrefers to one's unique set of
traceable digital activities, actions, contributions and communications that are manifested on
the Internet or on digital devices
There are two main classifications for digital footprints:
Passive digital footprints . A passive digital footprint is created when data is
collected without the owner knowing, it can be stored in many ways depending on the situation. In an
online environment a footprint may be stored in an online data base as a "hit". This footprint may track
the user IP address, when it was created, and where they came from; with the footprint later being
analyzed. In an offline environment, a footprint may be stored in files, which can be accessed
by administrators to view the actions performed on the machine, without being able to see who
performed them.
Active digital footprints active digital footprints are created when personal
data is released deliberately by a user for the purpose of sharing information about oneself by means
of websites or social media.
6. Introduction
digital footprints can tell a lot about the behavior, characteristics and preferences
of an individual [2] [3] [4] [5] [6], provided it’s accessible in digitally digestible,
machine-readable form.
Increasingly the data sets, open or closed are being made available over an
application programming interface, API. Where accessible, the person’s digital
footprint is used today, for example, for personalized recommendation services,
person-, income- and even location-context[7].
There are ideas promoting that digital footprint data, when properly gathered
and analyzed with modern data analytics could provide significant opportunities
for providing new, more personalized and timely health services.
Aggregated and analyzed data can help individuals themselves learn about
their health condition [10] [11].
7. Introduction
Better access to electronic health records can help communication
between careers, health professionals and other service providers [12].
This can create opportunities for totally new kind of health and
wellbeing services, which create new business opportunities for
companies, and help increasing efficiency of health interventions
through targeted care.
In this paper, we examine the state-of-the-practice of collecting 2010’s
citizen’s personal footprint for the purpose of health data mining.
8. Introduction
Our research question is ”Can digital footprint of an individual be collected successfully
today for health data mining?”.
For the purpose of the study, we hire some individual to send information to different
organizations of their own choice. they tried to maximize the number of responses.
Our results summarize how successful our case subjects were in collecting their digital
footprint data.
did the organizations provide them access to their personal footprint data?
in what format the data was presented to them?
9. Introduction
and what procedures roughly would be needed to make
that data actionable so that it could be used for
computerized health data mining by anyone attempting to
refine and analyze the data to provide insights and health
related value.
Our discussion summarizes our experience and suggests
further work on how such data can be examined to reveal
health behavior patterns.
10. METHODOLOGY
Total of 20 volunteer participants were hired among active researchers in this study.
The participants were instructed to print, sign and mail the information request with
the covering letter to 5-10 target organizations of their own choice.
A preliminary list of candidate sources for digital footprint information was collected
to serve as an example for the participants, although they were instructed to decide
themselves which data sources could be valuable for health data analytics.
In order to follow the process, the participants kept a record of dates when the
information requests were sent, when the replies were received and in which format.
11. METHODOLOGY
The data was asked to be delivered to each participants home address or email.
In the information request form it is stated that data is preferred to be delivered
via an API, a memory stick or DVD, instead of printed paper documents.
After receiving the data, the participants were instructed to go through the data
and decide which representative set of the individual registers data they were
willing to donate for the research program.
The sensitive personal information was removed or edited when needed. Each
participant signed an informed consent while handing over the data.
12. RESULTS AND DISCUSSION
The number of voluntary participants, all residing in Finland, was 20 (18 natives, 2 foreigners) for the
study.
11 (55.0%) individuals were active during period of five months (11/2014-03/2015), sending 100
information requests (9.09 per person) to 49 (2.04 per registry) distinct data sources in total.
With respect to their content, these data sources were classified by researchers into 15 categories, i.e.,
banking, education, energy, fitness, groceries, healthcare, housing, insurance, library, mobility,
municipality, police, retail, telecommunication and web.
The average number of distinct data sources and number of sent requests per category is 3.27 and
6.67, respectively.
Maximum number of distinct data sources along with maximum number of sent requests belongs to
health category with 30 requests from 13 data sources.
For each category, a detailed summary of number of data sources, number of sent requests, number
of received replies and number of replies resulting in an access to data can be seen from Table I.
13.
14. RESULTS AND DISCUSSION
Overall response rate and data response rate of the
study was 75.0% and 61.0% respectively.
As the main purpose of a digital footprint collection
process eventually is to perform data analysis on
each individual’s data.
the amount of collected data has a great effect on
the analysis performance.
15. RESULTS AND DISCUSSION
The format of the collected data is crucial as well for the analysis to be conducted properly.
Even though more than half of the data sources provided some data to the individuals, most of
the cases the format of the returned data is not analysis-friendly, even not digitized.
The format of the delivered data can be categorized into three groups as paper format (hard
copy), portable document format (PDF) and spreadsheet/structured format which includes
formats such as comma-separated values (CSV), Microsoft Excel file formats (XLS/XLSX),
JavaScript object notation (JSON).
The listed order is from least analysis-friendly to the most. A detailed view of the format of the
collected data for different categories can be seen from Table II.
Hard copy, i.e., paper format, corresponds to the majority of the collected data with 72.1%. Only
21.3% of the collected data can be considered as structured. None of the data sources had APIs
for such data ingestion process.
16.
17. RESULTS AND DISCUSSION
When the process of transforming non-analysis-friendly data into analysis-
friendly form is considered, the drawbacks become more obvious.
Data delivered in paper format, first of all, has to be printed and mailed, which
comes at a cost.
As an individual can easily own hundreds of pages of data residing in several
data sources; logistics, security and storing problems arise.
Then, the data has to be digitized by the recipient, for example by scanning.
Such a process is not only burdensome but also error-prone.
After digitization, data is in the form of PDF or digital images which has to be
fed into an optical character recognition (OCR) algorithm.
18. RESULTS AND DISCUSSION
As the paper-form data is likely to contain artifacts (lines, logos, bright/dark spots due to
scanning, irrelevant text, folded/torn down parts) acting as noise to the OCR system, the
likelihood of error increases.
Furthermore, the OCR system had to be tuned specifically for the structure of the text in paper;
thus, parsing the relevant information becomes even more demanding.
In addition, as there is no guarantee of the data source delivering the data on the paper in the
same format in the future, such tasks are discouraged with respect to the reproducible research
paradigm.
19.
20. RESULTS AND DISCUSSION
Another interesting aspect of the data collection process is the analysis of quickness
of the data sources, i.e., how quick each registry replies to the requests.
56 of the requests have both sending and reply dates recorded.
On the average, a reply (providing data or not) took 26.4 days to arrive.
Average reply times for different categories can be seen from Table III.
The average durations for the data registries with small number of recorded times
are given for the sake of completeness rather than conclusion determined.
The average reply time for requests resulting in data reception was 29.6 days while
replies failing to do so came in 14.8 days on the average.
21.
22. CONCLUSION
One’s behavior is reflecting to his/her actions and those actions are recorded in great amounts in
today’s world as digital footprint.
As the advancing data mining algorithms enable efficient harmonization of multi-modal data to
perform inferential, predictive and even causal analysis of people’s behavior, these digital
footprints are of considerable value for health data mining purposes.
An expected rise in the demand of personal data from various data registries is likely to change
the current situation of such information retrieval process which is presented in this paper.
Our results show that currently utilization of digital footprint in services has practical challenges.
Companies and institutions in control of the data of individuals are not responsive and attentive
to the emerging value of digital footprint.
Even in the Finnish context, where the individuals have right by law to access their personal data,
many organizations ignored the request or refused the access to the data.
Very few provided data in format which could be easily digested by digital tools.
23. CONCLUSION
Providing high quality data to the cutting-edge data mining and machine
learning systems is essential for high performance predictive analysis, health
behavioral modeling and personalized services.
In order to achieve this goal, controlled and secure data access via service web
portals, or even better, through machine readable APIs are needed.
Our work continues with exploration of the collected datasets in terms of validity,
suitability and information value for health data mining, leading to in-depth
analysis of how the digital footprint can be used in health services.
24. REFERENCES
[1] A. Sellen, Y. Rogers, R. Harper, and T. Rodden, “Reflecting human values in the
digital age,” Communications of the ACM, vol. 52, no. 3, pp. 58–66, 2009.
[2] “World economic forum - rethinking personal data: Strengthening trust,”
2012.
[3] D. Zhang, B. Guo, B. Li, and Z. Yu, “Extracting social and community
intelligence from digital footprints: an emerging research area,” in Ubiquitous
Intelligence and Computing. Springer, 2010, pp. 4–18.
25. REFERENCES
[4] C. Moiso and R. Minerva, “Towards a user-centric personal data ecosystem
the role of the bank of individuals’ data,” in Intelligence in Next Generation
Networks (ICIN), 2012 16th International Conference on. IEEE, 2012, pp. 202–209.
[5] A. Malhotra, L. Totti, W. Meira Jr, P. Kumaraguru, and V. Almeida, “Studying
user footprints in different online social networks,” in Proceedings of the 2012
International Conference on Advances in Social Networks Analysis and Mining
(ASONAM 2012). IEEE Computer Society, 2012, pp. 1065–1070.
26. REFERENCES
[6] N. Eagle and A. Pentland, “Reality mining: sensing complex social systems,”
Personal and ubiquitous computing, vol. 10, no. 4, pp. 255– 268, 2006.
[7] M. Venkataramanan, “My identity for sale,” http://www.wired.co.uk
/magazine/archive/2014/11/features/my-identity-for-sale/viewall, accessed: 2015-
27-03.
[8] “Mac basics: Notifications keep you informed,” https://support.apple.com/en-
lb/HT204079, accessed: 2015-27-03.
[9] “Google now,” https://www.google.com/landing/now/, accessed: 2015-
27. REFERENCES
[10] J. H. Frost and M. P. Massagli, “Social uses of personal health 27-03. information
within patientslikeme, an online patient community: what can happen when patients
have access to one anothers data,” Journal of Medical Internet Research, vol. 10, no.
3, 2008.
[11] S. Kumar, W. Nilsen, M. Pavel, and M. Srivastava, “Mobile health: Revolutionizing
healthcare through transdisciplinary research,” Computer, no. 1, pp. 28–35, 2013.
[12] C. Pagliari, D. Detmer, and P. Singleton, “Potential of electronic personal health
records,” BMJ: British Medical Journal, vol. 335, no. 7615, p. 330, 2007.
[13] “Finnish legislation - personal data act, 523/199,” translation completed: 2001-31-
03.