People generally think of Big Data as something generated by machines or large communities of people interacting with the digital world. But technological progress means that each individual is currently, or soon will be, generating masses of digital data in their everyday lives. In every interaction with an application, every web page visited, every time your telephone is turned on, you generate information about yourself, Personal Big Data. With the rising adoption of quantified self gadgets, and the foreseeable adoption of intelligent glasses capturing daily life, the quantity of personal Big Data will only grow. In this Personal Big Data, as in other Big Data, a key problem is aligning concepts in the same semantic space. While concept alignment in the public sphere is an understood, though unresolved, problem, what does ontological organization of a personal space look like? Is it idiosyncratic, or something that can be shared between people? We will describe our current approach to this problem of organizing personal data and creating and exploiting a personal semantics.
2. Information is moving from the Web to Apps
Each person generates a lot of data
Two communities use it now
Search in one’s own data is the future
Four ways to search
We need personal facets
3. 2015CLEF 2015 Grefenstette - 3
http://www.statista.com/statistics/263795/number-of-available-apps-in-the-apple-app-store/
Apple announced that 100 billion apps had been downloaded from its App
Store (June 2015)
10. Personal
Big Data
Email sent
Email received
Social network posts
IP address location
SMS, chats
Search history
Web pages visited
Media viewed
Credit card purchases
Call data
GPS locations
Vitals signs
Activity/inactivity
Lifestyle
Conversations
Reading
People seen
Noises heard
11. Who uses this data today?
Surely, each person should have the same
access to their own data
12. Impediments to using our own data
• Data Silos
• Ownership
• Privacy
• Big Data Problems
• Variety
• Volume
• Merging -- Semantics
13. Supposing we could get all our data back into
our own hands, how could we search it?
Short course on 4 types of search
14. Search Engines – Cranfield/SMART Model
148 Sept 2015
CLEF 2015 Grefenstette
ftp://ftp.cs.cornell.edu/pub/smart/cran
.I 6
.W
ventricular septal defect
occurring in association
with aortic regurgitation
.I 7
.W
radioisotopes in heart scanning.
mainly used in diagnosis of
pericardial effusions. also used
to study tumors, heart enlargement,
aneurysms and pericardial thickening.
technetium, rihsa, radioactive
hippurate, cholegraffin are used.
.I 8
.W
the effects of drugs on the bone
marrow of man and animals,
…
5 332
5 333
6 112
6 115
6 116
6 118
6 122
6 238
6 239
6 242
6 260
6 309
6 320
6 321
6 323
7 92
7 121
7 189
7 389
7 390
7 391
7 392
7 393
8 52
8 60
conditions .
.I 237
cisternal fluid oxygen ...
using a beckman micro-oxyg..
tension simultaneously in the..
and in arterial blood under..
that the cisternal oxygen..
oxygen tension of the surroun.
the available free oxygen...
duration in the cerebral...
.I 238
ventricular septal defect
obstruction .
a case of ventricular...
lesion and infundibular...
coronary cusp of the aortic..
septal defect, was demonstra..
as a polyp-like mass in the...
catheterization and angiocard
ventricular outflow obstr...
.I 239
functional adaptations of the
congenital heart disease ....
queries
qrels documents
16. 2015CLEF 2015 Grefenstette - 16
Schedules 3 Economics, Education, Society
33 Economics and Management
338 Industries, Products
338.1 – 338.4 Specific kinds of industries
338.4 Secondary Industries and Services
338.47 Goods and Services
Built from 338.471 – 338.479 Subdivisions for Goods and Services
Schedules 338.476 Technology
338.4767 Manufacturing
338.47677 Textiles
338.476772 Textiles of Seed hair fibres
338.4767721 Cotton
Built from 338.47677210 Facet Indicator for Standard Subdivision
Table 1 338.476772109 Historical, geographic, persons treatment
Built from 338.4767721094 Europe Western Europe
Table 2 338.47677210942 England and Wales
338.476772109427 Northwestern England and Isle of Man
338.4767721094276 Lancashire
“The Lancashire cotton industry : a study in economic development”
Assigned DDC Code: 338.4767721094276
Search Engines – Dewey Decimal Faceted Model
20. MyLifeBits
2015CLEF 2015 Grefenstette - 20
Gemmell, Jim, Gordon Bell, and Roger
Lueder. "MyLifeBits: a personal database
for everything." Communications of the
ACM 49.1 (2006): 88-95.
"But even with convenient
classifications and labels
ready to apply, we are still
asking the user to become
a filing clerk – manually
annotating every
document, email, photo, or
conversation."
21. LifeLog
2015CLEF 2015 Grefenstette - 21
…The user can order the life-log agent
to add retrieval keys (annotation) with
an arbitrary name by simple operations
on his cellular phone while the agent is
capturing a life-log video. This enables
the agent to identify a scene that the
user wants to remember throughout his
life, and thus the user can access easily
to the videos that were captured during
precious experiences"
Aizawa, Kiyoharu, Tetsuro Hori, Shinya
Kawasaki, and Takayuki Ishikawa.
"Capture and efficient retrieval of life log."
In Pervasive 2004 Workshop on Memory
and Sharing Experiences, pp. 15-20. 2004.
22. Stuff I’ve Seen
2015CLEF 2015 Grefenstette - 22
…Research in cognitive psychology has
found that people remember
information, particularly older
information, not in terms of exact time,
but in terms of key episodes, such as a
child’s birthday, exotic travel,…
Cutrell, Edward, Susan T. Dumais, and Jaime
Teevan. "Searching to eliminate personal
information management." Communications
of the ACM 49.1 (2006): 58-64
23. PERSON
2015CLEF 2015 Grefenstette - 23
…we define the general category for
user’s activity in advance, such as
ordinary activity and extra-ordinary
activity. In ordinary activity is related to
the activity in home or office. Generally,
the activities occurred outside of those
area, they are classified as
extraordinary activities. In addition to
these pre-defined activities, users can
add their own activity through our
learning based structure… For some
duration, we record whole activities of
user. For the repeated activities at
same time, in same place with similar
objects, our activity engine will register
as user defined activities by asking in
which category those can be included.
Kim, Ig-Jae, et al. "PERSON:
personalized experience recoding
and searching on networked
environment." Proceedings of the
3rd ACM workshop on Continuous
archival and retrival of personal
experences. ACM, 2006.
24. Personal Data Prototype
2015CLEF 2015 Grefenstette - 24
…Landmarks of tags are defined by the
frequency of tags that are assigned to
each item of personal data. A tag that has
been in heavy use during a period of time
is a candidate for a landmark. A tag that
has rarely been used during a long period
of time is also a candidate for a landmark.
Outliers are candidates for landmarks in
time-series data, such as home energy
use, the number of steps walked, and
histories of body weight. Data that
exceed pre-defined or user-defined
thresholds are also candidates.
Other landmarks are public landmarks,
which include shocking public news,
bestsellers, blockbuster films, and annual
rankings of top Web-search words. We
can recall our own experiences on those
days from these landmarks.
Teraoka, Teruhiko. "Organization and
exploration of heterogeneous personal
data collected in daily life." Human-
Centric Computing and Information
Sciences 2.1 (2012): 1-15.
25. Dublin City University
2015CLEF 2015 Grefenstette - 25
…The user can order the life-log agent to add retrieval
keys (annotation) with an arbitrary name by simple
operations on his cellular phone while the agent is
capturing a life-log video. This enables the agent to
identify a scene that the user wants to remember
throughout his life, and thus the user can access easily
to the videos that were captured during precious
experiences"
Qiu, Zhengwei. "A lifelogging system supporting multimodal
access." PhD diss., Dublin City University, 2013.
Wang, Peng, and Alan F. Smeaton. "Aggregating semantic concepts
for event representation in lifelogging." Proceedings of the
International Workshop on Semantic Web Information Management.
ACM, 2011.
26. Okay,
we’ve seen
-- Apps / QS
-- Personal Big Data
-- Some early attempts
Everyone says
Time is important
Maps are important
String search is important
but…
Facets, what are our personal facets?
How can we automate them?
2015CLEF 2015 Grefenstette - 26
47. Tweet
2015CLEF 2015 Grefenstette - 47
Less than 12 hours until I am in the pool
crying... thankful for mirrored goggles
Swimming>pool
Swimming>goggles
facets
I’d want this …
49. Existing taxonomies are for societal
exchanges
Do you want to buy this?
What famous person did this when?
What can we make for this?
2015CLEF 2015 Grefenstette - 49
We are missing a description of what is
related to us, doing something…
specific vocabularies
loose taxonomies
… facets
50. Somthing like….
Sports/swimming/backstroke
Sports/swimming/on my back
Sports/swimming/breastroke
Sports/swimming/fins
Sports/swimming/goggles
Sports/swimming/fast lane
Sports/swimming/slow lane
Sports/swimming/laps
Sports/swimming/lifeguard
Sports/swimming/pool
Sports/swimming/lake
Sports/swimming/ocean
Sports/swimming/Neuilly Nautic Centre
Sport/swimming/South Hills Pool
Sports/swimming/towel
Sports/swimming/25m
Sports/swimming/goggles
Sports/swimming/cap
Sports/swimming/swim suit
2015CLEF 2015 Grefenstette - 50
52. Conclusion on Personal facets
There is a lot of work to do
• for predictable needs (hobbies, pastimes, sports), we do not
have the basic facets we need
• for personal information (family, friends, familiar places), we
have very little
• And this should be multilingual, too
2015CLEF 2015 Grefenstette - 52
53. • Information is moving from the Web into Apps
• People are generating information in these siloed Apps
• People generate more digital information every day
• Wearable computing will create even more
2015CLEF 2015 Grefenstette - 53
Conclusion: Searching Personal Big Data
54. • Information is moving from the Web into Apps
• People are generating information in these siloed Apps
• People generate more digital information every day
• Wearable computing will create even more
• At one point, people will want their information back
2015CLEF 2015 Grefenstette - 54
Conclusion: Searching Personal Big Data
55. • Information is moving from the Web into Apps
• People are generating information in these siloed Apps
• People generate more digital information every day
• Wearable computing will create even more
• At one point, people will want their information back
• When you have too much information, you need facets
• The facets for organizing personal information will be
needed and do not yet exist
2015CLEF 2015 Grefenstette - 55
Conclusion: Searching Personal Big Data
56. Conclusion: Searching Personal Big Data
• Information is moving from the Web into Apps
• People are generating information in these siloed Apps
• People generate more digital information every day
• Wearable computing will create even more
• At one point, people will want their information back
• When you have too much information, you need facets
• The facets for organizing personal information will be
needed and do not yet exist
• There are billions of cell phone users. They will all
want this. You should start working on it.
2015CLEF 2015 Grefenstette - 56
58. Gurrin, Cathal and Smeaton, Alan F. and Doherty, Aiden R. (2014) LifeLogging:
personal big data. Foundations and Trends in Information Retrieval, 8 (1). pp. 1-125.
ISSN 1554-0677
Content type Per day Volume per day Volume per year
Video 16 hours 90 GB 33 TB
Autographer
Camera
3000 images 1.3 GB 480 GB
Audio 16 hours 630 MB 230 GB
Microsoft
Sensecam
4500 images 82 MB 30 GB
Accelerometer 58,000 readings 138 KB 50 MB
Locations 10,000 readings 27 KB 10 MB
Bluetooth
Interactions
400 (estimated) 5 MB 2 GB
Words heard or
read
100,000 700 KB 255 MB