의료의 미래, 디지털 헬스케어: 정신의학을 중심으로

Professor, SAHIST, Sungkyunkwan University
Director, Digital Healthcare Institute
Yoon Sup Choi, Ph.D.
의료의 미래, 디지털 헬스케어

: 정신의학을 중심으로

“It's in Apple's DNA that technology alone is not enough.  
It's technology married with liberal arts.”

The Convergence of IT, BT and Medicine

대한영상의학회 춘계학술대회 2017.6

Vinod Khosla
Founder, 1st CEO of Sun Microsystems
Partner of KPCB, CEO of KhoslaVentures
LegendaryVenture Capitalist in SiliconValley

“Technology will replace 80% of doctors”

https://www.youtube.com/watch?time_continue=70&v=2HMPRXstSvQ
“영상의학과 전문의를 양성하는 것을 당장 그만둬야 한다.
5년 안에 딥러닝이 영상의학과 전문의를 능가할 것은 자명하다.”
Hinton on Radiology

헬스케어넓은 의미의 건강 관리에는 해당되지만,
디지털 기술이 적용되지 않고, 전문 의료 영역도 아닌 것
예) 운동, 영양, 수면
디지털 헬스케어
건강 관리 중에 디지털 기술이 사용되는 것
예) 사물인터넷, 인공지능, 3D 프린터
모바일 헬스케어
디지털 헬스케어 중
모바일 기술이 사용되는 것
예) 스마트폰, 사물인터넷, SNS
개인 유전정보분석
예) 암유전체, 질병위험도,
보인자, 약물 민감도
예) 웰니스, 조상 분석
헬스케어 관련 분야 구성도 (ver 0.3)
의료
질병 예방, 치료, 처방, 관리
등 전문 의료 영역
원격의료
원격진료

What is most important factor in digital medicine?

“Data! Data! Data!” he cried.“I can’t
make bricks without clay!”
- Sherlock Holmes,“The Adventure of the Copper Beeches”

새로운 데이터가

새로운 방식으로

새로운 주체에 의해

측정, 저장, 통합, 분석된다.
데이터의 종류

데이터의 질적/양적 측면
웨어러블 기기

스마트폰

유전 정보 분석

인공지능

SNS
사용자/환자

대중

Digital Healthcare Industry Landscape
Data Measurement Data Integration Data Interpretation Treatment
Smartphone Gadget/Apps
DNA
Artiﬁcial Intelligence
2nd Opinion
Wearables / IoT
(ver. 3)
EMR/EHR 3D Printer
Counseling
Data Platform
Accelerator/early-VC
Telemedicine
Device
On Demand (O2O)
VR
Digital Healthcare Institute
Diretor, Yoon Sup Choi, Ph.D.
yoonsup.choi@gmail.com

Data Measurement Data Integration Data Interpretation Treatment
Smartphone Gadget/Apps
DNA
Artiﬁcial Intelligence
2nd Opinion
Device
On Demand (O2O)
Wearables / IoT
Diretor, Yoon Sup Choi, Ph.D.
EMR/EHR 3D Printer
Counseling
Data Platform
Accelerator/early-VC
VR
Telemedicine
Digital Healthcare Industry Landscape (ver. 3)

Digital Phenotype
당신의 스마트폰은 당신이 우울한지 알고 있다

Smartphone: the origin of healthcare innovation

2013?
The election of Pope Benedict
The Election of Pope Francis

The Election of Pope Francis
The Election of Pope Benedict

• 아이폰의 센서로 측정한 자신의 의료/건강 데이터를 플랫폼에 공유 가능

• 가속도계, 마이크, 자이로스코프, GPS 센서 등을 이용

• 걸음, 운동량, 기억력, 목소리 떨림 등등

• 기존의 의학연구의 문제를 해결: 충분한 의료 데이터의 확보

• 연구 참여자 등록에 물리적, 시간적 장벽을 제거 (1번/3개월 ➞ 1번/1초)

• 대중의 의료 연구 참여 장려: 연구 참여자의 수 증가

• 발표 후 24시간 내에 수만명의 연구 참여자들이 지원

• 사용자 본인의 동의 하에 진행
Research Kit

•초기 버전으로, 5가지 질환에 대한 앱 5개를 소개
ResearchKit

http://www.roche.com/media/store/roche_stories/roche-stories-2015-08-10.htm

http://www.roche.com/media/store/roche_stories/roche-stories-2015-08-10.htm
pRED app to track Parkinson’s symptoms in drug trial

the manifestations of disease by providing a
more comprehensive and nuanced view of the
experience of illness. Through the lens of the
digital phenotype, an individual’s interaction
The digital phenotype
Sachin H Jain, Brian W Powers, Jared B Hawkins & John S Brownstein
In the coming years, patient phenotypes captured to enhance health and wellness will extend to human interactions with
digital technology.
In 1982, the evolutionary biologist Richard
Dawkins introduced the concept of the
“extended phenotype”1, the idea that pheno-
types should not be limited just to biological
processes, such as protein biosynthesis or tissue
growth, but extended to include all effects that
a gene has on its environment inside or outside
ofthebodyoftheindividualorganism.Dawkins
stressed that many delineations of phenotypes
are arbitrary. Animals and humans can modify
their environments, and these modifications
andassociatedbehaviorsareexpressionsofone’s
genome and, thus, part of their extended phe-
notype. In the animal kingdom, he cites damn
buildingbybeaversasanexampleofthebeaver’s
extended phenotype1.
Aspersonaltechnologybecomesincreasingly
embedded in human lives, we think there is an
important extension of Dawkins’s theory—the
notion of a ‘digital phenotype’. Can aspects of
ourinterfacewithtechnologybesomehowdiag-
nosticand/orprognosticforcertainconditions?
Can one’s clinical data be linked and analyzed
together with online activity and behavior data
to create a unified, nuanced view of human dis-
ease?Here,wedescribetheconceptofthedigital
phenotype. Although several disparate studies
have touched on this notion, the framework for
medicine has yet to be described. We attempt to
define digital phenotype and further describe
the opportunities and challenges in incorporat-
ing these data into healthcare.
Jan. 2013
0.000
0.002
0.004
Density
0.006
July 2013 Jan. 2014 July 2014
User 1
User 2
User 3
User 4
User 5
User 6
User 7
Date
Figure 1 Timeline of insomnia-related tweets from representative individuals. Density distributions
(probability density functions) are shown for seven individual users over a two-year period. Density on
the y axis highlights periods of relative activity for each user. A representative tweet from each user is
shown as an example.
npg©2015NatureAmerica,Inc.Allrightsreserved.
http://www.nature.com/nbt/journal/v33/n5/full/nbt.3223.html

genotype vs phenotype
(유전형) (표현형)

“Extended Phenotype”(확장된 표현형)

Digital Phenotype:
Your smartphone knows if you are depressed
Ginger.io

Ginger.io
•문자를 얼마나 자주 하는지

•통화를 얼마나 오래하는지

•누구와 통화를 하는지

•얼마나 거리를 많이 이동했는지

•얼마나 많이 움직였는지
• UCSF, McLean Hospital: 정신질환 연구

• Novant Health: 당뇨병, 산후 우울증 연구

• UCSF, Duke: 수술 후 회복 모니터링

Digital Phenotype:
J Med Internet Res. 2015 Jul 15;17(7):e175.
The correlation analysis between the features and the PHQ-9 scores revealed that 6 of the 10
features were signiﬁcantly correlated to the scores:
• strong correlation: circadian movement, normalized entropy, location variance
• correlation: phone usage features, usage duration and usage frequency

Digital Phenotype:
J Med Internet Res. 2015 Jul 15;17(7):e175.
Comparison of location and usage feature statistics between participants with no symptoms of depression (blue) and the
ones with (red). (ENT, entropy; ENTN, normalized entropy; LV, location variance; HS, home stay;TT, transition time;TD,
total distance; CM, circadian movement; NC, number of clusters; UF, usage frequency; UD, usage duration).
Figure 4. Comparison of location and usage feature statistics between participants with no symptoms of depression (blue) and the ones with (red).
Feature values are scaled between 0 and 1 for easier comparison. Boxes extend between 25th and 75th percentiles, and whiskers show the range.
Horizontal solid lines inside the boxes are medians. One, two, and three asterisks show significant differences at P<.05, P<.01, and P<.001 levels,
respectively (ENT, entropy; ENTN, normalized entropy; LV, location variance; HS, home stay; TT, transition time; TD, total distance; CM, circadian
movement; NC, number of clusters; UF, usage frequency; UD, usage duration).
Figure 5. Coefficients of correlation between location features. One, two, and three asterisks indicate significant correlation levels at P<.05, P<.01,
and P<.001, respectively (ENT, entropy; ENTN, normalized entropy; LV, location variance; HS, home stay; TT, transition time; TD, total distance;
CM, circadian movement; NC, number of clusters).
Saeb et alJOURNAL OF MEDICAL INTERNET RESEARCH
the variability of the time
the participant spent at
the location clusters
what extent the participants’
sequence of locations followed a
circadian rhythm.
home stay

Submitted 23 June 2016
Accepted 7 September 2016
Published 29 September 2016
Corresponding author
David C. Mohr,
d-mohr@northwestern.edu
Academic editor
Anthony Jorm
Additional Information and
Declarations can be found on
page 12
DOI 10.7717/peerj.2537
Copyright
2016 Saeb et al.
Distributed under
Creative Commons CC-BY 4.0
OPEN ACCESS
The relationship between mobile phone
location sensor data and depressive
symptom severity
Sohrab Saeb1,2
, Emily G. Lattie1
, Stephen M. Schueller1
,
Konrad P. Kording2
and David C. Mohr1
1
Department of Preventive Medicine, Northwestern University, Chicago, IL, United States
2
Rehabilitation Institute of Chicago, Department of Physical Medicine and Rehabilitation,
Northwestern University, Chicago, IL, United States
ABSTRACT
Background. Smartphones offer the hope that depression can be detected using
passively collected data from the phone sensors. The aim of this study was to replicate
andextendpreviousworkusinggeographiclocation(GPS)sensorstoidentifydepressive
symptom severity.
Methods. We used a dataset collected from 48 college students over a 10-week period,
which included GPS phone sensor data and the Patient Health Questionnaire 9-item
(PHQ-9) to evaluate depressive symptom severity at baseline and end-of-study. GPS
featureswerecalculatedovertheentirestudy,forweekdaysandweekends,andin2-week
blocks.
Results. The results of this study replicated our previous findings that a number of
GPS features, including location variance, entropy, and circadian movement, were
significantly correlated with PHQ-9 scores (r’s ranging from 0.43 to 0.46, p-values
< .05). We also found that these relationships were stronger when GPS features were
calculatedfromweekend,comparedtoweekday,data.Althoughthecorrelationbetween
baseline PHQ-9 scores with 2-week GPS features diminished as we moved further from
baseline, correlations with the end-of-study scores remained significant regardless of the
time point used to calculate the features.
Discussion. Our findings were consistent with past research demonstrating that GPS
features may be an important and reliable predictor of depressive symptom severity.
The varying strength of these relationships on weekends and weekdays suggests the role
of weekend/weekday as a moderating variable. The finding that GPS features predict
depressive symptom severity up to 10 weeks prior to assessment suggests that GPS
features may have the potential as early warning signals of depression.
Subjects Bioinformatics, Psychiatry and Psychology, Public Health, Computational Science
Keywords Mobile phone, Depression, Depressive symptoms, Geographic locations, Students
INTRODUCTION
Depression is common and debilitating, taking an enormous toll in terms of cost,
morbidity, and mortality (Ferrari et al., 2013; Greenberg et al., 2015). The 12-month
prevalence of major depressive disorder among adults in the US is 6.9% (Kessler et al.,
2005), and an additional 2–5% have subsyndromal symptoms that warrant treatment
Saeb et al. (2016), PeerJ, DOI 10.7717/peerj.2537

The relationship between mobile phone location sensor
data and depressive symptom severity
Linear correlation coefficients (r) between individual 10-week features and PHQ-9 scores, and their 95% confidence
intervals. Features indicated with stars (∗) are replicated from our previous study (Saeb et al., 2015a.). Bold values indicate
significant correlations.
Table 2 Linear correlation coefficients (r) between individual 10-week features and PHQ-9 scores, and
their 95% confidence intervals. Features indicated with stars (⇤) are replicated from our previous study
(Saeb et al., 2015a.). Bold values indicate significant correlations.
Feature Baseline (n = 46) Follow-up (n = 38) Change (n = 38)
Location variance⇤
0.29 ± 0.008 0.43 ± 0.007 0.34 ± 0.008
Circadian movement⇤
0.34 ± 0.006 0.48 ± 0.006 0.33 ± 0.009
Speed mean 0.03 ± 0.007 0.06 ± 0.005 0.04 ± 0.008
Speed variance 0.07 ± 0.007 0.06 ± 0.005 0.06 ±0.007
Total distance⇤
0.23 ± 0.004 0.18 ± 0.006 0.03 ± 0.006
Number of clusters⇤
0.38 ± 0.005 0.44 ± 0.004 0.24 ± 0.007
Entropy⇤
0.31 ± 0.007 0.46 ± 0.005 0.28 ± 0.008
Normalized entropy⇤
0.26 ± 0.007 0.44 ± 0.005 0.30 ± 0.009
Raw entropy 0.17 ± 0.009 0.22 ± 0.008 0.15 ± 0.010
Home stay⇤
0.22 ± 0.008 0.43 ± 0.005 0.30 ± 0.009
Transition time⇤
0.30 ± 0.006 0.32 ± 0.005 0.12 ± 0.009
Data analysis
We evaluated the relationship between each set of features (10-week and 2-week, each for all
days, weekends, or weekdays) and depressive symptoms severity as measured by the PHQ-9.
We used linear correlation coefficient (r) and considered p < 0.05 as the significance level.
In order to reduce the possibility that results were generated by chance, we created 1,000
bootstrap subsamples (Efron & Tibshirani, 1993) to estimate these correlation coefficientsSaeb et al. (2016), PeerJ, DOI 10.7717/peerj.2537

Table 3 Linear correlation coefficients (r) between individual weekend and weekday features and PHQ-9 scores, and their 95% confidence in-
tervals. Bold values indicate significant correlations (see ‘Data Analysis’).
Feature Weekday Weekend
Baseline (n = 46) Follow-up (n = 38) Change (n = 38) Baseline (n = 46) Follow-up (n = 38) Change (n = 38)
Location variance 0.15 ± 0.008 0.20 ± 0.008 0.22 ± 0.009 0.31 ± 0.008 0.47 ±0.007 0.39 ± 0.008
Circadian movement 0.22 ± 0.007 0.28 ± 0.008 0.25 ± 0.009 0.35 ± 0.007 0.51 ±0.006 0.36 ± 0.008
Speed mean 0.00 ± 0.008 0.06 ± 0.005 0.03 ± 0.008 0.13 ± 0.005 0.06 ± 0.006 0.05 ± 0.009
Speed variance 0.05 ± 0.008 0.07 ± 0.005 0.02 ± 0.007 0.13 ± 0.004 0.05 ± 0.006 0.10 ± 0.008
Total distance 0.20 ± 0.004 0.15 ± 0.005 0.01 ± 0.006 0.25 ± 0.004 0.20 ± 0.005 0.03 ± 0.006
Number of clusters 0.19 ± 0.006 0.25 ± 0.005 0.14 ± 0.008 0.34 ± 0.006 0.46 ±0.004 0.32 ± 0.007
Entropy 0.21 ± 0.007 0.34 ± 0.006 0.20 ± 0.009 0.30 ± 0.008 0.55 ±0.004 0.38 ± 0.008
Normalized entropy 0.21 ± 0.008 0.39 ± 0.006 0.24 ± 0.009 0.28 ± 0.008 0.54 ± 0.004 0.41 ± 0.009
Raw entropy 0.05 ± 0.008 0.04 ± 0.008 0.01 ± 0.010 0.04 ± 0.008 0.01 ± 0.008 0.03 ± 0.009
Home stay 0.19 ± 0.008 0.37 ± 0.006 0.23 ± 0.009 0.23 ± 0.007 0.50 ± 0.004 0.35 ± 0.008
Transition time 0.27 ± 0.006 0.29 ± 0.006 0.14 ± 0.010 0.36 ± 0.006 0.32 ± 0.008 0.06 ± 0.009
only normalized entropy was significantly related to the scores as a weekday feature. The
magnitude of the relationship between weekend features and PHQ-9 scores was larger than
the magnitude of the relationship between 10-week features and PHQ-9 scores. However,
given the small sample size, we were not adequately powered to test if these differences were
significant.
2-week features
Finally, we examined how 2-week GPS features obtained at different times during the study
Linear correlation coefficients (r) between individual weekend and weekday features and PHQ-9 scores, and their 95%
confidence intervals. Bold values indicate significant correlations.All of those 10-week features that were significantly
related to PHQ-9 scores (seeTable 2) were also significant when calculated from weekends, whereas only normalized
entropy was significantly related to the scores as a weekday feature

Mean temporal correlations between 2-week location features, calculated at different time points during the study, and
baseline and follow-up PHQ-9 scores.

Reece & Danforth, “Instagram photos reveal predictive markers of depression” (2016)
higher Hue (bluer)
lower Saturation (grayer)
lower Brightness (darker)

Digital Phenotype:
Your Instagram knows if you are depressed
Rao (MVR) (24) .

Results
Both Alldata and Prediagnosis models were decisively superior to a null model
. Alldata predictors were significant with 99% probability.57.5;(KAll = 1 K 49.8) Pre = 1 7
Prediagnosis and Alldata confidence levels were largely identical, with two exceptions:
Prediagnosis Brightness decreased to 90% confidence, and Prediagnosis posting frequency
dropped to 30% confidence, suggesting a null predictive value in the latter case.
Increased hue, along with decreased brightness and saturation, predicted depression. This
means that photos posted by depressed individuals tended to be bluer, darker, and grayer (see
Fig. 2). The more comments Instagram posts received, the more likely they were posted by
depressed participants, but the opposite was true for likes received. In the Alldata model, higher
posting frequency was also associated with depression. Depressed participants were more likely
to post photos with faces, but had a lower average face count per photograph than healthy
participants. Finally, depressed participants were less likely to apply Instagram filters to their
posted photos.

Fig. 2. Magnitude and direction of regression coefficients in Alldata (N=24,713) and Prediagnosis (N=18,513)
models. Xaxis values represent the adjustment in odds of an observation belonging to depressed individuals, per

Fig. 1. Comparison of HSV values. Right photograph has higher Hue (bluer), lower Saturation (grayer), and lower
Brightness (darker) than left photograph. Instagram photos posted by depressed individuals had HSV values
shifted towards those in the right photograph, compared with photos posted by healthy individuals.

Units of observation
In determining the best time span for this analysis, we encountered a difficult question:
When and for how long does depression occur? A diagnosis of depression does not indicate the
persistence of a depressive state for every moment of every day, and to conduct analysis using an
individual’s entire posting history as a single unit of observation is therefore rather specious. At
the other extreme, to take each individual photograph as units of observation runs the risk of
being too granular. DeChoudhury et al. (5) looked at all of a given user’s posts in a single day,
and aggregated those data into perperson, perday units of observation. We adopted this
precedent of “userdays” as a unit of analysis .  5

Statistical framework
We used Bayesian logistic regression with uninformative priors to determine the strength
of individual predictors. Two separate models were trained. The Alldata model used all
collected data to address Hypothesis 1. The Prediagnosis model used all data collected from
higher Hue (bluer)
lower Saturation (grayer)
lower Brightness (darker)

Digital Phenotype:
. In particular, depressedχ2 07.84, p .17e 64;( All = 9 = 9 − 1 13.80, p .87e 44)χ2Pre = 8 = 2 − 1
participants were less likely than healthy participants to use any filters at all. When depressed
participants did employ filters, they most disproportionately favored the “Inkwell” filter, which
converts color photographs to blackandwhite images. Conversely, healthy participants most
disproportionately favored the Valencia filter, which lightens the tint of photos. Examples of
filtered photographs are provided in SI Appendix VIII.

Fig. 3. Instagram filter usage among depressed and healthy participants. Bars indicate difference between observed
and expected usage frequencies, based on a Chisquared analysis of independence. Blue bars indicate
disproportionate use of a filter by depressed compared to healthy participants, orange bars indicate the reverse.

Digital Phenotype:

VIII. Instagram filter examples

Fig. S8. Examples of Inkwell and Valencia Instagram filters. Inkwell converts
color photos to blackandwhite, Valencia lightens tint. Depressed participants
most favored Inkwell compared to healthy participants, Healthy participants

ers, Jared B Hawkins & John S Brownstein
phenotypes captured to enhance health and wellness will extend to human interactions with
st Richard
pt of the
hat pheno-
biological
sis or tissue
effects that
or outside
m.Dawkins
phenotypes
can modify
difications
onsofone’s
ended phe-
cites damn
thebeaver’s
ncreasingly
there is an
heory—the
aspects of
ehowdiag-
Jan. 2013
0.000
0.002
0.004
Density
0.006
July 2013 Jan. 2014 July 2014
User 1
User 2
User 3
User 4
User 5
User 6
User 7
Date
Figure 1 Timeline of insomnia-related tweets from representative individuals. Density distributions
(probability density functions) are shown for seven individual users over a two-year period. Density on
the y axis highlights periods of relative activity for each user. A representative tweet from each user is
Your twitter knows if you cannot sleep
Timeline of insomnia-related tweets from representative individuals.
Nat. Biotech. 2015

•트위터 내용과 패턴을 바탕으로 양극성 장애 환자와 정상인 구분

•포스팅 패턴, 빈도, 단어의 분석 통한 감정 파악

•음운론 기반의 phonological feature

•high energy word: 얼마나 강한 발음/억양의 단어를 사용하는가 
•양극성 장애를 진단 받은 406명의 환자 트윗

•진단 받기 1년 전부터의 트윗을 대조군과 비교 
•90% 이상의 정확도(precision)으로 구분 가능

Detection of the Prodromal Phase of Bipolar Disorder from
Psychological and Phonological Aspects in Social Media
Yen-Hao Huang
National Tsing Hua University
Hsinchu, Taiwan
yenhao0218@gmail.com
Lin-Hung Wei
Hsinchu, Taiwan
adeline80916@gmail.com
Yi-Shin Chen
Hsinchu, Taiwan
yishin@gmail.com
ABSTRACT
Seven out of ten people with bipolar disorder are initially
misdiagnosed and thirty percent of individuals with bipolar
disorder will commit suicide. Identifying the early phases of
the disorder is one of the key components for reducing the
full development of the disorder. In this study, we aim at
leveraging the data from social media to design predictive
models, which utilize the psychological and phonological fea-
tures, to determine the onset period of bipolar disorder and
provide insights on its prodrome. This study makes these dis-
coveries possible by employing a novel data collection process,
coined as Time-specific Subconscious Crowdsourcing, which
helps collect a reliable dataset that supplements diagnosis
information from people suffering from bipolar disorder. Our
experimental results demonstrate that the proposed models
could greatly contribute to the regular assessments of people
with bipolar disorder, which is important in the primary care
setting.
KEYWORDS
Bipolar Disorder Detection, Mental Disorder, Prodromal
Phrase, Emotion Analysis, Sentiment Analysis, Phonology,
Social Media
1 INTRODUCTION
Bipolar disorder (BD) is a common mental illness charac-
terized by recurrent episodes of mania/hypomania and de-
pression, which is found among all ages, races, ethnic groups
and social classes. The regular assessment of people with
BD is an important part of its treatment, though it may be
very time-consuming [21]. There are many beneficial treat-
ments for the patients, particularly for delaying relapses. The
identification of early symptoms is significant for allowing
early intervention and reducing the multiple adverse conse-
quences of a full-blown episode. Despite the importance of
the detection of prodromal symptoms, there are very few
studies that have actually examined the ability of relatives to
detect these symptoms in BD patients. [20] For the purpose
of early treatment, the challenge leads to: how to identify
the prodrome period of BD. Current studies are thus
aimed at detecting prodromes and analyzing the prodromal
symptoms of manic recurrence in clinics.
With regards to the symptom of social isolation, people
are increasingly turning to popular social media, such as
Facebook and Twitter, to share their illness experiences or
seek advice from others with similar mental health conditions.
As the information is being shared in public, people are
subconsciously providing rich contents about their states
of mind. In this paper, we refer to this sharing and data
collection as time-specific subconscious crowdsourcing.
In this study, we carefully look at patients who have been
diagnosed with BD and who explicitly indicate the diagnosis
and time of diagnosis on Twitter. Our goal is to both predict
whether BD rises on a given period of time, and to discover
the prodromal period for BD. It’s important to clarify that
our goal doesn’t seek to offer a diagnosis but rather to make
a prediction of which users are likely to be suffering from the
BD. The main contributions of our work are:
• Introducing the concept of time-specific subconscious
crowdsourcing, which can aid in locating the social
network behavior data of BD patients with the corre-
sponding time of diagnosis.
• A BD assessment mechanism that differentiates be-
tween prodromal symptoms and acute symptoms.
• Introducing the phonological features into the assess-
ment mechanism, which allows for the possibility to
assess patients through text only.
• An automatic recognition approach that detects the
possible prodromal period for BD.
2 RELATED WORK
Social media resources have been widely utilized by researchers
to study mental health issues. The following literature em-
phasizes on data collection and feature engineering, including
subject recruitment, manual data collection, data collection
applications, keyword matching, and combined approaches.
The clinical approach for mental disorders and prodrome
studies are also discussed in this section.
Subject recruitment: Based on customized question-
naires and contact with subjects, Park et al. [15] recruited
participants for the Center for Epidemiologic Studies Depres-
sion scale(CES-D) [17] and provided their Twitter data. By
analyzing the information contained in tweets, participants
were divided into normal and depressive groups based on
their scores on CES-D. An approach like this one requires ex-
pensive costs to acquire data and conduct the questionnaire.
Manual and automatic data collecting: Moreno et
al. [14] collected data via the Facebook profiles of college stu-
dents reviewed by two investigators. They aimed at revealing
the relationship between demographic factors and depression.
Similarly, in our work, we invest on manual efforts to collect
and properly annotate our dataset. In addition, there are
many applications built on top of social networks that provide
free services where users may need to input their credentials
arXiv:1712.09183v1[cs.IR]26Dec2017

Detection of the Prodromal Phase of Bipolar Disorder from
Psychological and Phonological Aspects in Social Media
Yen-Hao Huang
Hsinchu, Taiwan
yenhao0218@gmail.com
Lin-Hung Wei
Hsinchu, Taiwan
adeline80916@gmail.com
Yi-Shin Chen
Hsinchu, Taiwan
yishin@gmail.com
ABSTRACT
Seven out of ten people with bipolar disorder are initially
misdiagnosed and thirty percent of individuals with bipolar
disorder will commit suicide. Identifying the early phases of
the disorder is one of the key components for reducing the
full development of the disorder. In this study, we aim at
leveraging the data from social media to design predictive
models, which utilize the psychological and phonological fea-
tures, to determine the onset period of bipolar disorder and
provide insights on its prodrome. This study makes these dis-
coveries possible by employing a novel data collection process,
coined as Time-specific Subconscious Crowdsourcing, which
helps collect a reliable dataset that supplements diagnosis
information from people suffering from bipolar disorder. Our
experimental results demonstrate that the proposed models
could greatly contribute to the regular assessments of people
with bipolar disorder, which is important in the primary care
setting.
KEYWORDS
Bipolar Disorder Detection, Mental Disorder, Prodromal
Phrase, Emotion Analysis, Sentiment Analysis, Phonology,
Social Media
1 INTRODUCTION
Bipolar disorder (BD) is a common mental illness charac-
terized by recurrent episodes of mania/hypomania and de-
pression, which is found among all ages, races, ethnic groups
and social classes. The regular assessment of people with
BD is an important part of its treatment, though it may be
very time-consuming [21]. There are many beneficial treat-
ments for the patients, particularly for delaying relapses. The
identification of early symptoms is significant for allowing
early intervention and reducing the multiple adverse conse-
quences of a full-blown episode. Despite the importance of
the detection of prodromal symptoms, there are very few
studies that have actually examined the ability of relatives to
detect these symptoms in BD patients. [20] For the purpose
of early treatment, the challenge leads to: how to identify
the prodrome period of BD. Current studies are thus
aimed at detecting prodromes and analyzing the prodromal
symptoms of manic recurrence in clinics.
With regards to the symptom of social isolation, people
are increasingly turning to popular social media, such as
Facebook and Twitter, to share their illness experiences or
seek advice from others with similar mental health conditions.
As the information is being shared in public, people are
subconsciously providing rich contents about their states
of mind. In this paper, we refer to this sharing and data
collection as time-specific subconscious crowdsourcing.
In this study, we carefully look at patients who have been
diagnosed with BD and who explicitly indicate the diagnosis
and time of diagnosis on Twitter. Our goal is to both predict
whether BD rises on a given period of time, and to discover
the prodromal period for BD. It’s important to clarify that
our goal doesn’t seek to offer a diagnosis but rather to make
a prediction of which users are likely to be suffering from the
BD. The main contributions of our work are:
• Introducing the concept of time-specific subconscious
crowdsourcing, which can aid in locating the social
network behavior data of BD patients with the corre-
sponding time of diagnosis.
• A BD assessment mechanism that differentiates be-
tween prodromal symptoms and acute symptoms.
• Introducing the phonological features into the assess-
ment mechanism, which allows for the possibility to
assess patients through text only.
• An automatic recognition approach that detects the
possible prodromal period for BD.
2 RELATED WORK
Social media resources have been widely utilized by researchers
to study mental health issues. The following literature em-
phasizes on data collection and feature engineering, including
subject recruitment, manual data collection, data collection
applications, keyword matching, and combined approaches.
The clinical approach for mental disorders and prodrome
studies are also discussed in this section.
Subject recruitment: Based on customized question-
naires and contact with subjects, Park et al. [15] recruited
participants for the Center for Epidemiologic Studies Depres-
sion scale(CES-D) [17] and provided their Twitter data. By
analyzing the information contained in tweets, participants
were divided into normal and depressive groups based on
their scores on CES-D. An approach like this one requires ex-
pensive costs to acquire data and conduct the questionnaire.
Manual and automatic data collecting: Moreno et
al. [14] collected data via the Facebook profiles of college stu-
dents reviewed by two investigators. They aimed at revealing
the relationship between demographic factors and depression.
Similarly, in our work, we invest on manual efforts to collect
and properly annotate our dataset. In addition, there are
many applications built on top of social networks that provide
free services where users may need to input their credentials
arXiv:1712.09183v1[cs.IR]26Dec2017
Wordcloud
Features(#DIM) 2 mths 3 mths 6 mths 9 mths 12 mths
AG(2)
0.475 0.503 0.445 0.434 0.383
Pol(5)
0.911 0.893 0.843 0.836 0.803
Emot(8)
0.893 0.895 0.908 0.917 0.896
Soc(4)
0.941 0.913 0.845 0.834 0.786
LT(1)
0.645 0.589 0.554 0.504 0.513
TRD(1)
0.570 0.638 0.626 0.615 0.654
Phon(8)
0.889 0.880 0.802 0.838 0.821
Table 2: Average Precision of Single Feature Perfor-
mance
Age and Gender
Mood Polarity Features
Emotional Score
Social Feature
Late Tweet Frequency
Tweet Rate Difference
Phonological Feature
Diagnosed time !
" months
" = 2 months
Figure 1: Illness Period Modeling
features are introduced: (1) Word-level features and BD
Pattern of Life features.
3.4.1 Word-level Features. With respect to the linguis-
tic features for BD, the Character n-gram language fea-
tures(CLF) and LIWC metrics are designed to capture it.
The CLF utilizes n-grams to measure the comment words
or phrases used by users. The tf-idf is utilized in our score-
calculating method, the tf is the frequency of an n-gram and
the document d of df is defined as each particular twitter
user k. The formula for the tf-idf is thus given as:
tfidf
(k,⌧,↵)
vn = freq
(k,⌧,↵)
vn ⇥ log
K
1 + freq
(K,⌧,↵)
vn
(1)
The freq
(k)
vn is the frequency of n-gram vn
, which is n 2 {1, 2}
to represent psychological features, su
terns and the behavioral tendency o
polarity, emotion, and social interacti
full BDPLF, there are five categories:
• Age and Gender: Sit et al. [2
effects on BD, indicating that wom
likely to have Bipolar Disorder
than men. We make use of the ag
proposed by Sap et al. [19], whic
social media.
• Mood Polarity Features: Ow
BD patients experience rapid mo
analysis is firstly adapted to obt
larity portrayed by each user’s t
the sentiment of tweets, the onlin
used, based on Go et al.’s work [
the contents of tweets into thre
positive, negative, and neutral.
those three categories into five
positive ratio, negative ratio, po
combo, and flips ratio.
• Emotional Scores: Beyond th
tion detection tool proposed by
employed to classify the tweets in
gories: joy, surprise, anticipation,
anger, and fear. The emotion cla
further transformed into emotio
esei,
(k)
⌧,↵
=
ei,
(k)
⌧,↵
ecount
진단 받기 1년 전 부터의 트윗을 대조군과 비교 분석
개별 feature만으로도 위험군 분류에 높은 precision을 보였음

AI for mental health
정신 의학을 위한 인공지능

No choice but to bring AI into the medicine

Martin Duggan,“IBM Watson Health - Integrated Care & the Evolution to Cognitive Computing”

•약한 인공 지능 (Artificial Narrow Intelligence)

• 특정 방면에서 잘하는 인공지능

• 체스, 퀴즈, 메일 필터링, 상품 추천, 자율 운전

•강한 인공 지능 (Artificial General Intelligence)

• 모든 방면에서 인간 급의 인공 지능

• 사고, 계획, 문제해결, 추상화, 복잡한 개념 학습

•초 인공 지능 (Artificial Super Intelligence)

• 과학기술, 사회적 능력 등 모든 영역에서 인간보다 뛰어난 인공 지능

• “충분히 발달한 과학은 마법과 구분할 수 없다” - 아서 C. 클라크

Jeopardy!
2011년 인간 챔피언 두 명 과 퀴즈 대결을 벌여서 압도적인 우승을 차지

600,000 pieces of medical evidence
2 million pages of text from 42 medical journals and clinical trials
69 guidelines, 61,540 clinical trials
IBM Watson on Medicine
Watson learned...
+
1,500 lung cancer cases
physician notes, lab results and clinical research
+
14,700 hours of hands-on training

Annals of Oncology (2016) 27 (suppl_9): ix179-ix180. 10.1093/annonc/mdw601
Validation study to assess performance of IBM cognitive
computing system Watson for oncology with Manipal
multidisciplinary tumour board for 1000 consecutive cases:  
An Indian experience
• MMDT(Manipal multidisciplinary tumour board) treatment recommendation and
data of 1000 cases of 4 different cancers breast (638), colon (126), rectum (124)
and lung (112) which were treated in last 3 years was collected.
• Of the treatment recommendations given by MMDT, WFO provided  
 
50% in REC, 28% in FC, 17% in NREC
• Nearly 80% of the recommendations were in WFO REC and FC group
• 5% of the treatment provided by MMDT was not available with WFO
• The degree of concordance varied depending on the type of cancer
• WFO-REC was high in Rectum (85%) and least in Lung (17.8%)
• high with TNBC (67.9%); HER2 negative (35%) 
• WFO took a median of 40 sec to capture, analyze and give the treatment. 
 
(vs MMDT took the median time of 15 min)

WFO in ASCO 2017
• Early experience with IBM WFO cognitive computing system for lung  
 
and colorectal cancer treatment (마니팔 병원) 
• 지난 3년간: lung cancer(112), colon cancer(126), rectum cancer(124)
• lung cancer: localized 88.9%, meta 97.9%
• colon cancer: localized 85.5%, meta 76.6%
• rectum cancer: localized 96.8%, meta 80.6%
Performance of WFO in India
2017 ASCO annual Meeting, J Clin Oncol 35, 2017 (suppl; abstr 8527)

ORIGINAL ARTICLE
Watson for Oncology and breast cancer treatment
recommendations: agreement with an expert
multidisciplinary tumor board
S. P. Somashekhar1*, M.-J. Sepu´lveda2
, S. Puglielli3
, A. D. Norden3
, E. H. Shortliffe4
, C. Rohit Kumar1
,
A. Rauthan1
, N. Arun Kumar1
, P. Patil1
, K. Rhee3
& Y. Ramya1
1
Manipal Comprehensive Cancer Centre, Manipal Hospital, Bangalore, India; 2
IBM Research (Retired), Yorktown Heights; 3
Watson Health, IBM Corporation,
Cambridge; 4
Department of Surgical Oncology, College of Health Solutions, Arizona State University, Phoenix, USA
*Correspondence to: Prof. Sampige Prasannakumar Somashekhar, Manipal Comprehensive Cancer Centre, Manipal Hospital, Old Airport Road, Bangalore 560017, Karnataka,
India. Tel: þ91-9845712012; Fax: þ91-80-2502-3759; E-mail: somashekhar.sp@manipalhospitals.com
Background: Breast cancer oncologists are challenged to personalize care with rapidly changing scientific evidence, drug
approvals, and treatment guidelines. Artificial intelligence (AI) clinical decision-support systems (CDSSs) have the potential to
help address this challenge. We report here the results of examining the level of agreement (concordance) between treatment
recommendations made by the AI CDSS Watson for Oncology (WFO) and a multidisciplinary tumor board for breast cancer.
Patients and methods: Treatment recommendations were provided for 638 breast cancers between 2014 and 2016 at the
Manipal Comprehensive Cancer Center, Bengaluru, India. WFO provided treatment recommendations for the identical cases in
2016. A blinded second review was carried out by the center’s tumor board in 2016 for all cases in which there was not
agreement, to account for treatments and guidelines not available before 2016. Treatment recommendations were considered
concordant if the tumor board recommendations were designated ‘recommended’ or ‘for consideration’ by WFO.
Results: Treatment concordance between WFO and the multidisciplinary tumor board occurred in 93% of breast cancer cases.
Subgroup analysis found that patients with stage I or IV disease were less likely to be concordant than patients with stage II or III
disease. Increasing age was found to have a major impact on concordance. Concordance declined significantly (P 0.02;
P < 0.001) in all age groups compared with patients <45 years of age, except for the age group 55–64 years. Receptor status
was not found to affect concordance.
Conclusion: Treatment recommendations made by WFO and the tumor board were highly concordant for breast cancer cases
examined. Breast cancer stage and patient age had significant influence on concordance, while receptor status alone did not.
This study demonstrates that the AI clinical decision-support system WFO may be a helpful tool for breast cancer treatment
decision making, especially at centers where expert breast cancer resources are limited.
Key words: Watson for Oncology, artiﬁcial intelligence, cognitive clinical decision-support systems, breast cancer,
concordance, multidisciplinary tumor board
Introduction
Oncologists who treat breast cancer are challenged by a large and
rapidly expanding knowledge base [1, 2]. As of October 2017, for
example, there were 69 FDA-approved drugs for the treatment of
breast cancer, not including combination treatment regimens
[3]. The growth of massive genetic and clinical databases, along
with computing systems to exploit them, will accelerate the speed
of breast cancer treatment advances and shorten the cycle time
for changes to breast cancer treatment guidelines [4, 5]. In add-
ition, these information management challenges in cancer care
are occurring in a practice environment where there is little time
available for tracking and accessing relevant information at the
point of care [6]. For example, a study that surveyed 1117 oncolo-
gists reported that on average 4.6 h per week were spent keeping
VC The Author(s) 2018. Published by Oxford University Press on behalf of the European Society for Medical Oncology.
All rights reserved. For permissions, please email: journals.permissions@oup.com.
Annals of Oncology 29: 418–423, 2018
doi:10.1093/annonc/mdx781
Published online 9 January 2018
Downloaded from https://academic.oup.com/annonc/article-abstract/29/2/418/4781689
by guest

잠정적 결론
•왓슨 포 온콜로지와 의사의 일치율:

•암종별로 다르다.

•같은 암종에서도 병기별로 다르다.

•같은 암종에 대해서도 병원별/국가별로 다르다.

•시간이 흐름에 따라 달라질 가능성이 있다.

원칙이 필요하다
•어떤 환자의 경우, 왓슨에게 의견을 물을 것인가?

•왓슨을 (암종별로) 얼마나 신뢰할 것인가?

•왓슨의 의견을 환자에게 공개할 것인가?

•왓슨과 의료진의 판단이 다른 경우 어떻게 할 것인가?

•왓슨에게 보험 급여를 매길 수 있는가?
이러한 기준에 따라 의료의 질/치료효과가 달라질 수 있으나,

현재 개별 병원이 개별적인 기준으로 활용하게 됨

Bone Age Assessment
• M: 28 Classes
• F: 20 Classes
• Method: G.P.
• Top3-95.28% (F)
• Top3-81.55% (M)

40
50
60
70
80
인공지능 의사 A 의사 B
69.5%
63%
49.5%
정확도(%)
영상의학과 펠로우

(소아영상 세부전공)
영상의학과

2년차 전공의
인공지능 vs 의사
AJR Am J Roentgenol. 2017 Dec;209(6):1374-1380.
• 총 환자의 수: 200명

• 의사A: 소아영상 세부전공한 영상의학 전문의 (500례 이상의 판독 경험)

• 의사B: 영상의학과 2년차 전공의 (판독법 하루 교육 이수 + 20례 판독)

• 레퍼런스: 경험 많은 소아영상의학과 전문의 2명(18년, 4년 경력)의 컨센서스

• 인공지능: VUNO의 골연령 판독 딥러닝
골연령 판독에 인간 의사와 인공지능의 시너지 효과
Director,Yoon Sup Choi, PhD

40
50
60
70
80
인공지능 의사 A 의사 B
40
50
60
70
80
의사 A  
+ 인공지능
의사 B  
+ 인공지능
69.5%
63%
49.5%
72.5%
57.5%
정확도(%)
영상의학과 펠로우

(소아영상 세부전공)
영상의학과

2년차 전공의
인공지능 vs 의사 인공지능 + 의사




골연령 판독에 인간 의사와 인공지능의 시너지 효과

총 판독 시간 (m)
0
50
100
150
200
w/o AI w/ AI
0
50
100
150
200
w/o AI w/ AI
188m
154m
180m
108m
saving 40%
of time
saving 18%
of time
의사 A 의사 B
골연령 판독에서 인공지능을 활용하면

판독 시간의 절감도 가능





Detection of Diabetic Retinopathy

Copyright 2016 American Medical Association. All rights reserved.
Development and Validation of a Deep Learning Algorithm
for Detection of Diabetic Retinopathy
in Retinal Fundus Photographs
Varun Gulshan, PhD; Lily Peng, MD, PhD; Marc Coram, PhD; Martin C. Stumpe, PhD; Derek Wu, BS; Arunachalam Narayanaswamy, PhD;
Subhashini Venugopalan, MS; Kasumi Widner, MS; Tom Madams, MEng; Jorge Cuadros, OD, PhD; Ramasamy Kim, OD, DNB;
Rajiv Raman, MS, DNB; Philip C. Nelson, BS; Jessica L. Mega, MD, MPH; Dale R. Webster, PhD
IMPORTANCE Deep learning is a family of computational methods that allow an algorithm to
program itself by learning from a large set of examples that demonstrate the desired
behavior, removing the need to specify rules explicitly. Application of these methods to
medical imaging requires further assessment and validation.
OBJECTIVE To apply deep learning to create an algorithm for automated detection of diabetic
retinopathy and diabetic macular edema in retinal fundus photographs.
DESIGN AND SETTING A specific type of neural network optimized for image classification
called a deep convolutional neural network was trained using a retrospective development
data set of 128 175 retinal images, which were graded 3 to 7 times for diabetic retinopathy,
diabetic macular edema, and image gradability by a panel of 54 US licensed ophthalmologists
and ophthalmology senior residents between May and December 2015. The resultant
algorithm was validated in January and February 2016 using 2 separate data sets, both
graded by at least 7 US board-certified ophthalmologists with high intragrader consistency.
EXPOSURE Deep learning–trained algorithm.
MAIN OUTCOMES AND MEASURES The sensitivity and specificity of the algorithm for detecting
referable diabetic retinopathy (RDR), defined as moderate and worse diabetic retinopathy,
referable diabetic macular edema, or both, were generated based on the reference standard
of the majority decision of the ophthalmologist panel. The algorithm was evaluated at 2
operating points selected from the development set, one selected for high specificity and
another for high sensitivity.
RESULTS TheEyePACS-1datasetconsistedof9963imagesfrom4997patients(meanage,54.4
years;62.2%women;prevalenceofRDR,683/8878fullygradableimages[7.8%]);the
Messidor-2datasethad1748imagesfrom874patients(meanage,57.6years;42.6%women;
prevalenceofRDR,254/1745fullygradableimages[14.6%]).FordetectingRDR,thealgorithm
hadanareaunderthereceiveroperatingcurveof0.991(95%CI,0.988-0.993)forEyePACS-1and
0.990(95%CI,0.986-0.995)forMessidor-2.Usingthefirstoperatingcutpointwithhigh
specificity,forEyePACS-1,thesensitivitywas90.3%(95%CI,87.5%-92.7%)andthespecificity
was98.1%(95%CI,97.8%-98.5%).ForMessidor-2,thesensitivitywas87.0%(95%CI,81.1%-
91.0%)andthespecificitywas98.5%(95%CI,97.7%-99.1%).Usingasecondoperatingpoint
withhighsensitivityinthedevelopmentset,forEyePACS-1thesensitivitywas97.5%and
specificitywas93.4%andforMessidor-2thesensitivitywas96.1%andspecificitywas93.9%.
CONCLUSIONS AND RELEVANCE In this evaluation of retinal fundus photographs from adults
with diabetes, an algorithm based on deep machine learning had high sensitivity and
specificity for detecting referable diabetic retinopathy. Further research is necessary to
determine the feasibility of applying this algorithm in the clinical setting and to determine
whether use of the algorithm could lead to improved care and outcomes compared with
current ophthalmologic assessment.
JAMA. doi:10.1001/jama.2016.17216
Published online November 29, 2016.
Editorial
Supplemental content
Author Affiliations: Google Inc,
Mountain View, California (Gulshan,
Peng, Coram, Stumpe, Wu,
Narayanaswamy, Venugopalan,
Widner, Madams, Nelson, Webster);
Department of Computer Science,
University of Texas, Austin
(Venugopalan); EyePACS LLC,
San Jose, California (Cuadros); School
of Optometry, Vision Science
Graduate Group, University of
California, Berkeley (Cuadros);
Aravind Medical Research
Foundation, Aravind Eye Care
System, Madurai, India (Kim); Shri
Bhagwan Mahavir Vitreoretinal
Services, Sankara Nethralaya,
Chennai, Tamil Nadu, India (Raman);
Verily Life Sciences, Mountain View,
California (Mega); Cardiovascular
Division, Department of Medicine,
Brigham and Women’s Hospital and
Harvard Medical School, Boston,
Massachusetts (Mega).
Corresponding Author: Lily Peng,
MD, PhD, Google Research, 1600
Amphitheatre Way, Mountain View,
CA 94043 (lhpeng@google.com).
Research
JAMA | Original Investigation | INNOVATIONS IN HEALTH CARE DELIVERY
(Reprinted) E1
Copyright 2016 American Medical Association. All rights reserved.

Training Set / Test Set
• CNN으로 후향적으로 128,175개의 안저 이미지 학습

• 미국의 안과전문의 54명이 3-7회 판독한 데이터

• 우수한 안과전문의들 7-8명의 판독 결과와 인공지능의 판독 결과 비교

• EyePACS-1 (9,963 개), Messidor-2 (1,748 개)a) Fullscreen mode
b) Hit reset to reload this image. This will reset all of the grading.
c) Comment box for other pathologies you see
eFigure 2. Screenshot of the Second Screen of the Grading Tool, Which Asks Graders to Assess the
Image for DR, DME and Other Notable Conditions or Findings

• EyePACS-1 과 Messidor-2 의 AUC = 0.991, 0.990

• 7-8명의 안과 전문의와 sensitivity, specificity 가 동일한 수준

• F-score: 0.95 (vs. 인간 의사는 0.91)
Additional sensitivity analyses were conducted for sev-
eralsubcategories:(1)detectingmoderateorworsediabeticreti-
effects of data set size on algorithm performance were exam-
ined and shown to plateau at around 60 000 images (or ap-
Figure 2. Validation Set Performance for Referable Diabetic Retinopathy
100
80
60
40
20
0
0
70
80
85
95
90
75
0 5 10 15 20 25 30
100806040
Sensitivity,%
1 – Specificity, %
20
EyePACS-1: AUC, 99.1%; 95% CI, 98.8%-99.3%A
100
High-sensitivity operating point
High-specificity operating point
100
80
60
40
20
0
0
70
80
85
95
90
75
0 5 10 15 20 25 30
100806040
Sensitivity,%
1 – Specificity, %
20
Messidor-2: AUC, 99.0%; 95% CI, 98.6%-99.5%B
100
High-specificity operating point
High-sensitivity operating point
Performance of the algorithm (black curve) and ophthalmologists (colored
circles) for the presence of referable diabetic retinopathy (moderate or worse
diabetic retinopathy or referable diabetic macular edema) on A, EyePACS-1
(8788 fully gradable images) and B, Messidor-2 (1745 fully gradable images).
The black diamonds on the graph correspond to the sensitivity and specificity of
the algorithm at the high-sensitivity and high-specificity operating points.
In A, for the high-sensitivity operating point, specificity was 93.4% (95% CI,
92.8%-94.0%) and sensitivity was 97.5% (95% CI, 95.8%-98.7%); for the
high-specificity operating point, specificity was 98.1% (95% CI, 97.8%-98.5%)
and sensitivity was 90.3% (95% CI, 87.5%-92.7%). In B, for the high-sensitivity
operating point, specificity was 93.9% (95% CI, 92.4%-95.3%) and sensitivity
was 96.1% (95% CI, 92.4%-98.3%); for the high-specificity operating point,
specificity was 98.5% (95% CI, 97.7%-99.1%) and sensitivity was 87.0% (95%
CI, 81.1%-91.0%). There were 8 ophthalmologists who graded EyePACS-1 and 7
ophthalmologists who graded Messidor-2. AUC indicates area under the
receiver operating characteristic curve.
Research Original Investigation Accuracy of a Deep Learning Algorithm for Detection of Diabetic Retinopathy
Results

•2018년 4월 FDA는 안저사진을 판독하여 당뇨성 망막병증(DR)을 진단하는 인공지능 시판 허가

•IDx-DR: 클라우드 기반의 소프트웨어로, Topcon NW400 로 찍은 사진을 판독

•의사의 개입 없이 안저 사진을 판독하여 DR 여부를 진단

•두 가지 답 중에 하나를 준다

•1) mild DR 이상이 detection 되었으니, 의사에게 가봐라

•2) mild DR 이상은 없는 것 같으니, 12개월 이후에 다시 검사 받아봐라 
•임상시험 및 성능

•10개의 병원에서 멀티센터로 900명 환자의 데이터를 분석

•민감도와 특이도가 각각 87.4%, 89.5% (JAMA 논문의 구글 인공지능 보다 낮음)

•FDA가 de novo premarket review pathway로 진행

0 0 M O N T H 2 0 1 7 | V O L 0 0 0 | N A T U R E | 1
LETTER doi:10.1038/nature21056
Dermatologist-level classification of skin cancer
with deep neural networks
Andre Esteva1
*, Brett Kuprel1
*, Roberto A. Novoa2,3
, Justin Ko2
, Susan M. Swetter2,4
, Helen M. Blau5
& Sebastian Thrun6
Skin cancer, the most common human malignancy1–3
, is primarily
diagnosed visually, beginning with an initial clinical screening
and followed potentially by dermoscopic analysis, a biopsy and
histopathological examination. Automated classification of skin
lesions using images is a challenging task owing to the fine-grained
variability in the appearance of skin lesions. Deep convolutional
neural networks (CNNs)4,5
show potential for general and highly
variable tasks across many fine-grained object categories6–11
.
Here we demonstrate classification of skin lesions using a single
CNN, trained end-to-end from images directly, using only pixels
and disease labels as inputs. We train a CNN using a dataset of
129,450 clinical images—two orders of magnitude larger than
previous datasets12
—consisting of 2,032 different diseases. We
test its performance against 21 board-certified dermatologists on
biopsy-proven clinical images with two critical binary classification
use cases: keratinocyte carcinomas versus benign seborrheic
keratoses; and malignant melanomas versus benign nevi. The first
case represents the identification of the most common cancers, the
second represents the identification of the deadliest skin cancer.
The CNN achieves performance on par with all tested experts
across both tasks, demonstrating an artificial intelligence capable
of classifying skin cancer with a level of competence comparable to
dermatologists. Outfitted with deep neural networks, mobile devices
can potentially extend the reach of dermatologists outside of the
clinic. It is projected that 6.3 billion smartphone subscriptions will
exist by the year 2021 (ref. 13) and can therefore potentially provide
low-cost universal access to vital diagnostic care.
There are 5.4 million new cases of skin cancer in the United States2
every year. One in five Americans will be diagnosed with a cutaneous
malignancy in their lifetime. Although melanomas represent fewer than
5% of all skin cancers in the United States, they account for approxi-
mately 75% of all skin-cancer-related deaths, and are responsible for
over 10,000 deaths annually in the United States alone. Early detection
is critical, as the estimated 5-year survival rate for melanoma drops
from over 99% if detected in its earliest stages to about 14% if detected
in its latest stages. We developed a computational method which may
allow medical practitioners and patients to proactively track skin
lesions and detect cancer earlier. By creating a novel disease taxonomy,
and a disease-partitioning algorithm that maps individual diseases into
training classes, we are able to build a deep learning system for auto-
mated dermatology.
Previous work in dermatological computer-aided classification12,14,15
has lacked the generalization capability of medical practitioners
owing to insufficient data and a focus on standardized tasks such as
dermoscopy16–18
and histological image classification19–22
. Dermoscopy
images are acquired via a specialized instrument and histological
images are acquired via invasive biopsy and microscopy; whereby
both modalities yield highly standardized images. Photographic
images (for example, smartphone images) exhibit variability in factors
such as zoom, angle and lighting, making classification substantially
more challenging23,24
. We overcome this challenge by using a data-
driven approach—1.41 million pre-training and training images
make classification robust to photographic variability. Many previous
techniques require extensive preprocessing, lesion segmentation and
extraction of domain-specific visual features before classification. By
contrast, our system requires no hand-crafted features; it is trained
end-to-end directly from image labels and raw pixels, with a single
network for both photographic and dermoscopic images. The existing
body of work uses small datasets of typically less than a thousand
images of skin lesions16,18,19
, which, as a result, do not generalize well
to new images. We demonstrate generalizable classification with a new
dermatologist-labelled dataset of 129,450 clinical images, including
3,374 dermoscopy images.
Deep learning algorithms, powered by advances in computation
and very large datasets25
, have recently been shown to exceed human
performance in visual tasks such as playing Atari games26
, strategic
board games like Go27
and object recognition6
. In this paper we
outline the development of a CNN that matches the performance of
dermatologists at three key diagnostic tasks: melanoma classification,
melanoma classification using dermoscopy and carcinoma
classification. We restrict the comparisons to image-based classification.
We utilize a GoogleNet Inception v3 CNN architecture9
that was pre-
trained on approximately 1.28 million images (1,000 object categories)
from the 2014 ImageNet Large Scale Visual Recognition Challenge6
,
and train it on our dataset using transfer learning28
. Figure 1 shows the
working system. The CNN is trained using 757 disease classes. Our
dataset is composed of dermatologist-labelled images organized in a
tree-structured taxonomy of 2,032 diseases, in which the individual
diseases form the leaf nodes. The images come from 18 different
clinician-curated, open-access online repositories, as well as from
clinical data from Stanford University Medical Center. Figure 2a shows
a subset of the full taxonomy, which has been organized clinically and
visually by medical experts. We split our dataset into 127,463 training
and validation images and 1,942 biopsy-labelled test images.
To take advantage of fine-grained information contained within the
taxonomy structure, we develop an algorithm (Extended Data Table 1)
to partition diseases into fine-grained training classes (for example,
amelanotic melanoma and acrolentiginous melanoma). During
inference, the CNN outputs a probability distribution over these fine
classes. To recover the probabilities for coarser-level classes of interest
(for example, melanoma) we sum the probabilities of their descendants
(see Methods and Extended Data Fig. 1 for more details).
We validate the effectiveness of the algorithm in two ways, using
nine-fold cross-validation. First, we validate the algorithm using a
three-class disease partition—the first-level nodes of the taxonomy,
which represent benign lesions, malignant lesions and non-neoplastic
1
Department of Electrical Engineering, Stanford University, Stanford, California, USA. 2
Department of Dermatology, Stanford University, Stanford, California, USA. 3
Department of Pathology,
Stanford University, Stanford, California, USA. 4
Dermatology Service, Veterans Affairs Palo Alto Health Care System, Palo Alto, California, USA. 5
Baxter Laboratory for Stem Cell Biology, Department
of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, California, USA. 6
Department of Computer Science, Stanford University,
Stanford, California, USA.
*These authors contributed equally to this work.
© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.

LETTERH
his task, the CNN achieves 72.1±0.9% (mean±s.d.) overall
he average of individual inference class accuracies) and two
gists attain 65.56% and 66.0% accuracy on a subset of the
set. Second, we validate the algorithm using a nine-class
rtition—the second-level nodes—so that the diseases of
have similar medical treatment plans. The CNN achieves
two trials, one using standard images and the other using
images, which reflect the two steps that a dermatologist m
to obtain a clinical impression. The same CNN is used for a
Figure 2b shows a few example images, demonstrating th
distinguishing between malignant and benign lesions, whic
visual features. Our comparison metrics are sensitivity an
Acral-lentiginous melanoma
Amelanotic melanoma
Lentigo melanoma
…
Blue nevus
Halo nevus
Mongolian spot
…
Training classes (757)Deep convolutional neural network (Inception v3) Inference classes (varies by task)
92% malignant melanocytic lesion
8% benign melanocytic lesion
Skin lesion image
Convolution
AvgPool
MaxPool
Concat
Dropout
Fully connected
Softmax
Deep CNN layout. Our classification technique is a
Data flow is from left to right: an image of a skin lesion
e, melanoma) is sequentially warped into a probability
over clinical classes of skin disease using Google Inception
hitecture pretrained on the ImageNet dataset (1.28 million
1,000 generic object classes) and fine-tuned on our own
29,450 skin lesions comprising 2,032 different diseases.
ning classes are defined using a novel taxonomy of skin disease
oning algorithm that maps diseases into training classes
(for example, acrolentiginous melanoma, amelanotic melano
melanoma). Inference classes are more general and are comp
or more training classes (for example, malignant melanocytic
class of melanomas). The probability of an inference class is c
summing the probabilities of the training classes according to
structure (see Methods). Inception v3 CNN architecture repr
from https://research.googleblog.com/2016/03/train-your-ow
classifier-with.html
GoogleNet Inception v3
• 129,450개의 피부과 병변 이미지 데이터를 자체 제작

• 미국의 피부과 전문의 18명이 데이터 curation

• CNN (Inception v3)으로 이미지를 학습

• 피부과 전문의들 21명과 인공지능의 판독 결과 비교

• 표피세포 암 (keratinocyte carcinoma)과 지루각화증(benign seborrheic keratosis)의 구분

• 악성 흑색종과 양성 병변 구분 (표준 이미지 데이터 기반)

• 악성 흑색종과 양성 병변 구분 (더마토스코프로 찍은 이미지 기반)

Skin cancer classiﬁcation performance of
the CNN and dermatologists. LETT
a
b
0 1
Sensitivity
0
1
Specificity
Melanoma: 130 images
0 1
Sensitivity
0
1
Specificity
Melanoma: 225 images
Algorithm: AUC = 0.96
0 1
Sensitivity
0
1
Specificity
Melanoma: 111 dermoscopy images
0 1
Sensitivity
0
1
Specificity
Carcinoma: 707 images
0 1
Sensitivity
0
1
Specificity
Melanoma: 1,010 dermoscopy images
0 1
Sensitivity
0
1
Specificity
Carcinoma: 135 images
Dermatologists (25)
Average dermatologist
Dermatologists (22)
Dermatologists (21)
cancer classification performance of the CNN and
21명 중에 인공지능보다 정확성이 떨어지는 피부과 전문의들이 상당수 있었음

피부과 전문의들의 평균 성적도 인공지능보다 좋지 않았음

Assisting Pathologists in Detecting
Cancer with Deep Learning
• The localization score(FROC) for the algorithm reached 89%, which signiﬁcantly
exceeded the score of 73% for a pathologist with no time constraint.
https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html

• Algorithms need to be incorporated in a way that complements the pathologist’s workﬂow.
• Algorithms could improve the efﬁciency and consistency of pathologists.
• For example, pathologists could reduce their false negative rates (percentage of  
 
undetected tumors) by reviewing the top ranked predicted tumor regions  
 
including up to 8 false positive regions per slide.
https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html

6
Input & Validation Test
model size FROC @8FP AUC FROC @8FP AUC
40X 98.1 100 99.0 87.3 (83.2, 91.1) 91.1 (87.2, 94.5) 96.7 (92.6, 99.6)
40X-pretrained 99.3 100 100 85.5 (81.0, 89.5) 91.1 (86.8, 94.6) 97.5 (93.8, 99.8)
40X-small 99.3 100 100 86.4 (82.2, 90.4) 92.4 (88.8, 95.7) 97.1 (93.2, 99.8)
ensemble-of-3 - - - 88.5 (84.3, 92.2) 92.4 (88.7, 95.6) 97.7 (93.0, 100)
20X-small 94.7 100 99.6 85.5 (81.0, 89.7) 91.1 (86.9, 94.8) 98.6 (96.7, 100)
10X-small 88.7 97.2 97.7 79.3 (74.2, 84.1) 84.9 (80.0, 89.4) 96.5 (91.9, 99.7)
40X+20X-small 94.9 98.6 99.0 85.9 (81.6, 89.9) 92.9 (89.3, 96.1) 97.0 (93.1, 99.9)
40X+10X-small 93.8 98.6 100 82.2 (77.0, 86.7) 87.6 (83.2, 91.7) 98.6 (96.2, 99.9)
Pathologist [1] - - - 73.3* 73.3* 96.6
Camelyon16 winner [1, 23] - - - 80.7 82.7 99.4
Table 1. Results on Camelyon16 dataset (95% confidence intervals, CI). Bold indicates
results within the CI of the best model. “Small” models contain 300K parameters per
Inception tower instead of 20M. -: not reported. *A pathologist achieved this sensitivity
(with no FP) using 30 hours.
to 10 20% variance), and can confound evaluation of model improvements
by grouping multiple nearby tumors as one. By contrast, our non-maxima sup-
pression approach is relatively insensitive to r between 4 and 6, although less
accurate models benefited from tuning r using the validation set (e.g., 8). Fi-
The FROC evaluates tumor detection and localization
The FROC is defined as the sensitivity at 0.25,0.5,1,2,4,8 average FPs per tumor-negative slide.
Yun Liu et al. Detecting Cancer Metastases on Gigapixel Pathology Images (2017)
Sensitivity at 8 false positives per image

Yun Liu et al. Detecting Cancer Metastases on Gigapixel Pathology Images (2017)
• 구글의 인공지능은 @8FP 및 FROC에서 큰 개선 (92.9%, 88.5%)

•@8FP: FP를 8개까지 봐주면서, 달성할 수 있는 sensitivity

•FROC: FP를 슬라이드당 1/4, 1/2, 1, 2, 4, 8개를 허용한 민감도의 평균

•즉, FP를 조금 봐준다면, 인공지능은 매우 높은 민감도를 달성 가능

• 인간 병리학자는 민감도 73%에 반해 특이도는 거의 100% 달성
•인간 병리학자와 인공지능 병리학자는 서로 잘하는 것이 다름

•양쪽이 협력하면 판독 효율성, 일관성, 민감도 등에서 개선 기대 가능

https://www.facebook.com/groups/TensorFlowKR/permalink/633902253617503/
구글 엔지니어들이 AACR 2018 에서

의료 인공지능 기조 연설

AACR 2018인공지능을 이용하면 총 판독 시간을 줄일 수 있다

AACR 2018인공지능을 이용하면 판독 정확도를 (micro에서 특히) 높일 수 수 있다

BeyondVerbal: Reading emotions from voices

http://www.wsj.com/articles/SB10001424052702303824204579421242295627138

BeyondVerbal
• 기계가 사람의 감정을 이해한다면?

• 헬스케어 분야에서도 응용도 높음: 슬픔/우울함/피로 등의 감정 파악

• 보험 회사에서는 가입자의 우울증 여부 파악을 위해 이미 사용 중

• Aetna 는 2012년 부터 고객의 우울증 여부를 전화 목소리 분석으로 파악

• 기존의 방식에 비해 우울증 환자 6배 파악

• 사생활 침해 여부 존재

• linguistic
• identiﬁcation and extraction of
word instances (unigrams) and
word-pair instances (bi-grams)
from the transcriptions
• acoustic
• vocal dynamics
• voice quality
• vocal tract resonance frequencies
• pause lengths
A Machine Learning Approach to Identifying the
Thought Markers of Suicidal Subjects:A
Prospective Multicenter Trial
• “Do you have hope?”
• “Do you have any fear?”
• “Do you have any secrets?”
• “Are you angry?”
• “Does it hurt emotionally?”
Pestian, Suicide and Life-Threatening Behavior, 2016

A Machine Learning Approach to Identifying the
Thought Markers of Suicidal Subjects:A
Prospective Multicenter Trial
SensitivitySensitivity
1.00.00.20.40.60.81.00.00.20.40.60.81.0
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
SUICIDE THOUGHT MARKERS
SensitivitySensitivitySensitivity
0.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
Figure 1. Receiver operator curve (ROC): suicide versus control (upper), suicide versus mentally ill (middle), and
SensitivitySensitivity
0.00.20.40.60.81.00.00.20.40.60.81.0
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
Figure 1. Receiver operator curve (ROC): suicide versus control (upper), suicide versus mentally
suicide versus mentally ill with control. The ROC curves for adolescents (blue), adults (red), and a
generated where the nonsuicidal population is controls (top), mentally ill (middle), and mentally
using linguistic and acoustic features. The gray line is the AROC curve for a baseline (random) cla
TABLE 2
The AROC for the Machine Learning Algorithm. The Nonsuicidal Group Comprises of Either Mentally Ill and Control Subjects. Classiﬁcation
Performances are Shown for Adolescents, Adults, and the Combined Adolescent and Adult Cohorts
Suicidal versus Controls Suicidal versus Mentally Ill Suicidal versus Mentally Ill and Controls
Adolescents
ROC (SD)
Adults
ROC (SD)
Adolescents +
Adults
ROC (SD)
Adolescents
ROC (SD)
Adults
ROC (SD)
Adolescents +
Adults
ROC (SD)
Adolescents
ROC (SD)
Adults
ROC (SD)
Adolescents +
Adults
ROC (SD)
Linguistics 0.87 (0.04) 0.91 (0.02) 0.93 (0.02) 0.82 (0.05) 0.77 (0.04) 0.79 (0.03) 0.82 (0.04) 0.84 (0.03) 0.87 (0.02)
Acoustics 0.74 (0.05) 0.82 (0.03) 0.79 (0.03) 0.69 (0.06) 0.74 (0.04) 0.76 (0.03) 0.74 (0.05) 0.80 (0.03) 0.76 (0.03)
Linguistics +
Acoustics
0.83 (0.05) 0.93 (0.02) 0.92 (0.02) 0.80 (0.05) 0.77 (0.04) 0.82 (0.03) 0.81 (0.04) 0.84 (0.03) 0.87 (0.02)
PESTIANETAL.
Suicidal vs. Control Suicidal vs. Mentally Ill Suicidal vs. Mentally Ill and Controls
adolescents
adults
Pestian, Suicide and Life-Threatening Behavior, 2016

“A lot of Syrian refugees have trauma and maybe
this can help them overcome that.” However, he
points out that there is a stigma around
psychotherapy, saying people feel shame about
seeking out psychologists.
As a result he thinks people might feel more
comfortable knowing they are talking to a “robot”
than to a human.

http://www.newyorker.com/tech/elements/the-chatbot-will-see-you-now
•시리아 난민의 3/4 정도가 불안감, 고립감, 불면증 등의 증세

•아랍어를 하는 수천명의 상담사를 구하기는 불가능

•실리콘밸리 스타트업 X2AI 의 챗봇, Karim

•emotion-recognition algorithm

•표현, 용어 사용, 타이핑 속도, 문장의 길이, 문법 (수동태,
능동태) 등의 파라미터를 분석

•분노, 슬픔을 표현하는 것을 금기시하는 시리아에서,  
오히려 환자들이 챗봇과 상담하는 것을 편하게 느낌

•스탠퍼드 심리학과 David Spiegel 교수

•진단에 관한 역량은 유망할 것; 모든 상담 내용 기억

•케어받는 느낌을 주거나, 전이감정 다루기는 어려울 것

• Woebot, 정신 상담 챗봇 스타트업

• 스탠퍼드의 mental health 전문가들이 시작한 우울증 치료 (인지행동치료) 목적의 챗봇

• Andrew Ng 교수는 이사회장으로 참여

• Woebot, 정신 상담 챗봇

• 실제 상담사들이 하듯이, 대화형으로 설명하고 사용자의 정신 건강 상태를 체크

• 대부분 설문과 다를 것이 없지만 (정해진 답 중에 하나 선택), UI 상의 혁신이라고 볼 수 있음

• 아직까지는 아주 정교한 NLP를 사용하고 있지는 않음 (세션 당 한 번 정도)

Original Paper
Delivering Cognitive Behavior Therapy to Young Adults With
Symptoms of Depression and Anxiety Using a Fully Automated
Conversational Agent (Woebot): A Randomized Controlled Trial
Kathleen Kara Fitzpatrick1*
, PhD; Alison Darcy2*
, PhD; Molly Vierhile1
, BA
1
Stanford School of Medicine, Department of Psychiatry and Behavioral Sciences, Stanford, CA, United States
2
Woebot Labs Inc., San Francisco, CA, United States
*
these authors contributed equally
Corresponding Author:
Alison Darcy, PhD
Woebot Labs Inc.
55 Fair Avenue
San Francisco, CA, 94110
United States
Email: alison@woebot.io
Abstract
Background: Web-based cognitive-behavioral therapeutic (CBT) apps have demonstrated efficacy but are characterized by
poor adherence. Conversational agents may offer a convenient, engaging way of getting support at any time.
Objective: The objective of the study was to determine the feasibility, acceptability, and preliminary efficacy of a fully automated
conversational agent to deliver a self-help program for college students who self-identify as having symptoms of anxiety and
depression.
Methods: In an unblinded trial, 70 individuals age 18-28 years were recruited online from a university community social media
site and were randomized to receive either 2 weeks (up to 20 sessions) of self-help content derived from CBT principles in a
conversational format with a text-based conversational agent (Woebot) (n=34) or were directed to the National Institute of Mental
Health ebook, “Depression in College Students,” as an information-only control group (n=36). All participants completed
Web-based versions of the 9-item Patient Health Questionnaire (PHQ-9), the 7-item Generalized Anxiety Disorder scale (GAD-7),
and the Positive and Negative Affect Scale at baseline and 2-3 weeks later (T2).
Results: Participants were on average 22.2 years old (SD 2.33), 67% female (47/70), mostly non-Hispanic (93%, 54/58), and
Caucasian (79%, 46/58). Participants in the Woebot group engaged with the conversational agent an average of 12.14 (SD 2.23)
times over the study period. No significant differences existed between the groups at baseline, and 83% (58/70) of participants
provided data at T2 (17% attrition). Intent-to-treat univariate analysis of covariance revealed a significant group difference on
depression such that those in the Woebot group significantly reduced their symptoms of depression over the study period as
measured by the PHQ-9 (F=6.47; P=.01) while those in the information control group did not. In an analysis of completers,
participants in both groups significantly reduced anxiety as measured by the GAD-7 (F1,54= 9.24; P=.004). Participants’ comments
suggest that process factors were more influential on their acceptability of the program than content factors mirroring traditional
therapy.
Conclusions: Conversational agents appear to be a feasible, engaging, and effective way to deliver CBT.
(JMIR Ment Health 2017;4(2):e19) doi:10.2196/mental.7785
KEYWORDS
conversational agents; mobile mental health; mental health; chatbots; depression; anxiety; college students; digital health
Introduction
Up to 74% of mental health diagnoses have their first onset
particularly common among college students, with more than
half reporting symptoms of anxiety and depression in the
previous year that were so severe they had difficulty functioning
Fitzpatrick et alJMIR MENTAL HEALTH

depression at baseline as measured by the PHQ-9, while
three-quarters (74%, 52/70) were in the severe range for anxiety
as measured by the GAD-7.
Figure 1. Participant recruitment flow.
Table 1. Demographic and clinical variables of participants at baseline.
WoebotInformation control
Scale, mean (SD)
14.30 (6.65)13.25 (5.17)Depression (PHQ-9)
18.05 (5.89)19.02 (4.27)Anxiety (GAD-7)
25.54 (9.58)26.19 (8.37)Positive affect
24.87 (8.13)28.74 (8.92)Negative affect
22.58 (2.38)21.83 (2.24)Age, mean (SD)
Gender, n (%)
7 (21)4 (7)Male
27 (79)20 (55)Female
Ethnicity, n (%)
2 (6)2 (8)Latino/Hispanic
32 (94)22 (92)Non-Latino/Hispanic
28 (82)18 (75)Caucasian
Fitzpatrick et alJMIR MENTAL HEALTH
Delivering Cognitive Behavior Therapy toYoung Adults With
Symptoms of Depression and Anxiety Using a Fully Automated
Conversational Agent (Woebot):A Randomized Controlled Trial
•분노장애와 우울증이 있다고 스스로 생각하는 대학생들이 사용하는 self-help 챗봇

•목적: 챗봇의 feasibility, acceptability, preliminary efficacy 를 보기 위함

•대학생 총 70명을 대상으로 2주 동안 진행

•실험군 (Woebot): 34명

•대조군 (information-only): 31명

•Oucome: PHQ-9, GAD-7

d cPFWoebotInformation-only control
95% CIb
T2a
95% CIb
T2a
0.44.0176.039.74-12.3211.14 (0.71)12.07-15.2713.67 (.81)PHQ-9
0.14.5810.3816.16-18.1317.35 (0.60)15.52-18.5616.84 (.67)GAD-7
0.02.7070.1724.35-29.4126.88 (1.29)23.17-28.8626.02 (1.45)PANAS positive
affect
0.344.9120.9123.54-28.4225.98 (1.24)24.73-30.3227.53 (1.42)PANAS nega-
tive affect
a
Baseline=pooled mean (standard error)
b
95% confidence interval.
c
Cohen d shown for between-subjects effects using means and standard errors at Time 2.
Figure 2. Change in mean depression (PHQ-9) score by group over the study period. Error bars represent standard error.
Preliminary Efficacy
Table 2 shows the results of the primary ITT analyses conducted
on the entire sample. Univariate ANCOVA revealed a significant
treatment effect on depression revealing that those in the Woebot
group significantly reduced PHQ-9 score while those in the
information control group did not (F1,48=6.03; P=.017) (see
Figure 2). This represented a moderate between-groups effect
size (d=0.44). This effect is robust after Bonferroni correction
for multiple comparisons (P=.04). No other significant
between-group differences were observed on anxiety or affect.
Completer Analysis
As a secondary analysis, to explore whether any main effects
existed, 2x2 repeated measures ANOVAs were conducted on
the primary outcome variables (with the exception of PHQ-9)
among completers only. A significant main effect was observed
on GAD-7 (F1,54=9.24; P=.004) suggesting that completers
experienced a significant reduction in symptoms of anxiety
between baseline and T2, regardless of the group to which they
were assigned with a within-subjects effect size of d=0.37. No
main effects were observed for positive (F1,50=.001; P=.951;
d=0.21) or negative affect (F1,50=.06; P=.80; d=0.003) as
measured by the PANAS.
To further elucidate the source and magnitude of change in
depression, repeated measures dependent t tests were conducted
and Cohen d effect sizes were calculated on individual items of
the PHQ-9 among those in the Woebot condition. The analysis
revealed that baseline-T2 changes were observed on the
following items in order of decreasing magnitude: motoric
symptoms (d=2.09), appetite (d=0.65), little interest or pleasure
in things (d=0.44), feeling bad about self (d=0.40), and
concentration (d=0.39), and suicidal thoughts (d=0.30), feeling
down (d=0.14), sleep (d=0.12), and energy (d=0.06).
JMIR Ment Health 2017 | vol. 4 | iss. 2 | e19 | p.6http://mental.jmir.org/2017/2/e19/
(page number not for citation purposes)
XSL•FO
RenderX
Change in mean depression (PHQ-9) score
by group over the study period
•결과

•챗봇을 2주 동안 평균 12.14번 사용함

•우울증에 대해서는 significant group difference

•Woebot 그룹에서는 우울증(PHQ-9)의 유의미한 감소가 있었음

•대조군에서는 유의미한 감소 없음

•분노 장애에 대해서는 두 그룹 모두 유의미한 감소가 있었음 (GAD-7 기준)

•Woebot은 2017년 많은 데이터를 수집하면서 성장

•페북 메신져로만 서비스 하였음에도 불구하고, 매달 50% 이상 성장,

•주당 2m 개 이상의 메시지 축적, 130개국 이상의 사용자 확보

•2018년 초 iOS 앱 및 안드로이드 앱 출시  
•2018년 3월 Woebot 이 $8m의 시리즈A 투자에 성공

•New Enterprise Associates (NEA)가 리드하고 Andrew Ng 교수의 AI Fund 도 참여

•AI Fund: Andrew Ng 교수가 올해 1월 결성한 AI 스타트업에 투자하는 $175M 규모 펀드

G.M. Lucas et al. / Computers in Human Behavior 37 (2014) 94–100
인간 의사와 인공지능 의사 중 누가 라뽀 형성을 더 잘 할까?

It’s only a computer:
Virtual humans increase willingness to disclose
인공지능이

상담한다고 믿음

(computer frame)
사람이 원격으로

상담한다고 믿음

(human frame)
실제로 인공지능이 상담

(AI)
실제로 사람이 상담

(Tele-operated)
Method
Frame

‘‘How close are you to your family?’’
‘‘Tell me about a situation that you wish you had handled differently.’’
‘‘Tell me about an event, or something that you wish you could erase from your memory.’’
‘‘Tell me about the hardest decision you’ve ever had to make.’’
‘‘Tell me about the last time you felt really happy.’’
‘‘What are you most proud of in your life?’’
‘‘What’s something you feel guilty about?’’
‘‘When was the last time you argued with someone and what was it about?’’

0
5
10
15
20
Computer frame Human frame
0
15.25
30.5
45.75
61
0
0.033
0.065
0.098
0.13
0
0.3
0.6
0.9
1.2
Fear of Self-disclosure Impression Management Sadness Displays Willingness to Disclosure

‘‘This is way better than talking to a person. I don’t really feel
comfortable talking about personal stuff to other people.’’
‘‘A human being would be judgmental. I shared a lot of
personal things, and it was because of that.’’

Digital Therapeutics
디지털 신약

•Digiceutical = digital + pharmaceutical

•"chemical 과 protein에 이어서 digital drug 이 세번째 종류의 신약이 될 것이다”

•digital drug 은 크게 두 가지 종류

•기존의 약을 아예 대체

•기존 약을 강화(augment)

• reSET® was evaluated in a clinical trial of 507 patients with SUD across 10 treatment centers nation-wide over 12 weeks.*
• In patients who were dependent on stimulants, marijuana, cocaine, or alcohol (n=395), 58.1% of patients receiving
reSET®* were abstinent in study weeks 9-12, while 29.8% of patients receiving face-to-face therapy alone were abstinent
during the same time frame (p<0.01).
• Participants who tested positive for drug use at the start of the study (n=191), 26.7% of patients receiving reSET®* were
abstinent in study weeks 9-12, while 3.2% of patients receiving traditional face-to-face therapy were abstinent during the
same time frame (p<0.01).
Pear Therapeutics
Campbell et al. Am J Psychiatry. 2014.

Campbell et al. Am J Psychiatry. 2014.
Pear Therapeutics
• Patients receiving reSET® showed statistically signiﬁcant improvement in retention compared to face-to-face therapy alone
(p=0.0316).At the end of 12 weeks of treatment 59% of patients receiving face-to-face therapy were retained in the study
compared to 67% of patients receiving reSET®.

Pear Therapeutics
•최초로 스마트폰 앱이 digital therapeutics 로 질병 치료 목적으로 FDA de novo clearance 
 
(기기 없이 '앱'만으로 구성된 시스템이 '질병 치료' 목적으로 허가 받은 것은 최초)

•Pear Therapeutics의 reSET 이라는 시스템으로 각종 중독을 치료하는 목적의 앱

•12주에 걸쳐서 대마, 코카인, 알콜 중독에 대한 중독과 의존성을 치료

14© 2017 by HURAYPOSITIVE INC., a Digital Healthcare Service Provider. This information is strictly privileged and confidential. All rights reserved.
제2형 당뇨병 환자 95% 임신성 당뇨병 환자 2%
기타 1%
정상인 당뇨병 전단계
환자
당뇨병
환자
경증합병증 동반
당뇨병 환자
중증합병증 동반
당뇨병 환자
제1형 당뇨병 환자 2%
보건복지부/건강보험공단
(국민건강증진 및 관리)
병원/제약사/보험사
(비용절감 및 고객만족)
차기 위험단계로의
적극적인 진입 억제를 위한
헬스케어 솔루션
휴레이포지티브
헬스케어 솔루션
$
key facts
Products & Services
서비스 대상 & 역할

16© 2017 by HURAYPOSITIVE INC., a Digital Healthcare Service Provider. This information is strictly privileged and confidential. All rights reserved.
7
7.2
7.4
7.6
7.8
8
8.2
3M 6M 9M 12M0M
▼0.63%p.
▼0.64%p.
당화혈색소(HbA1c,%)
&
Products & Services
의학적 유효성(Health Switch를 활용한 임상실험)
기간
• 1차 실험(0M-6M)
실험군: 중재 O ( )
대조군: 중재 X ( )
• 2차 실험: 실험군과 대조군 교차(6M-12M)
대조군: 중재 X ( )
실험군: 중재 O ( )
당화혈색소 0.63%p. 감소
무의미한 변화
당화혈색소 수준 유지
당화혈색소 0.64%p. 감소
▼0.04%p.
• N = 148명
• 평균 연령: 52.2세
결과
임상 대상자
1 모바일 중재 서비스의 의미 있는 혈당 감소 효과
2 약 6개월의 서비스 후 생활습관 유지 가능성
3 고령 환자들도 사용할 수 있는 간편한 서비스
임상실험을 통해 검증된
Health Switch의 효과
key facts
• 특징: 제2형 당뇨병 유병자
• 기간: 2014.10 ~ 2015.12

1SCIENTIFIC REPORTS | (2018) 8:3642 | DOI:10.1038/s41598-018-22034-0
www.nature.com/scientificreports
The effectiveness, reproducibility,
and durability of tailored mobile
coaching on diabetes management
in policyholders:A randomized,
controlled, open-label study
DaYoung Lee1,2
, Jeongwoon Park3
, DooahChoi3
, Hong-YupAhn4
, Sung-Woo Park1
&
Cheol-Young Park 1
This randomized, controlled, open-label study conducted in Kangbuk Samsung Hospital evaluated
the effectiveness, reproducibility, and durability of tailored mobile coaching (TMC) on diabetes
management.The participants included 148 Korean adult policyholders with type 2 diabetes divided
into the Intervention-Maintenance (I-M) group (n=74) andControl-Intervention (C-I) group (n=74).
Intervention was the addition ofTMC to typical diabetes care. In the 6-month phase 1, the I-M group
receivedTMC, and theC-I group received their usual diabetes care. During the second 6-month phase
2, theC-I group receivedTMC, and the I-M group received only regular information messages.After
the 6-month phase 1, a significant decrease (0.6%) in HbA1c levels compared with baseline values was
observed in only the I-M group (from 8.1±1.4% to 7.5±1.1%, P<0.001 based on a paired t-test).
At the end of phase 2, HbA1c levels in theC-I group decreased by 0.6% compared with the value at 6
months (from 7.9±1.5 to 7.3±1.0, P<0.001 based on a paired t-test). In the I-M group, no changes
were observed. Both groups showed significant improvements in frequency of blood-glucose testing
and exercise. In conclusion, addition ofTMC to conventional treatment for diabetes improved glycemic
control, and this effect was maintained without individualized message feedback.
The incidence and prevalence of type 2 diabetes are increasing rapidly worldwide, and the disease is expected
to affect 439 million adults by 20301
. Previous large clinical trials indicated that adequate glycemic control con-
tributed to a reduction in both microvascular and macrovascular complications as well as mortality rates due to
diabetes2,3
. Complications from diabetes result in greater expenditure and reduced productivity. Therefore, it is a
socioeconomic concern4,5
. Adequate glycemic control is important not only as an individual health problem, but
also as a challenge to healthcare systems worldwide.
However, approximately 40% of subjects with diabetes in the United States do not meet the recommended
target for glycemic control, low-density lipoprotein cholesterol (LDL-C) level, or blood pressure (BP)6
. In Korea,
glycated hemoglobin (HbA1c) levels for nearly half of diabetic patients were above 7.0%7
.
Although successful diabetes care requires therapeutic lifestyle modification in addition to proper medica-
tion8–10
, only 55% of individuals with type 2 diabetes receive diabetes education from healthcare professionals11
,
and 16% report adhering to recommended self-management activities9
. Multifaceted professional inter-
ventions are needed to support patient efforts for behavior change including healthy lifestyle choices, disease
self-management, and prevention of diabetes complications10
.
1
Division of Endocrinology and Metabolism, Department of Internal Medicine, Kangbuk Samsung Hospital,
SungkyunkwanUniversitySchool of Medicine,Seoul, Republic of Korea.2
Division of Endocrinology and Metabolism,
Department of Internal Medicine, KoreaUniversityCollege of Medicine,Seoul, Republic of Korea.3
Huraypositive Inc.
Sinsa-dong, Gangnam-gu, Seoul, Republic of Korea. 4
Department of Statistics, Dongguk University-Seoul, Seoul,
Republic of Korea. Correspondence and requests for materials should be addressed to C.-Y.P. (email: cydoctor@
chol.com)
Received: 29 November 2017
Accepted: 15 February 2018
Published: xx xx xxxx
OPEN

의료의 미래, 디지털 헬스케어: 정신의학을 중심으로

의료의 미래, 디지털 헬스케어: 정신의학을 중심으로

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 의료의 미래, 디지털 헬스케어: 정신의학을 중심으로

Similaire à 의료의 미래, 디지털 헬스케어: 정신의학을 중심으로 (20)

Plus de Yoon Sup Choi

Plus de Yoon Sup Choi (20)

Dernier

Dernier (20)

의료의 미래, 디지털 헬스케어: 정신의학을 중심으로