The vast volumes of medical data collected offers pharma the opportunity to harness the information in big data sets
Unlocking the potential in these data sources can ultimately lead to improved patients outcomes
This presentation describes consideration how to maximize the impact of Big Data.
its methodology, practical challenges and implications.
2. Disclaimer
Yigal Aviv, June 2017
• The content offered in this presentation is intended to provide information or
to present my opinion and serve as general information only
• Any third party solution presented or information provided are mine only
and for illustrative purposes and are not indicative of a current or
past relationship with my workplace, Teva Pharmaceuticals Industries LTD.
• All content was taken from public domain resources
• I hereby declare that I did not receive any compensation
for my participation in this conference
3. Agenda
What is Big Data
•Introduction
•Definition
•Medical BD
•Sources of medical BD
What is Big Data
•Introduction
•Definition
•Medical BD
•Sources of medical BD
Why & How Big Data
•Implications of BD by pharma
•Advantages & Limitations
•BD for drug development and
clinical trials
•Challenges & Insights
Why & How Big Data
•Implications of BD by pharma
•Advantages & Limitations
•BD for drug development and
clinical trials
•Challenges & Insights
4. Introduction ,Why now ?
• 2002, The birth of Social media
2008, The birth of Internet of Things (IOT/Bio sensors)
• New technologies to reveal insights from
diverse datasets (ML,DL)
• Increase in genomic knowledge
• Increase of storage capabilities
• Increase of processing power (GPU)
• Rapid growth in sources of digital health data
5. Medical Research / Knowledge
doubling every 3.5 Years…
Source: National Institutes of Health, U.S. National Library of Medicine (“Challenges and Opportunities Facing Medical Education”
(Densen, Peter) Transactions of the American Clinical and Climatological Association (1/10) *Based on cumulative number of published medical citations on PubMed,
**Based on peer-reviewed article on challenges in medical education, Internet trends 2017 Mary Meeker May 31 ,2017
19x Increase in genomic
knowledge…
6. Electronic health record (EHR)
adoption is moving fast
Source: National Institutes of Health, U.S. National Library of Medicine (“Challenges and Opportunities Facing Medical Education”
(Densen, Peter) Transactions of the American Clinical and Climatological Association (1/10) *Based on cumulative number of published medical citations on PubMed,
**Based on peer-reviewed article on challenges in medical education
Wearables & Wellness data
7. Number of social networks
users worldwide (in billions)
Source: National Institutes of Health, U.S. National Library of Medicine (“Challenges and Opportunities Facing Medical Education”
(Densen, Peter) Transactions of the American Clinical and Climatological Association (1/10) *Based on cumulative number of published medical citations on PubMed,
**Based on peer-reviewed article on challenges in medical education
Sharing health data
8. What is Big Data ?
• Capturing in real time a huge amount of data of many
types from diverse sources
• Inconsistent data, beyond the ability of “old” database
software tools to manage and analyze
within a tolerable time
• The goal:
Transform from “just” data into intelligent insights to
optimize outcome
10. What is Big Data ? The 6 Vs
• Capturing in real time (Velocity) a huge amount of data (Volume) of many types
from diverse sources (Variety)
• Data with (Veracity) nature, beyond the ability of “old” database software tools
to manage and analyze within a tolerable time
• The goal: Transform (Visualization) from “just” data into intelligent insights to
optimize outcome (Value)
11. Medical Big data sources
• ClinicalTrials.gov
• Electronic health record
• Real world studies
• Gene sequencing
• Search engines
• Medical imaging
• Mobile devices/apps
• Social Media/Blogs
• HCP’s Notes
• Test Results
• Clinical registry
• Health Organisation
• Health and disease statistics
• IOT/Sensors/Wearables
• Electronic health record
12. What is special about medical big data
• Relatively small compared to data from
other disciplines
• Frequently hard to access
(patient consent)
• Affected by several sources
of uncertainty, domain knowledge
may be essential
• Consist with legal issues
(Data security, patient privacy)
13. Structured data vs. Unstructured data
Structured data:
•Organized and displayed in a database
with rows and columns
•Straightforward to work with, easy to
sort and categorize
Unstructured data
•Variable and complex
•Difficult to sort, categorize and analyze
•Emails, images, and any form of human
language in a conversational format
Classical
statistical analysis
Big data
analysis
14. Medical Big data sources
Test
Results
Electronic
health
record
World Health
Organisation
health and disease
statistics
World Health
Organisation
health and disease
statistics
ClinicalTrials.govClinicalTrials.gov
Structured dataUnstructured data
Medical imagingMedical imaging
Clinical
registry
Mix
15. Agenda
What is Big Data
•Introduction
•Definition
•Medical BD
•Sources of medical BD
What is Big Data
•Introduction
•Definition
•Medical BD
•Sources of medical BD
Why & How Big Data
•Implications of BD by pharma
•Advantages & Limitations
•BD for drug development and
clinical trials
•Challenges & Insights
Why & How Big Data
•Implications of BD by pharma
•Advantages & Limitations
•BD for drug development and
clinical trials
•Challenges & Insights
16. Using Big Data trough the commercial life cycle
(Commercial, Marketing, Market access, Legal )
Health economics outcome
research (HEOR(Regulatory intelligence
Influencer profiling
KOL Management
Market access & pricing
Treatment adherence
Multi channel marketing analysis
Social media listening
Patient journey analytics
Patent mining
Comparative intelligence
17. Using Big Data trough the commercial life cycle
(Commercial, Marketing, Market access, legal )
Health economics outcome
research (HEOR(Regulatory intelligence
Influencer profiling
KOL Management
Market access & pricing
Treatment adherence
Multi channel marketing analysis
Social media listening
Patient journey analytics
Patent minding
Comparative intelligence
19. Social media listening
Active users in Millions
https://www.weforum.org/agenda/2017/03/most-popular-social-networks-mapped?
utm_content=buffera55af&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer
20. Choose wisely from where, how, and what to listen
https://www.weforum.org/agenda/2017/03/most-popular-social-networks-mapped?
utm_content=buffera55af&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer
22. Measuring patient emotions across conditions
https://treato.com/articles/Measuring-Patient-Emotions-Across-Conditions/
23. Understanding Patients through communities Listening
Immunotherapy posts on a particular high-agency community over
time
Immunotherapy posts on all other sites over time
FDA approved the first immunotherapy for X cancer
24. Understanding Patients & Caregivers through SM
communities Listening
http://mhealthspot.com/2017/04/patientslikeme-shire-working-study-rare-genetic-diseases/
26. Despite Increased growth rate of clinical trials,
clinical impact lags due to Length of trials
Source: ClinicalTrials.gov database (5/17), FDAReview.org (2016)
Number of Registered Clinical Trials posted on ClinicalTrails.gov.
27. Using Big Data through clinical development process
(R&D,CRO,Medical Affairs )
Real world dataPatients recruitment
Site less/Remote trials
Site selection & management
Biostatistics
Pharmacovigilance & safety
Data management , Statistical programing & SAS
Drug repurposing
Pharmacokinetics/pharmacogenomics/personalized medicine analytics
28. Using Big Data through clinical development process
(R&D,CRO,Medical Affairs )
Real world dataPatients recruitment
Site less/Remote trials
Site selection & management
Biostatistics
Pharmacovigilance & safety
Data management , Statistical programing & SAS
Drug repurposing
Pharmacokinetics/pharmacogenomics/personalized medicine analytics/Biomarkers
30. Genentech 2013
• Partnered with universities and firms to recruit
80 data scientists (global team)
• Built a big-data infrastructure to analyze patient records
• Created a database on a historical cohort of real-world cancer patients
• The team analyzed their data to understand the outcomes of different
patient subtypes and treatment regimens
• Helped to learn how different biomarker alterations and different
treatment patterns affect clinical outcomes in the real world
• This information supports critical drug development decisions
31. 4 years later, 2017 Genentech in the Headlines
https://www.gene.com/media/press-releases/14666/2017-05-10/genentech-to-present-new-data-on-persona
32. Could Big data build new drugs faster than any human trials ?
• The majority of medical data are unstructured (e.g., genomics, curated medical
literature, health care claims, medical device feeds, electronic medical records, etc.
• The analysis of comprehensive EHR patient data collected in real time during
doctor or hospital visits provides an opportunity to better understand diseases,
treatment patterns, and clinical outcomes in an uncontrolled, real-world setting
• These valuable insights complement those gained from clinical trials and can
provide an opportunity to assess a wider spectrum of patients that are traditionally
excluded from clinical trials e.g., elderly, frail, or immobile patients, as well as
people with rare indications and diseases not yet studied in clinical trials
33. • Cultural and organizational obstacles slowing adoption of Big Data initiatives
• Existence of data silos: Inside company and external data (integration with Clinical and real-world patients and
physicians data which are spread across disparate silos
• Difficulties to “Show” ROI with tight budgets and short terms solutions
• Lack of Benchmark (Cost, value)
A comparative analysis of big data vendors will show who has the most appropriate data sets to meet business
needs
• Language , translation is not enough
• Patient privacy & data ownership is a barrier prohibiting explosive growth in data sharing.
Unlike data sharing in other industries, the pharma industry could potentially expose patient information
• Visualization tools are important to present complex trends and insights in a simple way
• Through the Internet of Things (IOT), bio sensors and medical chatbots
There are still compliance and legal concerns with regards to the medical responsibility
Challenges
34. • Analytics using big data mining has a crucial role in drug development
• Collecting and analyzing big data in order to increase genomic understanding
is becoming increasingly important/must do in personalized medicine development
• The winners in an era of big data are pharma companies who will transform medical affairs
teams into medical value teams, focusing on outcomes instead of products
• Pharma still needs to understand Why & How Big data
(Hypothesis generating, rather than testing)
• Internal data-sharing and collaboration are critical components in leveraging Big Data,
breakdown of silos and synergy between functions and therapeutics teams is essential
Key Insights
35. • We need data scientists !!! CDO Chief Data Officer
Data analysts along multidisciplinary teams who can take full advantage of analyzing big data
and are capable of Briefing ML vendors, handling noisy data and presenting results in a simple,
easy-to-interpret way
• Pharma needs to find ways to share data to advance medical knowledge without
compromising intellectual property
• Using Big Data analysis platforms is necessary but not sufficient and cannot replace well
controlled “human” clinical trials
Asking, “Which is better?” is like asking, “Is it better to use a knife or a hammer?”
The answer depends on what problem is being solved
Key Insights
- אתמקד על השימוש של הפארמה ב big data בקומרשיאל בכלל ובפרט בנושא המחקר הקליני ופיתוח תרופות
If you had to pick one buzzword that perfectly encapsulated the current latest and greatest in innovation, you could do a lot worse than “big data”. The last few years have seen big data and data mining introduced into nearly every field imaginable, often with disruptive results. Now, that same set of tools is being leveraged to improve drug development, with huge potential implications in identifying and correcting issues that arise during clinical trials. While there are no set standards or best practices around marshalling big data to identify and correct failing trials, we wanted to throw out a few ideas and suggestions that could go a long way towards making clinical trial succeed.
We see big data making the biggest impact in the areas of patient recruitment, process monitoring, and safety and data handling. In this post, we will take a quick overview of what big data is, how we look at and define program failures and faltering trials, and do a quick fly-by of the topics we will be covering in the future. Over future blog posts, we will be exploring each of these in depth to identify where and how the promises of big data can best increase the efficiency and success of your clinical trials.
יגאל אביב , 13 שנים בתעשיית הפארמה בתפקדי commercial marketing ו business development
כיום מועסק בטבע במטה הכללי כמנהל שיווק דיגיטלי ופיתוח עסקי של תחום האונקולוגיה במדינות מתפתחות
ההרצאה הזו היא לא מביאה או מרמזת על מידע שהוא לא publuc domain בטבע או בחברות פארמה אחרות
We see big data making the biggest impact in the areas of patient recruitment, process monitoring, and safety and data handling. In this post, we will take a quick overview of what big data is, how we look at and define program failures and faltering trials, and do a quick fly-by of the topics we will be covering in the future. Over future blog posts, we will be exploring each of these in depth to identify where and how the promises of big data can best increase the efficiency and success of your clinical trials.
Big Data
אנחנו נדבר בדקות הקרובות על מה זה ביג דאטה או בעברית נתוני עתק
למה פארמה צריכה להשתמש או משתמשת בטכנולוגות איסוף,ועיבוד של ביג דאטה
והכי חשוב איך הפארמה היום משתמשת בטכנולוגות אלו לטובת פיתוח תרופות ו life cycle של תרופה
בִּיג דָּאטָה), או בעברית, נְתוּנֵי עָתֵק, הוא מונח המתייחס למאגר מידע הכולל נתונים מבוזרים, שאינם מאורגנים לפי שיטה כלשהי, שמגיעים ממקורות רבים, בכמויות גדולות, בפורמטים מגוונים, ובאיכויות שונות.
ניתן לאפיין נתוני עתק לפי חמישה מאפיינים (חמשת ה-V-ים):
נפח (volume)
מהירות (velocity)
מגוון (variety)
אי-אמינות (Veracity)
תנודתי (volatility)
האתגר בניהול נתוני עתק הפך תחום זה לעניין מרכזי בטכנולוגיית מידע. מסדי הנתונים היחסיים הקיימים אינם בנויים לאחסון ולניתוח כמויות מידע גדולות, שרובן אינן מגיעות באופן מפורמט לפי תבניות אחידות וידועות מראש. העלות הזולה יחסית של אמצעי האחסון, מצד אחד, והכמות הגדולה של מידע המגיע משלל מקורות (אתרי אינטרנט, רשתות חברתיות, מכשירים סלולריים, מצלמות אבטחה, חיישנים ועוד), מצד שני, גורם לכך שמידע נאגר ללא מחיקה, ומאפשר יכולות ניתוח וזיהוי תבניותומִתְאָמִים, הנדרשות בעולמות תוכן רבים.
2002, , רשת חברתית ראשונה יצאה לאויר בארה"ב friendster כאתר היכרות ביצירת קהילה עם תחומי עניין דומים לינקדין 2003, מייספייס ופייסבוק ב 2004
2008 אינטרנט אוף טינגס..מרשתת הדבריםרשת של חפצים פיזיים, או "דברים", המשובצים באלקטרוניקה, תוכנה וחיישנים המאפשרים תקשורת מתקדמת בין החפצים ויכולות איסוף והחלפת מידע. רשת זו צפויה להוביל לאוטומציהבתחומים רבים. האינטרנט של הדברים כולל בין השאר את תחומי "הבית החכם" ו"העיר החכמה", ויכול להתייחס למגוון רחב של מכשירים
טכנולוגיות חדשות של למידת מכונה או למידה חישובית התחום עוסק בפיתוח אלגוריתמים המיועדים לאפשר למחשב ללמוד מתוך דוגמאות
מעבד גרפי (באנגלית: Graphics Processing Unit; בראשי תיבות: GPU) הוא מעבד המצוי בכרטיס גרפי או לצד ליבת המעבד במערכת על שבב ומאפשר חלוקת עומסים בינו לבין המעבד הראשי (CPU) בחישובים גרפיים שונים, בעיקר של סצנות תלת ממדיות.
המעבד הגרפי יודע לבצע חישובים גרפיים מסובכים וניתן לתכנות בדומה למעבדים אחרים. הופעת המעבד הגרפי אפשרה למפתחי משחקי המחשב לפתח גרפיקהריאליסטית הרבה יותר הודות לכוח העיבוד שלו.
2002 FRIENDSTER, 2003 LINKEDIN, MYSPACE AND 2004 FACEBOOK
In 2002, social networking hit really its stride with the launch of Friendster. Friendster used a degree of separation concept similar to that of the now-defunct SixDegrees.com, refined it into a routine dubbed the “Circle of Friends,” and promoted the idea that a rich online community can exist only between people who truly have common bonds. And it ensured there were plenty of ways to discover those bonds.An interface that shared many of the same traits one would find at an online dating site certainly didn’t seem to hurt. Friendster CEO Jonathan Abrams even once referred to his creation as a dating site that isn’t about dating. Within a year after its launch, Friendster boasted more than three million registered users and a ton of investment interest. Unfortunately, the service has since seen more than its fair share of technical difficulties, questionable management decisions, and a resulting drop in its North American fortunes. Although briefly enjoying success in Indonesia and in the Philippines, Friendster has since abandoned social networking and now exists solely as an online gaming site.
P (Single Nucleotide Polymorphism) הוא וריאציה של רצף הדנ"א המתרחשת כאשר נוקלאוטיד יחיד A,T,C,G בגנום נבדל בין פרטים במינים ביולוגים או בין כרומוזומים הומולוגים באדם. למשל, עבור שני מקטעי דנ"א בשני פרטים שונים, AAGCCTA מול AAGCTTA, מכיל שינוי בנוקליאוטיד יחיד (באזור המסומן), כלומר SNP. במקרה זה נאמר כי קיימים שני אללים,
תיקים רפואיים אלקטרוניים בארה"ב
ירוסקופ או ג'יירוסקופ, נקרא לעתים ג'יירו בקיצור (באנגלית: Gyroscope, מיוונית "גירוס"="עיגול, סיבוב" ו"סקופוס"="ראייה"; השם הומצא על ידי הפיזיקאי הצרפתילאון פוקו ב-1852) הוא מכשיר מדעי המשמש למדידה או שמירה של יציבות, תוך התבססות על עקרונות שימור התנע הזוויתי
עדשות מצלמה מודרניות - מונע רעידות בעת החשיפה לאור.
טלפונים חכמים ובמחשבי לוח - משמש לקביעת אוריינטציית המכשיר במרחב בעיקר במשחקים וביישומי מציאות מדומה ומציאות רבודה.
Sagway
Big Data
איסוף, שמירה וניתוח בזמן אמת כמות גדולה של נתונים שמגיעה בצורות שונות ממקורות שונים
המידע לא עקבי, לא צפוי הכולל נתונים מבוזרים מעבר ליכולת של מאגרי מידע שהיו קיימים עד היום לעבד לנהל ולחקור בזמן מסויים
המטרה: להפוך את המידע מסתם מידע לידע חכם שייצר תובנה תוצאה עסקית בעלת ערך
(בִּיג דָּאטָה), או בעברית, נְתוּנֵי עָתֵק, הוא מונח המתייחס למאגר מידע הכולל נתונים מבוזרים, שאינם מאורגנים לפי שיטה כלשהי, שמגיעים ממקורות רבים, בכמויות גדולות, בפורמטים מגוונים, ובאיכויות שונות.
ניתן לאפיין נתוני עתק לפי חמישה מאפיינים (חמשת ה-V-ים):
נפח (volume)
מהירות (velocity)
גיוון (variety)
אי-אמינות (Veracity)
The term has been in use since the 1990s, with some giving credit to John Mashey for coining or at least making it popular.[13][14] Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.[15] Big Data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data.[16] Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.[17] Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.[18]
In a 2001 research report[19] and related lectures, META Group (now Gartner) defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data.[20] In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Gartner's definition of the 3Vs is still widely used, and in agreement with a consensual definition that states that "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value".[21] Additionally, a new V "Veracity" is added by some organizations to describe it,[22] revisionism challenged by some industry authorities.[23] The 3Vs have been expanded to other complementary characteristics of big data:[24][25]
Volume: big data doesn't sample; it just observes and tracks what happens
Velocity: big data is often available in real-time
Variety: big data draws from text, images, audio, video; plus it completes missing pieces through data fusion
Machine learning: big data often doesn't ask why and simply detects patterns[26]
Digital footprint: big data is often a cost-free byproduct of digital interaction[25][27]
The growing maturity of the concept more starkly delineates the difference between big data and Business Intelligence:[28]
Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc..
Big data uses inductive statistics and concepts from nonlinear system identification[29] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[30] to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors
In a 2001 research report[19] and related lectures, META Group (now Gartner) defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data.[20] In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Gartner's definition of the 3Vs is still widely used, and in agreement with a consensual definition that states that "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value".[21] Additionally, a new V "Veracity" is added by some organizations to describe it,[22] revisionism challenged by some industry authorities.[23] The 3Vs have been expanded to other complementary characteristics of big data:[24][25]
Volume: big data doesn't sample; it just observes and tracks what happens
Velocity: big data is often available in real-time
Variety: big data draws from text, images, audio, video; plus it completes missing pieces through data fusion
Machine learning: big data often doesn't ask why and simply detects patterns[26]
Digital footprint: big data is often a cost-free byproduct of digital interaction[25][27]
The term has been in use since the 1990s, with some giving credit to John Mashey for coining or at least making it popular.[13][14] Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.[15] Big Data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data.[16] Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.[17] Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.[18]
In a 2001 research report[19] and related lectures, META Group (now Gartner) defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data.[20] In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Gartner's definition of the 3Vs is still widely used, and in agreement with a consensual definition that states that "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value".[21] Additionally, a new V "Veracity" is added by some organizations to describe it,[22] revisionism challenged by some industry authorities.[23] The 3Vs have been expanded to other complementary characteristics of big data:[24][25]
Volume: big data doesn't sample; it just observes and tracks what happens
Velocity: big data is often available in real-time
Variety: big data draws from text, images, audio, video; plus it completes missing pieces through data fusion
Machine learning: big data often doesn't ask why and simply detects patterns[26]
Digital footprint: big data is often a cost-free byproduct of digital interaction[25][27]
The growing maturity of the concept more starkly delineates the difference between big data and Business Intelligence:[28]
Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc..
Big data uses inductive statistics and concepts from nonlinear system identification[29] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[30] to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors
מנקודת מבט אחת, Machine Learning (בקיצור: ( ML) הוא כלי שמאפשר לפתור בעיות תוכנה, שלא ניתן לפתור בדרך הרגילה – קרי: בעזרת משפטי if ולולאות for.
Big Data (בִּיג דָּאטָה), או בעברית, נְתוּנֵי עָתֵק, הוא מונח המתייחס למאגר מידע הכולל נתונים מבוזרים, שאינם מאורגנים לפי שיטה כלשהי, שמגיעים ממקורות רבים, בכמויות גדולות, בפורמטים מגוונים, ובאיכויות שונות.
ניתן לאפיין נתוני עתק לפי חמישה מאפיינים (חמשת ה-V-ים):
נפח (volume)
מהירות (velocity)
מגוון (variety)
אי-אמינות (Veracity)
תנודתי (volatility)
האתגר בניהול נתוני עתק הפך תחום זה לעניין מרכזי בטכנולוגיית מידע. מסדי הנתונים היחסיים הקיימים אינם בנויים לאחסון ולניתוח כמויות מידע גדולות, שרובן אינן מגיעות באופן מפורמט לפי תבניות אחידות וידועות מראש. העלות הזולה יחסית של אמצעי האחסון, מצד אחד, והכמות הגדולה של מידע המגיע משלל מקורות (אתרי אינטרנט, רשתות חברתיות, מכשירים סלולריים, מצלמות אבטחה, חיישנים ועוד), מצד שני, גורם לכך שמידע נאגר ללא מחיקה, ומאפשר יכולות ניתוח וזיהוי תבניותומִתְאָמִים, הנדרשות בעולמות תוכן רבים.
https://www.google.co.il/search?q=big+data+in+pharma&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiSiqr3nKnUAhULDMAKHR6RBZgQ_AUIBigB&biw=1680&bih=895#imgrc=oZewZqY
https://www.google.co.il/search?q=big+data+in+pharma&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiSiqr3nKnUAhULDMAKHR6RBZgQ_AUIBigB&biw=1680&bih=895#imgrc=oZewZqY3Srf8cM:
Remote trials
Remote clinical trials may be one way in which many of these ideas come together in the near future. Technology and consulting company eClinicalHealth recently experimented with this model in the VERKKO remote online phase IV clinical trial for diabetes.
The objectives of VERKKO, developed in collaboration with Sanofi, Langland and Mendor, were to study the use of an online clinical trial platform integrated with Mendor's 3G-enabled wireless blood glucose meter in a completely remote setting.
Sixty patients – all recruited through Facebook – participated in the study, which had no site visits. Patients self-registered their interest in eClinicalHealth's cloud-based trial system Clinpal, after which the coordinating study site reviewed their application. Those selected reviewed patient information and signed the informed consent form electronically. Study materials were delivered directly to patients who then connected the smart, wireless glucose meter with their personal Clinpal account.
"The problem that diabetes companies have is that it's difficult to understand how diabetes and its treatments behave in different populations," says eClinicalHealth's patient-centric clinical trial solutions expert Kai Langel, explaining the reasoning behind the trial. "To be able to measure that you need to look at lots of people in lots of countries, which can become very expensive and cumbersome. However, if you can do it remotely, it could be practicable to start trials in many countries, make it easier for patients to take part, keep costs down and be able to do it in a more efficient way.
"It does not require any lab tests other than patients measuring their own blood glucose at home. They were doing normal finger prick tests using a smart glucose meter with a smartcard in it, which not only captures the data but automatically sends it straight away. It also reminds the patient to do the test at the right time, which was a really important part of improving compliance."
To measure whether this model was actually feasible and successful the company looked at a number of factors.
"We wanted to know how much time patients would spend in order to take part in the study," says Langel. "At the end, they completed a satisfaction survey, which asked how much time they spent on the trial. We looked at how happy they were with the materials they were provided, how they found using the meter, and how satisfied they were with study participation – which had very high ratings, 4.62 out of five.
"We also measured site efficiency, asking them how much more efficient our trial was to a sister non-remote trial, and they reported that ours was 66 percent more efficient, which means that our trial only required one third of the effort that the other trial took.
"The platform performed very well and we were able to track the patients throughout. We were able to track how long it took from the first contact with the patient to them becoming involved, how long it took them to complete their glucose profile, and we could see where the bottlenecks were. Patients reported that the platform helped them across the study. The lowest scores were about the digital materials we provided – we basically took the informed consent materials and converted them into electronic format but we didn't create any videos or graphics, so they felt it wasn't very friendly, although it still got good ratings and is easily improved.
"The average age of those that completed the study was 59, so there were some individuals who had some generic technology problems. For example, they had to find the activation email and click a link, and we learned that the patient didn't always click on the link as soon as they received the email and sometimes it had expired by the time they did click it. We just had to send another, but some of the participants were not used to doing email every day."
Langel says that the trial has shown him that this model can definitely have wider applications. "You can take these individual modules – such as online recruitment, electronic informed consent, supporting patients remotely, offering patients a dashboard that they can look at to see what's going on and what to do next – and apply them to traditional trials to make them more patient-centric and improve the data. Also, patient feedback after study participation is very easy to implement, and we learned so much by doing this in our study that I really hope more companies will implement that mechanism."
He adds that many companies have already shown interest in trying out this model. "I hope more will because the technology is not an obstacle, the regulatory acceptance is not an obstacle, so it's more to do with internal change management within these companies in order to change the way they think."
Pharma resistance
However, Cutler believes that mindset changes within pharma companies may be slow to arrive. "There's an element of conservatism that always overlays innovation in a clinical trial setting," he says. "Most of our customers are relatively conservative in the way they approach new things. Their business model is working pretty well and there's an element of 'if it ain't broke don't fix it'.
"When I think about the future I also look back to the past. When I got into the industry we were starting to talk about electronic data capture and it's probably taken 15 to 20 years for that to become a standard part of trials. We're dealing with human subjects, and there's always a requirement – and I think it's an important requirement – to validate the systems and make sure what you're doing is in place and proven before you roll it out.
Nevertheless, Cutler is still optimistic for the future: "The pharma industry is under significant pressure at the moment, so they do recognise that improving and getting better is an important part of what they need. There are a number of people who are very open to these changes and see the opportunity. There's enough momentum in the industry to move this forward, albeit probably not at the pace we'd like to see."
Accenture's report gives a more concrete picture of how pharma is responding to new developments in clinical trials, showing that 55 percent of companies surveyed had adopted digital as a key strategy in R&D, and 42 percent were exploring it.
"I would like to see a higher adoption rate but it seems like people are at least on the journey, and it's a good start," says Julian. "Some of the things I would have expected to rise high on the list of digital adoption examples such as wearables or social media, are still relatively low adoption and low potential in the eyes of heads of R&D. We are on a learning curve, identifying digital as a key driver of the outcomes-based R&D approach that R&D executives are driving to, and there are certainly opportunities for educating the industry on the greater potential that's out there."
3Srf8cM:
https://www.google.co.il/search?q=big+data+in+pharma&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiSiqr3nKnUAhULDMAKHR6RBZgQ_AUIBigB&biw=1680&bih=895#imgrc=oZewZqY
https://www.google.co.il/search?q=big+data+in+pharma&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiSiqr3nKnUAhULDMAKHR6RBZgQ_AUIBigB&biw=1680&bih=895#imgrc=oZewZqY3Srf8cM:
Remote trials
Remote clinical trials may be one way in which many of these ideas come together in the near future. Technology and consulting company eClinicalHealth recently experimented with this model in the VERKKO remote online phase IV clinical trial for diabetes.
The objectives of VERKKO, developed in collaboration with Sanofi, Langland and Mendor, were to study the use of an online clinical trial platform integrated with Mendor's 3G-enabled wireless blood glucose meter in a completely remote setting.
Sixty patients – all recruited through Facebook – participated in the study, which had no site visits. Patients self-registered their interest in eClinicalHealth's cloud-based trial system Clinpal, after which the coordinating study site reviewed their application. Those selected reviewed patient information and signed the informed consent form electronically. Study materials were delivered directly to patients who then connected the smart, wireless glucose meter with their personal Clinpal account.
"The problem that diabetes companies have is that it's difficult to understand how diabetes and its treatments behave in different populations," says eClinicalHealth's patient-centric clinical trial solutions expert Kai Langel, explaining the reasoning behind the trial. "To be able to measure that you need to look at lots of people in lots of countries, which can become very expensive and cumbersome. However, if you can do it remotely, it could be practicable to start trials in many countries, make it easier for patients to take part, keep costs down and be able to do it in a more efficient way.
"It does not require any lab tests other than patients measuring their own blood glucose at home. They were doing normal finger prick tests using a smart glucose meter with a smartcard in it, which not only captures the data but automatically sends it straight away. It also reminds the patient to do the test at the right time, which was a really important part of improving compliance."
To measure whether this model was actually feasible and successful the company looked at a number of factors.
"We wanted to know how much time patients would spend in order to take part in the study," says Langel. "At the end, they completed a satisfaction survey, which asked how much time they spent on the trial. We looked at how happy they were with the materials they were provided, how they found using the meter, and how satisfied they were with study participation – which had very high ratings, 4.62 out of five.
"We also measured site efficiency, asking them how much more efficient our trial was to a sister non-remote trial, and they reported that ours was 66 percent more efficient, which means that our trial only required one third of the effort that the other trial took.
"The platform performed very well and we were able to track the patients throughout. We were able to track how long it took from the first contact with the patient to them becoming involved, how long it took them to complete their glucose profile, and we could see where the bottlenecks were. Patients reported that the platform helped them across the study. The lowest scores were about the digital materials we provided – we basically took the informed consent materials and converted them into electronic format but we didn't create any videos or graphics, so they felt it wasn't very friendly, although it still got good ratings and is easily improved.
"The average age of those that completed the study was 59, so there were some individuals who had some generic technology problems. For example, they had to find the activation email and click a link, and we learned that the patient didn't always click on the link as soon as they received the email and sometimes it had expired by the time they did click it. We just had to send another, but some of the participants were not used to doing email every day."
Langel says that the trial has shown him that this model can definitely have wider applications. "You can take these individual modules – such as online recruitment, electronic informed consent, supporting patients remotely, offering patients a dashboard that they can look at to see what's going on and what to do next – and apply them to traditional trials to make them more patient-centric and improve the data. Also, patient feedback after study participation is very easy to implement, and we learned so much by doing this in our study that I really hope more companies will implement that mechanism."
He adds that many companies have already shown interest in trying out this model. "I hope more will because the technology is not an obstacle, the regulatory acceptance is not an obstacle, so it's more to do with internal change management within these companies in order to change the way they think."
Pharma resistance
However, Cutler believes that mindset changes within pharma companies may be slow to arrive. "There's an element of conservatism that always overlays innovation in a clinical trial setting," he says. "Most of our customers are relatively conservative in the way they approach new things. Their business model is working pretty well and there's an element of 'if it ain't broke don't fix it'.
"When I think about the future I also look back to the past. When I got into the industry we were starting to talk about electronic data capture and it's probably taken 15 to 20 years for that to become a standard part of trials. We're dealing with human subjects, and there's always a requirement – and I think it's an important requirement – to validate the systems and make sure what you're doing is in place and proven before you roll it out.
Nevertheless, Cutler is still optimistic for the future: "The pharma industry is under significant pressure at the moment, so they do recognise that improving and getting better is an important part of what they need. There are a number of people who are very open to these changes and see the opportunity. There's enough momentum in the industry to move this forward, albeit probably not at the pace we'd like to see."
Accenture's report gives a more concrete picture of how pharma is responding to new developments in clinical trials, showing that 55 percent of companies surveyed had adopted digital as a key strategy in R&D, and 42 percent were exploring it.
"I would like to see a higher adoption rate but it seems like people are at least on the journey, and it's a good start," says Julian. "Some of the things I would have expected to rise high on the list of digital adoption examples such as wearables or social media, are still relatively low adoption and low potential in the eyes of heads of R&D. We are on a learning curve, identifying digital as a key driver of the outcomes-based R&D approach that R&D executives are driving to, and there are certainly opportunities for educating the industry on the greater potential that's out there."
3Srf8cM:
https://www.google.co.il/search?q=big+data+in+pharma&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiSiqr3nKnUAhULDMAKHR6RBZgQ_AUIBigB&biw=1680&bih=895#imgrc=oZewZqY
https://www.google.co.il/search?q=big+data+in+pharma&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiSiqr3nKnUAhULDMAKHR6RBZgQ_AUIBigB&biw=1680&bih=895#imgrc=oZewZqY3Srf8cM:
Remote trials
Remote clinical trials may be one way in which many of these ideas come together in the near future. Technology and consulting company eClinicalHealth recently experimented with this model in the VERKKO remote online phase IV clinical trial for diabetes.
The objectives of VERKKO, developed in collaboration with Sanofi, Langland and Mendor, were to study the use of an online clinical trial platform integrated with Mendor's 3G-enabled wireless blood glucose meter in a completely remote setting.
Sixty patients – all recruited through Facebook – participated in the study, which had no site visits. Patients self-registered their interest in eClinicalHealth's cloud-based trial system Clinpal, after which the coordinating study site reviewed their application. Those selected reviewed patient information and signed the informed consent form electronically. Study materials were delivered directly to patients who then connected the smart, wireless glucose meter with their personal Clinpal account.
"The problem that diabetes companies have is that it's difficult to understand how diabetes and its treatments behave in different populations," says eClinicalHealth's patient-centric clinical trial solutions expert Kai Langel, explaining the reasoning behind the trial. "To be able to measure that you need to look at lots of people in lots of countries, which can become very expensive and cumbersome. However, if you can do it remotely, it could be practicable to start trials in many countries, make it easier for patients to take part, keep costs down and be able to do it in a more efficient way.
"It does not require any lab tests other than patients measuring their own blood glucose at home. They were doing normal finger prick tests using a smart glucose meter with a smartcard in it, which not only captures the data but automatically sends it straight away. It also reminds the patient to do the test at the right time, which was a really important part of improving compliance."
To measure whether this model was actually feasible and successful the company looked at a number of factors.
"We wanted to know how much time patients would spend in order to take part in the study," says Langel. "At the end, they completed a satisfaction survey, which asked how much time they spent on the trial. We looked at how happy they were with the materials they were provided, how they found using the meter, and how satisfied they were with study participation – which had very high ratings, 4.62 out of five.
"We also measured site efficiency, asking them how much more efficient our trial was to a sister non-remote trial, and they reported that ours was 66 percent more efficient, which means that our trial only required one third of the effort that the other trial took.
"The platform performed very well and we were able to track the patients throughout. We were able to track how long it took from the first contact with the patient to them becoming involved, how long it took them to complete their glucose profile, and we could see where the bottlenecks were. Patients reported that the platform helped them across the study. The lowest scores were about the digital materials we provided – we basically took the informed consent materials and converted them into electronic format but we didn't create any videos or graphics, so they felt it wasn't very friendly, although it still got good ratings and is easily improved.
"The average age of those that completed the study was 59, so there were some individuals who had some generic technology problems. For example, they had to find the activation email and click a link, and we learned that the patient didn't always click on the link as soon as they received the email and sometimes it had expired by the time they did click it. We just had to send another, but some of the participants were not used to doing email every day."
Langel says that the trial has shown him that this model can definitely have wider applications. "You can take these individual modules – such as online recruitment, electronic informed consent, supporting patients remotely, offering patients a dashboard that they can look at to see what's going on and what to do next – and apply them to traditional trials to make them more patient-centric and improve the data. Also, patient feedback after study participation is very easy to implement, and we learned so much by doing this in our study that I really hope more companies will implement that mechanism."
He adds that many companies have already shown interest in trying out this model. "I hope more will because the technology is not an obstacle, the regulatory acceptance is not an obstacle, so it's more to do with internal change management within these companies in order to change the way they think."
Pharma resistance
However, Cutler believes that mindset changes within pharma companies may be slow to arrive. "There's an element of conservatism that always overlays innovation in a clinical trial setting," he says. "Most of our customers are relatively conservative in the way they approach new things. Their business model is working pretty well and there's an element of 'if it ain't broke don't fix it'.
"When I think about the future I also look back to the past. When I got into the industry we were starting to talk about electronic data capture and it's probably taken 15 to 20 years for that to become a standard part of trials. We're dealing with human subjects, and there's always a requirement – and I think it's an important requirement – to validate the systems and make sure what you're doing is in place and proven before you roll it out.
Nevertheless, Cutler is still optimistic for the future: "The pharma industry is under significant pressure at the moment, so they do recognise that improving and getting better is an important part of what they need. There are a number of people who are very open to these changes and see the opportunity. There's enough momentum in the industry to move this forward, albeit probably not at the pace we'd like to see."
Accenture's report gives a more concrete picture of how pharma is responding to new developments in clinical trials, showing that 55 percent of companies surveyed had adopted digital as a key strategy in R&D, and 42 percent were exploring it.
"I would like to see a higher adoption rate but it seems like people are at least on the journey, and it's a good start," says Julian. "Some of the things I would have expected to rise high on the list of digital adoption examples such as wearables or social media, are still relatively low adoption and low potential in the eyes of heads of R&D. We are on a learning curve, identifying digital as a key driver of the outcomes-based R&D approach that R&D executives are driving to, and there are certainly opportunities for educating the industry on the greater potential that's out there."
3Srf8cM:
https://www.google.co.il/search?q=big+data+in+pharma&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiSiqr3nKnUAhULDMAKHR6RBZgQ_AUIBigB&biw=1680&bih=895#imgrc=oZewZqY
https://www.google.co.il/search?q=big+data+in+pharma&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiSiqr3nKnUAhULDMAKHR6RBZgQ_AUIBigB&biw=1680&bih=895#imgrc=oZewZqY3Srf8cM:
Remote trials
Remote clinical trials may be one way in which many of these ideas come together in the near future. Technology and consulting company eClinicalHealth recently experimented with this model in the VERKKO remote online phase IV clinical trial for diabetes.
The objectives of VERKKO, developed in collaboration with Sanofi, Langland and Mendor, were to study the use of an online clinical trial platform integrated with Mendor's 3G-enabled wireless blood glucose meter in a completely remote setting.
Sixty patients – all recruited through Facebook – participated in the study, which had no site visits. Patients self-registered their interest in eClinicalHealth's cloud-based trial system Clinpal, after which the coordinating study site reviewed their application. Those selected reviewed patient information and signed the informed consent form electronically. Study materials were delivered directly to patients who then connected the smart, wireless glucose meter with their personal Clinpal account.
"The problem that diabetes companies have is that it's difficult to understand how diabetes and its treatments behave in different populations," says eClinicalHealth's patient-centric clinical trial solutions expert Kai Langel, explaining the reasoning behind the trial. "To be able to measure that you need to look at lots of people in lots of countries, which can become very expensive and cumbersome. However, if you can do it remotely, it could be practicable to start trials in many countries, make it easier for patients to take part, keep costs down and be able to do it in a more efficient way.
"It does not require any lab tests other than patients measuring their own blood glucose at home. They were doing normal finger prick tests using a smart glucose meter with a smartcard in it, which not only captures the data but automatically sends it straight away. It also reminds the patient to do the test at the right time, which was a really important part of improving compliance."
To measure whether this model was actually feasible and successful the company looked at a number of factors.
"We wanted to know how much time patients would spend in order to take part in the study," says Langel. "At the end, they completed a satisfaction survey, which asked how much time they spent on the trial. We looked at how happy they were with the materials they were provided, how they found using the meter, and how satisfied they were with study participation – which had very high ratings, 4.62 out of five.
"We also measured site efficiency, asking them how much more efficient our trial was to a sister non-remote trial, and they reported that ours was 66 percent more efficient, which means that our trial only required one third of the effort that the other trial took.
"The platform performed very well and we were able to track the patients throughout. We were able to track how long it took from the first contact with the patient to them becoming involved, how long it took them to complete their glucose profile, and we could see where the bottlenecks were. Patients reported that the platform helped them across the study. The lowest scores were about the digital materials we provided – we basically took the informed consent materials and converted them into electronic format but we didn't create any videos or graphics, so they felt it wasn't very friendly, although it still got good ratings and is easily improved.
"The average age of those that completed the study was 59, so there were some individuals who had some generic technology problems. For example, they had to find the activation email and click a link, and we learned that the patient didn't always click on the link as soon as they received the email and sometimes it had expired by the time they did click it. We just had to send another, but some of the participants were not used to doing email every day."
Langel says that the trial has shown him that this model can definitely have wider applications. "You can take these individual modules – such as online recruitment, electronic informed consent, supporting patients remotely, offering patients a dashboard that they can look at to see what's going on and what to do next – and apply them to traditional trials to make them more patient-centric and improve the data. Also, patient feedback after study participation is very easy to implement, and we learned so much by doing this in our study that I really hope more companies will implement that mechanism."
He adds that many companies have already shown interest in trying out this model. "I hope more will because the technology is not an obstacle, the regulatory acceptance is not an obstacle, so it's more to do with internal change management within these companies in order to change the way they think."
Pharma resistance
However, Cutler believes that mindset changes within pharma companies may be slow to arrive. "There's an element of conservatism that always overlays innovation in a clinical trial setting," he says. "Most of our customers are relatively conservative in the way they approach new things. Their business model is working pretty well and there's an element of 'if it ain't broke don't fix it'.
"When I think about the future I also look back to the past. When I got into the industry we were starting to talk about electronic data capture and it's probably taken 15 to 20 years for that to become a standard part of trials. We're dealing with human subjects, and there's always a requirement – and I think it's an important requirement – to validate the systems and make sure what you're doing is in place and proven before you roll it out.
Nevertheless, Cutler is still optimistic for the future: "The pharma industry is under significant pressure at the moment, so they do recognise that improving and getting better is an important part of what they need. There are a number of people who are very open to these changes and see the opportunity. There's enough momentum in the industry to move this forward, albeit probably not at the pace we'd like to see."
Accenture's report gives a more concrete picture of how pharma is responding to new developments in clinical trials, showing that 55 percent of companies surveyed had adopted digital as a key strategy in R&D, and 42 percent were exploring it.
"I would like to see a higher adoption rate but it seems like people are at least on the journey, and it's a good start," says Julian. "Some of the things I would have expected to rise high on the list of digital adoption examples such as wearables or social media, are still relatively low adoption and low potential in the eyes of heads of R&D. We are on a learning curve, identifying digital as a key driver of the outcomes-based R&D approach that R&D executives are driving to, and there are certainly opportunities for educating the industry on the greater potential that's out there."
3Srf8cM:
תהליך העבודה (Workflow) של ML
תהליך העבודה של ML, הוא מעט יותר מורכב ממה שתיארתי. מכיוון שעיקר העבודה על ML היא התהליך ולא כתיבת קוד – אפרט מעט יותר על התהליך.אני אתמקד בתהליך שמתאים ל Supervised Learning, כמו בעיית רגרסיה (ניבוי ערך של נכס נדל"ן).
השלבים בתהליך הם:
1. הגדרת הבעיה –> כיוון הפתרון
זהו השלב הראשון ו"הקשה" ביותר: לעתים קרובות אין שום מתודולוגיה שיכולה להנחות אותנו כיצד לבצע אותו. הגדרת הבעיה לרוב נעשית על-ידי אנשי הביזנס, ואנשי ה ML אמורים לאמת, לעמת, ולפרט אותה.
למשל: אנו יכולים לנסות ולהעלות מכירות של מוצר מסוים ע"י המלצה של מוצרים דומים, בעוד בפועל כל הפעולות בתחום הזה נועדו להיכשל – והדרך לעשות זאת היא לצמצם בצורה חכמה את מספר ההצעות שהלקוח מקבל היום.
מי יכול לדעת ששישה חודשי עבודה על ניסיונות המלצה שונים הולכים לרדת לטמיון? דרוש כאן בעיקר Domain Knowledge עמוק ואיכותי ו/או מתודולוגיה שתזהה כישלון בשלב מוקדם.
2. השגת הנתונים (Obtaining Data)
יש להשיג נתונים "מוצלחים", ולא סתם נתונים "באיזור הבעיה". זה אומר הרבה פעמים להתחיל לאסוף נתונים שלא נאספו בעבר, ולהבין המשמעות המדויקת של כל נתון. לכמות הנתונים יש גם חשיבות גדולה: אלגוריתמים רבים יהיו מוצלחים הרבה יותר – כאשר יש להם יותר נתונים ללמוד מהם.
לפעמים הנתונים נמצאים במערכות הארגון – ופשוט יש לשלוף אותם / או להתחיל לאסוף אותם. פעמים אחרות הדרך להשיג את הנתונים היא דרך צד שלישי (ספק, אתר) או ממש לקנות נתונים מארגון אחר שיש לו אותם.
שלב זה בעיה משולבת של Domain Knowledge, תכנות, והבנה ארגונית טובה – ברוב הפעמים לא תוכלו להשיג את הנתונים לבד.
".More data beats clever algorithms, but better data beats more data" — פיטר נורוויג (המחבר-שותף של הספר הנודע "AI: A Modern Approach" ומנהל מחקר בגוגל).
3. קרצוף הנתונים (Scrubbing Data)
גם ברגע שיש נתונים "נכונים "- לרוב יהיו להם בעיות איכות (Data Quality):
להשלים ערכים חסרים (או להסיר אותם, אם תיקון לא אפשרי). באג בפרודקשיין עשוי לגרור לאחוז כזה או אחר של נתונים שגויים או חסרים – לאורך תקופה.
לתקן אי התאמות. למשל שמי הופיע פעם בספריית הוידאו גם כ "ליאור בר און" וגם כ "ליאור בר-און" – הייתי מיוצג כשני אנשים, למרות שאני אדם אחד – מה שיגרום להטיה בניתוח הנתונים.
יש טכניקות שונות (חלקן מבוססות ML) בכדי להתאים ולזהות שבעצם מדובר באותו האדם (בסבירות כזו או אחרת)
נורמליזציה של הנתונים, למשל: כל התאריכים באותו ה format כך שניתן יהיה לעשות השוואות. המרה של מיילים לקילומטרים, סדר נתונים כזה לסדר אחר, וכו'. ישנו כלל של ML שטוען: "הנתונים שתקבלו תמיד יהיו בפורמט הלא-נכון"
שלב השגת הנתונים + "הקרצוף" נחשבים בד"כ לכ 50-80% מכלל העבודה ב ML Workflow – החלק הארי. שלב זה דורש יכולות תכנות, קצת הבנה סטטיסטית (עבור תיקונים וקירובים), ומעט Domain Knowledge – על מנת להבין את הנתונים טוב יותר. כמעט תמיד ניתן להשקיע עוד עבודה בכדי לשפר עוד יותר את איכות הנתונים – וחשוב לדעת גם מתי "לעצור".
4. בחירת האלגוריתם
ישנם עשרות אלגוריתמים של ML, וכמה שרלוונטיים לכל סוג של בעיה. עליכם להבין את הנתונים ואת האלגוריתם – על מנת (לנסות ו)להתאים את האלגוריתם הטוב ביותר.
למשל: בעיה של Binary Classification ניתן לפתור בעזרת Naive Bayes (מודל סטטיסטי שלא דורש כמויות גדולות של נתונים ורץ מהר), בעזרת Logistic Regression (המבצע רגרסיה בין features ע"פ משקולות נתונות), או עץ החלטה (שמנסה לשערך הסתברות של כל feature בנפרד – ואז לסכם את סך ההסתברויות).
הבחירה באלגוריתם תשפיע על רמת הדיוק של התוצאה, על מהירות הריצה, על הרגישות לכמות ולאיכות הנתונים, על קלות הבנת התוצאות של האלגוריתם, ועל ההתאמה לבעיה הספציפית וסוג הנתונים שיש לנו, ועוד.
אימון המודל
זהו השלב בו מזרימים את הנתונים לאלגוריתם, ובד"כ בונים תהליך אוטומטי שיעשה זאת.
נרצה לנסות אימון של המערכת עם קבוצות שונות של features וכך לבדוק אילו features הם מוצלחים יותר לפתרון הבעיה. זהו סוג של fine tuning של האלגוריתם, למשל:
האם כדאי לשלוח mean או average?
אולי כדאי average כאשר מנקים מעט נקודות קיצון?
כיצד ניתן לצמצם את כמות ה features (קרי: סוגי הנתונים)?
לרוב, פחות features יובילו לכמות קטנה יותר של טעויות ולעתים קרובות הנתונים שלנו יתעדכנו בצורה שוטפת, וברגע שנתגבש על מודל שעובד – נרצה לאמן מחדש את המערכת כל שבוע, או אפילו כל יום.
שלב זה דורש הבנה סטטיסטית, מעט תכנות, ויכולות תפעול.
6. בחינת המודל
בשלב זה אנו מריצים את קבוצת הביקורת (או טכניקה אחרת שבחרנו) על מנת לבחון את הצלחת המודל. זהו שלב בעיקר מדעי / סטטיסטי – בו אנו מנסים להעריך עד כמה המודל מוצלח.
בשלב זה בדרך כלל יהיו השגות חדשות על איכות הנתונים / סוג הנתונים הנדרש / התאמת האלגוריתם וכו'. מכאן חוזרים לכמה איטרציות נוספות של שיפורים – עד שמגיעים לרמה הרצויה. כמה איטרציות? תלוי כמה טוב צריך להיות המודל, כמה זמן יש לכם, ועד כמה אתם מעריכים שזמן נוסף ישפר את המודל.
Cost is one of the largest factors in the slow growth and acceptance of Big Data analytics in the pharmaceutical industry
Cost is one of the largest factors in the slow growth and acceptance of Big Data analytics in the pharmaceutical industry
Cost is one of the largest factors in the slow growth and acceptance of Big Data analytics in the pharmaceutical industry
Cost is one of the largest factors in the slow growth and acceptance of Big Data analytics in the pharmaceutical industry