Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

The Role of Data Lakes in Healthcare

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 28 Publicité

The Role of Data Lakes in Healthcare

Télécharger pour lire hors ligne

A hybrid approach to data management is emerging in healthcare as organizations recognize the value of an enterprise data warehouse in combination with a data lake.

In this SlideShare, we discuss data lakes in healthcare and we:

Provide an overview of a Hadoop-based data lake architecture and integration platform, and its application in machine learning, predictive modeling, and data discovery

Discuss several key use cases driving the adoption of data lakes for both providers and health plans

Discuss available data storage forms and the required tools for a data lake environment

Detail best practices for conducting data lake assessments and review key implementation considerations for healthcare

A hybrid approach to data management is emerging in healthcare as organizations recognize the value of an enterprise data warehouse in combination with a data lake.

In this SlideShare, we discuss data lakes in healthcare and we:

Provide an overview of a Hadoop-based data lake architecture and integration platform, and its application in machine learning, predictive modeling, and data discovery

Discuss several key use cases driving the adoption of data lakes for both providers and health plans

Discuss available data storage forms and the required tools for a data lake environment

Detail best practices for conducting data lake assessments and review key implementation considerations for healthcare

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à The Role of Data Lakes in Healthcare (20)

Publicité

Plus par Perficient, Inc. (20)

Plus récents (20)

Publicité

The Role of Data Lakes in Healthcare

  1. 1. The Role of Data Lakes in Healthcare
  2. 2. 2 About Perficient Perficient is the leading digital transformation consulting firm serving Global 2000 and enterprise customers throughout North America. With unparalleled information technology, management consulting, and creative capabilities, Perficient and its Perficient Digital agency deliver vision, execution, and value with outstanding digital experience, business optimization, and industry solutions.
  3. 3. 3 Perficient Profile Founded in 1997 Public, NASDAQ: PRFT 2016 revenue $487 million Major market locations: Allentown, Atlanta, Ann Arbor, Boston, Charlotte, Chattanooga, Chicago, Cincinnati, Columbus, Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis, Lafayette, Milwaukee, Minneapolis, New York City, Northern California, Oxford (UK), Southern California, St. Louis, Toronto Global delivery centers in China and Indi Nearly a 3,000+ colleagues Dedicated solution practices ~95% repeat business rate Alliance partnerships with major technology vendors Multiple vendor/industry technology and growth awards
  4. 4. 4
  5. 5. 5 Speaker Introductions Juliet Silver, Director, Enterprise Strategy, Healthcare Juliet provides strategic thought leadership and leverages her more than 20 years of healthcare industry, management consulting and technology experience to support healthcare clients in the realization of their strategic vision. Jill Corcoran, Senior Technical Architect, Healthcare Jill has more than 20 years of consulting experience focused on helping clients solve complex business challenges by providing enterprise, data and business intelligence architectural solutions that transform the way they think about, organize, and leverage their data.
  6. 6. 6 Healthcare Data Lake Concepts
  7. 7. 7 Data Lakes in Healthcare What A Data Lake, as originally coined, is designed to hold raw data assets of varied types as they are received from their sources. Typically the lake is stored in an Hadoop ecosystem with minimal (if any) change to the original format and no content integration or enhancement of the source data. Why Healthcare organizations are attracted to the concept of a data lake as it allows for in-depth analysis of patient outcomes, fraud, waste and abuse, R&D for drugs and DME, and clinical trials. How A Data Lake offers schema-on-read access to large amounts of widely varied information that can be loaded and accessed rapidly. This allows skilled data scientists to uncover hidden correlations, obscure patterns, disease trends, and more.
  8. 8. 8 The Need for a Data Lake in Healthcare “Do we need an enterprise data warehouse, a data lake, or both as part of our overall data architecture?” • A Data Lake provides the ability to manage the fluid data requirements of contemporary healthcare organizations as they attempt to rapidly analyze large volumes of data in batch or real-time from an extensive range of sources in a variety of formats. • An enterprise data warehouse provides the strategy-driven, non-volatile transformed data used to run day-to-day operations and make informed business decisions based on known processes and thoroughly vetted data leveraging more traditional reporting, visualization, and analytics.
  9. 9. 9 Data Lake Traits • Time to value in data delivery is accelerated • Uses various tools which apply “schema-on-read" • Introduces and reuses tools and processes that improve search and general knowledge of the data content • Designed for low-cost storage for large data volumes • Is highly agile and reconfigurable
  10. 10. 10 Healthcare Data Lake Use Cases • Genomic analytics used by health plans • Improved clinical trials • Predictive healthcare costs • Member/Patient 360° view • Billing opportunities in unstructured text • Psychographic prescriptive modeling
  11. 11. 11 Use Case: GenomicAnalytics Used by Insurers The Genetic Information Nondiscrimination Act of 2008 (GINA) protects Americans from discrimination based on their genetic information in both health insurance and employment. But we can, and have access to the largest-ever collection of human protein-coding genetic variants (over 10 million variants), from the Exome Aggregation Consortium (ExAC). The challenge for healthcare is not how to use genomics data but dealing with massive amount of data.
  12. 12. 12 Use Case: Improved Clinical Trials The analysis and design of clinical trials can discover drug combinations with significant improvements for overall survival and toxicity. Using these statistical models we can develop optimization models that select treatment regimens that can be tested in clinical trials, based on the totality of data available on existing. Existing models can be expanded upon by using published research as an external source of data during clinical trials.
  13. 13. 13 Use Case: Predictive Healthcare Costs The data you thought would be useful … was not • 113 candidate predictors from structured and unstructured data sources • Structured data was less reliable then unstructured data – increased the reliance on unstructured data Unexpected indicators emerged from unstructured content • Increased the value of the Predictive Model • 18 accurate indicators or predictors Predictor Analysis % Encounters Structured Data % Encounters Unstructured Data Ejection Fraction (LVEF) 2% 74% Smoking Indicator 35% (65% Accurate) 81% (95% Accurate) Living Arrangements <1% 73% (100% Accurate) Drug and Alcohol Abuse 16% 81% Assisted Living 0% 13%
  14. 14. 14 Use Case: Member/Patient 360° View Member/Patient 360◦ • Improve decision making • Enhance patient experience • Provide a greater opportunity for improved outcomes • Improve profitability for both the provider and the health plan • Reduce unnecessary and inefficient processes and procedures When applied across a large population of patients you can: • Predict disease outbreaks • Identify preventative care • Develop cures for diseases that touch specific demographics or patient population segments
  15. 15. 15 Use Case: Billing Opportunities in Unstructured Text • The analysis of unstructured data can provide significant opportunities for more complete and fairer billing practices. This information is held by providers and payers but rarely reviewed as the amount of detail is overwhelming. Using keyword searches across vast amounts of data quickly produces meaningful insight. • Transcripts of physician’s notes show pre- and post-procedure exam tests, labs, and related minor procedures performed unbilled • Large U.S. health plan compensated on per- patient basis discovered co-morbidities allowing them to apply risk adjustments to segments of their patient population
  16. 16. 16 Use Case: Psychographic Prescriptive Modeling Adding psychographic data from patient healthcare records (PHR) can provide considerable insight into additional disease risk factors. One example of this would be The Framingham Heart Study with more than 1000 published medical papers related to the study it is one of the most widely known evidence-based studies. One of the key discoveries was that heart disease is effected not only by measurable factors (such as blood pressure, and cholesterol) but also demographic (age, gender, and race) and psychographic factors (values, attitudes, and lifestyles) as well. Basic Framingham Analysis Predictor Importance
  17. 17. Designing and Developing the Data Lake
  18. 18. 18 Stocking the Data Lake
  19. 19. 19 Provider Data Lake Healthcare Sources Provider Data Lake Patient Records Physician Notes Digital Images Medical Device Financials Health Info Sys External Sources Health Plan Gov’t Agencies Accountable Care Orgs Geo- Political Wearables Research Provider Sources System Sources Security Log Data Metadata Web Sources Social Media Email & Chat Web Content
  20. 20. 20 Payer Data Lake Healthcare Sources Payer Data Lake Provider Sources Provider NetworkFinancials Health Plan System Sources Security Log Data Metadata Web Sources Social Media Email & Chat Web Content Claim Encounter Member Marketing Rx Claim External Sources Gov’t Agencies Accountable Care Orgs Geo- Political Wearables Genomic PHR / PGHD Research Survey Standard Codes
  21. 21. 21 Big Data Landscape Components
  22. 22. 22 The Enterprise Data Landscape
  23. 23. 23 Introducing Hadoop to the Enterprise Data Landscape
  24. 24. 24 Best Practices Assessment • Genuine need based on the 4 Vs • Understanding the ‘Big’ Picture • Mature metadata procedures in place • Active governance with majority participation Planning • Executive suite backing and participation • Fully vetted use cases • Staffing and training plan – Infrastructure Architect – Big Data Architect – Data Scientist Implementation • Start slow in digestible portions (usable POC) • Employ technical project management • Maintain strong scope management • Small set of very skilled users for initial deployment • Bring all data that will answer the questions
  25. 25. 25 Summary • Data Lakes deliver the power to share data and rapidly explore, discover, and predict patterns of risk, cost, and improved outcomes and engagement • Provides the foundation for research and ad-hoc data science to occur across a variety of large volume data sets • Integral to evidence-based care and clinical genetics programs • Need for genomics data • Pedigree data • Personal health information • Geo data sets • Psychographic data • Requires advanced data management and data science skill sets • Should be governed through an Information and data governance structure • Sets use case and data priorities • Oversees data risk, security, and compliance
  26. 26. Questions Type your question into the chat box
  27. 27. 27 Next up: [Webinar] Harness the Power of Cloud to Drive Business Innovation – Tuesday, April 25th [Webinar] Modernize Core Technology to Accelerate Digital Transformation – Tuesday, May 23rd Follow Us Online • Perficient.com/SocialMedia • Facebook.com/Perficient • Twitter.com/Perficient_HC • Blogs.perficient.com/healthcare
  28. 28. Thank You

Notes de l'éditeur

  • Perficient is the leading digital transformation consulting firm serving Global 2000® and enterprise customers throughout North America. With unparalleled information technology, management consulting and creative capabilities, Perficient and its Perficient Digital agency deliver vision, execution, and value with outstanding digital experience, business optimization and industry solutions.
    We have a broad network of locations across the US, as well as offshore facilities in India and China.
    We deliver digital experience, business optimization and industry solutions that cultivate and captivate customers, drive efficiency and productivity, integrate business processes, improve productivity, reduce costs, and create a more agile enterprise.
  • Founded in 1997, we’re a public company with more than 2,800 employees.
    We’ve formed strategic partnerships with each of the major technology vendors and also have dedicated solution and industry practices.
  • Perficient’s national healthcare practice is recognized as one of the largest healthcare consulting firms in the U.S. We provide strategic technology consulting insights that help our healthcare clients transform with today’s digital consumer experience demands. This strategic guidance is then transformed into pragmatic technology solutions that improve clinical, financial and operational efficiency while dealing with the complexities of regulatory reform and the enablement of innovation. 


  • Using the right information, from the right place in your information eco-system, at the right time is key to successfully implementing Big Data into the Healthcare Data Landscape.

    DATA LAKE | VS | DATA WAREHOUSE…
    | |
    Structured, Unstructured, | Data | Structured Processed
    Semi-Structured, Raw | | ___
    | |
    Schema-on-Read | Process Time | Schema-on-write
    | |
    | |
    Low-Cost for High-Volume | Storage | High-Cost for High-Volume
    | |
    | |
    High-Cost for High-Volume | Queries |Low-Cost for High-Volume
    | |
    | |
    Highly configurable, Agile | Data | Fixed structure, requires
    | Structures | Requires remodeling
    | |
    | Security |
    Maturing | Authentication | Mature
    | Authorization |
    | |
    Data Savvy Users, Scientists | User Base | All business users
    | |



    | |



    | |

  • A successful data lake has its data organized in such a manner to promote better and more efficient access, and will introduce and reuse tools and processes that improve search and general knowledge of the data content.

    The data lake lacks a formal schema-on-write. Access to information contained within the data lake uses various tools which apply “schema-on-read

    Generally, users of the data lake are experienced analysts, accustomed to data wrangling techniques which apply schema upon read or interpret data content from unstructured formats. Less-experienced users will struggle without significant search tools and data extraction automation.

    That doesn’t mean the data lake lacks metadata, nor rules governing its usage, security, or management. It’s quite the opposite.
  • How can it be used to improve medical outcomes for patients, drive patient and member engagement, and bend the cost curve, foster innovation?

    Genomic Data or Research and Precision Medicine
    Real World Evidence
    Outcomes Research
    Cost Studies
    Socio-Economic Evaluations
    Machine / Device Data
    Unstructured Data in EMR Notes
  • Results Highlights.

    Research from “An Analytics Approach to Designing Clinical Trials for Cancer”, Dimitris Bertsimas, Sloan School and Operations Research Center, Massachusetts Institute of Technology,

    Identifying the best chemotherapy regimen currently available for advanced gastric cancer is a task that has proven challenging for traditional meta-analysis, but it is one that our methods are well suited to address.

    Through the use of regression models, which leverage a large database of clinical trial outcomes, we are able to control for differences in demographics and other factors across different clinical trials, enabling direct comparison of results that were not from the same randomized experiment. To determine the best chemotherapy treatments to date, we first note that selecting a chemotherapy treatment for cancer involves a tradeoff between survival time and toxic effects that affect quality of life.

    Since individual patients will differ in how they value these competing objectives, the notion of trying to find a single “best” regimen is not correct. Instead, we seek the set of treatments that make up the “efficient frontier” of chemotherapy treatments for a given cancer: a particular treatment is included in the efficient frontier only if there are no other available treatments with both higher survival and lower toxicity.

  • Creating a predictive model using data, create a statistical regression model that estimates the outcome (HCU or not) using factors (covariates) that may be influencing the outcome, such as:

    • Demographic variables (e.g., age, sex, RIO score)
    • Clinical variables (e.g., ICD-10 based chapters, with additional splits for diabetes, CHF, COPD)
    • SES variables (e.g., deprivation index (material and social deprivation))
    • Utilization variables for all care types from current year and previous two years, to account for disease progression (e.g., Number of visits, length of stay)
  • A member 360 degree view is basically a patient index that offers access to enterprise-wide health plan member or patient data
    this includes demographics, socio economic characteristics, healthcare encounters and claims and any number of other defining attributes
  • A considerable amount of information about a patient is contained in a practitioner’s notes, however, after the initial review for billing, this information goes largely untouched
  • Earlier in the Healthcare Costs use case we saw how predictive analytics use the data lake to show what COULD happen
    Then in the Member 360 use case showed how descriptive analytics use the data lake to show what HAS happened
    This use case Psychographic Prescriptive Analytics show what SHOULD happen based on the data found in the lake

    The indicators from the Framingham Study gave us the prescriptive information to tell us what we should do to avoid heart disease including eating right, exercising and using aspirin to reduce arterial blockages.

    The Baby Boomers, Gen Xers, or Millennials include both demographic variables (classifying individuals based on birth years) and psychographic variables (such as beliefs, attitudes, values and behaviors).

    PHRs or PGHD – health-related data created, recorded, or gathered by or from patients (or family members or other caregivers) to help address a health. Patients, not providers, are primarily responsible for capturing or recording these data and Patients decide how to share or distribute these data to health care providers and others

    PHRs can contain a diverse range of data, including but not limited to:
    allergies and adverse drug reactions
    chronic diseases
    family history
    illnesses and hospitalizations
    imaging reports (e.g. X-ray)
    laboratory test results
    medications and dosing
    prescription record
    surgeries and other procedures
    vaccinations
    and observations of daily living (ODLs)

    Psychographics - is the study of personality, values, opinions, attitudes, interests, and lifestyles

  • Volume
    Deep data – processing contact lenses, there are only a few dozen measurements, on every lens but they occur at every step of the process for every single lens created
    Wide and Deep - Prescriptions have hundreds of attributes and millions of prescriptions are filled everyday
    Medical/Dental Claims – hundreds of attributes and hundreds of claims
    Medical Device Logs
    Velocity
    Pace at which data flows in from sources like business processes, machines, networks and human interaction, also data that needs to be processed and provided in a more controlled environment
    Variety
    Type of data can be structured and unstructured, text, numeric, image, audio, etc.
    Sources include emails, photos, videos, monitoring devices, PDFs, audio, etc,
    Veracity
    conformity to truth or accuracy (outliers in device readings, abnormalities in statistical data (vitals), biases in surveys, )
    valid for the intended use


    Payer, Provider & Life Sciences use cases are driven by the Variety, Volume of the data, for example real world data (RWD) has huge volumes from multiple sources, Genomics requires massive amounts of data thus driving storage. One major issue for Healthcare services will be Information Integration on a huge scale for a variety of sources. Typically most EDWs would be able to handle a dozen or so schemas, but not the variety of data from RWD

    Patient Care & Treatment is based on department & usage type, static data found in EMR or EHR is reviewed along with back-office data (ERP) often unstructured data is overlooked but when analyzed provides insight to better treatment or indicators that were missed in the EMR records.

    Prescriptive Analytics drives better usage of information (Dashboard that indicates patient is suffering from post treatment and is a risk for readmissions). Often data is unstructured (notes from care providers that would require NLP analysis).



    Payer, Provider & Life Sciences use cases are driven by the Variety, Volume of the data, for example RWD is huge volumes from multiple sources, Genomics requires massive amounts of data thus driving storage. One major issue for Healthcare services will be Information Integration on a huge scale for a variety of sources. Typically most EDW would be able to handle a dozen or so schemas, but the variety of data from RWD

    Patient Care & Treatment is based on department & usage type, static data found in EMR or EHR is reviewed along with back-office data (ERP) often unstructured data is overlooked but when analyzed provides insight to better treatment or indicators that were missed in the EMR records.

    Prescriptive Analytics drives better usage of information (Dashboard that indicates patient is suffering from post treatment and is a risk for readmissions). Often data is unstructured (notes from care providers that would require NLP analysis).
  • Capturable Data:
    Medical Devices: Surgical device data, DME, Vital Statistics, EKG, medical device efficiency data,

    Survey Data:
    CASPER – Continuous Activity Scheduling Planning, Execution, & Reporting from CMS
    MEPS – Medical Expenditure Panel Survey is the only national data source measuring how Americans use and pay for medical care, health insurance, and out-of-pocket spending
    HCAHPS – Hospital Consumer Assessment of Healthcare Providers and Systems is a patient satisfaction survey required by CMS (the Centers for Medicare and Medicaid Services) for all hospitals in the United States


    Publicly Available Data ACO/HIE: Accountable Care Organizations/Health Information Exchange
    CDC – Center for Disease Control & Prevention Health Statistics
    AHRQ – searchable databases on topics such as the use of health care, the costs of care, trends in hospital care, health insurance coverage, out-of-pocket spending, and patient satisfaction
    NHSN – National Healthcare Safety Network provides Patient Safety Analysis Resources Reportable events for infection or safety
    AHRQ – Agency for Healthcare Research and Quality includes: use of health care, the costs of care, trends in hospital care, health insurance coverage, out-of-pocket spending, and patient satisfaction
    HCUP – Nation’s most comprehensive source of hospital data, including information on in-patient care, ambulatory care, and emergency department visits; enables researchers, insurers, policymakers and others to study health care delivery and patient outcomes over time, and at the national, regional, State, and community levels
    USHIK – metadata registry of healthcare-related data standards funded and directed by the AHRQ with management support in partnership with the CMS
    PGHD – Patient-generated health data is the health-related data created, recorded, or gathered by or from patients (or family members or other caregivers) to help address a health concern (also PHR). Patients, not providers, are primarily responsible for capturing or recording these data and Patients decide how to share or distribute these data to health care providers and others.





    Electronic medical records (EMRs) are a digital version of the paper charts in the clinician’s office. An EMR contains the medical and treatment history of the patients in one practice. EMRs have advantages over paper records. For example, EMRs allow clinicians to: 1) Track data over time, 2) Easily identify which patients are due for preventive screenings or checkups, 3) Check how their patients are doing on certain parameters—such as blood pressure readings or vaccinations, 4) Monitor and improve overall quality of care within the practice

    Electronic health records (EHRs) do all those things—and more. EHRs focus on the total health of the patient—going beyond standard clinical data collected in the provider’s office and inclusive of a broader view on a patient’s care. EHRs are designed to reach out beyond the health organization that originally collects and compiles the information. They are built to share information with other health care providers, such as laboratories and specialists, so they contain information from all the clinicians involved in the patient’s care. The National Alliance for Health Information Technology stated that EHR data “can be created, managed, and consulted by authorized clinicians and staff across more than one healthcare organization.”
    The information moves with the patient—to the specialist, the hospital, the nursing home, the next state or even across the country. In comparing the differences between record types, HIMSS Analytics stated that, “The EHR represents the ability to easily share medical information among stakeholders and to have a patient’s information follow him or her through the various modalities of care engaged by that individual.” EHRs are designed to be accessed by all people involved in the patients care—including the patients themselves. 







  • Capturable Data:
    Medical Devices: Surgical device data, DME, Vital Statistics, EKG, medical device efficiency data,

    Survey Data:
    CASPER – Continuous Activity Scheduling Planning, Execution, & Reporting from CMS
    MEPS – Medical Expenditure Panel Survey is the only national data source measuring how Americans use and pay for medical care, health insurance, and out-of-pocket spending
    HCAHPS – Hospital Consumer Assessment of Healthcare Providers and Systems is a patient satisfaction survey required by CMS (the Centers for Medicare and Medicaid Services) for all hospitals in the United States


    Publicly Available Data ACO/HIE: Accountable Care Organizations/Health Information Exchange
    CDC – Center for Disease Control & Prevention Health Statistics
    AHRQ – searchable databases on topics such as the use of health care, the costs of care, trends in hospital care, health insurance coverage, out-of-pocket spending, and patient satisfaction
    NHSN – National Healthcare Safety Network provides Patient Safety Analysis Resources Reportable events for infection or safety
    AHRQ – Agency for Healthcare Research and Quality includes: use of health care, the costs of care, trends in hospital care, health insurance coverage, out-of-pocket spending, and patient satisfaction
    HCUP – Nation’s most comprehensive source of hospital data, including information on in-patient care, ambulatory care, and emergency department visits; enables researchers, insurers, policymakers and others to study health care delivery and patient outcomes over time, and at the national, regional, State, and community levels
    USHIK – metadata registry of healthcare-related data standards funded and directed by the AHRQ with management support in partnership with the CMS
    PGHD – Patient-generated health data is the health-related data created, recorded, or gathered by or from patients (or family members or other caregivers) to help address a health concern (also PHR). Patients, not providers, are primarily responsible for capturing or recording these data and Patients decide how to share or distribute these data to health care providers and others.





    Electronic medical records (EMRs) are a digital version of the paper charts in the clinician’s office. An EMR contains the medical and treatment history of the patients in one practice. EMRs have advantages over paper records. For example, EMRs allow clinicians to: 1) Track data over time, 2) Easily identify which patients are due for preventive screenings or checkups, 3) Check how their patients are doing on certain parameters—such as blood pressure readings or vaccinations, 4) Monitor and improve overall quality of care within the practice

    Electronic health records (EHRs) do all those things—and more. EHRs focus on the total health of the patient—going beyond standard clinical data collected in the provider’s office and inclusive of a broader view on a patient’s care. EHRs are designed to reach out beyond the health organization that originally collects and compiles the information. They are built to share information with other health care providers, such as laboratories and specialists, so they contain information from all the clinicians involved in the patient’s care. The National Alliance for Health Information Technology stated that EHR data “can be created, managed, and consulted by authorized clinicians and staff across more than one healthcare organization.”
    The information moves with the patient—to the specialist, the hospital, the nursing home, the next state or even across the country. In comparing the differences between record types, HIMSS Analytics stated that, “The EHR represents the ability to easily share medical information among stakeholders and to have a patient’s information follow him or her through the various modalities of care engaged by that individual.” EHRs are designed to be accessed by all people involved in the patients care—including the patients themselves. 







  • This is a generic architecture for Big Data, and a data lake typically resides on Hadoop

    Underlying Security, Authorization, and Authentication tools run Ranger, Knox, and HDFS Encryption

    Governance Fabric on which the tools designed to exchange metadata with other tools and processes both within the Hadoop stack and outside of it work in concert with the data life cycle management software (Atlas)

    Data Management Stack Hadoop Distributed File System or HDFS is the core, it is a distributed, scalable, Java-based file system adept at storing large volumes of structured, unstructured and semi-structured data this is where the data lake resides

    YARN or (Yet Another Resource Negotiator) its a cluster management technology and it serves as the large-scale, distributed operating system for big data applications.

    Data Exchange tools for rapidly exchanging data with HDFS such as Sqoop, Flume, Kafka and NiFi

    Provisioning, Monitoring, and Scheduling tools includes tools such as ZooKeeper and Oozie

    Data Access: the list is long and varied below are just examples
    Batch is MapReduce - developers can write applications in their language of choice, like Java, C++ or Python
    Script Pig running on the TEZ framework and developers use Pig Latin
    Stream real-time Business intelligence like Storm
    SQL Hive, Hawq, Impala and Shark (which is basically Hive on Spark)
    NoSql HBase, Cassandra, and Accumulo
    In-Memory Spark
    Others catch-all and includes Independent Software Vendor tools or ISVs

    User Tools for Data Science Zepplin and Ambari
  • Structural Metadata – Entities, attributes, sizes, characteristics, relationships, definitions, common usage, aggregation rules, and performance metrics
    Descriptive Metadata – Purpose driven metadata e.g., title, abstract, author, keywords, etc.
    Administrative Metadata – Management data e.g., security roles, archival/preservation rules, IP rights, etc.
    Audit Metadata – Operational statistics, creation statistics, full lineage cradle-to-grave,

    Data Curation – activities related to the organization and integration of data collected from various sources, annotation of the data, and publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation.

    Relational Data Warehouse are often a good fit for well-understood and frequently accessed queries and reports on high-value data. 

    Data Governance and Management – highly important and should be led or co-led by the business as they are the data owners

    USERS – highly-skilled data professionals and data scientists will have considerably more access than business users
    Business Users – biggest pool of users includes daily operations folks, BI analysts, call centers, medical and pharmaceutical professionals and the executive suite
  • TOP - data lake is used as a data staging and transformation platform, but long-term persistence and analytics is performed solely in the EDW, while this will lower the cost of data capture and provide scalable data refinement, it removes the vast capabilities of the data lake from the business users
    MIDDLE - data lake and big data eco-system are used for long-term, big data persistence and analytics without employing an EDW while this is a viable solution given all of the tools that have matured in recent years it has proven to be very expensive. For details see Winter Corporation’s study on the cost of big data as an EDW
    BOTTOM - recommended scenario allows data to be harmonized and analyzed in the data lake or moved out to an EDW where there are more frequently accessed queries, more stringent data quality rules are enforced and data enhancements are rigorously applied, or when users simply require slice-in-time reporting.   This scenario is most likely to support any future data needs no matter the variety, volume, or velocity of the data.
  • Take a look at the questions that will be asked of the data from which the answers are expected; ensure that the data needed meets the criteria of the 4 Vs (Volume, Velocity, Variety, Veracity)

    Once an architecture has been designed, reviewed and approved that all teams or divisions involved have a good grasp of the big data architecture and tools needed to design, develop, implement and support the data lake (you should also have a change control process in place for the inevitable request to add a component to the architecture).

    Mature metadata and governance processes and procedures in place and good portion of your organization’s data and data owners adhering to them

    While staffing and training are very important on the whole, there are 3 positions listed here that will bear heavy burdens and should be hand picked:
    First use the Infrastructure Architects that have been with your organization for a fair bit of time, their knowledge and expertise is critical
    Big Data Architect and your Data Scientists choose someone with solid healthcare experience and a decent amount of big data experience over the candidates with a lots of big data background and little or no healthcare experience

    A big data project is not the time to use a PM who has never worked on a large scale data project or who doesn’t know an entity from an attribute (this PM will also be responsible for keeping scope creep at bay)!

    Your first few use cases should involve highly skilled, data savvy users, it will pay off when they extoll the virtues of your data lake.
  • James Dixon, CTO and founder of Pentaho described it best when he used the following analogy to explain the concept of the data lake; " think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state from which that water was obtained. The contents of the data lake stream in from sources to fill the lake, and various users of the lake can come to examine, dive in, or take samples”

×