Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Security, ETL, BI & Analytics, and Software Integration

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 25 Publicité

Security, ETL, BI & Analytics, and Software Integration

Télécharger pour lire hors ligne

Liberty Mutual Enterprise Data Lake Use Case Study

By building a data lake, Liberty Mutual Insurance Group Enterprise Analytics department has created a platform to implement various big data analytic projects. We will share our journey and how we leveraged Hortonworks Hadoop distribution and other open source technologies to meet our project needs. This session will cover data lake architecture, security, and use cases.

Liberty Mutual Enterprise Data Lake Use Case Study

By building a data lake, Liberty Mutual Insurance Group Enterprise Analytics department has created a platform to implement various big data analytic projects. We will share our journey and how we leveraged Hortonworks Hadoop distribution and other open source technologies to meet our project needs. This session will cover data lake architecture, security, and use cases.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Security, ETL, BI & Analytics, and Software Integration (20)

Publicité

Plus par DataWorks Summit (20)

Plus récents (20)

Publicité

Security, ETL, BI & Analytics, and Software Integration

  1. 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data & Analytics in Insurance Can you have one without the other?
  2. 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved P&C Insurance trends in big data/analytics Use of Predictive Models in P&C New applications, New Methods • Source: Willis Towers Watson 2016 Predictive Modeling Benchmark Survey (U.S.) • The survey was fielded from September 7 to October 24, 2016. Respondents comprise 14% of U.S. personal lines carriers and 20% of commercial lines carriers.
  3. 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved P&C Insurance trends in big data/analytics  “Big data, notably from vehicle telematics and the IoT, are opening up many new potential avenues for investigation and improvement. These opportunities apply as much to carriers that have invested recently in improved policy administration and quote systems as it does to others. Whatever the available level of hardware and software within a business, a lack of accompanying investment in data and analytics is rather like driving a sports car without fully revving up the engine.” Uses of Big Data • Source: Willis Towers Watson 2016 Predictive Modeling Benchmark Survey (U.S.) • The survey was fielded from September 7 to October 24, 2016. Respondents comprise 14% of U.S. personal lines carriers and 20% of commercial lines carriers.
  4. 4. The Liberty Mutual Insurance Data Lake One small Hadoop footprint … One giant leap to understanding #TechAtLiberty
  5. 5. 5Liberty Mutual Insurance Empower Liberty Mutual to leverage the vast data and amazing talent that we have Make analytics as easy as it can be Allow data to be free and secure Foster a culture of quick iterative experiments, failing and learning as fast as possible Remove the separation between IT and business Our North Star: What we strive for
  6. 6. 6Liberty Mutual Insurance 75th
  7. 7. 7Liberty Mutual Insurance Agenda • How do we think about analytics? • How do we work as a team? • Who/what is a data scientist? • How does a data lake help us?
  8. 8. 8Liberty Mutual Insurance How we think about analytics and machine learning (ML) Obtaining the data from source systems and devices Storing the data in a format and location so that it can be studied Studying the data to gain insight and business value GET LAND STUDY • ML is an extension of STUDY • ML programs need to access data that’s in “LAND”
  9. 9. 9Liberty Mutual Insurance Who/what is a data scientist? How do we work as a team?
  10. 10. 10Liberty Mutual Insurance What makes up a data scientists? True data scientists are extremely rare because of the unique combination of skills required. We believe in investing in data science teams made up of energized engineers with various roles: • Software developers • Data engineers • Data analysts • Data scientists You don’t need a PhD to be a data scientist!!! Business analyst Engineer/ Developer Mathematician Data Scientist
  11. 11. 11Liberty Mutual Insurance We heard common frustrations Analytics is hard! Tools are too hard to use; Requires many types of skills Security and Analytics have competing goals IT/business collaboration needs to improve
  12. 12. 12Liberty Mutual Insurance Information Technology Business Source system Data scientists/analysts MS SQL Teradata DB2 Mysql MS SQL Oracle Mongo Postgres DATA Mart Information management (IM) Ent. Data Warehouse DATA Mart 2 EDW Cognos Tableau SAS OBIEE Micro strategy SharePoint PowerBI Sybase
  13. 13. 13Liberty Mutual Insurance PYTHON R SAS H2O R Shiny Excel PowerBI Source system Data scientists/analysts IM evolving into Data analytics MS SQL Teradata DB2 Mysql Oracle Mongo EDW Sybase Iterate and learn Information technology Business Unstructured Data
  14. 14. 14Liberty Mutual Insurance Text Analytics Streaming Analytics Predictive Analytics Data Engineer Data Engineer IT Data Scientist Software Developer Software Developer Form one team with business and IT together Data Scientist Data Scientist Data Scientist
  15. 15. 15Liberty Mutual Insurance How does a data lake help us?
  16. 16. 16Liberty Mutual Insurance HORTONWORKS DATA PLATFORM (HDP®)
  17. 17. 17Liberty Mutual Insurance Enterprise data lake security Security: Centrify / AD / Kerberos / Ranger/ HDFS Encryption /SSL Kerberos HDP Data Lake on-Premises AD Server as KDC Secured Zone HDFS Secured Zone HDFS Secured Zone HDFS /Legal | user:grp | __1 | __2 /HR | user:grp | __1 | __2 /Finance | user:grp | __1 | __2 Ranger Policies & Plugins HDFS Permission & ACL System Admins Power BI Users Data Scientists ETL Developers Ambari Server Spark Thrift Server HDP Edge Node Kerberos Kerberos NAS/Local HDD SSL ODBC SSL SSL RMDBS on-Premises Sqoop Security Options Available: 1. Kerberos 2. SSL Enablein Connection String 3. Encryption=true on database Zeppelin  Livy Server Layers of Defense Perimeter Level Security: Apache Knox for REST API Authentication : Kerberos Authorization: Ranger OS Security : HDFS Permission, encryption on HDFS ApacheKnox
  18. 18. 18Liberty Mutual Insurance Security challenges and alternatives • Security implementation requires existing tools reconfiguration • Need to use the combined security mechanisms • Testing is painful and something doesn’t work • Not all BI Tools Build-in Drivers Support Kerberos • Spark Security ⎻ Kerberos for Authentication ⎻ AD Groups for HDFS ACLs ⎻ SparkSQL, Ranger, and LLAP via Spark Thrift Server for Authorization
  19. 19. 19Liberty Mutual Insurance Data lake BI & analytics example User’s Desktop / Laptop /VDEApplications & Databases PowerBI Desktop Dashboard (data embedded) Sources of Cost Information PowerBI Hive/Data Transformation Kerberos / ODBC S3: csv Files Centrify / AD / Kerberos/ Ranger/Encryption Publish Text Files / API License Counts from Office 365 Daily HDP Cluster PullData from Hadoop Report& Data AWS Keys Upload Data PowerBI Services DataAutomation PowerBIGateway Report Developers Report Consumers ETL Developers Other Data Sources on-premises Sqoop Data Lake on-Premises AD Server Rest API
  20. 20. 20Liberty Mutual Insurance Integrate Elasticsearch and Spark in data lake Enterprise Data Lake Master & Data Nodes HDP Edge/ES Node 1 HDP Edge/ES Node 2 HDP Edge/ES Node 3 ES Repo /experian | index | __1 | __2 ES Repo /experian | index | __3 | __4 ES Repo /experian | index | __5 | __6 ElasticSearch Hadoop Plugin ElasticSearch Hadoop Plugin ElasticSearch Hadoop Plugin REST API – Elasticsearch Queries End Users NAS spark-submit --master yarn --num-executors 4 --executor- memory 1G --executor-cores 1 esspark-assembly-1.0.jar hdfs:///data/BRICK_2016_Q3_masked.csv curl -XPOST "http://localhost:9200/gs/_search" - d'{"query": {"match" :{ "CITY": {"query": "Yiqing", "fuzziness": "AUTO"}}}}' Data Volume: 1 data brick 100GB csv file Fuzzy Match: company name, street address, city, state Results: match score and all 500+ attributes IT Developer
  21. 21. 21Liberty Mutual Insurance Integrate Elasticsearch and Spark in data lake (cont.)
  22. 22. 22Liberty Mutual Insurance Data archiving example Apache Flume Syslog Server 1 Syslog Server 3 Syslog Server 2 Apache Flume Apache Flume VirtualIndex Enterprise Data Lake (5 data nodes total 120TB) Analytics, trend Hot Data Storage OneMonth  1TB uncompressed Data  100GB Compressed Data SharePoint Logs HDP Edge Node SharePoint Logs IT Developers Data Analytists Kafka Warn Data Storage OneYear  60TB uncompressed Data  6TB Compressed Data SharePoint Logs Kafka Kafka SIEM, Alerts, Real Time Monitoring Kerberos NifiMergeContent: Holds data until the flow file reaches a suitable sizeto be loaded to HDFS Logs
  23. 23. 23Liberty Mutual Insurance Sample DataFlow
  24. 24. 24Liberty Mutual Insurance Conclusion Just get started! Don’t be afraid to fail! Invite your “business” partners into the process A small lake is still very beneficial!
  25. 25. 25Liberty Mutual Insurance Thank you

Notes de l'éditeur

  • Two-thirds of P&C insurers surveyed currently use predictive models for underwriting and risk selection, an increase of over 10 percentage points compared to the 2015 survey. The reasons behind such an increase are clear.
    There is unanimous agreement from personal lines insurers about the fundamental importance of using more sophisticated predictive techniques to drive success in today’s market. Equally, many commercial lines carriers are recognizing that the traditional barrier of the relative paucity of homogenous risk data in commercial portfolios can be overcome, enabling models to contribute significantly in more unique underwriting environments. Eighty-six percent of small- to mid-market carriers rate more sophisticated risk selection as essential or very important to future success. Over half (56%) of large account or specialty lines carriers share that view.
  • Two-thirds of P&C insurers surveyed currently use predictive models for underwriting and risk selection, an increase of over 10 percentage points compared to the 2015 survey. The reasons behind such an increase are clear.
    There is unanimous agreement from personal lines insurers about the fundamental importance of using more sophisticated predictive techniques to drive success in today’s market. Equally, many commercial lines carriers are recognizing that the traditional barrier of the relative paucity of homogenous risk data in commercial portfolios can be overcome, enabling models to contribute significantly in more unique underwriting environments. Eighty-six percent of small- to mid-market carriers rate more sophisticated risk selection as essential or very important to future success. Over half (56%) of large account or specialty lines carriers share that view.
  • https://dataworkssummit.com/san-jose-2017/sessions/from-big-data-to-data-discovery-one-small-footprint-one-giant-leap-to-understanding/

  • Most people don’t know that Liberty Mutual has over 4,000 technical employees who create our solutions. In order to keep up with the demands of our customers, we are changing the way our company works. We are moving to a faster paced, customer centric model. We want to offer innovative products and services in order to provide best in class experiences for our customers. We are basically operating like a startup backed by the strength of a Fortune 100 company.
  • Our group is involved in the entire lifecycle of analytics from Get to Study.

    We think about “Analytics” in 3 phases: Get/Land/Study

    All the way from obtaining the data orginally to landing it somewhere and then studying it
  • The necessary tools are often not to scale or not available
    Majority of people don’t have the training or understanding of how to use the tools
    In some areas we are relying on 3rd party vendors to solve our problems rather than build expertise – is this really an issue outside of Hadoop?
    But, data scientists want better performance from R and Python, want the freedom to use downloaded data science libraries, want to use Spark, Tensorflow, H2O, etc. want to be able to pull data directly from Liberty databases, want to be able to deploy models without IT involvement, want to be able to work with large datasets
    We could certainly create opportunities for people to expand their skills with R, Python, and increase out knowledge and level of support on the IT side. But you need to pick a tool

    Security and analytics seems to be opposing forces
    There is a bureaucratic and or autocratic view controlling data and it’s flow
    Data scientists generally don’t need NPPI – they want to analyze inputs and predict outcomes
    No understanding of the risk or lack of risk associated with using business data for analytics
    Unclear how to traverse governance and approval processes
    No resources available to assist with data requests or scrub data to prepare for analytics
    Need a place to persist prepared data, refresh as new data becomes available, make scrubbed data available to multiple projects
  • This works great for operational reporting… but not data analytics

    Some of those frustrations came from environments like this. Way to many data sources… very complex… wall between IT and Business


    Why does it take that long?

    For one… the data is everywhere…

    There are operational reasons for these EDW’s and Data Marts. I’m not saying there “not useful”, however as an analysts/ds they don’t alone met the need.


    What did the original data look like?

    Who do I talk to?

    Is there more data I don’t see?

    What about using R or Spark?

    Can I use open source?

  • Cleansing and cleaning is now shifting more towards the business side…

    Moving away from hard wall

    UPDATE: IM -> Data Analytics

    Excellent! I’m not bound by data storage or PC capacity.

    I can access/see all the data available to me

    I can “fail” and try again quickly!
  • How we work… we work together as one team with our business partners.

    We have Data Scientists and Engineers on our team, along with the software developers

    Next Yiqing will talk about how this team tackled various “big data” problems and how we used our lake in practice.
  • Remember our frustrations: Security and access for our users

    If you don’t setup security you have a lake that nobody can use!
  • This is an example of our Data Lake in action.


    GET: Were taking usage/billing data from various cloud providers


    LAND:
    and landing it in the LAKE.

    STUDY:
    Were leveraging PowerBI to surface that data to our end users

    Remember the OLD WAY: everyone talks about it forever in meetings, agrees on a schema, then an ETL developer starts the work.


  • Another Example: We leveraged the SAME LAKE to LAND that large amount of Experian data to HDFS.

    Then we used SPARK to preform ETL (Convert data) and write text documents to ELASTICSEARCH.

    In this example we used the same lake, but extended our capabilities with Elastic Search.


    REMEMBER THE OLD WAY: We whould have loaded into standard RDBMS, slow performance, and will have to write your own Queries and fuzzy matching. Large table scans. Would only look at a subset of the data because of the size.

    LONG time from idea to UNDERSTANDING!!!
  • Another example of how we use the SAME LAKE:

    Streaming Analytics for Security and Operational logs – Splunk cost containment
  • Get started
    Check back to North stars…
    Be mindful of transitions
    SPEAKER: MAKE SURE YOU HAVE TRANSTION STATEMENTS
    - Add more….

×