Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Simply Business' Data Platform

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 44 Publicité

Simply Business' Data Platform

Télécharger pour lire hors ligne

Simply Business is a leading insurance provider for small business in the UK and we are now growing to the USA. In this presentation, I explain how our data platform is evolving to keep delivering value and adapting to a company that changes really fast.

Simply Business is a leading insurance provider for small business in the UK and we are now growing to the USA. In this presentation, I explain how our data platform is evolving to keep delivering value and adapting to a company that changes really fast.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Simply Business' Data Platform (20)

Publicité

Plus récents (20)

Simply Business' Data Platform

  1. 1. Simply Business’ Data Platform By Dani Solà
  2. 2. 1. Introductions 2. Some context 3. Data platform evolution 4. Cool stuff we’ve done 5. Lessons learned 6. Peeking into the future 7. References Table of contents
  3. 3. 1. Introductions Nice to meet you
  4. 4. Hello! I’m Dani :)
  5. 5. This is Simply Business ● Largest UK business insurance provider ● Over 450,000 policy holders ● Using BML, tech and data to disrupt the business insurance market ● Acquired in 2016 (£120M) and again by Travellers in 2017 (£402M) ● #1 best company to work for in 2015 and 2016 among other awards ● Certified B Corporation since 2017
  6. 6. 2. Context, context, co...! Is everything
  7. 7. Mission: To enable Simply Business to create value through data
  8. 8. Data Environment - The 5Vs ● ⏬ Low volume: about ~1M events/day ● High variety: nearly 100 event types and growing ● High velocity: sub-second for apps that need it ● ⏫ High veracity: using strong schemas for most data points ● ⏫ High value: as a data-driven company, all departments use data on a daily basis
  9. 9. Data and Analytics team values ● Simplicity: simple is easier to maintain and understand (it’s hard!) ● Adaptability: data tools and techniques change very fast, don’t fight it ● Empowerment and self-serve: we provide a platform to do the easy things easy ● Pioneering: we push the boundaries of what’s possible with data
  10. 10. Data Platform Capabilities ● KPIs and MI: obviously ● Product Analytics: understand how our products perform ● Customer Analytics: understand how our customers behave ● Experimentation Tools: to test all our assumptions ● Data Integration: bringing all our data in one place ● Customer Comms: it’s very data intensive ● Machine Learning: because understanding the present is not enough!
  11. 11. Analytics usage
  12. 12. 3. Data platform evolution “Change is the only constant” - A data engineer
  13. 13. The batch days: 2014-2015 Team: 2-3 data platform engineers Tech: ● Vainilla Snowplow Analytics for the event pipeline that ran on EMR ● Homegrown Change Data Capture (CDC) pipeline to flatten MongoDB collections ● Looker for web and product analytics, SQL Server for top-level KPIs
  14. 14. Sources Ingest Process Store Serve Website Event Collector Redshift MongoDB S3 Scalding on EMR Change Data Capture Adwords Email Batch Importer ... Batch Exporter hourly job Data modelling cron jobs
  15. 15. NRT first steps: 2016-2017 Team: 3-4 data platform engineers Changes: ● We added a NRT pipeline in order to expose event data back to transactional apps ● We used Kinesis as message bus; we didn’t want to manage anything ● The data is stored in MongoDB for real-time access
  16. 16. Sources Ingest Process Store Serve Website Event Collector Redshift MongoDB S3 Scalding on EMR Change Data Capture Adwords Email Batch Importer ... Batch Exporter hourly job MongoDB Data modelling cron jobs Spark Streaming API 4s batches
  17. 17. Current pipeline: 2017-2018 Team: 4-5 data platform engineers Tech: ● We have gone NRT by default, there’s no batch layer ● We’ve introduced Airflow for batch job orchestration ● We’ve got rid of S3 to comply with GDPR without having to fiddle with files
  18. 18. Sources Ingest Process Store Serve Website Event Collector Redshift MongoDB Spark Streaming Change Data Capture Adwords Email Batch Importer ... Batch Exporter 3min batches MongoDB Data modelling Airflow Spark Streaming API 4s batches
  19. 19. Potential changes in the near future Migrate from Spark Streaming to Kafka Streams: ● Streaming-native API, much more powerful than Spark’s ● No need for external storage for stateful operations ● No need to have a YARN or Mesos cluster, any JVM app can have a streaming component ● Can expose APIs to other services!
  20. 20. Potential changes in the near future Migrate from Redshift to Snowflake: ● Decoupling storage from processing ● Handles semi-structured data natively ● Allows to isolate workloads much better ● Near-instant scaling, including stopping it when no one is using the cluster ● Infinite storage!
  21. 21. Potential changes in the near future Migrate from EMR to Databricks for Spark batch jobs: ● Would allow us to have a dedicated cluster per app ● Easier to upgrade to newer Spark versions ● No cluster maintenance required, they’re transient
  22. 22. Sources Ingest Process Store Serve Website Event Collector Snowflake MongoDB Kafka Streams Change Data Capture Adwords Email Batch Importer ... 3min batches Data modelling Airflow Kafka Streams + API Batch Exporter
  23. 23. 4. Cool stuff we’ve done Not everything is infrastructure!
  24. 24. Full Contact - A Kafka Streams App Full Contact is the brain behind the decisions related to calling Simply Business customers and prospects. It decides: ● If we need to call someone ● The reason to call someone ● The importance of a call (priority) ● When to make the call (scheduling)
  25. 25. Visualization made with https://zz85.github.io/kafka-streams-viz/
  26. 26. Visitor graphs analysis We used GraphFrames to understand customer behaviour. We found/understood: ● Cross-device customer behaviour ● How people refer Simply Business to their friends ● That we have some brokers that buy on behalf of customers
  27. 27. Visualization made with gephi.org
  28. 28. Lead scoring ● We developed a lead scoring algorithm using AdaBoost which, using customer behaviour, predicts how likely they are to convert ● This approach notably improved retargeting efficiency ● We are now developing a streaming version using LightGBM to plug it to Full Contact and improve call centre efficiency ● We can tune it to not bother at all people who we think aren’t interested in buying
  29. 29. 5. Lessons learned Remember, this are our lessons
  30. 30. Distributed FS aren’t for everyone Distributed FS have a set of properties that in many cases aren’t unique or that useful: ● Immutability: really cool until you need to mutate data ● Distributed: there are many options for distributed storage ● Schema-less data ingestion: you need to know what are you storing, especially if it contains PII ● Files: do you really want to manage files? ● Other quirks: eventual consistency (S3), managing backups (HDFS), ...
  31. 31. Schemas everywhere! Schemas are key to: ● Enforce data quality across multiple systems, right when it is created ● Allow multiple groups of people to talk and collaborate around data ● Make the data discoverable
  32. 32. Plan for flexibility and agility Using the right tools, or our love-hate relationship with SQL: ● It’s great for querying, testing stuff and hacking things together quickly ● Not so good for building complex logic: lots of repetition and difficult to test Make your architecture loosely coupled so that you can change bits at a time: ● Use Kafka to decouple real-time applications ● Use S3/HDFS/DB to decouple batch applications
  33. 33. 6. Peeking into the future Will probably get it wrong
  34. 34. Size doesn’t matter, so let’s go big ● Setting up and using “big data” tools is getting easier and easier ● Cloud providers and vendors host them for you ● Most tools are fine with little data volumes and scale horizontally ● CPU, storage and network are getting cheaper faster than (our) data needs ● Examples: ○ Spark: from a local notebook to processing petabytes ○ Kafka Streams: useful regardless of volume
  35. 35. Machine learning is commoditized ● Everyone is giving their algorithms for free: Tensorflow, Keras, MLFlow,… ● Cloud providers even provide infrastructure to train and serve models ● Invest in the things that will make a difference: ○ Skills ○ Data
  36. 36. Data and analytics are transactional ● Long gone are the days when data warehousing was done overnight and isolated from the transactional systems ● Many products require real-time, reliable access to data systems: ○ Visible: Twitter reactions, bank account spending, ... ○ Invisible: marketing warehouses, transportation, recommenders, ...
  37. 37. The best is yet to come ● Data is one of the most effective competitive advantages, everyone will invest in it ● Data will be used to self-optimize pretty much everything that can be optimized ● Data-centric ways of thinking about software engineering: ○ Software changes constantly, but data survives much longer ○ Event driven architectures and microservices ● Make sure you learn how to teach machines :)
  38. 38. 7. References Learning from the best
  39. 39. References ● The Art of Platform Thinking - ThoughtWorks ● Sharing is Caring: Multi-tenancy in Distributed Data Systems - Jay Kreps ● Machine Learning: The High-Interest Credit Card of Technical Debt - Google ● Ways to think about machine learning - Benedict Evans
  40. 40. Questions?

×