Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Making a Lion Bulletproof: SRE in Banking

47 vues

Publié le

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2NeL53R.

Robin van Zijll and Janna Brummel talk about the history, present and future of ING’s SRE team and practices. They touch upon people (hiring, coaching, organizational aspects, culture), process (way of working, education), technology (observability, infrastructure), and share lessons learned that can be applied to any organization starting or growing SRE, financial or not. Filmed at qconnewyork.com.

Janna Brummel is Site Reliability Engineer at ING Bank, where she helps other teams within the bank to know more about their services' reliability and to be able to respond more efficiently to incidents. Robin van Zijll is Site Reliability Engineer & Product Owner at ING Bank. He applies his experiences to help other engineers with operations related problems by creating a reliability toolset.

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Making a Lion Bulletproof: SRE in Banking

  1. 1. Making a Lion Bulletproof: SRE in Banking Robin van Zijll & Janna Brummel (@jannabrummel) QCon NY, June 26 2019
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ ing-sre-teams-practices/
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. ING is a global financial organization, active in 41 countries This talk is about the retail bank of NL with… 9 million debit cards 8 million retail customers 7 million ATM transactions/month
  5. 5. Mobile banking is used by 4.5 million customers Together, they log in 6 million times a day (100+ TPS)
  6. 6. 99.77 99.87 0.22 0.13 INTERNET BANKING MOBILE BANKING AVAILABILITY FIGURES 2018 PRIME TIME (06:30 AM – 01:00 AM) Uptime Downtime 99.88regulator target
  7. 7. Logins per second for Mobile Banking 100 40 20 0 60 80 120 140 00:00 04:00 08:00 12:00 16:00 20:0020:00 00:00 04:00 08:00 12:00 16:00
  8. 8. 99.63 99.78 0.37 0.22 INTERNET BANKING MOBILE BANKING AVAILABILITY FIGURES 2018 24 HOURS A DAY Uptime Downtime 99.999customer expectation
  9. 9. SR-what? Site Reliability Engineering is “what happens when you ask a software engineer to design an operations function” – Ben Traynor (Google)
  10. 10. People
  11. 11. At ING we are organized in tribes with (Biz)DevOps squads responsible for build and run product owners tribe tribe lead squad squad squad squad tribe tribe tribe Our SRE team is a ‘horizontal’ squad part of a productivity engineering tribe We support 1700 engineers across 340 squads
  12. 12. Our SRE team 7 engineers (4 dev, 3 ops) 2 more joining soon 1 product owner 1 chapter lead mostly with engineering and on-call experience in ING product engineering
  13. 13. When we hire SREs, we look for someone who’s Passionate about reliability, problems, DevOps and open source OK with failure Insensitive to hierarchy Willing to teach and advise engineers about reliability Experienced in on-call duties and 1+ language(s) in our stack Still excited to work with us after meeting half our team and having heard realistic job expectations
  14. 14. Process
  15. 15. Why and how did we start with SRE? We used to have a small team of ops engineers on call for online channels These engineers were the ones up at night, but they could not structurally improve service reliability because of our DevOps model SRE pilot was started and supported • Team was transformed and given a new purpose • Decided on SRE model, way of working and roadmap • Experiences and proposal were presented to senior management After knowledge transfer of old tasks, SRE was launched :)
  16. 16. For SRE, we generally see 3 organizational models product engineering + SRE product engineering SREs tribe SRE product engineering Service ownership is shared between PE and SRE SREs are distributed and embedded in PE teams, service ownership is shared Service ownership is with PE, SRE consults and creates tools our model
  17. 17. What do we do as SREs? Product Development Capacity Planning Testing + Release Procedures Postmortem/RCA Incident Response Monitoring Service Reliability Hierarchy, from O’Reilly’s Site Reliability Engineering (2016) Curious to learn more about… • Learning from failure? Check out Jason’s and Ryan’s talk • Chaos engineering and graceful degredation? Check out Lorne’s talk • High impact outlier system failures? Check out Laura’s talk
  18. 18. What do we do as SREs? We spend 80% of our time on engineering • We deliver the Reliability Toolkit: a white-box monitoring and alerting stack • We work on a secure container platform with a service mesh in public cloud We spread SRE love and best practices • We reach out to engineers to consult and get feedback • We educate on reliability topics What we don’t do • On-call for product engineering • Work on SRE-topics already covered by other teams in our organization
  19. 19. We do outreach and we educate on SRE topics We educate engineers • Engineering onboarding • Prometheus workshops We facilitate knowledge sharing • Cross-domain SRE guild • SRE demo sessions open to all • Guidance via chat and intranet • Prometheus user community • Conference report out We reach out to engineers • Feedback loop for products • We are reliability advocates
  20. 20. When we demo, we sometimes block the hallway
  21. 21. We use these principles in our way of working We work with industry standards We work with open source products and practices We automate toil wherever and whenever we can
  22. 22. Technology
  23. 23. Why did we develop the Reliability Toolkit? Mean time to repair is too long – we waste time finding incident owners Lack of insight into application health for teams High level of technology diversity makes implementing monitoring difficult
  24. 24. How does the Reliability Toolkit work? Prometheus Alert Manager Model Builder Grafana E-mail, SMS (Message Bird) and ChatOps (Mattermost) Applications
  25. 25. How do we provision the Reliability Toolkit? SRE Team Together with a team we create a joint config We maintain and update binaries We deliver the Reliability Toolkit on 5 instances over 3 environments, we remain responsible We deliver client libraries so metrics can be scraped from servers
  26. 26. Before, teams would own and use a full pipeline… version control combine configurations build publish deploy = reliability toolkit done by devops team done by devops team
  27. 27. …now they only own and update config version control combine configurations build deploy = reliability toolkit done by devops team
  28. 28. Increasing and improving usage of Reliability Toolkit Include client libraries in engineering frameworks Ensure a good feedback loop: in person or in tooling Educate others during onboarding and workshops Template team dashboards and make other dashboards accessible to all
  29. 29. And now Reliability Toolkit usage has been increasing
  30. 30. We made onboarding and using our Reliability Toolkit easy, but our 70 onboarded teams still need to ensure that Prometheus can scrape metrics How can we reach all 340 teams?
  31. 31. Let’s try a service mesh! Curious? Check the Software Defined Infrastructure track
  32. 32. Why use service mesh to improve reliability? • Service mesh helps us to get new/updated functionality to applications fast • We can improve observability for all: metrics, logs, distributed tracing and resilience patterns based on incident learnings that work out of the box • We can introduce/expand A/B testing, canary releasing and staged rollouts • Engineers only need to worry about security at application level: immutable containers, zero trust network and security policies for free, taking away risk documentation work
  33. 33. What are we working on next? • Scaling in our Reliability Toolkit stack for efficient use of resources, scaling up number of teams using our stack • Expanding our role as reliability advocates • Completing PoC with service mesh
  34. 34. Takeaways • Hire SREs from your product engineering domain • Never compromise on mindset in SREs • Start with a pilot if you are not sure if SRE works for you • Pick a SRE model that works well for your organization • Try to get senior management support and understanding • Invest in SRE outreach and education • Focus on scalability and ease-of-use in your tooling • Don’t be afraid of redesign if it makes users happier
  35. 35. Questions? Icons used are all from flaticon.com
  36. 36. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ ing-sre-teams-practices/