Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Spotify Lessons:

Learning to Let Go of
Machines
IO Tribe
James Wen, Site Reliability Engineer at Spotify

ALF Squad, Infr...
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the sprea...
Let’s control how
feature developers think
about what their code is
actually running on.
Takeaways
• Feature developers = happiest with
feature work
• Find out developer machine
concerns and mitigate
• Migrating...
Agenda
• Why?
• Journey
• Hybrid Cloud
• Ops in Squads
• Future
• Learnings
Why?
Why don’t we want feature devs to care too
much about infrastructure and machines?
Why?
Time taken on infrastructure tasks = time
taken away from feature work
Feature devs = focused on features
Spotify Scale Stats
- 140 Million+ Monthly Active Users
- 50 Million+ Subscribers
- 30 Million+ Songs
- 2 Billion+ Playlis...
Spotify Dev Scale Stats
~900 Devs
~100 Tech Teams
~2000 Services
Spotify Machine Scale Stats
~10,000 Bare Metal Hosts

~13,000 Hosts on GCP

46 Hardware/VM Types
Example: Capacity Planning
Avg # devs on a team Capacity Planning
Scale doesn’t really matter
-Smaller companies/teams =
developer time is more valuable
-Larger companies/teams =
wasted in...
Other Infrastructure Tasks
- Machine provisioning

- Failure planning
- Security updates
- Machine maintenance
Dedicated Ops?
Dedicated Ops?
~2000 Services

74 Infrastructure and Operations
Engineers
If all IO engineers → dedicated ops

27:1 servic...
Ops In Squads
Feature teams handle their own ops and
provisioning



Using the services and tooling the
Infrastructure and...
We control the level of
context feature teams
need to operate their
services.
- Developer Happiness

- Developer
effectiveness and
context
Journey
- Ops in Squads
- Hybrid Cloud
(Ephemerality)
Starting Out
StockholmSan Jose
Rack 2Rack 1
Historical: Feature Developer’s Context for Service’s Capacity
lon-1-dlon-1-b
lon-1-clon-1-...
Machine Context
- Packages
- Hostname
- Machine specs (CPU, RAM,
disk, etc.)
- Uptime and service duration
- Location
- Lo...
Feature Developer Concerns
How to get?
How many?Specs?
How long?
How to talk
to it?
Where? Up to date?
How to
track?
What ...
Feature Developer Concerns
How to get?
How many?Specs?
How long?
How to talk
to it?
Where? Up to date?
How to
track?
What ...
ServerDB
Feature Developer Concerns
How to get?
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Avai...
ProvGun/ProvCannon
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Servi...
DNS
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Servi...
Nameless
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Servi...
Cortana
Cortana
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Servi...
Helios and Containers
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Servi...
Google Compute Platform
ash2-cortana-a1.ash2

Zone Service Group Sequential #
gew1-cortana-a-l33t.gew1
Zone Service Pool Random 4 Chars
Cortana Pool Manager
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Servi...
Regional Managed
Instance Groups
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Servi...
MBMI: Minimal Base Machine Image
Feature Developer Concerns
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Busin...
Phoenix
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to ge...
Current: Feature Developer’s Context for Service’s Capacity
GCP - europe-west-1
Pool:

2 instances x (n1-standard-32)
Stoc...
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to ge...
Future
Gordon (Cloud DNS)
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to ge...
Autoscaling
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to ge...
Right Sizing
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to ge...
Future Feature Developer’s Context for Service’s Capacity
GCP - asia-east-1
Service Pool
GCP - europe-west-1
Service Pool
...
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to ge...
Learnings
Why Pets to Cattle was Difficult:
- Manual/tedious setup

- Wait times for machine becoming ready
(packages, DNS)

- Non-a...
- Monitoring

- Logging

- Service Design
- Incidents
Ephemerality Learnings
- Replicate bare metal functionality, then
iterate
- When in doubt, devs provision up and
many
- Migration = great time to...
- Feature devs need carrots,
sledgehammers, and/or limos to change
- Edge Cases: REST API + CLI = provide
enough for featu...
- Decrease necessary
infrastructure context
- Increase reliability
- Save $$$
- Increase dev happiness and
productivity
Re...
Let’s strategically
control and limit how
feature developers
think about
infrastructure.
James Wen

Email: jameswen@spotify.com

Twitter/Github: @rochesterinnyc
LinkedIn: jamesrwen



Spotify is hiring! spotifyj...
Watch the video with slide synchronization on
InfoQ.com!
https://www.infoq.com/presentations/spotify-
infrastructure-deplo...
Spotify Lessons: Learning to Let Go of Machines
Spotify Lessons: Learning to Let Go of Machines
Prochain SlideShare
Chargement dans…5
×

Spotify Lessons: Learning to Let Go of Machines

117 vues

Publié le

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2fmEofl.

James Wen tells the story of how Spotify’s infrastructure evolved from teams owning and doting on groups of long-running servers to a distinctive separation of business code and value from the underlying machines all of Spotify's services actually run on. He examines how this evolution also changed the way that Spotify developers write code and the vast increase in iteration and shipping speed. Filmed at qconnewyork.com.

James Wen is currently a Site Reliability Engineer at Spotify. He's on the ALF squad at Spotify, maintaining and developing the tooling for capacity management + provisioning and internal DNS for 150+ teams. He was formerly the Team Lead (Anchor) of the Cloud Foundry Buildpacks team at Pivotal and a core contributor and maintainer of Bundler.

Publié dans : Technologie
  • Soyez le premier à commenter

Spotify Lessons: Learning to Let Go of Machines

  1. 1. Spotify Lessons:
 Learning to Let Go of Machines IO Tribe James Wen, Site Reliability Engineer at Spotify
 ALF Squad, Infrastructure & Operations Tribe
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ spotify-infrastructure-deployment
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Let’s control how feature developers think about what their code is actually running on.
  5. 5. Takeaways • Feature developers = happiest with feature work • Find out developer machine concerns and mitigate • Migrating to cloud or hybrid? Start embracing ephemeral service design and infrastructure
  6. 6. Agenda • Why? • Journey • Hybrid Cloud • Ops in Squads • Future • Learnings
  7. 7. Why? Why don’t we want feature devs to care too much about infrastructure and machines?
  8. 8. Why? Time taken on infrastructure tasks = time taken away from feature work Feature devs = focused on features
  9. 9. Spotify Scale Stats - 140 Million+ Monthly Active Users - 50 Million+ Subscribers - 30 Million+ Songs - 2 Billion+ Playlists - Available in 60 markets
  10. 10. Spotify Dev Scale Stats ~900 Devs ~100 Tech Teams ~2000 Services
  11. 11. Spotify Machine Scale Stats ~10,000 Bare Metal Hosts
 ~13,000 Hosts on GCP
 46 Hardware/VM Types
  12. 12. Example: Capacity Planning Avg # devs on a team Capacity Planning
  13. 13. Scale doesn’t really matter -Smaller companies/teams = developer time is more valuable -Larger companies/teams = wasted infra time scales as well
  14. 14. Other Infrastructure Tasks - Machine provisioning
 - Failure planning - Security updates - Machine maintenance
  15. 15. Dedicated Ops?
  16. 16. Dedicated Ops? ~2000 Services
 74 Infrastructure and Operations Engineers If all IO engineers → dedicated ops
 27:1 service:engineer ratio
  17. 17. Ops In Squads Feature teams handle their own ops and provisioning
 
 Using the services and tooling the Infrastructure and Operations tribe has written
  18. 18. We control the level of context feature teams need to operate their services.
  19. 19. - Developer Happiness
 - Developer effectiveness and context
  20. 20. Journey
  21. 21. - Ops in Squads - Hybrid Cloud (Ephemerality)
  22. 22. Starting Out
  23. 23. StockholmSan Jose Rack 2Rack 1 Historical: Feature Developer’s Context for Service’s Capacity lon-1-dlon-1-b lon-1-clon-1-a keys updated Rack 2 lon-1-f lon-1-e updated
  24. 24. Machine Context - Packages - Hostname - Machine specs (CPU, RAM, disk, etc.) - Uptime and service duration - Location - Local state (files on disk, info in memory) Unbound
 v1.6.3 ash2-metadata-a.ash2.spotify.net Openssl v1.0.0f 2 Cores 8 GB RAM Tarred LogsIn Virginia 3 Years
  25. 25. Feature Developer Concerns How to get? How many?Specs? How long? How to talk to it? Where? Up to date? How to track? What tools on it? Maintenance? What to put on it? Available? Service + Business
  26. 26. Feature Developer Concerns How to get? How many?Specs? How long? How to talk to it? Where? Up to date? How to track? What tools on it? Maintenance? What to put on it? Available? Service + Business
  27. 27. ServerDB
  28. 28. Feature Developer Concerns How to get? How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? Where? Up to date? How many? Specs?
  29. 29. ProvGun/ProvCannon
  30. 30. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  31. 31. DNS
  32. 32. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  33. 33. Nameless
  34. 34. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  35. 35. Cortana
  36. 36. Cortana
  37. 37. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  38. 38. Helios and Containers
  39. 39. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business Where? Up to date? How many? Specs? How to track? How to get?
  40. 40. Google Compute Platform
  41. 41. ash2-cortana-a1.ash2
 Zone Service Group Sequential # gew1-cortana-a-l33t.gew1 Zone Service Pool Random 4 Chars
  42. 42. Cortana Pool Manager
  43. 43. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  44. 44. Regional Managed Instance Groups
  45. 45. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  46. 46. MBMI: Minimal Base Machine Image
  47. 47. Feature Developer Concerns How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? How long?Up to date? How many? Specs?
  48. 48. Phoenix
  49. 49. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  50. 50. Current: Feature Developer’s Context for Service’s Capacity GCP - europe-west-1 Pool:
 2 instances x (n1-standard-32) Stockholm Pool:
 4 instances x (High Mem)
  51. 51. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  52. 52. Future
  53. 53. Gordon (Cloud DNS)
  54. 54. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  55. 55. Autoscaling
  56. 56. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  57. 57. Right Sizing
  58. 58. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  59. 59. Future Feature Developer’s Context for Service’s Capacity GCP - asia-east-1 Service Pool GCP - europe-west-1 Service Pool GCP - us-central-1 Service Pool
  60. 60. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  61. 61. Learnings
  62. 62. Why Pets to Cattle was Difficult: - Manual/tedious setup
 - Wait times for machine becoming ready (packages, DNS)
 - Non-automatic security updates - A fixed, reliable hostname - SSH Access - Always up/present unless team tears down
  63. 63. - Monitoring
 - Logging
 - Service Design - Incidents Ephemerality Learnings
  64. 64. - Replicate bare metal functionality, then iterate - When in doubt, devs provision up and many - Migration = great time to influence dev paradigms - Don’t need to DIY Hybrid Learnings
  65. 65. - Feature devs need carrots, sledgehammers, and/or limos to change - Edge Cases: REST API + CLI = provide enough for feature teams to handle the edge cases 
 DevEx Learnings
  66. 66. - Decrease necessary infrastructure context - Increase reliability - Save $$$ - Increase dev happiness and productivity Recap
  67. 67. Let’s strategically control and limit how feature developers think about infrastructure.
  68. 68. James Wen
 Email: jameswen@spotify.com
 Twitter/Github: @rochesterinnyc LinkedIn: jamesrwen
 
 Spotify is hiring! spotifyjobs.com IO Tribe
  69. 69. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/spotify- infrastructure-deployment

×