thredUP team shares key learnings from after-migration processes. We tell you about what technologies and solutions worked best for us and where we spent time troubleshooting and improving. In particular we have focused on development and staging experience, user authentication, cloud-native CI pipelines, applications telemetry and service mesh. We also share our experience with Kubernetes security hardening, autoscaling and tell you about a new service creation within our infrastructure.
1. THE FOLLOWING CONTAINS CONFIDENTIAL INFORMATION.
DO NOT DISTRIBUTE WITHOUT PERMISSION.
Kubernetes Navigation Stories
DevOpsStage 2019
2. Director of Infrastructure Engineering at thredUP
Senior Engineering Manager at Hotwire
Roman Chepurnyi
Staff Software Engineer at thredUP
Senior Software Engineer at Toptal
Oleksii Asiutin
12. 12
AWS-IAM-Authenticator – kubeconfig generation
dev
dev lead
infra team
kubeconfig generation service
IAM identity: john-smith
Kubeconfig for dev
IAM identity: lara-jones
Kubeconfig for dev-lead
prod
stage
dev
+ group
kubeconfig
IAM user group
22. 22
Local Development
macbook: Thredup $ git clone git@github.com:thredup/node-proxy.git
Cloning into 'node-proxy'...
...
macbook: Thredup $ cd node-proxy/
macbook: node-proxy (master) $ npm install
added 6 packages from 8 contributors and audited 6 packages in 0.595s
found 0 vulnerabilities
macbook: node-proxy (master) $ npm test
> proxy@1.0.0 test ~/Thredup/node-proxy
...
macbook: node-proxy (master) $ npm start
> proxy@1.0.0 start
> node server.js
23. 23
Local Development with Docker
macbook: Thredup $ docker run -it -v ${PWD}:/app -p 3000:3000
node:12-alpine sh
/ $ apk add --no-cache mysql-dev
/ $ npm install
/ $ npm test
/ $ npm start
> proxy@1.0.0 start
> node server.js
24. 24
Local Development with Docker Compose
version: "3.7"
services:
web:
image: node:12-alpine
volumes:
- ./:/app
ports:
- "3000"
environment:
REDIS_HOST: "127.0.0.1"
mysql:
image: ...
...
redis:
image: ...
25. 25
Local Development with Docker Compose
macbook: Thredup $ docker-compose up -d
…
macbook: Thredup $ docker-compose exec web sh
/ $ npm install
/ $ npm test
/ $ npm start
> proxy@1.0.0 start
> node server.js
26. 26
Local Development with Docker Compose
And then you need another service as a dependency ;-)
...and another one
…
docker-compose.yaml ~ 330 lines
MySQL DB ~25Gb
30. 30
Horizontal Pod Autoscaling (HPA)
● Do not over-provision
● Be ready for traffic spikes
metrics:
- type: External
external:
metricName: trace.rack.request.hits
metricSelector:
matchLabels:
env : production
service : some-service
targetAverageValue: 10
34. 34
Spot instances and AZRebalance
● spot termination works https://github.com/mumoshu/kube-spot-termination-notice-handler
● except when instance is terminated by Availability Zone
Terminating EC2 instance: i-0e685dc2a84b65f63
Cause:CauseAt 2019-07-18T06:09:59Z instances were launched to balance instances in
zones us-east-1a us-east-1e with other zones resulting in more than desired number of
instances in the group. At 2019-07-18T06:11:30Z an instance was taken out of service
in response to a difference between desired and actual capacity, shrinking the
capacity from 4 to 3. At 2019-07-18T06:11:30Z instance i-0e685dc2a84b65f63 was
selected for termination.
[Roman] Let me introduce Oleksii - staff engineer at ThredUP. Oleksii is an infrastructure enthusiast, he is an co-organizer of monthly devops digest on dou.ua, he likes sportcars and runs instagram account dedicated to cooking
[Olek] Thank you Roman. Roman is a Director of our distributed Infrastructure team. I'd say Roman is a leader, he manages us in a way we can bring innovations in our company platform.
Before thredUP Roman worked at one of the biggest hotel discounts aggregator – Hotwire.
He lives in California
Roman is as confident navigating Kubernetes as navigating a sailing boat in San Francisco bay during weekends. Great to have Roman at the helm! I know it personally.
Switching to case studies. Think abot how to do it.
[Olek] In: access Mid: danger of shared root key Out: granular permissions
Okay, here was a brief introduction, it's over and now it comes the navigation stories itself. Like Roman told us one day you wake up and realize you migrated your infra to k8s and yeah it's cool. But during the migration you cut corners and now it's probably the time to review and fill some gaps.
Lots of us been in this situation - Hey Infra team, I need an access to a Kubernetes cluster.
Really? What are you going to do there? When we created our k8s clusters we used shared admin certificate inside the team.
And on early stages we gave it also to Engineers who asked for an access. Okay, here it is, but please, use it carefully.
Aha. And then, you know... Guys, Checkout is down, where is our checkout service? Guys? Oh, I might have deleted it on prod instead of dev, ouch.
So we need to organize users in groups maybe and give them granular permissions per cluster.
[Olek] In: granular access Mid: certs, openid - no Out: aws-auth - yes, review
For authorization we use RBAC, it's defacto standard for k8s now. We can create user groups and separate permissions with it.
We've reviewed multiple authentication mechanisms for users. We started with shared root certificate as I said before, and we realized that it would be hard to create a separate certificate for each user (mainly because k8s does not support certificate revocation policy).
After that we've reviewed openid connect mechanism. It works fine and it's good, but the downside for us was that our single sign-on provider does not provided user groups support with openid, so it's possible to authorize a user but you can not get groups and we need it for our ACL
Finally we stopped at the tool which name nowadays is aws-iam-authenticator. The days we implemented auth in k8s its name was heptio-authenticator. Nowadays it's the default auth method in AWS EKS and GCE and Azure have the same tools for their platforms. Lets briefly review how it works.
[Olek]
Our kubectl auth config uses tokens, which are gnerated by client aws-auth binary. The token is generated base on your AWS credentials and contains a cluster name and a role. For simplicity lets assume a role represents user group here. So if you're cluster admin you specify adin role, if you're read-only user you have a different one.
Then you send your API token to the k8s server and the server has webhook configured to talk to a daemonset, it checks if user is allowed to use the role from token. If everything is okay user is successfully authenticated and proper user groups is assigned for it's session.
[Olek]
So basically for every our cluster user needs a proper IAM role arn. For example for prod cluster user can access with read-only permissions but for development cluster the user has admin role. And it's actually not what our engineers should care about maintaining their local kubeconfig.
[Olek] So we created a service which generates kubeconfig based on user AWS IAM credentials. So for now a user executed one-liner shell script in a terminal and then our engineer ha a cronjob installed which generates or re-generates kubeconfig periodically. Why did we implement this as a cronjob? Time to time we can update our group hierarchy, add or remove users from groups and in case of a cronjob these changes are deployed to users machines automatically.
[CONCLUSION]. What did our engineers get from it? Everyone has a kubectl with kubeconfig fully managed by infrastructure team. And Infra team has a visibility and control in terms of identity and access management. So we applied IAM best practices into k8s auth management in that way.
[Olek]
So here is our secrets management evolution path. It looks a little bit strange from the first view but let me explain why is that.
[Olek] we setup Hashicorp Vault, we love it and it’s super-cool and gives you everything you need: Secrets management, good security level, infra perks.
Here is how we work with it, so we have init container which grabs all necessary secrets and puts it into shared volume, then main service container reads it from the volume and initializes it’s env vars with secrets values. There is even an open-source project for the init container called Daytona.
We have Vault setup in our clusters and it can be used by our Engineers but in fact it didn’t get a good spread. Maybe it’s because Engineers didn’t have enough time to dive into it, maybe it’s because of our not so good guides. We succeeded in setting it up but we failed in spreading it and make our colleagues to use it. Our Engineers did not add secrets to Vault and did not use it. So we started to investigate further.
[Olek] and we stopped at SOPS project which is Secrets OPerationS. It’s a simple and flexible tools for managing secrets. What it does is text files encryption and decryption with support of YAML, JSON and .env formats. It supports AWS, GCE and Azure key management systems and old-straight PGP encryption.
Here is an example of a yaml file containing database credentials for a service.
[Olek] And here is how this file looks after encryption, so we have key by key encryption instead of the whole file.
[Olek] We deploy our services with helm package manager and for helm release we specify both unencrypted values with generic release configuration and sops encrypted values which are used to create secrets. You can see an example of helm template for secrets creation.
[Olek] So to deploy a service all you need to do is decrypt your github stored secrets first, and then run a helm release.
This solution get good adoption among our engineers and it turned out to be more popular than vault. It might be simpler also, sops turned out to be more developer friendly in thredUP.
That way we moved from fault-intolerant de-synchronized and non-manageable manual secret creation to fully predictable and monitored secret management solution, filling one more migration gap.
All helm charts are available.. We can use them to run on-demand staging setup
advantages : 1) always up-to-date with latest code and data 2) scalable
[Olek] Okay, so we just told you how we manage dynamic stagings so engineers can present results of their work to other coworkers. But where do engineers spend most of their working hours? It's local development, when you write a code on your laptop, when you run the tests and do debugging.
[Olek] And when we talk about local development the real basic workflow is just to clone a git repo, install dependencies and run the service (lets assume it's a web application). Here is an example of doing that way with nodejs. BUT it's not that simple in real world, right?
[Olek] When you install a service it might has native extensions in dependencies. And in that case you might need to install specific libraries on your machine. It's okay if there is a good guide on how to do it and if it's libraries don't have conflicts with another service libraries, and another, and because we have this trendy microservices architecture – and another serivce libraries. It becomes cumbersome to setup it on local machine and ... it's good we have such thing as Docker. So you create a Docker container from nodejs image mapping your codebase and ports to work with, install all neccessary libraries and do the same stuf as you did locally.
And everything is fine, you are good to go. Not really.
[Olek] So we moved from literally operating system native development to containerized development, what's next? It's probably convenient to use docker-compose to setup service dependencies. Usually it's a database, a caching layer, queues, workers.
[Olek] Then you run it, it works, it’s convenient to use it locally.
[Olek] Until your docker-compose file becomes 300+ lines long and your local database is 25Gb heavy.
[Olek] Why is it hard and unconvenient? Because you have to keep your local env up to date, because it consumes a lot of resources (we do have powerful laptops but even they have problems with resource consumption time to time). And if you have some issues with some service it’s hard to