Our company went from $230K down to $25K. This was done by taking planned steps towards cost reduction, it wasn't an accident. More importantly, we did this without loss of functionality.
2. Top 6 Costs
Ec2, RDS, SQS, S3, Support, Data Xfer
Also in Production
DynamoDB, Elasticache, EMR, Lambda
On Occasion
Redshift, Aurora, Kinesis, Data Pipeline
Of Course
IAM, Cloudfront, Route53, CloudTrail, SNS, CW
Planning to Use
EFS, EC2 Container Service
3. Some Useful Services to Gain Visibility
AWS Cost Explorer
Netflix ICE (via Teevity)
CloudYN, Cloudability, Cloudcheckr, CloudHealthTech
AWS Billing and Detailed Billing CVS Files
Custom
4. Teevity is still building Teevity and welcomes any user that wants to registerto go to
http://teevity.com – they can register for free. More users provides data to help may
Teevity better.
Teevity does not compete with the OSS version of Ice. They are building on top of it and
around it (adding things to make it better) . The plan is to release a large and rich use-
case oriented documentation on both NetflixOSS/Iceand Teevity in the coming month
(http://docs.teevity.com)
Teevity plans to release a version on the AWS Marketplace a called "Teevity Incognito"
so users can have their own instances.
8. ** Important **
- This bill includes all charges except credits and refunds.
- The first day of the month always has additional costs (support and reservations).
- The time zone is UTC
- The most recent day is always a partial result (delayed by at least a few hours).
Date Amount Spent Running Total
---------- ------------ -------------
1970.01.01 5940 5940
2014.05.01 13366 19306
2014.05.02 2998 22304
2014.05.03 3152 25456
2014.05.04 2993 28450
2014.05.05 3078 31529
2014.05.06 2377 33907
2014.05.07 2505 36412
2014.05.08 2528 38941
2014.05.09 2572 41514
2014.05.10 2473 43987
2014.05.11 2562 46550
1970: Reservation Purchases, 5/1: Includes Monthly Reservation Cost
9.
10.
11.
12. Amortized/Not Amortized
New Services not Included
Support Included/Not Included
Delayed Reporting
Report Handling Errors
Consolidation by Time Errors
Refund/Credit Handling
TimeZone
Used Billing Invoice for Accuracy
Used Other Reports for Trends/Comparison
Let Accounting Sort out Amortization
23. 168 Hours a week, 60 During the Day M-F, 108 Nights & Weekends
24.
25. We saw the savings in turning off instances
Wrote a script to turn off and on daily
Complaints about unavailability to work
Work from home
Work late/early
Data Loss from Instance Shutdowns
Went from 14 hours off to 3 hours off
Needed a way allow developers to start stop instances
26. usage: listASGs.py [-h] [-v] [-e ENVIRONMENT] [-n NAME] [-r REGION] [-a {suspend,resume,set,start,stop,store}] [-w]
[-c CAPACITY] [--excludes EXCLUDES] [-k]
List Autoscaling Groups and Act on them
optional arguments:
-h, --help show this help message and exit
-v, --verbose Up the displayed messages or provide more detail
-e ENVIRONMENT, --environment ENVIRONMENT
Set the environment variable for the filter. You can chose 'all' as well as dev/qa/prd/ops/int/...
-n NAME, --name NAME Set the base stack name for the filter. Default is everything
-r REGION, --region REGION
Set the region. Default is everything
-a {suspend,resume,set,start,stop,store}, --action {suspend,resume,set,start,stop,store}
Determines the action for the script to take
-w, --html Print output in HTML format rather than text
-c CAPACITY, --capacity CAPACITY
Specifies the value for capacity. Enter as '#/#/#' in
min, desired, max order
--excludes EXCLUDES Enter a regular expression to exclude matchnames
-k, --kind Display the underlaying Instance Type
27.
28. Instances should be tied to an ASG
All instances MUST be tagged
“Invalid” instances should be shut down automatically
Simian Army
Janitor Monkey
Graffiti Monkey
Security Monkey
Conformity Monkey
Doctor Monkey
Chaos Monkey, Chaos Gorilla
Orphan identification script
43. Create Separate Accounts for DEV/QA/Prod
Only pay for Support on Prod
44.
45.
46. Unattached Volumes can easily grow
You can view unattached volumes by running
the AWS cli command:
aws ec2 describe-volumes –output text | grep available
us-east-1a False 20 snap-5c4b92de available vol-f44096be
standard
us-east-1a False 20 snap-5c4b92de available vol-b04a9cfa
standard
us-east-1a False 60 20 snap-bf8db125 available vol-baae0c54 gp2
us-east-1a False 1200 400 snap-4629e4de available vol-5360fdbd gp2
us-east-1e False 48 16 snap-e49eb646 available vol-6c918e74 gp2
52. Initially, Multiple ASGs with Minimum On Demand
Discovered Spots stay up for long periods
Move all in to Spots with OnDemand Backup
Switching to Fleet with OnDemand Backup
OnDemand Backup (Spots)
Two Minute Warning Flag
Separate ASG for On Demand is updated
56. What can we do?
Transfer between Availability Zones
Transfer within a Family
Modify Instance Type to match reservation
Move to Spot or Fleet
57.
58.
59. Details
Specify how AWS is responsible
Unable to View EMRs
Only Site Admin Root Accounts can see all EMRs
Did log tickets to help resolve but no answer
Amazon recommends not using root accounts
Detailed steps of process to discover
Work with your account representative
Credit full amount requested, $21,560
60. 1. Set up Standards (Multiple Accounts, Tagging Names)
2. Gain Visibility – Get a tool to visualize Costs and Assets
3. Tag Assets (Use CloudFormation, Scripts, Graffiti Monkey)
4. Turn off Unused Instances (We started with QA/Dev)
5. Use ASGs to turn off instances when less traffic
6. Buy EC2 Reservations, not once a year, monthly. Try to use
fewer instance families
7. Give Developers a way to Easily Turn On/Off ASGs/Instances
8. Set Rules - must have tags, must be tied to an ASG
9. Use Simian Army (Janitor Monkey) to automatically handle
cleanup
10. Evaluate Price/Time/Need for Failover (Multi-AZ, Instances
across Regions, Geography)
61. 11. Take advantage of drop in prices with Amazon
12. Use the DynamoDB Dynamic Script to manage Read/Write
Capacity
13. Understand how you are charged and refactor code as needed
14. Use SQS batch requests
15. Use SQS long polling
16. Buy non-EC2 Reservations - DynamoDB, RDS, Elasticache,
Redshift
17. Consolidate Instances (RDS, EC2, Elasticache)
18. Put alarms in place, pay attention to the Data
19. Where appropriate, ask Amazon for a Refund
20. Right Size Instances (Low Usage/Memory to Smaller
Instances), Avoid overprovisioning
62. 21. Turn off Detailed Cloud Watch Monitoring if Not Needed
22. Consider moving Cloud Watch Linux Data to cheaper service
(Librato, Self Hosted Graphite, etc)
23. Look at Trusted Advisor Reports
24. Delete Unattached Volumes
25. Right Size Low Utilization (CPU/Memory) instances, move to
smaller instances
26. Consider moving legacy instances to current instance types
(more powerful and at a lower cost)
27. Modify Setup to convert Unneeded Load Balancers
28. Convert to Spot and/or Fleet Instances (Bidding Strategies)
29. Monitor Unused Reservations
30. Move cloudwatch alarms/tracking elsewhere
63. 31. Optimize Cloudfront (do you need to be close to all of the edges?)
32. Move into VPC
33. Use Placement
34. Use Docker, Consolidate Containers to fewer instances
35. Pay attentions to EIPs
36. Know/Understand your EMR usage and expectations
37. Pay attention to Data Transfer costs
38. Use the Right Storage: S3, Normal or Reduced Redundancy, Glacier,
AutoDelete Policies, etc.
39. Leverage Services (CloudSearch, DynamoDB, Lambda, ElastiCache,
etc)
40. Set Termination by ASG to be "Closest to Instance Hour“ (Saves 10-
15%)
41. Use “burstable” instances when appropriate (when it’s good you can
save 20-50% going from m3.medium or c3.large to t2.medium)
64. Incremental Fixes, Rome wasn’t built in a day
Review Data Periodically
Engage Developers in the process(es)
Create a culture of cost awareness
Have the users of the resource own some of the
responsibility for costs
Get some cost data visibility to stakeholders daily
Customize cost data for stakeholder’s needs
Cost isn’t everything, get metrics that compare to
subscribers, pageviews, customers, api calls, urls processed.
Increased usage means increased costs and if traffic means
revenue, that could be very good.
Notes de l'éditeur
The basic goal here is to show some of the things we did to reduce our costs by nearly 90%. We are all in with AWS and so we use quite a few services from AWS. Your mileage may vary.
This is AWS Cost Explorer with Subscription Charges
Here’s the same Time Period using Netflix ICE (Teevity)
You can see we went from a hight of around $226K down to around $25K or $130K to $25K if we don’t include reservation costs.
So we are still adding instances at this point but managed to show a decrease of 18%.
Also, reservations in wrong AZ in some cases and we had mixed spot and on-demand for similar tasks
Also still adding instances
While the number of Instances dropped significantly (50%) the cost savings was more like 30% since the larger instances were production.
Over time the result is now the qa/dev instances are off unless the developer needs to use them
Cron job shuts down running instances each night, developer brings back up on demand
Shutdown means we stop paying for the instance but since we use ASGs and cloud formation as a setup, load balancers are still being paid for
The “usage” image exactly duplicates the total graph. The reads and writes differ, they are much closer together (write ~= reads)
This was hard to find because most of the costs were spread across ec2 instances and we had multiple projects going on. We knew there was an increase but not really how much. In the end, we had no visibility into some of the EMR runs because they are only accessible from the site admin root account and not an IAM account. Further, EMR instances were not tagged in a way to easily identify them. Still having issues with setting alerts for a volatile system with large standard deviation. Need to do it by product or event by Stack/EMR/etc.