5. • Bank with diverse software and hardware landscape
• Cost driven IT
• Traditional software development: design, build, test, implement
• Software strategy: buy before build
• Middleware strategy: buy
• Hardware strategy: appliance
History up to 2.5 years ago within ING
5
6. • Bank with diverse software and hardware landscape
• IT and Time-to-Market is important
• 60 scrum teams internally working on software
• Software strategy: build before buy (a lot of time)
• Middleware strategy: buy but…
• Hardware strategy: standard scalable stacks
From 2.5 years ago up to now
6
8. • Internet facing reverse proxies (IBM TAM WebSeal)
Authenticating proxy
Content caching and compression
Cookie jar functionality
• Multiple layers of load balancers (F5 BigIP)
Over data centers
Over nodes in different network zones
For all internet facing domains of domestic banking Netherlands
Infra structure to replace
8
9. • Investigate open source software: NGINX or Apache vs IBM WebSeal / F5
• Perform a proof of concept with NGINX for Authentication and Event Publishing
• Write a report for deciding architects which concluded after proof of concept:
Replace IBM TAM WebSeal with NGINX using custom modules
Integrate the layers of F5 BigIP’s with NGINX
The result “GO!” Now we are more in control then ever.
The Plan to Simplify
9
11. Working towards
11
Load balancer
NGINX
Tier 1 (dmz)
Tier 2
F5
NGINX
External
Authentication
Interface
Application
Application
Application
11
Inter Connectivity Cloud (between DC’s)Inter Connectivity Cloud (between DC’s)
12. Control in…
12
• Integrate Authentication and Event Publishing module from PoC
Functionality
Time-to-Market
Operational Monitoring
Control
13. Control in…
13
• Integrate Authentication and Event Messaging module from PoC
• Add missing cookie jar functionality
Functionality
Time-to-Market
Operational Monitoring
Control
14. Control in…
14
• Integrate Authentication and Event Messaging module from PoC
• Add missing cookie jar functionality
• Add load balancing persistency over data centers
Functionality
Time-to-Market
Operational Monitoring
Control
15. Control in…
15
• Integrate Authentication and Event Messaging module from PoC
• Add missing cookie jar functionality
• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points
Functionality
Time-to-Market
Operational Monitoring
Control
16. Control in…
16
• Integrate Authentication and Event Messaging module from PoC
• Add missing cookie jar functionality
• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points
• Integrate existing (Java) Continuous Delivery Pipeline
Functionality
Time-to-Market
Operational Monitoring
Control
17. Control in…
17
• Integrate Authentication and Event Messaging module from PoC
• Add missing cookie jar functionality
• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points
• Integrate existing (Java) Continuous Delivery Pipeline
• Monitor system resource usages and errors to Graphite
Functionality
Time-to-Market
Operational Monitoring
Control
18. Control in…
18
• Integrate Authentication and Event Messaging module from PoC
• Add missing cookie jar functionality
• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points
• Integrate existing (Java) Continuous Delivery Pipeline
• Monitor system resource usages and errors to Graphite
• Add Grafana dashboards and Mobile alerts for team dashboards
Functionality
Time-to-Market
Operational Monitoring
Control
19. Control in…
19
• Integrate Authentication and Event Messaging module from PoC
• Add missing cookie jar functionality
• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points
• Integrate existing (Java) Continuous Delivery Pipeline
• Monitor system resource usages and errors to Graphite
• Add Grafana dashboards and Mobile alerts for team dashboards
• Monitor and report upstream errors to Tivoli Omnibus (MCR)
Functionality
Time-to-Market
Operational Monitoring
Control
20. Control in…
20
• Integrate Authentication and Event Messaging module from PoC
• Add missing cookie jar functionality
• Add load balancing persistency over data centers
• Add dynamic service discovery so teams can self-service end points
• Integrate existing (Java) Continuous Delivery Pipeline
• Monitor system resource usages and errors to Graphite
• Add Grafana dashboards and Mobile alerts for team dashboards
• Monitor and report upstream errors to Tivoli Omnibus (MCR)
• Make performance data and reports available to all scrum teams
Functionality
Time-to-Market
Operational Monitoring
Control
21. • First step: Integrate into the Continuous Delivery Pipeline
• From GIT to production
• Second step: Add additional functionality to NGINX
• Future roadmap of the NGINX authenticating proxy environment
Roll-out planning
21
22. • Using standard open source tools like:
Git, Jenkins, Maven, Nexus, Docker, Valgrind, Python
• And closed source tools like
Nolio (deployments), Fortify (static source code analysis)
First step: integrate in continuous delivery pipeline
22
26. 26
By packaging all own modules
And add nginx.org source from our Nexus repository
And 3rd
party source modules from our Nexus repository
As a tar.gz file
32. 32
Using a Python test framework
To easily create test cases for the binary and modules
33. 33
The RPM’s and test results are uploaded to a Nexus Repository
Together with Nolio deployment scripts
After which Jenkins triggers an automatic Nolio deployment in LCM
34. 34
Each commit in “develop” also starts a Jenkins job that
Triggers the Valgrind tests on all modules
And emails the results on failures
35. 35
Each commit in “develop” also starts a nightly Jenkins job that
Starts a Fortify scan for static source code analysis
On all own modules, NGINX code and all 3rd
party modules used
36. 36
Releases on “master” trigger a build in Jenkins
Using Apache Maven release profile
Where versioned artifacts are uploaded to Nexus
37. 37
Configuration releases on “master” trigger a build in Jenkins
Where the correct nginx.conf and site information created
38. 38
And SQL is used to create a list of URL endpoints
And their module directives
39. 39
Using a maven plugin to create the correct configuration files
41. 41
So it can be automatically deployed in Nolio in LCM by Jenkins
42. • LCM DEV + TST environment for internal team tests
• DEV + TST for integration tests for all other teams
• ACC for pre-production tests
Daily load tests using Load Runner & perf. reports using Python, Latex and gnuplot
Weekly resilience tests
Unplanned Simian Army tests
Run “perf” tests for NGINX profiling (if a change requires it)
Penetration and security tests
• Multiple PRD environments in different data centers
Replaced all IBM WebSeal reverse proxies with NGINX
Starting to replace all F5 BigIP internal load balancers with NGINX load balancer module
The result…
42
43. • Using “perf” we analyzed the binary under load ~500 URI/sec
Optimizing the result
43
Number 1, 3, 8,11 is GZIP compression
Number 2 is memset => hard to pinpoint since generic use
Number 4 is network driver => cannot change
Number 5 is cookie header parsing, triggered by our code
Number 6 is OS
Number 7 is Kafka CRC32 code
Number 9 is memcpy => hard to pinpoint since generic use
Number 10 is cause by the audit system => cannot change
Number 20 first own method listed
44. • GZIP is expensive on the CPU, use optimized libraries when possible
• Use static linking when replacing the patched library cannot be done on
target machine
• Two patches available, from Intel and Cloudflare
Compression level 5
Source: https://www.snellman.net/blog/archive/2014-08-04-comparison-of-intel-and-cloudflare-zlib-patches.html
Include optimized libraries
44
45. • Some libraries are not available on the target machine (Kafka, MaxMind, Protobuf)
• Some libraries are too old on target machine (PCRE3 – for JIT)
• CPU optimized versions are added in the Docker image and statically
linked
Patching libraries for performance
45
46. • Our five most important home-made modules
Cookie jar module – store Set-Cookie operations in reverse proxy
WebSeal module – Authentication module based on Extended Authentication Interface (EAI)
Kafka module – Send Event Messages from proxy layer to other systems
Load balancing – Rule based upstream use, allow dynamic service discovery
Monitoring module – Monitor application use and system resource usage
Second step: Add additional functionality to NGINX
46
47. • Uses two levels of RB Trees to store state
• Highly configurable
• Use timers for automatic expiration and cleanup
• Use shared memory to share state between workers
Cookie jar module
47
48. • Uses a RB Trees to store session state
• Allows access on different policies (fine or coarse grained)
• Use timers for automatic expiration and cleanup
• Use shared memory to share state between workers
• Implement the EAI interface to allow gradual migration
WebSeal module
48
49. • Publish Events for monitoring and error analysis
• Highly configurable using a separate json config file
• Fast and asynchronous to avoid processing overhead
Event Publishing (Kafka) module
49
50. • Use specific upstream servers based on rules (e.g. confidence test)
• Allow static load balancing over data centers for stateful applications
• Allow TCP connection re-use, using pools
• Integration with monitoring module to allow monitoring via MCR
Load balancing module
50
51. • Read variables from other modules to monitor
• Create and expose variables with system resources to monitor
• Use UDP or TCP to transfer monitor data to Graphite
• Integration with Tivoli Omnibus to allow monitoring via MCR
Monitoring module
51
53. • Add WAF modules
• Fully implement dynamic service discovery to dynamically add/remove
URI’s and upstream servers
• Implement cross datacenter persistency for cookie jar
Future roadmap of the NGINX authenticating proxy environment
53
54. • Remove manual work in development and testing ASAP
• NGINX has a lot of configuration optimization possibilities
TCP Socket/TCP options, caching, connection re-use, JIT, Threads, upstream zone, buffer settings, timeouts
• In own modules
Use Shared Memory for Session State (if needed), RB Trees, Thread pools, Timers and the event queue
Use atomic reference counter over shared mutex locks if possible
Use variables to pass data between modules
• In NGINX modules
Compression on content is CPU expensive!
Cookie lookups in modules are potentially CPU expensive
CRC32 is potentially CPU expensive
If using symmetric crypto, use types supported by the CPU (EAS-NI), like EAS GCM/CTR
Lessons learned so far…
54
55. • Older stack require more work to fully use all configurations
Recompiled new GCC C-compiler for strong stack protector and CPU optimization options
Recompiled libz and static link for latest version and add Intel performance patches
Recompiled libpcre and static link for latest version for JIT, and use CPU optimize flags
Recompiled other libs which are not present in RHEL and use CPU optimize flags
• Make monitoring highly configurable per site and fine-tune over time
• Use good monitoring dashboards
Combination of Graphite and Grafana works very well
Test which log data in error.log is required for good root-cause-analysis if an error occurs
• Take enough time to test
Performance tests under stress load with tools like “perf” give a lot of insight
Invest enough time in resilience tests and what key data is needed to monitor your system
All code which involves shared memory, locks, timers and configuration reloads take more time to get right
Lessons learned so far…
55
56. And… NGINX is very fast, very efficiently coded and extremely fun to program for!
Lessons learned so far…
56
58. The opinions expressed in this publication are based on
information gathered by ING and on sources that ING deems
reliable. This data has been processed with care in our analyses.
Neither ING nor employees of the bank can be held liable for any
inaccuracies in this publication. No rights can be derived from the
information given. ING accepts no liability whatsoever for the
content of the publication or for information offered on or via the
sites. Author rights and data protection rights apply to this
publication. Nothing in this publication may be reproduced,
distributed or published without explicit mention of ING as the
source of this information. The user of this information is obliged
ot abide byb ING's instructions relating to the use of this
information. Dutch law applies.
www.ing.com
Disclaimer
58