NTT Resonant Inc., one of NTT group company, is an operator of the "goo" Japanese web portal and a leading provider of Internet services. NTT Resonant deployed and has been operating OpenStack as its service infrastructure since October 2014 in production. The infrastructure started with 400 hypervisors and now accommodates more than 80 services and over 1700 virtual servers. It processes most of 170 Million unique users per month and 1 Billion page views per month.
We will show our knowledge based on our experience.
1. OpenStack at NTT Resonant:
Lessons Learned in Web Infrastructure
Tomoya Hashimoto, Business Platform Division, NTT Resonant Inc.
Kazuhiro Tooriyama, Business Platform Division, NTT Resonant Inc.
Toshikazu Ichikawa, NTT Software Innovation Center, NTT Corporation
2. Presentation Video
This slide deck was presented in OpenStack Summit Tokyo 2015.
You will find our video-recorded presentation at the URL
https://www.openstack.org/summit/tokyo-
2015/videos/presentation/openstack-at-ntt-resonant-lessons-learned-
in-web-infrastructure
2
3. Speakers
Tomoya Hashimoto
Kazuhiro Tooriyama
2010 – 2014
NTT Communications Development of ISP NW (OCN)
2014 - current
NTT Resonant Engineer of server platform
2001 - 2012
NTT Resonant goo blog, oshiete goo (Q&A service)
Development and operation of core services
2012 - current
NTT Resonant Architect of server platform
3
Toshikazu Ichikawa
2011 – 2014
Verio (NTT America) Development of cloud service “Cloudn”
and managed hosting service
2014 - current
NTT Development of cloud service platform
4. 1. About NTT Resonant
2. OpenStack Infrastructure Design
3. VM setup by Puppet with OpenStack
4. Monitoring OpenStack and VMs
5. Current Issues and Future Plan
4
Agenda
6. 1. About NTT Resonant
6
Regional Communications Business Long Distance and
International Communications Business
Mobile Communications
Business
Date Communications
Business
$112
billion
in total
revenue
240,000
employees
worldwide
#1
in Data
Center floor
space
#2
in Global IP
Backbone
Source: TeleGeography
All facts and figures accurate as of March 2014
R&D
7. 1. About NTT Resonant
B2C services
Platform and B2B2C services
Portal Site
Smartphone application
goo milk feeder goo disaster
prevention application
7
Services
of
Customers
Healthcare
Disaster prevention/
response solutionsPhone Cloud / Developer support
e-commerce site
for communications devices
NTT Resonant’s Business Area
8. 1. About NTT Resonant
8
Dictionaries ZIP codes Laboratory Bodycloud
Housing and real estate Search Baby-care Movies
Maps Navigation Horoscopes RankingsCar and bike
News Weather Healthcare Smartphone applicationsBlogs
Job search Love and marriage
Online store
Travel
Providing 60+ services including
• Web search
• Blogging
• News
• Oshiete! goo Q&A site
Launched in 1997
18th years old
Web portal site “goo”
http://www.goo.ne.jp/
9. 1. About NTT Resonant
9
How large is ?
The 3rd largest web portal in Japan
Yahoo! Google
Rakuten
MSN
Scale of web portal “goo”
170 million unique browsers per month
1 billion page views per month
Source: 2015.02 NetRatings
11. 2. OpenStack Infrastructure Design
• Migrate to another data center, under limited timeframe
–The termination of existing data center (DC) contract is fixed. We
need to migrate our system from existing DC to another DC by the
time.
• Shorten a lead time for service release
–Speed up by changing manual operation to create and manage VMs
–Comparable to public cloud service such as AWS
• Support all workflows to provide service
–not just introducing OpenStack, as an infrastructure for web services
–Not only VM creation, but also an installation and configuration of
software inside VMs
What was required to us at OpenStack deployment
11
12. Service Teams
Platform Team
2. OpenStack Infrastructure Design
Organization and Formation
NTT Resonant
Service Developer
60+ services
Platform service
Operation
Partners
Operator
Outsourcing
12
~10 engineers
300+ engineers
…
Service Developer
Service Developer
Service Developer
Service Developer
Operator
Operator
Design Team
NTT R&D
OpenStack Community
…
Joint
experiment
Contribution
Distribution
13. 2. OpenStack Infrastructure Design
• It’s decided to migrate our services to another data center.
–2014/3 Project Started
OpenStack installation’s design and deployment begins
–2014/10 OpenStack is ready, in production
–2014/10-2015/01 (4 months) 70 services, 1300 VMs started
OpenStack deployment timeline with our services
13
March July Oct. JanJuneApril May Aug. Sep. Nov. Dec.
2014 2015
Migration of services
from old existing environment
★OpenStack started, In Production
★Migration Completed
OpenStack Installation: Design / Deployment
Requirement
Definition
About 6 months
★Old existing environment
Closed
14. 2. OpenStack Infrastructure Design
• Using OpenStack as Private Cloud
• In production since 2014 October
• As of now, it supports
–80+ services
–1 billion page views per month
• With
–400 hypervisors
•2 Nova cells
–4,800 physical cores
–1,800+ virtual servers
OpenStack Scale at main data center of NTT Resonant
14
Launch
15. Dashboard
2. OpenStack Infrastructure Design
OpenStack Components (Icehouse Release)
15
Horizon
Neutron
Nova
Glance
Cinder
Swift
Keystone
Network
Hypervisor
Image
Block Storage
Object Storage
Identity
Virtual Router and LAN
Virtual Load Balancer
Virtual server
VM Template
Image snapshot
Virtual volume
RESTful file store
Replication
What we use
VM VM VM
APP
OS
APP
OS
APP
OS
designed by Freepik
Trove
Heat Orchestration
Database
services
Ceilometer Telemetry
16. • Distribution
–RDO with CentOS 6
–Icehouse version
• Automation
–Puppet for Configuration Management
•Thanks RDO Community for Puppet manifest
16
2. OpenStack Infrastructure Design
Deployment
17. • Provider network with VLAN
–No control L3+ including
Router, NAT, Load Balancer, Firewall
• Using ML2, Linuxbridge agent
–We are familiar with it
• Service Model
–An administrator prepares networks and subnets per tenant
–A tenant is not allowed to create/delete a network
• Close to “Scenario: Provider networks with Linux bridge” in the
“OpenStack Networking Guide” [1]
17
[1] http://docs.openstack.org/networking-guide/scenario_provider_lb.html
Neutron
L4-7: Load Balancer, VPN
L3: Router, NAT
L2: Network, Port
What we use
2. OpenStack Infrastructure Design
Networking with Neutron
18. 18
Node Type OpenStack
Components
RabbitMQ (MQ) and
MariaDB (Database)
HAProxy (LB)
and Pacemaker
(HA cluster)
Top cell
Controller
Nova, Glance,
Keystone, Neutron,
Horizon
RabbitMQ
Mirrored queue
Nova, Keystone, Neutron,
Horizon,
DB, MQ
Child cell
Controller
Nova RabbitMQ
Mirrored queue
MQ
Database N/A MariaDB
Galera Cluster
N/A
Swift Proxy Swift, Glance N/A Swift, Glance
Swift Storage Swift N/A N/A
Compute Nova, Neutron N/A N/A
Node types and HA(High Availability) strategy
2. OpenStack Infrastructure Design
19. 2. OpenStack Infrastructure Design
Contribution to Community related to this project
19
• This bug was a show-stopper for the project until we fixed
–Bug Fix [1]:
•the Shelve function didn't work at Icehouse release with
nova-cell deployment
•We use shelve/unshelve for hypervisor maintenance
• Some bugs we found and fixed
–Security Bug Fix [2]:
•This was announced as OSSA 2015-017 recently
–8 bug fixes other than above
[1] "shelve api does not work in the nova-cell environment“
https://bugs.launchpad.net/nova/+bug/1338451
[2] "Deleting instance while resize instance is running leads to unuseable compute nodes”
https://bugs.launchpad.net/nova/+bug/1392527
20. 2. OpenStack Infrastructure Design
• We modified codes to enforce our operation rules
–We modified only Horizon
•Users come through Horizon, not API
•What we implemented
–Server naming restriction
–Access limit to security group function
–And, about 40 items
–No modification to other components
except bug fix backports
•Minimize the cost to maintain
20
Server creation dialog of Horizon
Customization on Dashboard
22. 3. VM setup by Puppet with OpenStack
Issue of VM setup (installation and configuration)
22
• Only 4 months from VM creation to service migration
– Time is limited for VM setup
– 1,300 VMs need to be migrated onto OpenStack
– Automate procedures, as much as possible
• The key is puppet manifest using at existing data center (DC)
– We used puppet manifest to setup VM at existing DC
• Making a bridge between OpenStack and puppet
– The goal is to setup our services on top of OpenStack quickly and easily
We resolved this issue by using puppet integrated with OpenStack
23. OpenStack
3. VM setup by Puppet with OpenStack
How we use puppet with OpenStack
23
• Our puppet design
– Individual puppet master per tenant
– Linux account, middleware, config file etc.
– Single manifest repository
• What is required to use Puppet
– Host name can be resolved with DNS
– Host group is defined in LDAP
– Puppet manifest has the entry for a host group
Tenant: A
Tenant:A User
VM-A
VM-B
SVN
Puppet
Master
DNS LDAP
Necessary
24. OpenStack
3. VM setup by Puppet with OpenStack
How we use puppet with OpenStack
24
• Synchronization tool
– Polling on Nova API to detect a new
VM
– VM registration to DNS, LDAP and
puppet manifest
– Complete above steps every 5 minutes
• OpenStack user is able to apply puppet
manifest easily and quickly right after a
VM creation
Tenant: A
Tenant:A User
VM-A
VM-B
SVN
Puppet
Master
Synchronization
tool
Polling
NovaAPI
DNS LDAP
Add entry
Add entry
25. Outcome from VM setup framework with OpenStack
25
• Drastically shortened timeline and efficient workflow
–1000 VMs service deployment within 1 months
–Only 30 minutes from VM creation to service start
•It needed 5 business days without OpenStack
–Eliminate tasks of two operators by reducing manual operation
• Common process to build service environment
–Service engineer don’t worry about environment, focusing on
business
3. VM setup by Puppet with OpenStack
27. 4. Monitoring OpenStack and VMs
• Two monitoring environments
1. For cloud infrastructure
•NW, Physical servers, Openstack itself
2. For web services
•Providing standard service monitoring methods on the private cloud
• Tools and Situations
– Zabbix
•Semi-auto VM monitoring
– Redmine and Wiki
•As an issue(ticket) managiment system
•Auto issuing 1 ticket / 1 trouble
– Operation Center
•24/7 monitoring and calling via TEL
•1st response to simple occasion
27
Abstract of our monitoring env.
Operation Center
Web Service Teams
Automatic Issuing
Infra Team
(us)
Watcing
24/7 In case of trouble or provisioning
In case of serious situation designed by Freepik
28. 4. Monitoring OpenStack and VMs
• Severity order
– API monitoring
•keystone-api, nova-api, neutron-api, horizon GUI, glance-api, swift-proxy
•Quite serious trouble
– Process failure detection
•nova-*, swift-*, keystone-*, rabbitmq-server, mysqld(MariaDB) etc.
– Process performance monitoring
•Depending on middleware
•i.e.) MySQL connection number etc.
– Log monitoring
•Treat any log message above ERROR as “trouble” from the beginning
–Lack of knowledge leads doubt
•Filtering problem-free logs day by day
28
1. For cloud infrastructure (OpenStack monitoring)
29. 4. Monitoring OpenStack and VMs
• What’s this?
– Log messages from the OpenStack launching one virtual machine
29
223 lines, 119698 characters
(only 24 lines without DEBUG level)
Problems: complicated logs
(icehouse release)
30. 4. Monitoring OpenStack and VMs
• Analyzing without DEBUG logs
– In a case of failure to create new instance
30
2015-07-XX 17:00:YY TopCellController INFO
nova.osapi_compute.wsgi.server
172.X.X.X "GET <API_URL>/servers/<VM-UUID> HTTP/1.1" status: 200
->Accepting the request of creating new one
2015-07-XX 17:00:YY TopCellController INFO
nova.scheduler.filter_scheduler Attempting to build 1 instance(s)
->Just reporting
2015-07-XX 17:00:YY ChildCellController WARNING
nova.scheduler.driver [instance:<VM-UUID>] Setting instance to ERROR state.
->The beginning of sleepless night
2015-07-XX 17:00:YY ChildCellController INFO
nova.filters Filter DiskFilter returned 0 hosts
-> Lack of free disks? Where is the processing sequence?
(icehouse release)
Problems: complicated logs
For newbies, it’s not friendly.
31. 4. Monitoring OpenStack and VMs
31
2015-07-XX 17:00:YY ChildCellController DEBUG
nova.filters Filter RamFilter returned 88 host(s) get_filtered_objects
/usr/lib/python2.6/site-packages/nova/filters.py:88
->report: enough memory
2015-07-XX 17:00:YY ChildCellController DEBUG
nova.scheduler.filters.disk_filter
(<hypervisor-name>) ram:46581 disk:731136 io_ops:0 instances:3
does not have 1433600 MB usable disk, it only has 731136.0 MB usable disk.
->report: not enough disk
* 88 times
2015-07-XX 17:00:YY ChildCellController INFO
nova.filters Filter DiskFilter returned 0 hosts
->Lacks of free disk space. We need to add more disk rapidly.
• Analyzing DEBUG logs
– In a case of failure to create new instance
(icehouse release)
Problems: complicated logs
DEBUG log shows internal processing, but it’s quite scruffy.
32. 4. Monitoring OpenStack and VMs
• New function to trace logs easily even across components
– Target) nova, cinder, glance, neutron, keystone, etc.
• Current
– Each component, each request ID
– Need to map request IDs for tracing logs
– Difficulty of finding IDs
– i.e.) Create new volume from image (cinder calls glance api)
• NTT’s suggestion
– Log request ID mapping within 1 line in each caller
– Approved as a cross-project spec, To be implemented
• https://review.openstack.org/#/c/156508
32
Log Request ID mapping
glance-apicinder-volume
2015-10-08 16:14:33.498 DEBUG cinder.volume.manager [req-A admin] image down load from glance req-B
015-10-08 16:14:33.521 DEBUG glanceclient.common.http
[req-A admin]
HTTP/1.1 200 OK
content-length: 0
x-image-meta-status: active
x-image-meta-owner:
46e99ee00fd14957b9d75d997cbbbcd8
…
x-openstack-request-id: req-B
…
x-image-meta-disk_format: ami log_http_response
/usr/local/lib/python2.7/dist-
packages/glanceclient/common/http.py:136
…
2015-10-08 16:14:33.517 11610 DEBUG
glance.registry.client.v1.client [req-B
924515e485e846799215a0c9be9789cf
46e99ee00fd14957b9d75d997cbbbcd8 - - -] Registry request GET
/images/c95a9731-77c8-4da7-9139-fedd21e9756d HTTP 200
request id req-req-5cb606e5-ea1c-4afc-a626-a4deb83c56a1
do_request /opt/stack/glance/glance/registry/client/v1/client.py:124
2015-10-08 16:14:33.520 11610 INFO eventlet.wsgi.server [req-
B 924515e485e846799215a0c9be9789cf
46e99ee00fd14957b9d75d997cbbbcd8
…
Buried
deep!
33. 4. Monitoring OpenStack and VMs
• We’ve been providing standard monitoring system inside our company
– Standardized monitoring work-flow for internal service developers
• Standard monitoring item sets and rules
• Parameter threshold of alerts
– Monitoring configuration into Zabbix (or Nagios) by our hands
• Think about monitoring scheme with OpenStack
– Over 1,000 virtual machines are born, also suddenly die
– By our hands?
– Our zabbix given new function
• Detecting new VMs and starting monitoring semi-automatically
• Before getting along with OpenStack...
– Consider your today’s work-flow deeper for an efficient operation
33
2. For web services - Changing operation work-flow
35. 5. Current Activity and Future Plan
• Changing Sizing and improving VM density
• Initial flavors are designed by focusing on migration project
•Compatibility with old DC rather than resource efficiency
•VM spec same as old DC was the best for migration plan
• Current usage
–Disk capacity is too much
•Design: 37 Gbytes disk size per 1 Gbytes memory size
•Actual Usage: 7 Gbytes disk size per 1 Gbytes memory size
• Providing new flavors based on actual usage, asking to return
unused disk capacity
• Increase server physical memory double
• Aiming to increase VM density 1.3 – 2 times
35
Current Activity
36. 5. Current Activity and Future Plan
• Upgrade Openstack
–Load Balancer as a Service (LBaaS) is desired
•Current: Manual operation to Load Balancer
•LBaaS API v1 is not enough
•Waiting our vender driver for LBaaS API v2
–Establish upgrade operation
•Need to apply our patches
•Need to develop and test these patches
•These prevents us from frequent upgrade
–Mitaka release
•NTT R&D locates at “Mitaka”
36
Future Plan
37. 37
Summary
1. About NTT Resonant, operates web portal site “goo”.
• 170 million unique browsers and 1 billion page views per month
2. OpenStack Infrastructure Design
• It increased our business speed and agility
• We successfully deployed 400 hypervisors in 6 months
• Stable in production for more than 1 year
3. VM setup by Puppet with OpenStack
• We could start 70+ services on 1,300 VMs in 4 months
• It shorten the time to deploy a service from 5 days to 30 minutes
4. Monitoring both of OpenStack and VMs, with Zabbix
5. Current Activity and Future Plans
• Current: Sizing to improve VM density
• Future: Upgrade, LBaaS and more toward Mitaka release
38. Openstack new VM
Appendix: Our monitoring environment
TIPS) Semi-auto monitoring setting
Zabbix polles VMs and reads monitoring.conf, and then apply specified template.
サーバ
Zabbix
サーバ
サーバ
サーバ
サーバ
Redmine
(2) Getting monitoring definition via the agent.
(1)Zabbix server
polles IP segments of Openstack VMs,
finding out zabbix agents
=> Registering it as monitoring target
(3)Applying corresponding monitoring
template against monitoring.conf
(4)In case of catching trouble,
kick the script for auto-issuing
=> Sending request to Redmine
API
X.Y.Z.0/24
Monitoring.conf
apache_prod
mysql_prod
linux_prod
alert_on
ZabbixAgent
Examples
apache_prod = Apache in production monitor
apache_dev = Apache in development monitor
linux_prod = Linux OS in production monitor
alert_on = sending alert to the VM users
alert_off = maintenance(silent) mode
…
Script
Polling VMs
(Auto Discovery)
Trouble Ticket
Issuing
New!
38