IT Operations Use Case: Jennifer Green, R&D Scientist, Los Alamos National Security, LLC.
IT Operations Overview: Jon Rooney, Director, Developer Marketing, Splunk
2. Los Alamos National Laboratory
Los Alamos, New Mexico
“Delivering science and technology to protect our nation
and promote world stability” [www.lanl.gov]
Originally Site Y during WWII (DoD installation) for the
sole purpose of building the atomic bomb
Today, the town-site is open to public access, and
science focuses on other scientific endeavors, but still
maintains focus on
– nuclear non-proliferation
– border security
– counter-nuclear and biological threats
2
3. LANL’s Areas of Scientific Research
High-energy and Applied Physics
High-Performance Computing
Dynamic and Energetic Materials
Science
Superconductivity
Quantum Information
• Advanced Materials
• Bioinformatics
3
• Theoretical and Computational
Biology
• Chemistry
• Earth and Environmental Science
• Energy and Infrastructure Security
• Engineering Sciences and
Applications
• Nanotechnology
4. High Performance Computing Division
• Providing “super-sized” computers for
numerically intensive / data intensive
computations
• Ensure our goals align with Lab’s mission
• Provide state-of-the-art platforms that satisfy
stakeholders’ requirements
4
HPC’s
Mission
LANL’s
Mission
National
Security
Mission
Nuclear Non-Proliferation;
National Safety and Security
Apply Scientific Excellence to
National Security Missions
Enable Scientific Discovery via
World Class High Performance
Computing Resources
5. High Performance Computing @ LANL
5
The newest Tri-laboratory installation,
Trinity, is projected for 2015–2016. It is
expected to be the first platform large
and fast enough to begin to
accommodate finely resolved 3D
calculations for full-scale, end-to-end
weapons calculations.
To meet cooling requirements for the supercomputers in the SCC, LANL is decreasing its use of
city/well water for cooling towers and using water from LANL’s Sanitary Effluent Reclamation Facility
(SERF). From December 2012 through April 2013, LANL has saved its use of city/well water by roughly
50% for cooling towers to meet supercomputer cooling requirements.
[http://www.lanl.gov/asc/trinity-highlight.php]
6. Managing HPC
Systems
6
We provide the first release
of the newest hardware &
software technologies:
– Specifically customized
– Often one-offs
– Simultaneously required to
be stable computational
resources
– Serve mission-critical
computations for National
Security initiatives
Hosting 15+ HPC Systems
• Three Disparate Security Network Partitions
• Compute / Data Intensive
• Networked File Systems
• Parallel File-systems
• Lustre
• Panasas
• Networking
• Fiber
• Ethernet
• Tape Archive
• Customer Support
• Programming Runtime and Environment
• Consult
• Application Readiness
• Theater + Interactive Visualization Systems
High Availability
High Reliability
High Serviceability
7. High Performance Computing Testing
Strategy
7
Proactive Testing Improves
Reliability
Acceptance Testing
Integration Testing
Correctness Testing
Regression Testing
Performance Testing
Software Testing
Fault Tolerance Testing
Resilience Testing
Parameter Studies
[ omg that’s a lot of
testing ]
[ Computer Life Cycle ]
TEST:
• VALIDATION OF COMPUTATIONAL ACCURACY
• SUSTAINED PERFORMANCE
• IDENTIFY HARDWARE / SOFTWARE PROBLEMS
8. 8
Stream Memory Bandwidth Test ( McAlpin, et. al )
Performs 4
computations
Main Memory
Bandwidth per
Processor
Triad is the money
computation – indicates
performance expected
with typical scientific
computations
Expect Tight
Performance
Variances from Baseline
Indicate Problem
9. CPU / GPU Performance Testing
9
Floating Point Operations Per
Second (FLOP/s) is typical
measure of computational
performance
HPL - High Performance
Linpack (Dongarra, et. al )
FLOPs are “free” ( theme of
SC’09 )
Enter HPCG
Scalable Heterogeneous
Cluster Benchmark
(Spafford)
10. I/O: Perhaps the Biggest Bottleneck
PARALLEL I/O FOLLOWS PATTERNS
[ N TO N ] OR [ N TO 1] WRITES, READS:
• Hidden in these system calls are file open,
file close and stat operations
• Can add unknown overhead to the
operation
• Can create burdensome load to file-
systems and overhead to application if not
programmed optimally (i.e. open file
handles, metadata overhead if too many
files are simultaneously opened, etc. )
• File-system testing helps to identify
potential failures, and load impacts on
running jobs
101
Baseline Represented by Yellow Line
Bursty File-system
Performance
11. 11
Whole Machine
Performance Overview
Wolf, a New
Supercomputer, Up
and Running at Los
Alamos National Lab
http://machinedesign.
com/news/wolf-new-
supercomputer-and-
running-los-alamos-
national-lab
15. Other Splunk Uses at LANL
Licensed software product monitoring drives business decisions
System logs from all production resources
– Network server logs
– IB Monitoring Server logs
– Node logs
– Security
– Software and hardware inventory
– Monitoring for known failure signatures in logs
– Resource managers / schedulers
Customized plugin development to handle unique test visualization
GUI for home-grown tools
15
16. Why Splunk?
A centralized location to store and process large amounts of data
Close to “real-time” to intervene when problems occur
No database schemas to design to accomplish data correlation
Versatile- can be programmed to generate sophisticated graphs, tables,
reports
Splunk Searching is easy- lots of functions and capabilities to handle statistical
measurements
Alerting function permits monitoring tasks
App / Dashboard / Reports permits different views tailored to audience
Ability to grow in size as required
16
20. Escalating IT Complexity…
SERVERS STORAGE NETWORKING
VITUALIZATION
INFRASTRUCTURE
APPLICATIONS
PACKAGED
APPLICATIONS
CUSTOM
APPLICATIONS
Identity
VPN
IP Phone
HR
Email
Finance
App Svr
DB
Web Svr SaaS/PaaS
IaaS
21. … Plaguing IT Operations
SERVERS STORAGE NETWORKING
VITUALIZATION
INFRASTRUCTURE
APPLICATIONS
PACKAGED
APPLICATIONS
CUSTOM
APPLICATIONS
Identity
VPN
IP Phone
HR
Email
Finance
App Svr
DB
Web Svr SaaS/PaaS
IaaS
Complex, silo-based technologies
Disconnected and outdated point solutions
Reactive brute-force problem resolution
Over 80% of time on maintaining not
innovating
22. Industry Leading Platform for Machine Data
Any Machine Data
Online
Services Web
Services
Servers
Security GPS
Location
Storage
Desktops
Networks
Packaged
Applications
Custom
ApplicationsMessaging
Telecoms
Online
Shopping
Cart
Web
Clickstreams
Databases
Energy
Meters
Call Detail
Records
Smartphones
and Devices
RFID
Datacenter
Private
Cloud
Public
Cloud
Enterprise
Scalability
Search and
Investigation
Proactive
Monitoring
Operational
Visibility
Real-time
Business
Insights
Operational Intelligence
23. Industry Leading Platform for Machine Data
Any Machine Data
Online
Services Web
Services
Servers
Security GPS
Location
Storage
Desktops
Networks
Packaged
Applications
Custom
ApplicationsMessaging
Telecoms
Online
Shopping
Cart
Web
Clickstreams
Databases
Energy
Meters
Call Detail
Records
Smartphones
and Devices
RFID
Datacenter
Private
Cloud
Public
Cloud
Enterprise
Scalability
Search and
Investigation
Proactive
Monitoring
Operational
Visibility
Real-time
Business
Insights
Operational Intelligence
Any amount, any location, any source
Schema-
on-the-fly
Universal
indexing
No
back-end
RDBMS
No need
to filter
data
24. Powerful Cross-Tier Operational Analytics
Harness IT data for mission decision-making
Data driven
decisions
across the
enterprise
Forecasting and planning
Root cause analysis
Proactive alerting
User/Usage analytics
Change monitoring
Security and forensics
25. INDUSTRY LEADING PRODUCTS
Full-featured platform for real-time operational intelligence
Splunk Enterprise as a cloud service
Explore, analyze, visualize data in Hadoop and NoSQL data stores
27. Splunk : Platform For IT Operational Intelligence
Apps and Add-ons Accelerate Value From Machine Data
API
SDKs UI
Server, Storage,
Network
Server
Virtualization
Operating
Systems
Infrastructure
Applications
Business
Applications
Cloud Services
XenApp
XenDesktop
Other Monitoring
Ticketing/Help
Desk
Web Intelligence
No rigid schemas– Add in data from any other source
Custom
Applications
Stream
28. End to End Correlation With Splunk Enterprise
Reduce Costs: Consolidate tools, eliminate silos, find root cause faster!
Exchange
Admin
Linux/Win
Admin
Network
Admin
Applicatio
ns Admin
Line of
Business
User
Applicatio
n Support
VMware/Lin
ux/ Win
Admin
Security
Admin
Storage
Admin
IT
Managem
ent
29. Splunk For Operating Systems
Proactive Monitoring
Operational Analytics
End-to-End Visibility
Get instant insight into infrastructure health
OS metrics for performance, capacity & resource
allocation analyses
Scale and correlate across all tiers of your technology
stack
30. Splunk For Virtualization and Storage
Proactive Monitoring
Operational Analytics
End-to-End Visibility
Real-time actionable insights into problem spots and
health issues
Real-time & historical insights into performance,
security, capacity, forecasting and change tracking
Scalable Big Data solution for holistic visibility across all
technology tiers
31. Splunk For Infrastructure & Operations
Management
Keep the Agency
Running
Increase Productivity
Access to Intelligence
Proactively monitor mission-critical services that all
other systems actively depend on
Analyze, report & monitor via simple dashboards and
decrease troubleshooting time
Get detailed information on irregular activities
affecting security policies or SLA
33. Self-Managed Sign Up for the Service
Full app, SDK, API,
platform support
1-Click Launch on AWS
Splunk
Enterprise and
Hunk AMIs
AMAZON
MACHINE
IMAGES (AMI)
AVAILABLE EVERYWHERE
* Available in USA, Canada
34. Easy to Get Started
http://apps.splunk.com
3. Start Splunking1. Download for free* 2. Accelerate data
collection and correlation
* Free 60-day trial for premium Apps
10,199 employees
36 Square Miles of DOE owned property
2,000+ individual facilities, including 47 technical areas with 8 million square feet under roof
Budget ~$2.1 billion annual fiscal budget 55% Weapons
My group, HPC-3, serves as customer, application and system support services for production High Performance Computing Resources, as provided to us through the U.S. Department of Energy, in order to provide National Security initiatives state-of-the-art, secure computing capabilities. The central focus of Los Alamos National Laboratory is to protect the nuclear stockpile of the United States, and it’s serious business.
NNSA maintains and enhances the safety, security, reliability and performance of the U.S. nuclear weapons stockpile without nuclear testing; works to reduce global danger from weapons of mass destruction; provides the U.S. Navy with safe and effective nuclear propulsion; and responds to nuclear and radiological emergencies in the U.S. and abroad.
LANL’s Mission: To solve national security challenges through scientific excellence.
That being said, we have a lot of fun working towards optimizing, enhancing, and protecting our nation’s valuable resources. We can boast that our lab provides to the national security mission cutting edge (albeit sometimes bleeding-edge) computers, many of which are featured in the Top500 list of fastest computers in the world, and a few of which have made the top ten. Our job involves ensuring that they are first of all accurate (precise), then fast (optimized), and finally, adequately provisioned to provide an ultimate user experience in computing.
How Fast Is Fast?
Petascale
96 Cabinets
~9,000 Nodes
100,000s of cores
Looking to Exascale!
more than one quadrillion floating point operations per second
The first phase of Cielo installation at Los Alamos is complete. The 2010 installation consists of 72 cabinets, 6,704 compute nodes, 107,264 compute cores, and 221.5 terabytes of memory. The total hardware takes up approximately 1,500 feet and use less than 4 megawatts of memory. The 2011 phase-two upgrade will expand the system to 96 cabinets with nearly 9,000 compute nodes, and approximately 300 terabytes of memory.
Define exascale, what it means and what it means in the context of testing supercomputers, resilience of utilities/apps, fault-tolerance requirements.
We, therefore, have developed a strategy to test, and are constantly tweaking the processes (means to the end) in order to make our testing most efficient, effective. Testing, and the software that we maintain in order to accomplish it, is a continual work in progress. As we get new systems, with new capabilities, we need new tests to measure the performance of the new capabilities. Reproducers are constantly being developed to address issues as they arise, and tests are then being produced to test that regressions don’t recur. It’s a evolutionary process that will stagnate and be rendered ineffective without constant improvements/additions.
Stream is a benchmark created by Dr. John McAlpin to measure memory bandwidth
We continually run a threaded single node mpi implementation
Large codes rely heavily on check-pointing as a means to provide fault tolerance to applications as they run at scale on HPC clusters
Long run times, and the potential for failures as time and scale increases places more dependence on Check-pointing (data-dumping) to Parallel Filesystems / Restarting Functionality to achieve successful completion
Large codes rely heavily on check-pointing as a means to provide fault tolerance to applications as they run at scale on HPC clusters
Long run times, and the potential for failures as time and scale increases places more dependence on Check-pointing (data-dumping) to Parallel Filesystems / Restarting Functionality to achieve successful completion
In order to maintain the state of the test harness on all machines simultaneously, automatic monitoring of filesystems, system states, test harness state and system utilization needed to be accomplished, with results centralized
Alerts for any failures or problems are set up to notify testing team of any failures or issues to be brought to their attention
The test harness runs as a user process on the front-ends of each machine
If not governed, the test harness potentially can fill up filesystems or disable the scheduler
Sometimes the test harness can silently die, and without monitoring, the tests will cease
The Test harness is instrumented to “fill up” the machine with test jobs in a low priority queue
Percentage is specified on the command line, as is the test suite definition file, containing parameters and tests, and the “watermark” (high level number of jobs to submit)
Delay is specified, so the harness will wait specific number of minutes before rechecking the system
Cronjobs run to ensure that the harness is still running, the filesystems aren’t full, and the system is not on DST. If any of these conditions exist, the harness doesn’t submit jobs, filesystems are automatically cleaned up, and the tests run again to see if the conditions are favorable for the harness to restart.
sometimes Cron daemons die, crontabs are lost, etc. A “checkpoint” of each machine’s crontab is created each time the crontab is turned off, and reinstated when it’s turned back on.
Consistent testing across all platforms ensure a base from which to accurately compare system performance
If Tests and their parameters/build environments aren’t consistent, then results comparison amongst systems cannot be accomplished
Drop Down Boxes in this view enable honing in on specific times and machines.
Credit goes to Dominic Manno, Student Intern, HPC-3 Los Alamos Nat’l Lab
Raw test data is formatted uniquely according to the tests generating the output. ( different metrics, different libraries used, different scales )
A common data format needed to be agreed upon in order to have a standard from which to rapidly visualize tests data within an analysis engine, such as splunk
Tentatively, the testers agreed that a key value pair is easiest to use within splunk since it doesn’t require unique field extraction logic to create graphs
Initially, every test instance created a single “splunkdata.log” file, which was captured in a file hierarchy specific to the user, machine, year, month, day, test, datestamp, and test parameters, which quickly overwhelmed the indexer with excessive metadata. Instead, now, we create a splunk log file per user per machine, and which is the indexed file, which is appended with every test output. This makes a neater and easier means to house test raw data
Future test harness developments will include a results database where abundant data can be collected and use the DBConnect mechanism w/i Splunk to index data.
Quick introduction:
Who am I?
There has been an explosion of growth of IT data center technologies, IoT mobile, distributed apps, virtualization. What this brought is increased efficiency and utilization, however at the same time there was escalating IT complexity. <click>
Lots of disparate and complex and siloed based solutions If you need to find a solution to a problem you maybe need to get a war room ready, finger pointing and trying to debug an issue in production environment. You maybe using hours and hours trying to find a solution. Often times it is a brute force approach when you need to restart the system, so brute-force approach is something used.
So IT is no longer spending time on innovating but losing valuable time on keeping the lights on or fighting fires.
Splunk Enterprise is fully featured, platform for collecting, searching, monitoring and analyzing machine data and getting operational intelligence. You can monitor both real-time (as the data is streaming) and historical data. Splunk collects machine data securely and reliably and scalably from wherever it’s generated in any formant, time series data,which means log files but also performance metrics. Talk ingestion types 1) Agent 2) Agent doing API, 3) Syslog. It stores and indexes the data in real time in a centralized location and protects it with role-based access controls. In centralizing the data and providing a consistent interface you can troubleshoot things related network problem and very easily correlate their impacts on your applications, all this in a matter of minutes not days. //Monitor your end-to-end infrastructure to avoid service degradation or outages. Gain real-time visibility and critical insights into customer experience, transactions and behavior. // As you move up you move from Search and Investigation of issues onto proactively monitoring problems and catching them before they happen, full operational visbilitly and finally ending at OI level 4 where Splunk starts giving you real time insights into your IT operations.
<click>We don’t require you to understand your data and have predefined schema and requirements. You don’t need to have expensive custom connecters to get data into Splunk. We have our own map reduce based high speed data index and retrieval mechanism. We can index the data from any part of your infrastructure. We scale from a single server to petabytes of data and you can use commodity x86 hardware. And you can store data in the cloud as well if you don’t want to manage your Splunk instance. So what you can start getting into the core of the problem, If you have a system that does not have proactive capabilities you can do that with Splunk Enterprise. And expand from there into security, capacity planning applications management – truly big gold mine of use cases from your data. And our customers once they star to gain that operational visibility they evolve to getting deeper insights from your data. No database in the backend as we apply schema on the fly. You need raw data to be able to re-use it. We are creating intelligence on top of the data therefore easy scaling.
And what can you use this data for? You have specific individual business need. Splunk is flexible and in you have a performance issue you can move into the root cause analysis of that problem. And more into proactive monitoring. You want to understand how your users are interacting with your website and which content is popular you can do that with Splunk. You can forecast and plan for enterprise growth you can do that. Understand insider threat or security breaches – We have an App for that?. The key point I would like to ask you to remember is that Splunk enables you to make informed analytics driven decisions across your enterprise.
Let’s take a closer look at few of the apps we are highlighting here. We will mention few Splunk supported Apps. We are investing in these apps and provide full support for them.
Over the last couple of years Splunk has evolved from an engine for machine data to a platform for machine data – nothing is a testimony of this more than our Apps store apps which range from plugins and templates to full fledged apps that help you collect, analyze and harness data from every layer of your technology stack. These apps are built by our customers, technology partners such as Cisco, Netapp, ExtraHop and our splunk employees. We are a platform as it is very easy to get data into Splunk and out of Splunk. We are complementing other solutions in the data center
Two important things to remember:
If a logo you have doesn't show up here, Splunk still doesn't’t limit you – you can always index data from that technology – Splunk extensions simply help you accelerate the process.
We provide a full featured REST API and a variety of SDKs that help you build your own custom apps for technologies and insights custom to your business. This is to help you create a specific interface to your data in special format and development languages your team is used to.
Lastly, each of the Splunk extensions is not comparable to point solutions in every silo, simply because your data from each silo is more valuable when in context of other data from other technology tiers. Splunk apps simply help you get to the point faster where you can see correlations and comparisons of machine data ACROSS silos.
So now, no mater what is your administrative area, you want to have cross-tier insights across the environment. How many times you have had complaints from applications guys that there is a big latency on the storage side or as a virtualization admin, you may need to allocate additional resources add more CPU cores to boost user performance and applications. Or as OS admin, you see that your OS is showing correct storage utilization but you still have application running slow. This is because each one of the IT professionals are looking into the isolated tools. They do not have insight into other siloes. That is what our apps deliver and is a core functionality of Splunk as platform. If you have an Exchange running on top of VMware/hypervisor, Windows, over Cisco ACI and with attached storage. You can use Splunk as a platform to help you get insight into how your business service is performing. It is central and easy for Splunk because we look at this just as another data source
One of our most popular apps are Windows and Linux and Unix Apps. And if you have thousand of servers out there deployed we have added functionality in these apps to let you easily monitor infrastructure. Primary with OS you want smooth operation and at a glance visibility into infrastructure health. Proactive alerts and understand CPU and memory consumption for processes running in the environment.
Another set of Apps which are very popular are Virtualization and storage apps. VMware is hugely popular app and it is one of our few that are premium. We support other virtualization apps such as apps for hyper-v, Xen App, Hyper-V, and when we talk about storage NetApp App is on top of the list. So what these apps allow you to do is again at a glance insight into your infrastructure. They provide operational analytics with insights into performance capacity forecasting and finally ability to view and cross-correlate across technology tiers.
Another set of extremely popular apps are Splunk Apps for Exchange, Active directory and Amazon Web Services. Why? Because you can get correlated insights across multiple tiers. What are the benefits to you? First keep the service running smoothly. For example, how is the message volume dependent on the health of your network or do you have enough CPU resources. Since you do have all the data located in one location this directly brings increased productivity. Also, security policies can be enforced as you have insights into how your users are interacting with the applications.