Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los Alamos National Security, LLC.

Copyright © 2014 Splunk Inc.
Jennifer Green, HPC
R&D Scientist
LA-UR-14-27180

Los Alamos National Laboratory
Los Alamos, New Mexico
“Delivering science and technology to protect our nation
and promote world stability” [www.lanl.gov]
Originally Site Y during WWII (DoD installation) for the
sole purpose of building the atomic bomb
Today, the town-site is open to public access, and
science focuses on other scientific endeavors, but still
maintains focus on
– nuclear non-proliferation
– border security
– counter-nuclear and biological threats
2

LANL’s Areas of Scientific Research
High-energy and Applied Physics
High-Performance Computing
Dynamic and Energetic Materials
Science
Superconductivity
Quantum Information
• Advanced Materials
• Bioinformatics
3
• Theoretical and Computational
Biology
• Chemistry
• Earth and Environmental Science
• Energy and Infrastructure Security
• Engineering Sciences and
Applications
• Nanotechnology

High Performance Computing Division
• Providing “super-sized” computers for
numerically intensive / data intensive
computations
• Ensure our goals align with Lab’s mission
• Provide state-of-the-art platforms that satisfy
stakeholders’ requirements
4
HPC’s
Mission
LANL’s
Mission
National
Security
Mission
Nuclear Non-Proliferation;
National Safety and Security
Apply Scientific Excellence to
National Security Missions
Enable Scientific Discovery via
World Class High Performance
Computing Resources

High Performance Computing @ LANL
5
The newest Tri-laboratory installation,
Trinity, is projected for 2015–2016. It is
expected to be the first platform large
and fast enough to begin to
accommodate finely resolved 3D
calculations for full-scale, end-to-end
weapons calculations.
To meet cooling requirements for the supercomputers in the SCC, LANL is decreasing its use of
city/well water for cooling towers and using water from LANL’s Sanitary Effluent Reclamation Facility
(SERF). From December 2012 through April 2013, LANL has saved its use of city/well water by roughly
50% for cooling towers to meet supercomputer cooling requirements.
[http://www.lanl.gov/asc/trinity-highlight.php]

Managing HPC
Systems
6
We provide the first release
of the newest hardware &
software technologies:
– Specifically customized
– Often one-offs
– Simultaneously required to
be stable computational
resources
– Serve mission-critical
computations for National
Security initiatives
Hosting 15+ HPC Systems
• Three Disparate Security Network Partitions
• Compute / Data Intensive
• Networked File Systems
• Parallel File-systems
• Lustre
• Panasas
• Networking
• Fiber
• Ethernet
• Tape Archive
• Customer Support
• Programming Runtime and Environment
• Consult
• Application Readiness
• Theater + Interactive Visualization Systems
 High Availability
 High Reliability
 High Serviceability

High Performance Computing Testing
Strategy
7
Proactive Testing Improves
Reliability
Acceptance Testing
Integration Testing
Correctness Testing
Regression Testing
Performance Testing
Software Testing
Fault Tolerance Testing
Resilience Testing
Parameter Studies
[ omg that’s a lot of
testing ]
[ Computer Life Cycle ]
TEST:
• VALIDATION OF COMPUTATIONAL ACCURACY
• SUSTAINED PERFORMANCE
• IDENTIFY HARDWARE / SOFTWARE PROBLEMS

8
Stream Memory Bandwidth Test ( McAlpin, et. al )
 Performs 4
computations
 Main Memory
Bandwidth per
Processor
 Triad is the money
computation – indicates
performance expected
with typical scientific
computations
 Expect Tight
Performance
 Variances from Baseline
Indicate Problem

CPU / GPU Performance Testing
9
 Floating Point Operations Per
Second (FLOP/s) is typical
measure of computational
performance
 HPL - High Performance
Linpack (Dongarra, et. al )
 FLOPs are “free” ( theme of
SC’09 )
 Enter HPCG
Scalable Heterogeneous
Cluster Benchmark
(Spafford)

I/O: Perhaps the Biggest Bottleneck
PARALLEL I/O FOLLOWS PATTERNS
[ N TO N ] OR [ N TO 1] WRITES, READS:
• Hidden in these system calls are file open,
file close and stat operations
• Can add unknown overhead to the
operation
• Can create burdensome load to file-
systems and overhead to application if not
programmed optimally (i.e. open file
handles, metadata overhead if too many
files are simultaneously opened, etc. )
• File-system testing helps to identify
potential failures, and load impacts on
running jobs
101
Baseline Represented by Yellow Line
Bursty File-system
Performance

11
Whole Machine
Performance Overview
Wolf, a New
Supercomputer, Up
and Running at Los
Alamos National Lab
http://machinedesign.
com/news/wolf-new-
supercomputer-and-
running-los-alamos-
national-lab

Monitoring the Test Harness
12

14
DateStamp=$Date
TestName=$TestName
OS=$OS-Version
MachineName=$MachineName
NumNodes=$NumNodes
TestMetric=$Measurement
etc
Parser
RawData
Must differentiate data by:
- Test Name/version
- System Name
- Resources Used
- Software Versions
Or valid results comparisons is impossible!
Post Process Raw Test Data

Other Splunk Uses at LANL
Licensed software product monitoring drives business decisions
System logs from all production resources
– Network server logs
– IB Monitoring Server logs
– Node logs
– Security
– Software and hardware inventory
– Monitoring for known failure signatures in logs
– Resource managers / schedulers
Customized plugin development to handle unique test visualization
GUI for home-grown tools
15

Why Splunk?
A centralized location to store and process large amounts of data
Close to “real-time” to intervene when problems occur
No database schemas to design to accomplish data correlation
Versatile- can be programmed to generate sophisticated graphs, tables,
reports
Splunk Searching is easy- lots of functions and capabilities to handle statistical
measurements
Alerting function permits monitoring tasks
App / Dashboard / Reports permits different views tailored to audience
Ability to grow in size as required
16

Unifying IT
Operations
Jon Rooney
Director, Developer Marketing
Splunk

Session Agenda
1
2
3
IT Operational Intelligence with Splunk
An overview of Splunk Apps
Getting started with Splunk

Escalating IT Complexity…
SERVERS STORAGE NETWORKING
VITUALIZATION
INFRASTRUCTURE
APPLICATIONS
PACKAGED
APPLICATIONS
CUSTOM
APPLICATIONS
Identity
VPN
IP Phone
HR
Email
Finance
App Svr
DB
Web Svr SaaS/PaaS
IaaS

… Plaguing IT Operations
SERVERS STORAGE NETWORKING
VITUALIZATION
INFRASTRUCTURE
APPLICATIONS
PACKAGED
APPLICATIONS
CUSTOM
APPLICATIONS
Identity
VPN
IP Phone
HR
Email
Finance
App Svr
DB
Web Svr SaaS/PaaS
IaaS
Complex, silo-based technologies
Disconnected and outdated point solutions
Reactive brute-force problem resolution
Over 80% of time on maintaining not
innovating

Industry Leading Platform for Machine Data
Any Machine Data
Online
Services Web
Services
Servers
Security GPS
Location
Storage
Desktops
Networks
Packaged
Applications
Custom
ApplicationsMessaging
Telecoms
Online
Shopping
Cart
Web
Clickstreams
Databases
Energy
Meters
Call Detail
Records
Smartphones
and Devices
RFID
Datacenter
Private
Cloud
Public
Cloud
Enterprise
Scalability
Search and
Investigation
Proactive
Monitoring
Operational
Visibility
Real-time
Business
Insights
Operational Intelligence

Industry Leading Platform for Machine Data
Any Machine Data
Online
Services Web
Services
Servers
Security GPS
Location
Storage
Desktops
Networks
Packaged
Applications
Custom
ApplicationsMessaging
Telecoms
Online
Shopping
Cart
Web
Clickstreams
Databases
Energy
Meters
Call Detail
Records
Smartphones
and Devices
RFID
Datacenter
Private
Cloud
Public
Cloud
Enterprise
Scalability
Search and
Investigation
Proactive
Monitoring
Operational
Visibility
Real-time
Business
Insights
Operational Intelligence
Any amount, any location, any source
Schema-
on-the-fly
Universal
indexing
No
back-end
RDBMS
No need
to filter
data

Powerful Cross-Tier Operational Analytics
Harness IT data for mission decision-making
Data driven
decisions
across the
enterprise
Forecasting and planning
Root cause analysis
Proactive alerting
User/Usage analytics
Change monitoring
Security and forensics

INDUSTRY LEADING PRODUCTS
Full-featured platform for real-time operational intelligence
Splunk Enterprise as a cloud service
Explore, analyze, visualize data in Hadoop and NoSQL data stores

Splunk Apps
Accelerate Insights

Splunk : Platform For IT Operational Intelligence
Apps and Add-ons Accelerate Value From Machine Data
API
SDKs UI
Server, Storage,
Network
Server
Virtualization
Operating
Systems
Infrastructure
Applications
Business
Applications
Cloud Services
XenApp
XenDesktop
Other Monitoring
Ticketing/Help
Desk
Web Intelligence
No rigid schemas– Add in data from any other source
Custom
Applications
Stream

End to End Correlation With Splunk Enterprise
Reduce Costs: Consolidate tools, eliminate silos, find root cause faster!
Exchange
Admin
Linux/Win
Admin
Network
Admin
Applicatio
ns Admin
Line of
Business
User
Applicatio
n Support
VMware/Lin
ux/ Win
Admin
Security
Admin
Storage
Admin
IT
Managem
ent

Splunk For Operating Systems
Proactive Monitoring
Operational Analytics
End-to-End Visibility
Get instant insight into infrastructure health
OS metrics for performance, capacity & resource
allocation analyses
Scale and correlate across all tiers of your technology
stack

Splunk For Virtualization and Storage
Proactive Monitoring
Operational Analytics
End-to-End Visibility
Real-time actionable insights into problem spots and
health issues
Real-time & historical insights into performance,
security, capacity, forecasting and change tracking
Scalable Big Data solution for holistic visibility across all
technology tiers

Splunk For Infrastructure & Operations
Management
Keep the Agency
Running
Increase Productivity
Access to Intelligence
Proactively monitor mission-critical services that all
other systems actively depend on
Analyze, report & monitor via simple dashboards and
decrease troubleshooting time
Get detailed information on irregular activities
affecting security policies or SLA

Self-Managed Sign Up for the Service
Full app, SDK, API,
platform support
1-Click Launch on AWS
Splunk
Enterprise and
Hunk AMIs
AMAZON
MACHINE
IMAGES (AMI)
AVAILABLE EVERYWHERE
* Available in USA, Canada

Easy to Get Started
http://apps.splunk.com
3. Start Splunking1. Download for free* 2. Accelerate data
collection and correlation
* Free 60-day trial for premium Apps

Questions?

Thank You

Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los Alamos National Security, LLC.

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los Alamos National Security, LLC.

Similaire à Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los Alamos National Security, LLC. (20)

Plus de Splunk

Plus de Splunk (20)

Dernier

Dernier (20)

Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los Alamos National Security, LLC.

Notes de l'éditeur