Slides that accompanied a three-hour crash training course on sysadmin survival skills useful for sysadmins of Evergreen open source library software. Session led by Don McMorris, Equinox Software.
1. Evergreen Sysadmin
Survival Skills
A presentation for the
Evergreen International Conference 2009
Don McMorris, Equinox Software Inc.
2. Target Audience
• Personnel who will administer the ILS servers or will provide upper-tier
support in an Evergreen installation, including roles in:
• Hardware
• Operating System
• Network
• Monitoring
• Deep analysis/troubleshooting
3. Assumptions
•Experience in administration of Debian-based system at a console level
• File structure organization
• Configuration file maintenance
• dpkg/apt, CPAN, configure/make/make install, etc.
• Success installing basic Evergreen 1.4 server
• Administration of a medium to large deployment
• A lot of the talk will still be relevant to smaller installs too, though!
4. OpenSRF architecture (handout)
• XMPP Server (via ejabberd)
• Facilitates IPC
• OpenSRF Router
• Tracks what applications are running, and where
• Directs requests to individual service instances
• OpenSRF Service
• Takes requests, performs some type of process, and responds to
the client
• Evergreen consists of several OpenSRF services
• OpenSRF Client
• Sends requests to OpenSRF services
• May be standalone or running as part of service (sub-requests)
7. OpenSRF architecture (handout)
• XMPP Server (via ejabberd)
• Facilitates IPC
• OpenSRF Router
• Tracks what applications are running, and where
• Directs requests to individual service instances
• OpenSRF Service
• Takes requests, performs some type of process, and responds to
the client
• Evergreen consists of several OpenSRF services
• OpenSRF Client
• Sends requests to OpenSRF services
• May be standalone or running as part of service (sub-requests)
8. Core OpenSRF Services
• opensrf.settings
• opensrf.math
• Used to test OpenSRF communication chain
• opensrf.dbmath
• Used to test OpenSRF communication chain and application
database connectivity.
11. Typical Hardware Setup
• Database boxes
• Main RW DB
• RO DBs
• Report server w/RO DB
• N+1 Server bricks, each consisting of:
• One quot;headquot; (ejabberd, apache, open-ils.settings)
• One or more quot;dronesquot; (general application processing)
• A quot;single server brickquot; refers to a single server with ejabberd,
apache, and all applicable applications
• Utility servers
• N+1 SIP servers (may be part of server bricks)
• quot;Batch processingquot; server (fine generator, hold targeter, etc)
• 2X memcache servers
• 2X firewall/Load balancers
• Logger
• Monitoring/alerts, mail, etc.
12. Typical Hardware Setup
• Redundant Ethernet
• All servers have dual NICs bonded in active/failover mode
• Each NIC wired to independent Ethernet switch
• Redundant power (on Critical servers [DB, Memcache, etc])
• Multiple power supplies (2-4)
• Each power supply independently fed
• Separate electrical panel
• Separate UPS system
• Hardware RAID
• Debian Linux
16. Network Architecture
• Border connection(s) come in to pair of load balancer/firewall boxes
• Active/Hot Standby
• Linux HA/Heartbeat used to perform automatic failover
• LVS/ldirector used to query brick heads (ldirectorping.txt)
• Brick is quot;removed from rotationquot; when ldirectorping.txt isn't
found or has invalid contents
• Incoming connections load-balanced round-robin to brick heads
• Connections NAT'd
• LAN consists of 2 separate physical switches
• Each server has one connection to each
• Switches are linked via cross-over cable or similar
• Packet manipulators (such as caching proxies) tend to cause issues
17. Network Architecture
• Border connection(s) come in to pair of load balancer/firewall boxes
• Active/Hot Standby
• Linux HA/Heartbeat used to perform automatic failover
• LVS/ldirector used to query brick heads (ldirectorping.txt)
• Brick is quot;removed from rotationquot; when ldirectorping.txt isn't
found or has invalid contents
• Incoming connections load-balanced round-robin to brick heads
• Connections NAT'd
• LAN consists of 2 separate physical switches
• Each server has one connection to each
• Switches are linked via cross-over cable or similar
• Packet manipulators (such as caching proxies) tend to cause issues
18. Open inbound ports (border)
• 80 HTTP
• 443 HTTPS
• 22 SSH (possibly restricted by source IP)
• 210 Z39.50 server
• 6001 SIP2 RAW (NOTE: SIP communication is plain-text)
19. Logging
• Syslog-ng
• Central logging server
• All logs, including Evergreen, Apache, Postgres, and system
• Typical directory structure:
• Evergreen: /var/log/remote/prod/%Y/%m/%d/file.%H.log
• System: /var/log/remote/sys/$HOSTNAME/%Y/%m/%d/file.%H.log
• gzip nightly
• Off-system archive as necessary
20. Checking the logs
• Identify which log file(s) to look at
• What date and time are you looking for?
• Identify thread trace from logs
DEMOSYS-DB2:/var/log/remote/prod/2009/04/30# grep 312345000098765 osrfsys.10.log
...
2009-04-30 10:54:02 DEMOSYS-BRICK0-DRONE1 open-ils.circ: [INFO:5119:ScriptRunner.pm:60:1241097309492131]
script_runner: circ_permit_hold : Patron=98725, Patron_Username=21234000054577,
Patron_Profile_Group=Patron, Patron_Fines=, Patron_OverdueCount=, Patron_Items_Out=,
Patron_Barcode=21234000054577, Patron_Library=Summer City Public Library Requestor=circ1, Copy=3870901,
Copy_Barcode=312345000098765, Copy_status=Available, Circ_Mod=DVD, Circ_Lib=SUMMER,
Copy_location=Stacks, Item_Owning_lib=4, Volume=5834287, Record=1984545, Is_Renewal: no, Is_Precat: no,
Hold_request_lib=SUMMER, Hold_Pickup_Lib=4
...
• Search for thread trace to get the complete thread
DEMOSYS-DB2:/var/log/remote/prod/2009/04/30# grep ':1241097309492131]' osrfsys.10.log
• If logs have been gzip'd, 'zgrep' can be used
21. Checking the logs (cont.)
• Checking the postgres logs may be worth while also
DEMOSYS-DB2:/var/log/remote/prod/2009/04/30# grep 312345000098765 pg.10.log
...
2009-04-30 10:56:04 DEMOSYS-DB1 postgres[4298]: [607-7] barcode = '312345000098765', call_number =
5834287, circ_as_type = NULL, circ_lib = 4, circ_modifier = 'DVD', circulate = 't'
...
DEMOSYS-DB2:/var/log/remote/prod/2009/04/30# grep 'postgres[4298]: [607-' pg.10.log
...
• In this case, note we use the app name and PID in addition to the
thread trace
postgres[4298]: [607-7]
•postgres = app name
• 4298 = pid
• 607 = thread trace
• Postgres also uses a sequence ID
• “-7” = “this is the 7th line of the query”
22. Monitoring
• ESI uses Nagios, but looking into additions/alternates
• mrtg + snmp
• Splunk
• Ping
• Free memory
• System Load Average
• > 1.0 sustained average on an app server often indicates a “hung”
process
• Critical processes (apache, jabber, memcache, clark-kent.pl, etc)
• Disk Free Space
23. Monitoring (continued)
• Lock file age
• Fine generator
• Hold targeter
• Backup status
• File age
• WAL file age
• Sync to external backup
• Service timeouts (“Returning NULL” in the gateway logs)
• Bandwidth Usage
• Motherboard sensors
• Third-party ping service
24. Database schema (handout)
• Action
• Circulations, holds, surveys, etc
• Actor
• Org Units (libraries, branches, bookmobiles, etc), users, cards,
workstations
• Asset
• Call numbers (volume), copies (items), statcats
• Auditor
• Audit data (usually org units, users, copies, bib records)
• Authority
• Authority Record data
25. Database schema (continued)
• Biblio
• Bibliographic record data (not including metabib entries)
• Config
• Configuration tables (audiences, bib levels, item forms, circ rules,
etc)
• Container
• Buckets (biblio, call number, item, user)
• Metabib
• Indexed bibliographic record fields
• Money
• Fines, payments, etc.
26. Database schema (continued)
• Offline
• Tables for the offline staff client
• Permission
• Profile groups, user group maps, permission allocations, etc.
• Reporter
• Reports, Report Templates, Report Folders, and Report Views
• Vandelay
• Bib/Authority import/export
27. Cron Jobs
• All:
• ntpdate
• Utility:
• Hold targeter
• Hold thawer
• Fine generator
• Reshelving completer
• Notices/collections
• Backup
• General backup scripts
• rsync
28. Backups
• Central file store
• Large systems may use dedicated server
• SCP/rsync/etc. data to central file store
• Many servers (ex: DB) quot;pushquot; data to backup store
• Backup server sometimes quot;pullsquot; data from servers
• Sync data store to external/removable media
• USB hard drive works great for this
• Encrypting backups (eg: with cryptfs) definitely a good idea
• Rotate backups
• Ideally, every morning
• Off-site
• Secure storage (preferably in a safe designed specifically for
backup media)
29. Conify (with demo)
• “System Settings” in staff client
• Moves OU Types, OU Names, permissions, etc. from backend to in staff
client
• Autogen still required
• New in 1.4
• Improvements in 1.6
30. Circ Policies (with demo)
• Circ matrix defined in /openils/var/circ/
# ls -lh /openils/var/circ/
total 64K
-rwxr-xr-x 1 opensrf opensrf 25K 2009-05-18 21:47 circ_duration.js
-rwxr-xr-x 1 opensrf opensrf 730 2009-05-06 16:46 circ_groups.js
-rwxr-xr-x 1 opensrf opensrf 1.6K 2009-05-18 21:00 circ_item_config.js
-rwxr-xr-x 1 opensrf opensrf 8.1K 2009-05-06 16:46 circ_lib.js
-rwxr-xr-x 1 opensrf opensrf 439 2009-05-06 16:46 circ_permit_copy.js
-rwxr-xr-x 1 opensrf opensrf 840 2009-05-06 16:46 circ_permit_hold.js
-rwxr-xr-x 1 opensrf opensrf 652 2009-05-06 16:46 circ_permit_patron.js
-rwxr-xr-x 1 opensrf opensrf 137 2009-05-06 16:46 circ_permit_renew.js
• Major moves from files to DB in 1.6 (some support in 1.4)
• Will facilitate config via staff client
• Very flexible. Can address several aspects of users, copies, bibs, etc.
• Typically will use combination of item library, library performing action
(circulation, hold, etc), patron group, etc.
31. Circ Policies (with demo)
• Circ matrix defined in /openils/var/circ/
# ls -lh /openils/var/circ/
total 64K
-rwxr-xr-x 1 opensrf opensrf 25K 2009-05-18 21:47 circ_duration.js
-rwxr-xr-x 1 opensrf opensrf 730 2009-05-06 16:46 circ_groups.js
-rwxr-xr-x 1 opensrf opensrf 1.6K 2009-05-18 21:00 circ_item_config.js
-rwxr-xr-x 1 opensrf opensrf 8.1K 2009-05-06 16:46 circ_lib.js
-rwxr-xr-x 1 opensrf opensrf 439 2009-05-06 16:46 circ_permit_copy.js
-rwxr-xr-x 1 opensrf opensrf 840 2009-05-06 16:46 circ_permit_hold.js
-rwxr-xr-x 1 opensrf opensrf 652 2009-05-06 16:46 circ_permit_patron.js
-rwxr-xr-x 1 opensrf opensrf 137 2009-05-06 16:46 circ_permit_renew.js
• Major moves from files to DB in 1.6 (some support in 1.4)
• Will facilitate config via staff client
• Very flexible. Can address several aspects of users, copies, bibs, etc.
• Typically will use combination of item library, library performing action
(circulation, hold, etc), patron group, etc.