SlideShare a Scribd company logo
1 of 29
I Upgraded iRODS And I
Still Have All My Hair
John Constable
john.constable@sanger.ac.uk
Background
Wellcome Trust Sanger Institute has had iRODS since 2011
“Implementing a genomic data management system using iRODS in the Wellcome Trust Sanger Institute”
by Gen-Tao Chiang, .. Clapham, Coates DOI: 10.1186/1471-2105-12-361
~14PB, in five federated Zones, plus another.
Most iCAT’s in modest VMWare (a couple high traffic ones are physical servers) instances
Sanger DC - DDN SFA10k’s x4 and 2U iRES via fibrechannel, some HP SL4540 (inbuilt 60 disks)
(about 36 in total)
Janet Shared Data Centre - HP SL4540 (inbuilt 60 disks) - about 30
Configuration Management using CFEngine 3.1
Background (2)
All running iRODS 3.3.1 on Ubuntu 12.04, with Oracle RAC database backend.
Resource Groups to organise resources at sanger and resources at JSDC
Bash script to move resources where filesystem is > 98% full to a ‘full’ group.
Rules;
Force uploads to go into certain groups (the ‘not full’ ones)
Replicate data that goes into one group is copied to the other
Other Rules;
Disable trash (we want to be able to use the filesystem to directly retrieve files in extremis)
Force checksum generation on ingest
On one zone, don’t replicate files if they are put into a particular collection
Reason to move
● iRODS 3.3.1 end-of-life, not getting security fixes etc
● SSL end-to-end encryption
● Composable Resource Tree’s potential to simplify
rules and scripting needed
● Gen3 Query aka external metadata search
● The 2000’s called and wanted their software back.
Fail to Plan,
Plan to Fail
We were going to do this right!
Also, 14PB of 10 years of scientific research, often where you
couldn't get another sample.
Scientific colleagues looking on with some trepidation.
We had a placement student (Hi Andy!).
We shared our plans with RENCI and Had A Lot Of Conference
Calls
(are you a consortium member? Perhaps consider joining, this kind of thing was
very helpful. I have not been paid by RENCI to write this)
Plan to Fail
Test Plan
Basic plan
Spin up 3.3.1 in a virtual environment.
Run BATS test framework to setup basic configuration & baseline features
Upgrade to 4.1.X
Run bats framework again to verify retained functionality
Rinse, repeat as bugs found
Test Plan
Coverage● Must
○ Functionality we must have
○ Cannot upgrade without this working
○ Issues we have had fixed before
● Should
○ Would make operating the Zones easier
○ Better for the future
○ Functionality we’ve not used but plan to
● Could
○ Nice to have
○ Functionality we don’t plan to use but good to have ‘in the back pocket’
Unit tests!
Unit tests!
Gone BATSChose Bash Automated Testing System - https://github.com/sstephenson/bats
● Surprisingly, not a pun on the iRODS name
● Simple to write
● Everyone knows bash
● No learning curve e.g. server spec
● Some use of shunit previously, so some history
● Helpful guide when getting started: https://blog.engineyard.com/2014/bats-test-command-line-
tools
The BATS-mobile in
action./scripts/v3/icat/setup
./bats_scripts/icommands.bats
✓ Check the output of ils
✓ Check the output of ipwd
✓ make a collection
✓ remove a collection
✓ Check that iput stores a txt document correctly
✓ Check that iget can retrieve the txt document correctly
✓ Add Metadata
✓ List Metadata
✓ remove temporary file using irm
✓ clean up
10 tests, 0 failures
Anatomy
of a BAT(S)
#!/usr/bin/env bats
setup(){
INSERT_FILE=irods_unicode_ɸ_test.txt
dd if=/dev/zero of=$INSERT_FILE bs=1M
count=1
test_value_hex=""
}
@test "iput a file" {
iput -K -f $INSERT_FILE
run ils
for i in $lines[@]; do
if [ i = $INSERT_FILE ]; then
[ true ]
fi
done
[ false ]
Unit tests!
Did our BAT fly?What we found:
● Unit tests fine, but making a functional test out of them
was painful (we’d use server spec for that next time)
● Writing tests while doing testing lead to confusion over
passing tests
● We could have done with using branches and tags, but
we were new at this.
● Idempotence hard to implement
● Learned as much writing the tests as running them
● Orchestration across servers hard, ended up being
manual(ansible? Fabric? serverspec?)
Virtualised
Environments!
Virtualised environment using virtualbox
Can build irods 4.1.8 or 3.3.1 in a few mins, with oracle database, iCAT and two iRES servers.
Continuous
Integration!
Design
Documents!
Waterfall and Gant
Charts!
Agile time boxes not waterfall honest guv
Started on 4.1.7
Working Closely
with RENCI
April 26, 2016 (six months in): 103 Issues Raised, 86 fixed
A few initial teething
issues
● You can’t use v4 client to talk to v3 servers
● icommands packaged in /opt/renci not /software/irods
● Compile libraries will now be in /usr/include/irods
● Oracle not initially supported in 4.x
● When it became supported, client version 11.1 wasn’t so we’d need to upgrade
● No way to mark resources as full in Composite Resource Tree
● Rule ordering is important in server_config.json
● Because irods 3.3.1's Remote Zone SID uses a "-" to delimit server authentication, RENCI have
removed it from the list of allowable characters from the V2/V3 schema. As a result, we have to
use a local schema repo
End Result:
An upgrade plan
● Upgrade Dev zones first, wait a couple of weeks for people to test
● Update Config management
● Update all 1000 users irods environment files
● Snapshot VM’s & backup /usr/local/iRODS
● Packages put config in separate place from 3.3.1, so easy(?) to
revert if needed
● Automate installation of Oracle 11.2 client
● Run unit tests
● Backup database
● Install packages (not done via CFEngine due to phased rollout)
● Run database schema update
● Add specific environment variables for Unicode support for the Oracle plugin,
Oracle paths to irods config files and set checksum to MD5 - all using
RENCI’s update_json.py script (hidden gem!)
● Run unit tests
● Enable ‘high water mark’ (with 4.1.10 this would be set
minimum_free_space_for_create)
● Modify server_config.json to include federation stanza for federation
master and itself
● Modify server_config.json to include rules
● Check you can retrieve files
● Do all of the above for each iRES
Upgrade day!
4.1.8! 4.1.8!
August 2016, nearly a year after we started.
Began with the Federation Master, and icommands, as this would then broker the communication to
the other Zones.
We found in short order;
● we couldn’t list across Zones because acACLPolicy wasn’t disabling Strict ACL’s.
● we couldn’t upload files to Federee zones because msiSetDefaultResc wasn’t honoured across
zones, and so the composite tree didn’t know where to put the files.
● Ils -A run across zones was astonishingly slow (31 mins for 500 objects)
Paused the upgrade of the other zones until fixed
4.1.9
● After upgrading to debug versions we found we couldn’t upload large files ( ~2G)
● Most of our BAM files are at least this big.
● Three weeks later and many debug builds and conf calls with RENCI, 4.1.9!
4.1.10
November 2016, we found that Resources whose file systems were full were being written to, despite
taking advantage of the ‘High Water Mark’ functionality.
After much debug builds and work with RENCI, the ‘minimum free space for create in bytes’ was
introduced in 4.1.10, in conjunction with updating the freespace on a resource (we have a cron script)
4.1.11?
Once the dust from all of that had settled, we checked that we had enough replicas.
We didn’t, in part due to a congestion issue between our data centres.
So.. 4.1.11 might be coming to you soon, with the fixes from the assorted investigations into
replication.
But that’s another talk (it was only 145TB and 70k files...).
Disruption
Total disruption;
● at least a month of unavailability (mostly due to three weeks working out why large files not
uploading)
● Instability every couple of weeks as patch s/w installed
Lessons Learned
● Working with your customers scientists is invaluable.
● Check you have enough disk space
● Terrell and Jason are very patient people (also see above)
● Become a consortium member and discuss things on the iRODS-CHAT mailing list
● Test Frameworks such as BATS
● ‘Infrastructure As Code’ (config management and virtualised images in this case)
● CI helps to make the same mistakes again reliable packages if you must do that kind of thing
● naming resources by server name and LUN/disk combination e.g. irods-seq-i23-bc
● python /var/lib/irods/packaging/update_json.py /etc/irods/server_config.json string environment_variables,"ORACLE_HOME"
"/opt/oracle/instantclient_11_2"
● Izonereport and jq BFF
izonereport | jq '.["zones"][].icat_server.resources[], .["zones"][].resource_servers[].resources[] | select(.host=="irods-g1-dev.internal.sanger.ac.uk") | {name:
The Future
4.1.11 to 4.2.
Must only take us two weeks to test..
Upgrade has to be smooth
Also, turn on SSL everywhere
GenQuery3
Ceph?
Thanks
● RENCI, especially
○ Terrel
○ Jason
○ Ben
○ Antoine
● NPG, especially
○ Keith James
● ISG
○ Andy Perry
○ Pete Clapham

More Related Content

What's hot

Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetesDongwon Kim
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
Enhancing OpenStack FWaaS for real world application
Enhancing OpenStack FWaaS for real world applicationEnhancing OpenStack FWaaS for real world application
Enhancing OpenStack FWaaS for real world applicationopenstackindia
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...Yahoo Developer Network
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtqViet Stack
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012mumrah
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackAndrew Yongjoon Kong
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache MesosJoe Stein
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache ZookeeperAnshul Patel
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Belmiro Moreira
 
OpenStack Nova - Developer Introduction
OpenStack Nova - Developer IntroductionOpenStack Nova - Developer Introduction
OpenStack Nova - Developer IntroductionJohn Garbutt
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High AvailabilityJakub Pavlik
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBSJim Plush
 
The Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQLThe Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQLDatadog
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesJonathan Katz
 
Introduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolIntroduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolSuresh Paulraj
 
An Introduction to Using PostgreSQL with Docker & Kubernetes
An Introduction to Using PostgreSQL with Docker & KubernetesAn Introduction to Using PostgreSQL with Docker & Kubernetes
An Introduction to Using PostgreSQL with Docker & KubernetesJonathan Katz
 

What's hot (20)

Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetes
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Enhancing OpenStack FWaaS for real world application
Enhancing OpenStack FWaaS for real world applicationEnhancing OpenStack FWaaS for real world application
Enhancing OpenStack FWaaS for real world application
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtq
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
 
Cloud data center and openstack
Cloud data center and openstackCloud data center and openstack
Cloud data center and openstack
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache Zookeeper
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
 
OpenStack Nova - Developer Introduction
OpenStack Nova - Developer IntroductionOpenStack Nova - Developer Introduction
OpenStack Nova - Developer Introduction
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High Availability
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS
 
The Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQLThe Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQL
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
 
Introduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolIntroduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning Tool
 
An Introduction to Using PostgreSQL with Docker & Kubernetes
An Introduction to Using PostgreSQL with Docker & KubernetesAn Introduction to Using PostgreSQL with Docker & Kubernetes
An Introduction to Using PostgreSQL with Docker & Kubernetes
 

Similar to RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair

[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...Ambassador Labs
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made EasyAll Things Open
 
New Oracle Infrastructure2
New Oracle Infrastructure2New Oracle Infrastructure2
New Oracle Infrastructure2markleeuw
 
Road to sbt 1.0 paved with server
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with serverEugene Yokota
 
Less02 Installation
Less02 InstallationLess02 Installation
Less02 Installationvivaankumar
 
Less02 Installation
Less02  InstallationLess02  Installation
Less02 Installationvivaankumar
 
Migraine Drupal - syncing your staging and live sites
Migraine Drupal - syncing your staging and live sitesMigraine Drupal - syncing your staging and live sites
Migraine Drupal - syncing your staging and live sitesdrupalindia
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A CloudSky Jian
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
DevOps Fest 2020. immutable infrastructure as code. True story.
DevOps Fest 2020. immutable infrastructure as code. True story.DevOps Fest 2020. immutable infrastructure as code. True story.
DevOps Fest 2020. immutable infrastructure as code. True story.Vlad Fedosov
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2MySQLConference
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)Doug Burns
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsshanker_uma
 
Serverless Compose vs hurtownia danych
Serverless Compose vs hurtownia danychServerless Compose vs hurtownia danych
Serverless Compose vs hurtownia danychThe Software House
 
Introducing resinOS: An Operating System Tailored for Containers and Built fo...
Introducing resinOS: An Operating System Tailored for Containers and Built fo...Introducing resinOS: An Operating System Tailored for Containers and Built fo...
Introducing resinOS: An Operating System Tailored for Containers and Built fo...Balena
 

Similar to RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair (20)

[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made Easy
 
New Oracle Infrastructure2
New Oracle Infrastructure2New Oracle Infrastructure2
New Oracle Infrastructure2
 
Road to sbt 1.0 paved with server
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with server
 
Sg1 Cover Page
Sg1 Cover PageSg1 Cover Page
Sg1 Cover Page
 
Less02 Installation
Less02 InstallationLess02 Installation
Less02 Installation
 
Less02 Installation
Less02  InstallationLess02  Installation
Less02 Installation
 
Migraine Drupal - syncing your staging and live sites
Migraine Drupal - syncing your staging and live sitesMigraine Drupal - syncing your staging and live sites
Migraine Drupal - syncing your staging and live sites
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A Cloud
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
DevOps Fest 2020. immutable infrastructure as code. True story.
DevOps Fest 2020. immutable infrastructure as code. True story.DevOps Fest 2020. immutable infrastructure as code. True story.
DevOps Fest 2020. immutable infrastructure as code. True story.
 
RAC - Test
RAC - TestRAC - Test
RAC - Test
 
Pratyusa_Resume
Pratyusa_ResumePratyusa_Resume
Pratyusa_Resume
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2
 
amala_storage
amala_storageamala_storage
amala_storage
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
 
Serverless Compose vs hurtownia danych
Serverless Compose vs hurtownia danychServerless Compose vs hurtownia danych
Serverless Compose vs hurtownia danych
 
Introducing resinOS: An Operating System Tailored for Containers and Built fo...
Introducing resinOS: An Operating System Tailored for Containers and Built fo...Introducing resinOS: An Operating System Tailored for Containers and Built fo...
Introducing resinOS: An Operating System Tailored for Containers and Built fo...
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair

  • 1. I Upgraded iRODS And I Still Have All My Hair John Constable john.constable@sanger.ac.uk
  • 2. Background Wellcome Trust Sanger Institute has had iRODS since 2011 “Implementing a genomic data management system using iRODS in the Wellcome Trust Sanger Institute” by Gen-Tao Chiang, .. Clapham, Coates DOI: 10.1186/1471-2105-12-361 ~14PB, in five federated Zones, plus another. Most iCAT’s in modest VMWare (a couple high traffic ones are physical servers) instances Sanger DC - DDN SFA10k’s x4 and 2U iRES via fibrechannel, some HP SL4540 (inbuilt 60 disks) (about 36 in total) Janet Shared Data Centre - HP SL4540 (inbuilt 60 disks) - about 30 Configuration Management using CFEngine 3.1
  • 3. Background (2) All running iRODS 3.3.1 on Ubuntu 12.04, with Oracle RAC database backend. Resource Groups to organise resources at sanger and resources at JSDC Bash script to move resources where filesystem is > 98% full to a ‘full’ group. Rules; Force uploads to go into certain groups (the ‘not full’ ones) Replicate data that goes into one group is copied to the other Other Rules; Disable trash (we want to be able to use the filesystem to directly retrieve files in extremis) Force checksum generation on ingest On one zone, don’t replicate files if they are put into a particular collection
  • 4. Reason to move ● iRODS 3.3.1 end-of-life, not getting security fixes etc ● SSL end-to-end encryption ● Composable Resource Tree’s potential to simplify rules and scripting needed ● Gen3 Query aka external metadata search ● The 2000’s called and wanted their software back.
  • 5. Fail to Plan, Plan to Fail We were going to do this right! Also, 14PB of 10 years of scientific research, often where you couldn't get another sample. Scientific colleagues looking on with some trepidation. We had a placement student (Hi Andy!). We shared our plans with RENCI and Had A Lot Of Conference Calls (are you a consortium member? Perhaps consider joining, this kind of thing was very helpful. I have not been paid by RENCI to write this)
  • 7. Basic plan Spin up 3.3.1 in a virtual environment. Run BATS test framework to setup basic configuration & baseline features Upgrade to 4.1.X Run bats framework again to verify retained functionality Rinse, repeat as bugs found
  • 8. Test Plan Coverage● Must ○ Functionality we must have ○ Cannot upgrade without this working ○ Issues we have had fixed before ● Should ○ Would make operating the Zones easier ○ Better for the future ○ Functionality we’ve not used but plan to ● Could ○ Nice to have ○ Functionality we don’t plan to use but good to have ‘in the back pocket’
  • 10. Unit tests! Gone BATSChose Bash Automated Testing System - https://github.com/sstephenson/bats ● Surprisingly, not a pun on the iRODS name ● Simple to write ● Everyone knows bash ● No learning curve e.g. server spec ● Some use of shunit previously, so some history ● Helpful guide when getting started: https://blog.engineyard.com/2014/bats-test-command-line- tools
  • 11. The BATS-mobile in action./scripts/v3/icat/setup ./bats_scripts/icommands.bats ✓ Check the output of ils ✓ Check the output of ipwd ✓ make a collection ✓ remove a collection ✓ Check that iput stores a txt document correctly ✓ Check that iget can retrieve the txt document correctly ✓ Add Metadata ✓ List Metadata ✓ remove temporary file using irm ✓ clean up 10 tests, 0 failures
  • 12. Anatomy of a BAT(S) #!/usr/bin/env bats setup(){ INSERT_FILE=irods_unicode_ɸ_test.txt dd if=/dev/zero of=$INSERT_FILE bs=1M count=1 test_value_hex="" } @test "iput a file" { iput -K -f $INSERT_FILE run ils for i in $lines[@]; do if [ i = $INSERT_FILE ]; then [ true ] fi done [ false ]
  • 13. Unit tests! Did our BAT fly?What we found: ● Unit tests fine, but making a functional test out of them was painful (we’d use server spec for that next time) ● Writing tests while doing testing lead to confusion over passing tests ● We could have done with using branches and tags, but we were new at this. ● Idempotence hard to implement ● Learned as much writing the tests as running them ● Orchestration across servers hard, ended up being manual(ansible? Fabric? serverspec?)
  • 14. Virtualised Environments! Virtualised environment using virtualbox Can build irods 4.1.8 or 3.3.1 in a few mins, with oracle database, iCAT and two iRES servers.
  • 17. Waterfall and Gant Charts! Agile time boxes not waterfall honest guv
  • 19. Working Closely with RENCI April 26, 2016 (six months in): 103 Issues Raised, 86 fixed
  • 20. A few initial teething issues ● You can’t use v4 client to talk to v3 servers ● icommands packaged in /opt/renci not /software/irods ● Compile libraries will now be in /usr/include/irods ● Oracle not initially supported in 4.x ● When it became supported, client version 11.1 wasn’t so we’d need to upgrade ● No way to mark resources as full in Composite Resource Tree ● Rule ordering is important in server_config.json ● Because irods 3.3.1's Remote Zone SID uses a "-" to delimit server authentication, RENCI have removed it from the list of allowable characters from the V2/V3 schema. As a result, we have to use a local schema repo
  • 21. End Result: An upgrade plan ● Upgrade Dev zones first, wait a couple of weeks for people to test ● Update Config management ● Update all 1000 users irods environment files ● Snapshot VM’s & backup /usr/local/iRODS ● Packages put config in separate place from 3.3.1, so easy(?) to revert if needed ● Automate installation of Oracle 11.2 client ● Run unit tests ● Backup database ● Install packages (not done via CFEngine due to phased rollout) ● Run database schema update ● Add specific environment variables for Unicode support for the Oracle plugin, Oracle paths to irods config files and set checksum to MD5 - all using RENCI’s update_json.py script (hidden gem!) ● Run unit tests ● Enable ‘high water mark’ (with 4.1.10 this would be set minimum_free_space_for_create) ● Modify server_config.json to include federation stanza for federation master and itself ● Modify server_config.json to include rules ● Check you can retrieve files ● Do all of the above for each iRES
  • 22. Upgrade day! 4.1.8! 4.1.8! August 2016, nearly a year after we started. Began with the Federation Master, and icommands, as this would then broker the communication to the other Zones. We found in short order; ● we couldn’t list across Zones because acACLPolicy wasn’t disabling Strict ACL’s. ● we couldn’t upload files to Federee zones because msiSetDefaultResc wasn’t honoured across zones, and so the composite tree didn’t know where to put the files. ● Ils -A run across zones was astonishingly slow (31 mins for 500 objects) Paused the upgrade of the other zones until fixed
  • 23. 4.1.9 ● After upgrading to debug versions we found we couldn’t upload large files ( ~2G) ● Most of our BAM files are at least this big. ● Three weeks later and many debug builds and conf calls with RENCI, 4.1.9!
  • 24. 4.1.10 November 2016, we found that Resources whose file systems were full were being written to, despite taking advantage of the ‘High Water Mark’ functionality. After much debug builds and work with RENCI, the ‘minimum free space for create in bytes’ was introduced in 4.1.10, in conjunction with updating the freespace on a resource (we have a cron script)
  • 25. 4.1.11? Once the dust from all of that had settled, we checked that we had enough replicas. We didn’t, in part due to a congestion issue between our data centres. So.. 4.1.11 might be coming to you soon, with the fixes from the assorted investigations into replication. But that’s another talk (it was only 145TB and 70k files...).
  • 26. Disruption Total disruption; ● at least a month of unavailability (mostly due to three weeks working out why large files not uploading) ● Instability every couple of weeks as patch s/w installed
  • 27. Lessons Learned ● Working with your customers scientists is invaluable. ● Check you have enough disk space ● Terrell and Jason are very patient people (also see above) ● Become a consortium member and discuss things on the iRODS-CHAT mailing list ● Test Frameworks such as BATS ● ‘Infrastructure As Code’ (config management and virtualised images in this case) ● CI helps to make the same mistakes again reliable packages if you must do that kind of thing ● naming resources by server name and LUN/disk combination e.g. irods-seq-i23-bc ● python /var/lib/irods/packaging/update_json.py /etc/irods/server_config.json string environment_variables,"ORACLE_HOME" "/opt/oracle/instantclient_11_2" ● Izonereport and jq BFF izonereport | jq '.["zones"][].icat_server.resources[], .["zones"][].resource_servers[].resources[] | select(.host=="irods-g1-dev.internal.sanger.ac.uk") | {name:
  • 28. The Future 4.1.11 to 4.2. Must only take us two weeks to test.. Upgrade has to be smooth Also, turn on SSL everywhere GenQuery3 Ceph?
  • 29. Thanks ● RENCI, especially ○ Terrel ○ Jason ○ Ben ○ Antoine ● NPG, especially ○ Keith James ● ISG ○ Andy Perry ○ Pete Clapham

Editor's Notes

  1. Been meaning to write this up for a while, apologies I can’t be there in person. How many people running 3.3.1? How many people have *heard* of 3.3.1?
  2. 6 years of irods, I’ve been involved in the past two. Most zones federated, mix of physical and virtual hardware. Storage is physical
  3. Kept the setup as simple as possible. Still able to retrieve files in event of disaster - this has proved very useful Lots of metadata via genomics pipelines (bulk of input)
  4. Consortium membership! Added another ~2 PB since then.
  5. We came up with a lists of tests, shared with RENCI, of all the 4.1.x functionality we wanted or must have Google docs very helpful here
  6. Shared tests out between 3 people, all using own copies of virtual environment
  7. On Github, not maintained for a while. Might be getting an update soon! Written in conjunction with Bioinformaticians who own the data in the zones as they have written tools to work with iRODS
  8. Example output. In colour, sometimes!
  9. Example of a bats test
  10. Repeatable nature of the test infrastructure vital here. Last time vagrant, which limuted size of deployment (only had 8Gb laptops), this time we hope openstack, and share templates with colleages’ in other teams. Also would like to do cross federation testing
  11. We made our own packages for kerberos plugin and icommands. Moving away from that now, but needed for reproducible science & migration
  12. We wrote down ahead of time what we planned to accomplish and why. Do you know the saying ‘if you want to make god laugh, tell him your plans..”?
  13. Plan was very helpful in communicating to people *when* their precious data would be unavailable, and highlighting gaps for e.g. holidays. And...how much we slipped by
  14. Are you a consortium member? Did you know you can influence the direction of iRODS by joining? How about get support?
  15. Found all this out BEFORE we upgraded. Massive win.
  16. Astonished if you can read this, but damned if I was going to let all that effort go to waste.. Slides available later on RENCI’s pages, also drop me an email if you want the gory details. Basically as 4.x stores config in different place, it was forklift upgrade. iCAt then iRES. Lots of testing as we go.
  17. Upgrade! Yay! What do you mean you can't list across federation? Or upload files? Basically federation we did not test.
  18. RENCI turned this round for us very quickly. Would have had to roll back to 3.3.1 if not.
  19. Becoming hard to test at scale - replicating live environment too hard (openstack FTW?), not to mention of of files and resource ful-ness
  20. Replication is my current worry
  21. It was a lot of downtime, but still very little compared to 14PB of data and 1.5 people managing it..
  22. Are you a consortium member? Did you know you can influence the direction of iRODS by joining? How about get support?