SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Digital Document
Preservation Simulation
Richard Landau
(late of Digital Equipment,
Compaq, Dell)
- Research Affiliate, MIT Libraries
Program on Information Science
- http://informatics.mit.edu/people/
richard-landau
2014-07-22
The Problem
• How do you preserve digital documents long-term?
• Surely not on CDs, tapes, etc., with short lifetimes
• LOCKSS: Lots Of Copies Keep Stuff Safe
• What's the threat model?
– Failures: media, format obsolescence, fires, floods, earthquakes,
institutional failures, mergers, funding cuts, malicious insiders,….
• Some data exists on disk media reliability, RAID, etc.
• Little data on reliability of storage strategies
BPG - Digital Document Preservation Simulation 20140722 RBL 2
The Project
• Digital Document Preservation Simulation
• Part of Program on Information Science, MIT Libraries:
Dr. Micah Altman, Director of Research
• Develop empirical data that real libraries can use to
make decisions about storage strategies and costs
– Simulate a range of situations, let the clients decide what level of
risk they are willing to accept
• I do this for fun: volunteer intern, 1-2 days/week
BPG - Digital Document Preservation Simulation 20140722 RBL 3
Questions to Be Answered
• Start small: 10,000 documents for 10 years, 1-20 copies
• (Short term questions, lots more in the long term)
• Question 0: If I place a number of copies out into the
network, how many will I lose over time?
– For various error rates and copies, what's the risk?
• Question 1: If I audit the remote copies and repair them if
they're broken, how many will I lose over time?
– For various auditing strategies and frequencies, what's the risk?
BPG - Digital Document Preservation Simulation 20140722 RBL 4
Tools
• Windows 7 on largish PCs
• Cygwin for Windows (like Linux on Windows)
• Python 2.7
– SimPy library for discrete event simulations
– argparse module for CLI
– csv module for reading parameter files
– logging module for recording events
– itertools for generating serial numbers
– random to generate exponentials, uniforms, Gaussians
• Random seed values from random.org
BPG - Digital Document Preservation Simulation 20140722 RBL 5
Approach
• Very traditional object and service oriented design
– Client, Document, Collection of documents
– Server, Copy of document, Storage Shelf
– Auditor
• Asynchronous processes managed by SimPy
– Sector failure, shelf failure, at exponential intervals
– Audit a collection on a server, at regular intervals
• SimPy resource, network mutex for auditors
• Slightly perverse code (e.g., Hungarian naming)
BPG - Digital Document Preservation Simulation 20140722 RBL 6
How SimPy Works
• Library for discrete event simulation, V3
• Discrete event queue sorted by time
• Asynchronous processes, resources, events, timers
– Process is a Python generator, yield to wait for event(s)
• Timeouts, discrete events, resource requests
• Very simple to use, very efficient, good tutorial examples
• Time is floating point, choose your units
BPG - Digital Document Preservation Simulation 20140722 RBL 7
Early Results for Q0
BPG - Digital Document Preservation Simulation 20140722 RBL 8
Is Python Fast Enough?
• Yes, if one is careful
– Code it all in C++? No. Get results faster with Python.
– "Premature optimization is the root of all evil" -- Knuth
• Numerous small optimizations on the squeaky wheels
– Just a few reduced CPU and I/O by 98%
• If it's still not fast enough, use a bigger computer
BPG - Digital Document Preservation Simulation 20140722 RBL 9
Burn Those CPUs!
• One test only
2 sec to 8 min
• 2000-4000 tests
• Scheduler runs
5-7 programs at
once
• All 4 cores
• (Poor man's
parallelism)
• 99% CPU use
BPG - Digital Document Preservation Simulation 20140722 RBL 10
Class Instances Should Have Names
• <client.CDocument object at 0x6ffff9f6690> is
imperfectly informative
– D127 is much better for humans: name, not address
– With zillions of instances, identifying one during debugging is
hard
• Suggestion:
– Assign a unique ID to every instance that denotes class and a
serial number
– Always always always pass IDs as arguments rather than
instance pointers
BPG - Digital Document Preservation Simulation 20140722 RBL 11
How I Assign IDs
• Every class begins like this:
• itertools.count(1).next function returns a unique
integer for this class, starting at 1, when you invoke it
• Global dictionary dID2Xxxx translates from an ID string
to an instance of class Xxxx
• Pass ID as argument; when a function needs the pointer,
it can use a single fast dictionary lookup
BPG - Digital Document Preservation Simulation 20140722 RBL 12
class CDocument(object):
getID = itertools.count(1).next
(in __init__)
self.ID = "D" + str(self.getID())
G.dID2Document[self.ID] = self
Lessons Learned
• Python is fast enough
• SimPy is really dandy
• argparse is a great lib for CLIs
• csv.DictReader makes everything a string, beware
• item-in-list is really slow, O(n/2), use dict or set instead
• Lots of comments in the code!
BPG - Digital Document Preservation Simulation 20140722 RBL 13
Code is on github
• On github: MIT-Informatics/PreservationSimulation
• Questions: mailto:landau@ricksoft.com
BPG - Digital Document Preservation Simulation 20140722 RBL 14
Further Details (Not Tonight)
• Many small optimizations add up
– Fast find of victim document on a shelf
– Shorten log file by removing many or all details
– Minimize just-in-time checks during auditing
– Change item-in-list checks to item-in-dict
• Wide variety of scenarios covered
– 1000X span on document size, 100X on storage shelf size,
10000X span on failure rates, 20X span on number of copies
– A test run takes from 2 seconds to 8 minutes (CPU)
• Running external programs
• Post-processing of structured log files
BPG - Digital Document Preservation Simulation 20140722 RBL 15
Forming and Running
External Commands
• Substitute multiple arguments into a string with format()
• Run command and capture output into string (or list)
BPG - Digital Document Preservation Simulation 20140722 RBL 16
TemplateCmd = "ps | grep {Name} | wc -l"
RealCmd = TemplateCmd.format(**dictArgs)
ResultString = ""
for Line in os.popen(RealCmd):
ResultString += Line.strip()
nProcesses = int(ResultString)
if nProcesses < nProcessMax:
. . .
A More Complicated Command
• Read a dict of parameters with csv.DictReader
• And then substitute lots of arguments with format()
• Execute and capture output with os.popen()
BPG - Digital Document Preservation Simulation 20140722 RBL 17
dir,specif,len,seed,cop,ber,berm,extra1
../q1,v3d50,0,919028296,01,0050,0050000,
python main.py {family} {specif} {len} {seed} 
--ncopies={cop} --ber={berm} {extra1} > 
{dir}/{specif}/log{ber}/c{cop}b{ber}s{seed}.log 
2>&1 &
Subprocess Module Instead
• os.popen is being replaced by more general functions
– subprocess.check_output
– subprocess.Popen
– subprocess.Pipe
– subprocess.communicate
BPG - Digital Document Preservation Simulation 20140722 RBL 18
Supersede That Old Function
• Have a complicated function that works, but you want to
replace it with a spiffier version?
– Edit in place?
• Might break the whole thing for a while
– Comment it out with ''' or """?
• Not perfectly reliable, like "if 0" in C
• Supersede it: add the new version after the old
– Last version replaces previous one in module or class dict
BPG - Digital Document Preservation Simulation 20140722 RBL 19
def Foo(…):
(moldy old code)
def Foo(…):
(shiny new code)

Contenu connexe

Tendances

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 

Tendances (20)

Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the Block
 
Introduction to InfluxDB
Introduction to InfluxDBIntroduction to InfluxDB
Introduction to InfluxDB
 
A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using Flink
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache Flink
 
Intelligent integration with WSO2 ESB & WSO2 CEP
Intelligent integration with WSO2 ESB & WSO2 CEP Intelligent integration with WSO2 ESB & WSO2 CEP
Intelligent integration with WSO2 ESB & WSO2 CEP
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitter
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...Increasing Security Awareness in Enterprise Using Automated Feature Extractio...
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
Complex Event Processing - A brief overview
Complex Event Processing - A brief overviewComplex Event Processing - A brief overview
Complex Event Processing - A brief overview
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGIWhole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with Esper
 

Similaire à Digital Document Preservation Simulation - Boston Python User's Group

Similaire à Digital Document Preservation Simulation - Boston Python User's Group (20)

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best Practices
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Server Tips
Server TipsServer Tips
Server Tips
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
 
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 

Plus de Micah Altman

SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Micah Altman
 
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsCreative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Micah Altman
 
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
Micah Altman
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Micah Altman
 

Plus de Micah Altman (20)

Selecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesSelecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategies
 
Well-Being - A Sunset Conversation
Well-Being - A Sunset ConversationWell-Being - A Sunset Conversation
Well-Being - A Sunset Conversation
 
Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...
 
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
 
Well-being A Sunset Conversation
Well-being A Sunset ConversationWell-being A Sunset Conversation
Well-being A Sunset Conversation
 
Can We Fix Peer Review
Can We Fix Peer ReviewCan We Fix Peer Review
Can We Fix Peer Review
 
Academy Owned Peer Review
Academy Owned Peer ReviewAcademy Owned Peer Review
Academy Owned Peer Review
 
Redistricting in the US -- An Overview
Redistricting in the US -- An OverviewRedistricting in the US -- An Overview
Redistricting in the US -- An Overview
 
A Future for Electoral Districting
A Future for Electoral DistrictingA Future for Electoral Districting
A Future for Electoral Districting
 
A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk  A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
 
Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:
 
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsCreative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
 
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
 
Ndsa 2016 opening plenary
Ndsa 2016 opening plenaryNdsa 2016 opening plenary
Ndsa 2016 opening plenary
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
 
Software Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental ScanSoftware Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental Scan
 
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 

Digital Document Preservation Simulation - Boston Python User's Group

  • 1. Digital Document Preservation Simulation Richard Landau (late of Digital Equipment, Compaq, Dell) - Research Affiliate, MIT Libraries Program on Information Science - http://informatics.mit.edu/people/ richard-landau 2014-07-22
  • 2. The Problem • How do you preserve digital documents long-term? • Surely not on CDs, tapes, etc., with short lifetimes • LOCKSS: Lots Of Copies Keep Stuff Safe • What's the threat model? – Failures: media, format obsolescence, fires, floods, earthquakes, institutional failures, mergers, funding cuts, malicious insiders,…. • Some data exists on disk media reliability, RAID, etc. • Little data on reliability of storage strategies BPG - Digital Document Preservation Simulation 20140722 RBL 2
  • 3. The Project • Digital Document Preservation Simulation • Part of Program on Information Science, MIT Libraries: Dr. Micah Altman, Director of Research • Develop empirical data that real libraries can use to make decisions about storage strategies and costs – Simulate a range of situations, let the clients decide what level of risk they are willing to accept • I do this for fun: volunteer intern, 1-2 days/week BPG - Digital Document Preservation Simulation 20140722 RBL 3
  • 4. Questions to Be Answered • Start small: 10,000 documents for 10 years, 1-20 copies • (Short term questions, lots more in the long term) • Question 0: If I place a number of copies out into the network, how many will I lose over time? – For various error rates and copies, what's the risk? • Question 1: If I audit the remote copies and repair them if they're broken, how many will I lose over time? – For various auditing strategies and frequencies, what's the risk? BPG - Digital Document Preservation Simulation 20140722 RBL 4
  • 5. Tools • Windows 7 on largish PCs • Cygwin for Windows (like Linux on Windows) • Python 2.7 – SimPy library for discrete event simulations – argparse module for CLI – csv module for reading parameter files – logging module for recording events – itertools for generating serial numbers – random to generate exponentials, uniforms, Gaussians • Random seed values from random.org BPG - Digital Document Preservation Simulation 20140722 RBL 5
  • 6. Approach • Very traditional object and service oriented design – Client, Document, Collection of documents – Server, Copy of document, Storage Shelf – Auditor • Asynchronous processes managed by SimPy – Sector failure, shelf failure, at exponential intervals – Audit a collection on a server, at regular intervals • SimPy resource, network mutex for auditors • Slightly perverse code (e.g., Hungarian naming) BPG - Digital Document Preservation Simulation 20140722 RBL 6
  • 7. How SimPy Works • Library for discrete event simulation, V3 • Discrete event queue sorted by time • Asynchronous processes, resources, events, timers – Process is a Python generator, yield to wait for event(s) • Timeouts, discrete events, resource requests • Very simple to use, very efficient, good tutorial examples • Time is floating point, choose your units BPG - Digital Document Preservation Simulation 20140722 RBL 7
  • 8. Early Results for Q0 BPG - Digital Document Preservation Simulation 20140722 RBL 8
  • 9. Is Python Fast Enough? • Yes, if one is careful – Code it all in C++? No. Get results faster with Python. – "Premature optimization is the root of all evil" -- Knuth • Numerous small optimizations on the squeaky wheels – Just a few reduced CPU and I/O by 98% • If it's still not fast enough, use a bigger computer BPG - Digital Document Preservation Simulation 20140722 RBL 9
  • 10. Burn Those CPUs! • One test only 2 sec to 8 min • 2000-4000 tests • Scheduler runs 5-7 programs at once • All 4 cores • (Poor man's parallelism) • 99% CPU use BPG - Digital Document Preservation Simulation 20140722 RBL 10
  • 11. Class Instances Should Have Names • <client.CDocument object at 0x6ffff9f6690> is imperfectly informative – D127 is much better for humans: name, not address – With zillions of instances, identifying one during debugging is hard • Suggestion: – Assign a unique ID to every instance that denotes class and a serial number – Always always always pass IDs as arguments rather than instance pointers BPG - Digital Document Preservation Simulation 20140722 RBL 11
  • 12. How I Assign IDs • Every class begins like this: • itertools.count(1).next function returns a unique integer for this class, starting at 1, when you invoke it • Global dictionary dID2Xxxx translates from an ID string to an instance of class Xxxx • Pass ID as argument; when a function needs the pointer, it can use a single fast dictionary lookup BPG - Digital Document Preservation Simulation 20140722 RBL 12 class CDocument(object): getID = itertools.count(1).next (in __init__) self.ID = "D" + str(self.getID()) G.dID2Document[self.ID] = self
  • 13. Lessons Learned • Python is fast enough • SimPy is really dandy • argparse is a great lib for CLIs • csv.DictReader makes everything a string, beware • item-in-list is really slow, O(n/2), use dict or set instead • Lots of comments in the code! BPG - Digital Document Preservation Simulation 20140722 RBL 13
  • 14. Code is on github • On github: MIT-Informatics/PreservationSimulation • Questions: mailto:landau@ricksoft.com BPG - Digital Document Preservation Simulation 20140722 RBL 14
  • 15. Further Details (Not Tonight) • Many small optimizations add up – Fast find of victim document on a shelf – Shorten log file by removing many or all details – Minimize just-in-time checks during auditing – Change item-in-list checks to item-in-dict • Wide variety of scenarios covered – 1000X span on document size, 100X on storage shelf size, 10000X span on failure rates, 20X span on number of copies – A test run takes from 2 seconds to 8 minutes (CPU) • Running external programs • Post-processing of structured log files BPG - Digital Document Preservation Simulation 20140722 RBL 15
  • 16. Forming and Running External Commands • Substitute multiple arguments into a string with format() • Run command and capture output into string (or list) BPG - Digital Document Preservation Simulation 20140722 RBL 16 TemplateCmd = "ps | grep {Name} | wc -l" RealCmd = TemplateCmd.format(**dictArgs) ResultString = "" for Line in os.popen(RealCmd): ResultString += Line.strip() nProcesses = int(ResultString) if nProcesses < nProcessMax: . . .
  • 17. A More Complicated Command • Read a dict of parameters with csv.DictReader • And then substitute lots of arguments with format() • Execute and capture output with os.popen() BPG - Digital Document Preservation Simulation 20140722 RBL 17 dir,specif,len,seed,cop,ber,berm,extra1 ../q1,v3d50,0,919028296,01,0050,0050000, python main.py {family} {specif} {len} {seed} --ncopies={cop} --ber={berm} {extra1} > {dir}/{specif}/log{ber}/c{cop}b{ber}s{seed}.log 2>&1 &
  • 18. Subprocess Module Instead • os.popen is being replaced by more general functions – subprocess.check_output – subprocess.Popen – subprocess.Pipe – subprocess.communicate BPG - Digital Document Preservation Simulation 20140722 RBL 18
  • 19. Supersede That Old Function • Have a complicated function that works, but you want to replace it with a spiffier version? – Edit in place? • Might break the whole thing for a while – Comment it out with ''' or """? • Not perfectly reliable, like "if 0" in C • Supersede it: add the new version after the old – Last version replaces previous one in module or class dict BPG - Digital Document Preservation Simulation 20140722 RBL 19 def Foo(…): (moldy old code) def Foo(…): (shiny new code)