Have you heard of TDD? Well, many teams struggle with CDD: Chaos-Driven Delivery. That is, teams struggle with how to handle the constant onslaught of overwhelming amounts of work and begin to lose hope. The good news is that if you understand operating systems, you already know a great deal about how to tame the chaos!
Process management is an integral part of an operating system. The OS makes decisions about scheduling, sharing information between jobs, handling interrupts and multi-tasking. It also has to manage the resources of a process and be concerned with process synchronization, just as we mere humans do. This presentation will show you how to apply common concepts from operating system process management to the way teams process work.
12. So, how does this relate to humans?
@everydaykanban
13. 1. Answer some key
questions about
what is important
Designing scheduling methods for mortals
@everydaykanban
Is our goal to keep people busy or
deliver quickly?
16. 1. Answer some key
questions about
what is important
Designing scheduling methods for mortals
@everydaykanban
Is our goal to keep people busy or
deliver quickly?
Does any work demand special
treatment?
17. Define your classes of service
@everydaykanban
Expedite
Intangible
Fixed Date
Standard
a.k.a. multilevel queues
18. 1. Answer some key
questions about
what is important
Designing scheduling methods for mortals
@everydaykanban
Is our goal to keep people busy or
deliver quickly?
Does any work demand special
treatment?
Are we concerned about job
starvation?
19. Allocate some work to each queue
@everydaykanban
Expedite
Intangible
Fixed Date
Standard
1 2
16
Only allowing 10
things in progress at
once
20. 1. Answer some key
questions about
what is important
Designing scheduling methods for mortals
@everydaykanban
Is our goal to keep people busy or
deliver quickly?
Does any work demand special
treatment?
Are we concerned about job
starvation?
Does partially-done work provide
value?
23. 1. Answer some key
questions about
what is important
2. Make policies to
optimize for
answers
Designing scheduling methods for mortals
@everydaykanban
Is our goal to keep people busy or
deliver quickly?
Does any work demand special
treatment?
Are we concerned about job
starvation?
Does partially-done work provide
value?
25. Create pull policies for work you will do
@everydaykanban
Expedite
Intangible
Fixed Date
Standard
First come, first served Priority (due date + size)
Priority (cost of delay) Priority (cost of delay)
26. 1. Answer key
questions about
what is important
2. Make policies to
optimize for
answers
3. Determine when to
break the rules
Designing scheduling methods for mortals
@everydaykanban
Is our goal to keep people busy or
deliver quickly?
Does any work demand special
treatment?
Are we concerned about job
starvation?
Does partially-done work provide
value?
27. Great resources for further learning
@everydaykanban
Queueing Theory,
Cost of Delay
Classes of Service,
Explicit Policies
Flow vs Resource
Efficiency
Effect of Policies
on Lead Time
Many teams, especially Ops teams, experience high levels of chaos desperately trying to manage the demands on their time.
How many of you feel like this at least once a week? This may in fact be the number of fire extinguishers you metaphorically use per week at your office!
This chaos is perpetuated by…
The cycle of pain. Without a defined approach to managing work, they end up catering to the whims of the loudest voices or the HiPPO which often result in too much work-in-progress because you are starting whatever you can to get people off your back. That leads to starting everything but finishing nothing. This causes more people to escalate their work as if it were an emergency because that’s the only way anything ever gets completed. It is a vicious, self-perpetuating cycle.
We need to stop the madness. When I was thinking about this, I realized that Ops teams deal with systems programmed to manage similar conditions. I decided to take a crack at looking at one of these systems to see what we can learn and apply to human work systems and thus improve our chaotic conditions and this talk was born.
Enter the OS. It gets bombarded with multiple requests at unpredictable intervals too. Yet, it is able to process its work in a seemingly effective manner.
So, how does the OS decide what to work on and when? Well, an OS is coded to follow a set of explicit policies to minimize resource starvation and ensure fairness amongst parties utilizing the resources. There are multiple methods to choose from, each with its pros and cons. Let’s go over a handful of common methods.
https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/5_CPU_Scheduling.html
The first is First Come, First Served. Its fair in a sense, because she who asks first gets first. Also, scheduling overhead is low. Its not a complicated system to figure out and maintain.
But, it doesn’t account for priority or size, so it’s better for teams that have homogenous work – which usually isn’t the case for ops or dev teams.
Shortest job first, just as it sounds, keeps short jobs from being stuck behind long ones. The problem is that longer-running jobs may never be started. This is called job starvation.
An equally important problem is that we often lack information about the duration of a task. That is compounded with the historical tendency of humans to be predictably bad at estimating.
If that’s not enough, if a shorter process arrives during another process' execution, the job can be interrupted, causing context switching. Context switching incurs cost of managing the overhead of the back and forth between tasks.
That’s a lot of reasons not to love SJF.
The third is scheduling by priority. This feels comfortable to us because a lot of work systems use a priority based system.
Priority scheduling expects you to assign priority to each job and executes it in order of highest priority.
Let me ask, how many people have ever been faced with multiple priority #1s?
With a constant flow of the highest priority jobs, other important work may never be completed. Again, we have job starvation.
The next type, round robin, is very interesting. In a round robin, each job in the queue is processed for a set time and put aside so it can move to the next job. If a job can’t finish in one time unit, it waits for another turn.
Pros: Good average response time, waiting time is dependent on number of processes, and not average process length.
Cons: No work is ignored but it can take a LONG time to finish a job if you have a large queue or long jobs.
Because of high waiting times, deadlines are rarely met in a pure RR system.
Some current Operating Systems, like Mac OSX, use Multilevel Feedback Queues as a scheduling method.
Multiple queues are established and given priorities
Each queue is processed by its own scheduling method.
Feedback: Processes can move between queues – if it takes too long it gets moved to a lower priority queue.
It learns about jobs and acts.
https://en.wikipedia.org/wiki/Multilevel_feedback_queue
http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf
Well, we learned about some key questions to ask when designing a scheduling method.
I believe that in most businesses, the goal is to deliver value (whatever that looks like) over making it look like people are busy.
Even though we see the latter far too often.
It is fiscally responsible to be concerned about capacity utilization. In the same vein, it is fiscally irresponsible to focus on capacity utilization to the degree that it causes a bottleneck in a system that enables the company to provide value to its customers. To be responsible corporate citizens, we need to balance capacity utilization with flow efficiency.
Flow efficiency is a measure of how efficiently work moves through your system. It is measured by dividing the active working time of pieces of work by the sum of the total time the request has been alive. According to David J Anderson who introduced Kanban to software development, a common flow efficiency result is less than 5%. This is because we are working in systems optimized for managing people, not managing work.
Once you have a picture of your flow efficiency, Look at when your work stops and starts. Capture the things that block movement.
Common reasons are
starting too many things
Starting things we aren’t able to finish (not everyone is ready)
Technical barriers
Coordination barriers
Once you’ve identified the reasons, learn what you can do to mitigate or remove those impediments.
Next, ask yourself if all your work can be handled the same – same urgency, same workflow, etc. – or if some work is special.
If not all of your work is the same, do like the operating system and define your buckets of work (or classes of service).
In the first team I managed, we felt like we had so many emergencies that we couldn’t get anything done. One of the first things we did was define our classes of service. We decided to break up our work into different buckets. The priority of each bucket would be based on cost of delay.
EXPLAIN the four!
Add a policies icon in bottom right.
Just like in Multilevel feedback queues used by operating systems, in human work systems, Items can move from one bucket/queue/class of service to another if you leave it for too long. This is the case with escalations, they can become emergent like that cycle of pain I showed earlier. In order to avoid this, consider giving allocations to each class of service. Split up the overall amount of work you allow your system to have going at any one time across the various queues as needed. Then monitor, experiment, adjust as needs demand.
Yes, you may hit the limit on those allocations, but those constraints enable you to see issues that are keeping you from doing other classes of work, issues that cause the cycle of emergencies and make you address them. If you have too many expedite/fixed dates, you’ll never get to standard or intangible work. We have to address that problem.
The usual answer to this is not really.
If I’m trying to get over there, this bridge, in its current state isn’t going to help me. So, we need to
At LeanKit we have the concept of FizzGood which helps to combat that. If things are FSGD, there is less chance for them to be interrupted. Create policies that only allow for interruption of work-in-progress under extreme situations.
Now that we’ve answered all of our questions, we take what we learned from that process and make explicit policies about how to handle our work. It sounds overlordish, but explicit policies help us have a common understanding of how the decisions are made in our work system and often take the emotional drama out of conversations. It becomes less personal if we know what decisions we’ve made and why.
Make active decisions!
, decide the optimal way to manage each one. Remember, each one can be handled differently. You may choose to handle expedites first come, first served because, by definition, each one is the most important thing to do.
Finally, once you’ve made all of those policies, you have one more to make – when to break the rules.
For instance,
you don’t want to reserve a space for expedites in your WIP limits, you want to be able to exceed your WIP when one comes in.
You can pull items out of priority order on certain conditions, like a specialized skillset is not available.
Just don’t break the rules to the extent that you don’t get the value intended. Don’t keep adding on expedites infinitum. Then you’re back to chaos. If you are always skipping work because you are missing a specialized skillset, consider cross training or updating your staffing model.
Ultimately, experiment. learn. update your policies.
Thanks so much for listening today. Do we have time for questions?