SlideShare une entreprise Scribd logo
1  sur  38
Outline
• Collection of guidelines, war stories and techniques learned during the
development of Warframe
• Focus on multiplayer games (many techniques applicable to any game type),
frequent updates, active community
• Mostly PC centric (more diagnostics available to developers, but variety of
hardware)
• What this talk is not about
Warframe
• A cooperative third person online action game
• 3 platforms (PC, PS4 (launch title), XB1) + China (PC+PS4)
• 23+ millions accounts
• Typical CCU ~45-50k+ (PC)
• Weekly updates, frequent major updates (usually every 6-8 weeks)
• Own technology stack (everything from low-level socket code to
matchmaking, all 3 platforms)
Debugging multiplayer code
• Debugging games is hard
• Debugging a MP game - like multi core, but no idea what other cores are
doing
• Debugging a constantly evolving MP game...
• ...running across N operating systems and hardware configurations =
Knowledge is power
• Collecting bugs/crash information to a database of your choice
• In most cases - extension of an internal system (unhandled exception
handler, uploaded to a global server)
• No longer shielded by publisher, can be overwhelming at first. Higher CCU =
more crashes (1/32k chance -- seemingly crashing all the time)
• Include important details in the bug itself (OS, 32/64-bit, CPU, account…)
for easy filtering
Analyzing data
“...there is no religious demagogue worse than an engineer arguing from intuition instead
of data.”
Robert P. Colwell, The Pentium Chronicles
• Do not guess -- build a system to analyze incoming cases
• Can start small (e.g. “Top 10 crashes in the latest build”), keep adding
details
Our system
• Top X most frequent crashes, can be filtered by category (networking,
system, game, ...)
• Cross referencing (past builds)
• Crashes/hour graph for every build
• Global stats – bug breakdown (intuition vs cold facts)
• Guess how many one-off crashes we get?
Our system
• Top X most frequent crashes, can be filtered by category (networking,
system, game, ...)
• Cross referencing (past builds)
• Crashes/hour graph for every build
• Global stats – bug breakdown (intuition vs cold facts)
• Guess how many one-off crashes we get? ~80%! (good... and bad)
Our system
• Top X most frequent crashes, can be filtered by category (networking,
system, game, ...)
• Cross referencing (past builds)
• Crashes/hour graph for every build
• Global stats – bug breakdown (intuition vs cold facts)
• Guess how many one-off crashes we get? ~80%! (good... and bad)
• Usually almost 30% of all crashes in the top 5 issues (4/5 of these tend to be
in drivers/external DLLs these days)
Crash report – log file
• Given 2 theoretical extremes, it’s better to log too much than not enough
○ Can always filter out noise, can’t materialize information that isn’t there
• Avoid messages carrying only binary info. Rule of thumb: the more unusual
situation – the more details
• ID all the things (objects, messages, players)
• If something goes wrong, does it provide enough information to figure out
the reason?
Look for cracks between layers
• State machines (matchmaking, mission flow, UI)
• N systems interacting with each other
• UI vs underlying code (triggering an activity just as something else
starts/ends)
• Clients connecting/disconnecting
• Focus on patterns repeating in multiple log files (the more the better)
• Modify the code locally to trigger the suspected problem
Parsing logs
• Eyeballing is hard, especially for long sessions (megabytes of text, multiple
missions)
• Simple Perl scripts FTW (congestion control, leaking objects). Even easier if
interesting entries are marked clearly
• A graph is worth a thousand words (or numbers…)
Object leaks
Object count (Y) vs time (X)
Congestion control tests
Crash report - dumps
• Know your assembly (x64 tends to be easier to investigate, more registers).
See [Ruskin11]
• Immediate Window to the rescue (> + command, e.g. >~kb for callstacks for
all threads)
• Spider sense (e.g. reallocating std::vector while iterating)
Instrumenting crash dumps
• In a perfect world, we’d just use full memory dumps. Sadly, they’re way too
big for practical purposes (hundreds of megs),
• “Mini” dumps include registers and the stack memory, helpful, but not
always good enough
• Can try to “exploit”, by storing important data on the stack (see blog post)
• Better way: mini dump callbacks and request regions pointed by registers (if
they look like valid addresses, can even chase pointers recursively)
• Detecting type -- cast register to pointer, VS resolves vftables, e.g.
(void**)eax - {Warframe.exe!const Projectile::`vftable'}
LogBug
• Problem: crash we can work around, but the root cause remains a mystery.
Examples: indexing out of bounds, state desynchronization
• We can add more logging, but ideally we prevent users from crashing
• Enter LogBug aka silent assertion. Logs an error message + a callstack,
game keeps running, the error report dialog pops up when the app exits.
• Result: game doesn’t crash, we get extra info (no dump, just log). 15
minutes of work to add on top of existing error handler.
• Use sparingly, fix quickly (~15 active LogBugs in Warframe as of today)
• Equivalent to fatal error for internal builds
Deploying to public
• Low hanging fruit long gone, this is where the “fun” starts
• Forget about trying to repro the bug just by playing a game (can still alter a
build sometimes)
• Left with good old fashioned police work. Given leads and clues, try to
determine who’s the killer (or the reason for our bug)
• Ability to quickly test various scenarios (unit tests) - invaluable!
Chasing a suspect
• Less stepping through code in a debugger, more
pen & paper
• War story: Warframe goes public, rare reports
where a client crashes because host assumes he
had received some information before (he didn’t)
• It’s tempting to overcomplicate things, start with
simple theories
• Add “dimensions” rather than move along the
same “axis”
Adding dimensions
Moving along 1 axis modifies mostly probability, doesn’t expand our search
space.
Adding dimensions
Adding a new dimension uncovers new possibilities
In our case - data dropped once, chunks it used to share a packet with
retransmitted, but no space in the new packet for all original entries.
Example unit test
server = GetTypeMgr().Create<TestFrame>();
client = GetTypeMgr().Create<TestFrame>();
server->Initialize(true, "0.0.0.0:4950", false);
const String& serverAddress = server->mNet->GetListenAddress();
client->Initialize(false, serverAddress, false);
server->Update(0.1f);
TEST(server->mNumConnects == 1);
// 100% packet loss, 0ms latency
client->mNet->SimulateNetwork(1.f, 0.f);
server->mNet->SendData(...);
Controlled repro case, regression, documentation, verification (assumptions).
The aftermath
• Very difficult to verify we got the right guy (no
repro, remember?)
• Publish a fix, hope for the best
• Sometimes, an evil twin/copycat remains. Usually
much easier now that we’ve a system/theory in
place
• Ideally, do not introduce major changes until the fix is verified (hard to
attribute the cause otherwise)
LogBug strikes again
• Eternal problem with networking – what was the other side doing at that time?
We only get reports from the player who crash/encountered a problem
• Some experiments with recording network messages (space limitations, still
incomplete)
• Solution: try to handle gracefully/convert to a LogBug and send a special
message to the other player – “trigger LogBug”
• Result: we get 2 (or more) reports from 2 machines (full log file with all the info
we could think about). Have to find a matching pair (not everyone submits), but
often any data from both sides is very helpful
Compiler bugs
• “Never happens”...
Compiler bugs
• “Never happens”... or does it?
• Seemingly innocent code:
m_v = Min(v, MAX_V);
// Overloaded operator= contained _rotr16 intrinsic inside.
• Getting reports of ‘v’ set to values greater than MAX_V
• Generated (pseudo)assembly:
mov r8d,MAX_V
cmp edx,r8d ; edx=v, x86 -> flags as in sub edx,r8d
ror ax,X ; X=1 for consistent repro, was other value
cmovl r8w,dx ; r8 will be later stored in m_v
• cmovl depends on OF & SF (move if OF!=SF)
Compiler bugs… and CPU differences
ROR instruction + the overflow flag:
• Different behavior on different CPUs. i5-3320m laptop and i7-850 are “fine”,
i7 2600k would modify the OF flag.
• VS 2010, seems to be fixed in VS2012
The OF flag is defined only for the 1-bit rotates; it is undefined in all other cases [...] For right rotates, the
OF flag is set to the exclusive OR of the two most-significant bits of the result. (Intel 64 and IA-32
Architectures Software Developer’s Manual, vol 2B)
Freezes/hitches
• Need a way to grab the state of our process (a user friendly way!),
• First approach: Process Hacker (just a callstack, no dump)
• Better: procdump + batch file to make it a one-click operation:
procdump process_name
• Requires community help
• Terrible granularity, but we’ve already diagnosed a few problems this way
Community
• Best community ever (no, really)
• Community Testers (second layer)
• Freezes/hitches
• Network issues
• Forums, IRC, voice chat
Third party libraries/external DLLs
• Mostly beyond our control, but users blame you anyway (not your fault, but
your problem)
• Third party libs: report a problem, wait for a new version
• External app: report a problem, detect and inform (video card software,
various overlays)
• Also: early console SDKs (especially paths less traveled). Staring at assembly
code again
• If you’re feeling brave, other options exhausted and you’re really bored:
patching binaries by hand ([Sinilo13])
Acknowledgments
• Warframe dev team
• Warframe community
References
• [Ruskin11] Ruskin, E. Forensic Debugging (GDC 2011)
• [Sinilo13] Sinilo M., Patching binaries: http://msinilo.pl/blog/?p=1215
Extra Slides
Spider sense – dynamic arrays
Great container (std::vector & friends), but a few problems
○ push_back while iterating (often few levels down the call chain), can reallocate
memory and invalidate iterators. Typical symptoms: crashing when trying to
access memory pointed by an iterator
○ indexing out of range (in some cases easy to tell from the dump). In many cases
fine to access it, but data doesn’t make sense, so fails later when trying to use it.
Spider sense - multithreading
• Crash when reading/writing, data races
• See what other threads are doing
• Investigate all accesses, missing locks
• Sometimes adding Sleeps makes it easier to repro (more likely for jobs to
overlap on your fast computer)
• Fake/failing locks - widen the race window, make it more obvious to detect
concurrent access
“Lock” = assert eq 0, set to 1
“Unlock” = assert eq 1, set to 0
Bugs – a social problem
• No man’s land, “the bystander effect” (probability of help inversely related to
the number of bystenders)
• Situation a little bit better in a typical game team, not a problem with serious
bugs
• Cracks between layers again – problems when multiple modules overlap
• Gamification: top issues lists, bounties
• Systemic tweaks: clear responsibilities, encourage team members to move
outside their comfort zones (especially new employees, one of the best
ways to get familiar with the codebase)

Contenu connexe

Tendances

Introduction to .NET Performance Measurement
Introduction to .NET Performance MeasurementIntroduction to .NET Performance Measurement
Introduction to .NET Performance MeasurementSasha Goldshtein
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectPeter Hlavaty
 
Course 102: Lecture 20: Networking In Linux (Basic Concepts)
Course 102: Lecture 20: Networking In Linux (Basic Concepts) Course 102: Lecture 20: Networking In Linux (Basic Concepts)
Course 102: Lecture 20: Networking In Linux (Basic Concepts) Ahmed El-Arabawy
 
Is That A Penguin In My Windows?
Is That A Penguin In My Windows?Is That A Penguin In My Windows?
Is That A Penguin In My Windows?zeroSteiner
 
From zero to SYSTEM on full disk encrypted windows system
From zero to SYSTEM on full disk encrypted windows systemFrom zero to SYSTEM on full disk encrypted windows system
From zero to SYSTEM on full disk encrypted windows systemNabeel Ahmed
 
Fun With Dr Brown
Fun With Dr BrownFun With Dr Brown
Fun With Dr BrownzeroSteiner
 
Memory management in operating system | Paging | Virtual memory
Memory management in operating system | Paging | Virtual memoryMemory management in operating system | Paging | Virtual memory
Memory management in operating system | Paging | Virtual memoryShivam Mitra
 
Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Pipat Methavanitpong
 
Threads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationThreads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationShivam Mitra
 
PAC 2019 virtual Christoph NEUMÜLLER
PAC 2019 virtual Christoph NEUMÜLLERPAC 2019 virtual Christoph NEUMÜLLER
PAC 2019 virtual Christoph NEUMÜLLERNeotys
 
Linux Distribution Automated Testing
 Linux Distribution Automated Testing Linux Distribution Automated Testing
Linux Distribution Automated TestingAleksander Baranowski
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...peknap
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkAnne Nicolas
 
DTrace in the Non-global Zone
DTrace in the Non-global ZoneDTrace in the Non-global Zone
DTrace in the Non-global Zonebcantrill
 
Real-time soultion
Real-time soultionReal-time soultion
Real-time soultionNylon
 
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...PostgreSQL-Consulting
 

Tendances (20)

Introduction to .NET Performance Measurement
Introduction to .NET Performance MeasurementIntroduction to .NET Performance Measurement
Introduction to .NET Performance Measurement
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could Expect
 
Packers
PackersPackers
Packers
 
Course 102: Lecture 20: Networking In Linux (Basic Concepts)
Course 102: Lecture 20: Networking In Linux (Basic Concepts) Course 102: Lecture 20: Networking In Linux (Basic Concepts)
Course 102: Lecture 20: Networking In Linux (Basic Concepts)
 
Is That A Penguin In My Windows?
Is That A Penguin In My Windows?Is That A Penguin In My Windows?
Is That A Penguin In My Windows?
 
SecureWV - APT2
SecureWV - APT2SecureWV - APT2
SecureWV - APT2
 
From zero to SYSTEM on full disk encrypted windows system
From zero to SYSTEM on full disk encrypted windows systemFrom zero to SYSTEM on full disk encrypted windows system
From zero to SYSTEM on full disk encrypted windows system
 
Fun With Dr Brown
Fun With Dr BrownFun With Dr Brown
Fun With Dr Brown
 
Memory management in operating system | Paging | Virtual memory
Memory management in operating system | Paging | Virtual memoryMemory management in operating system | Paging | Virtual memory
Memory management in operating system | Paging | Virtual memory
 
Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?
 
Threads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationThreads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess Communication
 
PAC 2019 virtual Christoph NEUMÜLLER
PAC 2019 virtual Christoph NEUMÜLLERPAC 2019 virtual Christoph NEUMÜLLER
PAC 2019 virtual Christoph NEUMÜLLER
 
Ranger BSides-FINAL
Ranger BSides-FINALRanger BSides-FINAL
Ranger BSides-FINAL
 
DerbyCon - APT2
DerbyCon - APT2DerbyCon - APT2
DerbyCon - APT2
 
Linux Distribution Automated Testing
 Linux Distribution Automated Testing Linux Distribution Automated Testing
Linux Distribution Automated Testing
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver framework
 
DTrace in the Non-global Zone
DTrace in the Non-global ZoneDTrace in the Non-global Zone
DTrace in the Non-global Zone
 
Real-time soultion
Real-time soultionReal-time soultion
Real-time soultion
 
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
 

Similaire à Debugging multiplayer games

Exploitation and State Machines
Exploitation and State MachinesExploitation and State Machines
Exploitation and State MachinesMichael Scovetta
 
Reverse Engineering the TomTom Runner pt. 1
Reverse Engineering the TomTom Runner pt. 1 Reverse Engineering the TomTom Runner pt. 1
Reverse Engineering the TomTom Runner pt. 1 Luis Grangeia
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013midnite_runr
 
Preventing Complexity in Game Programming
Preventing Complexity in Game ProgrammingPreventing Complexity in Game Programming
Preventing Complexity in Game ProgrammingYaser Zhian
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015lpgauth
 
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel....NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...Karel Zikmund
 
Harlan Beverly Lag The Barrier to innovation gdc austin 2009
Harlan Beverly Lag The Barrier to innovation gdc austin 2009Harlan Beverly Lag The Barrier to innovation gdc austin 2009
Harlan Beverly Lag The Barrier to innovation gdc austin 2009Harlan Beverly
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...Alexandre Moneger
 
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundNDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundKarel Zikmund
 
Breaking Smart Speakers: We are Listening to You.
Breaking Smart Speakers: We are Listening to You.Breaking Smart Speakers: We are Listening to You.
Breaking Smart Speakers: We are Listening to You.Priyanka Aash
 
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar....NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...Karel Zikmund
 
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel ZikmundKarel Zikmund
 
BSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyBSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyJerome Smith
 
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel....NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...Karel Zikmund
 
[2012 CodeEngn Conference 06] beist - Everyone has his or her own fuzzer
[2012 CodeEngn Conference 06] beist - Everyone has his or her own fuzzer[2012 CodeEngn Conference 06] beist - Everyone has his or her own fuzzer
[2012 CodeEngn Conference 06] beist - Everyone has his or her own fuzzerGangSeok Lee
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 

Similaire à Debugging multiplayer games (20)

Exploitation and State Machines
Exploitation and State MachinesExploitation and State Machines
Exploitation and State Machines
 
Reverse Engineering the TomTom Runner pt. 1
Reverse Engineering the TomTom Runner pt. 1 Reverse Engineering the TomTom Runner pt. 1
Reverse Engineering the TomTom Runner pt. 1
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
 
Preventing Complexity in Game Programming
Preventing Complexity in Game ProgrammingPreventing Complexity in Game Programming
Preventing Complexity in Game Programming
 
Surge2012
Surge2012Surge2012
Surge2012
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015
 
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel....NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
 
Harlan Beverly Lag The Barrier to innovation gdc austin 2009
Harlan Beverly Lag The Barrier to innovation gdc austin 2009Harlan Beverly Lag The Barrier to innovation gdc austin 2009
Harlan Beverly Lag The Barrier to innovation gdc austin 2009
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
 
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundNDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
 
Breaking Smart Speakers: We are Listening to You.
Breaking Smart Speakers: We are Listening to You.Breaking Smart Speakers: We are Listening to You.
Breaking Smart Speakers: We are Listening to You.
 
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar....NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
 
Debugging ZFS: From Illumos to Linux
Debugging ZFS: From Illumos to LinuxDebugging ZFS: From Illumos to Linux
Debugging ZFS: From Illumos to Linux
 
Happy users and good sleep. How?
Happy users and good sleep. How?Happy users and good sleep. How?
Happy users and good sleep. How?
 
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
 
BSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyBSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwerty
 
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel....NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
 
[2012 CodeEngn Conference 06] beist - Everyone has his or her own fuzzer
[2012 CodeEngn Conference 06] beist - Everyone has his or her own fuzzer[2012 CodeEngn Conference 06] beist - Everyone has his or her own fuzzer
[2012 CodeEngn Conference 06] beist - Everyone has his or her own fuzzer
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 

Dernier

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Debugging multiplayer games

  • 1.
  • 2. Outline • Collection of guidelines, war stories and techniques learned during the development of Warframe • Focus on multiplayer games (many techniques applicable to any game type), frequent updates, active community • Mostly PC centric (more diagnostics available to developers, but variety of hardware) • What this talk is not about
  • 3. Warframe • A cooperative third person online action game • 3 platforms (PC, PS4 (launch title), XB1) + China (PC+PS4) • 23+ millions accounts • Typical CCU ~45-50k+ (PC) • Weekly updates, frequent major updates (usually every 6-8 weeks) • Own technology stack (everything from low-level socket code to matchmaking, all 3 platforms)
  • 4. Debugging multiplayer code • Debugging games is hard • Debugging a MP game - like multi core, but no idea what other cores are doing • Debugging a constantly evolving MP game... • ...running across N operating systems and hardware configurations =
  • 5. Knowledge is power • Collecting bugs/crash information to a database of your choice • In most cases - extension of an internal system (unhandled exception handler, uploaded to a global server) • No longer shielded by publisher, can be overwhelming at first. Higher CCU = more crashes (1/32k chance -- seemingly crashing all the time) • Include important details in the bug itself (OS, 32/64-bit, CPU, account…) for easy filtering
  • 6. Analyzing data “...there is no religious demagogue worse than an engineer arguing from intuition instead of data.” Robert P. Colwell, The Pentium Chronicles • Do not guess -- build a system to analyze incoming cases • Can start small (e.g. “Top 10 crashes in the latest build”), keep adding details
  • 7. Our system • Top X most frequent crashes, can be filtered by category (networking, system, game, ...) • Cross referencing (past builds) • Crashes/hour graph for every build • Global stats – bug breakdown (intuition vs cold facts) • Guess how many one-off crashes we get?
  • 8. Our system • Top X most frequent crashes, can be filtered by category (networking, system, game, ...) • Cross referencing (past builds) • Crashes/hour graph for every build • Global stats – bug breakdown (intuition vs cold facts) • Guess how many one-off crashes we get? ~80%! (good... and bad)
  • 9. Our system • Top X most frequent crashes, can be filtered by category (networking, system, game, ...) • Cross referencing (past builds) • Crashes/hour graph for every build • Global stats – bug breakdown (intuition vs cold facts) • Guess how many one-off crashes we get? ~80%! (good... and bad) • Usually almost 30% of all crashes in the top 5 issues (4/5 of these tend to be in drivers/external DLLs these days)
  • 10. Crash report – log file • Given 2 theoretical extremes, it’s better to log too much than not enough ○ Can always filter out noise, can’t materialize information that isn’t there • Avoid messages carrying only binary info. Rule of thumb: the more unusual situation – the more details • ID all the things (objects, messages, players) • If something goes wrong, does it provide enough information to figure out the reason?
  • 11. Look for cracks between layers • State machines (matchmaking, mission flow, UI) • N systems interacting with each other • UI vs underlying code (triggering an activity just as something else starts/ends) • Clients connecting/disconnecting • Focus on patterns repeating in multiple log files (the more the better) • Modify the code locally to trigger the suspected problem
  • 12. Parsing logs • Eyeballing is hard, especially for long sessions (megabytes of text, multiple missions) • Simple Perl scripts FTW (congestion control, leaking objects). Even easier if interesting entries are marked clearly • A graph is worth a thousand words (or numbers…)
  • 13. Object leaks Object count (Y) vs time (X)
  • 15. Crash report - dumps • Know your assembly (x64 tends to be easier to investigate, more registers). See [Ruskin11] • Immediate Window to the rescue (> + command, e.g. >~kb for callstacks for all threads) • Spider sense (e.g. reallocating std::vector while iterating)
  • 16. Instrumenting crash dumps • In a perfect world, we’d just use full memory dumps. Sadly, they’re way too big for practical purposes (hundreds of megs), • “Mini” dumps include registers and the stack memory, helpful, but not always good enough • Can try to “exploit”, by storing important data on the stack (see blog post) • Better way: mini dump callbacks and request regions pointed by registers (if they look like valid addresses, can even chase pointers recursively) • Detecting type -- cast register to pointer, VS resolves vftables, e.g. (void**)eax - {Warframe.exe!const Projectile::`vftable'}
  • 17. LogBug • Problem: crash we can work around, but the root cause remains a mystery. Examples: indexing out of bounds, state desynchronization • We can add more logging, but ideally we prevent users from crashing • Enter LogBug aka silent assertion. Logs an error message + a callstack, game keeps running, the error report dialog pops up when the app exits. • Result: game doesn’t crash, we get extra info (no dump, just log). 15 minutes of work to add on top of existing error handler. • Use sparingly, fix quickly (~15 active LogBugs in Warframe as of today) • Equivalent to fatal error for internal builds
  • 18. Deploying to public • Low hanging fruit long gone, this is where the “fun” starts • Forget about trying to repro the bug just by playing a game (can still alter a build sometimes) • Left with good old fashioned police work. Given leads and clues, try to determine who’s the killer (or the reason for our bug) • Ability to quickly test various scenarios (unit tests) - invaluable!
  • 19. Chasing a suspect • Less stepping through code in a debugger, more pen & paper • War story: Warframe goes public, rare reports where a client crashes because host assumes he had received some information before (he didn’t) • It’s tempting to overcomplicate things, start with simple theories • Add “dimensions” rather than move along the same “axis”
  • 20. Adding dimensions Moving along 1 axis modifies mostly probability, doesn’t expand our search space.
  • 21. Adding dimensions Adding a new dimension uncovers new possibilities In our case - data dropped once, chunks it used to share a packet with retransmitted, but no space in the new packet for all original entries.
  • 22. Example unit test server = GetTypeMgr().Create<TestFrame>(); client = GetTypeMgr().Create<TestFrame>(); server->Initialize(true, "0.0.0.0:4950", false); const String& serverAddress = server->mNet->GetListenAddress(); client->Initialize(false, serverAddress, false); server->Update(0.1f); TEST(server->mNumConnects == 1); // 100% packet loss, 0ms latency client->mNet->SimulateNetwork(1.f, 0.f); server->mNet->SendData(...); Controlled repro case, regression, documentation, verification (assumptions).
  • 23. The aftermath • Very difficult to verify we got the right guy (no repro, remember?) • Publish a fix, hope for the best • Sometimes, an evil twin/copycat remains. Usually much easier now that we’ve a system/theory in place • Ideally, do not introduce major changes until the fix is verified (hard to attribute the cause otherwise)
  • 24. LogBug strikes again • Eternal problem with networking – what was the other side doing at that time? We only get reports from the player who crash/encountered a problem • Some experiments with recording network messages (space limitations, still incomplete) • Solution: try to handle gracefully/convert to a LogBug and send a special message to the other player – “trigger LogBug” • Result: we get 2 (or more) reports from 2 machines (full log file with all the info we could think about). Have to find a matching pair (not everyone submits), but often any data from both sides is very helpful
  • 25. Compiler bugs • “Never happens”...
  • 26. Compiler bugs • “Never happens”... or does it? • Seemingly innocent code: m_v = Min(v, MAX_V); // Overloaded operator= contained _rotr16 intrinsic inside. • Getting reports of ‘v’ set to values greater than MAX_V • Generated (pseudo)assembly: mov r8d,MAX_V cmp edx,r8d ; edx=v, x86 -> flags as in sub edx,r8d ror ax,X ; X=1 for consistent repro, was other value cmovl r8w,dx ; r8 will be later stored in m_v • cmovl depends on OF & SF (move if OF!=SF)
  • 27. Compiler bugs… and CPU differences ROR instruction + the overflow flag: • Different behavior on different CPUs. i5-3320m laptop and i7-850 are “fine”, i7 2600k would modify the OF flag. • VS 2010, seems to be fixed in VS2012 The OF flag is defined only for the 1-bit rotates; it is undefined in all other cases [...] For right rotates, the OF flag is set to the exclusive OR of the two most-significant bits of the result. (Intel 64 and IA-32 Architectures Software Developer’s Manual, vol 2B)
  • 28. Freezes/hitches • Need a way to grab the state of our process (a user friendly way!), • First approach: Process Hacker (just a callstack, no dump) • Better: procdump + batch file to make it a one-click operation: procdump process_name • Requires community help • Terrible granularity, but we’ve already diagnosed a few problems this way
  • 29. Community • Best community ever (no, really) • Community Testers (second layer) • Freezes/hitches • Network issues • Forums, IRC, voice chat
  • 30. Third party libraries/external DLLs • Mostly beyond our control, but users blame you anyway (not your fault, but your problem) • Third party libs: report a problem, wait for a new version • External app: report a problem, detect and inform (video card software, various overlays) • Also: early console SDKs (especially paths less traveled). Staring at assembly code again • If you’re feeling brave, other options exhausted and you’re really bored: patching binaries by hand ([Sinilo13])
  • 31. Acknowledgments • Warframe dev team • Warframe community
  • 32.
  • 33.
  • 34. References • [Ruskin11] Ruskin, E. Forensic Debugging (GDC 2011) • [Sinilo13] Sinilo M., Patching binaries: http://msinilo.pl/blog/?p=1215
  • 36. Spider sense – dynamic arrays Great container (std::vector & friends), but a few problems ○ push_back while iterating (often few levels down the call chain), can reallocate memory and invalidate iterators. Typical symptoms: crashing when trying to access memory pointed by an iterator ○ indexing out of range (in some cases easy to tell from the dump). In many cases fine to access it, but data doesn’t make sense, so fails later when trying to use it.
  • 37. Spider sense - multithreading • Crash when reading/writing, data races • See what other threads are doing • Investigate all accesses, missing locks • Sometimes adding Sleeps makes it easier to repro (more likely for jobs to overlap on your fast computer) • Fake/failing locks - widen the race window, make it more obvious to detect concurrent access “Lock” = assert eq 0, set to 1 “Unlock” = assert eq 1, set to 0
  • 38. Bugs – a social problem • No man’s land, “the bystander effect” (probability of help inversely related to the number of bystenders) • Situation a little bit better in a typical game team, not a problem with serious bugs • Cracks between layers again – problems when multiple modules overlap • Gamification: top issues lists, bounties • Systemic tweaks: clear responsibilities, encourage team members to move outside their comfort zones (especially new employees, one of the best ways to get familiar with the codebase)

Notes de l'éditeur

  1. Not about pre-release, in-house testing (extra diag tools, traffic recording etc)
  2. Brian Kernighan – “Everyone knows that debugging is twice as hard as writing a program in the first place.”
  3. Getting hit by a lightning – 1/80k
  4. Where 2 layers meet
  5. “See Test X for details” Both failing and succeeding tests valuable Flow analysis
  6. Process Hacker has an option to grab a dump, but it’s a full memory dump (considered modifying the source).