Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel Zikmund

13 vues

Publié le

.NET Core Summer event, 2019 in Brno, CZ - 2019/7/9
Talk: War stories from .NET team by Karel Zikmund

https://www.wug.cz/brno/akce/1152--NET-Core-Summer-Event

Publié dans : Ingénierie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel Zikmund

  1. 1. War stories from .NET team .NET Core Summer event 2019 – Brno, CZ Karel Zikmund – @ziki_cz
  2. 2. Agenda • Stories • Investigations on .NET team • Not just from me • Lessons learned on the way You won’t see any: • Source code • Debugger Not needed: Deep .NET knowledge Not on agenda
  3. 3. My First Serious Investigation • Build lab for Windows component • Build break 1x per week • AccessViolation dialog hangs machine • Toolset updated to 2.0 RTM • Repro: • Once in ~50 runs • Overnight run: 247 crashes out of 77,006 runs (0.3%)
  4. 4. My First Serious Investigation - quotes • "The actual crash is occurring on some boilerplate stack checking code …“ • “Karel is relatively new to the code base so he indicated it might take some time to understand what’s going on”
  5. 5. mscorwks!UTSemReadWrite::UnlockRead+0xe [f:rtmndpclrsrcutilcodeutsem.cpp @ 357] mscorwks!CMDSemReadWrite::~CMDSemReadWrite+0x14 [f:rtm...mdencrwutil.cpp @ 1299] mscorwks!RegMeta::DefineParam+0x196 [f:rtmndpclrsrcmdcompileremit.cpp @ 2719] cscomp!EMITTER::EmitParamProp cscomp!ParamAttrBind::Init cscomp!ParamAttrBind::CompileParamList cscomp!CLSDREC::compileMethod cscomp!CLSDREC::CompileMember cscomp!CLSDREC::EnumMembersInEmitOrder cscomp!CLSDREC::compileAggregate cscomp!CLSDREC::compileNamespace cscomp!COMPILER::CompileAll cscomp!COMPILER::Compile cscomp!CController::RunCompiler cscomp!CController::Compile csc!main My First Serious Investigation
  6. 6. My First Serious Investigation • Who corrupts stack? • GC? • NO! • Changed value between caller and callee • Single bit changed • Who corrupts it? • GC card table updates? • Of course NOT! • What about HW? • Naw! • Or maybe?
  7. 7. My First Serious Investigation • Does it by a chance reproduce on only one machine? • Answer: How did you know? • But why always the same callstack? • Good question, no good answer … magic • Lesson learned: Debugging HW errors is costly and hard • Always ask: Does it repro on more than 1 machine?
  8. 8. Another MetaData story MetaData format background: • Basically database – rows and columns • Example – TypeDef table: • Indexes into tables/heaps are either 2B or 4B • What happens if last TypeDef has no methods? • MethodList = Number of methods + 1 = max + 1 • What happens if there is 0xffff methods? Flags TypeName TypeNamespace Extends MethodList (Public) “Foo” “Awesome.Story” … Method #10 (Private) “Bar” “Awesome.Story” … Method #11
  9. 9. Another MetaData story • II.24.2.6 “#~ stream” • If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than 2^16 rows, otherwise it is stored using 4 bytes. • II.22.37 TypeDef : 0x02 • 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid means 1 <= row <= rowcount+1 [ERROR] • How do you fix it? • “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17 years” • https://github.com/dotnet/corefx/issues/29554 • Lesson learned: Not all bugs have to be fixed
  10. 10. TypeSystem – Collapsing interfaces • Table of implemented interfaces: class A : I, J {} • With generics: class C<T> : L<T> {} class D<T> : C<T>, L<string> {} class E : D<string>, I {} 0 1 I J 0 1 2 I J K 0 1 L<T> L<string> 0 1 L<string> I 0 1 2 L<string> L<string> I class B : A, K {} Fix:
  11. 11. Breaking changes – Intro • Everyone wants fix for their bug • But nobody wants to be broken • Observation: 10% of fixes have unintended side-effects • Extreme case: Perf improvement can break app • How many customers? • Lesson learned: Everything has risk of breaking someone
  12. 12. Breaking changes – Last build • Finance app crashing – “last” build of Windows 8 on arm (Surface RT) • Latent bug (introduced months ago) • Bug triggered by: 1. Method in NGen image has to be across 8KB pages 2. GC has to be triggered at least twice when it’s on stack • Unrelated change caused “unlucky” method order for: • System.Net.Configuration.DefaultProxySectionInternal..ctor • Lesson learned: Anything, really ANYTHING, has risk of breaking
  13. 13. Breaking changes – Huge impact • Patch to .NET Framework broke certain tax SW • Printing tax forms • Update pushed few days before tax deadline in US • Note: Printing was tested on both sides (Microsoft & tax SW company) • But only into file, not to printer • Lessons learned: Be extra cautious around sensitive dates
  14. 14. Breaking changes – Below you • RavenDB – blue screen after KB4487017 on .NET Core! • dotnet/coreclr#22597 • PrefetchVirtualMemory • Kernel memory management bug
  15. 15. Networking – Security issue • January: Researcher running ML models on Cosmos • Suspicion about buffers – more logging • March: Repro gone • May: Similar report • +2 weeks: It blows up (more teams & impact) • All hands on-deck • Small repro (20 min, then 1 min) … yay! • TTD trace (iDNA / TTT) … bonus & life saver
  16. 16. Networking – Security issue • Root-cause: HTTP pipelining under stress • 13 years old bug (.NET 2.0) Response 1 Request 1 Server Response 1 Request 1 Server Request 2 Response 2
  17. 17. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  18. 18. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  19. 19. Networking – Security issue • We have workaround (disable pipelining) – perf impact • Worked fix … • Verifying fix … • Repro fails after 4h  • Same symptoms • Repro sensitive to cloud network load (8-17) • TTD (iDNA / TTT) does not work  • Suspicion about buffers again
  20. 20. Networking – Security issue • Bad buffer lifetime management – on sending side! • 5 years old bug (.NET 4.5.2) • Trigger found: • Thanks to Skype team – 24h deployment of experiments • Change in .NET 4.7.1 • Fix around the problematic area • Making the opportunity window SMALLER! • … counter-intuitive • Code review – similar bug on receiving side (5 years old) • Same symptoms as HTTP pipelining
  21. 21. Networking – Security issue • Why so many customers/services hit it at once? • Maybe Spectre & Meltdown fixes roll out? • or just … magic • Lesson learned: Weird coincidences can happen …
  22. 22. Developer’s pride in multi-threading • School project (2000-2003) • Game simulation server – heavily multi-threaded • https://github.com/karelz/WarPlusPlus (nostalgia) • Classic deadlock – 2 threads locking A and B in different order • Deadlock avoidance started make sense • WinRT binder (2010) • Binder is tricky – GC interaction (NO_GC range) • Type routed to WinMD file, assembly meaningless • Negotiated on namespace only in 1 assembly • Multiple reviews, discussions with architects • Bugs start to come in after shipping (NullReferenceException)
  23. 23. Optimizations • Once upon a time, … there was a service in Microsoft • List vs. array data structure perf • Perspectives: 1. The data structure will have in practice 3-5 items 2. There 3 hops between servers for each request!!! • Lesson learned: Avoid premature optimizations … at all cost
  24. 24. Lessons learned • Always ask: Does it repro on more than 1 machine? • Debugging HW bugs is costly • Some bugs happen once in 17 years • Spec bugs are hard to fix • MetaData format bug • Anything, really ANYTHING, has risk of breaking someone • Innocent changes can trigger latent bugs elsewhere • Impact may be huge – e.g. during tax season • Always try to create small repro • Make your and everyone’s life easier • TTD (iDNA / TTT) is life saver • Avoid premature optimizations … at all cost, save your time • … sometimes there is just … magic @ziki_cz
  25. 25. Thank you • Feedback welcome • Twitter DM, email, in-person, etc. • Survey • What you liked vs. not? • Too rushed? • Hard to understand? • Boring? • Didn’t meet your expectations? @ziki_cz

×