Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

The curious case of data lake redemption

Reverse aging has been a subject of ambiguity and curiosity amongst Hollywood and in the flights of fantasies of Fitzgerald. Hadoop at Verizon Wireless has been a interesting case study, both from a scale and adoption perspective. Technology adoption typically follows a linear progressive curve with time comprising of feature additions, bug fixes, upgrades, etc. In this case study we examine a case of Hadoop adoption that oscillates in a space-time continuum exhibiting characteristics of traditional growth patterns in addition to reverse aging.

The use case highlights the factors, causes, and impacts that can cause such a extraordinary phenomenon to be commonplace in any environment. The conditions leading to this phenomena might vary for different use cases, industries, and environments. This use case discusses and highlights the technical aspects leading to the ultimate path to technical redemption, which in turn engineers a well designed and performance tuned infrastructure for continuous productivity. SHIVINDER SINGH, Distinguished Member Technical Staff, Verizon

  • Identifiez-vous pour voir les commentaires

The curious case of data lake redemption

  1. 1. The curious case of Data Lake Redemption Shivinder Singh Distinguished Member Technical Staff © 2017 Verizon. This document is the property of Verizon and may not be used, modified or further distributed without Verizon’s written permission.
  2. 2. 2 About Verizon The best, most reliable networks in the industry The largest U.S. wireless company with the largest 4G LTE network The largest and fastest all-fiber network in the U.S. One of the largest, most reliable and secure global networks Using technology to address big challenges Verizon Innovation Center in San Francisco, CA
  3. 3. 3 Dedicated Corporate Citizen Creating a platform for long-term growth for our customers, shareowners and society Using our talent and technology to address society’s biggest challenges Focusing on finding new ways our technology can improve healthcare, education and energy management Focusing our philanthropic resources on becoming a channel for innovation and social change Applying innovative technology to social issues
  4. 4. 4 Big Data in the Enterprise As the enterprise masters Big Data, it will become part of the enterprise solution framework
  5. 5. 5 Shrinking the Interval Analyzing Reporting Predicting Operationalizing Activating WHAT happened? WHY did it happen? WHAT is happening? What WILL happen? MAKING it happen! Batch Ad Hoc Analysis Analytics Continuous Updates / Short Queries Event-Based Triggering Understand Change Grow Compete Lead
  6. 6. 6 Effective strategies answer three key questions: How will we Deliver value? How will we Create value? How will we Capture value?
  7. 7. 7 Unix Inode Management mode owners (2) timestamps (3) size block count direct blocks single indirect double indirect triple indirect data data data data data data data data data data
  8. 8. 8 Block Size comparison Data lake vs Single Client DATA LAKE TOP 20 DB Size (GB) DB Name Total Files Total Blocks Average Block Size (bytes) 328,807 /apps/hive/warehouse/prd1.db 32,461,500 30,283,722 11,678,898 180,361 /apps/hive/warehouse/prd2.db 7,030,688 6,568,455 29,498,992 114,237 /apps/hive/warehouse/prd3db 7,218,443 7,663,817 16,004,037 113,144 /apps/hive/warehouse/prd4.db 2,041,641 2,830,226 42,925,340 42,535 /apps/hive/warehouse/prd5.db 169,111 504,297 90,567,016 30,615 /apps/hive/warehouse/prd6.db 86,923 297,950 110,331,894 21,433 /apps/hive/warehouse/prd7.db 637,283 730,173 31,520,262 21,401 /apps/hive/warehouse/prd8.db 29,971 188,875 121,668,441 11,564 /apps/hive/warehouse/prd9.db 30,873 110,838 119,432,578 11,184 /apps/hive/warehouse/prd10.db 157,975 196,467 61,127,078 10,301 /apps/hive/warehouse/prd11.db 9,713,823 8,953,109 1,236,123 8,972 /apps/hive/warehouse/prd12.db 20,236 80,666 119,426,068 8,711 /apps/hive/warehouse/prd13.db 352,294 390,780 23,994,662 8,359 /apps/hive/warehouse/prd14.db 21,175 70,756 126,829,445 7,920 /apps/hive/warehouse/prd15.db 1,316,631 1,215,234 7,017,294 5,843 /apps/hive/warehouse/prd16.db 1,055,270 468,010 13,406,724 5,829 /apps/hive/warehouse/prd17.db 552,918 486,693 12,881,117 5,669 /apps/hive/warehouse/prd18.db 1,605 46,147 131,925,260 5,652 /apps/hive/warehouse/prd19.db 5,362,238 5,360,747 1,135,249 987 /apps/hive/warehouse/prd20.db 565,537 571,859 1,854,672 Single Client DB Size (GB) DB Name Total Files Total Blocks Average Block Size (bytes) 315,866 /apps/hive/warehouse/prd.db 2,245,257 2,574,897 131,717,734
  9. 9. 9 Small File Namenode Impact High GC pauses High RPC running into minutes Cluster Unresponsive Jobs stalled Full downtime
  10. 10. 10 The S-curve Maps Major Transitions Performance Time Ferment Takeoff Maturity Reverse Aging
  11. 11. 11 Analysis Support Engagement Increase NN heap Bounce the NN/cluster 5 bug fix patches Root Cause still not found
  12. 12. 12 Root Cause and fix Deep dive for 40 data lakes clients Review of 456 Databases Review of 373,083 tables Review of 5K jobs Fix Reduce job frequency Block size parameters for hive and yarn Zookeeper tuning
  13. 13. 13 Run Times 0 50 100 150 200 250 300 350 400 Run Times Average_2017 Average_2018
  14. 14. 14 Job Counts 0 500 1000 1500 2000 2500 3000 3500 Job Count 2017_Job_count 2018_Job_count
  15. 15. 15 Other considerations ZK is most critical components Numerous third party components Znodes being written outside of HDP components ZK image size 10 gb 5 M znodes Fix Targeted purge of znodes to 100 K Znode image size down to 100 Mb Ongoing ZK tuning
  16. 16. 16 Stack Selection Physical limit? Performance is ultimately constrained by physical limits E.g.: Sailing ships & the power of the wind Copper wire & transmission capability Semiconductors & the speed of the electron Performance Time
  17. 17. 17 Once Upon a Time There Was a Inode… • Redemption… Andy Dufresne: ”He's a phantom, an apartition, second cousin to Harvey the Rabbit.” Unix Kernel is a basic ! Packaging changes, basic remains the same Small files a technology limitation Data Democracy can be boon or a bane Issues are platform agnostic
  18. 18. 18 Q & A You can reach us at shivinder.singh@vzw.com Go to www.verizon.com/about/ for more information and news about our company, social responsibility, investor relations and careers.