Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Uotm workshop

Chargement dans…3

Consultez-les par la suite

1 sur 38 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Uotm workshop (20)


Plus récents (20)

Uotm workshop

  1. 1. http://smallbitesofbigdata.com Big Data in the Cloud Ravi Patel Business Intelligence Team Manager Microsoft Certified Solution Expert (BI) ® ravi@nealanalytics.com
  2. 2. http://smallbitesofbigdata.com About Me
  3. 3. http://smallbitesofbigdata.com Key Takeaways Basic Big Data and Hadoop terminology What projects fit well with Hadoop Why Hadoop in the cloud is so Powerful Sample end-to-end architecture See: Data, Hadoop, Hive, Streaming, Analytics, BI Do: Data, Hadoop, Hive, Streaming, Analytics, BI How this tech solves your business problems
  4. 4. http://smallbitesofbigdata.com Your Goals What are your backgrounds and needs? What is your Big Data experience?
  5. 5. http://smallbitesofbigdata.com Pre-Req: Azure Subscription Trial: http://azure.microsoft.com/en-us/pricing/free-trial/ MSDN Subscription: http://azure.microsoft.com/en-us/pricing/member-offers/msdn-benefits/ Startup BizSpark: http://azure.microsoft.com/en-us/pricing/member-offers/bizspark-startups/ Classroom: http://www.microsoftazurepass.com/azureu Pay-As-You-Go or Enterprise Agreement: http://azure.microsoft.com/en-us/pricing/
  6. 6. http://smallbitesofbigdata.com Pre-Reqs Azure subscription with available HDInsight cores Demo file: http://www.slideshare.net/raviumesh/big-datademo Download Power Query add-in http://www.microsoft.com/en- us/download/details.aspx?id=39379&CorrelationId=d8002172-0438-4ef5-b0fa-e635f8f17251 Enable PowerPivot and Power View in Excel options – com add-ins HOL labs http://tinyurl.com/lncd45x “Clone in Desktop” or “Download ZIP” + UNZIP GUI: Install CloudXplorer http://clumsyleaf.com/products/downloads (Optional) Cmd line: Install AzCopy http://azure.microsoft.com/en-us/documentation/articles/storage-use- azcopy/ Install SQL 2014 SSMS http://www.microsoft.com/en-gb/download/details.aspx?id=42299 Today’s slides: http://tinyurl.com/lxutdd4
  7. 7. http://smallbitesofbigdata.com What is Big Data?
  8. 8. http://smallbitesofbigdata.com What do you think Big Data is?
  9. 9. http://smallbitesofbigdata.com What is Big Data? It Is Scale out, distributed processing Enables elasticity Encourages exploration Faster data ingestion Lower TCO Empowers self-service BI and analytics Rapid time to insight It Is NOT A well-defined thing About volume, size A replacement for everything The answer to every problem
  10. 10. http://smallbitesofbigdata.com What is Hadoop? Conceptual View It Is A type of Big Data Just another data source A loose collection of open source code Distributed by many Handles loosely structured data Write once, read many It Is Not Actually a thing! The only way to do Big Data Only about data
  11. 11. Basically Available Soft State Eventually Consistent BASE ACID Atomic Consistent Isolated Durable BASE - ACID
  12. 12. http://smallbitesofbigdata.com What is Hadoop? Tech View http://hortonworks.com/hdp/
  13. 13. http://smallbitesofbigdata.com End to End Architecture
  14. 14. Microsoft Azure Data Services Transform + analyze Visualize + decide Capture + manage Data 
  15. 15. http://smallbitesofbigdata.com Demo VIEW THE AZURE PORTALS HDINSIGHT: ELASTICITY, QUERY
  16. 16. Microsoft Azure Source Data Real Time Microsoft Azure Azure Storage Microsoft Azure Microsoft Azure Machine Learning, Analytics, and Business Intelligence Internet of Things – Business Insights Queries HDInsight SQL Server Storage Storage Storage Event Hub Streaming Microsoft Azure Destination Apps+ Data
  17. 17. http://smallbitesofbigdata.com Architecture – Use Cloud Building Blocks Blob Storage or In Memory (Landing Zone) Blob Storage (Persistent Storage) HDInsight Clusters (Hive, Pig, etc) REST Sqoop Self-Service Analytics Reporting / DW Curator Optimized for write throughput - Many small blobs - Raw/binary format - Data kept until curated - Azure Blob Storage if persisted - Azure Queues & Workers for in memory Optimized for query efficiency - Optimized size (combine blobs) - Cleansed/masked - Partitioned - Well-defined, semi-structured data Use Case Specific & General Processing - Data governance requirements (PII scrub) - Aggregate for efficient storage - Publish to real-time consumers and long term storage (Hadoop) OtherAny Device!
  19. 19. http://smallbitesofbigdata.com When to Use Hadoop
  20. 20. Typical Big Data Use Cases Smart meter monitoring Equipment monitoring Advertising analysis Life sciences research Fraud detection Healthcare outcomes Weather forecasting Natural resource exploration Social network analysis Churn analysis Traffic flow optimization Legal discovery Telemetry IT infrastructure optimization
  21. 21. http://smallbitesofbigdata.com Hadoop Shines When…. Data exploration, analytics and reporting, new data-driven actionable insights Rapid iterating Unknown unknowns Flexible scaling Data driven actions for early competitive advantage or first to market Low number of direct, concurrent users Low cost data archival
  22. 22. http://smallbitesofbigdata.com Hadoop Anti-Patterns…. Replace system whose pain points don’t align with Hadoop’s strengths OLTP needs adequately met by an existing system Known data with a static schema Many end users Interactive response time requirements Your first Hadoop project + mission critical system
  23. 23. Relational Database SCALE (storage & processing) Hadoop Platform schema speed governance best fit use processing Required on write Required on read Reads are fast Writes are fast Standards and structured Loosely structured Limited, no data processing Processing coupled with data data typesStructured Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Data Discovery Processing unstructured data Massive Storage/Processing
  24. 24. http://smallbitesofbigdata.cohttp://bit.ly/BDApr2015 Now You Do It CLOUD DATA CAMP LAB 2 CREATE: HDINSIGHT CLUSTER THANKS TO LARA RUBBELKE FOR DEMOS!
  25. 25. http://smallbitesofbigdata.com Why Hadoop in the Cloud
  26. 26. http://smallbitesofbigdata.com Microsoft Hadoop Options Cloud HDInsight Service Windows Azure Storage Blob (WASB) HDP or Cloudera on VMs (Windows or Linux) Any distro on VMs (Windows or Linux) Hybrid / On-Premises Parallel Data Warehouse (PDW) with Polybase APS/PDW Hadoop Regions OneBox for Developers Hortonworks Data Platform (HDP for Windows)
  27. 27. Why Hadoop in the Cloud?
  28. 28. http://smallbitesofbigdata.com Why Hadoop in the Cloud? Hadoop It’s easier You can concentrate on the analytics WASB: separation of storage and compute Shared data, globally accessible Lowers the cost of discovery & innovation No commitment as you learn Cloud in General Today’s disruptor, tomorrow’s reality Elasticity, capacity Less infrastructure and implementation work Lower TCO Business Continuity Operational Agility
  29. 29. http://smallbitesofbigdata.com WASB: Separation of Storage & Compute Windows Azure Storage Blob (WASB) = separate of storage and compute Open source code available to any distro Simplified data access Reduced data movement Faster access to new data Enables ETL even when a cluster isn’t up = lower TCO Share data concurrently
  30. 30. http://smallbitesofbigdata.com Why HDInsight Separation of storage and compute is the default Varied workloads: Query, Streaming, NoSQL Elasticity: Node sizes, # of nodes Committed to openness: Hortonworks, Linux, WASB
  32. 32. http://smallbitesofbigdata.com So Far…. Basic Big Data and Hadoop terminology What projects fit well with Hadoop Why Hadoop in the cloud is so Powerful Sample end-to-end architecture Hands-On: Storage, data load, SQL database, Service Bus Event Hub, HDInsight, Hive, AzureML, Power Query, Power View
  33. 33. http://smallbitesofbigdata.com Tie It Together
  34. 34. http://smallbitesofbigdata.com What’s the Goal? Ask a business question Find and load data Explore the data Iterate Analyze, Visualize, and/or move the data Productionalize some, all, or none
  35. 35. http://smallbitesofbigdata.com Key Takeaways Basic Big Data and Hadoop terminology What projects fit well with Hadoop Why Hadoop in the cloud is so Powerful Sample end-to-end architecture See: Data, Hadoop, Hive, Streaming, Analytics, BI Do: Data, Hadoop, Hive, Streaming, Analytics, BI How this tech solves your business problems
  36. 36. http://smallbitesofbigdata.com Hadoop in the Cloud Ravi Patel Business Intelligence Team Manager Microsoft Certified Solution Expert (SQL 2012) ® ravi@nealanalytics.com
  37. 37. http://smallbitesofbigdata.com Big Data References Get started / overview with a free Ebook “Introducing Microsoft Azure HDInsight” http://blogs.msdn.com/b/microsoft_press/archive/2014/05/27/free-ebook-introducing- microsoft-azure-hdinsight.aspx Architect a solution with the Patterns and Practices guide “Developing big data solutions on Microsoft Azure HDInsight“ http://blogs.msdn.com/b/masashi_narumoto/archive/2014/06/30/new-release-developing- big-data-solutions-on-microsoft-hdinsight.aspx The Data Science Laboratory Series is Complete http://blogs.msdn.com/b/buckwoody/archive/2014/03/24/the-data-science-laboratory- series-is-complete.aspx
  38. 38. http://smallbitesofbigdata.com Big Data References Microsoft Big Data http://microsoft.com/bigdata HDP for Windows http://hortonworks.com/products/hdp-windows/ Hadoop: The Definitive Guide by Tom White Programming Hive Book by Capriolo, Wampler, Rutherglen Big Data Learning Resources http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning- resources.aspx Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View http://blogs.msdn.com/b/cindygross/archive/2013/01/31/mash-up-hive-sql-server-data-in-powerpivot-amp- power-view-hurricane-sandy-2012.aspx Twitter Search https://twitter.com/#!/search/%23bigdata Hive Reference http://hive.apache.org HDInsight Tutorials http://www.windowsazure.com/en-us/documentation/services/hdinsight/?fb=en-us Denny Lee http://dennyglee.com/category/bigdata/ Carl Nolan http://blogs.msdn.com/b/carlnol/archive/tags/hadoop+streaming/ Cindy Gross http://tinyurl.com/SmallBitesBigData

Notes de l'éditeur

  • Azure Subscription: http://youtu.be/lSxMtmRE114
    Create HDInsight Cluster in Azure Portal http://smallbitesofbigdata.com/archive/2015/02/26/create-hdinsight-cluster-in-azure-portal.aspx
  • may refer to the technology (which includes tools and processes) that an organization requires to handle the large amounts of data and storage facilities.
    TCO – Total Cost of Ownership
  • Hadoop was first created in 2005 for support distribution for Nutch search engine project
    Initial release was December 10, 2011
    HDInsight released in late 2013
  • Atomic: Everything in a transaction succeeds or the entire transaction is rolled back.
    Consistent: A transaction cannot leave the database in an inconsistent state.
    Isolated: Transactions cannot interfere with each other.
    Durable: Completed transactions persist, even when servers restart etc.
  • Presenter guidance:
    Share how we think about the data platform in the cloud. Today, we’ll specifically talk about SQL in a VM (briefly), SQL DB, DocumentDB, HBase on HDInsight, and Tables/Blobs. There are lots of other adjacent services such as Redis Cache, Event Hubs, HDInsight, Azure ML, Data Factory, Stream Analytics that will not be addressed in this deck.

    Slide talk track:
    The top row is Power BI – you’re making decisions based on data
    The middle row is ML, Stream Analytics, HDInsight, and Data Factory – processing and making sense of the data
    The bottom row is where you ingest and store data -
    With Azure, organizations have access to a whole range of services that allow them to use the right tool for the right job when developing applications.
    In the cloud, organizations can collect and manage data in the form in which it’s born and store it in the form that best suits an application’s needs.
  • They have a very simple architecture.
    Xbox consoles send raw data to a landing zone (it may spill to disk/blob storage). They process each small file as it lands, keep it until curation finishes.
    They curate the data – scrub out personally identifiable info, aggregate, split as needed (to send subsets of data such as 10 minutes of sliding data or the new users in the last month), combine many small files into a few large files, put into AVRO format (common, well-known SerDes), persist “permanently” to azure blob store.
    The data in the permanent store (WASB) is in a few large files, cleansed/masked, partitioned by day, semi-structured.
    HDInsight processes the data – analytics, sending to other systems (SQL, RS, PowerPivot, etc.)

    Demo (fake/cleansed data)
    Show RawStats (view in notepad, Cloud Explorer) = raw binary data in a proprietary xbox format – shown here (cleansed) with comma separators for readability. Each line is a session with a start time, gamerid, IP address, who they interacted with (gamerids separated by hyphens). This is what is in the landing zone – the raw data.
    Show RawCurator.pig (view in notepad). Compute/worker roles are watching for the raw data files. They pick them up and use Pig (and other MapReduce) to remove PII, aggregate, split, consolidate, remove the last octet of the IP for per state data…. Data is stored per arrival data – this sets us up for Hive partitions. This is a very simple workflow written by people who didn’t know Hadoop.
    Show gamerstats.xlsx. This is the curated data.
    Show PowerMap on top of sheet 3 (optionally also sheet 2 for marketing campaign data). This is using Hive/Hive ODBC driver to view new users.
    (optional) Show pssnippets: PowerShell to submit jobs
  • Businesses using Big Data are “making it big”. They are taking advantage of all this ambient data and they’re moving ahead, gaining a foothold in new markets and gaining marketshare in existing markets. Think about how Netflix makes movie recommendations or how Google can predict a flu outbreak before the CDC does.

    HDInsight is very focused on the volume and variety problems. We have our RX/Stream Insight and BI stack added in to help address the solution velocity issues.
  • http://blogs.msdn.com/b/cindygross/archive/2015/02/25/master-choosing-the-right-project-for-hadoop.aspx
  • http://blogs.msdn.com/b/cindygross/archive/2015/02/25/master-choosing-the-right-project-for-hadoop.aspx
  • Create HDInsight Cluster in Azure Portal http://smallbitesofbigdata.com/archive/2015/02/26/create-hdinsight-cluster-in-azure-portal.aspx
  • Why big data in the cloud?

    collect data globally
    much is already in the cloud
    share globally
    cross data center HA/DR
    cost of hiring, training, retaining hardware personnel
    highly flexible, scalable
    easily pull in ambient data

    It's partly a question of where to spend your resources and how much control you want.

  • Why Hadoop in the cloud?

    You can deploy Hadoop in a traditional on-site datacenter. Some companies–including Microsoft–also offer Hadoop as a cloud-based service. One obvious question is: why use Hadoop in the cloud? Here's why a growing number of organizations are choosing this option.

    The cloud saves time and money

    Open source doesn't mean free. Deploying Hadoop on-premises still requires servers and skilled Hadoop experts to set up, tune, and maintain them. A cloud service lets you spin up a Hadoop cluster in minutes without up-front costs.

    See how Virginia Tech is using Microsoft's cloud instead of spending millions of dollars to establish their own supercomputing center.

    The cloud is flexible and scales fast

    In the Microsoft Azure cloud, you pay only for the compute and storage you use, when you use it. Spin up a Hadoop cluster, analyze your data, then shut it down to stop the meter.

    We quickly spun up the Azure HDInsight cluster and processed six years worth of data in just a few hours, and then we shut it down&ellipsis; processing the data in the cloud made it very affordable.

    –Paul Henderson, National Health Service (U.K.)

    The cloud makes you nimble

    Create a Hadoop cluster in minutes–and add nodes on-demand. The cloud offers organizations immediate time to value.

    It was simply so much faster to do this in the cloud with Windows Azure. We were able to implement the solution and start working with data in less than a week.

    –Morten Meldgaard, Chr. Hansen
  • http://blogs.msdn.com/b/cindygross/archive/2015/02/03/why-wasb-makes-hadoop-on-azure-so-very-cool.aspx
  • Create HDInsight Cluster in Azure Portal http://smallbitesofbigdata.com/archive/2015/02/26/create-hdinsight-cluster-in-azure-portal.aspx