Contenu connexe


unit 1 big data.pptx

  1. WEB DATA  A few years back, it was all manual data mining and it took long days for almost all small and medium players in the market for web data mining. Today, technology is evolving a lot and we are in an era of Big data and manual data mining is no more a right method and it is mostly about automation tools, custom scripts, or Hadoop framework. thing about web data extraction.  It is a process of collecting data from World Wide Web using some web scrapper, crawler, manual mining, etc. A web scrapper or crawler is a cutting tool for harvesting information available on internet.  In other word web data extraction is a process of crawling websites and extract data from that page using a tool or programming.  Web extraction is related to web indexing which refers to various methods of indexing the contents of web page using a bot or web crawler. A web crawler is an automated program, script or tool using that we can ‘crawl’ web pages to collect multiple information from websites.
  2. 1 In this whole process, first step is web data extraction, that can be done using different scraping tools available in market (there are free and paid tools are available) or create custom script using programming language with the help of expert in scripting language like Python, ruby, etc. 2 Second step is to find insight from the data. For this, first we need to process the data using the right tool based on the size of the data and availability of the expert resources. Hadoop framework is the most popular and highly used tool for big data processing. 3 Also, for sentimental analysis of those data, if needed, we need MapReduce which is one of the components of big data (Hadoop). To summarize, for web data extraction, we can choose different tools for automation or develop scripts using programming language. 4 Developing a script is often minimize effort as it is reusable with minimal modification. Moreover, as the volume of web data is huge-what we extract, it is always advisable to go for Hadoop framework for quick processing.
  3. Media companies use web scraping to collect recent and popular topics of interest from different social media and popular websites. Business directories use web scraping to collect information about the business profile, address, phone, location, zip code, etc. In healthcare sector, health physician scrap data from multiple websites to collect information on diseases, medicine, components, etc. When companies decide to go for web data extraction today, then they move ahead thinking about big data because they know that data will come in bulk i.e. in millions of records will be there and it will be mostly in semi or unstructured format. So, we will need to treat it as big data and use Hadoop framework and tools for converting it for any decision making.
  4. Challenges of Conventional Systems Analytics' has been used in the business intelligence world to provide tools and intelligence to gain insight into the data Data mining is used in enterprises to keep pace with the critical monitoring and analysis of mountains of data How to unearth all the hidden information through the vast amount of data
  5. Common changes: It cannot work on unstructured data efficiently It is built on to profile the relational data model It is batch oriented and we need to wait for nightly ETL(extract, transform and load)and transformation jobs to complete before the required insight is obtained Parallelism in a traditional analytics system is achieved through costly hardware like MPP(Massively Parallel Processing) systems In adequate support of aggregated summaries of data
  6. Data Challenges • Volume, Velocity, Variety & Veracity • Data discovery and comprehensiveness • Scalability • Storage issues Process Challenges • Capturing data • Aligning data from different sources • Transforming data into suitable form for data analysis • Modeling data(mathematically, simulation) • Understanding output, visualizing results and display issues on mobile devices
  7. Management Challenges • Security • Privacy • Governance • Ethical issues Traditional/ RDBMS • Designed to handle well structured data • Traditional storage vendor solutions are very expensive • Shared block-level storage is too slow • Read data in 8k or 16k block size • Schema-on-write requires data be validated before it can be written to disk. • Software licenses are too expensive • Get data from disk and load into memory requires application
  8. Solution constraints • Inexpensive storage • A data platform that could handle large volumes of data and be linearly scalable at cost and performance • A highly parallel processing model that was highly distributed to access and compute the data very fast • A data repository that could break down the silos and store structured, semi-structured, and unstructured data to make it easy to correlate and analyze the data together
  9. The Evolution of Analytic Scalability • Scalability: The ability of a system to handle increasing amount of work required to perform its task • The increase in data storage ability has grown in recent years to accommodate the need for big data • Measures of Data Size – Kilo, Mega, Giga , Tera, Peta, Exa, Zetta, Yotta Basic Definitions • Data: – Known facts that can be recorded and have an implicit meaning. • Database: – Organized collection of related data. • Database Management System (DBMS) – A software package to facilitate the creation and maintenance of a computerized database. • Relational Database Management System (RDBMS) – DBMS based on relational model • Relation is group of tuples • Enterprise Data Warehouse (EDW) – Central warehouse of all sources of data
  10. Massively Parallel Processing Systems (MPP) – Has lots of processor – All these processor works in parallel – Big data is split into many parts and the processors works in parallel in each part – Divide and conquer strategy
  11. Data Preparation •Manipulation of data into suitable form for analysis –Join • Combining columns of different data sources –Aggregation • Combining all data into one –Eg: statistical summary –Combining rows of different data source –Derivations • Creating new columns of data • Calculating ratio –Transformation • Converting data into useful format • Taking log, converting date of birth to age
  12. Ways for in-database data preparation • SQL • User defined functions / Embedded processes –Eg: Select customer, attrition_score –Analytic tool’s engine running on database • Predictive modeling markup language –Based on XML Cloud Computing • McKinsey Definition –Enterprises incur no infrastructure or capital cost. They will be paying on a pay-per-use basis –Should be scalable –The architectural specifics of the underlying hardware are abstracted from the user • Public Clouds and Private Clouds –Security –specialized service –Long term cost
  13. MapReduce •Parallel Processing Framework •Computational processing can occur on data (even semi-structured and unstructured data) stored in a file system without loading it into any kind of database Analytic process and Tools: 1.Deployment 2. Business Understanding 3. Data Exploration 4. Data Preparation 5. Data Modeling 6. Data Evaluation
  14. Step 1: Deployment • Here we need to: – plan the deployment and monitoring and maintenance, – we need to produce a final report and review the project. – In this phase, • we deploy the results of the analysis. • This is also known as reviewing the project. Step 2: Business Understanding • Business Understanding – The very first step consists of business understanding. – Whenever any requirement occurs, firstly we need to determine the business objective, – assess the situation, – determine data mining goals and then – produce the project plan as per the requirement. • Business objectives are defined in this phase. Step 3: Data Exploration • The second step consists of Data understanding. – For the further process, we need to gather initial data, describe and explore the data and verify data quality to ensure it contains the data we require
  15. – Data collected from the various sources is described in terms of its application and the need for the project in this phase. – This is also known as data exploration. • This is necessary to verify the quality of data collected. Step 4: Data Preparation • From the data collected in the last step, – we need to select data as per the need, clean it, construct it to get useful information and – then integrate it all. • Finally, we need to format the data to get the appropriate data. • Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase. Step 5: Data Modeling • we need to – select a modeling technique, generate test design, build a model and assess the model built. • The data model is build to – analyze relationships between various selected objects in the data, – test cases are built for assessing the model and model is tested and implemented on the data in this phase.
  16. • Where processing is hosted? – Distributed Servers / Cloud (e.g. Amazon EC2) • Where data is stored? – Distributed Storage (e.g. Amazon S3) • What is the programming model? – Distributed Processing (e.g. MapReduce) • How data is stored & indexed? – High-performance schema-free databases (e.g. MongoDB) • What operations are performed on data? – Analytic / Semantic Processing • Big data tools for HPC and supercomputing – MPI • Big data tools on clouds – MapReduce model – Iterative MapReduce model – DAG model – Graph model – Collective model • Other BDA tools – SaS – R – Hadoop
  17. Analysis VS Reporting What is Analysis? • The process of exploring data and reports – in order to extract meaningful insights, – which can be used to better understand and improve business performance – “the process of organizing data – into informational summaries – in order to monitor how different areas of a business are performing Comparing Analysis VS Reporting: Reporting is “the process of organizing data into informational summaries in order to monitor how different areas of a business are performing.” • Measuring core metrics and presenting them — whether in an email, a slidedeck, or online dashboard — falls under this category. • Analytics is “the process of exploring data and reports in order to extract meaningful insights, which can be used to better understand and improve business performance.” • Reporting helps companies to monitor their online business and be alerted to when data falls outside of expected ranges.
  18. • Good reporting • should raise questions about the business from its end users. • The goal of analysis is • to answer questions by interpreting the data at a deeper level and providing actionable recommendations. • A firm may be focused on the general area of analytics (strategy, implementation, reporting, etc.) – but not necessarily on the specific aspect of analysis. • It’s almost like some organizations run out of gas after the initial set-up- related activities and don’t make it to the analysis stage
  19. Analysis Reporting 1. Provides what is needed Provides what is asked for 2 .Is typically customized Is Typically standardized 3. Involves a person Does not involve a person 4 .Is extremely flexible Is fairly Inflexible Reporting translates raw data into information 1 Analysis transforms data and information into insights. 2 Reporting shows you what is happening 3 while analysis focuses on explaining why it is happening and what you can do about it. 5 Reports are like Robots n monitor and alter you and where as analysis is like parents - c an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear infection, etc). 6 Reporting and analysis can go hand-in-hand: 7 Reporting provides no limited context about what is happening in the data. Context is critical to good analysis. 8 Reporting translate a raw data into information
  20. History of Hadoop: 1. Hadoop was started by Doug Cutting to support two of his other well known projects, Lucene and Nutch 2. Hadoop has been inspired by Google's File System (GFS) which was detailed in a paper by released by Google in 2003 3. Hadoop, originally called Nutch Distributed File System (NDFS) split from Nutch in 2006 to become a sub-project of Lucene. At this point it was renamed to Hadoop
  21. Apache Hadoop: Apache Hadoop is the most important framework for working with Big Data. Hadoop biggest strength is scalability. It upgrades from working on a single node to thousands of nodes without any issue in a seamless manner. The web media was generating loads of information on a daily basis, and it was becoming very difficult to manage the data of around one billion pages of content. In order of revolutionary, Google invented a new methodology of processing data popularly known as MapReduce. Later after a year Google published a white paper
  22. Hadoop runs the applications on the basis of MapReduce where the data is processed in parallel and accomplish the entire statistical analysis on large amount of data. It is a framework which is based on java programming. It is intended to work upon from a single server to thousands of machines each offering local computation and storage. It supports the large collection of data set in a distributed computing environment. The Apache Hadoop software library based framework that gives permissions to distribute huge amount of data sets processing across clusters of computers using easy programming models.
  23. Analyzing data with Hadoop: Analyzing the Data with Hadoop To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job.Map and Reduce. MapReduce works by breaking the processing into two phases: The map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. The input to our map phase is the raw NCDC data. To visualize the way the map works, consider the following sample lines of input data (some unused columns have been dropped to fit the page, indicated by ellipses): 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999...
  24. These lines are presented to the map function as the key-value pairs: (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004...0500001N9+00781+99999999999...) The keys are the line offsets within the file, which we ignore in our map function. The map function merely extracts the year and the air temperature (indicated in bold text), and emits them as its output (the temperature values have been interpreted asintegers): (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78)
  25. The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce function sees the following input: (1949, [111, 78]) (1950, [0, 22, −11]) Each year appears with a list of all its air temperature readings. All the reduce function has to do now is iterate through the list and pick up the maximum reading: (1949, 111) (1950, 22)
  26. Hadoop Streaming: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir
  27. Streaming Command Options Streaming supports streaming command options as well as generic command options.