Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Big data upload

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
How does big data impact you
How does big data impact you
Chargement dans…3
×

Consultez-les par la suite

1 sur 7 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Publicité

Similaire à Big data upload (20)

Plus récents (20)

Publicité

Big data upload

  1. 1. REPORT: BIG DATA UNIVERSITY OF LEICESTER Data Analysis for Business Intelligence MSc
  2. 2. P a g e 1 | 6 BIG DATA We have heard the term “flood” been used with money, people or new technologies. But this has led to define a new term which is related to flooding of the data. Moore in 1965 described that number of transistors on dense IC doubles approximately every two year. Which indeed is true and thus new technologies which could fit in our hand and could be used as our personal computer is invented. We are surrounded by electronic machine, we don’t realise but we are monitored at some moment of our day by an electronic device that can be our mobile phones, CCTV, weighting machine, computer and list never ends. This in turns generate data and it so huge which made us define it as Big Data. Big Data is a vast repository of the data whose size is beyond the ability of conventional data database. Size of such database cannot be defined; it is growing at each second. For instance, famous social networking site facebook collects 500+ terabytes of data every day. But this data is just collection of facts and events of day to day life. These data doesn’t lie neither do it tell the truth. We actually need to understand what can the data tells us. And that extracted idea is called information. So information tells us that what does the data means. This information is useless if we are not able to make use of it and use is to change our action. So we need to provide insights that how will this information be useful to achieve our goal. Data have swept into every industry and business sectors. Mckinsey Global Institute (MGI) estimated that an enterprise globally stores 7 exabytes of data while consumer stores 6 exabytes in 2010. They have estimated that if US health care use big data effectively, then potential value from data in this sector could be more than $300 billion in value every year. European organisations have approximately 11 exabytes of data, making efficient use of it can generate nearly $149 billon in operational efficiency improvements. However in near term there is huge potential to leverage big data in developing countries. Organization can Leverage big data and improves its design and functionality. Big data can create value in several ways; it can create transparency by simply making data available to stakeholders in timely manner which can create tremendous value. Also making data readily available to all the departments can steeply reduce the search and processing time. It can also help to experiment to discover the needs. As most of the data are stored in digital format, one can discover if there is need or change that to be done on product. Big data enables us to segment the populations according to ones need and to help to deliver a customize action. It can also support in decision making with automated algorithms. It can minimize risks, dig up the valuable Insight, for instance tax agencies can use automated risk engines. Manufacturers are using
  3. 3. P a g e 2 | 6 data obtained from current product to improve their development of new ones. Big data has created entirely new categories of companies, such as those that aggregate and analyse industry data and provide useful information and insight to manufacturing or financial company. Value of big data can be measured by estimating the total value created from taking particular action with the use of big data. But to capture full potential of big data several issues will have to be addressed like; they have to consider the legal aspect in order to handle and analyse data, there is no room for information breach which can result in serious consequences. Organization which handles nation’s data need to be careful if any of the information is exposed or not as this might have very high loss of the nation. Also they have to be extremely careful when analysing any health care data. As wrong prescription might cost someone’s life. Also company has to hire new personnel’s who has understanding of big data. Abundant variety of technologies have been develop that can be applied to big data to get useful insight from it. Researchers continue to develop new techniques to analyse the big data. Set of techniques that is used to extract patterns from large database by combining methods from statistics, machine learning and database management is called Data Mining. Technique which is used is used to discover an interesting relationship among different variables in large database is Association rule learning, which can be helped to determine which products are frequently bought together. To study the buying strategy of a customer or to determine the most consumed product them, a technique called Classification is used. It categories the existing data and the new data can be predicted using this already classified data, it is also called Unsupervised Learning. Cluster Analysis is another statistical method that is used to group the similar objects, whose characteristics of similarity is not known in advance. A technique to collect large data from the crowd that are submitted through open calls is known as Crowdsourcing. But analysing data from single source might not be of great use. So it will be more efficient if multiple source is taken into consideration, which is called Data fusion and data integration. Natural language processing can be used to analyse data from social media websites like twitter, facebook etc. Also idea of natural evolution that is “survival of the fittest” can be used to optimize parameters of business or manufacturing models, called Genetic Algorithm. Also a unique technique called spatial analysis is used to analyse the geographic property which helps to make decision for selection of manufacturing sites. To support all the techniques of data mining many technologies have been developed. Google developed Big Table to store data in compressed form on Google File System. An application program which is used to report, analyse and present data is Business Intelligence (BI). It reads data that have been previously stored in a data warehouse and then enables to perform on it. Also the computing paradigm which provide highly scalable computing resources through the network is addressed as cloud computing. An open source software
  4. 4. P a g e 3 | 6 framework for processing huge datasets on distributed system is managed by Apache Software Foundation which is named as Hadoop. Data can be structured that is, data which resides in fixed fields like in spreadsheet. While I contrast, there are unstructured data which includes free-form text, untagged audio, image and video data. All the analyses and information from it will be in vain if we are unable to present it to the people which should be easily consumable. So proper visualization is the key challenge which needs to be met if proper action is needed for the result of the analyses. Earlier before the internet revolution, the methods used to mine data was restricted to small data sets and less variability in datatypes. But in information age, due to easier and less costly accumulation of data. It is estimated that amount of information stored doubles every twenty months. So making effective use of these data is new challenge we need to encounter. An automatic, exploratory analysis and modelling of large data repositories is known as Knowledge Discovery in Database (KDD). It is a novel approach to identify understandable patterns in large data sets. The process of KDD starts with determining the goal of particular project and ends with implementation of discovered knowledge. KDD process is a nine-step process, starting with managerial step. Firstly development of understanding of application domain is carried out, it prepares scene for understanding what should be done with available decision variables. People involved in this step need to understand the requirement of end-user and the environment in which knowledge discovery will take place. After that pre-processing need to be done of data sets. For this we need to determine which all data are available and which will be used for particular application. For success of the process we should consider all the relevant data available because if some of the attributes are missed then whole process may fail. Then we should clean the data that is, handling missing values and removal of noise and outliers. This step will enhance our data reliability, to give an example, if one suspects that a certain less important attribute is unreliable or has lot of missing data then ignoring that attribute is smart choice. But if that attribute is dominant for an application then we can make that attribute as goal of data mining supervised algorithm and make a prediction of missing variable. As certain attributes may not be useful or doesn’t affect the goal of an application, but these variables may not be spotted by a person. So after cleaning we will have data transformation step, this includes methods like data reduction such as record sampling and feature selection, and attribute transformation such as discretization of numerical attributes and functional transformation. This step is crucial for success of entire KDD process but is usually very project-specific. Having completed the above four step on data we have to focus on algorithmic aspects of each project which are related to data mining part. At first we need to choose appropriate Data Mining task, for example is it regression, classification or clustering. There are two major goals in Data Mining: prediction and description. Prediction is considered as supervised data mining while description data mining includes unsupervised and
  5. 5. P a g e 4 | 6 visualization aspects. Most data mining techniques are based on inductive learning, where model is constructed by generalizing from sufficient number or training samples. Next we need to choose Data Mining algorithm, specific method is to be chosen to search patterns. For example, precision is better with Neural Network approach and to get understanding of attributes Decision trees is better choice. Each algorithm has parameters and tactics of learning like cross- validation or division of training and testing. Finally implementation of chosen algorithm is employed, we might have to employ algorithm several times to get satisfying results, for instance by tuning the algorithm’s control parameters. Final step of data mining part is Evaluation and interpretation of mined patters, with respect to goal defined in first step. Here we also pre-processing steps with respect to their effects on result of data mining algorithm. Discovered knowledge is also documented for further use. Lastly we need to make proper used of discovered knowledge, the success of this step determines the effectiveness of entire KDD process. As now we will use this results in real life, so many challenges need to be considered like loosing the laboratory conditions under which we have operated. For example, the knowledge was discovered from certain snapshots (samples) but now data is dynamic. Data Mining is classified into two subgroups: Verification and Discovery. Discover methods are those that automatically identify hidden patterns in the data. It is branched out as prediction and description. Description methods are oriented to data interpretation, which focuses on understanding of way the data relate to each other (for example by visualization). Prediction-oriented methods aims to automatically build a behavioural model, which is capable of predicting values of one or more variables related to samples and can obtain new and unseen samples. It can help us provide understanding of the data. Verification methods deals with evaluation of hypothesis proposed by an external source like expert. Methods included are drawn out of traditional statistics like tests of hypotheses (e.g., t-test of means), goodness of fit test and analysis of variance (ANOVA). This methods are not related to data mining as most of the data mining task are concerned with discovering a hypothesis (out of a very large set of hypotheses), rather than testing which is already known. Under discovery based methods, prediction is also called as supervised learning, as opposed to unsupervised learning. Unsupervised learning generally maps high dimensional data to reduced dimension. It groups data without prespecified, dependent attributes. Unsupervised learning covers a portion of description method. For instance it cover clustering methods (like K-means, K-medoids, Adaptive Resonance Theory (ART) 2, etc.) but doesn’t cover visualization methods. Supervised methods tries to discover the relationship between input attributes and target attributes. It is useful to distinguish between two supervised models: classification models and regression models. Regression maps input space into a real-valued domain, for example, a regressor can predict the demand for a certain product given its characteristics. On the other hand, classifier maps input space into predefined classes.
  6. 6. P a g e 5 | 6 So we can say that Data Mining is a new science which consist of techniques or methods developed using statistics, artificial intelligence, machine learning and database systems. Reforming the US health care service in order to reduce the rate at which the cost have been increasing and to sustain its currency strength is critical to United States both as society and as an economy. It is possible to address the challenges faced by emulating and implementing best practices in health care which may require to analyse large datasets. MGI have identified different sectors through which US health care department can generate revenue and bring down the spending on this department. One of them is developing personalized medicine which will produce value in R&D arena. The goal of this application is to examine the relationships among genetic variation, predisposition for specific diseases, and specific drug responses and then to account for the genetic variability of individuals in the drug development process. Personalized medicine holds promise of improving health care in three main ways: offering early detection and diagnosis; more effective therapies because patients with same diseases often don’t respond in the same way to the same therapy; and the adjustment of drug dosages according to a patient’s molecular profile to minimize side effects and maximize response. Well but to thoroughly understand the structure of any diseases, one need to consider all the available clinical data which is very massive in amount. So this can be done by proper modelling and efficiently applying data mining method. This new lever was successful in early detection of breast cancer. Governments in many parts of the world are under increasing pressure to increase their productivity. Big data can offer them powerful arsenal of strategies and techniques for boosting productivity and achieving higher level of effectiveness. Public sector offer challenges because it is very diverse in its functions and budgets. MGI focussed on administration in two types of government agencies, tax and labour. So these agencies collects data on large scale from different sectors. But it can face significant performance challenge. For instance, Europe’s public sectors accounts for almost half of its GDP. This high level of shares of economic output puts considerable long-term strain on Europe’s budgetary. It has been estimated that by 2025 over 30 percent of population in mature economy across the globe will be aged 60 or over and so social security, health care, and pensions will face increasing demand. As big data and its levers are becoming increasingly valuable assets, the use of it will become key basis to compete across sectors. So it’s important for organization leader to incorporate big data into their business plans. Also they need to ensure that along with the sufficient skills in back-office analytics, they also manage a transition towards the right managerial talent on front line. Leader should understand the assets (i.e. data) they hold or which they could have access. Organization should have inventory of their own and should also systematically catalogue other data which they could gain access to, like
  7. 7. P a g e 6 | 6 government data, internet data. Also there might be third party who has not consider to share their data. So organization need to thoughtfully consider and present a compelling value to that party for able to gain access to their data. Leader need to consider to adopting a process of purposeful experimentation which can be powerful path to leverage big data, rather than just specifying complete plan prior to doing any implementation. At first one can consider just few high-potential areas in which to experiment with big data and then can be scaled to larger domain. A sophisticated leader will first apply technique like “scrubbing” on data which will generate, structure and organize the data, this will improve its quality. Next these data should be made easily accessible to all the departments of the organization through networks. Then very basic and simple analytics will be applied on it, e.g., those techniques which doesn’t require customized analyses to be designed by people with deep analytics skills. Fourth and highest level is applying advanced and complex analytics like automated algorithms and real-time data analysis that can create some new business model. Leader should build a team with deep analytics capability which will supply new information to the company and new insight for further business growth. Also these leader will need to have baseline understanding of this analytics techniques in order to become effective user of these types of analyses. The lack of customer-centric view can limits the organization’s ability to use any big data levers to create new value. So they might require to invest in IT hardware, software, and services to capture, store, organize, and analyse large datasets. Data privacy and security will become paramount as it travels across boundaries for various purposes. Privacy, not only require to compliance with laws and regulations, but also is fundamental to an organization’s trust relationship with its customers and partners. Organizational leader will have to wrestle with legal issues relating to their stance on intellectual property for data. A significant constraint of realizing value of big data will be shortage of talents, particularly people with expertise in statistics and machine learning. It’s been estimated by MGI that demand of people with deep analytical talents in US could be grater than 50-60 percent than its projected supply by 2018. It is considered as 21st century sexiest job, written by USA Today. Current trends indicate that 4,000 new positions are being created annually, perhaps significantly more. This has brought new wave in the market as most of the sectors want to gain more from big data. Hal Varian, the chief economist at Google, is known to have said, “The sexy job in the next 10 years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?”

×