SlideShare une entreprise Scribd logo
1  sur  65
Vinod Gupta School of Management, IIT Kharagpur




              Google Refine Analysis
                   A Business Perspective



                                April, 08 2012

                        Sathishwaran.R - 10BM60079
                         Vijaya Prabhu - 10BM60097



This Tutorial was created using Google Refine Version 2.5 on a Windows 7 platform
Data Cleansing
• Data cleansing is identifying the wrong or inaccurate
  records in the data set and making appropriate
  corrections to the records.
• It involves identifying incomplete, inaccurate, and
  incorrect parts of data and then either replacing them
  with correct data or deleting the incorrect data
• Data cleansing results in data which is consistent with
  the other standard data and is useful for performing
  various analysis
• The error in the data could be due to data entry error
  by the user, failure during transmission of data or
  improper data definitions.

                                                        2
Need for Data Cleansing
• Incorrect or inaccurate data may lead to false
  conclusions and can cause investments to be
  misdirected in finance.
• Also government needs accurate data on
  population and census for directing the funds to
  the deserving areas.
• Many organizations tap into customer
  information. If the data is not accurate, for eg. If
  the address is not accurate then the business
  runs the risk of send wrong information, thus
  losing customers.

                                                     3
Challenges Data Cleansing
• Loss of Information: In many cases the record may be
  incomplete, hence the whole record may require to be
  deleted which leads to loss of information. It could
  become costly if huge number of data is deleted.
• Maintenance of Data: Once the data is cleansed then
  any change in the data specification needs to affect
  only the new values. Hence data management
  solutions should be designed in such a way that the
  process of data entry and retrieval are altered to
  provide correct data.
• Data cleansing is an iterative process which needs
  significant work in exploration and corrction of entries.

                                                          4
About Google Refine
• Google Refine is a powerful tool that can be effectively
  used for data cleansing.
• It helps in working with raw data, cleaning it
  up,       transforming     from     one      format     to
  other, encompassing it with web services and linking it
  to databases.
• It is very easy to use and has a web interface.
• It is freely available and works well with any browser.
• Google Refine is a desktop application and it runs a
  small web server on your system and we need to point
  our browser to the server to use refine.
                                                           5
Getting Started - Installation
1. Download the zip file (appropriate
   Windows, Mac, Linux versions) from the link
   http://code.google.com/p/google-
   refine/wiki/Downloads?tm=2
2. Uncompress the files from the zip file.
3. Run the “google-refine.exe” file.
4. A command window opens and Google refine
   runs taking the user to the home page in the
   default browser.
                                              6
Google Refine Homepage




                         7
Importing Data
• Google Refine supports TSV, CSV, Excel (.xls
  and .xlsx), JSON, XML, and Google data
  document formats.
• Once imported the data is in Google Refine’s
  own data format.
• We have used TSV data on Disasters
  worldwide from 1900-2008 available from
  http://www.infochimps.com/datasets/disaster
  s-worldwide-from-1900-2008 for the tutorial.

                                             8
Importing Data




                 9
Importing Data




                 10
Data
Uploaded   Creating Project




                              11
Creating Project   Project
                   Created




                             12
Faceting
• Faceting is about seeing the big picture and
  filtering based on rows to work on data you
  want to change in bulk.
• We can create a facet for a column to get the
  details about that column and then we can
  filter to a subset of rows with a constraint.
• We can perform text facet, Numeric
  facet, timeline facet and scatterplot facet. Also
  various customized facets can be designed.

                                                  13
Faceting




           14
Faceting




The Column
Type has 18
  unique
  options



                         15
Removing Redundancy




  Even though
they are of same
 type, shows as
different options
   due to case


                                          16
Removing Redundancy




                      17
Removing Redundancy




                      18
Removing Redundancy




                      19
Removing Redundancy




Reduced to 15
unique options




                                       20
Numeric Faceting




                   21
Numeric Faceting




Highly clustered
  towards low
     values



                                      22
Numeric Faceting




                   23
Numeric Faceting




                   24
Numeric Faceting




                    Cost column is
                   blank and has no
                         value


                                      25
Numeric Faceting




                   Calamities with
                      low cost



                                     26
Numeric Faceting




              Calamities with
                 high cost



                                27
Clustering
•   Clustering is used to merge choices which look similar.




                                                              28
Clustering




             29
Clustering




Data Merged




                           30
Using Expressions
•   Expressions are used to transform existing data to create new data




                                                                         31
Using Expressions




                    32
Using Expressions




                    33
Data Augmentation
• Reconciliation option in Google refine allows
  data to be linked to web pages. Suppose we
  want details on the country where the
  calamity has struck we can perform the
  following steps




                                                  34
Reconciliation




                 35
Reconciliation




                 36
Reconciliation




                 37
Reconciliation




                 38
Reconciliation




                 39
Data Enrichment




                  40
Data Enrichment




                  41
Data Enrichment




                  42
Data Enrichment




                  43
Export




         44
How to Use Twitter Data

Step 1




Step 2

                             45
Step 3




         46
Step 4




Step 5

         47
Step 6




         48
Step 7   Step 8




                  49
Output




         50
Friends Events using Facebook data




                                 51
Friends Events using Facebook data




                                 52
Friends Events using Facebook data




                                 53
Friends Events using Facebook data




                                 54
Friends Events using Facebook data




                                 55
Friends Events using Facebook data




                                 56
Friends Events using Facebook data




                                 57
Friends Events using Facebook data




                                 58
Friends Events using Facebook data




                                 59
Friends Events using Facebook data




                                 60
Friends Events using Facebook data
• After splitting the cell using separator },{




                                                 61
Friends Events using Facebook data




                                 62
Friends Events using Facebook data
•   After updating for other columns and rearranging it we get the events as




                                                                               63
LIKED



DISLIKED

           64
Thank You




            65

Contenu connexe

En vedette

Rowin Petersma \'Projects 2011-2\'
Rowin Petersma \'Projects 2011-2\'Rowin Petersma \'Projects 2011-2\'
Rowin Petersma \'Projects 2011-2\'
Rowin Petersma
 
Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspective
Vijaya Prabhu
 
Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspective
Vijaya Prabhu
 
The Reproductive System
The Reproductive SystemThe Reproductive System
The Reproductive System
bsullivan4
 

En vedette (17)

bureau rowin petersma 2015
bureau rowin petersma 2015bureau rowin petersma 2015
bureau rowin petersma 2015
 
Are negative findings all down to confounding factors?
Are negative findings all down to confounding factors?Are negative findings all down to confounding factors?
Are negative findings all down to confounding factors?
 
Rowin Petersma \'Projects 2011-2\'
Rowin Petersma \'Projects 2011-2\'Rowin Petersma \'Projects 2011-2\'
Rowin Petersma \'Projects 2011-2\'
 
Drug glossaries
Drug glossariesDrug glossaries
Drug glossaries
 
LessonPlanning2
LessonPlanning2LessonPlanning2
LessonPlanning2
 
Understanding the sociocultural context through partnership with communities
Understanding the sociocultural context through partnership with communitiesUnderstanding the sociocultural context through partnership with communities
Understanding the sociocultural context through partnership with communities
 
啥是部落格
啥是部落格啥是部落格
啥是部落格
 
Ptc
PtcPtc
Ptc
 
Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspective
 
Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspective
 
The Reproductive System
The Reproductive SystemThe Reproductive System
The Reproductive System
 
Twitter and your career
Twitter and your careerTwitter and your career
Twitter and your career
 
Physical comorbidity with bipolar disorder
Physical comorbidity with bipolar disorderPhysical comorbidity with bipolar disorder
Physical comorbidity with bipolar disorder
 
Servicios de-streaming
Servicios de-streamingServicios de-streaming
Servicios de-streaming
 
Ruth White Cv11.11.11
Ruth White Cv11.11.11Ruth White Cv11.11.11
Ruth White Cv11.11.11
 
Intercalated BMedSc Psychological Medicine
Intercalated BMedSc Psychological MedicineIntercalated BMedSc Psychological Medicine
Intercalated BMedSc Psychological Medicine
 
Teste
TesteTeste
Teste
 

Similaire à Google refine tutotial

Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
Vijaya Prabhu
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
Vijaya Prabhu
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
itnig
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
Ian Feller
 
Builiding analytical apps on Hadoop
Builiding analytical apps on HadoopBuiliding analytical apps on Hadoop
Builiding analytical apps on Hadoop
Dmitry Makarchuk
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
Jonathan Seidman
 

Similaire à Google refine tutotial (20)

Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Gauge October 2012
Gauge October 2012Gauge October 2012
Gauge October 2012
 
Big Data
Big DataBig Data
Big Data
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Data Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmersData Tools cosystem_for_non_programmers
Data Tools cosystem_for_non_programmers
 
Data tools ecosystem for non-programmers
Data tools ecosystem for non-programmersData tools ecosystem for non-programmers
Data tools ecosystem for non-programmers
 
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and SparkLarge-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analytics
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
Tricks and tips_re
Tricks and tips_reTricks and tips_re
Tricks and tips_re
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Builiding analytical apps on Hadoop
Builiding analytical apps on HadoopBuiliding analytical apps on Hadoop
Builiding analytical apps on Hadoop
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
Finding The Perfect Donor Database In An Imperfect World
Finding The Perfect Donor Database In An Imperfect WorldFinding The Perfect Donor Database In An Imperfect World
Finding The Perfect Donor Database In An Imperfect World
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Google refine tutotial