SlideShare une entreprise Scribd logo
1  sur  53
BIG DATA&
DATAMINING
LECTURE 3, 7.9.2015
INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE (CSS01)
LAURI ELORANTA
• LECTURE 1: Introduction to Computational Social Science [DONE]
• Tuesday 01.09. 16:00 – 18:00, U35, Seminar room114
• LECTURE 2: Basics of Computation and Modeling [DONE]
• Wednesday 02.09. 16:00 – 18:00, U35, Seminar room 113
• LECTURE 3: Big Data and Information Extraction [TODAY]
• Monday 07.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 4: Network Analysis
• Monday 14.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 5: Complex Systems
• Tuesday 15.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 6: Simulation in Social Science
• Wednesday 16.09. 16:00 – 18:00, U35, Seminar room 113
• LECTURE 7: Ethical and Legal issues in CSS
• Monday 21.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 8: Summary
• Tuesday 22.09. 17:00 – 19:00, U35, Seminar room 114
LECTURESSCHEDULE
• PART 1: BIG DATA DEFINED
• PART 2: DATA MINING PROCESS
• PART 3: WHERE TO GET DATA
• PART 4 : DATA VISUALIZATION
LECTURE 3OVERVIEW
BIGDATADEFINED
• The term big data is used quite loosely, with various definitions depending
on the context
• Typically big data is misunderstood only to refer to big volumes of data
• One of the most used definitions in the field of IT is by Gartner:
“Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision
making.” (Gartner 2014.)
• Gartner analyst Doug Laney introduced the 3Vs concept in a 2001
MetaGroup research publication, 3D data management: Controlling data
volume, variety and velocity.
BIG DATADEFINED
(Gartner 2014.)
• Called as the three “V”s of Big Data
1. Volume refers to the big quantities of data
2. Velocity refers to the usually high speed of which data is generated
3. Variety refers to different kinds and types of data
• Other Vs suggested as well: Variability, Veracity
VOLUME, VELOCITY&
VARIETY
(Gartner 2014.)
•“Big Data represents the Information assets
characterized by such a High Volume,
Velocity and Variety to require specific
Technology and Analytical Methods for its
transformation into Value".
• (De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A
consensual definition and a review of key research topics. 4th
International Conference on Integrated Information, Madrid)
DEMAURO,GRECO&GRIMALDI2014,
DEFINITION
• Strong instrumental component in relation to how you get “value” out of
big data
• Answering research questions
• Answering business problems
• Instead of just one particular technology, big data also refers to large set
of different technologies used in various ways
BIG DATAISABOUTUSING
BIG DATA
(Sicular 2013.)
• “Every day, we create 2.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created in the last two years alone. This data comes from
everywhere: sensors used to gather climate information,
posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to
name a few. This data is big data.” (IBM 2014a.)
• Underlines the volume component of big data.
IBM’S DEFINITION
IBM’S FOUR VS
(IBM 2014b.)
• E.g 7 vies from Elliot 2013:
• Big Data as
1. Volume, Velocity & Variety (dictionary definition)
2. Set of technologies and tools
3. Set of different categories and types of data
4. Means of predicting the future (big data as signals)
5. New possibilities, that previously were impossible (value)
6. Metafora for a global neural network (combining all data)
7. As a capitalist/neoliberal concept (critical view)
MANYVIEWPOINTSTO BIG
DATA
(Elliot 2013)
• Letely in social sciences big data has been defined either in quite
vague terms or underlining only the volume component of big data
• ”Big Data, that is, data that are too big for standard database software to
process, or the more future-proof, ‘capacity to search, aggregate, and cross-
reference large data sets.” (Eynon 2013.)
• “Today, our more-than-ever digital lives leave significant footprints in
cyberspace. Large scale collections of these socially generated footprints,
often known as big data --“ (Yasseri ja Brigth 2013.)
• "These emitted shadows of ‘big data’ can take a variety of forms, but most
are manifestations or byproducts of human/machine interactions in
code/spaces and coded spaces. We now see hundreds of millions of
connected people, billions of sensors, and trillions of communications,
information transfers, and transactions producing unfathomably large data
shadows --" (Graham 2013.)
TYPICALLYNOTACOMMON
DEFINITIONINSOCIALSCIENCE
RESEARCH
DATAMINING
PROCESS
• Data mining process aims at answering research questions based on
large sets of data (in another words, big data)
• New insights and information is “mined” from the data with automated
computation
• For variety of research purposes with many different kinds of data
• Long traditions: Quantitative content analysis and register based
research, for example, could be seen as form of data mining
• NOTE! To be specific, in computer science the term data mining only
refers to the pre-processing and analysis part of the whole process
DATAMININGPROCESSINCSS
1. Formulating
research
questions
2. Selecting
source raw data
3. Gathering
source raw data
4. Preprocessing 5. Analysis
6.
Communication
(Cioffi-Revilla 2014.)
• Everything starts with a research question
• Three main types of research questions in relation to data
• 1. Inductive = Data-driven. The data tells something new.
• 2. Deductive = Theory-driven. The data tells something about a theory.
E.g. data can be used to test hypotheses.
• 3. Abductive = Mixed model, in-between of inductive and deductive
research
RESEARCH QUESTIONS IN
DATAMINING
(Cioffi-Revilla 2014.)
• Main guiding factor: the research question
• Not just text: many different forms of data
• Text / Numeric data
• Images
• Video
• Audio
• Sensor-data
• Register data
• Where to get the data?
• Data and its selection comes with many problems: ethics, legal,
privacy, public vs. private. (These matters will have a lecture of its
own).
SELECTINGAND
GATHERING RAW DATA
(Cioffi-Revilla 2014.)
• Data needs to be pre-processed in order it can be analyzed: typically this
can take a very big part of the data mining process
• Cioffi-Revilla 2014 mentions these (mainly from textual content analysis
perspective):
• Scanning = generating machine readable files
• Cleaning = making the data set more concise (extracting unnecessary
noise)
• Filtering = there may be a need to filter the data based on some rules
or categories even before the analysis
• Reformatting = changing the structure of the data, for example
dividing data in smaller parts
• Content proxy extraction = using removing the proxies in text that
denote to latent entities
PREPROCESSING DATA
(Cioffi-Revilla 2014.)
• This is the main automated information extraction part: data is “mined” to
reveal new information
• Many different analysis method classes, typically combining techniques
from statistics, machine learning, artificial intelligence and database
systems.
• Main types of analysis (according to Fayyad et al 1996):
Classification, Clustering, Regression Analysis, Summarization,
Dependency Modeling, Anomaly detection
• There are many many others, which can be seen combining and
mixing the main types given above
DATA ANALYSIS
(Fayyad et al. 1996)
• Classification is maps (classifies) data item in one or several predefined
classes
• Classification algorithms are learning algorithms in the sense that they
need a data set that defines how to categorize the data: thus, one needs
to teach the classification algorithm what classes to look for
• For example
• Classification of images in different categories
• Classification of news items in different categories
• Classification email into spam an normal mail
CLASSIFICATION
(Fayyad et al. 1996)
• Clustering groups a set of data objects in such a way that objects in the
same group (cluster) are more similar to each other than to those in
other groups (clusters).
• Not a one specific algorithm, but a general task with many different
solutions and algorithms
• Connectivity based clustering (based on distance)
• Centroid based clustering (e.g. K-means clustering)
• Distribution based clustering (objects belonging most likely to the same
distribution)
• Density based clustering
CLUSTERING
(Fayyad et al. 1996)
• Helsingin Sanomat (the biggest news corporation in Finland) opened
their Finnish parliament election 2015 questionnaire data to public
• The data contained questions and their answers from election
candidates for the Finnish parliament
• The data could be analyzed via clustering and factor analysis to find out
what different groups (clusters) of thought do the candidates actually
represent (in comparison to their actual party).
• Try it out: http://users.aalto.fi/~leinona1/vaalit2015/
CLUSTERING EXAMPLE
• Does what is says on the tin! Finding compact descriptions on subsets of
data.
• For example calculating means of standard deviations over different data
attributes (dimension)
• Summarization techniques are often applied to interactive exploratory
data analysis and automated report generation.
SUMMARIZATION
(Fayyad et al. 1996)
• Estimating the relationship among variables (with a regression function)
• It includes many techniques for modeling and analyzing
• Focuses on the relationship between a dependent variable and one or
more independent variables.
• Regression function is a learning function based on the data
• Applications in prediction and
REGRESSIONANALYSIS
(Fayyad et al. 1996)
REGRESSION EXAMPLE
LINEARREGRESSION
(Image is public domain, from Wikipedia 2015, Regression Analysis)
• Finds significant dependencies between the data variables
• Two levels
• Structural level defining which variables are dependent (can be
graphical form)
• Quantitative level defining the strength of the dependency in numeric
form
• E.g. Correlation analysis
• E.g. Probabilistic density networks
DEPENDENCYMODELING
(Fayyad et al. 1996)
CORRELATION DOES NOT
IMPLYCAUSATION
(XKCD: Correlation, http://imgs.xkcd.com/comics/correlation.png)
• Change and deviation detection
• Has the data changed from some previously known stable state or from
some previously measured normative values (“normal range”)
• Time scales matter, short term anomaly may actually be normal in long
term.
• Synchronic change (anomalies in stable processes) and diachronic
change (deeper change in generative structures of the process)
• Quite a dynamic category
ANOMALYDETECTION
(Fayyad et al. 1996)
• Cioffi-Revilla (2014) lists, for example, vocabularity analysis, correlation,
lexical analysis, spatial analysis, semantic analysis, sentiment analysis,
similarity analysis, clustering, network analysis, sequence analysis,
intensity analysis, anomaly detection, sonification analysis
• Most important thing is to understand the ins and outs of the analysis
model you are using: what is it for and how does it behave under the
hood
• The relationship of the model to your research question
AND MANYOTHERS…
• Basically means that data analysis algorithm is able to “learn” and enhance its
performance iteratively from the data
• 1. Supervised machine learning
• The algorithm is schooled based on some known labeled data (input/target pairs)
• e.g. Netflix is able to suggest you better movies based on how you use it: By
watching and rating films you are teaching the machine how to suggest better
movies to you
• 2. Semi-supervised machine learning
• The algorithm is schooled with a small set of labelet data (input/target pairs) and
a set of un labelet data
• 3. Unsupervised machine learning
• No result-set data is given for the machine to learn
• The algorithm is able to find patterns and structures from the data automatically
without any pre-learning
• 4. Reinforcement machine learning
• Algorithm has a certain goal and it interacts with a dynamic environment, which
gives it rewards based on actions
MACHINE LEARNING
WHERETOGETDATA
• Ready Data Sets = Many public data sets provided by different institutions
• Web APIs = Application programming interfaces, that gives you data in
structured format. For example facebook and twitter have APIs for getting
data
• Web Scraping = Gather the information automatically from webpages,
when it is allowed.
• Data Bases = Quering databases directly with query languages (e.g SQL)
• Custom data gathering process = the traditional research data gathering
(surveys, interviews…)
• Open Data and Open Science growing trends: governments opening
providing APIs and Data Sets to different kinds of public data (e.g. fiscal
information, expenses)
DATASOURCES
MAINTYPES
OLDIEBUTGOLDIE…
GOVERNMENTALREGISTRIES
FINNISHSOCIALSCIENCEDATA ARCHIVE
CSC.FI: ETSIN&AAVA
STATISTICSFINLAND
HELSINKIREGIONINFOSHARE
GAPMINDERDATA
• The Internet is full of open datasets of different kinds!
Some examples:
• Economics
• American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
• Gapminder: http://www.gapminder.org/data/
• UMD:: http://inforumweb.umd.edu/econdata/econdata.html
• World bank: http://data.worldbank.org/indicator
• Finance
• CBOE Futures Exchange: http://cfe.cboe.com/Data/
• Google Finance: https://www.google.com/finance (R)
• Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
• St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
• NASDAQ: https://data.nasdaq.com/
• OANDA: http://www.oanda.com/ (R)
• Quandl: http://www.quandl.com/
• Yahoo Finance: http://finance.yahoo.com/ (R)
• Social Sciences
• General Social Survey: http://www3.norc.org/GSS+Website/
• ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp
• Pew Research: http://www.pewinternet.org/datasets/pages/2/
• SNAP: http://snap.stanford.edu/data/index.html
• UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
• UPJOHN INST: http://www.upjohn.org/erdc/erdc.html
• FROM: http://www.inside-r.org/howto/finding-data-internet
INTERNETIS FULLOF DATA
WEBSCRAPING,APIS&DATABASES
DATABASE
API (APPLICATION
PROGRAMMING
INTERFACE)
PUBLIC WWW-
PAGE
Access via Internet
Automated
Web Scraping
API calls
Data provider organisation
The database is typically
accessed only from inside the
oganisation and not via
Internet.
• Web services and applications (such as twitter, facebook,…) provide
Web APIs so that others are able to build their services using some
functionality or data based on the data provider’s Web API / Web service
• Using APIs is the structured and “the right” way” to get data from a web
service
• The use of APIs is controlled by the data provider: they are thus used
with data providers permission
• Some APIs cost according usage, some have other conditions for use
• Needs programming to connect
API(APPLICATION
PROGRAMMINGINTERFACE)
TWITTERRESTAPIS
FACEBOOK GRAPHAPI
• Web scraping (web harvesting or web data extraction) is a computer
software technique of extracting information from websites. (Wikipedia
2015, Web Scraping)
• Transforms unstructured data in HTML format in some structured format
for for further analysis
• Used when you do not have access to the original Data Base or when
there are no APIs
• NOTE! Always make sure that scraping is allowed and legal! This is
not always the case, as some websites and services explicitly forbid web
scraping.
• Numerous tools varying from manual to semi-manual to fully automatic
• High-level scraping services
• Browser plugin tools
• Programming libraries
WEB SCRAPING
SERVICESFORWEBSCRAPING:
IMPORT.IO
https://www.youtube.com/watch?v=ghvsVLkTKLk
SERVICESFORWEBSCRAPING:
KIMONOLABS.COM
SERVICESFORWEBSCRAPING:
WEBHOSE.IO
BROWSERPLUGINSFORWEB
SCRAPING:DATAMINER
• Python
• Scrapy: http://scrapy.org
• BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
• Scrapemark: http://arshaw.com/scrapemark/ (not maintained
anymore)
• R
• rvest: http://cran.r-project.org/web/packages/rvest/index.html
WEB SCRAPING LIBRARIES
• Watch “The Beauty of Data Visualization” by David
McCandless:http://www.ted.com/talks/david_mccandless_the_beauty_of
_data_visualization?language=en
VISUALIZING DATA
• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining
to knowledge discovery in databases. AI magazine, 17(3), 37.
• De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A
consensual definition and a review of key research topics. 4th
International Conference on Integrated Information, Madrid
LECTURE 3 READING
• Cioffi-Revilla, C. 2014. Introduction to Computational Social Science. Springer-Verlag, London
• Elliot, T. 2013. 7 Definitions of Big Data You Should Know About. http://timoelliott.com/blog/2013/07/7-definitions-of-big-data-you-should-know-
about.html
• Eynon, R. 2013. The rise of Big Data: what does it mean for education, technology, and media research? Learning, Media and Technology, 38:3,
237-240, DOI: 10.1080/17439884.2013.771783.
• Gartner, 2014. IT Glossary: Big Data. http://www.gartner.com/it-glossary/big-data/
• Graham, M. 2013. The Virtual Dimension. Global City Challenges: Debating a Concept, Improving the Practice, M. Acuto and W. Steele, 2013.
London: Palgrave. 117-139.
• De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International
Conference on Integrated Information, Madrid
• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.
• IBM, 2014a. What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
• IBM, 2014b. The Four V’s of Big Data. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
• Sicular, S. 2013. Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three "V"s. Forbes, 3/27/2013.
http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/
• Yesseri, T.; Bright, J. 2013. Can electoral popularity be predicted using socially generated big data? Oxford Internet Institute, University of Oxford.
2013.
REFERENCES
Thank You!
Questions and comments?
twitter: @laurieloranta

Contenu connexe

Tendances

Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
Natural language Processing.pptx
Natural language Processing.pptxNatural language Processing.pptx
Natural language Processing.pptxShoaibNajeeb
 
HCI 3e - Ch 10: Universal design
HCI 3e - Ch 10:  Universal designHCI 3e - Ch 10:  Universal design
HCI 3e - Ch 10: Universal designAlan Dix
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesRajendran
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process Shuvra Ghosh
 
Privacy, security and ethics in data science
Privacy, security and ethics in data sciencePrivacy, security and ethics in data science
Privacy, security and ethics in data scienceNikolaos Vasiloglou
 
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...Lauri Eloranta
 
Human Computer Interaction (HCI)
Human Computer Interaction (HCI)Human Computer Interaction (HCI)
Human Computer Interaction (HCI)Lahiru Danushka
 
The Psychology of Human-Computer Interaction
The Psychology ofHuman-Computer InteractionThe Psychology ofHuman-Computer Interaction
The Psychology of Human-Computer InteractionSimon Bignell
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaEdureka!
 
user support system in HCI
user support system in HCIuser support system in HCI
user support system in HCIUsman Mukhtar
 
Human computerinterface
Human computerinterfaceHuman computerinterface
Human computerinterfaceKumar Aryan
 
Data and information
Data and informationData and information
Data and informationJojo Carrillo
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 

Tendances (20)

Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
Natural language Processing.pptx
Natural language Processing.pptxNatural language Processing.pptx
Natural language Processing.pptx
 
HCI 3e - Ch 10: Universal design
HCI 3e - Ch 10:  Universal designHCI 3e - Ch 10:  Universal design
HCI 3e - Ch 10: Universal design
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Privacy, security and ethics in data science
Privacy, security and ethics in data sciencePrivacy, security and ethics in data science
Privacy, security and ethics in data science
 
HCI NOTES.pdf
HCI NOTES.pdfHCI NOTES.pdf
HCI NOTES.pdf
 
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
 
Interaction devices in human Computer Interface(Human Computer interface tut...
 Interaction devices in human Computer Interface(Human Computer interface tut... Interaction devices in human Computer Interface(Human Computer interface tut...
Interaction devices in human Computer Interface(Human Computer interface tut...
 
Human Computer Interaction (HCI)
Human Computer Interaction (HCI)Human Computer Interaction (HCI)
Human Computer Interaction (HCI)
 
The Psychology of Human-Computer Interaction
The Psychology ofHuman-Computer InteractionThe Psychology ofHuman-Computer Interaction
The Psychology of Human-Computer Interaction
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
 
user support system in HCI
user support system in HCIuser support system in HCI
user support system in HCI
 
Human computerinterface
Human computerinterfaceHuman computerinterface
Human computerinterface
 
Introduction to Human Computer Interaction
Introduction to Human Computer InteractionIntroduction to Human Computer Interaction
Introduction to Human Computer Interaction
 
Data and information
Data and informationData and information
Data and information
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 

Similaire à Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...CILIP MDG
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
Service and Support for Science IT -Peter Kunzst, University of Zurich
Service and Support for Science IT-Peter Kunzst, University of ZurichService and Support for Science IT-Peter Kunzst, University of Zurich
Service and Support for Science IT -Peter Kunzst, University of ZurichMind the Byte
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation InfrastructureMicah Altman
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data scienceLoïc Lejoly
 
classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxXICSStudents
 
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...robkitchin
 

Similaire à Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science (20)

Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
NCCT.pptx
NCCT.pptxNCCT.pptx
NCCT.pptx
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
DOWLD SLIDES.pptx
DOWLD SLIDES.pptxDOWLD SLIDES.pptx
DOWLD SLIDES.pptx
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Service and Support for Science IT -Peter Kunzst, University of Zurich
Service and Support for Science IT-Peter Kunzst, University of ZurichService and Support for Science IT-Peter Kunzst, University of Zurich
Service and Support for Science IT -Peter Kunzst, University of Zurich
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation Infrastructure
 
G045033841
G045033841G045033841
G045033841
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
DBMS
DBMSDBMS
DBMS
 
classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptx
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
 

Plus de Lauri Eloranta

Digital Transformation in Social Science
Digital Transformation in Social ScienceDigital Transformation in Social Science
Digital Transformation in Social ScienceLauri Eloranta
 
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...Lauri Eloranta
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta
 
Simulation in Social Sciences - Lecture 6 in Introduction to Computational S...
Simulation in Social Sciences -  Lecture 6 in Introduction to Computational S...Simulation in Social Sciences -  Lecture 6 in Introduction to Computational S...
Simulation in Social Sciences - Lecture 6 in Introduction to Computational S...Lauri Eloranta
 
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...Lauri Eloranta
 
Introduction to Computational Social Science - Lecture 1
Introduction to Computational Social Science - Lecture 1Introduction to Computational Social Science - Lecture 1
Introduction to Computational Social Science - Lecture 1Lauri Eloranta
 
Producing Mobile Magazines
Producing Mobile MagazinesProducing Mobile Magazines
Producing Mobile MagazinesLauri Eloranta
 

Plus de Lauri Eloranta (7)

Digital Transformation in Social Science
Digital Transformation in Social ScienceDigital Transformation in Social Science
Digital Transformation in Social Science
 
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
 
Simulation in Social Sciences - Lecture 6 in Introduction to Computational S...
Simulation in Social Sciences -  Lecture 6 in Introduction to Computational S...Simulation in Social Sciences -  Lecture 6 in Introduction to Computational S...
Simulation in Social Sciences - Lecture 6 in Introduction to Computational S...
 
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
 
Introduction to Computational Social Science - Lecture 1
Introduction to Computational Social Science - Lecture 1Introduction to Computational Social Science - Lecture 1
Introduction to Computational Social Science - Lecture 1
 
Producing Mobile Magazines
Producing Mobile MagazinesProducing Mobile Magazines
Producing Mobile Magazines
 

Dernier

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Dernier (20)

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

  • 1. BIG DATA& DATAMINING LECTURE 3, 7.9.2015 INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE (CSS01) LAURI ELORANTA
  • 2. • LECTURE 1: Introduction to Computational Social Science [DONE] • Tuesday 01.09. 16:00 – 18:00, U35, Seminar room114 • LECTURE 2: Basics of Computation and Modeling [DONE] • Wednesday 02.09. 16:00 – 18:00, U35, Seminar room 113 • LECTURE 3: Big Data and Information Extraction [TODAY] • Monday 07.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 4: Network Analysis • Monday 14.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 5: Complex Systems • Tuesday 15.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 6: Simulation in Social Science • Wednesday 16.09. 16:00 – 18:00, U35, Seminar room 113 • LECTURE 7: Ethical and Legal issues in CSS • Monday 21.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 8: Summary • Tuesday 22.09. 17:00 – 19:00, U35, Seminar room 114 LECTURESSCHEDULE
  • 3. • PART 1: BIG DATA DEFINED • PART 2: DATA MINING PROCESS • PART 3: WHERE TO GET DATA • PART 4 : DATA VISUALIZATION LECTURE 3OVERVIEW
  • 5. • The term big data is used quite loosely, with various definitions depending on the context • Typically big data is misunderstood only to refer to big volumes of data • One of the most used definitions in the field of IT is by Gartner: “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” (Gartner 2014.) • Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D data management: Controlling data volume, variety and velocity. BIG DATADEFINED (Gartner 2014.)
  • 6. • Called as the three “V”s of Big Data 1. Volume refers to the big quantities of data 2. Velocity refers to the usually high speed of which data is generated 3. Variety refers to different kinds and types of data • Other Vs suggested as well: Variability, Veracity VOLUME, VELOCITY& VARIETY (Gartner 2014.)
  • 7. •“Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value". • (De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid) DEMAURO,GRECO&GRIMALDI2014, DEFINITION
  • 8. • Strong instrumental component in relation to how you get “value” out of big data • Answering research questions • Answering business problems • Instead of just one particular technology, big data also refers to large set of different technologies used in various ways BIG DATAISABOUTUSING BIG DATA (Sicular 2013.)
  • 9. • “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.” (IBM 2014a.) • Underlines the volume component of big data. IBM’S DEFINITION
  • 11. • E.g 7 vies from Elliot 2013: • Big Data as 1. Volume, Velocity & Variety (dictionary definition) 2. Set of technologies and tools 3. Set of different categories and types of data 4. Means of predicting the future (big data as signals) 5. New possibilities, that previously were impossible (value) 6. Metafora for a global neural network (combining all data) 7. As a capitalist/neoliberal concept (critical view) MANYVIEWPOINTSTO BIG DATA (Elliot 2013)
  • 12. • Letely in social sciences big data has been defined either in quite vague terms or underlining only the volume component of big data • ”Big Data, that is, data that are too big for standard database software to process, or the more future-proof, ‘capacity to search, aggregate, and cross- reference large data sets.” (Eynon 2013.) • “Today, our more-than-ever digital lives leave significant footprints in cyberspace. Large scale collections of these socially generated footprints, often known as big data --“ (Yasseri ja Brigth 2013.) • "These emitted shadows of ‘big data’ can take a variety of forms, but most are manifestations or byproducts of human/machine interactions in code/spaces and coded spaces. We now see hundreds of millions of connected people, billions of sensors, and trillions of communications, information transfers, and transactions producing unfathomably large data shadows --" (Graham 2013.) TYPICALLYNOTACOMMON DEFINITIONINSOCIALSCIENCE RESEARCH
  • 14. • Data mining process aims at answering research questions based on large sets of data (in another words, big data) • New insights and information is “mined” from the data with automated computation • For variety of research purposes with many different kinds of data • Long traditions: Quantitative content analysis and register based research, for example, could be seen as form of data mining • NOTE! To be specific, in computer science the term data mining only refers to the pre-processing and analysis part of the whole process DATAMININGPROCESSINCSS 1. Formulating research questions 2. Selecting source raw data 3. Gathering source raw data 4. Preprocessing 5. Analysis 6. Communication (Cioffi-Revilla 2014.)
  • 15. • Everything starts with a research question • Three main types of research questions in relation to data • 1. Inductive = Data-driven. The data tells something new. • 2. Deductive = Theory-driven. The data tells something about a theory. E.g. data can be used to test hypotheses. • 3. Abductive = Mixed model, in-between of inductive and deductive research RESEARCH QUESTIONS IN DATAMINING (Cioffi-Revilla 2014.)
  • 16. • Main guiding factor: the research question • Not just text: many different forms of data • Text / Numeric data • Images • Video • Audio • Sensor-data • Register data • Where to get the data? • Data and its selection comes with many problems: ethics, legal, privacy, public vs. private. (These matters will have a lecture of its own). SELECTINGAND GATHERING RAW DATA (Cioffi-Revilla 2014.)
  • 17. • Data needs to be pre-processed in order it can be analyzed: typically this can take a very big part of the data mining process • Cioffi-Revilla 2014 mentions these (mainly from textual content analysis perspective): • Scanning = generating machine readable files • Cleaning = making the data set more concise (extracting unnecessary noise) • Filtering = there may be a need to filter the data based on some rules or categories even before the analysis • Reformatting = changing the structure of the data, for example dividing data in smaller parts • Content proxy extraction = using removing the proxies in text that denote to latent entities PREPROCESSING DATA (Cioffi-Revilla 2014.)
  • 18. • This is the main automated information extraction part: data is “mined” to reveal new information • Many different analysis method classes, typically combining techniques from statistics, machine learning, artificial intelligence and database systems. • Main types of analysis (according to Fayyad et al 1996): Classification, Clustering, Regression Analysis, Summarization, Dependency Modeling, Anomaly detection • There are many many others, which can be seen combining and mixing the main types given above DATA ANALYSIS (Fayyad et al. 1996)
  • 19. • Classification is maps (classifies) data item in one or several predefined classes • Classification algorithms are learning algorithms in the sense that they need a data set that defines how to categorize the data: thus, one needs to teach the classification algorithm what classes to look for • For example • Classification of images in different categories • Classification of news items in different categories • Classification email into spam an normal mail CLASSIFICATION (Fayyad et al. 1996)
  • 20. • Clustering groups a set of data objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters). • Not a one specific algorithm, but a general task with many different solutions and algorithms • Connectivity based clustering (based on distance) • Centroid based clustering (e.g. K-means clustering) • Distribution based clustering (objects belonging most likely to the same distribution) • Density based clustering CLUSTERING (Fayyad et al. 1996)
  • 21. • Helsingin Sanomat (the biggest news corporation in Finland) opened their Finnish parliament election 2015 questionnaire data to public • The data contained questions and their answers from election candidates for the Finnish parliament • The data could be analyzed via clustering and factor analysis to find out what different groups (clusters) of thought do the candidates actually represent (in comparison to their actual party). • Try it out: http://users.aalto.fi/~leinona1/vaalit2015/ CLUSTERING EXAMPLE
  • 22.
  • 23. • Does what is says on the tin! Finding compact descriptions on subsets of data. • For example calculating means of standard deviations over different data attributes (dimension) • Summarization techniques are often applied to interactive exploratory data analysis and automated report generation. SUMMARIZATION (Fayyad et al. 1996)
  • 24. • Estimating the relationship among variables (with a regression function) • It includes many techniques for modeling and analyzing • Focuses on the relationship between a dependent variable and one or more independent variables. • Regression function is a learning function based on the data • Applications in prediction and REGRESSIONANALYSIS (Fayyad et al. 1996)
  • 25. REGRESSION EXAMPLE LINEARREGRESSION (Image is public domain, from Wikipedia 2015, Regression Analysis)
  • 26. • Finds significant dependencies between the data variables • Two levels • Structural level defining which variables are dependent (can be graphical form) • Quantitative level defining the strength of the dependency in numeric form • E.g. Correlation analysis • E.g. Probabilistic density networks DEPENDENCYMODELING (Fayyad et al. 1996)
  • 27. CORRELATION DOES NOT IMPLYCAUSATION (XKCD: Correlation, http://imgs.xkcd.com/comics/correlation.png)
  • 28. • Change and deviation detection • Has the data changed from some previously known stable state or from some previously measured normative values (“normal range”) • Time scales matter, short term anomaly may actually be normal in long term. • Synchronic change (anomalies in stable processes) and diachronic change (deeper change in generative structures of the process) • Quite a dynamic category ANOMALYDETECTION (Fayyad et al. 1996)
  • 29. • Cioffi-Revilla (2014) lists, for example, vocabularity analysis, correlation, lexical analysis, spatial analysis, semantic analysis, sentiment analysis, similarity analysis, clustering, network analysis, sequence analysis, intensity analysis, anomaly detection, sonification analysis • Most important thing is to understand the ins and outs of the analysis model you are using: what is it for and how does it behave under the hood • The relationship of the model to your research question AND MANYOTHERS…
  • 30. • Basically means that data analysis algorithm is able to “learn” and enhance its performance iteratively from the data • 1. Supervised machine learning • The algorithm is schooled based on some known labeled data (input/target pairs) • e.g. Netflix is able to suggest you better movies based on how you use it: By watching and rating films you are teaching the machine how to suggest better movies to you • 2. Semi-supervised machine learning • The algorithm is schooled with a small set of labelet data (input/target pairs) and a set of un labelet data • 3. Unsupervised machine learning • No result-set data is given for the machine to learn • The algorithm is able to find patterns and structures from the data automatically without any pre-learning • 4. Reinforcement machine learning • Algorithm has a certain goal and it interacts with a dynamic environment, which gives it rewards based on actions MACHINE LEARNING
  • 32. • Ready Data Sets = Many public data sets provided by different institutions • Web APIs = Application programming interfaces, that gives you data in structured format. For example facebook and twitter have APIs for getting data • Web Scraping = Gather the information automatically from webpages, when it is allowed. • Data Bases = Quering databases directly with query languages (e.g SQL) • Custom data gathering process = the traditional research data gathering (surveys, interviews…) • Open Data and Open Science growing trends: governments opening providing APIs and Data Sets to different kinds of public data (e.g. fiscal information, expenses) DATASOURCES MAINTYPES
  • 39. • The Internet is full of open datasets of different kinds! Some examples: • Economics • American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete • Gapminder: http://www.gapminder.org/data/ • UMD:: http://inforumweb.umd.edu/econdata/econdata.html • World bank: http://data.worldbank.org/indicator • Finance • CBOE Futures Exchange: http://cfe.cboe.com/Data/ • Google Finance: https://www.google.com/finance (R) • Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0 • St Louis Fed: http://research.stlouisfed.org/fred2/ (R) • NASDAQ: https://data.nasdaq.com/ • OANDA: http://www.oanda.com/ (R) • Quandl: http://www.quandl.com/ • Yahoo Finance: http://finance.yahoo.com/ (R) • Social Sciences • General Social Survey: http://www3.norc.org/GSS+Website/ • ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp • Pew Research: http://www.pewinternet.org/datasets/pages/2/ • SNAP: http://snap.stanford.edu/data/index.html • UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm • UPJOHN INST: http://www.upjohn.org/erdc/erdc.html • FROM: http://www.inside-r.org/howto/finding-data-internet INTERNETIS FULLOF DATA
  • 40. WEBSCRAPING,APIS&DATABASES DATABASE API (APPLICATION PROGRAMMING INTERFACE) PUBLIC WWW- PAGE Access via Internet Automated Web Scraping API calls Data provider organisation The database is typically accessed only from inside the oganisation and not via Internet.
  • 41. • Web services and applications (such as twitter, facebook,…) provide Web APIs so that others are able to build their services using some functionality or data based on the data provider’s Web API / Web service • Using APIs is the structured and “the right” way” to get data from a web service • The use of APIs is controlled by the data provider: they are thus used with data providers permission • Some APIs cost according usage, some have other conditions for use • Needs programming to connect API(APPLICATION PROGRAMMINGINTERFACE)
  • 44. • Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. (Wikipedia 2015, Web Scraping) • Transforms unstructured data in HTML format in some structured format for for further analysis • Used when you do not have access to the original Data Base or when there are no APIs • NOTE! Always make sure that scraping is allowed and legal! This is not always the case, as some websites and services explicitly forbid web scraping. • Numerous tools varying from manual to semi-manual to fully automatic • High-level scraping services • Browser plugin tools • Programming libraries WEB SCRAPING
  • 49. • Python • Scrapy: http://scrapy.org • BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ • Scrapemark: http://arshaw.com/scrapemark/ (not maintained anymore) • R • rvest: http://cran.r-project.org/web/packages/rvest/index.html WEB SCRAPING LIBRARIES
  • 50. • Watch “The Beauty of Data Visualization” by David McCandless:http://www.ted.com/talks/david_mccandless_the_beauty_of _data_visualization?language=en VISUALIZING DATA
  • 51. • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. • De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid LECTURE 3 READING
  • 52. • Cioffi-Revilla, C. 2014. Introduction to Computational Social Science. Springer-Verlag, London • Elliot, T. 2013. 7 Definitions of Big Data You Should Know About. http://timoelliott.com/blog/2013/07/7-definitions-of-big-data-you-should-know- about.html • Eynon, R. 2013. The rise of Big Data: what does it mean for education, technology, and media research? Learning, Media and Technology, 38:3, 237-240, DOI: 10.1080/17439884.2013.771783. • Gartner, 2014. IT Glossary: Big Data. http://www.gartner.com/it-glossary/big-data/ • Graham, M. 2013. The Virtual Dimension. Global City Challenges: Debating a Concept, Improving the Practice, M. Acuto and W. Steele, 2013. London: Palgrave. 117-139. • De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. • IBM, 2014a. What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html • IBM, 2014b. The Four V’s of Big Data. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg • Sicular, S. 2013. Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three "V"s. Forbes, 3/27/2013. http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/ • Yesseri, T.; Bright, J. 2013. Can electoral popularity be predicted using socially generated big data? Oxford Internet Institute, University of Oxford. 2013. REFERENCES
  • 53. Thank You! Questions and comments? twitter: @laurieloranta