Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

BIG DATA&
DATAMINING
LECTURE 3, 7.9.2015
INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE (CSS01)
LAURI ELORANTA

• LECTURE 1: Introduction to Computational Social Science [DONE]
• Tuesday 01.09. 16:00 – 18:00, U35, Seminar room114
• LECTURE 2: Basics of Computation and Modeling [DONE]
• Wednesday 02.09. 16:00 – 18:00, U35, Seminar room 113
• LECTURE 3: Big Data and Information Extraction [TODAY]
• Monday 07.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 4: Network Analysis
• LECTURE 5: Complex Systems
• Tuesday 15.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 6: Simulation in Social Science
• Wednesday 16.09. 16:00 – 18:00, U35, Seminar room 113
• LECTURE 7: Ethical and Legal issues in CSS
• LECTURE 8: Summary
• Tuesday 22.09. 17:00 – 19:00, U35, Seminar room 114
LECTURESSCHEDULE

• PART 1: BIG DATA DEFINED
• PART 2: DATA MINING PROCESS
• PART 3: WHERE TO GET DATA
• PART 4 : DATA VISUALIZATION
LECTURE 3OVERVIEW

• The term big data is used quite loosely, with various definitions depending
on the context
• Typically big data is misunderstood only to refer to big volumes of data
• One of the most used definitions in the field of IT is by Gartner:
“Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision
making.” (Gartner 2014.)
• Gartner analyst Doug Laney introduced the 3Vs concept in a 2001
MetaGroup research publication, 3D data management: Controlling data
volume, variety and velocity.
BIG DATADEFINED
(Gartner 2014.)

• Called as the three “V”s of Big Data
1. Volume refers to the big quantities of data
2. Velocity refers to the usually high speed of which data is generated
3. Variety refers to different kinds and types of data
• Other Vs suggested as well: Variability, Veracity
VOLUME, VELOCITY&
VARIETY
(Gartner 2014.)

•“Big Data represents the Information assets
characterized by such a High Volume,
Velocity and Variety to require specific
Technology and Analytical Methods for its
transformation into Value".
• (De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A
consensual definition and a review of key research topics. 4th
International Conference on Integrated Information, Madrid)
DEMAURO,GRECO&GRIMALDI2014,
DEFINITION

• Strong instrumental component in relation to how you get “value” out of
big data
• Answering research questions
• Answering business problems
• Instead of just one particular technology, big data also refers to large set
of different technologies used in various ways
BIG DATAISABOUTUSING
BIG DATA
(Sicular 2013.)

• “Every day, we create 2.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created in the last two years alone. This data comes from
everywhere: sensors used to gather climate information,
posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to
name a few. This data is big data.” (IBM 2014a.)
• Underlines the volume component of big data.
IBM’S DEFINITION

• E.g 7 vies from Elliot 2013:
• Big Data as
1. Volume, Velocity & Variety (dictionary definition)
2. Set of technologies and tools
3. Set of different categories and types of data
4. Means of predicting the future (big data as signals)
5. New possibilities, that previously were impossible (value)
6. Metafora for a global neural network (combining all data)
7. As a capitalist/neoliberal concept (critical view)
MANYVIEWPOINTSTO BIG
DATA
(Elliot 2013)

• Letely in social sciences big data has been defined either in quite
vague terms or underlining only the volume component of big data
• ”Big Data, that is, data that are too big for standard database software to
process, or the more future-proof, ‘capacity to search, aggregate, and cross-
reference large data sets.” (Eynon 2013.)
• “Today, our more-than-ever digital lives leave significant footprints in
cyberspace. Large scale collections of these socially generated footprints,
often known as big data --“ (Yasseri ja Brigth 2013.)
• "These emitted shadows of ‘big data’ can take a variety of forms, but most
are manifestations or byproducts of human/machine interactions in
code/spaces and coded spaces. We now see hundreds of millions of
connected people, billions of sensors, and trillions of communications,
information transfers, and transactions producing unfathomably large data
shadows --" (Graham 2013.)
TYPICALLYNOTACOMMON
DEFINITIONINSOCIALSCIENCE
RESEARCH

• Data mining process aims at answering research questions based on
large sets of data (in another words, big data)
• New insights and information is “mined” from the data with automated
computation
• For variety of research purposes with many different kinds of data
• Long traditions: Quantitative content analysis and register based
research, for example, could be seen as form of data mining
• NOTE! To be specific, in computer science the term data mining only
refers to the pre-processing and analysis part of the whole process
DATAMININGPROCESSINCSS
1. Formulating
research
questions
2. Selecting
source raw data
3. Gathering
source raw data
4. Preprocessing 5. Analysis
6.
Communication
(Cioffi-Revilla 2014.)

• Everything starts with a research question
• Three main types of research questions in relation to data
• 1. Inductive = Data-driven. The data tells something new.
• 2. Deductive = Theory-driven. The data tells something about a theory.
E.g. data can be used to test hypotheses.
• 3. Abductive = Mixed model, in-between of inductive and deductive
research
RESEARCH QUESTIONS IN
DATAMINING

• Main guiding factor: the research question
• Not just text: many different forms of data
• Text / Numeric data
• Images
• Video
• Audio
• Sensor-data
• Register data
• Where to get the data?
• Data and its selection comes with many problems: ethics, legal,
privacy, public vs. private. (These matters will have a lecture of its
own).
SELECTINGAND
GATHERING RAW DATA

• Data needs to be pre-processed in order it can be analyzed: typically this
can take a very big part of the data mining process
• Cioffi-Revilla 2014 mentions these (mainly from textual content analysis
perspective):
• Scanning = generating machine readable files
• Cleaning = making the data set more concise (extracting unnecessary
noise)
• Filtering = there may be a need to filter the data based on some rules
or categories even before the analysis
• Reformatting = changing the structure of the data, for example
dividing data in smaller parts
• Content proxy extraction = using removing the proxies in text that
denote to latent entities
PREPROCESSING DATA

• This is the main automated information extraction part: data is “mined” to
reveal new information
• Many different analysis method classes, typically combining techniques
from statistics, machine learning, artificial intelligence and database
systems.
• Main types of analysis (according to Fayyad et al 1996):
Classification, Clustering, Regression Analysis, Summarization,
Dependency Modeling, Anomaly detection
• There are many many others, which can be seen combining and
mixing the main types given above
DATA ANALYSIS
(Fayyad et al. 1996)

• Classification is maps (classifies) data item in one or several predefined
classes
• Classification algorithms are learning algorithms in the sense that they
need a data set that defines how to categorize the data: thus, one needs
to teach the classification algorithm what classes to look for
• For example
• Classification of images in different categories
• Classification of news items in different categories
• Classification email into spam an normal mail
CLASSIFICATION

• Clustering groups a set of data objects in such a way that objects in the
same group (cluster) are more similar to each other than to those in
other groups (clusters).
• Not a one specific algorithm, but a general task with many different
solutions and algorithms
• Connectivity based clustering (based on distance)
• Centroid based clustering (e.g. K-means clustering)
• Distribution based clustering (objects belonging most likely to the same
distribution)
• Density based clustering
CLUSTERING

• Helsingin Sanomat (the biggest news corporation in Finland) opened
their Finnish parliament election 2015 questionnaire data to public
• The data contained questions and their answers from election
candidates for the Finnish parliament
• The data could be analyzed via clustering and factor analysis to find out
what different groups (clusters) of thought do the candidates actually
represent (in comparison to their actual party).
• Try it out: http://users.aalto.fi/~leinona1/vaalit2015/
CLUSTERING EXAMPLE

• Does what is says on the tin! Finding compact descriptions on subsets of
data.
• For example calculating means of standard deviations over different data
attributes (dimension)
• Summarization techniques are often applied to interactive exploratory
data analysis and automated report generation.
SUMMARIZATION

• Estimating the relationship among variables (with a regression function)
• It includes many techniques for modeling and analyzing
• Focuses on the relationship between a dependent variable and one or
more independent variables.
• Regression function is a learning function based on the data
• Applications in prediction and
REGRESSIONANALYSIS

REGRESSION EXAMPLE
LINEARREGRESSION
(Image is public domain, from Wikipedia 2015, Regression Analysis)

• Finds significant dependencies between the data variables
• Two levels
• Structural level defining which variables are dependent (can be
graphical form)
• Quantitative level defining the strength of the dependency in numeric
form
• E.g. Correlation analysis
• E.g. Probabilistic density networks
DEPENDENCYMODELING

CORRELATION DOES NOT
IMPLYCAUSATION
(XKCD: Correlation, http://imgs.xkcd.com/comics/correlation.png)

• Change and deviation detection
• Has the data changed from some previously known stable state or from
some previously measured normative values (“normal range”)
• Time scales matter, short term anomaly may actually be normal in long
term.
• Synchronic change (anomalies in stable processes) and diachronic
change (deeper change in generative structures of the process)
• Quite a dynamic category
ANOMALYDETECTION

• Cioffi-Revilla (2014) lists, for example, vocabularity analysis, correlation,
lexical analysis, spatial analysis, semantic analysis, sentiment analysis,
similarity analysis, clustering, network analysis, sequence analysis,
intensity analysis, anomaly detection, sonification analysis
• Most important thing is to understand the ins and outs of the analysis
model you are using: what is it for and how does it behave under the
hood
• The relationship of the model to your research question
AND MANYOTHERS…

• Basically means that data analysis algorithm is able to “learn” and enhance its
performance iteratively from the data
• 1. Supervised machine learning
• The algorithm is schooled based on some known labeled data (input/target pairs)
• e.g. Netflix is able to suggest you better movies based on how you use it: By
watching and rating films you are teaching the machine how to suggest better
movies to you
• 2. Semi-supervised machine learning
• The algorithm is schooled with a small set of labelet data (input/target pairs) and
a set of un labelet data
• 3. Unsupervised machine learning
• No result-set data is given for the machine to learn
• The algorithm is able to find patterns and structures from the data automatically
without any pre-learning
• 4. Reinforcement machine learning
• Algorithm has a certain goal and it interacts with a dynamic environment, which
gives it rewards based on actions
MACHINE LEARNING

• Ready Data Sets = Many public data sets provided by different institutions
• Web APIs = Application programming interfaces, that gives you data in
structured format. For example facebook and twitter have APIs for getting
data
• Web Scraping = Gather the information automatically from webpages,
when it is allowed.
• Data Bases = Quering databases directly with query languages (e.g SQL)
• Custom data gathering process = the traditional research data gathering
(surveys, interviews…)
• Open Data and Open Science growing trends: governments opening
providing APIs and Data Sets to different kinds of public data (e.g. fiscal
information, expenses)
DATASOURCES
MAINTYPES

OLDIEBUTGOLDIE…
GOVERNMENTALREGISTRIES

FINNISHSOCIALSCIENCEDATA ARCHIVE

• The Internet is full of open datasets of different kinds!
Some examples:
• Economics
• American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
• Gapminder: http://www.gapminder.org/data/
• UMD:: http://inforumweb.umd.edu/econdata/econdata.html
• World bank: http://data.worldbank.org/indicator
• Finance
• CBOE Futures Exchange: http://cfe.cboe.com/Data/
• Google Finance: https://www.google.com/finance (R)
• Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
• St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
• NASDAQ: https://data.nasdaq.com/
• OANDA: http://www.oanda.com/ (R)
• Quandl: http://www.quandl.com/
• Yahoo Finance: http://finance.yahoo.com/ (R)
• Social Sciences
• General Social Survey: http://www3.norc.org/GSS+Website/
• ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp
• Pew Research: http://www.pewinternet.org/datasets/pages/2/
• SNAP: http://snap.stanford.edu/data/index.html
• UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
• UPJOHN INST: http://www.upjohn.org/erdc/erdc.html
• FROM: http://www.inside-r.org/howto/finding-data-internet
INTERNETIS FULLOF DATA

WEBSCRAPING,APIS&DATABASES
DATABASE
API (APPLICATION
PROGRAMMING
INTERFACE)
PUBLIC WWW-
PAGE
Access via Internet
Automated
Web Scraping
API calls
Data provider organisation
The database is typically
accessed only from inside the
oganisation and not via
Internet.

• Web services and applications (such as twitter, facebook,…) provide
Web APIs so that others are able to build their services using some
functionality or data based on the data provider’s Web API / Web service
• Using APIs is the structured and “the right” way” to get data from a web
service
• The use of APIs is controlled by the data provider: they are thus used
with data providers permission
• Some APIs cost according usage, some have other conditions for use
• Needs programming to connect
API(APPLICATION
PROGRAMMINGINTERFACE)

• Web scraping (web harvesting or web data extraction) is a computer
software technique of extracting information from websites. (Wikipedia
2015, Web Scraping)
• Transforms unstructured data in HTML format in some structured format
for for further analysis
• Used when you do not have access to the original Data Base or when
there are no APIs
• NOTE! Always make sure that scraping is allowed and legal! This is
not always the case, as some websites and services explicitly forbid web
scraping.
• Numerous tools varying from manual to semi-manual to fully automatic
• High-level scraping services
• Browser plugin tools
• Programming libraries
WEB SCRAPING

SERVICESFORWEBSCRAPING:
IMPORT.IO
https://www.youtube.com/watch?v=ghvsVLkTKLk

KIMONOLABS.COM

WEBHOSE.IO

BROWSERPLUGINSFORWEB
SCRAPING:DATAMINER

• Python
• Scrapy: http://scrapy.org
• BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
• Scrapemark: http://arshaw.com/scrapemark/ (not maintained
anymore)
• R
• rvest: http://cran.r-project.org/web/packages/rvest/index.html
WEB SCRAPING LIBRARIES

• Watch “The Beauty of Data Visualization” by David
McCandless:http://www.ted.com/talks/david_mccandless_the_beauty_of
_data_visualization?language=en
VISUALIZING DATA

• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining
to knowledge discovery in databases. AI magazine, 17(3), 37.
• De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A
consensual definition and a review of key research topics. 4th
International Conference on Integrated Information, Madrid
LECTURE 3 READING

• Cioffi-Revilla, C. 2014. Introduction to Computational Social Science. Springer-Verlag, London
• Elliot, T. 2013. 7 Definitions of Big Data You Should Know About. http://timoelliott.com/blog/2013/07/7-definitions-of-big-data-you-should-know-
about.html
• Eynon, R. 2013. The rise of Big Data: what does it mean for education, technology, and media research? Learning, Media and Technology, 38:3,
237-240, DOI: 10.1080/17439884.2013.771783.
• Gartner, 2014. IT Glossary: Big Data. http://www.gartner.com/it-glossary/big-data/
• Graham, M. 2013. The Virtual Dimension. Global City Challenges: Debating a Concept, Improving the Practice, M. Acuto and W. Steele, 2013.
London: Palgrave. 117-139.
• De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International
Conference on Integrated Information, Madrid
• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.
• IBM, 2014a. What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
• IBM, 2014b. The Four V’s of Big Data. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
• Sicular, S. 2013. Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three "V"s. Forbes, 3/27/2013.
http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/
• Yesseri, T.; Bright, J. 2013. Can electoral popularity be predicted using socially generated big data? Oxford Internet Institute, University of Oxford.
2013.
REFERENCES

Thank You!
Questions and comments?
twitter: @laurieloranta

Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

Similaire à Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science (20)

Plus de Lauri Eloranta

Plus de Lauri Eloranta (7)

Dernier

Dernier (20)

Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science