SlideShare une entreprise Scribd logo
1  sur  20
CSE6339 Computational Journalism
Chengkai Li
University of Texas at Arlington
Spring 2015
Big Data
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Big Data
The 4 Vs
o Volume
o Variety
o Velocity
o Veracity
Volume: How much data is out there?
http://www.sciencedaily.com/releases/2013/05/130522085217.htm
http://www.storagenewsletter.com/rubriques/market-
reportsresearch/ibm-cmo-study/
Variety: Types of Data
Structured Data
o (relational) database tables
o CSV/TSV files
Semi-structured Data
o XML
o JSON
o RDF
Unstructured Data
o text data (documents, Web pages, short texts (e.g., social media))
Multimedia Data (images, videos, audios)
Other types of data
o matrices, graphs, sequences, time-series, spatio-temporal
Velocity: Streaming Data
Stock Trades
Highway Sensors
Weather Data
Social Media
Telephone Calls
Video Streaming
http://mashable.com/2012/06/22/data-created-every-minute/
Datasets
Amazon Public Data Sets
Data.gov
Linked Open Data
Knowledge Bases, Encyclopedia
Yahoo! Webscope
Bibliography Databases
Network/Graph Datasets
UCI Machine Learning Repository
UCR Time Series Classification/Clustering
Time Series Data Library
KDnuggets Dataset List
KDD Cup Datasets
Amazon Public Data Sets
http://aws.amazon.com/public-data-sets/
o NASA NEX: A collection of Earth science data sets maintained by
NASA, including climate change projections and satellite images of
the Earth's surface
o Common Crawl Corpus: A corpus of web crawl data composed of
over 5 billion web pages
o 1000 Genomes Project: A detailed map of human genetic variation
o Google Books Ngrams: A data set containing Google Books n-
gram corpuses
o US Census Data: US demographic data from 1980, 1990, and 2000
US Censuses
o Freebase Data Dump: A data dump of all the current facts and
assertions in the Freebase system, an open database covering
millions of topics
Data.gov
http://www.data.gov/ (137,608 datasets)
o Consumer Complaint Database
o U.S. International Trade in Goods and Services: Monthly report that provides
national trade data including imports, exports, and balance of payments for
goods and services.
o DTV Reception Maps
o Climate Data Online
o Food Access Research Atlas — presents a spatial overview of food access
indicators for low-income and other census tracts using different measures of
supermarket...
o U.S. Hourly Precipitation Data
o Great Chile Earthquake of May 22, 1960
o Consumer Expenditure Survey
o Campus Security Data
o Farmers Markets Geographic Data: longitude and latitude, state, address, name,
and zip code of Farmers Markets in the United States
o Crimes - 2001 to present (City of Chicago)
Government Data
Government spending
http://www.usaspending.gov/
Campaign finance
http://www.fec.gov/disclosure.shtml
http://www.opensecrets.org/
Congress voting record
http://www.govtrack.us/ Members of Congress, Bills & Resolutions,
Voting Records, Committees
Census
http://www.census.gov/main/www/access.html
Linked Data
http://linkeddata.org/ (hundreds of datasets, billions of RDF triples)
Knowledge Bases, Encyclopedia
o Wikipedia, Dbpedia
o Freebase/Google Knowledge Graph
o YAGO
o Probase
o LibraryThing
Yahoo! Webscope Datasets
o Language Data
o Graph and Social Data
o Ratings and Classification Data
o Advertising and Market Data
o Competition Data
o Computing Systems Data
o Image Data
Bibliography Databases
o Google Scholar, Microsoft Academic Search, DBLP,
arXiv.org, CiteSeer, Arnetminer
Drug and Disease Databases
o Drug Bank, DailyMed, OMIM, KEGG Drug
Gene and Protein Databases
o UniProt, Protein Data Bank, Genbank
Stanford Large Network Dataset Collection
http://snap.stanford.edu/data/
o Social networks : online social networks, edges represent interactions between
people
o Networks with ground-truth communities : ground-truth network communities
in social and information networks
o Communication networks : email communication networks with edges
representing communication
o Citation networks : nodes represent papers, edges represent citations
o Collaboration networks : nodes represent scientists, edges represent
collaborations (co-authoring a paper)
o Web graphs : nodes represent webpages and edges are hyperlinks
o Amazon networks : nodes represent products and edges link commonly co-
purchased products
o Internet networks : nodes represent computers and edges communication
o Road networks : nodes represent intersections and edges roads connecting the
intersections
o …
Time Series Data Library
http://robjhyndman.com/TSDL/
KDnuggets Dataset List
http://www.kdnuggets.com/datasets/index.html
KDD Cup Datasets
http://www.sigkdd.org/kddcup/index.php
Data Mining Software
Free, open-source
o RapidMiner
o Weka: Data mining tool in java
o SCaVis: scientific computation and visualization, Java
o Orange: Python suite
o Scikit-learn: Python machine learning lbirary
o NumPy/SciPy/Ipython/ mlpy (python modules for scientific
computing, scientific library, interactive computing, machine
learning)
o R: statistical computing and graphic
o RattleGUI: data mining GUI using R
o Octave: numerical analysis
o Shogun: machine learning toolkit in C++
Text Mining Tools
o NLTK (NLP Toolkit): NLP suite for Python
o SenticNet API: sentiment analysis
o Stanford NLP software
o UIMA
Large-Scale Data Processing,
Machine Learning
o Apache Mahout
o GraphLab
o MapReduce/Hadoop
o Spark
o Pregel/Giraph
Commercial
o Matlab
o Oracle Data Mining
o SAS
o IBM SPSS
o Microsoft SQL Server
Analysis Services
o HP Vertica

Contenu connexe

Similaire à cse6339-spring15-02.pptx

Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...BigData_Europe
 
BDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWCBDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWCBigData_Europe
 
A Linked Data Dataset for Madrid Transport Authority's Datasets
A Linked Data Dataset for Madrid Transport Authority's DatasetsA Linked Data Dataset for Madrid Transport Authority's Datasets
A Linked Data Dataset for Madrid Transport Authority's DatasetsOscar Corcho
 
What's the Big Deal About Big Data?.pdf
What's the Big Deal About Big Data?.pdfWhat's the Big Deal About Big Data?.pdf
What's the Big Deal About Big Data?.pdfSteven Jong
 
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingAnalytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingOntotext
 
US EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open DataUS EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open Data3 Round Stones
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and InternetSanoj Kumar
 
Foresight conversation
Foresight conversationForesight conversation
Foresight conversationsuresh sood
 
Manchester Business School Nov 2010
Manchester Business School Nov 2010Manchester Business School Nov 2010
Manchester Business School Nov 2010johnkayebl
 
An open data story
An open data storyAn open data story
An open data storyProgCity
 
Big Data & Smart City Applications
Big Data & Smart City ApplicationsBig Data & Smart City Applications
Big Data & Smart City ApplicationsAmit Sheth
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3Dave King
 
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...Forum One
 
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...J T "Tom" Johnson
 
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
EDF2012  Rufus Pollock - Open Data. Where we are where we are goingEDF2012  Rufus Pollock - Open Data. Where we are where we are going
EDF2012 Rufus Pollock - Open Data. Where we are where we are goingEuropean Data Forum
 

Similaire à cse6339-spring15-02.pptx (20)

Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
 
BDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWCBDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWC
 
Statistics in Journalism Sheffield 2014
Statistics in Journalism Sheffield 2014Statistics in Journalism Sheffield 2014
Statistics in Journalism Sheffield 2014
 
A Linked Data Dataset for Madrid Transport Authority's Datasets
A Linked Data Dataset for Madrid Transport Authority's DatasetsA Linked Data Dataset for Madrid Transport Authority's Datasets
A Linked Data Dataset for Madrid Transport Authority's Datasets
 
What's the Big Deal About Big Data?.pdf
What's the Big Deal About Big Data?.pdfWhat's the Big Deal About Big Data?.pdf
What's the Big Deal About Big Data?.pdf
 
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingAnalytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
US EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open DataUS EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open Data
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Foresight conversation
Foresight conversationForesight conversation
Foresight conversation
 
Manchester Business School Nov 2010
Manchester Business School Nov 2010Manchester Business School Nov 2010
Manchester Business School Nov 2010
 
An open data story
An open data storyAn open data story
An open data story
 
Big Data & Smart City Applications
Big Data & Smart City ApplicationsBig Data & Smart City Applications
Big Data & Smart City Applications
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
An Open Data Story
An Open Data StoryAn Open Data Story
An Open Data Story
 
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...
 
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...
 
WORLD CAT AS BIG DATA
WORLD CAT AS  BIG DATAWORLD CAT AS  BIG DATA
WORLD CAT AS BIG DATA
 
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
EDF2012  Rufus Pollock - Open Data. Where we are where we are goingEDF2012  Rufus Pollock - Open Data. Where we are where we are going
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
 

Dernier

Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 

Dernier (20)

Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 

cse6339-spring15-02.pptx

  • 1. CSE6339 Computational Journalism Chengkai Li University of Texas at Arlington Spring 2015
  • 3. Big Data The 4 Vs o Volume o Variety o Velocity o Veracity
  • 4. Volume: How much data is out there? http://www.sciencedaily.com/releases/2013/05/130522085217.htm http://www.storagenewsletter.com/rubriques/market- reportsresearch/ibm-cmo-study/
  • 5. Variety: Types of Data Structured Data o (relational) database tables o CSV/TSV files Semi-structured Data o XML o JSON o RDF Unstructured Data o text data (documents, Web pages, short texts (e.g., social media)) Multimedia Data (images, videos, audios) Other types of data o matrices, graphs, sequences, time-series, spatio-temporal
  • 6. Velocity: Streaming Data Stock Trades Highway Sensors Weather Data Social Media Telephone Calls Video Streaming
  • 8. Datasets Amazon Public Data Sets Data.gov Linked Open Data Knowledge Bases, Encyclopedia Yahoo! Webscope Bibliography Databases Network/Graph Datasets UCI Machine Learning Repository UCR Time Series Classification/Clustering Time Series Data Library KDnuggets Dataset List KDD Cup Datasets
  • 9. Amazon Public Data Sets http://aws.amazon.com/public-data-sets/ o NASA NEX: A collection of Earth science data sets maintained by NASA, including climate change projections and satellite images of the Earth's surface o Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages o 1000 Genomes Project: A detailed map of human genetic variation o Google Books Ngrams: A data set containing Google Books n- gram corpuses o US Census Data: US demographic data from 1980, 1990, and 2000 US Censuses o Freebase Data Dump: A data dump of all the current facts and assertions in the Freebase system, an open database covering millions of topics
  • 10. Data.gov http://www.data.gov/ (137,608 datasets) o Consumer Complaint Database o U.S. International Trade in Goods and Services: Monthly report that provides national trade data including imports, exports, and balance of payments for goods and services. o DTV Reception Maps o Climate Data Online o Food Access Research Atlas — presents a spatial overview of food access indicators for low-income and other census tracts using different measures of supermarket... o U.S. Hourly Precipitation Data o Great Chile Earthquake of May 22, 1960 o Consumer Expenditure Survey o Campus Security Data o Farmers Markets Geographic Data: longitude and latitude, state, address, name, and zip code of Farmers Markets in the United States o Crimes - 2001 to present (City of Chicago)
  • 11. Government Data Government spending http://www.usaspending.gov/ Campaign finance http://www.fec.gov/disclosure.shtml http://www.opensecrets.org/ Congress voting record http://www.govtrack.us/ Members of Congress, Bills & Resolutions, Voting Records, Committees Census http://www.census.gov/main/www/access.html
  • 12. Linked Data http://linkeddata.org/ (hundreds of datasets, billions of RDF triples)
  • 13. Knowledge Bases, Encyclopedia o Wikipedia, Dbpedia o Freebase/Google Knowledge Graph o YAGO o Probase o LibraryThing
  • 14. Yahoo! Webscope Datasets o Language Data o Graph and Social Data o Ratings and Classification Data o Advertising and Market Data o Competition Data o Computing Systems Data o Image Data
  • 15. Bibliography Databases o Google Scholar, Microsoft Academic Search, DBLP, arXiv.org, CiteSeer, Arnetminer Drug and Disease Databases o Drug Bank, DailyMed, OMIM, KEGG Drug Gene and Protein Databases o UniProt, Protein Data Bank, Genbank
  • 16. Stanford Large Network Dataset Collection http://snap.stanford.edu/data/ o Social networks : online social networks, edges represent interactions between people o Networks with ground-truth communities : ground-truth network communities in social and information networks o Communication networks : email communication networks with edges representing communication o Citation networks : nodes represent papers, edges represent citations o Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper) o Web graphs : nodes represent webpages and edges are hyperlinks o Amazon networks : nodes represent products and edges link commonly co- purchased products o Internet networks : nodes represent computers and edges communication o Road networks : nodes represent intersections and edges roads connecting the intersections o …
  • 17. Time Series Data Library http://robjhyndman.com/TSDL/
  • 20. Data Mining Software Free, open-source o RapidMiner o Weka: Data mining tool in java o SCaVis: scientific computation and visualization, Java o Orange: Python suite o Scikit-learn: Python machine learning lbirary o NumPy/SciPy/Ipython/ mlpy (python modules for scientific computing, scientific library, interactive computing, machine learning) o R: statistical computing and graphic o RattleGUI: data mining GUI using R o Octave: numerical analysis o Shogun: machine learning toolkit in C++ Text Mining Tools o NLTK (NLP Toolkit): NLP suite for Python o SenticNet API: sentiment analysis o Stanford NLP software o UIMA Large-Scale Data Processing, Machine Learning o Apache Mahout o GraphLab o MapReduce/Hadoop o Spark o Pregel/Giraph Commercial o Matlab o Oracle Data Mining o SAS o IBM SPSS o Microsoft SQL Server Analysis Services o HP Vertica