Data 101- Big Data: What is it and Why Do We Care?

•Télécharger en tant que PPTX, PDF•

2 j'aime•2,521 vues

Webinar presentation for the Special Libraries Association on the basics of big data and what it means for information professionals and librarians

Formation

BIG DATA
What is it and Why Do We Care?
Elaine M. Lasda Bergman
University at Albany
March 6, 2014
elasdabergman@albany.edu
Webinar Presentation
for the Special Libraries Association

What we’re going to cover today
• What is Big Data
• What is great about Big Data
• What is not so great
• The role of Librarians and Info Pros in the Big
Data landscape
• Tools and Resources

How Big is Big?
http://breadboxes.info/files/2012/01/bread-box.jpg

Big Data Vs Open Data
Based on http://www.opendatanow.com/2013/11/new-big-data-vs-open-data-mapping-it-out/
BIG DATA OPEN GOV’T
OPEN DATA

Is Big Data a Game Changer?
http://bellwethergames.com/images/stories/blog/salvaged%20bits.jpg

Types of Data Scientists
• Statistics
• Mathematics
• Data Engineering
• Machine Learning
• Business
• Software engineering
• Visualization
• GIS
http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists

Big Data is FANTASTIC!
http://4206e9.medialib.glogster.com/media/6bde80470b0f0ffe3b59b390fcb54a117c65f2406a167bd2589cabc3e9601461/excited-smiley-face.jpg

Applications of Big Data (in general)
ttp://analytics-arena.blogspot.com/2012/12/the-famous-beer-diaper-planogram.html

BIG Data is TERRIBLE!
http://startupmixology.tech.co/2010-chicago/staff/harper-reed

Caveats and limitations
http://www.guy-sports.com/fun_pictures/no_brain.jpg

False Correlations
http://www.cdc.gov/healthyweight/images/height.jpg
http://www.sbsd.k12.ca.us/cms/lib02/CA01001886/Centricity/Domain/569/kids_reading.jpg

Add Data Literacy!
http://remc12.wikispaces.com/file/view/InformationLit.jpg/32256581/InformationLit.jpg

What We Just Talked About
• The Three V’s
• Amazing Capabilities
• The Human Element
• Our Roles as Information Professionals

Read!
• Big Data: A Revolution that Will Transform
How We Live, Work, and Think, by Viktor
Mayer-Schonberger http://www.amazon.com/Big-Data-Revolution-
Transform-Think/dp/0544002695
• “For Dummies” Books

Read!
• An Introduction to Data Science, by Jeffrey
Stanton http://jsresearch.net/
• Frontiers in Massive Data Analysis
http://www.nap.edu/catalog.php?record_id=18374

General Resource Lists/Training
• Syracuse University Library Guide on Data Science
http://researchguides.library.syr.edu/datascience
• ALA ACRL “Keeping Up With Big Data” page
http://www.ala.org/acrl/publications/keeping_up_with/big_data
• Data Information Literacy at Purdue wiki
http://wiki.lib.purdue.edu/display/ste/Home
• MOOCs

Policy/Best Practices
• Council For Big Data, Ethics and Society
http://www.datasociety.net/initiatives/council-for-big-data-ethics-and-society/
• Research Data Management Principles, Practices, and
Prospects – CLIR
http://www.clir.org/pubs/reports/pub160

Policy/Best Practices
• Rebuilding the Mosaic
http://www.nsf.gov/pubs/2011/nsf11086/nsf11086.pdf
• GovLab
http://thegovlab.org/
• Terminology issues

Keep Current
Newsletters
• Data Science Weekly http://www.datascienceweekly.org/
• Data Science Central http://www.datasciencecentral.com/
• R-Bloggers http://www.r-bloggers.com/

Keep Current
Blogs
• Hilary Mason http://www.hilarymason.com/
• Mathbabe http://mathbabe.org/
• Bits Blog in NY Times http://bits.blogs.nytimes.com/
• No Free Hunch http://blog.kaggle.com/
• What’s the Big Data http://whatsthebigdata.com/

PLAY!
http://brainysmurf1234.files.wordpress.com/2011/10/sand-castle.png

Big, Open Data Sources
http://lightworkersalliance.com/wp-content/uploads/2011/06/Open-Door1.jpg

Google Data Explorer
https://www.google.com/publicdata/directory

Amazon Web Services
http://aws.amazon.com/

Scale Unlimited
http://www.scaleunlimited.com/datasets/public-datasets/

Database Structure/Data Analysis
• R http://cran.us.r-project.org/
• Hive/Hadoop http://hive.apache.org/
• PostgreSQL http://www.postgresql.org/
• Project Bamboo Dirt http://dirt.projectbamboo.org/
• Mlcomp http://mlcomp.org/

Visualization tools
http://us.123rf.com/400wm/400/400/lucadp/lucadp1204/lucadp120400012/13060060-one-crystal-ball-with-a-bar-chart-inside-it-a-concept-of-financial-and-business-forecasts-3d-render.jpg

ManyEyes
http://www-958.ibm.com/software/analytics/manyeyes/

Google Fusion Tables
https://support.google.com/fusiontables/answer/2571232?hl=en

Just Plain Cool!
http://images5.fanpop.com/image/photos/30600000/The-Fonz-arthur-fonzarelli-30631370-621-362.jpg

My Magic Plus
https://disneyworld.disney.go.com/plan/my-disney-experience/my-magic-plus/

Information is Beautiful
http://www.informationisbeautiful.net/

Facebook’s Data Science Page
https://www.facebook.com/data

Google Trends
http://www.google.com/trends/

One Final Note:
Professional Development
SLA Data Caucus initiative!
IASSIST http://www.iassistdata.org/
ASIS&T http://www.asis.org/
LinkedIN Groups see:
http://researchguides.library.syr.edu/content.php?pid=484454&sid=4078160

Contact Me
Elaine Lasda Bergman
elasdabergman@albany.edu
http://www.slideshare.net/librarian68/
@ElaineLibrarian on Twitter

Contenu connexe

Tendances

Iris ai and academia.edu. Amal Jith

Introduction to data scienceTharushi Ruwandika

BigDataCSEKeyNote_2012Masoud Nikravesh

Let's Get Visible! with Karla Smith, Winnefox Library SystemWiLS

Capacity Building: Data Science in the University At Rensselaer Polytechnic ...James Hendler

20160414 23 Research Data ThingsKatina Toufexis

From Biology to Industry. A Blogger’s Journey to Data Science.Shirin Elsinghorst

Just Google ItAgnes Morrison

Big Data - Introduction and Research Topics - for Dutch KadasterJust van den Broecke

Data Science and its impact on societyVienna Data Science Group

Academic Research over internetAbdul Wahid Uqaily

Discovery of IIIF Resources: Intro for Working Group / VaticanRobert Sanderson

20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...Andrew Bourgeois

JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...Naveen Agarwal

Tendances (14)

Iris ai and academia.edu.

Introduction to data science

BigDataCSEKeyNote_2012

Let's Get Visible! with Karla Smith, Winnefox Library System

Capacity Building: Data Science in the University At Rensselaer Polytechnic ...

20160414 23 Research Data Things

From Biology to Industry. A Blogger’s Journey to Data Science.

Just Google It

Big Data - Introduction and Research Topics - for Dutch Kadaster

Data Science and its impact on society

Academic Research over internet

Discovery of IIIF Resources: Intro for Working Group / Vatican

20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...

JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...

En vedette

Big data 101 v1Welly Tambunan

Big Data 101Deb Dobson

Analytics 101 for startupsGoSquared

Internet of things, Big Data and Analytics 101Mukul Krishna

Google Analytics 101 #SMAMI 2017Nicole Bullock

Google Analytics 101 | 2015Insivia

En vedette (6)

Big data 101 v1

Big Data 101

Analytics 101 for startups

Internet of things, Big Data and Analytics 101

Google Analytics 101 #SMAMI 2017

Google Analytics 101 | 2015

Similaire à Data 101- Big Data: What is it and Why Do We Care?

Big and Small Web DataMarieke Guy

Informatics Transform : Re-engineering Libraries for the Data DecadeLiz Lyon

Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...National Information Standards Organization (NISO)

Research Data Management in Academic Libraries: Meeting the ChallengeSpencer Keralis

The purpose, practicalities, pitfalls and policies of managing and sharing da...Danny Kingsley

NCME Big Data in EducationPhilip Piety

Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR

Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...LIBER Europe

Supporting Libraries in Leading the Way in Research Data ManagementMarieke Guy

Winter school in research data science research data management - finalARDC

ICPSR Data ServicesICPSR

Andrew Cox Research data managementIncisive_Events

Fsci 2018 thursday2_august_am6ARDC

00-01 DSnDA.pdfSugumarSarDurai

Teaching Data Science to Undergraduate StudentsNicole Vasilevsky

Research Data ManagementSarah Jones

Managing and sharing dataSarah Jones

Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington

RDAP14: Collaboration and tension between institutions and units providing da...ASIS&T

Research Data Services at the University of UtahRebekah Cummings

Similaire à Data 101- Big Data: What is it and Why Do We Care? (20)

Big and Small Web Data

Informatics Transform : Re-engineering Libraries for the Data Decade

Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...

Research Data Management in Academic Libraries: Meeting the Challenge

The purpose, practicalities, pitfalls and policies of managing and sharing da...

NCME Big Data in Education

Meeting Federal Research Requirements for Data Management Plans, Public Acces...

Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...

Supporting Libraries in Leading the Way in Research Data Management

Winter school in research data science research data management - final

ICPSR Data Services

Andrew Cox Research data management

Fsci 2018 thursday2_august_am6

00-01 DSnDA.pdf

Teaching Data Science to Undergraduate Students

Research Data Management

Managing and sharing data

Big Data Curricula at the UW eScience Institute, JSM 2013

RDAP14: Collaboration and tension between institutions and units providing da...

Research Data Services at the University of Utah

Plus de Elaine Lasda

Your Systematic Review: Getting StartedElaine Lasda

Research Impact in Specialized Settings: 3 Case StudiesElaine Lasda

The New Metrics: conference presentationElaine Lasda

Maximizing Your Research Impact: 5 Quick Hits!Elaine Lasda

Scholarly Metrics in Specialized SettingsElaine Lasda

Personal Time ManagementElaine Lasda

Early Career Tactics to Increase Scholarly ImpactElaine Lasda

Computers in Libraries 2018 Workshop on Scholarly MetricsElaine Lasda

Computers in Libraries Scholarly Metrics FreebiesElaine Lasda

Data Literacy for Librarians - Day 2Elaine Lasda

Data Literacy for LibrariansElaine Lasda

UAlbany Open Access Day Presentation on OER GrantElaine Lasda

Open Educational Resources Faculty WorkshopElaine Lasda

Data and Libraries: How I learned to stop worrying and love the spreadsheetElaine Lasda

Altmetrics & Scholarly Publishing: the LIbrary Lay of the LandElaine Lasda

From Reputation to Citation: Varying Roles for Scholarly MetricsElaine Lasda

Open Educational Resources (OERs): A Game Changer For Higher EdElaine Lasda

Research Impact RoadshowElaine Lasda

Gaining Insights Through Bibliometric AnalysisElaine Lasda

Getting "Fancy" With Your Library Data!Elaine Lasda

Plus de Elaine Lasda (20)

Your Systematic Review: Getting Started

Research Impact in Specialized Settings: 3 Case Studies

The New Metrics: conference presentation

Maximizing Your Research Impact: 5 Quick Hits!

Scholarly Metrics in Specialized Settings

Personal Time Management

Early Career Tactics to Increase Scholarly Impact

Computers in Libraries 2018 Workshop on Scholarly Metrics

Computers in Libraries Scholarly Metrics Freebies

Data Literacy for Librarians - Day 2

Data Literacy for Librarians

UAlbany Open Access Day Presentation on OER Grant

Open Educational Resources Faculty Workshop

Data and Libraries: How I learned to stop worrying and love the spreadsheet

Altmetrics & Scholarly Publishing: the LIbrary Lay of the Land

From Reputation to Citation: Varying Roles for Scholarly Metrics

Open Educational Resources (OERs): A Game Changer For Higher Ed

Research Impact Roadshow

Gaining Insights Through Bibliometric Analysis

Getting "Fancy" With Your Library Data!

Dernier

How to Create and Manage Wizard in Odoo 17Celine George

Third Battle of Panipat detailed notes.pptxAmita Gupta

1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh

Accessible Digital Futures project (20/03/2024)Jisc

The basics of sentences session 3pptx.pptxheathfieldcps1

PROCESS RECORDING FORMAT.docxPoojaSen20

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics

Food safety_Challenges food safety laboratories_.pdfSherif Taha

General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil

Activity 01 - Artificial Culture (1).pdfciinovamais

Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh

Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University of Engineering & Technology, Jamshoro

microwave assisted reaction. General introductionMaksud Ahmed

Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella

1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh

How to Manage Global Discount in Odoo 17 POSCeline George

SOC 101 Demonstration of Learning Presentationcamerronhm

On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash

UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi

Making communications land - Are they received and understood as intended? we...Association for Project Management

Dernier (20)

How to Create and Manage Wizard in Odoo 17

Third Battle of Panipat detailed notes.pptx

1029 - Danh muc Sach Giao Khoa 10 . pdf

Accessible Digital Futures project (20/03/2024)

The basics of sentences session 3pptx.pptx

PROCESS RECORDING FORMAT.docx

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...

Food safety_Challenges food safety laboratories_.pdf

General Principles of Intellectual Property: Concepts of Intellectual Proper...

Activity 01 - Artificial Culture (1).pdf

Micro-Scholarship, What it is, How can it help me.pdf

Mehran University Newsletter Vol-X, Issue-I, 2024

microwave assisted reaction. General introduction

Spellings Wk 3 English CAPS CARES Please Practise

1029-Danh muc Sach Giao Khoa khoi 6.pdf

How to Manage Global Discount in Odoo 17 POS

SOC 101 Demonstration of Learning Presentation

On National Teacher Day, meet the 2024-25 Kenan Fellows

UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf

Making communications land - Are they received and understood as intended? we...

Data 101- Big Data: What is it and Why Do We Care?

1. BIG DATA What is it and Why Do We Care? Elaine M. Lasda Bergman University at Albany March 6, 2014 elasdabergman@albany.edu Webinar Presentation for the Special Libraries Association

2. What we’re going to cover today • What is Big Data • What is great about Big Data • What is not so great • The role of Librarians and Info Pros in the Big Data landscape • Tools and Resources

4. How Big is Big? http://breadboxes.info/files/2012/01/bread-box.jpg

5. The Three Vs •Variety •Velocity •Volume

6. Big Data Vs Open Data Based on http://www.opendatanow.com/2013/11/new-big-data-vs-open-data-mapping-it-out/ BIG DATA OPEN GOV’T OPEN DATA

7. Is Big Data a Game Changer? http://bellwethergames.com/images/stories/blog/salvaged%20bits.jpg

8. Types of Data Scientists • Statistics • Mathematics • Data Engineering • Machine Learning • Business • Software engineering • Visualization • GIS http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists

9. Big Data is FANTASTIC! http://4206e9.medialib.glogster.com/media/6bde80470b0f0ffe3b59b390fcb54a117c65f2406a167bd2589cabc3e9601461/excited-smiley-face.jpg

10. Applications of Big Data (in general) ttp://analytics-arena.blogspot.com/2012/12/the-famous-beer-diaper-planogram.html

11. BIG Data is TERRIBLE! http://startupmixology.tech.co/2010-chicago/staff/harper-reed

12. Caveats and limitations http://www.guy-sports.com/fun_pictures/no_brain.jpg

13. False Correlations http://www.cdc.gov/healthyweight/images/height.jpg http://www.sbsd.k12.ca.us/cms/lib02/CA01001886/Centricity/Domain/569/kids_reading.jpg

14.

15. Competencies for Info Pros/Librarians

16. Add Data Literacy! http://remc12.wikispaces.com/file/view/InformationLit.jpg/32256581/InformationLit.jpg

17.

18. What We Just Talked About • The Three V’s • Amazing Capabilities • The Human Element • Our Roles as Information Professionals

19. Now the Fun Stuff! http://www.whee.com.sg/images/common/logo-whee.png

20. Read! • Big Data: A Revolution that Will Transform How We Live, Work, and Think, by Viktor Mayer-Schonberger http://www.amazon.com/Big-Data-Revolution- Transform-Think/dp/0544002695 • “For Dummies” Books

21. Read! • An Introduction to Data Science, by Jeffrey Stanton http://jsresearch.net/ • Frontiers in Massive Data Analysis http://www.nap.edu/catalog.php?record_id=18374

22. General Resource Lists/Training • Syracuse University Library Guide on Data Science http://researchguides.library.syr.edu/datascience • ALA ACRL “Keeping Up With Big Data” page http://www.ala.org/acrl/publications/keeping_up_with/big_data • Data Information Literacy at Purdue wiki http://wiki.lib.purdue.edu/display/ste/Home • MOOCs

23. Policy/Best Practices • Council For Big Data, Ethics and Society http://www.datasociety.net/initiatives/council-for-big-data-ethics-and-society/ • Research Data Management Principles, Practices, and Prospects – CLIR http://www.clir.org/pubs/reports/pub160

24. Policy/Best Practices • Rebuilding the Mosaic http://www.nsf.gov/pubs/2011/nsf11086/nsf11086.pdf • GovLab http://thegovlab.org/ • Terminology issues

25. Keep Current Newsletters • Data Science Weekly http://www.datascienceweekly.org/ • Data Science Central http://www.datasciencecentral.com/ • R-Bloggers http://www.r-bloggers.com/

26. Keep Current Blogs • Hilary Mason http://www.hilarymason.com/ • Mathbabe http://mathbabe.org/ • Bits Blog in NY Times http://bits.blogs.nytimes.com/ • No Free Hunch http://blog.kaggle.com/ • What’s the Big Data http://whatsthebigdata.com/

27. PLAY! http://brainysmurf1234.files.wordpress.com/2011/10/sand-castle.png

28. Big, Open Data Sources http://lightworkersalliance.com/wp-content/uploads/2011/06/Open-Door1.jpg

29. Google Data Explorer https://www.google.com/publicdata/directory

30. Amazon Web Services http://aws.amazon.com/

31. Scale Unlimited http://www.scaleunlimited.com/datasets/public-datasets/

32.

33. Database Structure/Data Analysis • R http://cran.us.r-project.org/ • Hive/Hadoop http://hive.apache.org/ • PostgreSQL http://www.postgresql.org/ • Project Bamboo Dirt http://dirt.projectbamboo.org/ • Mlcomp http://mlcomp.org/

34. Visualization tools http://us.123rf.com/400wm/400/400/lucadp/lucadp1204/lucadp120400012/13060060-one-crystal-ball-with-a-bar-chart-inside-it-a-concept-of-financial-and-business-forecasts-3d-render.jpg

35. Piktochart http://piktochart.com/

36. Esri http://www.esri.com/

37. Big ML https://bigml.com/

38. ManyEyes http://www-958.ibm.com/software/analytics/manyeyes/

39. Google Fusion Tables https://support.google.com/fusiontables/answer/2571232?hl=en

40. Chartsbin http://chartsbin.com/

41. iCharts http://www.icharts.net/

42. Just Plain Cool! http://images5.fanpop.com/image/photos/30600000/The-Fonz-arthur-fonzarelli-30631370-621-362.jpg

43. CSSeer http://csseer.ist.psu.edu/

44. StreetBump http://streetbump.org/

45. My Magic Plus https://disneyworld.disney.go.com/plan/my-disney-experience/my-magic-plus/

46. Information is Beautiful http://www.informationisbeautiful.net/

47. Facebook’s Data Science Page https://www.facebook.com/data

48. Google Trends http://www.google.com/trends/

49. Flowing Data http://flowingdata.com/

50. GapMinder http://www.gapminder.org/

51. One Final Note: Professional Development SLA Data Caucus initiative! IASSIST http://www.iassistdata.org/ ASIS&T http://www.asis.org/ LinkedIN Groups see: http://researchguides.library.syr.edu/content.php?pid=484454&sid=4078160

52. Contact Me Elaine Lasda Bergman elasdabergman@albany.edu http://www.slideshare.net/librarian68/ @ElaineLibrarian on Twitter

Notes de l'éditeur

Thank you for the introduction Kendra.
Here’s a slide of a slide…Dan Ariely, a behavioral economist at Duke University has been posting this analogy all over social media and at presentations. He alsohas a book called Predictably Irrational, which I have not read yet, but it talks about his work in behavioral predictions. He also has a number of Ted talks that are very interesting. So what is Big Data, really anyway?
We may be wondering just “how big is big data”? If you played 20 questions as a kid you might have asked “is it bigger than a breadbox?” While some are reporting datasets in such unfathomable scale as Petabytes, exabytes, and zettabytes, really, any data that is too big for traditional technology. In other words, it is too big for our breadbox….A good working definition for most of us is a data that file is too big for Excel too load. In Excel 2013, the maximum size it can handle is 1,048,576 rows by 16,384 columns. But really, there are three features that make Big Data “big”
A common definition of Big Data relies on what are known as the 3 v’s. These are Variety Velocity and Volume, a term first coined by Doug Laney at a firm called Gartner. Variety means that we are not just collecting more of the same data we’ve always collected. Instead we are collecting different types of data. Variety also means that we do not have the type of structured datasets we used to have in relational databases – you know, the ones with nice tables with neat and tidy rows and columns. Now data often is in forms that don’t fit in columns like video and audio, sensor data, documents, flash and so forth. You may have heard of “NoSQL” databases, which are an alternative to the traditional relational database models that accommodate this type of dataset. Velocity has to do with the tremendous speed at which we are collecting this data and the rate at which data is being generated. You may have heard stats such as facebook generates 500 Terabytes of data per day. Many businesses use clickstream analysis on their website which generates a great deal of data in a hurry. The IDC Digital Universe study indicates that by the year 2020 society will be generating 50 times the amount of information currently being generated. In 2011 this number was 1.8 Zettabytes A zettabyte is a 1 with 21 zeroes after it so the rate of growth in 20 years will truly be staggering.And this gets us to our final v, volume. As is likely obvious by looking at the first two V’s, the sheer amount of data that can be collected now is really kind of unfathomable. Google for example, receives over 2 million search queries IN A SINGLE MINUTE. 72 hours of new video are uploaded to Youtube in a minute. 47,000 apps are downloaded from iTunes every minute.
Let’s take a moment before we go any further and discuss the differences between big data and open data. You can see by this Venn diagram that there are big data sets that are not open. These are proprietary datasets in business and other locations where security is an issue, but there are also datasets from scientific and government sources of big data that ARE open. Open Government data conversely is not all “big” but there is a great deal of public access to it on federal, state, and local levels. Furthermore there are open data sources that are not government sources, such as business and scientific data that are not necessarily “big” but are pubicly available. So this should give you an idea of how Big Data and Open Data are related.
Is big data a game changer? First and foremost, big data turns the scientific method on its head. Traditionally, any inquiry or decision starts with a hypothesis. We make an educated guess, and then look for the data to support or contradict this hypothesis. In Big Data analytics, we start with the data, and we look for patterns. This data is unstructured, it can be multidisiciplinary, and it can be highly predictive. Also, traditionally an inquiry or decision seeks to find the answer as to WHY the hypothesis is confirmed or rejected. In big data analytics, we identify the patterns without necessarily receiving information as to why those patterns do exist.
In his Data Science Central Blog, Vincent Granville has identified 9 types of data science specializations. Statistics – this area deals with testing and modeling, theoretical approaches and developing new techniques for approaching large datasetsMathematics – slightly different in that these people deal with operations research: optimization, quality control, etc.Data Engineering – those strong in data engineering deal mainly with the structure and architecture of databases/filesystems/storageSoftware engineering – know several programming languages and work on code development. Machine Learning– these experts are the ones that program the algorhithms and complex computations Business– these are subject experts in terms of determining appropriate metrics, ROI, what to include on a dashboardVisualization-- charts and graphs, making data analysis understandable to the user or decision makerGIS – focuses more exclusively on the spatial representation of data
What big data allows us to do“human insight at machine scale”identify patterns – but also outliers and unique instancesBehavioral predictionsSentiment analysisActivity “hotspots” – geographic such as the Arab Spring, Google’s flu predictionFor the social sciences, we can get empirical evidence – surveys subjective, observational studies are not “natural habitat,” etc. Here are some examples of the amazing things that are being done with big data currently:
Market-basket research: Diapers and Beer! Broccoli cam – sensors determine when the produce department is out of broccoli and sends worker out to refillNate Silver < - Moneyball – turned the scouting profession on its head. Netflix <- highly specific classifications of movie genres to create recommendationsLinguamatics: text mining predicted prime minister election using tweetsNYC fire inspectorsCataloged 60 pieces of metadata about all inspectable buildings, used to prioritize inspections
Harper Reed, Obama campaign techie, in an October 2013 article in the Chronicle of Higher Ed Wired Campus blog, says Big Data is “bs”. It is used to generate fear in enterprises to spur equipment upgrades, in other words, spend money on technology. He says: “you can get a lot of this stuff done just in Excel” So, just having the capacity for scalability in an enterprise does not mean that you are “doing big data.”
Big data requires more treatment and handling. This includesData cleansing: dirty data, missing data, more outliers, removing duplicatesParsing and treating: extracting data from its original source into something resembling a datasetTransformation into usable format is key
Another issue is false patterns, false correlations. For example Gene Pease, in his Talent Management Blog notes that The height of an elementary school student is correlated to his or her reading level. In Jeffry Stanton’s text Introduction to Data Science he says “bigger means weirder.” So we need to be careful with regard to the assumptions and conclusions we derive from the data. Again, big data is not concerned with the “why” of a pattern, it only identifies that the pattern exists.As one author noted “when looking at the whole haystack, EVERYTHING looks like a needle”
Big data is first and foremost a decision making tool. This means that for all the technology and fancy processing, storage and tools available, without competent subject matter experts to identify data flow in an organization or enterprise, identify the areas where data is lacking, and how the data can be used, it’s all for naught. The human element is what turns data information. So where do we, as information professionals fit into the equation?
There are a number of directions we, as librarians and information professionals can pursue as we move into more data-driven activities in our organizations, mainly as an outgrowth of existing skill sets we posess. For example: Metadata extraction, creation, classificationPrivacy experts/intellectual freedomQuality experts – identify reliable and authoritative data sources and analysisPolicy advisors for our organizationCuration/selectionStorage/managementAccess/gatekeepersAssuring data can be turned into informationKnowledge managementCompetitive Intelligence“be the link pulling biz and IT together”Michelle Hudson of Yale: Some day We’re all going to be data librarians”
In it’s article “Big Data’s Impact in the World”, The New York Times cited A report by the McKinsey Global Institute, the research arm of a well known consulting firm, projected that the United States needs 140,000 to 190,000 more workers with “deep analytical” expertise and 1.5 million more data-literate managers, whether retrained or hired. All disciplines are becoming increasingly data intensive whether political science, sociology, transportation, or the traditional sciences and medicine. As information professionals we have the opportunity to flex our Information Literacy muscles and extend them to Data Literacy. Those of us in higher education can add data literacy to our instructional and consultation activities, and librarians in other capacities can bring their own patrons and stakeholders up to speed on key data concepts – how to collect, store, gather, evaluate and interpret data. As my colleague Kim Silk of the University of Toronto has said to me: much as we teach people information and media literacy;data literacy – understanding what the data is telling us, understanding (significant or misleading) statistics, outliers, sample size, correlations – is critical for 21st century citizens.”
Our vendor partners are already getting in on the action. For example, Thomson Reuters’ Eikon desktop analysis software for financial offices has twitter and news sentiment analysis tools. These are primarily aimed at the financial sector, but what they do is allow for assessment of news events and predict the effect on changes in the financial markets. Many other partners are using big data internally to identify usability of their interfaces, frequency of use of resources, common search terms. As our vendor partners become more data driven, we will need to be data literate ourselves in order to understand the resources made available to us by our vendor partners, as well as how and why these resources work.
So here in my opinion, are the major takeaways from the first part of this webinar: We know that velocity, variety, and volume are the hallmarks of big data. Big data isn’t just more of the same data, and it isn’t necessarily tons and tons of data (although often it is). A good rule of thumb is any dataset that is too big to fit in Excel is “big data” for our purposes. Big data holds the promise of amazing capabilities, through identifying both patterns and outliers in the data we have collected. We can identify behavioral patterns in an empirical way, such as through marketbasket analysis, or collect and use new types of metadata to improve safety practices. But this cannot be done without the human element. Technology upgrades are only part of the equation and may not even be necessary – it takes subject matter experts to ask the right questions, interpret, clean and collect the data. Finally, as information professionals we have the ability to be involved with data and data issues in a variety of capacities, but our main strength may be in Data Literacy initiatives for our patrons and stakeholders..We did not have time for: Stats lessons, Privacy issues, Computer processes, Data structure, Etcetcetc
I’d like to move at this point on to recommending some resources for learning more about the topic. I am sure you realize that this presentation has only touched the tip of the iceberg on the topic of Big Data. There are many paths to pursue to learn more, many specializations to focus on.
Big Data A Revolution, is a best seller I am in the middle of it right now and it gives a laymans understanding of the concepts and impact of big dataThere are lots of “for Dummies” books on various aspects of Big Data – many free in PDF form from various web sources. Big Data for Dummies, etc
An Introduction to Data Science- open source (free!) textbook with lots of good information, an easy read, short chapters (available on iTunes)Frontiers in Massive Data Analysis- a report by the national academies press, discusses big data in mainly social science disciplines, free on web
I will put the URLs in the slideshare version of this presentation.SU guide: data sources, programming guide, news, associations,linkedIN groups many free sourcesALA list of resources, academic focus, but there are many good articles and a good collection of informationData Information literacy wiki at Purdue is documenting the development of a standardized curriculum for data literacy and data science, and they are doing research as to the level of data literacy and critical instructionThere are a number of schools that are offering Massive Open Online Classes, Syracuse University offers one periodically, University of Washington has one, can be done online, Caltech, MIT Have the more technical/computing focused programs
Another issue Librarians might be called upon for their expertise is information policy and best practices as they relate to data issues – use, storage, sharing, privacy, and so forth. Many of these practices are still in the process of being developed. For example the Council for Big Data Ethics and Society: hasn’t launched yet, is supposed to soon. It is a collaboration with National Science Foundation. Their website says they intend to “address such issues as security, privacy, equality, and access” to “develop frameworks to help researchers, practitioners and the public understand the social, ethical, legal, and policy issues that underpin the big data phenomenon. They have a newsletter sign-up but I have yet to receive anything from it. Research Data Management Services: primarily for academic libraries, this report deals with storage, access, repositories and data management in an academic environment but there may be lessons for other types of libraries as well.
Here are a couple of other resources on big data policy and best practicesRebuilding the Mosaic: National Science Foundation Social Behavior and Economic Council’s report on data driven research in the social sciences related to world development. They identify focusing on population change, disparities, communications, media, and social networking in the future. GovLab: a blog on governance policies of science and technology – search “data” in the search box for some good articles related to big data governance and policyTerminology: this is another area that is an issue with current Big Data projects- computer scientists, social scientists, statisticians all have different language for the same things: case vs instance vs observation as an example == all equal the “rows” in a dataset. There is an argument that this ISO standard for statistical terminology should be amended to create a standardized language for data analytics
These are some newsletters that can be delivered to your email inbox that I find useful. There are tons of these though, there may be others you will find on the web that are also useful. Data Science Weekly – free newsletter, variety of topics and includes jobsData Science Central – nice blog, newsletter with broad focus, professional development for the data scientist (or aspiring data scientist)R-Bloggers – tips and tricks for using the statistical software RForgot to mention the O’Reilly mailing lists. O’reilly as you may know is a publisher of IT manuals and provides blogs, other resources related to technology.
Here are my favorite blogs on the topic, in no particular order. Hilary Mason – she’s a data scientist and she posts interesting articles about some data analysis, lots of visualizations, but also professional development topics for data professionals. She was an innovator who had an extensive role in in creating bit.ly – among other things, they are well known for a tool that will convert a long URL into something shorter and more manageable. She speaks a lot and hosts a data related conference in NYC called DataGotham. Mathbabe – cathyo’neil she is a mathematician but not an academic, she has some nice introductory posts for those interested in data science, less visualization than Hilary, she focuses more opinion and techniqueBits Blog – technology and business news from the New York TimesNo Free Hunch – problem solving bent – “the sport of data science” from Kaggle, a consulting company. They identify fun problems and solve them using data science techniques and they announce many competitions and challenges where data scientests can strut their stuff. What’s the Big Data- Gil Press, who has a column at Forbes, focuses on impact of big data in society, business, government, IT right now he’s done a lot about the market for big data and its influence on business
Next I would like to show you some interesting tools that you can play with if you want to explore big data and its capabilities for yourself. There are a lot of open source resources that are available and user friendly.
The first thing we will cover is finding datasets. There are a surprising number of sources for datasets out there that are free and online. Some are easier to use than others. I am pointing out three well known or interesting resources, but there are many others I could have included. These three that I have chosen will give you an idea of some of the variety of data that is out there.
Google Data Explorer provides many datasets, and Google Trends, which we will talk about later provides visual display of data. Most of the public datasets available on Google Data Explorer are governmental in nature, as you can see by the list of data providers on the left.
Amazon Web Services – a wide variety of datasets on many interesting topics, many of these are also government sources, but not all
Scale Unlimited is a big data consulting firm that makes some big datasets freely available for testing and modeling purposes. They have a wide variety of datatypes including media, graphic, geographic. One of the datasets contains all of the Enron emails.
These are some tools for creating databases and analyzing or querying your dataset. I must confess I am just learning about how these work now, so I only have brief explanations of them. R is an open source, command language tool for statistical analysis. I liked the old DIALOG, so I love R. It has many extensible packages that can create a lot of flexibility and precision in data analysis. Hive/Hadoop = both of these tools are run by Apache which is a Google spinoff. Both are open source. Hadoop allows for what is known as parallel processing – distributed computing. Hive is the language and infrastructure that allows you to query the data in Hadoop and do analysis. It is very similar to SQLPostgreSQL – provides an object relational database management system, which is used by Etsy and Creative commons, two organizations I think are very popular with librarians! Again, it uses a query language similar to SQLProject Bamboo Dirt: open source “digital research tools for scholarly use” a variety of tools for data management, analysis, visualization as well as other topics. MLcomp: compares and evaluates computer algorithms. Evaluate your algorithm on their existing dataset or Evaluate your dataset to see what is the best algorithm to use for it.
Once you have queried and analyzed your data, you will want to display it in a manner that your patrons or stakeholders will understand and be able to use for making decisions. This is known as data visualization. Here are some cool tools that are free on the web.
PiktoChart – very user friendly data visualization design and editing, as you can see mainly “infographics”
Esri is a geospatial tool which means it is good at visualizing data that displayed using maps. For example here is a map related to commuting times across the US.
Big ML – fee for service, but for datasets under 16 MB you can play with their visualization tools
ManyEyes: from IBM – upload your dataset and create a wide variety of visualizations: maps, histograms, graphs, text based analysis
GoogleFusion Tables – way of providing visualization for big or multiple datasets in table format – charts ,maps, network graph, etc.
Chartsbin- with this tool you can create interactive (clickable) visualizations, that can be embedded in web pages or exported. They also share their own visualizations from various authoritative sources( government, scholarly journals, technical reports)
iCharts – another nice one that allows for interactive widgets that can be embedded, published on the web, etc.
Maybe you don’t want to get into analysis -- you just want to see what others are doing, here are some cool sites that give you a glimpse as to what various organizations are doing with big data and the results that they are making available to the public:
CSSeer- crossover data from CiteSeer which is a free bibliometric (citation) analysis tool and wikipedia to recommend scholarly experts in a field.
Streetbump- crowd-sourced pothole locator
My Magic Plus- coming from Disney – you get a wristband that tracks your every move around the park, what you spend, where you go, how long you wait, what you buy, everything
Information is Beautiful: independent “data journalist” David McCadless creates just gorgeous visual displays, and then the data is available in Google Docs for anyone to use
Facebook Blog: fascinating articles and visualizations of what is happening with Facebook data
Google Trends: what are people searching, visualizations, “zeitgeist”<- what did the world search for in 2013
Flowing Data: fun visualizations on a variety of topics
GapMinder: Educational bent, describes itself as a “museum” on the internet – focus is on world development: factfinding and needs assessment
Professional Development opportunities abound for info pros who wish to get their feet wet in big data and data science. In fact, I am working with a group of SLA members to create a Data Caucus. We are currently working on amending our scope to be compatible with other SLA units, and hope to send out a revised petition soon, so be on the lookout for those emails!IASSIST – the International Association for Social Science Information Services and Technology is an organization for data users in the social sciences – a small group but international, emphasis is on research and teaching – library/information professionals and others ASIS&T – Association for Information Science and Technology – interdisciplinary, focused on technologyLinkedIN- check the SU library guide for some LinkedINgroups that deal with data issues.
Thank you for your time and attention today. In a few days I will have these slides up on slideshare and they will include hotlinks to the resources I’ve been describing. Don’t forget about the Data Caucus and I hope you now have some starting points for learning more about Big Data. The term “big data” may be a buzzword – the practices and principles involved with big data issues are still evolving, but our capacity for ever increasing volume, velocity, and variety of data is not going to disappear any time soon. Do we have time for a few questions.

Data 101- Big Data: What is it and Why Do We Care?

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (14)

En vedette

En vedette (6)

Similaire à Data 101- Big Data: What is it and Why Do We Care?

Similaire à Data 101- Big Data: What is it and Why Do We Care? (20)

Plus de Elaine Lasda

Plus de Elaine Lasda (20)

Dernier

Dernier (20)

Data 101- Big Data: What is it and Why Do We Care?

Notes de l'éditeur