Sailing on the ocean of 1s and 0s

Sailing on the Ocean of 1's and 0's

Chris Woodruff Chris Woodruff cwoodruff@live.com Blog – http://chriswoodruff.com Technical Architect -- Perficient Coordinator, Grand Rapids DevDay INETA Director Co-host of Deep Fried Bytes Tech Podcast http://deepfriedbytes.com

Where are we sailing today? Lets look at Data Go on to making Data valuable Look at ways to share Data Finally lets talk about making Data look good

Science Paradigms 1000’s Years Ago Science was empirical Describing 100’s Years Ago Theoretical Using Models Last Few Decades Computational Simulations Today (eScience) Data Exploration Unified Theory Data Generated by Instruments or Simulations Scientists Analyzes data after curated from The Fourth Paradigm: Data-Intensive Scientific Discovery

Before we get into the water lets talk about the Digital Ocean The Internet

Why the Internet Won Simple architecture - HTML, URI, HTTP Networked - value grows with data, services, users Extensible - from Web of documentsto ... Tolerant - even w/ imperfect mark-up, data, links, software Universal - independent of systems and people Free / cheap - browsers, information, services Simple / powerful / productive for users - text, graphics, links Open standards

What is Data? The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of "datum") are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e. unprocessed data, refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols.

What really is Data? Information that has no meaning or understanding.

How much data is generated on internet every year/month/day?

How much data is moved on internet everymonth/day? 21 exabytes per a month Around 675 petabytes per a day The amount of data produced each year would fill 37,000 libraries the size of the Library of Congress. (2003)

Exabyte == a quintillion (or a million trillion) bytes or units of computer data. One exabyte is equivalent to 50,000 years’ worth of DVD-quality data.

How much data does twitter produce? Twitter users are averaging 27.3 million tweets per day with an annual run rate of 10 billion tweets According todata from Pingdom

How much Data is Facebook generating? More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month. Average user creates 90 pieces of content each month

Internet users are generating petabytes of data every day

How much Data does your organization produce?

Definition “Data curation is the selection, preservation, maintenance, collection and archiving of digital assets.”

What is involved in D Curation? Collecting verifiable digital assets Providing digital asset search and retrieval Certification of the trustworthiness and integrity of the collection content Semantic and ontological continuity and comparability of the collection content

Challenges of D Curation Storage format evolution and obsolescence Rate of creation of new data and data sets Broad access and searching flexibility and variety Comparability of semantic and ontological definitions of data sets

Setting up a Curation Process Identify what data you need to curate Identify who will curate the data Deﬁne the curation workﬂow Identity the most appropriate data-in and data-out formats Identify the artifacts, tools, and processes needed to support the curation process

Tools to Curate Data Physical SQL Databases Wiki’s SharePoint Data Warehouses Collaborative DBPedia Azure Datamarket Semantics!!

XML Schema is a language for providing and restricting the structure and content of elements contained within XML documents.

RDF is a simple language for expressing data models, which refer to objects ("resources") and their relationships.

RDF Schema extends RDF and is a vocabulary for describing properties and classes of RDF-based resources, with semantics for generalized-hierarchies of such properties and classes.

OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes.

SPARQL is a protocol and query language for semantic web data sources.,[object Object]

The Key to “Open Data”? Shared Agreed upon Protocols Metadata Shared Vocabularies

Produce Great Graphical Information

Minard's Diagram of Napoleon's March on Moscow

Have Integrity in your Graphical Information Edward Tufte’s The Lie Factor

Have Context with your Graphical Information

Wrap Up Think about your data Learn more about how your users work with the data you curate Learn about better ways to share your data Visualize and show the information your data best for your users Be a Data Experience Expert

Sailing on the ocean of 1s and 0s

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Sailing on the ocean of 1s and 0s

Similaire à Sailing on the ocean of 1s and 0s (20)

Plus de Woodruff Solutions LLC

Plus de Woodruff Solutions LLC (20)

Dernier

Dernier (20)

Sailing on the ocean of 1s and 0s

Notes de l'éditeur