SlideShare une entreprise Scribd logo
1  sur  14
IASSIST 2013
The <Meta>Data Dance
The <Meta/>Data Dance
The <Meta>Data Dance
 Tools to facilitate statistical/scientific data management
 Complement existing tools (not replace)
 Facilitate adoption of standards and promote best practices
 Alleviate the need to master complexities of metadata standards,
technologies
 Vision: web based services (cloud users), stand alone desktop /server
products (agencies/enclaves), libraries (developers)
 Broad Target audience: Data Producers / Manager / Researchers /
Users
 DataForge Today (lean model):
 Simple desktop based command line tools
 SledgeHammer - Data/metadata transformation
 LavaCore - Java library
 Caelum - Reporting / Publication
 DataForge Tomorrow:
 DataForge Online, Web Services, Desktop w/UI, Java Library
 Data Liberation
 Unlock from proprietary formats, turn into open data
 Produce metadata + convert data into standard ASCII formats
 Open <Meta>Data Integration
 Generating scripts for loading into stats packages, databases, big
data engines, cloud
 Use with open systems / environments
__ _ _
/ _| | ___ __| | __ _ ___ / /__ _ _ __ ___ _ __ ___ ___ _ __
  | |/ _ / _` |/ _` |/ _ / /_/ / _` | '_ ` _ | '_ ` _  / _  '__|
_ | | __/ (_| | (_| | __/ __ / (_| | | | | | | | | | | | __/ |
__/|_|___|__,_|__, |___/ /_/ __,_|_| |_| |_|_| |_| |_|___|_|
|___/
triple-s
syntax
script
 Challenges
 Read proprietary formats
 Data integrity
 ASCII optimization
 Cross package formatting issues
 SQL Database limits (columns, indexes, views)
 Performance (Timings, size, memory usage)
 …
 Key Features
 Input: SPSS, Stata, ASCII+SAS syntax, ASCII+DDI-C / DDI-L,
ASCII+SSS 1.1/2.0, ASCII+StatTransfer
 Data Out: Fixed w/optimization, CSV, Delimited
 Metadata Out: DDI-C 1.2.2/Nesstar/2.1/2.5, DDI-L 3.1-RP/3.1-
SU/Colectica, Triple-S 1.1/2.0
 Summary Stats: Min / Max /Valid/Invalid /Mean, StdDev
/Variance / Frequencies, Weighted *, Save in CSV*.
 Scripts Generators: SAS, SPSS, Stata, R, Mysql, Oracle*, MS-
SQL*, HSQL*, Google BigQuery*, HP/Vertica* +
dimension/lookup tables, indexes/foreign keys*
 See User’s Guide at http://goo.gl/9xukB
 Editions
 Freeware
 Anonymous: 100 variables / 1,000 obs.
 Registered: 500 variables / 5000 obs. (or contact us)
 Pro/Enterprise:
 4Q2013
 No limits, all features, disable usage report
 pricing/licensing t.b.d.
 In private beta (contact us for info)
 LavaCore:
 Currently internal only, planning for OEM
 Future Features (demand driven)
 Desktop UI
 Support additional input/output formats (SAS, cloud)
 Support missing value
 Additional ID generators
 Compute subsets
 Synthetic data generators
 Disclosure control
 ???
 DataForge Online (2014)
 Free and pay as you go services
 DataForge Services (available today)
 Community support in Google groups
 MTNA: tech support, custom developments
 IDMS: data/metadata management / processing
 MTNA Project Stories
 Generated DDI for American National Election Studies (ANES)
 Loaded US General Social Survey (1972-2010) surveys into Big
Query (5K+ variables, 55K records, 2.2Gb, query in seconds)
 Generated MS-SQL and loaded data for KUSP Translating
Research in Elder Care (TREC) surveys (U. Alberta, Canada)
 Integration in NSF / NORC SED Manager (LavaCore)
 Integration in Canada RDC Dataset Builder (LavaCore)
 …
 Caelum: A constellation in the southern sky, “chisel” in latin. Formerly known as
Caelum Scalptorium, "the engraver's chisel”.
 Rationale: I have DDI, then what…. Not everyone is an XML/technology expert.
 Use cases: Produce HTML/PDF reports, quality assurance, XML conversion,
etc..
 How does it work: Command line tool that wraps standards XML and other
transformation technologies. Can be used with any XML
 Comes with demo transforms with objective to foster community contributions
 Download at http://www.openmetadata.org/dataforge/caelum
http://www.openmetadata.org/dataforge
DEMO

Contenu connexe

Tendances

Introduction to Sql on Hadoop
Introduction to Sql on HadoopIntroduction to Sql on Hadoop
Introduction to Sql on HadoopSamuel Yee
 
Basic Concept of Database
Basic Concept of DatabaseBasic Concept of Database
Basic Concept of DatabaseMarlon Jamera
 
Database system concepts
Database system conceptsDatabase system concepts
Database system conceptsKumar
 
Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Michael Mathioudakis
 
Artifacts, Data Dictionary, Data Modeling, Data Wrangling
Artifacts, Data Dictionary, Data Modeling, Data WranglingArtifacts, Data Dictionary, Data Modeling, Data Wrangling
Artifacts, Data Dictionary, Data Modeling, Data WranglingFaisal Akbar
 
Types of databases
Types of databasesTypes of databases
Types of databasesPAQUIAAIZEL
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMswathi chouti
 
Advance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In DatabaseAdvance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In DatabaseSonali Parab
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talkbenosteen
 
Slide 2 data models
Slide 2 data modelsSlide 2 data models
Slide 2 data modelsVisakh V
 
Introduction to Database Concepts
Introduction to Database ConceptsIntroduction to Database Concepts
Introduction to Database ConceptsRosalyn Lemieux
 
Chapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business IntelligenceChapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business Intelligencephak_09
 
Data standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesData standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesvty
 

Tendances (20)

Digital Types
Digital TypesDigital Types
Digital Types
 
Introduction to Sql on Hadoop
Introduction to Sql on HadoopIntroduction to Sql on Hadoop
Introduction to Sql on Hadoop
 
Modern database
Modern databaseModern database
Modern database
 
Basic Concept of Database
Basic Concept of DatabaseBasic Concept of Database
Basic Concept of Database
 
Database system concepts
Database system conceptsDatabase system concepts
Database system concepts
 
Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00Modern Database Systems - Lecture 00
Modern Database Systems - Lecture 00
 
Artifacts, Data Dictionary, Data Modeling, Data Wrangling
Artifacts, Data Dictionary, Data Modeling, Data WranglingArtifacts, Data Dictionary, Data Modeling, Data Wrangling
Artifacts, Data Dictionary, Data Modeling, Data Wrangling
 
Database
DatabaseDatabase
Database
 
Types of databases
Types of databasesTypes of databases
Types of databases
 
Databases and types of databases
Databases and types of databasesDatabases and types of databases
Databases and types of databases
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEM
 
Design approach
Design approachDesign approach
Design approach
 
Advance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In DatabaseAdvance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In Database
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
 
Slide 2 data models
Slide 2 data modelsSlide 2 data models
Slide 2 data models
 
Introduction to Database Concepts
Introduction to Database ConceptsIntroduction to Database Concepts
Introduction to Database Concepts
 
database and database types
database and database typesdatabase and database types
database and database types
 
Chapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business IntelligenceChapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business Intelligence
 
Data standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesData standardization process for social sciences and humanities
Data standardization process for social sciences and humanities
 
Database Management & Models
Database Management & ModelsDatabase Management & Models
Database Management & Models
 

Similaire à MTNA DataForge

ETL Developer Resume
ETL Developer ResumeETL Developer Resume
ETL Developer ResumeTeferi Tamiru
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalknzhang
 
Sophia wang (2)
Sophia wang (2)Sophia wang (2)
Sophia wang (2)Dibakar27
 
vinay reddy resume 2yrs
vinay reddy resume 2yrsvinay reddy resume 2yrs
vinay reddy resume 2yrsVinay Reddy
 
Akshita_Resume
Akshita_ResumeAkshita_Resume
Akshita_ResumeAkshita .
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel BayetaSam B
 
JaiHegdeResume
JaiHegdeResumeJaiHegdeResume
JaiHegdeResumeJai Hegde
 
Deepak Sharma_ETL_Programmer_2016
Deepak Sharma_ETL_Programmer_2016Deepak Sharma_ETL_Programmer_2016
Deepak Sharma_ETL_Programmer_2016Deepak Sharma
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
DATABASE ADMIN RESUME 2016L MICROSOFT SQL SERVER 2008 MCTS (1)
DATABASE ADMIN  RESUME 2016L   MICROSOFT SQL SERVER 2008 MCTS (1)DATABASE ADMIN  RESUME 2016L   MICROSOFT SQL SERVER 2008 MCTS (1)
DATABASE ADMIN RESUME 2016L MICROSOFT SQL SERVER 2008 MCTS (1)Arthur P. Clay III
 
GCharles_Resume_Summer_2016_SS_Short
GCharles_Resume_Summer_2016_SS_ShortGCharles_Resume_Summer_2016_SS_Short
GCharles_Resume_Summer_2016_SS_Shortsshgc
 
Azhor kramp
Azhor krampAzhor kramp
Azhor krampazhor m
 

Similaire à MTNA DataForge (20)

jagadeesh updated
jagadeesh updatedjagadeesh updated
jagadeesh updated
 
ETL Developer Resume
ETL Developer ResumeETL Developer Resume
ETL Developer Resume
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Sophia wang (2)
Sophia wang (2)Sophia wang (2)
Sophia wang (2)
 
vinay reddy resume 2yrs
vinay reddy resume 2yrsvinay reddy resume 2yrs
vinay reddy resume 2yrs
 
Akshita_Resume
Akshita_ResumeAkshita_Resume
Akshita_Resume
 
Bigdata
BigdataBigdata
Bigdata
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
 
JaiHegdeResume
JaiHegdeResumeJaiHegdeResume
JaiHegdeResume
 
Deepak Sharma_ETL_Programmer_2016
Deepak Sharma_ETL_Programmer_2016Deepak Sharma_ETL_Programmer_2016
Deepak Sharma_ETL_Programmer_2016
 
B040101007012
B040101007012B040101007012
B040101007012
 
BigData
BigDataBigData
BigData
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
DATABASE ADMIN RESUME 2016L MICROSOFT SQL SERVER 2008 MCTS (1)
DATABASE ADMIN  RESUME 2016L   MICROSOFT SQL SERVER 2008 MCTS (1)DATABASE ADMIN  RESUME 2016L   MICROSOFT SQL SERVER 2008 MCTS (1)
DATABASE ADMIN RESUME 2016L MICROSOFT SQL SERVER 2008 MCTS (1)
 
GCharles_Resume_Summer_2016_SS_Short
GCharles_Resume_Summer_2016_SS_ShortGCharles_Resume_Summer_2016_SS_Short
GCharles_Resume_Summer_2016_SS_Short
 
Azhor kramp
Azhor krampAzhor kramp
Azhor kramp
 
KarenResumeDBA
KarenResumeDBAKarenResumeDBA
KarenResumeDBA
 
KarenResumeDBA
KarenResumeDBAKarenResumeDBA
KarenResumeDBA
 

MTNA DataForge

  • 5.  Tools to facilitate statistical/scientific data management  Complement existing tools (not replace)  Facilitate adoption of standards and promote best practices  Alleviate the need to master complexities of metadata standards, technologies  Vision: web based services (cloud users), stand alone desktop /server products (agencies/enclaves), libraries (developers)  Broad Target audience: Data Producers / Manager / Researchers / Users  DataForge Today (lean model):  Simple desktop based command line tools  SledgeHammer - Data/metadata transformation  LavaCore - Java library  Caelum - Reporting / Publication  DataForge Tomorrow:  DataForge Online, Web Services, Desktop w/UI, Java Library
  • 6.  Data Liberation  Unlock from proprietary formats, turn into open data  Produce metadata + convert data into standard ASCII formats  Open <Meta>Data Integration  Generating scripts for loading into stats packages, databases, big data engines, cloud  Use with open systems / environments __ _ _ / _| | ___ __| | __ _ ___ / /__ _ _ __ ___ _ __ ___ ___ _ __ | |/ _ / _` |/ _` |/ _ / /_/ / _` | '_ ` _ | '_ ` _ / _ '__| _ | | __/ (_| | (_| | __/ __ / (_| | | | | | | | | | | | __/ | __/|_|___|__,_|__, |___/ /_/ __,_|_| |_| |_|_| |_| |_|___|_| |___/
  • 8.  Challenges  Read proprietary formats  Data integrity  ASCII optimization  Cross package formatting issues  SQL Database limits (columns, indexes, views)  Performance (Timings, size, memory usage)  …
  • 9.  Key Features  Input: SPSS, Stata, ASCII+SAS syntax, ASCII+DDI-C / DDI-L, ASCII+SSS 1.1/2.0, ASCII+StatTransfer  Data Out: Fixed w/optimization, CSV, Delimited  Metadata Out: DDI-C 1.2.2/Nesstar/2.1/2.5, DDI-L 3.1-RP/3.1- SU/Colectica, Triple-S 1.1/2.0  Summary Stats: Min / Max /Valid/Invalid /Mean, StdDev /Variance / Frequencies, Weighted *, Save in CSV*.  Scripts Generators: SAS, SPSS, Stata, R, Mysql, Oracle*, MS- SQL*, HSQL*, Google BigQuery*, HP/Vertica* + dimension/lookup tables, indexes/foreign keys*  See User’s Guide at http://goo.gl/9xukB
  • 10.  Editions  Freeware  Anonymous: 100 variables / 1,000 obs.  Registered: 500 variables / 5000 obs. (or contact us)  Pro/Enterprise:  4Q2013  No limits, all features, disable usage report  pricing/licensing t.b.d.  In private beta (contact us for info)  LavaCore:  Currently internal only, planning for OEM
  • 11.  Future Features (demand driven)  Desktop UI  Support additional input/output formats (SAS, cloud)  Support missing value  Additional ID generators  Compute subsets  Synthetic data generators  Disclosure control  ???  DataForge Online (2014)  Free and pay as you go services  DataForge Services (available today)  Community support in Google groups  MTNA: tech support, custom developments  IDMS: data/metadata management / processing
  • 12.  MTNA Project Stories  Generated DDI for American National Election Studies (ANES)  Loaded US General Social Survey (1972-2010) surveys into Big Query (5K+ variables, 55K records, 2.2Gb, query in seconds)  Generated MS-SQL and loaded data for KUSP Translating Research in Elder Care (TREC) surveys (U. Alberta, Canada)  Integration in NSF / NORC SED Manager (LavaCore)  Integration in Canada RDC Dataset Builder (LavaCore)  …
  • 13.  Caelum: A constellation in the southern sky, “chisel” in latin. Formerly known as Caelum Scalptorium, "the engraver's chisel”.  Rationale: I have DDI, then what…. Not everyone is an XML/technology expert.  Use cases: Produce HTML/PDF reports, quality assurance, XML conversion, etc..  How does it work: Command line tool that wraps standards XML and other transformation technologies. Can be used with any XML  Comes with demo transforms with objective to foster community contributions  Download at http://www.openmetadata.org/dataforge/caelum