Slides from webinar: Provenance and social science data. Presented on 15 March 2017. Presenter was Prof George Alter, Research Professor, ICPSR, and visiting Professor, ANU
FULL webinar recording: https://youtu.be/elPcKqWoOPg
3. Prof George Alter, (Research Professor, ICPSR & Visiting Prof, ANU)
The C2Metadata Project is producing new tools that will work with common statistical packages (eg R and SPSS) to automate the capture of metadata describing variable transformations. Software-independent data transformation descriptions will be added to metadata in two internationally accepted standards: DDI and Ecological Markup Language (EML). These tools will create efficiencies and reduce the costs of data collection, preparation, and re-use. Of special interest to social sciences with its strong metadata standards and heavy reliance on statistical analysis software.
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Documenting Data Transformations
1. “Provenance and Social Science Data”
15 March 2017
Documenting DataTransformations
George Alter, University of Michigan
2. • Data are useless without Metadata – “data
about data”
• Metadata should:
– Include all information about data creation
– Describe transformations to variables
– Be easy to create
• Our goal: Automated capture of metadata
Why Metadata?
3. A few words about ICPSR
• World’s largest
archive of social
science data
• Consortium
established 1962
• 760+ member
institutions around
the world
• Founding member
and home office for
the DDI Alliance
4. Powered by DDI Metadata
ICPSR is building search
tools based upon Data
Documentation Initiative
(DDI) XML
Codebooks (pdf and
online) are rendered from
the DDI.
6. Online codebook shows
variable in context of
dataset
Link to online
crosstab tool
What question
was asked?
How was the
question coded?Link to online
graph tool
9. Search for datasets with
3 desired variables
Check boxes
for variable
comparison
10. Crosswalk for American National Election
Study (ANES) and General Social Survey
(GSS)
Columns link to
70 datasets
134 tags in
8 lists
Variable
comparison
display
Variables linked to
online codebooks
11. Metadata for the American National Election Study
What question
was asked?
Who answered
this question?
How was the
question coded?
Who answered
this question?
12. Metadata for the American National Election Study
Who answered
this question?
Who answered
this question?
How do we know who
answered the question?
It’s in the pdf.
13. When data arrive at the
archive…
• No question text
• No interview flow (question order, skip pattern)
• No variable provenance
• Data transformations are not documented.
14. How is research data created?
• Most surveys are conducted with computer
assisted interview software (CAI)
– CATI – Computer-assisted Telephone Interview
– CAPI – Computer-assisted Personal Interview
– CAWI – Computer Aided Web Interview
• There is no paper questionnaire
• The CAI program is the questionnaire
– i.e. the program is the metadata
20. What statistics packages should be
covered?
ICPSR Downloads by Format
All downloads
Studies with all
formats
Delimited text 43% 29%
SPSS 22% 24%
SAS 10% 12%
Stata 19% 23%
R 5% 12%
Excel 0% 1%
Other 0% 0%
100% 100%
Number 378,007 154,663
21. Input Data Output Data
SPSS
MISSING VALUES X(-1).
IF (X > 3) Y=9.
IF (X < 3) Z=8.
X
2
3
4
-1
Stata
replace X=. if X==-1
generate Y=9 if X>3
generate Z=8 if X<3
X
2
3
4
-1
SAS
if X=-1 then X=.;
if X>3 then Y=9;
if X<3 then Z=8;
X
2
3
4
-1
Why do we need an SDTL?
22. Input Data Output Data
SPSS
MISSING VALUES X(-1).
IF (X > 3) Y=9.
IF (X < 3) Z=8.
X X Y Z
2 2 8
3 3
4 4 9
-1 -1
Stata
replace X=. if X==-1
generate Y=9 if X>3
generate Z=8 if X<3
X X Y Z
2 2 8
3 3
4 4 9
-1 9
SAS
if X=-1 then X=.;
if X>3 then Y=9;
if X<3 then Z=8;
X X Y Z
2 2 . 8
3 3 . .
4 4 9 .
-1 . . 8
Why do we need an SDTL?
23. What happens when a missing value is
in a logical comparison?
• SPSS
– Logical expressions including a missing value are
considered “Missing.” Usually, “Missing” is equivalent to
“False.”
• Stata
– Missing values are treated as numbers equal to infinity.
So, any number is less than a missing value.
• SAS
– Missing values are treated as numbers equal to minus
infinity. So, any number is greater than a missing value.
24. Input Data Output Data
SPSS
MISSING VALUES X(-1).
IF (X > 3) Y=9.
IF (X < 3) Z=8.
X X Y Z
2 2 8
3 3
4 4 9
-1 NULL
Stata
replace X=. if X==-1
generate Y=9 if X>3
generate Z=8 if X<3
X X Y Z
2 2 8
3 3
4 4 9
-1 ∞ 9
SAS
if X=-1 then X=.;
if X>3 then Y=9;
if X<3 then Z=8;
X X Y Z
2 2 . 8
3 3 . .
4 4 9 .
-1 -∞ . 8
Missing Values in Comparisons
25. Benefits of automated metadata
capture
• Metadata will be better
– All the information in the CAI can be included.
– Variable transformations can be described
• Automation will lower costs
– Metadata will not be discarded and re-created
• All metadata will be standardized and machine
readable
– Codebooks with rich information can be rendered at will
• If we make it easy and beneficial, researchers will
use it.
26. Continuous Capture of Metadata for
Statistical Data
(NSF ACI-1640575)
Project Partners
•Inter-university Consortium for Political and Social
Research (ICPSR), University of Michigan
•Colectica
•Metadata Technology North America
•Norwegian Centre for Research Data
•General Social Survey, NORC, University of Chicago
•American National Election Study, University of
Michigan