From my graduate work and extended to the field of education.
Citation of paper from which presentation was derived:
Farrelly, C. M., Schwartz, S. J., Amodeo, A. L., Feaster, D. J., Steinley, D. L., Meca, A., & Picariello, S. (2017). The Analysis of Bridging Constructs with Hierarchical Clustering Methods: An application to identity. Journal of Research in Personality.
2. Creating a New Survey: Psychometrics
Many types of surveys/tests exist for assessing academic achievement,
psychological traits, or sociological constructs; the field that studies the
construction and functioning of tests is called psychometrics.
Sometimes, a new survey must be created to either improve upon a
previous/discontinued one or assess a new idea/context for a given behavior
or trait.
These new surveys pose several statistical challenges:
Consistency within a survey (measuring what a survey is thought to measure)
Crohnbach’s alpha, differential item functioning…
Validation across samples (measures the same thing across populations/time)
Exploratory factor analysis followed by confirmatory factor analysis
Subscales for easier computation and interpretation of results (need to figure out
what items on the survey function similarly)
Statistical frameworks exist for assessing these challenges, but they typically
require large sample sizes and assume certain structures underlie the survey
design.
3. Example Survey
a) Red:Rainbow::July:____ (Month, Year, Hot, Cloud)
b) Soothing:Anodyne::____:Esoteric (Eccentric, School, Abstruse, Calming)
c) Pyrrhic:Victory::Potemkin:____ (Village, Battle, Hollow, Achilles)
d) Stegasaurus:Jurassic::Trilobite:____ (Triassic, Dinosaur, Mesozoic, Cambrian)
e) Mice:Men::Cabbages:____ (Women, Lettuce, Salad, Kings)
f) Fill in the following series: 1, 1/8, 1/27, 1/64, ___
g) Fill in the following series: ___, 25, 168, 1229, 9592
h) Fill in the following series: 3, ___,4,1,5
4. Factor Analysis
Creation of new surveys requires internal and external validation, typically
done through factor analysis.
Exploratory factor analysis is used to cluster items measuring similar underlying
processes.
Confirmatory factor analysis can then be applied to validate those clusters, or
subscales, that were found in the exploratory analysis.
Crohnbach’s alpha establishes internal consistency.
Verbal
Math
f
g
h
a
b c d
e
5. Potential Pitfalls in Psychometric
Validation with Factor Analysis
Two major problems challenge the assumptions of these methods and
necessitate the development of a new way to analyze and validate the
measure.
Time-wise or context-wise measurement can introduce non-independent, non-
hierarchical components into the model.
Study habits across terms (longitudinal effects on measurement), identity across social
spheres (student perception of intellectual ability when with friends, work, and school)
Factor analysis can be broadened to Bayesian networks and structural equation models, but
this method comes with its own assumptions on the underlying geometry and sample size.
Small sample size can create numerical instability in traditional algorithms for both
factor analysis and structural equation models (suggest 5-10 participants per item).
If there are 90 items, at least 450 students would be needed to discover subscales, and
another 450 would be needed to validate these findings.
Cost and population size can be prohibitive to the study.
Ex. Bridging constructs, or loosely connected concepts without a defined hierarchy,
typically run into both limitations and require a new method to validate their
surveys.
Many of these issues arise from the dependence on linear mapping from the
survey response space to a lower-dimensional space.
6. Moving from Euclidean-Based Statistics
to Topologically-Based Statistics
Loss of information with
each projection to a lower-
dimensional space (errors)
Topological methods work by
partitioning existing space
into homogenous components
(no maps, no error)
2D example
7. Algebraic Topology and Topological Spaces
Spaces, such as the one formed by survey response data, can be defined
topologically and decomposed using algebraic topology/geometry.
Data follows discrete versions of many theoretical results in this area of math.
Topology is rubber sheet geometry, with areas analogous to gluing together children’s building
blocks, examining connections on shapes, or hunting for mountain/valley water flows.
Examining how the pieces fit together in a given space allows one to study the topological
space’s defining characteristics and the behavior of functions in that space.
One can define connections between pieces of this space via algebra and examine
structural properties computationally:
Homotopy (shrinking connected paths to a point)
Homology (hole-counting to define topological classification of structure)
1 2 3Homotopy/
Homology Basins of Attraction (Morse Theory)
Hodge Theory
8. Applied Homology: Filtrations and
Persistence
Filtration
This is an iterative changing of lens with which
to examine data (height, neighbors…).
Topological features appear and disappear as
the lens changes.
This creates a nested sequence of features with
underlying algebraic objects, called a homology
sequence:
Hom1⊂Hom2⊂Hom3⊂Hom4
Persistence is the length of feature existence in
a homology sequence, which can be visualized.
This information maps back to the data
space’s topology (shape).
The first level of algebraic objects
corresponds to connectedness of the space
(0th Betti numbers), and this is directly
related to a type of clustering analysis.
0 2 4 6 8 10
time
Connected
space
Vertices
Hole in
middle
9. Solution: Use Machine Learning to Exploit
Underlying Topology of Survey Data
Single-linkage hierarchical clustering partitions data space according to
connected components (0th Betti numbers) across filtration levels (i.e. a series
of distance filtrations).
This method has been successfully applied to neuroimaging studies focused on
patterns of brain activity across diseases, neuropsychological tests, and drug states.
This provides a nuanced scanning of topologically-based features within the datasets at
different correlation/similarity thresholds.
These can be summarized in feature plots, called persistence diagrams, that track the birth
and death of a given feature across thresholds, and can be compared through existing
statistical tests, such as a nonparametric Wasserstein metric test.
It has also been used to track gene expression pattern changes across time and/or
disease states in microarray studies.
These studies particularly emphasize the visualization of hierarchical clustering through
dendrograms (tree diagrams of relationships at different filtration levels) and heat maps
(color-coded expression-similarity plots among genes in the microarray).
These visualizations provide a user-friendly way to understand and communicate key
findings of this statistical method.
This method can handle data with fewer observations than predictors (p>>n), and,
thus, does not require large sample sizes.
Internal correlations do not pose issues; in fact, the method excels at separating
data within and across dependencies.
10. Hierarchical Clustering: Example Survey
Math Verbal
Heatmap
Very distinct separation of items (noted
by sharp color contrast of heatmap and
long height bars on dendrogram)
11. Validation: Dendrograms and Topology
Dendrograms are a special type of graph,
called a tree.
Because graphs have a defined topological
space and dendrograms are a type of
graph, they can be studied or measured
through the tools of topology and metric
geometry.
Hausdorff distance allows two objects of
the same dimension to be compared by a
defined metric.
This examines the greatest distance
between close points, allowing for a
nearness-of-match type of metric on two
objects (top left).
Within a graph framework, it allows one to
calculate worst best match between two
graphs (as shown at bottom left).
This allows for the development of a
distance-based nonparametric test to test
for dendrogram structural differences in a
statistical framework.
Hausdorff
Distance
12. Steps in Exploration and Validation of
Surveys with Hierarchical Clustering
1) Partition sample into training and validation sets/draw a small number of
bootstrap samples from the original dataset.
2) Calculate distance metrics in each sample.
3) Run a single-linkage hierarchical clustering algorithm on the training set to
obtain exploratory clusters of similar survey items (pvclust R package
statistically tests internal survey structure like the Crohnbach alpha metric).
Create heat map and dendrogram.
4) Repeat (3) on validation sets to obtain a set of dendrograms.
5) Calculate Hausdorff distance (a topological metric) between dendrograms to
estimate differences in results (validation step).
6) Obtain p-value through permuting the extant dendrograms or generating
random dendrograms.
7) If p-value is larger than 0.05/n (Bonferroni correction) for dendrograms in (5),
no statistically significant differences exist in dendrogram structure, meaning
that the survey clusters are consistent and valid.
13. Example Measure: Bridging Constructs
Identity expression across life contexts (ILLCQ Survey):
There are many components to identity in leading theories of identity.
Example: religious identity in school, family, and friends contexts
It was unknown whether identity type or social context plays a greater role in the
expression of identity within an individual.
Identity type as more influential would suggest that identity is a fairly static trait.
Context as more influential would suggest that identity is fluid.
Sample size and survey size
406 participants (FIU students) and 91 distinct survey items.
5 draws of 130 participants each for validation and consistency checks.
Results suggest certain aspects of identity are fluid and others are fixed.
Political and racial/ethnic identity are fairly fixed.
Other types, such as athletic or gender, are fairly fluid.
Bootstrapped samples suggest consistency of measure and validate findings.
Subscales hold over different samples (tests of difference, all p>0.05).
This validates the measure and allows for inference into the psychology of identity.
15. Conclusion
This method offers a robust way to create survey subscales and validate
measures without needing a large sample or a pre-defined measure structure.
Flexible
Deeply routed in mathematics
Statistically testable
Internal validity by pvclust’s statistical test of cluster hierarchy for cut-points
External validity by Hausdorff nonparametric test on bootstrapped samples
It has been successfully applied to a bridging concept survey (factorial design),
as well as more traditional survey designs.
This offers a general way to extend traditional areas of statistics to a more
general framework through the use of topological theory and tools.
Likely to be useful as data becomes more complex in industry and academia.
May be able to circumvent other problems in modern statistics.
Item response theory (how people in different groups perform on test items)
Network comparison (social networks, covariance networks…) between groups or over time
Structural equation modeling when data does not meet method assumptions
16. Co-authors
The Analysis of bridging constructs with hierarchical clustering methods: An
application to identity (under review Journal of Research in Personality)
Seth Schwartz, University of Miami
Anna Lisa Amodeo, University of Naples
Daniel Feaster, University of Miami
Douglas Steinley, University of Missouri
Alan Meca, University of Miami
Simona Picariello, University of Naples
Santos, J. R. A. (1999). Cronbach’s alpha: A tool for assessing the reliability of scales. Journal of extension, 37(2), 1-5.
Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. American Psychological Association.
Costello, A. B. (2009). Getting the most from your analysis. Pan, 12(2), 131-146.
Rouquette, A., & Falissard, B. (2011). Sample size requirements for the internal validation of psychiatric scales. International Journal of Methods in Psychiatric Research, 20(4), 235-249.
DeCoster, J. (1998). Overview of factor analysis.
Zomorodian, A., & Carlsson, G. (2005). Computing persistent homology. Discrete & Computational Geometry, 33(2), 249-274.
Lee, H., Kang, H., Chung, M. K., Kim, B. N., & Lee, D. S. (2012). Persistent brain network homology from the perspective of dendrogram. IEEE transactions on medical imaging, 31(12), 2267-2277.
Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate Behavioral Research, 14(1), 57-74.
Lee, H., Kang, H., Chung, M. K., Kim, B. N., & Lee, D. S. (2012). Persistent brain network homology from the perspective of dendrogram. IEEE transactions on medical imaging, 31(12), 2267-2277.
Suzuki, R., & Shimodaira, H. (2006). Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics, 22(12), 1540-1542.
Chipman, H., & Tibshirani, R. (2006). Hybrid hierarchical clustering with applications to microarray data. Biostatistics, 7(2), 286-301.
Gross, J. L., & Tucker, T. W. (1987). Topological graph theory. Courier Corporation.