Joint work with Martijn Naaijer (VU University).
With the Hebrew Bible encoded in Linguistic Annotation Framework (LAF-ISO), and with a new LAF processing tool, we demonstrate how you can do practical data analysis. The tool, LAF-Fabric, integrates with the ipython notebook approach. Our example here is lexeme cooccurrence analysis of bible books. For now, the road from data to visualization is more important than the exact visualization.
4. DISTANT READING
scan large quantities of text
find patterns
signals in the noise
study other aspects than meaning
text transmission
linguistic variation
literary form
5. VARIATION IN BIBLICAL
HEBREW
Timespan of Hebrew Bible writing: ~1000 years
Assumption: we can divide the books in 2 groups
EBH (early biblical Hebrew)
LBH (late biblical Hebrew)
6. "PROOF"
Select some features that differ for EBH and LBH
Risk of circularity
We need data analysis that is
comprehensive (not eclectic)
critical (not everything is a signal)
9. THE HEBREW BIBLE IN LAF
LAF ISO
24612:2012
SHEBANQ
(github)
2.27 GB
1.5 M nodes
1.5 M edges
40 M features
400 K words
13 M XML ids
10. PROCESSING LAF
it is XML
but not document-like (not asTEI)
and not database like (not nice for XQUERY)
it is graph-like
11. PROCESSING LAF
eXist (>30min loading time, simple queries >60min)
indexes needed: but which ones
tried POIO (>60min loading time, needs >20GB RAM)
straightforward object oriented in Python
scripting language overhead