This lexicon is an exploratory corpus linguistics tool. It surfaces distributional patterns — how often words appear, where they cluster, and what they co-occur with — as evidence for researchers, exhibition visitors, and educators to interpret. It does not make historical claims
Visit Lexicon at temporary URL
Corpus
The lexicon draws on 397 oral history testimonies (~2.73 million words) in two languages: 261 in Greek (~1.92M words) and 136 in Italian (~0.81M words).
The Greek corpus comprises testimonies from three archives: USHMM (53 testimonies, 13 female / 40 male, ~392K words), Memories of the Occupation in Greece — MoG (27 testimonies, all female, ~355K words), and Istorima (181 testimonies, all female, ~1.17M words).
The Italian corpus consists of CDEC testimonies from Italy (101 testimonies, 48 female / 53 male, ~645K words) and Rhodes (35 testimonies, 30 female / 5 male, ~168K words).
Gender Dimension
A central aim of this lexicon is to examine how female and male survivors narrate the same experiences through different vocabulary. The Italian CDEC corpus provides the cleanest gender comparison, with roughly balanced female (48) and male (53) testimonies from the same archive. The Greek USHMM comparison (13F vs 40M) is the most controlled Greek gender analysis, but the smaller female sample means results should be interpreted with caution.
Gender statistics combine document-level measures (checking how many testimonies use a word, not just how often) with frequency-based measures, to avoid inflating results from a few long testimonies.
Credits
Testimonies are provided by CDEC (Centro di Documentazione Ebraica Contemporanea, Milan), the United States Holocaust Memorial Museum (USHMM), the Memories of the Occupation in Greece project (MoG), and the Istorima oral history project.
References
Burrows, J. (2007). All the Way Through: Testing for Authorship in Different Frequency Strata. Literary and Linguistic Computing, 22(1), 27–47. https://doi.org/10.1093/llc/fqi067
Church, K. W., & Hanks, P. (1990). Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, 16(1), 22–29.
Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1), 61–74.
Evert, S. (2008). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 1212–1248). Mouton de Gruyter. https://doi.org/10.1515/9783110213881.2.1212
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2022). Language-agnostic BERT Sentence Embedding. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 878–891). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.62
Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. https://doi.org/10.1075/ijcl.13.4.02gri
Hardie, A. (2012). CQPweb—Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409. https://doi.org/10.1075/ijcl.17.3.04har
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/10.5281/zenodo.1212303
Kilgarriff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6(1), 97–133. https://doi.org/10.1075/ijcl.6.1.05kil
Kilgarriff, A. (2009). Simple maths for keywords. Proceedings of the Corpus Linguistics Conference, 6, 41–55.
Kilgarriff, A., Rychlý, P., Smrž, P., & Tugwell, D. (2004). The Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th EURALEX International Congress (pp. 105–115). Université de Bretagne-Sud, Faculté des lettres et des sciences humaines.
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíçek, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: Ten years on. Lexicography, 1(1), 7–36. https://doi.org/10.1007/s40607-014-0009-9
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision (arXiv:2212.04356). arXiv. https://doi.org/10.48550/arXiv.2212.04356
Rychlý, P. (2008). A Lexicographer-Friendly Association Score. Raslan, 6–9.
Rychlý, P., & Kilgarriff, A. (2007). An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments). In S. Ananiadou (Ed.), Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions (pp. 41–44). Association for Computational Linguistics.
Schöch, C., Schlör, D., Zehe, A., Gebhard, H., Becker, M., & Hotho, A. (2018). Burrows’ Zeta: Exploring and Evaluating Variants and Parameters. DH, 274–277.
