Adrian Doyle, John P. McCrae, Clodagh Downey
Refereed Conference Meeting Proceeding
This paper examines difficulties inherent in tokenization of Early Irish texts and demonstrates that a neural-network-based approach may provide a viable solution for historical texts which contain unconventional spacing and spelling anomalies. Guidelines for tokenizing Old Irish text are presented and the creation of a character-level LSTM network is detailed, its accuracy assessed, and efforts at optimising its performance are recorded. Based on the results of this research it is expected that a character- level LSTM model may provide a viable solution for tokenization of historical texts where the use of Scriptio Continua, or alternative spacing conventions, makes the automatic separation of tokens difficult.
Celtic Language Technology Workshop 2019
Digital Object Identifer (DOI):
National University of Ireland, Galway (NUIG)
Open access repository: