A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles
Authors:
Adrian Doyle, John P. McCrae, Clodagh Downey
Publication Type:
Refereed Conference Meeting Proceeding
Abstract:
This paper examines difficulties inherent in tokenization of Early Irish texts and demonstrates that a neural-network-based approach may provide a viable solution for historical texts which contain unconventional spacing and spelling anomalies. Guidelines for tokenizing Old Irish text are presented and the creation of a character-level LSTM network is detailed, its accuracy assessed, and efforts at optimising its performance are recorded. Based on the results of this research it is expected that a character- level LSTM model may provide a viable solution for tokenization of historical texts where the use of Scriptio Continua, or alternative spacing conventions, makes the automatic separation of tokens difficult.
Conference Name:
Celtic Language Technology Workshop 2019
Digital Object Identifer (DOI):
10.18653/v1/w19-6910
Publication Date:
19/08/2019
Conference Location:
Ireland
Research Group:
Institution:
National University of Ireland, Galway (NUIG)
Open access repository:
Yes