A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles

Insight>Publications>A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles

Authors:

Adrian Doyle, John P. McCrae, Clodagh Downey

Publication Type:

Refereed Conference Meeting Proceeding

Abstract:

This paper examines difficulties inherent in tokenization of Early Irish texts and demonstrates that a neural-network-based approach may provide a viable solution for historical texts which contain unconventional spacing and spelling anomalies. Guidelines for tokenizing Old Irish text are presented and the creation of a character-level LSTM network is detailed, its accuracy assessed, and efforts at optimising its performance are recorded. Based on the results of this research it is expected that a character- level LSTM model may provide a viable solution for tokenization of historical texts where the use of Scriptio Continua, or alternative spacing conventions, makes the automatic separation of tokens difficult.

Conference Name:

Celtic Language Technology Workshop 2019

Digital Object Identifer (DOI):

10.18653/v1/w19-6910

Publication Date:

19/08/2019

Conference Location:

Ireland

Research Group:

Linked Data

Institution:

National University of Ireland, Galway (NUIG)

Open access repository:

Yes

Publication document:

doyle2019character.pdf