InsightInsight
sfi
  • About
    • Who We Are
    • What We Do
    • Our Structure
  • People
    • Work With Us
    • Senior Leadership
    • Principal Investigators
    • Funded Investigators
    • Research and Operations
  • Research
    • Excellence
    • Ecosystem
    • Publications
    • National Projects
    • European Projects
  • Industry
    • Engage
    • Contact
    • Impact
  • Public Engagement
    • Highlights
    • What We Do
    • Meet the Team
    • Our Strategy
  • News
    • Latest News
    • Media Queries
    • Newsletter
    • Spotlight on Research
  • Contact
  • About
    • Who We Are
    • What We Do
    • Our Structure
  • People
    • Work With Us
    • Senior Leadership
    • Principal Investigators
    • Funded Investigators
    • Research and Operations
  • Research
    • Excellence
    • Ecosystem
    • Publications
    • National Projects
    • European Projects
  • Industry
    • Engage
    • Contact
    • Impact
  • Public Engagement
    • Highlights
    • What We Do
    • Meet the Team
    • Our Strategy
  • News
    • Latest News
    • Media Queries
    • Newsletter
    • Spotlight on Research
  • Contact

A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles

Insight>Publications>A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles

Authors:

Adrian Doyle, John P. McCrae, Clodagh Downey

Publication Type:

Refereed Conference Meeting Proceeding

Abstract:

This paper examines difficulties inherent in tokenization of Early Irish texts and demonstrates that a neural-network-based approach may provide a viable solution for historical texts which contain unconventional spacing and spelling anomalies. Guidelines for tokenizing Old Irish text are presented and the creation of a character-level LSTM network is detailed, its accuracy assessed, and efforts at optimising its performance are recorded. Based on the results of this research it is expected that a character- level LSTM model may provide a viable solution for tokenization of historical texts where the use of Scriptio Continua, or alternative spacing conventions, makes the automatic separation of tokens difficult.

Conference Name:

Celtic Language Technology Workshop 2019

Digital Object Identifer (DOI):

10.18653/v1/w19-6910

Publication Date:

19/08/2019

Conference Location:

Ireland

Research Group:

Linked Data

Institution:

National University of Ireland, Galway (NUIG)

Open access repository:

Yes

Publication document:

doyle2019character.pdf

footer-top
  • Privacy Statement
  • Copyright Statement
  • Data Protection Notice