InsightInsight
IPIC-Ribbon-Horizontal-2-Small
  • About
    • What We Do
    • Governance
    • Equality, Diversity and Inclusion
  • People
    • Work With Us
    • Senior Leadership
    • Principal Investigators
    • Funded Investigators
    • Research and Operations
  • Research
    • Central Bank PhD Programme
    • Excellence
    • Funding Collaboration
    • MSCA Postdoctoral Fellowships
    • National Projects
    • European Projects
  • Industry
    • Collaborate
    • Insight Brochure
    • Commercialisation
    • Contact
  • Public Engagement
    • Meet the Team
    • Highlights
    • Insight Scholarship
  • News
    • Spotlight on Research
    • Events
    • Newsletter
    • Press Releases
  • Contact
  • About
    • What We Do
    • Governance
    • Equality, Diversity and Inclusion
  • People
    • Work With Us
    • Senior Leadership
    • Principal Investigators
    • Funded Investigators
    • Research and Operations
  • Research
    • Central Bank PhD Programme
    • Excellence
    • Funding Collaboration
    • MSCA Postdoctoral Fellowships
    • National Projects
    • European Projects
  • Industry
    • Collaborate
    • Insight Brochure
    • Commercialisation
    • Contact
  • Public Engagement
    • Meet the Team
    • Highlights
    • Insight Scholarship
  • News
    • Spotlight on Research
    • Events
    • Newsletter
    • Press Releases
  • Contact

A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles

Insight>Publications>A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles

Authors:

Adrian Doyle, John P. McCrae, Clodagh Downey

Publication Type:

Refereed Conference Meeting Proceeding

Abstract:

This paper examines difficulties inherent in tokenization of Early Irish texts and demonstrates that a neural-network-based approach may provide a viable solution for historical texts which contain unconventional spacing and spelling anomalies. Guidelines for tokenizing Old Irish text are presented and the creation of a character-level LSTM network is detailed, its accuracy assessed, and efforts at optimising its performance are recorded. Based on the results of this research it is expected that a character- level LSTM model may provide a viable solution for tokenization of historical texts where the use of Scriptio Continua, or alternative spacing conventions, makes the automatic separation of tokens difficult.

Conference Name:

Celtic Language Technology Workshop 2019

Digital Object Identifer (DOI):

10.18653/v1/w19-6910

Publication Date:

19/08/2019

Conference Location:

Ireland

Research Group:

Linked Data

Institution:

National University of Ireland, Galway (NUIG)

Open access repository:

Yes

Publication document:

doyle2019character.pdf

Insight_host_partners_funder
Ireland's European Structural and Investment Funds Programme 2014-2022 logo
European Union European Regional Development Fund logo
  • Privacy Statement
  • Copyright Statement
  • Data Protection Notice
  • Accessibility Statement