Link Search Menu Expand Document

Orthography and tokenization

  • explicitly defined orthographic systems
  • classified tokenization

Create a tokenizable corpus:

  • a citable corpus
  • an orthographic system

Need to import trait as well as implementation:

import edu.holycross.shot.mid.orthography._
import edu.holycross.shot.latin._
val tokenizable = TokenizableCorpus(chapter, Latin23Alphabet)

Two common activities in analyzing a corpus:

  1. Generate a word list
  2. Create a tokenized corpus:
tokenizable.wordList
tokenizable.tokenizedCorpus

All material developed by Daniel Libatique, Dominic Machado and Neel Smith, and available under the Creative Commons Attribution Share-Alike license CC BY-SA 4.0