Link Search Menu Expand Document

Build a word list for your corpus

Load a citable corpus.

import edu.holycross.shot.ohco2._
import edu.holycross.shot.mid.orthography._
// Load citable corpus
val url = "https://raw.githubusercontent.com/LinguaLatina/texts/master/texts/latin23/hyginus.cex"
val corpus = CorpusSource.fromUrl(url, cexHeader = true)
// tiny subset to practice on
import edu.holycross.shot.cite._
val c108a = corpus ~~ CtsUrn("urn:cts:latinLit:stoa1263.stoa001.hc:108a")

To tokenize, we need to import the MidOrthography trait, and an implementation for our corpus.

import edu.holycross.shot.mid.orthography.MidOrthography
import edu.holycross.shot.latin._
val tokenizable = TokenizableCorpus(c108a, Latin23Alphabet)
val wordList = tokenizable.wordList
// write to a file
import java.io.PrintWriter
new PrintWriter("c108-words.txt"){write(wordList.mkString("\n"));close;}

All material developed by Daniel Libatique, Dominic Machado and Neel Smith, and available under the Creative Commons Attribution Share-Alike license CC BY-SA 4.0