GC: CUNY Linguistics Colloquium

April 7, 2016 @ 4:15 pm - 6:15 pm

Speaker: Richard Sproat – Google Research, New York
(Joint work with Kyle Gorman)

Title: Text normalization: Reading words that are not words

Abstract: The web page for this talk lists it as occurring on “APR 07, 2016 | 4:15 PM TO 6:00 PM” at “The Graduate Center, 365 Fifth Avenue.” Any competent speaker of English knows to read the date and time as “April seventh, twenty sixteen, four fifteen p m to six p m”; and the address as “three sixty five” (rather than “three hundred sixty five”) Fifth Avenue.  Most of the written material in these expressions does not consist of ordinary conventionally spelled words; speakers must mentally translate these “non-standard words” into ordinary words as part of their process of reading.

Text normalization is the problem of building computational algorithms that mimic this process— for example as part of a text-to-speech synthesizer.

Since there are many classes of “non-standard words” — besides dates, times and addresses, there are currency amounts, measures, abbreviations, among many others — building a wide-coverage text normalization system for a language is labor intensive. For some languages, such as Russian with its complex inflectional morphology, the process can be especially difficult. In this talk I will outline the problem, and give specific examples of why it is hard, why it is linguistically interesting, and why large parts of text normalization systems are still constructed by hand, rather than trained using machine-learning algorithms.

I will also discuss one of several areas where we are investigating machine learning alternatives to hand-constructed grammars: a system that learns to verbalize number names from their digit representation using a small amount of training data, and making use of a large amount of domain-specific linguistic knowledge.

April 7, 2016
4:15 pm - 6:15 pm
The PhD Program in Linguistics


Room 6417
365 Fifth Ave.
NYC, NY 10035 United States
