NAME README for gramophone - wrapper scripts for hybrid grapheme-phoneme conversion DESCRIPTION gramophone is a package for hybrid grapheme-to-phoneme conversion using a set of heuristic mappings to determine admissible segmentations, a Conditional Random Field model for labelling candidate segmentations, and a language model over grapheme+phoneme segment-pairs to determine the optimal transcription. INSTALLATION Requirements wapiti Tested version 1.5.0; available from http://wapiti.limsi.fr/ OpenFst Tested version 1.3.4 (v1.4.x currently do not work due to python wrapper incompatibilities). Available from http://www.openfst.org/ OpenGrm Ngram library Tested version 1.1.0 (versions > 1.1.0 appear to require OpenFst > v1.3.4, which breaks the OpenFst python wrappers). Available from http://www.opengrm.org/ python Tested version 2.7.3. Available from http://www.python.org/ perl Tested version 5.14.2. Available from http://www.perl.org/ pyfst Python wrappers for the OpenFst libraries. See http://pyfst.github.io/ c++ compiler Tested g++ v4.7.2, see http://gcc.gnu.org/ a computer running a reasonably sane operating system Tested Debian GNU/Linux v7 ("wheezy"), MacOS X. Building from SVN To build this package from SVN sources, you must first run the shell command: bash$ sh ./autoreconf.sh from the distribution root directory BEFORE running ./configure. Building from SVN sources requires additional development tools to present on the build system. Then, follow the instructions in "Building from Source". Building from Source To build and install the entire package, issue the following commands to the shell: bash$ cd gramophone-0.01 # (or wherever you unpacked this distribution) bash$ sh ./configure # configure the package bash$ make # build the package bash$ make install # install the package on your system More details on the top-level installation process can be found in the file INSTALL in the distribution root directory. USAGE The perl program gramophone installed from the perl/ distribution subdirectory provides a flexible top-level command-line interface to the various scripts and utilties included in this distribution. See the output of "gramophone --help" for a usage summary. File Formats Mappings Training a gramophone model requires a set of user-defined "mappings" from input grapheme segments (character substrings) to admissible phonetic transcriptions (phonetic markup substrings). Mappings are passed to the gramophone script via the "-map" option, which specifies a TAB-separted file with lines of the format GRAPH "\t" PHON ("\t" COMMENT)? where "GRAPH" is a potential grapheme-segment, "PHON" an admissible phonetic transcription for "GRAPH", and "COMMENT" an optional comment. Some example mappings for the de-wiktionary dataset are: a a a aː a ʔa # glottal aa aː aa ʔaː Training Corpora Training corpora are used to train a "gramophone" model. Each line of the training corpus maps an entire grapheme string (word) to its phonetic transcription, separated by TABs. By convention, all words in a training corpus should be converted to lower-case before training. Models gramophone models as produced by the top-level "gramophone" wrapper script are simply directories containing the necessary data to apply the phonetization function previously learned from a training corpus. The training data itself is generally not recoverable from a "gramophone" model directory. Input Corpora Input corpora for runtime application via "gramophone -apply" are a line-based format analogous to that used for training corpora: one word per line, grapheme strings (words) only, and by convention all input strings should be in lower-case. Output Corpora Corpora output by "gramophone -apply" have the same format as training corpora: one word per line, TAB-separated grapheme string and phonetic transcription. ACKNOWLEDGEMENTS We would appreciate gramophone users acknowledging its use in their publications. You can cite: Kay-Michael Würzner and Bryan Jurish. "A hybrid approach to grapheme-phoneme conversion." In *Proceedings of the 12th International Conference on Finite State Methods and Natural Language Processing* (Düsseldorf, Germany, 22nd - 24th June, 2015), 2015. The full paper can be downloaded from http://fsmnlp2015.phil.hhu.de/wp-content/uploads/2015/06/wuerzner_jurish -grapheme_phoneme.pdf, and a BibTeX entry for the paper can be found at http://kaskade.dwds.de/~moocow/mirror/pubs/wj2015gramophone.bib. SEE ALSO http://fsmnlp2015.phil.hhu.de/wp-content/uploads/2015/06/wuerzner_jurish -grapheme_phoneme.pdf, python(1), perl(1). AUTHORS Kay-Michael Würzner and Bryan Jurish COPYRIGHT AND LICENSE Copyright (C) 2015 by Kay-Michael Würzner and Bryan Jurish. This package is free software. Redistribution and modification of C portions of this package are subject to the terms of the version 3 or greater of the GNU Lesser General Public License; see the files COPYING and COPYING.GPL-3 which came with the distribution for details. Redistribution and/or modification of the Perl portions of this package are subject to the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.