gramophone g2p

Description

gramophone is a package for hybrid grapheme-to-phoneme conversion using a set of heuristic mappings to determine admissible segmentations, a Conditional Random Field model for labelling candidate segmentations, and a language model over (grapheme,phoneme) segment-pairs to determine the optimal transcription. The package is implemented using wapiti, OpenFst, OpenGrm, Python, and Perl.

We would appreciate gramophone users acknowledging its use in their publications. You can cite:

Kay-Michael Würzner & Bryan Jurish. "A hybrid approach to grapheme-phoneme conversion." In Proceedings of the 12th International Conference on Finite State Methods and Natural Language Processing (Düsseldorf, Germany, 22nd - 24th June, 2015), 2015.

The full paper can be downloaded here, and a BibTeX entry can be found here.

License

The gramophone package is distributed under the terms of the GNU Lesser General Public License (LGPL-v3), which itself incorporates the terms and conditions of the GNU General Public License.

Downloads & Links

Sources

github.com/wrznr/gramophone (current)
gramophone-0.0.1.tar.gz (old, stable)

Documentation

README.txt for the current gramophone distribution
Würzner & Jurish (2015) paper delivered at FSMNLP-2015 describing gramophone
Slides for the FSMNLP-2015 presenation

Models (linux-x86, 64-bit)

de-wiki-w5.model.tar.gz (German, Wiktionary, UTF-8/IPA, N=5)
de-lexdb-w5.model.tar.gz (German, LexDB, SAMPA, N=5)
en-celex-w5.model.tar.gz (English, CELEX, DISC, N=5)
en-celex-ipa-w5.model.tar.gz (English, CELEX, DISC->UTF-8/IPA, N=5)

Datasets

de-wiktionary.data.txt (German, Wiktionary, UTF-8/IPA) : use 1st and 3rd columns.
LexDB aka VM-II-HyprLex (external link: German, SAMPA) : use initial 2 columns "Plain Ascii", convert to TABs and lower-case 1st column.
CELEX (external link: English, DISC, N=5) : use "Create Lexicon" - "English Wordforms" - "PhonDISC", convert to TABs and lower-case 1st column.

Mappings

de-wiktionary.gpk.txt (German, Wiktionary, UTF-8/IPA)
de-lexdb.gpk.txt (German, LexDB, SAMPA)
en-celex.gpk.txt (English, CELEX, DISC)
en-celex-ipa.gpk.txt (English, CELEX, DISC->UTF-8/IPA)

Miscellaneous

de-dlexdb.data.txt (morphological surface segmentation for German, DLexDB, UTF-8; distributed under the terms of the CC-BY-SA 3.0 license)

gramophone/ˈɡɹæməˌfoʊ̯n/

hybrid grapheme-phoneme conversion

Description

License

Downloads & Links