Description
gramophone is a package for hybrid grapheme-to-phoneme conversion using a set of heuristic mappings to determine admissible segmentations, a Conditional Random Field model for labelling candidate segmentations, and a language model over (grapheme,phoneme) segment-pairs to determine the optimal transcription. The package is implemented using wapiti, OpenFst, OpenGrm, Python, and Perl. We would appreciate gramophone users acknowledging its use in their publications. You can cite:
Kay-Michael Würzner & Bryan Jurish.
"A hybrid approach to grapheme-phoneme conversion."
In Proceedings of the
12th International Conference on Finite State Methods and Natural Language Processing
(Düsseldorf, Germany, 22nd - 24th June, 2015),
2015.
The full paper can be downloaded
here,
and a BibTeX entry can be found
here.
License
The gramophone package is distributed under the terms of the GNU Lesser General Public License (LGPL-v3), which itself incorporates the terms and conditions of the GNU General Public License.Downloads & Links
- Sources
-
- github.com/wrznr/gramophone (current)
- gramophone-0.0.1.tar.gz (old, stable)
- Documentation
-
- README.txt for the current gramophone distribution
- Würzner & Jurish (2015) paper delivered at FSMNLP-2015 describing gramophone
- Slides for the FSMNLP-2015 presenation
- Models (linux-x86, 64-bit)
-
- de-wiki-w5.model.tar.gz (German, Wiktionary, UTF-8/IPA, N=5)
- de-lexdb-w5.model.tar.gz (German, LexDB, SAMPA, N=5)
- en-celex-w5.model.tar.gz (English, CELEX, DISC, N=5)
- en-celex-ipa-w5.model.tar.gz (English, CELEX, DISC->UTF-8/IPA, N=5)
- Datasets
-
- de-wiktionary.data.txt (German, Wiktionary, UTF-8/IPA) : use 1st and 3rd columns.
- LexDB aka VM-II-HyprLex (external link: German, SAMPA) : use initial 2 columns "Plain Ascii", convert to TABs and lower-case 1st column.
- CELEX (external link: English, DISC, N=5) : use "Create Lexicon" - "English Wordforms" - "PhonDISC", convert to TABs and lower-case 1st column.
- Mappings
-
- de-wiktionary.gpk.txt (German, Wiktionary, UTF-8/IPA)
- de-lexdb.gpk.txt (German, LexDB, SAMPA)
- en-celex.gpk.txt (English, CELEX, DISC)
- en-celex-ipa.gpk.txt (English, CELEX, DISC->UTF-8/IPA)
- Miscellaneous
-
- de-dlexdb.data.txt (morphological surface segmentation for German, DLexDB, UTF-8; distributed under the terms of the CC-BY-SA 3.0 license)