NAME
    README for gramophone - wrapper scripts for hybrid grapheme-phoneme
    conversion

DESCRIPTION
    gramophone is a package for hybrid grapheme-to-phoneme conversion using
    a set of heuristic mappings to determine admissible segmentations, a
    Conditional Random Field model for labelling candidate segmentations,
    and a language model over grapheme+phoneme segment-pairs to determine
    the optimal transcription.

INSTALLATION
  Requirements
    wapiti
        Tested version 1.5.0; available from http://wapiti.limsi.fr/

    OpenFst
        Tested version 1.3.4 (v1.4.x currently do not work due to python
        wrapper incompatibilities). Available from http://www.openfst.org/

    OpenGrm Ngram library
        Tested version 1.1.0 (versions > 1.1.0 appear to require OpenFst >
        v1.3.4, which breaks the OpenFst python wrappers). Available from
        http://www.opengrm.org/

    python
        Tested version 2.7.3. Available from http://www.python.org/

    perl
        Tested version 5.14.2. Available from http://www.perl.org/

    pyfst
        Python wrappers for the OpenFst libraries. See
        http://pyfst.github.io/

    c++ compiler
        Tested g++ v4.7.2, see http://gcc.gnu.org/

    a computer running a reasonably sane operating system
        Tested Debian GNU/Linux v7 ("wheezy"), MacOS X.

  Building from SVN
    To build this package from SVN sources, you must first run the shell
    command:

     bash$ sh ./autoreconf.sh

    from the distribution root directory BEFORE running ./configure.
    Building from SVN sources requires additional development tools to
    present on the build system. Then, follow the instructions in "Building
    from Source".

  Building from Source
    To build and install the entire package, issue the following commands to
    the shell:

     bash$ cd gramophone-0.01    # (or wherever you unpacked this distribution)
     bash$ sh ./configure        # configure the package
     bash$ make                  # build the package
     bash$ make install          # install the package on your system

    More details on the top-level installation process can be found in the
    file INSTALL in the distribution root directory.

USAGE
    The perl program gramophone installed from the perl/ distribution
    subdirectory provides a flexible top-level command-line interface to the
    various scripts and utilties included in this distribution. See the
    output of "gramophone --help" for a usage summary.

  File Formats
   Mappings
    Training a gramophone model requires a set of user-defined "mappings"
    from input grapheme segments (character substrings) to admissible
    phonetic transcriptions (phonetic markup substrings). Mappings are
    passed to the gramophone script via the "-map" option, which specifies a
    TAB-separted file with lines of the format

     GRAPH "\t" PHON ("\t" COMMENT)?

    where "GRAPH" is a potential grapheme-segment, "PHON" an admissible
    phonetic transcription for "GRAPH", and "COMMENT" an optional comment.
    Some example mappings for the de-wiktionary dataset are:

     a      a
     a      aː
     a      ʔa      # glottal
     aa     aː
     aa     ʔaː

   Training Corpora
    Training corpora are used to train a "gramophone" model. Each line of
    the training corpus maps an entire grapheme string (word) to its
    phonetic transcription, separated by TABs. By convention, all words in a
    training corpus should be converted to lower-case before training.

   Models
    gramophone models as produced by the top-level "gramophone" wrapper
    script are simply directories containing the necessary data to apply the
    phonetization function previously learned from a training corpus. The
    training data itself is generally not recoverable from a "gramophone"
    model directory.

   Input Corpora
    Input corpora for runtime application via "gramophone -apply" are a
    line-based format analogous to that used for training corpora: one word
    per line, grapheme strings (words) only, and by convention all input
    strings should be in lower-case.

   Output Corpora
    Corpora output by "gramophone -apply" have the same format as training
    corpora: one word per line, TAB-separated grapheme string and phonetic
    transcription.

ACKNOWLEDGEMENTS
    We would appreciate gramophone users acknowledging its use in their
    publications. You can cite:

    Kay-Michael Würzner and Bryan Jurish. "A hybrid approach to
    grapheme-phoneme conversion." In *Proceedings of the 12th International
    Conference on Finite State Methods and Natural Language Processing*
    (Düsseldorf, Germany, 22nd - 24th June, 2015), 2015.

    The full paper can be downloaded from
    http://fsmnlp2015.phil.hhu.de/wp-content/uploads/2015/06/wuerzner_jurish
    -grapheme_phoneme.pdf, and a BibTeX entry for the paper can be found at
    http://kaskade.dwds.de/~moocow/mirror/pubs/wj2015gramophone.bib.

SEE ALSO
    http://fsmnlp2015.phil.hhu.de/wp-content/uploads/2015/06/wuerzner_jurish
    -grapheme_phoneme.pdf, python(1), perl(1).

AUTHORS
    Kay-Michael Würzner <wuerzner@bbaw.de> and Bryan Jurish <jurish@bbaw.de>

COPYRIGHT AND LICENSE
    Copyright (C) 2015 by Kay-Michael Würzner and Bryan Jurish.

    This package is free software. Redistribution and modification of C
    portions of this package are subject to the terms of the version 3 or
    greater of the GNU Lesser General Public License; see the files COPYING
    and COPYING.GPL-3 which came with the distribution for details.

    Redistribution and/or modification of the Perl portions of this package
    are subject to the same terms as Perl itself, either Perl version 5.14.2
    or, at your option, any later version of Perl 5 you may have available.