Basic UTF-8 manipulation routines by Jeff Bezanson. More...
#include <stdint.h>
#include <string.h>
#include <assert.h>
#include <stdarg.h>
#include <string>
#include <vector>
Go to the source code of this file.
Basic UTF-8 manipulation routines by Jeff Bezanson.
(from original utf8.c)
Basic UTF-8 manipulation routines
by Jeff Bezanson
placed in the public domain Fall 2005
This code is designed to provide the utilities you need to manipulate UTF-8 as an internal string encoding. These functions do not perform the error checking normally needed when handling UTF-8 data, so if you happen to be from the Unicode Consortium you will want to flay me alive. I do this because error checking can be performed at the boundaries (I/O), with these routines reserved for higher performance on data known to be valid.
#define u_int32_t uint32_t |
#define UTF8_MAXBYTES 4 |
#define UTF8_MAXBYTES1 5 |
#define isutf | ( | c | ) | (((c)&0xC0)!=0x80) |
is c the start of a utf8 sequence?
Referenced by u8_charnum(), u8_dec(), u8_inc(), u8_nextchar(), u8_nextcharn(), and u8_offset().
typedef uint32_t ucs4 |
typedef unsigned char uchar |
typedef std::string utf8str |
typedef for UTF-8 strings: use standard byte strings
int u8_seqlen | ( | const utf8str & | s, | |
size_t | i | |||
) |
c++ifiy: returns length of utf-8 sequence beginning at position i
of string s
returns length of next utf-8 sequence
References trailingBytesForUTF8.
convert UTF-8 byte string src
to UCS-4 wide character string dst
, without error checking. Data is appended to dst
.
dst | = destination UCS-4 string | |
src | = source UTF-8 byte string |
References offsetsFromUTF8, and trailingBytesForUTF8.
Referenced by u8_toucs().
convience wrapper: UTF-8 string -> UCS-4 string
convert UTF-8 byte string src
to a new UCS-4 string
References u8_toucs().
convert UCS-4 wide character string to UTF-8 byte string.
dst | = destination UTF-8 string | |
src | = source UCS-4 string |
Referenced by u8_toutf8().
convenience wrapper: UCS-4 string -> UTF-8 string
References u8_toutf8().
size_t u8_wc_len | ( | ucs4 | ch | ) |
(moo) get number of bytes required for representing a wide character ch in UTF-8. Returns 0 on error.
Referenced by u8_wc_toutf8(), and u8_ws_len().
size_t u8_ws_len | ( | const ucs4str & | src | ) |
(moo) get number of bytes required for representing a wide character string ws in UTF-8
References u8_wc_len().
append single UCS-4 character to a UTF-8 string
dst | UTF-8 destination buffer | |
ch | UCS-4 character to convert |
Referenced by u8_wc_toutf8(), unescapeCString(), unescapeJsonString(), and unescapeUtf8String().
convience wrapper: UCS-4 char -> UTF-8 string
References u8_wc_len(), and u8_wc_toutf8().
size_t u8_offset | ( | const utf8str & | s, | |
int | charnum | |||
) |
(logical) character number to (physical) byte offset
References isutf.
size_t u8_charnum | ( | const utf8str & | s, | |
int | offset | |||
) |
(physical) byte offset to (logical) character number
References isutf.
read and return next logical character, updating an index variable
References isutf, and offsetsFromUTF8.
Referenced by u8_strlen().
(moo): return next character, updating an index variable which may not exceed length slen
References isutf, and offsetsFromUTF8.
size_t u8_strlen | ( | const utf8str & | s | ) |
count the number of characters in a UTF-8 string
References u8_nextchar().