Basic UTF-8 manipulation routines by Jeff Bezanson. More...

#include <stdint.h>
#include <string.h>
#include <assert.h>
#include <stdarg.h>
#include <string>
#include <vector>

Include dependency graph for utf8xx.h:

This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Defines

#define u_int32_t uint32_t
#define UTF8_MAXBYTES 4
#define UTF8_MAXBYTES1 5
#define isutf(c) (((c)&0xC0)!=0x80)

Typedefs

typedef uint32_t ucs4
typedef unsigned char uchar
typedef std::vector< ucs4 > ucs4str
typedef std::string utf8str

Functions

int u8_seqlen (const utf8str &s, size_t i)
size_t u8_toucs (ucs4str &dst, const utf8str &src)
ucs4str u8_toucs (const utf8str &src)
size_t u8_toutf8 (utf8str &dst, const ucs4str &src)
utf8str u8_toutf8 (const ucs4str &src)
size_t u8_wc_len (ucs4 ch)
size_t u8_ws_len (const ucs4str &src)
size_t u8_wc_toutf8 (utf8str &dst, ucs4 ch)
utf8str u8_wc_toutf8 (ucs4 ch)
size_t u8_offset (const utf8str &s, int charnum)
size_t u8_charnum (const utf8str &s, int offset)
ucs4 u8_nextchar (const utf8str &s, size_t *i)
ucs4 u8_nextcharn (const utf8str &s, size_t slen, size_t *i)
size_t u8_strlen (const utf8str &s)
void u8_inc (const utf8str &s, size_t *i)
void u8_dec (const utf8str &s, size_t *i)

Detailed Description

Basic UTF-8 manipulation routines by Jeff Bezanson.

(from original utf8.c)

Basic UTF-8 manipulation routines
by Jeff Bezanson
placed in the public domain Fall 2005

This code is designed to provide the utilities you need to manipulate UTF-8 as an internal string encoding. These functions do not perform the error checking normally needed when handling UTF-8 data, so if you happen to be from the Unicode Consortium you will want to flay me alive. I do this because error checking can be performed at the boundaries (I/O), with these routines reserved for higher performance on data known to be valid.

Define Documentation

#define u_int32_t uint32_t

#define UTF8_MAXBYTES 4

#define UTF8_MAXBYTES1 5

#define isutf ( c ) (((c)&0xC0)!=0x80)

is c the start of a utf8 sequence?

Referenced by u8_charnum(), u8_dec(), u8_inc(), u8_nextchar(), u8_nextcharn(), and u8_offset().

Typedef Documentation

typedef uint32_t ucs4

typedef unsigned char uchar

typedef std::vector<ucs4> ucs4str

typedef for UCS-4 strings

typedef std::string utf8str

typedef for UTF-8 strings: use standard byte strings

Function Documentation

int u8_seqlen	(	const utf8str &	s,
		size_t	i
	)

c++ifiy: returns length of utf-8 sequence beginning at position i of string s

returns length of next utf-8 sequence

References trailingBytesForUTF8.

size_t u8_toucs	(	ucs4str &	dst,
		const utf8str &	src
	)

convert UTF-8 byte string src to UCS-4 wide character string dst, without error checking. Data is appended to dst.

Warning:: only works for valid UTF-8, i.e. no 5- or 6-byte sequences

Parameters:

	dst	= destination UCS-4 string
	src	= source UTF-8 byte string

Returns:: number of characters converted

References offsetsFromUTF8, and trailingBytesForUTF8.

Referenced by u8_toucs().

Here is the caller graph for this function:

ucs4str u8_toucs ( const utf8str & src )

convience wrapper: UTF-8 string -> UCS-4 string

convert UTF-8 byte string src to a new UCS-4 string

References u8_toucs().

Here is the call graph for this function:

size_t u8_toutf8	(	utf8str &	dst,
		const ucs4str &	src
	)

convert UCS-4 wide character string to UTF-8 byte string.

Parameters:

	dst	= destination UTF-8 string
	src	= source UCS-4 string

Returns:: number of characters converted

Referenced by u8_toutf8().

Here is the caller graph for this function:

utf8str u8_toutf8 ( const ucs4str & src )

convenience wrapper: UCS-4 string -> UTF-8 string

References u8_toutf8().

Here is the call graph for this function:

size_t u8_wc_len ( ucs4 ch )

(moo) get number of bytes required for representing a wide character ch in UTF-8. Returns 0 on error.

Referenced by u8_wc_toutf8(), and u8_ws_len().

Here is the caller graph for this function:

size_t u8_ws_len ( const ucs4str & src )

(moo) get number of bytes required for representing a wide character string ws in UTF-8

References u8_wc_len().

Here is the call graph for this function:

size_t u8_wc_toutf8	(	utf8str &	dst,
		ucs4	ch
	)

append single UCS-4 character to a UTF-8 string

Parameters:

	dst	UTF-8 destination buffer
	ch	UCS-4 character to convert

Returns:: number of bytes written to dst (0 <= RETVAL <= UTF8_MAXBYTES)

Referenced by u8_wc_toutf8(), unescapeCString(), unescapeJsonString(), and unescapeUtf8String().

Here is the caller graph for this function:

utf8str u8_wc_toutf8 ( ucs4 ch )

convience wrapper: UCS-4 char -> UTF-8 string

References u8_wc_len(), and u8_wc_toutf8().

Here is the call graph for this function:

size_t u8_offset	(	const utf8str &	s,
		int	charnum
	)

(logical) character number to (physical) byte offset

References isutf.

size_t u8_charnum	(	const utf8str &	s,
		int	offset
	)

(physical) byte offset to (logical) character number

References isutf.

ucs4 u8_nextchar	(	const utf8str &	s,
		size_t *	i
	)

read and return next logical character, updating an index variable

References isutf, and offsetsFromUTF8.

Referenced by u8_strlen().

Here is the caller graph for this function:

ucs4 u8_nextcharn	(	const utf8str &	s,
		size_t	slen,
		size_t *	i
	)

(moo): return next character, updating an index variable which may not exceed length slen

References isutf, and offsetsFromUTF8.

size_t u8_strlen ( const utf8str & s )

count the number of characters in a UTF-8 string

References u8_nextchar().

Here is the call graph for this function:

void u8_inc	(	const utf8str &	s,
		size_t *	i
	)

move to next character

References isutf.

void u8_dec	(	const utf8str &	s,
		size_t *	i
	)

move to previous character

References isutf.

utf8xx.h File Reference

Defines

Typedefs

Functions

Detailed Description

Define Documentation

Typedef Documentation

Function Documentation