unicode_canonical, unicode_ccc, unicode_decomposition_init, unicode_decomposition_deinit, unicode_decompose, unicode_decompose_reallocate_size, unicode_compose, unicode_composition_init, unicode_composition_deinit, unicode_composition_apply — unicode canonical normalization and denormalization
#include <courier-unicode.h>
unicode_canonical_t
unicode_canonical( |
char32_t c) ; |
uint8_t
unicode_ccc( |
char32_t c) ; |
void
unicode_decomposition_init( |
unicode_decomposition_t *info, |
char32_t *string, | |
size_t *string_size, | |
void *arg) ; |
int
unicode_decompose( |
unicode_decomposition_t *info) ; |
void
unicode_decomposition_deinit( |
unicode_decomposition_t *info) ; |
size_t
unicode_decompose_reallocate_size( |
unicode_decomposition_t *info, |
const size_t *sizes, | |
size_t n) ; |
int
unicode_compose( |
char32_t *string, |
size_t string_size, | |
int flags, | |
size_t *new_size) ; |
int
unicode_composition_init( |
const char32_t *string, |
size_t string_size, | |
int flags, | |
unicode_composition_t *compositions) ; |
void
unicode_composition_deinit( |
unicode_composition_t *compositions) ; |
size_t
unicode_composition_apply( |
char32_t *string, |
size_t string_size, | |
unicode_composition_t *compositions) ; |
These functions compose or decompose a Unicode string into a canonical or a compatible normalized form.
unicode_canonical
() looks up
the character's canonical
and compatibility mapping. unicode_canonical
() returns a structure
with the following fields:
canonical_chars
A pointer to the canonical or equivalent representation of the character.
n_canonical_chars
Number of characters in the canonical_chars
.
format
A value of UNICODE_CANONICAL_FMT_NONE
indicates a
canonical mapping, other values indicate a
compatibility equivalent mapping.
A NULL canonical_chars
(with a 0
n_canonical_chars
)
indicates that the character has no canonical or
compatibility equivalence.
unicode_ccc
() returns the
character's canonical combining class value.
unicode_decomposition_init
(), unicode_decompose
() and unicode_decomposition_deinit
() implement a
complete interface for decomposing a Unicode string:
unicode_decomposition_t info; unicode_decomposition_init(&info, before, (size_t)-1, NULL); info.decompose_flags=UNICODE_DECOMPOSE_FLAG_QC; unicode_decompose(&info); unicode_decomposition_deinit(&info);
unicode_decomposition_init
()
initializes a new unicode_decomposition_t
structure, that
gets passed in as its first parameter. The second parameter
is a pointer to a Unicode string, with the number of
characters in the string in the third parameter. A string
size of -1
indicates a
\0
-terminated string and
calculates its string_size
(which does not include the trailing \0
. The last parameter is a void *
, an opaque pointer that gets stored
in the initialized unicode_decomposition_t
object:
typedef struct unicode_decomposition { char32_t * string
;size_t string_size
;int decompose_flags
;int (* reallocate)(
struct unicode_decomposition * info
,const size_t * offsets
,const size_t * sizes
,size_t n
); void * arg
;} unicode_decomposition_t;
unicode_decompose
() proceeds
and decomposes the string
and
replaces it with its decomposed string
version.
unicode_decomposition_t
's
string
, string_size
and arg
are copies of unicode_decomposition_init
's parameters.
unicode_decomposition_init
initializes all other fields to their default values.
The decompose_flags
bitmask
gets initialized to 0, and is a bit mask:
UNICODE_DECOMPOSE_FLAG_QC
Check each character's appropriate “quick check”
property and skip decomposing Unicode characters that
would get re-composed by unicode_composition_apply
().
UNICODE_DECOMPOSE_FLAG_COMPAT
Perform a compatibility decomposition instead of a canonical decomposition.
reallocate
is a pointer to a
function that gets called to reallocate a larger string
. unicode_decompose
() determines which
characters in the string
need
decomposing and calls the reallocate
function pointer zero or more
times. Each call to reallocate
passes information about where new characters will get
inserted into the string
.
reallocate
only needs to grow
the size of the buffer where string
points so that it's big enough to
hold a larger, decomposed string; then update string
accordingly. reallocate
should not update string_size
or make any changes to the
existing string
, that's
unicode_decompose
()'s job
(after reallocate
returns).
The reallocate
callback
function receives the following parameters.
A pointer to the unicode_decomposition_t
and,
notably, its arg
.
A pointer to the array of offset indexes in the
string
where new
characters will get inserted in order to hold the
decomposed string.
A pointer to the array that holds the number of characters that get inserted each corresponding offset.
The size of the two arrays.
reallocate
must update the
string
if necessary to hold at
least the number of characters that's the sum total of the
initial string_size
and the sum
total of al sizes
.
unicode_decomposition_init
()
initializes the reallocate
pointer to a default implementation that uses realloc(3) and updates
string
with its return value.
The application can use its own reallocate
to handle this task on its own,
and use unicode_decompose_reallocate_size
to
compute the minimum string size:
size_t unicode_decompose_reallocate_size(unicode_decomposition_t *info, const size_t *sizes, size_t n) { size_t i; size_t new_size=info->string_size; for (i=0; i<n; ++i) new_size += sizes[i]; return new_size; }
The reallocate
function
returns 0 on success and a non-0 error code to report a
failure; and unicode_decompose
()
does the same. The only error condition from unicode_decompose
() is a non-0 error code
from the reallocate
function.
Otherwise: a successful decomposition results in unicode_decompose
() returning 0 and
unicode_decomposition_init
()'s
string
pointing to the
decomposed string and string_size
giving the number of characters
in the decomposed string.
string_size
does not
include the trailing \0
character. The input string also has its string_size
specified without counting its
\0
character. The default
implementation of reallocate
allocates an extra char32_t
ands sets it to a \0
.
Therefore:
If the Unicode string before decomposition has a
trailing \0
and no
decomposition occurs, and no calls to reallocate
takes place: the
string
in the
unicode_decomposition_t
is
unchanged and it's still \0
-terminated.
The default reallocate
allocates an extra
char32_t
ands sets it
to a \0
; and it takes
care of that for the decomposed string.
An application that provides its own replacement
reallocate
is
responsible for doing the same, if it wants the
decomposed string to be \0
terminated.
Multiple calls to the reallocate
callback are possible. Each
call to reallocate
reflect the
prior calls' decompositions. Example: the original string
has five characters and the first call to reallocate
had two offsets, at position 1
and 3, with a value of 1 for their both sizes
. This effects transforming an
original Unicode string "AAAAA" into "AXAAXAA" (with
“A”
representing unspecified characters in the original string,
and “X” showing the two characters added
in the first call to reallocate
.
A second call to varname
with am offset at position 4, and a size of 1, results in
the updated string of "AXAAYXAA" (with “Y”) marking an
unspecified character inserted by the second call.
Unicode string decomposition involves replacing a given
Unicode character with one or more other characters. The
sizes given to reallocate
reflect the net addition to the Unicode string. For
example: decomposing one Unicode character into three
decomposed characters results in a call to reallocate
reporting an insert of two more
characters.
offsets
actually report the
indices of each Unicode character that's getting
decomposed. A 1:1 decomposition of a Unicode Character gets
reported as an additional sizes
entry of 0.
unicode_decomposition_deinit
() releases all
resources and destroys the unicode_decomposition_t
; it is no longer
valid.
unicode_decomposition_deinit
() does not
free(3) the string
. The original string gets passed in
to unicode_decomposition_init
() and the
decomposed string is left in the string
.
The default implementation of the reallocate
function assumes the string
is a malloc(3)-ed string, and
realloc
s it.
At this time unicode_decomposition_deinit
() does
nothing. All code should explicitly call it in order to
remain forward-compatible (at the source level).
unicode_compose
() performs a
canonical composition of a decomposed string. Its parameters
are:
A pointer to the decomposed Unicode string.
The number of characters in the Unicode string. The
Unicode string does not need to be \0
-terminated; if it is this number
does not include it.
A flags bitmask, which can have the following values:
UNICODE_COMPOSE_FLAG_REMOVEUNUSED
Remove all combining marks after doing all canonical compositions. Normally any unused combining marks are left in place, in the combined text. This option removes them.
UNICODE_COMPOSE_FLAG_ONESHOT
Perform canonical composition once per character, and do not attempt to combine any resulting combined characters again.
A non-NULL
pointer to a
size_t
.
A successful composition sets this size_t
to the number of characters
in the combined string, and returns 0. The combined
string gets placed back into the string
parameter, this
string gets combined in place and this gives the size
of the combined string.
unicode_compose
()
returns a non-zero value to indicate an error.
unicode_composition_init
(),
unicode_composition_apply
() and
unicode_composition_deinit
()
implement a detailed interface for canonical composition of a
decomposed Unicode string:
unicode_compositions_t compositions; if (unicode_composition_init(str, strsize, flags, &compositions) == 0) { size_t new_size=unicode_composition_apply(str, strsize, &compositions); unicode_composition_deinit(&compositions); }
The first two parameters to both unicode_composition_init
() and unicode_composition_apply
() are the same:
the Unicode string and the number of characters (not
including any trailing \0
character) in the Unicode string.
unicode_composition_init
()'s
additional parameters are: any optional flags (see
unicode_compose()
for a list of
available flags), and the address of a unicode_composition_t
object. A non-0
return from unicode_composition_init
() indicates an
error. unicode_composition_init
() indicates
success by returning 0 and initializing the unicode_composition_t
's object which
contains a pointer to an array of pointers to of unicode_compose_info objects, and the
number of pointers. unicode_composition_init
() does not change
the string; the only thing it does is initialize the
unicode_composition_t
object.
unicode_composition_apply
()
applies the compositions to the string
, in place, and returns the new size
of the string
(also not
including the \0
byte, however
it does append one if the composed string is smaller, so the
composed string is \0
-terminated
if the decomposed string was).
It is necessary to call unicode_composition_deinit
() to free all
memory that was allocated for the unicode_composition_t
object:
struct unicode_compose_info { size_t index
;size_t n_composed
;char32_t * composition
;size_t n_composition
;}; typedef struct { struct unicode_compose_info ** compositions
;size_t n_compositions
;} unicode_composition_t;
index
gives the character
index in the string
where each
composition occurs. n_composed
gives the number of characters in the original string that
get composed. The composed characters are the composition
; and n_composition
gives the number of composed
characters.
Effectively: at the index
position in the original string, #n_composed
characters get removed and there
are #n_composition
characters
that replace them (always n_composed
or less).
The UNICODE_COMPOSE_FLAG_REMOVEUNUSED
flag has
the effect of including the combining marks that did not
get combined in the n_composed
count. It's possible that, in this case, n_composition
is 0. This indicates
complete removal of the combining marks, without anything
getting combined in their place.
unicode_composition_init
()
sets unicode_composition_t
's
compositions
pointer to an array
of pointers to unicode_compose_infos that are sorted
according to their index
.
n_compositions
gives the number
of pointers in the array, and is 0 if there are no
compositions, the array is empty. The empty array gets
interpreted accordingly when it gets passed to unicode_composition_apply
() and
unicode_composition_deinit
():
nothing happens. unicode_composition_apply
() simply returns
the size of the unchanged string
, and unicode_composition_deinit
() does a
pro-forma cleanup.