unicode_wb_init, unicode_wb_next, unicode_wb_next_cnt, unicode_wb_end, unicode_wbscan_init, unicode_wbscan_next, unicode_wbscan_end, unicode_word_break — calculate word breaks
#include <courier-unicode.h>
unicode_wb_info_t
unicode_wb_init( |
int (*cb_func)(int, void *), |
void *cb_arg) ; |
int
unicode_wb_next( |
unicode_wb_info_t wb, |
char32_t c) ; |
int
unicode_wb_next_cnt( |
unicode_wb_info_t wb, |
const char32_t *cptr, | |
size_t cnt) ; |
int
unicode_wb_end( |
unicode_wb_info_t wb) ; |
unicode_wbscan_info_t
unicode_wbscan_init( |
void) ; |
int
unicode_wbscan_next( |
unicode_wbscan_info_t wbs, |
char32_t c) ; |
size_t
unicode_wbscan_end( |
unicode_wbscan_info_t wbs) ; |
These functions implement the unicode word breaking
algorithm. Invoke unicode_wb_init
() to initialize the word
breaking algorithm. The first parameter is a callback
function. The second parameter is an opaque pointer. The
callback function gets invoked with two parameters. The
second parameter is the opaque pointer that was given to
unicode_wb_init
(); and the
opaque pointer is not subject to any further interpretation
by these functions.
unicode_wb_init
() returns an
opaque handle. Repeated invocations of unicode_wb_next
(), passing the handle, and
one unicode character defines a sequence of unicode
characters over which the word breaking algorithm calculation
takes place. unicode_wb_next_cnt
() is a shortcut for
invoking unicode_wb_next
()
repeatedly over an array cptr
containing cnt
unicode
characters.
unicode_wb_end
() denotes the
end of the unicode character sequence. After the call to
unicode_wb_end
() the word
breaking unicode_wb_info_t
handle is no longer valid.
Between the call to unicode_wb_init
() and unicode_wb_end
(), the callback function
gets invoked exactly once for each unicode character given to
unicode_wb_next
() or
unicode_wb_next_cnt
(). Usually
each call to unicode_wb_next
()
results in the callback function getting invoked immediately,
but it does not have to be. It's possible that a call to
unicode_wb_next
() returns
without invoking the callback function, and some subsequent
call to unicode_wb_next
() (or
unicode_wb_end
()) invokes the
callback function more than once, to catch things up. The
contract is that before unicode_wb_end
() returns, the callback
function gets invoked the exact number of times as the number
of characters in the unicode sequence defined by the
intervening calls to unicode_wb_next
() and unicode_wb_next_cnt
(), unless an error
occurs.
Each call to the callback function reports the calculated wordbreaking status of the corresponding character in the unicode character sequence. If the parameter to the callback function is non zero, a word break is permitted before the corresponding character. A zero value indicates that a word break is prohibited before the corresponding character.
The callback function should return 0. A non-zero value
indicates to the word breaking algorithm that an error has
occurred. unicode_wb_next
() and
unicode_wb_next_cnt
() return
zero either if they never invoked the callback function, or
if each call to the callback function returned zero. A non
zero return from the callback function results in
unicode_wb_next
() and
unicode_wb_next_cnt
()
immediately returning the same value.
unicode_wb_end
() must be
invoked to destroy the word breaking handle even if
unicode_wb_next
() and
unicode_wb_next_cnt
() returned
an error indication. It's also possible that, under normal
circumstances, unicode_wb_end
()
invokes the callback function one or more times. The return
value from unicode_wb_end
() has
the same meaning as from unicode_wb_next
() and unicode_wb_next_cnt
(); however in all cases
after unicode_wb_end
() returns
the line breaking handle is no longer valid.
unicode_wbscan_init
(),
unicode_wbscan_next
() and
unicode_wbscan_end
scan for
the next word boundary in a unicode character sequence.
unicode_wbscan_init
() obtains
a handle, then unicode_wbscan_next
() gets repeatedly
invoked to define the unicode character sequence.
unicode_wbscan_end
()
deallocates the handle and returns the number of leading
characters in the unicode character sequence up to the
first word break.
A non-0 return value from unicode_wbscan_next
() indicates that the
word boundary is already known, and any further calls to
unicode_wbscan_next
() will be
ignored. unicode_wbscan_end
()
must still be called, to obtain the unicode character
count.