NAME
    Unicode::Truncate - Unicode-aware efficient string truncation

SYNOPSIS
        use utf8;
        use Unicode::Truncate;

        truncate_egc("hello world", 7);
        ## returns "hellâ€¦";

        truncate_egc("hello world", 7, '');
        ## returns "hello w"

        truncate_egc('æ·±åœ³', 7);
        ## returns "æ·±â€¦"

        truncate_egc("neÌe Jones", 5)'
        ## returns "nâ€¦" (not "neâ€¦", even in NFD)

        truncate_egc("\xff", 10)
        ## throws exception:
        ##   "input string not valid UTF-8 (detected at byte offset 0 in truncate_egc)"

        my $str = "hello world";
        truncate_egc_inplace($str, 8)
        ## $str is now "helloâ€¦";

DESCRIPTION
    This module is for truncating UTF-8 encoded Unicode text to particular
    byte lengths while inflicting the least amount of data corruption
    possible. The resulting truncated string will be no longer than your
    specified number of bytes (after UTF-8 encoding).

    All truncated strings will continue to be valid UTF-8: it won't cut in
    the middle of a UTF-8 encoded code-point. Furthermore, if your text
    contains combining diacritical marks, this module will not cut in
    between a diacritical mark and the base character. It will in general
    try to preserve what users perceive as whole characters, with as little
    as possible mutilation at the truncation site.

    The "truncate_egc" function truncates only between extended grapheme
    clusters
    <https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Charac
    ters_grapheme_clusters_and_glyphs> (as defined by Unicode TR29
    <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>
    version 7.0.0).

    The "truncate_egc_inplace" function is identical to "truncate_egc"
    except that the input string will be modified so that no copying occurs.
    If you pass in a read-only value it will throw an exception.

    Eventually I'd like to support other boundaries such as words and
    sentences. Those functions will be named "truncate_word" and so on.

RATIONALE
    Of course in a perfect world we would only need to worry about the
    amount of space some text takes up on the screen, in the real world we
    often have to or want to make sure things fit within certain byte size
    capacity limits. Many databases, network protocols, and file-formats
    require honouring byte-length restrictions. Even if they automatically
    truncate for you, are they doing it properly and consistently? On many
    file-systems, file and directory names are subject to byte-size limits.
    Many APIs that use C structs have fixed limits as well. You may even
    wish to do things like guarantee that a collection of news headlines
    will fit in a single ethernet packet.

    I knew I had to write this module after I asked Tom Christiansen about
    the best way to truncate unicode to fit in fixed-byte fields and he got
    angry and told me to never do that. :)

    Why not just use "substr" on a string before UTF-8 encoding it? The main
    problem with that is the number of bytes that an encoded string will
    consume is not known until after you encode it. It depends on how many
    "high" code-points are in the string, how "high" those code-points are,
    the normalisation form chosen, and (relatedly) how many combining marks
    are used. Even with perl unicode strings (ie before encoding), using
    "substr" will cut in front of combining marks.

    Truncating post-encoding may result in invalid UTF-8 partials at the end
    of your string, as well as cutting in front of combining marks.

    One interesting aspect of unicode's combining marks is that there is no
    specified limit to the number of combining marks that can be applied. So
    in some interpretations a single character/grapheme/whatever can take up
    an arbitrarily large number of bytes. However, there are various
    recommendations such as the Unicode UAX15-D3
    <http://www.unicode.org/reports/tr15/#UAX15-D3> "stream-safe" limit of
    30. Reportedly the largest known "legitimate" use is a 1 base + 8
    combining marks grapheme used in a Tibetan script.

ELLIPSIS
    When a string is truncated, "truncate_egc" indicates this by appending
    an ellipsis. The length of the truncated content including the ellipsis
    is guaranteed to be no greater than the byte size limit you specified.

    By default the ellipsis is the character U+2026 (â€¦) however you can use
    any other string by passing it in as the third argument. The ellipsis
    string must not contain invalid UTF-8 (it can be encoded or can contain
    perl high-code points, up to you). Note the default ellipsis consumes 3
    bytes in UTF-8 encoding which is the same as 3 periods in a row.

IMPLEMENTATION
    This module uses the ragel <http://www.colm.net/open-source/ragel/>
    state machine compiler to parse/validate UTF-8 and to determine the
    presence of combining characters. Ragel is nice because we can determine
    the truncation location with a single pass through the data in an
    optimised C loop.

    One of the requirements of this module was to additionally validate
    UTF-8 encoding. This is so you can run it against strings with or
    without having decoded them with "Encode::decode" first. This module
    will throw exceptions if the strings to be truncated aren't UTF-8. This
    property lets us minimise the amount of times a user-supplied string is
    "decoded". With this module, you can accept an arbitrary string from a
    web request (say), validate that it is UTF-8, truncate it if necessary,
    and write it out to a DB, all with only a single pass over the data.

    As mentioned, this module will not scan further than it needs to in
    order to determine the truncation location. So creating a short
    truncation of a really long string doesn't require traversing the entire
    string. However, this module won't validate that the bytes beyond its
    truncation location are valid UTF-8.

    Another purpose of this module is to be a "proof of concept" for
    Inline::Module::LeanDist and Inline::Filters::Ragel. This distribution
    concept was of course heavily inspired by Inline::Module.

SEE ALSO
    Unicode-Truncate github repo
    <https://github.com/hoytech/Unicode-Truncate>

    Although efficient, as discussed above, "substr" will not be able to
    give you a guaranteed byte-length output (if done pre-encoding) and will
    corrupt text (pre or post-encoding).

    There are several similar modules such as Text::Truncate,
    String::Truncate, Text::Elide but they are all essentially wrappers
    around "substr" and are subject to its limitations.

    A reasonable "99%" solution is to encode your string as UTF-8, truncate
    at the byte-level with "substr", decode with "Encode::FB_QUIET", and
    then re-encode it to UTF-8. This will ensure that the output is always
    valid UTF-8, but will still risk corrupting unicode text that contains
    combining marks.

    Ricardo Signes suggested an algorithm using Unicode::GCString which
    would also be correct but likely less efficient.

    It may be possible to use the regexp engine's "\X" combined with "(?{})"
    in some way but I haven't been able to figure that out.

BUGS
    Of course I can't test this module on all the writing systems of the
    world so I don't know the severity of the corruption in all situations.
    It's possible that the corruption can be minimised in additional ways
    without sacrificing the simplicity or efficiency of the algorithm. If
    you have any ideas please let me know and I'll try to incorporate them.

    Eventually I'd like to truncate on other boundaries specified by
    unicode, such as word, sentence, and line.

    It would be nice to be able to apply an EGC limit such as 30.

    This module doesn't handle the UTF-16 surrogate range in the grapheme
    properties files because "Encode::encode" isn't encoding them the way
    I'd need them to. That's OK because these aren't valid UTF-8 anyway.

    Perl internally supports characters outside what is officially unicode.
    This module only works with the official UTF-8 range so if you are using
    this perl extension (perhaps for some sort of non-unicode sentinel
    value) this module will throw an exception indicating invalid UTF-8
    encoding (which is more of a feature than a bug given this module's
    primary purpose of validating and truncating untrusted, user-provided
    text).

AUTHOR
    Doug Hoyte, "<doug@hcsw.org>"

COPYRIGHT & LICENSE
    Copyright 2014-2015 Doug Hoyte.

    This module is licensed under the same terms as perl itself.