NAME
    Lingua::ZH::CCDICT - An interface to the CCDICT Chinese dictionary

SYNOPSIS
      use Lingua::ZH::CCDICT;

      my $dict = Lingua::ZH::CCDICT->new( storage => 'InMemory' );

DESCRIPTION
    This module provides a Perl interface to the CCDICT dictionary created
    by Thomas Chin. This dictionary is indexed by Unicode character number
    (traditional character), and contains information about these
    characters.

    CCDICT is released under a Creative Commons Attribution License (version
    2.5).

    The dictionary contains the following information, though not all
    information is avaialable for every character.

    * Radical number
        The number of the radical. Always available.

    * Index
        The total number of strokes minus the number of strokes in the
        radical. Always available.

    * Total stroke count
        The total number of strokes in the character.

    * Cangjie
        The Cangjie Chinese input system code.

    * Four Corner
        The Four Corner Chinese input system code.

    In addition, the dictionary contains English definitions (often multiple
    definitions), and romanizations for the character in different languages
    and systems. The romanizations available include Pinjim for Hakka,
    Jyutping for Cantonese and (Hanyu) Pinyin for Mandarin.

DICTIONARY BUGS
    The CCDICT dictionary is distributed by Thomas Chin in a simple but
    non-standard, textual ASCII-only format. I've tried to work around
    errors or ambiguities in the dictionary data, although there are
    probably still oddities lurking. Please send bug reports to me so I can
    figure out whether the error is in my code or the dictionary itself.

STORAGE
    This module is capable of parsing the CCDICT format file, and can also
    store the data in other formats (just Berkeley DB fo rnow).

    Each storage system is implemented via a module in the
    `Lingua::ZH::CCDICT::Storage::*' class hierarchy. All of these modules
    are subclasses of `Lingua::ZH::CCDICT' class, and implement its methods
    for searching the dictionary.

    In addition some storage classes may offer additional methods.

  Storage Subclasses
    The following storage subclasses are available:

    * Lingua::ZH::CCDICT::Storage::InMemory
        This class stores the entire parsed dictionary in memory.

    * Lingua::ZH::CCDICT::Storage::BerkeleyDB
        This class can convert the CCDICT source file to a set of BerkeleyDB
        files.

USAGE
    This module allows you to look up information in the dictionary based on
    a number of keys. These include the Unicode character (as a character,
    not its number), stroke count, radical number, and any of the various
    romanization systems.

METHODS
    This class provides the following methods.

  Lingua::ZH::CCDICT->new(...)
    This method always takes at least one parameter, "storage". This
    indicates what storage subclass to use. The current options are
    "InMemory" and "BerkeleyDB".

    Any other parameters given will be passed to the appropriate subclass's
    `new()' method.

  $dict->parse_source_file($filename)
    If you don't specify a file, then it will use the data file distributed
    with this module. This is probably what you want, unless you have a
    local copy of the dictionary that you want to work with. Note that the
    dictionary format has changes a fair bit between versions, so this
    probably won't work with much older or newer versions of the CCDICT
    data.

    This method is what does the real work of creating a dictionary. Note
    that if you are not using the InMemory storage subclass, you only need
    to parse the source file once, and then you can reuse the stored data.

MATCH METHODS
    When doing a lookup based on the romanization of a character, the tone
    is indicated with a number at the end of the syllable, as opposed to
    using the Unicode character combining the latin letter with the
    diacritic.

    In addition, lookups based on a Pinyin romanization should use the
    u-with-umlaut character (character 252 in Unicode) rather than two "u"
    characters.

    The return value for any lookup will be an object in a
    `Lingua::ZH::CCDICT::ResultSet' subclass.

    Result sets always return matches in ascending Unicode character order.

  $ccdict->match_unicode(@chars)
    This method matches on one or more Unicode characters. Unicode
    characters should be given as Perl characters (i.e. `chr(0x7D20)'), not
    as a number.

    This dictionary index uses *traditional* Chinese characters. Simplified
    character lookups will not work (but you could use `Encode::HanConvert'
    to convert simple to traditional first).

  $ccdict->match_radical(@numbers)
    Given a set of numbers, this method returns those characters containing
    the specified radical(s).

  $ccdict->match_index(@numbers)
    Given a set of numbers, this method returns those characters containing
    the specified index(es).

  $ccdict->match_stroke_count(@numbers)
    Given a set of numbers, this method returns those characters containing
    the specified number(s) of strokes.

  $ccdict->match_cangjie(@codes)
    Given a set of Cangjie codes, this method returns the character(s) for
    those code(s).

  $ccdict->match_four_corner(@codes)
    Given a set of Four Corner codes, this method returns the character(s)
    for those code(s).

  $ccdict->match_pinjim(@romanizations)
  $ccdict->match_jyutping(@romanizations)
  $ccdict->match_pinyin(@romanizations)
  $ccdict->all_characters()
    Returns a result set containing all of the characters in the dictionary.

  $ccdict->entry_count()
    Returns the number of entries in the dictionary

ENVIRONMENT VARIABLES
    There are several environment variables you can set to change this
    module's behavior.

    * CCDICT_DEBUG_SOURCE
        Causes a warning when bad data is enountered in the ccdict
        dictionary source. This is primarily useful if you want to find bugs
        in the dictionary itself.

    * CCDICT_VERBOSE
        Tells the module to give you progress reports when parsing the
        source file. These are sent to STDERR.

AUTHOR
    David Rolsky <autarch@urth.org>

COPYRIGHT
    Copyright (c) 2002-2007 David Rolsky. All rights reserved. This program
    is free software; you can redistribute it and/or modify it under the
    same terms as Perl itself.

    The full text of the license can be found in the LICENSE file included
    with this module.

    CCDICT is copyright (c) 1995-2006 Thomas Chin.

SEE ALSO
    Lingua::ZH::CEDICT - for converting between Chinese and English.

    Encode::HanConvert - for converting between simplified and traditional
    characters in various character sets.

    http://www.chinalanguage.com/dictionaries/CCDICT/ - the home of the
    CCDICT dictionary.