Erik Peterson
	   Computational Linguist, Applications Technology
	       Graduate Student, Georgetown University


	       A Chinese Named Entity Extraction System


			       ABSTRACT

	     For many information applications, being able to identify
       the proper names and other entities in the text is a vital step
       in understanding and using the text.  For example, in a
       Chinese-English machine translation system, if a word is
       identified as a person name, it can be romanized, rather than
       being translated as a regular word.  Other entities, such as
       times, dates, and money amounts, are best translated by modules
       with special knowledge of these domains.  This paper discusses
       the development and testing of a Perl-based named entity
       identification and extraction system for simplified Chinese
       text.  This system can serve as a first stage to other Chinese
       language processing systems.  Entities that are identified
       include locations, person names, organizations, dates, times,
       money amounts, and percentages.  The system uses a segmenter
       and a specially created pattern matching language to identify
       these named entities.  Useful criteria for finding each of
       these entity types, along with the major problems in finding
       them, are discussed.  Scores are reported for the system as run
       on a test corpus and possibilities for improvement are
       proposed.


INTRODUCTION
------------

	For many information applications, being able to identify the
proper names, among other entities, in the text is an important step
in understanding and using the text.  For example, in a
Chinese-English machine translation system, if a word is identified as
a person name, it can be romanized, rather than being directly
translated.  Other entities, such as times, dates, and money amounts,
are best translated by modules with special knowledge of these
domains.

	Fortunately, these entities have common characteristics and
occur in regular contexts that make it possible to write patterns to
identify them.  To utilize these common patterns, I have written a
complete system for processing Chinese text and identifying named
entities.  This system is written entirely in the Perl programming
language, making it portable to a variety of computer systems.  It
runs on texts encoded in the GuoBiao (GB) computer code set for
Chinese, as used in the People's Republic of China and Singapore.

	An important part of any Chinese language processing system is
tokenizing the text into character compounds.  How the text is
segmented is influenced by the needs of the subsequent processing
stages.  The segmenter used by this extraction system uses a simple
Maximal Matching Algorithm, augmented with routines to find numbers,
transliterated names, and Chinese names.  Development of the segmenter
occurred in close coordination with development of the extraction
system.  For example, compounds that would otherwise be considered
valid words are removed from the segmenter's lexicon if they cause
conflicts in finding a named entity.  For example, the compound 英国人 
(England + Person = Englishman) was removed since it interfered in
finding "England" by itself as a location.

	Decisions on what to consider as a person, location,
organization, etc. are mostly drawn from the tagging guidelines
established for the Message Understanding Conference sponsored by the
United States government [1].  Each extraction type below includes a
summary of what was included and excluded during name finding.

	English text mixed in with the Chinese is not processed at
all, as this is outside the scope of a Chinese named entity extraction
system.  Once a word or phrase is identified as belonging to a given
entity type, its offsets are stored and can be a useful input to
further language processing, machine translation, and information
retrieval systems.

	The rules developed for this system also reflect the target
documents on which the system was run.  For development and testing, I
used simplified Chinese news summaries from the Voice of America
broadcasts to China.  The writing style used by a mainland China
newspaper or a Singapore newspaper would lead to the modification
of existing rules and addition of new ones.


PATTERN MATCHING
----------------

	Although the Perl programming language provides a rich set of
pattern matching operators, these operators are designed to work on
the character level rather than on the word level.  To facilitate the
matching of named entities, which are usually one or more words, I
created a simple word level pattern matching language which is the
basis for all of the entity identification done by the system.

	My general pattern matching language allows for the
identification of entities based upon the internal characteristics of
the entity (the entity itself) and the external context of the text in
which the entity occurs.  For each entity type, I wrote several rules.
A rule has four ways of matching a word:

	1.  The word (or phrase) is a member of a set of words.  The
	    system has several word lists that it uses to identify
	    entities and parts of entities.  The name of the set will
	    start with a percent sign.  The different sets currently
	    used by the system are:

	    %geonames:  Name of locations
	    %geotypes:  Suffixes that indicated place names
	    %surnames:  Common Chinese surnames
	    %notname:   Words that do not occur in names
	    %titles:    Person titles
		        e.g. 先生 Mr., 大夫 Dr., 参议员 Senator
	    %persons:   Known person names
	    %orgnames:  Known organization names
	    %orgwords:  Words that are commonly used in organization names
	    %orgtypes:  Suffixes that indicate an organization
		        e.g. 公司 Company, 有限公司 Ltd.
	    %dates:     Known dates; e.g. 今天 today, 去年 last year
	    %times:     Known times; e.g. 夜间 tonight, 早上 morning
	    %currency:  Currency names; e.g. 美金 dollars, 日圆 yen

	2.  The word meets a particular test or has some desired
            feature.  The system can run tests on words.  The results
            of the test are passed back to the top level rule.  Word
            tests are indicated by an ampersand at the beginning of
            the test name.  Some word tests used by the system
            include:
	    
	    &allforeign: Is the word entirely composed of characters
                         that are commonly used in transliterating
                         foreign names into Chinese?  Useful for
                         finding locations and Western person names.
            &isPossibleChineseName: Could this word be a Chinese name?
                         Looks for a Chinese surname and an acceptable
                         given name.
            &isChineseName: Is this word definitely a Chinese name?
            &allnumbers: Is this word entirely composed of Chinese
                         and/or Arabic numerals?  Useful for finding
                         times, dates, money amounts, and percentages.

	3.  The word is an exact match for a word in the rule.  These
            words are indicated by being enclosed in double quotes.

	    e.g. "地区" (district)

	4.  Using "ANY" in the rule will match any word.


	Each rule has the form:

	   (TYPE (... BEG ... END ...) ADD)

where TYPE is the type of entity to find with this rule (LOCATION,
TIME, etc.)  BEG and END delimit the part of the match that is
actually identified as belonging to TYPE.  Anything outside of BEG and
END is context that still must be matched for the entire rule to
match.  If ADD is included at the end of the rule, then whatever was
matched by the rule will automatically be matched elsewhere in the
text.  Before and after BEG and END can be any sequence of the four
types of word matching mechanisms listed above.  Finally, an element
in the rule can have an asterisk "*" or plus sign "+" appended onto
the end to signify repetition.  In the case of the asterisk, it means
the match can occur as many times in a row as possible, or none at
all.  For the plus sign, the match must occur at least once, but can
match as many times as possible.  For example, '&allforeign+' would
match all the words in a row that are composed only of characters for
transliterating foreign names.

       Not yet implemented, but potentially very useful, is an
aliasing mechanism for entities identified by the system.  This would
be useful in those cases where an abbreviated form of the entity is
used after the term is first introduced in the text.  For example, it
is common in Chinese to use the first character of a country name as
abbreviation of the entire name.  A country alias system would take
the first character of the country name and then look for it elsewhere
in the text.  Other possible alias mechanisms are listed under the
extraction types below.


EXTRACTION TYPES
----------------

LOCATIONS

	Locations as identified by the extraction system are
countries, states, provinces, cities, towns, bodies of water, islands,
and named geographic features (mountains, valleys, etc.).  Other types
of locations that are tagged include military bases, buildings, and
other immobile structures.  Not tagged are movable structures (planes,
ships) or regions denoted by compass directions ("the southern state",
"the west half of the state").

Examples
 - 美国		America
 - 以色列	Israel
 - 海牙		the Hague

Location Rules
    '(LOCATION (BEG &allforeign+ %geotypes END) ADD )',
    '(LOCATION (BEG &allforeign+ END "地区") ADD )',
    '(LOCATION (BEG %geonames %geotypes END) ADD )',
    '(LOCATION (BEG &allforeign+ END "附近") ADD )',
    '(LOCATION (BEG ANY END ANY "两国") ADD)',
    '(LOCATION (BEG ANY END "两国") ADD)',
    '(LOCATION (BEG %geonames END))'
     
     A productive way of identifying locations involves looking for a
likely foreign name followed by a geographic suffix.  In my system the
majority of the locations were found through look-up in a large
location name lexicon.  However, while this may work well for
international news text where only a limited number of locations are
seen on a regular basis, it would need to be adapted for local areas.
Each time text from a new area is processed, its locations would need
to be added to the lexicon.  In the near future, I also will add rules
that find a location based on its context ("at ...") in the sentence.

     Location names that are a foreign name followed by a geographic
suffix can be aliased by removing the geographic suffix and just
looking for the foreign name.  Countries can be aliased by looking for
just the first character of the country name, as described above.


PERSON NAMES

	Persons are the names of people, either real or fictional.
Person names do not include titles (e.g. "the Pope").

Person Rules
    '(PERSON (BEG &allforeign+ "．" &allforeign+ END) ADD )',
    '(PERSON (BEG &allforeign+ "." &allforeign+ END) ADD )',
    '(PERSON (%titles BEG &allforeign+ END) ADD )',
    '(PERSON (BEG &allforeign+ END %titles) ADD )',
    '(PERSON (%orgtypes "长" BEG &allforeign+ END) ADD )',
    '(PERSON (%geonames "人" BEG &allforeign+ END) ADD )',
    '(PERSON (%titles BEG &isPossibleChineseName END) ADD )',
    '(PERSON (BEG &isChineseName END) ADD )',
    '(PERSON (%titles BEG %surname ANY END) ADD )',
    '(PERSON (BEG %persons END))'

	Chinese has two main ways of expressing person names.  The
first method is for native Chinese names.  Chinese names have a one
character surname (or rarely, two characters) that comes at the start
of the name.  This is followed by the one or two character given name.
Surnames in Chinese come from a limited set of possibilities but there
is not a limited set of given names.  Complicating name finding is the
fact that surnames can serve other functions in Chinese, as can the
characters used in given names.  My rules try to improve
identification of names by looking at the context in which each
occurs.  If a word has a title next to it and has the form of a
possible Chinese name, it is matched as a Chinese name.

Examples
 - 徐文立		Xu Wenli, Prominent Chinese dissident
 - 李鹏飞		Li Pengfei, Chinese smuggler
 - 林海			Lin Hai, convicted of computer crimes

	The other common type of name is drawn from names for
non-Chinese.  In most cases, this is simply a transliteration
of the sounds of the name into like sounding Chinese characters.
Fortunately, this is usually done with a small set of characters, but
unfortunately, these characters are also commonly used elsewhere in
the language.  Finding a word made up of all foreign characters that
is in the right position in the sentence for a name will identify the
word as a person name.

Examples
 - 苏哈托	former prime minister Suharto (sounds like soo ha twoh)
 - 克林顿	President Clinton (sounds like kuh leen dwoon)


     Regions around China that have been influenced by Chinese culture
sometimes do not fall into either above category.  For example, Korean
names are usually three characters, like Chinese names, but can vary
in the surnames used.  Japanese names are often four characters long
and use a different set of (usually 2 character) surnames.  Vietnamese
and other Southeast Asian names all have their own peculiarities that
make identifying them difficult.

Japanese Name Example
 - 小渊惠三     Keizo Obuchi, Prime Minister of Japan


   Person names can be aliased several ways.  For Chinese names, the
system could look for instances when just the surname is used later on
in the text.  For foreign names which include both a surname and a
given name, the system could look for either name separately.


ORGANIZATIONS

	Organizations can include a wide variety of types, and as such
can be one of the hardest types of entities to identify.
Organizations include company names, official governmental bodies,
educational institutions, political parties, and military divisions.

* Examples
 - 美国国会			U.S. Congress
 - 中国民主党			China Democratic Party
 - 联合国			United Nations
 - R-J-R 纳比斯科控股公司       RJR Nabisco Corporation

Organization Rules
    '(ORGS (BEG &allforeign %orgtypes END) ADD )',
    '(ORGS (BEG %geonames %orgtypes END) ADD )',
    '(ORGS (BEG %orgwords+ %orgtypes END) ADD )',
    '(ORGS (BEG %orgnames END))'


	 The main way of finding organizations is to look for a person
or location name followed by an organization suffix.  While imperfect,
this finds many organizations.

	It is hard to distinguish between "American Oil Company" and
"American oil companies", since both would be the same in Chinese.  It
is hard to determine the start of an organization.  Organizations that
are the same as people names are hard to distinguish from regular
people names.

	Organization names are often aliased by taking one character
from each word used in the full name.  However, which character is
selected from the word is not standardized, making it difficult to
build a generic organization aliasing algorithm.


DATES 

    Dates include specific decades, years, months, dates, weekdays, or
combinations of these.  Dates in Chinese have a standard, accepted way
of ordering the different parts of a date expression where each part
is listed from largest time unit to smallest.  This makes identifying
dates in Chinese an easier task than finding dates in English, which
allows a wide range of variation in constructing dates.  The system
also finds a few simple relative dates, such as "tomorrow", "next
year", "last week", and "March of last year".

* Examples
 - 1998年		1998
 - 1970年9月28号	Sept. 28, 1970
 - 明年			Next year
 - 本周			This week

     Also identified as dates are holidays and special designated days,
such as World AIDS Day.

* Examples
 - 圣诞节		Christmas
 - 世界艾滋病日 	World AIDS Day

	However, not identified as dates are time ranges or relative
dates with no fixed reference.  For example, "3 weeks", "the past 4
days", and "5 years after the election"  would not be found.

Date Rules
   '(DATE (BEG &allnumbers "年" END))',
   '(DATE (BEG &allnumbers "年" &allnumbers "月" END))',
   '(DATE (BEG &allnumbers "年" &allnumbers "月" &allnumbers "日" END))',
   '(DATE (BEG &allnumbers "月" &allnumbers "号" END))',
   '(DATE (BEG &ismonth &allnumbers "号" END))',
   '(DATE (BEG &allnumbers "月" &allnumbers "日" END))',
   '(DATE (BEG &allnumbers "月" END))',
   '(DATE (BEG &allnumbers "月份" END))',
   '(DATE (BEG %dates END))'

   The date rules find various combinations of year, month, and day.
Work needs to be done to identify other dates such as centuries,
decades, and parts of months (China divides a month into three ten day
periods).


TIMES

	Times include specific time references (e.g. 4:17 pm) and time
periods during the day, such as "morning", "afternoon", and "evening".

Time Rules
     '(TIME (BEG &allnumbers "钟" END))',
     '(TIME (BEG &allnumbers "分钟" END))',
     '(TIME (BEG %times END))'

     The first rule finds a number followed by the Chinese version of
"o'clock".  The second rule finds periods of the day, such as
"afternoon" and "night".  The limited number of times mentioned in my
test corpus limited the development of these rules.  Future work needs
to be done to find other shorthand ways of expressing time in Chinese,
such as the equivalents to "a quarter after 5" or "half past nine
o'clock".  Also, the words for evening and morning can serve
other uses in Chinese.  A method for distinguishing when each is used
as a time or as an adjective is needed.


MONEY AMOUNTS
      
      Money amounts are a given amount of a legal currency such as
dollars, yen, pounds, or the soon to exist euro.  Other financial
instruments, such as stocks or bonds, are not included.  Stock market
indexes and points are also not included.

* Example
 - 十亿美元	1 billion U.S. dollars


Money Rules
     '(MONEY (BEG &allnumbers+ %currency END))',
     '(MONEY (BEG "$" &allnumbers END))'

     Finding money amounts is as simple as finding a number followed
by a currency type.  The second rule will also find money amounts
expressed with a dollar sign.  While this finds the majority of money
references in newspapers, greater coverage of the different currency
signs (such as Britain's pound sign or the Singapore dollar sign, S$)
of the world are needed for complete coverage.  Further questions to
ponder might include how to handle money ranges (e.g. "5 to 6
dollars").  Would this be tagged as one unit or would "5" and "6
dollars" be tagged separately for future processing.


PERCENTS

	Percentages are expressed in a standard way in Chinese.  A
percent, or more generally a fraction, is a number, usually 100,
followed by the word "分之" (parts of), and then followed by another
number.  A percentage expressed as 100分之30 would literally be "30
parts of 100".  Percents can be expressed using the percent sign "%".
The system will also find percent amounts expressed as percentage
points.

* Examples
 - 百分之三	3%
 - 百分之19	19%
 - 三分之一     33.3%
 - 两个百分点   Two percentage points

Percent Rules
      '(PERCENT (BEG "百分之" &allnumbers+ END))',
      '(PERCENT (BEG &allnumbers "分之" &allnumbers END))',
      '(PERCENT (BEG "％" &allnumbers+ END))',
      '(PERCENT (BEG &allnumbers+ "％" END))',
      '(PERCENT (BEG &allnumbers+ "百分点" END))',
      '(PERCENT (BEG &isPercent END))'

      The percentage rules closely follow the description listed
above.  No major problems were encountered in finding percentages.


RESULTS
-------

	Development and testing was done on Chinese text drawn from
news reports on the Voice of America web site [2] during the months of
November and December.  These files are in the public domain, making
them easier to use and distribute.  The files are also scripts for
radio broadcasts and consequently are closer in style to spoken
Chinese.  This works well because written Chinese is often much more
compact than spoken Chinese.  The news reports also provided a rich
store of entity names and introduced each entity as it was used.
These factors made the VOA corpus ideal for development of the system.

        Identified entities in the output of the system were marked
using SGML tags as adopted by the Message Understanding Conference.
These tags are listed below:

       TAG TYPE	      START TAG		           END TAG
       ----------------------------------------------------
       PERSON	      <ENAMEX TYPE="PERSON">	   </ENAMEX>
       LOCATION	      <ENAMEX TYPE="LOCATION">	   </ENAMEX>
       ORGANIZATION   <ENAMEX TYPE="ORGANIZATION"> </ENAMEX>
       DATE	      <TIMEX TYPE="DATE">	   </TIMEX>
       TIME	      <TIMEX TYPE="TIME">	   </TIMEX>
       PERCENT	      <NUMEX TYPE="PERCENT">	   </NUMEX>
       MONEY	      <NUMEX TYPE="MONEY">	   </NUMEX>


        In addition, I also found a small corpus of hand-tagged
sentences made available on the web by some of the organizers of MUC.
This "Little Grove" corpus consists of 100 sentences picked apparently
at random from various mainland Chinese news sources[3].  As such, the
sentences have little context for the entities within.  Also, they use
the more telegraphic Chinese newspaper writing style.  These two
factors made scores on this corpus much lower than the VOA corpus.
The "Little Grove" corpus was used as test data; no reference was made
to it during development.

        The VOA files were split into two sets, test and development.
These 10 test files were hand tagged using MUC-style SGML tags and
then put aside.  No further reference was made to them.  Ideally,
development of the test set should not be done by the person doing
development, lest develpment be biased by a limited knowledge of the
test set, but this was not possible in the time limits of the class.
As part of the system, I also wrote a scorer in Perl to measure the
accuracy of the system.  The two measures of system performance are
precision and recall.  Recall measures how 

       Recall and precision scores for the two test data sets are
listed below:

VOA Corpus
  Total Test Corpus Tags:    1117
  Total Correct Corpus Tags: 514
  Total Machine Corpus Tags: 971
  RECALL   :  46.02
  PRECISION:  52.94


Little Grove 100 Sentences
  Total Test Corpus Tags:       254
  Total Machine Corpus Tags:    147
  Total Correct Machine Tags:   43
  RECALL   :  16.92
  PRECISION:  29.25


CONCLUSION
----------

	Initial development of the named entity extractions system has
been promising.  Major problems in extraction category were
identified.  I also wrote a scorer to obtain the recall and precision
scores for the test data.  Finally, I created an infrastructure that
will significantly simplify future system development.  Unfortunately,
while developing my word-level pattern matching language will save a
great deal of time in the long run, it took away time from actually
improving the entity matching rules and hence scores are currently
low.  It is anticipated that two or three more months of part-time
rule development would bring scores for precision and recall into the
mid-80's.  This would be near the scores for the professional systems
tested by MUC.  Other work that could improve scores would include
improving the Chinese text segmenter that the rest of the system
depends on so heavily.

	This system can be obtained from
http://epsilon3.georgetown.edu/~petersee/cgi-bin/chinesene.zip or
chinesene.tar.gz.  After unzipping this file, the system can directly
run on any computer with a Perl interpreter.  The main file is
chinesene.cgi.  The system can also be run via the web at
http://epsilon3.georgetown.edu/~petersee/chinesene.html.  The web
interface allows users to either paste in text to be processed or type
in the URL of a web page that is downloaded and processed.


REFERENCES
-----------

1. MUC-7 Tagging Guidelines
    ftp://ftp.muc.saic.com/pub/MUC/MUC7-guidelines/guidelines.NEtaskdef.3.5.ps.gz

2. Voice of America Chinese Radio Scripts
    http://www.voa.gov/chinese/mindex.html

3. Chinese Treebank's 100 with Named-Entity Tags
    http://umiacs.umd.edu/labs/CLIP/nrdnetaA.html