英语语料库与自动语法分析

显示全部前言

　　从1993年到2005年，我在伦敦大学学院(University College London，简称UCL)从事科研和教学工作。本书记载了我多年来在语料库语言学和计算语言学这两个领域的主要研究心得和成果。
　　上世纪90年代，是英国语料库语言学发展的黄金时期。伦敦的Randolph Quirk教授和Sidney Greenbaum教授、兰开斯特的Geoffrey Leech教授、伯明翰的John Sinclair教授都在进行语料库的开发工作。
　　当时，Sidney Greenbaum教授任UCL的英语用法调查中心(Survey of English Usage)主任，正在从事国际英语语料库(The International Corpus of English)的创建工作。100万字的英国英语语料已经采集完毕，语法标码也己完成，但句法分析遇到不少困难。一是所用的句法分析系统不适用，每输入一个语句，常生成几十、上百、甚至上千棵句法树，然后再人工选取，十分耗时耗力。二是所用的形式语法不适用。当时的语法为英语书面语所写，而100万字的英国英语语料包含60万字的口语，所以几乎每天都要开会讨论一些语句的具体处理，语法的某些部分干脆需要重写，尤其是不同层次上的并列结构。尽管如此，最后还是有大约30％的语句，自动句法分析系统根本无法应付。
　　于是，Sidney Greenbaum教授和我在1994年一同撰写了一份项目申请书，然后约见了英国工程及物理科学研究委员会(Engineering and Physical Sciences Research Council)的有关人员，其中包括Nigel Birch先生和Mark Tatham教授，提出了我们的研究设想。这份申请最后通过了委员会的评审，获得了一笔约50万英镑的资助，专门用于研制一个新的自动句法分析系统并重写一部新的、可用于英语口语分析的形式语法。
　　研究项目的主要思路就是将已经分析过的语料库变成一个句法知识库，从中提取短语结构语法规则，并通过基于实例的手段，在知识库中为待分析语句提取一棵最佳句法树。这样的句法分析机制涉及几个重要课题：首先需要一个高质量的自动词类标码系统，不仅能对大类进行判别，而且能对小类的细分进行快速、有效的精确分析，比如说动词的配价问题。然后，我们需要一个短语分析系统，将待分析语句处理成一个短语结构集，然后据此计算句法相似度，最终生成相应的句法树。这样一种句法分析途径，具有强劲、高效、精确和自动学习等特性，在对国际英语语料库及其他海量语料库的处理中得到广泛检测和验证。
　　本书对上述各个部分的研究进行了详细的描述，对系统的实际表现进行了深入的量化评测，并有专门章节来探讨句法分析的评测问题。除此之外，还探讨了介词短语的自动分析，特别是这类短语的句法功能的自动判定，因为这一研究和句法相似度分析有着密切的关系。同时，本书还就自动语法分析在语音合成及语音识别中的应用做了相应的介绍和说明，希望对读者能有所帮助。
　　我的不少朋友及同事都看过本书的初稿或部分章节，并提出过许多建议，在此表示感谢，特别是伦敦大学学院的John Campbell教授和Mark Huckvale博士、伦敦国王学院的Jonathan Ginzburg博士、利兹大学的Eric Atwell先生、瑞典隆德大学的Jan Svartvik教授及商务印书馆上海信息中心主任钱厚生教授。当然，我对书中的所有错误负全责，并恳请读者提出宝贵批评和建议。
　　最后，我以此书来缅怀先父对我的言传身教和恩师Sidney Greenbaum教授对我的栽培，并感谢家人对我的关心和支持。
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　方称宇
　　　　　　　　　　　　　　　　　　　　　　　　　　　　2007年10月

显示全部内容简介

　　语料库语言学和计算语言学为促进自然语言处理技术快速发展的两门基础学科。《英语语料库与自动语法分析》系这两个领域的一本专著，它以国际英语语料库为背景，着重探讨大型语料库的语法分析，尤其是英语口语材料给计算机自动处理带来的一系列难题，书中涉及基于概率的自动词类识别和基于实例的自动句法分析这两大技术，并有专门章节来探讨句法分析的评测问题，对AUTASYS和THE SURVEY PARSER这两个软件系统的实际表现进行了深入的量化评测。此外，本书还探讨了介词短语的自动分析，特别是这类短语的句法功能的自动判定，并对自动语法分析在语音合成及语音识别中的应用做了相应的说明。　　作者方称宇博士曾任英国伦敦大学学院英语用法调查中心副主任，协助著名语法学家Sidney Greenbaum教授进行国际英语语料库的创建与研究，随后在英国伦敦大学学院的语音和语言学系任高级研究员。现执教于香港城市大学，在中文、翻译及语言学系教授计算语言学、语料库语言学和认知语言学等课程，并任韩礼德语言研究智能应用中心核心成员。　　本书为纯英文，适合英语类语言工作专业人员阅读。

显示全部目录

Preface
前言
List of Figures
List of Tables
Abstract
1. Introduction
1.1. What is Parsing?
1.2. The Introspective View
1.3. The Retrospective View
1.4. Data-Oriented Parsing
1.5. General Problems
1.6. The Proposed Research
　1.6.1. Background to the Proposed Research
　1.6.2. The Basic Approach of the Proposed Research
　1.6.3. The Strengths and Novelties of the Proposed Approach
　　1.6.3.1. Automated Grammar Generation
　　1.6.3.2. De-Lexicalised Terminal Nodes
　　1.6.3.3. Global Parse with Subcategorisation Features
　　1.6.3.4. High-Quality Partial Parse
　　1.6.3.5. Intrinsic Ability to Learn
1.7. The Organisation of the Book
2. The Automatic Analysis of English Word Classes
2.1. An Overview of Word Class Tagging
2.2. Major Word Class Tagging Schemes
　2.2.1. The Lancaster-Oslo/Bergen Tagging Scheme
　　2.2.1.1. The Lancaster-Oslo-Bergen Corpus
　　2.2.1.2. The Lancaster-Oslo-Bergen Tag Set
　　2.2.1.3. Summary
　2.2.2. The International Corpus of English Tagging Scheme
　　2.2.2.1. The International Corpus of English
　　2.2.2.2. The International Corpus of English Tag Set
　2.2.3. A Comparison of LOB and ICE
2.3. Word Class Tagging Methodologies
　2.3.1. The Rule-Based Approach
　2.3.2. The Probabilistic Approach
2.4. AUTASYS: A Hybrid Tagging System
　2.4.1. A Probabilistic Approach Using the LOB Tag Set
　　2.4.1.1. The Tag Assignment Module .
　　　2.4.1.1.1. Tokenisation
　　　2.4.1.1.2. The treatment of“.”
　　　2.4.1.1.3. The treatment of“'”
　　　2.4.1.1.4. Sentence boundary markers
　　2.4.1.2. Orthographic Analysis
　　2.4.1.3. Lexicon Lookup
　　　2.4.1.3.1. The lexicon
　　　2.4.1.3.2. The coverage of the lexicon
　　2.4.1.4. Morphological Analysis
　2.4.2. The Idiom Identification Module
　2.4.3. The Probabilistic Tag Selection Module
　　2.4.3.1. The Bigram Probabilistic Matrix
　　2.4.3.2. Implementing Probabilistic Tag Selection
　2.4.4. The Rule-Based Refinement Module
　2.4.5. Empirical Evaluation
　2.4.6. Permissive AUTASYS-LOB Disagreements
　　2.4.6.1. NNP-NPT
　　2.4.6.2. JJ-JJB
　　2.4.6.3. NNP-NPL
　　2.4.6.4. RB-NN
　2.4.7. Summary
2.5. A Rule-Based Approach towards LOB to ICE Translation
　2.5.1. Solutions for Verbs
　　2.5.1.1. Auxiliary vs. Lexical
　　2.5.1.2. Monotransitive vs. Complex Transitive
　　2.5.1.3. Finite vs. Nonfinite
　2.5.2. Closed Sets
　2.5.3. Initial Results
　2.5.4. Problems
　2.5.5. Summary
3. The Automatic Induction of a Formal Grammar
3.1. Introduction
3.2. The ICE Parsing Scheme
3.3. Grammar Generation
　3.3.1. Phrase Structure Rules
　3.3.2. Phrase Structure Cluster Rules
3.4. Evaluating the Coverage
　3.4.1. The Construction of the Training and Test Sets
　3.4.2. The Number of Extracted Rules
　3.4.3. The Coverage of Adjective Phrase Rules
　3.4.4. The Coverage of Adverb Phrase Rules
　3.4.5. The Coverage of Noun Phrase Rules
　3.4.6. The Coverage of Verb Phrase Rules
　3.4.7. The Coverage of Prepositional Phrase Rules
　3.4.8. The Coverage of Phrase Structure Cluster Rules
　3.4.9. Discussion
4. Robust Practical Analogy-Based Parsing
4.1. Introduction
　4.1.1. Analogy-Based Parsing
4.2. An Overview of the Survey Parser
4.3. The Construction of the Syntactic Knowledge Base
　4.3.1. Phrase Structure Rules
　4.3.2. Phrase Structure Cluster Rules
　　4.3.2.1. Feature Constraints
　　4.3.2.2. Nonobligatory Elements
　　4.3.2.3. A Definition of Analogy
4.4. Parsing with Phrase Structure and Phrase Structure Cluster Rules
　4.4.1. The Analysis of Word Classes
　4.4.2. The Analysis of Phrases
　4.4.3. The Analysis of Clauses
　　4.4.3.1. Partial Analysis
4.5. Some Initial Evaluation
　4.5.1. Evaluating the Coverage of Phrase Structure Rules
　4.5.2. Evaluating the Coverage of Phrase Structure Cluster Rules
　4.5.3. Evaluating the Labelling Precision
　4.5.4. Evaluating the Accuracy of Analysis
　4.5.5. Evaluating the Processing Speed
4.6. Concluding Remarks
5. Extensive Evaluations of the Survey Parser
5.1. Introduction
5.2. Commonly Used Metrics
　5.2.1. Labelled Match
　5.2.2. Bracketed Match
　5.2.3. Crossing-Brackets Rate
　5.2.4. Summary
5.3. Evaluations with the NIST Scheme
　5.3.1. Labelling Accuracy
　　5.3.1.1. Methodology
　　5.3.1.2. Evaluating the Scoring Program
　　5.3.1.3. Evaluating the Labelling Accuracy of Parser-Produced Trees
　5.3.2. Bracketing Accuracy
　　5.3.2.1. A Linear Representation of the Hierarchical Structure
　　5.3.2.2. Evaluating the Scoring Program
　　5.3.2.3. Evaluating the Bracketing Accuracy of Parser-Produced Trees
　5.3.3. Labelling and Bracketing Accuracy Scores Combined
　　5.3.3.1. A Description
　　5.3.3.2. Empirical Scores by the Survey Parser
　　5.3.3.3. A Comparison with other Reports
　　　5.3.3.3.1. A Comparison with Keller (2003)
　　　5.3.3.3.2. A Comparison with Plaehn (2004)
　　　5.3.3.3.3. A Comparison with Henderson (2004)
　　　5.3.3.3.4. Summary
5.4. Summary
6. The Resolution of Prepositional Phrases
6.1. Introduction
6.2. PPs in Contemporary British English
　6.2.1. Prepositions
　6.2.2. Prepositional Complement
　6.2.3. The Syntactic Functions of Prepositional Phrases
6.3. Data and Resources Used in the Experiment
6.4. Scope of Experiment
6.5. Test Data
6.6. Lexical Database
6.7. Rules and their Coverage
　6.7.1. Prepositional Phrases as Adjective Phrase Postmodifiers
　　6.7.1.1. Test Data
　　6.7.1.2. Attachment Rules
　　6.7.1.3. Morphological Analysis
　　6.7.1.4. Coverage
　6.7.2. Prepositional Phrases as Noun Phrase Postmodifiers
　　6.7.2.1. Test Data
　　6.7.2.2. Attachment Rules
　　6.7.2.3. Morphological Analysis
　　6.7.2.4. Coverage of the Rules
　6.7.3. Prepositional Phrases as Adverbials
6.8. Overall Success Rate and Discussion
7. Conclusions and Further Work
7.1. Conclusions
7.2. Applications of AUTASYS and the Survey Parser
　7.2.1. AUTASYS
　7.2.2. The Survey Parser
　　7.2.2.1. SpeechMaker
　　7.2.2.2. Correlation between Tone Units and Syntax
7.3. Future Work
　7.3.1. Implementation of Automated Prepositional Phrase Attachment
　7.3.2. Automated Clause Boundary Detection and Attachment
References
Appendix A: A List of LOB Tags
Appendix B: A List of ICE Tags
Appendix C: A List of AUTASYS Idioms
Appendix D: A List of ICE Parsing Symbols
Appendix E: A List of ICE Prepositions in Descending Frequency Order
Appendix F: A Distributional Profile of ICE-GB Prepositions
Index

相关图书

英语语料库与自动语法分析平装

显示全部前言

显示全部内容简介

显示全部目录

关注我们

相关图书

英语语料库与自动语法分析 平装

显示全部前言

显示全部内容简介

显示全部目 录

关注我们

英语语料库与自动语法分析平装

显示全部目录