CKIP CoreNLP

Introduction

Official CKIP CoreNLP Toolkits

Features

  • Sentence Segmentation

  • Word Segmentation

  • Part-of-Speech Tagging

  • Named-Entity Recognition

  • Sentence Parsing

  • Co-Reference Resolution

Contributers

Installation

Requirements

Driver Requirements

Driver

Built-in

CkipTagger

CkipClassic

Sentence Segmentation

Word Segmentation†

Part-of-Speech Tagging†

Sentence Parsing

Named-Entity Recognition

Co-Reference Resolution‡

  • † These drivers require only one of either backends.

  • ‡ Co-Reference implementation does not require any backend, but requires results from word segmentation, part-of-speech tagging, sentence parsing, and named-entity recognition.

Installation via Pip

License

CC BY-NC-SA 4.0

Copyright (c) 2018-2020 CKIP Lab under the CC BY-NC-SA 4.0 License.

Usage

Pipelines

Core Pipeline

The CkipPipeline connect drivers of sentence segmentation, word segmentation, part-of-speech tagging, named-entity recognition, and sentence parsing.

The CkipDocument is the workspace of CkipPipeline with input/output data. Note that CkipPipeline will store the result into CkipDocument in-place.

The CkipPipeline will compute all necessary dependencies. For example, if one calls get_ner() with only raw-text input, the pipeline will automatically calls get_text(), get_ws(), get_pos().

_images/pipeline.svg
from ckipnlp.pipeline import CkipPipeline, CkipDocument

pipeline = CkipPipeline()
doc = CkipDocument(raw='中文字喔,啊哈哈哈')

# Word Segmentation
pipeline.get_ws(doc)
print(doc.ws)
for line in doc.ws:
    print(line.to_text())

# Part-of-Speech Tagging
pipeline.get_pos(doc)
print(doc.pos)
for line in doc.pos:
    print(line.to_text())

# Named-Entity Recognition
pipeline.get_ner(doc)
print(doc.ner)

# Sentence Parsing
pipeline.get_parsed(doc)
print(doc.parsed)

################################################################

from ckipnlp.container.util.wspos import WsPosParagraph

# Word Segmentation & Part-of-Speech Tagging
for line in WsPosParagraph.to_text(doc.ws, doc.pos):
    print(line)

Co-Reference Pipeline

The CkipCorefPipeline is a extension of CkipPipeline by providing coreference resolution. The pipeline first do named-entity recognition as CkipPipeline do, followed by alignment algorithms to fix the word-segmentation and part-of-speech tagging outputs, and then do coreference resolution based sentence parsing result.

The CkipCorefDocument is the workspace of CkipCorefPipeline with input/output data. Note that CkipCorefDocument will store the result into CkipCorefPipeline.

_images/coref_pipeline.svg
from ckipnlp.pipeline import CkipCorefPipeline, CkipDocument

pipeline = CkipCorefPipeline()
doc = CkipDocument(raw='畢卡索他想,完蛋了')

# Co-Reference
corefdoc = pipeline(doc)
print(corefdoc.coref)
for line in corefdoc.coref:
    print(line.to_text())

Drivers

CkipNLP provides several alternative drivers for above two pipelines. Here are the list of the drivers:

DriverType

DriverFamily.BUILTIN

DriverFamily.TAGGER

DriverFamily.CLASSIC

SENTENCE_SEGMENTER

CkipSentenceSegmenter

WORD_SEGMENTER

CkipTaggerWordSegmenter

CkipClassicWordSegmenter

POS_TAGGER

CkipTaggerPosTagger

CkipClassicWordSegmenter

NER_CHUNKER

CkipTaggerNerChunker

SENTENCE_PARSER

CkipClassicSentenceParser

COREF_CHUNKER

CkipCorefChunker

† Not compatible with CkipCorefPipeline.

Containers

The container objects provides following methods:

  • from_text(), to_text() for plain-text format conversions;

  • from_dict(), to_dict() for dictionary-like format conversions;

  • from_list(), to_list() for list-like format conversions;

  • from_json(), to_json() for JSON format conversions (based-on dictionary-like format conversions).

The following are the interfaces, where CONTAINER_CLASS refers to the container class.

obj = CONTAINER_CLASS.from_text(plain_text)
plain_text = obj.to_text()

obj = CONTAINER_CLASS.from_dict({ key: value })
dict_obj = obj.to_dict()

obj = CONTAINER_CLASS.from_list([ value1, value2 ])
list_obj = obj.to_list()

obj = CONTAINER_CLASS.from_json(json_str)
json_str = obj.to_json()

Note that not all container provide all above methods. Here is the table of implemented methods. Please refer the documentation of each container for detail formats.

Container

Item

from/to text

from/to dict, list, json

TextParagraph

str

SegSentence

str

SegParagraph

SegSentence

NerToken

NerSentence

NerToken

NerParagraph

NerSentence

ParsedParagraph

str

CorefToken

only to

CorefSentence

CorefToken

only to

CorefParagraph

CorefSentence

only to

WS with POS

There are also conversion routines for word-segmentation and POS containers jointly. For example, WsPosToken provides routines for a word (str) with POS-tag (str):

ws_obj, pos_obj = WsPosToken.from_text('中文字(Na)')
plain_text = WsPosToken.to_text(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_dict({ 'word': '中文字', 'pos': 'Na', })
dict_obj = WsPosToken.to_dict(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_list([ '中文字', 'Na' ])
list_obj = WsPosToken.to_list(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_json(json_str)
json_str = WsPosToken.to_json(ws_obj, pos_obj)

Similarly, WsPosSentence/WsPosParagraph provides routines for word-segmented and POS sentence/paragraph (SegSentence/SegParagraph) respectively.

Parsed Tree

In addition to ParsedParagraph, we have implemented tree utilities base on TreeLib.

ParsedTree is the tree structure of a parsed sentence. One may use from_text() and to_text() for plain-text conversion; from_dict(), to_dict() for dictionary-like object conversion; and also from_json(), to_json() for JSON string conversion.

The ParsedTree is a TreeLib tree with ParsedNode as its nodes. The data of these nodes is stored in a ParsedNodeData (accessed by node.data), which is a tuple of role (semantic role), pos (part-of-speech tagging), word.

ParsedTree provides useful methods: get_heads() finds the head words of the sentence; get_relations() extracts all relations in the sentence; get_subjects() returns the subjects of the sentence.

from ckipnlp.container import ParsedTree

# 我的早餐、午餐和晚餐都在那場比賽中被吃掉了
tree_text = 'S(goal:NP(possessor:N‧的(head:Nhaa:我|Head:DE:的)|Head:Nab(DUMMY1:Nab(DUMMY1:Nab:早餐|Head:Caa:、|DUMMY2:Naa:午餐)|Head:Caa:和|DUMMY2:Nab:晚餐))|quantity:Dab:都|condition:PP(Head:P21:在|DUMMY:GP(DUMMY:NP(Head:Nac:比賽)|Head:Ng:中))|agent:PP(Head:P02:被)|Head:VC31:吃掉|aspect:Di:了)'

tree = ParsedTree.from_text(tree_text, normalize=False)

print('Show Tree')
tree.show()

print('Get Heads of {}'.format(tree[5]))
print('-- Semantic --')
for head in tree.get_heads(5, semantic=True): print(repr(head))
print('-- Syntactic --')
for head in tree.get_heads(5, semantic=False): print(repr(head))
print()

print('Get Relations of {}'.format(tree[0]))
print('-- Semantic --')
for rel in tree.get_relations(0, semantic=True): print(repr(rel))
print('-- Syntactic --')
for rel in tree.get_relations(0, semantic=False): print(repr(rel))
print()

# 我和食物真的都很不開心
tree_text = 'S(theme:NP(DUMMY1:NP(Head:Nhaa:我)|Head:Caa:和|DUMMY2:NP(Head:Naa:食物))|evaluation:Dbb:真的|quantity:Dab:都|degree:Dfa:很|negation:Dc:不|Head:VH21:開心)'

tree = ParsedTree.from_text(tree_text, normalize=False)

print('Show Tree')
tree.show()

print('Get get_subjects of {}'.format(tree[0]))
print('-- Semantic --')
for subject in tree.get_subjects(0, semantic=True): print(repr(subject))
print('-- Syntactic --')
for subject in tree.get_subjects(0, semantic=False): print(repr(subject))
print()

Tables of Tags

Part-of-Speech Tags

Tag

Description

A

非謂形容詞

Caa

對等連接詞

Cab

連接詞,如:等等

Cba

連接詞,如:的話

Cbb

關聯連接詞

D

副詞

Da

數量副詞

Dfa

動詞前程度副詞

Dfb

動詞後程度副詞

Di

時態標記

Dk

句副詞

DM

定量式

I

感嘆詞

Na

普通名詞

Nb

專有名詞

Nc

地方詞

Ncd

位置詞

Nd

時間詞

Nep

指代定詞

Neqa

數量定詞

Neqb

後置數量定詞

Nes

特指定詞

Neu

數詞定詞

Nf

量詞

Ng

後置詞

Nh

代名詞

Nv

名物化動詞

P

介詞

T

語助詞

VA

動作不及物動詞

VAC

動作使動動詞

VB

動作類及物動詞

VC

動作及物動詞

VCL

動作接地方賓語動詞

VD

雙賓動詞

VF

動作謂賓動詞

VE

動作句賓動詞

VG

分類動詞

VH

狀態不及物動詞

VHC

狀態使動動詞

VI

狀態類及物動詞

VJ

狀態及物動詞

VK

狀態句賓動詞

VL

狀態謂賓動詞

V_2

DE

的之得地

SHI

FW

外文

COLONCATEGORY

冒號

COMMACATEGORY

逗號

DASHCATEGORY

破折號

DOTCATEGORY

點號

ETCCATEGORY

刪節號

EXCLAMATIONCATEGORY

驚嘆號

PARENTHESISCATEGORY

括號

PAUSECATEGORY

頓號

PERIODCATEGORY

句號

QUESTIONCATEGORY

問號

SEMICOLONCATEGORY

分號

SPCHANGECATEGORY

雙直線

WHITESPACE

空白

Parsing Tree Tags

Tag

Description

S

表示結構樹為句子,以述詞為中心語,此外當主詞和述詞的賓語或補語的型式為句子或子句的時候,詞組結構標記為S,不為NP。

VP

述詞詞組,中心語為述詞(V)。

NP

名詞詞組,中心語為名詞(N)。

GP

方位詞詞組,中心語為方位詞(Ng),所帶論元角色為DUMMY1。

PP

介詞詞組,中心語為介詞(P),所帶論元角色亦為DUMMY。

XP

連接詞詞組,中心語為連接詞(C),X代表一個變數,XP的真正詞類由連接成分決定,例如:連接成分為述詞詞組(VP),則為述詞詞組(VP),連接成分為名詞詞組,則為名詞詞組(NP)。

DM

定量詞詞組。

Parsing Tree Roles

Role

Description

#修飾物體名詞

apposition

表物體的同位語,即指涉相同的物體。

possessor

表物體的領屬者,包含成員、創造者、擁有者和整體等皆為領屬者。

predication

表修飾物體的相關事件,為名詞的關係子句,與事件中心語有論元關係。

property

表物體的特色和性質,也包含物體相關的時空訊息,是一個較上位而粗略的語意角色。

quantifier

表名詞的數量修飾語,為數量定詞、定量詞等等。

#修飾事件動詞–事件參與者角色

agent

表事件中的肇始者,動作動詞的行動者。

benefactor

表受益的對象,但非主要賓語。

causer

表事件的肇始者,但肇始者並未主動促使事件發生。

companion

表主語的隨同對象。

comparison

表比較的對象,多在比較句中出現。

experiencer

表感受所敘述的情緒感知狀況的主事者,為心靈類述語的主語。

goal

表動作影響的對象,或者為心靈動作的受事對象,在有物件轉移的事件中則是個接受者或終點。

range

表分類的範疇或結果的幅度。為分類動詞及比較句的主要語意角色。

source

表物件轉移的起點。

target

述詞內容表達的對象或是轉移的方向。

theme

表靜態及分類述詞敘述的對象或動態事件中描述存在或位移的主事者,以及因事件動作造成物體的狀態從無到有的受事者,皆使用這個語意角色。

topic

表事件所論述的主題。

#修飾事件動詞–事件附加的角色

aspect

表動作的時貌。

degree

表狀態的程度。

deixis

表動作附加的指示成分。

deontics

表說話者對事件是否成真的態度,標示於此類型的法相副詞。

duration

表事件持續的時間長度。

evaluation

表評價的語氣成分。

epistemics

表說話者對事件是否為真的猜測,標示於此類型的法相副詞。

frequency

表事件的頻率。

instrument

表動作時所使用的工具。

interjection

表句中感嘆詞的角色。

location

表事件發生的地點。

manner

表主語的動作方式。

negation

表否定。

particle

表句尾說話者的語氣。

quantity

表事物的數量。

standard

表憑據。

time

表事件發生的時間。

#修飾事件動詞–從屬關係的語意角色

addition

表附加。

alternative

表聯合複句中選擇的口氣。

avoidance

表應避免的情況。

complement

表補充說明,進一步補充前一事件內容。

conclusion

表引介出的結論。

condition

表條件語氣的句子或是事件狀況。

concession

表讓步語氣的連接。

contrast

表轉折語氣。

conversion

表引出轉變條件下的結果。

exclusion

表屏除的對象。

hypothesis

表假設的語氣。

listing

表條列的項目。

purpose

表目的。

reason

表事件的原因。

rejection

表取捨關係中的應捨部分。

result

表事件的結果。

restriction

表遞進語氣的前半部。

selection

表取捨關係中的應取部分。

uncondition

表與現況不符的假設。

whatever

表不論何種條件。

#標記語法功能

DUMMY

表未定的角色,需要靠其上位詞組的中心語才能決定。

DUMMY1

表未定的角色,需要靠其上位詞組的中心語才能決定。

DUMMY2

表未定的角色,需要靠其上位詞組的中心語才能決定。

Head

表語法的中心語,通常也是語意的中心成分,句子或詞組皆有Head這個角色。

head

在「的」的結構裡,語意和語法的中心語不同時,表示為語意的中心語成分,以別於語法的中心語。

nominal

表名物化結構,用來標示中心語為名物化動詞的名詞短語中的「的」。

ckipnlp package

The Official CKIP CoreNLP Toolkits.

Subpackages

ckipnlp.container package

This module implements specialized container datatypes for CKIPNLP.

Subpackages

ckipnlp.container.util package

This module implements specialized utilities for CKIPNLP containers.

Submodules

ckipnlp.container.util.parsed_tree module

This module provides tree containers for sentence parsing.

class ckipnlp.container.util.parsed_tree.ParsedNodeData[source]

Bases: ckipnlp.container.base.BaseTuple, ckipnlp.container.util.parsed_tree._ParsedNodeData

A parser node.

Variables
  • role (str) – the semantic role.

  • pos (str) – the POS-tag.

  • word (str) – the text term.

Note

This class is an subclass of tuple. To change the attribute, please create a new instance instead.

Data Structure Examples

Text format

Used for from_text() and to_text().

'Head:Na:中文字'  # role / POS-tag / text-term
Dict format

Used for from_dict() and to_dict().

{
    'role': 'Head',   # role
    'pos': 'Na',      # POS-tag
    'word': '中文字',  # text term
}
List format

Not implemented.

classmethod from_text(data)[source]

Construct an instance from text format.

Parameters

data (str) – text such as 'Head:Na:中文字'.

Note

  • 'Head:Na:中文字' -> role = 'Head', pos = 'Na', word = '中文字'

  • 'Head:Na' -> role = 'Head', pos = 'Na', word = None

  • 'Na' -> role = None, pos = 'Na', word = None

class ckipnlp.container.util.parsed_tree.ParsedNode(tag=None, identifier=None, expanded=True, data=None)[source]

Bases: ckipnlp.container.base.Base, treelib.node.Node

A parser node for tree.

Variables

data (ParsedNodeData) –

See also

treelib.tree.Node

Please refer https://treelib.readthedocs.io/ for built-in usages.

Data Structure Examples

Text format

Not implemented.

Dict format

Used for to_dict().

{
    'role': 'Head',   # role
    'pos': 'Na',      # POS-tag
    'word': '中文字',  # text term
}
List format

Not implemented.

data_class

alias of ParsedNodeData

class ckipnlp.container.util.parsed_tree.ParsedRelation[source]

Bases: ckipnlp.container.base.Base, ckipnlp.container.util.parsed_tree._ParsedRelation

A parser relation.

Variables
  • head (ParsedNode) – the head node.

  • tail (ParsedNode) – the tail node.

  • relation (ParsedNode) – the relation node. (the semantic role of this node is the relation.)

Notes

The parent of the relation node is always the common ancestor of the head node and tail node.

Data Structure Examples

Text format

Not implemented.

Dict format

Used for to_dict().

{
    'tail': { 'role': 'Head', 'pos': 'Nab', 'word': '中文字' }, # head node
    'tail': { 'role': 'particle', 'pos': 'Td', 'word': '耶' }, # tail node
    'relation': 'particle',  # relation
}
List format

Not implemented.

class ckipnlp.container.util.parsed_tree.ParsedTree(tree=None, deep=False, node_class=None, identifier=None)[source]

Bases: ckipnlp.container.base.Base, treelib.tree.Tree

A parsed tree.

See also

treereelib.tree.Tree

Please refer https://treelib.readthedocs.io/ for built-in usages.

Data Structure Examples

Text format

Used for from_text() and to_text().

'S(Head:Nab:中文字|particle:Td:耶)'
Dict format

Used for from_dict() and to_dict(). A dictionary such as { 'id': 0, 'data': { ... }, 'children': [ ... ] }, where 'data' is a dictionary with the same format as ParsedNodeData.to_dict(), and 'children' is a list of dictionaries of subtrees with the same format as this tree.

{
    'id': 0,
    'data': {
        'role': None,
        'pos': 'S',
        'word': None,
    },
    'children': [
        {
            'id': 1,
            'data': {
                'role': 'Head',
                'pos': 'Nab',
                'word': '中文字',
            },
            'children': [],
        },
        {
            'id': 2,
            'data': {
                'role': 'particle',
                'pos': 'Td',
                'word': '耶',
            },
            'children': [],
        },
    ],
}
List format

Not implemented.

node_class

alias of ParsedNode

static normalize_text(tree_text)[source]

Text normalization.

Remove leading number and trailing #.

classmethod from_text(data, *, normalize=True)[source]

Construct an instance from text format.

Parameters
  • data (str) – A parsed tree in text format.

  • normalize (bool) – Do text normalization using normalize_text().

to_text(node_id=None)[source]

Transform to plain text.

Parameters

node_id (int) – Output the plain text format for the subtree under node_id.

Returns

str

classmethod from_dict(data)[source]

Construct an instance a from python built-in containers.

Parameters

data (str) – A parsed tree in dictionary format.

to_dict(node_id=None)[source]

Construct an instance a from python built-in containers.

Parameters

node_id (int) – Output the plain text format for the subtree under node_id.

Returns

str

show(*, key=<function ParsedTree.<lambda>>, idhidden=False, **kwargs)[source]

Show pretty tree.

get_children(node_id, *, role)[source]

Get children of a node with given role.

Parameters
  • node_id (int) – ID of target node.

  • role (str) – the target role.

Yields

ParsedNode – the children nodes with given role.

get_heads(root_id=None, *, semantic=True, deep=True)[source]

Get all head nodes of a subtree.

Parameters
  • root_id (int) – ID of the root node of target subtree.

  • semantic (bool) – use semantic/syntactic policy. For semantic mode, return DUMMY or head instead of syntactic Head.

  • deep (bool) – find heads recursively.

Yields

ParsedNode – the head nodes.

get_relations(root_id=None, *, semantic=True)[source]

Get all relations of a subtree.

Parameters
  • root_id (int) – ID of the subtree root node.

  • semantic (bool) – please refer get_heads() for policy detail.

Yields

ParsedRelation – the relations.

get_subjects(root_id=None, *, semantic=True, deep=True)[source]

Get the subject node of a subtree.

Parameters
  • root_id (int) – ID of the root node of target subtree.

  • semantic (bool) – please refer get_heads() for policy detail.

  • deep (bool) – please refer get_heads() for policy detail.

Yields

ParsedNode – the subject node.

Notes

A node can be a subject if either:

  1. is a head of NP

  2. is a head of a subnode (N) of S with subject role

  3. is a head of a subnode (N) of S with neutral role and before the head (V) of S

ckipnlp.container.util.wspos module

This module provides containers for word-segmented sentences with part-of-speech-tags.

class ckipnlp.container.util.wspos.WsPosToken[source]

Bases: ckipnlp.container.base.BaseTuple, ckipnlp.container.util.wspos._WsPosToken

A word with POS-tag.

Variables
  • word (str) – the word.

  • pos (str) – the POS-tag.

Note

This class is an subclass of tuple. To change the attribute, please create a new instance instead.

Data Structure Examples

Text format

Used for from_text() and to_text().

'中文字(Na)'  # word / POS-tag
Dict format

Used for from_dict() and to_dict().

{
    'word': '中文字', # word
    'pos': 'Na',     # POS-tag
}
List format

Used for from_list() and to_list().

[
    '中文字', # word
    'Na',    # POS-tag
]
classmethod from_text(data)[source]

Construct an instance from text format.

Parameters

data (str) – text such as '中文字(Na)'.

Note

  • '中文字(Na)' -> word = '中文字', pos = 'Na'

  • '中文字' -> word = '中文字', pos = None

class ckipnlp.container.util.wspos.WsPosSentence[source]

Bases: object

A helper class for data conversion of word-segmented and part-of-speech sentences.

classmethod from_text(data)[source]

Convert text format to word-segmented and part-of-speech sentences.

Parameters

data (str) – text such as '中文字(Na)\u3000喔(T)'.

Returns

static to_text(word, pos)[source]

Convert text format to word-segmented and part-of-speech sentences.

Parameters
Returns

str – text such as '中文字(Na)\u3000喔(T)'.

class ckipnlp.container.util.wspos.WsPosParagraph[source]

Bases: object

A helper class for data conversion of word-segmented and part-of-speech sentence lists.

classmethod from_text(data)[source]

Convert text format to word-segmented and part-of-speech sentence lists.

Parameters

data (Sequence[str]) – list of sentences such as '中文字(Na)\u3000喔(T)'.

Returns

static to_text(word, pos)[source]

Convert text format to word-segmented and part-of-speech sentence lists.

Parameters
Returns

List[str] – list of sentences such as '中文字(Na)\u3000喔(T)'.

Submodules

ckipnlp.container.base module

This module provides base containers.

class ckipnlp.container.base.Base[source]

Bases: object

The base CKIPNLP container.

abstract classmethod from_text(data)[source]

Construct an instance from text format.

Parameters

data (str) –

abstract to_text()[source]

Transform to plain text.

Returns

str

abstract classmethod from_dict(data)[source]

Construct an instance a from python built-in containers.

abstract to_dict()[source]

Transform to python built-in containers.

abstract classmethod from_list(data)[source]

Construct an instance a from python built-in containers.

abstract to_list()[source]

Transform to python built-in containers.

classmethod from_json(data, **kwargs)[source]

Construct an instance from JSON format.

Parameters

data (str) – please refer from_dict() for format details.

to_json(**kwargs)[source]

Transform to JSON format.

Returns

str

class ckipnlp.container.base.BaseTuple[source]

Bases: ckipnlp.container.base.Base

The base CKIPNLP tuple.

classmethod from_dict(data)[source]

Construct an instance from python built-in containers.

Parameters

data (dict) –

to_dict()[source]

Transform to python built-in containers.

Returns

dict

classmethod from_list(data)[source]

Construct an instance from python built-in containers.

Parameters

data (list) –

to_list()[source]

Transform to python built-in containers.

Returns

list

class ckipnlp.container.base.BaseList(initlist=None)[source]

Bases: ckipnlp.container.base._BaseList, ckipnlp.container.base._InterfaceItem

The base CKIPNLP list.

item_class = Not Implemented

Must be a CKIPNLP container class.

class ckipnlp.container.base.BaseList0(initlist=None)[source]

Bases: ckipnlp.container.base._BaseList, ckipnlp.container.base._InterfaceBuiltInItem

The base CKIPNLP list with built-in item class.

item_class = Not Implemented

Must be a built-in type.

class ckipnlp.container.base.BaseSentence(initlist=None)[source]

Bases: ckipnlp.container.base._BaseSentence, ckipnlp.container.base._InterfaceItem

The base CKIPNLP sentence.

item_class = Not Implemented

Must be a CKIPNLP container class.

class ckipnlp.container.base.BaseSentence0(initlist=None)[source]

Bases: ckipnlp.container.base._BaseSentence, ckipnlp.container.base._InterfaceBuiltInItem

The base CKIPNLP sentence with built-in item class.

item_class = Not Implemented

Must be a built-in type.

ckipnlp.container.coref module

This module provides containers for coreference sentences.

class ckipnlp.container.coref.CorefToken[source]

Bases: ckipnlp.container.base.BaseTuple, ckipnlp.container.coref._CorefToken

A coreference token.

Variables
  • word (str) – the token word.

  • coref (Tuple[int, str]) –

    the coreference ID and type. None if not a coreference source or target.

    • type:
      • ’source’: coreference source.

      • ’target’: coreference target.

      • ’zero’: null element coreference target.

  • idx (int) – the node index in parsed tree.

Note

This class is an subclass of tuple. To change the attribute, please create a new instance instead.

Data Structure Examples

Text format

Used for to_list().

'畢卡索_0'
Dict format

Used for from_dict() and to_dict().

{
    'word': '畢卡索',        # token word
    'coref': (0, 'source'), # coref ID and type
    'idx': 2,               # node index
}
List format

Used for from_list() and to_list().

[
    '畢卡索',       # token word
    (0, 'source'), # coref ID and type
    2,             # node index
]
class ckipnlp.container.coref.CorefSentence(initlist=None)[source]

Bases: ckipnlp.container.base.BaseSentence

A list of coreference sentence.

Data Structure Examples

Text format

Used for to_list().

'畢卡索_0 他_0 想' # Token segmented by \u3000 (full-width space)
Dict format

Used for from_dict() and to_dict().

[
    { 'word': '畢卡索', 'coref': (0, 'source'), 'idx': 2, }, # coref-token 1
    { 'word': '他', 'coref': (0, 'target'), 'idx': 3, },    # coref-token 2
    { 'word': '想', 'coref': None, 'idx': 4, },             # coref-token 3
]
List format

Used for from_list() and to_list().

[
    [ '畢卡索', (0, 'source'), 2, ], # coref-token 1
    [ '他', (0, 'target'), 3, ],    # coref-token 2
    [ '想', None, 4, ],             # coref-token 3
]
item_class

alias of CorefToken

class ckipnlp.container.coref.CorefParagraph(initlist=None)[source]

Bases: ckipnlp.container.base.BaseList

A list of coreference sentence.

Data Structure Examples

Text format

Used for to_list().

[
    '畢卡索_0 他_0 想', # Sentence 1
    'None_0 完蛋 了',  # Sentence 2
]
Dict format

Used for from_dict() and to_dict().

[
    [ # Sentence 1
        { 'word': '畢卡索', 'coref': (0, 'source'), 'idx': 2, },
        { 'word': '他', 'coref': (0, 'target'), 'idx': 3, },
        { 'word': '想', 'coref': None, 'idx': 4, },
    ],
    [ # Sentence 2
        { 'word': None, 'coref': (0, 'zero'), None, },
        { 'word': '完蛋', 'coref': None, 'idx': 1, },
        { 'word': '了', 'coref': None, 'idx': 2, },
    ],
]
List format

Used for from_list() and to_list().

[
    [ # Sentence 1
        [ '畢卡索', (0, 'source'), 2, ],
        [ '他', (0, 'target'), 3, ],
        [ '想', None, 4, ],
    ],
    [ # Sentence 2
        [ None, (0, 'zero'), None, ],
        [ '完蛋', None, 1, ],
        [ '了', None, 2, ],
    ],
]
item_class

alias of CorefSentence

ckipnlp.container.ner module

This module provides containers for NER sentences.

class ckipnlp.container.ner.NerToken[source]

Bases: ckipnlp.container.base.BaseTuple, ckipnlp.container.ner._NerToken

A named-entity recognition token.

Variables
  • word (str) – the token word.

  • ner (str) – the NER-tag.

  • idx (Tuple[int, int]) – the starting / ending index.

Note

This class is an subclass of tuple. To change the attribute, please create a new instance instead.

Data Structure Examples

Text format

Not implemented

Dict format

Used for from_dict() and to_dict().

{
    'word': '中文字',   # token word
    'ner': 'LANGUAGE', # NER-tag
    'idx': (0, 3),     # starting / ending index.
}
List format

Used for from_list() and to_list().

[
    '中文字'     # token word
    'LANGUAGE', # NER-tag
    (0, 3),     # starting / ending index.
]
CkipTagger format

Used for from_tagger() and to_tagger().

(
    0,          # starting index
    3,          # ending index
    'LANGUAGE', # NER-tag
    '中文字',    # token word
)
classmethod from_tagger(data)[source]

Construct an instance a from CkipTagger format.

to_tagger()[source]

Transform to CkipTagger format.

class ckipnlp.container.ner.NerSentence(initlist=None)[source]

Bases: ckipnlp.container.base.BaseSentence

A named-entity recognition sentence.

Data Structure Examples

Text format

Not implemented

Dict format

Used for from_dict() and to_dict().

[
    { 'word': '美國', 'ner': 'GPE', 'idx': (0, 2), },   # name-entity 1
    { 'word': '參議院', 'ner': 'ORG', 'idx': (3, 5), }, # name-entity 2
]
List format

Used for from_list() and to_list().

[
    [ '美國', 'GPE', (0, 2), ],   # name-entity 1
    [ '參議院', 'ORG', (3, 5), ], # name-entity 2
]
CkipTagger format

Used for from_tagger() and to_tagger().

[
    ( 0, 2, 'GPE', '美國', ),   # name-entity 1
    ( 3, 5, 'ORG', '參議院', ), # name-entity 2
]
item_class

alias of NerToken

classmethod from_tagger(data)[source]

Construct an instance a from CkipTagger format.

to_tagger()[source]

Transform to CkipTagger format.

class ckipnlp.container.ner.NerParagraph(initlist=None)[source]

Bases: ckipnlp.container.base.BaseList

A list of named-entity recognition sentence.

Data Structure Examples

Text format

Not implemented

Dict format

Used for from_dict() and to_dict().

[
    [ # Sentence 1
        { 'word': '中文字', 'ner': 'LANGUAGE', 'idx': (0, 3), },
    ],
    [ # Sentence 2
        { 'word': '美國', 'ner': 'GPE', 'idx': (0, 2), },
        { 'word': '參議院', 'ner': 'ORG', 'idx': (3, 5), },
    ],
]
List format

Used for from_list() and to_list().

[
    [ # Sentence 1
        [ '中文字', 'LANGUAGE', (0, 3), ],
    ],
    [ # Sentence 2
        [ '美國', 'GPE', (0, 2), ],
        [ '參議院', 'ORG', (3, 5), ],
    ],
]
CkipTagger format

Used for from_tagger() and to_tagger().

[
    [ # Sentence 1
        ( 0, 3, 'LANGUAGE', '中文字', ),
    ],
    [ # Sentence 2
        ( 0, 2, 'GPE', '美國', ),
        ( 3, 5, 'ORG', '參議院', ),
    ],
]
item_class

alias of NerSentence

classmethod from_tagger(data)[source]

Construct an instance a from CkipTagger format.

to_tagger()[source]

Transform to CkipTagger format.

ckipnlp.container.parsed module

This module provides containers for parsed sentences.

class ckipnlp.container.parsed.ParsedParagraph(initlist=None)[source]

Bases: ckipnlp.container.base.BaseList0

A list of parsed sentence.

Data Structure Examples

Text/Dict/List format

Used for from_text(), to_text(), from_dict(), to_dict(), from_list(), and to_list().

[
    'S(Head:Nab:中文字|particle:Td:耶)',                     # Sentence 1
    '%(particle:I:啊|manner:Dh:哈|manner:Dh:哈|time:Dh:哈)', # Sentence 2
]
item_class

alias of builtins.str

ckipnlp.container.seg module

This module provides containers for word-segmented sentences.

class ckipnlp.container.seg.SegSentence(initlist=None)[source]

Bases: ckipnlp.container.base.BaseSentence0

A word-segmented sentence.

Data Structure Examples

Text format

Used for from_text() and to_text().

'中文字 喔' # Words segmented by \u3000 (full-width space)
Dict/List format

Used for from_dict(), to_dict(), from_list(), and to_list().

[ '中文字', '喔', ]

Note

This class is also used for part-of-speech tagging.

item_class

alias of builtins.str

class ckipnlp.container.seg.SegParagraph(initlist=None)[source]

Bases: ckipnlp.container.base.BaseList

A list of word-segmented sentences.

Data Structure Examples

Text format

Used for from_text() and to_text().

[
    '中文字 喔',     # Sentence 1
    '啊 哈 哈 哈', # Sentence 2
]
Dict/List format

Used for from_dict(), to_dict(), from_list(), and to_list().

[
    [ '中文字', '喔', ],         # Sentence 1
    [ '啊', '哈', '哈', '哈', ], # Sentence 2
]

Note

This class is also used for part-of-speech tagging.

item_class

alias of SegSentence

ckipnlp.container.text module

This module provides containers for text sentences.

class ckipnlp.container.text.TextParagraph(initlist=None)[source]

Bases: ckipnlp.container.base.BaseList0

A list of text sentence.

Data Structure Examples

Text/Dict/List format

Used for from_text(), to_text(), from_dict(), to_dict(), from_list(), and to_list().

[
    '中文字喔', # Sentence 1
    '啊哈哈哈', # Sentence 2
]
item_class

alias of builtins.str

ckipnlp.driver package

This module implements CKIPNLP drivers.

Submodules

ckipnlp.driver.base module

This module provides base drivers.

class ckipnlp.driver.base.DriverType[source]

Bases: enum.IntEnum

The enumeration of driver types.

SENTENCE_SEGMENTER = 1

Sentence segmentation

WORD_SEGMENTER = 2

Word segmentation

POS_TAGGER = 3

Part-of-speech tagging

NER_CHUNKER = 4

Named-entity recognition

SENTENCE_PARSER = 5

Sentence parsing

COREF_CHUNKER = 6

Coreference delectation

class ckipnlp.driver.base.DriverFamily[source]

Bases: enum.IntEnum

The enumeration of driver backend kinds.

BUILTIN = 1

Built-in Implementation

TAGGER = 2

CkipTagger Backend

CLASSIC = 3

CkipClassic Backend

class ckipnlp.driver.base.DriverRegister[source]

Bases: object

The driver registering utility.

class ckipnlp.driver.base.BaseDriver(*, lazy=False)[source]

Bases: object

The base CKIPNLP driver.

class ckipnlp.driver.base.DummyDriver(*, lazy=False)[source]

Bases: ckipnlp.driver.base.BaseDriver

The dummy driver.

ckipnlp.driver.classic module

This module provides drivers with CkipClassic backend.

class ckipnlp.driver.classic.CkipClassicWordSegmenter(*, lazy=False, do_pos=False, lexicons=None)[source]

Bases: ckipnlp.driver.base.BaseDriver

The CKIP word segmentation driver with CkipClassic backend.

Parameters
  • lazy (bool) – Lazy initialize underlay object.

  • do_pos (bool) – Returns POS-tag or not

  • lexicons (Iterable[Tuple[str, str]]) – A list of the lexicon words and their POS-tags.

__call__(*, text)

Apply word segmentation.

Parameters

text (TextParagraph) — The sentences.

Returns
  • ws (TextParagraph) — The word-segmented sentences.

  • pos (TextParagraph) — The part-of-speech sentences. (returns if do_pos is set.)

class ckipnlp.driver.classic.CkipClassicSentenceParser(*, lazy=False)[source]

Bases: ckipnlp.driver.base.BaseDriver

The CKIP sentence parsing driver with CkipClassic backend.

Parameters

lazy (bool) – Lazy initialize underlay object.

__call__(*, ws, pos)

Apply sentence parsing.

Parameters
Returns

parsed (ParsedParagraph) — The parsed-sentences.

ckipnlp.driver.coref module

This module provides built-in coreference resolution driver.

class ckipnlp.driver.coref.CkipCorefChunker(*, lazy=False)[source]

Bases: ckipnlp.driver.base.BaseDriver

The CKIP coreference resolution driver.

Parameters

lazy (bool) – Lazy initialize underlay object.

__call__(*, parsed)

Apply coreference delectation.

Parameters

parsed (ParsedParagraph) — The parsed-sentences.

Returns

coref (CorefParagraph) — The coreference results.

static transform_ws(*, text, ws, ner)[source]

Transform word-segmented sentence lists (create a new instance).

static transform_pos(*, ws, pos, ner)[source]

Transform pos-tag sentence lists (modify in-place).

ckipnlp.driver.ss module

This module provides built-in sentence segmentation driver.

class ckipnlp.driver.ss.CkipSentenceSegmenter(*, lazy=False, delims=',,。!!??::;;\n', keep_delims=False)[source]

Bases: ckipnlp.driver.base.BaseDriver

The CKIP sentence segmentation driver.

Parameters
  • lazy (bool) – Lazy initialize underlay object.

  • delims (str) – The delimiters.

  • keep_delims (bool) – Keep delimiters.

__call__(*, raw, keep_all=True)

Apply sentence segmentation.

Parameters

raw (str) — The raw text.

Returns

text (TextParagraph) — The sentences.

ckipnlp.driver.tagger module

This module provides drivers with CkipTagger backend.

class ckipnlp.driver.tagger.CkipTaggerWordSegmenter(*, lazy=False, disable_cuda=True, recommend_lexicons={}, coerce_lexicons={}, **opts)[source]

Bases: ckipnlp.driver.base.BaseDriver

The CKIP word segmentation driver with CkipTagger backend.

Parameters
  • lazy (bool) – Lazy initialize underlay object.

  • disable_cuda (bool) – Disable GPU usage.

  • recommend_lexicons (Mapping[str, float]) – A mapping of lexicon words to their relative weights.

  • coerce_lexicons (Mapping[str, float]) – A mapping of lexicon words to their relative weights.

Other Parameters

**opts – Extra options for ckiptagger.WS.__call__(). (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)

__call__(*, text)

Apply word segmentation.

Parameters

text (TextParagraph) — The sentences.

Returns

ws (TextParagraph) — The word-segmented sentences.

class ckipnlp.driver.tagger.CkipTaggerPosTagger(*, lazy=False, disable_cuda=True, **opts)[source]

Bases: ckipnlp.driver.base.BaseDriver

The CKIP part-of-speech tagging driver with CkipTagger backend.

Parameters
  • lazy (bool) – Lazy initialize underlay object.

  • disable_cuda (bool) – Disable GPU usage.

Other Parameters

**opts – Extra options for ckiptagger.POS.__call__(). (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)

__call__(*, text)

Apply part-of-speech tagging.

Parameters

ws (TextParagraph) — The word-segmented sentences.

Returns

pos (TextParagraph) — The part-of-speech sentences.

class ckipnlp.driver.tagger.CkipTaggerNerChunker(*, lazy=False, disable_cuda=True, **opts)[source]

Bases: ckipnlp.driver.base.BaseDriver

The CKIP named-entity recognition driver with CkipTagger backend.

Parameters
  • lazy (bool) – Lazy initialize underlay object.

  • disable_cuda (bool) – Disable GPU usage.

Other Parameters

**opts – Extra options for ckiptagger.NER.__call__(). (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)

__call__(*, text)

Apply named-entity recognition.

Parameters
Returns

ner (NerParagraph) — The named-entity recognition results.

ckipnlp.pipeline package

This module implements CKIPNLP pipelines.

Submodules

ckipnlp.pipeline.core module

This module provides core CKIPNLP pipeline.

class ckipnlp.pipeline.core.CkipDocument(*, raw=None, text=None, ws=None, pos=None, ner=None, parsed=None)[source]

Bases: collections.abc.Mapping

The core document.

Variables
class ckipnlp.pipeline.core.CkipPipeline(*, sentence_segmenter=<DriverFamily.BUILTIN: 1>, word_segmenter=<DriverFamily.TAGGER: 2>, pos_tagger=<DriverFamily.TAGGER: 2>, sentence_parser=<DriverFamily.CLASSIC: 3>, ner_chunker=<DriverFamily.TAGGER: 2>, lazy=True, opts={})[source]

Bases: object

The core pipeline.

Parameters
  • sentence_segmenter (DriverFamily) – The type of sentence segmenter.

  • word_segmenter (DriverFamily) – The type of word segmenter.

  • pos_tagger (DriverFamily) – The type of part-of-speech tagger.

  • ner_chunker (DriverFamily) – The type of named-entity recognition chunker.

  • sentence_parser (DriverFamily) – The type of sentence parser.

Other Parameters
  • lazy (bool) – Lazy initialize the drivers.

  • opts (Dict[str, Dict]) – The driver options. Key: driver name (e.g. ‘sentence_segmenter’); Value: a dictionary of options.

get_text(doc)[source]

Apply sentence segmentation.

Parameters

doc (CkipDocument) – The input document.

Returns

doc.text (TextParagraph) – The sentences.

Note

This routine modify doc inplace.

get_ws(doc)[source]

Apply word segmentation.

Parameters

doc (CkipDocument) – The input document.

Returns

doc.ws (SegParagraph) – The word-segmented sentences.

Note

This routine modify doc inplace.

get_pos(doc)[source]

Apply part-of-speech tagging.

Parameters

doc (CkipDocument) – The input document.

Returns

doc.pos (SegParagraph) – The part-of-speech sentences.

Note

This routine modify doc inplace.

get_ner(doc)[source]

Apply named-entity recognition.

Parameters

doc (CkipDocument) – The input document.

Returns

doc.ner (NerParagraph) – The named-entity recognition results.

Note

This routine modify doc inplace.

get_parsed(doc)[source]

Apply sentence parsing.

Parameters

doc (CkipDocument) – The input document.

Returns

doc.parsed (ParsedParagraph) – The parsed sentences.

Note

This routine modify doc inplace.

ckipnlp.pipeline.coref module

This module provides coreference resolution pipeline.

class ckipnlp.pipeline.coref.CkipCorefDocument(*, ws=None, pos=None, parsed=None, coref=None)[source]

Bases: collections.abc.Mapping

The coreference document.

Variables
class ckipnlp.pipeline.coref.CkipCorefPipeline(*, coref_chunker=<DriverFamily.BUILTIN: 1>, lazy=True, opts={}, **kwargs)[source]

Bases: ckipnlp.pipeline.core.CkipPipeline

The coreference resolution pipeline.

Parameters
  • sentence_segmenter (DriverFamily) – The type of sentence segmenter.

  • word_segmenter (DriverFamily) – The type of word segmenter.

  • pos_tagger (DriverFamily) – The type of part-of-speech tagger.

  • ner_chunker (DriverFamily) – The type of named-entity recognition chunker.

  • sentence_parser (DriverFamily) – The type of sentence parser.

  • coref_chunker (DriverFamily) – The type of coreference resolution chunker.

Other Parameters
  • lazy (bool) – Lazy initialize the drivers.

  • opts (Dict[str, Dict]) – The driver options. Key: driver name (e.g. ‘sentence_segmenter’); Value: a dictionary of options.

__call__(doc)[source]

Apply coreference delectation.

Parameters

doc (CkipDocument) – The input document.

Returns

corefdoc (CkipCorefDocument) – The coreference document.

Note

doc is also modified if necessary dependencies (ws, pos, ner) is not computed yet.

get_coref(doc, corefdoc)[source]

Apply coreference delectation.

Parameters
Returns

corefdoc.coref (CorefParagraph) – The coreference results.

Note

This routine modify corefdoc inplace.

doc is also modified if necessary dependencies (ws, pos, ner) is not computed yet.

ckipnlp.util package

This module implements extra utilities for CKIPNLP.

Submodules

ckipnlp.util.data module

This module implements data loading utilities for CKIPNLP.

ckipnlp.util.data.get_tagger_data()

Get CkipTagger data directory.

ckipnlp.util.data.install_tagger_data(src_dir, *, copy=False)

Link/Copy CkipTagger data directory.

ckipnlp.util.data.download_tagger_data()

Download CkipTagger data directory.

ckipnlp.util.logger module

This module implements logging utilities for CKIPNLP.

ckipnlp.util.logger.get_logger()[source]

Get the CKIPNLP logger.

Index

Module Index