CKIP CoreNLP¶
Introduction¶
Official CKIP CoreNLP Toolkits¶
Features¶
Sentence Segmentation
Word Segmentation
Part-of-Speech Tagging
Named-Entity Recognition
Sentence Parsing
Co-Reference Resolution
External Links¶
Installation¶
Requirements¶
Python 3.6+
TreeLib 1.5+
CkipTagger 0.1.1+ [Optional, Recommended]
CkipClassic 1.0+ [Optional]
Driver Requirements¶
Driver |
Built-in |
CkipTagger |
CkipClassic |
---|---|---|---|
Sentence Segmentation |
✔ |
||
Word Segmentation† |
✔ |
✔ |
|
Part-of-Speech Tagging† |
✔ |
✔ |
|
Sentence Parsing |
✔ |
||
Named-Entity Recognition |
✔ |
||
Co-Reference Resolution‡ |
✔ |
✔ |
✔ |
† These drivers require only one of either backends.
‡ Co-Reference implementation does not require any backend, but requires results from word segmentation, part-of-speech tagging, sentence parsing, and named-entity recognition.
Installation via Pip¶
No backend (not recommended):
pip install ckipnlp
.With CkipTagger backend (recommended):
pip install ckipnlp[tagger]
With CkipClassic backend: Please refer https://ckip-classic.readthedocs.io/en/latest/main/readme.html#installation for CkipClassic installation guide.
Usage¶
See https://ckipnlp.readthedocs.io/en/latest/main/usage.html for Usage.
See https://ckipnlp.readthedocs.io/en/latest/_api/ckipnlp.html for API details.
Usage¶
Pipelines¶
Core Pipeline¶
The CkipPipeline
connect drivers of sentence segmentation, word segmentation, part-of-speech tagging, named-entity recognition, and sentence parsing.
The CkipDocument
is the workspace of CkipPipeline
with input/output data. Note that CkipPipeline
will store the result into CkipDocument
in-place.
The CkipPipeline
will compute all necessary dependencies. For example, if one calls get_ner()
with only raw-text input, the pipeline will automatically calls get_text()
, get_ws()
, get_pos()
.
from ckipnlp.pipeline import CkipPipeline, CkipDocument
pipeline = CkipPipeline()
doc = CkipDocument(raw='中文字喔,啊哈哈哈')
# Word Segmentation
pipeline.get_ws(doc)
print(doc.ws)
for line in doc.ws:
print(line.to_text())
# Part-of-Speech Tagging
pipeline.get_pos(doc)
print(doc.pos)
for line in doc.pos:
print(line.to_text())
# Named-Entity Recognition
pipeline.get_ner(doc)
print(doc.ner)
# Sentence Parsing
pipeline.get_parsed(doc)
print(doc.parsed)
################################################################
from ckipnlp.container.util.wspos import WsPosParagraph
# Word Segmentation & Part-of-Speech Tagging
for line in WsPosParagraph.to_text(doc.ws, doc.pos):
print(line)
Co-Reference Pipeline¶
The CkipCorefPipeline
is a extension of CkipPipeline
by providing coreference resolution. The pipeline first do named-entity recognition as CkipPipeline
do, followed by alignment algorithms to fix the word-segmentation and part-of-speech tagging outputs, and then do coreference resolution based sentence parsing result.
The CkipCorefDocument
is the workspace of CkipCorefPipeline
with input/output data. Note that CkipCorefDocument
will store the result into CkipCorefPipeline
.
from ckipnlp.pipeline import CkipCorefPipeline, CkipDocument
pipeline = CkipCorefPipeline()
doc = CkipDocument(raw='畢卡索他想,完蛋了')
# Co-Reference
corefdoc = pipeline(doc)
print(corefdoc.coref)
for line in corefdoc.coref:
print(line.to_text())
Drivers¶
CkipNLP provides several alternative drivers for above two pipelines. Here are the list of the drivers:
|
|
|
|
---|---|---|---|
SENTENCE_SEGMENTER |
|||
WORD_SEGMENTER |
|||
POS_TAGGER |
|||
NER_CHUNKER |
|||
SENTENCE_PARSER |
|||
COREF_CHUNKER |
† Not compatible with CkipCorefPipeline
.
Containers¶
The container objects provides following methods:
from_text()
,to_text()
for plain-text format conversions;from_dict()
,to_dict()
for dictionary-like format conversions;from_list()
,to_list()
for list-like format conversions;from_json()
,to_json()
for JSON format conversions (based-on dictionary-like format conversions).
The following are the interfaces, where CONTAINER_CLASS
refers to the container class.
obj = CONTAINER_CLASS.from_text(plain_text)
plain_text = obj.to_text()
obj = CONTAINER_CLASS.from_dict({ key: value })
dict_obj = obj.to_dict()
obj = CONTAINER_CLASS.from_list([ value1, value2 ])
list_obj = obj.to_list()
obj = CONTAINER_CLASS.from_json(json_str)
json_str = obj.to_json()
Note that not all container provide all above methods. Here is the table of implemented methods. Please refer the documentation of each container for detail formats.
Container |
Item |
from/to text |
from/to dict, list, json |
---|---|---|---|
|
✔ |
✔ |
|
|
✔ |
✔ |
|
✔ |
✔ |
||
✘ |
✔ |
||
✔ |
|||
✔ |
|||
|
✔ |
✔ |
|
✘ |
only to |
✔ |
|
only to |
✔ |
||
only to |
✔ |
WS with POS¶
There are also conversion routines for word-segmentation and POS containers jointly. For example, WsPosToken
provides routines for a word (str
) with POS-tag (str
):
ws_obj, pos_obj = WsPosToken.from_text('中文字(Na)')
plain_text = WsPosToken.to_text(ws_obj, pos_obj)
ws_obj, pos_obj = WsPosToken.from_dict({ 'word': '中文字', 'pos': 'Na', })
dict_obj = WsPosToken.to_dict(ws_obj, pos_obj)
ws_obj, pos_obj = WsPosToken.from_list([ '中文字', 'Na' ])
list_obj = WsPosToken.to_list(ws_obj, pos_obj)
ws_obj, pos_obj = WsPosToken.from_json(json_str)
json_str = WsPosToken.to_json(ws_obj, pos_obj)
Similarly, WsPosSentence
/WsPosParagraph
provides routines for word-segmented and POS sentence/paragraph (SegSentence
/SegParagraph
) respectively.
Parsed Tree¶
In addition to ParsedParagraph
, we have implemented tree utilities base on TreeLib.
ParsedTree
is the tree structure of a parsed sentence. One may use from_text()
and to_text()
for plain-text conversion; from_dict()
, to_dict()
for dictionary-like object conversion; and also from_json()
, to_json()
for JSON string conversion.
The ParsedTree
is a TreeLib tree with ParsedNode
as its nodes. The data of these nodes is stored in a ParsedNodeData
(accessed by node.data
), which is a tuple of role
(semantic role), pos
(part-of-speech tagging), word
.
ParsedTree
provides useful methods: get_heads()
finds the head words of the sentence; get_relations()
extracts all relations in the sentence; get_subjects()
returns the subjects of the sentence.
from ckipnlp.container import ParsedTree
# 我的早餐、午餐和晚餐都在那場比賽中被吃掉了
tree_text = 'S(goal:NP(possessor:N‧的(head:Nhaa:我|Head:DE:的)|Head:Nab(DUMMY1:Nab(DUMMY1:Nab:早餐|Head:Caa:、|DUMMY2:Naa:午餐)|Head:Caa:和|DUMMY2:Nab:晚餐))|quantity:Dab:都|condition:PP(Head:P21:在|DUMMY:GP(DUMMY:NP(Head:Nac:比賽)|Head:Ng:中))|agent:PP(Head:P02:被)|Head:VC31:吃掉|aspect:Di:了)'
tree = ParsedTree.from_text(tree_text, normalize=False)
print('Show Tree')
tree.show()
print('Get Heads of {}'.format(tree[5]))
print('-- Semantic --')
for head in tree.get_heads(5, semantic=True): print(repr(head))
print('-- Syntactic --')
for head in tree.get_heads(5, semantic=False): print(repr(head))
print()
print('Get Relations of {}'.format(tree[0]))
print('-- Semantic --')
for rel in tree.get_relations(0, semantic=True): print(repr(rel))
print('-- Syntactic --')
for rel in tree.get_relations(0, semantic=False): print(repr(rel))
print()
# 我和食物真的都很不開心
tree_text = 'S(theme:NP(DUMMY1:NP(Head:Nhaa:我)|Head:Caa:和|DUMMY2:NP(Head:Naa:食物))|evaluation:Dbb:真的|quantity:Dab:都|degree:Dfa:很|negation:Dc:不|Head:VH21:開心)'
tree = ParsedTree.from_text(tree_text, normalize=False)
print('Show Tree')
tree.show()
print('Get get_subjects of {}'.format(tree[0]))
print('-- Semantic --')
for subject in tree.get_subjects(0, semantic=True): print(repr(subject))
print('-- Syntactic --')
for subject in tree.get_subjects(0, semantic=False): print(repr(subject))
print()
Tables of Tags¶
Part-of-Speech Tags¶
Tag |
Description |
---|---|
A |
非謂形容詞 |
Caa |
對等連接詞 |
Cab |
連接詞,如:等等 |
Cba |
連接詞,如:的話 |
Cbb |
關聯連接詞 |
D |
副詞 |
Da |
數量副詞 |
Dfa |
動詞前程度副詞 |
Dfb |
動詞後程度副詞 |
Di |
時態標記 |
Dk |
句副詞 |
DM |
定量式 |
I |
感嘆詞 |
Na |
普通名詞 |
Nb |
專有名詞 |
Nc |
地方詞 |
Ncd |
位置詞 |
Nd |
時間詞 |
Nep |
指代定詞 |
Neqa |
數量定詞 |
Neqb |
後置數量定詞 |
Nes |
特指定詞 |
Neu |
數詞定詞 |
Nf |
量詞 |
Ng |
後置詞 |
Nh |
代名詞 |
Nv |
名物化動詞 |
P |
介詞 |
T |
語助詞 |
VA |
動作不及物動詞 |
VAC |
動作使動動詞 |
VB |
動作類及物動詞 |
VC |
動作及物動詞 |
VCL |
動作接地方賓語動詞 |
VD |
雙賓動詞 |
VF |
動作謂賓動詞 |
VE |
動作句賓動詞 |
VG |
分類動詞 |
VH |
狀態不及物動詞 |
VHC |
狀態使動動詞 |
VI |
狀態類及物動詞 |
VJ |
狀態及物動詞 |
VK |
狀態句賓動詞 |
VL |
狀態謂賓動詞 |
V_2 |
有 |
DE |
的之得地 |
SHI |
是 |
FW |
外文 |
COLONCATEGORY |
冒號 |
COMMACATEGORY |
逗號 |
DASHCATEGORY |
破折號 |
DOTCATEGORY |
點號 |
ETCCATEGORY |
刪節號 |
EXCLAMATIONCATEGORY |
驚嘆號 |
PARENTHESISCATEGORY |
括號 |
PAUSECATEGORY |
頓號 |
PERIODCATEGORY |
句號 |
QUESTIONCATEGORY |
問號 |
SEMICOLONCATEGORY |
分號 |
SPCHANGECATEGORY |
雙直線 |
WHITESPACE |
空白 |
Parsing Tree Tags¶
Tag |
Description |
---|---|
S |
表示結構樹為句子,以述詞為中心語,此外當主詞和述詞的賓語或補語的型式為句子或子句的時候,詞組結構標記為S,不為NP。 |
VP |
述詞詞組,中心語為述詞(V)。 |
NP |
名詞詞組,中心語為名詞(N)。 |
GP |
方位詞詞組,中心語為方位詞(Ng),所帶論元角色為DUMMY1。 |
PP |
介詞詞組,中心語為介詞(P),所帶論元角色亦為DUMMY。 |
XP |
連接詞詞組,中心語為連接詞(C),X代表一個變數,XP的真正詞類由連接成分決定,例如:連接成分為述詞詞組(VP),則為述詞詞組(VP),連接成分為名詞詞組,則為名詞詞組(NP)。 |
DM |
定量詞詞組。 |
Parsing Tree Roles¶
Role |
Description |
---|---|
#修飾物體名詞 |
|
apposition |
表物體的同位語,即指涉相同的物體。 |
possessor |
表物體的領屬者,包含成員、創造者、擁有者和整體等皆為領屬者。 |
predication |
表修飾物體的相關事件,為名詞的關係子句,與事件中心語有論元關係。 |
property |
表物體的特色和性質,也包含物體相關的時空訊息,是一個較上位而粗略的語意角色。 |
quantifier |
表名詞的數量修飾語,為數量定詞、定量詞等等。 |
#修飾事件動詞–事件參與者角色 |
|
agent |
表事件中的肇始者,動作動詞的行動者。 |
benefactor |
表受益的對象,但非主要賓語。 |
causer |
表事件的肇始者,但肇始者並未主動促使事件發生。 |
companion |
表主語的隨同對象。 |
comparison |
表比較的對象,多在比較句中出現。 |
experiencer |
表感受所敘述的情緒感知狀況的主事者,為心靈類述語的主語。 |
goal |
表動作影響的對象,或者為心靈動作的受事對象,在有物件轉移的事件中則是個接受者或終點。 |
range |
表分類的範疇或結果的幅度。為分類動詞及比較句的主要語意角色。 |
source |
表物件轉移的起點。 |
target |
述詞內容表達的對象或是轉移的方向。 |
theme |
表靜態及分類述詞敘述的對象或動態事件中描述存在或位移的主事者,以及因事件動作造成物體的狀態從無到有的受事者,皆使用這個語意角色。 |
topic |
表事件所論述的主題。 |
#修飾事件動詞–事件附加的角色 |
|
aspect |
表動作的時貌。 |
degree |
表狀態的程度。 |
deixis |
表動作附加的指示成分。 |
deontics |
表說話者對事件是否成真的態度,標示於此類型的法相副詞。 |
duration |
表事件持續的時間長度。 |
evaluation |
表評價的語氣成分。 |
epistemics |
表說話者對事件是否為真的猜測,標示於此類型的法相副詞。 |
frequency |
表事件的頻率。 |
instrument |
表動作時所使用的工具。 |
interjection |
表句中感嘆詞的角色。 |
location |
表事件發生的地點。 |
manner |
表主語的動作方式。 |
negation |
表否定。 |
particle |
表句尾說話者的語氣。 |
quantity |
表事物的數量。 |
standard |
表憑據。 |
time |
表事件發生的時間。 |
#修飾事件動詞–從屬關係的語意角色 |
|
addition |
表附加。 |
alternative |
表聯合複句中選擇的口氣。 |
avoidance |
表應避免的情況。 |
complement |
表補充說明,進一步補充前一事件內容。 |
conclusion |
表引介出的結論。 |
condition |
表條件語氣的句子或是事件狀況。 |
concession |
表讓步語氣的連接。 |
contrast |
表轉折語氣。 |
conversion |
表引出轉變條件下的結果。 |
exclusion |
表屏除的對象。 |
hypothesis |
表假設的語氣。 |
listing |
表條列的項目。 |
purpose |
表目的。 |
reason |
表事件的原因。 |
rejection |
表取捨關係中的應捨部分。 |
result |
表事件的結果。 |
restriction |
表遞進語氣的前半部。 |
selection |
表取捨關係中的應取部分。 |
uncondition |
表與現況不符的假設。 |
whatever |
表不論何種條件。 |
#標記語法功能 |
|
DUMMY |
表未定的角色,需要靠其上位詞組的中心語才能決定。 |
DUMMY1 |
表未定的角色,需要靠其上位詞組的中心語才能決定。 |
DUMMY2 |
表未定的角色,需要靠其上位詞組的中心語才能決定。 |
Head |
表語法的中心語,通常也是語意的中心成分,句子或詞組皆有Head這個角色。 |
head |
在「的」的結構裡,語意和語法的中心語不同時,表示為語意的中心語成分,以別於語法的中心語。 |
nominal |
表名物化結構,用來標示中心語為名物化動詞的名詞短語中的「的」。 |
ckipnlp package¶
The Official CKIP CoreNLP Toolkits.
Subpackages
ckipnlp.container package¶
This module implements specialized container datatypes for CKIPNLP.
Subpackages
ckipnlp.container.util package¶
This module implements specialized utilities for CKIPNLP containers.
Submodules
ckipnlp.container.util.parsed_tree module¶
This module provides tree containers for sentence parsing.
-
class
ckipnlp.container.util.parsed_tree.
ParsedNodeData
[source]¶ Bases:
ckipnlp.container.base.BaseTuple
,ckipnlp.container.util.parsed_tree._ParsedNodeData
A parser node.
- Variables
role (str) – the semantic role.
pos (str) – the POS-tag.
word (str) – the text term.
Note
This class is an subclass of
tuple
. To change the attribute, please create a new instance instead.Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.'Head:Na:中文字' # role / POS-tag / text-term
- Dict format
Used for
from_dict()
andto_dict()
.{ 'role': 'Head', # role 'pos': 'Na', # POS-tag 'word': '中文字', # text term }
- List format
Not implemented.
-
class
ckipnlp.container.util.parsed_tree.
ParsedNode
(tag=None, identifier=None, expanded=True, data=None)[source]¶ Bases:
ckipnlp.container.base.Base
,treelib.node.Node
A parser node for tree.
- Variables
data (
ParsedNodeData
) –
See also
treelib.tree.Node
Please refer https://treelib.readthedocs.io/ for built-in usages.
Data Structure Examples
- Text format
Not implemented.
- Dict format
Used for
to_dict()
.{ 'role': 'Head', # role 'pos': 'Na', # POS-tag 'word': '中文字', # text term }
- List format
Not implemented.
-
data_class
¶ alias of
ParsedNodeData
-
class
ckipnlp.container.util.parsed_tree.
ParsedRelation
[source]¶ Bases:
ckipnlp.container.base.Base
,ckipnlp.container.util.parsed_tree._ParsedRelation
A parser relation.
- Variables
head (
ParsedNode
) – the head node.tail (
ParsedNode
) – the tail node.relation (
ParsedNode
) – the relation node. (the semantic role of this node is the relation.)
Notes
The parent of the relation node is always the common ancestor of the head node and tail node.
Data Structure Examples
- Text format
Not implemented.
- Dict format
Used for
to_dict()
.{ 'tail': { 'role': 'Head', 'pos': 'Nab', 'word': '中文字' }, # head node 'tail': { 'role': 'particle', 'pos': 'Td', 'word': '耶' }, # tail node 'relation': 'particle', # relation }
- List format
Not implemented.
-
class
ckipnlp.container.util.parsed_tree.
ParsedTree
(tree=None, deep=False, node_class=None, identifier=None)[source]¶ Bases:
ckipnlp.container.base.Base
,treelib.tree.Tree
A parsed tree.
See also
treereelib.tree.Tree
Please refer https://treelib.readthedocs.io/ for built-in usages.
Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.'S(Head:Nab:中文字|particle:Td:耶)'
- Dict format
Used for
from_dict()
andto_dict()
. A dictionary such as{ 'id': 0, 'data': { ... }, 'children': [ ... ] }
, where'data'
is a dictionary with the same format asParsedNodeData.to_dict()
, and'children'
is a list of dictionaries of subtrees with the same format as this tree.{ 'id': 0, 'data': { 'role': None, 'pos': 'S', 'word': None, }, 'children': [ { 'id': 1, 'data': { 'role': 'Head', 'pos': 'Nab', 'word': '中文字', }, 'children': [], }, { 'id': 2, 'data': { 'role': 'particle', 'pos': 'Td', 'word': '耶', }, 'children': [], }, ], }
- List format
Not implemented.
-
node_class
¶ alias of
ParsedNode
-
classmethod
from_text
(data, *, normalize=True)[source]¶ Construct an instance from text format.
- Parameters
data (str) – A parsed tree in text format.
normalize (bool) – Do text normalization using
normalize_text()
.
-
to_text
(node_id=None)[source]¶ Transform to plain text.
- Parameters
node_id (int) – Output the plain text format for the subtree under node_id.
- Returns
str
-
classmethod
from_dict
(data)[source]¶ Construct an instance a from python built-in containers.
- Parameters
data (str) – A parsed tree in dictionary format.
-
to_dict
(node_id=None)[source]¶ Construct an instance a from python built-in containers.
- Parameters
node_id (int) – Output the plain text format for the subtree under node_id.
- Returns
str
-
get_children
(node_id, *, role)[source]¶ Get children of a node with given role.
- Parameters
node_id (int) – ID of target node.
role (str) – the target role.
- Yields
ParsedNode
– the children nodes with given role.
-
get_heads
(root_id=None, *, semantic=True, deep=True)[source]¶ Get all head nodes of a subtree.
- Parameters
root_id (int) – ID of the root node of target subtree.
semantic (bool) – use semantic/syntactic policy. For semantic mode, return
DUMMY
orhead
instead of syntacticHead
.deep (bool) – find heads recursively.
- Yields
ParsedNode
– the head nodes.
-
get_relations
(root_id=None, *, semantic=True)[source]¶ Get all relations of a subtree.
- Parameters
root_id (int) – ID of the subtree root node.
semantic (bool) – please refer
get_heads()
for policy detail.
- Yields
ParsedRelation
– the relations.
-
get_subjects
(root_id=None, *, semantic=True, deep=True)[source]¶ Get the subject node of a subtree.
- Parameters
root_id (int) – ID of the root node of target subtree.
semantic (bool) – please refer
get_heads()
for policy detail.deep (bool) – please refer
get_heads()
for policy detail.
- Yields
ParsedNode
– the subject node.
Notes
A node can be a subject if either:
is a head of NP
is a head of a subnode (N) of S with subject role
is a head of a subnode (N) of S with neutral role and before the head (V) of S
ckipnlp.container.util.wspos module¶
This module provides containers for word-segmented sentences with part-of-speech-tags.
-
class
ckipnlp.container.util.wspos.
WsPosToken
[source]¶ Bases:
ckipnlp.container.base.BaseTuple
,ckipnlp.container.util.wspos._WsPosToken
A word with POS-tag.
- Variables
word (str) – the word.
pos (str) – the POS-tag.
Note
This class is an subclass of tuple. To change the attribute, please create a new instance instead.
Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.'中文字(Na)' # word / POS-tag
- Dict format
Used for
from_dict()
andto_dict()
.{ 'word': '中文字', # word 'pos': 'Na', # POS-tag }
- List format
Used for
from_list()
andto_list()
.[ '中文字', # word 'Na', # POS-tag ]
-
class
ckipnlp.container.util.wspos.
WsPosSentence
[source]¶ Bases:
object
A helper class for data conversion of word-segmented and part-of-speech sentences.
-
classmethod
from_text
(data)[source]¶ Convert text format to word-segmented and part-of-speech sentences.
- Parameters
data (str) – text such as
'中文字(Na)\u3000喔(T)'
.- Returns
SegSentence
– the word sentenceSegSentence
– the POS-tag sentence.
-
static
to_text
(word, pos)[source]¶ Convert text format to word-segmented and part-of-speech sentences.
- Parameters
word (
SegSentence
) – the word sentencepos (
SegSentence
) – the POS-tag sentence.
- Returns
str – text such as
'中文字(Na)\u3000喔(T)'
.
-
classmethod
-
class
ckipnlp.container.util.wspos.
WsPosParagraph
[source]¶ Bases:
object
A helper class for data conversion of word-segmented and part-of-speech sentence lists.
-
classmethod
from_text
(data)[source]¶ Convert text format to word-segmented and part-of-speech sentence lists.
- Parameters
data (Sequence[str]) – list of sentences such as
'中文字(Na)\u3000喔(T)'
.- Returns
SegParagraph
– the word sentence listSegParagraph
– the POS-tag sentence list.
-
static
to_text
(word, pos)[source]¶ Convert text format to word-segmented and part-of-speech sentence lists.
- Parameters
word (
SegParagraph
) – the word sentence listpos (
SegParagraph
) – the POS-tag sentence list.
- Returns
List[str] – list of sentences such as
'中文字(Na)\u3000喔(T)'
.
-
classmethod
Submodules
ckipnlp.container.base module¶
This module provides base containers.
-
class
ckipnlp.container.base.
Base
[source]¶ Bases:
object
The base CKIPNLP container.
-
abstract classmethod
from_text
(data)[source]¶ Construct an instance from text format.
- Parameters
data (str) –
-
abstract classmethod
from_dict
(data)[source]¶ Construct an instance a from python built-in containers.
-
abstract classmethod
from_list
(data)[source]¶ Construct an instance a from python built-in containers.
-
classmethod
from_json
(data, **kwargs)[source]¶ Construct an instance from JSON format.
- Parameters
data (str) – please refer
from_dict()
for format details.
-
abstract classmethod
-
class
ckipnlp.container.base.
BaseTuple
[source]¶ Bases:
ckipnlp.container.base.Base
The base CKIPNLP tuple.
-
classmethod
from_dict
(data)[source]¶ Construct an instance from python built-in containers.
- Parameters
data (dict) –
-
classmethod
-
class
ckipnlp.container.base.
BaseList
(initlist=None)[source]¶ Bases:
ckipnlp.container.base._BaseList
,ckipnlp.container.base._InterfaceItem
The base CKIPNLP list.
-
item_class
= Not Implemented¶ Must be a CKIPNLP container class.
-
-
class
ckipnlp.container.base.
BaseList0
(initlist=None)[source]¶ Bases:
ckipnlp.container.base._BaseList
,ckipnlp.container.base._InterfaceBuiltInItem
The base CKIPNLP list with built-in item class.
-
item_class
= Not Implemented¶ Must be a built-in type.
-
ckipnlp.container.coref module¶
This module provides containers for coreference sentences.
-
class
ckipnlp.container.coref.
CorefToken
[source]¶ Bases:
ckipnlp.container.base.BaseTuple
,ckipnlp.container.coref._CorefToken
A coreference token.
- Variables
word (str) – the token word.
coref (Tuple[int, str]) –
the coreference ID and type. None if not a coreference source or target.
- type:
’source’: coreference source.
’target’: coreference target.
’zero’: null element coreference target.
idx (int) – the node index in parsed tree.
Note
This class is an subclass of
tuple
. To change the attribute, please create a new instance instead.Data Structure Examples
- Text format
Used for
to_list()
.'畢卡索_0'
- Dict format
Used for
from_dict()
andto_dict()
.{ 'word': '畢卡索', # token word 'coref': (0, 'source'), # coref ID and type 'idx': 2, # node index }
- List format
Used for
from_list()
andto_list()
.[ '畢卡索', # token word (0, 'source'), # coref ID and type 2, # node index ]
-
class
ckipnlp.container.coref.
CorefSentence
(initlist=None)[source]¶ Bases:
ckipnlp.container.base.BaseSentence
A list of coreference sentence.
Data Structure Examples
- Text format
Used for
to_list()
.'畢卡索_0 他_0 想' # Token segmented by \u3000 (full-width space)
- Dict format
Used for
from_dict()
andto_dict()
.[ { 'word': '畢卡索', 'coref': (0, 'source'), 'idx': 2, }, # coref-token 1 { 'word': '他', 'coref': (0, 'target'), 'idx': 3, }, # coref-token 2 { 'word': '想', 'coref': None, 'idx': 4, }, # coref-token 3 ]
- List format
Used for
from_list()
andto_list()
.[ [ '畢卡索', (0, 'source'), 2, ], # coref-token 1 [ '他', (0, 'target'), 3, ], # coref-token 2 [ '想', None, 4, ], # coref-token 3 ]
-
item_class
¶ alias of
CorefToken
-
class
ckipnlp.container.coref.
CorefParagraph
(initlist=None)[source]¶ Bases:
ckipnlp.container.base.BaseList
A list of coreference sentence.
Data Structure Examples
- Text format
Used for
to_list()
.[ '畢卡索_0 他_0 想', # Sentence 1 'None_0 完蛋 了', # Sentence 2 ]
- Dict format
Used for
from_dict()
andto_dict()
.[ [ # Sentence 1 { 'word': '畢卡索', 'coref': (0, 'source'), 'idx': 2, }, { 'word': '他', 'coref': (0, 'target'), 'idx': 3, }, { 'word': '想', 'coref': None, 'idx': 4, }, ], [ # Sentence 2 { 'word': None, 'coref': (0, 'zero'), None, }, { 'word': '完蛋', 'coref': None, 'idx': 1, }, { 'word': '了', 'coref': None, 'idx': 2, }, ], ]
- List format
Used for
from_list()
andto_list()
.[ [ # Sentence 1 [ '畢卡索', (0, 'source'), 2, ], [ '他', (0, 'target'), 3, ], [ '想', None, 4, ], ], [ # Sentence 2 [ None, (0, 'zero'), None, ], [ '完蛋', None, 1, ], [ '了', None, 2, ], ], ]
-
item_class
¶ alias of
CorefSentence
ckipnlp.container.ner module¶
This module provides containers for NER sentences.
-
class
ckipnlp.container.ner.
NerToken
[source]¶ Bases:
ckipnlp.container.base.BaseTuple
,ckipnlp.container.ner._NerToken
A named-entity recognition token.
- Variables
word (str) – the token word.
ner (str) – the NER-tag.
idx (Tuple[int, int]) – the starting / ending index.
Note
This class is an subclass of
tuple
. To change the attribute, please create a new instance instead.Data Structure Examples
- Text format
Not implemented
- Dict format
Used for
from_dict()
andto_dict()
.{ 'word': '中文字', # token word 'ner': 'LANGUAGE', # NER-tag 'idx': (0, 3), # starting / ending index. }
- List format
Used for
from_list()
andto_list()
.[ '中文字' # token word 'LANGUAGE', # NER-tag (0, 3), # starting / ending index. ]
- CkipTagger format
Used for
from_tagger()
andto_tagger()
.( 0, # starting index 3, # ending index 'LANGUAGE', # NER-tag '中文字', # token word )
-
class
ckipnlp.container.ner.
NerSentence
(initlist=None)[source]¶ Bases:
ckipnlp.container.base.BaseSentence
A named-entity recognition sentence.
Data Structure Examples
- Text format
Not implemented
- Dict format
Used for
from_dict()
andto_dict()
.[ { 'word': '美國', 'ner': 'GPE', 'idx': (0, 2), }, # name-entity 1 { 'word': '參議院', 'ner': 'ORG', 'idx': (3, 5), }, # name-entity 2 ]
- List format
Used for
from_list()
andto_list()
.[ [ '美國', 'GPE', (0, 2), ], # name-entity 1 [ '參議院', 'ORG', (3, 5), ], # name-entity 2 ]
- CkipTagger format
Used for
from_tagger()
andto_tagger()
.[ ( 0, 2, 'GPE', '美國', ), # name-entity 1 ( 3, 5, 'ORG', '參議院', ), # name-entity 2 ]
-
class
ckipnlp.container.ner.
NerParagraph
(initlist=None)[source]¶ Bases:
ckipnlp.container.base.BaseList
A list of named-entity recognition sentence.
Data Structure Examples
- Text format
Not implemented
- Dict format
Used for
from_dict()
andto_dict()
.[ [ # Sentence 1 { 'word': '中文字', 'ner': 'LANGUAGE', 'idx': (0, 3), }, ], [ # Sentence 2 { 'word': '美國', 'ner': 'GPE', 'idx': (0, 2), }, { 'word': '參議院', 'ner': 'ORG', 'idx': (3, 5), }, ], ]
- List format
Used for
from_list()
andto_list()
.[ [ # Sentence 1 [ '中文字', 'LANGUAGE', (0, 3), ], ], [ # Sentence 2 [ '美國', 'GPE', (0, 2), ], [ '參議院', 'ORG', (3, 5), ], ], ]
- CkipTagger format
Used for
from_tagger()
andto_tagger()
.[ [ # Sentence 1 ( 0, 3, 'LANGUAGE', '中文字', ), ], [ # Sentence 2 ( 0, 2, 'GPE', '美國', ), ( 3, 5, 'ORG', '參議院', ), ], ]
-
item_class
¶ alias of
NerSentence
ckipnlp.container.parsed module¶
This module provides containers for parsed sentences.
-
class
ckipnlp.container.parsed.
ParsedParagraph
(initlist=None)[source]¶ Bases:
ckipnlp.container.base.BaseList0
A list of parsed sentence.
Data Structure Examples
- Text/Dict/List format
Used for
from_text()
,to_text()
,from_dict()
,to_dict()
,from_list()
, andto_list()
.[ 'S(Head:Nab:中文字|particle:Td:耶)', # Sentence 1 '%(particle:I:啊|manner:Dh:哈|manner:Dh:哈|time:Dh:哈)', # Sentence 2 ]
-
item_class
¶ alias of
builtins.str
ckipnlp.container.seg module¶
This module provides containers for word-segmented sentences.
-
class
ckipnlp.container.seg.
SegSentence
(initlist=None)[source]¶ Bases:
ckipnlp.container.base.BaseSentence0
A word-segmented sentence.
Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.'中文字 喔' # Words segmented by \u3000 (full-width space)
- Dict/List format
Used for
from_dict()
,to_dict()
,from_list()
, andto_list()
.[ '中文字', '喔', ]
Note
This class is also used for part-of-speech tagging.
-
item_class
¶ alias of
builtins.str
-
class
ckipnlp.container.seg.
SegParagraph
(initlist=None)[source]¶ Bases:
ckipnlp.container.base.BaseList
A list of word-segmented sentences.
Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.[ '中文字 喔', # Sentence 1 '啊 哈 哈 哈', # Sentence 2 ]
- Dict/List format
Used for
from_dict()
,to_dict()
,from_list()
, andto_list()
.[ [ '中文字', '喔', ], # Sentence 1 [ '啊', '哈', '哈', '哈', ], # Sentence 2 ]
Note
This class is also used for part-of-speech tagging.
-
item_class
¶ alias of
SegSentence
ckipnlp.container.text module¶
This module provides containers for text sentences.
-
class
ckipnlp.container.text.
TextParagraph
(initlist=None)[source]¶ Bases:
ckipnlp.container.base.BaseList0
A list of text sentence.
Data Structure Examples
- Text/Dict/List format
Used for
from_text()
,to_text()
,from_dict()
,to_dict()
,from_list()
, andto_list()
.[ '中文字喔', # Sentence 1 '啊哈哈哈', # Sentence 2 ]
-
item_class
¶ alias of
builtins.str
ckipnlp.driver package¶
This module implements CKIPNLP drivers.
Submodules
ckipnlp.driver.base module¶
This module provides base drivers.
-
class
ckipnlp.driver.base.
DriverType
[source]¶ Bases:
enum.IntEnum
The enumeration of driver types.
-
SENTENCE_SEGMENTER
= 1¶ Sentence segmentation
-
WORD_SEGMENTER
= 2¶ Word segmentation
-
POS_TAGGER
= 3¶ Part-of-speech tagging
-
NER_CHUNKER
= 4¶ Named-entity recognition
-
SENTENCE_PARSER
= 5¶ Sentence parsing
-
COREF_CHUNKER
= 6¶ Coreference delectation
-
-
class
ckipnlp.driver.base.
DriverFamily
[source]¶ Bases:
enum.IntEnum
The enumeration of driver backend kinds.
-
BUILTIN
= 1¶ Built-in Implementation
-
TAGGER
= 2¶ CkipTagger Backend
-
CLASSIC
= 3¶ CkipClassic Backend
-
-
class
ckipnlp.driver.base.
DummyDriver
(*, lazy=False)[source]¶ Bases:
ckipnlp.driver.base.BaseDriver
The dummy driver.
ckipnlp.driver.classic module¶
This module provides drivers with CkipClassic backend.
-
class
ckipnlp.driver.classic.
CkipClassicWordSegmenter
(*, lazy=False, do_pos=False, lexicons=None)[source]¶ Bases:
ckipnlp.driver.base.BaseDriver
The CKIP word segmentation driver with CkipClassic backend.
- Parameters
lazy (bool) – Lazy initialize underlay object.
do_pos (bool) – Returns POS-tag or not
lexicons (Iterable[Tuple[str, str]]) – A list of the lexicon words and their POS-tags.
-
__call__
(*, text)¶ Apply word segmentation.
- Parameters
text (
TextParagraph
) — The sentences.- Returns
ws (
TextParagraph
) — The word-segmented sentences.pos (
TextParagraph
) — The part-of-speech sentences. (returns if do_pos is set.)
-
class
ckipnlp.driver.classic.
CkipClassicSentenceParser
(*, lazy=False)[source]¶ Bases:
ckipnlp.driver.base.BaseDriver
The CKIP sentence parsing driver with CkipClassic backend.
- Parameters
lazy (bool) – Lazy initialize underlay object.
-
__call__
(*, ws, pos)¶ Apply sentence parsing.
- Parameters
ws (
TextParagraph
) — The word-segmented sentences.pos (
TextParagraph
) — The part-of-speech sentences.
- Returns
parsed (
ParsedParagraph
) — The parsed-sentences.
ckipnlp.driver.coref module¶
This module provides built-in coreference resolution driver.
-
class
ckipnlp.driver.coref.
CkipCorefChunker
(*, lazy=False)[source]¶ Bases:
ckipnlp.driver.base.BaseDriver
The CKIP coreference resolution driver.
- Parameters
lazy (bool) – Lazy initialize underlay object.
-
__call__
(*, parsed)¶ Apply coreference delectation.
- Parameters
parsed (
ParsedParagraph
) — The parsed-sentences.- Returns
coref (
CorefParagraph
) — The coreference results.
ckipnlp.driver.ss module¶
This module provides built-in sentence segmentation driver.
-
class
ckipnlp.driver.ss.
CkipSentenceSegmenter
(*, lazy=False, delims=',,。!!??::;;\n', keep_delims=False)[source]¶ Bases:
ckipnlp.driver.base.BaseDriver
The CKIP sentence segmentation driver.
- Parameters
lazy (bool) – Lazy initialize underlay object.
delims (str) – The delimiters.
keep_delims (bool) – Keep delimiters.
-
__call__
(*, raw, keep_all=True)¶ Apply sentence segmentation.
- Parameters
raw (str) — The raw text.
- Returns
text (
TextParagraph
) — The sentences.
ckipnlp.driver.tagger module¶
This module provides drivers with CkipTagger backend.
-
class
ckipnlp.driver.tagger.
CkipTaggerWordSegmenter
(*, lazy=False, disable_cuda=True, recommend_lexicons={}, coerce_lexicons={}, **opts)[source]¶ Bases:
ckipnlp.driver.base.BaseDriver
The CKIP word segmentation driver with CkipTagger backend.
- Parameters
lazy (bool) – Lazy initialize underlay object.
disable_cuda (bool) – Disable GPU usage.
recommend_lexicons (Mapping[str, float]) – A mapping of lexicon words to their relative weights.
coerce_lexicons (Mapping[str, float]) – A mapping of lexicon words to their relative weights.
- Other Parameters
**opts – Extra options for
ckiptagger.WS.__call__()
. (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)
-
__call__
(*, text)¶ Apply word segmentation.
- Parameters
text (
TextParagraph
) — The sentences.- Returns
ws (
TextParagraph
) — The word-segmented sentences.
-
class
ckipnlp.driver.tagger.
CkipTaggerPosTagger
(*, lazy=False, disable_cuda=True, **opts)[source]¶ Bases:
ckipnlp.driver.base.BaseDriver
The CKIP part-of-speech tagging driver with CkipTagger backend.
- Parameters
lazy (bool) – Lazy initialize underlay object.
disable_cuda (bool) – Disable GPU usage.
- Other Parameters
**opts – Extra options for
ckiptagger.POS.__call__()
. (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)
-
__call__
(*, text)¶ Apply part-of-speech tagging.
- Parameters
ws (
TextParagraph
) — The word-segmented sentences.- Returns
pos (
TextParagraph
) — The part-of-speech sentences.
-
class
ckipnlp.driver.tagger.
CkipTaggerNerChunker
(*, lazy=False, disable_cuda=True, **opts)[source]¶ Bases:
ckipnlp.driver.base.BaseDriver
The CKIP named-entity recognition driver with CkipTagger backend.
- Parameters
lazy (bool) – Lazy initialize underlay object.
disable_cuda (bool) – Disable GPU usage.
- Other Parameters
**opts – Extra options for
ckiptagger.NER.__call__()
. (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)
-
__call__
(*, text)¶ Apply named-entity recognition.
- Parameters
ws (
TextParagraph
) — The word-segmented sentences.pos (
TextParagraph
) — The part-of-speech sentences.
- Returns
ner (
NerParagraph
) — The named-entity recognition results.
ckipnlp.pipeline package¶
This module implements CKIPNLP pipelines.
Submodules
ckipnlp.pipeline.core module¶
This module provides core CKIPNLP pipeline.
-
class
ckipnlp.pipeline.core.
CkipDocument
(*, raw=None, text=None, ws=None, pos=None, ner=None, parsed=None)[source]¶ Bases:
collections.abc.Mapping
The core document.
- Variables
raw (str) – The unsegmented text input.
text (
TextParagraph
) – The sentences.ws (
SegParagraph
) – The word-segmented sentences.pos (
SegParagraph
) – The part-of-speech sentences.ner (
NerParagraph
) – The named-entity recognition results.parsed (
ParsedParagraph
) – The parsed-sentences.
-
class
ckipnlp.pipeline.core.
CkipPipeline
(*, sentence_segmenter=<DriverFamily.BUILTIN: 1>, word_segmenter=<DriverFamily.TAGGER: 2>, pos_tagger=<DriverFamily.TAGGER: 2>, sentence_parser=<DriverFamily.CLASSIC: 3>, ner_chunker=<DriverFamily.TAGGER: 2>, lazy=True, opts={})[source]¶ Bases:
object
The core pipeline.
- Parameters
sentence_segmenter (
DriverFamily
) – The type of sentence segmenter.word_segmenter (
DriverFamily
) – The type of word segmenter.pos_tagger (
DriverFamily
) – The type of part-of-speech tagger.ner_chunker (
DriverFamily
) – The type of named-entity recognition chunker.sentence_parser (
DriverFamily
) – The type of sentence parser.
- Other Parameters
lazy (bool) – Lazy initialize the drivers.
opts (Dict[str, Dict]) – The driver options. Key: driver name (e.g. ‘sentence_segmenter’); Value: a dictionary of options.
-
get_text
(doc)[source]¶ Apply sentence segmentation.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.text (
TextParagraph
) – The sentences.
Note
This routine modify doc inplace.
-
get_ws
(doc)[source]¶ Apply word segmentation.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.ws (
SegParagraph
) – The word-segmented sentences.
Note
This routine modify doc inplace.
-
get_pos
(doc)[source]¶ Apply part-of-speech tagging.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.pos (
SegParagraph
) – The part-of-speech sentences.
Note
This routine modify doc inplace.
-
get_ner
(doc)[source]¶ Apply named-entity recognition.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.ner (
NerParagraph
) – The named-entity recognition results.
Note
This routine modify doc inplace.
-
get_parsed
(doc)[source]¶ Apply sentence parsing.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.parsed (
ParsedParagraph
) – The parsed sentences.
Note
This routine modify doc inplace.
ckipnlp.pipeline.coref module¶
This module provides coreference resolution pipeline.
-
class
ckipnlp.pipeline.coref.
CkipCorefDocument
(*, ws=None, pos=None, parsed=None, coref=None)[source]¶ Bases:
collections.abc.Mapping
The coreference document.
- Variables
ws (
SegParagraph
) – The word-segmented sentences.pos (
SegParagraph
) – The part-of-speech sentences.parsed (
ParsedParagraph
) – The parsed sentences.coref (
CorefParagraph
) – The coreference resolution results.
-
class
ckipnlp.pipeline.coref.
CkipCorefPipeline
(*, coref_chunker=<DriverFamily.BUILTIN: 1>, lazy=True, opts={}, **kwargs)[source]¶ Bases:
ckipnlp.pipeline.core.CkipPipeline
The coreference resolution pipeline.
- Parameters
sentence_segmenter (
DriverFamily
) – The type of sentence segmenter.word_segmenter (
DriverFamily
) – The type of word segmenter.pos_tagger (
DriverFamily
) – The type of part-of-speech tagger.ner_chunker (
DriverFamily
) – The type of named-entity recognition chunker.sentence_parser (
DriverFamily
) – The type of sentence parser.coref_chunker (
DriverFamily
) – The type of coreference resolution chunker.
- Other Parameters
lazy (bool) – Lazy initialize the drivers.
opts (Dict[str, Dict]) – The driver options. Key: driver name (e.g. ‘sentence_segmenter’); Value: a dictionary of options.
-
__call__
(doc)[source]¶ Apply coreference delectation.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
corefdoc (
CkipCorefDocument
) – The coreference document.
Note
doc is also modified if necessary dependencies (ws, pos, ner) is not computed yet.
-
get_coref
(doc, corefdoc)[source]¶ Apply coreference delectation.
- Parameters
doc (
CkipDocument
) – The input document.corefdoc (
CkipCorefDocument
) – The input document for coreference.
- Returns
corefdoc.coref (
CorefParagraph
) – The coreference results.
Note
This routine modify corefdoc inplace.
doc is also modified if necessary dependencies (ws, pos, ner) is not computed yet.
ckipnlp.util package¶
This module implements extra utilities for CKIPNLP.
Submodules
ckipnlp.util.data module¶
This module implements data loading utilities for CKIPNLP.
-
ckipnlp.util.data.
get_tagger_data
()¶ Get CkipTagger data directory.
-
ckipnlp.util.data.
install_tagger_data
(src_dir, *, copy=False)¶ Link/Copy CkipTagger data directory.
-
ckipnlp.util.data.
download_tagger_data
()¶ Download CkipTagger data directory.