CKIP CoreNLP¶
Introduction¶
CKIP CoreNLP Toolkit¶
Features¶
Sentence Segmentation
Word Segmentation
Part-of-Speech Tagging
Named-Entity Recognition
Constituency Parsing
Coreference Resolution
Git¶
PyPI¶
Documentation¶
Online Demo¶
Contributers¶
Wei-Yun Ma at CKIP (Maintainer)
Installation¶
Requirements¶
Python 3.6+
TreeLib 1.5+
CkipTagger 0.2.1+ [Optional, Recommended]
CkipClassic 1.0+ [Optional, Recommended]
TensorFlow / TensorFlow-GPU 1.13.1+ [Required by CkipTagger]
Driver Requirements¶
Driver |
Built-in |
CkipTagger |
CkipClassic |
---|---|---|---|
Sentence Segmentation |
✔ |
||
Word Segmentation† |
✔ |
✔ |
|
Part-of-Speech Tagging† |
✔ |
✔ |
|
Constituency Parsing |
✔ |
||
Named-Entity Recognition |
✔ |
||
Coreference Resolution‡ |
✔ |
✔ |
✔ |
† These drivers require only one of either backends.
‡ Coreference implementation does not require any backend, but requires results from word segmentation, part-of-speech tagging, constituency parsing, and named-entity recognition.
Installation via Pip¶
No backend (not recommended):
pip install ckipnlp
.With CkipTagger backend (recommended):
pip install ckipnlp[tagger]
orpip install ckipnlp[tagger-gpu]
.With CkipClassic Parser Client backend (recommended):
pip install ckipnlp[classic]
.With CkipClassic offline backend: Please refer https://ckip-classic.readthedocs.io/en/latest/main/readme.html#installation for CkipClassic installation guide.
Attention
To use CkipClassic Parser Client backend, please
Register an account at http://parser.iis.sinica.edu.tw/v1/reg.php
Set the username and password in the pipeline’s options:
pipeline = CkipPipeline(opts={'con_parser': {'username': YOUR_USERNAME, 'password': YOUR_PASSWORD})
Detail¶
See https://ckipnlp.readthedocs.io/ for full documentation.
License¶
Copyright (c) 2018-2023 CKIP Lab under the GPL-3.0 License.
Usage¶
CkipNLP provides a set of human language technology tools, including
Sentence Segmentation
Word Segmentation
Part-of-Speech Tagging
Named-Entity Recognition
Constituency Parsing
Coreference Resolution
The library is build around three types of classes:
Containers such as
SegParagraph
are the basic data structures for inputs and outputs.Drivers such as
CkipTaggerWordSegmenter
that apply specific tool on the inputs.Pipelines such as
CkipPipeline
are collections of drivers that automatically handles the dependencies between inputs and outputs.
Containers¶
Containers Prototypes¶
All the container objects can be convert from/to other formats:
from_text()
,to_text()
for plain-text conversions;from_list()
,to_list()
for list-like python object conversions;from_dict()
,to_dict()
for dictionary-like python object (key-value mappings) conversions;from_json()
,to_json()
for JSON format conversions (based-on dictionary-like format conversions).
Here are the interfaces, where CONTAINER_CLASS
refers to the container class.
obj = CONTAINER_CLASS.from_text(plain_text)
plain_text = obj.to_text()
obj = CONTAINER_CLASS.from_list([ value1, value2 ])
list_obj = obj.to_list()
obj = CONTAINER_CLASS.from_dict({ key: value })
dict_obj = obj.to_dict()
obj = CONTAINER_CLASS.from_json(json_str)
json_str = obj.to_json()
Note that not all container provide all above conversions. Here is the table of implemented methods. Please refer the documentation of each container for format details.
Container |
Item |
from/to text |
from/to list, dict, json |
---|---|---|---|
|
✔ |
✔ |
|
|
✔ |
✔ |
|
✔ |
✔ |
||
✘ |
✔ |
||
✘ |
✔ |
||
✘ |
✔ |
||
only to |
✔ |
||
only to |
✔ |
||
only to |
✔ |
||
only to |
✔ |
||
only to |
✔ |
||
only to |
✔ |
WS with POS¶
There are also conversion routines for word-segmentation and part-of-speech containers jointly. For example, WsPosToken
provides routines for a word (str
) with POS-tag (str
):
ws_obj, pos_obj = WsPosToken.from_text('中文字(Na)')
plain_text = WsPosToken.to_text(ws_obj, pos_obj)
ws_obj, pos_obj = WsPosToken.from_list([ '中文字', 'Na' ])
list_obj = WsPosToken.to_list(ws_obj, pos_obj)
ws_obj, pos_obj = WsPosToken.from_dict({ 'word': '中文字', 'pos': 'Na', })
dict_obj = WsPosToken.to_dict(ws_obj, pos_obj)
ws_obj, pos_obj = WsPosToken.from_json(json_str)
json_str = WsPosToken.to_json(ws_obj, pos_obj)
Similarly, WsPosSentence
/WsPosParagraph
provides routines for word-segmented and POS sentence/paragraph (SegSentence
/SegParagraph
) respectively.
Parse Tree¶
In addition to ParseClause
, there are also tree utilities base on TreeLib.
ParseTree
is the tree structure of a parse clause. One may use from_text()
and to_text()
for plain-text conversion; from_dict()
, to_dict()
for dictionary-like object conversion; and also from_json()
, to_json()
for JSON string conversion.
ParseTree
also provide from_penn()
and to_penn()
methods for Penn Treebank conversion. One may use to_penn()
together with SvgLing to generate SVG tree graphs.
ParseTree
is a TreeLib tree with ParseNode
as its nodes. The data of these nodes is stored in a ParseNodeData
(accessed by node.data
), which is a tuple of role
(semantic role), pos
(part-of-speech tagging), word
.
ParseTree
provides useful methods: get_heads()
finds the head words of the clause; get_relations()
extracts all relations in the clause; get_subjects()
returns the subjects of the clause.
from ckipnlp.container import ParseClause, ParseTree
# 我的早餐、午餐和晚餐都在那場比賽中被吃掉了
clause = ParseClause('S(goal:NP(possessor:N‧的(head:Nhaa:我|Head:DE:的)|Head:Nab(DUMMY1:Nab(DUMMY1:Nab:早餐|Head:Caa:、|DUMMY2:Naa:午餐)|Head:Caa:和|DUMMY2:Nab:晚餐))|quantity:Dab:都|condition:PP(Head:P21:在|DUMMY:GP(DUMMY:NP(Head:Nac:比賽)|Head:Ng:中))|agent:PP(Head:P02:被)|Head:VC31:吃掉|aspect:Di:了)')
tree = clause.to_tree()
print('Show Tree')
tree.show()
print('Get Heads of {}'.format(tree[5]))
print('-- Semantic --')
for head in tree.get_heads(5, semantic=True): print(repr(head))
print('-- Syntactic --')
for head in tree.get_heads(5, semantic=False): print(repr(head))
print()
print('Get Relations of {}'.format(tree[0]))
print('-- Semantic --')
for rel in tree.get_relations(0, semantic=True): print(repr(rel))
print('-- Syntactic --')
for rel in tree.get_relations(0, semantic=False): print(repr(rel))
print()
# 我和食物真的都很不開心
tree_text = 'S(theme:NP(DUMMY1:NP(Head:Nhaa:我)|Head:Caa:和|DUMMY2:NP(Head:Naa:食物))|evaluation:Dbb:真的|quantity:Dab:都|degree:Dfa:很|negation:Dc:不|Head:VH21:開心)'
tree = ParseTree.from_text(tree_text)
print('Show Tree')
tree.show()
print('Get get_subjects of {}'.format(tree[0]))
print('-- Semantic --')
for subject in tree.get_subjects(0, semantic=True): print(repr(subject))
print('-- Syntactic --')
for subject in tree.get_subjects(0, semantic=False): print(repr(subject))
print()
Drivers¶
- class Driver(*, lazy=False, ...)
The prototype of CkipNLP Drivers.
- Parameters
lazy (bool) – Lazy initialize the driver. (Call
init()
at the first call of__call__()
instead.)
- driver_type: str¶
The type of this driver.
- driver_family: str¶
The family of this driver.
- driver_inputs: Tuple[str, ...]¶
The inputs of this driver.
- init()¶
Initialize the driver (by calling the
_init()
function).
- __call__(*, ...)¶
Call the driver (by calling the
_call()
function).
Here are the list of the drivers:
Driver Type \ Family |
|
|
|
|
---|---|---|---|---|
Sentence Segmenter |
||||
Word Segmenter |
||||
Pos Tagger |
||||
Ner Chunker |
||||
Constituency Parser |
||||
Coref Chunker |
† Not compatible with
CkipCorefPipeline
.‡ Please register an account at http://parser.iis.sinica.edu.tw/v1/reg.php and set the environment variables
$CKIPPARSER_USERNAME
and$CKIPPARSER_PASSWORD
.
Pipelines¶
Kernel Pipeline¶
The CkipPipeline
connect drivers of sentence segmentation, word segmentation, part-of-speech tagging, named-entity recognition, and sentence parsing.
The CkipDocument
is the workspace of CkipPipeline
with input/output data. Note that CkipPipeline
will store the result into CkipDocument
in-place.
The CkipPipeline
will compute all necessary dependencies. For example, if one calls get_ner()
with only raw-text input, the pipeline will automatically calls get_text()
, get_ws()
, get_pos()
.
from ckipnlp.pipeline import CkipPipeline, CkipDocument
pipeline = CkipPipeline()
doc = CkipDocument(raw='中文字耶,啊哈哈哈')
# Word Segmentation
pipeline.get_ws(doc)
print(doc.ws)
for line in doc.ws:
print(line.to_text())
# Part-of-Speech Tagging
pipeline.get_pos(doc)
print(doc.pos)
for line in doc.pos:
print(line.to_text())
# Named-Entity Recognition
pipeline.get_ner(doc)
print(doc.ner)
# Constituency Parsing
pipeline.get_conparse(doc)
print(doc.conparse)
################################################################
from ckipnlp.container.util.wspos import WsPosParagraph
# Word Segmentation & Part-of-Speech Tagging
for line in WsPosParagraph.to_text(doc.ws, doc.pos):
print(line)
To customize the driver (e.g. disable CUDA in CkipTaggerWordSegmenter
), you may pass the options to the pipeline:
pipeline = CkipPipeline(opts = {'word_segmenter': {'disable_cuda': True}})
Please refer each driver’s documentation for the extra options.
Co-Reference Pipeline¶
The CkipCorefPipeline
is a extension of CkipPipeline
by providing coreference resolution. The pipeline first do named-entity recognition as CkipPipeline
do, followed by alignment algorithms to fix the word-segmentation and part-of-speech tagging outputs, and then do coreference resolution based sentence parsing result.
The CkipCorefDocument
is the workspace of CkipCorefPipeline
with input/output data. Note that CkipCorefDocument
will store the result into CkipCorefPipeline
.
from ckipnlp.pipeline import CkipCorefPipeline, CkipDocument
pipeline = CkipCorefPipeline()
doc = CkipDocument(raw='畢卡索他想,完蛋了')
# Co-Reference
corefdoc = pipeline(doc)
print(corefdoc.coref)
for line in corefdoc.coref:
print(line.to_text())
ckipnlp package¶
The Official CKIP CoreNLP Toolkit.
Subpackages
ckipnlp.container package¶
This module implements specialized container datatypes for CKIPNLP.
Subpackages
ckipnlp.container.util package¶
This module implements specialized utilities for CKIPNLP containers.
Submodules
ckipnlp.container.util.parse_tree module¶
This module provides tree containers for parsed sentences.
- class ckipnlp.container.util.parse_tree.ParseNodeData(role: Optional[str] = None, pos: Optional[str] = None, word: Optional[str] = None)[source]¶
Bases:
BaseTuple
,_ParseNodeData
A parse node.
- Variables
role (str) – the semantic role.
pos (str) – the POS-tag.
word (str) – the text term.
Note
This class is an subclass of
tuple
. To change the attribute, please create a new instance instead.Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.'Head:Na:中文字' # role / POS-tag / text-term
- List format
Not implemented.
- Dict format
Used for
from_dict()
andto_dict()
.{ 'role': 'Head', # role 'pos': 'Na', # POS-tag 'word': '中文字', # text term }
- class ckipnlp.container.util.parse_tree.ParseNode(tag=None, identifier=None, expanded=True, data=None)[source]¶
Bases:
Base
,Node
A parse node for tree.
- Variables
data (
ParseNodeData
) –
See also
treelib.tree.Node
Please refer https://treelib.readthedocs.io/ for built-in usages.
Data Structure Examples
- Text format
Not implemented.
- List format
Not implemented.
- Dict format
Used for
to_dict()
.{ 'role': 'Head', # role 'pos': 'Na', # POS-tag 'word': '中文字', # text term }
- data_class¶
alias of
ParseNodeData
- class ckipnlp.container.util.parse_tree.ParseRelation(head: ParseNode, tail: ParseNode, relation: ParseNode)[source]¶
Bases:
Base
,_ParseRelation
A parse relation.
- Variables
Notes
The parent of the relation node is always the common ancestor of the head node and tail node.
Data Structure Examples
- Text format
Not implemented.
- List format
Not implemented.
- Dict format
Used for
to_dict()
.{ 'tail': { 'role': 'Head', 'pos': 'Nab', 'word': '中文字' }, # head node 'tail': { 'role': 'particle', 'pos': 'Td', 'word': '耶' }, # tail node 'relation': 'particle', # relation }
- class ckipnlp.container.util.parse_tree.ParseTree(tree=None, deep=False, node_class=None, identifier=None)[source]¶
Bases:
Base
,Tree
A parse tree.
See also
treereelib.tree.Tree
Please refer https://treelib.readthedocs.io/ for built-in usages.
Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.'S(Head:Nab:中文字|particle:Td:耶)'
- List format
Not implemented.
- Dict format
Used for
from_dict()
andto_dict()
. A dictionary such as{ 'id': 0, 'data': { ... }, 'children': [ ... ] }
, where'data'
is a dictionary with the same format asParseNodeData.to_dict()
, and'children'
is a list of dictionaries of subtrees with the same format as this tree.{ 'id': 0, 'data': { 'role': None, 'pos': 'S', 'word': None, }, 'children': [ { 'id': 1, 'data': { 'role': 'Head', 'pos': 'Nab', 'word': '中文字', }, 'children': [], }, { 'id': 2, 'data': { 'role': 'particle', 'pos': 'Td', 'word': '耶', }, 'children': [], }, ], }
- Penn Treebank format
Used for
from_penn()
andto_penn()
.[ 'S', [ 'Head:Nab', '中文字', ], [ 'particle:Td', '耶', ], ]
- classmethod from_text(data)[source]¶
Construct an instance from text format.
- Parameters
data (str) – A parse tree in text format (
ParseClause.clause
).
See also
- to_text(node_id=None)[source]¶
Transform to plain text.
- Parameters
node_id (int) – Output the plain text format for the subtree under node_id.
- Returns
str
- classmethod from_dict(data)[source]¶
Construct an instance from python built-in containers.
- Parameters
data (str) – A parse tree in dictionary format.
- to_dict(node_id=None)[source]¶
Transform to python built-in containers.
- Parameters
node_id (int) – Output the plain text format for the subtree under node_id.
- Returns
str
- to_penn(node_id=None, *, with_role=True, with_word=True, sep=':')[source]¶
Transform to Penn Treebank format.
- Parameters
node_id (int) – Output the plain text format for the subtree under node_id.
with_role (bool) – Contains role-tag or not.
with_word (bool) – Contains word or not.
sep (str) – The seperator between role and POS-tag.
- Returns
list
- get_children(node_id, *, role)[source]¶
Get children of a node with given role.
- Parameters
node_id (int) – ID of target node.
role (str) – the target role.
- Yields
ParseNode
– the children nodes with given role.
- get_heads(root_id=None, *, semantic=True, deep=True)[source]¶
Get all head nodes of a subtree.
- Parameters
root_id (int) – ID of the root node of target subtree.
semantic (bool) – use semantic/syntactic policy. For semantic mode, return
DUMMY
orhead
instead of syntacticHead
.deep (bool) – find heads recursively.
- Yields
ParseNode
– the head nodes.
- get_relations(root_id=None, *, semantic=True)[source]¶
Get all relations of a subtree.
- Parameters
root_id (int) – ID of the subtree root node.
semantic (bool) – please refer
get_heads()
for policy detail.
- Yields
ParseRelation
– the relations.
- get_subjects(root_id=None, *, semantic=True, deep=True)[source]¶
Get the subject node of a subtree.
- Parameters
root_id (int) – ID of the root node of target subtree.
semantic (bool) – please refer
get_heads()
for policy detail.deep (bool) – please refer
get_heads()
for policy detail.
- Yields
ParseNode
– the subject node.
Notes
A node can be a subject if either:
is a head of NP
is a head of a subnode (N) of S with subject role
is a head of a subnode (N) of S with neutral role and before the head (V) of S
ckipnlp.container.util.wspos module¶
This module provides containers for word-segmented sentences with part-of-speech-tags.
- class ckipnlp.container.util.wspos.WsPosToken(word: Optional[str] = None, pos: Optional[str] = None)[source]¶
Bases:
BaseTuple
,_WsPosToken
A word with POS-tag.
- Variables
word (str) – the word.
pos (str) – the POS-tag.
Note
This class is an subclass of tuple. To change the attribute, please create a new instance instead.
Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.'中文字(Na)' # word / POS-tag
- List format
Used for
from_list()
andto_list()
.[ '中文字', # word 'Na', # POS-tag ]
- Dict format
Used for
from_dict()
andto_dict()
.{ 'word': '中文字', # word 'pos': 'Na', # POS-tag }
- class ckipnlp.container.util.wspos.WsPosSentence[source]¶
Bases:
object
A helper class for data conversion of word-segmented and part-of-speech sentences.
- classmethod from_text(data)[source]¶
Convert text format to word-segmented and part-of-speech sentences.
- Parameters
data (str) – text such as
'中文字(Na)\u3000耶(T)'
.- Returns
SegSentence
– the word sentenceSegSentence
– the POS-tag sentence.
- static to_text(word, pos)[source]¶
Convert text format to word-segmented and part-of-speech sentences.
- Parameters
word (
SegSentence
) – the word sentencepos (
SegSentence
) – the POS-tag sentence.
- Returns
str – text such as
'中文字(Na)\u3000耶(T)'
.
- class ckipnlp.container.util.wspos.WsPosParagraph[source]¶
Bases:
object
A helper class for data conversion of word-segmented and part-of-speech sentence lists.
- classmethod from_text(data)[source]¶
Convert text format to word-segmented and part-of-speech sentence lists.
- Parameters
data (Sequence[str]) – list of sentences such as
'中文字(Na)\u3000耶(T)'
.- Returns
SegParagraph
– the word sentence listSegParagraph
– the POS-tag sentence list.
- static to_text(word, pos)[source]¶
Convert text format to word-segmented and part-of-speech sentence lists.
- Parameters
word (
SegParagraph
) – the word sentence listpos (
SegParagraph
) – the POS-tag sentence list.
- Returns
List[str] – list of sentences such as
'中文字(Na)\u3000耶(T)'
.
Submodules
ckipnlp.container.base module¶
This module provides base containers.
- class ckipnlp.container.base.Base[source]¶
Bases:
object
The base CKIPNLP container.
- abstract classmethod from_text(data)[source]¶
Construct an instance from text format.
- Parameters
data (str) –
- abstract classmethod from_list(data)[source]¶
Construct an instance from python built-in containers.
- abstract classmethod from_dict(data)[source]¶
Construct an instance from python built-in containers.
- classmethod from_json(data, **kwargs)[source]¶
Construct an instance from JSON format.
- Parameters
data (str) – please refer
from_dict()
for format details.
- class ckipnlp.container.base.BaseTuple[source]¶
Bases:
Base
The base CKIPNLP tuple.
- classmethod from_list(data)[source]¶
Construct an instance from python built-in containers.
- Parameters
data (list) –
- class ckipnlp.container.base.BaseList(initlist=None)[source]¶
Bases:
_BaseList
,_InterfaceItem
The base CKIPNLP list.
- item_class = Not Implemented¶
Must be a CKIPNLP container class.
- class ckipnlp.container.base.BaseList0(initlist=None)[source]¶
Bases:
_BaseList
,_InterfaceBuiltInItem
The base CKIPNLP list with built-in item class.
- item_class = Not Implemented¶
Must be a built-in type.
ckipnlp.container.coref module¶
This module provides containers for coreference sentences.
- class ckipnlp.container.coref.CorefToken(word, coref, idx, **kwargs)[source]¶
Bases:
BaseTuple
,_CorefToken
A coreference token.
- Variables
word (str) – the token word.
coref (Tuple[int, str]) –
the coreference ID and type. None if not a coreference source or target.
- type:
’source’: coreference source.
’target’: coreference target.
’zero’: null element coreference target.
idx (Tuple[int, int]) – the node indexes (clause index, token index) in parse tree. idx[1] = None if this node is a null element or the punctuations.
Note
This class is an subclass of
tuple
. To change the attribute, please create a new instance instead.Data Structure Examples
- Text format
Used for
to_text()
.'畢卡索_0'
- List format
Used for
from_list()
andto_list()
.[ '畢卡索', # token word (0, 'source'), # coref ID and type (2, 2), # node index ]
- Dict format
Used for
from_dict()
andto_dict()
.{ 'word': '畢卡索', # token word 'coref': (0, 'source'), # coref ID and type 'idx': (2, 2), # node index }
- class ckipnlp.container.coref.CorefSentence(initlist=None)[source]¶
Bases:
BaseSentence
A list of coreference sentence.
Data Structure Examples
- Text format
Used for
to_text()
.# Token segmented by \u3000 (full-width space) '「 完蛋 了 !」 , 畢卡索_0 他_0 想'
- List format
Used for
from_list()
andto_list()
.[ [ '「', None, (0, None,), ], [ '完蛋', None, (1, 0,), ], [ '了', None, (1, 1,), ], [ '!」', None, (1, None,), ], [ '畢卡索', (0, 'source'), (2, 2,), ], [ '他', (0, 'target'), (2, 3,), ], [ '想', None, (2, 4,), ], ]
- Dict format
Used for
from_dict()
andto_dict()
.[ { 'word': '「', 'coref': None, 'idx': (0, None,), }, { 'word': '完蛋', 'coref': None, 'idx': (1, 0,), }, { 'word': '了', 'coref': None, 'idx': (1, 1,), }, { 'word': '!」', 'coref': None, 'idx': (1, None,), }, { 'word': '畢卡索', 'coref': (0, 'source'), 'idx': (2, 2,), }, { 'word': '他', 'coref': (0, 'target'), 'idx': (2, 3,), }, { 'word': '想', 'coref': None, 'idx': (2, 4,), }, ]
- item_class¶
alias of
CorefToken
- class ckipnlp.container.coref.CorefParagraph(initlist=None)[source]¶
Bases:
BaseList
A list of coreference sentence.
Data Structure Examples
- Text format
Used for
to_text()
.[ '「 完蛋 了 !」 , 畢卡索_0 他_0 想', # Sentence 1 '但是 None_0 也 沒有 辦法', # Sentence 1 ]
- List format
Used for
from_list()
andto_list()
.[ [ # Sentence 1 [ '「', None, (0, None,), ], [ '完蛋', None, (1, 0,), ], [ '了', None, (1, 1,), ], [ '!」', None, (1, None,), ], [ '畢卡索', (0, 'source'), (2, 2,), ], [ '他', (0, 'target'), (2, 3,), ], [ '想', None, (2, 4,), ], ], [ # Sentence 2 [ '但是', None, (0, 1,), ], [ None, (0, 'zero'), (0, None,), ], [ '也', None, (0, 2,), ], [ '沒有', None, (0, 3,), ], [ '辦法', None, (0, 5,), ], ], ]
- Dict format
Used for
from_dict()
andto_dict()
.[ [ # Sentence 1 { 'word': '「', 'coref': None, 'idx': (0, None,), }, { 'word': '完蛋', 'coref': None, 'idx': (1, 0,), }, { 'word': '了', 'coref': None, 'idx': (1, 1,), }, { 'word': '!」', 'coref': None, 'idx': (1, None,), }, { 'word': '畢卡索', 'coref': (0, 'source'), 'idx': (2, 2,), }, { 'word': '他', 'coref': (0, 'target'), 'idx': (2, 3,), }, { 'word': '想', 'coref': None, 'idx': (2, 4,), }, ], [ # Sentence 2 { 'word': '但是', 'coref': None, 'idx': (0, 1,), }, { 'word': None, 'coref': (0, 'zero'), 'idx': (0, None,), }, { 'word': '也', 'coref': None, 'idx': (0, 2,), }, { 'word': '沒有', 'coref': None, 'idx': (0, 3,), }, { 'word': '辦法', 'coref': None, 'idx': (0, 5,), }, ], ]
- item_class¶
alias of
CorefSentence
ckipnlp.container.ner module¶
This module provides containers for NER sentences.
- class ckipnlp.container.ner.NerToken(word, ner, idx, **kwargs)[source]¶
Bases:
BaseTuple
,_NerToken
A named-entity recognition token.
- Variables
word (str) – the token word.
ner (str) – the NER-tag.
idx (Tuple[int, int]) – the starting / ending index.
Note
This class is an subclass of
tuple
. To change the attribute, please create a new instance instead.Data Structure Examples
- Text format
Not implemented
- List format
Used for
from_list()
andto_list()
.[ '中文字' # token word 'LANGUAGE', # NER-tag (0, 3), # starting / ending index. ]
- Dict format
Used for
from_dict()
andto_dict()
.{ 'word': '中文字', # token word 'ner': 'LANGUAGE', # NER-tag 'idx': (0, 3), # starting / ending index. }
- CkipTagger format
Used for
from_tagger()
andto_tagger()
.( 0, # starting index 3, # ending index 'LANGUAGE', # NER-tag '中文字', # token word )
- class ckipnlp.container.ner.NerSentence(initlist=None)[source]¶
Bases:
BaseSentence
A named-entity recognition sentence.
Data Structure Examples
- Text format
Not implemented
- List format
Used for
from_list()
andto_list()
.[ [ '美國', 'GPE', (0, 2), ], # name-entity 1 [ '參議院', 'ORG', (3, 5), ], # name-entity 2 ]
- Dict format
Used for
from_dict()
andto_dict()
.[ { 'word': '美國', 'ner': 'GPE', 'idx': (0, 2), }, # name-entity 1 { 'word': '參議院', 'ner': 'ORG', 'idx': (3, 5), }, # name-entity 2 ]
- CkipTagger format
Used for
from_tagger()
andto_tagger()
.[ ( 0, 2, 'GPE', '美國', ), # name-entity 1 ( 3, 5, 'ORG', '參議院', ), # name-entity 2 ]
- class ckipnlp.container.ner.NerParagraph(initlist=None)[source]¶
Bases:
BaseList
A list of named-entity recognition sentence.
Data Structure Examples
- Text format
Not implemented
- List format
Used for
from_list()
andto_list()
.[ [ # Sentence 1 [ '中文字', 'LANGUAGE', (0, 3), ], ], [ # Sentence 2 [ '美國', 'GPE', (0, 2), ], [ '參議院', 'ORG', (3, 5), ], ], ]
- Dict format
Used for
from_dict()
andto_dict()
.[ [ # Sentence 1 { 'word': '中文字', 'ner': 'LANGUAGE', 'idx': (0, 3), }, ], [ # Sentence 2 { 'word': '美國', 'ner': 'GPE', 'idx': (0, 2), }, { 'word': '參議院', 'ner': 'ORG', 'idx': (3, 5), }, ], ]
- CkipTagger format
Used for
from_tagger()
andto_tagger()
.[ [ # Sentence 1 ( 0, 3, 'LANGUAGE', '中文字', ), ], [ # Sentence 2 ( 0, 2, 'GPE', '美國', ), ( 3, 5, 'ORG', '參議院', ), ], ]
- item_class¶
alias of
NerSentence
ckipnlp.container.parse module¶
This module provides containers for parsed sentences.
- class ckipnlp.container.parse.ParseClause(clause: Optional[str] = None, delim: str = '')[source]¶
Bases:
BaseTuple
,_ParseClause
A parse clause.
- Variables
clause (str) – the parse clause.
delim (str) – the punctuations after this clause.
Note
This class is an subclass of
tuple
. To change the attribute, please create a new instance instead.Data Structure Examples
- Text format
Used for
to_text()
.'S(Head:Nab:中文字|particle:Td:耶)' # delim are ignored
- List format
Used for
from_list()
, andto_list()
.[ 'S(Head:Nab:中文字|particle:Td:耶)', # parse clause ',', # punctuations ]
- Dict format
Used for
from_dict()
andto_dict()
.{ 'clause': 'S(Head:Nab:中文字|particle:Td:耶)', # parse clause 'delim': ',', # punctuations }
- class ckipnlp.container.parse.ParseSentence(initlist=None)[source]¶
Bases:
BaseList
A parse sentence.
Data Structure Examples
- Text format
Used for
to_text()
.[ # delim are ignored 'S(Head:Nab:中文字|particle:Td:耶)', # Clause 1 '%(particle:I:啊|manner:Dh:哈|manner:Dh:哈|time:Dh:哈), # Clause 2 ]
- List format
Used for
from_list()
, andto_list()
.[ [ # Clause 1 'S(Head:Nab:中文字|particle:Td:耶)', ',', ], [ # Clause 2 '%(particle:I:啊|manner:Dh:哈|manner:Dh:哈|time:Dh:哈), '。', ], ]
- Dict format
Used for
from_dict()
andto_dict()
.[ { # Clause 1 'clause': 'S(Head:Nab:中文字|particle:Td:耶)', 'delim': ',', }, { # Clause 2 'clause': '%(particle:I:啊|manner:Dh:哈|manner:Dh:哈|time:Dh:哈), 'delim': '。', }, ]
- item_class¶
alias of
ParseClause
- class ckipnlp.container.parse.ParseParagraph(initlist=None)[source]¶
Bases:
BaseList
A list of parse sentence.
Data Structure Examples
- Text format
Used for
to_text()
.[ # delim are ignored [ # Sentence 1 'S(Head:Nab:中文字|particle:Td:耶)', '%(particle:I:啊|manner:Dh:哈|manner:Dh:哈|time:Dh:哈), ], [ # Sentence 2 None, 'VP(Head:VH11:完蛋|particle:Ta:了), 'S(agent:NP(apposition:Nba:畢卡索|Head:Nhaa:他)|Head:VE2:想)', ], ]
- List format
Used for
from_list()
, andto_list()
.[ [ # Sentence 1 [ 'S(Head:Nab:中文字|particle:Td:耶)', ',', ], [ '%(particle:I:啊|manner:Dh:哈|manner:Dh:哈|time:Dh:哈), '。', ], ], [ # Sentence 2 [ None, '「', ], [ 'VP(Head:VH11:完蛋|particle:Ta:了), '!」', ], [ 'S(agent:NP(apposition:Nba:畢卡索|Head:Nhaa:他)|Head:VE2:想)', '', ], ], ]
- Dict format
Used for
from_dict()
, andto_dict()
.[ [ # Sentence 1 { 'clause': 'S(Head:Nab:中文字|particle:Td:耶)', 'delim': ',', }, { 'clause': '%(particle:I:啊|manner:Dh:哈|manner:Dh:哈|time:Dh:哈), 'delim': '。', }, ], [ # Sentence 2 { 'clause': None, 'delim': '「', }, { 'clause': 'VP(Head:VH11:完蛋|particle:Ta:了), 'delim': '!」', }, { 'clause': 'S(agent:NP(apposition:Nba:畢卡索|Head:Nhaa:他)|Head:VE2:想)', 'delim': '', }, ], ]
- item_class¶
alias of
ParseSentence
ckipnlp.container.seg module¶
This module provides containers for word-segmented sentences.
- class ckipnlp.container.seg.SegSentence(initlist=None)[source]¶
Bases:
BaseSentence0
A word-segmented sentence.
Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.'中文字 耶 , 啊 哈 哈哈 。' # Words segmented by \u3000 (full-width space)
- List/Dict format
Used for
from_list()
,to_list()
,from_dict()
, andto_dict()
.[ '中文字', '耶', ',', '啊', '哈', '哈哈', '。', ]
Note
This class is also used for part-of-speech tagging.
- item_class¶
alias of
str
- class ckipnlp.container.seg.SegParagraph(initlist=None)[source]¶
Bases:
BaseList
A list of word-segmented sentences.
Data Structure Examples
- Text format
Used for
from_text()
andto_text()
.[ '中文字 耶 , 啊 哈 哈 。', # Sentence 1 '「 完蛋 了 ! 」 , 畢卡索 他 想', # Sentence 2 ]
- List/Dict format
Used for
from_list()
,to_list()
,from_dict()
, andto_dict()
.[ [ '中文字', '耶', ',', '啊', '哈', '哈哈', '。', ], # Sentence 1 [ '「', '完蛋', '了', '!', '」', ',', '畢卡索', '他', '想', ], # Sentence 2 ]
Note
This class is also used for part-of-speech tagging.
- item_class¶
alias of
SegSentence
ckipnlp.container.text module¶
This module provides containers for text sentences.
- class ckipnlp.container.text.TextParagraph(initlist=None)[source]¶
Bases:
BaseList0
A list of text sentence.
Data Structure Examples
- Text/List/Dict format
Used for
from_text()
,to_text()
,from_list()
,to_list()
,from_dict()
, andto_dict()
.[ '中文字耶,啊哈哈哈。', # Sentence 1 '「完蛋了!」畢卡索他想', # Sentence 2 ]
- item_class¶
alias of
str
ckipnlp.driver package¶
This module implements CKIPNLP drivers.
Submodules
ckipnlp.driver.base module¶
This module provides base drivers.
- class ckipnlp.driver.base.DummyDriver(*, lazy=False)[source]¶
Bases:
BaseDriver
The dummy driver.
ckipnlp.driver.classic module¶
This module provides drivers with CkipClassic backend.
- class ckipnlp.driver.classic.CkipClassicWordSegmenter(*, lazy=False, do_pos=False, lexicons=None)[source]¶
Bases:
BaseDriver
The CKIP word segmentation driver with CkipClassic backend.
- Parameters
lazy (bool) – Lazy initialize the driver.
do_pos (bool) – Returns POS-tag or not
lexicons (Iterable[Tuple[str, str]]) – A list of the lexicon words and their POS-tags.
- __call__(*, text)¶
Apply word segmentation.
- Parameters
text (
TextParagraph
) — The sentences.- Returns
ws (
TextParagraph
) — The word-segmented sentences.pos (
TextParagraph
) — The part-of-speech sentences. (returns if do_pos is set.)
- class ckipnlp.driver.classic.CkipClassicConParser(*, lazy=False)[source]¶
Bases:
_CkipClassicConParser
The CKIP constituency parsing driver with CkipClassic backend.
- Parameters
lazy (bool) – Lazy initialize the driver.
- __call__(*, ws, pos)¶
Apply constituency parsing.
- Parameters
ws (
TextParagraph
) — The word-segmented sentences.pos (
TextParagraph
) — The part-of-speech sentences.
- Returns
conparse (
ParseSentence
) — The constituency-parsing sentences.
- class ckipnlp.driver.classic.CkipClassicConParserClient(*, lazy=False, **opts)[source]¶
Bases:
_CkipClassicConParser
The CKIP constituency parsing driver with CkipClassic client backend.
- Parameters
lazy (bool) – Lazy initialize the driver.
username (string) – (optional) The username of CkipClassicParserClient.
password (string) – (optional) The password of CkipClassicParserClient.
Notes
Please register an account at http://parser.iis.sinica.edu.tw/v1/reg.php and set the environment variables
$CKIPPARSER_USERNAME
and$CKIPPARSER_PASSWORD
.- __call__(*, ws, pos)¶
Apply constituency parsing.
- Parameters
ws (
TextParagraph
) — The word-segmented sentences.pos (
TextParagraph
) — The part-of-speech sentences.
- Returns
conparse (
ParseSentence
) — The constituency-parsing sentences.
ckipnlp.driver.coref module¶
This module provides built-in coreference resolution driver.
- class ckipnlp.driver.coref.CkipCorefChunker(*, lazy=False)[source]¶
Bases:
BaseDriver
The CKIP coreference resolution driver.
- Parameters
lazy (bool) – Lazy initialize the driver.
- __call__(*, conparse)¶
Apply coreference delectation.
- Parameters
conparse (
ParseParagraph
) — The constituency-parsing sentences.- Returns
coref (
CorefParagraph
) — The coreference results.
ckipnlp.driver.ss module¶
This module provides built-in sentence segmentation driver.
- class ckipnlp.driver.ss.CkipSentenceSegmenter(*, lazy=False, delims='\n', keep_delims=False)[source]¶
Bases:
BaseDriver
The CKIP sentence segmentation driver.
- Parameters
lazy (bool) – Lazy initialize the driver.
delims (str) – The delimiters.
keep_delims (bool) – Keep the delimiters.
- __call__(*, raw, keep_all=True)¶
Apply sentence segmentation.
- Parameters
raw (str) — The raw text.
- Returns
text (
TextParagraph
) — The sentences.
ckipnlp.driver.tagger module¶
This module provides drivers with CkipTagger backend.
- class ckipnlp.driver.tagger.CkipTaggerWordSegmenter(*, lazy=False, disable_cuda=True, recommend_lexicons={}, coerce_lexicons={}, **opts)[source]¶
Bases:
BaseDriver
The CKIP word segmentation driver with CkipTagger backend.
- Parameters
lazy (bool) – Lazy initialize the driver.
disable_cuda (bool) – Disable GPU usage.
recommend_lexicons (Mapping[str, float]) – A mapping of lexicon words to their relative weights.
coerce_lexicons (Mapping[str, float]) – A mapping of lexicon words to their relative weights.
**opts – Extra options for
ckiptagger.WS.__call__()
. (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)
- __call__(*, text)¶
Apply word segmentation.
- Parameters
text (
TextParagraph
) — The sentences.- Returns
ws (
TextParagraph
) — The word-segmented sentences.
- class ckipnlp.driver.tagger.CkipTaggerPosTagger(*, lazy=False, disable_cuda=True, **opts)[source]¶
Bases:
BaseDriver
The CKIP part-of-speech tagging driver with CkipTagger backend.
- Parameters
lazy (bool) – Lazy initialize the driver.
disable_cuda (bool) – Disable GPU usage.
**opts – Extra options for
ckiptagger.POS.__call__()
. (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)
- __call__(*, text)¶
Apply part-of-speech tagging.
- Parameters
ws (
TextParagraph
) — The word-segmented sentences.- Returns
pos (
TextParagraph
) — The part-of-speech sentences.
- class ckipnlp.driver.tagger.CkipTaggerNerChunker(*, lazy=False, disable_cuda=True, **opts)[source]¶
Bases:
BaseDriver
The CKIP named-entity recognition driver with CkipTagger backend.
- Parameters
lazy (bool) – Lazy initialize the driver.
disable_cuda (bool) – Disable GPU usage.
**opts – Extra options for
ckiptagger.NER.__call__()
. (Please refer https://github.com/ckiplab/ckiptagger#4-run-the-ws-pos-ner-pipeline for details.)
- __call__(*, text)¶
Apply named-entity recognition.
- Parameters
ws (
TextParagraph
) — The word-segmented sentences.pos (
TextParagraph
) — The part-of-speech sentences.
- Returns
ner (
NerParagraph
) — The named-entity recognition results.
ckipnlp.pipeline package¶
This module implements CKIPNLP pipelines.
Submodules
ckipnlp.pipeline.coref module¶
This module provides coreference resolution pipeline.
- class ckipnlp.pipeline.coref.CkipCorefDocument(*, ws=None, pos=None, conparse=None, coref=None)[source]¶
Bases:
Mapping
The coreference document.
- Variables
ws (
SegParagraph
) – The word-segmented sentences.pos (
SegParagraph
) – The part-of-speech sentences.conparse (
ParseParagraph
) – The constituency sentences.coref (
CorefParagraph
) – The coreference resolution results.
- class ckipnlp.pipeline.coref.CkipCorefPipeline(*, coref_chunker='default', lazy=True, opts={}, **kwargs)[source]¶
Bases:
CkipPipeline
The coreference resolution pipeline.
- Parameters
sentence_segmenter (str) – The type of sentence segmenter.
word_segmenter (str) – The type of word segmenter.
pos_tagger (str) – The type of part-of-speech tagger.
ner_chunker (str) – The type of named-entity recognition chunker.
con_parser (str) – The type of constituency parser.
coref_chunker (str) – The type of coreference resolution chunker.
lazy (bool) – Lazy initialize the drivers.
opts (Dict[str, Dict]) – The driver options. Key: driver name (e.g. ‘sentence_segmenter’); Value: a dictionary of options.
- __call__(doc)[source]¶
Apply coreference delectation.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
corefdoc (
CkipCorefDocument
) – The coreference document.
Note
doc is also modified if necessary dependencies (ws, pos, ner) is not computed yet.
- get_coref(doc, corefdoc)[source]¶
Apply coreference delectation.
- Parameters
doc (
CkipDocument
) – The input document.corefdoc (
CkipCorefDocument
) – The input document for coreference.
- Returns
corefdoc.coref (
CorefParagraph
) – The coreference results.
Note
This routine modify corefdoc inplace.
doc is also modified if necessary dependencies (ws, pos, ner) is not computed yet.
ckipnlp.pipeline.kernel module¶
This module provides kernel CKIPNLP pipeline.
- class ckipnlp.pipeline.kernel.CkipDocument(*, raw=None, text=None, ws=None, pos=None, ner=None, conparse=None)[source]¶
Bases:
Mapping
The kernel document.
- Variables
raw (str) – The unsegmented text input.
text (
TextParagraph
) – The sentences.ws (
SegParagraph
) – The word-segmented sentences.pos (
SegParagraph
) – The part-of-speech sentences.ner (
NerParagraph
) – The named-entity recognition results.conparse (
ParseParagraph
) – The constituency-parsing sentences.
- class ckipnlp.pipeline.kernel.CkipPipeline(*, sentence_segmenter='default', word_segmenter='tagger', pos_tagger='tagger', con_parser='classic-client', ner_chunker='tagger', lazy=True, opts={})[source]¶
Bases:
object
The kernel pipeline.
- Parameters
sentence_segmenter (str) – The type of sentence segmenter.
word_segmenter (str) – The type of word segmenter.
pos_tagger (str) – The type of part-of-speech tagger.
ner_chunker (str) – The type of named-entity recognition chunker.
con_parser (str) – The type of constituency parser.
lazy (bool) – Lazy initialize the drivers.
opts (Dict[str, Dict]) – The driver options. Key: driver name (e.g. ‘sentence_segmenter’); Value: a dictionary of options.
- get_text(doc)[source]¶
Apply sentence segmentation.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.text (
TextParagraph
) – The sentences.
Note
This routine modify doc inplace.
- get_ws(doc)[source]¶
Apply word segmentation.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.ws (
SegParagraph
) – The word-segmented sentences.
Note
This routine modify doc inplace.
- get_pos(doc)[source]¶
Apply part-of-speech tagging.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.pos (
SegParagraph
) – The part-of-speech sentences.
Note
This routine modify doc inplace.
- get_ner(doc)[source]¶
Apply named-entity recognition.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.ner (
NerParagraph
) – The named-entity recognition results.
Note
This routine modify doc inplace.
- get_conparse(doc)[source]¶
Apply constituency parsing.
- Parameters
doc (
CkipDocument
) – The input document.- Returns
doc.conparse (
ParseParagraph
) – The constituency parsing sentences.
Note
This routine modify doc inplace.
ckipnlp.util package¶
This module implements extra utilities for CKIPNLP.
Submodules
ckipnlp.util.data module¶
This module implements data loading utilities for CKIPNLP.
- ckipnlp.util.data.get_tagger_data()¶
Get CkipTagger data directory.
- ckipnlp.util.data.install_tagger_data(src_dir, *, copy=False)¶
Link/Copy CkipTagger data directory.
- ckipnlp.util.data.download_tagger_data()¶
Download CkipTagger data directory.
ckipnlp.util.logger module¶
This module implements logging utilities for CKIPNLP.