Containers¶

Containers Prototypes¶

All the container objects can be convert from/to other formats:

from_text(), to_text() for plain-text conversions;
from_list(), to_list() for list-like python object conversions;
from_dict(), to_dict() for dictionary-like python object (key-value mappings) conversions;
from_json(), to_json() for JSON format conversions (based-on dictionary-like format conversions).

Here are the interfaces, where CONTAINER_CLASS refers to the container class.

obj = CONTAINER_CLASS.from_text(plain_text)
plain_text = obj.to_text()

obj = CONTAINER_CLASS.from_list([ value1, value2 ])
list_obj = obj.to_list()

obj = CONTAINER_CLASS.from_dict({ key: value })
dict_obj = obj.to_dict()

obj = CONTAINER_CLASS.from_json(json_str)
json_str = obj.to_json()

Note that not all container provide all above conversions. Here is the table of implemented methods. Please refer the documentation of each container for format details.

Container	Item	from/to text	from/to list, dict, json
`TextParagraph`	`str`	✔	✔
`SegSentence`	`str`	✔	✔
`SegParagraph`	`SegSentence`	✔	✔
`NerToken`		✘	✔
`NerSentence`	`NerToken`	✘	✔
`NerParagraph`	`NerSentence`	✘	✔
`ParseClause`		only to	✔
`ParseSentence`	`ParseClause`	only to	✔
`ParseParagraph`	`ParseSentence`	only to	✔
`CorefToken`		only to	✔
`CorefSentence`	`CorefToken`	only to	✔
`CorefParagraph`	`CorefSentence`	only to	✔

WS with POS¶

There are also conversion routines for word-segmentation and part-of-speech containers jointly. For example, WsPosToken provides routines for a word (str) with POS-tag (str):

ws_obj, pos_obj = WsPosToken.from_text('中文字(Na)')
plain_text = WsPosToken.to_text(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_list([ '中文字', 'Na' ])
list_obj = WsPosToken.to_list(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_dict({ 'word': '中文字', 'pos': 'Na', })
dict_obj = WsPosToken.to_dict(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_json(json_str)
json_str = WsPosToken.to_json(ws_obj, pos_obj)

Similarly, WsPosSentence/WsPosParagraph provides routines for word-segmented and POS sentence/paragraph (SegSentence/SegParagraph) respectively.

Parse Tree¶

In addition to ParseClause, there are also tree utilities base on TreeLib.

ParseTree is the tree structure of a parse clause. One may use from_text() and to_text() for plain-text conversion; from_dict(), to_dict() for dictionary-like object conversion; and also from_json(), to_json() for JSON string conversion.

ParseTree also provide from_penn() and to_penn() methods for Penn Treebank conversion. One may use to_penn() together with SvgLing to generate SVG tree graphs.

ParseTree is a TreeLib tree with ParseNode as its nodes. The data of these nodes is stored in a ParseNodeData (accessed by node.data), which is a tuple of role (semantic role), pos (part-of-speech tagging), word.

ParseTree provides useful methods: get_heads() finds the head words of the clause; get_relations() extracts all relations in the clause; get_subjects() returns the subjects of the clause.

from ckipnlp.container import ParseClause, ParseTree

# 我的早餐、午餐和晚餐都在那場比賽中被吃掉了
clause = ParseClause('S(goal:NP(possessor:N‧的(head:Nhaa:我|Head:DE:的)|Head:Nab(DUMMY1:Nab(DUMMY1:Nab:早餐|Head:Caa:、|DUMMY2:Naa:午餐)|Head:Caa:和|DUMMY2:Nab:晚餐))|quantity:Dab:都|condition:PP(Head:P21:在|DUMMY:GP(DUMMY:NP(Head:Nac:比賽)|Head:Ng:中))|agent:PP(Head:P02:被)|Head:VC31:吃掉|aspect:Di:了)')

tree = clause.to_tree()

print('Show Tree')
tree.show()

print('Get Heads of {}'.format(tree[5]))
print('-- Semantic --')
for head in tree.get_heads(5, semantic=True): print(repr(head))
print('-- Syntactic --')
for head in tree.get_heads(5, semantic=False): print(repr(head))
print()

print('Get Relations of {}'.format(tree[0]))
print('-- Semantic --')
for rel in tree.get_relations(0, semantic=True): print(repr(rel))
print('-- Syntactic --')
for rel in tree.get_relations(0, semantic=False): print(repr(rel))
print()

# 我和食物真的都很不開心
tree_text = 'S(theme:NP(DUMMY1:NP(Head:Nhaa:我)|Head:Caa:和|DUMMY2:NP(Head:Naa:食物))|evaluation:Dbb:真的|quantity:Dab:都|degree:Dfa:很|negation:Dc:不|Head:VH21:開心)'

tree = ParseTree.from_text(tree_text)

print('Show Tree')
tree.show()

print('Get get_subjects of {}'.format(tree[0]))
print('-- Semantic --')
for subject in tree.get_subjects(0, semantic=True): print(repr(subject))
print('-- Syntactic --')
for subject in tree.get_subjects(0, semantic=False): print(repr(subject))
print()