Introduction

Official CKIP CoreNLP Toolkits

Features

  • Sentence Segmentation

  • Word Segmentation

  • Part-of-Speech Tagging

  • Sentence Parsing

  • Named-Entity Recognition

  • Co-Reference Delectation

Contributers

Installation

Requirements

Tool Requirements

Tool

Built-in

CkipTagger

CkipClassic

Sentence Segmentation

Word Segmentation†

Part-of-Speech Tagging†

Sentence Parsing

Named-Entity Recognition

Co-Reference Delectation‡

  • † These tools require only one of either backends.

  • ‡ Co-Reference implementation does not require any backend, but requires results from word segmentation, part-of-speech tagging, sentence parsing, and named-entity recognition.

Installation via Pip

Usage

See https://ckipnlp.readthedocs.io/en/latest/_api/ckipnlp.html for API details.

Pipelines

Core Pipeline

_static/image/pipeline.svg
from ckipnlp.pipeline import CkipPipeline, CkipDocument

pipeline = CkipPipeline()
doc = CkipDocument(raw='中文字喔,啊哈哈哈')

# Word Segmentation
pipeline.get_ws(doc)
print(doc.ws)
for line in doc.ws:
    print(line.to_text())

# Part-of-Speech Tagging
pipeline.get_pos(doc)
print(doc.pos)
for line in doc.pos:
    print(line.to_text())

# Named-Entity Recognition
pipeline.get_ner(doc)
print(doc.ner)

# Sentence Parsing
pipeline.get_parsed(doc)
print(doc.parsed)

################################################################

from ckipnlp.container.util.wspos import WsPosParagraph

# Word Segmentation & Part-of-Speech Tagging
for line in WsPosParagraph.to_text(doc.ws, doc.pos):
    print(line)

Co-Reference Pipeline

_static/image/coref_pipeline.svg
from ckipnlp.pipeline import CkipCorefPipeline, CkipDocument

pipeline = CkipCorefPipeline()
doc = CkipDocument(raw='畢卡索他想,完蛋了')

# Co-Reference
corefdoc = pipeline(doc)
print(corefdoc.coref)
for line in corefdoc.coref:
    print(line.to_text())

Containers

The container objects provides following methods:

  • from_text(), to_text() for plain-text format conversions;

  • from_dict(), to_dict() for dictionary-like format conversions;

  • from_list(), to_list() for list-like format conversions;

  • from_json(), to_json() for JSON format conversions (based-on dictionary-like format conversions).

The following are the interfaces, where CONTAINER_CLASS refers to the container class.

obj = CONTAINER_CLASS.from_text(plain_text)
plain_text = obj.to_text()

obj = CONTAINER_CLASS.from_dict({ key: value })
dict_obj = obj.to_dict()

obj = CONTAINER_CLASS.from_list([ value1, value2 ])
list_obj = obj.to_list()

obj = CONTAINER_CLASS.from_json(json_str)
json_str = obj.to_json()

Note that not all container provide all above methods. Here is the table of implemented methods. Please refer the documentation of each container for detail formats.

Container

Item

from/to text

from/to dict, list, json

TextParagraph

str

SegSentence

str

SegParagraph

SegSentence

NerToken

NerSentence

NerToken

NerParagraph

NerSentence

ParsedParagraph

str

CorefToken

only to

CorefSentence

CorefToken

only to

CorefParagraph

CorefSentence

only to

WS with POS

There are also conversion routines for word-segmentation and POS containers jointly. For example, WsPosToken provides routines for a word (str) with POS-tag (str):

ws_obj, pos_obj = WsPosToken.from_text('中文字(Na)')
plain_text = WsPosToken.to_text(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_dict({ 'word': '中文字', 'pos': 'Na', })
dict_obj = WsPosToken.to_dict(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_list([ '中文字', 'Na' ])
list_obj = WsPosToken.to_list(ws_obj, pos_obj)

ws_obj, pos_obj = WsPosToken.from_json(json_str)
json_str = WsPosToken.to_json(ws_obj, pos_obj)

Similarly, WsPosSentence/WsPosParagraph provides routines for word-segmented and POS sentence/paragraph (SegSentence/SegParagraph) respectively.

Parsed Tree

In addition to ParsedParagraph, we have implemented tree utilities base on TreeLib.

ParsedTree is the tree structure of a parsed sentence. One may use from_text() and to_text() for plain-text conversion; from_dict(), to_dict() for dictionary-like object conversion; and also from_json(), to_json() for JSON string conversion.

The ParsedTree is a TreeLib tree with ParsedNode as its nodes. The data of these nodes is stored in a ParsedNodeData (accessed by node.data), which is a tuple of role (semantic role), pos (part-of-speech tagging), word.

ParsedTree provides useful methods: get_heads() finds the head words of the sentence; get_relations() extracts all relations in the sentence; get_subjects() returns the subjects of the sentence.

from ckipnlp.container import ParsedTree

# 我的早餐、午餐和晚餐都在那場比賽中被吃掉了
tree_text = 'S(goal:NP(possessor:N‧的(head:Nhaa:我|Head:DE:的)|Head:Nab(DUMMY1:Nab(DUMMY1:Nab:早餐|Head:Caa:、|DUMMY2:Naa:午餐)|Head:Caa:和|DUMMY2:Nab:晚餐))|quantity:Dab:都|condition:PP(Head:P21:在|DUMMY:GP(DUMMY:NP(Head:Nac:比賽)|Head:Ng:中))|agent:PP(Head:P02:被)|Head:VC31:吃掉|aspect:Di:了)'

tree = ParsedTree.from_text(tree_text, normalize=False)

print('Show Tree')
tree.show()

print('Get Heads of {}'.format(tree[5]))
print('-- Semantic --')
for head in tree.get_heads(5, semantic=True): print(repr(head))
print('-- Syntactic --')
for head in tree.get_heads(5, semantic=False): print(repr(head))
print()

print('Get Relations of {}'.format(tree[0]))
print('-- Semantic --')
for rel in tree.get_relations(0, semantic=True): print(repr(rel))
print('-- Syntactic --')
for rel in tree.get_relations(0, semantic=False): print(repr(rel))
print()

# 我和食物真的都很不開心
tree_text = 'S(theme:NP(DUMMY1:NP(Head:Nhaa:我)|Head:Caa:和|DUMMY2:NP(Head:Naa:食物))|evaluation:Dbb:真的|quantity:Dab:都|degree:Dfa:很|negation:Dc:不|Head:VH21:開心)'

tree = ParsedTree.from_text(tree_text, normalize=False)

print('Show Tree')
tree.show()

print('Get get_subjects of {}'.format(tree[0]))
print('-- Semantic --')
for subject in tree.get_subjects(0, semantic=True): print(repr(subject))
print('-- Syntactic --')
for subject in tree.get_subjects(0, semantic=False): print(repr(subject))
print()

License

CC BY-NC-SA 4.0

Copyright (c) 2018-2020 CKIP Lab under the CC BY-NC-SA 4.0 License.