CCKS2019 Chinese Clinical NER

The word2vec BiLSTM-CRF model for CCKS2019 Chinese clinical named entity recognition.

Dependencies

Dataset

The dataset is provided by the CCKS2019.

文本 疾病和诊断 影像检查 实验室检验 手术 药物 解剖部位 总数
1000 2116 222 318 765 456 1486 5363

Data directory structure

每 O 个 O 例 O 子 O 空 O 行 O 分 O 隔 O

* "untagged_test_data.txt" should be in the following format:

患 者 罹 患 胃 癌

每 个 例 子 空 行 分 隔


# Getting Started
## Data configuration
* Please download the dataset from CCKS2019 by yourself.
* Put the tagged data under the directory "/data/original_data/tagged_data/".
* Put the untagged data under the directory "/data/original_data/untagged_data/".

## Preprocess

python preprocess --tagged True python preprocess --tagged False


## Train the model

python main.py --mode train


## Test the model

python main.py --mode test

The prediction, standard results and evaluation would be saved as "test_results.json", "true_results.json" and "eval_results.csv", respectively.

## Predict

python main.py --mode predict


The prediction would be saved as "pred_results.json".

# Performance
| | 疾病和诊断 | 影像检查 | 实验室检验 | 手术 | 药物 | 解剖部位 | 综合
------------- | ------------- | ------------ | -------------| ------------ | ------------- | ------------ | -------------
严格指标 |  0.49346 | 0.51851 | 0.41049 | 0.55263 | 0.46835 | 0.49975 | 0.49018
松弛指标 |  0.58800 | 0.58370 | 0.54920 | 0.67105 | 0.55485 | 0.55902 | 0.56851