The word2vec BiLSTM-CRF model for CCKS2019 Chinese clinical named entity recognition.
The dataset is provided by the CCKS2019.
文本 | 疾病和诊断 | 影像检查 | 实验室检验 | 手术 | 药物 | 解剖部位 | 总数 |
---|---|---|---|---|---|---|---|
1000 | 2116 | 222 | 318 | 765 | 456 | 1486 | 5363 |
患 O
者 O
罹 O
患 O
胃 B-疾病和诊断
癌 I-疾病和诊断
每 O 个 O 例 O 子 O 空 O 行 O 分 O 隔 O
* "untagged_test_data.txt" should be in the following format:
患 者 罹 患 胃 癌
每 个 例 子 空 行 分 隔
# Getting Started
## Data configuration
* Please download the dataset from CCKS2019 by yourself.
* Put the tagged data under the directory "/data/original_data/tagged_data/".
* Put the untagged data under the directory "/data/original_data/untagged_data/".
## Preprocess
python preprocess --tagged True python preprocess --tagged False
## Train the model
python main.py --mode train
## Test the model
python main.py --mode test
The prediction, standard results and evaluation would be saved as "test_results.json", "true_results.json" and "eval_results.csv", respectively.
## Predict
python main.py --mode predict
The prediction would be saved as "pred_results.json".
# Performance
| | 疾病和诊断 | 影像检查 | 实验室检验 | 手术 | 药物 | 解剖部位 | 综合
------------- | ------------- | ------------ | -------------| ------------ | ------------- | ------------ | -------------
严格指标 | 0.49346 | 0.51851 | 0.41049 | 0.55263 | 0.46835 | 0.49975 | 0.49018
松弛指标 | 0.58800 | 0.58370 | 0.54920 | 0.67105 | 0.55485 | 0.55902 | 0.56851