BioNEV (Biomedical Network Embedding Evaluation)

1. Introduction

This repository contains source code and datasets for paper "Graph Embedding on Biomedical Networks: Methods, Applications, and Evaluations" (accepted by Bioinformatics). This work aims to systematically evaluate recent advanced graph embedding techniques on biomedical tasks. We compile 5 benchmark datasets for 4 biomedical prediction tasks (see paper for details) and use them to evaluate 11 representative graph embedding methods selected from different categories:

The code can also be applied to graphs in other domains (e.g., social networks, citation networks). More experimental details can be found in Supplementary Materials.

Please kindly cite the paper if you use the code, datasets or any results in this repo or in the paper:

@article{yue2020graph,
  title={Graph embedding on biomedical networks: methods, applications and evaluations},
  author={Yue, Xiang and Wang, Zhen and Huang, Jingong and Parthasarathy, Srinivasan and Moosavinasab, Soheil and Huang, Yungui and Lin, Simon M and Zhang, Wen and Zhang, Ping and Sun, Huan},
  journal={Bioinformatics},
  volume={36},
  number={4},
  pages={1241--1251},
  year={2020},
  publisher={Oxford University Press}
}

2. Pipeline

alt text

Fig. 1: Pipeline for applying graph embedding methods to biomedical tasks. Low-dimensional node representations are first learned from biomedical networks by graph embedding methods and then used as features to build specific classifiers for different tasks. For (a) matrix factorization-based methods, they use a data matrix (e.g., adjacency matrix) as the input to learn embeddings through matrix factorization. For (b) random walk-based methods, they first generate sequences of nodes through random walks and then feed the sequences into the word2vec model to learn node representations. For (c) neural network-based methods, their architectures and inputs vary from different models.

3. Dataset

Datasets used in the paper:

Link Prediction

Statistics:

Task Type Dataset #nodes #edges Density #labels
CTD DDA 12,765 92,813 0.11% -
NDFRT DDA 13,545 56,515 0.06% -
Link Prediction DrugBank DDI 2,191 242,027 10.08% -
STRING PPI 15,131 359,776 0.31% -
Clin Term COOC 48,651 1,659,249 0.14% 31
Node Classification node2vec PPI 3,890 76,584 1.01% 50
Mashup PPI 16,143 300,181 0.23% 28

4. Pre-trained Embeddings

We also release the best-performing pre-trained representations of nodes (e.g., drugs, diseases, proteins, UMLS concepts) on each dataset. These pre-trained vectors can be used as:

All the pretrained vectors can be downloaded here. The files are formatted as:

node_num, embedding_dimension
index_1, embedding vector 1
index_2, embedding vector 2
...

The corresponding index to node name (or their original ID) can be found in the each dataset directory.

5. Code

The graph embedding learning for Laplician Eigenmap, Graph Factorization, HOPE, GraRep, DeepWalk, node2vec, LINE, SDNE uses the code from OpenNE The code of struc2vec and GAE is from their authors. To ensure different source code could run successfully in our framework, we modify part of their source code.

Installation

Use the following command to install directly from GitHub;

$ pip install git+https://github.com/xiangyue9607/BioNEV.git

Alternatively, use the following commands to install the latest code in development mode (using -e):

$ git clone https://github.com/xiangyue9607/BioNEV.git
$ cd BioNEV
$ pip install -e .

General Options

Specific Options

Running example

bionev --input ./data/DrugBank_DDI/DrugBank_DDI.edgelist \
       --output ./embeddings/DeepWalk_DrugBank_DDI.txt \
       --method DeepWalk \
       --task link-prediction \
       --eval-result-file eval_result.txt
bionev --input ./data/Clin_Term_COOC/Clin_Term_COOC.edgelist \
       --label-file ./data/Clin_Term_COOC/Clin_Term_COOC_labels.txt \
       --output ./embeddings/LINE_COOC.txt \
       --method LINE \
       --task node-classification \
       --weighted True

6. Contact

Feel free to contact Xiang Yue or Huan Sun for any questions about the paper, datsaets, code and results.