This repository contains the API server, neural models, and UI client for Covidex, a neural search engine for the COVID-19 Open Research Dataset (CORD-19). For a description of our system, check out this paper: Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset.
We also provide neural search infrastructure for searching domain-specific scholarly literature via Cydex. This paper details the abstractions developed on top of Covidex to facilitate domain-specific search: Cydex: Neural Search Infrastructure for the Scholarly Literature.
sudo apt-get install nvidia-cuda-toolkit
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
bash Anaconda3-2020.02-Linux-x86_64.sh
sudo apt-get install openjdk-11-jre openjdk-11-jdk maven
conda create -n covidex python=3.7
conda activate covidex
api/
cd api
pip install -r api/requirements.txt
Setup index and environment variables
Build Anserini indices for your dataset. We provide instructions for setting up Covidex with both CORD-19 and the ACL Anthology. Instructions to add support for new datasets is found under docs/adding-datasets.md
Set up environment variables by copying over the defaults from api/.env.sample
into a new api/.env
file, and modifying as needed. This requires setting the correct index and schema locations, CUDA devices, and enabling/disabling various services (highlighting, related search, neural ranking, etc.). Set DEVELOPMENT=False
for production deployments.
Install Node.js 14+ and Yarn.
Install dependencies from inside /client
yarn install
Serve the UI from inside /client
. The client will be running at localhost:3000.
yarn start
Separately, run the API server from inside /api
. The server wil be running at localhost:8000.
uvicorn app.main:app --reload --port=8000
We provide a script under scripts/deploy-prod.sh to start the API server and serve the UI build files. This assumes the environment is set up correctly and api/.env
contains DEVELOPMENT=False
.
Start the server (deploys to port 8000 by default):
sh scripts/deploy-prod.sh
Optional: set the environment variable PORT
to use a different port:
PORT=8080 sh scripts/deploy-prod.sh
Route port 80 to 8000 (or whatever port we deploy to). By default, the deployment script will use 8000.
sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8000
If we're having trouble accessing the service, check that there aren't any conflicting rules:
sudo iptables -t nat -L -n -v
If there are conflicting rules, we should delete them:
sudo iptables -t nat -D PREROUTING -p tcp --dport 80 -j REDIRECT --to-port UNWANTED_PORT
Log files are available under api/logs
. New files are created daily based on UTC time. All filenames have the date appended, except for the current one, which will be named search.log
or related.log
.
Run all API tests:
TESTING=true pytest api
@inproceedings{zhang2020covidex,
title = "Covidex: Neural Ranking Models and Keyword Search Infrastructure for the {COVID}-19 Open Research Dataset",
author = "Zhang, Edwin and
Gupta, Nikhil and
Tang, Raphael and
Han, Xiao and
Pradeep, Ronak and
Lu, Kuang and
Zhang, Yue and
Nogueira, Rodrigo and
Cho, Kyunghyun and
Fang, Hui and
Lin, Jimmy",
booktitle = "Proceedings of the First Workshop on Scholarly Document Processing",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.sdp-1.5",
doi = "10.18653/v1/2020.sdp-1.5",
pages = "31--41",
}