Source Code Repository for the Cognitive Search based Covid-19 Search App
If you simply want to show this code in a running instance, feel free to use https://covid19search.azurewebsites.net. Otherwise, you can follow the setup instructions below to recreate your own instance in your Azure subscription.
This repository contains:
- AzureCognitiveSearchService: The components to set up the Cognitive Search service
- Concatenator: An Azure Function to reformat names which is invoked as a custom skill
- InvokeHealthEntityExtraction: An Azure Function to call the Text Analytics for Health container which is invoked as a custom skill
There is an overwhelming amount of information (and misinformation) about COVID-19. How can we use AI to better understand this novel coronavirus? In this code, we take an open dataset of research papers on COVID-19 and apply several machine learning techniques (name entity recognition of medical terms, finding semantically similar words, contextual summarization, and knowledge graphs) which can help first responders and medical professionals better find and make sense of the research they need.
Data is pulled from two folders in the same Azure blob storage container. The main indexer runs data in json format through a skillset which reshapes the data and extracts medical entities, and puts the enriched data in the search index. A second metadata indexer pulls additional metadata into the same search index.
First, you will need an Azure account. If you don't already have one, you can start a free trial of Azure here.
Secondly, our implementation uses the Text Analytics for Health container for medical entity extraction. Once you have received access, you will need to set up the container as instructed in their README. Then, you will need to update the InvokeHealthEntityExtraction Azure function with the location of your running container. You will also need to download a file umls_concept_dict.pickle that is too big to host on GitHub, which will allow lookup of UMLS entities.
Specifically, in the InvokeHealthEntityExtraction\InvokeHealthEntityExtraction folder:
- In init.py file, change the ta_url variable to the URL of your TA Health container. This value should look something like "http://ta-health-container.westus2.azurecontainer.io:5000/text/analytics/v3.0-preview.1/domains/health".
- Download the umls_concept_dict.pickle file and save to this directory InvokeHealthEntityExtraction\InvokeHealthEntityExtraction (the same directory as init.py) so it will deploy with the Azure function.
After these two actions are complete, you can deploy the InvokeHealthEntityExtraction Azure function, as well as the Concatenator Azure function. One easy way to deploy an Azure function is using Visual Studio Code. You can install VS Code and then follow some of the instructions at this link:
-
Install the Azure Functions extension for Visual Studio Code
-
Sign in to Azure
-
Publish the function to Azure
Don't forget to deploy both Azure functions: Concatenator and InvokeHealthEntityExtraction. After each function is deployed, navigate to that service in the Azure portal. Click "Functions" in the left-hand sidebar. Then click on the function name, and then click "Get Function Url" at the top of the page. Copy that value to a text editor for each function; you will need it later.
Finally, create a new Azure search service using the Azure portal at https://portal.azure.com/#create/Microsoft.Search. Select your Azure subscription. You may create a new resource group (you can name it something like "covid19-search-rg"). You will need a globally-unique URL as the name of your search service (try something like "covid19-search-" plus your name, organization, or numbers). Finally, choose a nearby location to host your search service - please remember the location that you chose, as your Cognitive Services instance will need to be based in the same location. Click "Review + create" and then (after validation) click "Create" to instantiate and deploy the service.
After deployment is complete, click "Go to resource" to navigate to your new search service. We will need some information about your search service to fill in the "Azure Search variables" section in the SetupAzureCognitiveSearchService.ipynb notebook, which is in the AzureCognitiveSearchService directory. Open the notebook for details on how to do this and copy those values into the first code cell, but don't run the notebook yet (you will need to create an Azure storage account and update skillset.json first).
Next, you will need to create an Azure storage account and upload the COVID-19 data set. The data set can be downloaded from https://www.semanticscholar.org/cord19/download. There are two different sections to download: the metadata and document parses. Then, back on the Azure portal, you can create a new Azure storage account at https://portal.azure.com/#create/Microsoft.StorageAccount. Use the same subscription, resource group, and location that you did for the Azure search service. Choose your own unique storage account name (it must be lowercase letters and numbers only). You can change the replication to LRS. You can use the defaults for everything else, and then create the storage. Once it has been deployed, update the blob_connection_string variable in the SetupAzureCognitiveSearchService.ipynb notebook. Then create a container in your blob storage called "covid19". Inside of that container, create a folder called "json" and upload the document parses data there. Then create a folder called "metadata" in the same blob container, and upload the metadata.csv file to that folder. Finally, create another container in your blob storage account called "knowledgeStore". After you have created it, note down the connection string to that container for later.
Before running the notebook, you will also need to change the 4 TODOs in the skillset.json (which is also located in the AzureCognitiveSearchService folder). Open skillset.json, search for "TODO", and replace each instance with the following:
- Name concatenation custom skill URI: this value should be "https://" plus the value from the "Get Function Url" for the Concatenation function that you noted down earlier
- Invoke TA Health Extraction custom skill URI: this value should be "https://" plus the value from the "Get Function Url" for the InvokeHealthEntityExtraction function that you noted down earlier
- Cognitive Services key: create a new Cognitive Services key in the Azure portal using the same subscription, location, and resource group that you did for your Azure search service. Click "Create" and after the resource is ready, click it. Click "Keys and Endpoint" in the left-hand sidebar. Copy the Key 1 value into this TODO.
- Knowledge Store connection string: use the value that you noted down earlier of the connection string to the knowledgeStore container in your Azure blob storage. It should be of the format "DefaultEndpointsProtocol=https;AccountName=YourValueHere;AccountKey=YourValueHere;EndpointSuffix=core.windows.net".
Finally, you are all set to go into the SetupAzureCognitiveSearchService.ipynb notebook and run it. This notebook will call REST endpoints on the search service that you have deployed in Azure to setup the search data sources, index, indexers, and skillset.