This repository has been archived by the owner on Jan 13, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 20
Home
Honghan Wu edited this page Jul 3, 2018
·
21 revisions
Welcome to the SemEHR wiki!
A typical SemEHR process contains the following steps:
- query a database or read from elastichsearch instance to get the documents for processing
- NLP processing (currently using bio-yodie to annotate UMLS concepts)
- index contextualised concepts into an elaticsearch instance
- do patient centric indexing to integrate all patient docs and annotations
To do the process, the easiest way is to
- (only do this ONCE) initialise SemEHR index using the mapping file (for ES version <6.0) or mappings (ES version > 6.0): patient index mapping and contextualised concept mapping.
- setup the database view from which SemEHR will pull documents from.
- edit the process configuration file using this template.
- run the script
python semehr_processor.py PATH_TO_YOUR_CONFIGURATION
-
env - system variables for running SemEHR
- java_home - path to JRE
- gcp_home - path to GCP (Gate Cloud Processing toolkit)
- gate_home - path to Gate
- yodie_path - path to bio-yodie
- ukb_home - path to UKB (used by bio-yodie to do PageRank computation for disambiguation)
-
yodie - settings for running bio-yodie NLP pipeline on documents
- "os" - the type of Operating System; possible values: win, linux
- "gcp_run_path" - bio-yodie working folder
- "input_doc_file_path" - (optional) path to a folder containing a text document that lists all document ids to be processed
- "thread_num" - number of concurrent threads to run bio-yodie
- "memory" - max memory to run bio-yodie, e.g., 30g or 600m
- "config_xml_path" - the full path to store bio-yodie configuration file (the file will be automatically generated)
- "output_file_path" - (optional) path to the folder where JSON dumps of bio-yodie will be saved to
- "output_destination" - output type of bio-yodie including 'sql', 'json'. sql - to be saved to a database server; json - to be saved as dumps of annotation files in JSON format.
- "output_dbconn_setting_file" - path to a json database configuration for saving annotations to; check this example.
- "output_table" - the table name to save annotations to if using sql output, e.g., [kconnect_annotations];
- "output_concept_filter_file" - (optional) path to a text document containing concept IDs that should be saved; all other concepts will be discarded. The format is each line a UMLS CUI
- "input_source" - where to read documents from. possible values include "sql" and "elasticsearch". Essentially, the system will use different input handlers for running bio-yodie. sql - read from database; elasticsearch - read from a elasticsearch server specified in the semehr section of this configuration
- "input_dbconn_setting_file" - (optional) input document database configuration, only needed when input_source is sql. check this example.
-
semehr
- "es_doc_url" - Elasticsearch host url for full text documents,
- "full_text_doc_id" - doc id field name in the full text document,
- "full_text_doc_date" - doc date field name in the full text document,
- "full_text_index" - index name in the full text document,
- "full_text_doc_type" - doc type name in the full text document,
- "full_text_patient_field" - patient id field name in the full text document,
- "full_text_text_field" - full text field name in the full text document,
- "es_host" - Elasticsearch host url for SemEHR,
- "index" - index name for SemEHR patients,
- "concept_index" - index name for SemEHR contextualised concepts (remove this if you would like to have everything in the same index for ES < 6.0),
- "concept_doc_type" - document type for contextualised concept,
- "entity_doc_type": - document type for patient
-
new_docs - where to read new document IDs from, only needed if document IDs are read from database
- "sql_query" - the SQL query to read document IDs, e.g., "select docid from ..."; the sql query can be a template with two placeholders of "{start_time_point}" and "{end_time_point}", which will be replaced with information stored in SemEHR's progress log - using last successful job time to replace
{start_time_point}
and current time for{end_time_point}
. - "dbconn_setting_file" - database connection settings for reading document IDs, e.g. "dbconn.json", check this example.
- "sql_query" - the SQL query to read document IDs, e.g., "select docid from ..."; the sql query can be a template with two placeholders of "{start_time_point}" and "{end_time_point}", which will be replaced with information stored in SemEHR's progress log - using last successful job time to replace
-
job - todo list for a SemEHR process, for the first three items, each accepts
yes
orno
,yes
means do respective task.- "copy_docs" - copy elasticsearch documents from one index to another. designed for KCH use cases where full text documents have already indexed in CogStack.
- "yodie" - run bio-yodie pipeline on documents; meta-map will be supported soon.
- "semehr-concept" - do SemEHR concept indexing
- "semehr-patients" - do SemEHR patient centric indexing
- "job_id" - the unique job name
- "job_status_file_path" - path to the folder where job progress log file is to be stored }
- when you see no concepts indexed for patients, please double check the index mapping to make sure the mappings are correct as defined in the script.