Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request support on data input sample and output sample just for prediction #9

Open
Darrshan-Sankar opened this issue Jul 16, 2024 · 11 comments

Comments

@Darrshan-Sankar
Copy link

Darrshan-Sankar commented Jul 16, 2024

I used AIONER output to extract relations, but it didn't work. Went through the issues and found the example to be in BioRED repo. Want to know how to create such data and a sample output, about how the predict.pubtator will look like

@ptlai
Copy link
Collaborator

ptlai commented Jul 18, 2024

Hi @Darrshan-Sankar,

The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.

@Darrshan-Sankar
Copy link
Author

Hi @Darrshan-Sankar,

The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.

@ptlai Thanks for your support. I actually have to process full texts. So could you please guide how to normalise AIONER results to input for BioREx. Possibly a script would help better

@ptlai
Copy link
Collaborator

ptlai commented Jul 18, 2024

Hi @Darrshan-Sankar,

The simplest way is to use the NE/ID annotations in https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/ as well (BioCXML files). We processed the NEs/IDs for full-text already, but relations for abstracts only. You can treat each paragraph as an abstract and then feed it to BioREx. If you still need help using normalization components, you may contact Dr. Wei (chih-hsuan.wei@nih.gov), who deals with the entire backend process of our PubTator.

@Darrshan-Sankar
Copy link
Author

Darrshan-Sankar commented Jul 18, 2024

Hi @Darrshan-Sankar,

The simplest way is to use the NE/ID annotations in https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/ as well (BioCXML files). We processed the NEs/IDs for full-text already, but relations for abstracts only. You can treat each paragraph as an abstract and then feed it to BioREx. If you still need help using normalization components, you may contact Dr. Wei (chih-hsuan.wei@nih.gov), who deals with the entire backend process of our PubTator.

@ptlai Yeah went through the FTP. As you said, only got relations for abstract. Thank you for providing contact of Dr.Wei to contact him

@zy2376
Copy link

zy2376 commented Oct 28, 2024

Hi @Darrshan-Sankar,
The results of AIONER cannot be fed directly to BioREx. BioREx requires that the entities' IDs be normalized. You have to use our normalization components, such as GNORM2. If you just want to process the PubMed abstracts, you can find the https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator3/, where we provide the PubMed precessed relation results. Please let me know if you need further help.

@ptlai Thanks for your support. I actually have to process full texts. So could you please guide how to normalise AIONER results to input for BioREx. Possibly a script would help better

@ptlai Hi,could you please provide an example file of a normalized AIONER file? I'd like to review the workflow for extracting entities with AIONER and then performing relation extraction with BioRex.

@ptlai
Copy link
Collaborator

ptlai commented Oct 29, 2024

Hi @zy2376 ,

A normalized example of an AIONER file can be found at bc8_biored_task1_val.txt](https://github.com/user-attachments/files/17559816/bc8_biored_task1_val.txt). Please note that AIONER NE types must be converted to their corresponding BioRED NE types (e.g., 'Gene' to 'GeneOrGeneProduct') before running BioREx.

@zy2376
Copy link

zy2376 commented Nov 19, 2024

@ptlai Thank you very much for providing the normalized example, I was finally able to successfully run he AIONER-to-BioRex process for PubMed abstract. However, the process failed when applied to PMC full-text. Could you please provide guidance on resolving this issue?

@ptlai
Copy link
Collaborator

ptlai commented Nov 19, 2024

Hi @zy2376 ,

To process the full-text data with BioREx, you can treat each paragraph as a separate abstract. For instance, take the article available at https://www.ncbi.nlm.nih.gov/research/pubtator3/publication/33202951.

You can format the content like this:

33202951|t|1. Introduction. Paragraph 1.
33202951|a|In general, N-nitrosamines (NAs) are the products of reactions between a nitrosating agent and a secondary or tertiary amine; NAs are formed preferentially at elevated temperature. Thus, NAs are mainly detected in food and drinks after processing. In foods, nitrous anhydride is the main nitrosating agent formed from nitrite in an acidic aqueous solution. In drinking water, N-nitrosodimethylamine (NDMA) is the most simple and volatile NA that can form during the degradation of dimethylhydrazine (a component of rocket fuel) by chloramination of amine-based precursors or as a byproduct of anion exchange purification of water. NDMA has been shown to be formed in certain foods due to a direct-fire drying process. International Agency for Research on Cancer (IARC) has classified NDMA as a probable carcinogen in humans. NDMA is known to be genotoxic in vivo and in vitro. Several case-control studies and a single cohort study of NDMA in humans supported the assumption that NDMA consumption is positively associated with either gastric or colorectal cancer. Therefore, due to possible contamination of water with NDMA, the World Health Organization (WHO) and U.S. Environmental Protection Agency (EPA) have set the drinking water guideline limits to 100 ng/L and 0.4 ng/L in tap water, respectively. Only in a few foods and countries, limits have been set for NAs. In the United States, a limit of 10 microg/kg has been set for total volatile NAs in cured meat products. In 2005, China introduced a limit of 4 and 7 microg/kg of NDMA in fish and related products, respectively. There are currently no maximum regulatory limits for the level of N-nitroso-compounds in food in the European Union.

@zy2376
Copy link

zy2376 commented Dec 3, 2024

@ptlai Thanks to your comments, I've converted my full-text into |t| and |a| title format, and it works for some paragraphs(see the attached BioRex input file "PMC7611502_t-a-format_239rows.txt" and BioRex output file "PMC7611502_t-a-format_239rows_predict.txt".
However, for the full-text data, I encountered a mapping issue, as indicated by the following warning:
"IFN
15184
3
<annotation.AnnotationInfo object at 0x2af9f2c126d0> cannot be mapped to original text
"
By the way, the full-text data was formatted using the following workflow: paper from BioC API -> AIONER -> GNorm2 -> BioRED PubTator. I checked that the full-text data didn't change from BioC to BioRED PubTator , which led me to wonder if the issue might be due to a difference between AIONER and BioRex mapping. Could you help resolve the mapping issue?

PMC7611502_t-a-format.txt
PMC7611502_t-a-format_239rows.txt
PMC7611502_t-a-format_239rows_predict.txt

@ptlai
Copy link
Collaborator

ptlai commented Dec 3, 2024

Hi @zy2376 ,

Thank you for providing the example PubTator files. Upon review, I noticed a few formatting issues that need to be addressed:

  1. Each document in the file should be separated by an empty line. For example:

Incorrect

7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502	18	21	A20	GeneOrGeneProduct	7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502	242	249	TNFAIP3	GeneOrGeneProduct	21929,7128
7611502	265	268	A20	GeneOrGeneProduct	7128
7611502	299	302	A20	GeneOrGeneProduct	7128,21929
7611502	357	360	CD8	GeneOrGeneProduct	925
7611502	530	551	TANK-binding kinase 1	GeneOrGeneProduct	29110,56480
7611502	553	557	TBK1	GeneOrGeneProduct	29110,56480
7611502	602	607	STAT1	GeneOrGeneProduct	6772,20846
7611502	631	636	PD-L1	GeneOrGeneProduct	29126
7611502	733	736	A20	GeneOrGeneProduct	7128
7611502	791	794	A20	GeneOrGeneProduct	7128
7611502	840	844	TBK1	GeneOrGeneProduct	29110
7611502	845	850	STAT1	GeneOrGeneProduct	6772
7611502	851	856	PD-L1	GeneOrGeneProduct	29126,60533
7611502	401	409	patients	OrganismTaxon	9606
7611502	414	418	mice	OrganismTaxon	10090
7611502	718	722	mice	OrganismTaxon	10090
7611502	486	496	interferon	GeneOrGeneProduct	3439
7611502|t|Introduction
7611502|a|Cancer cells express immune regulatory factors that remodel the tumor microenvironment (TME) and promote tumor immune escape, a hallmark of cancer progression. Accordingly, TME targeting therapies to break tumor-induced immune tolerance are heavily pursued. The development of immune checkpoint inhibitors blocking negative effectors of T cell function was a major advance, especially in malignancies with poor prognosis. In lung cancer, which is the leading cause of cancer related deaths, the approval of immune checkpoint blockade (ICB) raised high hopes and fundamentally changed therapies. Nevertheless, only around 20% of unselected patients suffering from non-small cell lung cancer (NSCLC) respond to monotherapies targeting Programmed Cell Death Protein 1 (PD-1)/Programmed Death Ligand 1 (PD-L1), and predicting the response of individual patients remains challenging. A better understanding of factors altering the TME is needed in order to avoid exposing non-responders to the unnecessary toxicity of costly ICB therapeutic regimen.
7611502	1677	1708	Programmed Cell Death Protein 1	GeneOrGeneProduct	5133
7611502	1743	1748	PD-L1	GeneOrGeneProduct	29126
7611502	1583	1591	patients	OrganismTaxon	9606
7611502	1793	1801	patients	OrganismTaxon	9606

Correct

7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502	18	21	A20	GeneOrGeneProduct	7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502	242	249	TNFAIP3	GeneOrGeneProduct	21929,7128
7611502	265	268	A20	GeneOrGeneProduct	7128
7611502	299	302	A20	GeneOrGeneProduct	7128,21929
7611502	357	360	CD8	GeneOrGeneProduct	925
7611502	530	551	TANK-binding kinase 1	GeneOrGeneProduct	29110,56480
7611502	553	557	TBK1	GeneOrGeneProduct	29110,56480
7611502	602	607	STAT1	GeneOrGeneProduct	6772,20846
7611502	631	636	PD-L1	GeneOrGeneProduct	29126
7611502	733	736	A20	GeneOrGeneProduct	7128
7611502	791	794	A20	GeneOrGeneProduct	7128
7611502	840	844	TBK1	GeneOrGeneProduct	29110
7611502	845	850	STAT1	GeneOrGeneProduct	6772
7611502	851	856	PD-L1	GeneOrGeneProduct	29126,60533
7611502	401	409	patients	OrganismTaxon	9606
7611502	414	418	mice	OrganismTaxon	10090
7611502	718	722	mice	OrganismTaxon	10090
7611502	486	496	interferon	GeneOrGeneProduct	3439

7611502|t|Introduction
7611502|a|Cancer cells express immune regulatory factors that remodel the tumor microenvironment (TME) and promote tumor immune escape, a hallmark of cancer progression. Accordingly, TME targeting therapies to break tumor-induced immune tolerance are heavily pursued. The development of immune checkpoint inhibitors blocking negative effectors of T cell function was a major advance, especially in malignancies with poor prognosis. In lung cancer, which is the leading cause of cancer related deaths, the approval of immune checkpoint blockade (ICB) raised high hopes and fundamentally changed therapies. Nevertheless, only around 20% of unselected patients suffering from non-small cell lung cancer (NSCLC) respond to monotherapies targeting Programmed Cell Death Protein 1 (PD-1)/Programmed Death Ligand 1 (PD-L1), and predicting the response of individual patients remains challenging. A better understanding of factors altering the TME is needed in order to avoid exposing non-responders to the unnecessary toxicity of costly ICB therapeutic regimen.
7611502	1677	1708	Programmed Cell Death Protein 1	GeneOrGeneProduct	5133
7611502	1743	1748	PD-L1	GeneOrGeneProduct	29126
7611502	1583	1591	patients	OrganismTaxon	9606
7611502	1793	1801	patients	OrganismTaxon	9606
  1. The first two lines in each document must begin with |t| (title) and |a| (abstract), respectively. Entity annotations should start from the third line onward.

Incorrect

7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502	18	21	A20	GeneOrGeneProduct	7128
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502	242	249	TNFAIP3	GeneOrGeneProduct	21929,7128

Correct

7611502|t|Downregulation of A20 promotes immune escape of lung Adenocarcinomas
7611502|a|Inflammation is a well-known driver of lung tumorigenesis. Tumor cells escape tight homeostatic control by decreasing the expression of the potent anti-inflammatory protein TNFAIP3, also known as A20. Tumor cell intrinsic loss of A20 dramatically enhances lung tumorigenesis and prevents CD8+ T cell mediated immune surveillance in patients and mice. This is completely dependent on increased cellular sensibility to interferon signaling via hyperactivation of TANK-binding kinase 1 (TBK1) and increased expression and activation of STAT1, resulting in elevated PD-L1 expression. Accordingly, immune checkpoint blockade (ICB) is highly efficient in mice harboring A20 deficient lung tumors. Altogether, we have identified A20 as a master immune checkpoint regulating the TBK1-STAT1-PD-L1 axis that may be exploited to improve ICB therapy in lung adenocarcinoma.
7611502	18	21	A20	GeneOrGeneProduct	7128
7611502	242	249	TNFAIP3	GeneOrGeneProduct	21929,7128
  1. Entity offsets should reset to 0 at the beginning of each document.

@zy2376
Copy link

zy2376 commented Dec 28, 2024

@ptlai Thanks to your help, the full text can now be extracted using BioRex. However, another issue has arisen: each document provides the same relations🤦‍. I've included the input and output files below. Please help me check them.
PMC7611502_input.txt
PMC7611502_predict.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants