This repository contains code and resources for training optoelectronics-aware language models. The models are evaluated on various tasks including text classification, question-answering, and embedding abilities.
OptoelectronicsLM is a project aimed at developing language models that are specifically aware of optoelectronics concepts. These models are trained on specialized datasets and evaluated on their performance in classification, question-answering, and embedding tasks.
Training and evaluation scripts used in this work for each relevant task are given in the corresponding directory. Note that you will need to change relevanbt file paths and repository locations to suit your own use.
See the associated paper, models and datasets on Hugging Face for more details.
We welcome contributions to improve OptoelectronicsLM. Please fork the repository and submit a pull request with your changes. Ensure that your code adheres to the project's coding standards and includes appropriate tests.
Please use the following the citation if you use any of this codebase in your work.
@article{doi:10.1021/acs.jcim.4c02029,
author = {Huang, Dingyun and Cole, Jacqueline M.},
title = {Cost-Efficient Domain-Adaptive Pretraining of Language Models for Optoelectronics Applications},
journal = {Journal of Chemical Information and Modeling},
doi = {10.1021/acs.jcim.4c02029},
note ={PMID: 39933074},
URL = {https://doi.org/10.1021/acs.jcim.4c02029},
eprint = {https://doi.org/10.1021/acs.jcim.4c02029}
}