Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for LLM backend: implementation plan #21

Open
osma opened this issue Feb 7, 2024 · 3 comments · May be fixed by #29
Open

Support for LLM backend: implementation plan #21

osma opened this issue Feb 7, 2024 · 3 comments · May be fixed by #29

Comments

@osma
Copy link

osma commented Feb 7, 2024

Hi,

our team at @NatLibFi has been looking at performing metadata extraction using large language models. Since our initial experiments have been promising, we would like to move towards a prototype system that could be used for demonstrations. Since Meteor already provides the basic building blocks (API, web UI, backend code that uses extraction heuristics...) we would like to extend the codebase with another backend that uses a LLM for the extraction part. We are going to develop this in a fork of this repo, https://github.com/NatLibFi/meteor and if you're interested, we could contribute the code back to this upstream repository sometime in the future (for now this would obviously be just a prototype).

This is the initial implementation plan:


Minimal LLM support for Meteor

Configuration:

  • env var LLM_API_URL + optional LLM_API_KEY
  • prompt template set inside LLMExtractor for now but should be configurable in the future
    • could be a template file chosen by LLM_PROMPT_TEMPLATE env var

Changes to Meteor class

  • run method: if LLMExtractor.is_available(), run LLMExtractor instead of Finder

Changes to MeteorDocument class

  • add support for extracting text for use by LLM

New LLMExtractor class

  • similar to Finder
  • is_available() returns True if LLM_API_URL is set
  • essential functionality:
    • request text from MeteorDocument
    • build prompt using template + text
    • send request to LLM service (request JSON if possible?)
    • parse response
    • return extracted metadata

Do you think the above implementation plan sounds reasonable? It would result in an experimental version of Meteor that selects the extraction method (heuristics or LLM) at runtime based on the presence (or not) of an environment variable. This way the "traditional" Meteor approach can be compared head-to-head with the LLM version in terms of extraction quality, response time, resource usage etc. The LLM itself would be hosted as an external inference service (e.g. vLLM) used via API calls. In the initial version, we would stick to the metadata fields currently supported by Meteor, but in the future we would like to expand this to more fields, as the LLM approach allows adding more fields quite easily.

We are still working on prototyping the LLM based extraction techniques and fine-tuning local LLM models in the https://github.com/NatLibFi/FinGreyLit repository, in particular the experiments directory. It will take a while to set this up and start coding. I opened this issue to make you aware of our plans and potentially get feedback on what would make sense from your perspective.

@pierrebeauguitte
Copy link
Collaborator

Hi,

I think the architecture you suggest is a nice way to integrate a LLM running externally. I just have a few comments:

  • "send request to LLM service (request JSON if possible?)": asking the LLM for JSON format is unlikely to be a strong enough guarantee. I would simply have a validation step in the LLMExtractor class, and return an exception / 500 status when it fails.
  • the env var logic you describe sounds quite rigid: as I read it, if LLM_API_URL is defined, you cannot run the "traditional" approach without changing the environment. Why not add an optional URL parameter to the API to decide which backend should be used?
  • currently, the Meteor backend can be installed as a python module (https://github.com/NationalLibraryOfNorway/meteor#installing-the-python-module). It is not necessary for running Meteor as a service, but is a nice feature to have, and ideally your changes should preserve it.

We look forward to seeing the results of your work!

@osma
Copy link
Author

osma commented Feb 16, 2024

Thanks for your comments @pierrebeauguitte !

"send request to LLM service (request JSON if possible?)": asking the LLM for JSON format is unlikely to be a strong enough guarantee. I would simply have a validation step in the LLMExtractor class, and return an exception / 500 status when it fails.

Yes, you are right. In any case the JSON response from the LLM needs to be parsed within LLMExtractor because it has to be transformed to the format Meteor expects. If anything goes wrong in that process, returning a 500 status code makes sense.

the env var logic you describe sounds quite rigid: as I read it, if LLM_API_URL is defined, you cannot run the "traditional" approach without changing the environment. Why not add an optional URL parameter to the API to decide which backend should be used?

Yes, why not. There could be a parameter backend=llm / backend=finder for the / (HTML), /json and /file methods. What would the default be? Use LLMExtractor if available, fall back to Finder if not?

To make it easy to try this out, there could be an element in the web UI for selecting the backend.

currently, the Meteor backend can be installed as a python module (https://github.com/NationalLibraryOfNorway/meteor#installing-the-python-module). It is not necessary for running Meteor as a service, but is a nice feature to have, and ideally your changes should preserve it.

Sure, no reason to break this. Thanks for the reminder! Also I think it's important to keep the unit tests working and ideally add new tests for all the new functionality.

The current status is that we have an initial, very rough but working fine-tuned LLM based on Zephyr-7B, and it has been uploaded to HuggingFace Hub both in original HF format and in GGUF quantized format for CPU-only inference (e.g. for local development). We still need to take a closer look at inference platforms. After setting up one or more of those we can start development of the Meteor part.

@osma
Copy link
Author

osma commented Jul 31, 2024

Hi again! There has been some progress on this lately: not yet the Meteor implementation part, but getting closer. Here are some news from FinGreyLit:

  1. We have refined the data set quite a lot, adjusted the document categorization (now there are five types, not four) and verified a lot of the metadata so that they correspond more closely to what is actually stated in the documents themselves to make a proper ground truth data set.
  2. We switched the input and output formats used when talking with the LLM. Now the input is in JSON (not ad-hoc text) and the metadata schema is also a simple form of JSON that doesn't pretend to be Dublin Core anymore. The schema is pretty close to what Meteor produces, although we've added a few fields such as doi, p-isbn, p-issn and type_coar that Meteor doesn't support, and intend to add some more in the future.
  3. The evaluation code has been overhauled a bit. It now has field-specific rules (including some reusable generic ones) for the comparisons and metrics. It probably still needs more work and we need to get back to the discussion on shared metrics. But at least the code is now clean and it's easy to see what the current logic is.
  4. We've continued fine-tuning LLMs, now with the updated data set, and the results are slowly but steadily improving. There are two main LLMs: a larger model based on Mistral-7B and a much smaller one based on Qwen2-0.5B. The larger one needs a GPU to be practical, but the smaller one can be run on a laptop CPU using llama.cpp. The current evaluation scores are roughly 0.92 for the larger LLM, 0.85 for the smaller LLM, 0.64 for Meteor and 0.39 for the null baseline (always predicting nothing, which is surprisingly often the correct answer!).

But the most relevant part when it comes to adding LLM support in Meteor is the new Jupyter notebook for metadata extraction using a LLM API service. It needs an API service such as llama.cpp running locally. That is really easy to install and run at least on Linux - see the instructions at the beginning of the notebook. So you may want to give it a try! It also shows how the text is extracted from the PDF and converted to the simple JSON format that the fine-tuned LLM expects.

The same logic (text extraction, conversion to JSON, performing API calls to the LLM, parsing the results from the LLM) now just needs to be integrated into Meteor itself - as stated in the implementation plan above - and that's what I intend to try next!

@osma osma linked a pull request Jul 31, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants