Replies: 1 comment
-
The POC is PR'd here: #1665 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Describe the feature or potential improvement
What I appreciate about LangFuse is its core abstraction: the Trace. It's versatile enough to be applied to a broad range of use-cases, while also allowing the construction of a rich set of observation and evaluation tools for various model implementations. In past experience, feedback is normally dispersed across multiple spreadsheets, databases, and Jupyter notebooks. At Rocket Money, we aim to use Traces for many different use-cases, including:
These are not theoretical use-cases; they are real, in-production use-cases. Thanks to the adaptability of a trace, we can visualize and collect manual feedback for all these use-cases using one tool, despite their differing implementations.
Task
Although a trace can effectively observe anything, Langfuse's tools are limited without understanding the original task. Essentially, a Task is constrained to a well-defined input and output. If we attach a Task to Datasets and Traces, we could transition from editing raw JSON in textboxes to using validated form fields.
Given the existing tooling and OpenAI's use of JSON schemas, defining task inputs and outputs as JSON schemas offers great flexibility (See the playground example of using a JSON schema to drive a form). In practice, these tasks are likely to be defined in the main software application where the contract is already established. If
zod
is employed in the main application, we can seamlessly relay that contract and register the task within LangFuse through a new API request. I envision this as an idempotent process that occurs during continuous integration when the main branch is merged. In this way, all these tasks will populate LangFuse as they are defined in the main application.Bot - more flexible than Prompts
There are many tools and services available for iterating and evaluating prompts on foundational Large Language Models (LLMs). We've used HumanLoop, Vellum, Helicone, Braintrust, and various open-source tools that provide similar support. However, these solutions often fall short when tasks require more than a single string-interpolated LLM completion.
For example, we usually start with a single string-interpolated template for an LLM completion. Still, we often move away from this because:
These are not hypothetical situations but reflect real scenarios. In the future, I foresee multi-agent solutions like AutoGen or CrewAI and tools like DSPy reducing the relevance of a "prompting iterator". However, evaluations and traces will become even more important.
Iterating on these tasks outside of a software engineering context can significantly increase velocity, allowing domain/ML experts and software engineers to work concurrently. Similar to Traces, a Bot is an abstraction that can follow these use-cases even after they move beyond the basic string-template LLM solution. With all the above solutions, there are aspects that can be tweaked. In these situations, traces, datasets, and feedback remain vital for iteration and improvement. As long as the software contract, also known as the Task, stays the same, software engineers are not required.
For instance, in the NER example above, we can adjust the model and a confidence threshold without breaking the software contract. We can establish this bot configuration schema when defining Tasks, along with the clearly defined input and output schemas. Now LangFuse doesn't solely cater to a single implementation, but rather, it can display an appropriate interface for any type of implementation for a Task. JSON schema's support unions, enabling us to create Bots with various implementations. Other than that, a Bot table record can function exactly like a Prompt table record defined in the LangFuse codebase.
Additional information
No response
Beta Was this translation helpful? Give feedback.
All reactions