Skip to content

feat: measure cost of testset generator #1560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/getstarted/rag_testset_generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Now we will enrich the knowledge graph with additional information using [Transf
But you can mix and match transforms or build your own as needed.

```python
from ragas.testset.transforms import default_transforms
from ragas.testset.transforms import default_transforms, apply_transforms


# define your LLM and Embedding Model
Expand Down
71 changes: 65 additions & 6 deletions docs/howtos/applications/_cost.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How to estimate Cost and Usage of evaluations
# How to estimate Cost and Usage of evaluations and testset generation

When using LLMs for evaluation and test set generation, cost will be an important factor. Ragas provides you some tools to help you with that.

Expand Down Expand Up @@ -34,7 +34,9 @@ get_token_usage_for_openai(llm_result)

You can define your own or import parsers if they are defined. If you would like to suggest parser for LLM providers or contribute your own ones please check out this [issue](https://github.com/explodinggradients/ragas/issues/1151) 🙂.

You can use it for evaluations as so.
## Token Usage for Evaluations

Let's use the `get_token_usage_for_openai` parser to calculate the token usage for an evaluation.


```python
Expand All @@ -43,9 +45,14 @@ from datasets import load_dataset

dataset = load_dataset("explodinggradients/amnesty_qa", "english_v3")

dataset = EvaluationDataset.load_from_hf(dataset["eval"])
eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
```

Repo card metadata block was not found. Setting CardData to empty.


You can pass in the parser to the `evaluate()` function and the cost will be calculated and returned in the `Result` object.


```python
from ragas import evaluate
Expand All @@ -54,15 +61,15 @@ from ragas.metrics import LLMContextRecall
from ragas.cost import get_token_usage_for_openai

result = evaluate(
amnesty_qa["eval"],
eval_dataset,
metrics=[LLMContextRecall()],
llm=gpt4o,
token_usage_parser=get_token_usage_for_openai,
)
```


Evaluating: 0%| | 0/80 [00:00<?, ?it/s]
Evaluating: 0%| | 0/20 [00:00<?, ?it/s]



Expand All @@ -73,7 +80,7 @@ result.total_tokens()



TokenUsage(input_tokens=116765, output_tokens=39031, model='')
TokenUsage(input_tokens=25097, output_tokens=3757, model='')



Expand All @@ -92,3 +99,55 @@ result.total_cost(cost_per_input_token=5 / 1e6, cost_per_output_token=15 / 1e6)
1.1692900000000002



## Token Usage for Testset Generation

You can use the same parser for testset generation but you need to pass in the `token_usage_parser` to the `generate()` function. For now it only calculates the cost for the generation process and not the cost for the transforms.

For an example let's load an existing KnowledgeGraph and generate a testset. If you want to know more about how to generate a testset please check out the [testset generation](../../getstarted/rag_testset_generation.md#a-deeper-look).


```python
from ragas.testset.graph import KnowledgeGraph

# loading an existing KnowledgeGraph
# make sure to change the path to the location of the KnowledgeGraph file
kg = KnowledgeGraph.load("../../../experiments/scratchpad_kg.json")
kg
```




KnowledgeGraph(nodes: 47, relationships: 109)



### Choose your LLM

--8<--
choose_generator_llm.md
--8<--


```python
from ragas.testset import TestsetGenerator
from ragas.llms import llm_factory

tg = TestsetGenerator(llm=llm_factory(), knowledge_graph=kg)
# generating a testset
testset = tg.generate(testset_size=10, token_usage_parser=get_token_usage_for_openai)
```


```python
# total cost for the generation process
testset.total_cost(cost_per_input_token=5 / 1e6, cost_per_output_token=15 / 1e6)
```




0.20967000000000002


136 changes: 120 additions & 16 deletions docs/howtos/applications/cost.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to estimate Cost and Usage of evaluations\n",
"# How to estimate Cost and Usage of evaluations and testset generation\n",
"\n",
"When using LLMs for evaluation and test set generation, cost will be an important factor. Ragas provides you some tools to help you with that."
]
Expand All @@ -24,7 +24,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"metadata": {},
"outputs": [
{
Expand All @@ -33,7 +33,7 @@
"TokenUsage(input_tokens=9, output_tokens=9, model='')"
]
},
"execution_count": 1,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -56,39 +56,61 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can define your own or import parsers if they are defined. If you would like to suggest parser for LLM providers or contribute your own ones please check out this [issue](https://github.com/explodinggradients/ragas/issues/1151) 🙂.\n",
"You can define your own or import parsers if they are defined. If you would like to suggest parser for LLM providers or contribute your own ones please check out this [issue](https://github.com/explodinggradients/ragas/issues/1151) 🙂."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Token Usage for Evaluations\n",
"\n",
"You can use it for evaluations as so."
"Let's use the `get_token_usage_for_openai` parser to calculate the token usage for an evaluation."
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 8,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Repo card metadata block was not found. Setting CardData to empty.\n"
]
}
],
"source": [
"from ragas import EvaluationDataset\n",
"from datasets import load_dataset\n",
"\n",
"dataset = load_dataset(\"explodinggradients/amnesty_qa\", \"english_v3\")\n",
"\n",
"dataset = EvaluationDataset.load_from_hf(dataset[\"eval\"])"
"eval_dataset = EvaluationDataset.from_hf_dataset(dataset[\"eval\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass in the parser to the `evaluate()` function and the cost will be calculated and returned in the `Result` object."
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "40184591e33e4735b3d9905c2b0d6b9e",
"model_id": "c9cf15f7bae64320b2bc389b98321a37",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Evaluating: 0%| | 0/80 [00:00<?, ?it/s]"
"Evaluating: 0%| | 0/20 [00:00<?, ?it/s]"
]
},
"metadata": {},
Expand All @@ -102,7 +124,7 @@
"from ragas.cost import get_token_usage_for_openai\n",
"\n",
"result = evaluate(\n",
" amnesty_qa[\"eval\"],\n",
" eval_dataset,\n",
" metrics=[LLMContextRecall()],\n",
" llm=gpt4o,\n",
" token_usage_parser=get_token_usage_for_openai,\n",
Expand All @@ -111,16 +133,16 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"TokenUsage(input_tokens=116765, output_tokens=39031, model='')"
"TokenUsage(input_tokens=25097, output_tokens=3757, model='')"
]
},
"execution_count": 4,
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -157,6 +179,88 @@
"source": [
"result.total_cost(cost_per_input_token=5 / 1e6, cost_per_output_token=15 / 1e6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Token Usage for Testset Generation\n",
"\n",
"You can use the same parser for testset generation but you need to pass in the `token_usage_parser` to the `generate()` function. For now it only calculates the cost for the generation process and not the cost for the transforms.\n",
"\n",
"For an example let's load an existing KnowledgeGraph and generate a testset. If you want to know more about how to generate a testset please check out the [testset generation](../../getstarted/rag_testset_generation.md#a-deeper-look)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"KnowledgeGraph(nodes: 47, relationships: 109)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from ragas.testset.graph import KnowledgeGraph\n",
"\n",
"# loading an existing KnowledgeGraph\n",
"# make sure to change the path to the location of the KnowledgeGraph file\n",
"kg = KnowledgeGraph.load(\"../../../experiments/scratchpad_kg.json\")\n",
"kg"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Choose your LLM\n",
"\n",
"--8<--\n",
"choose_generator_llm.md\n",
"--8<--"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ragas.testset import TestsetGenerator\n",
"from ragas.llms import llm_factory\n",
"\n",
"tg = TestsetGenerator(llm=llm_factory(), knowledge_graph=kg)\n",
"# generating a testset\n",
"testset = tg.generate(testset_size=10, token_usage_parser=get_token_usage_for_openai)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.20967000000000002"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# total cost for the generation process\n",
"testset.total_cost(cost_per_input_token=5 / 1e6, cost_per_output_token=15 / 1e6)"
]
}
],
"metadata": {
Expand All @@ -175,7 +279,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
"version": "3.10.12"
}
},
"nbformat": 4,
Expand Down
16 changes: 10 additions & 6 deletions src/ragas/dataset_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,9 +143,13 @@ def pretty_repr(self):
T = t.TypeVar("T", bound="RagasDataset")


class RagasDataset(ABC, BaseModel, t.Generic[Sample]):
@dataclass
class RagasDataset(ABC, t.Generic[Sample]):
samples: t.List[Sample]

def __post_init__(self):
self.samples = self.validate_samples(self.samples)

@abstractmethod
def to_list(self) -> t.List[t.Dict]:
"""Converts the samples to a list of dictionaries."""
Expand All @@ -154,17 +158,16 @@ def to_list(self) -> t.List[t.Dict]:
@classmethod
@abstractmethod
def from_list(cls: t.Type[T], data: t.List[t.Dict]) -> T:
"""Creates an EvaluationDataset from a list of dictionaries."""
"""Creates an RagasDataset from a list of dictionaries."""
pass

@field_validator("samples")
def validate_samples(cls, samples: t.List[BaseSample]) -> t.List[BaseSample]:
def validate_samples(self, samples: t.List[Sample]) -> t.List[Sample]:
"""Validates that all samples are of the same type."""
if len(samples) == 0:
return samples

first_sample_type = type(samples[0])
if not all(isinstance(sample, first_sample_type) for sample in samples):
first_sample_type = type(self.samples[0])
if not all(isinstance(sample, first_sample_type) for sample in self.samples):
raise ValueError("All samples must be of the same type")

return samples
Expand Down Expand Up @@ -263,6 +266,7 @@ def __repr__(self) -> str:
SingleTurnSampleOrMultiTurnSample = t.Union[SingleTurnSample, MultiTurnSample]


@dataclass
class EvaluationDataset(RagasDataset[SingleTurnSampleOrMultiTurnSample]):
"""
Represents a dataset of evaluation samples.
Expand Down
Loading
Loading