Feat: Add accuracy evaluation for LLMs (GPQA, AIME, HLE etc.)

Add datasets, post-processing scripts, environment (likely docker) to evaluate the accuracy based on the output collected from the inference endpoints.
List of accuracy eval to add:

- GPQA
- MMLU, MMLU-Pro
- AIME
- MATH500
- HLE
- Health Bench
- TBD