Skip to content
View ahmedEid1's full-sized avatar

Highlights

  • Pro

Organizations

@AlgoGraph

Block or report ahmedEid1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
ahmedEid1/README.md

Hi, I'm Ahmed 👋 — AI / Agent Engineer

I build LLM agents you can trust — measured with golden evals, per-claim citation checks, and full tracing. Based in Germany. Open to AI / Agent Engineer roles.

🔎 What I'm known for: evaluation & agent-observability rigor — turning "the demo worked" into "here's the eval score, the trace, and the CI gate."

Flagship projects

  • Lumen — open-source agentic AI tutor. Custom multi-agent orchestrator (no LangChain), course-scoped RAG over pgvector, a 9-tool MCP server (stdio + HTTP) on the official registry, a golden eval suite (LLM-as-judge) wired into CI, and observable per-call traces (tokens, cost, latency). Production deploy on AWS Graviton. ⭐ 66 · live demo · public /eval
  • Thoth — open-source agentic systematic literature reviews with per-claim citation verification (cite_check) that scores every cited claim against its source paper and flags hallucinated references. LangGraph agent (planner → assessor → drafter → critic → cite_check), an authenticated MCP server (OAuth 2.1 + PKCE), Langfuse + OpenTelemetry tracing, and a public eval dashboard. 676 tests. live · eval dashboard
  • ForgeJudge — open, always-on leaderboard + CI gate for autonomous coding agents. A hand-rolled single-agent solver fixes real bugs from the issue text alone; every patch is scored by execution-as-judge (verified equivalent to the official SWE-bench grading, and cheat-resistant), every run emits a public OpenTelemetry/Langfuse trace, and a multi-seed CI gate blocks any regression — on a $0 stack against a contamination-resistant, mutation-hardened golden set. A model-swap proves the harness is the deliverable (llama-3.1-8b 42% → llama-3.3-70b 83% pass@1, harness fixed). FastMCP server on the official registry. live leaderboard

The numbers are public on purpose. Thoth's dashboard shows citation precision 97% and recall 74% on the goldens scored so far, with faithfulness and coverage tracked in the open as the golden set fills out — a regression is a falsifiable public signal, not a hidden one. Lumen's /eval publishes all three suites whole, strong and weak alike (authoring 3.85/5). The point isn't a perfect score — it's that every score is measured, reproducible, and gated in CI.

Stack

AI / Agents: LLM agents · RAG / pgvector · multi-agent orchestration · MCP · LangGraph · Claude API · prompt & context engineering Eval & Observability: golden evals · LLM-as-judge · cite_check · Langfuse · OpenTelemetry · CI regression gates Languages & Backend: Python · TypeScript · Java · FastAPI · Flask · Node.js · PostgreSQL · Redis Infra: Docker · Kubernetes · Helm · AWS · CI/CD

More

3+ years shipping production software · BSc top of class (Al-Azhar) · MSc Digital Transformation (FH Dortmund) · currently going deep on LLM internals and agent architectures — transformer mechanics, RAG, evaluation, and LLMOps.

Foundations (proof of depth): GPT from scratch (PyTorch) — a 297-test from-scratch GPT (tokenizer → attention → training loop → LoRA). Not just an API caller.

📫 ahmedhobeishy@gmail.com · 🔗 ahmedhobeishy.tech · 💼 in/ahmedhobeishy

Pinned Loading

  1. lumen lumen Public

    Lumen — learner-owned AI education platform. Tell the AI what you want to learn: it builds you a private course in ~a minute, tutors you with course-scoped RAG + citations, and lets you share, clon…

    Python 68 34