Prompt Buddy logoPrompt Buddy

openai · OpenAI Platform Docs

Working with evals

A guide to programmatically configuring, running, and analyzing model evaluations using the Evals API to test LLM outputs against specific style and content criteria.

Import to Prompt Buddy

Derived skill

Files assembled from official documentation

Viewing SKILL.md

Working with evals

A guide to programmatically configuring, running, and analyzing model evaluations using the Evals API to test LLM outputs against specific style and content criteria.

When To Use

Use when you need to programmatically validate that LLM outputs meet specific quality standards or when comparing model performance during upgrades using test datasets and grading criteria.

Reference Files

FileContainsUse For
SKILL.mdEntry point: scope, routing table, and workflow.Start here.
docs/working-with-evals-workflow-guide.mdA guide explaining how to create evaluations, test prompts, upload test data, and manage eval runs to ensure model output quality.Questions about a guide explaining how to create evaluations, test prompts, upload test data, and manage eval runs to ensure model ou...
examples/working-with-evals-openai-evals-curl-request.bashA bash curl command demonstrating how to send a request to the OpenAI API for evaluation purposes.Exact payloads, commands, or snippets shown in A bash curl command demonstrating how to send a request to the OpenAI API for evaluation purposes.
examples/working-with-evals-openai-evals-javascript-categorization.javascriptA JavaScript code example demonstrating how to use the OpenAI client to perform categorization tasks as part of an evaluation workflow.Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to use the OpenAI client to perform categorization tasks as part of an ev...
examples/working-with-evals-openai-evals-python-categorization.pythonA Python script demonstrating how to use the OpenAI client to categorize IT support tickets as part of an evaluation workflow.Exact payloads, commands, or snippets shown in A Python script demonstrating how to use the OpenAI client to categorize IT support tickets as part of an evaluation...
examples/working-with-evals-openai-evals-api-curl-request.bashA bash curl command demonstrating how to create a new evaluation via the OpenAI API with a custom datasource configuration.Exact payloads, commands, or snippets shown in A bash curl command demonstrating how to create a new evaluation via the OpenAI API with a custom datasource configur...
examples/working-with-evals-openai-evals-create.javascriptA JavaScript code example demonstrating how to use the OpenAI SDK to create an evaluation object with custom datasource configuration.Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to use the OpenAI SDK to create an evaluation object with custom datasour...
examples/working-with-evals-openai-evals-create.pythonA Python script demonstrating how to use the OpenAI client to create a custom evaluation with a datasource configuration and item schema.Exact payloads, commands, or snippets shown in A Python script demonstrating how to use the OpenAI client to create a custom evaluation with a datasource configurat...
examples/working-with-evals-openai-evals-custom-eval.jsonA JSON object defining a custom evaluation schema including item properties and required fields for an evaluation task.Exact payloads, commands, or snippets shown in A JSON object defining a custom evaluation schema including item properties and required fields for an evaluation task.
examples/working-with-evals-openai-evals.jsonA JSON object defining an evaluation schema with string match operations and input/reference mappings.Exact payloads, commands, or snippets shown in A JSON object defining an evaluation schema with string match operations and input/reference mappings.
examples/working-with-evals-openai-evals-custom-datasource.jsonA JSON configuration object defining a custom datasource and string-check testing criteria for an evaluation.Exact payloads, commands, or snippets shown in A JSON configuration object defining a custom datasource and string-check testing criteria for an evaluation.
examples/working-with-evals-openai-evals-dataset-item-examples.jsonA JSON array of evaluation dataset items containing ticket text and their corresponding correct labels.Exact payloads, commands, or snippets shown in A JSON array of evaluation dataset items containing ticket text and their corresponding correct labels.
examples/working-with-evals-openai-evals-upload-curl.bashA bash command using curl to upload a JSONL file for evaluation purposes to the OpenAI Files API.Exact payloads, commands, or snippets shown in A bash command using curl to upload a JSONL file for evaluation purposes to the OpenAI Files API.
examples/working-with-evals-openai-evals-javascript-upload.javascriptA JavaScript code example demonstrating how to upload a JSONL file to OpenAI for use with the evals API.Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to upload a JSONL file to OpenAI for use with the evals API.
examples/working-with-evals-openai-evals-python-upload.pythonA Python script demonstrating how to upload a JSONL file to OpenAI for use with the evals API.Exact payloads, commands, or snippets shown in A Python script demonstrating how to upload a JSONL file to OpenAI for use with the evals API.
examples/working-with-evals-openai-evals-tickets-dataset.jsonA JSONL formatted dataset containing ticket objects used for evaluating model performance.Exact payloads, commands, or snippets shown in A JSONL formatted dataset containing ticket objects used for evaluating model performance.
examples/working-with-evals-openai-evals-run-creation-curl.bashA curl command demonstrating how to create a new run for an existing evaluation using the OpenAI API.Exact payloads, commands, or snippets shown in A curl command demonstrating how to create a new run for an existing evaluation using the OpenAI API.
examples/working-with-evals-openai-evals-runs-create.javascriptA JavaScript code example demonstrating how to create an evaluation run using the OpenAI SDK.Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to create an evaluation run using the OpenAI SDK.
examples/working-with-evals-openai-evals-runs-create.pythonA Python script demonstrating how to create an evaluation run using the OpenAI client to categorize IT support tickets.Exact payloads, commands, or snippets shown in A Python script demonstrating how to create an evaluation run using the OpenAI client to categorize IT support tickets.
examples/working-with-evals-openai-evals-run-response-object.jsonA JSON object representing the response structure of an evaluation run, including run ID, status, and report URL.Exact payloads, commands, or snippets shown in A JSON object representing the response structure of an evaluation run, including run ID, status, and report URL.
examples/working-with-evals-openai-evals-run-status-curl-request.bashA curl command to retrieve the status of a specific evaluation run using the OpenAI API.Exact payloads, commands, or snippets shown in A curl command to retrieve the status of a specific evaluation run using the OpenAI API.
examples/working-with-evals-openai-evals-runs-retrieve.javascriptA JavaScript code example demonstrating how to retrieve an evaluation run using the OpenAI client library.Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to retrieve an evaluation run using the OpenAI client library.
examples/working-with-evals-openai-evals-runs-retrieve.pythonA Python script demonstrating how to use the OpenAI client to retrieve the status and details of an evaluation run.Exact payloads, commands, or snippets shown in A Python script demonstrating how to use the OpenAI client to retrieve the status and details of an evaluation run.
examples/working-with-evals-openai-evals-run-result-response.jsonA JSON object representing the completed status and result counts of an evaluation run.Exact payloads, commands, or snippets shown in A JSON object representing the completed status and result counts of an evaluation run.

What This Skill Covers

  • Evaluations (often called evals) test model outputs to ensure they meet style and content criteria that you specify. Writing evals to understand how your LLM...
  • Main sections: Create an eval for a task, Test a prompt with your eval, Uploading test data, Creating an eval run, Analyze the results.

Workflow

  1. Open the most relevant file under docs/ for the exact documented workflow and wording.
  2. Open schemas/ files for exact structured contracts.
  3. Open examples/ files for concrete requests, commands, snippets, and manifests.
  4. Do not add behavior or configuration that is not present in the attached source files.

Canonical source: https://developers.openai.com/api/docs/guides/evals.md