openai · OpenAI Platform Docs
Working with evals
A guide to programmatically configuring, running, and analyzing model evaluations using the Evals API to test LLM outputs against specific style and content criteria.
Derived skill
Files assembled from official documentation
Viewing SKILL.md
Working with evals
A guide to programmatically configuring, running, and analyzing model evaluations using the Evals API to test LLM outputs against specific style and content criteria.
When To Use
Use when you need to programmatically validate that LLM outputs meet specific quality standards or when comparing model performance during upgrades using test datasets and grading criteria.
Reference Files
| File | Contains | Use For |
|---|---|---|
SKILL.md | Entry point: scope, routing table, and workflow. | Start here. |
docs/working-with-evals-workflow-guide.md | A guide explaining how to create evaluations, test prompts, upload test data, and manage eval runs to ensure model output quality. | Questions about a guide explaining how to create evaluations, test prompts, upload test data, and manage eval runs to ensure model ou... |
examples/working-with-evals-openai-evals-curl-request.bash | A bash curl command demonstrating how to send a request to the OpenAI API for evaluation purposes. | Exact payloads, commands, or snippets shown in A bash curl command demonstrating how to send a request to the OpenAI API for evaluation purposes. |
examples/working-with-evals-openai-evals-javascript-categorization.javascript | A JavaScript code example demonstrating how to use the OpenAI client to perform categorization tasks as part of an evaluation workflow. | Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to use the OpenAI client to perform categorization tasks as part of an ev... |
examples/working-with-evals-openai-evals-python-categorization.python | A Python script demonstrating how to use the OpenAI client to categorize IT support tickets as part of an evaluation workflow. | Exact payloads, commands, or snippets shown in A Python script demonstrating how to use the OpenAI client to categorize IT support tickets as part of an evaluation... |
examples/working-with-evals-openai-evals-api-curl-request.bash | A bash curl command demonstrating how to create a new evaluation via the OpenAI API with a custom datasource configuration. | Exact payloads, commands, or snippets shown in A bash curl command demonstrating how to create a new evaluation via the OpenAI API with a custom datasource configur... |
examples/working-with-evals-openai-evals-create.javascript | A JavaScript code example demonstrating how to use the OpenAI SDK to create an evaluation object with custom datasource configuration. | Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to use the OpenAI SDK to create an evaluation object with custom datasour... |
examples/working-with-evals-openai-evals-create.python | A Python script demonstrating how to use the OpenAI client to create a custom evaluation with a datasource configuration and item schema. | Exact payloads, commands, or snippets shown in A Python script demonstrating how to use the OpenAI client to create a custom evaluation with a datasource configurat... |
examples/working-with-evals-openai-evals-custom-eval.json | A JSON object defining a custom evaluation schema including item properties and required fields for an evaluation task. | Exact payloads, commands, or snippets shown in A JSON object defining a custom evaluation schema including item properties and required fields for an evaluation task. |
examples/working-with-evals-openai-evals.json | A JSON object defining an evaluation schema with string match operations and input/reference mappings. | Exact payloads, commands, or snippets shown in A JSON object defining an evaluation schema with string match operations and input/reference mappings. |
examples/working-with-evals-openai-evals-custom-datasource.json | A JSON configuration object defining a custom datasource and string-check testing criteria for an evaluation. | Exact payloads, commands, or snippets shown in A JSON configuration object defining a custom datasource and string-check testing criteria for an evaluation. |
examples/working-with-evals-openai-evals-dataset-item-examples.json | A JSON array of evaluation dataset items containing ticket text and their corresponding correct labels. | Exact payloads, commands, or snippets shown in A JSON array of evaluation dataset items containing ticket text and their corresponding correct labels. |
examples/working-with-evals-openai-evals-upload-curl.bash | A bash command using curl to upload a JSONL file for evaluation purposes to the OpenAI Files API. | Exact payloads, commands, or snippets shown in A bash command using curl to upload a JSONL file for evaluation purposes to the OpenAI Files API. |
examples/working-with-evals-openai-evals-javascript-upload.javascript | A JavaScript code example demonstrating how to upload a JSONL file to OpenAI for use with the evals API. | Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to upload a JSONL file to OpenAI for use with the evals API. |
examples/working-with-evals-openai-evals-python-upload.python | A Python script demonstrating how to upload a JSONL file to OpenAI for use with the evals API. | Exact payloads, commands, or snippets shown in A Python script demonstrating how to upload a JSONL file to OpenAI for use with the evals API. |
examples/working-with-evals-openai-evals-tickets-dataset.json | A JSONL formatted dataset containing ticket objects used for evaluating model performance. | Exact payloads, commands, or snippets shown in A JSONL formatted dataset containing ticket objects used for evaluating model performance. |
examples/working-with-evals-openai-evals-run-creation-curl.bash | A curl command demonstrating how to create a new run for an existing evaluation using the OpenAI API. | Exact payloads, commands, or snippets shown in A curl command demonstrating how to create a new run for an existing evaluation using the OpenAI API. |
examples/working-with-evals-openai-evals-runs-create.javascript | A JavaScript code example demonstrating how to create an evaluation run using the OpenAI SDK. | Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to create an evaluation run using the OpenAI SDK. |
examples/working-with-evals-openai-evals-runs-create.python | A Python script demonstrating how to create an evaluation run using the OpenAI client to categorize IT support tickets. | Exact payloads, commands, or snippets shown in A Python script demonstrating how to create an evaluation run using the OpenAI client to categorize IT support tickets. |
examples/working-with-evals-openai-evals-run-response-object.json | A JSON object representing the response structure of an evaluation run, including run ID, status, and report URL. | Exact payloads, commands, or snippets shown in A JSON object representing the response structure of an evaluation run, including run ID, status, and report URL. |
examples/working-with-evals-openai-evals-run-status-curl-request.bash | A curl command to retrieve the status of a specific evaluation run using the OpenAI API. | Exact payloads, commands, or snippets shown in A curl command to retrieve the status of a specific evaluation run using the OpenAI API. |
examples/working-with-evals-openai-evals-runs-retrieve.javascript | A JavaScript code example demonstrating how to retrieve an evaluation run using the OpenAI client library. | Exact payloads, commands, or snippets shown in A JavaScript code example demonstrating how to retrieve an evaluation run using the OpenAI client library. |
examples/working-with-evals-openai-evals-runs-retrieve.python | A Python script demonstrating how to use the OpenAI client to retrieve the status and details of an evaluation run. | Exact payloads, commands, or snippets shown in A Python script demonstrating how to use the OpenAI client to retrieve the status and details of an evaluation run. |
examples/working-with-evals-openai-evals-run-result-response.json | A JSON object representing the completed status and result counts of an evaluation run. | Exact payloads, commands, or snippets shown in A JSON object representing the completed status and result counts of an evaluation run. |
What This Skill Covers
- Evaluations (often called evals) test model outputs to ensure they meet style and content criteria that you specify. Writing evals to understand how your LLM...
- Main sections:
Create an eval for a task,Test a prompt with your eval,Uploading test data,Creating an eval run,Analyze the results.
Workflow
- Open the most relevant file under
docs/for the exact documented workflow and wording. - Open
schemas/files for exact structured contracts. - Open
examples/files for concrete requests, commands, snippets, and manifests. - Do not add behavior or configuration that is not present in the attached source files.
Canonical source: https://developers.openai.com/api/docs/guides/evals.md
