Code to run the Structured Output Benchmark introduced in our article:
LLM Structured Output Benchmarks are Riddled with Mistakes
This benchmark contains four high-quality datasets which are formatted for you to easily evaluate Structured Outputs from different LLM models.
These benchmarks were created after we discovered that public Structured Output datasets contain substantial annotation errors, inconsistencies, and ambigities in their ground truth. To enable more reliable assessment of models, we provide four rigorously cleaned and validated benchmarks, along with scripts to format tasks, generate LLM responses, and evaluate their correctness.
The datasets are hosted on HuggingFace.
| Dataset | Description | Dataset Link | Code Folder |
|---|---|---|---|
| Data Table Analysis | Analyze CSV tables and extract structured metadata. | https://huggingface.co/datasets/Cleanlab/data-table-analysis | data_table_analysis/ |
| Financial Entities Extraction | Extract financial and contextual entities from business and financial text. | https://huggingface.co/datasets/Cleanlab/fire-financial-ner-extraction | financial_entities/ |
| Insurance Claims Extraction | Extract structured fields from insurance claim documents. | https://huggingface.co/datasets/Cleanlab/insurance-claims-extraction | insurance_claims/ |
| PII Extraction | Extract and classify different types of personally identifiable information (PII) from text. | https://huggingface.co/datasets/Cleanlab/pii-extraction | pii_extraction/ |