What structured outputs means for LLMs

Amber Williams

Last edited - August 31, 2024

As developers working with LLMs*, you've likely experienced the pain of parsing through a wall of unstructured text, trying to extract meaningful data. Enter structured output - which isn't just a feature - it's a shift in how we interact with LLMs, bringing the reliability of strongly-typed systems to the world of NLP.

*LLMs (Large Language Models) - You can consider all mentions in this article of LLM synonymous with the buzzword AI.

Structured outputs

At its core, structured output is a constraint mechanism for LLMs. It forces the model to adhere to a predefined schema, ensuring that the output follows a strict structure. For this article I'll be using OpenAI's API with response_format written using the Python library Pydantic. But the same can be done using JSON mode for the schema or the TypeScript library Zod.

Clear advantages of structured output:

Type safety: No more runtime type errors from unexpected LLM outputs.
Efficient prompting: No more verbose instructions in your prompts about output format.
Less code overhead: No more complex regex rules to extract data.

Have a look at the data quality alone between the before and after from a project I refactored from pure prompting to use structured outputs + prompting. The project automatically scrapes Hackernew's monthly hiring tread and extracts important data points I care about then pops them into an easy to read markdown file. Blog post's here on how that was built.

In the before and after data you'll notice not only does the quality go up but my prompting overhead was greatly decreased. The original prompt's code with few-shot prompting was a total 134 lines of code. After adding structured output only 7 lines of prompt code was needed.

This large reduction in prompting overhead is because the model takes the context of your schema and their relationships through a code based symantic meaning. You no longer need to explain linguistically to the model how these relationships work.

Worth noting sometimes using structured outputs isn't enough for the model to understand fields and/or their relationships. In these cases you can add descriptions to the schema. JSON mode provides a way to add descriptions directly for this. However, Pydantic's built in field's description is not supported. To work around this OpenAI's appears to read the Pydantic schema's docstring so I find it's more performant to add schema context there rather than putting in your prompt. Typescript using JSDoc in the Zod model may work similarily.

Let's see it in action

python

# Pydantic structures file
from typing import Optional, List
from enum import Enum

class RemoteEnum(str, Enum):
    YES = "yes"
    NO = "no"
    HYBRID = "hybrid"
    UNKNOWN = "unknown"


class CurrencyEnum(str, Enum):
    USD = "USD"
    EUR = "EUR"
    GBP = "GBP"
    CAD = "CAD"
    CNY = "CNY"
    AUD = "AUD"
    CHF = "CHF"
    MXN = "MXN"
    UNKNOWN = "unknown"


class EmploymentType(str, Enum):
    FULL_TIME = "full-time"
    PART_TIME = "part-time"
    CONTRACT = "contract"
    INTERN = "intern"
    UNKNOWN = "unknown"


class JobPosting(BaseModel):
    comment_id: int
    company_name: str
    job_title: str
    employment_type: EmploymentType
    currency: CurrencyEnum
    remote: RemoteEnum
    salary: Optional[str]
    remote_rules: Optional[str]
    how_to_apply: Optional[str]
    company_city: Optional[str]
    company_country: Optional[str]
    languages_and_frameworks: Optional[List[str]]


class JobPostings(BaseModel):
    postings: List[JobPosting]

python

from openai import OpenAI

from structures import JobPostings

client = OpenAI()


completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "You are a data extraction expert that extracts job posting data based on a list of job postings you are provided."},
        {"role": "user", "content": "..."}
    ],
    response_format=JobPostings,
)
job_postings = completion.choices[0].message.parsed

Cool thing is job_postings will be an actual valid JobPostings Pydantic model rather than JSON that fits the model. This is useful if you want to do things such as extend the model to say format the model.

Enabling reliable testing

While the above is already incredibily useful and likely not news to some readers. What this unlocks is - with structured responses, you can write deterministic tests that validate the structure and content of LLM outputs.

Turning your testing from looking like this...

...into this...

LLM output	Truth	Eval
apple	apple	✅
orange	banana	❌

Previous to this to validate an LLM's wall of text output we had to rely on similarity matching of text or more costly measures such as another LLM to validate our LLM’s output. Just explaining those legacy validation methods to non-technical executives was a burden in itself.

Code below is how this could work in practice. You’ll notice there’s 2 similarity matching checks. This is where the output type would be considered an non-structable data type. Non-structable data types are generally where the output's longer than a sentence.

python

# conftest.py - fixtures we’ll need for our tests
import pytest

from structures import JobPostings

JOB_POSTING_OUTPUT_FILE = 'job_posting_ouput.json'


@pytest.fixture
def jobs_struct() -> JobPostings:
    json_data = None
    script_dir = os.path.dirname(os.path.abspath(__file__))
    file_path = os.path.join(script_dir, JOB_POSTING_OUTPUT_FILE)

    with open(file_path, 'r') as f:
        json_data = json.load(f)
    if json_data is None:
        raise ValueError("No data found in the job_posting_ouput.json")
    return JobPostings(**json_data)

python

from fuzzywuzzy import fuzz

from structures import JobPosting


def test_response_passes_model_validation(jobs_struct):
    assert isinstance(assets_struct, JobPostings), "JobPosting structure is invalid"


def test_job_post_sample(jobs_struct):
    SOFT_ASSERT_THRESHOLD = 80  # 80% similarity threshold

    expected_job = JobPosting(
        comment_id=41133198,
        company_name="FusionAuth",
        job_title="Senior Java Software Engineer",
        employment_type=EmploymentType.FULL_TIME,
        currency=CurrencyEnum.USD,
        remote=RemoteEnum.HYBRID,
        salary="$120k-$180k",
        remote_rules="ONSITE or REMOTE in USA (location reqs listed on the job desc)",
        how_to_apply="Learn more, including about benefits and salaries, and apply here: https://fusionauth.io/jobs/",
        company_city="Denver",
        company_country="USA",
        languages_and_frameworks=["Java", "MySQL", "Docker"]
    )
    job = jobs_struct[5]

    # Strict assertions
    assert job.comment_id == expected_job.comment_id, "Comment ID does not match"
    assert job.company_name == expected_job.company_name, "Company name does not match"
    assert job.job_title == expected_job.job_title, "Job title does not match"
    assert job.employment_type == expected_job.employment_type, "Employment type does not match"
    assert job.currency == expected_job.currency, "Currency does not match"
    assert job.remote == expected_job.remote, "Remote does not match"
    assert job.salary == expected_job.salary, "Salary does not match"
    assert job.company_city == expected_job.company_city, "Company city does not match"
    assert job.company_country == expected_job.company_country, "Company country does not match"

    # Soft assertions
    similarity = fuzz.ratio(job.remote_rules, expected_job.remote_rules)
    assert similarity >= SOFT_ASSERT_THRESHOLD, f"Similarity {similarity}% is below {SOFT_ASSERT_THRESHOLD}% for line:\nActual: {job.remote_rules}\nExpected: {expected_job.remote_rules}"

    similarity = fuzz.ratio(job.how_to_apply, expected_job.how_to_apply)
    assert similarity >= SOFT_ASSERT_THRESHOLD, f"Similarity {similarity}% is below {SOFT_ASSERT_THRESHOLD}% for line:\nActual: {job.how_to_apply}\nExpected: {expected_job.how_to_apply}"

Long-term performance monitoring

When integrating LLMs into production systems, it's crucial to track performance over time. This generally is done using multiple test sets. The test sets are created from snapshots of human verified succesfull cases. I've used the FusionAuth job posting of my August 2024 markdown file which I verified for this example, but in practice I should use multiple job postings from that posting and other prior months. As jobs are posted I should manually verify other month's and add them to the test set, in particular where edge cases arise.

Here's why this matters and how structured output plays a key role:

Detecting drift: LLM performance can drift over time due to changes in the underlying model, updates to prompt engineering, or shifts in user behavior.
Seasonal variations: Some applications may have seasonal patterns. By maintaining separate test sets for different time periods (e.g., monthly), you can identify and account for these variations.
Impact of updates: When you make changes to your prompts or fine-tuning, you need a way to measure the impact across different scenarios.
Comprehensive evaluation: Different test sets can cover various edge cases, ensuring your LLM integration remains robust across a wide range of inputs.

In this example, we're tracking the accuracy of our LLM integration over the course of a year. Each week, we run our test suite and record the overall accuracy. Notice the dramatic dip in week 52. Let's say this was due to a well-intentioned prompt change that accidentally reduced accuracy by 20%. Without this tracking in place, such a regression can go unnoticed without the ability to pin point causality.

This allows teams to make data-driven decisions about prompt engineering, model updates, and fine-tuning strategies, ensuring that LLM integration remains robust and effective over the long term.

Wrapping things up

Structured outputs with LLMs unlocks deterministic data. Deterministic data means we can start to reliably integrate LLMs into our system automations. That's not to say I recommend integrating your system completely with an LLM as if it is an Hal 9000. But for things like cited data extraction, using citing to decrease hallucination as I did with JobPosting's comment_id, that has performance testing - then you should give LLM integration a shot.