In mathematics, computer science and physics, a deterministic system is a system in which no randomness is involved in the development of future states of the system.— The Internet Encyclopedia of Science
LLMs are capable of producing a consistent output when the input and model version all match. However, change a single word from that same prompt and the LLM can output something drastically different. Because of this, most LLMs are limited to features where a degree of randomness is acceptable, such as chat bots. For this same reason, integrating LLMs has historically been impractical for mission-critical operations.
The good news is that with Structured Output LLMs become exceedingly reliable and less random. Structured Output leverages strongly-typed systems to put guardrails around what an LLM can output. Furthermore, when its used in combination with performance testing - companies can deliver stable AI-integrated products.
The reader can expect to walk away with an understanding of how to build dependable AI-integrations. The same concepts from this article were used to build HN Who's Hiring Database. Check out the original post on how the HN Who's Hiring Database website was was built here. It's a self-maintained open-source project - used by 100s of monthly active users to query Hacker News job postings. Hacker News is a social news website and online forum for computer science and entrepreneurship, run by the venture capital firm Y Combinator.
Structured outputs
At its core, Structured Output is a constraint mechanism for LLMs. It forces the model to adhere to a predefined schema, ensuring that the model's output follows a strict structure. For this article, I'll be using OpenAI's API in Python using Pydantic. The same can be done in Typescript with Zod and various other languages using JSON mode with JSON based schemas.
Clear advantages of structured output:
Type safety: No more runtime type errors from unexpected LLM outputs.
Efficient prompting: No more verbose instructions in your prompts about output format.
Less code overhead: No more complex regex rules to extract data.
The example project for this article uses a cron job to scrape website data, gives large text blocks to an LLM to extract and clean data from, save and share the data online. We'll discuss only a portion of the project which uses LLM integration - to read more about those other parts check out the blog post on how HN Who's Hiring Database was built.
Let's have a look the data quality bump we can expect between LLM prompting with Structured Output and without. Not only does the quality go up, but the text based prompt greatly decreased. The original prompt's code with few-shot prompting was a total 134
lines of code. With Structured Output 134
lines were reduced to 7
.
This reduction in prompting overhead is because the model uses the Structured Output schema as part of it's context. When schemas are named and related clearly, the model understands what it needs to do through a code based semantic meaning. You no longer need to explain linguistically to the model how a system's data model and and their relationships work.
That's not to say that Structured Output is always enough for the model to understand a system's data modeling. In these cases, we can add descriptions to the schema. JSON mode provides a way to add descriptions directly for this. However, Pydantic's built in field description is not supported. To work around this OpenAI's appears to read the Pydantic schema's docstring so I find it's more performant to add schema context there rather than putting in your prompt. Typescript using JSDoc in the Zod model may work similarly.
Let's see it in action
# Pydantic structures file
from typing import Optional, List
from enum import Enum
class RemoteEnum(str, Enum):
YES = "yes"
NO = "no"
HYBRID = "hybrid"
UNKNOWN = "unknown"
class CurrencyEnum(str, Enum):
USD = "USD"
EUR = "EUR"
GBP = "GBP"
CAD = "CAD"
CNY = "CNY"
AUD = "AUD"
CHF = "CHF"
MXN = "MXN"
UNKNOWN = "unknown"
class EmploymentType(str, Enum):
FULL_TIME = "full-time"
PART_TIME = "part-time"
CONTRACT = "contract"
INTERN = "intern"
UNKNOWN = "unknown"
class JobPosting(BaseModel):
comment_id: int
company_name: str
job_title: str
employment_type: EmploymentType
currency: CurrencyEnum
remote: RemoteEnum
salary: Optional[str]
remote_rules: Optional[str]
how_to_apply: Optional[str]
company_city: Optional[str]
company_country: Optional[str]
languages_and_frameworks: Optional[List[str]]
class JobPostings(BaseModel):
postings: List[JobPosting]
from openai import OpenAI
from structures import JobPostings
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "You are a data extraction expert that extracts job posting data based on a list of job postings you are provided."},
{"role": "user", "content": "..."}
],
response_format=JobPostings,
)
job_postings = completion.choices[0].message.parsed
Now instead of JSON like typical REST requests, the response will a JobPostings
Pydantic model. This not only adds a layer of validation, but can also be used to format or clean data further.
Enabling reliable testing
Before Structured Output, methods of extracting meaningful data from an LLM's wall of text, required excessive lines of regex.
Now that we have used Structured Output to enforce the shape of our data as opposed to a wall of tech - we can utilize test Evals to evaluate an LLM's performance. Evals are incredibly useful to validate new ideas, improve or lock-in AI integrated feature performance.
Before Structured Output, methods of extracting meaningful data from an LLM's wall of text, required excessive lines of regex. Testing an LLM's performance also prone to this same issue. Validating an LLM's wordy output required regex or similarity matching of text. Sometimes even more costly measures, like OpenAI's evals which uses another LLM to score our LLM’s output. Which always sounded like a snake eating its own tail from my perspective.
Well this is all much, much easier now that we can limit the LLM to a simple Yes / No only output if we like.
Code below is how this could work in practice. You’ll notice there’s 2 similarity matching checks. This is where the output type would be considered an unstructured data type. When the output's longer than a sentence, we can generally consider as it as unstructured data.
# conftest.py - fixtures we’ll need for our tests
import pytest
from structures import JobPostings
JOB_POSTING_OUTPUT_FILE = 'job_posting_ouput.json'
@pytest.fixture
def jobs_struct() -> JobPostings:
json_data = None
script_dir = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(script_dir, JOB_POSTING_OUTPUT_FILE)
with open(file_path, 'r') as f:
json_data = json.load(f)
if json_data is None:
raise ValueError("No data found in the job_posting_ouput.json")
return JobPostings(**json_data)
from fuzzywuzzy import fuzz
from structures import JobPosting
def test_response_passes_model_validation(jobs_struct):
assert isinstance(assets_struct, JobPostings), "JobPosting structure is invalid"
def test_job_post_sample(jobs_struct):
SOFT_ASSERT_THRESHOLD = 80 # 80% similarity threshold
expected_job = JobPosting(
comment_id=41133198,
company_name="FusionAuth",
job_title="Senior Java Software Engineer",
employment_type=EmploymentType.FULL_TIME,
currency=CurrencyEnum.USD,
remote=RemoteEnum.HYBRID,
salary="$120k-$180k",
remote_rules="ONSITE or REMOTE in USA (location reqs listed on the job desc)",
how_to_apply="Learn more, including about benefits and salaries, and apply here: https://fusionauth.io/jobs/",
company_city="Denver",
company_country="USA",
languages_and_frameworks=["Java", "MySQL", "Docker"]
)
job = jobs_struct[5]
# Strict assertions
assert job.comment_id == expected_job.comment_id, "Comment ID does not match"
assert job.company_name == expected_job.company_name, "Company name does not match"
assert job.job_title == expected_job.job_title, "Job title does not match"
assert job.employment_type == expected_job.employment_type, "Employment type does not match"
assert job.currency == expected_job.currency, "Currency does not match"
assert job.remote == expected_job.remote, "Remote does not match"
assert job.salary == expected_job.salary, "Salary does not match"
assert job.company_city == expected_job.company_city, "Company city does not match"
assert job.company_country == expected_job.company_country, "Company country does not match"
# Soft assertions
similarity = fuzz.ratio(job.remote_rules, expected_job.remote_rules)
assert similarity >= SOFT_ASSERT_THRESHOLD, f"Similarity {similarity}% is below {SOFT_ASSERT_THRESHOLD}% for line:\nActual: {job.remote_rules}\nExpected: {expected_job.remote_rules}"
similarity = fuzz.ratio(job.how_to_apply, expected_job.how_to_apply)
assert similarity >= SOFT_ASSERT_THRESHOLD, f"Similarity {similarity}% is below {SOFT_ASSERT_THRESHOLD}% for line:\nActual: {job.how_to_apply}\nExpected: {expected_job.how_to_apply}"
Long-term performance monitoring
When integrating LLMs into production systems, it's crucial to track performance over time. This generally is done using multiple test sets. The test sets can be created from snapshots of human verified successful cases. The larger test sets will improve reliability as will extending the test set as time goes on, and adding edge cases to the test set as they arise.
Here's why this matters and how structured output plays a key role:
Detecting drift: LLM performance can drift over time due to changes in the underlying model, updates to prompt engineering, or shifts in user behavior.
Seasonal variations: Some applications may have seasonal patterns. By maintaining separate test sets for different time periods (e.g., monthly), you can identify and account for these variations.
Impact of updates: When you make changes to your prompts or fine-tuning, you need a way to measure the impact across different scenarios.
Comprehensive evaluation: Different test sets can cover various edge cases, ensuring your LLM integration remains robust across a wide range of inputs.
To demonstrate LLM evals put into practice, let's imagine that we have shipped an AI-integration that takes a customer email, when the customer's email mentions an issue with the product the AI-integration will create a support ticket for our team to look into. We built-in test evals and are actively tracking the accuracy of our LLM integration over the course of a year. Each week, we run our test suite and record the overall accuracy.
We notice a dramatic dip in week 52. We looked into our source code from that day and can see an engineer upgraded the model's versioning which lead a 20% reduction in performance. Without this tracking in place, such a regression can not only go unnoticed but also looses the ability to pin point causality.
Test evals allows companies to make data-driven decisions around prompt engineering, model providers or versions, and fine-tuning strategies. All while ensuring that LLM integration remains robust and effective over the long term.
Wrapping things up
We've discussed how using Structured Outputs with LLMs adds stability to products using AI integration. Structured Output when used in combination with the right test evals, companies can start to integrate LLMs into production systems with confidence.
It's still not advisable to integrate critical systems with LLMs since, they cannot guarantee fully accurate data. However, LLMs can deliver greater consistency and efficiency than their human counter parts which otherwise become bored or tired. For projects that demand considerable manual efforts, LLM integrations are worth considering.
Feel free to reach out via any of the social links in the footer if this topic caught your interest. For those that enjoyed this article, you may also like to read how HN Who's Hiring Database was built here.