Practical Problem Definition for AI Projects (A Developer-First Guide)

If you want the full, original version of this write-up (with more governance framing and templates), start here: Practical problem definition for AI projects and use cases.

If you like technical posts that treat AI as production infrastructure, not a demo, my main index is here: hernanhuwyler.wordpress.com.

Now the developer version.

I have seen more AI projects die from a bad problem statement than from a bad model.

The code was fine. The embeddings were fine. The training run was fine. The metrics looked “good.” Then the system shipped and nobody used it, or it automated the wrong step, or it created a new failure mode that support had no way to handle.

That failure usually started on day one, when someone wrote: “We need an AI solution.”

I am intentionally leaving three human typos in this post because this is how real project docs look at 1 AM: teh, definately, occured.

Why “we need AI” is not a problem statement
A real problem statement describes a measurable gap in a workflow.

An AI-flavored ambition describes a technology preference.

If your team starts with “use AI,” you will end up fitting AI into whatever pain is nearby. That feels productive until you try to write acceptance tests.

Instead of “we need an AI assistant,” write something a test suite can verify:

“We spend 1,200 hours per quarter answering due diligence questionnaires, with a median turnaround of 9 days and an observed rework rate of 12%. We need median turnaround under 2 days while keeping rework under 5%.”

That is not business theater. That is a spec.

The goal: turn business pain into an executable spec
A good AI problem definition gives developers five things:

You know what the system will do.

You know what “good” looks like.

You know what “unsafe” looks like.

You know what data you need.

You know how to decide go or no-go without politics.

If you cannot write those down, you do not have a project. You have a conversation.

Step 1: Write the “as-is” workflow like you are debugging it
When teams skip this, they end up automating the wrong step.

Write the current workflow as a sequence diagram or as pseudocode. Keep it brutally literal.

Example (support ticket triage):

text

1) Ticket arrives in Zendesk
2) Agent reads it
3) Agent searches internal KB + Slack history
4) Agent drafts response
5) Agent checks policy constraints (refunds, privacy, SLA)
6) Agent sends response
7) Escalation occurs if customer replies again
Now mark where the real bottleneck is.

Is it step 3 (search)? Step 5 (policy checks)? Step 7 (escalations)?

If you do not identify the actual constraint, you will build a system that makes step 4 faster while the process still waits on step 5.

Step 2: Define the output contract before you touch a model
Developers need an output contract, even if the model is probabilistic.

For each AI output, define:

output type (classification, draft text, decision suggestion, extracted fields)
required metadata (sources, confidence, policy flags)
acceptable error modes
required human review conditions
logging requirements
Example: a response drafting system that must cite sources.

JSON

{
"draft_reply": "string",
"citations": [
{ "doc_id": "string", "section": "string", "quote": "string" }
],
"policy_flags": ["privacy", "refund", "security"],
"confidence": 0.0,
"needs_human_review": true
}
If your vendor tool cannot produce the fields you need for your workflow, you just learned something early, not after deployment.

Step 3: Force the counterfactual: “how do we solve this without AI?”
This single question kills weak projects fast.

If a rules engine, a better search index, a form redesign, or a simple automation tool solves 80% of the pain, AI is not your first move.

You can still use AI later, but you will use it in the right place.

A lot of “AI projects” are really data quality projects or workflow standardization projects. That is not a failure. That is reality.

Step 4: Choose the right tool class before choosing the tool
Engineers waste months when they choose a model family before they classify the task.

A simple filter works:

If the task is deterministic and structured, prefer conventional software.

If the task is prediction, ranking, scoring, or classification on structured data, prefer traditional machine learning.

If the task is understanding or generating unstructured language, then consider large language models.

Most real projects are hybrid. The mistake is making the whole thing “AI” when only one component needs it.

Example hybrid for due diligence automation:

retrieval system to fetch relevant policy sections
language model to draft responses with citations
rules engine to flag regulated claims
human review for high-risk topics
Step 5: Feasibility check that developers actually care about
This is where optimism goes to die, which is good. You want it to die early.

Data feasibility
Do you have the data? Is it current? Is it consistent? Is it legally usable?

If the answer is “we have PDFs somewhere,” your project is not a model project yet. It is a data engineering project.

Label feasibility (if supervised learning is involved)
If you need labels, ask:

Who produces them?
How long does it take?
How noisy are they?
Can we measure inter-annotator agreement?
If you cannot sustain labeling, you cannot sustain the model.

Operational feasibility
Can you meet latency, cost, and uptime targets?

If inference costs are unbounded, “accuracy” is irrelevant. Your system will be throttled by finance.

Safety and abuse feasibility
If the system can take action (send emails, trigger workflows, call APIs), you need explicit constraints.

If you cannot articulate how prompt injection or data exfiltration would be detected, that risk will definately show up later.

Step 6: Define success metrics that cannot be negotiated later
If success metrics are vague, your project will never finish. It will just… continue.

I use four metric buckets.

Technical quality
Depends on task. Examples:

accuracy, precision, recall, F1
extraction exact match rate
groundedness or citation validity (for retrieval-based systems)
calibration (do probabilities mean anything?)
Business impact
median turnaround time reduction
rework rate reduction
cost per case
SLA adherence
Risk and control metrics
policy violation rate
unsafe output rate
number of escalations per 1,000 outputs
audit log completeness
Adoption
percentage of cases processed through the system
override rate (humans rejecting the AI output)
opt-out rate (users routing around it)
If adoption is low, your problem definition was wrong, your UX was wrong, or your trust model was wrong. Pick one and investigate.

Make the problem definition machine-readable (so it becomes a build artifact)
This is the most practical trick I can offer to developers.

Convert the problem definition into a repo artifact. Treat it like code.

Example use_case.yaml:

YAML

use_case_id: "ddq_auto_response_v1"
owner: "security_ops"
objective:
baseline:
median_turnaround_days: 9
rework_rate: 0.12
target:
median_turnaround_days: 2
rework_rate: 0.05

outputs:

name: "draft_answer"
requires_citations: true
human_review_required_when:
"policy_flags contains 'privacy'"
"confidence \< 0.75"

data_sources:

name: "control_matrix"
format: "structured"
freshness_sla_days: 30
name: "policies"
format: "pdf"
ocr_required: true

constraints:
pii_allowed: false
max_latency_ms: 2500
audit_logging_required: true

pilot:
duration_weeks: 8
sample_size: 50
go_no_go:
min_pass_rate: 0.90
min_time_reduction: 0.70
Now your engineers can write tests against this. Your PM can’t “reinterpret” it mid-flight. And when an incident occured, you have a paper trail that matches what was shipped.

Pilot design that avoids pilot purgatory
Pilots fail when they are not built to produce a decision.

Define:

exact duration
exact sample size
pre-agreed thresholds
decision date
Example:

“The pilot runs for 8 weeks on 50 questionnaires. We scale only if pass rate exceeds 90% and median turnaround improves by 70%. If not, we do a root cause analysis and decide continue, modify, or stop within 2 weeks.”

If you do not write that down, you will extend the pilot forever because nobody wants to be the person who says stop.

Red flags I watch for in problem statements
If I see these, I assume the project will stall unless the team rewrites the spec.

“We want an AI strategy.”
“We want to explore AI.”
“We want to improve customer experience.”
“We want a chatbot.”

Those can be ambitions. They are not problem definitions.

A problem definition has a baseline, a target, constraints, and a decision gate.

A short note on standards (only because they help developers)
If you work in a regulated environment, problem definition is not just best practice. It becomes evidence.

These references map well to developer workflows:

NIST AI Risk Management Framework (especially the Map function)
ISO/IEC 42001 (planning, roles, lifecycle discipline)
ISO/IEC 5338 (AI system lifecycle processes, where available)
You do not need to memorize standards. You need to produce artifacts that prove intent, constraints, and control.

Learn more
Original article: Practical problem definition for AI projects and use cases

Blog index: hernanhuwyler.wordpress.com

Closing question (the one I use to test problem definition quality)
Could someone outside your team read your problem statement and write correct acceptance tests from it in under 15 minutes?