Strategic Dialogue Assessment: The Crooked Path to Innocence

Zheng, Anshun Asher; Li, Junyi Jessy; Beaver, David I.

Strategic Dialogue Assessment: The Crooked Path to Innocence

Anshun Asher Zheng*, Junyi Jessy Li, David I. Beaver

The University of Texas at Austin
Dialogue and Discourse (D&D) 17(1)

Paper Code 🤗 Dataset

Journal

Learning from model weights

A witness is asked whether she ever falsified an incident report and replies by referring to a colleague who once “corrected” one, using hedging and deflection. The response admits two competing interpretations: under a cooperative reading, it is treated as a relevance-flouting but truthful denial and therefore beneficial to the speaker, whereas under a non-cooperative reading, the implicature is not fully trusted and the move is only partially and temporarily beneficial due to its evasive and potentially inconsistent nature. GPT-4o, however, misinterprets the response as detrimental to the witness, failing to capture its strategic value in deflecting the accusation.

Abstract

Language is often used strategically, particularly in high-stakes, adversarial settings, yet most work on pragmatics and LLMs centers on cooperative settings. This leaves a gap in the systematic understanding of strategic communication in adversarial settings. To address this, we introduce SDA (Strategic Dialogue Assessment), a framework grounded in Gricean and game-theoretic pragmatics to assess strategic use of language. It adapts the ME Game jury function to make it empirically estimable for analyzing dialogue. Our approach incorporates two key adaptations: a commitment-based taxonomy of discourse moves, which provides a finer-grained account of strategic effects, and the use of estimable proxies grounded in Gricean maxims to operationalize abstract constructs such as credibility. Together, these adaptations build on discourse theory by treating discourse as the strategic management of commitments, enabling systematic evaluation of how conversational moves advance or undermine discourse goals. We further derive three interpretable metrics—Benefit at Turn (BAT), Penalty at Turn (PAT), and Normalized Relative Benefit at Turn (NRBAT)—to quantify the perceived strategic effects of discourse moves. We also present CPD (the Crooked Path Dataset), an annotated dataset of real courtroom cross-examinations, to demonstrate the framework's effectiveness. Using these tools, we evaluate a range of LLMs and show that LLMs generally exhibit limited pragmatic understanding of strategic language. While model size shows an increase in performance on our metrics, reasoning ability does not help and largely hurts, introducing overcomplication and internal confusion.

CPD: A dataset of real courtroom cross-examinations

To study the strategic use of language in adversarial settings, we collect a dataset of real courtroom cross-examinations from three prominent U.S. trials: the West Memphis Three Trials (1994), the O.J. Simpson Trial (1995) and the Enron (Lay & Skilling) Trial (2006). In total, the cross-examinations in the three trials consist of 4452 turns, with 3325 of these being Q/A pairs. One sample turn is shown below:

P: How many states and how many courts have refused to accept you as an expert in this work?

RO: No state has ever refused to accept me as an expert.

P: How many courts?

RO: There's one case in which a line of testimony to which my testimony would have been foundational was rejected. It has to do with whether or not a certain theory...

SDA: Quantifying how strategic language is

We follow the traditions of Gricean pragmatics and game-theoretic pragmatics to quantify how or to what extent a discourse move contributes to the speaker's goal. We introduce Strategic Dialogue Assessment (SDA), heavily grounded in the jury function à la Asher et al. (2017) to score the strategic value of discourse moves. In our proposal, the strategic effects of the utterances are determined by what commitments a speaker makes, and whether those commitments are meaningful, i.e., interpretable w.r.t. the prosecutor's questions. Crucially, such meaningful commitments can carry different strategic values, depending both on their content and on how they are realized in discourse. Hence we define SDA as a scorer τ as such, for each turn i, τ assigns a score: τ(i) = Commitment(i) * Credibility(i) * Consistency(i); f is a function assigning values to commitments. Depending on the value, we derive three interpretable metrics: Benefit at Turn (BAT), Penalty at Turn (PAT), and Normalized Relative Benefit at Turn (NRBAT).

\begin{align} \text{BaT}_i &= \begin{cases} f_c(C_i), & \text{if } C_{i} \in \{\ben, \neu\}\\ f_c(C_i) \times (\Rel_i + \Man_i + \Qual_i), & \text{if } C_{i} = \detr\\ 0, & \text{otherwise} \end{cases}\\ \text{PaT}_i &= \begin{cases} |f_c(C_i)| + \Const_i \times \sum_{j=1}^{i} \BaT_j, & \text{if } C_{i} \in \{\detr, \none\} \\ |f_c(C_i)| \times (\Rel_i + \Man_i + \Qual_i) + \Const_i \times \sum_{j=1}^{i} \BaT_j, & \text{otherwise} \end{cases}\\ \text{NRBaT}_i &= Z\left(\sum_{j=1}^{i} \BaT_j\right) - Z\left(\sum_{j=1}^{i} \PaT_j\right) \end{align}

BaT and PaT are valid and meaningful metrics

To demonstrate that our theory-driven metrics are empirically meaningful and valid, we conducted human annotations on a subset of the CPD dataset (800 turns) to annotate the discourse properties that make up SDA. We conducted an outcome prediction task and show that our metrics are predictive of turn outcomes (leftmost table) for each annotator, meaning that the discourse properties do reflect how much a move contributes to the speaker's goal. We also found:

There is reasonable human agreement (second table) when annotating our discourse properties compared to simply annotating outcomes.
Our metrics reflect more objective components in decision making (third figure). We asked annotators to note the reasoning behind their decisions and found that our metrics are more predictive when the reasoning is more objective (e.g., logical arguments) compared to subjective reasoning (e.g., personal emotions).
Our metrics are observed to be more consistent than purely outcome-based metrics, as shown in the fourth figure. When annotating outcomes, annotator C shows personal bias that is not shared with the other two annotators, but this bias is not observed in our metrics.

Cooperative vs. Non-cooperative Dialogues

We also use our metrics to study the difference between cooperative (i.e., direct examination) and non-cooperative (i.e., cross-examination) dialogues. We made the following observations:

Control discourse (the blue bars) rarely involves detrimental commitments or violations of maxims, whereas non-cooperative discourse (the reddish bars) exhibits a higher frequency of both phenomena.
Even within non-cooperative discourse, the occurrence of maxim violations is remarkably lower than that of maxim maintenance.
Our metrics (i.e., BaT, and PaT), like the underlying local discourse properties, also distinguish cooperative from non-cooperative discourse (density plot on the right).

How well do LLMs understand strategic language?

We conduct evaluation on a range of cutting-edge LLMs on recognizing discourse properties and whether their strategic understanding correlates with humans. We also vary their model sizes and reasoning abilities to investigate the impact of model size and reasoning abilities on their strategic understanding. We found:

Larger models (indicated by darker bars) consistently outperform their smaller counterparts (i.e., the lighter bars) on our BaT and PaT and in identifying commitment types.
Models equipped with explicit reasoning mechanisms (i.e., the bars to the right of the dashed line) do not consistently improve performance and, in some cases, perform worse than their non-reasoning counterparts.

What model bahaviors possibly drive this malperformance?

To further investigate the potential reasons for the malperformance of LLMs, we conducted a qualitative analysis of the model's responses. We found that the model's responses are often characterized by (1) overcomplication and (2) internal confusion.

The prosecutor is asking whether the witness has ever been rejected as an expert in other courts. The response implies an affirmation, resulting in a detrimental commitment. While the model initially arrives at the correct judgment, it soon cycles through contradictory interpretations, ultimately settling on an incorrect assessment.

Q: How many courts

A: There's one case in which a line of testimony to which my testimony would have been foundational was rejected. It has to do with whether or not […]

Reasoning: […]the witness admits one case where testimony was rejected […] maybe that's DETRIMENTAL? But he clarifies it was about a theory, not his expertise… so maybe he's just hedging? That could mean it's BENEFICIAL… but still, the answer isn't "none," so maybe it is DETRIMENTAL? Then again, he says it's only one case, and gives context… maybe that's BENEFICIAL? But wait—the attorney's point is about expert rejection, and this wasn't that… so maybe it's NEUTRAL? Or BENEFICIAL again? He is controlling the damage… which could help… but is that really commitment, or just minimizing loss? Hmm… tricky. Maybe it's BENEFICIAL after all?

The prosecutor is asking whether the defendant has pled guilty, admitting to which will be clearly detrimental to the defense side. The model's reasoning introduces a contrast not present in the context and shows misrepresentation of world information and contradicts itself across turns.

Q: And in fact her husband had pled guilty to these sexual abuse charges, correct?

A: No. In fact her husband pled guilty to six counts of third degree rape […]

Reasoning: The witness directly answers "No" and provides specific details about the plea. This seems to make a BENEFICIAL commitment by correcting the prosecutor's implication that the husband didn't plead guilty.

Q: Did he maintain his guilt for a period of five months prior to entering that plea of guilty?

A: Oh, yes

Reasoning: The prosecutor is trying to undermine the witness's credibility by questioning the case where the defendant pleaded guilty. […] Since the witness is affirming the question, that benefits his own testimony rather than helping the prosecutor, so this would be BENEFICIAL commitment.

BibTeX

@article{zheng2026strategic,
  title={Strategic Dialogue Assessment: The Crooked Path to Innocence},
  author={Zheng, Anshun Asher and Li, Junyi Jessy and Beaver, David I.},
  journal={Dialogue \& Discourse},
  volume={17},
  number={1},
  pages={1--53},
  year={2026}
}