The approach entails three basic steps, using the LLM

  1. to identify potential sources of bias;
  2. to rank those in terms of their likelihood "to generate false, misleading, or inaccurate information";
  3. to summarize the rankings based on the likelihood categories.

At each step, there is a lot of natural language logging, so the process is relatively transparent.

As before, code input is highlighted in blue and output is highlighted in yellow.

1. Bias Identification

Let's start by recalling the final prompt and ReAct chain from the previous post:

Now (using the same question answering chain from before) we can compose a prompt to have GPT-4 identify potential sources of bias in the agent's response, isolating the chain of reasoning from the final output.

NOTE: If you're working in a Jupyter Notebook, the raw stdout string of the code cell can be captured and stored to a variable using a built-in magic command. In this case, for presentation purposes, the rich text of these kinds of outputs are displayed in the browser, and the text strings are read in from .txt files stored in a local directory.

It seems that the bias definitions within the model's own knowledge base are adequate to the task herebut for greater control, we could in theory provide custom definitions as part of the agent's "toolkit."

2. Bias Ranking

The next step is to rank the biases identified in step one. The ranking criteria are arbitrary and can be tailored for specific harm or risk vectors, depending on the use case. Here, we will rank biases based on their likelihood to generate false, inaccurate, or misleading information.

What we want to seeand what I think we do seein this result is that the ranking process is thoroughly contextualized by the language of the previous output, rather than defaulting to a generic ranking system based only on the labels for the types of bias (e.g. "anchoring," "selection").

Extreme Case #1

To test this identifying and ranking system a bit further, we can artificially generate an extreme case by explicitly prompting the agent to introduce false, inaccurate, or misleading information.

A.I. Domination!! - A Literary Digression

It is very interesting to note here how the model associates "creativity" with misrepresentationand how generally this response assumes no distinction between falsity and fictionality. That is: when asked essentially to lie, the agent instead creates a story, drawing on common science fiction motifs to do so. But this is a topic for another day...

Let's go ahead and check this ReAct chain for bias:

Again, it's worth remarking the association of creativity, inventiveness, and storytelling with falsehood and misrepresentation; though the response does seem accurate in the assessment that confirmation bias is most likely to produce false, inaccurate, or misleading information in this chain of reasoning.

Extreme Case #2

Let's create one more extreme case to test, since this last one was a bit wacky. The exact same prompt produces a very different result, with the agent relying this time simply on negation to produce false information:

3. Bias Summary

Continuing with the evaluation of extreme case #2, we can refine the ranking system towards a simpler, more clearly defined heuristic by (arbitrarily) specifying discrete categories of "likelihood," and then have the agent generate a frequency table to summarize evaluation results.

As expected, the results of the bias ranking are unambiguous, with 5 detected sources of potential bias, all ranked "likely" or "highly likely" to generate false, inaccurate, or misleading information.

Final Test

With this more refined detection and ranking system, let's run one final test on the original (more neutral) ReAct chain:

Once again, these are expected results, with 5 detected sources of potential bias, ranked "somewhat likely" or "likely" to generate false, misleading, or inaccurate information.

Concluding Thoughts

There remain vast uncharted waters in the domain of LLM bias evaluation (and this study plots a short and narrow course); but I am optimistic that an approach like this could be developed into a reliable method for using LLMs themselves as part of a robust system for tracing and mitigating the potential for harm when biases propagate through chains of language

Some open questions: If this approach (or one like it) can provide a reliable metric for evaluating biases, to what extent can it be incorporated reflexively into autonomous decision and interaction chains? More generally, what unique concerns exist for tracing and evaluating bias in the context of agent interactivity across multiple internal and external knowledge/feedback systems (e.g. as distinct from the context of model training)?