This experiment in prompt engineering is meant to be exploratory, to generate cross-disciplinary conversation around large language model technology and learning, and to suggest new ways of approaching the prompt engineering process. Topics of interest include:

A note on format: Written in a spirit of inquiry, this is not a conventional "How-To" guide; but I've included the code used in the project, which can serve as an introduction to the ReAct (reason, act) paradigm and to working with the LangChain framework in Python. For clarity, given the pervasiveness of natural language throughout this document, code input is highlighted in blue and output is highlighted in yellow.

A note on determinism: Because LLM interactions involve nondeterministic processes, outputs show variation across multiple executions of identical code. In general, however, I have seen broad high-level consistency among outputs for a given prompt chain, which I take to be one of the core advantages of LangChain. The output examples included below are ones that I consider to be representative of the model's behavior (rather than deviations from it)but more testing is needed to better understand the relationship between the non-deterministic and deterministic components of these implementations.


The end of academic honesty?

First, to address the elephant in the classroom: plagiarism has been endemic in college student writing since long before the explosive arrival of natural language AI systems like chatGPT. Having read and evaluated thousands of examples of student writing, my general impression over the last decade is that plagiarismin step with much of service-model academiahas developed along a steady trajectory of commercial and technological refinement.

As one online essay mill puts it in their "100% Happiness Guarantee": "Our writers always follow instructions, deliver original papers, and never miss deadlines. Our support agents are always there for you: to revise papers, change writers, and even refund your money. Whatever it takes to make you happy!"

Or as another puts it: "We’ll do your homework while you live your life."

While it may be reasonable to fear that rapid advancements in AI technology could amplify the degrading ethos of this particular strain of consumer coddling and create new vectors for it, my guess is that these technologies will have a greater immediate effect on that business than on the business of learning.

Nevertheless, commercial and open-source development ecosystems are flourishing at an astonishing rate around the APIs for these large language models, and it seems we have an awful lot to lose if we don't work with some urgencyand across disciplinesto allow this process to hold our higher learning systems in relief.


Introducing the Syllabus

As one would do with any college student, the first step is to expose the model to the syllabus and to probe it a bit to see what it brings to bear on the course material before being introduced to new ideas.

After importing the libraries needed for the project and loading and chunking the document, we can use the OpenAI Embeddings API within the LangChain framework to create a vectorstore index from the syllabus. This index allows the model to interact with the syllabus by way of a toolkit object attributed with a natural language description of what the vectorstore can be used for (for more information on LangChain indexes, including vectorstores, see the documentation).

The next step is to set a system message to contextualize queries in a specific and consistent way when calls are made to the GPT-4 API (LangChain provides a helpful abstraction layer for this).

The basic approach to this first experiment is to allow GPT-4 to pose questions about the course syllabus while completing an assignmentdescribed in that syllabusdrawing only on the LLM's existing knowledge base. The prompts in this test and throughout this study will be confined to a single author, and the model temperature will be kept at zero, in order to constrain nondeterministic output to some extent.

Here's the exact text of the relevant assignment as given in one part of the syllabus:

"Forum posts need not be longer than a couple of paragraphs, but they should be engaging and thoughtfully composed. You are encouraged to ask critical questions about the course material and/or to respond to your classmates’ posts. These will not receive individual grades, butevaluated in aggregate at the end of the quarterthey will indicate the extent to which you have remained a consistent, well-informed class participant and should demonstrate your overall command of the course material."

Response #1

Analysis of Response #1:

Interlocution

To simulate intellectual exchange, we can have the model respond to its own post without any additional context (i.e. beyond the syllabus, the previous response, and a slightly modified query string).

Response #2

Analysis of Response #2:

Building Data-Awareness

Repeated executions of the same or similar prompts, though outside the scope of this blog, give a pretty good sense of what the model's base of knowledge is on this topic and the (small) set of predetermined opinions it will articulate about it. Now it's time to see if we can make the model aware of language data outside its training data. One of the core functionalities of LangChain is the ability to initialize an agent with access to a pre-configured suite of "tools" that can be used within a prompt chain. In this case, we will use a "zero-shot-react-description" agent, which selects tools based only on natural language descriptions of what the tools can be used for.

The tools from which the agent can select will be the syllabus in addition to notes for all the lectures for the course. The notes are loaded from txt files in a local directory. They are personal notes: rough, unstructured, not composed to be easily readable by a person or a machine. As with the syllabus, we will chunk the lectures, then create a vectorstore index using LangChain and the OpenAI Embeddings API.

Once again, we will simply "stuff" all the related data into the prompt chain as context to pass to the LLM. Other methods for interacting with indexes include Map Reduce, Refine, and Map-Rerank. For more information on these chain types, see the documentation.

Response #3

When the agent executor runs, intermediary output is logged, showing the sequential process of "thought," "observation," and "action" as the chain is executed, before the final output is displayed. This paradigm is known as ReAct (reason, act), and you can read about it here: ReAct: Synergizing Reasoning and Action in Language Models.

Analysis of Response #3:

Argumentation with Primary Sources

For the final test here, we will give the agent one additional toolthe primary source documents themselvesand prompt it to cite that material while refuting one of its classmate's (i.e. its own) points.

We will need to append the new tool to the list of existing tools and re-initialize the agent.

Response #4

Of all the chains implemented in this project, this one varies the most across multiple executions. Without additional configuration, the random component in how the agent selects tools entails that the order in which it poses inquiries and makes observations about the documents at hand is nondeterministic even if (given a model with temperature=0) any single specific input in a chain can be expected to produce consistent outputs.

In this case, the agent retrieves information from the primary source documents first, building up a kind of arsenal of possible topics around which to build arguments (as one might do in preparation for a debate competition, not knowing what the other team might come up with).

Analysis of Response #4:

Concluding Thoughts

There would be clear safety and ethical implications if it were easy to provoke ground shifts in the moral or ideological orientation of responses simply by exposing the model to new language data. Implementing robust adversarial testing is outside the scope of this project; but even this one final ReAct loop highlights a set of important and difficult questions. Stated generally:

As we develop increasingly complex "agentic" and cognitive AI systems by building new forms of intelligibility from static representations of language, how can weand to what extent should wework to trace, to explain, and to control the processes through which systems of language align to systems of value?

If this is the broad technological-historical question raised by the specter of an AI system in a college classroom, there is a more immediate practical takeaway, here. It is a perennial source of consternation among college faculty that even highly intelligent students sometimes have great difficulty extracting basic information and following basic instructions from a course syllabus. We are a ways away, yet, from using LLM technology to reliably detect AI writing and other forms of plagiarism; but testing throughout this project suggested to me that GPT-4 is ready, pretty much out of the box, to be used as part of an automated process to evaluate student work for its adherence to syllabus guidelines (even without a well-structured rubric).