Martin Zirulnik

Martin Zirulnik, PhD

A site for eclectic sharing: professional info, projects, code, writing (academic and otherwise, technological and otherwise), course information, and pedagogical materials.

I am a Los Angeles based researcher, educator, and software developer who likes to talk about language and data technology. Feel free to get in touch with queries.

October 4, 2023 Martin Zirulnik

LangChain HTML Header Text Splitter

An open-source tool for improving RAG performance, contributed to LangChain, a popular framework for developing applications powered by language models.

The HTML Header Text Splitter is a "structure-aware" chunker that splits text at the element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures.

For more information, you can read a blog entry exploring use cases, read the documentation, and view the class within the LangChain source code.

June 7, 2023 Martin Zirulnik

LangDuel: a Contest of Generative Philosophies

Stage an LLM-powered dialogue between any two philosophers in history to see which convictions prevail.

Click here to try it in beta: https://testing.langduel.com
Click here to read about it: https://mziru.com/#blog_dialogue

March 19, 2023 Martin Zirulnik

GPT-4 Advocator Bot

A sophisticated Q&A system with a peculiar bias. Click to give it a try.

You may ask a question about pretty much anything, as long as it does not violate OpenAI's content policy (OpenAI Terms of Use).

February 27, 2023 Martin Zirulnik

Web API with Object Relational Mapping

A model RESTful web API with create, read, update, and delete (CRUD) operations and data persistence. It is scalable and intended to serve as a foundation for developing more complex database interfaces in Python using Object-Relational Mapping.

The sample data models represent fictional people and Covid-19 test results, which have a one-to-many relationship. Each person's Covid status is updated automatically based on their most recent test result.

Technologies:

Google App Engine and Google Cloud SQL for cloud deployment
Flask web framework
PostgreSQL for RDBMS
SQLAlchemy for ORM
marshmallow for data object serialization
Swagger for API design and documentation (OpenAPI Specification)

July 28, 2022 Martin Zirulnik

Wordle Bot with Browser Automation

A quick project to experiment with using browser automation to solve Wordle, a popular puzzle game available on the New York times website.

The bot builds on an algorithm initially developed by Yotam Gafni and has a 100% win rate. This version adds:

browser control and full automation on the New York Times Wordle site using Selenium WebDriver
optimization to reduce runtimes and the average number of guesses to reach the correct solution

January 25, 2022 Martin Zirulnik

Movie Keyword Data Explorer

A web application for wrangling, exploring, and modeling movie data based on keywords.

The app uses the open-source TMDB web API to generate a custom dataset based on user input keyword(s), automatically cleans the data, outputs some exploratory visualizations, and gives the user options to train, visualize, and tune a Latent Dirichlet allocation (LDA) topic model based on natural language data (titles, taglines, and synopses).

Note: this is a pretty old project with basic functionality and a rudimentary UI, and it has not been maintained. If you're interested in seeing further development on it (or related ideas), feel free to contact me.

Technologies:

Natural Language Processing and Modeling:
- Gensim
- spaCy
- NLTK
Visualizations:
- pyLDAvis
- Matplotlib
- seaborn
- wordcloud
Math and Data Structures:
- Pandas
- NumPy
Browser Interface
- PyWebIO
- Flask

Some sample output for the keyword "alien":

December 27, 2021 Martin Zirulnik

Maximizing Information Gain: Guess Who?

A simple web app that maximizes the chances of winning at the children's game Guess Who? Using a browser interface, it outputs the best yes-or-no question to ask during each turn, traversing a binary decision tree based on user input responses.

The algorithm works by maximizing information gain; that is, it transforms the game dataset into a decision tree by splitting nodes into sub-nodes on the unknown variable (a physical feature of the opponent's character) with the greatest entropy in the context of the target variable (the name of the opponent's character).

October 18, 2023 Martin Zirulnik

A Chunk by Any Other Name: Structured Text Splitting and Metadata-enhanced RAG

[Note: this piece is cross-published as a guest entry on the LangChain blog.]

There's something of a structural irony in the fact that building context-aware LLM applications typically begins with a systematic process of decontextualization, wherein

source text is divided into more or less arbitrarily-sized pieces before
each piece undergoes a vector embedding process designed to comprehend context, to capture information inherent in relations between pieces of text.

Not altogether unlike the way human readers interact with natural language, AI applications that rely on Retrieval Augmented Generation (RAG) must balance the analytic precision of drawing inferences from short sequences of characters (what your English teacher would call "close reading") against the comprehension of context-bound structures of meaning that emerge more or less continuously as those sequences increase in length (what your particularly cool English teacher would call "distant reading").

This blog explores a novel approach to striking this balance with HTML content, leveraging important contextual information inherent to document structures that is typically lost when LLM applications are built over web-scraped data or other HTML sources. In particular, we will test some methods of combining Self-querying with LangChain's new HTML Header Text Splitter, a "structure-aware" chunker that splits text at the element level and adds metadata for each chunk based on header text.

Click to read on...

May 1, 2023 Martin Zirulnik

GPT-4 Goes to The Academy: Simulating Philosophical Dialogues with Dialogue Agents

The project of using LLM dialogue agents to simulate philosophers conversing with each other is in one sense just a game—a way of playing with language dolls that are, if not lifelike, rendered in amazing detail. But the implications are far-reaching, holding in relief the complex relationship between these static models of language and the more or less malleable systems of knowledge that they can contain and interact with.

Like the previous project of using LLM agents to simulate the discourse of a university English class, these experiments are designed in the spirit of exploration. I use LangChain, powered by GPT-4, first to build agents to role-play as specific philosophers from throughout history (with information extracted from the Stanford Encyclopedia of Philosophy), then to simulate a series of dialogues in which one agent tries to convince another to adopt a philosophical position that contradicts its own.

Some open questions: To what extent are these interactions deterministic? What makes an agent more likely to convince or to be convinced? What does the relative tractability of an agent indicate about the philosophical position it represents? What does that tractability indicate about how the LLM itself reflects the underlying system of knowledge?

Click to read on...

April 24, 2023 Martin Zirulnik

Using GPT-4 to Build Structured Information from Unstructured Webpage Data

A quick project to demonstrate how to build structured information about forms of bias in unstructured webpage content. The objectives, scoring criteria, and schema are arbitrarily defined, so this can serve as an open-ended demonstration of how to use LangChain with GPT-4 to build structured, evaluative information from language data loaded from URLs.

Click to read on...

April 15, 2023 Martin Zirulnik

Evaluating Bias in LLM Agent Interaction Chains

This entry is a follow-up/supplement to the previous entry, "GPT-4 Goes to College." Read that one first for context and the rest of the code.

Previously, I used LangChain to create an agent to play the role of a student in a university English class. Given a set of "tools" (primary source texts along with a course syllabus and lecture notes), the agent was prompted to compose responses to its own previous responses to a writing assignment, simulating the process of learning through participation in intellectual exchange. This allowed for a glimpse of some potential vectors through which bias can propagate in reasoning chains involving autonomous interactions across multiple internal and external knowledge systems.

Such processes (while relatively benign in this case) pose a unique set of challenges for traceability and evaluation versus more familiar—and more deterministic—processes related to biases inherent in LLM models themselves, i.e. that emerge during training or that exist in natural language data used for training.

Here, I test the potential for GPT-4 (or other LLMs) to be used reflexively to trace and to evaluate forms of bias that emerge within the ReAct framework.

Click to read on...

April 6, 2023 Martin Zirulnik

GPT-4 Goes to College: Discussions with a Model Model Student

In this entry, I use the LangChain framework, powered by GPT-4 and other OpenAI natural language APIs, to experiment with incorporating an "agentic" language model into the discourse of a university English class. A central goal of the prompt engineering is to make the model data-aware with respect to learning material that is not public or part of its existing knowledge base. While there are no actual students involved, the model is exposed to actual learning content (e.g. a syllabus, lecture notes, assignments) for a course that I've taught at UCLA in the past (122: Keywords in Theory), and it is prompted to interact with that material as though it were in an environment of intellectual exchange.

Click to read on...

More Coming Soon!

Here I share writing about technical topics related to software development work, including more detailed information about past and present projects, explications of problem-solving processes, and insights gained from exploring and implementing new technologies.

June 21, 2022 Martin Zirulnik

Technical Curriculum: Visual ML Tools for Predictive Modeling

A sample training sequence designed to teach new users how to create end-to-end machine learning projects with Dataiku DSS, a popular platform for building and managing AI and ML workflows.

Topics include: data cleansing, normalization, and enrichment; prototyping and deploying a predictive model.

dss image

University Curriculum

A selection of upper-division college courses taught in recent years (both online and in person), including sample syllabuses and anonymous student feedback.

ENGL 177: L.A. Noir - This course introduces students to the genre of film noir through fiction and films set in and around the city of Los Angeles, focusing on how these works of literature and visual culture represent and re-imagine the city from one generation to another. Students start by analyzing classic noirs of the 1940s alongside the hard-boiled detective novels on which they were based, then follow the evolution of the genre into some of its more contemporary iterations, e.g. neo-noir and tech-noir. The course examines both the formal history of the genre—its distinct stylistic features as these develop throughout the latter half of the 20th century—and its broader social contexts.
- Student Feedback
- Syllabus
ENGL 115A: American Crime Fiction - This course examines the development of popular styles of American crime fiction, especially around the middle of the 20th century, as readers’ common fascination with detectives and the art of detection gives way to a different sort of fascination with killers and the art of killing. Why do people read these stories? Why do they write them? What forms do they take, and what do these forms indicate about the cultural values and social conflicts that define America in the 20th century?
- Student Feedback
- Syllabus
ENGL 122: Continuity and Discontinuity (Keywords in Theory) - It seems reasonable to assert that, of all literary forms, the novel is most suited to the idea of continuity—to a general sense that things go on and on, that one thing leads to another, and that individual perceptions and experiences link up in meaningful ways that cannot be reduced to discrete episodes or momentary impressions. But it also seems just as reasonable to say that, at least since Laurence Sterne published Tristram Shandy in 1759, no other literary form has more thoroughly mastered the art of refusing to continue—the art of dilation and digression, of leaving gaps and ellipses, of pausing, procrastinating, fixating on disconnected sensual details, and otherwise failing to follow a story through.

Do these aesthetic impulses contradict each other? What values are ascribed to continuity and discontinuity and why? Does this change in the 20th century, when aesthetic practices come under the pressures of modernity in ways that seem more readily assimilable to the cuts and fragments of cinematic montage than to the narrative commitments of literary realism? This course approaches these questions by way of a range of texts from the late 19th and 20th centuries, including influential works of theory and criticism as well as exemplary works of fiction, poetry and film.
- Student Feedback
- Syllabus

Peer-reviewed Research

"Crane's Speech Figures," NOVEL (Duke UP) - A research article examining rhetorical processes and the creation of new language structures in narrative forms associated with Naturalism.

If you do not have access to this article through an academic institution, please follow the link below to request a copy.
- Request Access
- JSTOR Link