Noise-cancelling headphones for the internet

Filtering out LLM-generated junk

Jacob Torrey

Cyber-security Philosopher and Boffin

Introduction

A few weeks ago, I built a very proof-of-concept browser extension that would classify the text of a visited page and set the transparency of each paragraph to the confidence that it was the output of a generative AI model. A friend and I were talking about the challenges of doing deep work, and how there is so much noise and distraction online already, with the release of ChatGPT and co, it could only get worse. Already there have been low-quality journalism outlets that have experimented with replacing humans with LLM writers; the amount of junk that AI can produce is limitless.

There were two main elements of this, both which had me learning: writing a browser extension, and the hardest part, detecting if text is likely of human or AI origin. If you just want to jump to the code, it's all available in the GitHub repository.

Writing a browser extension

While I wrote a few Firefox extensions back in the day (~2004), it was a long time ago, and like most of the world, I've moved to a Chromium-based browser. This tutorial covered almost everything I needed to know, from making a manifest file to how to use a querySelector to iterate through all the paragraphs (<p>) on a page. Initially I started with iterating through <div>s, but found that there were too many divs that were used to contain non-textual content, so I re-focused onto paragraphs.

With my browser in Developer mode, I could load (and reload) the folder where I had the manifest and scripts directory without compressing and renaming, allowing for a pretty quick development and test cycle. I also wanted to have a simple page to test it out on, so I wrote a quick About Me paragraph and then used ChatGPT to write an About Me page, putting both together onto my testing page.

Pretty quickly (or at least for someone who doesn't really know or use JS) I had built a loop over all the paragraphs on a page, and could pass the text from each into a query() function that would do the detection itself. Then I had to normalize the confidence score from the detector (0-1) into an opacity value and round to a few decimal places.

After my initial testing, I found that the delays in processing the text caused the text to disappear abruptly, so I learned about the transition CSS operator, which allowed for a more pleasing fade-out of the [suspected] AI-generated paragraphs. The last part was adding a sanity check on the length of the text, paragraphs that had little to no text, or many 1000s of characters would be skipped, as either the detector is unable to generate a useful score, or it would overwhelm the model.

Text classification

This was meat of the task, figuring out if a paragraph was likely human or AI written. Tools like GPTZero and OpenAI's text classifier indicate that there's a way to get a relatively good indication of text origin. Both of those tools require a fairly considerable amount of text before they'll make a determination, and state that the more text they receive the more accurate they are. Before I used a tool, I wanted to dig into a bit of what features lead a detector to classify text as human or LLM generated; most tools score a text on two features: perplexity and burstiness.

Perplexity is the amount of randomness in a text, humans write with more randomness as each have a vocabulary, style, and grammar pattern that is unique to them. At their core, LLMs essentially are trained by "reading" a huge amount of text, and then determining based on the prompt what is the most probable text to follow. In the sheer immensity of the training corpus, individual styles and quirks are filed off, leaving the most generic and oft-used.

Burstiness is more about randomness or spikes in structure. Perplexity may be high for a paragraph that has "artisanal" words, or references to a sub-culture's meme, whereas burstiness captures a human's natural flow in writing sentences of different lengths. Just. Like. This.

While these are fairly simple concepts, and they make intuitive sense that an LLM trained on the corpus of all texts will naturally seek towards the mean, calculating them from a text is difficult. Burstiness is possible to calculate purely from the text, however the accuracy is dependent on length of the sample, since I was working on a per-paragraph basis, there are sometimes very little data to work from.

Calculating perplexity is much more difficult, usually you need a model trained on an LLM to essentially encode the probabilities of generating each word with the preceding text as context. When LLMs generate text, they start with the prompt, find a set of words that would be the most probable to come next, and then select one probabilistically (this is where the temperature parameter comes into play), this process is repeated until stopped. If you have a model that is trained on the same corpus, the same process can be performed, but instead of generating the next word, record the probability of selecting that word. For example if the text you've analyzed so far is:

The quick...

The words e.g., cheetah, brown [as in fox], runner, etc. would be likely, and contribute to a low perplexity (this prompt on ChatGPT responds to me with: brown [fox jumps over the lazy dog]), whereas snail, slug, rock would be much less likely encountered during training--contributing to a higher perplexity score, and therefore confidence in human origin.

Putting it all together

In order to get a quality detection, I needed to use a model, and the RoBERTa Base OpenAI Detector on HuggingFace seemed to tick all the boxes. There was no minimum length of text (but there was an undocumented maximum), it was trained on GPT-2, which was used as a base for newer GPT models, and it was lightweight enough to run via the free API or on a cheap/small dedicated instance. The API for this model returns a confidence score of how fake or real a text sample provided is, which I can easily normalize into an opacity value.

While extension works pretty well, it was interesting in that there are some challenges getting ground truth on some sites. For example, OpenAI's ChatGPT announcement blog quotes ChatGPT (which is correctly flagged), but also the surrounding text is deemed suspicious, so it's possible that OpenAI had ChatGPT write its "own" press release. The biggest downside is the delay (and privacy concerns) of sending all text to an instance to run a model on it is slow, and does not scale well. I'm working on a fast, local way to doing the same detection, check back later for more details!

Hi!