Will third party tools ever be truly effective at detecting Large Language Models?

If you are reading this, then you are probably well aware of OpenAI’s large language models, such as GPT-3 and its dialogue interface, ChatGPT. You may have also heard of concerns from teachers and employers over heavy use of Large Language Models (LLMs) to generate content for assignments, such as generating an essay that was given as a homework assignment.

In general, LLMs operate on the idea that if you input a string of text, for example:

“The capital of France is”

Then it will return the most likely string of text that would continue, such as:

“The capital of France is Paris.”

Or depending on how you call the model API, it would just return “Paris”.


How to Become a Data Scientist PDF

Your FREE Guide to Become a Data Scientist

Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.

Don’t wait, download now and transform your career!


Input strings are separated out into tokens (which are not exactly words, since longer words can be split into multiple tokens for efficiency reasons, and punctuation and whitespace also count as tokens) and then fed into the model, which produces a probabilistic output for the next tokens form its “vocabulary” of tokens. The model then chooses a token (or tokens) from the discrete probability distribution of likelihood of the next token in the string. Sometimes this leads to criticism of the models just being dumb “stochastic parrots”, which to some degree I believe is a fair criticism, but deeper reflection on this notion of language simply being stochastic probabilities for machines but not for humans will soon have you questioning your own consciousness, so we’ll leave this topic alone for now.

Coming back to the idea that the token output is probability driven, this implies that for certain strings of input tokens there will be a very high probability for just a single or few output tokens, like in the example above, we would expect most LLMs to always return “Paris”. However this also implies we could ask the model to return less probable outputs, to try to make the model more “creative” (this is sometimes referred to as increasing the “temperature”, but there are actually a few different methods of pushing the model to make less likely output choices). It should be noted that we don’t want our models to behave too “creatively” by allowing them to choose very low probability outputs. Part of using the APIs for LLMs is understanding the parameters a user can call to effect this probabilistic token choice, its a careful balance between creative LLM output and coherent text.

Third Party Detection Tools

As these large language models have increased in popularity, so have 3rd party tools that claim to be able to detect the output of these models with some high degree of “accuracy”. These types of tools include services like GPTZero and Originality.AI (as well as OpenAI’s own detection service, but more on that later). In my personal opinion, its very unclear if these 3rd party services are actually effective to begin with, given their high likelihood to give a false positive to text that we know for a fact is generated by a human, as well as the ability to easily “defeat” their detection methods with a few parameters in a GPT call. There also seems to be a perverse incentive for these tools to lean towards false positives, rather than false negatives, since its difficult to definitively prove for certain if a particular recent work of text was not created by an LLM.

Let’s quickly check on how accurate these services are at identifying human text as being written by a human. I’ll choose a totally random example of human text that in no way reflects any of my personal feelings about society and capitalism today, let’s see…how about… Karl Marx’s opening remarks for his Inaugural Address of the International Working Men’s Association.

Let’s pass this into Originality.AI’s detection service:

As we can see, the service gives a 50% probability that the text was written by an AI and 50% probability that it was human written. Personally, this was quite surprising, especially given how specific some of the text is, where Marx literally refers to import/export values of Great Britain in 1863. Already this doesn’t bode well for 3rd party services, but what about OpenAI’s own free detection service?

OpenAI recently released their own free classifier, which you can use here: https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text/

As one might imagine, this represents a serious threat to the business model of 3rd party services, rendering them obsolete. What is also interesting about this release was OpenAI’s own admission that it’s classifier wasn’t “fully reliable”. From their announcement post:

Our classifier is not fully reliable. In our evaluations on a “challenge set” of English texts, our classifier correctly identifies 26% of AI-written text (true positives) as “likely AI-written,” while incorrectly labeling human-written text as AI-written 9% of the time (false positives). 

Let’s try out Marx’s remarks on the OpenAI Classifier platform (which is free, and is also created by the same company that literally makes the world’s most popular LLM, and again, its free):

As we can see, OpenAI’s model appears to perform significantly better than the 3rd party service at detecting human text. At this point the reader may want to pause and reflect on the incentives of 3rd party classifiers versus OpenAI’s incentives.

A 3rd party classifier would be prone to downplay the effectiveness of this free service, and perhaps even take advantage of the general populace’s misunderstanding of the differences between “accuracy”, “recall”, and “precision” to make carefully worded claims, such as in this Feb 8th email from Originality AI:

It should also be noted that Originality AI currently offers a 25% lifetime affiliate program, so you may want to be skeptical (perhaps even cynical?) of other blog posts that directly link to Originality AI with the claims that it is the best detection available.

Either way, we can see that neither service can actually claim with 100% certainty via text analysis whether a piece of text was generated by a human, they can only give likelihoods. Let’s begin to shift our focus to detecting AI generated text, beginning with a discussion about parameters we can adjust in regard to LLM output.

Temperature

The temperature parameter is a value between 0 and 1 that controls the sampling strategy used by the language model during text generation. Specifically, the SoftMax function is applied to a set of logit values produced by the model that represent the probabilities of each possible word that could follow the current sequence of words in the generated text. The temperature parameter, represented by a scalar value between 0 and 1, is then used to divide the logits before the SoftMax function is applied, resulting in a modified probability distribution over the possible words.

In particular, when the temperature is set to a low value, the probability distribution becomes peaked around the most likely words to follow the current context, which results in more predictable and less diverse text output. Conversely, as the temperature value increases, the distribution becomes more spread out, allowing for less probable words to be selected with higher probabilities, which results in more diverse and creative text. However, a high temperature may also result in more errors or nonsensical output.

The choice of temperature value will depend on the specific context of the text generation task, and may need to be tuned through experimentation. OpenAI also gives users access to the presence penalty and frequency penalty hyperparameters.

Presence Penalty and Frequency Penalty

The presence penalty and frequency penalty are two hyperparameters that a user can adjust to make sure the LLM output doesn’t repeat itself too much, essentially “punishing” the model for repeating tokens too often, if set too high, this can lead the model to start a discussion completely off-track, but if used carefully, we can adjust AI output to always go undetected by LLM Classifiers.

Presence penalty is a technique that helps to discourage the model from generating tokens that do not appear in the training data. This ensures that the generated tokens are valid and coherent, and are not made up of random characters or meaningless words. This helps to ensure that the generated text is semantically meaningful and makes sense in the context of the prompt.

Frequency penalty is a technique that encourages the model to generate tokens that are more frequent in the training data. This helps to ensure that the generated tokens are likely to appear in a natural language text and are less likely to be rare or obscure words. This helps to make the generated text more readable and easier to understand.

By combining these two penalties along with a slightly higher temperature setting, the OpenAI Text Completion API is able to generate high-quality completions that are both coherent and natural-sounding, yet go undetected by 3rd party services.

Effectiveness of Commercial LLM Detection Software

Given what we’ve discussed above, let’s test some of the leading commercial software for detecting large language models and check its effectiveness at recognizing output from OpenAI’s text-davinci-003 model.

Here we explore the output of the base model, with very low temperature settings, as well as the default frequency and penalty parameters:

Now let’s see how some commercial LLM detection software performs on this text:

Clearly we can see the detection model is well aware that this output text exhibits very high probability words for its output, so it easily detects that it came from some LLM. But what happens as we begin to increase penalties for frequency and presence of output tokens? Let’s keep the same prompt in mind, but now explore how adjusting the parameters can change the output, while still keeping the same overall “thesis” for the output text:

Notice how even the formatting changes, and since whitespace and punctuation count as tokens, we can assume that also effects LLM detection. If we pass this LLM generated text, we can see that the detection capability completely breaks:
You will notice that the LLM detection software still believes there is a 1% that this text is AI generated, but clearly the simple edits in frequency and presence penalties have been effective at nullifying this detection model. Now just for fun, let’s let the LLM sarcastically mock the software in a manner that would be obvious under human review, but not under automated review:
And now for the detection result:

Perhaps the model was right, it is an “almighty force to be reckoned with!”

Conclusion

If you are using LLM detection software, you should be aware that it can only perform well against default settings of large language models. Currently resources like ChatGPT do not allow users to edit temperature or frequency and presence penalties, but in the future these parameters will be editable by the user, causing most detection software to “break”, in the sense that if they do begin detecting highly penalized outputs, they will also produce many false positives on human written text, in fact we have already seen this behavior occurring for human written Wikipedia articles, due to the fact that the training corpus for large language models almost always includes the full text of the entirety of Wikipedia. In conclusion, make sure you know what you’re paying for if you subscribe to these LLM detection services!

Pierian Training
Pierian Training
Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing. Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.

You May Also Like

Deep Learning, Tutorials

ChatGPT API Python Guide

Introduction Welcome to this tutorial on using the ChatGPT API from OpenAI with Python! ChatGPT is a state-of-the-art language model that can generate human-like responses to text-based inputs. With its ability to understand context and generate coherent responses, ChatGPT has become a popular tool for chatbots and conversational agents in a variety of industries. In […]

Deep Learning, Natural Language Processing, Tutorials

TensorFlow LSTM Example: A Beginner’s Guide

Introduction LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network (RNN) that is widely used in deep learning. It is particularly useful in processing and making predictions based on sequential data, such as time series, speech recognition, and natural language processing. TensorFlow is an open-source platform for machine learning developed by Google Brain […]

Deep Learning

JAX vs PyTorch: Comparing Two Deep Learning Frameworks

Introduction Deep learning has become a popular field in machine learning, and there are several frameworks available for building and training deep neural networks. Two of the most popular deep learning frameworks are JAX and PyTorch. JAX is a relatively new framework developed by Google, while PyTorch is a well-established framework developed by Facebook. Both […]