A mildly fun thing to do when you’re bored
is start the beginning of a text message, and then use only the suggested words to finish
it. “In five years I will see you in the morning
and then you can get it.” The technology behind these text predictions
is called a “language model”: a computer program that uses statistics to guess the
next word in a sentence. And in the past year, other, newer language
models have gotten really, weirdly good at generating text that mimics human writing.
“In five years, I will never return to this place. He felt his eye sting and his throat
tighten.” The program completely made this up. It’s
not taken from anywhere else and it’s not using a template made by humans.
For the first time in history, computers can write stories. The only problem is that it’s
easier for machines to write fiction than to write facts. Language models are useful for a lot of reasons. They help “recognize speech” properly
when sounds are ambiguous in speech-to-text applications.
And they can make translations more fluent when a word in one language maps to multiple
words in another. But if you asked language models to simply
generate passages of text, the results never made much sense.
SHANE: And so the kinds of things that made sense to do were like generating single words
or very short phrases. For years, Janelle Shane has been experimenting
with language generation for her blog AI Weirdness. Her algorithms have generated paint colors,
“Bull Cream” Halloween costumes,
“Sexy Michael Cera” And pick-up lines.
“You look like a thing and I love you.” But this is what she got in 2017 when she
asked for longer passages, like the first lines of a novel:
SHANE:The year of the island is discovered the Missouri of the galaxy like a teenage
lying and always discovered the year of her own class-writing bed …It makes no sense.
Compare that to this opening line from a newer language model called GPT-2.
SHANE: It was a rainy, drizzling day in the summer of 1869. And the people of New York,
who had become accustomed to the warm, kissable air of the city, were having another bad one.
JOSS: It’s like it’s getting better at bullsh*tting us.
SHANE: Yes, yes, it is very good at generating scannable, readable bullsh*t.
Going from word salad to pretty passable prose took a new approach in the field of natural
language processing. Typically, language tasks have required carefully
structured data. You need thousands of correct examples to train the program.
For translation you need a bunch of samples of the same document in multiple languages.
For spam filters, you need emails that humans have labeled as spam.
For summarization, you need full documents plus their human-written summaries. Those
data sources are limited and can take a lot of work to collect.
But if the task is to simply guess the next word in a sentence, the problem comes with
its own solution. So the training data can be any human-written
text, no labeling required. This is called “self-supervised learning.” That’s what
makes it easy and inexpensive to gather data, which means you can use a LOT of it.
Like all of Wikipedia, or 11,000 books, or 8 million web sites.
With that amount of data, plus serious computing resources, and a few tweaks to the
architecture and size of the algorithms, these new language models build vast mathematical
maps of how every word correlates with every other word, all without being explicitly told
any of the rules of grammar or syntax. That gives them fluency with whatever language
they’re trained on, but it doesn’t mean they know what’s true or false.
To get language models to generate true stories, like summarizing documents or answering questions
accurately, it takes extra training. The simplest thing to do without much more
work is just generate passages of text, which are both superficially coherent and also false.
GEITGEY: So give me any headline that you want a fake news story for.
JOSS: Scientists discover Flying Horse. Adam Geitgey is a software developer who created
a fake news website populated entirely with generated text.
He used a language model called Grover, which was trained on news articles from 5,000 publications.
“More than 1,000 years ago, archaeologists unearthed a mysterious flying animal in France
and hailed it the ‘Winged Horse of Afzel’ or ‘Horse of Wisdom’”
GEITGEY: This is amazing, right? Like this is crazy.
JOSS: So crazy. GEITGEY: “The animal, which is the size of
a horse, was not easy.” If we just Google that. Like there’s nothing.
JOSS::It doesn’t exist anywhere. GEITGEY: And I don’t want to say this is perfect.
But just from a longer term point of view of what people were really excited about three
years ago versus what people can do now, like this is just like a huge, huge leap.
If you read closely, you can see that the model is describing a creature that is somehow
both “mouse-like” and “the size of a horse.”
That’s because it doesn’t actually know what it’s talking about. It’s simply mimicking
the writing style of a news reporter. These models can be trained to write in the
voice of any source, like a twitter feed, “I’d like to be very clear about one thing.
shrek is not based on any actual biblical characters. not even close.”
Or whole subreddits. “I found a potato on my floor.”
“A lot of people use the word ‘potato’ as an insult to imply they are not really
a potato, they just ‘looked like’ one.” “I don’t mean insult, I mean as in as in
the definition of the word potato.” “Fair enough. The potato has been used in
various ways for a long time.” But we may be entering a time when AI-generated
text isn’t so funny anymore. “Islam has taken the place of Communism
as the chief enemy of the West.” Researchers have shown that these models can
be used to flood government websites with fake public comments about policy proposals,
post tons of fake business reviews, argue with people online, and generate extremist
and racist posts that can make fringe opinions seem more popular than they really are.
GEITGEY: It’s all about like taking something you could do and then just increasing the
scale of it, making it more scalable and cheaper. The good news is that some of the developers
who built these language models also built ways to detect much of the text generated
through their models. But it’s not clear who has the responsibility to fake-check the
internet. And as bots become even better mimics – with
faces like ours, voices like ours, and now our language, those of us made of flesh and blood may find ourselves
increasingly burdened with not only detecting what’s fake, but also proving that we’re