Can AI be used to make the internet safer?

Nov 08, 2023

The arrival of new technology is always accompanied by an age-old dilemma: From the written word to the online browser, the same tools that can be used for tremendous good can be incredibly dangerous when placed in the hands of people with bad intentions.

We’re seeing this dilemma repeat itself once more with the introduction of generative AI that makes it infinitely easier to create content. What happens when tools like ChatGPT or Midjourney are used by people to create misleading or harmful content? Could bad actors use these tools at scale to degrade the online experience?

These are questions that my guest, Tom Siegel, the CEO and founder of the online trust and safety company TrustLab, thinks about on a daily basis. Nearly two decades ago, Tom started his career at Google, where he founded and scaled the company’s trust and safety team. Google is also the same place where Tom and I initially met when we worked together on content policies for Google Search.

Recently, I had the pleasure of reconnecting with Tom for a conversation about his work at TrustLab, which uses AI to help companies identify and flag harmful content online. We talked about the incredibly difficult work of online content moderation, what he learned about keeping the internet safe at Google, and how generative AI is ultimately a net-good for internet safety.

When I was at Google, I often felt that people had an oversimplified understanding of problems around trust and safety. I remember hearing criticism like “Google should remove hate speech from the internet,” or “Google should make sure that all online content is factual.” But these problems are rarely straightforward. How complicated is the reality of solving issues of trust and safety?

It’s very complicated. We sometimes hear questions from executives like “How many resources and how much time do you need to get rid of all the bad stuff?” But the reality is that it’s fundamentally impossible to get rid of the bad stuff altogether.

It’s impossible to ensure that online content is 100% safe 100% of the time. The only way to guarantee the internet is 100% “safe” is to oppose user-generated content and take away the ability for anyone to freely sign-up and post online.

I’ll give you some examples to show how complicated this problem is.

We often hear that we should take content down automatically if it’s considered harmful or hurtful by certain people. But if you’re removing content from the internet, you need to have firm policies in place about what constitutes bad behavior online.

Of course, you want to remove content that’s hateful or harassing. But where do you draw the line when it comes to sexualized content? The answer probably varies based on religion, local regulations, and cultural norms. In fact, there’s probably vastly different sentiments about what’s appropriate, even among the people who are in leadership at a company. So it’s really important to have a firm policy in place when issues like this arise.

Then, there’s the issue of how much content you actually want to remove. When you try to pull down as much inappropriate content as possible, you often end up removing a lot of good content and impacting “good” users. Let’s say you want to impact the least amount of good users and good content as possible. But as your definition of bad content becomes narrower, this typically means that more of the bad content will stay online.

Finding all the “bad” use-cases and simply eliminating them is very difficult to do, which is something we’re seeing play out with ChatGPT today. If a user queries ChatGPT about a derogatory term or something harmful, like “death,” for instance, there’s a few things you can do. You can make sure it doesn’t return results, or if it does, that it returns results with a disclaimer.

But this leads you to another problem, one of over-regulating and over-filtering content. There’s a lot of very good questions you could ask about a derogatory term in order to understand it better.

I remember a moment when people were upset that Google would sometimes suggest unflattering words or phrases when people’s names were googled. One recommendation was to have Google simply stop suggesting any additional phrases alongside people’s names. This wasn’t enacted because it would have gotten rid of so much good. For instance, if you’re an author, Google would no longer populate the name of your books when people searched for you. You have to decide what you’re giving up by introducing these broad, sweeping generalizations.

That’s a very good point. A lot of this comes down to who is making these decisions. I certainly don’t envy the product teams and company leaders who have to make these very, very important decisions.

Do you have any specific examples of how hard it is to balance precision and recall?

This is a big issue when you’re dealing with misinformation, both in terms of how you define it and how you deal with its presence on your platform. In an ideal world, you would only take down misinformation when you’ve had a fact-checker perform an in-depth analysis to show that the information is false. But there’s a lot of information online that can't be fact-checked in a reliable way, and fact-checking is generally very slow.

Once you’ve established what constitutes misinformation, you have the secondary issue of how you deal with it. There’s many enforcement actions you can take. Obviously, you can remove the piece of content. But you can also leave the content up and decide not to promote it, monetize it, or show ads next to it. However, filtering content like this is incredibly difficult to do, especially in cases where a publisher has thousands of pages but maybe only two or three pages are violating guidelines. So what do you do? Do you ban the publisher? Do you make it so the publisher can’t monetize?

These were the types of decisions we sometimes had to make spur of the moment at Google. It’s very frustrating when executives don’t have clear rules on this stuff because, at the end of the day, it comes down to making judgment calls.

How did you use AI to help you process content when you were at Google?

When I first started at Google, we used manual rules, which worked well for a short time. At a certain point, there was so much content that it overwhelmed the human ability to curate and use the rules we had established at scale–there was just no way to solve these issues manually. This was when we introduced machine learning, which went on to become a constant part of our efforts to achieve coverage at scale.

AI and ML are the most effective ways to address problems of trust and safety. At Google, it would be impossible to have humans address these problems as quickly as they would need to in order to keep up with the pace of content generation.

You’re continuing to use AI at TrustLab to monitor and evaluate online content. Can you tell me more about how this works at TrustLab?

Our mission is to create a safer internet for humanity. To do so, we work with technology companies, regulatory agencies, advocacy groups, and academic institutions to help improve the understanding of bad content and bad actors on the open web. Our technical means of doing this is primarily through algorithms.

We deal with a number of different types of harmful content involving terrorism, violence, extreme speech, and misinformation across multiple modalities. Most social media platforms have a function where users can flag content that’s potentially harmful or false. We review this flagged content for social media companies.

The first thing we do is break the content down by separating it into individual claims. Then, we evaluate each claim separately to examine how it might represent misinformation or harmful content. We’re able to check this effectively because we’ve collected a lot of open web information about fact checks from reputable government resources.

To go back to your example, you’ve got a piece of content and you match it to your database. If it matches, it’s simple. If it doesn't, what do you do with it?

In cases where it doesn’t match, we might run it against other models. This is where having a mix of open-source AI models and generative models is really useful. Having our own models means we can prevent hallucinations, which are typically present in generative AI models.

We’ve found that these models are completely wrong about 10% of the time, so having classifiers that are built on our own technology allows us to check if these generative AI model scores are right.

Typically, we’ll run content through both a third-party model and our own models in order to compare results and maintain better coverage. While there’s still a lot of gray area, we're getting better precision recall than many other approaches out there including human discernment.

That I can believe. Some people are gullible and believe everything while others are cynical and don’t believe anything. So it seems that you’d get a more balanced view using AI. Have you actually compared the models to humans?

To be honest, this data is pretty convoluted. One of the problems with establishing precision and recall in the models is that we need humans to assess the model and output, but human judgment is also highly fallible. However, we typically find that the models are as good as human review across the vast majority of verticals.

Can you tell me more about the models you built at TrustLab? How do they actually work?

We built our own models on proprietary algorithms and data. These models rely on cloud-based public infrastructure, which means they’re very scalable and have the ability to handle large QPS volumes as needed. We also use human moderators and publicly available data. We built a handful of proprietary models that handle especially harmful content because, frankly, there aren’t any that are good enough on the market to handle this type of content.

We also use our own models to handle misinformation detection because this is another area where generative AI models haven’t been particularly useful. When it comes to objectionable content that’s been around for a while, like adult content, hate speech, harassment or violence, we typically use generative AI models because these categories are well-established.

Additionally, we use commercially available models that have been around for a long time including models from Google, Amazon, Azure, and some of the newer generative AI models like GPT 3.5. In combination with this, we have end-to-end classifier providers for trust and safety like Hive AI, which offers harmful content classifiers in different categories.

With new generative AI content that’s changing the face of the internet, it seems like the amount of content creation online is going to ramp up exponentially. At TrustLab, do you approach gen-AI content the same way you’ve approached user-generated content in the past?

Gen-AI content represents a sea change for the trust and safety industry from two sides. Obviously, there's bad actors' ability to use these technologies to create harmful content more cheaply and at a larger volume than ever before–a problem that could potentially degrade the user experience online.

But this same technology gives us the ability to identify harmful content more efficiently and with greater confidence without relying on humans to the same degree we do now.

The question here is whether or not generative AI is good or bad for the quality of the web overall. I strongly believe that it’s a net good. Already, there’s a lot of work being done in models–OpenAI, for instance, is a great example of this. Some of these models allow you to identify harmful content in specific predetermined categories.

This means that, through techniques like prompt engineering, you can tune these models and create custom classifiers. This is very exciting and provides a lot of upside. So far, the implementation we’ve seen in companies right off the bat is that 10-30% of content that was formerly labeled manually can now be automated with similar precision and recall.

So the reason you’re an optimist about generative AI is because you feel that it’s helping more than it’s hurting?

Yes. This is still very much a cat-and-mouse game, but this technology gives us much better tools. Historically, trust and safety have primarily been subject to human review. Companies like Google and Meta are outliers here, but for the most part, the industry has handled most trust and safety issues manually. It’s an exciting but challenging time, and ultimately, this new technology really helps the entire industry.