LLMs ‘Would Not Exist’ With out Reddit Knowledge

Reddit CEO Steve Huffman mentioned massive language fashions “wouldn’t exist as we all know them” with out Reddit’s content material. He known as the platform’s user-generated information “trendy oil” for AI.

Huffman made the feedback throughout an interview at Quick Firm’s Most Progressive Corporations Summit.

What Huffman Mentioned About Reddit’s Worth To AI

Huffman described the place Reddit’s information holds within the AI ecosystem.

Huffman mentioned:

“LLMs wouldn’t exist as we all know them with out Reddit. Reddit is without doubt one of the single largest sources of coaching information for the LLMs and Reddit continues to be one of many main sources of each coaching information and we’re additionally essentially the most cited, essentially the most cited platform throughout all fashions.”

He attributed the quotation declare to Profound, a agency that tracks AI quotation information.

Huffman defined why AI firms depend upon the content material.

“There’s no synthetic intelligence with out precise intelligence. On the finish of the day, these fashions are fairly easy. They’re regurgitating on a completely huge scale what they’ve consumed elsewhere and a big portion of that consumption is definitely simply the human dialog on Reddit as a result of it’s pure and it covers principally each subject possible.”

Offers For Some, Lawsuits For Others

Reddit introduced information licensing agreements with Google and OpenAI in 2024. Huffman referenced these as Reddit’s unique two AI information offers and didn’t announce any extra agreements.

“Since we did the unique two offers with Google and OpenAI, that was over two years in the past, so we’ve realized so much. They’ve realized so much. The entire world’s realized so much. Particularly how useful Reddit’s information is and the way helpful it’s. And so we’re being I feel very deliberate and selective there. However yeah, we’re open and open for enterprise.”

For firms that haven’t agreed to licensing phrases, Reddit has taken authorized motion. The corporate sued Anthropic in California Superior Court docket, alleging unauthorized use of Reddit content material and violations of Reddit’s phrases. Reddit filed a federal lawsuit towards Perplexity within the Southern District of New York, together with three data-scraping companies, alleging DMCA anti-circumvention violations and associated claims.

Huffman drew a line between the 2 teams.

“Corporations like Google and OpenAI the place we had good relationships, we are able to really do a deal and put some guard rails on use and entry to our information on behalf of our customers however then collaborate on making merchandise for the following technology of the web.”

He added that “not each firm is prepared to be a collaborative associate and so sadly we’ve got to go the opposite method which is lawsuits.”

Huffman instructed the viewers Reddit’s place on industrial use is straightforward. “Industrial use of our information requires industrial phrases,” he mentioned. Reddit started charging for industrial API entry in 2023, a transfer that preceded the present licensing offers.

Huffman mentioned Reddit nonetheless offers free information entry to researchers and universities and tries to stay versatile for non-commercial use.

What Modified Reddit’s Openness

In keeping with Huffman, Reddit’s willingness to share information freely modified when the AI business moved away from open analysis. As SEJ beforehand reported, Reddit restricted entry for a lot of search engine crawlers whereas Google remained an exception.

“Traditionally, Reddit has been like we’re born of the open web and Reddit has been open and really permissive for entry to its information. And truthfully, I feel we’d be in a distinct place at the moment if the AI firms have been nonetheless principally open and open supply and doing open analysis.”

Huffman mentioned the problem was that Reddit couldn’t longer monitor how its information was getting used. “Persons are utilizing our information and we don’t know what it was getting used for,” he instructed the viewers.

Past industrial phrases, Huffman mentioned Reddit desires to forestall its information from getting used to establish customers, goal them with advertisements, or to exchange or disintermediate the platform.

Reddit’s Personal AI Efforts

Huffman acknowledged what he known as a “paradox.” Reddit’s content material powers exterior AI programs, however the firm additionally makes use of AI throughout its platform.

Probably the most seen product is Reddit Solutions, an LLM-powered search function. It reads posts and feedback, then organizes them into responses constructed from verbatim consumer quotes. Huffman famous it’s designed for questions with out definitive solutions.

“What Reddit Solutions does is a few issues which are distinctive to Reddit. One, it principally solely solutions in verbatim quotes from precise individuals. After which the second factor it does is it tries to current a number of views as a result of the entire level for those who’re on Reddit, you need the human perspective.”

Behind the scenes, Reddit makes use of AI for content material moderation and classification. LLMs can consider whether or not a remark crosses into bullying, one thing Huffman described as beforehand tough due to the subjectivity concerned.

Huffman introduced AI moderation as a method to scale back publicity to the worst content material, not as a substitute for Reddit’s group moderation mannequin.

“The worst job on the web was trying on the worst content material on the web and deciding whether or not it might be on-line or not,” Huffman mentioned. “That job simply goes away.”

The Grey Space Of AI-Written Posts

Huffman additionally addressed the problem of customers writing content material with AI instruments and pasting it into Reddit. That’s totally different from automated bot exercise, he careworn.

“Probably the most annoying factor that I see not simply on Reddit, however everywhere in the web is someone who wrote their publish or remark with ChatGPT after which pasted it into Reddit. Like, is {that a} bot? Actually appears like a bot, however there’s a human behind the thought.”

Huffman solid the problem as one in every of intent. “It’s crucial to us that there’s a human behind the thought, behind the content material, behind the immediate,” Huffman mentioned. However he additionally famous that “the writing sucks” when customers depend on AI to compose their posts.

Reasonably than making a coverage to handle it, Huffman indicated Reddit will let its group deal with the problem. Customers are already downvoting AI-written content material and calling it out in feedback. Huffman mentioned Reddit will “empower the customers extra and the subreddits extra to only reject that type of content material altogether.”

He in contrast the broader query to calculators in math class. “Children lately are simply studying learn how to write with AI. What are we going to do about it?” he mentioned. “We form of must study, I feel, together with all people else.”

Why This Issues

Huffman’s feedback reinforce Reddit’s pitch that its consumer discussions are a core enter for AI programs.

The AI-written content material drawback Huffman described is one SEJ coated as a part of a broader YouTube AI slop investigation. Reddit’s choice to let group voting deal with AI-generated posts, somewhat than constructing detection instruments, is a distinct path than platforms which have deployed automated labeling.

Trying Forward

Huffman instructed Quick Firm that Reddit is “out there speaking to people on a regular basis” about new information offers, although he didn’t trace at a 3rd settlement.

Reddit’s lawsuits towards Anthropic and Perplexity are each ongoing. The Anthropic case was the topic of a federal court docket remand listening to in March.