If you've ever posted to Reddit there's a good chance you're helping train the next generation of AI models with your own words, pictures, and memes, because the company's selling access to its 20 years' worth of content for a reported $60 million. I mean, chances are you've already been used to train AIs given that Reddit's already featured pretty heavily in the training data for a bunch of different large language models (LLMs) and image generators, but at least now someone's getting paid for it.
Generative AI models, such as ChatGPT and Stable Diffusion, need to be trained on databases comprising hundreds of millions of images, books, video clips, music, and so on. Sometimes, the source is publicly available and open to use by anyone, and sometimes AI companies simply 'borrow' what's just lying around on the web. But there's seldom any money handed over between the two bodies. Not so with Reddit, as it seems that it's entered into a deal where for a healthy lump of cash each year, an AI model can use the site's content for training.
That's according to a report by Bloomberg, which says that the deal is worth $60 million per year. In the tech world, where transactions run into the billions of dollars, that might not sound like very much but it's pretty much unheard of in AI training. There's no indication as to who the other party in the deal is but it's unlikely to be a small start-up firm in somebody's back bedroom.
Reddit hosts almost 20 years of posted content on its servers, so whoever the AI firm is, it's picked up a veritable bargain. OpenAI, the developers of ChatGPT, has reportedly been entering licensing agreements with multiple media companies and publishers, which doesn't seem all that different from Reddit's deal.
However, such publishers typically pay for content creators' work, or at the very least, directly employ people to make the material that OpenAI wants to use. Reddit, on the other hand, does no such thing, though the site itself is completely free to use. There's no such thing as a repas gratuit, of course, and Reddit generates revenue through advertising and paid user features.
Assuming the deal goes ahead (and I can't see any reason why it won't), then I do not doubt that there will be another user backlash of some kind, similar to that seen when Reddit changed its API fees. The effects of that response disappeared over time, though, and the site is pretty much back as it was before numerous sections went dark.
Best CPU for gaming: The top chips from Intel and AMD.
Best gaming motherboard: The right boards.
Best graphics card: Your perfect pixel-pusher awaits.
Best SSD for gaming: Get into the game ahead of the rest.
While there will be lots of initial noise, the result will be that Reddit looks and works just as it does now. No user is going to be aware that their posts are being actively scrapped and used in model training. So this is all just a bit of fuss about nothing, yes?
It might not be if you've ever used Reddit to show off your writing skills, pieces of art, or music. You might think that content is yours and protected under copyright laws, but it all gets very murky when it comes to training for generative AI. You do all the hard work but someone else is getting all the benefits of it, and more importantly, not acknowledging nor compensating you for it in any way.
All of this will certainly mark the start of a flood of deals between AI companies and other social media sites. I suspect the largest ones already scrape content for training and hide the details of this in the minutiae of their enormous end-user agreements. But maybe it's time to pay a lot more attention to what and where you post your creative output, especially if you're hoping to make a career out of it.