Building AI from the Ground Up—Legally
So, here’s the line we’ve been hearing from major AI companies for a while now: “It’s basically impossible to train large language models without using copyrighted stuff.” Like, literally, that’s the defense they keep throwing out—everyone does it, and trying to license everything would be way too hard and expensive.
But a group of researchers just said: challenge accepted.
According to a new Washington Post piece, a team from EleutherAI (along with folks from MIT, Carnegie Mellon, and the University of Toronto) decided to see what would happen if they built a massive dataset—without pulling in any copyrighted content. Just public domain and openly licensed text. The result? An 8-terabyte dataset they used to train a new model called Comma v0.1, and it actually held its own against big names like Meta’s Llama 2-7B.
Here’s where it gets real: building the dataset was, in their words, "a painstaking, arduous process". Not only did they have to dig for clean, legal data, but they also had to manually check licenses and scrub weird formatting issues. “It was not something you could just automate,” one researcher admitted. It kind of makes you realize why most companies haven’t even tried to go this route.
Still, the experiment worked. And that alone kind of blows a hole in the excuse that ethical AI isn’t possible. As the article puts it, the project proves that “you can build a high-performing AI model without hoovering up the internet.”
Meanwhile, on the legal side, things are heating up. Reddit is suing AI company Anthropic for allegedly scraping user comments to train its chatbot, and lawmakers are starting to pay attention. There’s even talk in Congress about putting guardrails around how these models get trained.
Bottom line: this study doesn't magically fix the AI ethics debate, but it definitely raises the bar. If these researchers can do it (even with the struggle), shouldn’t the companies with billions in funding be trying a little harder, too?
What do you think—should ethical AI be the expectation now, not the exception? Or is the whole system still too far gone?
More info here: https://www.washingtonpost.com/politics/2025/06/05/tech-brief-ai-copyright-report/