I have been posting to this blog to share my thoughts with a small community of privacy professionals.
So, I was a bit surprised to see Blogger give me statistics: my posts get around 10,000 views. I was surprised, because the privacy expert community is smaller than that.
But how many of those views were bots, in particular AI training bots? Blogger doesn’t give me those statistics.
We all know that AI models are trained on data. Big models, like large language models, are trained on vast amounts of data. In fact, they’re being trained on essentially all available data in the world. So, given their hunger for data, in particular for human-generated content, I’m not surprised they’ll visit my little blog too.
There’s a raging debate about whether AI training bots should be allowed to use other people’s data to train their models. There are many voices who claim that AI bots shouldn’t be allowed to train on other people’s data, if that data is either “personal data” or under copyright. I think they’re wrong.
I think the key distinction is public v private data. If I make my data public, as I do with this blog, then I should expect (and probably want) it to be read by anyone who wants to: humans or bots. After all, search engine crawlers have been crawling public data for decades, and almost no one seems to object. If AI training bots are reading my blog, say, to learn about human language, or about privacy, I’m delighted.
On the other hand, private data is private. If I use an email service, I expect that data to be private, as it’s filled with my highly personal and sensitive information. If I use a social networking service, and I set the content I upload to “private”, I expect the platform to respect that choice, including from their own or third-party bots. Failure to respect these privacy choices is a serious privacy breach, maybe even crime, unless the owner of the data has consented to allowing their data to be used for AI training. (It’s a different discussion if “consent” can be deduced from some updated clause in some terms of use.)
Thousands of training bots are looking for more data, especially human-generated data. If you make your data public, then realize the bots will come read it. You can’t really stop it. And I think that’s fine.
The real issue is what the AI models intend to do after training on your data. If they’re learning human language (large language models), it’s not going to have any impact on your real-world privacy. But if they’re reading your data to impersonate you, to copy your voice or image or your copyrighted content, then you have every reason to object and use the legal resources available. I think it’s fine when bots read public data for training. The real question, and vastly harder to evaluate, is what their trained models should be allowed to do with it afterwards.