California Governor Gavin Newsom vetoed Californiaās proposed Safe and Secure Innovation for Frontier Artificial Intelligence Models Act in late September, fearing it would stifle āinnovation.ā [SBā¦
Despite what the tech companies say,
there are absolutely techniques for identifying the sources of their data, and there are absolutely techniques for good faith data removal upon request. I know this, because Iāve worked on such projects before on some of the less major tech companies that make some effort to abide by European laws.
The trick is, it costs money, and the economics shift such that one must eventually begin to do things like audit and curate. The shape and size of your business, plus how you address your markets, gains nuance that doesnāt work when your entire business model is smooth, mindless amotirizing of other peopleās data.
The other reason they donāt do it is because many models are trained on a large corpus of pirated texts, and documenting this would be a confession.
Not just in an āI scraped the new york times without permissionā kind of way, but in a āI illegally downloaded a torrent containing bestsellers from the last 30 yearsā kind of way.
In a sense, to me, it is the same thing. If your business is built upon repurposing everyone elseās inputs indiscriminately to your benefit and their detriment, it is, too expensive, to reveal that simple truth.
Bestsellers? There used to be torrents of basically all releases. My provider blocks torrent sites and I dont use a vpn so im not sure if people still do this, but downloading basically all books (in english) at once released in a certain period was possible
Thatās a good question, because there is nuance here! Itās interesting because while working on similar projects I also ran into this issue.
First off, itās important to understand what your obligation is and the way that you can understand data deletion. No one believes it is necessary to permanently remove all copies of anything, anymore than it is necessary to prevent all forms of plagairism. No one is complaining that is possible at all to plaigarise, weāre complaining that major institutions are continuing to do so with ongoing disregard of the law.
Only maximalists fall into the trap that thinking of the world in binary sense: either all in or do nothing at all.
For most of us, itās about economics and risk profiles. Open source models get trained continuously over time, there wonāt be one version. Saying that open source operators do have some obligations to in good faith to curate future training to comply has a long tail impact on how that model evolves. Previous PII or plaigarized data might still exist, but its value and novelty and relevance to economic life goes down sharply over time. No artist or writer argues that copyright protections need to exist forever. They literally, just need to have survival working conditions, and the respect for attribution. The same thing with PII: no one claims that they must be completely anonymous. They just desire cyber crime to be taken seriously rather than abandoned in favor of one party taking the spoils of their personhood.
Also, yes, there are algorithms that can control how further learning promotes or demotes growth and connections relative to various policies. Rather than saying that any one policy is perfect, a mere willingness to adopt policies in good faith (most such LLM filters are intentionally weak so that those with $$ and paying for API access can outright ignore them, while they can turn around and claim it canāt be solved too bad so sad).
Yes. It is possible to perturb and influence the evolution of a continuously trained neural network based on external policy, and theyāre carefully lying through omision when they say they canāt 100% control it or 100% remove things. Fine. Thatās, not necessary, neither in copyright nor privacy law. Never been.
Are you sure that meets the letter of the law? GDPR would say āfuck that version of nuance, fix it.ā Microsoft now tries filtering on Bing Copilot in Germany, to variable results. What does the relevant California law say and mean?
I am not a lawyer. But you wouldnāt be surprised to hear that
I donāt have inside story of Bing in Germany. It could be that Microsoft either doesnāt want to do it well, or hasnāt yet done it well enough. Iām not promising either in particular, but it can be done.
Generally as an engineer you have a pile of options with trade offs. You absolutely can build nuanced solutions, as often the law and the lawyers live in nuanced realities. That is the reality of even the best sorts of tech companies who are trying.
My commitment is that maximalism or strict binary assumptions wonāt work on either end and donāt satisfy what anyone truly wants or needs. If weāre not careful about what it takes to move the needle, we agree with them by saying āit canāt be done, so it wont be done.ā
My commitment is that maximalism or strict binary assumptions wonāt work on either end and donāt satisfy what anyone truly wants or needs.
Whatās truly lovely about GDPR is that it is maximalist, strict, and binary. For any ābutā¦ā of a corporation the GDPR answer is āfucks given: 0, this is YOUR problem, comply or perish.ā
Which makes it so baffling every time a techbro fails to understand it or claims āGDPR doesnāt apply to me.ā Just donāt fuck around with PII and donāt collect any without explicit permission from the user! How is this difficult?!
Despite what the tech companies say, there are absolutely techniques for identifying the sources of their data, and there are absolutely techniques for good faith data removal upon request. I know this, because Iāve worked on such projects before on some of the less major tech companies that make some effort to abide by European laws.
The trick is, it costs money, and the economics shift such that one must eventually begin to do things like audit and curate. The shape and size of your business, plus how you address your markets, gains nuance that doesnāt work when your entire business model is smooth, mindless amotirizing of other peopleās data.
But I donāt envy these tech companies, or the increasing absurd stories they must tell to hide the truth. A handsome sword hangs above their heads.
The other reason they donāt do it is because many models are trained on a large corpus of pirated texts, and documenting this would be a confession.
Not just in an āI scraped the new york times without permissionā kind of way, but in a āI illegally downloaded a torrent containing bestsellers from the last 30 yearsā kind of way.
Exactly. Itās not that they canāt, or that itās too expensive, itās that doing so will reveal their crimes.
In a sense, to me, it is the same thing. If your business is built upon repurposing everyone elseās inputs indiscriminately to your benefit and their detriment, it is, too expensive, to reveal that simple truth.
Bestsellers? There used to be torrents of basically all releases. My provider blocks torrent sites and I dont use a vpn so im not sure if people still do this, but downloading basically all books (in english) at once released in a certain period was possible
occasionally i see this for music (weekly new tracks)
Wouldnāt removal of the data effect on the model require a basic retraining? A bit too late for all the open source ones out there.
Thatās a good question, because there is nuance here! Itās interesting because while working on similar projects I also ran into this issue. First off, itās important to understand what your obligation is and the way that you can understand data deletion. No one believes it is necessary to permanently remove all copies of anything, anymore than it is necessary to prevent all forms of plagairism. No one is complaining that is possible at all to plaigarise, weāre complaining that major institutions are continuing to do so with ongoing disregard of the law.
Only maximalists fall into the trap that thinking of the world in binary sense: either all in or do nothing at all.
For most of us, itās about economics and risk profiles. Open source models get trained continuously over time, there wonāt be one version. Saying that open source operators do have some obligations to in good faith to curate future training to comply has a long tail impact on how that model evolves. Previous PII or plaigarized data might still exist, but its value and novelty and relevance to economic life goes down sharply over time. No artist or writer argues that copyright protections need to exist forever. They literally, just need to have survival working conditions, and the respect for attribution. The same thing with PII: no one claims that they must be completely anonymous. They just desire cyber crime to be taken seriously rather than abandoned in favor of one party taking the spoils of their personhood.
Also, yes, there are algorithms that can control how further learning promotes or demotes growth and connections relative to various policies. Rather than saying that any one policy is perfect, a mere willingness to adopt policies in good faith (most such LLM filters are intentionally weak so that those with $$ and paying for API access can outright ignore them, while they can turn around and claim it canāt be solved too bad so sad).
Yes. It is possible to perturb and influence the evolution of a continuously trained neural network based on external policy, and theyāre carefully lying through omision when they say they canāt 100% control it or 100% remove things. Fine. Thatās, not necessary, neither in copyright nor privacy law. Never been.
Are you sure that meets the letter of the law? GDPR would say āfuck that version of nuance, fix it.ā Microsoft now tries filtering on Bing Copilot in Germany, to variable results. What does the relevant California law say and mean?
I am not a lawyer. But you wouldnāt be surprised to hear that
My commitment is that maximalism or strict binary assumptions wonāt work on either end and donāt satisfy what anyone truly wants or needs. If weāre not careful about what it takes to move the needle, we agree with them by saying āit canāt be done, so it wont be done.ā
Whatās truly lovely about GDPR is that it is maximalist, strict, and binary. For any ābutā¦ā of a corporation the GDPR answer is āfucks given: 0, this is YOUR problem, comply or perish.ā
Which makes it so baffling every time a techbro fails to understand it or claims āGDPR doesnāt apply to me.ā Just donāt fuck around with PII and donāt collect any without explicit permission from the user! How is this difficult?!
Iām referring specifically to this where they could only put in a shaky bodge.
When you donāt know an example, consider looking it up and not just waffling anyway.