A Developer Accidentally Found CSAM in AI Data. Google Banned Him For It

themachinestops@lemmy.dbzer0.com · 1 month ago

A Developer Accidentally Found CSAM in AI Data. Google Banned Him For It

Devial@discuss.online · edit-2 1 month ago

The article headline is wildly misleading, bordering on being just a straight up lie.

Google didn’t ban the developer for reporting the material, they didn’t even know he reported it, because he did so anonymously, and to a child protection org, not Google.

Google’s automatic tools, correctly, flagged the CSAM when he unzipped the data and subsequently nuked his account.

Google’s only failure here was to not unban on his first or second appeal. And whilst that is absolutely a big failure on Google’s part, I find it very understandable that the appeals team generally speaking won’t accept “I didn’t know the folder I uploaded contained CSAM” as a valid ban appeal reason.

It’s also kind of insane how this article somehow makes a bigger deal out of this devolper being temporarily banned by Google, than it does of the fact that hundreds of CSAM images were freely available online and openly sharable by anyone, and to anyone, for god knows how long.

forkDestroyer@infosec.pub · 1 month ago

I’m being a bit extra but…

Your statement:

The article headline is wildly misleading, bordering on being just a straight up lie.

The article headline:

A Developer Accidentally Found CSAM in AI Data. Google Banned Him For It

The general story in reference to the headline:

He found csam in a known AI dataset, a dataset which he stored in his account.
Google banned him for having this data in his account.
The article mentions that he tripped the automated monitoring tools.

The article headline is accurate if you interpret it as

“A Developer Accidentally Found CSAM in AI Data. Google Banned Him For It” (“it” being “csam”).

The article headline is inaccurate if you interpret it as

“A Developer Accidentally Found CSAM in AI Data. Google Banned Him For It” (“it” being “reporting csam”).

I read it as the former, because the action of reporting isn’t listed in the headline at all.

^___

Blubber28@lemmy.world · 1 month ago

This is correct. However, many websites/newspapers/magazines/etc. love to get more clicks with sensational headlines that are technically true, but can be easily interpreted as something much more sinister/exciting. This headline is a great example of it. While you interpreted it correctly, or claim to at least, there will be many people that initially interpret it the second way you described. Me among them, admittedly. And the people deciding on the headlines are very much aware of that. Therefore, the headline can absolutely be deemed misleading, for while it is absolutely a correct statement, there are less ambiguous ways to phrase it.

MangoCats@feddit.it · 1 month ago

can be easily interpreted as something…

This is pretty much the art of sensational journalism, popular song lyric writing and every other “writing for the masses” job out there.

Factual / accurate journalism? More noble, but less compensated.

obsoleteacct@lemmy.zip · 1 month ago

It is a terrible headline. It can be debated whether it’s intentionally misleading, but if the debate is even possible then the writing is awful.

MangoCats@feddit.it · 1 month ago

if the debate is even possible then the writing is awful.

Awfully well compensated in terms of advertising views as compared with “good” writing.

Capitalism in the “free content market” at work.

MangoCats@feddit.it · 1 month ago

Google’s only failure here was to not unban on his first or second appeal.

My experience of Google and the unban process is: it doesn’t exist, never works, doesn’t even escalate to a human evaluator in a 3rd world sweatshop - the algorithm simply ignores appeals inscrutably.

katy ✨@piefed.blahaj.zone · 1 month ago

so they got mad because he reported it to an agency that actually fights csam instead of them so they can sweep it under the rug?

Devial@discuss.online · edit-2 1 month ago

They didn’t get mad, they didn’t even know THAT he reported it, and they have no reason or incentive to swipe it under the rug, because they have no connection to the data set. Did you even read my comment ?

I hate Alphabet as much as the next person, but this feels like you’re just trying to find any excuse to hate on them, even if it’s basically a made up reason.

katy ✨@piefed.blahaj.zone · 1 month ago

they obviously did if they banned him for it; and if they’re training on csam and refuse to do anything about it then yeah they have a connection to it.

Devial@discuss.online · edit-2 1 month ago

Also, the data set wasn’t hosted, created, or explicitly used by Google in any way.

It was a common data set used in various academic papers on training nudity detectors.

Did you seriously just read the headline, guess what happened, and are now arguing based on that guess that I, who actually read the article, am wrong about it’s content ? Because that’s sure what it feels like reading your comments…

Devial@discuss.online · edit-2 1 month ago

So you didn’t read my comment then did you ?

He got banned because Google’s automated monitoring system, entirely correctly, detected that the content he unzipped contained CSAM. It wasn’t even a manual decision to ban him.

His ban had literally nothing whatsoever to do with the fact that the CSAM was part of an AI training data set.

MangoCats@feddit.it · 1 month ago

Google doesn’t ban for hate or feels, they ban by algorithm. The algorithms address legal responsibilities and concerns. Are the algorithms perfect? No. Are they good? Debatable. Is it possible to replace those algorithms with “thinking human beings” that do a better job? Also debatable, from a legal standpoint they’re probably much better off arguing from a position of algorithm vs human training.

ulterno@programming.dev · edit-2 1 month ago

Another point is, the reason Google’s AI is able to identify CSAM is because it has that in its training data, flagged as such.

In that case, it would have detected the training material as ~100% match.

I don’t get though, how it ended up being openly available as if it were properly tagged, they would probably exclude it from the open-sourced data. And now I see it would also not be viable to have an open-source, openly scrutinisable AI deployment for CSAM detection for the same reason.

And while some governmental body got a lot of backlash for trying to implement such an AI thing on chat stuff, Google gets to do so all it wants because it’s E-Mail/GDrive and all on their servers and you can’t expect privacy.

Considering how many such stories of people having problems due to this system is coming up, is there any statistic of legitimate catches using this model? I suspect not, because why would anyone use Google services for this kind of stuff?

arararagi@ani.social · 1 month ago

You would think, but none of these companies actually make their own dataset, they buy from third parties.

ulterno@programming.dev · 1 month ago

I am not sure which point you are answering to.
COuld you please specify.

ayyy@sh.itjust.works · 1 month ago

The article headline is wildly misleading, bordering on being just a straight up lie.

A 404Media headline? The place exclusively staffed by former BuzzFeed/Cracked employees? Noooo, couldn’t be.

Cybersteel@lemmy.world · 1 month ago

We need to block access to the web to certain known actors and tie ipaddresses to IDs, names, passport number. For the children.

kylian0087@lemmy.dbzer0.com · 1 month ago

Oh hell no. That’s a privacy nightmare to he abused like hell.

Also that wouldn’t work at all what you say.

Cybersteel@lemmy.world · 1 month ago

In the current digitized world, trivial information is accumulating every second; preserved in all it’s tritness, never fading, always accessible; rumors of petty issues, misinterpretations, slander.

All junk data preserved in an unfiltered state, growing at an alarming rate, it will only slow down social progress.

The digital society furthers human flaws and selectively rewards development of convenient half-truths. Just look at the strange juxtaposition of morality around us. Billions spent on new weapons to humanely murder other humans. Rights of criminals are given more respect than the privacy of their own victims. Although there are people in poverty, huge donations are made to protect endangered species; everyone grows up being told what to do.

“Be nice to other people.”

“But beat out the competition.”

“You’re special, believe in yourself and you will succeed”.

But it’s obvious from the start that only a few can succeed.

You exercise your right to freedom and this is the result. All the rhetoric to avoid conflict and protect each other from hurt. The untested truths spun by different interests continue to churn and accumulate in the sandbox of political correctness and value systems.

Everyone withdrawals into their own small gated community, afraid of a larger forum; they stay inside their little ponds leaking what ever “truth” suits them into the growing cesspool of society at large.

The different cardinal truths neither clash nor mesh, no one is invalidated but no one is right. Not even natural selection can take place here.

The world is being engulfed in “Truth”. And this is the way the world ends. Not with a BANG, but with a…

zalgotext@sh.itjust.works · 1 month ago

Is this a fresh new copypasta, or are you just a really long-winded, elaborate troll?

tetris11@feddit.uk · 1 month ago

Also, pay me exhorbitant amounts of tax-payer money to ineffectually enforce this. For the children.

youmaynotknow@lemmy.zip · 1 month ago

Fuck you, and everything you stand for.

driving_crooner@lemmy.eco.br · 1 month ago

That sounds like sarcasm to me

x0x7@lemmy.world · 1 month ago

People on Lemmy don’t understand sarcasm because they have brain damage.

asudox@lemmy.asudox.dev · 1 month ago

including you

NoForwardslashS@sopuli.xyz · 1 month ago

No need to go that far. If we just require one valid photo ID for TikTok, the children will finally be safe.

bobzer@lemmy.zip · 1 month ago

CSAM images

ATM machine

Goodlucksil@lemmy.dbzer0.com · 1 month ago

CSAM stands for “material”. Adding “image” specifies what kind of material it is.

bobzer@lemmy.zip · edit-2 1 month ago

A “material image” doesn’t make any sense. An image is material. It should be CSAI if you wanna be specific.

I don’t know why this is the second time I’ve had a discussion about CSAM being a stupid acronym on Lemmy, but it’s also the only place I’ve ever seen people use it.

Goodlucksil@lemmy.dbzer0.com · 1 month ago

Material. Type of material: Image

bobzer@lemmy.zip · 1 month ago

Why say sexual abuse material images, which is grammatically incorrect, instead of sexual abuse images, which is what you mean, and shorter?

Devial@discuss.online · 1 month ago

Which of the letters in CSAM stand for images then ?

bobzer@lemmy.zip · 1 month ago

Material.

Devial@discuss.online · 1 month ago

Material can be anything. It can be images, videos theoretically even audio recordings.

Images is a relevant and sensible distinction. And judging by the downvotes you’re collecting, the majority of people disagree with you.

MangoCats@feddit.it · 1 month ago

Material can be anything.

And, if you’re trying to authorize law enforcement to arrest and prosecute, you want the broadest definitions possible.

bobzer@lemmy.zip · 1 month ago

You’re right. It can be images, that’s exactly why saying “this man was found in possession of child abuse material images” does not make grammatical sense. It’s why CP still defines it better as we’re not arresting people for owning copies of Lolita, which you could argue is also CSAM.

the majority of people disagree with you.

The majority of people can be wrong.

Devial@discuss.online · edit-2 1 month ago

Big “Ben Shapiro ranting about renewable energies because of the first law of thermodynamics” energy right here.

And your point is literally the opposite. Lolita could be argued to be child porn, as it’s pornographic material showing (fictional/animated) children. It is objectively NOT CSAM, because it does not contain CSA, because you can’t sexually abuse a fictional animated character.

CP is also a common acronym that can mean many other things.

Porn also implies it’s a work of artistic intent, which is just wrong for CSAM.

The majority of people can be wrong.

No they can’t, not with regards to linguistics. Linguistics is a descriptive science, not a prescriptive one. Words and language, by definition, and convention of every serious linguist in the world, mean what the majority of people think them to mean. That’s how language works.

bobzer@lemmy.zip · 1 month ago

“I’m mad you’re right so let me compare you to a hateful right wing grifter and also by the way, you’re wrong because all my friends say so.”

It may shock you but a handful of Lemmy users doesn’t constitute the linguistic consensus you’re trying to inherit here.

TheJesusaurus@sh.itjust.works · 1 month ago

Why confront the glaring issues with your “revolutionary” new toy when you could just suppress information instead

Ex Nummis@lemmy.world · 1 month ago

This was about sending a message: “stfu or suffer the consequences”. Hence, subsequent people who encounter similar will think twice about reporting anything.

Devial@discuss.online · edit-2 1 month ago

Did you even read the article ? The dude reported it anonymously, to a child protection org, not google, and his account was nuked as soon as he unzipped the data, because the content was automatically flagged.

Google didn’t even know he reported this, and Google has nothing whatsoever to do with this dataset. They didn’t create it, and they don’t own or host it.

Whostosay@sh.itjust.works · 1 month ago

It seems they did react to it though

Devial@discuss.online · edit-2 1 month ago

They didn’t react to anything. The automated system (correctly) flagged and banned the account for CSAM, and as usual, the manual ban appeal sucked ass and didn’t do what it’s supposed to do (also whilst this is obviously a very unique case, and the ban should have been overturned on appeal right away, it does make sense that the appeals team, broadly speaking, rejects “I didn’t know this contained CSAM” as a legitimate appeal reason). This is barely news worthy. The real headline should be about how hundreds of CSAM images were freely available and sharable from this data set.

Whostosay@sh.itjust.works · 1 month ago

An automatic reaction is a reaction

Devial@discuss.online · edit-2 1 month ago

They reacted to the presence of CSAM. It had nothing whatsoever to do with it being contained in an AI training dataset, as the comment I originally replied to states.

AngryishHumanoid@lemmynsfw.com · edit-2 1 month ago

“Sign up for free access.”

danc4498@lemmy.world · 1 month ago

https://archive.ph/d6LEb

floquant@lemmy.dbzer0.com · edit-2 1 month ago

Nooo I was liking 404 :/ ~~Sucks to see them enshittified too…~~

edit: that was too harsh, I take it back.

StitchInTime@piefed.social · 1 month ago

I think they’ve always been like this for some of their posts, and honestly I’m considering getting a paid subscription to support them. Sucks, but they’ve been putting out quality content in exchange for your email address and some metrics - I’d call it a fair trade.

TJA!@sh.itjust.works · 1 month ago

They are doing it because of AI scraper. But that is for some time now already

Chozo@fedia.io · 1 month ago

How does this stop AI scrapers?

Axolotl@feddit.it · edit-2 1 month ago

It’s more difficult to crawl a webpage if under a login wall so you will have less crawlers flooding your site

Barbecue Cowboy@lemmy.dbzer0.com · 1 month ago

It is legitimately free after you sign up, I get their reasoning but is kinda annoying.

killea@lemmy.world · 1 month ago

So in a just world, google would be heavily penalized for not only allowing csam on their servers, but also for violating their own tos with a customer?

shalafi@lemmy.world · 1 month ago

We really don’t want that first part to be law.

Section 230 was enacted as part of the Communications Decency Act of 1996 and is a crucial piece of legislation that protects online service providers and users from being held liable for content created by third parties. It is often cited as a foundational law that has allowed the internet to flourish by enabling platforms to host user-generated content without the fear of legal repercussions for that content.

Though I’m not sure if that applies to scraping other server’s content. But I wouldn’t say it’s fair for the scraper to review everything. If we don’t like that take, then we should illegalize scraping altogether, but I’m betting there are unwanted side effects to that.

mic_check_one_two@lemmy.dbzer0.com · 1 month ago

While I agree with Section 230 in theory, it is often only used in practice to protect megacorps. For example, many Lemmy instances started getting spammed by CSAM after the Reddit API migration. It was very clearly some angry redditors who were trying to shut down instances, to try and keep people on Reddit.

But individual server owners were legitimately concerned that they could be held liable for the CSAM existing on their servers, even if they were not the ones who uploaded it. The concern was that Section 230 would be thrown out the window if the instance owners were just lone devs and not massive megacorps.

Especially since federation caused content to be cached whenever a user scrolled past another instance’s posts. So even if they moderated their own server’s content heavily (which wasn’t even possible with the mod tools that existed at the time), then there was still the risk that they’d end up cacheing CSAM from other instances. It led to a lot of instances moving from federation blacklists to whitelists instead. Basically, default to not federating with an instance, unless that instance owner takes the time to jump through some hoops and promises to moderate their own shit.

vimmiewimmie@slrpnk.net · 1 month ago

Not to create an argument, which isn’t my intent, as certainty there may be a thought such as, “scraping as it stands is good because of the simplification and ‘benefit’”. Which, sure, it’s easiest to wide net and absorb, to simply the concept, at least as I’m also understanding it.

Yet, maybe it is the process of scraping, and also absorbing into databases including AI, which is a worthwhile point of conversation. Maybe how we’ve been doing something isn’t the continued ‘best course’ for a situation.

Undeniably, more minutely monitoring what is scraped and stored creates large quantities, and large in scope, of questions and obstacles, but, maybe having that conversation is where things should go.

Thoughts?

killea@lemmy.world · 1 month ago

Oh my, yes, you are correct. That was sort of knee jerk, as opposed to it being the reporting party’s burden somehow. I simply cannot understand the legal gymnastics needed to punish your customers for this sort of thing; I’m tired but I feel like this is not exactly an uncommon occurrence. Anyways let us all learn from my mistake and do not be rash and curtail your own freedoms.

dev_null@lemmy.ml · 1 month ago

They were not only not allowing it, they immediately blocked the user’s attempt to put it on their servers and banned the user for even trying. That’s as far from allowing it as possible.

abbiistabbii@lemmy.blahaj.zone · 1 month ago

This, literally the only reason I could guess is that it is to teach AI to recognise childporn, but if that is the case, why is google going it instead of like, the FBI?

gustofwind@lemmy.world · 1 month ago

Who do you think the FBI would contract to do the work anyway 😬

Maybe not Google but it would sure be some private company. Our government doesn’t do stuff itself almost ever. It hires the private sector

alias_qr_rainmaker@lemmy.world · 1 month ago

guess i gotta get into the private sector, lmao

alias_qr_rainmaker@lemmy.world · edit-2 1 month ago

i know it’s really fucked up, but the FBI needs to train an AI on CSAM if it is to be able to identify it.

i’m trying to help, i have a script that takes control of your computer and opens the folder where all your fucked up shit is downloaded it’s basically a pedo destroyer. they all just save everything to the downloads folder of their tor browser, so the script just takes control of their computer, opens tor, and pressed cmd+j to open up downloads and then it copies the files names and all that.

will it work? dude, how the fuck am i supposed to know, i don’t even do this shit for a living

i’m trying to use steganography to embed the applescript in a png

vimmiewimmie@slrpnk.net · 1 month ago

What’s the ‘applescript’?

alias_qr_rainmaker@lemmy.world · 1 month ago

the applescript opens tor from spotlight search and presses the shortcut to open downloads

i dunno how much y’all know about applescript. it’s used to automate apps on your mac. i know y’all hate mac shit but dude, whatever, if you get osascript -e aliased to o you can run applescript easily from your terminal

alias_qr_rainmaker@lemmy.world · 1 month ago

just pass in a heredoc

forkDestroyer@infosec.pub · 1 month ago

Google isn’t the only service checking for csam. Microsoft (and other file hosting services, likely) also have methods to do this. This doesn’t mean they also host csam to detect it. I believe their checks use hash values to determine if a picture is already clocked as being in that category.

This has existed since 2009 and provides good insight on the topic, used for detecting all sorts of bad category images:

https://technologycoalition.org/news/the-tech-coalition-empowers-industry-to-combat-online-child-sexual-abuse-with-expanded-photodna-licensing/

frongt@lemmy.zip · 1 month ago

Google wants to be able to recognize and remove it. They don’t want the FBI all up in their business.

Allero@lemmy.today · edit-2 1 month ago

So, Google could be allowed to have the tools to collect, store, and process CSAM all over the Web without oversight?

Pretty much everyone else would get straight to jail for attempting that.

finitebanjo@lemmy.world · 1 month ago

My dumb ass sitting here confused for a solid minute thinking CSAM was in reference to a type of artillery.

pigup@lemmy.world · 1 month ago

Combined surface air munitions

llama@lemmy.zip · 1 month ago

Right I thought it was cyber security something or other like API keys now duck duck go probably thinks I’m a creep

Hozerkiller@lemmy.ca · 1 month ago

I feel that I assumed it was something like SCCM.

hummingbird@lemmy.world · 1 month ago

It goes to show: developers should make sure they don’t make their livelihood dependent on access to Google services.

rizzothesmall@sh.itjust.works · 1 month ago

Never heard that acronym before…

TheJesusaurus@sh.itjust.works · 1 month ago

Not sure where it originates but it’s the preferred term in UK policing and therefore most media reporting to refer to what might have been called “CP” on the interweb in the past as CSAM. Probably because porn implies it’s art rather than crime, and also just a wider umbrella term

Zikeji@programming.dev · edit-2 1 month ago

It’s also more distinct. CP has many potential definitions. CSAM only has the one I’m aware of.

yesman@lemmy.world · 1 month ago

LOL, You mean the letters C and P can stand for lots of stuff. At first I thought you meant the term “child porn” was ambiguous.

drdiddlybadger@pawb.social · 1 month ago

Weirdly people have also been intentionally diluting the term to expand it to other things which causes a number of legal issues.

QueenHawlSera@sh.itjust.works · 1 month ago

I’ve seen Cyberpunk fans claim to be into “CP”, usually it’s an innocent mistake and don’t catch it at first, but are very embarassed when they do

Tollana1234567@lemmy.today · 1 month ago

its always been child porn, when your using cp in the context with media/human, if its medical forum, you rarely use it to describe chickenpox.

SereneSadie@quokk.au · edit-2 16 days ago

deleted by creator

finitebanjo@lemmy.world · 1 month ago

Counter Surface to Air Missile equipment or ordinance.

Why not just say Child Porn or Illicit Images of Children? Who are we protecting by refusing to say the words?

rizzothesmall@sh.itjust.works · 1 month ago

Lol why tf people downvoting that? Sorry I learned a new fucking thing jfc.

Deceptichum@quokk.au · 1 month ago

It’s basically the only one anyone uses?

Devolution@lemmy.world · 1 month ago

Gemini likes twins…

…I’ll see myself out.

yeehaw@lemmy.ca · 1 month ago

Child sexual abuse material.

Is it just me or did anyone else know what “CSAM” was already?

chronicledmonocle@lemmy.world · 1 month ago

I had no idea what the acronym was. Guess I’m just sheltered or something.

pipe01@programming.dev · 1 month ago

Yeah it’s pretty common, unfortunately

TipsyMcGee@lemmy.dbzer0.com · edit-2 4 days ago

deleted by creator

Miðvikudagur@lemmy.world · 1 month ago

“Child pornography” is a term NGO’s and Law enforcement are trying to get phased out. It makes it sound like CSAM is related to porn, when in fact it is simply abuse of a minor.

TipsyMcGee@lemmy.dbzer0.com · edit-2 4 days ago

deleted by creator

ulterno@programming.dev · 1 month ago

The abbreviation sounds like some kind of exotic Surface to Air Missile lol

It does.
Somehow acronyms just end up sounding cool. Guess we should just use the full form. That would be better.

arararagi@ani.social · 1 month ago

“stop noticing things” -Google

DylanMc6 [any, any]@lemmy.ml · 1 month ago

time for guillotines

DylanMc6 [any, any]@lemmy.ml · 1 month ago

gill o’ teens

B-TR3E@feddit.org · 1 month ago

That’s what you get for critisising AI - and righ so. I for one, welcome our new electronic overlords!

B-TR3E@feddit.org · 1 month ago

cough: https://knowyourmeme.com/memes/i-for-one-welcome-our-new-insect-overlords

billwashere@lemmy.world · 1 month ago

I imagine most of these models have all kinds of nefarious things in them, sucking up all the info they could find indiscriminately.

√𝛂𝛋𝛆@piefed.world · 1 month ago

deleted by creator

minorkeys@lemmy.world · 1 month ago

Me stupid. Pls dumbsplain.

√𝛂𝛋𝛆@piefed.world · 1 month ago

deleted by creator

minorkeys@lemmy.world · 1 month ago

Mmmhmmm…mhhmmm yes, okay, uh huh, i see…so the conclusion I’ve reached is i’m stupid.