TechTakes @awful.systems David Gerard @awful.systems 6 mo. ago

Proton Mail goes AI, security-focused userbase goes ‘what on earth’

pivot-to-ai.com Proton Mail goes AI, security-focused userbase goes ‘what on earth’

If an organization runs a survey in 2024 on whether it should get into AI, then they’ve already bodged an LLM into the system and they’re seeing if they can get away with it. Proton Mail is a priva…

we appear to be the first to write up the outrage coherently too. much thanks to the illustrious @self

184

You're viewing a single thread.

184 comments

Mistral isn't trained on copy righted data. It's based off selective databases that were open use. This article in general is full of false information. But I suppose most people only read the headlines.
- https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/8#6527a6fca6eaf92e6c26fa59
  
  Unfortunately we're unable to share details about the training and the datasets (extracted from the open Web) due to the highly competitive nature of the field.
  
  The "open web" is full of copyrighted material.
  
  We had a social contract!
  
  but it's apache2 sega! tooooootes freebies!
- was this incorrect? https://www.patronus.ai/blog/introducing-copyright-catcher
  
  Yes, they are incorrect:
  
  https://arstechnica.com/information-technology/2023/07/why-ai-detectors-think-the-us-constitution-was-written-by-ai/
  
  if you're not gonna read the fucken thing then fuck off.
  
  I did read the thing, then provided an article explaining why detecting copyrighted material / determining if something is written by AI is very inaccurate.
  
  Perhaps take your own advice to "read the fucken thing" next time instead of making yourself look like an idiot. Though I doubt you've ever heard of "better to stay silent and let them think you the fool than to speak and remove all doubt".
  
  Btw, I even recall that Ars specifically covered the company you linked to in a separate article as well. I'd be glad to provide it once you've come to your senses and want to discuss things like an adult.
  
  Mistral’s Mixtral-8x7B-Instruct-v0.1 produced copyrighted content on 22% of the prompts.
  
  did you know that a lesser-known side effect of the infinite monkeys approach is that they will produce whole sections of copyright content abso-dupo-lutely by accident? wild, I know! totes coinkeedink!
  
  I’d be glad to provide it once you’ve come to your senses and want to discuss things like an adult
  
  jesus fucking christ you must be a fucking terrible person to work with
  
  I've seen toddlers throw more mature tantrums
  
  she wrote harry potter with an llm, didn't she?
  
  I'm too old to discuss against bad faith arguments.
  
  Especially with people who won't read the information I provide them showing their initial information was wrong.
  
  One is a company that has something to sell, the other an article with citations showing why it's not easy to determine what percentage of a data set is infringing on copyright, or whether exact reproduction via "fishing expedition" prompting is a useful metric to determine if unauthorized copyright was used in training.
  
  The dumbest take though is attacking Mistral of all LLMs, even though it's on an Apache 2.0 license.
  
  I've read the article you've posted: it does not refute the fucking datapoint provided, it literally DOES NOT EVEN MENTION MISTRAL AT ALL.
  
  so all I can tell you is to take your pearlclutching tantrum bullshit and please fuck off already
  
  god these weird little fuckers’ ability to fill a thread with garbage is fucking notable isn’t it? something about loving LLMs makes you act like an LLM. how depressing for them.
  
  high willingness to accept painfully inexact responses
  
  high tendency to side with authority when given no information
  
  low ability to distinguish "how it is" from "how it seems like it should be"
  
  Meta:
  
  default expectation that others are the same way
  
  indignant consent-ignoring gesture if they're not
  
  "I said good day, sir. good day!" [walks through revolving door]
  
  I bet they go to react conferences
  
  snrk
  
  brava.
  
  To think that when sneer club/techtakes migrated to lemmy, I was pretty sure we would not be getting a lot of incidental traffic to the communities. Just about as wrong as you can be.
  
  Where was it prior? @earthquake @self
  
  Yes, clearly I'm the one throwing a tantrum 🙄
  
  Btw, you can just fact check my claim about what Mistral is licenced under. The article talks about copyright and AI detection in general, which to anyone with basic critical thinking skills could then understand would apply to other LLMs like Mistral.
  
  You might want to look up what pearl clutching means as well. You're using it wrong:
  
  https://dictionary.cambridge.org/us/dictionary/english/pearl-clutching
  
  Considering I've done the opposite of a shocked reaction. While at it, maybe also look up "projection"
  
  https://www.psychologytoday.com/us/basics/projection
  
  Anyhow, have a good day.
  
  chatgpt gets it
  
  Well since you want to use computers to continue the discussion, here's also ChatGPT:
  
  Determining the exact percentage of copyrighted data used to train a large language model (LLM) is challenging for several reasons:
  
  Scale and Variety of Data Sources: LLMs are typically trained on vast and diverse datasets collected from the internet, including books, articles, websites, and social media. This data encompasses both copyrighted and non-copyrighted content. The datasets are often so large and varied that it is difficult to precisely categorize each piece of data.
  
  Data Collection and Processing: During the data collection process, the primary focus is on acquiring large volumes of text rather than cataloging the copyright status of each individual piece. While some datasets, like Common Crawl, include metadata about the sources, they do not typically include detailed copyright status information.
  
  Transformation and Use: The data used for training is transformed into numerical representations and used to learn patterns, making it even harder to trace back and identify the copyright status of specific training examples.
  
  Legal and Ethical Considerations: The legal landscape regarding the use of copyrighted materials for AI training is still evolving. Many AI developers rely on fair use arguments, which complicates the assessment of what constitutes a copyright violation.
  
  Efforts are being made within the industry to better understand and address these issues. For example, some organizations are working on creating more transparent and ethically sourced datasets. Projects like RedPajama aim to provide open datasets that include details about data sources, helping to identify and manage the use of copyrighted content more effectively【6†source】.
  
  Overall, while it is theoretically possible to estimate the proportion of copyrighted content in a training dataset, in practice, it is a complex and resource-intensive task that is rarely undertaken with precision.
  
  you should speak to a physicist, they might be able to find a way your density can contribute to science
  
  "exact percentage"
  
  just fuck right off. wasting my fucken time.
  
  You're the one who linked to an exact percentage, not me. Have a good day.
  
  you're conflating "detecting ai text" with "detecting an ai trained on copyrighted material"
  
  send the relevant article or shut up
  
  Ignoring the logical inconsistency you just spouted for a moment (can't tell if it's written by AI but knows it used copyrighted material? Do you not hear yourself?), you do realize Mistral is released under the Apache 2.0 license, a highly permissive scheme that has no restrictions on use or reproduction beyond attribution, right?
  
  I think it's clear you're arguing in bad faith however with no intention of changing your misinformed opinion at this point. Perhaps you'd enjoy an echo chamber like the "fuckai" Lemmy instance.
  
  wait a minute… there’s another “fuck ai” instance and they’ve already told you to go fuck yourself?
  
  I wonder if they want to be friends
  
  have seen one on lemmy.world. it's kinda the dancing baby version of the stubsack and techtakes tho from what I can tell
  
  aw, it’s only a community? that’s what I get for expecting anything but garbage from Oscar the Grouch I suppose
  
  "I love trash, baka!"
  
  — Asuka the Grouch
  
  holy shit you really are quite dumb. the fuck is wrong with you?
  
  actually don’t answer that
  
  You are quite dumb.
  
  the reading comprehension of a llm and the contextual capacity of a gnat

You've viewed 184 comments.