Shallow Ignorance: The AI discourse is broken

The news will have you believe that the worlds is in an arms race, with AI. Are you team US, are you team China? When is the next foundational model coming out? What about: Where's this data coming from? What's the environmental impact of using LLMs? What biases does it propagate?

Shallow Ignorance: The AI discourse is broken

If you’ve been following news this last week, you’d know all about DeepSeek, the Chinese Artificial Intelligence firm’s eponymous LLM; and how it’s put US and China in an arms race for the “best LLM”; How China built it for a fraction of a budget, and how China is now the leader in LLMs. The news also sent stock markets crashing apparently. There’s also plenty of discussion about giving away data to a Chinese firm, their censorship of things critical of China and Chinese government. (This has mostly been about the app though, the LLM itself is open-source)

The emergence of LLMs and their power is not an arms race. It’s not about who has the fastest, smartest LLM. It’s about what people do with it. It’s about how these LLMs are trained. It’s about the biases they bring. It’s about the environment they damage. It’s about where they are reliable and where they aren’t.

LLM datasets are unethically sourced

The biggest criticism of LLMs and any Machine Learning model for that matter has been the lack of transparency in training datasets.

Perplexity, OpenAI, DeepSeek all of these companies are running on systems that are unethically trained; Perplexity famously didn’t adhere to robots.txt files, OpenAI admitted that it’s impossible to train leading AI models without using copyrighted material. DeepSeek apparently stole data from OpenAI.

So when the next big LLM arrives can we ask: How are the datasets sourced?

Wired Confirms Perplexity Is Bypassing Efforts by Websites to Block Its Web Crawler
Last week, Federico and I asked Robb Knight to do what he could to block web crawlers deployed by artificial intelligence companies from scraping MacStories. Robb had already updated his own site’s robots.txt file months ago, so that’s the first thing he did for MacStories. However, robots.txt only …
OpenAI says it’s “impossible” to create useful AI models without copyrighted material
“Copyright today covers virtually every sort of human expression” and cannot be avoided.
Exposed DeepSeek Database Revealed Chat Prompts and Internal Data
China-based DeepSeek has exploded in popularity, drawing greater scrutiny. Case in point: Security researchers found more than 1 million records, including user data and API keys, in an open database.
OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us
OpenAI shocked that an AI company would train on someone else’s data without permission or compensation.

LLMs have biases

Based on what datasets an LLM is trained on, LLMs are known to exhibit biases. These can range from political to gender biases. They can be as harmless as the LLMs giving you the incorrect recipe for a rare Indian snack, to reimposing gender biases.

What Chat GPT has to teach us about the Israel vs Palestine conflict
Language models are dominating everyday activities and functionalities. It is important to think that they are built on biased information…
Generative AI: UNESCO study reveals alarming evidence of regressive gender stereotypes
Ahead of the International Women’s Day, a UNESCO study revealed worrying tendencies in Large Language models (LLM) to produce gender bias, as well as homophobia and racial stereotyping. Women were described as working in domestic roles far more often than men ¬– four times as often by one model – an…
Gender bias and stereotypes in Large Language Models
Covert Racism in AI: How Language Models Are Reinforcing Outdated Stereotypes
Despite advancements in AI, new research reveals that large language models continue to perpetuate harmful racial biases, particularly against speakers of African American English.

LLMs are environmentally disastrous

According to Alex de Vries of VU Amsterdam A single LLM interaction may consume as much power as leaving a low-brightness LED lightbulb on for one hour." [1]

So should you really be doing 2 + 2 using ChatGPT?

Using GPT-4 to generate 100 words consumes up to 3 bottles of water — AI data centers also raise power and water bills for nearby residents
Net-zero emission goals went out the window with AI.
Reconciling the contrasting narratives on the environmental impact of large language models - Scientific Reports
Scientific Reports - Reconciling the contrasting narratives on the environmental impact of large language models
Making AI Less “Thirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models
The growing carbon footprint of artificial intelligence (AI) has been undergoing public scrutiny. Nonetheless, the equally important water (withdrawal and consumption) footprint of AI has largely remained under the radar. For example, training the GPT-3 language model in Microsoft’s state-of-the-art…
Explained: Generative AI’s environmental impact
MIT News explores the environmental and sustainability implications of generative AI technologies and applications.

[1]: https://spectrum.ieee.org/ai-energy-consumption

LLM research is a rapidly breaking new ground

Make no mistake, DeepSeek R1 is a remarkable technical achievement. Their research paper brings some radical ideas together and it's commendable what they did in a budget thats significantly smaller than OpenAI's.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero…


So when the next big LLM comes around like perhaps ShallowIgnorance? I really hope these are the questions that get asked. Instead of who fretting over who built it first, or which country has the upper hand.