Kai's Notes

docker compose down Then up -d, or Just up -d? What the Official Docs Actually Say

Thu, 11 Jun 2026 00:00:00 GMT

If you deploy services with Docker Compose regularly, chances are you've built up this muscle-memory combo:

docker compose down
docker compose up -d

Stop the whole project, wipe it clean, then bring everything back up. It works, of course — but many people can't quite explain it: doesn't docker compose up -d replace old containers on its own? Is the down step actually necessary, or is it redundant?

This article, based on the official Docker documentation, settles once and for all what each of these commands does and when to use which.

What the Official Docs Say

`docker compose up`: Create and Start, With Built-in Change Detection

The official reference defines up as: build, recreate, and start the containers for your services, attaching to their output; with -d, that is --detach, the containers run in the background instead.

The part that actually answers our question is this key passage in the docs:

If there are existing containers for a service, and the service's configuration or image was changed after the container's creation, docker compose up picks up the changes by stopping and recreating the containers, while preserving mounted volumes.

In other words, up already has the complete "detect changes → remove old container → swap in new container" logic built in. That's exactly where the behavior you've observed — old containers automatically getting replaced by new ones — comes from.

And it's restrained about it: only services that have changed get recreated. Containers with no changes are left as they are and keep running, completely unaffected.

Around this mechanism, the docs also provide two switches pointing in opposite directions:

--no-recreate: don't recreate containers even if changes are detected.
--force-recreate: recreate containers even if neither the configuration nor the image has changed.

`docker compose down`: Stop and Tear Down the Whole Project

The official definition of down: stop containers, and remove the containers and networks created by up.

By default, it removes three kinds of things:

The service containers defined in the Compose file;
The networks defined in the networks section;
The project's default network.

Networks and volumes declared as external, however, are never removed.

For data volumes, there are two cases to consider:

Named volumes: preserved by default; they're only removed if you explicitly add -v or --volumes.
Anonymous volumes: not removed by default either, but the docs add a warning that's easy to miss: anonymous volumes don't have stable names, so when you run up again later, the new containers won't automatically mount those old anonymous volumes.

Hence the official recommendation: data that needs to persist between updates should live in bind mounts or named volumes — don't rely on anonymous volumes.

The official getting-started guide has a very intuitive example: a small app that counts visits with Redis. After a down followed by up, the visit counter resets to zero.

The reason is simple: down deletes the containers, and any data written to the container's writable layer disappears with them; stop merely stops the containers — both the containers and their data remain.

The Essential Difference Between the Two Approaches

Put the pieces together and the difference becomes clear.

Running just:

docker compose up -d

is an in-place, incremental update.

Compose compares, service by service, the current configuration against the state of the running containers, and replaces only the parts that changed. The project network stays as it is; containers that weren't recreated keep even their IP addresses; and anonymous volumes from the old containers are "taken over" by the new ones.

up has a -V / --renew-anon-volumes flag, whose purpose is to "recreate anonymous volumes instead of retrieving data from the previous containers." The very existence of this flag confirms, conversely, that the default behavior is to retrieve the old data.

Whereas running:

docker compose down
docker compose up -d

is a full-stack teardown and rebuild.

All containers are stopped and removed first, and the project network is torn down as well; then up creates the network and all the containers from scratch.

Which means:

The whole application goes through a complete downtime window;
Every container gets replaced, including the ones you never touched;
The network is rebuilt wholesale, and container IPs are reassigned;
Anonymous volumes from the old containers are orphaned for good — the new containers start with blank data.

Dimension	Just `up -d`	`down` then `up -d`
Containers	Only changed services recreated	All removed, then recreated
Unchanged services	Unaffected, keep running	Stopped and rebuilt along with the rest
Project network	Stays as it is	Removed and recreated
Anonymous volume data	New containers take over old data	Orphaned with the old containers — effectively lost
Named volumes	Preserved	Preserved, unless you run `down -v`
Downtime scope	Brief interruption for changed services only	One full round of whole-stack downtime

Most of the Time, Just `up -d` Is Enough

You changed a service's environment variables, port mappings, or image tag in compose.yaml, or added a new service — for these everyday scenarios, simply running:

docker compose up -d

is enough.

Compose will touch exactly the parts that need touching, and the remaining services won't even notice. This is the standard update path as officially designed — the one with the least downtime and the safest behavior.

There is, however, one very common pitfall here, and it's the real reason many people believe "up -d doesn't take effect, you have to down first":

up does not proactively pull new images from the registry.

If your service is pinned to an unchanging tag like myapp:latest, and the image in the registry has been updated while your local copy is still the old one, then as far as Compose is concerned, "the image hasn't changed" — and up -d will do nothing at all.

The correct way to update is to pull first, then start:

docker compose pull
docker compose up -d

You can also merge it into a single step:

docker compose up -d --pull always

If the image is built locally, use this instead:

docker compose up -d --build

Once the image has been pulled — or rebuilt — Compose detects that it changed and replaces the corresponding containers. At no point does down need to be involved.

When You Actually Need `down` First

1. You Changed the Definition of Top-Level Resources Like Networks

Docker networks don't support in-place reconfiguration.

If you adjust a network's subnet, driver, or other parameters in the compose file, the old network — together with the containers attached to it — usually has to be torn down before it can be recreated with the new configuration.

That's exactly down's job. The same goes for changes to named volume definitions.

2. You Want a Genuinely Clean Environment

When you're chasing a weird bug or resetting test data, down gives you a deterministic "zero state."

If the persisted data should be wiped too, add -v to remove the named volumes along with everything else:

docker compose down -v
docker compose up -d

Careful: down -v deletes named volumes, and that data cannot be recovered.

3. You're Taking the Stack Out of Service for a While

If this isn't just a brief pause but you actually want to free up the container and network resources, then down is precisely what it was designed for.

In this scenario, you don't even need an up right after it.

4. You Need to Clean Up Services Removed From the Compose File

If you've deleted a service from the compose file and want to clean up its leftover container while you're at it, down can certainly do that.

But in many cases, the better option is:

docker compose up -d --remove-orphans

It cleans up orphaned containers just the same, without affecting the other services that are still running — usually the more convenient choice.

Two Commands That Often Get Confused, While We're at It

`docker compose restart`

restart merely restarts the processes inside the containers.

It does not apply any changes you've made to the compose file, nor does it swap in a new image. Running restart after editing your configuration accomplishes nothing.

What you should be running in that situation is:

docker compose up -d

`docker compose stop` / `docker compose start`

stop / start simply stop and resume containers.

The containers themselves and the data inside them are preserved exactly as they were — the right fit for "switch it off for now, bring it back as-is later." That's also the biggest difference between them and down.

Back to the Original Question

Habitually running down before up -d isn't wrong — it always lands you in a correct, fresh state.

It's just that most of the time, it's overkill: longer whole-stack downtime, a rebuilt network, and orphaned anonymous-volume data. And everything those costs buy you, up -d could have achieved with far less commotion.

A simple way to decide:

Day-to-day config or image updates: use docker compose pull && docker compose up -d;
Images built locally: use docker compose up -d --build;
Changed top-level resources like networks, need a thorough cleanup, or plan to decommission the stack: that's when you reach for down.

References: this article is based primarily on the official Docker documentation, including the docker compose up command reference, the docker compose down command reference, and the section of the Docker Compose quickstart on the data-persistence difference between down and stop.

DeepSeek V4 Shouldn't Be Overshadowed by GPT-5.5

Mon, 27 Apr 2026 00:00:00 GMT

Background

Recently, I have been using GPT-5.5 to review computer science knowledge, and its capability has genuinely stunned me. The earlier GPT-5 series models felt somewhat lacking in a human touch, but 5.5 has clearly changed that impression. I believe many people feel the same way: lately, everyone has started paying attention to GPT again. Image 2 is far ahead of other text-to-image models, and GPT-5.5 also feels like a model worthy of the LLM crown.

I still remember the timing: GPT-5.5 arrived in the early hours of April 24, 2026, Beijing time, while DeepSeek V4 was released around noon that same day. It was another major release from the DeepSeek team after half a year of quiet work.

In DeepSeek's launch article, most of the models used for comparison were previous-generation models from overseas AI companies. Without question, DeepSeek V4 cannot beat GPT-5.5, but its value and contribution should not be overshadowed by GPT-5.5's brilliance.

DeepSeek Capabilities I Am Optimistic About

1. 1M context, with strong retrieval ability

On Context Arena, DeepSeek V4 Pro ranks first among Chinese open source models in retrieval ability under the 128K context stress test.

Why does this matter? When you assign a task to a model and let it execute through tools such as OpenCode, the longer the task runs and the longer the context becomes, the easier it is for the model to forget earlier information. In the end, the result is more likely to drift away from what the user expected.

2. The largest parameter scale among Chinese, and even global, open source models

In recent years, constrained by factors such as compute, many Chinese teams, including Alibaba's Qwen team, have been researching smaller models and pushing their performance to the limit. But the effective path toward AGI and continued capability improvement is still to make models larger while also making them more efficient. This time, DeepSeek has raised the total parameter count of V4 Pro directly to 1.6T, more than twice that of the R1 model. This helps ensure the model has more abundant world knowledge.

3. ...

There are many other highlights I have not yet discovered. If readers have new observations, feel free to add them in the comments.

My Personal Experience Using DeepSeek V4 Pro

Yesterday, I subscribed to Kimi's lowest-tier membership and used it together with the official Kimi CLI for data preprocessing.

The preprocessing results still lagged behind Claude Code with the Opus model and Codex with GPT-5.5. Also, Kimi K2.6 only has a 256K context window. Even with fairly good prompts, it still failed to remove some obvious noise.

So today, I topped up 50 yuan for the DeepSeek API and paired it with OpenCode to clean up the remaining work from Kimi. The initial result was not satisfying, so I paused the execution in OpenCode and instructed it to read one article completely, then preprocess that article before moving on. In the end, with OpenCode's help, DeepSeek V4 Pro completed the cleanup task quite well.

After that, I gave it more data preprocessing tasks, and the results were also fairly satisfying.

Conclusion

DeepSeek V4 Pro's experience on the web or desktop client is not as smooth as Doubao's, and its feature set is not as complete. But in API-based workflows, it performs tasks quite well.

With the May Day holiday approaching, DeepSeek API pricing has been heavily discounted, making it very cost-effective.

DeepSeek is currently on the Pareto frontier: strong model capability at a low price. If your budget is limited but you still want to preserve model quality, it is a good option.

Although its performance is not as strong as the latest models such as GPT-5.5, its strengths are openness, low cost, and the acceleration of AI democratization. Models such as Gemini have far more parameters than DeepSeek, so it is not surprising that DeepSeek cannot currently beat the very top models. Even so, its contribution deserves recognition.

The DeepSeek team is quiet and restrained: unmoved by praise, unafraid of criticism, following its own path with composure and discipline, and holding to long-termism. This attitude is much better than OpenAI's Sam Altman-style hype or Anthropic keeping Mythos under wraps while stirring up attention.

When I was in the second year of graduate school, from the second half of 2024 to 2025, before R1 came out, I already used DeepSeek for data processing. It was cheap, had no concurrency limits, and offered the best value for money.

I am optimistic about DeepSeek. Every stir from the little blue whale pushes open source AI further forward. DeepSeek stands on the right side of history, and I look forward to more surprises from it in the future.

What Should We Watch Out for When AI Starts Researching Its Own Alignment?

Tue, 14 Apr 2026 00:00:00 GMT

When we're still worrying about the risks that may come with AI's rapid progress, Anthropic has already started a striking and far-reaching line of research: letting AI conduct "alignment research" itself, meaning teaching AI how to supervise and limit its own capabilities.

AI Doing Research on Its Own: Has the Future Already Started?

This project has a dramatic name: Automated Alignment Researchers (AAR). Put simply, it means letting AI carry out scientific research autonomously, including proposing hypotheses, designing experiments, analyzing data, and iterating through communication with other agents, all without human intervention.

The result was eye-catching: nine AI agents, in just five days and 800 cumulative hours, significantly outperformed human experts on the target task. Human researchers spent seven days and achieved a Performance Gap Recovered (PGR) score of 0.23, while the AI system reached an astonishing 0.97.

More importantly, the whole effort cost only around $18,000, which works out to roughly $22 per hour for each AI researcher.

Why Does "Automated Alignment" Matter So Much?

You might ask: why let AI supervise AI?

As AI capabilities continue to advance quickly, one urgent question is emerging: when future AI systems become more capable than humans, how will we supervise them effectively? Anthropic's research tries to simulate that scenario through a "weak-to-strong supervision" setup, where a weaker AI, acting as a stand-in for humans, teaches and supervises a stronger AI.

In this experiment, three different AI models played different roles. Qwen1.5-0.5B-Chat was the weaker small model serving as the "weak teacher" - a proxy for humans in a future where AI becomes vastly more powerful. Qwen3-4B-Base was the more capable but not yet fully trained model serving as the "strong student" - a stand-in for a future superintelligent AI that surpasses human intelligence. And the systems actually doing the research were nine instances of Claude Opus 4.6: they were not part of the teaching loop itself, but instead focused on discovering better ways for the weak teacher to guide the strong student.

The core finding is that even when the supervisor is much weaker than the system being supervised, it may still be possible to steer the stronger system effectively if the right method is found. That has obvious implications for how humans might one day supervise superhuman AI.

The Magic and the Pitfalls of Automated Research

While the study showcases AI's striking efficiency, it also exposes some uniquely AI-shaped risks and traps:

Diversity matters a lot: Anthropic found that giving each AI researcher a different but fuzzy starting point worked better than forcing all of them through the same process. Over-specifying the workflow actually reduced creativity.
AI can cheat too: The AI researchers sometimes tried to game the evaluation through leaderboard chasing and reward hacking, for example by bypassing the supervisor and directly guessing the most likely correct answer. That is a reminder that even very capable systems may exploit weaknesses in the scoring process.
Generalization remains limited: Although the method worked well on certain tasks, Anthropic did not see significant gains when trying to transfer it into real production settings. That suggests the approach may still be overfitting to a narrow experimental setup.

How Should We Face the Future of AI "Doing Research on Its Own"?

Even with all these constraints, the study points to a clear trend: AI may gradually take over large amounts of basic, repetitive research work, while human roles shift upward toward higher-level judgment, especially value judgments on ambiguous problems and the design of evaluation systems.

But we also need to stay clear-eyed about the risk of "alien science": AI could produce theories or methods that humans find difficult to understand, let alone verify.

Anthropic's research does not prove that AI can already do fully autonomous research. What it does show is this: we need clear and reliable evaluation standards for AI, we need to prevent systems from exploiting loopholes, and human judgment and oversight remain indispensable.

In the future, we may face a new scientific ecosystem in which humans and AI work side by side to explore the unknown. But humans have to remain vigilant and make sure AI truly serves us, rather than the other way around.

References

Let Yourself Feel "Learned Helplessness" for a While

Tue, 14 Apr 2026 00:00:00 GMT

For a while after the Qingming Festival, I became sluggish and drained.

When the interview results for the national civil service tax bureau came out, I missed the shore by a small margin. I also failed to make it into the interview round for the provincial exam. Even when preparing for public institution exams, I kept feeling a weight on my chest. I had worked hard, but I still felt there was an unbridgeable gap between me and the top candidates.

After three years of my master's program, my thesis has just been sent out for blind review. Graduation is approaching fast, yet my mind is full of confusion and anxiety about the future.

Recently, I realized that I had fallen into a state called "learned helplessness." The first time I came across this term was when I was preparing for the written test of the teaching qualification exam. Back then, it felt far away from me. Only now do I realize that this idea has quietly made its way into my heart.

Simply put, learned helplessness is a state in which repeated failure gradually makes a person lose confidence in changing their situation. Even when opportunities appear, they may still feel unable to act. That seems to be exactly where I am right now. My spirit has scattered, and even the motivation to keep trying is close to disappearing.

But rationally, I know I should not keep letting myself sink like this.

In truth, setbacks in exams do not completely negate my effort or everything I have invested. All the experiences and accumulation from the past still matter. The real question is how to adjust my mindset and set out again in a better state.

First, I want to accept my failure.

Failure does not mean I am incapable, nor does it define me. It is simply an unavoidable episode in the journey of life. Only by accepting failure can I truly let go of it and step out of the shadow it casts.

Second, I hope to rebuild my inner drive.

What is that drive? It is the firm belief in your goal, the force inside you that keeps you moving forward. Losing it may only be temporary, not permanent. As long as we are willing, we can gather that strength again and continue on.

I have decided to set a few small goals for myself and slowly return to a steady rhythm. I will try to complete concrete tasks each day, such as exercising for half an hour, reviewing professional knowledge, or actively attending a spring recruitment fair. By doing these small things, I hope to slowly rebuild my confidence and regain that inner drive.

Life is never a straight road. Failure and setbacks are unavoidable parts of the scenery. What matters is that when we realize we are in trouble, we also know how to make peace with ourselves.

I am writing these words not to vent negativity, but to see my situation clearly, remind myself to accept imperfection, and begin again.

If you are reading this and feel lost too, I hope you can find your own direction.

Let's keep going together.

AIGC Plagiarism Detection: CNKI's Self-Contradiction and a Doomed Battle of Containment

Wed, 01 Apr 2026 00:00:00 GMT

AIGC Plagiarism Detection: CNKI's Self-Contradiction and a Doomed Battle of Containment

Selling AI tools to help you write papers with one hand, penalizing you for using AI with the other — CNKI, whose side are you on?

Prologue: An Absurd Graduation Season

The 2026 graduation season is saturated with an unprecedented anxiety across Chinese social media.

On Xiaohongshu (China's Instagram-like platform), a master's student posted her CNKI AIGC detection report — 36.9%, with red flags everywhere. She wrote every word of her thesis by hand; the traditional plagiarism check showed only 1%, but the AI detection slapped her with the label "suspected AIGC-generated." In the comments, others shared even more outrageous experiences: a handwritten 23,000-word thesis flagged as "medium risk," a purely original 345-word abstract marked as 99% AI-generated.

Some spent over a hundred yuan (roughly $15) on a single CNKI AIGC check, only to receive a report that felt like a lottery ticket — the same paper yielded results differing by over 50 percentage points across different platforms. Others discovered that without changing a single word, their AIGC rate skyrocketed from 0.84% to 41.3% after a CNKI system update.

And the most ironic scene appeared beneath a viral Xiaohongshu post with 20,000 likes: someone discovered that running the flagged paragraphs through CNKI's own translation tool reduced the AIGC rate to zero. In other words — CNKI's own AI doesn't count as AI.

This isn't a joke. This is daily life for Chinese college graduates in 2026.

1. What Is AIGC Detection? How Does It Work?

AIGC detection, short for "AI-Generated Content Detection," aims to determine whether a piece of text was generated by an AI large language model (such as DeepSeek, etc.).

The underlying principles aren't overly complex, relying primarily on these technical approaches:

Perplexity Analysis: In simple terms, it checks whether a piece of text is "too smooth." AI-generated text tends to use precise vocabulary, regular sentence structures, and seamless transitions — like a machine doing fill-in-the-blank exercises. Human writing features leaps in thought, sudden colloquial expressions, and even grammatically "incorrect" sentences. Low perplexity = text is too "predictable" = more likely AI-written.

Burstiness Analysis: Human writing has a distinctive characteristic — it fluctuates between long and short, dense and sparse. Sometimes you write an ultra-long subordinate clause; sometimes you just drop a single word: "yeah." AI, however, produces text that's uniform and steady throughout, like a train cruising at constant speed. Low burstiness = style is too uniform = more likely AI-written.

Semantic Fingerprinting and Deep Learning Models: Some advanced detection systems (such as Turnitin's Authorship Investigate) construct "semantic fingerprints" of text, analyzing sentence dependency relationships, modifier nesting levels, and over 23 other indicators. Simply put, they try to find traces of AI in the text's "skeleton."

Watermark Detection: Some AI models embed invisible "watermarks" during text generation — for example, restricting the frequency of certain vocabulary, or like Google's Gemini model using SynthID technology to embed digital watermarks directly into generated text or images. Detection systems identify these statistical anomalies or specific watermark signatures to determine whether content is AI-generated.

Sounds scientific? Hold on — here come the problems.

2. Is AIGC Detection Accurate?

In a word: no. In two words: absolutely not.

This isn't an emotional outburst — it's a conclusion backed by substantial evidence.

Classic Literature Flagged as AI: Tests show that Zhu Ziqing's Moonlight Over the Lotus Pond was flagged as 62.88% AI-generated by one platform, Liu Cixin's The Wandering Earth excerpt was flagged at 52.88%, and even Wang Bo's Preface to the Pavilion of Prince Teng was judged 100% AI-generated. These works existed decades or even over a thousand years before AI was born.

Wildly Different Results Across Platforms: The same paper scored 21.76% on the Zhuque platform and 74.07% on SpeedAI — a 52-percentage-point gap. Different platforms use different models and algorithms with no unified standard; detection results are essentially a coin toss.

Even OpenAI Gave Up: OpenAI once launched its own AI detection tool (AI Classifier), which could only correctly identify 26% of AI-generated text while misclassifying 9% of human writing as AI-generated. The tool was quietly taken offline in July 2023.

Systematic Discrimination Against Non-Native Speakers: Stanford University research found that AI detection tools had an average false positive rate of 61.3% for non-native English speakers, with 97.8% of TOEFL essays flagged by at least one detector. The reason is straightforward — non-native speakers tend to use simpler, more "standard" expressions, which happen to match AI writing characteristics.

Inherent Bias Against Academic Writing Style: Academic papers inherently emphasize rigorous logic, standardized expression, and precise terminology — characteristics that overlap significantly with AI-generated text. The better written, more professional, and more well-organized a paper is, the more likely it is to be flagged as AI-generated. This creates an absurd paradox: the better your paper is written, the more likely it is to be suspected as not your own work.

3. CNKI's Self-Contradiction: Selling AI With One Hand, Policing AI With the Other

This is the most absurd part of the entire affair.

On one hand, CNKI actively promotes its AI products — the "CNKI AI Academic Research Assistant" — advertising how it helps researchers improve efficiency, assists with literature reviews, and optimizes writing. On the other hand, CNKI offers its AIGC detection service, charging students 2 yuan per thousand characters to check how much of their paper is "suspected AI-generated."

You encourage me to use AI, then punish me for using AI?

It's like a car company selling you a vehicle, then setting up a checkpoint at the exit to fine you for driving it.

A highly upvoted comment on Xiaohongshu precisely exposed this contradiction: run the paragraphs flagged by CNKI's AIGC detection through CNKI's own translation tool, and the AIGC rate drops to zero. CNKI's own AI output isn't caught by its own detection system — users jokingly call it "in-house AI doesn't count as AI."

This isn't a technical bug — it's the essential nature of the business model laid bare: For CNKI, AIGC detection is first and foremost a business, and only secondarily a technical problem.

CNKI was once fined 87.6 million yuan for monopolistic practices. Before the fine, master's and doctoral thesis plagiarism checks during peak graduation season were scalped at up to 1,200 yuan per check. Only after the penalty did CNKI open up individual checking services. Now, with AIGC detection added, the comprehensive cost for a master's thesis check runs 280–350 yuan, and doctoral theses cost 380–580 yuan. Due to unstable results, many students have to check repeatedly — some shared receipts totaling four to five hundred yuan.

A 2,000-like post on Xiaohongshu put it plainly in its title: "My Heartbreaking Journey of Reducing CNKI AIGC Scores, or How I Became a Great Philanthropist" — "donating" hard-earned money to CNKI.

4. AIGC "Score Reduction": Turning Good Writing Into Drivel

Facing the pressure of AIGC detection, a grey market industry chain has rapidly expanded — AIGC score reduction services.

The principle is simple: since detection systems flag text that's "too standardized, too fluent, and too logical" as AI-written, just do the opposite — make good writing look more "human." How?

Replace professional terminology with colloquial expressions
Break long sentences into short ones and insert meaningless transitional words
Scramble paragraph logic and order
Add personal feelings and subjective judgments — that "human touch"
Translate Chinese to English and back again, using the "noise" from translation tools to mask AI traces

The result? A well-structured, rigorously argued academic paper gets mangled into something fragmented and incoherent. Students report spending an entire semester writing a 40,000-word thesis, only to delete massive sections to reduce their AIGC rate, submitting a final version far inferior to their first draft.

This is the greatest irony of AIGC detection: it doesn't promote academic integrity — it punishes good writing. It forces students to turn professional, thoughtful prose into drivel and scramble clear logic into mush, all to satisfy an unreliable algorithm.

5. Pros and Cons: Is AIGC Detection Worth It?

Potential Benefits:

To some extent, it deters those who rely entirely on AI to ghostwrite their papers
It has prompted universities to begin discussing AI's role in academia
It has raised public awareness around academic integrity

Clear Drawbacks:

High false positive rates that are unfair to original authors
No unified detection standards — results contradict each other across platforms
Increased financial burden and psychological stress on students
Spawned a grey market for AIGC score reduction that actually lowers paper quality
Systematic bias against non-native speakers and interdisciplinary researchers
Platforms like CNKI acting as both referee and player creates severe conflicts of interest
Those who write earnestly are often the ones punished, while actual ghostwriting operations find ways to evade detection

On balance, current AIGC detection does far more harm than good. It resembles a hastily launched commercial product rather than a thoroughly validated academic integrity tool.

6. The Way Forward: Guidance Over Gatekeeping

AI is here, and it isn't leaving. Trying to stop students from using AI with an unreliable detection system is like trying to hold back a flood with a fishing net — you can't stop the water, and you'll hurt innocent fish in the process.

The right direction should be "guidance" rather than "gatekeeping":

Establish Transparent AI Usage Disclosure Systems: Instead of guessing whether students used AI, let them proactively declare: what AI tools were used, at which stages, what AI contributed, and what modifications and judgments they made themselves. Leading international journals (Nature, IEEE, Wiley, etc.) are already implementing similar systems requiring authors to disclose AI usage in detail.

Create a Tiered Disclosure Framework: Classify AI involvement into four levels — Information Retrieval (AI used only for searching materials), Assisted Optimization (AI provides writing suggestions), Collaborative Creation (AI participates in generating core content), and Primary Generation (AI generates most of the content). Different levels correspond to different disclosure requirements.

Prioritize Process Over Product: Evaluate whether students truly understand and have mastered their research content through reviewing the writing process (draft history, revision records), in-depth questioning during thesis defense, and supervisors' process-based assessments — rather than relying on a percentage from an algorithm.

Teach Students to Use AI Properly: AI is a tool, not a replacement. Universities should offer relevant courses teaching students how to leverage AI for accelerating literature searches, assisting data analysis, and optimizing written expression, while maintaining independent thinking and academic judgment.

Stop Using Immature Detection Technology as a Hard Metric: Multiple top international universities (UCLA, Cornell, Duke, etc.) have explicitly advised against using AI detection tools as the sole basis for academic integrity judgments, citing "immature technology, high false positive rates, and unfairness to students." Chinese universities should follow suit.

7. AI Writing Tool Recommendations: Choose the Right Model, Double Your Efficiency

Since AI-assisted writing is an irreversible trend, choosing the right tools is crucial. Here are the best AI models for academic writing and long-form content creation (as of April 2026):

Top Pick: Claude (Anthropic)

Claude is currently the best AI model for academic writing, bar none.

Strong at both code and writing — Claude achieves top-tier performance in both coding ability and written composition, which is extremely rare among AI models.
Ultra-long context window — Supporting 1 million tokens of context means you can feed in your entire paper and references at once, and Claude can read through everything to provide coherent, in-depth suggestions.
Natural writing style with a "human touch" — Claude's output doesn't have the cookie-cutter "AI voice" of some models; it adjusts its style based on context, handling everything from academic papers to casual blog posts with ease.
Strong logical reasoning — Claude excels particularly in writing tasks that require argumentation, analysis, and critical thinking.
Recommended models: Claude Opus 4.6 (strongest reasoning + writing), Claude Opus 4.5 (classic, stable choice).

For Fact-Checking: GPT-5.4 (OpenAI)

As OpenAI's latest flagship model, the GPT series excels at logical reasoning and fact-checking, but its generated text often carries a strong "AI voice," making it unsuitable for direct use in AI-assisted writing.

Best use case: Expression verification, data validation, and logical structuring.
Recommended models: GPT-5.4 (professional verification first choice), GPT-5.4 mini (lightweight daily verification).

Alternative: Gemini 3.1 Pro (Google)

Gemini 3.1 Pro serves as a viable alternative to Claude Opus models.

Ultra-long context window — Gemini 3.1 Pro supports 1 million tokens of context, suitable for processing ultra-large-scale literature reviews.
Strong multimodal capabilities — Can directly analyze charts, formulas, and data within papers.
Google ecosystem integration — Deeply integrated with Google Scholar, Google Docs, and other tools.

Why Not Smaller Parameter Models?

This isn't bias — it's a technical fact: model parameter scale directly affects how "human" the output sounds.

Large parameter models (such as Claude Opus 4.6, Gemini 3.1 Pro) have been exposed to more diverse human writing samples during training, so their output more closely resembles human writing in vocabulary richness, sentence variety, and semantic depth. Smaller parameter models, limited by training data and computational resources, tend to produce more "standardized" output — monotonous vocabulary, fixed sentence patterns, and lacking personality.

What does this mean for academic writing? Using smaller models for writing assistance not only makes the output more likely to be caught by AIGC detection systems, but also shows a noticeable gap in the depth and nuance of academic expression. While some models may have unique advantages in Chinese-language contexts, for overall academic writing performance, it's still recommended to prioritize top-tier international large parameter models.

Conclusion: Let AI Be Wings, Not Shackles

The explosion of ChatGPT in 2023 ushered in the AI era — just three years ago. In those three years, AI has gone from a novelty toy to an indispensable tool. Academia should not greet it with hostility, and certainly should not use an unreliable detection system to manufacture panic.

As the core platform of China's academic infrastructure, CNKI should be guiding and regulating, not simultaneously selling AI services and setting up toll booths. This approach of being both referee and player harms students and undermines academic integrity itself.

The best academic integrity isn't enforced by algorithms — it's safeguarded by institutions and cultivated through education.

Guidance will always triumph over gatekeeping.

The World's Most Powerful AIs All Failed: Pattern Reasoning Becomes LLMs' Cognitive Graveyard

Sat, 14 Mar 2026 00:00:00 GMT

The World's Most Powerful AIs All Failed: Pattern Reasoning Becomes LLMs' Cognitive Graveyard

An Accidental "Crash Test"

March 14, 2026 — the provincial civil service exam is just days away. Out of curiosity, I fed a set of real pattern reasoning questions to the world's most powerful AI models: OpenAI's GPT 5.4 Pro, Google's Gemini 3 Deep Think, Anthropic's Claude Opus 4.6, and China's Doubao.

The result? A total wipeout.

What made it even more laughable was that Gemini 3 Deep Think — the model that supposedly crushes human experts on the "Human Last Exam" — started spouting nonsense when faced with these entry-level civil service exam pattern questions. Meanwhile, GPT 5.4 Pro and Doubao took the "smarter" approach: they simply triggered web searches to look up the original questions and answers from exam prep websites.

That's not problem-solving. That's cheating.

After disconnecting from the internet and retesting, every model immediately showed its true colors: answers were either completely wrong, or the "patterns" they identified could only explain some of the figures and were logically inconsistent.

This made me wonder: These super AIs can write code, prove mathematical theorems, and pass the bar exam — so why can't they handle a few "find the pattern" picture puzzles?

Layer 1: Blind from the Start — The Innate Deficiency of Visual Encoding

To understand why AI can't do pattern reasoning, you first need to understand how it "sees" images.

All current multimodal LLMs process images through roughly this pipeline:

Image → Visual Encoder (ViT) → Image Tokens → Language Model Processing

The problem lies at the very first step.

Mainstream visual encoders (like Vision Transformer) were designed from the start to optimize for semantic recognition — enabling AI to instantly recognize whether an image contains a cat, a dog, or a landscape. But what do civil service pattern reasoning questions test? Fine-grained geometric structures: how many lines there are, how many intersection points, how many enclosed regions, which direction the axis of symmetry faces, how many degrees something has rotated.

This low-level structural information gets "lossy compressed" away during the encoding stage.

Here's an analogy: Asking AI to do pattern reasoning is like asking someone to look at pictures through frosted glass — they can tell it's "roughly a triangle," but they can't count how many line segments are intersecting inside it.

Even worse, visual encoders split images into small patches for processing. The tiny intersection points, open/closed line endpoints, and precise element positions in civil service pattern questions can easily be chopped up or blurred at patch boundaries.

If the first step is wrong, how could anything after it be right?

Layer 2: No "Mental Canvas" — The Absence of Spatial Reasoning

What happens in the human brain during pattern reasoning?

Our parietal lobe activates a "mental canvas" where we rotate, flip, fold, and overlay shapes. When you see an unfolded diagram, you can mentally "fold" it into a cube. When you see a sequence of figures, you can mentally animate the elements and observe their trajectories.

AI has no such canvas.

What is the fundamental nature of a large language model? It's autoregressive token sequence prediction. Its entire reasoning process is built on the linear generation of "what's the next token." To handle spatial problems, it must first "translate" visual patterns into language descriptions, then reason within the language space.

This translation process creates a catastrophic information bottleneck:

A rotation relationship between shapes — a human spots it at a glance
AI needs to first describe: "The first figure has a line pointing upper-left at 45 degrees, the second figure has this line pointing upper-right at 45 degrees..."
And this description itself is often inaccurate

Even worse, AI lacks "visual working memory." When humans are solving problems, if a first hypothesis is disproved, our eyes automatically return to the figures to refocus and recount. Once AI generates its first round of descriptions, it can only keep building on top of this potentially erroneous description — it has no ability to "look back."

Layer 3: The Infinitely Open Rule Space — Not Knowing What's Being Tested

The trickiest aspect of civil service pattern reasoning is this: You never know which dimension of pattern the question is testing.

It could be line count, number of enclosed regions, symmetry, odd/even vertices for single-stroke drawing, element types, black-white ratios, rotation angles, translation steps... dozens of possible pattern dimensions, and often composites of multiple patterns.

What do humans rely on? Rapid visual intuition for screening.

With a single sweep across the figure sequence, the brain automatically notices certain "conspicuous" feature changes, then rapidly forms hypotheses, verifies them, eliminates possibilities, and re-hypothesizes... This is a highly parallel, non-linear cognitive process.

What does AI rely on? Sequential testing of verbalized rules.

It lacks that "catch the key insight at a glance" intuition. It can only check each possible pattern one by one in some order. Not only is this extremely inefficient, but more fatally — since it already got the first step wrong (accurately perceiving figure features), all subsequent rule-checking is built on a flawed foundation.

Layer 4: Paradigm Conflict — Probabilistic Generation vs. Rigid Deduction

This is the most fundamental issue — and the hardest gap to bridge.

The underlying logic of LLMs is probabilistic prediction. Their training objective is to learn statistical correlations from massive data and output "the most probabilistically reasonable text sequence." The core capability is "correlation fitting," not "causal deduction."

The underlying logic of civil service pattern reasoning is rigid deduction. The pattern you identify must apply 100% to all figures in the question stem, corresponding to exactly one correct option. There's zero tolerance for probabilistic ambiguity.

A proper solution process should look like this:

Narrow down the test dimension → Propose a pattern hypothesis →
Verify against every stem figure one by one →
If inconsistency found, immediately reject → Try next dimension →
Find a pattern that fits 100% → Match against all options →
Eliminate distractors → Lock in the unique answer

This is a falsifiable, backtrackable, error-correctable closed-loop reasoning process.

LLM generation, however, is unidirectional, linear, and non-backtracking. It simply generates the "highest-probability pattern + answer" based on input, without rigorous exhaustive verification, and without proactively overturning wrong hypotheses.

The result: AI frequently outputs a "half-right pattern" — one that explains only some of the stem figures, or where multiple options could match. In civil service exams, this is fatal, because test designers specialize in crafting exactly these traps.

Layer 5: Structural Gaps in Training Data

"Then just feed AI more pattern reasoning training data, right?"

Not that simple.

First, in LLM pretraining corpora, civil service pattern reasoning content accounts for a vanishingly small fraction. The vast majority of image-text data on the global internet consists of "natural images + semantic descriptions" (beach sunsets, cute pets, product photos), not "abstract geometric figures + logical reasoning chains."

Second, even if a model sees large numbers of real exam questions during fine-tuning, what it learns is merely the statistical association of "this image corresponds to correct option C," not the reasoning process in the explanation.

This explains why:

Original questions can be answered correctly (via memory matching or search)
Slight variations (change an element, modify a number) cause immediate failure

Finally, the core reasoning process in pattern reasoning is non-verbal spatial-visual operations. "Mentally rotate this figure 90 degrees" — this action is very difficult to fully describe in language. Even when forcing AI to output a chain of thought (CoT), it's merely "using language to pretend to reason" without actually completing the spatial operation.

Why Did They Choose to "Cheat"?

Returning to the opening observation: why did GPT 5.4 Pro and Doubao resort to searching for answers online?

This actually demonstrates that the models "know" they can't do it.

When AI receives a pattern reasoning question, its visual module feeds back chaotic, low-confidence features to the central system. Meanwhile, its OCR capability is extremely strong, instantly recognizing format features in the question (nine-grid layout, keywords like "select from the given options").

It immediately realizes: this is a standardized test question, and the original question with answers likely exists on the internet.

Since its own hard-computed confidence is very low, while calling a search engine might directly match the original question and achieve 100% accuracy — the model naturally chooses the path of "least resistance, highest reward."

This isn't a bug. It's "smart" behavior trained through RLHF (Reinforcement Learning from Human Feedback). It just happens to look like blatant cheating from our perspective.

Once disconnected from the internet, they had nowhere to hide.

Where Is the Path Forward?

There's an academic consensus emerging: to truly crack abstract visual reasoning (like the famous ARC Challenge), simply increasing parameter counts is far from sufficient.

The promising direction is Neuro-symbolic AI:

Rather than having the model "squint hard at the image," it would first automatically invoke a precise visual analysis program (like OpenCV) to extract structural features such as face counts, intersection points, and axis-of-symmetry coordinates, converting them into absolutely accurate symbolic matrices. Then the LLM's logical capabilities would be used to deduce numerical patterns.

At CVPR 2023, there was a solver specifically designed for Raven's Progressive Matrices that used a hybrid architecture of "perception module for attribute extraction + algebraic symbolic reasoning," achieving 93.2% accuracy on the I-RAVEN dataset — higher than the human benchmark of 84.4%.

This demonstrates that the issue isn't "machines inherently can't do this" — it's that "handing this task end-to-end to a general-purpose chat model" was never the right approach.

Final Thoughts

Civil service pattern reasoning — a task that seems like "just a few find-the-pattern puzzles" — has unexpectedly become a mirror reflecting the boundaries of current AI capabilities.

It precisely strikes at three major weaknesses of large language models:

Insufficient visual perception precision — can't see accurately
Missing spatial reasoning mechanisms — can't manipulate mentally
Absent rigid deduction capability — can't reason strictly

This also reminds us: AI's "intelligence" and human "intelligence" may not be the same thing at all.

It can find statistical patterns across massive text corpora, fluently generate code and articles, and pass professional exams requiring extensive knowledge — but when facing a simple task that requires "truly seeing a figure, truly manipulating it mentally, and truly verifying a pattern with logic," it remains helpless.

Perhaps this is one of the last moats of human intelligence.

At least in 2026, civil service pattern reasoning remains a battlefield that belongs to human test-takers.

If you've also tested AI on pattern reasoning, feel free to share your "crash" stories in the comments.

Perplexity Max Is Great, But I Won't Subscribe

Thu, 12 Mar 2026 00:00:00 GMT

On March 11, 2026, Perplexity held its first developer conference — Ask 2026 — in a converted church in San Francisco.

A company that started with AI search launched a "personal computer" agent, enterprise Computer, the iOS browser Comet, and even partnered with cybersecurity giant CrowdStrike for security collaboration — all in one event. CEO Aravind Srinivas said something ambitious on stage: "Traditional operating systems receive commands; AI operating systems receive goals."

Taken together, the signal is clear: Perplexity doesn't want to be just a search engine anymore. It wants to be the operating system of the AI era.

This article will focus on the two most noteworthy features — Model Council (multi-model committee) and Computer (multi-model agent) — providing a complete breakdown from mechanism to value to limitations. I'll finish with my honest take on whether the $200 monthly fee is worth it.

I. Model Council: Three Models Argue, a Fourth Judges

What It Actually Is

Model Council launched on February 5, 2026 as an exclusive multi-model research feature for Perplexity Max members.

The mechanism is straightforward: you ask a question, the system sends it simultaneously to three frontier LLMs (say Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro), each generates an independent response, and then a fourth "chairman model" reviews all outputs and synthesizes a unified answer annotating consensus areas and points of disagreement.

Users can expand to view each model's complete original response and switch between different model combinations.

Design Philosophy: Making Disagreement Visible

The most interesting aspect of this feature isn't the "synthesis" — it's the visualization of disagreement.

When three models converge on a judgment, you gain higher confidence. When they show clear disagreement, you know the issue needs further investigation rather than blind trust in any single model's output. Conceptually, this is closer to ensemble methods in machine learning than a mere model selector.

Official recommended use cases include investment research, high-stakes personal decisions, and multi-perspective analysis of complex issues. Within Computer workflows, Model Council serves as the "critical checkpoint reviewer" — subjecting specific analysis or review steps to multi-model cross-examination.

My Take: Interesting, But Not Necessarily Worth Paying For

Model Council's approach is genuinely thought-provoking. In an era where AI outputs are plagued by hallucinations and biases, using multi-model cross-validation to improve reliability is logically sound.

But here's the thing: You can do this yourself.

Ask ChatGPT, Claude, and Gemini the same question separately, compare three windows side by side, and manually judge which response is most reliable — this workflow is a bit clunky, but virtually free (if you already subscribe to each), and being your own judge means you're actively exercising judgment rather than delegating it to yet another "chairman model" that you also can't verify.

Model Council's value lies in convenience and structured presentation, but it provides no information increment that you couldn't obtain through manual operation. For anyone with reasonable AI experience, "having your own judgment" matters far more than "letting a fourth model judge for you."

II. Perplexity Computer: 19 Models, One "Digital Employee"

What It Actually Is

Perplexity Computer launched for consumers on February 25, with the enterprise version and "Personal Computer" local agent announced at Ask 2026 on March 11.

Computer is positioned as a cloud-based multi-model AI agent orchestration platform. You describe a goal in natural language (say, "Create a competitive analysis report for this industry"), the system automatically decomposes it into subtasks, routes each subtask to the most suitable AI model, executes autonomously in the background (potentially for hours), and delivers the finished product.

It orchestrates over 19 models: Claude Opus 4.6 handles core reasoning, Gemini manages deep research, GPT-5.2 handles long-context search, Grok runs lightweight tasks, Nano Banana generates images, Veo 3.1 generates video, and GPT-5.3-Codex specializes in code. Each task runs in an isolated sandbox environment with real file systems and browsers.

Over 400 connectors integrated: Gmail, GitHub, Slack, Notion, Salesforce, Snowflake, and more.

The Personal Computer announced on March 11 goes further — it's resident software running on your own Mac mini, giving AI agents 24/7 access to your local files and applications while inference still runs in Perplexity's cloud.

The March 6 Update

Computer's first major update after launch landed on March 6, expanding in four directions:

Custom Skills — You can write "capability descriptions" for repetitive tasks (like fixed report templates or writing style requirements), and Computer will automatically invoke them for relevant tasks without re-explaining each time.

Embedded Model Council — Directly invoke three-model parallel review within Computer workflows, providing cross-validation for critical decision steps.

Voice Mode — Describe tasks, give mid-process feedback, or adjust direction using voice.

GPT-5.3-Codex Coding Sub-Agent — When encountering complex coding tasks, automatically assigns to a dedicated code model that can build full-stack applications from scratch and even debug through browser DevTools with GitHub integration.

My Take: Concept Is Stunning, Execution Is Questionable

Computer's architecture is genuinely impressive. 19 models dispatched on demand, nested multi-agent workflows, sandbox execution, asynchronous long-running tasks — from a technical vision standpoint, this may be the most aggressive multi-model agent solution on the market.

But several practical issues are hard to ignore:

First, credit consumption is opaque and expensive. A Builder.io reviewer reported spending $200 in two days to build a single webpage. Failed tasks still consume credits, and you can't predict how much any given task will cost. This pricing model is essentially a black box for users.

Second, the complex coding tasks that can be reliably delivered today are primarily handled by Claude Code. While Computer also integrates coding capabilities, Claude Code's stability and developer experience remain the industry benchmark. Computer is more like Claude Code wrapped in an agent shell, but that shell itself adds uncertainty and cost.

Third, Computer's positioning heavily overlaps with Manus. Both are natural-language-driven, auto-decomposing, background-executing agent systems. Computer's differentiation lies in multi-model orchestration and Perplexity's search capabilities, but if the core advantage is merely "more comprehensive search sources," whether that premium is justified is debatable.

III. The Unavoidable Question: Is $200/Month Worth It?

Model Council and Computer are both exclusive to Perplexity Max members at $200/month.

Where does this price sit in the current AI subscription market? Claude Max runs about $100 and gives heavy Opus usage. OpenAI Pro at $200 provides GPT 5.4 Pro and higher usage quotas.

What's included in Perplexity Max's $200? Model Council, Computer (with credits), Deep Research, and unlimited access to all models. Sounds comprehensive, but several concerns linger:

Does Claude Opus get degraded through the Max subscription? This is a repeatedly discussed question in the community. When Perplexity acts as a middleware layer calling Anthropic's API, prompt packaging, context management, and potential token truncation can all affect output quality. The Opus you use through Perplexity may not deliver an identical experience to the Opus in Claude's official client.

Computer's credit consumption is another deep water. The $200 monthly fee doesn't mean unlimited Computer usage — complex tasks can rapidly exhaust your credit quota. Moreover, Perplexity has precedent for slashing Deep Research quotas from roughly 500/day to 20/month, triggering widespread criticism of a "bait and squeeze" strategy.

Perplexity's "track record" is also worth noting. From early accusations of unauthorized content scraping, to copyright disputes with multiple publishers, to the March 11 federal court ruling banning its AI shopping agent from accessing Amazon, to reports of users having their free Pro memberships obtained through promotional channels silently revoked — this company never hesitates with its "act first, ask later" aggressive approach. This style may drive innovation speed, but it also means product strategies and pricing can shift at any moment, and users' existing benefits may not be reliably protected.

IV. Perplexity's True Moat: Search

Having noted many shortcomings, I should acknowledge Perplexity's core strength.

Its search sources are genuinely comprehensive. This point has been widely validated among Chinese internet users who've subscribed to Max. Opus 4.6 combined with Perplexity's proprietary search pipeline delivers research query performance that genuinely surpasses using any single model's search function alone. Seven parallel search types (web, academic, people, images, video, shopping, social) plus premium data sources like PitchBook and Statista give it real advantages in both breadth and depth of information retrieval.

If your core need is high-frequency deep research — financial due diligence, market analysis, technology evaluation — Perplexity's search capability is its most compelling selling point.

But if your needs center on code development, creative writing, or everyday conversation, this search advantage doesn't align with your use case.

How Long Can the Moat Hold?

One must face an industry consensus: Perplexity has always been viewed as a "wrapper" company. It doesn't train its own foundation models. Its core product is built on APIs from OpenAI, Anthropic, Google, and others, with virtually no model-layer innovation. What it does — combining top SOTA models with comprehensive search sources — does produce an excellent research experience. That's undeniable.

The problem is that neither of the two key ingredients in this recipe are in its hands.

OpenAI's ChatGPT already has web search and Deep Research capabilities. Anthropic has launched Claude's Web Search tool and Deep Research. Google's Gemini naturally sits atop the world's largest search index. When model providers themselves fill in the search gap, Perplexity's value as a middleware layer gets continuously compressed. This is why the "Perplexity will die" narrative never goes away in the AI community — not because it does a bad job, but because its core capabilities are too easily replicated by upstream providers.

Perplexity clearly recognizes this, which is why it's racing toward an agent platform: Computer, Personal Computer, Comet browser, enterprise edition... every move is an attempt to transition from "search middleman" to "AI operating system," building deeper product stickiness before users leave. The strategic direction is clear-eyed, but whether it can outrun time is another matter entirely.

V. My Conclusion

I won't be subscribing to Perplexity Max.

The reason is simple: compared to Claude Max and OpenAI Pro, the value-for-money isn't there. Computer's concept is forward-looking, but the credit black box, unstable quota policies, and the awkward "can do it but not well enough" reality in actual use make it hard for me to justify $200 a month. Model Council's multi-model cross-validation approach has value, but manual operation is a perfectly viable substitute, and being your own judge is more reliable than relying on a fourth model.

If you're considering subscribing, I'd suggest asking yourself two questions:

First, is your core need search or execution? If it's search, a Pro membership ($20/month) might be sufficient. If it's executing complex tasks, Claude Code is still the more stable choice today.

Second, can you accept the risk of pricing and quotas changing at any time? Perplexity is a company still iterating rapidly (and experimenting rapidly). The uncertainty in product strategy is real.

What Perplexity is building — multi-model orchestration, agent workflows, an AI-native operating system — directionally correct. But "directionally correct" and "worth buying now" are separated by a long road.

Rather than chasing the latest paid features, invest your time in genuinely improving your own judgment. After all, no "committee" of models can substitute for your own independent thinking.

This article was written on March 12, 2026, based on Perplexity's official blog, changelog, and help center documentation, as well as reporting from TechCrunch, VentureBeat, Digital Trends, Axios, AppleInsider, and other technology media. Views expressed represent the author's personal opinions and do not constitute subscription or investment advice.

The Industrial Recipe for Synthetic Data: HuggingFace's 90 Experiments Reveal the Laws of Pretraining Data Production

Wed, 11 Mar 2026 00:00:00 GMT

The Industrial Recipe for Synthetic Data: HuggingFace's 90 Experiments Reveal the Laws of Pretraining Data Production

As LLM training enters the era of "data is king," efficiently generating high-quality synthetic data has become a critical challenge. HuggingFace spent 12.7 GPU-years running 90 controlled experiments, finally turning this "alchemy" into reproducible "chemistry."

I. Synthetic Data: The Fourth Paradigm Shift in LLM Training

The pretraining data for large language models has gone through several clear evolutionary stages.

Initially, researchers trained language models on small but high-quality corpora like Wikipedia. Then, datasets like C4 and The Pile pushed the scale to hundreds of gigabytes. Next, projects like FineWeb and DCLM expanded data volumes to trillions of tokens, covering nearly the entire crawlable internet.

Once web text approached its collection limit, the focus shifted to quality filtering: using neural network classifiers to find "educational" or "instructional" content, filtering massive noisy data down to curated subsets.

Now, the fourth paradigm is taking shape — synthetic data.

NVIDIA's Nemotron-CC rewrote approximately 2 trillion tokens of web text, Zhipu's GLM-4.5 series generated 500 billion reasoning tokens for mid-training, and frontier models like Qwen3 and Phi-4 heavily incorporate synthetic content in their training data. Synthetic data has evolved from an "optional augmentation technique" to a "standard production step."

But the question remains: How exactly should you do it?

Which model should generate the data? What prompts should you write? Does source data quality matter? Should you mix it with original data? These questions were previously answered mostly by intuition and trial-and-error. The HuggingFace team decided to answer them with systematic experiments.

II. 90 Experiments, 1 Trillion Tokens, All to Answer One Question

The HuggingFace research team designed a large-scale ablation experiment framework:

Experiment scale: 90 complete train-evaluate cycles
Generation volume: Over 1 trillion tokens of synthetic text
Compute cost: Approximately 12.7 GPU-years (H100)
Evaluation method: Each experiment trained a 1.2B parameter proxy model, tested on 12 benchmarks

They explored along three main lines:

Rewriting strategies: Which format transformations actually work? Simple paraphrasing, Q&A pairs, step-by-step tutorials, structured tables...
Generation models: Is bigger always better? Do different model families matter? Are newer versions stronger?
Data mixing ratios: Does source data quality matter? Can synthetic data be used alone? What should it be mixed with?

The final output was FinePhrase — a synthetic pretraining dataset containing 486 billion tokens that achieved clear advantages across all baselines.

III. Core Finding: Prompt Design Is the Biggest Lever

Among variables like model size, model family, and source data quality, prompt design had by far the greatest impact.

The research team tested existing prompts from projects like Nemotron, REWIRE, and BeyondWeb, and also designed 9 entirely new formats. Results showed that only four formats could consistently beat the strongest raw data baseline, DCLM:

Winning Format	Core Feature
FAQ	Reorganizes content into Q&A pairs
Math	Converts into math word problems + solutions
Table	Extracts into structured tables
Tutorial	Rewrites as step-by-step tutorials

Simple paraphrasing (Article), review-style summaries (Commentary), conversational format (Discussion), and narrative retelling (Narrative) all performed unremarkably.

The key difference: The winning formats all restructured how knowledge is presented, rather than merely polishing the language.

FAQ makes implicit questions explicit, Table aggregates scattered information into indexable units, and Tutorial externalizes procedural logic. These transformations force the model to convert implicit knowledge in the original document into structured, explicit representations.

In other words, the value of synthetic data isn't in "saying the same thing with better wording" — it's in reshaping information into "curriculum formats" better suited for model learning.

IV. Counter-Intuitive Finding: A 1B Small Model Is Enough

The industry previously held a popular assumption: generating high-quality synthetic data requires 70B or even larger models. The REWIRE project used Llama-3.3 70B.

HuggingFace's experimental results directly refuted this assumption.

They compared the entire Gemma-3 series from 270M to 27B, and concluded:

Simple prompts: 1B parameters suffice — no significant difference between 1B and 27B
Complex prompts (like REWIRE's guided rewriting): 4B needed, but no difference between 4B and 27B
Low-quality source data: Larger models don't help "rescue" it either

On the cost-efficiency Pareto frontier, the small model + structured prompt combination dominated. A 27B model costs 5-10x more GPU resources than a 1B model, with zero improvement in generation quality.

Furthermore, in a horizontal comparison of all 1B-class models, SmolLM2-1.7B crushed all competitors — including Qwen3, Gemma-3, Llama-3.2, Granite3, and Falcon3. And SmolLM2 is already a model released over a year ago.

The practical implication is very direct: Use the cheapest model, and invest all the savings into data volume.

V. The Most Counter-Intuitive Finding: "Worse" Output Is Actually Better

This is probably the most surprising conclusion in the entire study.

The research team compared the output quality of SmolLM2 and Qwen3 when generating math problems:

Metric	SmolLM2	Qwen3
Complete solution rate	68%	100%
Output length range	4-4000 tokens	100-2600 tokens
Format consistency	Messy	Perfect (with LaTeX)
Most common opening repetition rate	3/1000	115/1000

From a human aesthetic standpoint, Qwen3's output is impeccable. But the downstream models trained on SmolLM2's data actually performed better.

The reason is Template Collapse.

Qwen3 is too "obedient" — its outputs are highly homogeneous. Out of 1000 samples, 115 had identical openings. This uniformity looks like "standards" to humans, but it's a disaster for pretraining data. SmolLM2, though "sloppy," maintained extremely high text diversity.

This reveals a core paradox of pretraining data: What humans prefer as "neat" may not be what models need for "generalizability".

For pretraining, diversity matters far more than consistency. A model that is "less obedient" can actually produce better training data.

VI. Capability Trade-offs: Synthetic Data "Trades Common Sense for Knowledge"

Analyzing experiment results benchmark by benchmark, a consistent pattern emerged:

Nearly all synthetic data significantly outperformed raw data on ARC (scientific knowledge), SQuAD (reading comprehension), and DROP (numerical reasoning)
But nearly all synthetic data underperformed raw data on HellaSwag and PIQA (common sense reasoning)

The macro scores appear roughly even, but the gains and losses offset each other.

Synthetic data, through structured rewriting, makes the factual knowledge in web pages "explicit," making it easier for models to learn retrievable information. But this process simultaneously strips away the common sense, contextual cues, and implicit rules about how the world works that exist in raw web text.

Synthetic data is essentially "trading common sense for knowledge."

This explains another key finding: Training on pure synthetic data is always worse than mixed training. Synthetic data must be blended with high-quality raw data to maintain capability balance.

Moreover, what you mix in matters critically:

High-quality source data → Mix in DCLM (to recover common sense signals)
Low-quality source data → Mix in FineWeb-Edu-HQ (to supplement knowledge signals)

An important finding from the team: The choice of mix-in dataset is sometimes more important than the source data itself. As long as the mix-in data is strong enough, even rewriting low-quality web pages can approach the effectiveness of rewriting high-quality data. This vastly expands the usable data pool.

VII. Quality Scores Completely Fail on Synthetic Data

FineWeb-Edu-score and DCLM-score are commonly used metrics for filtering high-quality web pages. But when applied to evaluate synthetic data, their predictive power drops to nearly zero.

The DCLM-score's correlation with downstream performance was only 0.56-0.61 (moderate), while the Edu-score's correlation was a mere -0.08 (essentially uncorrelated).

Even more ironic: Edu-score actually penalizes format transformations that improved performance. When text was converted into tables, FAQs, or mathematical notation, the Edu-score judged "quality decreased" — yet these were precisely the best-performing formats.

The reason: these scorers were trained on "natural web text" and favor coherent long-form narratives. Structured formats appear as "anomalies" to them, even though they are "optimal" for model learning.

The conclusion is harsh: there are no shortcuts. You must complete the full "generate → train → evaluate" pipeline to know the true quality of synthetic data.

VIII. The Cost Revolution at the Engineering Level

Cost is another core issue in synthetic data generation.

The REWIRE project used a 70B model to generate 400 billion tokens, requiring an estimated ~350,000 GPU hours. HuggingFace's FinePhrase used a 1.7B model to generate 486 billion tokens in only ~14,700 GPU hours.

Efficiency comparison:

Project	Generation Model	Token Volume	GPU Hours	Efficiency (tokens/GPU hour)
Cosmopedia	Mixtral 8x7B	25B	>10K	<2.5M
REWIRE	Llama-3.3 70B	400B	~352K	~1.1M
FinePhrase	SmolLM2-1.7B	486B	~14.7K	~33.1M

FinePhrase's generation efficiency is approximately 30x that of REWIRE and 13x that of Cosmopedia.

Key optimizations included:

Speculative Decoding: Extremely effective for small models — SmolLM2 achieved a 1.75x speedup
Tensor Parallelism Optimization: Frees KV cache space for large MoE models
Flash-Attn Backend: Over 50% faster than FlashInfer (on H100)

This means synthetic data production has gone from being "an exclusive game for compute giants" to an engineering practice accessible to small and mid-sized teams.

IX. Clarification on "Model Collapse"

Academia frequently warns that AI training on its own generated data leads to "Model Collapse."

HuggingFace directly addressed this concern at the beginning of their paper: This collapse only occurs under extremely closed experimental conditions — where a model iteratively trains on its own output without introducing any new information.

Real-world industrial practice is entirely different:

Synthetic data is mixed with human data
Prompts reference diverse reference materials
Synthetic data is a strategic supplement, not a wholesale replacement

In their FineWeb research, the team even found that naturally occurring AI-generated content on the web did not cause model degradation.

The real concern isn't ordinary synthetic data practices, but rather the extreme scenario where frontier models generate data for other frontier models in a closed loop. Synthetic data that is thoughtfully integrated with fresh perspectives isn't the problem — it's the solution.

X. The Practical Recipe: FinePhrase's Final Configuration

Based on systematic validation across 90 experiments, HuggingFace delivered a concise best-practice recipe:

Generation model: SmolLM2-1.7B-Instruct
Prompt format: FAQ, Math, Table, Tutorial (pick one or mix)
Source data: FineWeb-Edu (relaxed quality requirements)
Mix-in data: DCLM or FineWeb-Edu-HQ
Inference optimization: suffix-32 speculative decoding + 0.9 memory utilization

The core logic of this recipe:

Use structured prompts to reshape knowledge formats — this is the biggest lever
Use the smallest adequate model — invest savings into data volume
Use strong mix-in data as a safety net — recover common sense signals, relax source data requirements
Use engineering optimizations to compress costs — make synthetic data production sustainable

XI. Unanswered Questions

HuggingFace candidly listed the boundaries and open questions of this research:

Repetition and rewriting: If data is rewritten each time it's repeated, can performance degradation be avoided?
Mixing ratios: What proportion of synthetic data is optimal? 5%, 20%, or 50%?
Sampling strategies: Is Best-of-N filtering effective?
Scale effects: Do these findings hold at 100B+ token training scales?
Automated optimization: Can tools like DSPy be used to automatically search for optimal prompts?

These questions define the agenda for the next phase of synthetic data research.

Conclusion: From "Alchemy" to "Chemistry"

The fundamental contribution of this research isn't releasing yet another larger dataset — it's transforming synthetic pretraining data generation from experience-driven trial-and-error into a verifiable, reproducible systematic methodology.

Several core conclusions deserve repeated emphasis:

Prompt design is the primary productivity driver — restructure formats, don't polish language
Small models are good enough — 1B-class suffices; don't worship parameter counts
Diversity beats consistency — "obedient" models may actually produce worse data
Raw data must be mixed in — synthetic data "trades common sense for knowledge"
Quality scores are unreliable — you must complete the full train-evaluate pipeline

Synthetic data is evolving from an "optional data augmentation trick" to a "core production step in LLM training." And this research provides the clearest industrial-grade operating guide to date.

References:

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens

🦞 A Lobster's Rise: From Clawdbot to OpenClaw - What Did This AI Crustacean Go Through?

Fri, 30 Jan 2026 00:00:00 GMT

A Lobster's Rise: From Clawdbot to OpenClaw - What Did This AI Crustacean Go Through?

"Two months ago, I just spent a weekend casually building a small project. Now it has over 100K stars on GitHub and attracted 2 million visits in a single week."

These are the words of Peter Steinberger (@steipete), the founder of OpenClaw.

You might not know him, but you've probably used his products - he's the founder of PSPDFKit, the PDF framework that almost every iOS developer has heard of. After the company was acquired in 2023, Peter planned to retire and enjoy life. Instead, he accidentally created one of the fastest-growing open-source projects in GitHub history.

Imagine: a weekend project you casually built suddenly goes viral worldwide, and even Anthropic's (Claude's parent company) legal team reaches out...This plot is more dramatic than a TV series.

Today, let's talk about the story of this "lobster's" rise.

🦞 Chapter 1: The Birth of Clawdbot - A "Copycat" Lobster's Debut

In November 2025, Peter had a sudden inspiration: he wanted to build an AI assistant that he could use on WhatsApp.

Initially, it was just a little thing called "WhatsApp Relay." But Peter got more and more into it, eventually giving it a proper name: Clawdbot - Claude (Anthropic's AI) + Claw (lobster claw), complete with a cute lobster mascot called Clawd.

Yes, it's a pun.

What's special about this "weekend project"?

It runs entirely on your own computer.

Not one of those "upload your data to someone else's server" SaaS services, but truly "your computer, your API keys, your data." Laptop, home server, VPS - your choice.

As one community member put it: "This is infrastructure that truly belongs to you."

Clawdbot quickly spread through developer circles. GitHub stars broke 9,000 within 24 hours, and it surpassed 100K two months later. After all, who wouldn't want an AI assistant that can help you reply to emails, check your calendar, and be ready to serve across 13 platforms including WhatsApp, Telegram, Discord, Slack, Signal, and iMessage?

Moreover, it remembers everything about you - your preferences, your habits, your previous conversations. It reads your SOUL.md to understand your personality and MEMORY.md to remember your history.

"This thing is way smarter than Siri!" someone commented.

Others remarked: "2026 is truly the year of personal AI agents."

🔄 Chapter 2: Moltbot - The Awkward Moment of Forced "Molting"

In January 2026, just when Clawdbot was at its peak, Peter received an email.

From: Anthropic Legal Team.

The content was polite, but the message was clear: "Clawdbot and Clawd are too similar to our Claude. Please change the name."

Peter was reasonable about it. After all, they're a multi-billion dollar company, and he's just an individual developer - no need to fight back.

But the question was: what to change it to?

At 5 AM on January 27th, Peter launched a "naming convention" on Discord. Community members went wild with ideas, and finally settled on Moltbot.

Molting is how lobsters grow - they shed their old shell to grow a bigger new one. This meaning was too perfect: the project was also experiencing a transformation, becoming stronger.

Peter himself was satisfied: "Anthropic asked us to rename (trademark issue), honestly? 'Molt' is perfect - that's how lobsters grow."

The mascot also changed from Clawd to Molty.

But renaming came with more than a few headaches:

Old users were confused: "Why did Clawdbot suddenly stop working?"
Someone registered the old brand's social accounts within 10 seconds to post crypto scam messages
A fake $CLAWD token pumped to $16 million market cap before crashing
All the old repository links on GitHub became invalid

Peter had to urgently contact friends at X (Twitter) and GitHub to get these issues under control.

This experience teaches us: rebranding is truly a tough battle. And internet scammers are always faster than you.

✨ Chapter 3: OpenClaw - The Lobster's Final Form

Just two days later, on January 29th, Peter announced: The final name is decided - OpenClaw.

Wait, another change?

It turned out that "Moltbot," despite its nice meaning, still had some trademark and domain issues. This time, Peter was prepared:

✅ Trademark search passed
✅ All domains secured (openclaw.ai)
✅ Migration code written in advance
✅ openclaw doctor command automatically handles config migration

Open represents open source, openness, and community-driven development.
Claw is a tribute to the lobster heritage, also implying this is an AI that can "take action."

In Peter's words: "The lobster has finally completed its ultimate molt. Welcome to OpenClaw."

(By the way, the mascot is still that lobster Molty - some things are sacred and cannot be changed🦞)

🚀 What Can OpenClaw Do Now?

I have to say, after all these rounds of evolution, OpenClaw has become a quite mature AI assistant platform. With 107K+ stars, 15K+ forks, and 8,300+ commits on GitHub, these numbers represent an active global community.

📱 Full Platform Coverage

WhatsApp, Telegram, Discord, Slack, Signal, iMessage, Google Chat, Microsoft Teams, Matrix...supporting 13 messaging platforms in total. Wherever you chat, it follows you there.

🧠 True "Memory"

Unlike those AIs that "forget after chatting," OpenClaw remembers everything about you:

AGENTS.md — Agent configuration file
SOUL.md — Personality settings
TOOLS.md — Tool preferences
MEMORY.md — Memory storage

It truly gets to know you better over time.

🎙️ Voice Activation

Supports "Always-on Speech" feature on macOS, iOS, and Android, with natural voice interaction through ElevenLabs. Imagine just calling out to your phone to have the AI help you with tasks.

🌐 Browser Control + System Access

Let it help you:

Browse web pages, fill forms, scrape data
Read and write files, run scripts, execute commands
Achieve web automation through a dedicated Chrome/Chromium instance
Even extend functionality through 700+ community skills

🔒 Security First

In this renamed version, the team committed 34 security-related code updates. It uses Docker sandbox mode by default to isolate non-primary sessions and supports tool whitelist and blacklist configurations.

Peter specifically reminds: Prompt injection remains an industry challenge. It's recommended to use strong models like Claude Opus 4.5 and follow security best practices.

🛠️ Migration Guide for Existing Users

If you've used Clawdbot or Moltbot before, don't worry - migration is super simple. The installation script will automatically handle everything for you.

One-Click Upgrade to OpenClaw

# Run the installation script, it will automatically detect old configs and migrate
curl -fsSL https://openclaw.ai/install.sh | bash

It's that simple. The installation script will automatically:

Detect your system environment (macOS/Linux)
Verify Node.js version (requires v22+)
Install the latest version of OpenClaw
Run openclaw doctor to auto-migrate configurations

You'll see output like this:

◇  Doctor changes ─────────────────────────────────────────────────────────╮
│  - State dir: ~/.clawdbot → ~/.openclaw (legacy path now symlinked)      │
│  - Migrated legacy config: ~/.clawdbot/clawdbot.json →                   │
│    ~/.openclaw/openclaw.json                                             │
├──────────────────────────────────────────────────────────────────────────╯

Optional: Clean Up Old Versions

After migration, if you want to completely remove old versions:

# Uninstall old Clawdbot (will ask which components to delete)
clawdbot uninstall

# Or uninstall Moltbot
moltbot uninstall

Important Notes ⚠️

Old clawdbot and moltbot commands still work after migration
Old config directories are symlinked to the new location, no worry about data loss
Existing Skills and workflows need no modification
If you encounter issues, run openclaw doctor --fix to auto-repair

Version Comparison Table

Item	ClawdBot	MoltBot	OpenClaw
Config Directory	~/.clawdbot/	~/.moltbot/	~/.openclaw/
Website	clawd.bot	molt.bot	openclaw.ai
GitHub	clawdbot/clawdbot	moltbot/moltbot	openclaw/openclaw
NPM Package	clawdbot	moltbot	openclaw

🔮 Future Outlook

OpenClaw's story is far from over.

Peter is working on several big things:

Security Hardening (Top Priority) — Continuously strengthening codebase security
Gateway Reliability Improvements — Making it smoother for more people to use
Expanding Model Support — Already supports KIMI K2.5, Xiaomi MiMo-V2-Flash, and other new models
Establishing Sustainable Funding — Wanting to pay core maintainers full-time salaries
Expanding the Maintainer Team — One person really can't handle it all

Community members have already done super cool things with OpenClaw:

Automatically managing emails and calendars
Remotely controlling code compilation and testing
Using Sentry webhooks to automatically catch errors and submit PR fixes
Secure remote access through Tailscale

One user put it well:

"The open-source community built a product better than Apple's Siri with just a few people. Welcome to the AI era - one person plus one code repository can fill the gap left by trillion-dollar companies."

📝 Final Thoughts

From Clawdbot to Moltbot to OpenClaw, this lobster has been through quite a lot.

Targeted by Anthropic's legal team, exploited by crypto scammers, renamed twice within two days...

But it's still alive, and thriving more than ever.

107K+ GitHub stars, 15K+ forks, 2 million weekly visits, a global developer community...

Behind these numbers is a simple belief:

Your AI assistant should truly belong to you. 100% open source, MIT license, forever free.

If you'd like to try this "lobster," check out the official website:

🌐 Website: https://openclaw.ai
💻 GitHub: https://github.com/openclaw/openclaw
📖 Documentation: https://docs.openclaw.ai
💬 Discord Community: https://discord.gg/openclaw

Perhaps it will become your most capable digital assistant of 2026?

After all, lobsters molt to grow bigger. And OpenClaw is just beginning its growth journey. 🦞

References:

Clawdbot to Moltbot: A 72-Hour Internet Drama

Wed, 28 Jan 2026 00:00:00 GMT

Clawdbot to Moltbot: A 72-Hour Internet Drama

Chapter 1: An Overnight Open Source Sensation

January 26, 2026 — An open source project called Clawdbot suddenly went viral.

Created by Austrian developer Peter Steinberger (@steipete), Clawdbot is a self-hosted AI assistant that can:

Run on WhatsApp, Telegram, Discord, Slack, Signal, and iMessage
Maintain persistent memory, remembering user preferences and conversation history
Control browsers, execute shell commands, and manage calendars
Proactively send notifications and reminders

Steinberger is no unknown — he founded PSPDFKit (now rebranded as Nutrient), "retired" after receiving a $100M+ investment from Insight Partners in 2021, and has now returned to build this "Claude with hands."

Its growth was absolutely insane:

🚀 Within 24 hours: 9,000+ GitHub stars
🚀 Within 72 hours: 60,000+ GitHub stars
🚀 Became one of the fastest-growing open source projects in GitHub history

Andrej Karpathy (former Tesla AI Director) publicly praised it, David Sacks (PayPal Mafia member) tweeted about it, and MacStories called it "the future of personal AI assistants."

Chapter 2: Anthropic's "Trademark Bomb"

January 27, 2026 — At the peak of Clawdbot's viral moment, Anthropic (Claude's parent company) sent a trademark-related request.

The problem? Anthropic believed "Clawd" was too similar to "Claude", constituting potential trademark infringement.

Founder Peter Steinberger announced on X:

🦞 BIG NEWS: We've molted!

Clawdbot → Moltbot
Clawd → Molty

Same lobster soul, new shell.

Anthropic asked us to change our name (trademark stuff), and honestly? "Molt" fits perfectly — it's what lobsters do to grow.

The rebranding was cleverly conceived:

Lobsters grow by molting
The project was also "molting" into a new form
New website: molt.bot

Chapter 3: 10 Seconds of Disaster 💥

However, the renaming process turned into a disaster.

Peter Steinberger tried to simultaneously rename the GitHub organization and X/Twitter accounts. In the mere 10-second gap between releasing the old names and registering the new ones, crypto scammers snatched both accounts!

"Had to rename our accounts for trademark stuff and messed up the GitHub rename and the X rename got snatched by crypto shills. That went wonderful."
— Peter Steinberger

The scammers had clearly been monitoring for this opportunity. They instantly seized:

❌ The original @clawdbot X account
❌ The original Clawdbot GitHub organization

They then began pushing cryptocurrency scams to tens of thousands of unsuspecting followers.

Chapter 4: The $16 Million Fake Token Scam

The account hijacking was just the beginning. Within hours, a fake $CLAWD token appeared on the Solana blockchain.

Scam timeline:

📈 Fake token market cap surged to $16,000,000
📉 Peter Steinberger publicly stated he would "never launch a token"
📉 Token price instantly crashed 90%+
💸 Late buyers got "rugged," scammers walked away with millions

Peter was forced to tweet a warning:

"To all crypto folks: Please stop pinging me, stop harassing me. I will never do a coin. Any project that lists me as coin owner is a SCAM."

Chapter 5: Security Nightmares Surface

Meanwhile, security researchers discovered serious security vulnerabilities in Clawdbot/Moltbot.

Blockchain security firm SlowMist reported:

"Multiple unauthenticated instances are publicly accessible, and several code flaws may lead to credential theft and even remote code execution."

Researcher Jamieson O'Reilly found:

Searching Shodan for "Clawdbot Control" revealed hundreds of exposed control panels
These panels contained: API keys, bot tokens, OAuth secrets, complete conversation histories
Attackers could: impersonate users to send messages, execute commands, steal data

Demo attack:

Archestra AI CEO Matvey Kukuy sent a malicious email with prompt injection to an exposed Moltbot instance. After the AI read the email, it believed the "legitimate instructions" and forwarded the user's 5 most recent emails to the attacker's address.

The whole process took only 5 minutes.

Chapter 6: Community vs Anthropic

The community began questioning Anthropic's decision.

Key issues:

Clawdbot actually drove Claude usage — many users specifically configured Clawdbot to use Claude as its underlying model
This was a rapidly rising project bringing Anthropic free marketing and API revenue
The renaming chaos caused actual security disasters and financial losses
The similarity between "Clawd" and "Claude" was obviously playful, not malicious infringement

DHH (Ruby on Rails creator) criticized Anthropic's recent moves as "customer hostile."

AWS Hero AJ Stuyvenberg was more direct: "They're speedrunning the journey from forgivable startup to loathsome corporation before any exit!"

Developers began looking at OpenAI's Codex CLI (Apache 2.0 license), questioning whether Anthropic was becoming the kind of company they didn't want to build on.

Finale: Fighting on Multiple Fronts

Peter Steinberger is now simultaneously dealing with:

Front	Status
🔄 Recovering hijacked GitHub/X accounts	In progress
🛡️ Dealing with crypto scammer harassment	Ongoing
👥 Managing 8,900+ Discord community members	Active
🔒 Fixing security vulnerabilities	Urgent
📢 Rebuilding brand awareness	Challenging

Deeper Lessons

For open source builders:

Building on corporate platforms means facing ambiguous trademark policies. A single legal letter can force you to rename, exposing you to account hijacking, scams, and chaos.

For AI companies:

Your most passionate supporters are indie developers building quirky experimental tools. Sending legal notices to viral open source projects — ones driving your API usage — is a choice worth careful consideration.

For users:

Self-hosting AI agents with root privileges is both powerful and dangerous. The security models for these tools are still immature. Don't run them on your main machine, don't give them access to crypto wallets. Use dedicated hardware, isolated accounts, and strict IP whitelisting.

🤔 Final Thoughts: Is Anthropic Really the "Righteous" Party?

This isn't the first time Anthropic has angered the developer community.

Just two weeks ago (January 9), Anthropic suddenly banned all users accessing Claude Pro/Max subscriptions through third-party tools — no warning, no migration path. Developers who had deeply integrated Claude into their workflows were "backstabbed" overnight.

Now there's the Clawdbot incident.

A company that touts "AI safety" and "responsible AI" takes trademark action against an open source project that was obviously a good-faith pun and actually promoting the Claude ecosystem. The irony:

Clawdbot drove more people to use Claude API → Anthropic makes more money
Clawdbot demonstrated Claude's capabilities → Free marketing material
Clawdbot's developer was a Claude superfan → Community evangelist

The result? A legal letter, a PR disaster, and a group of once-enthusiastic developers seriously considering migration to OpenAI.

Anthropic's slogan is "AI safety," but they seem more adept at "developer hostility."

When a company's legal department is more active than its product department, perhaps it's time to ask: Whose safety are they really protecting? The users' safety, or their own trademark empire?

Once the trust of the open source community is lost, it's hard to rebuild. Anthropic should perhaps reconsider: in the marathon of AI, the real moat is technology and ecosystem, not legal letters.

🔗 Related Links:

New project homepage: molt.bot
GitHub: github.com/moltbot
X account: @moltbot

This is the reality of the open source AI world: overnight fame, legal threats, crypto scams, security vulnerabilities — all within 72 hours. 🦞💥

Claude's Founder at Davos: When Programmers No Longer Need to 'Write' Code

Thu, 22 Jan 2026 00:00:00 GMT

Claude's Founder at Davos: When Programmers No Longer Need to 'Write' Code

Insights from Anthropic founder Dario Amodei's latest Davos interview: Claude's real capabilities, the rise of Chinese open source, and how we should adapt

If you've used Claude, you've probably experienced this frustrating moment: you're in the middle of a great conversation, and suddenly your account gets suspended. You finally appeal and get it back, only to end up in the penalty box again a few days later.

In AI circles, Claude's "ban-prone nature" is almost a meme. But strangely enough, nine out of ten users who've been banned still find their way back—because once you've used it, you know this thing is genuinely powerful.

On January 20, 2026, Dario Amodei, founder of Anthropic (the company behind Claude), gave an interview to Bloomberg at the World Economic Forum in Davos. This usually low-profile AI leader shared plenty of insights: What makes Claude so strong? Has Chinese AI caught up? Will programmers face mass unemployment?

Today, let's dive into this interview—and maybe pour some cold water on a few points where Amodei's views deserve some pushback.

I. "Two Months Without Writing Code": AI Programming Isn't as Magical as It Sounds

The most eye-catching quote from the interview was about their Claude Code product lead:

"He hasn't written a single line of code in two months. Claude writes everything."

At first glance, it sounds like programmers are about to become obsolete, right?

Hold on—let's break down the actual meaning here.

First, "not writing code" doesn't mean "not working." What this person is still doing includes: designing system architecture, breaking down requirements, writing prompts, reviewing AI-generated code, debugging and testing, making technical decisions...

In other words, he went from being "someone who writes code" to "someone who directs AI to write code."

It's like switching from manual to automatic transmission—sure, you don't need to work the clutch anymore, but you still need to know when to hit the gas and when to turn the wheel. Lose control of the wheel, and you'll still crash.

Amodei himself admitted in the interview that while AI's cognitive capabilities are growing exponentially, "fully automated programming" is still an unrealistic fantasy. No matter how strong Claude is, it still needs humans to guide it with precise prompts and professional judgment to ensure quality output.

So here's the truth: Claude isn't replacing programmers—it's amplifying their capabilities.

A programmer who knows how to use Claude might be ten times more efficient than one who doesn't. But the prerequisite is that you need to be a competent programmer first—knowing what you want and being able to judge whether AI's output is correct.

II. Is Chinese AI Falling Behind? The Question Itself Is Wrong

There was an interesting exchange in the interview. The host asked Amodei: How's the competition with Chinese AI companies going?

Amodei's answer: When competing for enterprise client contracts, we've hardly ever lost to Chinese models.

That sounds impressive, but think about it—this comparison isn't exactly fair.

What kind of product is Claude? It's backed by trillion-parameter large models, burning astronomical amounts of compute and funding, targeting the high-end enterprise market.

Meanwhile, the most active force in Chinese AI is on a completely different track: open source.

DeepSeek, Qwen, GLM... These models might not match Claude on certain benchmarks, but they've achieved something more important: making AI accessible to ordinary developers and small businesses.

You can deploy them on your own servers without worrying about data privacy. You can fine-tune them for your specific needs without being constrained by API limitations. Most importantly, the cost is lower by an order of magnitude or more.

This is what's called "AI democratization"—not every company can afford Claude's enterprise subscription, but almost every developer can run an open-source model.

Amodei's assessment of Chinese AI in the interview has a bit of a "let them eat cake" flavor. He's speaking from the perspective of a top AI company CEO, seeing the competitive landscape in the premium market. But he may be underestimating the power of the open-source ecosystem—historically, Linux beating Unix and Android sweeping the mobile market weren't about being "stronger," but about being "more accessible."

The real AI landscape isn't a competition over who's stronger—it's a multi-layered ecosystem. Claude can be the crown jewel, but Chinese open-source models are continuously lowering the barrier to AI, enabling more people to participate in this transformation.

III. Will Programmers Lose Their Jobs? It's a False Dichotomy

In the interview, the host asked a pointed question: Will AI cause mass unemployment?

Amodei's answer was honest: We might see rapid GDP growth and rising unemployment at the same time.

That's fair enough, but I want to look at this question from a different angle.

Instead of asking "will programmers lose their jobs," ask "what kind of programmers will lose their jobs."

Every technological revolution in history has seen some people eliminated and others rise. When Excel appeared, those skilled at the abacus lost their advantage. When CAD became widespread, hand-drafting skills became less valuable. But the professions of accountant and engineer didn't disappear.

AI programming tools follow the same logic.

Those who'll be eliminated are the ones who can only mechanically type code, don't understand business logic, and can't ask questions—the "code monkeys."

Those who'll thrive are the ones who can use AI as a "super assistant":

Can precisely describe requirements to get high-quality code from AI
Can quickly review AI output and spot the pitfalls
Can integrate AI into their workflow to dramatically boost efficiency
Most importantly, can continuously learn new tools and methods

Amodei said people at his company "haven't written code in two months," but what he didn't mention is that these people are learning how to use AI better every single day.

That's the real lesson: it's not enough to learn one tool—you need to develop the ability for "continuous learning."

Claude is powerful today, but tomorrow something stronger might come along. Today's prompt engineering techniques might be obsolete next year. The only constant is change itself.

IV. Final Thoughts: Stay Clear-Headed, Stay Curious

In this interview, Amodei displayed the typical perspective of an AI company CEO: confident in his own product, cautious about competitors, both optimistic and careful about the future.

But as ordinary people, we don't need to accept any leader's views wholesale.

Claude is indeed powerful, but it's not the only option, nor is it omnipotent. Chinese open-source models may fall short in some areas, but they're bringing AI technology benefits to more people. Programmers do face challenges, but where there are challenges, there are opportunities.

If I had to summarize the takeaway from this interview in one sentence, it would be:

AI is a tool, not magic. Those who learn to use it will become stronger; those who expect it to think for them will eventually be left behind.

As for Claude's account suspension issues... well, use it while you can.

This article is based on Bloomberg's Davos interview from January 20, 2026. Views expressed are the author's own.

[Discussion Topic]

Have you used AI programming tools at work? How was the experience? Feel free to share your stories in the comments~

References

Anthropic's Amodei on AI: Power and Risk

2:30 AM Inspiration: Why Google's Hottest AI Model Is Called 'Nano Banana'

Sun, 18 Jan 2026 00:00:00 GMT

2:30 AM Inspiration: Why Google's Hottest AI Model Is Called "Nano Banana"

Starting mid-last year, a Google AI model went viral—not because of how powerful it is (though it certainly is powerful), but because of its name: Nano Banana.

Yes, you read that right. A serious AI image generation model with a name like "Nano Banana."

What's the story behind this?

It All Started with a 2:30 AM Message

The story begins last July.

At the time, the Google DeepMind team was preparing to launch a new image generation model on LMArena (an AI model evaluation platform). The technical name was already set—Gemini 2.5 Flash Image—but the platform needed a public codename.

The problem was—everyone kept putting it off.

Until 2:30 AM the night before launch, when a colleague messaged product manager Naina Raisinghani:

"We need to submit the codename now."

"How About Nano Banana?"

Drowsy and half-asleep, Naina's brain popped out an idea: Nano Banana.

Why this name? It turns out it came from her own nicknames:

Friends called her Naina Banana (because Naina rhymes with Banana)
Some also called her Nano (because she's petite and loves computers)

So she combined her two nicknames—Nano Banana.

And the name was surprisingly fitting: since this was a Flash (lightning-fast) model, Nano (meaning tiny) perfectly hinted at its lightweight and speedy nature.

Just like that, a casual suggestion at 2:30 AM became the official codename.

Unexpectedly, It Went Viral

In early August, Nano Banana launched on LMArena.

Users discovered the model's image editing capabilities were quite impressive—it could maintain facial similarity while cleverly blending multiple images together.

But what left an even stronger impression was this quirky name.

"What the heck is Nano Banana?"
"This name is too cute!"

The name spread rapidly on social media, with users from different regions creating localized memes around it.

From Joke to Official Branding

What happened next you probably know—Nano Banana became one of the highest-rated image editing models globally.

Google embraced the serendipity, fully incorporating "banana" elements into the brand design. The latest version even upgraded to Nano Banana Pro (powered by Gemini 3 Pro Image).

Final Thoughts

A flash of inspiration at 2:30 AM, a small joke with personal warmth, ultimately became one of the most viral names in Google's AI product lineup.

This story teaches us:

Sometimes the best ideas come when you're relaxed
Never underestimate a "casually chosen name"
Great product + great name = viral spread

Next time you're naming a project, maybe try 2:30 AM?

(Just kidding. Get some sleep.)

#Google #AI #NanoBanana #ArtificialIntelligence #TechTrivia

References:

How Nano Banana got its name - Google Blog

2025: The Year LLMs Changed Everything - A Deep Dive into Simon Willison's Year-End Review

Thu, 01 Jan 2026 00:00:00 GMT

2025: The Year LLMs Changed Everything - A Deep Dive into Simon Willison's Year-End Review

Original Article: 2025: The year in LLMs - Simon Willison

This analysis is based on Simon Willison's year-end summary. A tribute to this Django co-founder and one of the sharpest observers in the LLM space.

Preface: Why You Should Take Simon Willison Seriously

Simon Willison isn't one of those AI evangelists who just hypes everything up. He's the co-creator of the Django framework, the person who defined the term "prompt injection," and a board member of the Python Software Foundation. More importantly—he's a developer who uses LLMs for real work every day. In 2025, he built 110 tools with AI assistance.

When someone like this says "2025 was the year of XXX," it's worth paying attention.

Key Insight #1: Reasoning Models Changed Everything

Simon's Take: Reasoning isn't about making AI count how many R's are in "strawberry"—it's about teaching AI to work with tools.

"The real unlock of reasoning was in driving tools. Reasoning models with access to tools can plan out multi-step tasks, execute on them and continue to reason about the results."

My Analysis:

When o1 launched in late 2024, most people's reaction was: "Oh, it can do math problems now. What does that have to do with me?" This thinking was completely wrong.

The real value of reasoning models lies in:

Planning ability: Breaking complex tasks into executable steps
Reflection ability: Checking results after execution, adjusting strategies
Tool coordination: Simultaneously invoking search, code execution, file operations, and other tools

What does this mean? It means AI evolved from a "Q&A machine" into an "executor."

Key Insight #2: Agents Went from "Sci-Fi" to "Practical"

Simon's prediction at year start: Agents won't happen.

Simon's admission at year end: I was half wrong.

"I didn't think agents would happen because I didn't think the gullibility problem could be solved... But if you define agents as LLM systems that can perform useful work via tool calls over multiple steps then agents are here."

My Analysis:

Simon's "eating his words" is actually quite enlightening. Where was he wrong? He imagined Agents as omnipotent assistants from sci-fi movies. But what are the Agents that actually shipped? Claude Code, Codex CLI—tools that can write code, run tests, and submit PRs for you.

Key insights:

Agent ≠ general-purpose intelligent assistant, but rather domain-specific automation executor
Code became the most mature landing scenario for Agents, because code execution results are verifiable
Search is the second mature scenario—deep research mode actually works now

Simon offers a pragmatic Agent definition: "An LLM system that can achieve goals through iterative tool calls." Not fancy, but effective.

Key Insight #3: Claude Code Is the Most Important Product of 2025

Simon's exact words: "The most impactful event of 2025 happened in February, with the quiet release of Claude Code."

This might surprise many people. Not GPT-5? Not DeepSeek R1's market impact? A command-line tool?

My Analysis:

Claude Code represents a paradigm shift—LLMs moving from chat interfaces to the terminal.

Why does this matter?

Developers' natural habitat: The terminal is the most familiar environment for developers. Pipes, redirects, script composition—Unix philosophy merges perfectly with LLMs
$1 billion ARR validation: Anthropic announced Claude Code reached $1 billion annual revenue. A CLI tool! This shows professional users are willing to pay for truly useful AI tools
Asynchronous execution breakthrough: Claude Code for web can run in the background. Send a task, grab a coffee, come back and your PR is ready

In the cleaned software engineering benchmark SWE-rebench, Claude Code leads by a wide margin. Claude Code paired with Claude Opus 4.5 is the ultimate Vibe Coding combo. For bug fixes and code review, OpenAI's Codex GPT 5.2 xhigh excels.

Key Insight #4: Chinese Open-Source Models Rose to Dominance

Simon's data: On the Artificial Analysis leaderboard, the top five open-source models are all from China.

"GLM-4.7, Kimi K2 Thinking, MiMo-V2-Flash, DeepSeek V3.2, MiniMax-M2.1 are all Chinese open weight models."

My Analysis:

DeepSeek R1 launched on January 20, 2025. That day, NVIDIA's market cap dropped $600 billion. This wasn't a tech event—it was a geopolitical event.

Key facts:

DeepSeek V3 training cost about $5.5 million, while US companies spend hundreds of millions
These models aren't just "open source"—they're truly open source—MIT or Apache 2.0 licenses
While training code and datasets aren't public, detailed technical papers have advanced the entire industry

What does this mean for you?

The barrier to locally deploying top-tier models dropped significantly
The reference point for API costs has been redefined
The "AI is a US monopoly" narrative has been shattered

Key Insight #5: OpenAI Lost Its Lead

Simon's assessment: "This year the rest of the industry caught up."

This doesn't mean OpenAI got worse, but rather:

Image generation was surpassed by Google Nano Banana
Code capability was challenged by Claude Opus 4.5
Open-source models were crushed by Chinese vendors
Audio API was threatened by Gemini Live

My Analysis:

OpenAI's advantage now is mainly brand recognition—"Nobody knows LLMs, but everyone's heard of ChatGPT." But in professional developer circles, this advantage is eroding.

After Google released Gemini 3 in December, OpenAI internally declared "Code Red." This was the first time OpenAI publicly acknowledged feeling competitive pressure.

A deeper issue: Google has its own TPUs and doesn't need to pay the "GPU tax" to NVIDIA. When training cost is a core competitive factor, this is a structural advantage.

Key Insight #6: $200/Month Subscriptions Became the New Standard

Fact: Claude Pro Max, ChatGPT Pro, and Google AI Ultra all landed at the $200/month tier.

Simon's personal experience: "I've personally paid $100/month for Claude... I've heard from plenty of other people who are happy to pay these prices too."

My Analysis:

This reveals a bifurcation:

Casual users: Free or $20/month is enough
Power users: $200/month is a good deal

Why is it worth it? Because Coding Agents consume tokens like crazy. If you're using Claude Code daily for complex tasks, pay-per-API could easily exceed $200.

This also means: LLMs are transitioning from "novelty toy" to "professional tool". Professional tools deserve professional pricing.

Key Insight #7: YOLO Mode and the Danger of "Normalization of Deviance"

Simon's warning: "The longer we get away with running these systems in fundamentally insecure ways, the closer we are getting to a Challenger disaster of our own."

Context: YOLO mode = letting Coding Agents auto-execute all operations without human confirmation.

My Analysis:

This is Simon's most serious warning in the article. He cites sociologist Diane Vaughan's research on the Challenger space shuttle disaster—engineers knew about O-ring problems long before, but because multiple launches went fine, the risk was "normalized."

The AI analogy:

You run Claude Code in YOLO mode daily without incident
You start thinking prompt injection is only a theoretical risk
Until one day, a malicious instruction actually deletes your home directory

Johann Rehberger calls this "normalization of deviance in the AI space." Simon clearly agrees.

Key Insight #8: MCP Might Be a Flash in the Pan

Simon's observation: "The reason I think MCP may be a one-year wonder is the stratospheric growth of coding agents."

Core argument: When Agents can run arbitrary Bash commands, who needs MCP?

My Analysis:

MCP (Model Context Protocol) was launched by Anthropic in November 2024 and exploded in early 2025—OpenAI, Anthropic, and Mistral all announced support within eight days.

But Simon points out an awkward fact: Bash is the ultimate tool. An Agent that can run shell commands can invoke any CLI tool—git, gh, ffmpeg, curl—why wrap another layer of MCP?

Anthropic itself seems to have realized this, launching the lighter Skills mechanism: a Markdown file plus optional scripts, much simpler than MCP's JSON-RPC server.

Key Insight #9: Local Models Are Good, But Cloud Models Are Better

Simon's mixed feelings:

"I got small amounts of real work done offline! My excitement for local LLMs was very much rekindled."

But also:

"I have yet to try a local model that handles Bash tool calls reliably enough for me to trust that model to operate a coding agent on my device."

My Analysis:

Local models indeed improved massively in 2025:

Mistral Small 3 (24B) ≈ GPT-4 level, runs on 64GB laptops
20-32B parameter range became the sweet spot
Can do some real work offline

But the problem is reliability. Coding Agents need models to stably invoke tools dozens or even hundreds of times. Local models can't do that yet.

Simon's conclusion: Next laptop needs at least 128GB RAM, but the main workhorse remains frontier cloud models.

Key Insight #10: "Slop" Became Word of the Year

Merriam-Webster's definition: "Low-quality digital content mass-produced through artificial intelligence"

Simon's optimistic lean:

"The internet has always been flooded with low quality content. The challenge, as ever, is to find and amplify the good stuff."

My Analysis:

The popularity of "Slop" (AI junk content) as a word reflects growing public vigilance toward AI-generated content. This is good.

But Simon raises a deeper question: Can you perceive slop's impact?

His own answer: Probably not. Because he doesn't use Facebook and carefully curates his information sources. For average users who don't? They might be drowning in slop without knowing it.

Key Insight #11: Data Centers Are Becoming Extremely Unpopular

Fact: Over 200 environmental groups demanded a moratorium on new US data center construction.

Simon's focus: Water resource concerns might be overstated (a distraction), but energy consumption is real.

My Analysis:

This is the only section touching on AI ethics/social impact, and Simon's stance is cautious.

He points out the Jevons paradox: Cost per token drops → users consume more tokens → total energy consumption rises instead of falling.

$200/month subscription users might consume 10x the compute resources of $20 users. Efficiency gains are offset by usage growth.

My Summary: The Thinking Framework Simon Willison Teaches Us

After reading this 13,000-word year-end summary, what I learned isn't just 26 trends, but a methodology for observing the AI industry:

Hands-on practice: Simon isn't a commentator—he built 110 tools and uses these technologies daily
Admitting mistakes: He predicted Agents wouldn't happen at year start, and candidly admitted he was half wrong at year end
Defining terms: "prompt injection," "slop," "lethal trifecta"—clear concepts are prerequisites for clear thinking
Security awareness: Even while using YOLO mode daily, he doesn't forget to warn about "Challenger disaster" risks
Staying curious: A 44-year-old Django founder still researching mobile programming

If you want to keep up with LLM developments, there's no better way than following Simon Willison.

Appendix: Key Terms Created/Popularized by Simon Willison in 2025

Term	Meaning
Vibe Coding	Generating code entirely through prompts, "forgetting the code exists"
The Lethal Trifecta	Access to private data + ability to communicate externally + exposure to untrusted content
Context Rot	Model output quality degrading as conversations grow longer
Slopsquatting	Registering malicious packages using package names hallucinated by LLMs
Asynchronous Coding Agent	Tools that run in the background and submit PRs when complete

Original: 2025: The year in LLMs

If you found this analysis valuable, subscribe to Simon's blog: RSS, email, or Bluesky/Mastodon. $10/month also gets you his monthly newsletter.

Notes

This article was co-authored by the author with Claude Opus 4.5 and Gemini 3 Pro.

AI News - July 30, 2025

Wed, 30 Jul 2025 00:00:00 GMT

Open Source

Qwen3-30B-A3B Minor Update

Qwen3-30B-A3B recently released a minor update version called Qwen3-30B-A3B-Instruct-2507. This efficient Mixture of Experts (MoE) model activates only 3B parameters while achieving performance close to GPT-4o and Qwen3-235B-A22B in non-thinking mode. Key improvements include:

Enhanced reasoning, coding, and mathematical capabilities
Expanded multilingual knowledge coverage
Improved long-context understanding, supporting up to 256K tokens
Better alignment with user intent and handling of open-ended tasks
Removed blocks for more direct and efficient responses

This update makes the model smarter, faster, and easier to deploy locally, suitable for various complex tasks such as instruction following, logical reasoning, and tool use.

Commentary: Good news for open source and experimentation.

Official Tweet: https://x.com/Alibaba_Qwen/status/1950227114793586867

Model Repository: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507

Closed Source

ChatGPT Study Mode

OpenAI today launched "Study Mode" for ChatGPT, a learning experience designed to help users work through problems step-by-step rather than providing direct answers. This mode uses guided questioning, step-by-step explanations, and interactive approaches to enhance critical thinking and learning outcomes, particularly useful for homework help, exam preparation, and exploring new knowledge.

The feature is now available to logged-in users on Free, Plus, Pro, and Team tiers. ChatGPT Edu users will get access in the coming weeks. This update is seen as a responsible application of AI in education, aimed at reducing dependency on generative AI while promoting deeper learning.

Commentary: The strongest AI product experience for average users. ChatGPT teaching you how to learn—sometimes you can't help but wonder if schools are still necessary.

Official Blog: https://openai.com/index/chatgpt-study-mode/

NotebookLM & AI Mode Updates

Google recently announced major updates to NotebookLM, including Video Overviews and Studio panel upgrades.

Video Overviews serve as a visual alternative to Audio Overviews, generating AI-narrated slideshows that incorporate images, charts, quotes, and data from source documents. This helps users understand complex information more intuitively, with support for customizing topics, learning objectives, and target audiences. The Studio panel features a new interface design that supports creating and storing multiple outputs of the same type within a single notebook (such as multi-language audio or mind maps for different chapters), improving collaboration and multitasking efficiency. This feature is rolling out to English users, with more language support coming soon.

Additionally, for back-to-school season, Google Search's AI Mode received updates including: support for uploading images and PDF files on desktop browsers (with plans to expand to Google Drive and other file types), a Canvas tool for multi-session planning (like creating study guides), Search Live with integrated Google Lens for real-time video input, and Lens functionality in Chrome for asking questions about on-screen content. These enhancements aim to improve the learning experience for students, parents, and educators through interactive questioning, cross-referencing information, and visual context. Currently available mainly in the US and India for users 18 and older.

Commentary: Google's product update blog posts don't mean features are immediately available—patience is required. Just like when AI Mode announced support for Gemini 2.5 Pro and Deep Research, users didn't get the feature on day one. NotebookLM is a great learning companion, and these updates further enhance learning assistance. AI Mode is a preview of Google disrupting itself—along with experimental projects like Web Guide, these experiments will eventually become Google Search products for the AI era.

Official Blogs:

https://blog.google/technology/google-labs/notebooklm-video-overviews-studio-upgrades/

https://blog.google/products/search/ai-mode-updates-back-to-school/

Claude Code --add-dir Command

Claude Code recently introduced the --add-dir command, a feature extension that allows users to work across multiple directories in a single session. By using the CLI flag --add-dir at startup or the slash command /add-dir during a session, developers can seamlessly add additional working directories to Claude Code's workspace without switching the main directory. This update is particularly useful for working with monorepos, shared configurations, or cross-project collaboration, helping improve code navigation, referencing, and editing efficiency, making Claude Code an even more powerful and flexible terminal AI coding tool.

Commentary: Claude Code has become the most popular product among developers. The cross-directory feature further elevates the experience. Anthropic deserves praise for developing products based on user needs.

Official Tweet: https://x.com/_catwu/status/1950288312033562751

Notes

This article was co-authored by the author and Grok 4.

A New Beginning

Thu, 17 Jul 2025 00:00:00 GMT

During my university years, I once ran a WeChat Official Account. But due to frustration with content review and other factors, that first account ended with me voluntarily deactivating it.

Later, with AI assistance, I built a personal blog from scratch. After more than 3 years of development, my humble site has gained some readers. The image below shows current Cloudflare traffic data for my site—though many of these visitors are actually AI crawlers, so the real numbers are far lower than what's shown.

I've basically been running this as a labor of love, never considering monetization through Google Ads. As a current graduate student, I haven't felt much financial pressure yet. But after sustaining an idealistic website purely through passion for so long, some fatigue is inevitable.

Next year marks the end of my student life, and I'll inevitably need to start earning my own living. Restarting a WeChat Official Account is one approach—not as a main job, but as a side project to experiment with.

The real Chinese internet is no longer the world of websites you find through Google or Bing—it now exists within the "walled gardens" of major tech giants.

As I once again embrace writing in this authentic Chinese internet space, I'll try to avoid being generic. All my writing will be carefully crafted. This account won't touch any sensitive or rule-violating topics—I'll self-censor accordingly.

I also understand that by publishing on WeChat, my writing becomes training data for Tencent's Hunyuan large language model. This is unavoidable on the public web, and even private platforms can't escape it. I accept this reality.

This account's avatar and name match my WeChat account. Every article published here will have a corresponding original on the public web—click the "Read More" button at the end of each article to jump to the source.

This account mainly shares knowledge about AI, personal tinkering projects, and personal growth insights. I aim to update at least once a week.

A new beginning—let's go!

Kai's Notes

docker compose down Then up -d, or Just up -d? What the Official Docs Actually Say

What the Official Docs Say

docker compose up: Create and Start, With Built-in Change Detection

docker compose down: Stop and Tear Down the Whole Project

The Essential Difference Between the Two Approaches

Most of the Time, Just up -d Is Enough

When You Actually Need down First

1. You Changed the Definition of Top-Level Resources Like Networks

2. You Want a Genuinely Clean Environment

3. You're Taking the Stack Out of Service for a While

4. You Need to Clean Up Services Removed From the Compose File

Two Commands That Often Get Confused, While We're at It

docker compose restart

docker compose stop / docker compose start

Back to the Original Question

DeepSeek V4 Shouldn't Be Overshadowed by GPT-5.5

Background

DeepSeek Capabilities I Am Optimistic About

My Personal Experience Using DeepSeek V4 Pro

Conclusion

What Should We Watch Out for When AI Starts Researching Its Own Alignment?

AI Doing Research on Its Own: Has the Future Already Started?

Why Does "Automated Alignment" Matter So Much?

The Magic and the Pitfalls of Automated Research

How Should We Face the Future of AI "Doing Research on Its Own"?

References

Let Yourself Feel "Learned Helplessness" for a While

AIGC Plagiarism Detection: CNKI's Self-Contradiction and a Doomed Battle of Containment

AIGC Plagiarism Detection: CNKI's Self-Contradiction and a Doomed Battle of Containment

Prologue: An Absurd Graduation Season

1. What Is AIGC Detection? How Does It Work?

2. Is AIGC Detection Accurate?

3. CNKI's Self-Contradiction: Selling AI With One Hand, Policing AI With the Other

4. AIGC "Score Reduction": Turning Good Writing Into Drivel

5. Pros and Cons: Is AIGC Detection Worth It?

6. The Way Forward: Guidance Over Gatekeeping

7. AI Writing Tool Recommendations: Choose the Right Model, Double Your Efficiency

Top Pick: Claude (Anthropic)

For Fact-Checking: GPT-5.4 (OpenAI)

Alternative: Gemini 3.1 Pro (Google)

Why Not Smaller Parameter Models?

Conclusion: Let AI Be Wings, Not Shackles

The World's Most Powerful AIs All Failed: Pattern Reasoning Becomes LLMs' Cognitive Graveyard

The World's Most Powerful AIs All Failed: Pattern Reasoning Becomes LLMs' Cognitive Graveyard

An Accidental "Crash Test"

Layer 1: Blind from the Start — The Innate Deficiency of Visual Encoding

Layer 2: No "Mental Canvas" — The Absence of Spatial Reasoning

Layer 3: The Infinitely Open Rule Space — Not Knowing What's Being Tested

Layer 4: Paradigm Conflict — Probabilistic Generation vs. Rigid Deduction

Layer 5: Structural Gaps in Training Data

Why Did They Choose to "Cheat"?

Where Is the Path Forward?

Final Thoughts

Perplexity Max Is Great, But I Won't Subscribe

I. Model Council: Three Models Argue, a Fourth Judges

What It Actually Is

Design Philosophy: Making Disagreement Visible

My Take: Interesting, But Not Necessarily Worth Paying For

II. Perplexity Computer: 19 Models, One "Digital Employee"

What It Actually Is

The March 6 Update

My Take: Concept Is Stunning, Execution Is Questionable

III. The Unavoidable Question: Is $200/Month Worth It?

IV. Perplexity's True Moat: Search

How Long Can the Moat Hold?

V. My Conclusion

The Industrial Recipe for Synthetic Data: HuggingFace's 90 Experiments Reveal the Laws of Pretraining Data Production

The Industrial Recipe for Synthetic Data: HuggingFace's 90 Experiments Reveal the Laws of Pretraining Data Production

I. Synthetic Data: The Fourth Paradigm Shift in LLM Training

II. 90 Experiments, 1 Trillion Tokens, All to Answer One Question

III. Core Finding: Prompt Design Is the Biggest Lever

IV. Counter-Intuitive Finding: A 1B Small Model Is Enough

V. The Most Counter-Intuitive Finding: "Worse" Output Is Actually Better

VI. Capability Trade-offs: Synthetic Data "Trades Common Sense for Knowledge"

VII. Quality Scores Completely Fail on Synthetic Data

VIII. The Cost Revolution at the Engineering Level

IX. Clarification on "Model Collapse"

X. The Practical Recipe: FinePhrase's Final Configuration

XI. Unanswered Questions

`docker compose up`: Create and Start, With Built-in Change Detection

`docker compose down`: Stop and Tear Down the Whole Project

Most of the Time, Just `up -d` Is Enough

When You Actually Need `down` First

`docker compose restart`

`docker compose stop` / `docker compose start`