Everything you need to know about PewDiePie’s AI setup

llm
llama.cpp
mlx
vllm
gpt-oss
qwen
local-ai
privacy
mac
metal
Author

VB

Published

November 5, 2025

TL;DR - Watch Felix Kjellberg’s “STOP. Using AI Right now.” (31 October 2025) to see his 10‑GPU setup and custom UI: https://www.youtube.com/watch?v=qw4fDU18RcU. - You can recreate the workflow with consumer gear—start with 7B–20B quantized models in llama.cpp or Apple’s MLX, then scale up if you add VRAM. - The real goal is autonomy: keep data off third-party APIs, wire in your own search/memory, and iterate until the tooling feels personal.

Felix “PewDiePie” Kjellberg recently shared a video showing his $20K, 10‑GPU rack plus a custom UI he calls Chad OS. What’s interesting isn’t the hardware—it’s how much of his approach works on a single GPU tower or an Apple Silicon laptop. This guide breaks down his setup for people who want to run AI locally without being deep-learning engineers.

Felix emphasizes two points in the video: “I don’t want to API my way out of everything,” and “Delete. Delete. … Oh, so they collect your data even if you deleted it.” Everything below follows that mindset—local-first, privacy-aware experiments you can scale up or down based on your hardware.

Pick the path that matches your hardware, dive deeper with the links, and don’t feel obligated to chase 8×4090 setups.

Quick glossary

  • Quant (quantization): shrink a model so it fits in laptop memory—think of it as zipping weights.
  • RAG (retrieval augmented generation): let the model look up your files before it answers you.
  • Tensor split: share a big model across multiple GPUs so none of them overload.

Models Felix name‑dropped

Use the tables below to pick models that fit your machine before you burn time downloading 200B+ checkpoints.

Model quick picks (llama.cpp + MLX)

Model (quant) llama.cpp (bartowski GGUF) MLX (mlx-community) Unified memory (Mac) GPU VRAM (PC/Linux) Notes
Qwen2.5-7B Instruct Q4_K_M bartowski/Qwen2.5-7B-Instruct-GGUF mlx-community/Qwen2.5-7B-Instruct-4bit-mlx 12 GB 8 GB Fast, friendly council member; run this first.
GPT-OSS-20B Q4_K_M bartowski/GPT-OSS-20B-GGUF mlx-community/GPT-OSS-20B-4bit-mlx 24 GB 12 GB Keeps 20B quality on laptops with swap.
Meta-Llama-3.1-70B Instruct Q4_K_M bartowski/Meta-Llama-3.1-70B-Instruct-GGUF mlx-community/Meta-Llama-3.1-70B-Instruct-4bit-mlx 64 GB 48 GB Needs tensor splitting or M3 Ultra.
GPT-OSS-120B Q4_K_M bartowski/GPT-OSS-120B-GGUF 96 GB 64 GB Chase this only if you have 4×4090 or better cooling.
Qwen3-235B A22B Q4_K_M bartowski/Qwen3-235B-A22B-Instruct-GGUF 128 GB+ 96 GB+ Demo piece; keep it offline unless you own a rack.

Day-to-day downshift

  • Spin up bartowski/Qwen2.5-7B-Instruct-GGUF or bartowski/GPT-OSS-20B-GGUF for “council” voices while the big models stay idle.
  • Use MLX or llama.cpp quantized 13B checkpoints (Mixtral, Llama-3.1-8B) as proxies when 70B/120B VRAM isn’t available.
  • Stash a “high-octane” profile for vLLM remote runs, but default to 7B–20B locally so you can iterate quickly.

These memory numbers assume 4-bit quantization with ~25 % headroom for KV cache, runtime buffers, and system processes. Higher precision (Q5/Q8 or FP16) multiplies the requirement.

Prefer MLX-native weights or ready-made .gguf files whenever possible. “Converted weights,” in Felix’s terminology, means grabbing a pre-quantized artifact (like the links above) instead of running your own conversion step mid-setup.

1. llama.cpp everywhere (CLI + WebUI)

  1. Install the runtime. Homebrew now ships both CLI and WebUI binaries; Windows/Linux users can download prebuilt archives or make from source:

    brew install llama.cpp
  2. Launch the WebUI with GPT-OSS-20B (fast enough for CPU + GPU hybrids). The WebUI streams weights once and caches them under ~/.cache/llama.cpp. Jump into the Interface pane and enable the collapsible sidebar + preset buttons to match Felix’s setup:

    llama-server --hf-repo bartowski/GPT-OSS-20B-GGUF \
      --hf-file GPT-OSS-20B-Q4_K_M.gguf \
      --port 8080 --ctx-size 4096 --threads 10

    Open http://localhost:8080/?chat for a clean chat UI, provide a system prompt, and pin your favorite sampling presets. --no-browser keeps it silent if you’re running headless.

  3. Add a second route for Qwen2.5-7B so you can flip between “council” members. Each llama-server instance can host multiple models—add a second --model block or run another process on a new port. Pair each endpoint with a different --system-prompt or saved preset to recreate his council voting approach:

    llama-server --hf-repo bartowski/Qwen2.5-7B-Instruct-GGUF \
      --hf-file qwen2.5-7b-instruct-q4_k_m.gguf \
      --port 8081 --ctx-size 8192 --threads 10 \
      --chat-template chatml
  4. Prefer terminals? The CLI flows stay the same—keep --batch-size 1 for interactive chats, then scale toward 64+ only when you’re benchmarking or streaming to multiple clients. Match --ctx-size to the longest prompt you actually need:

    llama-cli --hf-repo bartowski/GPT-OSS-20B-GGUF \
      --hf-file GPT-OSS-20B-Q4_K_M.gguf \
      --prompt "Summarize Felix's council experiment in 80 words." \
      --ctx-size 4096
    llama-cli --hf-repo bartowski/Qwen2.5-7B-Instruct-GGUF \
      --hf-file qwen2.5-7b-instruct-q4_k_m.gguf \
      --chat --prompt "Create a persona for a skeptical council member." \
      --ctx-size 6144

    Need more throughput? Raise --batch-size gradually (4 → 16 → 64) while watching latency and GPU memory. Interactive chats usually feel best between batch sizes 1 and 4.

  5. Dual GPUs? Split tensors so each card shares the load—Felix’s “council” runs this way, then a supervisor script scores responses and “kills” the losers. Start with even splits and add a simple vote tally in Python to replicate this:

    llama-cli --hf-repo bartowski/GPT-OSS-20B-GGUF \
      --hf-file GPT-OSS-20B-Q4_K_M.gguf \
      --tensor-split 50,50 --ctx-size 4096 --chat

The layout below mirrors Felix’s WebUI: pinned sidebar, quick preset buttons, and two council members ready to vote.

Example llama.cpp WebUI layout with GPT-OSS-20B and Qwen2.5-7B loaded side-by-side, similar to Felix's setup.

Reference llama.cpp layout

2. Apple Silicon path: MLX

  1. Install Apple’s MLX tooling with uv (fast, no venv juggling):

    uv pip install --upgrade mlx-lm
  2. Point MLX straight at the Hugging Face repo IDs—the runtime will stream and cache the weights automatically. Swap the --model flag for any entry in the table above:

    mlx_lm.chat --model mlx-community/GPT-OSS-20B-4bit-mlx \
      --prompt "Summarize PewDiePie’s council idea in 3 bullet points."
    mlx_lm.chat --model mlx-community/Qwen2.5-7B-Instruct-4bit-mlx \
      --prompt "List three privacy-first features Felix added to his setup."

    First run pulls the weights into your Hugging Face cache (usually ~/.cache/huggingface/hub). Swap in larger packs like mlx-community/Meta-Llama-3.1-70B-Instruct-4bit-mlx when you have the VRAM.

Felix emphasizes that “smaller models are amazing” once you bolt search or RAG on top. On Apple Silicon that means using MLX for the core weights, then piping results through Spotlight, mdfind, or a local embeddings DB before handing the snippets back to the model.

Tip: MLX auto-detects available CPU/GPU tiles. On an M3 Ultra you can bump --max-tokens 2048 safely; older machines should keep generations shorter or drop to 8‑14 B models.

3. No GPU? Use HuggingChat

Felix mentions being “allergic to cloud APIs,” but if you’re still experimenting—or waiting on your next hardware upgrade—you can try the same models through HuggingChat. Pick the gpt-oss or Qwen2.5-72B endpoints, drop in his council/automation prompts, and note the responses, then recreate your favorites locally once you have the compute. HuggingChat keeps a transcript history you can export as JSON, which pairs nicely with the RAG workflows above when you’re ready to go offline.

4. Multi-GPU servers: vLLM

If you have a workstation or cloud box with multiple GPUs, vLLM gives you the same rapid token throughput Felix uses.

pip install -U vllm

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 4 \
  --max-model-len 65536 \
  --swap-space 16 \
  --dtype auto

For Qwen3-235B on eight GPUs:

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --max-model-len 100000 \
  --enforce-eager

Add --served-model-name chad-os to keep your OpenAI-compatible clients pointed at custom endpoints. Pair with your own FastAPI or Next.js front-end to recreate his custom UI.

He admits that the first time 235B came alive, “I wish I never ran this model. Too much power.” If you go this big, set conservative batch sizes (--max-num-batches-in-flight 1), add GPU memory monitors, and keep a smaller model on standby for day-to-day prompts so your workstation doesn’t melt.

Common snags (and quick fixes)

  • Model won’t load? Drop to a smaller quant (Q4 → Q3) or close browser tabs to free RAM.
  • Fans roaring? Cap --max-tokens and set Apple’s LOW_POWER=1 or NVIDIA PowerMizer to “Adaptive.”
  • Weird answers? Clear the chat history and rerun with a lower temperature like --temp 0.6.

Tackle these before you assume your hardware isn’t strong enough.

“I realized I like running AI more than using AI with this computer.” Build the parts that sound fun and keep the loop private, just like he does.

That’s it. Swap in models your hardware can handle, keep the tooling modular, and you’ll have a similar setup running locally without needing 10 GPUs.