Skip to main content

Four days, eighteen missed sessions, and a private roundtable with Kelsey Hightower: SCALE 23x as it actually happened

The schedule I built two weeks ago was a fiction. A useful fiction — it forced real thinking about tradeoffs — but eighteen of the sessions I marked as “MUST” or “HIGH” are now links in a YouTube folder I won’t open before 2027. The one session that wasn’t on any schedule, wasn’t announced publicly, and had no recording? That one I can still reconstruct line by line.

That’s the gap between the conference you plan and the conference you actually attend.

Thursday — PlanetNix and the unexpected room
#

Kelsey Hightower’s “Is it time for Nix?” talk opened PlanetNix with the kind of skepticism I wasn’t expecting from someone whose name is on the event. He traced his own career — sysadmin to DevOps to SRE to platform engineer — and made the point that pivoting is the job, not an anomaly. But then he said something about AI that I wrote down immediately: “With this new technology we are going faster, but what are we doing with this new time? Are we spending more time with family? No. Are we getting raises? No. That’s the skepticism. The people who are driving it aren’t driving it for altruistic reasons.”

That’s not a blanket condemnation of AI. It’s a harder question: who captures the productivity gains? It set a tone for the whole conference that I kept returning to.

Sam Fu from Anthropic presented after Kelsey, and the talk was one of those sessions where a single data point rewires your sense of what’s normal. Anthropic gives each developer their own pod on its own dedicated node. Not a shared namespace. Not resource quotas. Their own node. I reacted to this the way you’d expect — that’s absurd, that’s expensive — and then spent the rest of the talk recalibrating. Their rationale is that CI should match developer environments exactly, and you can’t do that with shared nodes. They use Tailscale as a sidecar inside the dev container, and they’ve consolidated their service containers using rootless Docker to avoid the operational overhead of running Docker-in-Docker properly. The philosophical takeaway — things should just work for our users; go through great lengths to ensure it — is easy to agree with in a talk and genuinely hard to act on in an existing platform.

Then the afternoon ended with the thing I couldn’t have put on a schedule: a private roundtable with Kelsey, the Flox CTO and their VP of Engineering, a representative from JPL, and Jesse from my team.

The conversation centred on something Kelsey had been thinking through out loud: Kubernetes started with Docker images as the atomic unit. Layers. Composition through FROM. Flox breaks that model — instead of building images that contain everything, you inject Nix packages directly into containers at runtime. The build artifact is no longer an image. It’s a Nix package. The CI pipeline ends with Flox assembling and packaging the application.

What this means practically: you stop rebuilding your container because jq got a patch. Common tooling lives in the Nix layer, maintained separately, not triggering your application builds. For Spark specifically, this is significant — the entire application doesn’t have to travel through every layer of your image pipeline.

On the walk back afterward, Jesse and I were already mapping this to our own build process. The output would still be an image pushed to ECR, which means it’s a drop-in from the perspective of every downstream system. But the shape of how you get there changes substantially. We don’t have a plan yet, but we have a direction.

The JPL engineer in the room was a useful reminder that reproducibility isn’t just a developer experience problem. When you’re building software for spacecraft, “it worked on my machine” isn’t a philosophy you can entertain.

Friday — The morning was excellent; I skipped the afternoon
#

John Willis opened Friday and gave me the most quotable framework of the conference. The standard risk model — Risk = Impact x Likelihood — is no longer sufficient for AI-accelerated threat environments. Willis’s revision: Risk = (likelihood ^ velocity) x Authority.

The velocity term is doing real work there. An AI agent can go from an announced CVE to active exploitation in under 15 minutes. Humans cannot review, assess, and respond in that window. Willis’s conclusion: “The human can no longer catch the error before it happens. The system architecture needs to protect against it.” If you’re still designing security controls around human review cycles, you’re building for a threat model that’s already obsolete.

Kat Morgan was next with her platform stack live demo — devcontainers, Nix, Docker-in-Docker, K8s-in-containers, KubeVirt, Ceph, Cilium, Dagger, Gitea — which turned out to be genuinely substantive rather than a scope disaster. The key structural idea was the workspace path format: /workspace/{user}/{server}/{namespace}. User can be a human, an AI agent, or a CI runner. Each gets its own subtree with independent group ownership. The consequence of that structure is that you get isolation across all three categories without special-casing any of them.

The line I keep coming back to: “To make things reliable for AI we have to start making them reliable for humans first.” This is the thing I want to put on a slide in our next planning cycle. A lot of the AI workload pressure we’re under is pressure to build new infrastructure. Morgan’s argument is that you mostly need to finish the infrastructure you already started.

Maya Singh’s session on conversational K8s debugging used Inspektor Gadget with an MCP integration in Cursor. The demo worked the way demos rarely do — she traced a DNS issue live, identified it as a five-dot FQDN lookup hitting CoreDNS unnecessarily, and resolved it. What made it interesting wasn’t the outcome, it was the framing: IG has many powerful diagnostic tools, and under pressure, teams consistently pick the wrong ones. The LLM selects the right tool based on the problem description. Engineers who know IG well were apparently worse at tool selection during incidents than the LLM with no prior IG experience. The expertise that makes you fast in normal conditions is the same thing that gives you tunnel vision under pressure.

Then I went to the vendor hall, had lunch with the AWS team, and skipped the rest of the afternoon.

I should be more elaborate about this. In the scheduling post I spent several paragraphs explaining why Dustin Kirkland’s agentic pipeline supply chain talk was a must-attend. I marked it MUST. And I didn’t go. I had a good lunch conversation and the afternoon slipped away. That’s what actually happened. It’s in the recording folder now.

Saturday — A direct report’s first conference talk, and the room I’ll be in next year
#

Renovate at 1,300 repos turned out to be an interesting topic, boring delivery. The useful technical details: Grafana runs multiple Renovate configs as CronJobs with shared Redis state, splits jobs alphabetically to avoid deduplication overhead, and uses a webhook-triggered Go application to manage scans against PRs. Renovate PRs include changelogs and CVE descriptions inline, which means the person reviewing the PR has the context to make an informed decision without leaving the GitHub UI. There’s also a paging setup that alerts when Renovate isn’t opening enough PRs — a neat inversion of the usual “too many alerts” problem.

The sidebar: I didn’t know you could manually trigger a job from k9s. Learned that in the middle of what was otherwise a slow session. Sometimes conferences work like that.

At 12:30 I stayed in the same room for Vinh Nguyen’s talk on migrating from Logz.io to self-managed Grafana Loki. This is the part I don’t have detailed technical notes on, and that’s intentional. I was there as a manager, not as a content consumer.

Vinh is on my team. This was his first conference talk. My notes from that session are four words: “Doing a good job, integrated the feedback we provide.” That’s it. There’s nothing else I need to write down. Being in that room was the whole point.

At 2:30, the Meta containers-in-containers talk became the technical surprise of the conference. The setup: Meta runs production multi-tenant compute using nested containers. Developers SSH into a login container, which runs Podman so they can start Claude. Then — and this is the part that answered a question my team hadn’t quite articulated yet — an iptables rule restricts that inner container to only communicate with a proxy that limits access to the inference server. Not the open internet. Just the inference endpoint.

The pattern: nested containers as the agent sandboxing primitive, with iptables as the enforcement layer. The developers can run their own builds inside their containers too — a bind mount of /proc to /proc resolves the RUN statement failures that blocked them initially. They moved from Kaniko (now archived) to BuildKit for the image building layer.

This is the architectural answer to the agent isolation question we keep circling. We knew we needed sandboxing for agent workloads. We’d been thinking about it as a separate problem from developer environments. Meta has collapsed those into the same pattern.

The Saturday panel — Kelsey, Stormy Peters, and James Bayer on AI reshaping infra — ran over time and nobody in the room seemed to mind.

Three things earned space in my notes.

First, Kelsey on training data collapse: if people stop contributing to Stack Overflow, stop writing blog posts, stop creating the public corpus that models train on, what do the next generation of models train on? “Models can’t train on their own output.” This is a systems problem with a feedback loop that most of the AI discourse ignores.

Second, the APIs-as-hints argument: “We write our APIs with hints, not instructions. This was never good.” Kelsey’s point was that AI works better with intent-based interfaces, and that those should have been the standard from the start. We gave up on clear specification in favour of “good enough for humans to figure out.” Now we’re paying for it.

Third — and this is the one I kept thinking about on the walk back to the hotel — Kelsey on consent: AI is “built on our prior knowledge without our consent and sold back to us for $20/month.” This isn’t a legal argument or a licensing argument. It’s a community argument. The open source ecosystem that produced the training data operated under norms that didn’t anticipate that use. Whether you think the models are technically in compliance or not, the social contract was violated.

After the panel, I had a two-minute hallway conversation with James Bayer about our Flox adoption plan. His advice was immediate and specific: forget about the Kubernetes integration first. Start with developer workflows. Two sentences. Immediately actionable. Worth more than most of the formal sessions.

Sunday — Shorter than planned, and that was fine
#

Mark Russinovich’s supply chain keynote was a welcome surprise in how direct it was. Microsoft is shipping Sysinternals for Linux (including jcd). KEDA graduated to CNCF after being incubated inside Microsoft. And Russinovich said something that should be in every security review presentation: “Not looking at the code is not the flex you think it is.”

The practical takeaway from this talk was the OSSF Scorecard tool — a CLI that scores repositories for supply chain trustworthiness. The Open Source Security Foundation has 117 members across 16 industries. Running Scorecard on your dependencies is a 20-minute task that gives you a defensible starting position on supply chain posture. We’re going to add it to our onboarding checklist for new dependencies.

Engin Diri from Pulumi had the central tension of Sunday’s track in his title: AI platforms without losing engineering principles. The architecture he described uses kserve for model serving, LiteLLM for access management, and agent-sandbox from CNCF for isolation — running on Bottlerocket nodes with skills defined as ConfigMaps. The demo used Open WebUI with specific sandbox skills to handle developer infrastructure queries without requiring any infrastructure setup from the developer. His project code is at dirien/what-is-ai-platform-engineering-and-why-should-you-care if you want to follow along.

The honest note: Engin has leaned heavily into LiteLLM, and I’ve heard that its support and maintenance has been declining. Worth watching before committing to it as a dependency. The architecture makes sense; the specific tool choice is a conversation I need to have with my team before we adopt it.

The Chainguard booth conversation filled in a gap from the week. They’re running Trivy and Grype for image scanning — layered, not redundant, with each catching things the other misses. More usefully, someone there had done the work of integrating GPU utilization metrics into Karpenter dashboards. We’ve wanted this since we started running AI workloads and kept deprioritizing it. I came away with a concrete approach to take back to the team instead of just another item on the list.


What actually changed because of this conference comes down to three things — two technical, one personal.

The container-as-artifact shift is a convergent signal. The Flox roundtable, Kat Morgan’s path structure for CI/AI/human parity, and Meta’s containers-in-containers work all point at the same thing: the container image as the fundamental build artifact is being renegotiated. Not replaced — nothing at this scale replaces things, it accumulates layers — but the assumptions underneath it are shifting.

The AI skepticism is coming from the people who know the most. Kelsey said it twice (the productivity gains question on Thursday, the consent framing on Saturday). Willis gave it a precise technical form (velocity changes the risk calculus, systems need to absorb the consequences). Morgan said it plainest: make things reliable for humans first. These are not people who are anti-AI. They’re people who’ve thought harder about it than most, and they’re all expressing the same category of doubt.

And Vinh. The conference had a lot of good content. The manager moment I’ll actually remember was staying in that room at 12:30.

I wrote in the scheduling post that I have a folder of recordings I haven’t opened since 2023. That folder has eighteen new items in it now. I’m not going to watch them. The Flox roundtable — unrecorded, unscheduled, forty-five minutes in a hallway meeting room — was worth more than any of them would have been, because it ended with a direction and a concrete next step, not just a presentation I could have read on someone’s blog at home.

Related

Four days, 277 sessions, one brutal Sunday time slot: scheduling SCALE 23x as a platform team manager

There are 277 sessions at SCALE 23x this year. I know this because I extracted all of them from the schedule webarchive files and scored every single one. I’m not proud of how long this took. But it surfaced some genuinely interesting tradeoffs — and the pattern of what conflicted with what tells you something real about where platform engineering is right now. The scheduling problem is different when you manage a team # When I was an IC, conference scheduling was mostly about depth. Find the three talks that will blow your mind and plan the rest around them. Everything else is hallway track.

Claude Code's /copy Command

·1 min
A coworker dropped /copy in our work Slack yesterday and I had to try it immediately. It’s a Claude Code slash command that copies Claude’s last response straight to your clipboard as markdown. Before finding this, my workflow for grabbing a generated code snippet or shell command was embarrassingly manual — select text in the terminal, hope I got the boundaries right, paste it somewhere. Now I just type: /copy And the whole response lands in my clipboard, formatting intact — including code blocks. This is especially useful when Claude generates something multi-part, like a function plus its tests or a sequence of shell commands, where careful selection across scroll boundaries used to be the only option.