<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://schackenberg.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://schackenberg.com/" rel="alternate" type="text/html" /><updated>2026-05-16T22:39:48+02:00</updated><id>https://schackenberg.com/feed.xml</id><title type="html">Benedikt Schackenberg</title><subtitle>Software Engineer · Database Architect · Building tools that make IT operations easier.</subtitle><author><name>Benedikt Schackenberg</name></author><entry><title type="html">Why I keep deleting half my stack and going back to Postgres</title><link href="https://schackenberg.com/2026/05/16/postgres-can-do-almost-everything.html" rel="alternate" type="text/html" title="Why I keep deleting half my stack and going back to Postgres" /><published>2026-05-16T00:00:00+02:00</published><updated>2026-05-16T00:00:00+02:00</updated><id>https://schackenberg.com/2026/05/16/postgres-can-do-almost-everything</id><content type="html" xml:base="https://schackenberg.com/2026/05/16/postgres-can-do-almost-everything.html"><![CDATA[<p>I have a recurring conversation. Someone shows me an architecture diagram — a clean little web app, three or four engineers behind it — and there’s a Redis instance for caching, a separate queue (RabbitMQ or SQS), Elasticsearch for a single search box, a vector DB for the “AI feature” the founder promised, and somewhere in the middle, Postgres, doing the actual work.</p>

<p>Every one of those extra boxes is a subscription, a deployment, a backup story, a monitoring dashboard, an upgrade window, and a 2 a.m. incident waiting to happen. And in maybe 80% of the cases I see, every single one of them could just be Postgres.</p>

<p>This isn’t a “Postgres is magic” post. Postgres has limits, and I’ll get to them. But before you reach for another piece of infrastructure, it’s worth knowing how much the boring old elephant can actually do.</p>

<p><img src="/assets/images/postgres-ate-stack-hero.png" alt="Postgres sitting on top of a pile of retired stack components" /></p>

<hr />

<h2 id="the-thing-people-forget-about-postgres">The thing people forget about Postgres</h2>

<p>Postgres is not just a relational database. It’s an extensible object-relational database, and that word — <em>extensible</em> — is the part that matters. You get the standard ACID-compliant core, but you also get a healthy ecosystem of extensions that bolt on entire new capabilities without leaving the database.</p>

<p>In other words: it’s the kind of system you can keep adding to instead of replacing. That’s a rare property in this industry.</p>

<p>Here’s what I’ve actually replaced with it, in real projects.</p>

<hr />

<h2 id="nosql-jsonb-and-gin">NoSQL: JSONB and GIN</h2>

<p>The classic argument for MongoDB is “I have unstructured data.” Fine. You can store unstructured data in Postgres, too, using the <code class="language-plaintext highlighter-rouge">JSONB</code> type. The <code class="language-plaintext highlighter-rouge">B</code> is the important part — it’s a decomposed binary representation, not a string you have to re-parse on every query.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">events</span> <span class="p">(</span>
    <span class="n">id</span>        <span class="n">bigserial</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">payload</span>   <span class="n">jsonb</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">received</span>  <span class="n">timestamptz</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">DEFAULT</span> <span class="n">now</span><span class="p">()</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">events_payload_gin</span> <span class="k">ON</span> <span class="n">events</span> <span class="k">USING</span> <span class="n">GIN</span> <span class="p">(</span><span class="n">payload</span> <span class="n">jsonb_path_ops</span><span class="p">);</span>
</code></pre></div></div>

<p>A GIN (Generalized Inverted Index) on a JSONB column behaves like the index at the back of a textbook — keys point directly to the rows that contain them. So you can do this:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">payload</span><span class="o">-&gt;&gt;</span><span class="s1">'customer_id'</span>
<span class="k">FROM</span> <span class="n">events</span>
<span class="k">WHERE</span> <span class="n">payload</span> <span class="o">@&gt;</span> <span class="s1">'{"type": "checkout_completed"}'</span>
  <span class="k">AND</span> <span class="n">received</span> <span class="o">&gt;</span> <span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">interval</span> <span class="s1">'7 days'</span><span class="p">;</span>
</code></pre></div></div>

<p>…and join it against your normal relational tables in the same query, in the same transaction. You get the schema flexibility of a document store without giving up referential integrity. Most “we need MongoDB” projects I’ve seen would have been perfectly happy with this.</p>

<p><img src="/assets/images/postgres-jsonb-gin.png" alt="JSONB documents being indexed by a GIN tree" /></p>

<hr />

<h2 id="background-jobs-for-update-skip-locked">Background jobs: <code class="language-plaintext highlighter-rouge">FOR UPDATE SKIP LOCKED</code></h2>

<p>This one’s a little hidden gem. People reach for Redis or RabbitMQ for a job queue because “a SQL queue will deadlock.” That was true 15 years ago. It hasn’t been for a long time.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">next_job</span> <span class="k">AS</span> <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">id</span>
    <span class="k">FROM</span> <span class="n">jobs</span>
    <span class="k">WHERE</span> <span class="n">status</span> <span class="o">=</span> <span class="s1">'pending'</span>
    <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">created_at</span>
    <span class="k">LIMIT</span> <span class="mi">1</span>
    <span class="k">FOR</span> <span class="k">UPDATE</span> <span class="n">SKIP</span> <span class="n">LOCKED</span>
<span class="p">)</span>
<span class="k">UPDATE</span> <span class="n">jobs</span>
<span class="k">SET</span> <span class="n">status</span> <span class="o">=</span> <span class="s1">'running'</span><span class="p">,</span> <span class="n">started_at</span> <span class="o">=</span> <span class="n">now</span><span class="p">()</span>
<span class="k">FROM</span> <span class="n">next_job</span>
<span class="k">WHERE</span> <span class="n">jobs</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">next_job</span><span class="p">.</span><span class="n">id</span>
<span class="n">RETURNING</span> <span class="n">jobs</span><span class="p">.</span><span class="o">*</span><span class="p">;</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">SKIP LOCKED</code> tells Postgres: grab the first available row, lock it, and if you hit a row another worker already has — don’t wait, just skip it. The result is a wait-free worker pool sitting on top of a plain table. I’ve run systems doing thousands of jobs per second this way. Operationally it’s lovely: jobs are just rows you can <code class="language-plaintext highlighter-rouge">SELECT</code>, retry, audit, and back up like anything else.</p>

<p>If you need a queue and you already have Postgres, you probably don’t need another piece of infrastructure.</p>

<hr />

<h2 id="full-text-search">Full-text search</h2>

<p>You don’t need Elasticsearch to power a search bar. You might if you’re indexing the New York Times archive, but for the search field on your SaaS dashboard? <code class="language-plaintext highlighter-rouge">tsvector</code> and <code class="language-plaintext highlighter-rouge">tsquery</code> are right there.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ALTER</span> <span class="k">TABLE</span> <span class="n">articles</span>
<span class="k">ADD</span> <span class="k">COLUMN</span> <span class="n">search_doc</span> <span class="n">tsvector</span>
<span class="k">GENERATED</span> <span class="n">ALWAYS</span> <span class="k">AS</span> <span class="p">(</span>
    <span class="n">to_tsvector</span><span class="p">(</span><span class="s1">'english'</span><span class="p">,</span> <span class="n">coalesce</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="o">||</span> <span class="s1">' '</span> <span class="o">||</span> <span class="n">coalesce</span><span class="p">(</span><span class="n">body</span><span class="p">,</span> <span class="s1">''</span><span class="p">))</span>
<span class="p">)</span> <span class="n">STORED</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">articles_search_idx</span> <span class="k">ON</span> <span class="n">articles</span> <span class="k">USING</span> <span class="n">GIN</span> <span class="p">(</span><span class="n">search_doc</span><span class="p">);</span>

<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">title</span>
<span class="k">FROM</span> <span class="n">articles</span>
<span class="k">WHERE</span> <span class="n">search_doc</span> <span class="o">@@</span> <span class="n">plainto_tsquery</span><span class="p">(</span><span class="s1">'english'</span><span class="p">,</span> <span class="s1">'postgres performance'</span><span class="p">);</span>
</code></pre></div></div>

<p>Postgres handles stemming (so “running” matches “run”), stop words, and ranking. Add the <code class="language-plaintext highlighter-rouge">pg_trgm</code> extension and you get fuzzy matching that survives typos — useful when a user searches for “postgresss” instead of “postgres.” For 90% of in-app search use cases, this is more than enough, and it lives next to your data, in the same transaction, with the same backup story.</p>

<hr />

<h2 id="vector-search-for-ai-features">Vector search for AI features</h2>

<p>If you’re shipping an AI feature, the temptation is to drop in a hosted vector database. Fine for prototypes — painful in production, because now you have a “hybrid search problem”: you want documents that are <em>semantically</em> similar to a query, but only the ones owned by the current user, only from the last 30 days, only in this project. Doing that across two systems over the network is slow, expensive, and ugly.</p>

<p><code class="language-plaintext highlighter-rouge">pgvector</code> gives you vectors as a native column type, plus HNSW indexes for approximate nearest neighbor search:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="n">EXTENSION</span> <span class="n">vector</span><span class="p">;</span>

<span class="k">ALTER</span> <span class="k">TABLE</span> <span class="n">documents</span> <span class="k">ADD</span> <span class="k">COLUMN</span> <span class="n">embedding</span> <span class="n">vector</span><span class="p">(</span><span class="mi">1536</span><span class="p">);</span>

<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">documents_embedding_hnsw</span>
<span class="k">ON</span> <span class="n">documents</span> <span class="k">USING</span> <span class="n">hnsw</span> <span class="p">(</span><span class="n">embedding</span> <span class="n">vector_cosine_ops</span><span class="p">);</span>

<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">title</span>
<span class="k">FROM</span> <span class="n">documents</span>
<span class="k">WHERE</span> <span class="n">owner_id</span> <span class="o">=</span> <span class="err">$</span><span class="mi">1</span>
  <span class="k">AND</span> <span class="n">created_at</span> <span class="o">&gt;</span> <span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">interval</span> <span class="s1">'30 days'</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">embedding</span> <span class="o">&lt;=&gt;</span> <span class="err">$</span><span class="mi">2</span>
<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
</code></pre></div></div>

<p>That’s the whole hybrid search problem solved with a single query. Your vector data sits next to your relational data, and your row-level security policies apply to both at once.</p>

<p><img src="/assets/images/postgres-toolbox.png" alt="A toolbox labeled PostgreSQL with various database-shaped tools inside" /></p>

<hr />

<h2 id="geospatial-postgis">Geospatial: PostGIS</h2>

<p>If you do anything with maps or routing, this isn’t even a “Postgres can do it too” situation. PostGIS <em>is</em> the industry standard. It has been for years.</p>

<p>The GiST index is the trick: it draws bounding boxes around your geometries and rejects the obvious non-matches before doing any expensive geometric math. “All coffee shops within this polygon” goes from a server-melting query to milliseconds.</p>

<p>I’ve seen teams adopt commercial GIS systems that, under the hood, were running PostGIS the whole time.</p>

<hr />

<h2 id="time-series-with-brin-and-partitioning">Time-series with BRIN and partitioning</h2>

<p>Telemetry, logs, IoT events — the usual reflex is “we need a time-series database.” Often you don’t. Postgres has declarative partitioning, and for time-ordered data the BRIN (Block Range Index) is a serious power tool.</p>

<p>A B-tree index stores an entry for every row. A BRIN index stores the min and max value for each block of data on disk. For sequentially-inserted time-series data, that’s all you need — when you query a range, Postgres skips entire physical chunks of the table that can’t possibly contain matching rows.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">events_received_brin</span>
<span class="k">ON</span> <span class="n">events</span> <span class="k">USING</span> <span class="n">BRIN</span> <span class="p">(</span><span class="n">received</span><span class="p">)</span>
<span class="k">WITH</span> <span class="p">(</span><span class="n">pages_per_range</span> <span class="o">=</span> <span class="mi">128</span><span class="p">);</span>
</code></pre></div></div>

<p>Tiny index, huge table, very fast range scans. Pair it with monthly or daily partitions and you have something that comfortably handles billions of rows for most “I need a TSDB” workloads. If you cross into TimescaleDB territory, that’s also just an extension on top of Postgres — same engine, same tools.</p>

<hr />

<h2 id="dashboards-materialized-views-not-snowflake">Dashboards: materialized views, not Snowflake</h2>

<p>A standard view re-runs its query every time. A materialized view runs the heavy query once and stores the result on disk. For dashboards that aggregate over a lot of data, this is the difference between “snappy” and “the database is on fire.”</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="n">MATERIALIZED</span> <span class="k">VIEW</span> <span class="n">daily_sales</span> <span class="k">AS</span>
<span class="k">SELECT</span> <span class="n">date_trunc</span><span class="p">(</span><span class="s1">'day'</span><span class="p">,</span> <span class="n">occurred_at</span><span class="p">)</span> <span class="k">AS</span> <span class="k">day</span><span class="p">,</span>
       <span class="n">region</span><span class="p">,</span>
       <span class="k">sum</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span> <span class="k">AS</span> <span class="n">total</span>
<span class="k">FROM</span> <span class="n">sales</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">UNIQUE</span> <span class="k">INDEX</span> <span class="n">daily_sales_day_region</span> <span class="k">ON</span> <span class="n">daily_sales</span> <span class="p">(</span><span class="k">day</span><span class="p">,</span> <span class="n">region</span><span class="p">);</span>

<span class="n">REFRESH</span> <span class="n">MATERIALIZED</span> <span class="k">VIEW</span> <span class="n">CONCURRENTLY</span> <span class="n">daily_sales</span><span class="p">;</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">CONCURRENTLY</code> is the magic word — it refreshes in the background without locking out readers. Most “we need a data warehouse” stories I’ve heard could have started with this and grown into something heavier only if it actually became necessary.</p>

<hr />

<h2 id="the-api-layer-postgrest-pg_graphql">The API layer: PostgREST, PG_GraphQL</h2>

<p>This one’s more controversial, but worth a mention. Tools like <a href="https://postgrest.org/">PostgREST</a> and the <code class="language-plaintext highlighter-rouge">pg_graphql</code> extension can generate a fully working REST or GraphQL API directly from your schema. Combine that with row-level security policies and a JWT-based auth layer, and your “backend” can shrink to a handful of SQL files.</p>

<p>I wouldn’t build a full product this way, but for internal tools, admin panels, and prototypes, it removes an absurd amount of boilerplate.</p>

<hr />

<h2 id="when-not-to-do-this">When <em>not</em> to do this</h2>

<p>Here’s the honest part. Postgres scales beautifully vertically — bigger box, faster disks, you’ll get a long way. It scales much less gracefully horizontally. If you genuinely need:</p>

<ul>
  <li>millions of writes per second across a sharded cluster,</li>
  <li>sub-millisecond in-memory caches for huge concurrent loads,</li>
  <li>globally distributed multi-region active-active writes,</li>
</ul>

<p>…then you need the specialized tools. That’s fine. But “fine” applies to a much smaller number of teams than the architecture-Twitter discourse would have you believe. Most products live and die long before they hit those limits.</p>

<p>The other place to be careful: don’t pile every workload onto a single Postgres instance and then act surprised when your OLTP traffic slows down because someone is also running a 12-hour analytical query on it. Split workloads by replica or by database when it matters.</p>

<hr />

<h2 id="the-actual-takeaway">The actual takeaway</h2>

<p>The interesting question isn’t “can Postgres replace everything?” It can’t, and it shouldn’t try.</p>

<p>The interesting question is: <em>for the next piece of infrastructure I’m about to add to my stack, do I actually need it, or am I just out of practice with Postgres?</em> In my experience, the answer is “I’m out of practice” about half the time. The other half, you reach for the specialized tool with a clear conscience and a much simpler core.</p>

<p>Fewer moving parts means fewer incidents, fewer cloud bills, and fewer phone calls at 2 a.m. That’s worth a lot.</p>

<hr />

<p><em>I work with databases for a living — mostly SQL Server in operations work, Postgres in product work. If you’ve got a stack you’re trying to simplify (or a Postgres performance problem that’s haunting you), I’m easy to find on <a href="https://linkedin.com/in/benedikt-schackenberg-b7422338b">LinkedIn</a>.</em></p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="postgresql" /><category term="databases" /><category term="architecture" /><category term="backend" /><category term="dba" /><summary type="html"><![CDATA[I have a recurring conversation. Someone shows me an architecture diagram — a clean little web app, three or four engineers behind it — and there’s a Redis instance for caching, a separate queue (RabbitMQ or SQS), Elasticsearch for a single search box, a vector DB for the “AI feature” the founder promised, and somewhere in the middle, Postgres, doing the actual work.]]></summary></entry><entry><title type="html">Contained Availability Groups in SQL Server 2022: Reducing Failover Drift and Manual Sync Work</title><link href="https://schackenberg.com/2026/04/29/contained-availability-groups-sql-server-2022.html" rel="alternate" type="text/html" title="Contained Availability Groups in SQL Server 2022: Reducing Failover Drift and Manual Sync Work" /><published>2026-04-29T00:00:00+02:00</published><updated>2026-04-29T00:00:00+02:00</updated><id>https://schackenberg.com/2026/04/29/contained-availability-groups-sql-server-2022</id><content type="html" xml:base="https://schackenberg.com/2026/04/29/contained-availability-groups-sql-server-2022.html"><![CDATA[<p>In classic Always On Availability Groups, user databases are replicated — but many instance-level objects are not. Logins, SQL Agent jobs, linked servers, Database Mail configuration: all of these live in <code class="language-plaintext highlighter-rouge">master</code> or <code class="language-plaintext highlighter-rouge">msdb</code> at the instance level, outside the AG’s replication scope. In practice, this gap tends to surface at the worst possible moment — during a failover or a DR test, when the database is online but something that depends on those objects isn’t working.</p>

<p>The standard approach has been to keep those objects in sync manually: scripts, scheduled jobs, documentation that someone eventually stops updating. It works until it doesn’t.</p>

<p>Contained Availability Groups in SQL Server 2022 address this directly. Here’s how they work, what they solve, and where the boundaries are.</p>

<hr />

<h2 id="how-it-works">How It Works</h2>

<p>A Contained AG maintains AG-local versions of the system databases. When the AG is created with the <code class="language-plaintext highlighter-rouge">CONTAINED</code> option, SQL Server creates two additional databases that replicate as part of the AG:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">[AG_Name]_master</code> — AG-local master context</li>
  <li><code class="language-plaintext highlighter-rouge">[AG_Name]_msdb</code> — AG-local msdb context</li>
</ul>

<p>For example, an AG named <code class="language-plaintext highlighter-rouge">ContainedAG</code> would have <code class="language-plaintext highlighter-rouge">ContainedAG_master</code> and <code class="language-plaintext highlighter-rouge">ContainedAG_msdb</code>.</p>

<p>Logins, Agent jobs, DB Mail profiles, and linked servers that you create through the AG listener are stored in these AG-local databases — not in the instance-level system databases. They replicate to all replicas and fail over with the AG.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="n">AVAILABILITY</span> <span class="k">GROUP</span> <span class="p">[</span><span class="n">ContainedAG</span><span class="p">]</span>
<span class="k">WITH</span> <span class="p">(</span>
    <span class="n">CLUSTER_TYPE</span> <span class="o">=</span> <span class="n">WSFC</span><span class="p">,</span>
    <span class="n">CONTAINED</span>
<span class="p">)</span>
<span class="k">FOR</span> <span class="k">DATABASE</span> <span class="p">[</span><span class="n">YourDatabase</span><span class="p">]</span>
<span class="n">REPLICA</span> <span class="k">ON</span>
    <span class="n">N</span><span class="s1">'Node1'</span> <span class="k">WITH</span> <span class="p">(</span>
        <span class="n">ENDPOINT_URL</span>      <span class="o">=</span> <span class="n">N</span><span class="s1">'TCP://node1.domain.local:5022'</span><span class="p">,</span>
        <span class="n">AVAILABILITY_MODE</span> <span class="o">=</span> <span class="n">SYNCHRONOUS_COMMIT</span><span class="p">,</span>
        <span class="n">FAILOVER_MODE</span>     <span class="o">=</span> <span class="n">AUTOMATIC</span><span class="p">,</span>
        <span class="n">SEEDING_MODE</span>      <span class="o">=</span> <span class="n">AUTOMATIC</span>
    <span class="p">),</span>
    <span class="n">N</span><span class="s1">'Node2'</span> <span class="k">WITH</span> <span class="p">(</span>
        <span class="n">ENDPOINT_URL</span>      <span class="o">=</span> <span class="n">N</span><span class="s1">'TCP://node2.domain.local:5022'</span><span class="p">,</span>
        <span class="n">AVAILABILITY_MODE</span> <span class="o">=</span> <span class="n">SYNCHRONOUS_COMMIT</span><span class="p">,</span>
        <span class="n">FAILOVER_MODE</span>     <span class="o">=</span> <span class="n">AUTOMATIC</span><span class="p">,</span>
        <span class="n">SEEDING_MODE</span>      <span class="o">=</span> <span class="n">AUTOMATIC</span>
    <span class="p">);</span>
</code></pre></div></div>

<p>To create objects in the contained context, connect through the AG listener — not directly to the instance. Objects created via the listener land in the AG-local system databases:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Creates the login in ContainedAG_master, not instance-level master</span>
<span class="k">CREATE</span> <span class="n">LOGIN</span> <span class="p">[</span><span class="n">app_svc</span><span class="p">]</span>
    <span class="k">WITH</span> <span class="n">PASSWORD</span>    <span class="o">=</span> <span class="n">N</span><span class="s1">'...'</span><span class="p">,</span>
    <span class="n">DEFAULT_DATABASE</span> <span class="o">=</span> <span class="p">[</span><span class="n">YourDatabase</span><span class="p">];</span>
</code></pre></div></div>

<hr />

<h2 id="what-changes-operationally">What Changes Operationally</h2>

<p><strong>Login and job consistency after failover.</strong> This is the core benefit. Logins and Agent jobs created through the listener exist on all replicas from the start. There’s no post-failover sync step. In environments where failovers have historically required follow-up work on instance-level objects, this removes a significant chunk of that manual overhead.</p>

<p><strong>DR site setup is simpler.</strong> When a new replica joins the AG, AG-local objects come along with replication. You don’t need a separate synchronization process for logins and jobs that belong to the AG’s scope.</p>

<p><strong>Cleaner object isolation in multi-AG environments.</strong> On instances running several AGs — common in shared infrastructure or consulting setups — each Contained AG carries its own login and job context. Objects for one AG don’t need to exist at the instance level, which reduces configuration drift across AGs over time.</p>

<hr />

<h2 id="limitations">Limitations</h2>

<p><strong>SQL Agent jobs run on the primary only.</strong> Agent jobs replicate to all replicas in the contained context, but they execute only on the current primary. If your jobs have any node-specific assumptions, review them before migrating.</p>

<p><strong>Contained AGs are not a security boundary.</strong> Microsoft is explicit about this. The AG-local system databases are a configuration consistency mechanism — not an isolation layer. Instance-level sysadmin access still reaches the AG databases. Building a multi-tenant security model around Contained AGs is not supported and not safe.</p>

<p><strong>No in-place conversion from a traditional AG.</strong> There is no <code class="language-plaintext highlighter-rouge">ALTER AVAILABILITY GROUP ... ADD CONTAINED</code>. Converting an existing AG means creating a new Contained AG and migrating databases into it. This is worth planning carefully if you have many databases or complex Agent job configurations.</p>

<p><strong>Instance-level objects remain separate.</strong> The AG-local <code class="language-plaintext highlighter-rouge">[AG_Name]_master</code> and <code class="language-plaintext highlighter-rouge">[AG_Name]_msdb</code> exist alongside the instance-level system databases — they don’t replace them. Startup procedures, server-scoped configuration (<code class="language-plaintext highlighter-rouge">sp_configure</code>), instance-wide linked servers, and certificates not scoped to the AG still need to be managed independently on each replica.</p>

<p><strong>Not all msdb objects are covered.</strong> SSIS packages stored in msdb, certain maintenance plan metadata, and some system-managed jobs do not replicate. Audit your msdb contents before assuming everything will be included.</p>

<hr />

<h2 id="object-replication-whats-covered">Object Replication: What’s Covered</h2>

<table>
  <thead>
    <tr>
      <th>Object</th>
      <th>Traditional AG</th>
      <th>Contained AG</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>User database data</td>
      <td>Replicated</td>
      <td>Replicated</td>
    </tr>
    <tr>
      <td>Logins (AG-scoped)</td>
      <td>Manual sync</td>
      <td>Replicated via listener</td>
    </tr>
    <tr>
      <td>SQL Agent jobs (AG-scoped)</td>
      <td>Manual sync</td>
      <td>Replicated; primary only</td>
    </tr>
    <tr>
      <td>DB Mail profiles</td>
      <td>Manual sync</td>
      <td>Replicated via listener</td>
    </tr>
    <tr>
      <td>Linked servers (AG-scoped)</td>
      <td>Manual sync</td>
      <td>Replicated via listener</td>
    </tr>
    <tr>
      <td>Instance configuration</td>
      <td>Manual sync</td>
      <td>Still manual</td>
    </tr>
    <tr>
      <td>Startup procedures</td>
      <td>Manual sync</td>
      <td>Still manual</td>
    </tr>
    <tr>
      <td>SSIS packages in msdb</td>
      <td>Manual sync</td>
      <td>Not covered</td>
    </tr>
    <tr>
      <td>Security boundary</td>
      <td>No</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Minimum version</td>
      <td>SQL Server 2012</td>
      <td>SQL Server 2022</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="when-to-use-it">When to Use It</h2>

<p>Contained AGs make the most sense for new deployments on SQL Server 2022 where you want login and job consistency handled at the infrastructure level rather than through scripts. The benefit is most visible in environments with frequent failovers, regular DR testing, or teams where multiple people manage the secondary replicas.</p>

<p>For existing environments: if your current sync process is reliable and tested, rebuilding the AG as a contained one carries real disruption. It’s not necessarily worth doing mid-cycle. For greenfield setups, there’s little reason not to use it.</p>

<hr />

<p><em>I work as a database architect and consultant across logistics, banking, and healthcare environments — AlwaysOn clusters are a recurring topic. Feedback or questions: <a href="https://linkedin.com/in/benedikt-schackenberg-b7422338b">LinkedIn</a>.</em></p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="sql-server" /><category term="high-availability" /><category term="dba" /><category term="always-on" /><category term="sql-server-2022" /><summary type="html"><![CDATA[In classic Always On Availability Groups, user databases are replicated — but many instance-level objects are not. Logins, SQL Agent jobs, linked servers, Database Mail configuration: all of these live in master or msdb at the instance level, outside the AG’s replication scope. In practice, this gap tends to surface at the worst possible moment — during a failover or a DR test, when the database is online but something that depends on those objects isn’t working.]]></summary></entry><entry><title type="html">NemoClaw: NVIDIA’s Open-Source Stack for Safer AI Agents — and Why I Keep Contributing to It</title><link href="https://schackenberg.com/2026/04/14/contributing-to-nemoclaw.html" rel="alternate" type="text/html" title="NemoClaw: NVIDIA’s Open-Source Stack for Safer AI Agents — and Why I Keep Contributing to It" /><published>2026-04-14T00:00:00+02:00</published><updated>2026-04-14T00:00:00+02:00</updated><id>https://schackenberg.com/2026/04/14/contributing-to-nemoclaw</id><content type="html" xml:base="https://schackenberg.com/2026/04/14/contributing-to-nemoclaw.html"><![CDATA[<p>There’s a specific kind of Saturday afternoon that goes like this: you open your laptop, tell yourself you’ll just “quickly check one PR”, and three hours later you’re neck-deep in Rust serde internals because a field called <code class="language-plaintext highlighter-rouge">tool_policy</code> doesn’t exist in OpenShell’s schema and you’ve just found out the hard way.</p>

<p>That’s NemoClaw. And I kind of love it.</p>

<h2 id="what-even-is-nemoclaw">What Even Is NemoClaw?</h2>

<p>Let me back up — and sort out three terms that get conflated a lot.</p>

<p><strong>OpenShell</strong> is NVIDIA’s secure container runtime for AI workloads. It’s the actual sandbox layer: Landlock-based filesystem isolation, network proxy with policy enforcement, process restrictions. The thing that actually makes sure the agent can’t read your SSH keys or phone home to a random server.</p>

<p><strong>OpenClaw</strong> is the agent framework — the layer that connects a language model to tools, memory, and channels like Slack or Discord.</p>

<p><strong>NemoClaw</strong> is the reference stack that ties both together. It’s the CLI, the onboarding flow, the policy presets, the deployment tooling, and all the hardening decisions that turn “here’s a container with an agent” into something you can actually run in production without lying awake at night.</p>

<p>Think of it less as “Docker, but for AI” and more as the opinionated security wrapper that reduces the blast radius when an agent goes off-script — because they will, eventually.</p>

<p>The core insight: when you give an AI agent access to your terminal, your files, and your APIs, you need guardrails. Not because AI is inherently malicious, but because AI is <em>confidently wrong</em> in ways that can be catastrophic. NemoClaw defines exactly what the agent is allowed to touch — and enforces it at the runtime level, not just by convention.</p>

<p><strong>In concrete terms, NemoClaw lets you:</strong></p>

<ul>
  <li><strong>Run AI coding agents</strong> (Claude, GPT-4, local Ollama/vLLM models) inside an OpenShell sandbox — the agent can write code, run tests, make API calls, and open PRs, but stays within explicitly defined boundaries</li>
  <li><strong>Define network policies</strong> — exactly which endpoints the agent can reach, with which HTTP methods, on which paths. Want the agent to use Slack but not be able to delete channels? One YAML stanza.</li>
  <li><strong>Use local or cloud inference</strong> — NVIDIA NIM, vLLM, Ollama on your own hardware, or cloud providers. <code class="language-plaintext highlighter-rouge">nemoclaw onboard</code> walks you through the whole setup in minutes.</li>
  <li><strong>Isolate filesystem access</strong> — Landlock enforcement means the agent can only read/write what you explicitly allow. Your SSH keys, your production configs, your secrets — not accessible.</li>
  <li><strong>Integrate with messaging</strong> — Telegram, Slack, Discord. The agent can send updates or accept instructions through these channels, with the same policy enforcement in place.</li>
  <li><strong>Automate enterprise workflows</strong> — Jira, Outlook, GitHub. You configure the preset, the agent handles the repetitive work, and the policy layer makes sure it stays in its lane.</li>
</ul>

<p>The CLI is TypeScript, the orchestration is Python, the policy files are YAML, and the hardening is… a lot of bash. A <em>lot</em> of bash.</p>

<h2 id="why-do-i-contribute-in-my-free-time">Why Do I Contribute in My Free Time?</h2>

<p>Honest answer: because the problem is genuinely interesting.</p>

<p>Security for AI agents is not a solved problem. It’s not even a well-framed problem yet. Most people are either in the “just let it rip and hope for the best” camp or the “don’t give it any tools at all” camp. NemoClaw is trying to chart a path down the middle — maximum capability, minimum attack surface. That’s hard engineering.</p>

<p>The second reason is more practical: I use this stuff. I run OpenClaw (the agent framework) on my own infrastructure, and NemoClaw is directly relevant to how those agents operate. When I fix a bug here, I’m also fixing it for myself.</p>

<p>The third reason, and I’ll be honest here, is that contributing to a project with active NVIDIA engineers reviewing your code is a pretty good way to learn. cv (one of the maintainers) writes review comments that are worth more than most blog posts. When he approves something with “Security design is sound — three independent defense layers verified,” that means something.</p>

<h2 id="what-weve-actually-shipped">What We’ve Actually Shipped</h2>

<p>Over the past few weeks, my AI assistant Rainer and I (yes, I use an AI agent to help contribute to an AI agent sandbox project, the irony is not lost on me) have gotten a non-trivial list of PRs merged:</p>

<p><strong>Sandbox hardening:</strong></p>
<ul>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/830"><strong>#830</strong></a> — Removed <code class="language-plaintext highlighter-rouge">gcc</code>, <code class="language-plaintext highlighter-rouge">g++</code>, <code class="language-plaintext highlighter-rouge">cpp</code>, <code class="language-plaintext highlighter-rouge">make</code>, and <code class="language-plaintext highlighter-rouge">netcat</code> from the sandbox image. These tools have no business being in an agent’s runtime environment. ulimit hard+soft limits added.</li>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/654"><strong>#654</strong></a> — Multi-layer config protection: <code class="language-plaintext highlighter-rouge">openclaw.json</code> moved to <code class="language-plaintext highlighter-rouge">/etc/openclaw/</code> (outside the agent-writable tree), root:root 555 permissions, TOCTOU guard with <code class="language-plaintext highlighter-rouge">rmtree</code> before the runtime copy. Three independent layers. Took several rounds of review to get right — including discovering that <code class="language-plaintext highlighter-rouge">tool_policy</code> is not a valid OpenShell schema field (more on that below).</li>
</ul>

<p><strong>CLI fixes:</strong></p>
<ul>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/1245"><strong>#1245</strong></a> — <code class="language-plaintext highlighter-rouge">registry.clearAll()</code> was called unconditionally on <code class="language-plaintext highlighter-rouge">gateway destroy</code>, which wiped your sandbox list even if the destroy failed. Now it only runs on success.</li>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/1246"><strong>#1246</strong></a> — Secret patterns (API keys, tokens) are now redacted from CLI log output and error messages before they hit disk or stdout. Simple but important.</li>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/1526"><strong>#1526</strong></a> — The <code class="language-plaintext highlighter-rouge">nemoclaw destroy</code> confirmation prompt now explicitly tells you that <em>workspace files will be permanently deleted</em>. Previously it just said “This cannot be undone.” Technically true, but not helpful.</li>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/1884"><strong>#1884</strong></a> — Model and provider were never persisted to the sandbox registry after onboarding. <code class="language-plaintext highlighter-rouge">registry.updateSandbox()</code> was called before <code class="language-plaintext highlighter-rouge">registerSandbox()</code>, so it silently no-oped. One line fix, annoying to track down.</li>
</ul>

<p><strong>Policy and network:</strong></p>
<ul>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/1540"><strong>#1540</strong></a> — Two policy preset fixes: the HuggingFace inference endpoint had migrated from <code class="language-plaintext highlighter-rouge">api-inference.huggingface.co</code> to <code class="language-plaintext highlighter-rouge">router.huggingface.co</code> (the old one returns HTTP 410), and the Discord preset allowed <code class="language-plaintext highlighter-rouge">DELETE</code> on all paths including channels and roles. Scoped it down to message/reaction paths only.</li>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/1700"><strong>#1700</strong></a> — The baseline sandbox policy included <code class="language-plaintext highlighter-rouge">/usr/local/bin/npm</code> and <code class="language-plaintext highlighter-rouge">/usr/local/bin/node</code> in the <code class="language-plaintext highlighter-rouge">npm_registry</code> binaries list, which meant <code class="language-plaintext highlighter-rouge">npm install</code> worked even with the “none” policy preset selected. The entry exists for <code class="language-plaintext highlighter-rouge">openclaw plugins install</code> only. Fixed.</li>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/1885"><strong>#1885</strong></a> — Removed the deprecated <code class="language-plaintext highlighter-rouge">tls: terminate</code> field from every policy YAML in the repo — 40 lines gone across presets, the baseline policy, and the Hermes agent additions. Was generating WARN logs on every sandbox start.</li>
</ul>

<p><strong>Runtime improvements:</strong></p>
<ul>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/980"><strong>#980</strong></a> — vLLM and NVIDIA NIM backends were broken with the standard chat completions API. Added forced routing to the right API type during onboard.</li>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/1359"><strong>#1359</strong></a> — Podman support alongside Docker. Socket detection, CoreDNS patching, the works.</li>
  <li><a href="https://github.com/NVIDIA/NemoClaw/pull/433"><strong>#433</strong></a> — Kubelet conflict detection before gateway start. If you’re running k3s, MicroK8s, or kubeadm on the same host, the cgroup namespace clash will silently break things. Now you get a warning.</li>
</ul>

<h2 id="the-tool_policy-lesson">The <code class="language-plaintext highlighter-rouge">tool_policy</code> Lesson</h2>

<p>I want to tell this story because it’s a good one.</p>

<p><a href="https://github.com/NVIDIA/NemoClaw/pull/654">PR #654</a> — the config hardening one — looked clean. cv approved it. The security design was solid. And then ericksoa dug into the OpenShell Rust source code and found that <code class="language-plaintext highlighter-rouge">PolicyFile</code> uses <code class="language-plaintext highlighter-rouge">#[serde(deny_unknown_fields)]</code>. Which means if you add a field OpenShell doesn’t know about — like, say, a <code class="language-plaintext highlighter-rouge">tool_policy</code> block — it doesn’t ignore it. It <strong>rejects the entire policy file at deserialization time</strong>. Sandboxes fail to start.</p>

<p>We had been so focused on the security logic that nobody had checked the schema constraints. The fix was straightforward (remove the block, add a comment explaining why it’s absent and what the actual enforcement is), but it was a good reminder that “looks good in YAML” and “parses correctly at runtime” are two different things.</p>

<p>This is the kind of thing you only learn by actually submitting code to a project with people who read the source.</p>

<h2 id="the-rebase-loop">The Rebase Loop</h2>

<p>One thing nobody tells you about open-source contribution is the rebasing. You open a PR, it looks good, someone reviews it, you address the feedback, you push — and then three days later there’s a merge conflict because 40 other commits landed on main.</p>

<p>I have a small script at this point that fetches upstream, rebases, force-pushes, and leaves a comment so the maintainer knows it’s been updated. I’ve used it approximately forty-seven times across these PRs. The <code class="language-plaintext highlighter-rouge">fix/kubelet-cgroup-conflict</code> branch has been rebased so many times it knows the codebase better than I do.</p>

<h2 id="whats-next">What’s Next</h2>

<p>A few PRs are still in review — the tamper-evident audit logging (<a href="https://github.com/NVIDIA/NemoClaw/pull/916">#916</a>), the kubelet detection (<a href="https://github.com/NVIDIA/NemoClaw/pull/433">#433</a>), the K8s resource limits (<a href="https://github.com/NVIDIA/NemoClaw/pull/1448">#1448</a>). There are also some newer issues I’m tracking: sandbox image missing <code class="language-plaintext highlighter-rouge">gnupg</code>, the Slack <code class="language-plaintext highlighter-rouge">SLACK_APP_TOKEN</code> not being forwarded, the <code class="language-plaintext highlighter-rouge">nemoclaw list</code> resurrection bug.</p>

<p>The project is moving fast. Between PRs, there was a full TypeScript migration of the CLI (which required rebasing everything again, obviously), new preset infrastructure, and a handful of security audit issues opened in bulk. It’s an active codebase.</p>

<p>If you’re interested in AI agent security, sandboxing, or just want to contribute to something that actually runs NVIDIA inference workloads in production — take a look at <a href="https://github.com/NVIDIA/NemoClaw">github.com/NVIDIA/NemoClaw</a>. The maintainers are responsive, the reviews are thorough, and there are always open issues that need someone to pick them up.</p>

<p>Just don’t put <code class="language-plaintext highlighter-rouge">tool_policy</code> in your YAML.</p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="nemoclaw" /><category term="nvidia" /><category term="open-source" /><category term="ai-safety" /><category term="sandbox" /><category term="contributing" /><category term="security" /><summary type="html"><![CDATA[There’s a specific kind of Saturday afternoon that goes like this: you open your laptop, tell yourself you’ll just “quickly check one PR”, and three hours later you’re neck-deep in Rust serde internals because a field called tool_policy doesn’t exist in OpenShell’s schema and you’ve just found out the hard way.]]></summary></entry><entry><title type="html">Upgrading to SQL Server 2022: Paths, Pitfalls, and Best Practices</title><link href="https://schackenberg.com/2026/04/05/sql-server-2022-upgrade-guide.html" rel="alternate" type="text/html" title="Upgrading to SQL Server 2022: Paths, Pitfalls, and Best Practices" /><published>2026-04-05T00:00:00+02:00</published><updated>2026-04-05T00:00:00+02:00</updated><id>https://schackenberg.com/2026/04/05/sql-server-2022-upgrade-guide</id><content type="html" xml:base="https://schackenberg.com/2026/04/05/sql-server-2022-upgrade-guide.html"><![CDATA[<p>SQL Server 2022 (internal version 16.x) is the most feature-rich release Microsoft has shipped in years — better Azure integration, improved Query Store defaults, ledger tables, and performance improvements that actually matter in production. If you’re still running SQL Server 2014, 2016, or 2019, the upgrade is worth doing. The question is how.</p>

<p>This post covers what you need to know: supported upgrade paths, the three main methods, Always On rolling upgrades, and what to check before you touch anything in production.</p>

<h2 id="supported-source-versions">Supported Source Versions</h2>

<p>Not every version can upgrade directly to SQL Server 2022. According to Microsoft’s official documentation, in-place upgrades are supported from:</p>

<table>
  <thead>
    <tr>
      <th>Source Version</th>
      <th>Minimum Build Required</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>SQL Server 2012</td>
      <td>SP4 or later</td>
    </tr>
    <tr>
      <td>SQL Server 2014</td>
      <td>SP3 or later</td>
    </tr>
    <tr>
      <td>SQL Server 2016</td>
      <td>SP3 or later</td>
    </tr>
    <tr>
      <td>SQL Server 2017</td>
      <td>Any RTM or later</td>
    </tr>
    <tr>
      <td>SQL Server 2019</td>
      <td>Any RTM or later</td>
    </tr>
  </tbody>
</table>

<p>If you’re on SQL Server 2008 or 2008 R2, you <strong>cannot</strong> upgrade directly — you’ll need to go via an intermediate version (e.g., 2014 or 2016) first. Same if you’re on SQL Server 2012 SP3 or lower: get to SP4 before you attempt anything.</p>

<p>Also worth noting: SQL Server 2022 is <strong>64-bit only</strong>. If you’re still running a 32-bit SQL Server instance (unlikely in 2026, but it happens in legacy environments), you can’t do an in-place upgrade. You’ll need to restore databases to a fresh 64-bit installation and recreate all logins manually.</p>

<h2 id="before-you-touch-anything-the-checklist">Before You Touch Anything: The Checklist</h2>

<p>Every DBA has a story about an upgrade that went sideways because someone skipped the prep work. Here’s what actually matters:</p>

<p><strong>1. Get sign-off from the application vendor.</strong>
This is the step most people skip and then regret. Before you upgrade the SQL Server, confirm that your application supports SQL Server 2022. Some ISVs are slow to certify new versions, and you don’t want to find out post-upgrade that your ERP system throws errors.</p>

<p><strong>2. Check OS compatibility.</strong>
SQL Server 2022 requires Windows Server 2019 or later. If you’re running Windows Server 2016, you’re still fine. Windows Server 2012 R2? You’ll need to upgrade the OS first — or migrate to new hardware.</p>

<p><strong>3. No pending restarts.</strong>
SQL Server Setup will block the upgrade if Windows has a pending restart. Check with <code class="language-plaintext highlighter-rouge">Get-PendingReboot</code> (from the <code class="language-plaintext highlighter-rouge">PendingReboot</code> module) or just reboot before you start.</p>

<p><strong>4. Full backups — and verify them.</strong>
This sounds obvious, but verify that your backups actually restore. A backup you haven’t tested is not a backup.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Check last backup dates for all user databases</span>
<span class="k">SELECT</span> 
    <span class="n">d</span><span class="p">.</span><span class="n">name</span><span class="p">,</span>
    <span class="k">MAX</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">backup_finish_date</span><span class="p">)</span> <span class="k">AS</span> <span class="n">last_full_backup</span><span class="p">,</span>
    <span class="n">DATEDIFF</span><span class="p">(</span><span class="n">HOUR</span><span class="p">,</span> <span class="k">MAX</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">backup_finish_date</span><span class="p">),</span> <span class="n">GETDATE</span><span class="p">())</span> <span class="k">AS</span> <span class="n">hours_since_backup</span>
<span class="k">FROM</span> <span class="n">sys</span><span class="p">.</span><span class="n">databases</span> <span class="n">d</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">msdb</span><span class="p">.</span><span class="n">dbo</span><span class="p">.</span><span class="n">backupset</span> <span class="n">b</span> 
    <span class="k">ON</span> <span class="n">d</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="n">database_name</span> <span class="k">AND</span> <span class="n">b</span><span class="p">.</span><span class="k">type</span> <span class="o">=</span> <span class="s1">'D'</span>
<span class="k">WHERE</span> <span class="n">d</span><span class="p">.</span><span class="n">database_id</span> <span class="o">&gt;</span> <span class="mi">4</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">d</span><span class="p">.</span><span class="n">name</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">last_full_backup</span> <span class="k">ASC</span><span class="p">;</span>
</code></pre></div></div>

<p><strong>5. Check database compatibility levels.</strong>
Post-upgrade, your databases will keep their existing compatibility level. You should plan to raise them — but not on upgrade day. Raise them in a maintenance window after you’ve confirmed everything works.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Current compatibility levels</span>
<span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">compatibility_level</span> <span class="k">FROM</span> <span class="n">sys</span><span class="p">.</span><span class="n">databases</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">name</span><span class="p">;</span>

<span class="c1">-- SQL Server 2022 native level is 160</span>
<span class="c1">-- SQL Server 2019 = 150, 2017 = 140, 2016 = 130</span>
</code></pre></div></div>

<p><strong>6. Run the Database Experimentation Assistant (DEA) or Query Store.</strong>
If you’re going from 2016 or later, enable Query Store before the upgrade, let it collect a workload baseline, then compare after the compatibility level change. DEA does this comparison automatically and flags query regressions.</p>

<p><strong>7. Disable Auto-Failover (if Always On).</strong>
More on this below.</p>

<hr />

<h2 id="the-three-upgrade-methods">The Three Upgrade Methods</h2>

<h3 id="method-1-in-place-upgrade">Method 1: In-Place Upgrade</h3>

<p>The simplest path. You run SQL Server 2022 Setup on the existing server, it replaces the binaries, upgrades the system databases, and you’re done. IP address, instance name, and port don’t change — applications don’t need to be reconfigured.</p>

<p><strong>Pros:</strong></p>
<ul>
  <li>Fast (downtime is typically 15–45 minutes, independent of database size)</li>
  <li>No application reconfiguration</li>
  <li>Logins, jobs, and linked servers are preserved automatically</li>
</ul>

<p><strong>Cons:</strong></p>
<ul>
  <li>If something goes wrong, rollback means restoring from backup (time-consuming)</li>
  <li>You can’t add new features during the upgrade; you have to run Setup again afterward</li>
  <li>Not supported for 32-bit instances or cross-platform upgrades</li>
</ul>

<p><strong>When to use it:</strong> When you have a maintenance window, a solid backup, and a relatively straightforward environment without custom components that might conflict.</p>

<h3 id="method-2-side-by-side-migration-new-server">Method 2: Side-by-Side Migration (New Server)</h3>

<p>You set up a fresh SQL Server 2022 instance on new hardware (or a new VM), then migrate databases using backup/restore or detach/attach, recreate logins, and reconfigure your application to point at the new server.</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># dbatools makes this significantly easier</span><span class="w">
</span><span class="c"># Copy all user databases from old to new server</span><span class="w">
</span><span class="n">Copy-DbaDatabase</span><span class="w"> </span><span class="nt">-Source</span><span class="w"> </span><span class="nx">OLDSQLSERVER</span><span class="w"> </span><span class="nt">-Destination</span><span class="w"> </span><span class="nx">NEWSQLSERVER2022</span><span class="w"> </span><span class="se">`
</span><span class="w">    </span><span class="nt">-Database</span><span class="w"> </span><span class="p">(</span><span class="n">Get-DbaDatabase</span><span class="w"> </span><span class="nt">-SqlInstance</span><span class="w"> </span><span class="nx">OLDSQLSERVER</span><span class="w"> </span><span class="nt">-ExcludeSystem</span><span class="p">)</span><span class="o">.</span><span class="nf">Name</span><span class="w"> </span><span class="err">`</span><span class="w">
    </span><span class="nt">-BackupRestore</span><span class="w"> </span><span class="nt">-SharedPath</span><span class="w"> </span><span class="n">\\fileserver\backup</span><span class="w"> </span><span class="nt">-WithReplace</span><span class="w">

</span><span class="c"># Sync logins (including passwords and SIDs)</span><span class="w">
</span><span class="n">Copy-DbaLogin</span><span class="w"> </span><span class="nt">-Source</span><span class="w"> </span><span class="nx">OLDSQLSERVER</span><span class="w"> </span><span class="nt">-Destination</span><span class="w"> </span><span class="nx">NEWSQLSERVER2022</span><span class="w">

</span><span class="c"># Copy SQL Agent jobs</span><span class="w">
</span><span class="n">Copy-DbaAgentJob</span><span class="w"> </span><span class="nt">-Source</span><span class="w"> </span><span class="nx">OLDSQLSERVER</span><span class="w"> </span><span class="nt">-Destination</span><span class="w"> </span><span class="nx">NEWSQLSERVER2022</span><span class="w">
</span></code></pre></div></div>

<p><strong>Pros:</strong></p>
<ul>
  <li>Clean installation on new hardware</li>
  <li>Easy rollback (old server still exists)</li>
  <li>Opportunity to consolidate or change hardware specs</li>
</ul>

<p><strong>Cons:</strong></p>
<ul>
  <li>Application connection strings must be updated (or DNS aliasing used)</li>
  <li>Takes longer, especially for large databases</li>
  <li>More operational complexity</li>
</ul>

<p><strong>When to use it:</strong> When you’re also replacing hardware, when rollback safety is paramount, or when your environment has grown significantly since the last SQL Server install.</p>

<h3 id="method-3-rolling-upgrade-with-always-on-availability-groups">Method 3: Rolling Upgrade with Always On Availability Groups</h3>

<p>If you’re running Always On AGs, you can upgrade with near-zero downtime. The approach works because SQL Server supports mixed-version AGs during the upgrade window.</p>

<p><strong>The sequence:</strong></p>

<ol>
  <li>
    <p><strong>Disable automatic failover</strong> on all AG groups before you start.</p>
  </li>
  <li>
    <p><strong>Upgrade secondary replicas first</strong> — start with async secondaries, then sync secondaries. During this period, replication from primary to upgraded secondary still works (newer versions can receive from older primaries).</p>
  </li>
  <li>
    <p><strong>Failover to an upgraded secondary.</strong> Your primary is now on SQL Server 2022. The old primary becomes a secondary.</p>
  </li>
  <li>
    <p><strong>Upgrade the remaining replicas</strong> (including what was the original primary).</p>
  </li>
  <li>
    <p><strong>Re-enable automatic failover.</strong></p>
  </li>
</ol>

<p>Important caveat: while the primary is still on the old version and a secondary has been upgraded to 2022, you temporarily lose the ability to failover back to the old-version secondary. Plan your maintenance window accordingly.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Check AG synchronization state before and during upgrade</span>
<span class="k">SELECT</span> 
    <span class="n">ag</span><span class="p">.</span><span class="n">name</span> <span class="k">AS</span> <span class="n">ag_name</span><span class="p">,</span>
    <span class="n">ar</span><span class="p">.</span><span class="n">replica_server_name</span><span class="p">,</span>
    <span class="n">ars</span><span class="p">.</span><span class="n">role_desc</span><span class="p">,</span>
    <span class="n">ars</span><span class="p">.</span><span class="n">synchronization_health_desc</span><span class="p">,</span>
    <span class="n">ars</span><span class="p">.</span><span class="n">last_redone_time</span>
<span class="k">FROM</span> <span class="n">sys</span><span class="p">.</span><span class="n">availability_groups</span> <span class="n">ag</span>
<span class="k">JOIN</span> <span class="n">sys</span><span class="p">.</span><span class="n">availability_replicas</span> <span class="n">ar</span> <span class="k">ON</span> <span class="n">ag</span><span class="p">.</span><span class="n">group_id</span> <span class="o">=</span> <span class="n">ar</span><span class="p">.</span><span class="n">group_id</span>
<span class="k">JOIN</span> <span class="n">sys</span><span class="p">.</span><span class="n">dm_hadr_availability_replica_states</span> <span class="n">ars</span> <span class="k">ON</span> <span class="n">ar</span><span class="p">.</span><span class="n">replica_id</span> <span class="o">=</span> <span class="n">ars</span><span class="p">.</span><span class="n">replica_id</span><span class="p">;</span>
</code></pre></div></div>

<hr />

<h2 id="after-the-upgrade-dont-forget-the-compatibility-level">After the Upgrade: Don’t Forget the Compatibility Level</h2>

<p>This is the most commonly skipped step post-upgrade. Your databases are now running on SQL Server 2022 but still operating at their old compatibility level — which means they don’t benefit from new query optimizer improvements, better cardinality estimator behavior, or features like Parameter Sensitive Plan optimization (new in 2022).</p>

<p><strong>Do this in a separate maintenance window, after you’ve confirmed stability:</strong></p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Step 1: Enable Query Store to capture baseline before changing compat level</span>
<span class="k">ALTER</span> <span class="k">DATABASE</span> <span class="n">YourDatabase</span> <span class="k">SET</span> <span class="n">QUERY_STORE</span> <span class="o">=</span> <span class="k">ON</span><span class="p">;</span>
<span class="k">ALTER</span> <span class="k">DATABASE</span> <span class="n">YourDatabase</span> <span class="k">SET</span> <span class="n">QUERY_STORE</span> <span class="p">(</span><span class="n">OPERATION_MODE</span> <span class="o">=</span> <span class="n">READ_WRITE</span><span class="p">);</span>

<span class="c1">-- Step 2: Change compatibility level</span>
<span class="k">ALTER</span> <span class="k">DATABASE</span> <span class="n">YourDatabase</span> <span class="k">SET</span> <span class="n">COMPATIBILITY_LEVEL</span> <span class="o">=</span> <span class="mi">160</span><span class="p">;</span>

<span class="c1">-- Step 3: Monitor for regressions</span>
<span class="c1">-- If a query regresses, you can force the old plan via Query Store</span>
<span class="c1">-- or temporarily revert compat level while you investigate</span>
</code></pre></div></div>

<p>Wait at least a week of production load at the new compatibility level before declaring the upgrade complete. Query Store will show you if any plan regressions happened.</p>

<hr />

<h2 id="a-note-on-sql-server-2022-features-worth-using-after-upgrade">A Note on SQL Server 2022 Features Worth Using After Upgrade</h2>

<p>Once you’re fully on 2022 with compatibility level 160:</p>

<ul>
  <li><strong>Parameter Sensitive Plan (PSP) optimization</strong> — automatically creates multiple query plans for parameterized queries where a single plan performed poorly. This is a significant improvement for workloads with data skew.</li>
  <li><strong>Improved Query Store defaults</strong> — enabled by default in new databases; includes read replica support.</li>
  <li><strong>Ledger tables</strong> — tamper-evident tables backed by cryptographic hashes. Useful for compliance scenarios.</li>
  <li><strong>Azure Synapse Link</strong> — if you have hybrid or Azure-connected scenarios.</li>
  <li><strong>S3-compatible object storage</strong> for backup** — backup directly to any S3-compatible endpoint (MinIO, Cloudflare R2, etc.), not just Azure Blob.</li>
</ul>

<hr />

<h2 id="why-sql-server-upgrades-really-fail-its-not-the-technology">Why SQL Server Upgrades Really Fail (It’s Not the Technology)</h2>

<p>Here’s something most upgrade guides won’t tell you: the technical part is rarely where upgrades fall apart. The blockers are almost always organizational.</p>

<p><strong>Third-party software is the #1 real-world killer.</strong> You’ve done everything right — backups, testing, maintenance window booked — and then you discover your ERP vendor hasn’t certified SQL Server 2022 yet. Or your hospital information system (HIS) has a hardcoded SQL Server version check that throws an error on anything newer than 2016. These things exist in production, and they will surprise you if you don’t check upfront.</p>

<p>Microsoft explicitly calls this out in their upgrade guidance: verify third-party compatibility <em>before</em> you start. This means contacting vendors, checking their compatibility matrices, and getting written confirmation. Don’t rely on “it should work.”</p>

<p><strong>Other common reasons upgrades fail in practice:</strong></p>
<ul>
  <li><strong>Missing vendor sign-off</strong> — upgrade happens, application breaks, vendor says “unsupported configuration, fix it yourself”</li>
  <li><strong>No staging environment</strong> — first test of the upgraded stack is production</li>
  <li><strong>Incorrect expectations about downtime</strong> — stakeholders weren’t told there would be a maintenance window, causing last-minute cancellations</li>
  <li><strong>Forgotten dependencies</strong> — linked servers, SSIS packages, Reporting Services, custom CLR assemblies that need recompilation</li>
  <li><strong>Skipped compatibility level raise</strong> — databases upgraded but running with “handbrake on” at old compat level for months</li>
</ul>

<p>The pattern is always the same: the SQL Server upgrade itself takes 20 minutes. The surrounding project work takes weeks.</p>

<hr />

<h2 id="my-recommendation-from-the-field">My Recommendation from the Field</h2>

<p>If you ask me directly: <strong>default to Side-by-Side migration over In-Place Upgrade</strong>, especially in production environments.</p>

<p>Yes, in-place is faster and simpler on paper. But side-by-side gives you something that in-place doesn’t: your old server still exists and still works. If something goes wrong post-migration — an application compatibility issue surfaces 48 hours later, a stored procedure behaves differently at the new compatibility level — you can fail back immediately without a restore.</p>

<p>The extra effort of a side-by-side migration (DNS alias switchover, migrating logins with dbatools) pays off the first time something unexpected happens. And in my experience, something unexpected happens on roughly one in three upgrades.</p>

<p>The one exception: if you’re running Always On AGs, use the rolling upgrade method. It’s the only approach that lets you maintain HA throughout the process.</p>

<hr />

<h2 id="quick-reference-checklist">Quick-Reference Checklist</h2>

<p>Print this out or put it in your runbook:</p>

<p><strong>Pre-Upgrade</strong></p>
<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Vendor compatibility confirmed (written, not verbal)</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />OS version supported (Windows Server 2019+ for SQL 2022)</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />No pending Windows restarts</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />All databases in FULL recovery model</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Full backup taken and restore-tested</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Query Store enabled for baseline collection</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Maintenance window scheduled and communicated</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Rollback plan documented</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Auto-failover disabled (if Always On)</li>
</ul>

<p><strong>During Upgrade</strong></p>
<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />AG sync state verified before each replica upgrade (if rolling)</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />System databases checked post-upgrade</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />SQL Agent running</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Application smoke test completed</li>
</ul>

<p><strong>Post-Upgrade</strong></p>
<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Compatibility level raised (in separate maintenance window)</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Query Store monitored for plan regressions (min. 1 week)</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Auto-failover re-enabled (if Always On)</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Documentation updated</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Old server decommissioned (not immediately — keep it for 2–4 weeks)</li>
</ul>

<hr />

<h2 id="summary">Summary</h2>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Downtime</th>
      <th>Rollback Complexity</th>
      <th>Best For</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>In-Place Upgrade</td>
      <td>15–45 min</td>
      <td>High (restore from backup)</td>
      <td>Simple environments, maintenance windows</td>
    </tr>
    <tr>
      <td>Side-by-Side</td>
      <td>Hours (migration-dependent)</td>
      <td>Low (old server intact)</td>
      <td>Hardware refresh, maximum safety</td>
    </tr>
    <tr>
      <td>Always On Rolling</td>
      <td>Near-zero</td>
      <td>Medium</td>
      <td>HA environments, production-critical systems</td>
    </tr>
  </tbody>
</table>

<p>The upgrade itself is usually not the hard part. The preparation — vendor sign-off, OS compatibility, backup verification, baseline collection — is what determines whether the upgrade is a non-event or a weekend of fire-fighting.</p>

<p>If you’re running SQL Server 2014 or older, the support lifecycle argument alone makes the upgrade mandatory: SQL Server 2014 exited Extended Support in July 2024. At this point you’re running unsupported software in production.</p>

<p>Do the upgrade. Do it properly. And raise the compatibility level afterward.</p>

<hr />

<p><em>Questions or something to add? Open an issue or drop a comment below.</em></p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="sql-server" /><category term="sql-server-2022" /><category term="upgrade" /><category term="migration" /><category term="dba" /><category term="mssql" /><category term="always-on" /><category term="on-premises" /><summary type="html"><![CDATA[SQL Server 2022 (internal version 16.x) is the most feature-rich release Microsoft has shipped in years — better Azure integration, improved Query Store defaults, ledger tables, and performance improvements that actually matter in production. If you’re still running SQL Server 2014, 2016, or 2019, the upgrade is worth doing. The question is how.]]></summary></entry><entry><title type="html">When Your AI Assistant Can’t Run Shell Commands: Debugging an OpenClaw v2026.3.31 Breaking Change</title><link href="https://schackenberg.com/2026/04/01/openclaw-exec-allowlist-fix.html" rel="alternate" type="text/html" title="When Your AI Assistant Can’t Run Shell Commands: Debugging an OpenClaw v2026.3.31 Breaking Change" /><published>2026-04-01T00:00:00+02:00</published><updated>2026-04-01T00:00:00+02:00</updated><id>https://schackenberg.com/2026/04/01/openclaw-exec-allowlist-fix</id><content type="html" xml:base="https://schackenberg.com/2026/04/01/openclaw-exec-allowlist-fix.html"><![CDATA[<p>Today was one of those mornings where everything looks fine until nothing works.</p>

<p>I was knee-deep in NemoClaw pull request reviews — fixing CodeRabbit feedback, guarding <code class="language-plaintext highlighter-rouge">registry.clearAll()</code> on successful gateway destroys, adding JSDoc coverage — when my AI assistant (Rainer, running on OpenClaw) suddenly stopped being able to run any shell commands.</p>

<p>Not some commands. <em>All</em> commands.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exec denied: allowlist miss
</code></pre></div></div>

<p>Every. Single. Time.</p>

<h2 id="what-openclaw-is">What OpenClaw Is</h2>

<p><a href="https://openclaw.ai">OpenClaw</a> is a self-hosted AI assistant runtime. You run it on your own server, connect it to Discord (or Signal, Telegram, etc.), and your AI assistant lives there — with access to your workspace, your repos, your tools. It’s powerful precisely because it can actually <em>do things</em>, not just talk about doing things.</p>

<p>Which makes it particularly frustrating when it can’t do things.</p>

<h2 id="the-symptom">The Symptom</h2>

<p>Rainer (my OpenClaw assistant) was responding to messages just fine. Memory, reasoning, GitHub API calls via <code class="language-plaintext highlighter-rouge">web_fetch</code> — all working. But anything requiring <code class="language-plaintext highlighter-rouge">exec</code> — git commands, prettier, gh CLI, grep — was blocked:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exec denied: allowlist miss
</code></pre></div></div>

<p>I tried restarting the gateway. I tried <code class="language-plaintext highlighter-rouge">openclaw gateway restart</code>. I tried stop + start. Nothing helped. The error persisted across session restarts, across new sub-agents, across everything.</p>

<h2 id="the-hunt">The Hunt</h2>

<p>After about two hours of back-and-forth (yes, two hours — the irony of an AI assistant that can’t help you fix its own broken shell access is not lost on me), we found it.</p>

<p>OpenClaw v2026.3.31 introduced a breaking change:</p>

<blockquote>
  <p><strong>“honor per-agent tools.exec defaults when no inline directive or session override is present”</strong></p>
</blockquote>

<p>Before this update, the <code class="language-plaintext highlighter-rouge">tools.exec.security</code> config key was silently ignored if missing. After the update, it’s actually enforced. And since it was never explicitly set in my config, it fell back to the built-in default: <code class="language-plaintext highlighter-rouge">allowlist</code> — a restrictive mode that blocks most exec calls.</p>

<p>The problem is documented in the <a href="https://github.com/openclaw/openclaw/releases/tag/v2026.3.31">v2026.3.31 release notes</a>, but it’s easy to miss if you’re not actively reading changelogs on every update.</p>

<h2 id="the-fix">The Fix</h2>

<p>One line in <code class="language-plaintext highlighter-rouge">~/.openclaw/openclaw.json</code>:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">"tools"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nl">"exec"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"security"</span><span class="p">:</span><span class="w"> </span><span class="s2">"full"</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>That’s it. Add that block, do a full gateway stop + start (not just restart — <code class="language-plaintext highlighter-rouge">restart</code> wasn’t enough to pick up the change), and exec works again.</p>

<p>The full sequence that fixed it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>openclaw gateway stop
<span class="nb">sleep </span>2
openclaw gateway start
</code></pre></div></div>

<h2 id="why-it-took-so-long">Why It Took So Long</h2>

<p>A few things made this harder to debug than it should have been:</p>

<p><strong>1. The error message doesn’t tell you what to do.</strong><br />
<code class="language-plaintext highlighter-rouge">exec denied: allowlist miss</code> accurately describes what happened, but gives you no hint about <em>why</em> the allowlist is in effect or how to change it. A message like “exec blocked by tools.exec.security=allowlist — set to ‘full’ to allow” would have cut debug time by 90%.</p>

<p><strong>2. The breaking change was behavioral, not syntactic.</strong><br />
The config file didn’t break. OpenClaw still started fine. Everything looked normal. The change was that a previously-ignored config key now had real consequences — which is the hardest class of breaking change to notice.</p>

<p><strong>3. Sub-agents couldn’t help.</strong><br />
When the main session can’t exec, spawning sub-agents doesn’t help — they inherit the same restriction. So you can’t use your AI to debug the problem that’s preventing your AI from working. Classic.</p>

<h2 id="what-i-learned">What I Learned</h2>

<p>If you’re running OpenClaw and update to v2026.3.31 or later, <strong>explicitly set <code class="language-plaintext highlighter-rouge">tools.exec.security</code></strong> in your config. Don’t rely on the default. The recommended setting depends on your threat model:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">"full"</code> — No restrictions. The assistant can run any shell command. Fine if you trust your setup and channel security.</li>
  <li><code class="language-plaintext highlighter-rouge">"allowlist"</code> — Restricts to a pre-approved command list. More secure, but requires maintaining the list.</li>
  <li><code class="language-plaintext highlighter-rouge">"deny"</code> — Blocks all exec. Useful for read-only or untrusted environments.</li>
</ul>

<p>The relevant docs are at <a href="https://docs.openclaw.ai">docs.openclaw.ai</a>.</p>

<h2 id="the-silver-lining">The Silver Lining</h2>

<p>Once exec was restored, we knocked out three NemoClaw PR reviews in about 20 minutes:</p>

<ul>
  <li><strong><a href="https://github.com/NVIDIA/NemoClaw/pull/1245">#1245</a></strong> — guarded <code class="language-plaintext highlighter-rouge">registry.clearAll()</code> on successful gateway destroy (CodeRabbit: Major)</li>
  <li><strong><a href="https://github.com/NVIDIA/NemoClaw/pull/1246">#1246</a></strong> — added JSDoc coverage to all <code class="language-plaintext highlighter-rouge">runner.js</code> functions, pushing docstring coverage from 22% to 100%</li>
  <li><strong><a href="https://github.com/NVIDIA/NemoClaw/pull/1247">#1247</a></strong> — fixed a sneaky <code class="language-plaintext highlighter-rouge">"x" * 5000</code> → <code class="language-plaintext highlighter-rouge">NaN</code> bug in a regression test (JS string multiplication returns NaN, not a long string)</li>
</ul>

<p>Not a bad morning’s work — once the tools actually worked.</p>

<hr />

<p><em>Running OpenClaw on a local server and contributing to open source through Discord. If you hit similar issues, check the <a href="https://github.com/openclaw/openclaw/releases">release notes</a> before assuming it’s a config problem on your end.</em></p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="openclaw" /><category term="debugging" /><category term="devops" /><category term="ai-assistant" /><category term="configuration" /><category term="open-source" /><summary type="html"><![CDATA[Today was one of those mornings where everything looks fine until nothing works.]]></summary></entry><entry><title type="html">PAMlab: The Day I Fixed Everything That Wasn’t Actually Broken (Until You Tested It)</title><link href="https://schackenberg.com/2026/03/31/pamlab-auth-realism-code-quality.html" rel="alternate" type="text/html" title="PAMlab: The Day I Fixed Everything That Wasn’t Actually Broken (Until You Tested It)" /><published>2026-03-31T00:00:00+02:00</published><updated>2026-03-31T00:00:00+02:00</updated><id>https://schackenberg.com/2026/03/31/pamlab-auth-realism-code-quality</id><content type="html" xml:base="https://schackenberg.com/2026/03/31/pamlab-auth-realism-code-quality.html"><![CDATA[<p>You know that moment when your test suite is green, your README looks great, and then someone actually <em>uses</em> your project and finds that the login endpoint accepts literally any password?</p>

<p>Yeah. That was my Monday morning.</p>

<h2 id="the-problem-nobody-noticed">The Problem Nobody Noticed</h2>

<p>PAMlab has been running for a while now — six mock APIs simulating Active Directory, Fudo PAM, Matrix42 ESM, ServiceNow, JSM, and BMC Remedy. The whole point is to build and test access management workflows without touching production. And it worked. Technically.</p>

<p>But there was a catch. When I sat down to do a proper integration test — not just “does the endpoint return 200” but “does the <strong>security behavior</strong> make sense” — things fell apart fast.</p>

<p>The Active Directory mock? You could bind with any password:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This should fail. It didn't.</span>
curl <span class="nt">-X</span> POST http://localhost:8445/api/ad/auth/bind <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"dn": "CN=Administrator,OU=Users,DC=corp,DC=local", "password": "totallyWrong"}'</span>

<span class="c"># → 200 OK, here's your token. Come on in.</span>
</code></pre></div></div>

<p>Same story with the Remedy mock. <code class="language-plaintext highlighter-rouge">POST /api/jwt/login</code> with garbage credentials? Here’s a valid JWT, no questions asked.</p>

<p>For a PAM sandbox — a project specifically about <strong>access management</strong> — that’s embarrassing. If your mock doesn’t reject bad passwords, every downstream integration test that depends on auth behavior is lying to you.</p>

<h2 id="the-auth-fix">The Auth Fix</h2>

<p>The fix itself was straightforward. AD bind now validates against a credential allowlist:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Wrong password → 401</span>
curl <span class="nt">-s</span> <span class="nt">-w</span> <span class="s2">" [HTTP %{http_code}]"</span> <span class="nt">-X</span> POST http://localhost:8445/api/ad/auth/bind <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"dn": "CN=Administrator,OU=Users,DC=corp,DC=local", "password": "totallyWrong"}'</span>
<span class="c"># → {"error":"Invalid credentials","message":"LDAP bind failed: wrong password"} [HTTP 401]</span>

<span class="c"># Correct password → 200</span>
curl <span class="nt">-s</span> <span class="nt">-w</span> <span class="s2">" [HTTP %{http_code}]"</span> <span class="nt">-X</span> POST http://localhost:8445/api/ad/auth/bind <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"dn": "CN=Administrator,OU=Users,DC=corp,DC=local", "password": "admin"}'</span>
<span class="c"># → {"token":"...","bind_dn":"CN=Administrator,...","message":"Bind successful"} [HTTP 200]</span>
</code></pre></div></div>

<p>Basic Auth got the same treatment — both on the AD mock and the Remedy mock. The Remedy JWT endpoint now returns a proper <code class="language-plaintext highlighter-rouge">401 "Authentication failed: invalid password"</code> instead of handing out tokens like candy.</p>

<p>It’s a mock, sure. But mocks that don’t enforce auth boundaries train you to write integrations that don’t handle auth failures. And that’s exactly the kind of bug that shows up at 3 AM in production.</p>

<h2 id="the-missing-webhooks">The Missing Webhooks</h2>

<p>While I was at it, I noticed two more gaps. Matrix42 and JSM both claim to support event-driven workflows in the README — but neither had webhook endpoints.</p>

<p>Matrix42 webhook registration:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-s</span> <span class="nt">-X</span> POST http://localhost:8444/m42Services/api/webhooks <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Authorization: Bearer pamlab-dev-token"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"url": "http://example.com/hook", "events": ["ticket.created"]}'</span>
<span class="c"># → 201, returns webhook ID + event subscription</span>
</code></pre></div></div>

<p>JSM got a similar treatment with <code class="language-plaintext highlighter-rouge">/rest/webhooks/1.0/webhook</code> — the standard Atlassian webhook path. Now you can actually test event-driven approval flows without pretending the webhook exists.</p>

<h2 id="fudo-account-sync-the-silent-breaker">Fudo Account Sync: The Silent Breaker</h2>

<p>This one was subtle. The Fudo PAM mock required a <code class="language-plaintext highlighter-rouge">server_id</code> when creating accounts. Makes sense in production — you need to know which server the account lives on. But in an onboarding pipeline, the first thing you do is create the account. You don’t necessarily have the server mapping yet.</p>

<p>The fix: auto-assign the first available server when <code class="language-plaintext highlighter-rouge">server_id</code> is omitted. The pipeline keeps flowing, and the mapping can be refined later. Small change, big difference for anyone trying the onboarding demo.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># No server_id → auto-assigns to first available server</span>
curl <span class="nt">-s</span> <span class="nt">-X</span> POST http://localhost:8443/api/v2/accounts <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Authorization: Bearer pamlab-dev-token"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"name": "new-account", "login": "testuser"}'</span>
<span class="c"># → 201, server_id: "20000000-0000-0000-0000-000000000001"</span>
</code></pre></div></div>

<h2 id="the-code-quality-pass">The Code Quality Pass</h2>

<p>Fixing bugs is satisfying. But the codebase had accumulated some debt that was bugging me.</p>

<p><strong>The double exports.</strong> Every single <code class="language-plaintext highlighter-rouge">server.js</code> — all seven of them — ended with:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">module</span><span class="p">.</span><span class="nx">exports</span> <span class="o">=</span> <span class="nx">app</span><span class="p">;</span>

<span class="nx">module</span><span class="p">.</span><span class="nx">exports</span> <span class="o">=</span> <span class="nx">app</span><span class="p">;</span>
</code></pre></div></div>

<p>Harmless, but it screams “nobody reviewed this.” Gone now.</p>

<p><strong>Compressed server files.</strong> The Express app setup was crammed into as few lines as possible. Health endpoints were one-liners. No section comments. Reading a server.js felt like decoding morse code. I reformatted all of them with consistent sections:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// --- Middleware ---</span>
<span class="nx">app</span><span class="p">.</span><span class="nx">use</span><span class="p">(</span><span class="nx">cors</span><span class="p">());</span>
<span class="nx">app</span><span class="p">.</span><span class="nx">use</span><span class="p">(</span><span class="nx">express</span><span class="p">.</span><span class="nx">json</span><span class="p">());</span>

<span class="c1">// --- Public Routes ---</span>
<span class="nx">app</span><span class="p">.</span><span class="nx">use</span><span class="p">(</span><span class="dl">'</span><span class="s1">/api/ad/auth</span><span class="dl">'</span><span class="p">,</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">./routes/auth</span><span class="dl">'</span><span class="p">));</span>

<span class="c1">// --- Protected Routes ---</span>
<span class="nx">app</span><span class="p">.</span><span class="nx">use</span><span class="p">(</span><span class="dl">'</span><span class="s1">/api/ad/users</span><span class="dl">'</span><span class="p">,</span> <span class="nx">authMiddleware</span><span class="p">,</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">./routes/users</span><span class="dl">'</span><span class="p">));</span>

<span class="c1">// --- Health &amp; Admin ---</span>
<span class="nx">app</span><span class="p">.</span><span class="kd">get</span><span class="p">(</span><span class="dl">'</span><span class="s1">/health</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">req</span><span class="p">,</span> <span class="nx">res</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
  <span class="nx">res</span><span class="p">.</span><span class="nx">json</span><span class="p">({</span> <span class="na">status</span><span class="p">:</span> <span class="dl">'</span><span class="s1">ok</span><span class="dl">'</span><span class="p">,</span> <span class="na">service</span><span class="p">:</span> <span class="dl">'</span><span class="s1">ad-mock-api</span><span class="dl">'</span><span class="p">,</span> <span class="na">domain</span><span class="p">:</span> <span class="dl">'</span><span class="s1">corp.local</span><span class="dl">'</span> <span class="p">});</span>
<span class="p">});</span>
</code></pre></div></div>

<p><strong>The pipeline engine.</strong> This one was worse. Synchronous file reads everywhere — <code class="language-plaintext highlighter-rouge">fs.readFileSync</code> in async handlers, <code class="language-plaintext highlighter-rouge">fs.existsSync</code> before every read, <code class="language-plaintext highlighter-rouge">fs.readdirSync</code> for listing pipelines. In a web server. Serving API requests.</p>

<p>Converted everything to <code class="language-plaintext highlighter-rouge">fs.promises.*</code> with proper async/await. Added input validation (the pipeline runner would happily try to execute <code class="language-plaintext highlighter-rouge">undefined</code>). Added structured JSON logging instead of bare <code class="language-plaintext highlighter-rouge">console.log</code>.</p>

<p><strong>Prettier + ESLint.</strong> Added <code class="language-plaintext highlighter-rouge">.prettierrc.json</code> and <code class="language-plaintext highlighter-rouge">.eslintrc.json</code> to the repo root. Ran Prettier across all 115 source files. Now <code class="language-plaintext highlighter-rouge">npm run format</code> and <code class="language-plaintext highlighter-rouge">npm run lint</code> work from the project root.</p>

<h2 id="the-matrix42-fragment-api">The Matrix42 Fragment API</h2>

<p>This was a documentation problem that became a code problem. The Matrix42 mock had fragments — data objects organized by data definition name. You could create them, read them by ID, update them. But:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">GET /api/data/fragments/:ddName</code> (without an ID) → 404. No way to list fragments.</li>
  <li><code class="language-plaintext highlighter-rouge">DELETE /api/data/fragments/:ddName/:fragmentId</code> → didn’t exist.</li>
</ul>

<p>So CRUD was missing the L and the D. Added both:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># List all fragments for a data definition</span>
curl <span class="nt">-s</span> http://localhost:8444/m42Services/api/data/fragments/SPSUserClassBase <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Authorization: Bearer pamlab-dev-token"</span> | jq <span class="s1">'.items | length'</span>
<span class="c"># → 10</span>

<span class="c"># Delete a specific fragment</span>
curl <span class="nt">-s</span> <span class="nt">-X</span> DELETE http://localhost:8444/m42Services/api/data/fragments/SPSUserClassBase/some-id <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Authorization: Bearer pamlab-dev-token"</span>
<span class="c"># → 204 No Content</span>
</code></pre></div></div>

<h2 id="the-smoke-test">The Smoke Test</h2>

<p>I wrote <code class="language-plaintext highlighter-rouge">scripts/smoke-test.sh</code> — one script that tests the complete onboarding path:</p>

<ol>
  <li>Health checks on all 6 services</li>
  <li>AD auth (valid + invalid credentials)</li>
  <li>Remedy auth (valid + invalid)</li>
  <li>Matrix42 ticket creation</li>
  <li>AD user creation + group assignment</li>
  <li>Fudo account creation (without server_id)</li>
  <li>Matrix42 webhook registration</li>
  <li>JSM webhook registration</li>
  <li>Pipeline dry-run</li>
</ol>

<p>Each step prints <code class="language-plaintext highlighter-rouge">[PASS]</code> or <code class="language-plaintext highlighter-rouge">[FAIL]</code>. Exit code 0 means everything works. CI-ready.</p>

<h2 id="the-readme-restructure">The README Restructure</h2>

<p>The README got feedback that it was too dense up front. Fair point — the first thing you saw was a 9-row system table, a “Problem” section, and a “Solution” section, before you even got to <code class="language-plaintext highlighter-rouge">docker-compose up</code>.</p>

<p>Restructured it: <strong>TL;DR → Getting Started → Minimal Quickstart → Architecture</strong>. The system table moved below the quick start. Added a “Minimal Quickstart” section that goes from zero to working in three curl commands (Matrix42 ticket → AD user → group assignment). CyberArk is now clearly marked as optional.</p>

<p>Also added a “Mock API Realism” section. Because this is a sandbox, not a production clone, and that should be obvious to anyone evaluating it.</p>

<h2 id="the-numbers">The Numbers</h2>

<p>Before today:</p>
<ul>
  <li>82 tests passing, 8 failing, 8 warnings</li>
  <li>Auth validation: non-existent</li>
  <li>Webhooks: missing</li>
  <li>Code style: inconsistent</li>
</ul>

<p>After today:</p>
<ul>
  <li><strong>124 tests passing, 0 failing, 0 warnings</strong></li>
  <li>Auth rejects invalid credentials across all services</li>
  <li>Webhooks work for Matrix42 and JSM</li>
  <li>Prettier + ESLint across the entire codebase</li>
  <li>One-command smoke test for the full onboarding path</li>
</ul>

<h2 id="what-i-learned">What I Learned</h2>

<p>Mock APIs are deceptively easy to get wrong. The happy path works from day one — but the error paths, the auth validation, the edge cases in CRUD operations? Those are what make a sandbox actually useful for development.</p>

<p>If your mock accepts any password, your integration code will never handle auth failures. If your mock doesn’t have webhooks, your event-driven workflows are built on assumptions. If your test report has warnings that you’ve been ignoring, your quality story has holes.</p>

<p>Today was about closing those holes. Not glamorous, but the kind of work that separates a demo from a tool.</p>

<hr />

<p><em>PAMlab is open source under Apache 2.0: <a href="https://github.com/BenediktSchackenberg/PAMlab">github.com/BenediktSchackenberg/PAMlab</a></em></p>

<p><em>Previous posts: <a href="/2026/03/26/pamlab-dev-sandbox.html">Introducing PAMlab</a> · <a href="/2026/03/28/pamlab-studio-v2.html">PAMlab Studio V2</a></em></p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="pamlab" /><category term="pam" /><category term="fudo-pam" /><category term="active-directory" /><category term="matrix42-automation" /><category term="matrix42-esm" /><category term="servicenow" /><category term="jira-service-management" /><category term="bmc-remedy" /><category term="security-automation" /><category term="itsm-integration" /><category term="access-management" /><category term="privileged-access-management" /><category term="webhook-automation" /><category term="pipeline-engine" /><category term="devops" /><category term="open-source" /><summary type="html"><![CDATA[You know that moment when your test suite is green, your README looks great, and then someone actually uses your project and finds that the login endpoint accepts literally any password?]]></summary></entry><entry><title type="html">PAMlab Studio: From ‘It Works On My Machine’ to ‘It Works Everywhere’ 🔧</title><link href="https://schackenberg.com/2026/03/28/pamlab-studio-v2.html" rel="alternate" type="text/html" title="PAMlab Studio: From ‘It Works On My Machine’ to ‘It Works Everywhere’ 🔧" /><published>2026-03-28T00:00:00+01:00</published><updated>2026-03-28T00:00:00+01:00</updated><id>https://schackenberg.com/2026/03/28/pamlab-studio-v2</id><content type="html" xml:base="https://schackenberg.com/2026/03/28/pamlab-studio-v2.html"><![CDATA[<p>Two days ago I <a href="/2026/03/26/pamlab-dev-sandbox.html">released PAMlab</a> — a sandbox with mock APIs for building PAM integration scripts. Three APIs, some PowerShell templates, a basic web UI. Cool, but kind of bare bones.</p>

<p>Today it’s a completely different beast. And the story of how it got there is worth telling, because it touches on something I think most tooling projects get wrong: <strong>the gap between “technically works” and “someone would actually use this.”</strong></p>

<h2 id="the-problem-with-v1">The Problem With V1</h2>

<p>PAMlab v1 had working mock APIs and a code editor. You could write PowerShell, hit Run, and see results. Mission accomplished, right?</p>

<p>Not really. Here’s what happened when I sat down to actually <em>use</em> it:</p>

<ol>
  <li>I loaded the Onboarding template. It created a user called “Sarah Connor” in the AD mock.</li>
  <li>I tweaked some params. Ran it again. <strong>409 Conflict</strong> — user already exists.</li>
  <li>I opened the Emergency Revoke template. It tried to block a Fudo user by ID. The ID was <code class="language-plaintext highlighter-rouge">FROM_STEP_5</code>. <strong>422 — “Valid user_id required.”</strong></li>
  <li>I clicked the Debug button. Nothing happened. Export? Nothing. Save? You guessed it.</li>
</ol>

<p>The buttons were decorative. The templates had placeholder values that couldn’t resolve. And running the same workflow twice was a guaranteed crash because the first run’s test data polluted the second run.</p>

<p>This is the kind of thing that separates a proof of concept from a tool. And I had a proof of concept.</p>

<h2 id="the-fix-everything">The Fix: Everything</h2>

<p>Instead of patching individual bugs, I decided to rebuild the entire user experience in one session. Here’s what that looked like.</p>

<h3 id="cross-step-references-the-non-obvious-problem">Cross-Step References (The Non-Obvious Problem)</h3>

<p>This was the most interesting engineering challenge. In a real provisioning workflow, steps depend on each other:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Step 1: Create Matrix42 Ticket     → returns ticket_id
Step 5: Create Fudo PAM User       → returns user_id
Step 6: Add User to Fudo Group     → needs Step 5's user_id
Step 8: Close Ticket               → needs Step 1's ticket_id
</code></pre></div></div>

<p>In production PowerShell, you’d write <code class="language-plaintext highlighter-rouge">$step5Result.id</code>. But in the PAMlab runner, the script is parsed into API calls and executed sequentially. There’s no PowerShell runtime — just a JavaScript loop calling <code class="language-plaintext highlighter-rouge">fetch()</code>.</p>

<p>The solution was a step resolver that sits between the parser and the executor:</p>

<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// After each step, extract IDs from the response</span>
<span class="nx">stepResults</span><span class="p">[</span><span class="nx">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="nx">extractIds</span><span class="p">(</span><span class="nx">res</span><span class="p">.</span><span class="nx">data</span><span class="p">);</span>

<span class="c1">// Before executing the next step, resolve references</span>
<span class="kd">const</span> <span class="nx">resolvedCall</span> <span class="o">=</span> <span class="nx">resolveStepReferences</span><span class="p">(</span><span class="nx">calls</span><span class="p">[</span><span class="nx">i</span><span class="p">],</span> <span class="nx">stepResults</span><span class="p">);</span>
</code></pre></div></div>

<p>The resolver understands different response formats: Fudo returns <code class="language-plaintext highlighter-rouge">{ id: "..." }</code>, Matrix42 returns <code class="language-plaintext highlighter-rouge">{ ID: "..." }</code>, ServiceNow wraps everything in <code class="language-plaintext highlighter-rouge">{ result: { sys_id: "..." } }</code>, and Jira uses <code class="language-plaintext highlighter-rouge">{ key: "ITSM-12" }</code>. One function, five different ID extraction patterns.</p>

<p>It’s the kind of plumbing that nobody notices when it works, but breaks everything when it doesn’t.</p>

<h3 id="the-test-sandbox">The Test Sandbox</h3>

<p>The second big problem: running templates more than once. The Onboarding template creates “Sarah Connor” — run it twice and the AD mock returns 409 because Sarah already exists.</p>

<p>The fix was a test runner that generates random identities:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Template says: s.connor / Sarah Connor
Test run uses: test-a3f8b / Test User A3F8B
</code></pre></div></div>

<p>It replaces every known demo username in the generated script with a random identifier, runs it, and then offers a “Cleanup” button that deletes everything the test created. AD users, Fudo users, group memberships, tickets — all tracked during execution and reversible after.</p>

<p>Combined with a “Reset Mock Data” button that reloads all seed data across all 7 APIs, you can now iterate on workflows indefinitely without accumulating garbage state.</p>

<h3 id="stable-uuids-the-subtle-fix">Stable UUIDs (The Subtle Fix)</h3>

<p>Here’s a fun one. The Fudo mock used <code class="language-plaintext highlighter-rouge">uuidv4()</code> to generate IDs for all seed data — users, groups, safes, servers. Every time the API restarted, every ID changed.</p>

<p>This meant the workflow templates couldn’t reference Fudo groups or safes by ID, because the IDs were different every time. I replaced all seed IDs with deterministic UUIDs:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Before: random every restart</span>
<span class="p">{</span> <span class="nl">id</span><span class="p">:</span> <span class="nx">uuidv4</span><span class="p">(),</span> <span class="nx">name</span><span class="p">:</span> <span class="dl">'</span><span class="s1">RDP-Server-Admins</span><span class="dl">'</span> <span class="p">}</span>

<span class="c1">// After: stable forever</span>
<span class="p">{</span> <span class="na">id</span><span class="p">:</span> <span class="dl">'</span><span class="s1">70000000-0000-0000-0000-000000000001</span><span class="dl">'</span><span class="p">,</span> <span class="na">name</span><span class="p">:</span> <span class="dl">'</span><span class="s1">RDP-Server-Admins</span><span class="dl">'</span> <span class="p">}</span>
</code></pre></div></div>

<p>Simple change, huge impact. Now templates can hard-reference <code class="language-plaintext highlighter-rouge">70000000-0000-0000-0000-000000000001</code> and it’ll always be the RDP-Server-Admins group.</p>

<h3 id="five-real-templates">Five Real Templates</h3>

<p>With stable IDs and cross-step resolution, I could finally build templates that actually <em>work</em>:</p>

<table>
  <thead>
    <tr>
      <th>Template</th>
      <th>Steps</th>
      <th>Systems</th>
      <th>What Happens</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Employee Onboarding</td>
      <td>8</td>
      <td>Matrix42 → AD → Fudo</td>
      <td>Ticket → User → Groups → PAM → Policy → Close</td>
    </tr>
    <tr>
      <td>Temp Server Access</td>
      <td>4</td>
      <td>Matrix42 → AD → Fudo</td>
      <td>Ticket → Group → Time-Limited Policy → Close</td>
    </tr>
    <tr>
      <td>Offboarding</td>
      <td>5</td>
      <td>AD → Fudo → ServiceNow</td>
      <td>Remove → Block → Disable → Incident</td>
    </tr>
    <tr>
      <td>Emergency Revoke</td>
      <td>5</td>
      <td>Fudo → AD → ServiceNow</td>
      <td>🚨 Block → Disable → Remove → Security Incident</td>
    </tr>
    <tr>
      <td>Project Access</td>
      <td>4</td>
      <td>AD → Fudo → Jira</td>
      <td>Group → Web Policy → DB Policy → Jira Issue</td>
    </tr>
  </tbody>
</table>

<p>All five templates verified end-to-end: <strong>26 API calls, 26 successful responses.</strong> You can load any of them, hit Run, and watch the entire flow execute with live status updates in the inline results panel.</p>

<h3 id="the-access-policy-model">The Access Policy Model</h3>

<p>While building templates, I realized the Fudo mock was missing a critical concept: <strong>Access Policies</strong>. In real Fudo, the access chain is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>AD Group (GRP-RDP-Admins)
  → linked via ad_group_dn to
Fudo Group (RDP-Server-Admins)
  → Access Policy binds to
Fudo Safe (IT-Administration)
  → contains
Servers (DC01, DB-PROD, FileServer01) + Accounts
  → accessed via
Listener (RDP / SSH)
</code></pre></div></div>

<p>A user in the group can access all servers in the safe through the listener. I implemented the full model:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">POST /api/v2/access-policies</code> — create policies with group, safe, listener, time restrictions, and approval requirements</li>
  <li><code class="language-plaintext highlighter-rouge">GET /api/v2/access-policies/check/:user_id/:safe_id</code> — check if a user has access to a specific safe</li>
  <li>Seed data with three pre-configured policies</li>
</ul>

<p>This is the kind of domain modeling that makes mock APIs actually useful for testing — not just “does the API respond” but “does the business logic flow make sense.”</p>

<h3 id="fudo-mock-expansion">Fudo Mock Expansion</h3>

<p>While at it, I expanded the infrastructure:</p>

<ul>
  <li><strong>6 servers</strong> (was 3): DC01, DB-PROD, APP-ERP, FileServer01, Web-PROD, Citrix01</li>
  <li><strong>4 safes</strong> (was 2): IT-Administration, Application-Access, File-Server-Access, Web-Server-Deployment</li>
  <li><strong>6 accounts</strong>: One per server with realistic names</li>
  <li>All servers in the Production pool</li>
</ul>

<h3 id="flow-visualization">Flow Visualization</h3>

<p>The workflow builder now shows your steps as a visual flow diagram:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Matrix42 🎫]──→──[AD 🏢]──→──[AD 🏢]──→──[Fudo 🔐]──→──[Fudo 🔐]──→──[Matrix42 🎫]
 Create Ticket    Create User   Add Group   Create User   Create Policy   Close Ticket
     ✅              ✅            ✅           ⏳            ⏸              ⏸
</code></pre></div></div>

<p>Each node is colored by system (blue=AD, green=Fudo, purple=Matrix42, orange=ServiceNow), shows the step label, and has a live status indicator. When you run the workflow, nodes light up one by one.</p>

<h3 id="production-export">Production Export</h3>

<p>Mock scripts are useless if you can’t deploy them. PAMlab Studio now has a production config system where you configure real system URLs and auth methods:</p>

<table>
  <thead>
    <tr>
      <th>System</th>
      <th>Auth Method</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Fudo PAM</td>
      <td>API Token</td>
    </tr>
    <tr>
      <td>Matrix42</td>
      <td>API Key</td>
    </tr>
    <tr>
      <td>Active Directory</td>
      <td>LDAP Bind</td>
    </tr>
    <tr>
      <td>ServiceNow</td>
      <td>OAuth2 (Client ID/Secret)</td>
    </tr>
    <tr>
      <td>Jira SM</td>
      <td>API Token</td>
    </tr>
    <tr>
      <td>BMC Remedy</td>
      <td>Basic Auth</td>
    </tr>
  </tbody>
</table>

<p>Hit “🏭 Export for Production” and the script is regenerated with proper auth blocks. Same workflow logic, different endpoints and credentials. The config can be exported/imported as JSON (with passwords masked).</p>

<h3 id="the-welcome-screen">The Welcome Screen</h3>

<p>New users no longer land on a dashboard with green dots and no context. There’s now a welcome screen with:</p>

<ul>
  <li>Feature overview (Build → Test → Ship)</li>
  <li>“Start with a Demo” button that loads the Onboarding template</li>
  <li>“Build from Scratch” option</li>
</ul>

<p>It’s skippable and only shows once (localStorage flag). Small thing, but it’s the difference between “what am I looking at” and “oh, I should click this.”</p>

<h3 id="everything-else">Everything Else</h3>

<ul>
  <li><strong>Run History</strong>: All executions saved in localStorage, viewable as a table with expandable details</li>
  <li><strong>Live Dashboard Stats</strong>: Users, servers, groups, sessions, pending requests — fetched from Fudo API in real time</li>
  <li><strong>Quick Actions</strong>: One-click cards for common demos (Onboarding, Emergency Revoke)</li>
  <li><strong>Keyboard Shortcuts</strong>: Ctrl+Enter (Run), Ctrl+S (Save), Ctrl+E (Export)</li>
  <li><strong>Settings Redesign</strong>: Three tabs — Mock APIs, Production Config, Preferences</li>
  <li><strong>Mock Data Reset</strong>: POST /reset on all APIs to reload seed data</li>
</ul>

<h2 id="interactive-demo">Interactive Demo</h2>

<p>Want to see it in action without cloning the repo? Here’s an interactive walkthrough:</p>

<div style="position: relative; width: 100%; max-width: 960px; margin: 30px auto;">
<iframe id="pamlab-demo" src="/assets/demos/pamlab-studio-v2-demo.html" style="width: 100%; height: 640px; border: 2px solid #333; border-radius: 12px; background: #0a0a0f;" allowfullscreen=""></iframe>
</div>

<p style="text-align: center; margin-top: -15px;">
<em>Use arrow keys or click the navigation dots. <a href="/assets/demos/pamlab-studio-v2-demo.html" target="_blank">Open fullscreen ↗</a></em>
</p>

<h2 id="the-numbers">The Numbers</h2>

<p>Over the course of one (long) day:</p>

<ul>
  <li><strong>1,400+ lines</strong> of new TypeScript/React code</li>
  <li><strong>19 files</strong> changed across frontend and all 7 mock APIs</li>
  <li><strong>6 new components</strong>: Welcome, FlowDiagram, RunHistory, testRunner, stepResolver, productionConfig</li>
  <li><strong>5 workflow templates</strong> verified end-to-end (26 API calls each)</li>
  <li><strong>0 production systems</strong> were harmed in the making of this update</li>
</ul>

<h2 id="what-i-learned">What I Learned</h2>

<p>Building developer tools is different from building user-facing products. The audience is smaller but way more demanding. They’ll find every edge case in the first five minutes because that’s literally what they do for a living.</p>

<p>The biggest lesson: <strong>demo data matters more than features.</strong> The test sandbox and mock data reset took maybe 20% of the development time but solved 80% of the usability problems. Nobody cares about your flow visualization if they can’t run the same template twice.</p>

<p>The second lesson: <strong>cross-system reference resolution is the hard part of workflow automation.</strong> Not the individual API calls — those are straightforward. It’s the <code class="language-plaintext highlighter-rouge">$step5Result.id</code> problem. Every orchestration tool eventually has to solve this, and most do it poorly.</p>

<h2 id="try-it">Try It</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/BenediktSchackenberg/PAMlab.git
<span class="nb">cd </span>PAMlab
docker-compose up
<span class="c"># Open http://localhost:3000</span>
</code></pre></div></div>

<p>Load a template. Hit Run. Watch 8 API calls cascade across three different systems. Then hit “🧪 Test Run” and do it again with random data. Then export the script, change the URLs, and run it against your real environment.</p>

<p>That’s the whole point: <strong>build once, test safely, deploy anywhere.</strong></p>

<p><a href="https://github.com/BenediktSchackenberg/PAMlab">GitHub</a> — Apache 2.0, contributions welcome.</p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="pamlab" /><category term="pam" /><category term="fudo" /><category term="workflow-automation" /><category term="devtools" /><category term="react" /><category term="typescript" /><category term="security" /><category term="active-directory" /><category term="servicenow" /><category term="jira" /><summary type="html"><![CDATA[Two days ago I released PAMlab — a sandbox with mock APIs for building PAM integration scripts. Three APIs, some PowerShell templates, a basic web UI. Cool, but kind of bare bones.]]></summary></entry><entry><title type="html">How I Accidentally Started a Container Security Drama at NVIDIA</title><link href="https://schackenberg.com/2026/03/26/nvidia-nemoclaw-container-security.html" rel="alternate" type="text/html" title="How I Accidentally Started a Container Security Drama at NVIDIA" /><published>2026-03-26T00:00:00+01:00</published><updated>2026-03-26T00:00:00+01:00</updated><id>https://schackenberg.com/2026/03/26/nvidia-nemoclaw-container-security</id><content type="html" xml:base="https://schackenberg.com/2026/03/26/nvidia-nemoclaw-container-security.html"><![CDATA[<h1 id="the-day-i-poked-nvidia-about-their-container-security">The Day I Poked NVIDIA About Their Container Security</h1>

<p>You know that feeling when you leave a comment on a GitHub issue and suddenly you’re in the middle of a security drama? Yeah, that happened.</p>

<h2 id="it-started-with-a-simple-question">It Started With a Simple Question</h2>

<p><a href="https://github.com/NVIDIA/NemoClaw">NemoClaw</a> is NVIDIA’s open-source AI agent orchestration framework. Cool project. But while poking around the Dockerfile, something caught my eye: <strong>the container wasn’t dropping any Linux capabilities</strong>.</p>

<p><img src="/assets/images/cap-drop-before.png" alt="Container running with all capabilities" />
<em>The container after startup. Look at all those juicy capabilities. CAP_NET_RAW? Don’t mind if I do!</em></p>

<p>For the uninitiated: Linux capabilities are like fine-grained root permissions. Docker containers get a bunch by default — including fun ones like:</p>

<ul>
  <li><strong>CAP_NET_RAW</strong> — craft raw packets, ARP spoofing, the works</li>
  <li><strong>CAP_DAC_OVERRIDE</strong> — who needs file permissions anyway?</li>
  <li><strong>CAP_SYS_CHROOT</strong> — chroot calls for everyone!</li>
</ul>

<p>This is basically <a href="https://www.cisecurity.org/benchmark/docker">CIS Docker Benchmark 5.3</a>: “Restrict Linux kernel capabilities within containers.” One of those things that everyone agrees on but nobody actually does.</p>

<h2 id="the-docs-only-fix-that-wasnt">The Docs-Only “Fix” That Wasn’t</h2>

<p>The first attempt to address issue <a href="https://github.com/NVIDIA/NemoClaw/issues/797">#797</a> was… a documentation update. Basically: “Hey, just add <code class="language-plaintext highlighter-rouge">--cap-drop=ALL</code> to your docker run command.”</p>

<p>The community wasn’t having it. As one contributor pointedly asked: <em>“How does this cover #797?”</em> — and they were right. Telling users to remember a CLI flag isn’t the same as actually fixing the problem. The PR author gracefully acknowledged:</p>

<blockquote>
  <p>“Fair point — you’re right that our PR doesn’t actually enforce capability dropping at the container level.”</p>
</blockquote>

<p><strong>Lesson learned:</strong> Docs are great. Docs as a security fix? Not so much. 📝 ≠ 🔒</p>

<h2 id="the-real-fix-capsh-to-the-rescue">The Real Fix: capsh to the Rescue</h2>

<p>Here’s where it gets interesting. The NemoClaw container is managed by OpenShell’s sandbox runtime, which means <strong>you can’t just pass <code class="language-plaintext highlighter-rouge">--cap-drop=ALL</code> to docker run</strong>. The runtime doesn’t expose that flag. Classic.</p>

<p>So <a href="https://github.com/ericksoa">@ericksoa</a> came up with an elegant solution: use <code class="language-plaintext highlighter-rouge">capsh</code> (from <code class="language-plaintext highlighter-rouge">libcap2-bin</code>) in the entrypoint script to <strong>self-re-exec with a stripped bounding set</strong>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="o">[</span> <span class="s2">"</span><span class="k">${</span><span class="nv">NEMOCLAW_CAPS_DROPPED</span><span class="k">:-}</span><span class="s2">"</span> <span class="o">!=</span> <span class="s2">"1"</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="nb">command</span> <span class="nt">-v</span> capsh <span class="o">&gt;</span>/dev/null 2&gt;&amp;1<span class="p">;</span> <span class="k">then
  </span><span class="nb">export </span><span class="nv">NEMOCLAW_CAPS_DROPPED</span><span class="o">=</span>1
  <span class="nb">exec </span>capsh <span class="se">\</span>
    <span class="nt">--drop</span><span class="o">=</span>cap_net_raw,cap_dac_override,cap_sys_chroot,cap_fsetid,... <span class="se">\</span>
    <span class="nt">--</span> <span class="nt">-c</span> <span class="s1">'exec /usr/local/bin/nemoclaw-start "$@"'</span> <span class="nt">--</span> <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
<span class="k">fi</span>
</code></pre></div></div>

<p>The beauty: it drops 9 dangerous capabilities while keeping only the 5 needed for <code class="language-plaintext highlighter-rouge">gosu</code> privilege separation. And the <code class="language-plaintext highlighter-rouge">NEMOCLAW_CAPS_DROPPED</code> guard prevents infinite re-exec loops. Chef’s kiss. 👨‍🍳</p>

<h2 id="my-review-and-the-accidental-empty-approval">My Review (and the Accidental Empty Approval)</h2>

<p>I got to <a href="https://github.com/NVIDIA/NemoClaw/pull/917">review the PR</a> and… well, first I accidentally submitted an empty approval. Automation things. 😅</p>

<p>But then I left actual feedback:</p>

<ol>
  <li>
    <p><strong>The e2e test</strong> used <code class="language-plaintext highlighter-rouge">bash -c</code> with nested arithmetic to decode hex capability masks — suggested using <code class="language-plaintext highlighter-rouge">capsh --print</code> and grepping for <code class="language-plaintext highlighter-rouge">cap_net_raw</code> in the Bounding set instead. Simpler, less fragile.</p>
  </li>
  <li>
    <p><strong>Spotted a sneaky one:</strong> <code class="language-plaintext highlighter-rouge">cap_dac_read_search</code> wasn’t in the drop list. Intentional? Some hardening guides flag it. Always worth asking.</p>
  </li>
</ol>

<p><img src="/assets/images/cap-drop-after.png" alt="The capability drop in action" />
<em>After the fix: bounding set stripped clean. Only the essentials remain.</em></p>

<h2 id="the-plot-twist-cap_setpcap">The Plot Twist: cap_setpcap</h2>

<p>Of course, no good security PR ships without a follow-up fix. Turns out, <strong>you need <code class="language-plaintext highlighter-rouge">CAP_SETPCAP</code> to drop other capabilities</strong>. Dropping it from the bounding set means <code class="language-plaintext highlighter-rouge">capsh</code> can’t… well, do its job. The sandbox wouldn’t start.</p>

<p><a href="https://github.com/NVIDIA/NemoClaw/pull/929">PR #929</a> landed the same day to keep <code class="language-plaintext highlighter-rouge">cap_setpcap</code> and add a pre-check with <code class="language-plaintext highlighter-rouge">capsh --has-p</code>. If the capability isn’t available (like in OpenShell’s sandbox), the drop is skipped gracefully since the runtime is already restricting things.</p>

<h2 id="takeaways">Takeaways</h2>

<ol>
  <li><strong>Defense in depth actually matters.</strong> Don’t rely on one layer (docs, runtime, or Dockerfile alone).</li>
  <li><strong>Community pressure works.</strong> The docs-only approach got rightfully challenged, leading to a proper fix.</li>
  <li><strong>capsh is underrated.</strong> When you can’t control the runtime, the entrypoint can still harden itself.</li>
  <li><strong>Always read the capability list twice.</strong> Missing one (<code class="language-plaintext highlighter-rouge">cap_setpcap</code>) broke the entire sandbox startup.</li>
  <li><strong>Open source reviews are fun.</strong> You get thanked in NVIDIA PRs for pointing out CIS benchmarks. Not bad for a Tuesday evening.</li>
</ol>

<hr />

<p><em>The PR <a href="https://github.com/NVIDIA/NemoClaw/pull/917">#917</a> was merged on March 25, 2026. Many thanks to <a href="https://github.com/ericksoa">@ericksoa</a> for the implementation, <a href="https://github.com/h-network">@h-network</a> for keeping the bar high, and the NemoClaw maintainers for a smooth review process.</em></p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="security" /><category term="containers" /><category term="open-source" /><category term="NVIDIA" /><category term="NemoClaw" /><summary type="html"><![CDATA[The Day I Poked NVIDIA About Their Container Security]]></summary></entry><entry><title type="html">PAMlab: A Dev Sandbox for Privileged Access Automation 🔐</title><link href="https://schackenberg.com/2026/03/26/pamlab-dev-sandbox.html" rel="alternate" type="text/html" title="PAMlab: A Dev Sandbox for Privileged Access Automation 🔐" /><published>2026-03-26T00:00:00+01:00</published><updated>2026-03-26T00:00:00+01:00</updated><id>https://schackenberg.com/2026/03/26/pamlab-dev-sandbox</id><content type="html" xml:base="https://schackenberg.com/2026/03/26/pamlab-dev-sandbox.html"><![CDATA[<p>Here’s a scenario every sysadmin knows: Your boss asks you to automate the access provisioning workflow. New employee joins, gets added to the right AD groups, Fudo PAM picks up the change, access to production servers is granted — all triggered by a ticket in Matrix42.</p>

<p>Sounds straightforward on a whiteboard. Then you sit down to actually build it and realize: you can’t test against production. Your Fudo appliance is locked down by the security team. The Matrix42 dev instance hasn’t been updated since 2019. And nobody’s going to give you a sandbox Active Directory with realistic data.</p>

<p>So you do what every sysadmin does: you test in production and pray. Or you write the script, send it to someone else, and wait three weeks for feedback.</p>

<p>There has to be a better way.</p>

<h2 id="what-we-built">What We Built</h2>

<p><a href="https://github.com/BenediktSchackenberg/PAMlab">PAMlab</a> is a collection of mock APIs that simulate a complete enterprise access management stack. Three Node.js servers, each pretending to be a different system:</p>

<table>
  <thead>
    <tr>
      <th>System</th>
      <th>What it pretends to be</th>
      <th>Port</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Fudo PAM</td>
      <td>Privileged Access Management (sessions, passwords, JIT access)</td>
      <td>8443</td>
    </tr>
    <tr>
      <td>Matrix42 ESM</td>
      <td>IT Service Management (tickets, assets, provisioning workflows)</td>
      <td>8444</td>
    </tr>
    <tr>
      <td>Active Directory</td>
      <td>Directory services (users, groups, OUs, computers)</td>
      <td>8445</td>
    </tr>
  </tbody>
</table>

<p>Start everything with one command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/BenediktSchackenberg/PAMlab.git
<span class="nb">cd </span>PAMlab
docker-compose up
</code></pre></div></div>

<p>Three APIs running locally. With realistic data. Ready to be scripted against.</p>

<h2 id="why-this-matters">Why This Matters</h2>

<p>I’ve worked in environments where a single misconfigured provisioning script removed 200 users from their security groups. On a Friday afternoon. The helpdesk queue on Monday was not fun.</p>

<p>The problem wasn’t that the script was wrong — the logic was fine. The problem was that nobody tested the edge cases:</p>

<ul>
  <li>What happens when the AD group doesn’t exist yet?</li>
  <li>What if the Fudo sync fails halfway through?</li>
  <li>What if the Matrix42 approval is denied after the AD change was already made?</li>
</ul>

<p>These aren’t theoretical problems. They’re Tuesday at 2pm problems. And you can’t catch them by reading the script. You need to run it against something.</p>

<h2 id="a-real-example-onboarding-with-temporary-access">A Real Example: Onboarding with Temporary Access</h2>

<p>Let’s say a contractor needs RDP access to three production servers for 30 days. The workflow:</p>

<ol>
  <li>Someone creates an access request in Matrix42</li>
  <li>The manager approves it</li>
  <li>A PowerShell script adds the user to the right AD group</li>
  <li>Fudo syncs from AD and grants access</li>
  <li>After 30 days, the group membership expires</li>
</ol>

<p>Here’s what that looks like against PAMlab:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Import-Module</span><span class="w"> </span><span class="o">.</span><span class="nx">/examples/powershell/_PAMlab-Module.psm1</span><span class="w">
</span><span class="n">Connect-PAMlab</span><span class="w">

</span><span class="c"># Step 1: Create the access request</span><span class="w">
</span><span class="nv">$request</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Invoke-M42</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">POST</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/access-requests"</span><span class="w"> </span><span class="nt">-Body</span><span class="w"> </span><span class="p">@{</span><span class="w">
    </span><span class="nx">user</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"t.developer"</span><span class="w">
    </span><span class="nx">target_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"server_group"</span><span class="w">
    </span><span class="nx">target</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"GRP-RDP-Admins"</span><span class="w">
    </span><span class="nx">access_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"rdp"</span><span class="w">
    </span><span class="nx">justification</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Database migration project"</span><span class="w">
    </span><span class="nx">duration</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"30d"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Write-Host</span><span class="w"> </span><span class="s2">"Access Request: </span><span class="si">$(</span><span class="nv">$request</span><span class="o">.</span><span class="nf">id</span><span class="si">)</span><span class="s2"> — Status: </span><span class="si">$(</span><span class="nv">$request</span><span class="o">.</span><span class="nf">status</span><span class="si">)</span><span class="s2">"</span><span class="w">

</span><span class="c"># Step 2: Approve it</span><span class="w">
</span><span class="n">Invoke-M42</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">POST</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/access-requests/</span><span class="si">$(</span><span class="nv">$request</span><span class="o">.</span><span class="nf">id</span><span class="si">)</span><span class="s2">/approve"</span><span class="w"> </span><span class="nt">-Body</span><span class="w"> </span><span class="p">@{</span><span class="w">
    </span><span class="nx">approved_by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"admin"</span><span class="w">
    </span><span class="nx">comment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Approved for Q2 migration"</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c"># Step 3: Add to AD group with 30-day expiry</span><span class="w">
</span><span class="n">Invoke-AD</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">POST</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/groups/GRP-RDP-Admins/members/timed"</span><span class="w"> </span><span class="nt">-Body</span><span class="w"> </span><span class="p">@{</span><span class="w">
    </span><span class="nx">user</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"t.developer"</span><span class="w">
    </span><span class="nx">expires_at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="err">(</span><span class="nx">Get</span><span class="err">-</span><span class="nx">Date</span><span class="err">).</span><span class="nx">AddDays</span><span class="err">(</span><span class="mi">30</span><span class="err">).</span><span class="nx">ToString</span><span class="err">(</span><span class="s2">"yyyy-MM-ddTHH:mm:ssZ"</span><span class="err">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c"># Step 4: Trigger Fudo sync</span><span class="w">
</span><span class="nv">$sync</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Invoke-Fudo</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">POST</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/user-directory/sync"</span><span class="w">
</span><span class="n">Write-Host</span><span class="w"> </span><span class="s2">"Sync complete: </span><span class="si">$(</span><span class="nv">$sync</span><span class="o">.</span><span class="nf">users_added</span><span class="si">)</span><span class="s2"> added, </span><span class="si">$(</span><span class="nv">$sync</span><span class="o">.</span><span class="nf">groups_synced</span><span class="si">)</span><span class="s2"> groups"</span><span class="w">

</span><span class="c"># Step 5: Verify</span><span class="w">
</span><span class="nv">$groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Invoke-Fudo</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">GET</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/groups"</span><span class="w">
</span><span class="c"># Done. User has access. Auto-revokes in 30 days.</span><span class="w">
</span></code></pre></div></div>

<p>This script runs against localhost. No VPN to the production network. No “can you give me API access to the Fudo dev instance” emails. No waiting.</p>

<p>And the best part: when you’re done testing, <strong>the exact same script works against production</strong>. You just swap the URLs:</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Dev:</span><span class="w">
</span><span class="n">Connect-PAMlab</span><span class="w">  </span><span class="c"># → localhost:8443, :8444, :8445</span><span class="w">

</span><span class="c"># Prod:</span><span class="w">
</span><span class="n">Connect-PAMlab</span><span class="w"> </span><span class="nt">-ConfigFile</span><span class="w"> </span><span class="o">.</span><span class="nx">/config/production.env</span><span class="w">
</span><span class="c"># → fudo.company.com, matrix42.company.com, dc01.company.com</span><span class="w">
</span></code></pre></div></div>

<p>Same script. Same logic. Different targets.</p>

<h2 id="the-rollback-problem">The Rollback Problem</h2>

<p>Here’s something I learned the hard way: you always need a rollback plan.</p>

<p>Your onboarding script adds a user to three AD groups and triggers a Fudo sync. The first two groups work fine. The third group add fails because someone renamed the group. Now you have a user with partial access — they can reach the database server but not the application server.</p>

<p>PAMlab’s mock APIs let you simulate these failures. The AD mock returns a 404 when you try to add a member to a non-existent group. Your script should catch that and undo the first two group additions. If it doesn’t, you’ll find out now — not when it happens for real.</p>

<div class="language-powershell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">try</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">Invoke-AD</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">POST</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/groups/GRP-DOES-NOT-EXIST/members"</span><span class="w"> </span><span class="nt">-Body</span><span class="w"> </span><span class="p">@{</span><span class="w">
        </span><span class="nx">members</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">@(</span><span class="s2">"t.developer"</span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="kr">catch</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">Write-Host</span><span class="w"> </span><span class="s2">"Failed! Rolling back..."</span><span class="w"> </span><span class="nt">-ForegroundColor</span><span class="w"> </span><span class="nx">Red</span><span class="w">
    </span><span class="c"># Undo previous steps</span><span class="w">
    </span><span class="n">Invoke-AD</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">DELETE</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/groups/GRP-RDP-Admins/members/t.developer"</span><span class="w">
    </span><span class="n">Invoke-AD</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">DELETE</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/groups/GRP-DB-Operators/members/t.developer"</span><span class="w">
    
    </span><span class="c"># Create incident ticket</span><span class="w">
    </span><span class="n">Invoke-M42</span><span class="w"> </span><span class="nt">-Method</span><span class="w"> </span><span class="nx">POST</span><span class="w"> </span><span class="nt">-Endpoint</span><span class="w"> </span><span class="s2">"/tickets"</span><span class="w"> </span><span class="nt">-Body</span><span class="w"> </span><span class="p">@{</span><span class="w">
        </span><span class="nx">Subject</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"FAILED: Onboarding t.developer"</span><span class="w">
        </span><span class="nx">Priority</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="w">
        </span><span class="nx">Category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Access Management"</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Boring? Yes. Necessary? Ask the guy who had to manually fix 200 user accounts on a Monday morning.</p>

<h2 id="whats-in-the-box">What’s In The Box</h2>

<p>The Fudo mock alone has over 70 endpoints. It’s not just CRUD — it simulates the stuff you actually need for integration testing:</p>

<p><strong>Session lifecycle</strong>: open connections, terminate sessions, pause/resume, get AI-generated session summaries. Useful when your monitoring script needs to react to suspicious sessions.</p>

<p><strong>Event stream</strong>: Server-Sent Events endpoint that pushes random events every few seconds. Your SIEM integration can subscribe and test real-time processing.</p>

<p><strong>Password rotation</strong>: Policies with rotation schedules. Trigger a rotation, check the history. See if your credential management workflow handles the new password correctly.</p>

<p><strong>Just-in-Time access</strong>: Request temporary access, approve/deny, automatic expiration. The whole workflow that JIT access vendors love to put in their slides but nobody tests end-to-end.</p>

<p>The Matrix42 mock covers tickets, assets, provisioning workflows, software catalog, and compliance reports. The AD mock has users, groups (including timed membership for JIT), OUs, and computer objects.</p>

<h2 id="where-this-is-going">Where This Is Going</h2>

<p>Right now, PAMlab supports Matrix42, Active Directory, and Fudo PAM. But the architecture is pluggable. We have epics planned for:</p>

<ul>
  <li><strong>Jira Service Management</strong> — for Atlassian shops</li>
  <li><strong>ServiceNow</strong> — the 800-pound gorilla of ITSM</li>
  <li><strong>BMC Remedy</strong> — still very common in healthcare and government</li>
</ul>

<p>The bigger vision is a <strong>pipeline engine</strong> where you define your provisioning workflow as YAML and PAMlab executes it step by step against whatever combination of systems your organization uses. Matrix42 → AD → Fudo, or JSM → Azure AD → CyberArk, or ServiceNow → LDAP → BeyondTrust. Same engine, different connectors.</p>

<p>But even without the fancy stuff, just having three mock APIs on localhost that you can script against — that alone saves a stupid amount of time.</p>

<h2 id="try-it">Try It</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/BenediktSchackenberg/PAMlab.git
<span class="nb">cd </span>PAMlab
docker-compose up
</code></pre></div></div>

<p>The repo has ready-to-use PowerShell scripts for seven common scenarios: onboarding, offboarding, role changes, JIT access, emergency revocation, password rotation, and audit reports.</p>

<p>Fork it, break it, add your own connectors. PRs welcome (signed commits required — we’re a security project after all).</p>

<p>→ <a href="https://github.com/BenediktSchackenberg/PAMlab">github.com/BenediktSchackenberg/PAMlab</a></p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="pamlab" /><category term="pam" /><category term="fudo" /><category term="matrix42" /><category term="active-directory" /><category term="devops" /><category term="security" /><category term="automation" /><summary type="html"><![CDATA[Here’s a scenario every sysadmin knows: Your boss asks you to automate the access provisioning workflow. New employee joins, gets added to the right AD groups, Fudo PAM picks up the change, access to production servers is granted — all triggered by a ticket in Matrix42.]]></summary></entry><entry><title type="html">DBA Dash WebView: Two Weeks In — What Changed</title><link href="https://schackenberg.com/2026/03/13/dba-dash-webview-update.html" rel="alternate" type="text/html" title="DBA Dash WebView: Two Weeks In — What Changed" /><published>2026-03-13T00:00:00+01:00</published><updated>2026-03-13T00:00:00+01:00</updated><id>https://schackenberg.com/2026/03/13/dba-dash-webview-update</id><content type="html" xml:base="https://schackenberg.com/2026/03/13/dba-dash-webview-update.html"><![CDATA[<p>It’s been about a week since I <a href="/2026-03-05-dba-dash-webview">wrote about DBA Dash WebView</a> — the web frontend we built on top of <a href="https://github.com/trimble-oss/dba-dash">DBA Dash</a>. Back then it was already usable with ~40 pages covering most DBA workflows. Since then, things escalated a bit.</p>

<p>Here’s what happened.</p>

<h2 id="the-backup-ampel">The Backup Ampel</h2>

<p>This one came from a real need. We have ~200 SQL Servers and a bunch of AlwaysOn Availability Groups. The question “are all our backups OK?” sounds simple, but it’s surprisingly hard to answer when you have AG secondaries, Simple Recovery databases, and dozens of instances.</p>

<p>So we built a traffic-light report. Every instance gets a status:</p>

<ul>
  <li><strong>Green</strong> — Full backup within 24 hours, log backup within 15 minutes</li>
  <li><strong>Yellow</strong> — Full within 48 hours, log within 30 minutes</li>
  <li><strong>Red</strong> — Everything else</li>
</ul>

<p>Sounds easy enough, right? It wasn’t. The first version showed everything as red. Turns out when you run <code class="language-plaintext highlighter-rouge">MIN(latest_log_backup)</code> across all databases in an instance, you’re also including Simple Recovery databases — which by design never have log backups. NULL minimum = no backup = red. Every single instance.</p>

<p>The fix: a separate CTE that filters <code class="language-plaintext highlighter-rouge">recovery_model IN (1, 2)</code> before evaluating log backup compliance. Simple Recovery databases show “N/A” instead of a false alarm.</p>

<p>Then there was the AG problem. AlwaysOn secondaries don’t run backups — the preferred replica does. So a secondary with no recent backup isn’t a problem, it’s expected behavior. We now JOIN against <code class="language-plaintext highlighter-rouge">dbo.DatabasesHADR</code> with <code class="language-plaintext highlighter-rouge">is_local = 1</code> and exclude <code class="language-plaintext highlighter-rouge">is_primary_replica = 0</code> from the evaluation. Secondaries show “via Primary” in the detail view instead of angry red timestamps.</p>

<p>The result is a page where you can actually trust the colors. Green means green. Red means something needs attention.</p>

<h2 id="sql-monitor">SQL Monitor</h2>

<p>I’ve always liked the card-based approach that tools like Redgate SQL Monitor use — you see your entire fleet as a grid of cards, each showing the instance name, health status, and current CPU. At a glance you know what’s going on.</p>

<p>Our version pulls data from DBA Dash’s <code class="language-plaintext highlighter-rouge">Summary_Get</code> stored procedure and <code class="language-plaintext highlighter-rouge">dbo.CPU</code> table. Each card shows health indicators based on 7 status keys (full backup, log backup, DBCC, drives, jobs, AG, corruption). Status values from DBA Dash are 1=OK, 2=Warning, 3=N/A, 4=Critical. We learned the hard way that <code class="language-plaintext highlighter-rouge">&gt;= 2</code> catches N/A as a warning — the correct check is <code class="language-plaintext highlighter-rouge">== 2 || == 4</code>.</p>

<p>There’s also an alert sidebar that shows the latest collection errors and failed jobs in real time. Click any card to jump straight to the instance detail page.</p>

<h2 id="instance-detail-rebuilt">Instance Detail, Rebuilt</h2>

<p>The old instance detail page had an Overview tab that was basically a worse version of the summary. We ripped it out and made Performance the default tab instead. When I click on an instance, I want to see the CPU chart — not a list of metadata I already saw in the card.</p>

<p>The header now shows status badges inline (compact, always visible), with N/A statuses hidden entirely. CPU KPI cards sit above the chart: current, 24h average, 24h peak. The Databases tab got AG Role and Sync State columns. The Backups tab is fully AG-aware.</p>

<h2 id="alerts-that-actually-help">Alerts That Actually Help</h2>

<p>The alerts page was… not great. It was dumping raw JSON from <code class="language-plaintext highlighter-rouge">dbo.Alerts</code> into a list. If you’ve ever looked at a wall of <code class="language-plaintext highlighter-rouge">{"InstanceID":42,"ErrorDate":"2026-03-12T...",...}</code> and tried to figure out what went wrong, you know the pain.</p>

<p>The new version combines two data sources: <code class="language-plaintext highlighter-rouge">CollectionErrorLog</code> (actual collection errors) and <code class="language-plaintext highlighter-rouge">JobHistory</code> (failed jobs from the last 48 hours). Each alert shows the instance name, a readable error message, severity (guessed from keywords — not perfect but useful), and a relative timestamp. There’s a detail panel on the right where you can read the full error message and click through to the instance.</p>

<p>The severity filter strip at the top doubles as a KPI row — you immediately see “12 Critical, 3 Warning, 8 Info” and can click to filter. There’s also a “per server” breakdown in the sidebar showing which instances are generating the most noise.</p>

<h2 id="ag-page-search">AG Page Search</h2>

<p>Small feature, big impact. The AlwaysOn Availability Groups page now has a search box that filters across server names, AG names, and database names. When you have 10+ clusters with dozens of databases each, being able to type “ASES” and immediately see only the matching AG with its databases expanded — that’s just nice to have.</p>

<h2 id="the-dba-dash-schema">The DBA Dash Schema</h2>

<p>Working directly against the DBA Dash database taught us a lot about the schema. A few things that might save someone else some debugging time:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">dbo.Databases.recovery_model</code> is a TINYINT (1=FULL, 2=BULK_LOGGED, 3=SIMPLE), not the <code class="language-plaintext highlighter-rouge">_desc</code> NVARCHAR variant</li>
  <li><code class="language-plaintext highlighter-rouge">dbo.CPU</code> has <code class="language-plaintext highlighter-rouge">SQLProcessCPU</code> and <code class="language-plaintext highlighter-rouge">SystemIdleCPU</code> but no <code class="language-plaintext highlighter-rouge">SystemCPU</code> column — you calculate it as <code class="language-plaintext highlighter-rouge">100 - SystemIdleCPU - SQLProcessCPU</code></li>
  <li>The table is called <code class="language-plaintext highlighter-rouge">dbo.DBIOStats</code>, not <code class="language-plaintext highlighter-rouge">dbo.IOStats</code></li>
  <li><code class="language-plaintext highlighter-rouge">dbo.AvailabilityGroups</code> uses <code class="language-plaintext highlighter-rouge">name</code>, not <code class="language-plaintext highlighter-rouge">group_name</code></li>
  <li><code class="language-plaintext highlighter-rouge">dbo.AvailabilityReplicas</code> (not <code class="language-plaintext highlighter-rouge">AvailabilityGroupReplicas</code>) doesn’t store <code class="language-plaintext highlighter-rouge">role_desc</code></li>
  <li>Status values across Summary_Get: 1=OK, 2=Warning, 3=N/A, 4=Critical — and 3 should almost never trigger an alert</li>
</ul>

<p>We verified all of this against the <a href="https://github.com/trimble-oss/dba-dash">DBA Dash source on GitHub</a>. When in doubt, read the source — the schema definitions in the repo are the ground truth.</p>

<h2 id="whats-next">What’s Next</h2>

<p>There’s still plenty to do. About 20 pages could benefit from smarter auto-refresh (delta queries instead of full reloads). We want to add CSV/Excel export to every table. SignalR for real-time push updates instead of polling. PDF reports on a schedule. Mobile-optimized views for the on-call DBA checking their phone at 2 AM.</p>

<p>But honestly, even right now it’s genuinely useful. We use it daily to check on our fleet, and the NOC display runs the SQL Monitor page full-time.</p>

<h2 id="thanks">Thanks</h2>

<p>Huge thanks to the <a href="https://github.com/trimble-oss/dba-dash">DBA Dash</a> team and <a href="https://github.com/DavidWiseman">David Wiseman</a> for building such a solid foundation. The data collection, the stored procedures, the schema — it’s all well thought out and consistent. Building a web frontend on top of it was remarkably smooth precisely because the underlying tool is so well engineered.</p>

<p>If you’re running SQL Server at any scale and not using DBA Dash, seriously, go check it out at <a href="https://dbadash.com">dbadash.com</a>.</p>

<p>DBA Dash WebView is open source under MIT: <strong><a href="https://github.com/BenediktSchackenberg/dbadashwebview">github.com/BenediktSchackenberg/dbadashwebview</a></strong></p>

<hr />

<p><em>Previously: <a href="/2026-03-05-dba-dash-webview">We Built a Web UI for DBA Dash — and It’s Free</a></em></p>]]></content><author><name>Benedikt Schackenberg</name></author><category term="sql-server" /><category term="dba-dash" /><category term="monitoring" /><category term="open-source" /><category term="webdev" /><category term="dotnet" /><category term="react" /><summary type="html"><![CDATA[It’s been about a week since I wrote about DBA Dash WebView — the web frontend we built on top of DBA Dash. Back then it was already usable with ~40 pages covering most DBA workflows. Since then, things escalated a bit.]]></summary></entry></feed>