← back

Ink Cloud

Mar 2026

Cloud purpose-built for agents that need to spin up isolated compute on demand

https://ml.ink

Problem

Building with AI is trivial now, but deployment still requires going to Vercel / Railway / AWS / Supabase to get credentials, adding secrets, connecting GitHub, update DNS records manually.

What if agents could do all of this on a cloud designed from first principles to be used by agents? That is Ink.

Motivation

I decided to build Ink because I was shipping multiple new projects a week, each one requiring a backend, frontend, database, and additional services. I wanted an agent to be able to debug, read deployment and runtime logs, check CPU and memory metrics, fix issues, and iterate until everything worked. I also wanted hands-on experience with bare-metal infrastructure, where reducing cloud CPU cost by 80% looked achievable.

The goal was simple: build an MCP server connected to my Ink Cloud, so that after coding a new project I could simply ask my agent to deploy it and get a fully working URL back. It should work with GitHub, container (Docker) images, and databases (Postgres, MySQL, etc.).

Approach

I got the basic implementation with stateless containers up pretty quickly - using Railpack to build Docker images directly from a GitHub repository - but I knew that for Ink to be truly useful it had to handle stateful services with volumes, such as databases. I spoke with a few people whose judgement I trust and ended up picking Ceph for block storage. It's quite complex - usually a Ceph cluster has its own dedicated DevOps team - but I figured that with AI it should be manageable. The setup is three nodes dedicated to Ceph for replication, each with four physical disks. And here we are: databases work, services mount volumes - it's great.

At this point my team started using it, and I realised we needed many more features to round out the platform and give it a good developer experience. I added S3 support (AWS, GCS, Azure, and S3-compatible providers) for bucket mounts and backups, built the API for ephemeral sandboxes, and added a web terminal so you can SSH straight into a container from the browser. Templates followed soon after - one-click (or, for an agent, one CLI call) deployments of common services like Postgres, Temporal, Grafana, MediaWiki, OpenClaw, etc.

In the end Ink worked beyond my expectations, but the learning phase consumed me. I spent weeks talking to Claude and ChatGPT almost non-stop about design decisions. It was the most complex project I've taken on, but incredibly rewarding.

What agents can do

  • configure subdomains on a DNS zone
  • deploy a Postgres database
  • configure env vars (and set env files without exposing their contents to the LLM)
  • wire internal networking between services
  • debug build and runtime logs
  • inspect metrics (CPU / memory / network utilisation)
  • scale resources up or down to optimise cost
  • deploy a full-stack codebase in most languages
  • mount S3 buckets and snapshot volumes for backups
  • spin up ephemeral sandboxes
  • SSH into a container
  • one-shot deploy templates (Postgres, Temporal, Grafana, etc.)

Architecture

Cloudflare DNS + edge
Bare-metal cluster
control plane · k3s + etcd HA
ctrl-1
ctrl-2
ctrl-3
ingress · Traefik
tcp-edge-1
tcp-edge-2
compute
run-1customer pods (gVisor)
run-2customer pods (gVisor)
build-1BuildKit + registry
ops-1Grafana, Mimir, ClickHouse, git
dns
powerdns-1
storage · Rook-Ceph · 3× replication
volume-14 OSDs · 4 × 7.68 TB NVMe
volume-24 OSDs · 4 × 7.68 TB NVMe
volume-34 OSDs · 4 × 7.68 TB NVMe

Learnings

  • AI makes bare metal as easy to work with as cloud. I believe a shift from cloud to bare metal is coming, driven by cost optimisation. Ultimately, an agent doesn't care about the provider - it cares about its optimisation function f(cost, time to manage, geography, reliability), which will be heavily weighted by cost but remains problem-dependent.
  • Building a PaaS / deployment platform requires an enormous number of technical and product decisions. From "how do multiple users share GitHub repositories within a Team?" (Render, Railway, and Vercel each handle this slightly differently), to "how much log volume should a user / workspace / service be allowed to store before throttling?", to "should billing be based on reserved capacity or real-time utilisation?", to "what percentage of CPU and memory can we overprovision in a cluster?" - to name just a few.
  • gVisor adds real but manageable overhead. Every customer pod runs sandboxed by gVisor, which trades a bit of performance for a strong syscall-level isolation boundary. On the run nodes I measured ~35-46 Mi of memory overhead per pod (~45-60 Mi for small web services). CPU overhead is workload-dependent: per-syscall cost is 2-10× vs. hitting the host kernel directly, but for most real services this dilutes down to roughly 15-40% end-to-end - dominated by TLS, business logic, and downstream I/O. The worst case is fork-heavy workloads (make-driven builds, shell-out-per-file scripts), where fork / execve / wait are some of the slowest syscalls under gVisor. The same trade-off is what Cloud Run runs on.
  • The hardest part of building Ink was networking. For example, I wanted it to be possible to deploy services across one or more regions. In that case you need to resolve a user's domain to the correct cluster - the one that hosts the service and is closest to the user. There are many ways to approach this, both from a product and a technical perspective. I ended up using Cloudflare, but I was fascinated to learn about anycast networking.
  • Interesting fact - the Public Suffix List. When unrelated users share subdomains of the same parent domain, browsers' default cookie and origin rules can let them interfere with each other. The Public Suffix List (PSL), bundled into browsers, marks such parent domains as "public suffixes" so each subdomain is treated as an independent registrable domain.
  • Interesting fact - TLS runs on trust. Browsers blindly trust a list of certificate authorities (CAs) baked into them, and any trusted CA can technically issue a certificate for any domain - including google.com. A malicious or compromised CA could issue a fraudulent google.com certificate, allowing attackers to impersonate the real site and intercept traffic from users who get redirected to it. To keep CAs honest, Google and Mozilla monitor CA behaviour (via Certificate Transparency logs, audits, and incident reports), and if serious misissuance or malicious behaviour is observed they distrust the CA in their browsers. Once that happens, every certificate that CA ever issued stops being trusted - so all of that CA's customers' websites break overnight, and the CA's business is effectively dead. This has happened to DigiNotar (2011, hacked and used to issue fake Google certs - bankrupt within weeks), Symantec (2017, years of misissuance - distrusted and forced to sell off its CA business), and WoSign / StartCom (2016, back-dated certs and other misconduct - distrusted and shut down).

Technology

Go, gRPC, GraphQL, Postgres, ClickHouse, Temporal, NATS, k3s, FluxCD, Ceph, Traefik, Cloudflare, Ansible, Nix, BuildKit, Railpack, Grafana, Alloy, Prometheus, Mimir, MCP, Stripe, web3 stablecoin payments.

← back