MailArchive v2 — Self-Hosted Gmail Archiving on Kubernetes

MailArchive v2 — Self-Hosted Gmail Archiving on Kubernetes
GitHub - jdtate101/MailArchive: A simple offline storage app for GMail on K8S, designed to archive you emails
A simple offline storage app for GMail on K8S, designed to archive you emails - jdtate101/MailArchive

Sync every Gmail label to .eml files on a persistent volume, search across the lot with full-text indexing, and export the whole archive whenever you need it — all from a three-pane web UI that feels like a real email client.


Google Takeout gives you a one-time dump. IMAP clients give you a live view but nothing persisted. MailArchive sits in between: it continuously syncs your Gmail to plain .eml files on a Kubernetes PVC, builds a searchable Whoosh index across everything, and wraps it in a three-pane web UI with virtual scrolling, attachment downloads, and bulk export. Your email lives on your own storage, in an open format, importable into any client.

Version 2 adds full-text search, attachment downloads, a persistent disk-backed header cache, and bulk zip export — the pieces that turn a sync daemon into something you'd actually use day-to-day.


How it works

The backend is a FastAPI app running an IMAP sync engine, a Whoosh indexer, and an APScheduler task that fires every 24 hours. The frontend is a React SPA served by nginx. Both run as separate deployments on Kubernetes, sharing a 50Gi PVC where everything lands: the .eml files, the search index, the sync state, the header caches, and any export zips in progress.

Browser
  │
  ▼
React SPA (nginx)
  │
  ▼
FastAPI backend
  ├── IMAP sync engine (incremental, per-folder)
  ├── Whoosh full-text index
  └── APScheduler (24-hour cycle)
  │
  ▼
PVC 50Gi (/mail/)
  ├── .sync_state.json
  ├── .index/               ← Whoosh index
  ├── INBOX/
  │     ├── .cache.json     ← persistent header cache
  │     └── *.eml
  ├── Work/
  └── Personal/

The three-pane layout gives you a folder list on the left, an email list top-right (with virtual scrolling for large folders), and the rendered email detail below. HTML emails render in a sandboxed iframe; plaintext falls back gracefully.


Sync behaviour

On first run, the backend detects no .sync_state.json and kicks off a full sync of all Gmail labels automatically. For large mailboxes this takes a while — watch the logs. After that, only emails with UIDs higher than the last seen UID per folder are fetched. Progress is saved after each folder completes, so a pod restart mid-sync doesn't lose work.

You can also trigger syncs manually: Sync Now in the top bar hits POST /api/sync for a full run, or hover over any folder in the sidebar and click the ⟳ icon to sync just that folder immediately. The UI polls status every 3 seconds and updates folder counts as labels complete, so you can browse normally while a sync runs in the background.

To force a full re-sync from scratch, delete /mail/.sync_state.json from the PVC and restart the pod.


Search is powered by a Whoosh index stored at /mail/.index/ on the PVC. On first startup a background thread walks all existing .eml files and indexes anything not already in the index — a progress bar in the top bar shows percentage complete while this runs. New emails are indexed immediately as they're downloaded during sync. The index persists across pod restarts; it only rebuilds missing entries.

Search covers subject (2x boost), from (1.5x boost), to, and the full email body. Whoosh query syntax is supported:

from:amazon receipt
subject:"order confirmation"
holiday OR travel
"exact phrase"

Results appear inline in the email list pane with source folder, date, from, subject, and a body snippet. Hit Escape or the ✕ button to return to normal folder browsing.


Caching

Each folder keeps a .cache.json alongside its .eml files containing the pre-sorted header list used to populate the email list pane. It's built on first access, then loaded from disk on all subsequent accesses — including after pod restarts. Sync invalidates it automatically when new emails arrive. This means folders with thousands of emails open instantly after the first visit, without scanning .eml headers every time.


Bulk export

The export button has three states:

  1. Export All — click to start building the zip on the PVC in the background
  2. Preparing X% — progress bar polls every 2 seconds while the zip builds
  3. Download Ready — click to stream the completed zip to your browser

The zip preserves the folder structure and is stored as a temp file on the PVC until the next export build clears it. You can close the browser and come back — the download link stays valid.

mailarchive-export-20260407-120000.zip
  INBOX/
    1_abc123.eml
  Work/
    ...
  Personal/
    ...

This format imports directly into Thunderbird (via ImportExportTools NG), Apple Mail, or any IMAP client. For migrating to a new provider, a short Python script using imaplib.append() can push each .eml back into the correct folder on the destination server.


Gmail setup

You need a Gmail App Password before deploying — standard IMAP with your Google password won't work:

  1. Google Account → Security → enable 2-Step Verification
  2. Security → App passwords → create one named "MailArchive"
  3. Copy the 16-character password

Deployment

1. Build and push images

# Backend
cd backend/
docker build -t your-registry/mailarchive-backend:latest .
docker push your-registry/mailarchive-backend:latest

# Frontend
cd ../frontend/
docker build -t your-registry/mailarchive-frontend:latest .
docker push your-registry/mailarchive-frontend:latest

2. Configure credentials

Edit k8s/secret.yaml:

stringData:
  IMAP_USER: "yourname@gmail.com"
  IMAP_PASS: "xxxx xxxx xxxx xxxx"

Update the image fields in both deployment manifests, and set your ingress hostname in k8s/frontend-deployment.yaml.

3. Deploy

kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secret.yaml
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/backend-deployment.yaml
kubectl apply -f k8s/frontend-deployment.yaml

4. Verify

kubectl get pods -n mailarchive
kubectl logs -n mailarchive -l component=backend -f

# Check sync and search index status
kubectl port-forward -n mailarchive svc/mailarchive-backend 8000:8000
curl http://localhost:8000/api/status
curl http://localhost:8000/api/search/status

Configuration

Variable Default Description
IMAP_HOST imap.gmail.com IMAP server hostname
IMAP_PORT 993 IMAP SSL port
IMAP_USER (from secret) Gmail address
IMAP_PASS (from secret) 16-character app password
MAIL_ROOT /mail PVC mount path
SYNC_INTERVAL_HOURS 24 Hours between automatic syncs

Updating without losing state

All state lives on the PVC — sync progress, search index, and header caches all survive pod restarts. To update the image safely:

kubectl scale deployment mailarchive-backend -n mailarchive --replicas=0
# push new image
kubectl scale deployment mailarchive-backend -n mailarchive --replicas=1

The new pod reads .sync_state.json from the PVC and picks up where it left off. No automatic sync is triggered when the state file exists — use Sync Now if you want one immediately.

The deployment uses strategy: Recreate rather than RollingUpdate because ReadWriteOnce volumes can only be mounted by one pod at a time.

Expanding the PVC

If your mailbox outgrows 50Gi and your StorageClass supports online expansion:

kubectl patch pvc mailarchive-pvc -n mailarchive \
  -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

Local development

# Backend
cd backend/
pip install -r requirements.txt
IMAP_USER=you@gmail.com IMAP_PASS="xxxx xxxx xxxx xxxx" MAIL_ROOT=./mail python main.py

# Frontend (separate terminal)
cd frontend/
npm install
npm run dev   # Vite proxies /api calls to localhost:8000

API reference

Endpoint Method Description
GET /api/status GET Sync status, last run, emails downloaded, index status
POST /api/sync POST Trigger a full sync of all folders
POST /api/sync/folder/{folder} POST Trigger immediate sync of a single folder
GET /api/folders GET List all folders with email counts
GET /api/emails/{folder} GET List emails in a folder
GET /api/email/{folder}/{filename} GET Full email detail — body, headers, attachments
GET /api/email/{folder}/{filename}/download GET Download the raw .eml file
GET /api/email/{folder}/{filename}/attachment/{index} GET Download a specific attachment by index
GET /api/search?q={query} GET Full-text search across all folders
GET /api/search/status GET Search index document count and status
POST /api/export/build POST Start building the export zip in the background
GET /api/export/status GET Export build progress
GET /api/export/download GET Download the completed export zip