Browse docs (11)

MicroGemLabs Platform Overview

1,734 words ยท 9 min read ยท 9 sections

Overview

Welcome to MicroGemLabs. This guide covers what the platform does, how the pieces fit together, and where to find detailed documentation for each product.

What Is MicroGemLabs?

MicroGemLabs is an autonomous operations platform with six specialized monitoring tools, a conversational AI assistant, and an optional autonomous AI agent โ€” all on shared infrastructure. Each product has its own subdomain and pricing, while sharing one database, one on-call engine, one billing system, and one AI surface.

The platform serves developers and small engineering teams (2โ€“20 people) who need professional monitoring without enterprise pricing.

Products

ProductWhat It DoesSubdomain
PulseGuardPlusUptime monitoring โ€” HTTP, TCP, DNS, Ping, SSL, Heartbeat from 3 regionspulse.microgemlabs.ai
CronKeeperCron job monitoring โ€” dead-man switch with start/complete trackingcron.microgemlabs.ai
LogVaultLog management โ€” ingestion, full-text search, pattern/rate/absence alertslogs.microgemlabs.ai
CertGuardCertificate monitoring โ€” SSL cert and domain registration expirycerts.microgemlabs.ai
CronRunnerScheduled HTTP โ€” cron-as-a-service with retry and loggingrun.microgemlabs.ai
HookRelayWebhook relay โ€” receive, inspect, forward, and replay webhookshooks.microgemlabs.ai

Enable products individually from the Products page. Each one works independently, but they all share the platform features below.

Shared Platform Features

On-Call & Escalation

One rotation schedule and escalation policy serves all six products. Alert delivery through 5 channels: email, SMS, voice call, Slack DM, and Telegram DM. When any product detects an issue, it routes through the same on-call engine.

MicroGemAI

The AI DevOps assistant that sees across all products simultaneously. Ask questions about your infrastructure, get cross-product root cause analysis, and let it remember your team's knowledge across sessions. Uses your own LLM API key (BYOK โ€” OpenAI, Anthropic, or any compatible provider). Included with every plan.

MicroGemAgent

Premium add-on that turns MicroGemAI from a chat tool into an autonomous engineer. Runs a watchdog scan every 5 min + a deep scan every 60 min, investigates anomalies with an LLM, and proposes fixes for your approval from the dashboard. 21 fix actions ship in-the-box: docker (restart / pull-restart / compose-up), kubectl (rollout-restart + scale), pm2 (one process or all), systemctl restart, AWS (EC2 / ECS / Lambda alias / RDS reboot), Postgres query kill/cancel, Redis (DEL / FLUSHDB), Cloudflare cache purge, GitHub (rerun + workflow_dispatch + rollback), and a read-only HTTP health probe. Generic webhook calls live in the runbook skill type at /agent/skills?type=runbook. Three tiers: Starter $29 (MCP only, analysis-only), Pro $79 (all integrations + fix proposals), Team $149 (multi-env + shared skills). Separate from the MicroGemAI chat key โ€” BYOK both. See the MicroGemAgent guide for the full setup flow.

Recursive Self-Improvement

The agent gets smarter every time an incident resolves. Each postmortem auto-drafts a knowledge skill at quarantine tier; semantic retrieval (pgvector embeddings) surfaces past learnings to the LLM the next time a similar incident fires; citation outcomes track which memories and skills actually helped (or didn't) and adjust their confidence; a weekly consolidation cron clusters similar incident_learning memories into "recurring pattern" memos; stale skills (retrieved but never applied) auto-demote out of validated tier. Observability for the whole loop lives at /agent/memory โ€” pool stats, top citations with win-rate, recent consolidations, stale candidates, missed recommendations.

Internet-Aware Investigation

Beyond the team's own data, MicroGemAgent (and MicroGemAI in read-only mode) ships three external research tools: web_search (Tavily-backed) for unknown error messages and recent CVEs, fetch_url for reading specific docs / changelogs / status pages with SSRF protection, and check_third_party_status for fast in/out checks against 14 provider status pages (GitHub, AWS, Stripe, OpenAI, Anthropic, Cloudflare, Vercel, Supabase, Datadog, Sentry, Slack, Discord, PagerDuty, Atlassian).

Mobile-First UX

Every customer-facing surface is built for phones first. Bottom nav (Home / Agent / Chat / Incidents) on <md. List pages render as cards on mobile (incidents, feedback, browser checks) instead of overflowing tables. Forms have a Cancel button next to Submit on every page so you can back out from a touch keyboard. The chat panel covers the full viewport on phones; clicking a deep link auto-closes the panel so the destination page is visible. The agent's "thinking" indicator is a pulsating gem (no GIFs โ€” pure CSS, theme-aware, respects prefers-reduced-motion). Right-click โ†’ "Filter to / Filter out / Sort" works on every list cell on desktop; long-press is the same on mobile.

Skills Library

A unified library at /agent/skills holding three reusable types of agent capability:

  • Knowledge โ€” markdown playbooks the agent loads as context when a trigger keyword matches.
  • Executable โ€” sandboxed JavaScript that calls helper functions (helpers.postgres.query, helpers.aws.ec2.reboot, etc.).
  • Runbook โ€” trigger-pattern + action bindings for incident response.

Skills accumulate from agent observations after successful fixes, manual authoring (Monaco editor for scripts), or JSON imports from another team. Three trust levels (quarantine โ†’ validated โ†’ trusted) with auto-promotion; team-wide approval policy at Settings โ†’ Skill Execution. Full-text search across name + content + script + pattern. Live SSE-streamed execution with mid-run approval gates for destructive skills.

Anomaly Detection

Hourly statistical analysis detects when metrics deviate from their 7-day rolling baselines using Z-scores. Tracks response time (PulseGuardPlus), execution duration (CronKeeper), error rate (LogVault), and execution time (CronRunner). Configurable sensitivity per metric.

Predictive Alerting

Extends anomaly detection with trend extrapolation. Uses linear regression over 48 hours of data to predict future threshold breaches. Alerts hours or days before problems occur.

Maintenance Windows

Suppress alerting during planned maintenance while continuing to monitor. Three scope levels (global, product, resource), independent toggles for alert suppression and baseline exclusion, quick suppress or scheduled windows.

Runbook Actions

Pre-defined HTTP calls and bundle-defined actions that MicroGemAI can suggest or execute during incidents. Lives inside the unified Skills library (/agent/skills?type=runbook); legacy /runbooks redirects there. Three trust levels: manual (human clicks), auto-with-approval (AI triggers, human approves), and full auto (AI handles everything). Cooldown, circuit breaker, and full audit trail.

Postmortem Generation

When incidents resolve (5+ minutes), MicroGemAI auto-generates a structured postmortem: summary, impact, timeline, root cause, contributing factors, resolution, action items, and lessons learned. Teams review, edit, and publish.

Messaging Gateway

Chat with MicroGemAI from Telegram or Slack. Receive on-call alerts, ask questions, execute runbook actions, and get infrastructure status โ€” all from your phone. Auto-discovery: message the bot from a new chat and it tells you how to connect.

MCP Server

15 tools (11 read, 4 action) exposed via the Model Context Protocol. Connect any AI agent โ€” Claude, Cursor, custom โ€” to query your monitoring data and execute runbook actions programmatically.

Documentation

Searchable product guides powered by PostgreSQL tsvector full-text search. MicroGemAI includes relevant documentation in its context when answering questions. Docs are publicly readable without an account โ€” share any /docs/<slug> link with anyone.

Managing Your Team

At Settings โ†’ Members. Three roles:

  • Owner โ€” created the team; has every permission. One per team. Cannot be removed. Ownership transfer is a future capability.
  • Admin โ€” can invite, promote members to admin, remove members, revoke pending invites, and approve agent fix proposals. Cannot demote or remove other admins โ€” only the owner can.
  • Member โ€” can view everything in the team but can't change membership, can't approve fix proposals, and can't touch settings that cost money.

What each role can do

ActionOwnerAdminMember
Invite new memberโœ…โœ…
Revoke pending inviteโœ…โœ…
Remove a memberโœ…โœ… (not other admins)
Promote member โ†’ adminโœ…โœ…
Demote admin โ†’ memberโœ…
Change agent tierโœ…
Cancel agent subscriptionโœ…
Approve agent fix proposalโœ…โœ…
Leave teamn/aโœ…โœ…

Owners can't use "Leave team" โ€” they'd orphan the team. Use transfer-ownership (coming soon) or delete the team first.

Invites

Invite by email from the members card. The recipient gets a link that expires in 7 days. Pending invites are listed below the member roster with a Revoke link owners/admins can use if they invited the wrong person. Used invites are preserved for audit.

How It All Connects

When PulseGuardPlus detects your API is down, here's what happens automatically:

1. Incident created โ€” PulseGuardPlus creates an incident record

2. On-call alerted โ€” the escalation policy sends SMS/voice/email/Slack/Telegram to whoever is on call

3. AI analyzes โ€” MicroGemAI assembles context from all 6 products, finds the expired SSL cert in CertGuard and the missed renewal cron in CronKeeper

4. Runbook matched โ€” the "Restart Cert Renewal" runbook matches the incident pattern with 90% confidence

5. Auto-executed โ€” the runbook fires (trust level: full auto) and triggers certificate renewal

6. Alert enriched โ€” the on-call notification is updated with AI analysis and runbook execution result

7. Incident resolves โ€” PulseGuardPlus detects recovery, auto-resolves the incident

8. Postmortem generated โ€” MicroGemAI drafts a full postmortem with timeline and action items

9. Memory stored โ€” the incident and resolution are saved to persistent memory for future reference

All of this happens in minutes, often before the on-call engineer opens their laptop.

Billing

Per-product pricing starting at $3โ€“12/month. 20% bundle discount when using 2+ products. On-call seats metered per user. MicroGemAI is included โ€” teams bring their own LLM API key.

Getting Started

1. Sign up at microgemlabs.ai โ€” free tier, no credit card

2. Enable a product from the Products page

3. Create your first monitor (60 seconds)

4. Configure on-call at Ops โ†’ On-Call

5. Set up MicroGemAI with your LLM API key at Settings โ†’ Account

6. Connect messaging at Settings โ†’ Integrations (Telegram or Slack)

Documentation Index

  • PulseGuardPlus โ€” Uptime monitoring setup, 6 check types, content matching, status pages
  • CronKeeper โ€” Cron monitoring, ping endpoints, shell wrapper, status badges
  • LogVault โ€” Log ingestion API, search, alert rules, retention
  • CertGuard โ€” SSL and domain monitoring, alert thresholds, WHOIS
  • CronRunner โ€” Scheduled HTTP jobs, retry logic, failure alerting
  • HookRelay โ€” Webhook endpoints, payload inspector, forwarding, replay
  • MicroGemAI โ€” AI assistant, memory, anomaly detection, runbooks, postmortems, messaging
  • MicroGemAgent โ€” Autonomous AI engineer, tiers, fix proposals, credential vault, skills library
  • MCP Integration โ€” Connecting external AI agents, 15 tool reference
  • Outbound Webhooks โ€” Push MicroGemLabs events to PagerDuty, Zapier, custom endpoints (signed POSTs)