MicroGemLabs

Overview

AI DevOps assistant with cross-product intelligence, persistent memory, and automated incident analysis.

Location: microgemlabs.ai/gemai

MicroGemAI vs MicroGemAgent

Two different products with similar names:

MicroGemAI — the conversational chat assistant documented below. You ask, it answers. Included with every MicroGemLabs plan (BYOK LLM key).
MicroGemAgent — an autonomous DevOps engineer that runs on a 5-min loop, investigates issues, and proposes fixes for your approval. Premium add-on with three tiers. Lives at microgemlabs.ai/agent. See microgemagent-guide.md.

Rule of thumb: if you want to ask *why* something happened, open MicroGemAI chat. If you want something *watched, investigated, and fixed* without you in the loop, that's MicroGemAgent.

Getting Started

1. Navigate to Account settings at microgemlabs.ai/account

2. Under MicroGemAI Configuration, enter your LLM API key

3. Open MicroGemAI — full page at microgemlabs.ai/gemai, or click the ✨ chat icon in the top nav of any page for a slide-out panel

4. Start chatting — MicroGemAI can see data from all your enabled products

Configuration (BYOK — Bring Your Own Key)

MicroGemAI uses YOUR LLM API key to process queries. Your data stays on the MicroGemLabs platform and is only sent to the LLM provider you choose. MicroGemLabs never stores or has access to your LLM provider credentials beyond what you configure.

Supported providers:

Provider	Base URL	Recommended Model
OpenAI	`https://api.openai.com/v1` (default)	`gpt-4o-mini` (fast, cheap) or `gpt-4o` (more capable)
Anthropic	`https://api.anthropic.com/v1`	`claude-sonnet-4-20250514`
Any OpenAI-compatible API	Custom URL	Varies

Settings:

Max Tokens — Maximum response length (default: 4000)
Temperature — Creativity vs precision (default: 0.3 — lower is more factual)

What MicroGemAI Can Do

Cross-Product Correlation

MicroGemAI's unique capability is connecting events across all six MicroGemLabs products. When you ask "why did our API go down?", it doesn't just look at PulseGuardPlus — it checks:

PulseGuardPlus — Which monitors went down, when, for how long
CronKeeper — Did any cron jobs stop running around the same time?
LogVault — What error logs appeared before and during the outage?
CertGuard — Did any SSL certificates expire recently?
CronRunner — Did any scheduled jobs fail?
HookRelay — Did webhook forwarding start failing?

This cross-product view is something no single-product monitoring tool can provide.

Example Questions

Infrastructure health:

"What's the current health of our infrastructure?"
"Are there any active issues across all products?"
"Which monitors have been flapping this week?"

Incident investigation:

"Why did our API go down last night?"
"What happened around 3 AM that caused the outage?"
"Are the log errors related to the SSL cert expiry?"

Proactive analysis:

"Are any SSL certificates expiring soon?"
"Which cron jobs have been running slower than usual?"
"Show me error trends from the last 7 days"

Operational context:

"Who is on call right now?"
"What's our uptime over the last 30 days?"
"How many incidents did we have this month?"

Persistent Memory

MicroGemAI builds up knowledge about your infrastructure over time. Unlike a stateless chatbot, it remembers what it learns across conversations.

How Memory Works

Explicit memory: Tell MicroGemAI something directly.

"Remember that our staging database is on port 5433"
"Note that Alice handles all database-related issues"
"Keep in mind that deployments on Fridays tend to cause log spikes"

Implicit memory: MicroGemAI automatically extracts knowledge from incident analyses and stores it. After resolving an outage caused by a Redis connection pool, it remembers "Redis connection pool exhaustion has caused API outages — check pool settings when API response times degrade." Accumulated patterns: Over time, MicroGemAI notices recurring patterns. "This is the third time this monitor has gone down on the 1st of the month — it correlates with the billing cron job."

Memory Categories

Category	Examples
Infrastructure	Server addresses, database ports, service architecture
Pattern	"API timeouts usually mean Redis is overloaded"
Runbook	"When the payment service crashes, restart the worker pods first"
Preference	"Team prefers Slack notifications over SMS for non-critical alerts"
Incident Learning	Root cause analyses from past incidents
Team Context	"Alice owns the payment service, Bob handles the database"

Memory Lifecycle

Memories have a confidence score that starts at 0.8 and changes over time:

Referenced frequently → Confidence increases (up to 1.0)
Not referenced for 90+ days → Confidence decays by 0.1 per day
Below 0.2 confidence → Moved to archive (searchable but not in default prompt)
Archived 180+ days → Marked expired, deleted after 30-day grace period

This ensures the most relevant knowledge stays active while stale information fades naturally.

Semantic retrieval

Every memory is embedded with text-embedding-3-small (1536-dim) at write time and stored in pgvector. When you ask a question, MicroGemAI computes a query embedding from your message and pulls the top-K most similar memories using cosine similarity (ivfflat index) — not just substring matches. Asking "why did the API slow down?" can surface a memory titled "Redis pool exhaustion correlates with latency spikes" even though none of those words appear in your question.

No separate embedding key is required — MicroGemAI uses your configured OpenAI-compatible endpoint, falling back to a hash-based pseudo-vector for providers that don't support embeddings (search degrades to keyword match in that case).

The full closed-loop story — citation tracking, confidence adjustment from outcomes, weekly consolidation, staleness demotion, the /agent/memory observability page — lives in the MicroGemAgent guide under "Recursive Self-Improvement," because that loop is what makes the *agent* better over time. MicroGemAI gemai mode reads from the same pool, so any memory the agent learns is immediately available in chat.

Memory Limits

Each team has a configurable cap on active memories (default: 100). When new knowledge pushes past the cap, the lowest-confidence memories are moved to archive. Archived memories are still searchable — they're just not included in every conversation by default.

Memory Stats

The MicroGemAI page shows your current memory counts:

Active — In the default prompt for every conversation
Archive — Searchable, pulled in when relevant
Total — All memories across both tiers

Incident Auto-Analysis

When any MicroGemLabs product creates an incident (PulseGuardPlus downtime, CronKeeper missed ping, LogVault alert fire, CertGuard expiry, CronRunner failure, HookRelay forward failure), MicroGemAI can automatically generate a cross-product analysis.

The analysis includes:

1. Summary — What happened in 1-2 sentences

2. Timeline — Key events across products leading up to the incident

3. Probable Root Cause — Best assessment based on all available data

4. Related Issues — Other products affected

5. Recommended Actions — Steps to resolve and prevent recurrence

Auto-analysis is triggered via Inngest events and runs asynchronously. The analysis is attached to the incident record and delivered through the on-call alert.

Weekly Insights

Every Monday at 8 AM, MicroGemAI scans the past 7 days of data across all products and generates proactive insights:

Trends — Metrics improving or degrading over time
Anomalies — Unusual patterns that deviate from baseline
Correlations — Events in one product affecting another
Recommendations — Preventive actions to take

Insights appear as banners on the MicroGemAI page and can be dismissed or marked as actioned.

Anomaly Detection

MicroGemAI includes statistical anomaly detection that runs hourly across four products. Instead of binary up/down checks, it detects when metrics are trending abnormally compared to a rolling 7-day baseline.

What's Monitored

Product	Metric	What It Catches
PulseGuardPlus	Response time	API getting slower before it goes down
CronKeeper	Execution duration	Backup job taking 3x longer than normal
LogVault	Error rate (per hour)	Error spike even if no threshold rule is set
CronRunner	Execution time	Scheduled job slowing down over time

How It Works

For each metric, MicroGemAI computes a rolling baseline: mean, standard deviation, and percentiles (p50, p95, p99) from 7 days of historical data. Every hour, it compares the current value against this baseline using a Z-score (how many standard deviations from the mean).

When the Z-score exceeds the sensitivity threshold, MicroGemAI creates an anomaly insight. Critical anomalies (1.5x the threshold) are automatically routed through on-call.

Configuring Sensitivity

Navigate to Ops → Anomalies. Each baseline card has a ⚙ gear icon that opens an inline settings panel with a sensitivity slider and named presets:

Preset	Z-Score	Best For
Very Sensitive	1.5	Critical services, error rates
Sensitive	2.0	Error rates, important endpoints
Normal	2.5	Most metrics (default)
Relaxed	3.0	Naturally variable metrics
Very Relaxed	4.0	Only major deviations

You can also set sensitivity in bulk — per product or globally — using the controls at the top of the Anomalies page.

Maintenance Window Integration

Data collected during maintenance windows is automatically excluded from baseline computation when "Exclude from baselines" is enabled on the window. This prevents deployment spikes from polluting your baselines. See the Maintenance Windows section below.

Maintenance Windows

Planned maintenance windows suppress alerting while continuing to monitor. Your on-call team won't be paged during a scheduled deployment, but all monitoring data is still recorded for postmortem analysis.

Quick Suppress

Navigate to Ops → Maintenance. The Quick Suppress panel lets you start a window immediately with one click. Choose a scope (All Products or a specific product), optionally add a reason, and select a duration (15m, 30m, 1h, 2h, 4h).

Schedule a Window

Use the Schedule Window form to plan maintenance for a future time. You can independently control two behaviors:

Suppress alerts — Skip incident creation and on-call routing during the window
Exclude from baselines — Don't include data from this period in anomaly detection baselines

Both default to on, but you might want to keep baseline exclusion while allowing alerts for a partial maintenance that shouldn't suppress everything.

Scope Levels

Scope	What's Suppressed	Use Case
Global	All products, all resources	Full infrastructure maintenance
Product	All resources in one product	Deploying a specific service
Resource	One specific monitor, check, stream, etc.	Migrating a single database server

The most specific scope wins — a resource-level window takes precedence over a global window.

Active Window Management

Active windows appear as amber banners at the top of the Maintenance page. You can extend a running window (+30m or +1h buttons) if maintenance is taking longer than expected, or end it immediately with the "End Now" button to resume alerting.

Windows complete automatically when their end time is reached (checked every minute by a background job).

Predictive Alerting

Extends anomaly detection with trend extrapolation. Uses linear regression over 48 hours of hourly-aggregated data to predict where metrics are heading. Example: "Response time for API Gateway increasing at 12ms/hour. Current: 180ms. Predicted: 468ms in 24 hours. Threshold: 450ms. (Trend confidence: 72%)"

Enable at Anomalies → Detection Settings → toggle Prediction on. Configurable horizon from 6 hours to 7 days (default: 24 hours). Requires R² > 0.3 (30% trend reliability) and positive slope (degrading) to generate predictions. Warning alerts at 1.5× baseline p95, critical at 2× baseline p99.

Detection Settings

Four independent team-level toggles on the Anomalies page:

Toggle	Default	What It Does
Anomaly Detection	ON	Compute baselines, flag Z-score deviations
Anomaly Alerting	ON	Route critical anomalies through on-call
Predictive	OFF	Extrapolate trends, forecast breaches
Predictive Alerting	OFF	Route critical predictions through on-call

Detection and alerting are independent — you can detect anomalies without routing them to on-call (they appear as insights in the dashboard instead). Disabling detection auto-disables alerting for that feature.

Runbook Actions

Pre-defined HTTP calls that MicroGemAI can suggest or execute during incidents. Manage at Skills → filter by Runbook (/agent/skills?type=runbook). The legacy /runbooks URL redirects to that filtered view; if you bookmarked the old page, the redirect carries you through.

Creating a Runbook

Define the HTTP target (URL, method, headers, body with {{variable}} support), trust level, trigger pattern, and safety limits. Categories: restart, scale, cache, rollback, DNS, custom.

Trust Levels

Level	Behavior	Use Case
Manual	Human clicks Execute. AI suggests only.	Starting point for all actions
Auto with Approval	AI triggers, approval link sent to on-call. Timeout escalates.	Well-tested remediation
Full Auto	AI triggers and executes immediately. No human in the loop.	Low-risk operations (cache clear, worker restart)

Incident Integration

When MicroGemAI auto-analyzes an incident, it runs findMatchingRunbooks() in parallel with context assembly. Matching runbooks are included in the LLM prompt so the analysis references them by name. Templates with full_auto trust and 70%+ match confidence execute immediately. Auto-approval templates trigger the approval workflow. Manual templates are suggested only.

Safety

Every execution is audited with who/what triggered it, the full HTTP request (auth headers redacted) and response, duration, linked incident, and AI reasoning. Cooldown prevents rapid re-execution (configurable, default 15 minutes). Circuit breaker stops after max daily runs (configurable, default 10).

MCP Access

External agents discover and execute runbooks via 4 MCP tools: list_runbooks, execute_runbook, check_execution_status, suggest_runbook. Trust levels apply to MCP agents the same as to MicroGemAI.

Postmortem Generation

When incidents resolve (lasting 5+ minutes), MicroGemAI auto-generates a structured postmortem by assembling:

Incident events and timeline from the source product
Cross-product context during the incident window (all 6 products)
GemAnalysis if one was generated during the incident
Runbook executions that were triggered
Persistent memories for institutional context

The LLM generates a JSON response containing: executive summary, impact assessment, chronological timeline, root cause, contributing factors, resolution steps, prioritized action items with severity, and lessons learned.

Postmortems are stored as drafts. Every section is editable inline on the detail page. Teams can regenerate (re-run AI analysis from scratch), publish, unpublish, or delete. Action items have priority badges and checkboxes. Runbooks executed during the incident are listed with success/fail status.

The postmortem is also stored as a MicroGemAI memory entry, so future incidents can reference past resolutions.

Messaging Gateway

Chat with MicroGemAI from Telegram or Slack. Configure at Settings → Integrations.

Auto-Discovery

Message the bot from an unconnected chat and it responds with your Chat ID (Telegram) or Channel ID (Slack) plus setup instructions. Copy the ID into the Integrations page to connect.

Quick Commands

Command	Response
`status` / `health`	Infrastructure health summary with monitor, cron, log, cert counts
`incidents`	Active and recent incidents across all products (24h)
`oncall`	Who's on call + active alerts
`run <name>`	Execute a runbook action (if the channel has execute permission)
`remember <fact>`	Save to MicroGemAI persistent memory

Any other message goes through the full MicroGemAI pipeline: product context assembly + memory retrieval + doc search + LLM call. Conversation history is maintained for continuity.

Alert Delivery

Personal DM alerts: Add your Telegram Chat ID to Account → Contact Preferences. On-call alerts are delivered to your Telegram alongside SMS/email/voice/Slack. The alert includes an acknowledge link and "Reply to ask MicroGemAI for analysis." Team channel broadcast: Enable "Alerts" on any messaging connection in the Integrations page. All incidents for that team are posted to the connected Telegram group or Slack channel automatically.

Conversations

Chat sessions are saved and accessible from the sidebar. You can return to a previous conversation to continue an investigation or reference past analysis. Each conversation maintains its full message history.

Two Ways to Chat

Two surfaces share the same conversation history — pick whichever fits your context:

Surface	Where	Best for
Full page	`microgemlabs.ai/gemai`	Long sessions, deep investigation, side-by-side with another tab
Slide-out panel	✨ icon in any hub-page header	Quick questions while you're already on `/services`, `/anomalies`, `/postmortems`, etc. — panel state survives navigation

The slide-out panel is the same component agent-tier users see, with destructive tools (propose_fix, execute_skill, notify_team) hidden. The full read-only investigation surface IS available — gemai mode can do real work, it just can't change anything.

Tools available in gemai mode

Tool	What it does
`search_docs`	Full-text search across the published docs (tsvector, ranked)
`search_memory`	Semantic search across team memories (pgvector cosine)
`expand_memory`	Pull the full body of a specific memory by ID
`search_skills`	Find knowledge skills, executable skills, and runbooks by name + content
`gather_health`	Cross-product snapshot — monitors up/down, recent incidents, cron pings, log error rate
`query_mcp`	Read-only access to the same 11 MCP read tools the agent uses (list_monitors, recent_incidents, etc.)
`recall_recent_alerts`	Last N on-call alerts with timestamps and channels
`list_postmortems` / `get_postmortem`	Index of past postmortems and full text of a specific one
`find_similar_past_incidents`	Semantic match against the incident history pool
`search_past_sessions`	Search MicroGemAgent session transcripts (helpful for "what did the agent see at 3 AM?")
`web_search`	Tavily-backed general web search for unknown errors / CVEs
`fetch_url`	Read a specific page, SSRF-guarded against internal hosts
`check_third_party_status`	Fast in/out check against 14 provider status pages (GitHub, AWS, Stripe, OpenAI, Anthropic, Cloudflare, Vercel, Supabase, Datadog, Sentry, Slack, Discord, PagerDuty, Atlassian)

What's deliberately *not* in gemai mode: propose_fix, execute_skill, notify_team, prefill_form_link, update_skill, create_knowledge_skill, create_runbook. Those require an active MicroGemAgent subscription.

Panel features

Markdown rendering — fenced code blocks render as styled tiles with copy buttons; inline code gets a colored pill; internal links use the Next router so the panel stays open across hub navigation; external links open in a new tab.
Voice input — mic button (sends to OpenAI's gpt-4o-mini-transcribe; requires that model to be enabled in your OpenAI project). Click once to record, again to stop, OR walk away — it auto-stops on ~1.5s of silence and submits.
↑ / ↓ history — terminal-style recall of your prior messages, persisted across conversations on this device (localStorage, capped at 100 entries).
Expand to full screen + drag-to-resize the panel width — both desktop-only.
Smart auto-scroll — only follows new tokens when you're already at the bottom; a "↓ Latest" button appears when you scroll up.

On mobile

The chat panel covers the full viewport on phones — touch keyboards already consume too much screen for a half-overlay to feel right.
Clicking any internal link (deep link, /postmortems/..., /skills/...) auto-closes the panel so the destination is visible immediately.
The "thinking" indicator is a pulsating gem rendered in pure CSS — theme-aware, respects prefers-reduced-motion.
The ✨ trigger and 🔔 bell are present in every product header (PulseGuardPlus, CronKeeper, LogVault, CertGuard, CronRunner, HookRelay) — not just the hub. Ask MicroGemAI a question without leaving the page.
Bottom nav (Home / Agent / Chat / Incidents) on screens narrower than md keeps the assistant always one tap away.
Right-click → "Filter to / Filter out / Sort" works on every list cell on desktop; long-press is the equivalent on mobile.

Tips for Better Results

Be specific: "Why did the API gateway monitor fail at 3:15 AM?" gets better results than "what happened?"
Reference products by name: "Check LogVault for errors" helps MicroGemAI focus its search
Tell it what you know: "We deployed version 2.4.1 at 2 AM" gives context that isn't in the monitoring data
Ask it to remember: Explicitly say "remember this" when sharing infrastructure knowledge
Ask for correlations: "Is the CronKeeper failure related to the LogVault errors?" prompts cross-product analysis

MicroGemAI User Guide