Overview
AI DevOps assistant with cross-product intelligence, persistent memory, and automated incident analysis.
Location: microgemlabs.ai/gemaiMicroGemAI vs MicroGemAgent
Two different products with similar names:
- MicroGemAI โ the conversational chat assistant documented below. You ask, it answers. Included with every MicroGemLabs plan (BYOK LLM key).
- MicroGemAgent โ an autonomous DevOps engineer that runs on a 5-min loop, investigates issues, and proposes fixes for your approval. Premium add-on with three tiers. Lives at microgemlabs.ai/agent. See
microgemagent-guide.md.
Rule of thumb: if you want to ask *why* something happened, open MicroGemAI chat. If you want something *watched, investigated, and fixed* without you in the loop, that's MicroGemAgent.
Getting Started
1. Navigate to Account settings at microgemlabs.ai/account
2. Under MicroGemAI Configuration, enter your LLM API key
3. Open MicroGemAI โ full page at microgemlabs.ai/gemai, or click the โจ chat icon in the top nav of any page for a slide-out panel
4. Start chatting โ MicroGemAI can see data from all your enabled products
Configuration (BYOK โ Bring Your Own Key)
MicroGemAI uses YOUR LLM API key to process queries. Your data stays on the MicroGemLabs platform and is only sent to the LLM provider you choose. MicroGemLabs never stores or has access to your LLM provider credentials beyond what you configure.
Supported providers:| Provider | Base URL | Recommended Model |
|---|---|---|
| OpenAI | https://api.openai.com/v1 (default) | gpt-4o-mini (fast, cheap) or gpt-4o (more capable) |
| Anthropic | https://api.anthropic.com/v1 | claude-sonnet-4-20250514 |
| Any OpenAI-compatible API | Custom URL | Varies |
- Max Tokens โ Maximum response length (default: 4000)
- Temperature โ Creativity vs precision (default: 0.3 โ lower is more factual)
What MicroGemAI Can Do
Cross-Product Correlation
MicroGemAI's unique capability is connecting events across all six MicroGemLabs products. When you ask "why did our API go down?", it doesn't just look at PulseGuardPlus โ it checks:
- PulseGuardPlus โ Which monitors went down, when, for how long
- CronKeeper โ Did any cron jobs stop running around the same time?
- LogVault โ What error logs appeared before and during the outage?
- CertGuard โ Did any SSL certificates expire recently?
- CronRunner โ Did any scheduled jobs fail?
- HookRelay โ Did webhook forwarding start failing?
This cross-product view is something no single-product monitoring tool can provide.
Example Questions
Infrastructure health:- "What's the current health of our infrastructure?"
- "Are there any active issues across all products?"
- "Which monitors have been flapping this week?"
- "Why did our API go down last night?"
- "What happened around 3 AM that caused the outage?"
- "Are the log errors related to the SSL cert expiry?"
- "Are any SSL certificates expiring soon?"
- "Which cron jobs have been running slower than usual?"
- "Show me error trends from the last 7 days"
- "Who is on call right now?"
- "What's our uptime over the last 30 days?"
- "How many incidents did we have this month?"
Persistent Memory
MicroGemAI builds up knowledge about your infrastructure over time. Unlike a stateless chatbot, it remembers what it learns across conversations.
How Memory Works
Explicit memory: Tell MicroGemAI something directly.- "Remember that our staging database is on port 5433"
- "Note that Alice handles all database-related issues"
- "Keep in mind that deployments on Fridays tend to cause log spikes"
Memory Categories
| Category | Examples |
|---|---|
| Infrastructure | Server addresses, database ports, service architecture |
| Pattern | "API timeouts usually mean Redis is overloaded" |
| Runbook | "When the payment service crashes, restart the worker pods first" |
| Preference | "Team prefers Slack notifications over SMS for non-critical alerts" |
| Incident Learning | Root cause analyses from past incidents |
| Team Context | "Alice owns the payment service, Bob handles the database" |
Memory Lifecycle
Memories have a confidence score that starts at 0.8 and changes over time:
- Referenced frequently โ Confidence increases (up to 1.0)
- Not referenced for 90+ days โ Confidence decays by 0.1 per day
- Below 0.2 confidence โ Moved to archive (searchable but not in default prompt)
- Archived 180+ days โ Marked expired, deleted after 30-day grace period
This ensures the most relevant knowledge stays active while stale information fades naturally.
Semantic retrieval
Every memory is embedded with text-embedding-3-small (1536-dim) at write time and stored in pgvector. When you ask a question, MicroGemAI computes a query embedding from your message and pulls the top-K most similar memories using cosine similarity (ivfflat index) โ not just substring matches. Asking "why did the API slow down?" can surface a memory titled "Redis pool exhaustion correlates with latency spikes" even though none of those words appear in your question.
No separate embedding key is required โ MicroGemAI uses your configured OpenAI-compatible endpoint, falling back to a hash-based pseudo-vector for providers that don't support embeddings (search degrades to keyword match in that case).
The full closed-loop story โ citation tracking, confidence adjustment from outcomes, weekly consolidation, staleness demotion, the /agent/memory observability page โ lives in the MicroGemAgent guide under "Recursive Self-Improvement," because that loop is what makes the *agent* better over time. MicroGemAI gemai mode reads from the same pool, so any memory the agent learns is immediately available in chat.
Memory Limits
Each team has a configurable cap on active memories (default: 100). When new knowledge pushes past the cap, the lowest-confidence memories are moved to archive. Archived memories are still searchable โ they're just not included in every conversation by default.
Memory Stats
The MicroGemAI page shows your current memory counts:
- Active โ In the default prompt for every conversation
- Archive โ Searchable, pulled in when relevant
- Total โ All memories across both tiers
Incident Auto-Analysis
When any MicroGemLabs product creates an incident (PulseGuardPlus downtime, CronKeeper missed ping, LogVault alert fire, CertGuard expiry, CronRunner failure, HookRelay forward failure), MicroGemAI can automatically generate a cross-product analysis.
The analysis includes:
1. Summary โ What happened in 1-2 sentences
2. Timeline โ Key events across products leading up to the incident
3. Probable Root Cause โ Best assessment based on all available data
4. Related Issues โ Other products affected
5. Recommended Actions โ Steps to resolve and prevent recurrence
Auto-analysis is triggered via Inngest events and runs asynchronously. The analysis is attached to the incident record and delivered through the on-call alert.
Weekly Insights
Every Monday at 8 AM, MicroGemAI scans the past 7 days of data across all products and generates proactive insights:
- Trends โ Metrics improving or degrading over time
- Anomalies โ Unusual patterns that deviate from baseline
- Correlations โ Events in one product affecting another
- Recommendations โ Preventive actions to take
Insights appear as banners on the MicroGemAI page and can be dismissed or marked as actioned.
Anomaly Detection
MicroGemAI includes statistical anomaly detection that runs hourly across four products. Instead of binary up/down checks, it detects when metrics are trending abnormally compared to a rolling 7-day baseline.
What's Monitored
| Product | Metric | What It Catches |
|---|---|---|
| PulseGuardPlus | Response time | API getting slower before it goes down |
| CronKeeper | Execution duration | Backup job taking 3x longer than normal |
| LogVault | Error rate (per hour) | Error spike even if no threshold rule is set |
| CronRunner | Execution time | Scheduled job slowing down over time |
How It Works
For each metric, MicroGemAI computes a rolling baseline: mean, standard deviation, and percentiles (p50, p95, p99) from 7 days of historical data. Every hour, it compares the current value against this baseline using a Z-score (how many standard deviations from the mean).
When the Z-score exceeds the sensitivity threshold, MicroGemAI creates an anomaly insight. Critical anomalies (1.5x the threshold) are automatically routed through on-call.
Configuring Sensitivity
Navigate to Ops โ Anomalies. Each baseline card has a โ gear icon that opens an inline settings panel with a sensitivity slider and named presets:
| Preset | Z-Score | Best For |
|---|---|---|
| Very Sensitive | 1.5 | Critical services, error rates |
| Sensitive | 2.0 | Error rates, important endpoints |
| Normal | 2.5 | Most metrics (default) |
| Relaxed | 3.0 | Naturally variable metrics |
| Very Relaxed | 4.0 | Only major deviations |
You can also set sensitivity in bulk โ per product or globally โ using the controls at the top of the Anomalies page.
Maintenance Window Integration
Data collected during maintenance windows is automatically excluded from baseline computation when "Exclude from baselines" is enabled on the window. This prevents deployment spikes from polluting your baselines. See the Maintenance Windows section below.
Maintenance Windows
Planned maintenance windows suppress alerting while continuing to monitor. Your on-call team won't be paged during a scheduled deployment, but all monitoring data is still recorded for postmortem analysis.
Quick Suppress
Navigate to Ops โ Maintenance. The Quick Suppress panel lets you start a window immediately with one click. Choose a scope (All Products or a specific product), optionally add a reason, and select a duration (15m, 30m, 1h, 2h, 4h).
Schedule a Window
Use the Schedule Window form to plan maintenance for a future time. You can independently control two behaviors:
- Suppress alerts โ Skip incident creation and on-call routing during the window
- Exclude from baselines โ Don't include data from this period in anomaly detection baselines
Both default to on, but you might want to keep baseline exclusion while allowing alerts for a partial maintenance that shouldn't suppress everything.
Scope Levels
| Scope | What's Suppressed | Use Case |
|---|---|---|
| Global | All products, all resources | Full infrastructure maintenance |
| Product | All resources in one product | Deploying a specific service |
| Resource | One specific monitor, check, stream, etc. | Migrating a single database server |
The most specific scope wins โ a resource-level window takes precedence over a global window.
Active Window Management
Active windows appear as amber banners at the top of the Maintenance page. You can extend a running window (+30m or +1h buttons) if maintenance is taking longer than expected, or end it immediately with the "End Now" button to resume alerting.
Windows complete automatically when their end time is reached (checked every minute by a background job).
Predictive Alerting
Extends anomaly detection with trend extrapolation. Uses linear regression over 48 hours of hourly-aggregated data to predict where metrics are heading. Example: "Response time for API Gateway increasing at 12ms/hour. Current: 180ms. Predicted: 468ms in 24 hours. Threshold: 450ms. (Trend confidence: 72%)"
Enable at Anomalies โ Detection Settings โ toggle Prediction on. Configurable horizon from 6 hours to 7 days (default: 24 hours). Requires Rยฒ > 0.3 (30% trend reliability) and positive slope (degrading) to generate predictions. Warning alerts at 1.5ร baseline p95, critical at 2ร baseline p99.
Detection Settings
Four independent team-level toggles on the Anomalies page:
| Toggle | Default | What It Does |
|---|---|---|
| Anomaly Detection | ON | Compute baselines, flag Z-score deviations |
| Anomaly Alerting | ON | Route critical anomalies through on-call |
| Predictive | OFF | Extrapolate trends, forecast breaches |
| Predictive Alerting | OFF | Route critical predictions through on-call |
Detection and alerting are independent โ you can detect anomalies without routing them to on-call (they appear as insights in the dashboard instead). Disabling detection auto-disables alerting for that feature.
Runbook Actions
Pre-defined HTTP calls that MicroGemAI can suggest or execute during incidents. Manage at Skills โ filter by Runbook (/agent/skills?type=runbook). The legacy /runbooks URL redirects to that filtered view; if you bookmarked the old page, the redirect carries you through.
Creating a Runbook
Define the HTTP target (URL, method, headers, body with {{variable}} support), trust level, trigger pattern, and safety limits. Categories: restart, scale, cache, rollback, DNS, custom.
Trust Levels
| Level | Behavior | Use Case |
|---|---|---|
| Manual | Human clicks Execute. AI suggests only. | Starting point for all actions |
| Auto with Approval | AI triggers, approval link sent to on-call. Timeout escalates. | Well-tested remediation |
| Full Auto | AI triggers and executes immediately. No human in the loop. | Low-risk operations (cache clear, worker restart) |
Incident Integration
When MicroGemAI auto-analyzes an incident, it runs findMatchingRunbooks() in parallel with context assembly. Matching runbooks are included in the LLM prompt so the analysis references them by name. Templates with full_auto trust and 70%+ match confidence execute immediately. Auto-approval templates trigger the approval workflow. Manual templates are suggested only.
Safety
Every execution is audited with who/what triggered it, the full HTTP request (auth headers redacted) and response, duration, linked incident, and AI reasoning. Cooldown prevents rapid re-execution (configurable, default 15 minutes). Circuit breaker stops after max daily runs (configurable, default 10).
MCP Access
External agents discover and execute runbooks via 4 MCP tools: list_runbooks, execute_runbook, check_execution_status, suggest_runbook. Trust levels apply to MCP agents the same as to MicroGemAI.
Postmortem Generation
When incidents resolve (lasting 5+ minutes), MicroGemAI auto-generates a structured postmortem by assembling:
- Incident events and timeline from the source product
- Cross-product context during the incident window (all 6 products)
- GemAnalysis if one was generated during the incident
- Runbook executions that were triggered
- Persistent memories for institutional context
The LLM generates a JSON response containing: executive summary, impact assessment, chronological timeline, root cause, contributing factors, resolution steps, prioritized action items with severity, and lessons learned.
Postmortems are stored as drafts. Every section is editable inline on the detail page. Teams can regenerate (re-run AI analysis from scratch), publish, unpublish, or delete. Action items have priority badges and checkboxes. Runbooks executed during the incident are listed with success/fail status.
The postmortem is also stored as a MicroGemAI memory entry, so future incidents can reference past resolutions.
Messaging Gateway
Chat with MicroGemAI from Telegram or Slack. Configure at Settings โ Integrations.
Auto-Discovery
Message the bot from an unconnected chat and it responds with your Chat ID (Telegram) or Channel ID (Slack) plus setup instructions. Copy the ID into the Integrations page to connect.
Quick Commands
| Command | Response |
|---|---|
status / health | Infrastructure health summary with monitor, cron, log, cert counts |
incidents | Active and recent incidents across all products (24h) |
oncall | Who's on call + active alerts |
run <name> | Execute a runbook action (if the channel has execute permission) |
remember <fact> | Save to MicroGemAI persistent memory |
Any other message goes through the full MicroGemAI pipeline: product context assembly + memory retrieval + doc search + LLM call. Conversation history is maintained for continuity.
Alert Delivery
Personal DM alerts: Add your Telegram Chat ID to Account โ Contact Preferences. On-call alerts are delivered to your Telegram alongside SMS/email/voice/Slack. The alert includes an acknowledge link and "Reply to ask MicroGemAI for analysis." Team channel broadcast: Enable "Alerts" on any messaging connection in the Integrations page. All incidents for that team are posted to the connected Telegram group or Slack channel automatically.Conversations
Chat sessions are saved and accessible from the sidebar. You can return to a previous conversation to continue an investigation or reference past analysis. Each conversation maintains its full message history.
Two Ways to Chat
Two surfaces share the same conversation history โ pick whichever fits your context:
| Surface | Where | Best for |
|---|---|---|
| Full page | microgemlabs.ai/gemai | Long sessions, deep investigation, side-by-side with another tab |
| Slide-out panel | โจ icon in any hub-page header | Quick questions while you're already on /services, /anomalies, /postmortems, etc. โ panel state survives navigation |
The slide-out panel is the same component agent-tier users see, with destructive tools (propose_fix, execute_skill, notify_team) hidden. The full read-only investigation surface IS available โ gemai mode can do real work, it just can't change anything.
Tools available in gemai mode
| Tool | What it does |
|---|---|
search_docs | Full-text search across the published docs (tsvector, ranked) |
search_memory | Semantic search across team memories (pgvector cosine) |
expand_memory | Pull the full body of a specific memory by ID |
search_skills | Find knowledge skills, executable skills, and runbooks by name + content |
gather_health | Cross-product snapshot โ monitors up/down, recent incidents, cron pings, log error rate |
query_mcp | Read-only access to the same 11 MCP read tools the agent uses (list_monitors, recent_incidents, etc.) |
recall_recent_alerts | Last N on-call alerts with timestamps and channels |
list_postmortems / get_postmortem | Index of past postmortems and full text of a specific one |
find_similar_past_incidents | Semantic match against the incident history pool |
search_past_sessions | Search MicroGemAgent session transcripts (helpful for "what did the agent see at 3 AM?") |
web_search | Tavily-backed general web search for unknown errors / CVEs |
fetch_url | Read a specific page, SSRF-guarded against internal hosts |
check_third_party_status | Fast in/out check against 14 provider status pages (GitHub, AWS, Stripe, OpenAI, Anthropic, Cloudflare, Vercel, Supabase, Datadog, Sentry, Slack, Discord, PagerDuty, Atlassian) |
What's deliberately *not* in gemai mode: propose_fix, execute_skill, notify_team, prefill_form_link, update_skill, create_knowledge_skill, create_runbook. Those require an active MicroGemAgent subscription.
Panel features
- Markdown rendering โ fenced code blocks render as styled tiles with copy buttons; inline
codegets a colored pill; internal links use the Next router so the panel stays open across hub navigation; external links open in a new tab. - Voice input โ mic button (sends to OpenAI's
gpt-4o-mini-transcribe; requires that model to be enabled in your OpenAI project). Click once to record, again to stop, OR walk away โ it auto-stops on ~1.5s of silence and submits. - โ / โ history โ terminal-style recall of your prior messages, persisted across conversations on this device (localStorage, capped at 100 entries).
- Expand to full screen + drag-to-resize the panel width โ both desktop-only.
- Smart auto-scroll โ only follows new tokens when you're already at the bottom; a "โ Latest" button appears when you scroll up.
On mobile
- The chat panel covers the full viewport on phones โ touch keyboards already consume too much screen for a half-overlay to feel right.
- Clicking any internal link (deep link,
/postmortems/...,/skills/...) auto-closes the panel so the destination is visible immediately. - The "thinking" indicator is a pulsating gem rendered in pure CSS โ theme-aware, respects
prefers-reduced-motion. - The โจ trigger and ๐ bell are present in every product header (PulseGuardPlus, CronKeeper, LogVault, CertGuard, CronRunner, HookRelay) โ not just the hub. Ask MicroGemAI a question without leaving the page.
- Bottom nav (
Home / Agent / Chat / Incidents) on screens narrower thanmdkeeps the assistant always one tap away. - Right-click โ "Filter to / Filter out / Sort" works on every list cell on desktop; long-press is the equivalent on mobile.
Tips for Better Results
- Be specific: "Why did the API gateway monitor fail at 3:15 AM?" gets better results than "what happened?"
- Reference products by name: "Check LogVault for errors" helps MicroGemAI focus its search
- Tell it what you know: "We deployed version 2.4.1 at 2 AM" gives context that isn't in the monitoring data
- Ask it to remember: Explicitly say "remember this" when sharing infrastructure knowledge
- Ask for correlations: "Is the CronKeeper failure related to the LogVault errors?" prompts cross-product analysis