Browse docs (11)

MCP Integration Guide

1,427 words ยท 7 min read ยท 8 sections

Overview

MicroGemLabs exposes a read-only MCP server that lets any MCP-compatible AI agent query your monitoring, logging, and alerting data. This means agents like Hermes Agent, Claude, Cursor, or your own custom agent can access your infrastructure data through a standard protocol.

The MCP server provides 15 tools covering all six MicroGemLabs products plus the MicroGemAI memory system.

Authentication

All MCP tool calls require a team API key. Generate one in Settings โ†’ Account โ†’ API Keys.

The API key is passed as a Bearer token in the MCP server connection:

Authorization: Bearer mgl_team_xxxxxxxxxxxxx

API keys are scoped to a single team. All data returned is limited to that team's resources.

Connecting Your Agent

Hermes Agent

Add to your Hermes Agent MCP configuration:

# ~/.hermes/mcp_servers.yaml
servers:
  - name: microgemlabs
    url: https://mcp.microgemlabs.ai
    auth:
      type: bearer
      token: mgl_team_xxxxxxxxxxxxx

Claude Desktop / Claude Code

Add to your MCP server configuration:

{
  "mcpServers": {
    "microgemlabs": {
      "type": "url",
      "url": "https://mcp.microgemlabs.ai/sse",
      "headers": {
        "Authorization": "Bearer mgl_team_xxxxxxxxxxxxx"
      }
    }
  }
}

Custom Agents

Any agent that supports the MCP protocol can connect. The server endpoint is:

https://mcp.microgemlabs.ai/sse

Available Tools

query_monitors

Get the current status of all uptime monitors (PulseGuardPlus).

Parameters:
ParamTypeDescription
statusstringFilter: up, down, or all (default)
typestringFilter: http, tcp, ping, dns, ssl, heartbeat
Returns: Monitor name, target URL, type, current status, latest response time. Example prompt: "Check if any of my monitors are down"

query_incidents

Get recent incidents across ALL products โ€” PulseGuardPlus downtime, CronKeeper missed pings, LogVault alerts, CertGuard expiry warnings, CronRunner failures, HookRelay forward failures.

Parameters:
ParamTypeDescription
productstringFilter by product or all (default)
statusstringactive, resolved, or all (default)
hoursnumberLook-back period (default: 24)
Example prompt: "What incidents have occurred in the last 48 hours?"

search_logs

Search log entries in LogVault with full-text search.

Parameters:
ParamTypeRequiredDescription
querystringYesFull-text search query
levelstringNodebug, info, warn, error, fatal
streamstringNoStream name filter
hoursnumberNoLook-back period (default: 1)
limitnumberNoMax entries (default: 20)
Example prompt: "Search logs for database timeout errors in the last 6 hours"

query_cron_checks

Get cron job monitor status from CronKeeper.

Parameters:
ParamTypeDescription
statusstringup, late, down, or all (default)
Example prompt: "Are any cron jobs late or down?"

query_certs

Get SSL certificate and domain registration status from CertGuard.

Parameters:
ParamTypeDescription
statusstringhealthy, warning, critical, expired, or all
expiring_within_daysnumberOnly return certs expiring within N days
Example prompt: "Which SSL certificates expire within the next 14 days?"

query_scheduled_jobs

Get scheduled HTTP job status from CronRunner.

Parameters:
ParamTypeDescription
statusstringsuccess, failed, or all (default)
Example prompt: "Are any scheduled jobs failing?"

query_webhooks

Get webhook endpoint status from HookRelay.

Parameters:
ParamTypeDescription
with_failuresbooleanOnly return endpoints with failures (default: false)
Example prompt: "Which webhook endpoints have forwarding failures?"

get_oncall_status

Get current on-call schedule status.

Parameters: None Returns: Who is currently on call, active alerts, escalation policy status. Example prompt: "Who is on call right now and are there any active alerts?"

get_platform_health

Comprehensive health overview of ALL products in one call. This is the most useful tool for general "how are things?" queries.

Parameters:
ParamTypeDescription
hoursnumberLook-back period (default: 24)
Returns: Summary counts, active issues, and key metrics across all six products. Example prompt: "Give me a full health check of our infrastructure"

search_memory

Search MicroGemAI's accumulated knowledge base.

Parameters:
ParamTypeRequiredDescription
querystringYesSearch query
categorystringNoFilter: infrastructure, pattern, runbook, preference, incident_learning, team_context, or all
Returns: Matching knowledge entries with category, content, and confidence score. Example prompt: "What do we know about our database architecture?"

correlate_events

Find events across ALL products that occurred around a specific time. The most powerful tool for incident investigation.

Parameters:
ParamTypeRequiredDescription
timestampISO 8601YesCenter of the search window
window_minutesnumberNoMinutes before and after (default: 30)
Returns: Events from all products within the time window, sorted chronologically. Example prompt: "What happened across all our monitoring between 2:00 AM and 3:00 AM today?"

Example Agent Sessions

Incident Investigation with Hermes Agent

You: What caused the API outage at 3 AM?

Hermes Agent:
  โ†’ [calls get_platform_health]
  โ†’ [calls query_incidents with hours=12]
  โ†’ [calls correlate_events with timestamp=2026-04-20T03:00:00Z]
  โ†’ [calls search_logs with query="error" hours=2]

"Based on the data, here's what happened:

The API gateway monitor went down at 2:58 AM. LogVault shows a spike of 
'SSL handshake failed' errors starting at 2:45 AM. CertGuard reports that 
the api.example.com certificate expired at 2:30 AM. CronKeeper shows the 
cert-renewal cron job has been in 'late' status for 3 days.

Root cause: The SSL certificate renewal cron job stopped running, the cert 
expired, and the API started rejecting HTTPS connections.

Fix: Renew the cert manually, then investigate why the renewal cron failed."

Proactive Check with Claude

You: @microgemlabs Are there any issues I should know about?

Claude:
  โ†’ [calls get_platform_health]
  โ†’ [calls query_certs with expiring_within_days=14]

"Your infrastructure looks mostly healthy. Two items to note:

1. The staging.example.com SSL cert expires in 9 days. Consider 
   renewing early.
2. CronRunner shows the 'analytics-sync' job has 2 consecutive 
   failures. The last error was 'Connection refused' โ€” your analytics 
   endpoint may be down."

Action Tools

These tools allow agents to interact with the runbook system (now part of the unified Skills library โ€” see /agent/skills?type=runbook). Trust levels apply โ€” see Runbook Actions in the MicroGemAI guide.

list_runbooks

List available runbook action templates for this team.

Parameters:
  • category โ€” Filter: restart, scale, cache, rollback, dns, custom, or all (default)
  • active_only โ€” Boolean, default true

execute_runbook

Trigger a runbook action. Trust levels apply: manual templates execute immediately when called by an authenticated agent, auto_approval templates return a pending status with approval link, full_auto templates execute immediately.

Parameters:
  • template_id (required) โ€” ActionTemplate ID from list_runbooks
  • incident_id โ€” Link execution to an incident
  • incident_source โ€” Product that created the incident
  • variables โ€” Key-value pairs for {{variable}} substitution
  • reasoning โ€” Why you're executing this (stored in audit trail)

check_execution_status

Poll for the result of a runbook execution.

Parameters:
  • execution_id (required) โ€” ActionExecution ID from execute_runbook

Returns: Status (pending/running/success/failed/approval_pending/cancelled), HTTP response status, duration, error message.

suggest_runbook

Find matching runbooks for an incident description. Uses trigger pattern matching and past execution history.

Parameters:
  • product (required) โ€” Where the incident occurred
  • incident_title (required) โ€” Description of the incident
  • error_message โ€” Error message for better matching

Returns: Ranked list of matching templates with confidence scores.

Example: Full Incident Investigation + Auto-Fix

You: My API is down, check MicroGemLabs and fix it.

Agent:
  โ†’ [calls get_platform_health]
  โ†’ [calls query_incidents with product=pulseguardplus, status=active]
  โ†’ [calls search_logs with query="error" hours=1]
  โ†’ [calls correlate_events with timestamp of the incident]
  โ†’ [calls suggest_runbook with product=pulseguardplus, incident_title="API Gateway DOWN"]
  โ†’ [calls execute_runbook with template_id of "Restart API Workers"]
  โ†’ [calls check_execution_status to verify success]
  โ†’ [calls query_monitors to confirm recovery]

"The API Gateway went down at 3:02 AM due to connection pool exhaustion.
LogVault showed 'max connections reached' errors starting at 2:45 AM. I
executed the 'Restart API Workers' runbook โ€” workers restarted in 3.2s
and the API is now responding normally (avg 142ms)."

Rate Limits

MCP tool calls are rate-limited to 60 requests per minute per team. For bulk queries, use get_platform_health (one call for everything) instead of querying each product individually.

Data Scope

All tools return data scoped to the authenticated team. You cannot access other teams' data. Tools are read-only โ€” no tool can create, modify, or delete resources. Use the MicroGemLabs dashboard for write operations.