neural-thinker commited on
Commit
3c90182
·
1 Parent(s): da991b7

feat(monitoring): implement comprehensive Prometheus metrics system

Browse files

* Add complete Prometheus metrics instrumentation for agents
* Implement investigation lifecycle tracking metrics
* Add anomaly detection metrics with severity classification
* Track transparency API usage and data processing metrics
* Create agent task execution metrics with retry/failure tracking
* Add /health/metrics endpoint for Prometheus scraping
* Add /health/metrics/json endpoint for JSON format debugging
* Update documentation with monitoring capabilities

Technical Implementation:
- BaseAgent: Universal metrics for all agent executions
- ZumbiAgent: Specialized investigation and anomaly detection metrics
- Health Routes: Prometheus and JSON metrics endpoints
- Core Monitoring: Extended metrics collection infrastructure

Metrics Categories:
- API Performance: Request duration, throughput, status codes
- Agent Operations: Task execution, retries, failures, duration
- Investigations: Lifecycle tracking, success/failure rates
- Anomaly Detection: Types, severity levels, detection rates
- Data Processing: Records processed, API calls, external data

Production Ready:
- Prometheus format compliance
- Grafana dashboard compatibility
- Performance optimized metric collection
- Comprehensive observability coverage

README.md CHANGED
@@ -87,8 +87,16 @@ curl -X POST "https://your-space-url.hf.space/api/agents/zumbi/investigate" \
87
  "data_source": "contracts",
88
  "max_results": 100
89
  }'
 
 
 
90
  ```
91
 
 
 
 
 
 
92
  ## 🛡️ Recursos Enterprise
93
 
94
  ### 🏗️ **Arquitetura**
@@ -105,9 +113,14 @@ curl -X POST "https://your-space-url.hf.space/api/agents/zumbi/investigate" \
105
  - **Gestão de segredos** integrada com HashiCorp Vault
106
 
107
  ### 📊 **Observabilidade Completa**
108
- - **Métricas Prometheus** customizadas para análises de transparência
 
 
 
 
 
109
  - **Logging estruturado JSON** com correlação de IDs
110
- - **Health checks** detalhados para todos os componentes
111
  - **Documentação automática** com OpenAPI/Swagger
112
 
113
  ### ⚡ **Performance & Escalabilidade**
@@ -147,6 +160,8 @@ curl -X POST "https://your-space-url.hf.space/api/agents/zumbi/investigate" \
147
  - ✅ **Sistema Multi-Agente**: 16 agentes implementados
148
  - ✅ **API REST**: 100% endpoints funcionais com documentação
149
  - ✅ **Pipeline ML**: Estado-da-arte para anomaly detection
 
 
150
  - ✅ **Containerização**: Docker pronto para deploy
151
  - ✅ **Documentação**: Qualidade técnica excepcional
152
 
 
87
  "data_source": "contracts",
88
  "max_results": 100
89
  }'
90
+
91
+ # Acessar métricas Prometheus
92
+ curl -X GET "https://your-space-url.hf.space/health/metrics"
93
  ```
94
 
95
+ ### Monitoring Endpoints
96
+ - `GET /health/metrics` - Métricas Prometheus para monitoramento
97
+ - `GET /health/metrics/json` - Métricas do sistema em formato JSON
98
+ - `GET /health/detailed` - Health check detalhado com informações do sistema
99
+
100
  ## 🛡️ Recursos Enterprise
101
 
102
  ### 🏗️ **Arquitetura**
 
113
  - **Gestão de segredos** integrada com HashiCorp Vault
114
 
115
  ### 📊 **Observabilidade Completa**
116
+ - **Métricas Prometheus** customizadas implementadas em produção:
117
+ - Métricas de API (requisições, duração, throughput)
118
+ - Métricas de agentes (execuções, falhas, retry, duração)
119
+ - Métricas de investigação (anomalias detectadas, tipos, severidade)
120
+ - Métricas de dados (registros processados, API transparência)
121
+ - **Endpoint /health/metrics** expondo métricas para Grafana/Prometheus
122
  - **Logging estruturado JSON** com correlação de IDs
123
+ - **Health checks** detalhados para todos os componentes
124
  - **Documentação automática** com OpenAPI/Swagger
125
 
126
  ### ⚡ **Performance & Escalabilidade**
 
160
  - ✅ **Sistema Multi-Agente**: 16 agentes implementados
161
  - ✅ **API REST**: 100% endpoints funcionais com documentação
162
  - ✅ **Pipeline ML**: Estado-da-arte para anomaly detection
163
+ - ✅ **Observabilidade**: Métricas Prometheus implementadas em produção
164
+ - ✅ **Monitoramento**: Instrumentação completa de agentes e investigações
165
  - ✅ **Containerização**: Docker pronto para deploy
166
  - ✅ **Documentação**: Qualidade técnica excepcional
167
 
src/agents/deodoro.py CHANGED
@@ -17,6 +17,8 @@ from pydantic import BaseModel, Field as PydanticField
17
 
18
  from src.core import AgentStatus, get_logger
19
  from src.core.exceptions import AgentError, AgentExecutionError
 
 
20
 
21
 
22
  @dataclass
@@ -165,9 +167,17 @@ class BaseAgent(ABC):
165
  )
166
 
167
  start_time = datetime.utcnow()
 
168
  retries = 0
169
  last_error = None
170
 
 
 
 
 
 
 
 
171
  while retries <= self.max_retries:
172
  try:
173
  self.status = AgentStatus.THINKING
@@ -178,13 +188,24 @@ class BaseAgent(ABC):
178
  retry=retries,
179
  )
180
 
181
- # Process the message
182
- response = await self.process(message, context)
 
 
 
 
183
 
184
  # Calculate processing time
185
  processing_time = (datetime.utcnow() - start_time).total_seconds() * 1000
186
  response.processing_time_ms = processing_time
187
 
 
 
 
 
 
 
 
188
  # Update status
189
  self.status = AgentStatus.COMPLETED
190
 
@@ -211,12 +232,25 @@ class BaseAgent(ABC):
211
  retry=retries,
212
  )
213
 
 
 
 
 
 
 
 
214
  retries += 1
215
  if retries <= self.max_retries:
216
  # Exponential backoff
217
  await self._wait(2 ** retries)
218
 
219
- # All retries exhausted
 
 
 
 
 
 
220
  self.status = AgentStatus.ERROR
221
 
222
  error_response = AgentResponse(
 
17
 
18
  from src.core import AgentStatus, get_logger
19
  from src.core.exceptions import AgentError, AgentExecutionError
20
+ from src.core.monitoring import AGENT_TASK_COUNT, AGENT_TASK_DURATION
21
+ import time
22
 
23
 
24
  @dataclass
 
167
  )
168
 
169
  start_time = datetime.utcnow()
170
+ perf_start_time = time.time()
171
  retries = 0
172
  last_error = None
173
 
174
+ # Increment task counter
175
+ AGENT_TASK_COUNT.labels(
176
+ agent_type=self.name,
177
+ task_type=action,
178
+ status="started"
179
+ ).inc()
180
+
181
  while retries <= self.max_retries:
182
  try:
183
  self.status = AgentStatus.THINKING
 
188
  retry=retries,
189
  )
190
 
191
+ # Process the message with timing
192
+ with AGENT_TASK_DURATION.labels(
193
+ agent_type=self.name,
194
+ task_type=action
195
+ ).time():
196
+ response = await self.process(message, context)
197
 
198
  # Calculate processing time
199
  processing_time = (datetime.utcnow() - start_time).total_seconds() * 1000
200
  response.processing_time_ms = processing_time
201
 
202
+ # Record successful execution
203
+ AGENT_TASK_COUNT.labels(
204
+ agent_type=self.name,
205
+ task_type=action,
206
+ status="completed"
207
+ ).inc()
208
+
209
  # Update status
210
  self.status = AgentStatus.COMPLETED
211
 
 
232
  retry=retries,
233
  )
234
 
235
+ # Record retry attempt
236
+ AGENT_TASK_COUNT.labels(
237
+ agent_type=self.name,
238
+ task_type=action,
239
+ status="retry"
240
+ ).inc()
241
+
242
  retries += 1
243
  if retries <= self.max_retries:
244
  # Exponential backoff
245
  await self._wait(2 ** retries)
246
 
247
+ # All retries exhausted - record failure
248
+ AGENT_TASK_COUNT.labels(
249
+ agent_type=self.name,
250
+ task_type=action,
251
+ status="failed"
252
+ ).inc()
253
+
254
  self.status = AgentStatus.ERROR
255
 
256
  error_response = AgentResponse(
src/agents/zumbi.py CHANGED
@@ -19,9 +19,14 @@ from pydantic import BaseModel, Field as PydanticField
19
  from src.agents.deodoro import BaseAgent, AgentContext, AgentMessage
20
  from src.core import get_logger
21
  from src.core.exceptions import AgentExecutionError, DataAnalysisError
 
 
 
 
22
  from src.tools.transparency_api import TransparencyAPIClient, TransparencyAPIFilter
23
  from src.tools.models_client import ModelsClient, get_models_client
24
  from src.ml.spectral_analyzer import SpectralAnalyzer, SpectralAnomaly
 
25
 
26
 
27
  @dataclass
@@ -128,6 +133,8 @@ class InvestigatorAgent(BaseAgent):
128
  Returns:
129
  Investigation results with detected anomalies
130
  """
 
 
131
  try:
132
  self.logger.info(
133
  "investigation_started",
@@ -139,6 +146,14 @@ class InvestigatorAgent(BaseAgent):
139
  # Parse investigation request
140
  if message.message_type == "investigation_request":
141
  request = InvestigationRequest(**message.content)
 
 
 
 
 
 
 
 
142
  else:
143
  raise AgentExecutionError(
144
  f"Unsupported message type: {message.message_type}",
@@ -148,6 +163,13 @@ class InvestigatorAgent(BaseAgent):
148
  # Fetch data for investigation
149
  contracts_data = await self._fetch_investigation_data(request, context)
150
 
 
 
 
 
 
 
 
151
  if not contracts_data:
152
  return AgentMessage(
153
  message_type="investigation_result",
@@ -167,6 +189,14 @@ class InvestigatorAgent(BaseAgent):
167
  context
168
  )
169
 
 
 
 
 
 
 
 
 
170
  # Generate investigation summary
171
  summary = self._generate_investigation_summary(contracts_data, anomalies)
172
 
@@ -185,11 +215,25 @@ class InvestigatorAgent(BaseAgent):
185
  }
186
  }
187
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
  self.logger.info(
189
  "investigation_completed",
190
  investigation_id=context.investigation_id,
191
  records_analyzed=len(contracts_data),
192
  anomalies_found=len(anomalies),
 
193
  )
194
 
195
  return AgentMessage(
@@ -199,6 +243,13 @@ class InvestigatorAgent(BaseAgent):
199
  )
200
 
201
  except Exception as e:
 
 
 
 
 
 
 
202
  self.logger.error(
203
  "investigation_failed",
204
  investigation_id=context.investigation_id,
@@ -259,6 +310,13 @@ class InvestigatorAgent(BaseAgent):
259
  # Fetch contracts
260
  response = await client.get_contracts(filters)
261
 
 
 
 
 
 
 
 
262
  # Add organization code to each contract
263
  for contract in response.data:
264
  contract["_org_code"] = org_code
@@ -273,6 +331,13 @@ class InvestigatorAgent(BaseAgent):
273
  )
274
 
275
  except Exception as e:
 
 
 
 
 
 
 
276
  self.logger.warning(
277
  "data_fetch_failed",
278
  org_code=org_code,
 
19
  from src.agents.deodoro import BaseAgent, AgentContext, AgentMessage
20
  from src.core import get_logger
21
  from src.core.exceptions import AgentExecutionError, DataAnalysisError
22
+ from src.core.monitoring import (
23
+ INVESTIGATIONS_TOTAL, ANOMALIES_DETECTED, INVESTIGATION_DURATION,
24
+ DATA_RECORDS_PROCESSED, TRANSPARENCY_API_DATA_FETCHED
25
+ )
26
  from src.tools.transparency_api import TransparencyAPIClient, TransparencyAPIFilter
27
  from src.tools.models_client import ModelsClient, get_models_client
28
  from src.ml.spectral_analyzer import SpectralAnalyzer, SpectralAnomaly
29
+ import time
30
 
31
 
32
  @dataclass
 
133
  Returns:
134
  Investigation results with detected anomalies
135
  """
136
+ investigation_start_time = time.time()
137
+
138
  try:
139
  self.logger.info(
140
  "investigation_started",
 
146
  # Parse investigation request
147
  if message.message_type == "investigation_request":
148
  request = InvestigationRequest(**message.content)
149
+
150
+ # Record investigation start
151
+ INVESTIGATIONS_TOTAL.labels(
152
+ agent_type="zumbi",
153
+ investigation_type=request.anomaly_types[0] if request.anomaly_types else "general",
154
+ status="started"
155
+ ).inc()
156
+
157
  else:
158
  raise AgentExecutionError(
159
  f"Unsupported message type: {message.message_type}",
 
163
  # Fetch data for investigation
164
  contracts_data = await self._fetch_investigation_data(request, context)
165
 
166
+ # Record data processed
167
+ DATA_RECORDS_PROCESSED.labels(
168
+ data_source="transparency_api",
169
+ agent="zumbi",
170
+ operation="fetch"
171
+ ).inc(len(contracts_data) if contracts_data else 0)
172
+
173
  if not contracts_data:
174
  return AgentMessage(
175
  message_type="investigation_result",
 
189
  context
190
  )
191
 
192
+ # Record anomalies detected
193
+ for anomaly in anomalies:
194
+ ANOMALIES_DETECTED.labels(
195
+ anomaly_type=anomaly.anomaly_type,
196
+ severity="high" if anomaly.severity > 0.7 else "medium" if anomaly.severity > 0.4 else "low",
197
+ agent="zumbi"
198
+ ).inc()
199
+
200
  # Generate investigation summary
201
  summary = self._generate_investigation_summary(contracts_data, anomalies)
202
 
 
215
  }
216
  }
217
 
218
+ # Record investigation completion and duration
219
+ investigation_duration = time.time() - investigation_start_time
220
+ INVESTIGATION_DURATION.labels(
221
+ agent_type="zumbi",
222
+ investigation_type=request.anomaly_types[0] if request.anomaly_types else "general"
223
+ ).observe(investigation_duration)
224
+
225
+ INVESTIGATIONS_TOTAL.labels(
226
+ agent_type="zumbi",
227
+ investigation_type=request.anomaly_types[0] if request.anomaly_types else "general",
228
+ status="completed"
229
+ ).inc()
230
+
231
  self.logger.info(
232
  "investigation_completed",
233
  investigation_id=context.investigation_id,
234
  records_analyzed=len(contracts_data),
235
  anomalies_found=len(anomalies),
236
+ duration_seconds=investigation_duration,
237
  )
238
 
239
  return AgentMessage(
 
243
  )
244
 
245
  except Exception as e:
246
+ # Record investigation failure
247
+ INVESTIGATIONS_TOTAL.labels(
248
+ agent_type="zumbi",
249
+ investigation_type="general", # Fallback for failed investigations
250
+ status="failed"
251
+ ).inc()
252
+
253
  self.logger.error(
254
  "investigation_failed",
255
  investigation_id=context.investigation_id,
 
310
  # Fetch contracts
311
  response = await client.get_contracts(filters)
312
 
313
+ # Record API data fetched
314
+ TRANSPARENCY_API_DATA_FETCHED.labels(
315
+ endpoint="contracts",
316
+ organization=org_code,
317
+ status="success"
318
+ ).inc(len(response.data))
319
+
320
  # Add organization code to each contract
321
  for contract in response.data:
322
  contract["_org_code"] = org_code
 
331
  )
332
 
333
  except Exception as e:
334
+ # Record API fetch failure
335
+ TRANSPARENCY_API_DATA_FETCHED.labels(
336
+ endpoint="contracts",
337
+ organization=org_code,
338
+ status="failed"
339
+ ).inc()
340
+
341
  self.logger.warning(
342
  "data_fetch_failed",
343
  org_code=org_code,
src/api/routes/health.py CHANGED
@@ -15,7 +15,10 @@ from fastapi import APIRouter, HTTPException, Depends
15
  from pydantic import BaseModel
16
 
17
  from src.core import get_logger, settings
 
18
  from src.tools import TransparencyAPIClient
 
 
19
 
20
 
21
  logger = get_logger(__name__)
@@ -171,6 +174,45 @@ async def liveness_probe():
171
  return {"status": "alive", "timestamp": datetime.utcnow()}
172
 
173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
  @router.get("/ready")
175
  async def readiness_probe():
176
  """
 
15
  from pydantic import BaseModel
16
 
17
  from src.core import get_logger, settings
18
+ from src.core.monitoring import get_metrics_data, collect_system_metrics, performance_metrics, health_monitor
19
  from src.tools import TransparencyAPIClient
20
+ from prometheus_client import CONTENT_TYPE_LATEST
21
+ from fastapi.responses import Response
22
 
23
 
24
  logger = get_logger(__name__)
 
174
  return {"status": "alive", "timestamp": datetime.utcnow()}
175
 
176
 
177
+ @router.get("/metrics")
178
+ async def prometheus_metrics():
179
+ """
180
+ Prometheus metrics endpoint.
181
+
182
+ Returns metrics in Prometheus format for monitoring.
183
+ """
184
+ try:
185
+ metrics_data = get_metrics_data()
186
+ return Response(
187
+ content=metrics_data,
188
+ media_type=CONTENT_TYPE_LATEST
189
+ )
190
+ except Exception as e:
191
+ logger.error("metrics_endpoint_error", error=str(e))
192
+ raise HTTPException(
193
+ status_code=500,
194
+ detail=f"Failed to collect metrics: {str(e)}"
195
+ )
196
+
197
+
198
+ @router.get("/metrics/json")
199
+ async def system_metrics_json():
200
+ """
201
+ System metrics in JSON format for debugging.
202
+
203
+ Returns comprehensive system metrics in JSON format.
204
+ """
205
+ try:
206
+ metrics = await collect_system_metrics()
207
+ return metrics
208
+ except Exception as e:
209
+ logger.error("system_metrics_error", error=str(e))
210
+ raise HTTPException(
211
+ status_code=500,
212
+ detail=f"Failed to collect system metrics: {str(e)}"
213
+ )
214
+
215
+
216
  @router.get("/ready")
217
  async def readiness_probe():
218
  """
src/core/monitoring.py CHANGED
@@ -99,6 +99,37 @@ ACTIVE_CONNECTIONS = Gauge(
99
  ['connection_type']
100
  )
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  class PerformanceMetrics:
104
  """System performance metrics collector."""
 
99
  ['connection_type']
100
  )
101
 
102
+ # Investigation and Anomaly Detection Metrics
103
+ INVESTIGATIONS_TOTAL = Counter(
104
+ 'cidadao_ai_investigations_total',
105
+ 'Total number of investigations started',
106
+ ['agent_type', 'investigation_type', 'status']
107
+ )
108
+
109
+ ANOMALIES_DETECTED = Counter(
110
+ 'cidadao_ai_anomalies_detected_total',
111
+ 'Total number of anomalies detected',
112
+ ['anomaly_type', 'severity', 'agent']
113
+ )
114
+
115
+ INVESTIGATION_DURATION = Histogram(
116
+ 'cidadao_ai_investigation_duration_seconds',
117
+ 'Time taken for investigations',
118
+ ['agent_type', 'investigation_type']
119
+ )
120
+
121
+ DATA_RECORDS_PROCESSED = Counter(
122
+ 'cidadao_ai_data_records_processed_total',
123
+ 'Total number of data records processed',
124
+ ['data_source', 'agent', 'operation']
125
+ )
126
+
127
+ TRANSPARENCY_API_DATA_FETCHED = Counter(
128
+ 'cidadao_ai_transparency_data_fetched_total',
129
+ 'Total data fetched from transparency API',
130
+ ['endpoint', 'organization', 'status']
131
+ )
132
+
133
 
134
  class PerformanceMetrics:
135
  """System performance metrics collector."""