Troubleshooting & Common Issues¶
Per-component FAQ and quick fixes. Saves hours of debugging.
How to Use This Guide¶
- Problem occurs → Find component below
- Read "Symptoms" → Do they match your issue?
- Try quick fixes in order (most likely first)
- If still broken → Follow "Debug Steps"
- Still stuck? → Escalate (call for pair programming)
OpenShift Cluster¶
Issue: Provisioning Fails (Stuck at 50%)¶
Symptoms:
Terraform error: "Timeout waiting for cluster"
OR
OpenShift installer hangs on "Waiting for bootstrap"
Quick Fixes (try in order):
1. Check cloud quotas: aws ec2 describe-account-attributes --attribute-names vpc-max-security-groups-per-interface
- If < 3: Request quota increase
- Check network: Can you
pingthe VMs? -
If NO: Firewall blocked. Check security groups.
-
Check pull-secret: Is it valid?
bash cat pull-secret.json | jq . > /dev/null # Should output JSON, not error -
If error: Pull secret corrupted, download fresh one
-
Restart provisioning:
bash terraform destroy -auto-approve # Clear state terraform apply # Start fresh
Debug Steps: ```bash
Check Terraform logs¶
terraform apply -var-file=terraform.tfvars 2>&1 | tail -50
SSH to a VM and check install logs¶
ssh core@bootstrap.example.com journalctl -u bootkube -f
If still stuck, check OpenShift issue database¶
https://github.com/openshift/installer/issues¶
```
When to Escalate: - If stuck > 2 hours, escalate to Infrastructure Lead - May need manual VM setup (fallback plan)
Issue: "Nodes Not Ready" (Red X)¶
Symptoms:
bash
$ oc get nodes
NAME STATUS ROLES AGE
node-0 NotReady master 10m
node-1 NotReady worker 10m
Quick Fixes:
1. Wait 5 minutes (cluster is initializing)
bash
watch oc get nodes # Watch until Ready
-
Check node logs:
bash oc describe node node-0 oc logs -n openshift-node-admission ds/node-admission-agent -
Common cause: CNI (networking) not deployed
bash oc get pods -n openshift-network-operator # If none: Network operator stuck, restart it oc rollout restart deployment -n openshift-network-operator -
Check cluster operators:
bash oc get clusteroperators # If any Degraded: Check which one oc describe co/network # Example
Issue: "No space left on device"¶
Symptoms:
Pods evicted, OOM errors, disk usage 100%
Quick Fix: ```bash
Check disk usage per node¶
oc top nodes
Clean up unused images¶
oc adm prune images --all --force
Delete unused PVCs¶
oc delete pvc --all -n default # Only safe if no data needed ```
Kafka KRaft Cluster¶
Issue: Brokers Keep Crashing (CrashLoopBackOff)¶
Symptoms:
bash
$ kubectl get pods -n kafka
NAME READY STATUS RESTARTS
kafka-0 0/1 CrashLoopBackOff 5
kafka-1 0/1 CrashLoopBackOff 4
Quick Fixes:
1. Check logs:
bash
kubectl logs kafka-0 -n kafka --tail=50
- Common causes:
- Insufficient disk space (Kafka needs > 5GB)
- JVM heap too small (set
-Xmx2G) -
Port already in use (9092, 29092)
-
Fix & restart: ```bash # Update Kafka deployment with more disk/memory kubectl patch statefulset kafka -n kafka -p \ '{"spec":{"template":{"spec":{"containers":[{"name":"kafka","resources":{"requests":{"memory":"4Gi"}}}]}}}}'
# Delete pods to force restart kubectl delete pod kafka-0 kafka-1 kafka-2 -n kafka ```
Issue: Topics Not Receiving Messages¶
Symptoms:
bash
$ kafka-console-consumer --topic telemetry.logs --from-beginning
[No messages for 30 seconds]
Quick Fixes:
1. Check topics exist:
bash
kafka-topics.sh --bootstrap-server kafka:9092 --list
- If empty: Create topics
bash
kafka-topics.sh --bootstrap-server kafka:9092 --create \
--topic telemetry.logs --partitions 3 --replication-factor 2
-
Check producer connectivity: ```bash kafka-console-producer.sh --bootstrap-server kafka:9092 --topic test
hello (press Ctrl+C) ```
-
Check Schema Registry (if using):
bash curl http://schema-registry:8081/subjects
OTel & SigNoz¶
Issue: No Traces Appearing in SigNoz¶
Symptoms:
SigNoz web UI shows 0 traces
Sample app sends requests but nothing in dashboard
Quick Fixes (check in order):
1. OTel Collector receiving data?
bash
kubectl logs -n observability -l app=otel-collector | grep -i "exported\|received"
# Should show: "Exported N spans" every few seconds
-
Sample app instrumented?
bash # Check if sample app is sending OTLP kubectl exec -it [pod] -c app -- curl localhost:4317 # Should timeout (OTLP port, not HTTP) -
SigNoz backend receiving?
bash kubectl logs -n observability -l app=signoz-backend | tail -50 # Check for errors or "received spans" -
ClickHouse storing data?
bash kubectl exec -it clickhouse-0 -n observability -- \ clickhouse-client -q "SELECT count() FROM traces" # Should show > 0 -
Kafka flowing data? (check if buffering)
bash kafka-console-consumer --bootstrap-server kafka:9092 \ --topic telemetry.traces --max-messages 5 # Should show trace data (JSON)
Issue: OTel Collector Memory Growing¶
Symptoms:
bash
$ kubectl top pods -n observability
NAME CPU MEMORY
otel-collector-xxxx 500m 800Mi ← Growing!
Quick Fixes:
1. Increase memory limit:
bash
kubectl set resources ds otel-collector -n observability \
--limits=memory=2Gi --requests=memory=1Gi
- Reduce batch size: ```bash # Edit OTel config kubectl edit cm otel-collector-config -n observability
# Change batch processor: # send_batch_size: 100 # Lower from default ```
- Enable sampling (drop 90% of traces):
bash # In OTel config, add: processors: probabilistic_sampler: sampling_percentage: 10 # Keep 10%, drop 90%
WSO2 APIM¶
Issue: "502 Bad Gateway" from API Gateway¶
Symptoms:
curl http://apim-gateway/api/
502 Bad Gateway
Quick Fixes:
1. Check APIM pod status:
bash
kubectl get pods -n wso2 -l app=apim
# Should be Running
-
Check logs:
bash kubectl logs -n wso2 -l app=apim --tail=50 # Look for errors, connection refused, etc. -
API not deployed?
bash # Log into APIM console (port 9443) # Check: Publisher → APIs → Your API → Deployed? # If not: Publish it -
Backend service not responding?
bash # Check if sample app is running kubectl get pods -n default | grep sample-app
Issue: SAML SSO Redirects to Login Loop¶
Symptoms:
Click "SSO" → Redirects to Identity Server
Login with credentials
Redirects back to APIM console
Repeats (login loop)
Quick Fixes:
1. Check SAML metadata:
bash
# Get Identity Server URL
IS_URL=$(kubectl get svc wso2-is -n wso2 -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
curl $IS_URL/saml/metadata
# Should return XML metadata
-
SAML assertion validation:
bash # Browser DevTools → Network tab # Look for SAMLResponse header # Decode and check: Subject, Audience, NotOnOrAfter -
Common cause: Clock skew (times don't match)
bash # Check server times kubectl exec wso2-is-0 -n wso2 -- date # Compare with local: date # Difference > 5 min? Fix clock sync -
Fallback: Skip SSO, use API key instead
bash # Generate API key in APIM console # Use it instead of SSO
Redis Sentinel¶
Issue: Failover Not Triggering¶
Symptoms:
Kill Redis master pod
Wait 30 seconds
Master still down, no failover happened
Quick Fixes:
1. Check Sentinel status:
bash
kubectl exec redis-sentinel-0 -n redis -- \
redis-cli -p 26379 sentinel masters
# Should show 1 master
-
Check Sentinel quorum:
bash kubectl exec redis-sentinel-0 -n redis -- \ redis-cli -p 26379 info # Look for "sentinels=3" (need 3 total) -
Manually trigger failover:
bash kubectl exec redis-sentinel-0 -n redis -- \ redis-cli -p 26379 sentinel failover mymaster -
Check replication is working:
bash kubectl exec redis-0 -n redis -- redis-cli info replication # Should show role: slave (or master after failover)
Issue: Redis Data Lost After Failover¶
Symptoms:
Set key: SET mykey "hello"
Kill master
Failover happens
Get key: GET mykey
Result: (nil) ← Data lost!
Causes: - Replication lag (data not synced before failover) - AOF (append-only file) disabled
Quick Fix: ```bash
Enable AOF persistence¶
kubectl exec redis-0 -n redis -- \ redis-cli CONFIG SET appendonly yes
Check replication lag¶
kubectl exec redis-master -n redis -- redis-cli info replication
Look for: "replication_offset"¶
It should be low (< 100 bytes)¶
```
GitLab CI/CD¶
Issue: Pipeline Fails - "Container Image Pull Failed"¶
Symptoms:
CI/CD pipeline error:
Error response from daemon: pull access denied for myimage
Quick Fixes:
1. Check container registry credentials:
bash
# In GitLab CI/CD settings
# Add REGISTRY_URL, REGISTRY_USER, REGISTRY_PASSWORD
-
Check image exists:
bash docker pull myregistry.com/myimage:latest # If pull fails: Image not there or wrong tag -
Check Nexus/Docker Hub quota:
bash # If using Nexus: Check storage not full # If using Docker Hub: Check rate limits (100 pulls/6h for anonymous)
Issue: ".gitlab-ci.yml" Not Triggering¶
Symptoms:
Push commit with .gitlab-ci.yml
No pipeline starts
Quick Fixes:
1. Check .gitlab-ci.yml syntax:
bash
# In GitLab UI: CI/CD → Editor → Validate
-
Check GitLab Runner registered:
bash gitlab-runner list # Should show at least 1 runner -
Check GitLab webhook:
bash # In project: Settings → Integrations → Webhooks # Should show push event enabled
Compliance & PCI-DSS¶
Issue: Compliance Scan Reports No Findings¶
Symptoms:
bash
$ oc get compliancescans pci-dss-scan -n openshift-compliance
STATUS=DONE
RESULT=NOTAPPLICABLE
Quick Fixes:
1. Check nodes are labeled:
bash
oc get nodes --show-labels
# Look for scan=true label
# If missing: Add it
oc label node node-0 scan=true
- Run scan again:
bash oc delete compliancescans pci-dss-scan -n openshift-compliance oc apply -f pci-dss-scan.yaml watch oc get compliancescans -n openshift-compliance
Generic Debugging¶
Pod Stuck in Pending State¶
```bash
Check why pod won't start¶
kubectl describe pod [podname]
Common causes:¶
- ImagePullBackOff: Image not found¶
- CrashLoopBackOff: App exits immediately¶
- Pending: Resource quota exceeded, no node available¶
Get more details¶
kubectl logs [podname] kubectl get events -n [namespace] ```
High Memory Usage¶
```bash
Find the culprit¶
kubectl top pods -A | sort -k4 -n
If single pod using too much¶
kubectl set resources deployment [name] \ --limits=memory=4Gi --requests=memory=2Gi
If cluster-wide: scale down non-critical deployments¶
kubectl delete deployment [non-critical] -n [ns] ```
Network Connectivity Issues¶
```bash
Test between pods¶
kubectl run -it --image=busybox debug -- sh / # wget -O- http://[service]:[port]
If timeout: Check Network Policies¶
kubectl get networkpolicies -A
If no connection: Service not running¶
kubectl get svc [service] kubectl get endpoints [service] ```
Escalation Path¶
If none of these quick fixes work:
- Pair Program (30 min)
- Post problem in Slack/standup
- Another person looks with fresh eyes
-
Often finds the issue immediately
-
Search GitHub Issues
- Project: openshift/installer, openshift/ocs-operator, etc.
- Search for your error message
-
Often someone else had same issue + solution
-
Post to Community
- OpenShift forums
- Kubernetes Slack
-
Often get answers within 2 hours
-
Skip & Document
- If unresolvable, skip component
- Document as "Known Limitation"
- Move to next issue
- Revisit after other work done
Troubleshooting Guide Created: 2026-04-24
Status: Active (keep updating as you find issues)
Owner: All team members
Next Step: Add issues you find during execution