Skip to content

Troubleshooting & Common Issues

Per-component FAQ and quick fixes. Saves hours of debugging.


How to Use This Guide

  1. Problem occurs → Find component below
  2. Read "Symptoms" → Do they match your issue?
  3. Try quick fixes in order (most likely first)
  4. If still broken → Follow "Debug Steps"
  5. Still stuck? → Escalate (call for pair programming)

OpenShift Cluster

Issue: Provisioning Fails (Stuck at 50%)

Symptoms: Terraform error: "Timeout waiting for cluster" OR OpenShift installer hangs on "Waiting for bootstrap"

Quick Fixes (try in order): 1. Check cloud quotas: aws ec2 describe-account-attributes --attribute-names vpc-max-security-groups-per-interface - If < 3: Request quota increase

  1. Check network: Can you ping the VMs?
  2. If NO: Firewall blocked. Check security groups.

  3. Check pull-secret: Is it valid? bash cat pull-secret.json | jq . > /dev/null # Should output JSON, not error

  4. If error: Pull secret corrupted, download fresh one

  5. Restart provisioning: bash terraform destroy -auto-approve # Clear state terraform apply # Start fresh

Debug Steps: ```bash

Check Terraform logs

terraform apply -var-file=terraform.tfvars 2>&1 | tail -50

SSH to a VM and check install logs

ssh core@bootstrap.example.com journalctl -u bootkube -f

If still stuck, check OpenShift issue database

https://github.com/openshift/installer/issues

```

When to Escalate: - If stuck > 2 hours, escalate to Infrastructure Lead - May need manual VM setup (fallback plan)


Issue: "Nodes Not Ready" (Red X)

Symptoms: bash $ oc get nodes NAME STATUS ROLES AGE node-0 NotReady master 10m node-1 NotReady worker 10m

Quick Fixes: 1. Wait 5 minutes (cluster is initializing) bash watch oc get nodes # Watch until Ready

  1. Check node logs: bash oc describe node node-0 oc logs -n openshift-node-admission ds/node-admission-agent

  2. Common cause: CNI (networking) not deployed bash oc get pods -n openshift-network-operator # If none: Network operator stuck, restart it oc rollout restart deployment -n openshift-network-operator

  3. Check cluster operators: bash oc get clusteroperators # If any Degraded: Check which one oc describe co/network # Example


Issue: "No space left on device"

Symptoms: Pods evicted, OOM errors, disk usage 100%

Quick Fix: ```bash

Check disk usage per node

oc top nodes

Clean up unused images

oc adm prune images --all --force

Delete unused PVCs

oc delete pvc --all -n default # Only safe if no data needed ```


Kafka KRaft Cluster

Issue: Brokers Keep Crashing (CrashLoopBackOff)

Symptoms: bash $ kubectl get pods -n kafka NAME READY STATUS RESTARTS kafka-0 0/1 CrashLoopBackOff 5 kafka-1 0/1 CrashLoopBackOff 4

Quick Fixes: 1. Check logs: bash kubectl logs kafka-0 -n kafka --tail=50

  1. Common causes:
  2. Insufficient disk space (Kafka needs > 5GB)
  3. JVM heap too small (set -Xmx2G)
  4. Port already in use (9092, 29092)

  5. Fix & restart: ```bash # Update Kafka deployment with more disk/memory kubectl patch statefulset kafka -n kafka -p \ '{"spec":{"template":{"spec":{"containers":[{"name":"kafka","resources":{"requests":{"memory":"4Gi"}}}]}}}}'

# Delete pods to force restart kubectl delete pod kafka-0 kafka-1 kafka-2 -n kafka ```


Issue: Topics Not Receiving Messages

Symptoms: bash $ kafka-console-consumer --topic telemetry.logs --from-beginning [No messages for 30 seconds]

Quick Fixes: 1. Check topics exist: bash kafka-topics.sh --bootstrap-server kafka:9092 --list - If empty: Create topics bash kafka-topics.sh --bootstrap-server kafka:9092 --create \ --topic telemetry.logs --partitions 3 --replication-factor 2

  1. Check producer connectivity: ```bash kafka-console-producer.sh --bootstrap-server kafka:9092 --topic test

    hello (press Ctrl+C) ```

  2. Check Schema Registry (if using): bash curl http://schema-registry:8081/subjects


OTel & SigNoz

Issue: No Traces Appearing in SigNoz

Symptoms: SigNoz web UI shows 0 traces Sample app sends requests but nothing in dashboard

Quick Fixes (check in order): 1. OTel Collector receiving data? bash kubectl logs -n observability -l app=otel-collector | grep -i "exported\|received" # Should show: "Exported N spans" every few seconds

  1. Sample app instrumented? bash # Check if sample app is sending OTLP kubectl exec -it [pod] -c app -- curl localhost:4317 # Should timeout (OTLP port, not HTTP)

  2. SigNoz backend receiving? bash kubectl logs -n observability -l app=signoz-backend | tail -50 # Check for errors or "received spans"

  3. ClickHouse storing data? bash kubectl exec -it clickhouse-0 -n observability -- \ clickhouse-client -q "SELECT count() FROM traces" # Should show > 0

  4. Kafka flowing data? (check if buffering) bash kafka-console-consumer --bootstrap-server kafka:9092 \ --topic telemetry.traces --max-messages 5 # Should show trace data (JSON)


Issue: OTel Collector Memory Growing

Symptoms: bash $ kubectl top pods -n observability NAME CPU MEMORY otel-collector-xxxx 500m 800Mi ← Growing!

Quick Fixes: 1. Increase memory limit: bash kubectl set resources ds otel-collector -n observability \ --limits=memory=2Gi --requests=memory=1Gi

  1. Reduce batch size: ```bash # Edit OTel config kubectl edit cm otel-collector-config -n observability

# Change batch processor: # send_batch_size: 100 # Lower from default ```

  1. Enable sampling (drop 90% of traces): bash # In OTel config, add: processors: probabilistic_sampler: sampling_percentage: 10 # Keep 10%, drop 90%

WSO2 APIM

Issue: "502 Bad Gateway" from API Gateway

Symptoms: curl http://apim-gateway/api/ 502 Bad Gateway

Quick Fixes: 1. Check APIM pod status: bash kubectl get pods -n wso2 -l app=apim # Should be Running

  1. Check logs: bash kubectl logs -n wso2 -l app=apim --tail=50 # Look for errors, connection refused, etc.

  2. API not deployed? bash # Log into APIM console (port 9443) # Check: Publisher → APIs → Your API → Deployed? # If not: Publish it

  3. Backend service not responding? bash # Check if sample app is running kubectl get pods -n default | grep sample-app


Issue: SAML SSO Redirects to Login Loop

Symptoms: Click "SSO" → Redirects to Identity Server Login with credentials Redirects back to APIM console Repeats (login loop)

Quick Fixes: 1. Check SAML metadata: bash # Get Identity Server URL IS_URL=$(kubectl get svc wso2-is -n wso2 -o jsonpath='{.status.loadBalancer.ingress[0].hostname}') curl $IS_URL/saml/metadata # Should return XML metadata

  1. SAML assertion validation: bash # Browser DevTools → Network tab # Look for SAMLResponse header # Decode and check: Subject, Audience, NotOnOrAfter

  2. Common cause: Clock skew (times don't match) bash # Check server times kubectl exec wso2-is-0 -n wso2 -- date # Compare with local: date # Difference > 5 min? Fix clock sync

  3. Fallback: Skip SSO, use API key instead bash # Generate API key in APIM console # Use it instead of SSO


Redis Sentinel

Issue: Failover Not Triggering

Symptoms: Kill Redis master pod Wait 30 seconds Master still down, no failover happened

Quick Fixes: 1. Check Sentinel status: bash kubectl exec redis-sentinel-0 -n redis -- \ redis-cli -p 26379 sentinel masters # Should show 1 master

  1. Check Sentinel quorum: bash kubectl exec redis-sentinel-0 -n redis -- \ redis-cli -p 26379 info # Look for "sentinels=3" (need 3 total)

  2. Manually trigger failover: bash kubectl exec redis-sentinel-0 -n redis -- \ redis-cli -p 26379 sentinel failover mymaster

  3. Check replication is working: bash kubectl exec redis-0 -n redis -- redis-cli info replication # Should show role: slave (or master after failover)


Issue: Redis Data Lost After Failover

Symptoms: Set key: SET mykey "hello" Kill master Failover happens Get key: GET mykey Result: (nil) ← Data lost!

Causes: - Replication lag (data not synced before failover) - AOF (append-only file) disabled

Quick Fix: ```bash

Enable AOF persistence

kubectl exec redis-0 -n redis -- \ redis-cli CONFIG SET appendonly yes

Check replication lag

kubectl exec redis-master -n redis -- redis-cli info replication

Look for: "replication_offset"

It should be low (< 100 bytes)

```


GitLab CI/CD

Issue: Pipeline Fails - "Container Image Pull Failed"

Symptoms: CI/CD pipeline error: Error response from daemon: pull access denied for myimage

Quick Fixes: 1. Check container registry credentials: bash # In GitLab CI/CD settings # Add REGISTRY_URL, REGISTRY_USER, REGISTRY_PASSWORD

  1. Check image exists: bash docker pull myregistry.com/myimage:latest # If pull fails: Image not there or wrong tag

  2. Check Nexus/Docker Hub quota: bash # If using Nexus: Check storage not full # If using Docker Hub: Check rate limits (100 pulls/6h for anonymous)


Issue: ".gitlab-ci.yml" Not Triggering

Symptoms: Push commit with .gitlab-ci.yml No pipeline starts

Quick Fixes: 1. Check .gitlab-ci.yml syntax: bash # In GitLab UI: CI/CD → Editor → Validate

  1. Check GitLab Runner registered: bash gitlab-runner list # Should show at least 1 runner

  2. Check GitLab webhook: bash # In project: Settings → Integrations → Webhooks # Should show push event enabled


Compliance & PCI-DSS

Issue: Compliance Scan Reports No Findings

Symptoms: bash $ oc get compliancescans pci-dss-scan -n openshift-compliance STATUS=DONE RESULT=NOTAPPLICABLE

Quick Fixes: 1. Check nodes are labeled: bash oc get nodes --show-labels # Look for scan=true label # If missing: Add it oc label node node-0 scan=true

  1. Run scan again: bash oc delete compliancescans pci-dss-scan -n openshift-compliance oc apply -f pci-dss-scan.yaml watch oc get compliancescans -n openshift-compliance

Generic Debugging

Pod Stuck in Pending State

```bash

Check why pod won't start

kubectl describe pod [podname]

Common causes:

- ImagePullBackOff: Image not found

- CrashLoopBackOff: App exits immediately

- Pending: Resource quota exceeded, no node available

Get more details

kubectl logs [podname] kubectl get events -n [namespace] ```

High Memory Usage

```bash

Find the culprit

kubectl top pods -A | sort -k4 -n

If single pod using too much

kubectl set resources deployment [name] \ --limits=memory=4Gi --requests=memory=2Gi

If cluster-wide: scale down non-critical deployments

kubectl delete deployment [non-critical] -n [ns] ```

Network Connectivity Issues

```bash

Test between pods

kubectl run -it --image=busybox debug -- sh / # wget -O- http://[service]:[port]

If timeout: Check Network Policies

kubectl get networkpolicies -A

If no connection: Service not running

kubectl get svc [service] kubectl get endpoints [service] ```


Escalation Path

If none of these quick fixes work:

  1. Pair Program (30 min)
  2. Post problem in Slack/standup
  3. Another person looks with fresh eyes
  4. Often finds the issue immediately

  5. Search GitHub Issues

  6. Project: openshift/installer, openshift/ocs-operator, etc.
  7. Search for your error message
  8. Often someone else had same issue + solution

  9. Post to Community

  10. OpenShift forums
  11. Kubernetes Slack
  12. Often get answers within 2 hours

  13. Skip & Document

  14. If unresolvable, skip component
  15. Document as "Known Limitation"
  16. Move to next issue
  17. Revisit after other work done

Troubleshooting Guide Created: 2026-04-24
Status: Active (keep updating as you find issues)
Owner: All team members
Next Step: Add issues you find during execution