Troubleshooting & Common Issues¶

Per-component FAQ and quick fixes. Saves hours of debugging.

How to Use This Guide¶

Problem occurs → Find component below
Read "Symptoms" → Do they match your issue?
Try quick fixes in order (most likely first)
If still broken → Follow "Debug Steps"
Still stuck? → Escalate (call for pair programming)

OpenShift Cluster¶

Issue: Provisioning Fails (Stuck at 50%)¶

Symptoms: Terraform error: "Timeout waiting for cluster" OR OpenShift installer hangs on "Waiting for bootstrap"

Quick Fixes (try in order): 1. Check cloud quotas: aws ec2 describe-account-attributes --attribute-names vpc-max-security-groups-per-interface - If < 3: Request quota increase

Check network: Can you ping the VMs?
If NO: Firewall blocked. Check security groups.
Check pull-secret: Is it valid? bash cat pull-secret.json | jq . > /dev/null # Should output JSON, not error
If error: Pull secret corrupted, download fresh one
Restart provisioning: bash terraform destroy -auto-approve # Clear state terraform apply # Start fresh

Debug Steps: ```bash

Check Terraform logs¶

terraform apply -var-file=terraform.tfvars 2>&1 | tail -50

SSH to a VM and check install logs¶

ssh core@bootstrap.example.com journalctl -u bootkube -f

If still stuck, check OpenShift issue database¶

https://github.com/openshift/installer/issues¶

```

When to Escalate: - If stuck > 2 hours, escalate to Infrastructure Lead - May need manual VM setup (fallback plan)

Issue: "Nodes Not Ready" (Red X)¶

Symptoms: bash $ oc get nodes NAME STATUS ROLES AGE node-0 NotReady master 10m node-1 NotReady worker 10m

Quick Fixes: 1. Wait 5 minutes (cluster is initializing) bash watch oc get nodes # Watch until Ready

Check node logs: bash oc describe node node-0 oc logs -n openshift-node-admission ds/node-admission-agent
Common cause: CNI (networking) not deployed bash oc get pods -n openshift-network-operator # If none: Network operator stuck, restart it oc rollout restart deployment -n openshift-network-operator
Check cluster operators: bash oc get clusteroperators # If any Degraded: Check which one oc describe co/network # Example

Issue: "No space left on device"¶

Symptoms: Pods evicted, OOM errors, disk usage 100%

Quick Fix: ```bash

Check disk usage per node¶

oc top nodes

Clean up unused images¶

oc adm prune images --all --force

Delete unused PVCs¶

oc delete pvc --all -n default # Only safe if no data needed ```

Kafka KRaft Cluster¶

Issue: Brokers Keep Crashing (CrashLoopBackOff)¶

Symptoms: bash $ kubectl get pods -n kafka NAME READY STATUS RESTARTS kafka-0 0/1 CrashLoopBackOff 5 kafka-1 0/1 CrashLoopBackOff 4

Quick Fixes: 1. Check logs: bash kubectl logs kafka-0 -n kafka --tail=50

Common causes:
Insufficient disk space (Kafka needs > 5GB)
JVM heap too small (set -Xmx2G)
Port already in use (9092, 29092)
Fix & restart: ```bash # Update Kafka deployment with more disk/memory kubectl patch statefulset kafka -n kafka -p \ '{"spec":{"template":{"spec":{"containers":[{"name":"kafka","resources":{"requests":{"memory":"4Gi"}}}]}}}}'

# Delete pods to force restart kubectl delete pod kafka-0 kafka-1 kafka-2 -n kafka ```

Issue: Topics Not Receiving Messages¶

Symptoms: bash $ kafka-console-consumer --topic telemetry.logs --from-beginning [No messages for 30 seconds]

Quick Fixes: 1. Check topics exist: bash kafka-topics.sh --bootstrap-server kafka:9092 --list - If empty: Create topics bash kafka-topics.sh --bootstrap-server kafka:9092 --create \ --topic telemetry.logs --partitions 3 --replication-factor 2

Check producer connectivity: ```bash kafka-console-producer.sh --bootstrap-server kafka:9092 --topic test

hello (press Ctrl+C) ```
Check Schema Registry (if using): bash curl http://schema-registry:8081/subjects

OTel & SigNoz¶

Issue: No Traces Appearing in SigNoz¶

Symptoms: SigNoz web UI shows 0 traces Sample app sends requests but nothing in dashboard

Quick Fixes (check in order): 1. OTel Collector receiving data? bash kubectl logs -n observability -l app=otel-collector | grep -i "exported\|received" # Should show: "Exported N spans" every few seconds

Sample app instrumented? bash # Check if sample app is sending OTLP kubectl exec -it [pod] -c app -- curl localhost:4317 # Should timeout (OTLP port, not HTTP)
SigNoz backend receiving? bash kubectl logs -n observability -l app=signoz-backend | tail -50 # Check for errors or "received spans"
ClickHouse storing data? bash kubectl exec -it clickhouse-0 -n observability -- \ clickhouse-client -q "SELECT count() FROM traces" # Should show > 0
Kafka flowing data? (check if buffering) bash kafka-console-consumer --bootstrap-server kafka:9092 \ --topic telemetry.traces --max-messages 5 # Should show trace data (JSON)

Issue: OTel Collector Memory Growing¶

Symptoms: bash $ kubectl top pods -n observability NAME CPU MEMORY otel-collector-xxxx 500m 800Mi ← Growing!

Quick Fixes: 1. Increase memory limit: bash kubectl set resources ds otel-collector -n observability \ --limits=memory=2Gi --requests=memory=1Gi

Reduce batch size: ```bash # Edit OTel config kubectl edit cm otel-collector-config -n observability

# Change batch processor: # send_batch_size: 100 # Lower from default ```

Enable sampling (drop 90% of traces): bash # In OTel config, add: processors: probabilistic_sampler: sampling_percentage: 10 # Keep 10%, drop 90%

WSO2 APIM¶

Issue: "502 Bad Gateway" from API Gateway¶

Symptoms: curl http://apim-gateway/api/ 502 Bad Gateway

Quick Fixes: 1. Check APIM pod status: bash kubectl get pods -n wso2 -l app=apim # Should be Running

Check logs: bash kubectl logs -n wso2 -l app=apim --tail=50 # Look for errors, connection refused, etc.
API not deployed? bash # Log into APIM console (port 9443) # Check: Publisher → APIs → Your API → Deployed? # If not: Publish it
Backend service not responding? bash # Check if sample app is running kubectl get pods -n default | grep sample-app

Symptoms: Click "SSO" → Redirects to Identity Server Login with credentials Redirects back to APIM console Repeats (login loop)

Quick Fixes: 1. Check SAML metadata: bash # Get Identity Server URL IS_URL=$(kubectl get svc wso2-is -n wso2 -o jsonpath='{.status.loadBalancer.ingress[0].hostname}') curl $IS_URL/saml/metadata # Should return XML metadata

SAML assertion validation: bash # Browser DevTools → Network tab # Look for SAMLResponse header # Decode and check: Subject, Audience, NotOnOrAfter
Common cause: Clock skew (times don't match) bash # Check server times kubectl exec wso2-is-0 -n wso2 -- date # Compare with local: date # Difference > 5 min? Fix clock sync
Fallback: Skip SSO, use API key instead bash # Generate API key in APIM console # Use it instead of SSO

Redis Sentinel¶

Issue: Failover Not Triggering¶

Symptoms: Kill Redis master pod Wait 30 seconds Master still down, no failover happened

Quick Fixes: 1. Check Sentinel status: bash kubectl exec redis-sentinel-0 -n redis -- \ redis-cli -p 26379 sentinel masters # Should show 1 master

Check Sentinel quorum: bash kubectl exec redis-sentinel-0 -n redis -- \ redis-cli -p 26379 info # Look for "sentinels=3" (need 3 total)
Manually trigger failover: bash kubectl exec redis-sentinel-0 -n redis -- \ redis-cli -p 26379 sentinel failover mymaster
Check replication is working: bash kubectl exec redis-0 -n redis -- redis-cli info replication # Should show role: slave (or master after failover)

Issue: Redis Data Lost After Failover¶

Symptoms: Set key: SET mykey "hello" Kill master Failover happens Get key: GET mykey Result: (nil) ← Data lost!

Causes: - Replication lag (data not synced before failover) - AOF (append-only file) disabled

Quick Fix: ```bash

Enable AOF persistence¶

kubectl exec redis-0 -n redis -- \ redis-cli CONFIG SET appendonly yes

Check replication lag¶

kubectl exec redis-master -n redis -- redis-cli info replication

Look for: "replication_offset"¶

It should be low (< 100 bytes)¶

```

GitLab CI/CD¶

Issue: Pipeline Fails - "Container Image Pull Failed"¶

Symptoms: CI/CD pipeline error: Error response from daemon: pull access denied for myimage

Quick Fixes: 1. Check container registry credentials: bash # In GitLab CI/CD settings # Add REGISTRY_URL, REGISTRY_USER, REGISTRY_PASSWORD

Check image exists: bash docker pull myregistry.com/myimage:latest # If pull fails: Image not there or wrong tag
Check Nexus/Docker Hub quota: bash # If using Nexus: Check storage not full # If using Docker Hub: Check rate limits (100 pulls/6h for anonymous)

Issue: ".gitlab-ci.yml" Not Triggering¶

Symptoms: Push commit with .gitlab-ci.yml No pipeline starts

Quick Fixes: 1. Check .gitlab-ci.yml syntax: bash # In GitLab UI: CI/CD → Editor → Validate

Check GitLab Runner registered: bash gitlab-runner list # Should show at least 1 runner
Check GitLab webhook: bash # In project: Settings → Integrations → Webhooks # Should show push event enabled

Compliance & PCI-DSS¶

Issue: Compliance Scan Reports No Findings¶

Symptoms: bash $ oc get compliancescans pci-dss-scan -n openshift-compliance STATUS=DONE RESULT=NOTAPPLICABLE

Quick Fixes: 1. Check nodes are labeled: bash oc get nodes --show-labels # Look for scan=true label # If missing: Add it oc label node node-0 scan=true

Run scan again: bash oc delete compliancescans pci-dss-scan -n openshift-compliance oc apply -f pci-dss-scan.yaml watch oc get compliancescans -n openshift-compliance

Generic Debugging¶

Pod Stuck in Pending State¶

```bash

Check why pod won't start¶

kubectl describe pod [podname]

Common causes:¶

- ImagePullBackOff: Image not found¶

- CrashLoopBackOff: App exits immediately¶

- Pending: Resource quota exceeded, no node available¶

Get more details¶

kubectl logs [podname] kubectl get events -n [namespace] ```

High Memory Usage¶

```bash

Find the culprit¶

kubectl top pods -A | sort -k4 -n

If single pod using too much¶

kubectl set resources deployment [name] \ --limits=memory=4Gi --requests=memory=2Gi

If cluster-wide: scale down non-critical deployments¶

kubectl delete deployment [non-critical] -n [ns] ```

Network Connectivity Issues¶

```bash

Test between pods¶

kubectl run -it --image=busybox debug -- sh / # wget -O- http://[service]:[port]

If timeout: Check Network Policies¶

kubectl get networkpolicies -A

If no connection: Service not running¶

kubectl get svc [service] kubectl get endpoints [service] ```

Escalation Path¶

If none of these quick fixes work:

Pair Program (30 min)
Post problem in Slack/standup
Another person looks with fresh eyes
Often finds the issue immediately
Search GitHub Issues
Project: openshift/installer, openshift/ocs-operator, etc.
Search for your error message
Often someone else had same issue + solution
Post to Community
OpenShift forums
Kubernetes Slack
Often get answers within 2 hours
Skip & Document
If unresolvable, skip component
Document as "Known Limitation"
Move to next issue
Revisit after other work done

Troubleshooting Guide Created: 2026-04-24
Status: Active (keep updating as you find issues)
Owner: All team members
Next Step: Add issues you find during execution

Troubleshooting & Common Issues¶

How to Use This Guide¶

OpenShift Cluster¶

Issue: Provisioning Fails (Stuck at 50%)¶

Check Terraform logs¶

SSH to a VM and check install logs¶

If still stuck, check OpenShift issue database¶

https://github.com/openshift/installer/issues¶

Issue: "Nodes Not Ready" (Red X)¶

Issue: "No space left on device"¶

Check disk usage per node¶

Clean up unused images¶

Delete unused PVCs¶

Kafka KRaft Cluster¶

Issue: Brokers Keep Crashing (CrashLoopBackOff)¶

Issue: Topics Not Receiving Messages¶

OTel & SigNoz¶

Issue: No Traces Appearing in SigNoz¶

Issue: OTel Collector Memory Growing¶

WSO2 APIM¶

Issue: "502 Bad Gateway" from API Gateway¶

Issue: SAML SSO Redirects to Login Loop¶

Redis Sentinel¶

Issue: Failover Not Triggering¶

Issue: Redis Data Lost After Failover¶

Enable AOF persistence¶

Check replication lag¶

Look for: "replication_offset"¶

It should be low (< 100 bytes)¶

GitLab CI/CD¶

Issue: Pipeline Fails - "Container Image Pull Failed"¶

Issue: ".gitlab-ci.yml" Not Triggering¶

Compliance & PCI-DSS¶

Issue: Compliance Scan Reports No Findings¶

Generic Debugging¶

Pod Stuck in Pending State¶

Check why pod won't start¶

Common causes:¶

- ImagePullBackOff: Image not found¶

- CrashLoopBackOff: App exits immediately¶

- Pending: Resource quota exceeded, no node available¶

Get more details¶

High Memory Usage¶

Find the culprit¶

If single pod using too much¶

If cluster-wide: scale down non-critical deployments¶

Network Connectivity Issues¶

Test between pods¶

If timeout: Check Network Policies¶

If no connection: Service not running¶

Escalation Path¶