Skip to content

APiGen Chaos - Chaos Engineering Module

Comprehensive chaos engineering and resilience testing module for APiGen. Verify system behavior under failure conditions, network issues, resource constraints, and service degradation.

Features

🐒 Chaos Monkey Integration

  • Latency Injection: Add random delays to method executions
  • Exception Throwing: Randomly throw exceptions to simulate failures
  • Application Kill: Terminate the application to test recovery
  • Resource Stress: Simulate memory and CPU pressure

🌐 Network Chaos (Toxiproxy)

  • Latency Injection: Add network latency with configurable jitter
  • Bandwidth Limiting: Throttle network throughput
  • Connection Cutting: Simulate network partitions
  • Timeout Simulation: Introduce delays exceeding timeout thresholds
  • Packet Loss: Drop packets to simulate unreliable networks

🔧 Service Failure Simulation (WireMock)

  • HTTP Errors: Return specific status codes (500, 404, 503, etc.)
  • Timeouts: Delay responses beyond timeout thresholds
  • Malformed Responses: Return invalid JSON or corrupted data
  • Random Failures: Probabilistic failure injection
  • Circuit Breaker Testing: Simulate circuit breaker patterns
  • Variable Latency: Random response times within a range

🗄️ Database Chaos

  • Connection Failures: Simulate database connection drops
  • Transient Failures: Limited consecutive failures followed by recovery
  • Slow Connections: Add delays to connection establishment
  • Partial Failures: Probabilistic connection failures

💾 Resource Stress Testing

  • CPU Stress: Saturate CPU cores with configurable thread count
  • Memory Stress: Allocate large memory blocks
  • Memory Leak Simulation: Continuous memory allocation
  • Resource Monitoring: Track memory and CPU usage

🎯 Test Orchestration

  • Scenario Builder: Fluent API for complex chaos scenarios
  • Parallel Execution: Run multiple chaos scenarios concurrently
  • Custom Actions: Define custom chaos behaviors
  • Result Tracking: Monitor scenario execution and outcomes

Installation

Add to your build.gradle:

groovy
dependencies {
    testImplementation 'com.jnzader:apigen-chaos'
}

Quick Start

1. Chaos Monkey Configuration

yaml
# application-chaos.yml
chaos:
  monkey:
    enabled: true
    latency-enabled: true
    latency-min: 100
    latency-max: 5000
    exceptions-enabled: true
    level: method
    attack-probability: 0.1

2. Network Chaos Testing

java
@Autowired
private NetworkChaosSimulator networkChaos;

@Test
void testNetworkLatency() throws Exception {
    // Add 500ms latency with 50ms jitter
    networkChaos.addLatency("database-proxy", 500, 50);

    // Execute your code
    performDatabaseOperation();

    // Restore normal behavior
    networkChaos.restore("database-proxy");
}

3. Service Failure Simulation

java
@Autowired
private ServiceFailureSimulator serviceFailure;

@Test
void testServiceTimeout() {
    serviceFailure.start(8080);

    // Simulate 10 second timeout
    serviceFailure.simulateTimeout("/api/users", 10000);

    // Your test code
    assertThatThrownBy(() -> callExternalService())
        .hasMessageContaining("timeout");

    serviceFailure.stop();
}

4. Database Chaos

java
@Autowired
private DatabaseChaosSimulator dbChaos;

@Test
void testDatabaseFailover() {
    // Enable transient failures (3 consecutive failures)
    dbChaos.enableTransientFailures(3);

    // First 3 attempts should fail
    for (int i = 0; i < 3; i++) {
        assertThatThrownBy(() -> repository.findAll())
            .isInstanceOf(SQLException.class);
    }

    // 4th attempt should succeed (automatic recovery)
    assertThat(repository.findAll()).isNotEmpty();
}

5. Resource Stress Testing

java
@Autowired
private ResourceStressSimulator resourceStress;

@Test
void testUnderMemoryPressure() {
    // Allocate 500MB in 50MB blocks
    resourceStress.startMemoryStress(500, 50);

    // Verify application still functions
    assertThat(service.processLargeDataset()).isTrue();

    resourceStress.stopMemoryStress();
}

@Test
void testUnderCpuLoad() {
    // Stress 4 CPU cores for 10 seconds
    resourceStress.startCpuStress(4, 10);

    // Verify performance degradation handling
    long responseTime = measureResponseTime();
    assertThat(responseTime).isLessThan(5000);
}

6. Orchestrated Chaos Scenarios

java
@Autowired
private ChaosTestOrchestrator orchestrator;

@Test
void testComplexFailureScenario() {
    orchestrator.scenario("Multi-component failure")
        .withNetworkLatency("api-proxy", 1000, 100, 5000)
        .withServiceFailure("/api/orders", 503, 3000)
        .withDatabaseFailure(dbChaos, 0.3, 5000)
        .withCpuStress(2, 10)
        .run()
        .thenAccept(result -> {
            assertThat(result.isSuccess()).isTrue();
        });
}

Usage Scenarios

Testing Resilience Patterns

java
@Test
void testCircuitBreakerOpens() {
    // Simulate service failures
    serviceFailure.simulateCircuitBreakerOpen("/api/payment", 3);

    // Make requests until circuit opens
    for (int i = 0; i < 5; i++) {
        try {
            paymentService.processPayment(order);
        } catch (CircuitBreakerOpenException e) {
            // Circuit should open after 3 failures
            assertThat(i).isGreaterThanOrEqualTo(3);
        }
    }
}

Testing Retry Logic

java
@Test
void testRetryOnTransientFailure() {
    // Enable 2 consecutive failures
    dbChaos.enableTransientFailures(2);

    // Service should retry and succeed on 3rd attempt
    List<User> users = userService.findAllWithRetry();

    assertThat(users).isNotEmpty();
}

Testing Graceful Degradation

java
@Test
void testDegradationUnderLoad() {
    resourceStress.startMemoryStress(1000, 100);
    resourceStress.startCpuStress(8, 30);

    // Verify service returns cached data instead of failing
    Response response = apiClient.getData();

    assertThat(response.getStatus()).isEqualTo(200);
    assertThat(response.isFromCache()).isTrue();

    resourceStress.stopMemoryStress();
    resourceStress.stopCpuStress();
}

Testing Timeout Handling

java
@Test
void testTimeoutHandling() {
    // Simulate slow downstream service
    serviceFailure.simulateVariableLatency("/api/external", 5000, 10000);

    // Should timeout and use fallback
    CompletableFuture<Data> future = service.fetchDataAsync();

    assertThatThrownBy(() -> future.get(3, TimeUnit.SECONDS))
        .isInstanceOf(TimeoutException.class);

    // Verify fallback was used
    Data fallbackData = service.getFallbackData();
    assertThat(fallbackData).isNotNull();
}

Configuration

Chaos Monkey Properties

yaml
chaos:
  monkey:
    enabled: true/false
    latency-enabled: true/false
    latency-min: 100  # ms
    latency-max: 5000  # ms
    exceptions-enabled: true/false
    exception-message: "Custom error message"
    kill-enabled: false  # WARNING: Terminates application
    memory-stress-enabled: false
    cpu-stress-enabled: false
    level: method|service|repository|component|restController
    attack-probability: 0.0-1.0  # 0.1 = 10%
    watcher-enabled: true/false

Actuator Endpoints

Monitor chaos experiments via Spring Boot Actuator:

bash
# Enable/disable chaos monkey
POST /actuator/chaosmonkey/enable
POST /actuator/chaosmonkey/disable

# Get current configuration
GET /actuator/chaosmonkey

# Get watcher status
GET /actuator/chaosmonkey/watchers

Best Practices

  1. Start Small: Begin with single failure types before combining scenarios
  2. Use Profiles: Enable chaos only in test/staging environments
  3. Monitor Metrics: Track application metrics during chaos tests
  4. Set Timeouts: Always configure test timeouts to prevent hanging
  5. Clean Up: Always restore normal behavior after tests
  6. Gradual Increase: Start with low failure probabilities and increase gradually
  7. Document Scenarios: Maintain a catalog of tested failure scenarios
  8. Automate: Include chaos tests in CI/CD pipelines

Safety

  • Never run in production without proper controls
  • Use @Profile("!prod") to prevent accidental production usage
  • Set chaos.monkey.enabled=false by default
  • Implement kill switches for chaos experiments
  • Monitor resource usage during stress tests
  • Set conservative timeouts

Dependencies

  • Chaos Monkey for Spring Boot 3.2.0
  • Toxiproxy Java 2.1.7
  • WireMock 3.10.0
  • Testcontainers 1.21.4
  • Awaitility 4.2.2

Examples

See src/test/java for comprehensive examples:

  • NetworkChaosIntegrationTest.java - Network chaos scenarios
  • ServiceFailureIntegrationTest.java - Service failure patterns
  • DatabaseChaosIntegrationTest.java - Database resilience testing
  • ResourceStressIntegrationTest.java - Resource stress scenarios

Contributing

Contributions welcome! Please ensure:

  • All chaos scenarios have corresponding tests
  • Safety mechanisms are in place
  • Documentation is updated
  • Examples are provided

License

MIT License - see LICENSE file for details

Released under the MIT License.