Chaos Engineering is one field that always draws my attention. I came to know about it after I heard about the Netflix Simian Army toolkit https://github.com/Netflix/SimianArmy . At an initial glance, it’s hard to believe that someone using the Chaos tools in production randomly shut down any production server(chaos monkey). Later on, I watched Tammy Bryant Butow video on youtube and came to know about Gremlin. What Gremlin does is provides a hosted service that lets you run the Chaos experiment. Finally, after one week of study, I am now Gremlin Chaos Engineering Practitioner Certified.
I only followed below two resources below to prepare for the exam.
- Gremlin Tutorial: https://www.gremlin.com/community/tutorials/?ref=nav
- Gremlin YouTube Channel: https://www.youtube.com/channel/UC6PAoCqf2LSw6Hth-4M4yEQ
- If you need more practice and hands-on experience, you can attend Gremlin Bootcamp https://www.gremlin.com/bootcamps/?ref=nav
- Number of Questions: 20
- Question Type: Single and Multiple Choice, Drag and Drop
- If you still have any doubts about the exam format, please watch this video https://www.youtube.com/watch?v=TL1j2MJBE0A&t=1248s.
NOTE: Exam is free of cost; you can register via below link https://gremlin.coassemble.com/unlock/7Jan8Su
- To prepare for the exam, the first thing you can do is to create a free account on the Gremlin website https://app.gremlin.com/?ref=nav
- Get familiar with how to install a gremlin agent
- For you to attack a host, the gremlin agent needs to install on that host. Gremlin support various operating system(Ubuntu, Centos, RHEL, Windows), you can even download the Docker image https://hub.docker.com/r/gremlin/gremlin or use the helm repo.
helm repo add gremlin https://helm.gremlin.com
- This is how the architecture will look like
- In the case of Ubuntu, these are the steps you need to follow, as shown in the above diagram.
* echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list * sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys XXXX * sudo apt-get update && sudo apt-get install -y gremlin gremlind
- Once these steps are done, you need to Register the installed Gremlin with the Gremlin Control Plane using your Team ID and Secret Key in Team Settings. To do that, go to the Team Settings page, make a note of TeamID and SecretKey(In case you don’t know it, click on the Reset button)
- Run gremlin init command and enter the Team ID and Secret you copied in previous steps
$ gremlin init Metadata set for [ gremlin-client-version: 2.20.0 ] Metadata set for [ os-type: Linux ] Metadata set for [ os-name: Ubuntu ] AWS metadata may be present Metadata set for [ instance-id: i-0550fdb260931639b ] Metadata set for [ local-hostname: ip-172-31-28-103.ec2.internal ] Metadata set for [ local-ip: 172.31.28.103 ] Metadata set for [ public-hostname: ec2-184-73-139-79.compute-1.amazonaws.com ] Metadata set for [ public-ip: 220.127.116.11 ] Metadata set for [ azid: use1-az4 ] Metadata set for [ cloud: AWS ] Metadata set for [ image-id: ami-09e67e426f25ce0d7 ] Metadata set for [ instance-type: t2.micro ] Metadata set for [ region: us-east-1 ] Metadata set for [ zone: us-east-1c ] Unable to describe AWS tags. The error message is: No such file or directory (os error 2) Azure metadata may be present Please input your Team ID: <-------- XXXXXXXX Please input your Team Secret: <-------- Using XXXXXX for Team Id Using 172.31.28.103 for Gremlin identifier
- Go to the gremlin dashboard, and you will see your newly added host.
- You were all set to perform various attacks by just clicking on the attack button.
Get familiar with various types of attacks you can perform via Gremlin
Using Gremlin, you can trigger various attacks depend upon the Infrastructure to target(Hosts, Containers, or Kubernetes)
Resource: Test against sudden changes in consumption of computing resources.
- CPU: Test that your application behaves as expected even when CPU capacity is limited or exhausted
- Disk: Test system and application behavior when storage space is limited or unavailable, and validate dynamic storage provisioning systems
- IO: Test against heavy IO operations to understand their effect on your applications
- Memory: Test your systems against memory consumption to ensure they can tolerate and perform given a sudden increase in usage
State: Test against unexpected changes in your environment, such as power outages, node failures, clock drift, or application crashes.
- Process Killer: Test against application crashes and similar events by terminating specific sets of processes
- Shutdown: Test resilience to host failures by rebooting or shutting down targeted host operating systems
- Time Travel: Test for scenarios such as Daylight Saving Time (DST), clock drift between hosts, and expiring SSL/TLS certificates
Network: Test against unreliable network conditions.
- Blackhole: Test against unreachable dependencies by dropping network traffic between services
- DNS: Test against DNS outages, and validate both fallback DNS servers and DNS resolver configurations
- Latency: Test your system’s responsiveness under varying network conditions by injecting a controlled delay into outbound network traffic
- Packet Loss: Test your system’s end user experience when a percentage of outbound network packets are dropped or corrupted
Try to test and perform some of these attacks before the exam. E.g., to test shut down, go to State and click on shutdown; you have an option to introduce delay and reboot the host after shutdown.
- You can go to the host and see what command it’s executing.
$ ps aux|grep -i gremlin gremlin 2142 0.0 0.9 23420 9328 ? Ssl 04:42 0:00 /usr/sbin/gremlind gremlin 2362 0.0 0.8 23612 8516 ? Sl 05:07 0:00 gremlin attack shutdown -d 1 -r
- Gremlin also provides a friendly UI, where you can view this.
- Similarly, you can perform other kinds of attacks like CPU attacks. In the scenario below, we run the test for 60 sec, for CPU utilization of 50% and on all cores.
- You can go back to the host and check the CPU utilization using the top command.
- To use Gremlin with EKS, please check this blog https://www.gremlin.com/community/tutorials/how-to-install-and-use-gremlin-with-eks/
- To use Gremlin with RDS https://www.gremlin.com/community/tutorials/how-to-use-gremlin-with-amazon-rds/
3. Get familiar with the gremlin command line.
$ gremlin -h gremlin USAGE: gremlin <SUBCOMMAND> FLAGS: -h, --help Prints help information SUBCOMMANDS: attack Run a new gremlin attack against this host attack-container Run a new gremlin attack against the specified container check Show runtime troubleshooting data help Prints this message or the help of the given subcommand(s) init Initialize a new client session with the Gremlin service logout Remove this client from the Gremlin service measure Measure then report dynamic system data rollback Interrupt an active attack, or revert the last impact rollback-container Interrupt an active attack against a Docker container status Show the status of all gremlins or a specific attack syscheck System check was a feature in Gremlin 2.8.x and is no longer supported validate Validate a gremlin version Show version information for the gremlin binary
In the end, I will say this exam is straightforward, go through Gremlin doc and youtube(Bonus: If you can attend their Bootcamp), and you should be good to go.
The best way to connect with me is via any of the below mediums
- Website: https://101daysofdevops.com/
- Linkedin: https://www.linkedin.com/in/prashant-lakhera-696119b/
- Twitter: @100daysofdevops OR @lakhera2015
- Facebook: https://www.facebook.com/groups/795382630808645/
- Medium: https://medium.com/@devopslearning
- GitHub: https://github.com/100daysofdevops/100daysofdevops
- YouTube Channel: https://www.youtube.com/user/laprashant/videos
- Slack: https://join.slack.com/t/100daysofdevops/shared_invite/zt-au03logz-YfDUp_FJF4rAUeDEbgWmsg
- Reddit: r/101DaysofDevops
- Meetup: https://www.meetup.com/100daysofdevops/