Design and maintain reliable, scalable, and secure systems.
Monitor performance, troubleshoot issues, and drive incident resolution.
Automate workflows and improve CI/CD pipelines.
Optimize system capacity and ensure infrastructure resilience.
Collaborate with teams to embed reliability into development practices.• Strong Linux/Unix, scripting (Python, Bash, Go), and cloud platform experience (AWS, Azure, or GCP).
Proficiency in Kubernetes, Docker, and IaC tools (Terraform/CloudFormation).
Hands-on with monitoring tools (Prometheus, Grafana, Datadog).
Analytical and problem-solving skills with on-call experience.