Search Freelance Talent on Upwork

Site Reliability Engineering

Vietnam

$12/hr

100% Job Success

$500+ earned

I'm an SRE at Asia Commercial Bank Vietnam. Ensure and maintain 24/7 service and ensure 99.9% uptime of servers, perform backup, CVE patching of all servers on managed applications. In addition to: - Deploying applications to re-production and production environment. Building a monitoring program. - Monitoring the roll-out of new software applications to ensure there are no problems. Troubleshooting and resolving any problems with application software.

Pablo A.

Principal DevOps & Site Reliability Engineer

Argentina

$65/hr

$500+ earned

Azure | AWS | GCP | Terraform | CI/CD Architecture | SRE | DevOps I am a Principal-level DevOps and Site Reliability Engineer with nearly two decades of experience designing, operating, and scaling mission-critical cloud platforms. I specialize in multi-region Azure architecture, distributed systems, CI/CD modernization, performance engineering, and reliability automation. Core Competencies Multi-Region Cloud Architecture (Azure, AWS, GCP) Design and operation of highly available, compliant platforms (PROD, DEV, QA1, QA2, Sandbox) supporting Web Apps, Function Apps, and data pipelines. Site Reliability Engineering (SRE) Incident reduction, automated failover and recovery, health modeling, and elimination of recurring failures. Performance Engineering Diagnosis of CPU/memory saturation, cold starts, auto-heal loops, unstable restarts, and latency improvements via App Service Plan tuning. Observability & Monitoring Azure Monitor, Application Insights, KQL dashboards, Logstash, Grafana — complete visibility for applications and infrastructure. Security & Secrets Automation Secret rotation with Key Vault + Python + Terraform, RBAC hardening, identity flow governance, and compliance across environments. CI/CD & Infrastructure as Code Terraform (AzureRM), GitHub Actions pipelines, automated DNS/SSL workflows, and secure reproducible IaC for enterprise workloads. Why Clients Work With Me - Architect-level decision-making with deep implementation skills - Proven ability to stabilize and optimize high-scale systems - Strong diagnostic capabilities for critical production issues - Modernization of CI/CD, observability, security, and IaC at scale - Clear communication and high-quality documentation - Fluent in English, Spanish, and Portuguese Selected Experience Senior DevOps / SRE Consultant — Freelance Architect for multi-region Azure platforms; implemented SRE practices, automated failover, observability stacks, secret automation, RBAC hardening, and Terraform IaC. DevOps Engineer — Globant (BBVA Global Team) CI/CD modernization for 40+ enterprise apps; SonarQube rollout to ~600 developers; compliance improvements; EKS modernization; Jenkins Groovy pipelines. Cloud DevOps Engineer — Santander (OnLab) Azure VNets, VPNs, Load Balancers, Bastion, AD/AAD, B2C identity, AKS autoscaling, GitHub Actions/Azure DevOps pipelines, ArgoCD GitOps, observability, Terraform automation. QA DevOps Engineer — Web.com Kubernetes operations, CI/CD pipelines, test automation, and observability with Prometheus/Grafana. Unix/Linux Engineering — IBM, Cetelem, RSA, DC Solutions Solaris/AIX/HP-UX, clustering, virtualization, D-Trace debugging, SAN/NAS, CIS hardening, Oracle/JBoss hosting, production troubleshooting.

Godstime J.

Senior Site Reliability Engineer | DevOps & Observability Specialist

Nigeria

$20/hr

80% Job Success

$9K+ earned

6+ Years Engineering High-Availability Cloud Infrastructure, Proactive Observability, and Bulletproof CI/CD Pipelines. In production, downtime isn’t just an inconvenience, it’s a revenue killer. As a Senior SRE and DevOps Engineer, I bridge the gap between development and operations to ensure your applications are highly scalable, resilient, and transparent. My core philosophy relies on proactive reliability: designing system observability, synthetic monitoring, and robust SLO/SLI frameworks to catch and resolve anomalies before they ever impact your end users. Whether you need to containerize legacy workloads, automate infrastructure from scratch via Code (IaC), or build comprehensive, multi-layered dashboards for enterprise-level visibility, I engineer stable environments that empower development teams to deploy with confidence. Core Areas of Expertise: Observability & Monitoring: Deep experience designing end-to-end telemetry across enterprise stacks. Proficient in Datadog, Dynatrace, ELK, Grafana, Prometheus, AppDynamics, Nagios, and Kibana. Specialized in alert tuning, synthetic monitoring, and SLO/SLI mapping. Cloud & Container Orchestration: Advanced provisioning and management across AWS and Azure. Expert-level containerization with Docker and Kubernetes (EKS/AKS cluster management, scaling, and security). Infrastructure as Code (IaC) & Configuration: Building modular, reusable infrastructure using Terraform, alongside automated configuration management with Ansible and Chef. CI/CD & Automation: Architecting reliable continuous integration and continuous deployment pipelines that accelerate release cycles while maintaining strict quality gates. Operations & Support: Extensive background in L3 Application Support, web development, hands-on incident response, and executing robust disaster recovery strategies. Why Work With Me? Enterprise-Ready Communication: I translate complex technical infrastructure into high-impact, accessible insights for stakeholders and business owners. Self-Driven Execution: Comfortable operating autonomously, managing workflows, writing rigorous technical documentation/runbooks, and collaborating across timezone-agnostic teams. Culture of Collaboration: I do not just fix servers; I foster a true DevOps culture of shared responsibility, automation, and continuous improvement. Let’s ensure your infrastructure is built to scale. Click "Invite to Job" or "Message" to discuss how we can optimize your environment today.

Nikita E.

DevOps / SRE / Infrastructure

Serbia

$35/hr

100% Job Success

$2K+ earned

Offers consultations

Results-driven DevOps/SRE Engineer with a passion for cloud technologies, automation, and building high-availability systems. Proven expertise in optimizing infrastructures and driving operational excellence.

Anatoliy F.

Solution Architect | Site Reliability | Devops | AWS | GCP | Terraform

Kazakhstan

$40/hr

$0 earned

Hi! I’m Anatoliy — Senior certified DevOps / SRE Engineer with 8+ years of hands-on experience building and operating production-grade cloud infrastructure, observability platforms, and CI/CD systems. I specialize in designing reliable, scalable, and fully automated DevOps ecosystems in AWS/EKS environments, with a strong focus on observability and incident resilience. What I do best Production-grade AWS & Kubernetes (EKS) Design and operate highly available systems in AWS / GCP (EC2, EKS, ASG, Route53, ACM, S3) Kubernetes: multi-environment setups, ingress, canary & blue/green deployments Cluster architecture, networking, and failure recovery scenarios Observability & Monitoring (from scratch) Build end-to-end monitoring systems: metrics, logs, tracing Stack: Prometheus, Grafana, Mimir / Thanos / VictoriaMetrics, ELK Alerting strategies: Alertmanager routing, SLOs, error budgets Log-to-metric pipelines and multi-source data integration CI/CD & Automation Design fully automated pipelines (Jenkins, GitLab CI, GitHub Actions) Docker build pipelines, release strategies (canary, rollback, risk-based deploys) Infrastructure as Code: Terraform / Ansible Strong scripting: Bash, Python, PowerShell, Groovy SRE & Incident Management Root cause analysis of complex production issues System reliability improvements and performance tuning Experience with real-world failures, not just greenfield setups AI-assisted DevOps Daily use of tools like ChatGPT, Claude Automating configs, pipelines, and debugging workflows with AI Key strengths Building observability platforms from zero to production Designing resilient architectures under failure conditions Deep hands-on with Prometheus ecosystem & time-series databases Strong focus on automation, reliability, and cost efficiency (FinOps mindset)

Mahesh P.

DevOps | Site Reliability Engineer | AWS | Kubernetes | CI/CD | Terraf

India

$50/hr

$0 earned

🚀 DevOps & Site Reliability Engineer | Automating Cloud Infrastructure & Scaling Solutions Hi, I'm Mahesh — a passionate DevOps & SRE Engineer with hands-on experience delivering end-to-end infrastructure automation, CI/CD pipelines, cloud-native deployments, and application monitoring. With a strong foundation in AWS and Kubernetes, I help companies achieve faster, more secure, and reliable deployments. ✅ What I Can Help You With: - Cloud Infrastructure: AWS, GCP, Azure – scalable & secure architecture design - IaC (Infrastructure as Code): Terraform, CloudFormation, Ansible - Containerization & Orchestration: Docker, Kubernetes (EKS, GKE, AKS), Helm - CI/CD Pipelines: Jenkins, GitHub Actions, GitLab CI/CD, AWS CodePipeline - Monitoring & Observability: Prometheus, Grafana, ELK Stack, AWS CloudWatch - Security & Scanning: Trivy, SonarQube, OWASP Dependency-Check - GitOps & Automation: ArgoCD, Flux, AWS Lambda, Bash/Python scripting - System Hardening & Performance Optimization ## Technology and tools i worked with 🔹 AWS | GCP | Azure 🔹 Docker | Kubernetes | Helm 🔹 Terraform | Ansible | Packer 🔹 Jenkins | GitHub Actions | GitLab CI 🔹 ArgoCD | FluxCD 🔹 Prometheus | Grafana | Loki | ELK Stack 🔹 Trivy | SonarQube | OWASP ZAP 🔹 Bash | Python | Shell scripting 🔹 Linux | NGINX | Apache | HAProxy

Seun O.

Experienced DevOps / SRE Engineer

Nigeria

$20/hr

$15 earned

I'm Seun — a Senior DevOps Engineer with 6+ years of experience building secure, scalable infrastructure for fast-growing teams and regulated industries. I specialize in: Kubernetes, Terraform, and cloud platforms (Azure, AWS) CI/CD pipeline automation using GitHub Actions, Azure Devops and CircleCI Secure deployments with PCI DSS-compliant architecture Monitoring & alerting using Prometheus, Grafana, Datadog Seamless collaboration with developers and product teams I've helped banks and startups build and modernize platforms — from legacy system migrations to cloud-native microservices — while maintaining speed, security, and reliability. If you're looking for an engineer who can automate your delivery process, harden your infrastructure, and improve system uptime, let's talk.

Dmitriy M.

Senior SRE & Observability Engineer | Zero-Downtime Infrastructure

Ukraine

$60/hr

100% Job Success

$400K+ earned

Available now

Offers consultations

Every hour of downtime costs your enterprise an average of $540,000. That's not a DevOps metric - that's a board-level problem. I design and operate infrastructure that stays up. My specialty is the full observability and reliability stack: detecting failures before users do, recovering in minutes, not hours, and building the self-healing systems that let your engineering team sleep at night. With $400K+ earned on Upwork, 100% Job Success Score, and 10+ years in DevOps and Site Reliability Engineering, I bring the broadest monitoring stack you'll find in one engineer: Prometheus · Grafana · Datadog · ELK Stack · Zabbix · Sumo Logic - combined with deep Kubernetes and cloud expertise. Typical Problems I Solve → Production incidents are frequent, and MTTR is too high → No observability - issues discovered by users, not monitoring → Need DevOps engineer for high availability infrastructure setup → Kubernetes clusters experiencing performance or reliability issues → Disaster recovery plan missing or untested → AI workloads need reliable, observable infrastructure Proven Client Results ▪️ iLost Platform (Netherlands) → Challenge: EKS performance degrading, CI/CD unstable across Dev/Staging/Production, and OpenSearch causing timeouts. → Solution: EKS rightsizing, CI/CD pipeline optimization, IAM hardening, OpenSearch performance tuning. → Business Impact: Reduced bottlenecks, improved release predictability, and optimized infrastructure costs. ▪️ Qopla Infrastructure → Challenge: Long-term SRE engagement - ongoing reliability, monitoring, and performance needs. → Solution: Continuous SRE operations, incident management, and infrastructure optimization. → Business Impact: 657-hour engagement demonstrating sustained impact. Rate progressed $40/hr → $52/hr based on results. ▪️ GetChecked (Blockchain / Multi-cloud) → Challenge: Multi-cloud infrastructure (AWS + DigitalOcean + bare-metal) with Hyperledger Fabric in a regulated environment. → Solution: Upgraded Hyperledger Fabric, implemented monitoring, backup, DR/BCP, optimized CI/CD pipelines. → Business Impact: Increased system reliability and security in a strict regulatory context. Core SRE Capabilities → DORA metrics: Deployment Frequency, Lead Time, MTTR, Change Failure Rate - measured and improved → Incident Management & On-Call: Runbooks, escalation paths, post-mortems → Disaster Recovery & Business Continuity planning and testing → Zero Trust Security: IAM, RBAC, SELinux, network hardening → High Availability: Multi-AZ, failover, load balancing, autoscaling → AI workload reliability: observability for ML pipelines and LLM infrastructure Stack: AWS · GCP · Kubernetes · Terraform · Ansible · Prometheus · Grafana · ELK Stack · Datadog · Zabbix · Docker · Linux · GitHub Actions · GitLab CI · Incident Response Plan · Disaster Recovery Available during US business hours (EST/PST overlap). Downtime is a solved problem - if you have the right observability. Send an invite and let's audit your reliability posture.

Dmitriy M. has worked .

Associated with

Spaceport

$1M+

earned

Yasmany C.

Proficient Site Reliability Engineer

Ecuador

$80/hr

$200K+ earned

Offers consultations

High-skilled Software Engineer with 15 years of experience. My unique set of skills has taken companies to their goals over these many years. I professionally work toward that every day.

Yasmany C. has worked .

Rizqi Fathi R.

DevOps | SRE | Sofware Engineer

Indonesia

$9/hr

$0 earned

DevOps Engineer and Software Engineer specialized in CI/CD pipelines and Kubernetes automation. Experienced in container orchestration, observability, and infrastructure reliability using Prometheus, Grafana, Jaeger, and Elasticsearch. Skilled in designing and maintaining scalable backend systems, implementing distributed tracing, and automating deployments across environments. As an aspiring Site Reliability Engineer (SRE), I focus on creating robust, monitored, and self-healing infrastructures that ensure consistent uptime and smooth delivery workflows. Passionate about automation, system performance, and building production-grade environments that balance speed and reliability.