Ansible OEL 8 DevOps · OEL 8 · Advanced

AnsibleRolling Updates & Serial Execution

The keywords that turn a 50-host playbook from 'all hosts in parallel' into 'one host at a time, with health checks between each' — serial, max_fail_percentage, run_once, delegate_to, and the rolling-restart pattern for a database cluster.

The default Ansible execution: every host runs every task, in roughly parallel batches of forks. That's fine for stateless installs but a disaster for stateful systems. Restarting all 6 MySQL replicas at once = 6 minutes of read-traffic downtime. Restarting them one at a time, with replica-lag checks between each = zero downtime.

Rolling update — three serial strategies serial: 1 · one host at a time safest · slowest db1 → done db2 waiting db3 waiting db4 waiting db5 waiting db6 waiting serial: 2 · two hosts in parallel balanced db1 → done db2 → done db3 waiting db4 waiting db5 waiting db6 waiting serial: "25%" · scales with fleet size 25% of 6 = 2 hosts/batch db1 batch 1 db2 batch 1 db3 batch 2 db4 batch 2 db5 batch 3 db6 batch 3 max_fail_percentage stop after N% fail circuit-breaker max_fail_percentage: 25 25% threshold → if 2 of 6 fail, play aborts prevents cascading failure

The serial keyword overrides Ansible's normal "all hosts at once" behaviour. It accepts a single number, a percentage, or a list for tapered rollouts:

YAML — serial in three forms
---
# Form 1: one host at a time (safest, slowest)
- hosts: mysql_replica
  serial: 1
  tasks:
    - name: Apply config and restart
      ansible.builtin.template:
        src: my.cnf.j2
        dest: /etc/my.cnf
      notify: restart mysql

# Form 2: percentage of fleet at a time (scales with size)
- hosts: webservers
  serial: "25%"      # 25% of fleet per batch
  tasks:
    - name: Deploy new release
      ansible.builtin.include_role:
        name: deploy_app

# Form 3: tapered rollout — 1 host first, then 3, then everyone
- hosts: api_servers
  serial:
    - 1              # canary: just one host
    - 3              # then three
    - "100%"         # then the rest
  tasks:
    - name: Apply patch
      ansible.builtin.dnf:
        name: "*"
        state: latest

By default, Ansible continues to the next batch even if the current one had failures. For risky changes, set a threshold — if more than N% of the batch fails, the play aborts before the next batch:

YAML — max_fail_percentage
---
- hosts: mysql_replica
  serial: 2
  max_fail_percentage: 25     # if >25% of a batch fails, stop
  tasks:
    - name: Restart mysql
      ansible.builtin.systemd:
        name: mysqld
        state: restarted

    - name: Wait for replication to catch up
      community.mysql.mysql_query:
        query: "SHOW SLAVE STATUS"
        login_user: root
        login_password: "{{{{ vault_mysql_root_password }}}}"
      register: status
      until: (status.query_result[0][0].Seconds_Behind_Master | int) < 5
      retries: 30
      delay: 10

# With serial: 2 + max_fail_percentage: 25:
# - If 1 host of 2 in a batch fails (50%), play aborts
# - If 0 of 2 fail, continue to next batch
💡 Tip: The default failure behaviour is fail-and-continue per host. max_fail_percentage is the only way to stop a half-broken rollout from continuing into the next batch and breaking that one too.
YAML — rolling-restart-mysql.yml
---
- name: Rolling MySQL restart, one replica at a time
  hosts: mysql_replica
  serial: 1
  max_fail_percentage: 0      # zero tolerance — abort on any failure
  become: true
  tasks:
    - name: Drain this host from ProxySQL
      community.proxysql.proxysql_mysql_servers:
        login_host: "{{{{ proxysql_admin_host }}}}"
        login_user: admin
        login_password: "{{{{ vault_proxysql_admin_password }}}}"
        hostgroup_id: 20      # reader hostgroup
        hostname: "{{{{ ansible_default_ipv4.address }}}}"
        status: OFFLINE_SOFT
        state: present
      delegate_to: localhost

    - name: Wait for active queries to drain
      community.proxysql.proxysql_query:
        login_host: "{{{{ proxysql_admin_host }}}}"
        login_user: admin
        login_password: "{{{{ vault_proxysql_admin_password }}}}"
        query: "SELECT ConnUsed FROM stats_mysql_connection_pool WHERE srv_host='{{{{ ansible_default_ipv4.address }}}}'"
      register: pool
      until: pool.query_result[0].ConnUsed | int == 0
      retries: 30
      delay: 5
      delegate_to: localhost

    - name: Restart MySQL
      ansible.builtin.systemd:
        name: mysqld
        state: restarted

    - name: Wait for MySQL to be reachable
      ansible.builtin.wait_for:
        port: 3306
        timeout: 60

    - name: Wait for replication to catch up
      community.mysql.mysql_query:
        query: "SHOW SLAVE STATUS"
        login_user: root
        login_password: "{{{{ vault_mysql_root_password }}}}"
      register: status
      until: (status.query_result[0][0].Seconds_Behind_Master | int) < 5
      retries: 30
      delay: 10

    - name: Re-add to ProxySQL
      community.proxysql.proxysql_mysql_servers:
        login_host: "{{{{ proxysql_admin_host }}}}"
        login_user: admin
        login_password: "{{{{ vault_proxysql_admin_password }}}}"
        hostgroup_id: 20
        hostname: "{{{{ ansible_default_ipv4.address }}}}"
        status: ONLINE
        state: present
      delegate_to: localhost

For tasks that should fire exactly once across the whole inventory (registering a shard, running a database migration, sending a notification), use run_once:

YAML — run_once
---
- hosts: mongo
  tasks:
    - name: Initialise the replica set (only on one host)
      community.mongodb.mongodb_replicaset:
        replica_set: rs0
        members:
          - "{{{{ groups['mongo'][0] }}}}:27017"
          - "{{{{ groups['mongo'][1] }}}}:27017"
          - "{{{{ groups['mongo'][2] }}}}:27017"
      run_once: true              # ← critical: only one host runs this
      delegate_to: "{{{{ groups['mongo'][0] }}}}"

    - name: Apply schema migration (only once across all webservers)
      ansible.builtin.command: |
        liquibase --changeLogFile=db/changelog.xml update
      run_once: true
      delegate_to: "{{{{ groups['webservers'][0] }}}}"

delegate_to redirects a task to run on a specific host while keeping the play's host context. Common uses: cloud API calls, central log writes, load-balancer manipulations:

YAML — delegate_to patterns
---
- hosts: webservers
  serial: 1
  tasks:
    # 1. Call AWS API from localhost (control node), not from each web host
    - name: Update ELB target health (run from control node)
      community.aws.elb_target:
        target_group_arn: "{{{{ tg_arn }}}}"
        target_id: "{{{{ ansible_ec2_instance_id }}}}"
        deregister: true
      delegate_to: localhost

    # 2. Push notification to slack (run from one notifier host)
    - name: Notify deploy started
      ansible.builtin.uri:
        url: "{{{{ slack_webhook }}}}"
        method: POST
        body_format: json
        body: {{ text: "Deploying {{ inventory_hostname }}" }}
      delegate_to: notifier.example.com

    # 3. Restart the actual host
    - name: Restart application
      ansible.builtin.systemd:
        name: myapp
        state: restarted
      # no delegate_to → runs on the inventory host

    # 4. Re-register with ELB
    - name: Re-add to ELB
      community.aws.elb_target:
        target_group_arn: "{{{{ tg_arn }}}}"
        target_id: "{{{{ ansible_ec2_instance_id }}}}"
        deregister: false
      delegate_to: localhost

By default, when one host fails, the others continue. With any_errors_fatal, any single host's failure aborts the play for everyone:

YAML — any_errors_fatal
---
- hosts: galera_cluster
  any_errors_fatal: true       # one node failing = abort entire play
  tasks:
    - name: Apply Galera config update
      ansible.builtin.template:
        src: wsrep.cnf.j2
        dest: /etc/my.cnf.d/wsrep.cnf
      notify: restart galera

# Use case: cluster ops where partial application breaks the cluster.
# If one node's config push fails, you don't want to restart only the
# other 2 — that would split-brain the cluster.
YAML — canary deploy: one host, soak, then the rest
---
- name: Canary phase
  hosts: webservers
  serial: 1
  max_fail_percentage: 0
  tasks:
    - name: Deploy new release on canary
      ansible.builtin.include_role:
        name: deploy_app
      when: inventory_hostname == groups['webservers'][0]

    - name: Wait 10 minutes to soak
      ansible.builtin.pause:
        minutes: 10
      when: inventory_hostname == groups['webservers'][0]
      delegate_to: localhost
      run_once: true

- name: Roll out to everyone else
  hosts: webservers:!canary
  serial: "25%"
  max_fail_percentage: 25
  tasks:
    - name: Deploy new release
      ansible.builtin.include_role:
        name: deploy_app
✅ Tip: End of Section 7. With these orchestration tools you can roll out database changes across 50-node clusters with zero downtime, run migrations exactly once, drain hosts from load balancers safely, and abort entire plays the moment something looks wrong.

You've now covered everything Ansible offers beyond the basics: custom Python plugins, performance tuning to 5–10× speedup, multi-environment discipline with per-env vault passwords, and rolling-update orchestration with circuit breakers and canary phases.

Next — Section 8: Operations and CI/CD (3 pages, 63–65), the final section. Ansible Tower / AWX, integrating Ansible into GitLab CI / GitHub Actions, and testing playbooks with Molecule + ansible-lint.