Commit 31f66596 authored by Andrew Newdigate's avatar Andrew Newdigate Committed by Brett Walker

Sidekiq Monitor Script, Mailroom Stop, Sidekiq Pullmirror Shutdown script

parent 31e02d3a
......@@ -269,13 +269,11 @@ Running CI jobs will no longer be able to push updates. Jobs that complete now m
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
* Production: `knife ssh "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
* `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
* `knife ssh roles:gitlab-base-be-sidekiq-pullmirror "sudo gitlab-ctl stop sidekiq-cluster"`
1. [ ] 🐺 {+ Coordinator +}: Disable Sidekiq crons that may cause updates on the primary
* In a separate rails console on the **primary**:
* `loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }`
* `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh`
1. [ ] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
......@@ -341,12 +339,10 @@ state of the secondary to converge.
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* Press `Queues -> Live Poll`
* Wait for all queues to reach 0, excepting `emails_on_push` and `mailers` (which are disabled)
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* Staging: Some jobs (e.g., `file_download_dispatch_worker`) may refuse to exit. They can be safely ignored.
* Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED`
* Need more details?
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
......
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
UNSYMLINKED_SCRIPT_DIR="$(greadlink -f "${SCRIPT_DIR}" || readlink "${SCRIPT_DIR}" || echo "${SCRIPT_DIR}")"
# shellcheck disable=SC1091,SC1090
source "${UNSYMLINKED_SCRIPT_DIR}/../../../../workflow-script-commons.sh"
# --------------------------------------------------------------
if [[ ${FAILOVER_ENVIRONMENT} == "stg" ]]; then
log_command remote_command "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'
elif [[ ${FAILOVER_ENVIRONMENT} == "prd" ]]; then
log_command remote_command "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'
else
die "Unknown environment"
fi
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
UNSYMLINKED_SCRIPT_DIR="$(greadlink -f "${SCRIPT_DIR}" || readlink "${SCRIPT_DIR}" || echo "${SCRIPT_DIR}")"
# shellcheck disable=SC1091,SC1090
source "${UNSYMLINKED_SCRIPT_DIR}/../../../../workflow-script-commons.sh"
# --------------------------------------------------------------
PRODUCTION_ONLY
log_command remote_command roles:gitlab-base-be-sidekiq-pullmirror "sudo gitlab-ctl stop sidekiq-cluster"
#!/usr/bin/gitlab-rails runner
$purge_allowed = Set[
"pages_domain_verification",
"geo:geo_repository_verification_primary_shard"
]
# Staging: Some jobs (e.g., "file_download_dispatch_worker") may refuse to exit. They can be safely ignored.
if ENV["FAILOVER_ENVIRONMENT"] == "stg" || `hostname -f` == "deploy.stg.gitlab.com" # Hack to get around inability to pass envvars
$purge_allowed << "file_download_dispatch_worker"
$purge_allowed << "emails_on_push"
$purge_allowed << "mailers"
$purge_allowed << "background_migration"
end
$dry_run = true
def queue_can_be_purged(queue_name)
# Make sure that the geo crons are not included in this list...
$purge_allowed.include?(queue_name)
end
def cronjob_can_be_disabled(cron_name)
# Okay to disable this Geo job, but...
!::Gitlab::Geo::CronManager::GEO_JOBS.include?(cron_name)
end
def handle_named_set(title, named_set)
queue_sizes = named_set.each_with_object({}) do |job, hash|
hash[job.queue] = (hash[job.queue] || 0) + 1
end
if !queue_sizes.empty?
puts "#{title}:"
queue_sizes.each do |k,v|
status = queue_can_be_purged(k) ? " (purged)" : ""
puts " #{k}: #{v}#{status}"
end
end
## Remove jobs that that we do not need to retry.
unless $dry_run
named_set.each do |job|
job.delete if queue_can_be_purged(job.queue)
end
end
end
begin
loop do
puts "----------------------------------------------------------------------------------------"
if $dry_run
puts "NOTE: This script is in dry run mode and will not purge any queues"
else
puts "WARNING: This script is in terminator mode. It will hunt down unwanted jobs and kill them off"
end
pending_queues = Sidekiq::Queue.all.select { |q| q.size > 0 }.sort
unless pending_queues.empty?
puts "Queues:"
pending_queues.each do |q|
if queue_can_be_purged(q.name)
q.clear unless $dry_run
puts " #{q.name}: #{q.size} (purged)"
else
puts " #{q.name}: #{q.size}"
end
end
end
handle_named_set("Retries", Sidekiq::RetrySet.new)
handle_named_set("Scheduled", Sidekiq::ScheduledSet.new)
# Ignore the dead queue....
ps = Sidekiq::ProcessSet.new
total_busy = ps.each_with_object({}) do |process, hash|
# Hack the hostname to figure out the sidekiq cluster name
m = /sidekiq-(\w+)-\d+|(worker)\d+.cluster.gitlab.com/.match(process["hostname"])
group = m && (m[1] || m[2]) || process["hostname"]
hash[group] = (hash[group] || 0) + process["busy"]
end
unless total_busy.empty?
puts "Busy jobs per worker type:"
total_busy.each { |k, v| puts " #{k}: #{v}" }
end
crons = Sidekiq::Cron::Job.all
total_enabled_sidekiq_crons = crons.select { |cron| cron.status == "enabled" }.count
## Disable most cronjobs
unless $dry_run do
crons.select { |cron| cron.status == "enabled" && cronjob_can_be_disabled(cron.name) }.each(&:disable!)
end
puts "Cron: #{total_enabled_sidekiq_crons} enabled out of #{crons.length} total"
stats = Sidekiq::Stats.new
total_enqueued = stats.enqueued
total_retries = Sidekiq::RetrySet.new.size
total_scheduled = Sidekiq::ScheduledSet.new.size
puts "\nTotal Enqueued: #{total_enqueued} | Total Retries: #{total_retries} | Total Scheduled: #{total_scheduled}\n"
if total_enqueued.zero? && total_retries.zero? && total_scheduled.zero?
puts "--> Status: PROCEED.\n--> Continue with the migration.\n--> So long. And good luck (keep this script running)!"
else
puts "--> Status: WAIT"
end
puts "\n\n"
sleep 1
end
rescue Interrupt => e
end
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
UNSYMLINKED_SCRIPT_DIR="$(greadlink -f "${SCRIPT_DIR}" || readlink "${SCRIPT_DIR}" || echo "${SCRIPT_DIR}")"
# shellcheck disable=SC1091,SC1090
source "${UNSYMLINKED_SCRIPT_DIR}/../../../../workflow-script-commons.sh"
# --------------------------------------------------------------
if [[ ${FAILOVER_ENVIRONMENT} == "stg" ]]; then
DEPLOY_HOST=deploy.gitlab.com
elif [[ ${FAILOVER_ENVIRONMENT} == "prd" ]]; then
DEPLOY_HOST=deploy.stg.gitlab.com
fi
log_command knife ssh "fqdn:${DEPLOY_HOST}" sudo gitlab-rails runner "$(cat "${SCRIPT_DIR}"/030-await-sidekiq-drain.rb)"
......@@ -89,6 +89,15 @@ function log_command() {
(set -x; "$@")
}
function remote_command() {
if [[ -z "${SSH_AUTH_SOCK:=}" ]]; then
>&2 echo "Warning: you appear to not have SSH agent forwarding setup. You may need to add \`ForwardAgent yes\` to your SSH config"
fi
chef_query=$1; shift
knife ssh "${chef_query}" "$@"
}
function header() {
local full_path
full_path=$(gnu_readlink -f "${BASH_SOURCE[2]}")
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment