Every data pipeline has a validation step — or should. Shell scripts that validate database contents after imports, migrations, and ETL runs catch data quality problems before they propagate to downstream systems. This page covers row count checks, referential integrity probes, range validations, and duplicate detection.
1
Row count and completeness checks
BASH
#!/usr/bin/env bash
# validate_import.sh — Validate a CSV import into MySQL
set -euo pipefail
DB_CONF="/etc/myapp/mysql.conf"
db() { mysql --defaults-file="${DB_CONF}" -BNs -D myapp -e "$1"; }
ERRORS=0
check() {
local name="$1" condition="$2"
local result
result=$(db "${condition}")
if [[ "${result}" == "1" ]]; then
printf " %-40s PASS\n" "${name}"
else
printf " %-40s FAIL (got: %s)\n" "${name}" "${result}"
(( ERRORS++ ))
fi
}
echo "=== Data Validation: $(date +%Y-%m-%d) ==="
# Row count checks
check "users table not empty" "SELECT COUNT(*) > 0 FROM users"
check "all users have email" "SELECT COUNT(*) = 0 FROM users WHERE email IS NULL OR email = ''"
check "email uniqueness" "SELECT COUNT(email) = COUNT(DISTINCT email) FROM users"
check "no future created_at" "SELECT COUNT(*) = 0 FROM users WHERE created_at > NOW()"
check "active flag valid" "SELECT COUNT(*) = 0 FROM users WHERE active NOT IN (0,1)"
# Referential integrity
check "orders have valid user_id" "SELECT COUNT(*) = 0 FROM orders o LEFT JOIN users u ON u.id=o.user_id WHERE u.id IS NULL"
check "order_items have valid order" "SELECT COUNT(*) = 0 FROM order_items oi LEFT JOIN orders o ON o.id=oi.order_id WHERE o.id IS NULL"
# Range validations
check "order totals non-negative" "SELECT COUNT(*) = 0 FROM orders WHERE total < 0"
check "order totals reasonable" "SELECT COUNT(*) = 0 FROM orders WHERE total > 100000"
echo ""
echo " Results: ${ERRORS} failures"
(( ERRORS > 0 )) && { echo " VALIDATION FAILED"; exit 1; }
echo " All validations passed"
2
Duplicate detection and cardinality checks
BASH
# ── Find duplicate emails ─────────────────────────────────
mysql --defaults-file="${DB_CONF}" -BNs myapp -e "
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING cnt > 1
ORDER BY cnt DESC
LIMIT 20" | awk -F$'\t' '{printf " DUPLICATE: %-40s %d times\n",$1,$2}'
# ── Detect orphaned records ────────────────────────────────
mysql --defaults-file="${DB_CONF}" -BNs myapp -e "
SELECT COUNT(*) FROM order_items oi
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.id=oi.order_id)" | xargs -I{} bash -c '[ {} -gt 0 ] && echo "WARN: {} orphaned order_items"'
# ── Check column cardinality (detect runaway enums) ────────
mysql --defaults-file="${DB_CONF}" -BNs myapp -e "
SELECT status, COUNT(*) FROM orders GROUP BY status ORDER BY 2 DESC"
# ── Statistical anomaly check ─────────────────────────────
mysql --defaults-file="${DB_CONF}" -BNs myapp -e "
SELECT
AVG(total) AS mean,
STDDEV(total) AS stddev,
MIN(total) AS min_val,
MAX(total) AS max_val,
COUNT(*) AS cnt
FROM orders
WHERE DATE(created_at) = CURDATE()" | awk -F$'\t' '{
if ($1 > 0 && $4/$1 > 10)
printf "WARN: Max order (%.2f) is >10x average (%.2f)\n",$4,$1
}'
✔ ✔ Validati — Run validation after every import and migration. Use
COALESCE and IS NULL checks for nullable columns. Always check referential integrity after bulk inserts. Store validation results to a log file for audit trails. Make the script exit non-zero on any failure so CI/CD pipelines can gate on it.