Shell Scripting Bash Text Processing Basics May 2026

Shell Scripting Text Processing Basics

Master grep, cut, sort, uniq, wc, head, tail, tr, paste and column — the essential Unix text tools that form the backbone of every log analysis, report generation, and data extraction pipeline.

Unix text tools follow the philosophy of doing one thing well and composing into pipelines. grep filters lines, cut extracts fields, sort orders them, uniq deduplicates, wc counts. Together they handle 80% of text processing needs without needing awk or sed.

BASH
# ── Basic matching ────────────────────────────────────────
grep "ERROR" app.log              # lines containing ERROR
grep -i "error" app.log           # case-insensitive
grep -v "DEBUG" app.log           # invert — lines NOT matching
grep -c "ERROR" app.log           # count matching lines
grep -n "ERROR" app.log           # show line numbers
grep -l "ERROR" /var/log/*.log    # list matching filenames only

# ── Extended regex (-E) ───────────────────────────────────
grep -E "ERROR|WARN|FATAL" app.log        # multiple patterns
grep -E "^[0-9]{4}-[0-9]{2}" app.log     # lines starting with date
grep -E "[0-9]+\.[0-9]+" access.log      # IP-like pattern

# ── Context lines ─────────────────────────────────────────
grep -A 3 "FATAL" app.log         # 3 lines After match
grep -B 2 "FATAL" app.log         # 2 lines Before match
grep -C 2 "FATAL" app.log         # 2 lines Context (before + after)

# ── Fixed string (-F) — faster for literal text ───────────
grep -F "user@example.com" mail.log  # no regex interpretation

# ── Recursive (-r) ────────────────────────────────────────
grep -r "DB_PASS" /etc/myapp/       # search all files in dir
grep -rl "TODO" /opt/scripts/       # list files containing TODO

# ── Extract matched portion only (-o) ────────────────────
grep -oE "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" access.log  # extract IPs
grep -oE "HTTP/[0-9.]+" access.log                     # extract HTTP versions
BASH
# ── Cut by delimiter ─────────────────────────────────────
cut -d: -f1 /etc/passwd             # field 1, colon delimited
cut -d: -f1,3 /etc/passwd           # fields 1 and 3
cut -d: -f1-3 /etc/passwd           # fields 1 through 3
cut -d, -f2 data.csv                # CSV field 2

# ── Cut by character position ─────────────────────────────
cut -c1-10 file.txt                 # first 10 characters
cut -c5- file.txt                   # from char 5 to end

# ── Practical: extract usernames from /etc/passwd ─────────
cut -d: -f1 /etc/passwd | sort

# ── Practical: get 2nd column from ps output ─────────────
ps aux | awk '{print $2}'           # awk is better for whitespace-delimited

# ── Cut limitation: can't handle variable whitespace ──────
# Use awk for space/tab delimited data where fields vary in width
df -h | awk '{print $1, $5}'       # filesystem and use%
BASH
# ── sort basics ───────────────────────────────────────────
sort names.txt                     # alphabetical
sort -r names.txt                  # reverse
sort -n numbers.txt                # numeric sort (not lexicographic)
sort -rn numbers.txt               # reverse numeric
sort -k2 data.txt                  # sort by field 2 (space delimited)
sort -t, -k3 -n data.csv           # sort CSV by field 3 numerically
sort -u names.txt                  # sort + remove duplicates
sort -h sizes.txt                  # human-readable sizes (1K 2M 3G)

# ── uniq — works on ADJACENT duplicates (sort first!) ─────
sort names.txt | uniq             # deduplicate
sort names.txt | uniq -c          # count occurrences
sort names.txt | uniq -d          # only show duplicates
sort names.txt | uniq -u          # only show unique (appear once)

# ── Top N pattern — most common IPs in access log ─────────
awk '{print $1}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn | head -10

# ── Most common HTTP status codes ─────────────────────────
awk '{print $9}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn
BASH
# ── wc — word, line, character count ─────────────────────
wc -l file.txt          # line count
wc -w file.txt          # word count
wc -c file.txt          # byte count
wc -m file.txt          # character count
wc file.txt             # lines words bytes

# Count errors in log
errors=$(grep -c "ERROR" app.log)
echo "Found ${errors} errors"

# ── head and tail ─────────────────────────────────────────
head -5 file.txt         # first 5 lines
head -1 file.txt         # first line only (great for headers)
tail -5 file.txt         # last 5 lines
tail -1 file.txt         # last line only
tail -n +2 file.txt      # skip first line (remove CSV header)
tail -f /var/log/app.log # follow — live log monitoring
tail -F /var/log/app.log # follow + retry if file rotated

# ── Combining head and tail ───────────────────────────────
head -20 bigfile.txt | tail -10   # lines 11-20

# ── tr — translate and delete characters ──────────────────
echo "hello" | tr 'a-z' 'A-Z'      # HELLO
echo "HELLO" | tr 'A-Z' 'a-z'      # hello
echo "a,b,c" | tr ',' '\n'         # one per line
echo "  spaces  " | tr -d ' '      # delete spaces
echo "aabbcc" | tr -s 'a-z'        # squeeze repeats: abc
cat file.txt | tr -d '\r'          # remove Windows line endings
BASH
#!/usr/bin/env bash
# log_report.sh — Daily access log summary

LOG="/var/log/nginx/access.log"
DATE=$(date '+%d/%b/%Y')

echo "═══════════════════════════════════════"
echo "  Nginx Access Log Report — ${DATE}"
echo "═══════════════════════════════════════"

echo ""
echo "Total requests today:"
grep "${DATE}" "${LOG}" | wc -l

echo ""
echo "Top 10 IPs:"
grep "${DATE}" "${LOG}" \
  | awk '{print $1}' \
  | sort | uniq -c | sort -rn \
  | head -10 \
  | awk '{printf "  %6d  %s\n", $1, $2}'

echo ""
echo "HTTP Status breakdown:"
grep "${DATE}" "${LOG}" \
  | awk '{print $9}' \
  | grep -E '^[0-9]{3}$' \
  | sort | uniq -c | sort -rn \
  | awk '{printf "  HTTP %-4s  %d\n", $2, $1}'

echo ""
echo "Top 5 requested URLs:"
grep "${DATE}" "${LOG}" \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn \
  | head -5 \
  | awk '{printf "  %6d  %s\n", $1, $2}'
bash — log_report.sh
vriddh@prod-01:~/scripts$./log_report.sh
═══════════════════════════════════════
Nginx Access Log Report — 01/May/2026
═══════════════════════════════════════
Total requests today: 14823
Top 10 IPs:
2841 203.0.113.42
1203 198.51.100.7
892 10.0.1.15
HTTP Status breakdown:
HTTP 200 12104
HTTP 404 1842
HTTP 500 103
✔ Tool selection guide — Use grep to filter lines. Use cut for simple fixed delimiters. Use awk when fields are whitespace-separated or you need arithmetic. Use sort | uniq -c | sort -rn for frequency counting. Use tr for character-level transformations. Chain them with pipes — each tool does one thing and passes to the next.