Post Mortem: Connection Spike leads to cascading SQL Performance Degradation

Post Mortems

Summary

Spike in connections overloaded our SQL server and triggered a latent performance degradation in SQL lookups, causing delayed delivery until performance mitigation.

Timeline

  • 11:30AM EST: Incoming mail connections spike to 30x steady state, causing increased load on our servers
  • 11:49AM [Downtime Begins]: Emails begin being delayed/backed up.
  • 11:58AM [First Alert Fired]: Automated Alerting notified Matthew Tse on call that there was a deliverability issue with Microsoft
  • 11:59AM [First Responder Signs On]: Matthew Tse signs on and begins investigating
  • 12:08PM [Customers Alerted]: Matthew Tse posts an incident to status page
  • 12:42PM [Mitigation Attempted]: Matthew Tse pushes code that increases logging to track offending users, and also increases the connection limit on front door mail requests
  • 01:34PM [Mitigation Attempted]: Matthew Tse unlocks the max SMTP autoscaled server limit
  • 02:00PM [Recovery Begins]: The SQL servers begin to handle the load, and emails begin being delivered again, but delayed.
  • 02:10PM [Mitigation Attempted]: Matthew Tse adds additional logging to root cause the dropped front door messages. We discover that there is a SQL connection pool overload error being emitted constantly.
  • 04:27PM [Mitigation Attempted]: Matthew Tse pushes async connection pool optimizations to decrease load on SQL servers
  • 07:10PM [Mitigation Attempted][Recovery Complete]: Matthew Tse pushes further async connection pool optimizations, that fully eliminate the SQL error.

Action Items

  • IMX-1337: Audit all python SQL connection pool logic across all clients ensuring the thundering herd issue doesn't happen again.
  • IMX-1338: Audit/persist all SQL limit changes made during the incident, ensuring they persist past server reboot.
  • IMX-1339: Add Metrics Tracking the number of Connections made to our SQL database, ensuring this issue surfaces immediately in the future.
  • IMX-1340: Add Metrics Tracking number of mail rejections due to unhandled SQL connection issues. This should further improve our speed and reliability during delivery.

We apologize for the downtime in our services.

If you have any questions, feel free to reach out to me at [email protected]

Matthew Tse
Owner and CEO of ImprovMX

Matthew Tse

Matthew Tse

Owner and CEO of ImprovMX