Summary: On 11/16, code containing a defect was released. This code was not in use until 12/02, when customers attempting to generate a pdf saw service errors or latency.
Root Cause Analysis: After a detailed investigation, the GoFormz Engineering team determined that there was an underlying code defect in a 3rd-party pdf tool that caused a higher-than-normal amount of CPU. During high traffic, this defect created a deadlock and resulted in several servers stalling when attempting to generate a pdf. Consequently, our API and Web services experienced errors and timeouts, and our backend queues experienced an increase in service latency.
Mitigation: Immediate action was taken to scale out the pdf services, allowing for the increase in CPU to be handled. Complete mitigation was accomplished when the engineering team was able to correct the code defect and deploy a hotfix to all pdf services.
Next steps: While this issue is resolved, GoFormz Engineering is continuing to closely monitor each service for errors or instability.