Two Incidents, One Pattern In April 2023, Samsung Semiconductor disclosed 3 separate data leaks through ChatGPT inside 20 days. Engineers pasted proprietary source code, confidential meeting notes, and internal test sequences into a public LLM. Samsung banned ChatGPT company-wide. Engineers kept using it on personal devices.
In 2025, Notion workspace data reached third-party model servers through an AI assistant misconfiguration. Customer workspace content, including engineering docs and internal pricing models, was processed by providers that had not signed data processing agreements covering that content type.
Different companies. Different tools. Same root cause. Same fix.
This teardown walks through what happened, why policy did not stop it, and what AI code leak prevention actually requires.
Samsung: 3 Leaks in 20 Days Leak 1: A Samsung engineer pasted proprietary semiconductor source code into ChatGPT to debug an error. The prompt, including the full source, was retained by the provider for training purposes.
Leak 2: An engineer pasted internal test sequences for chip validation to ask ChatGPT for optimization advice. The sequences revealed the exact defect patterns Samsung tested for in production silicon.
Leak 3: An engineer pasted the transcript of an internal recorded meeting to generate meeting minutes. The transcript contained product roadmap discussion and competitive analysis.
Samsung disclosed the incidents internally. The banned the tool. Bloomberg reported the numbers. Samsung then built an internal ChatGPT clone.
The ban did not work. A TechCrunch follow up found engineers were still using the public tool through personal accounts and personal devices. The data was gone the moment it was pasted. No amount of policy recovered it.
Notion: The Assistant Misconfig Notion integrated a third-party AI assistant for workspace summarization and content generation. The integration routed workspace content through a model provider that Notion had vetted for customer-facing use but had not restricted in scope.
Workspace admins who enabled the assistant did not know that enabling it gave the provider access to every page the assistant was asked about. Enterprise customers found internal strategy docs and engineering plans had been sent outside their data boundary.
Notion issued a post mortem. They tightened the integration scope. Customers who had enabled the feature during the affected window had no recovery path for content that had already been sent.
The Common Root Cause Both breaches share a single cause: proprietary data entered an LLM API call with no enforcement layer between the user and the provider.
Samsung had a policy. The policy said do not paste source code into public LLMs. The engineers knew the policy. They pasted the code anyway because the tool was useful and the policy was a document, not a control.
Notion had a vendor review process. The process approved the AI provider. The process did not catch that the integration scope was broader than the review assumed. Policy without technical enforcement is a wish list.
This is the pattern across every AI code leak we have tracked. The 2025 LayerX Security report found 40 percent of files uploaded to ChatGPT contained PII or PCI data. The 2026 CISA disclosure found officials used ChatGPT for sensitive government work. Policy alone fails every time.
Why Bans Make It Worse Samsung banned ChatGPT. Engineers switched to personal accounts on personal devices. The company lost visibility. Before the ban, the ChatGPT traffic was at least traceable on corporate network logs. After the ban, it moved to home Wi-Fi.
This is called shadow AI. A Gartner survey found 63 percent of enterprises that banned consumer AI tools discovered employees using them anyway within 90 days. The ban moves the risk, it does not remove it.
The only working control is a technical one. Intercept the API call. Transform the payload before it leaves the machine. Produce an audit log.
What Mutation Does Pretense mutates proprietary identifiers before the API call is made. The LLM never sees the real names. If the provider is breached, if the API logs are subpoenaed, if the model is trained on the traffic, the leaked data is synthetic hashes, not your source code.
Samsung pastes getDefectPattern into the prompt. Pretense rewrites it to _fn7a2c before the request leaves the developer laptop. The LLM answers in synthetics. Pretense reverses the synthetics to real names in the response. The developer gets working output. The provider never sees the real identifier.
This is not redaction. Redaction removes information and breaks the output. Mutation preserves structure and preserves utility. The output quality stays within measurement noise of an unprotected request.
The Audit Trail The second thing Samsung and Notion both lacked: an audit trail. When the leak was discovered, neither team could answer basic questions quickly.
- -Which identifiers left the building? - Which developers made the requests? - Which projects were affected? - What time window was the exposure?
Pretense logs every mutation with timestamp, file path, identifier, synthetic hash, model, and user. Run pretense audit and the answer is immediate. Export the log as CSV for SOC2 evidence. The audit trail is the artifact that turns a security event into a closeable ticket.
What AI Code Leak Prevention Requires Based on these 2 teardowns plus the 6 incidents we have tracked across 2024 to 2026, AI code leak prevention requires 4 properties:
- -Intercept at the API boundary, not the policy boundary. Users route around policy. They cannot route around a proxy. 2. Mutate, do not redact. Redaction destroys output quality and drives users back to unprotected tools. 3. Local-first map. The mutation key never leaves the machine. If it did, the whole system collapses. 4. Audit log by default. You cannot respond to a breach you cannot describe.
Any tool that is missing 1 of these is a speed bump, not a firewall.
The Lesson Samsung lost chip design IP to a policy gap. Notion lost workspace data to a scope gap. In both cases, the gap was the same: a missing enforcement layer between the user and the LLM API.
The tools that exist today close this gap. Pretense is open source, deploys in 30 seconds, and generates the audit log your compliance team will need when the next incident lands.
The question is not whether your team will paste proprietary code into an LLM. Your team is doing it right now. The question is whether the API traffic that leaves your network carries real identifiers or synthetic hashes.