Enterprise privacy is viewed more as a challenge and bottleneck in adopting AI and cloud API-driven projects than an opportunity to lower the risk of image and money loss thanks to novel technological solutions. There's a lot of focus on regulatory paperwork, and declarative formal privacy. Departments were created, roles were set, people were assigned, and privacy policies were written and published. Cookie warnings are implemented on corporate websites and mobile apps, etc. So from the basic regulatory obligations, all the requirements are met and fingers crossed that incoming regulations will not require too much additional effort.
However, behind the closed doors of decision-makers, real privacy protection thanks to technology and processes is often portrayed as too hard and too costly to implement. So the most common strategy is to do as little as possible just to conform to formal requirements of current and incoming regulations.
This is a missed opportunity because true privacy protection can be achieved with the right set of processes and technologies without killing the budget. The strategy of deprioritization and effort minimization unfortunately means that many technology teams are not even aware of available privacy protection options and progress in the field.
No Privacy without Security
First, there's no privacy without security, we need to connect both instead of separating them as issues because they are heavily interdependent. The security landscape is demanding more attention than ever, and getting more complex. We observe increased sophistication of cyberattacks on both social and technological levels. It doesn't mean that security and privacy improvements are blocking each other, they can and should be progressing in parallel, but understanding mutual dependencies is the key to the success for both.
Basic rules and fundamentals of security, are not that different from those twenty years ago. Memory attack vulnerabilities are still dominating, as most of the system-level software is still written in unsafe C and C derivatives. Browsers suffer from client-side attacks targeting rendering and JavaScript engines. The thing that changes the most is the pace of change due to automation and the rise of state-sponsored hacking groups targeting competitors and enemies in trade and hybrid cyber warfare.
AI-driven attackers are much more successful and automated than they used to be. AI-supported defenders also get new tools, and more advanced scanners at the source code, container, and runtime levels. However, the defense seems to be at a comparative disadvantage.
The fundamentals of security change slowly, which unfortunately does not mean they are perfectly implemented. Daily reports of data leaks typically comes down to the usual set of mishaps on the data owners and processing sites.
Data Needs to Be Protected
Decades-old strategies such as data minimization are still not a norm, as even simple transactions require users to provide too much information than is necessary to perform the transactions. The pressure to know and target customers better allow companies to get tons of telemetric data, bordering or crossing privacy violation borders of individuals and organizations.
Avoiding data copies is another recommendation that is true today and is not going away anytime soon. Unfortunately, data is copied unnecessarily which makes data retention policies much harder, and allows for a much larger data attack surface, without proper processes and tools. This also includes the rise of observability in distributed systems, which can create copies of sensitive data in the system logs that are hard to detect and remove.
Data retention conformance still relies too much on declarative statements than the physical destruction of data, for instance, already trained machine learning models make it almost impossible and it's a secret that everyone in the AI industry knows well. The models would have to be retrained with specific source data removed which is very costly and time-consuming. And that's on top of existing issues of backups and logs stored for years in digital archives.
Privacy of public LLM services has also been proven questionable, as uploaded corporate data containing business secrets as well as sensitive personal data are then used for model training. Free public services used by corporate users (instead o or besides corporate chatbots) often come with a hidden privacy "price," as the saying goes "if it appears to be free, you are the product."
Language models and machine learning models in general exhibit a memorization problem that leaks sensitive data with malicious prompting. Local Small Language Models (SLM) combined with Retrieval Augmented Generation (RAGs) are alternative that balances model performance with guarantees that no documents will be used to train global models. Another option is to rely on signed agreements with big tech companies and rely on their declarations of not using corporate data for model training.
Multi-device usage and bring-your-device (BYOD) trends mean that mobile applications requesting too many permissions to track users may contribute negatively to the privacy exposing data of employees, customers, and patients. There are corporate policies that reduce the exposure of corporate data, they need to be implemented correctly, however, devices do increase the attack surface and risks.
PET to the Rescue
Privacy Enhancing Technologies such as Differential Privacy (DP) are already mature enough to be used in real-world applications. They practically mean a slight deterioration of data usability but make it much harder to identify individuals. The tradeoff between privacy and data value should be considered every time there's a risk of data exposure, with a slight addition of controllable noise we can gain a much higher protection of data and machine learning models against membership inference attacks.
The federated learning strategy helps to train models without moving or accessing any sensitive data, avoiding data copy (and data retention) problems altogether. Data sharing agreements do not need to be signed, as there is no data sharing, of course, the operations performed must be non-disclosive, the models cannot be overtrained and assumed 100% private, so there's room for secure aggregation and differential privacy.
The space of PET is maturing fast, it's underestimated but it could dramatically improve privacy protection when combined with enterprise security and privacy-preserving mindset.
Confidential Computing
Another emerging technology is confidential computing which protects companies and their data against malicious administrations of hosting and cloud services. No one except people who are allowed from the company can access data at any time as it is always kept encrypted and the cloud provider does not have the keys. The technologies are improving fast, and there are already practical applications. The near-term future is expected to deliver on the promise of CPU and GPU accelerated workloads, very important for the entire machine learning field, including generative AI.
Future
The pressure on privacy put by the regulators is only going to increase. Fortunately, there's an entire ecosystem of strategies, technologies, and tools to help to minimize the risks without significantly affecting the IT budgets. Federated networks of (limited) trust are growing, privacy enhancing techniques deliver better results at a lower cost of data value. Confidential computing is improving fast and slowly gaining traction. In the times of generative AI news thrown at us daily it's easy to overlook the significant technological progress in this area of privacy protection due to technological advances. When combined with the right process and, most importantly, a privacy-preserving attitude of individuals and entire organizations, privacy can become a part of competitive advantage, significantly lowering risks of image and reputation damage and money loss due to regulation violations.