Last week, the security community learned of a big cryptomining campaign that leveraged exposed Kubeflow clusters to mine Monero coins. This attack happened due to publicly exposed dashboards and has sparked a healthy conversation on the security of machine learning platforms.
Being very familiar with the problem and the security ecosystem, I wanted to share why this problem is hard and what you can do about it. I hope this will help people in our community who aren't our customers.
ML access is valuable and resourcefulML environments are a rich target for attackers. Very few teams have access to massive amounts of data and compute. However, that is the standard for ML teams. These teams access data from user information to internal finance to feed their models. Data processing and model training takes a lot of computing power every day.
From an attacker's perspective, this is a gold mine. Do you want to mine coins? You have a lot of computing power available. And day-to-day business operations require a lot of it already. Do you want to encrypt data and extort the company? You have access to massive amounts of compute and data in an environment where reads and writes a lot every day. Do you want to extract some particular data from the company? Go head; your access might not even appear on the target's radar. All the common attack goals are done more efficiently with an ML platform at your disposal.
Furthermore, these environments give users a lot of flexibility by design. The field of ML moves fast, and teams need their data scientists to iterate quicker than usual software teams. This need means that organizations must use tools to give data scientists the flexibility to launch their software with arbitrary access and compute patterns. If a team member can't start their job, they will be blocked and unproductive. Same thing if they can't access a given piece of data. This level of control creates a hard balance between flexibility and security, which is not new.
ML platforms are not the only things that are attacked. We frequently hear of database breaches, browser exploits, downgrade attacks, and many other security issues. And our modern platforms are not immune either, as Kubernetes is also under attack. These systems are not insecure per-se but do pose a security threat due to their design and operation.
"Tyranny of the default" is a common term in software development. It means that most users will use the default settings provided by the system. Sometimes it's due to trust that the maintainer knows better, sometimes it's due to lack of awareness. But you can bet that most users will have the default configuration without thinking twice. How many times have you clicked "forward" while installing something without even reading? If the default is a lack of security, then that will be the case for most users. It doesn't matter how many security features you provide as an add-on; they rarely will be installed or used.
Unfortunately, that's the design of most systems. Security is a patch that you apply on top of an existing solution, when and where you can. For example, by default, Kubernetes allows containers to talk freely with each other and run with full system permission. If you want that not to be the case, then you need to opt-in to security. And you need to do that for every security feature—one at a time. I bet you got bored just by reading that.
The common reason for this design is that you have to balance between security and convenience. For example, having to add every service to which your code will talk to an "allow list" is tedious and error-prone, but it would default to deny access. But balancing doesn't mean you have to pick one or the other. You can design systems where you have a healthy balance between the two.
Before we can understand how to make our systems both secure and convenient, we need to know how attacks happen. We won't go into every possible attack detail, but attacks generally cover three steps: intrusion, expansion, and extraction. The attacker iterates these steps until they reach their goal.
Intrusion is probably the one with which most people are familiar. They're phishing emails, false logins due to weak passwords, or even accessing a service exposed to the internet. In this step, the attacker tries to get access to a system that they are not supposed to. If the attacker can't get in, then they can't cause you harm. The IT community has gotten much better at this in general, but it's hard to cover all bases. For example, a standard library that you install can open the doors to your system. Let's be honest, how often do your data scientists install libraries without thinking about this vector of attack? I'd say almost always.
Once the attacker has a hold into your system, it's time to expand. The techniques should also be familiar to you, like usual privilege escalation and lack of firewall between networks. The attacker's goal is not necessarily to infect the most machines but to infect devices that will get them closer to their target. Even if you block the direct path, they might find a way around. For example, assume you have blocked communication between two containers. If the attacker can escape one container and the nodes can talk freely, they can ignore your barrier. It might sound like adding more steps makes things much harder, but it doesn't really. It's pretty common these days to use a sequence of exploits. Each one gives in just a little more access than it should until they get through.
Finally, they reached the target, and it's time to extract. Not every attack will do this step if the attacker wants to cause mayhem. But for companies, the attacker will want to retrieve some data most of the time. It can be as small as a cryptographic signature key, or as big as complete information on millions of customers. You'd be surprised by how easy this step is in most cases. Sometimes they have free access to the internet, so they can just upload the data. Other times they only have access to the network inside your CSP, so they can use their bucket to upload the data. In rarer cases, one can find a side-channel attack. For most systems, once someone has access to the data, they can probably just walk away with it.
As I said before, the challenge of protecting ML platforms is how to balance the flexibility required by data scientists while applying good, secure software engineering practices. I’m not going to touch upon the obvious methods (e.g., authorization barriers, private networks, data encryption). If you are missing these, then you have a deeper problem. But when it comes to ML workloads specifically, leveraging domain knowledge helps.
The first recommendation is to build distributed systems defensively. Assume that everything that is not your software is trying to cause you harm. Even if the piece you’re working on is tiny, it can be exploited. You only need to help a little bit in one of the three steps above to become the weak link that breaks the chain. Consider everything going in and out of the software and if it’s necessary.
Of course, this requires a lot of discipline. However, this is a case where ML platforms have an advantage. Most of the time, your customer doesn’t need complex permissions or access. They don’t need root; they won’t mess with your network; they don’t need to know precisely where things are. So your job (or your providers’ job) is to make sure that the data scientist has a well-guarded sandbox. They can do absolutely anything they want in that sandbox (including bringing a bad actor), but nothing gets in or out without your permission.
Next, you need to give them access to the systems they need. They might need to talk with a database or to a service to do their job. This access will poke holes in their sandbox. It’s your job to make sure that, if you have to trust something else, you place the minimal level of trust. The attacker will use every piece of information you leak against your system. For example, consider the common web scenario of email and password. Once you give it to one website, now they know your email provider, what email you might have used in other systems, and potentially which password. Every time you use the same information for two places, you’re making the expansion step easier. Every time authentication credentials are shared, you’re doubling the return of an exploit.
Again, the patterns of ML systems can be your ally here. Most teams don’t need weird access patterns to external resources. If you provide a library that makes their life simpler to do a job, you can leverage that library to control the security holes you create. For example, forget about giving a container full access to S3 for the duration of their lifetime. That’s more power than they need. Give the user permission to access the specific location for the period they need with a unique identifier. If they’re doing a training job that fits in a local disk, that will be a few minutes at the beginning of the execution. You’ll be protected throughout the rest of it.
Finally, you need to observe usage patterns and flag any anomaly. This practice is already common for live services in many organizations and required by some regulations. If a service starts to send more data than usual, or the node starts to misbehave, something is up. It might be just a regular software bug, or it might be an actor. The operations team will receive an alert and handle it, potentially even via an automated system.
To our advantage, ML workloads are incredibly predictable. Of course, there are a lot of potential variabilities, but that rarely happens in practice. You can probably group them in a few categories for your organization—different types of training, types of data processing, and exploratory research. If one workload starts to have a different pattern than expected, it might be a point of concern. If it happens for a long time, it might be an issue. In the Kubeflow breach example, one could observe that no data was accessed or that the computing resources usage was different. This observation would have alerted that the attacker was in the system.
Here’s my main piece of advice: use your team's domain knowledge to protect your system. The attacker will use domain knowledge to improve their attack; there’s no doubt. Attacks only get better, not worse.
We saw why ML environments are attractive to attackers and why security is hard in general. But not all hope is gone. I could talk about general good security practices, but you can find them in thousands of other places. The fact that attacks look similar in general is also good, and known methods will catch most issues. But that is not enough.
It’s your duty to use domain expertise to understand how to prevent the attacks. Security practices tell you how to make systems secure; domain knowledge tells you how to make security less inconvenient. You can make your data scientists efficient while keeping your company safe.