Greenberg: I’m Aysylu Greenberg. We’ll talk about a very important topic, especially in light of recent events, Software Supply Chains for DevOps. I’m a senior software engineer at Google. I’m the tech lead of the GCP Container Analysis and Container Scanning team. I maintain Grafeas and Kritis open source project. I hosted the inaugural software supply chain track at QCon San Francisco in 2019. I can be found on Twitter @Aysylu22.
Software Supply Chain Attacks
Of course, what do these three have in common: npm event-stream, Kaseya, SolarWinds? The answer is software supply chain attacks. Specifically, in all three cases, there was a malicious update that happened that then allowed for the attackers to execute on the rest of it. In 2018, event-stream, which is a very popular Node.js library, had a new ownership and a malicious update was made to one of event-stream’s dependencies. It targeted Bitcoin wallets. To this day, it’s unknown, the exact number of affected people. In 2020, SolarWinds had a vulnerability introduced via its update framework, which impacted a lot of very important companies and government organizations. Earlier this month, Kaseya had a software supply chain attack, where a malicious update was made that allowed attackers to gain access to customer data. Specifically, ransomware has become more common recently, in both the case of Kaseya and the Colonial Pipeline attack, which happened earlier this year. This was so serious and critical that these issues were recognized at the highest levels of the government, and an executive order was issued to highlight the importance of this subject and ensure that everyone is aware that this will only become more prominent and more common, and that we should do something about it.
Modern Challenges in Software Supply Chains
Why now? Why exactly are software supply chain attacks more common now? The answer to all of that is in a containerized cloud computing world, more of our computing needs are being taken care of by open source projects, other partner dependencies, vendor dependencies, and it’s infeasible at this point for any one engineer or any one team or company to monitor and audit every single one of their dependencies, and their dependencies’ dependencies, and so on. It’s too complex of a problem. We need automated tooling and automated solutions in order to help keep our software supply chains safe and secure.
There is a blog post that came out earlier this year that gives an example of how vulnerable exactly our software supply chains are in the current world. As you can see, in red, these are all of the different ways that present weak spots that could cause the software supply chain to be attacked, such as submitting bad code, introducing bad dependencies, compromising package repository. SSC is shorthand for Software Supply Chains.
The talk will be in five parts. First, get on the same page about what software supply chains are. Then we’ll talk about the use of software supply chains for DevOps. How exactly DevOps could use these to monitor workloads and to keep them secure, increase security posture of these workloads. We’ll talk about existing solutions in this space, and some of the open challenges. At the end, I would love for us to walk away with three takeaways. One is shared understanding of what software supply chains are, and how they can be used by DevOps to benefit us during incident handling, and preventing bad things from happening by shifting left as much as possible of the detection of the bad things happening, and existing solutions, challenges, and open problems in this space.
What Are Software Supply Chains?
Let’s talk about software supply chains. Software supply chains is a more modern term for software development lifecycle. I’ve specifically heard it used in the cloud computing world. Basically, what it entails is open source and proprietary libraries that go into the source code developed within the company. Then that code gets built and tested, and then deployed to staging and other pre-production environments, which then later is deployed to a production environment. Once it’s running in production, we can monitor workloads and detect and alert on incidents. Patch management is also part of this, both in detecting that there is a patch available that could allow us to take care of one of our vulnerabilities as well as actually updating them.
Roughly, these are divided up. The first four, source code, build, and test are taking care of our continuous integration pipelines. Deploying to our different environments, including production, as well as deploying the patches is taken care of by continuous delivery pipelines. Then a lot of our observability tools exist to monitor workloads, declare incidents, and also detect when an update that might be useful to patch vulnerabilities become available.
Source to Prod
Now that we’re on the same page about what software supply chains mean, in this context, let’s talk about using software supply chains for DevOps. Quickly walking through the general development lifecycle, so we have our engineer who will be building and deploying code. We have a build process that will then through CI/CD pipelines, get to automated testing, and scanning, and various types of analysis run on the code. Then deploy checks will run to decide whether something is good to go to production or not. Then finally, it winds up in production.
Incident: O11y to the Rescue
Now, what happens? Our pagers go off, we have an incident. Alerts are triggered. What do we do? First and foremost, we have, hopefully, great observability tools that come to the rescue. We’ll look at our graphs, our dashboards, our tracing, our server logs, and so on, and try to figure out what might be happening. We might have a decent idea of what’s happening but we don’t necessarily know when it got introduced, so we want to confirm it, and cross correlate it, or perhaps we want to understand the impact. Do we go back to sleep at 2:15 a.m., because it’s fine. It’s not a big deal. It’s only affecting a couple probes, or do we stay up late and make sure that we fix it and take care of it as soon as possible because the issue is already affecting or might affect a lot of customers and we don’t want for this to happen.
Incident: From Prod to Source
During this incident handling, we will now be questioning everything, from detecting in production, all the way back to the source. Every single stage in this supply chain will now come under investigation. Not only through every stage in this supply chain, but also all of the open source dependencies that we have. Did anything change in the open source dependencies? Did we pull a new version? Did we accidentally go back and put the older version of the code? What happened? Did we miss maybe a deploy check that should have run? Did we maybe miss running some tests that could have detected this. Did a test get removed? If they all pass, then what happened? What happened in our code? Going all the way back is what’s different about this? Did the incident trigger because of something very new, or did it trigger because the issue had a slow burn and it finally crossed the threshold. Not only does our own code come into question here, but also all the third party dependencies including open source dependencies, and all of our tooling as well, all the CI/CD pipelines. Is it possible that something went through because deploy checks did not run, because continuous delivery pipeline had an issue?
In order to have answers to these questions, we really need to collect all of the data from every single stage in the software supply chain. Not only do we need to record every single result of the analysis and the test, but also which deploy tool ran this, which testing tool ran this, and collect all of this. Sometimes it’s not possible to do this. Some of the tooling you might be using just throws away the data as soon as it runs the checks. Making sure that we collect all of this data along the way, and store it in some Universal Artifact Metadata that allows us to go back maybe even weeks after the incident was resolved to estimate the impact. Or just to be able to go back a day before, is important in order for us to be able to figure out what went wrong, and why.
The Level of Detail to Collect – Signal to Noise Ratio
What level of detail should we collect in order to make it useful for our DevOps? Really, it’s best to collect everything. You never know when you’ll need that information. Fine granularity does allow us to not have observability gaps. Maybe the workload is no longer deployed in production, but we found out a lot later that there was an incident that was impacting us. We would still want to be able to go back and analyze it.
A natural question is then, am I going to get overwhelmed with this information? The right signal to noise ratio is very important. Just having lots of graphs isn’t sufficient, if that’s not necessarily improving observability in our system. The right signal to notice ratio is incredibly important. For example, if our tool is reporting all vulnerabilities on an image, is it possible that they’re false positives? Maybe if we have a few of them, double digit vulnerabilities recorded, then that’s fine. Our operator can go through them by hand, and just make sure that there isn’t anything there that should come under suspicion. When we’re suspecting everything, and we haven’t been able to quickly pinpoint this is what’s happening, then we’ll have to go through this with a finer tooth comb. If the tool is reporting 1000 vulnerabilities, is it really feasible for our operator to go through all of them by hand to do this? This is when automated solutions become important. Not only that, but also, the tooling needs to focus on making sure that it does not introduce false positives or false positives are minimized as much as possible.
On the other hand, for compliance requirements, for example, FedRAMP, they require detailed information about software supply chain in order to be in good standing, which means collecting everything that the auditors might require that your customers would want to know. Striking the right balance between collecting everything possible just so you can look back at it and you can confirm you’re in good standing, as well as being able to derive insights is incredibly important for our DevOps.
Vertical and Horizontal Querying
Another question there is, how do we efficiently query for these details? We need vertical and horizontal querying in this case. What is horizontal querying? It’s a query across all artifacts for a specific property. For example, find all images that are built from a particular GitHub commit that is known to have introduced the security problem. Vertical queries metadata across software development lifecycle for specific artifacts. For example, find all source, build, test, vulnerabilities, and deployments metadata for a container image, so looking at vertical integration for a given container image.
Software Supply Chain Solutions
We’ve talked about how software supply chain information can be used for the benefit of DevOps, and about some of the challenges in that space in order to find the right balance of signal to noise ratio. Now let’s talk about some of the existing solutions in this space. Going back to our initial diagram of the typical process that our engineers will go through, so we have our software supply chain. For the Universal Artifact Metadata, we have a couple solutions available. One is the open source Grafeas that are committing. It’s an API to audit and govern your software supply chain. What it allows us to do is represent every stage in the software supply chain, and then records metadata about it so that it can be queried across this structured metadata in an efficient way to derive insights. There’s the hosted version of it, Container Analysis API on GCP. This diagram is taken from the public documentation. You see, it’s very similar to what we discussed earlier, where the source information will collect there, and the build information from continuous integration, and continuous delivery information of our deployment, as well as the production, like the deploy at runtime information will all get collected into this metadata here.
Example – Find All Workloads that Exceed Build Horizon
Here’s an example of how you can utilize all this metadata that was collected across the software supply chain. Suppose we want to find all workloads that exceed build horizon. First, we will list all of the deployment information. There is a concept of notes and occurrences in Grafeas. Notes are objects, some general high level information about an entity and artifact metadata. Then occurrences are instances of them if I were to borrow the object oriented terminology. We look at all of the deployment information. Then for all of those occurrences, so all the bits of information about all of the images, we’re going to look through them. We’re going to, one by one, look at any of them that are currently running. UndeployTime will just be nil if they’re still running. UndeployTime will be set to some time if we already brought down that container image. We’ll mark them as running images. We will iterate over all of the running images, and we’ll look at the build information for each of those images. Specifically, we’ll look at the time at which build information was created. When was the container image built? Then we will compare it to the build horizon. Say, your company’s policy is no workloads that were built over 60 days ago should be running in production, then this will allow us to flag. This is to show that it could be used for a variety of flexible querying using the structured metadata.
Deploy Policy Checks: Kritis and Voucher
Then, for the deploy time checks, there are a couple other open source tools that exist. One is Kritis, which is the deploy time policy enforcer for Kubernetes applications. Another one is Voucher which creates attestations for binary authorization. It also has out of the box, very simple, easy to use policies that allow you to get started with the two projects right away. What these allow you to do is basically define a policy, and then anytime a Kubernetes pod gets deployed, the policy engine will verify to make sure that it passes all the checks, and then allow the deployment on that so it gets on the policies defined by the customer. The equivalent of that is Binary Authorization on GCP, which is basically hosted Kritis.
SPDX SBOM (Software Bill of Material)
Now for the build process, to record everything that goes into specific software artifacts, we have SPDX SBOM. SBOM is Software Bill of Materials. SPDX is a format that has existed for over a decade. It’s part of the Linux Foundation. The team has been solving this problem for a very long time. They’ve done some incredible work there. The way they describe it is, it’s an open standard for communicating Software Bill of Material information, including components, licenses, copyrights, and security references. What you see here is a little screenshot from their website, which basically describes the level of detail that can be captured by Software Bill of Materials by SPDX. Document is the top level, and package information, file, snippet, licenses, and then relationship describing all of them. It allows you to know everything that went into a specific artifact, a specific binary, as well as share with your customers so that they know to trust it, and they can inspect it, and it allows for a machine readable way to build tooling on top of it.
Integrity of Supply Chain – in-toto
In-toto is basically in charge of verifying integrity of the entire software supply chain. It’s a framework to secure the integrity of software supply chains. It’s how the team describes it. It’s a project that came out of NYU. The way you work with this is you define a framework and layout, basically, where you describe all the sources for your source code, all the pipelines that are allowed to build your code and deploy it, and where it should be deployed. Then the framework validates to ensure that the integrity is not compromised, of the software supply chains. Going back, it allows us to ensure that each of these stages happened as they claim they did. There is nothing else that happened or didn’t happen that should have. Upfront, you declare what needs to happen. Then you can validate that all the right steps were taken. All of these tools work really well in conjunction together to provide solutions to improve security and integrity of software supply chains.
We talked about the existing solutions. Now let’s talk about the open problems in this space. It’s not a complete list. This is mostly a list off the top of my head of the open problems that I and my team and my colleagues in adjacent teams are solving. The first one is secure builds and secure supply chains. This is a relatively new area where a lot of bright minds are currently working on to ensure that we can verify that the build was secure, that the supply chain security posture is improved. Another one is compliance. FedRAMP, for example, requires now all cloud service providers to ensure that they provide the right level of detail to their customers and stay in good standing. To solve this, collecting metadata and being able to query it is one of the many ways to help customers, both cloud service providers as well as cloud service providers’ customers just to be in good standing and ensure that they’re protected. Supply chain integrity is another open problem that a lot of people are currently working on. The difference between security and integrity, since these often get confused when used in the same context, is security is about who is authorized to access things, who’s authorized to do what. Whereas integrity is about ensuring that correct steps were taken, correct data was generated, nothing was skipped over, and things like that.
Also, security and integrity in open source software. Since pretty much all of the software right now running in small and large companies relies on open source and the complexity associated around ensuring that trusted open source dependencies are built in. These are some of the few open problems that we are all working on. Finally, data quality and vulnerability scanning. I mentioned that having false positives makes it very difficult to trust the data. Just because you report a lot of vulnerability does not necessarily mean that they apply. It’s a very hard problem to identify what does and doesn’t apply in a container image or VM image. It’s something that all the different solutions in this space are working hard on solving. We want to make sure that the right level of detail is surfaced so that when DevOps, when administrators go over the container images, they can validate it in a very efficient manner.
We talked about the open problems. To conclude, these are the three takeaways, hopefully, you’re able to walk away with. The definition of software supply chains is basically software development lifecycle in cloud applications. Given the complexities around the multitude of dependency support and the open source dependencies that allow contributions from everyone involved in software, and ensuring that they are to be trusted. Software supply chain metadata is needed to help DevOps derive these insights and be able to effectively determine what might be going wrong, and then deploy the right fixes and determine the impact of the incidents, and prevent, hopefully, the same incidents from happening in the future. Finally, there are some existing solutions. We covered Grafeas, Kritis, Software Bill of Materials by SPDX, and in-toto for integrity of software supply chains. We talked about challenges and open problems.
Questions and Answers
Losio: Do software developers have to sign code before deploying, to increase security?
Greenberg: At least signing on deployment once it is out in the public, yes. It does help. Any additional piece of information you can help provide to the persons, the customers that would be consuming the code so that they can verify that this code comes from you. Whether it’s a package being distributed, the part that you deploy so that someone else can check that what they’re going to be connecting to is the thing that they intend to be connected to, based on some of the chain of trust that they’ve created with you. Yes, that’s a possibility. Software Bill of Materials is basically addressing it from another angle, which is listing everything that goes in there so that you can validate. Then also you can check the dependencies, whatever was distributed to you. See if it actually matches the Software Bill of Materials that was provided. It would help increase the security posture of your application and your customer’s application as well.
Losio: This is cloud vendor agnostic. I was wondering if there’s any different or specific way you can do it if you are in a multi-cloud versus working just with one cloud provider. In that case, what are the shared responsibility when you talk about the cloud provider? Where’s the limit of the part you have to take care of yourself, basically?
Greenberg: First, I’ll address the multi-cloud provider aspect of it. Then the separate aspect, which is the responsibilities delineation. For the Grafeas and Kritis that was presented there, because it’s all open source, there is a hosted version on GCP available, but then also, it can be hosted on other cloud providers, including on private cloud. It is meant to work really well with hybrid clouds. It’s just that a hosted version needs to be built elsewhere, or the person could choose to host it themselves using the available server implementation there. It is meant to help those cases, multi-cloud, hybrid cloud cases, because those are just a lot more popular than having one single cloud provider, or strictly having only the private cloud version of it.
The second part of your question was around the responsibility across cloud providers?
Losio: Yes. You started talking about, of course, the case of an incident. Of course, there’s the boundary, the layer that you reach as your deployment, your responsibility, and then there’s the part that is under the standard shared responsibility model with the cloud provider. It was only, when do I need to get the cloud provider involved in this? How do I automate that? Do you automatically trigger something in support as well? How do you deal with that part, or you consider that out of scope for the solution or for the SRE managing it?
Greenberg: With the executive order calling to everyone’s attention that providing Software Bill of Materials should be the responsibility of the providers of the software. I hope, over time, the industry as a whole will move towards that model so that we can always be presenting this as what you’re getting with this. It really depends on your threat model in general, in terms of how far you want to go. You can take it all the way to like only accept things that are signed from the trusted parties, and all of that. Depending on the security profile of it, you might say, “I trust everything that comes from this service provider.” Then move on with this, or do some hybrid approach for the most sensitive parts of the application. It really depends. Requiring and/or asking your service providers to give more of that information, more of that tooling, and just integrate with all the existing stuff that is out there, that’s pretty reasonable. For example, for Grafeas to work not every vendor has integrated with it. There are some. Then increasing that scope, like that adoption, where now you can switch from Travis to CircleCI, but you still preserve that metadata. That will be very helpful. It does require either the consumer to actually integrate with this, and provide their own custom integration or asking their vendors to provide this.
Losio: How can we actually validate the tool that you use to validate our software supply chain? Basically, the tools that you mentioned contribute to our software supply chain, so how do we handle that?
Greenberg: You should definitely be requiring this. One approach to do this is requiring them to present basically to be using this own tooling, like bootstrapping it, so that it itself presents the metadata and all of that. That is definitely a very valid concern. That’s something that until everyone is using this tooling, and ensures that they are also presenting the results of using that tooling, then every subsequent layer is just deferring that concern.
Losio: Of course, that responsibility.
The second question was related to the integration with gatekeeper. I don’t know if you have any specific advice in that sense in terms of deploy time policy, does gateway integrate with gatekeeper?
Greenberg: The whole discussions are currently it does not. There were discussions around that. There was also discussion with integrating with Open Policy Agent. That’s a matter of if the community is really interested in having this, and contributing it back to open source. Then these kinds of integrations allows the ecosystem to not be so fractured, and work really well together and solve the smaller aspects of the bigger problem that we’re all trying to solve.
Losio: That’s really interesting. I assume it’s more like what the community can bring to that and what basically people are going to bring to the platform.
Greenberg: Exactly. In some of these problems that I encountered is, some of the concerns can be theoretical, and so having them be driven by business use cases, by your own use cases, in my experience, reduces a lot more useful tooling. Because otherwise, it’s very easy to get very theoretical about what you might want out of your software supply chain. What is actually immediately useful, may be slightly different from the expectations. The way we’ve been developing some of the metadata kinds that we support is either somebody in the community has a use case that they already know how they need to represent, or we have a use case and then we put it out there.
Losio: Otherwise it is nice in theory, but it’s not going to be really the main purpose of that.
Losio: You said at the beginning about how to handle the poor guy that has his phone ring in the middle of the night that has to decide if to basically wait for a patch, or go back to bed because it’s not that bad. I was wondering, what’s the future. Are we going to integrate that with, for example, machine learning? Because they’re still a human being that at some point there are some flags with a lot of data and has to make a decision. Some of the decisions sound quite, of course, not freewill. Some other feels like, I wouldn’t say fully automatic, but can be a next step. It can be something that gains some time or does the step for us more than just reading us.
Greenberg: I feel like that ties into the general concern of how much can you automate in your tooling. The system, having been building distributed systems, you want it to be self-healing, you want to automate as much of your DevOps processes as possible so that it could just fix itself or at least defer until business hours. It’s a tricky problem. A big way I found it to be most successful is just by getting to know the specifics of your system. Then over time, automating those mundane things, so that when you are woken up, it’s for things that are actually complicated. Knowing that the incident is due to a vulnerability for which there is a patch coming is a hard problem.
Losio: It’s not as straightforward.
Greenberg: Exactly. In my opinion, the most useful way we can address this is by addressing the smaller pieces of how do you link the incident to that specific vulnerability? Then, how do you link that vulnerability to a specific patch? Then as we solve those smaller pieces, then continuing the integrations is probably the most effective way we could solve this.
Losio: Yes, because I understand from what you say that if you get to even the next step, even if it’s just shut down every system to protect yourself, that’s the easy part. The hard part is to reach that point.
Will these apps stop hackers infiltrating cloud apps in build cycle?
Greenberg: That’s the purpose of doing all of this. When talking about secure build, not only do you need the tooling to record all of the data, collect the Software Bill of Materials, and have the build to be signed, also, you want to make sure that the build system itself is doing all the same steps, so that it can prove to you, it can prove to itself that it has not been infiltrated. Doing all of this, and collecting all this information and constantly getting on it and requiring these things from your build system are going to help it. All of it has to be getting done together. Then of course it’s a mouse game.
Losio: Stay one step ahead.
See more presentations with transcripts