Trendy software program purposes are underpinned by a big and rising net of APIs, microservices, and cloud companies that should be extremely accessible, fault tolerant, and safe. The underlying networking know-how should help all of those necessities, in fact, but additionally explosive progress.
Sadly, the earlier era of applied sciences are too costly, brittle, and poorly built-in to adequately resolve this problem. Mixed with non-optimal organizational practices, regulatory compliance necessities, and the necessity to ship software program sooner, a brand new era of know-how is required to handle these API, networking, and safety challenges.
CAKES is an open-source software networking stack constructed to combine and higher resolve these challenges. This stack is meant to be coupled with trendy practices like GitOps, declarative configuration, and platform engineering. CAKES is constructed on the next open-source applied sciences:
- C – CNI (container community interface) / Cilium, Calico
- A – Ambient Mesh / Istio
- Okay – Kubernetes
- E – Envoy / API gateway
- S – SPIFFE / SPIRE
On this article, we discover why we want CAKES and the way these applied sciences match collectively in a contemporary cloud atmosphere, with a deal with rushing up supply, decreasing prices, and enhancing compliance.
Why CAKES?
Present know-how and group buildings are impediments to fixing the issues that come up with the explosion in APIs, the necessity for iteration, and an elevated velocity of supply. Greatest-of-breed applied sciences that combine nicely with one another, which can be based mostly on trendy cloud rules, and which have been confirmed at scale are higher geared up to deal with the challenges we see.
Conway’s legislation strikes once more
A serious problem in enterprises immediately is maintaining with the networking wants of recent architectures whereas additionally retaining current know-how investments working easily. Giant organizations have a number of IT groups liable for these wants, however at instances, the knowledge sharing and communication between these groups is lower than superb. These liable for connectivity, safety, and compliance sometimes stay throughout networking operations, data safety, platform/cloud infrastructure, and/or API administration. These groups usually make selections in silos, which causes duplication and integration friction with different elements of the group. Oftentimes, “integration” between these groups is thru ticketing techniques.
For instance, a networking operations group usually oversees know-how for connectivity, DNS, subnets, micro-segmentation, load balancing, firewall home equipment, monitoring/alerting, and extra. An data safety group is often concerned in coverage for compliance and audit, managing net app firewalls (WAF), penetration testing, container scanning, deep packet inspection, and so forth. An API administration group takes care of onboarding, securing, cataloging, and publishing APIs.
If every of those groups independently picks the know-how for his or her silo, then integration and automation can be gradual, brittle, and costly. Modifications to coverage, routing, and safety will reveal cracks in compliance. Groups might change into confused about which know-how to make use of, as inevitably there can be overlap. Lead instances for modifications in help of app developer productiveness will get longer and longer. Briefly, Conway’s legislation, which states that an organizational system usually finish ups just like the communication construction of that group, rears its ugly head.
Sub-optimal organizational practices
Conway’s legislation isn’t the one challenge right here. Organizational practices on this space could be sub-optimal. Implementations on a use-case-by-use-case foundation end in many remoted “community islands” inside a company as a result of that’s how issues “have at all times been accomplished.”
For instance, a brand new line of enterprise spins up, which can present companies to different elements of the enterprise and eat companies from different elements. The modus operandi is to create a brand new VPC (digital non-public cloud), set up new F5 load balancers, new Palo Alto firewalls, create a brand new group to configure and handle it, and so forth. Doing this use case by use case causes a proliferation of those community islands, that are tough to combine and handle.
As time goes on, every group solves challenges of their environments independently. Little by little, these community islands begin to transfer away from one another. For instance, we at Solo.io have labored with massive monetary establishments the place it’s widespread to search out dozens if not a whole bunch of those drifting community islands. Organizational safety and compliance necessities change into very tough to maintain constant and auditable in an atmosphere like that.
Outdated networking assumptions and controls
Lastly, the assumptions we’ve made about perimeter community safety and the controls we use to implement safety coverage and community coverage are now not legitimate. We’ve historically assigned a number of belief to the community perimeter and “the place” companies are deployed inside community islands or community segments. The “perimeter” deteriorates as we punch extra holes within the firewall, use extra cloud companies, and deploy extra APIs and microservices on premises and in public clouds (or in a number of public clouds as demanded by laws). As soon as a malicious actor makes it previous the perimeter, they’ve lateral entry to different techniques and may get entry to delicate knowledge. Safety and compliance insurance policies are sometimes based mostly on IP addresses and community segments, that are ephemeral and could be reassigned. With fast modifications within the infrastructure, “coverage bit rot” occurs rapidly and unpredictably.
Coverage bit rot occurs once we intend to implement a coverage, however due to a change in advanced infrastructure and IP-based networking guidelines, the coverage turns into skewed or invalid. Let’s take a easy instance of service A working on VM 1 with IP tackle 10.0.1.1 and repair B working on VM 2 with IP tackle 10.0.1.2. We are able to write a coverage that claims “service A ought to be capable of speak to service B” and implement that as firewall guidelines permitting 10.0.1.1 to speak to 10.0.1.2.
Two easy issues may occur right here to rot our coverage. First, a brand new Service C might be deployed to VM 2. The outcome, which is probably not meant, is that now service A can name service C. Second, VM 2 may change into unhealthy and recycled with a brand new IP tackle. The previous IP tackle might be re-assigned to a VM 3 with Service D. Now service A can name service D however probably not service B.
The earlier instance is for a quite simple use case, however for those who prolong this to a whole bunch of VMs with a whole bunch if not hundreds of advanced firewall guidelines, you’ll be able to see how modifications to environments like this will get skewed. When coverage bit rot occurs, it’s very obscure what the present coverage is until one thing breaks. However simply because site visitors isn’t breaking proper now doesn’t imply that the coverage posture hasn’t change into weak.
Conway’s legislation, advanced infrastructure, and outdated networking assumptions make for a expensive quagmire that slows the velocity of supply. Making modifications in these environments results in unpredictable safety and coverage impacts, makes auditing tough, and undermines trendy cloud practices and automation. For these causes, we want a contemporary, holistic method to software networking.
A greater method to software networking
Know-how alone gained’t resolve a few of the organizational challenges mentioned above. Extra lately, the practices which have fashioned round platform engineering seem to offer us a path ahead. Organizations that put money into platform engineering groups to automate and summary away the complexity round networking, safety, and compliance allow their software groups to go sooner.
Platform engineering groups tackle the heavy lifting round integration and honing in on the suitable consumer expertise for the group’s builders. By centralizing widespread practices, taking a holistic view of a company’s networking, and utilizing workflows based mostly on GitOps to drive supply, a platform engineering group can get the advantages of finest practices, reuse, and financial system of scale. This improves agility, reduces prices, and permits app groups to deal with delivering new worth to the enterprise.
For a platform engineering group to achieve success, we have to give them instruments which can be higher geared up to stay on this trendy, cloud-native world. When enthusiastic about networking, safety, and compliance, we needs to be pondering by way of roles, obligations, and coverage that may be mapped on to the group.
We must always keep away from counting on “the place” issues are deployed, what IP addresses are getting used, and what micro-segmentation or firewall guidelines exist. We must always be capable of rapidly take a look at our “meant” posture and simply evaluate it to current deployment or coverage. This can make auditing easier and compliance simpler to make sure. How will we obtain it? We’d like three easy however highly effective foundational ideas in our instruments:
- Declarative configuration
- Workload id
- Normal integration factors
Declarative configuration
Intent and present state are sometimes muddied by complexities of a company’s infrastructure. Making an attempt to wade by way of hundreds of strains of firewall guidelines based mostly on IP addresses and community segmentation and perceive intent could be almost not possible. Declarative configuration codecs assist resolve this.
As a substitute of hundreds of crucial steps to realize a desired posture, declarative configuration permits us to very clearly state what the intent or the tip state of the system needs to be. We are able to take a look at the stay state of a system and evaluate it with its meant state far more simply with declarative configuration than making an attempt to reverse engineer by way of advanced steps and guidelines. If the infrastructure modifications we will “recompile” the declarative coverage to this new goal, which permits for agility.
Writing community coverage as declarative configuration shouldn’t be sufficient, nonetheless. We’ve seen massive organizations construct good declarative configuration fashions, however the complexity of their infrastructure nonetheless results in advanced guidelines and brittle automation. Declarative configuration needs to be written by way of sturdy workload id that’s tied to companies mapped to group construction. This workload id is impartial of the infrastructure, IP addresses, or micro-segmentation. Workload id helps cut back coverage bit rot, reduces configuration drift, and makes it simpler to cause in regards to the meant state of the system and the precise state.
Workload id
Earlier strategies of constructing coverage based mostly on “the place” workloads are deployed are too prone to “coverage bit rot.” Constructs like IP addresses and community segments will not be sturdy, that’s, they’re ephemeral and could be modified, reassigned, or will not be even related. Modifications to those constructs can nullify meant coverage. We have to establish workloads based mostly on what they’re, how they map throughout the organizational construction, and achieve this independently of the place they’re deployed. This decoupling permits meant coverage to withstand drift when the infrastructure modifications, is deployed over hybrid environments, or experiences faults/failures.
With a extra sturdy workload id, we will write authentication and authorization insurance policies with declarative configuration which can be simpler to audit and that map clearly to compliance necessities. A high-level compliance requirement comparable to “take a look at and developer environments can’t work together with manufacturing environments or knowledge” turns into simpler to implement. With workload id, we all know which workloads belong to which environments as a result of it’s encoded of their workload id.
Most organizations have already got current investments in id and entry administration techniques, so the final piece of the puzzle right here is the necessity for traditional integration factors.
Normal integration factors
An enormous ache level in current networking and safety implementations is the costly integrations between techniques that weren’t meant to work nicely collectively or that expose proprietary integration factors. A few of these integrations are closely UI-based, that are tough to automate. Any system constructed on declarative configuration and robust workload id may even have to combine with different layers within the stack or supporting know-how.