Red Hat Best Practices for Kubernetes

1. Foreword

This Red Hat Best Practices for Kubernetes document was originally created in 2020 by Verizon with assistance from Red Hat. Verizon’s vision of an open collaborative process to create industry standards for cloud-native workloads led this document to be where it is today. Over many iterations, and based on lessons learned and feature improvements in the areas of security, lifecycle management, networking, and many other components, the document has matured into its current state.

Many thanks to the team at Verizon for kick-starting this effort and for the many contributions Verizon has made to the document over the years.

2. Developing Containers and Operators for the OpenShift Container Platform (OCP)

Review the following information to learn more about developing Containers and Operators for the OpenShift Container Platform (OCP) in compliance with Red Hat certification requirements:

Building Operator that meets the Red Hat certification criteria
Partner guide for Red Hat Container, Operator, Helm Chart certifications
Partner guide for Red Hat Cloud Native Function (CNF) certification

2.1. Helm v3

Helm v3 is a serverless mechanism for defining templates that describe a complete Kubernetes application. This allows you to build generic templates for applications that you can use with site or deployment specific values to be provided as inputs to the template. Helm is roughly analogous to HEAT templates in the OpenStack environment.

For more information, see Understanding Helm.

Workload requirement

If you use Helm to deploy your application, you must use Helm v3 because of security issues with Helm v2.

See test case affiliated-certification-helm-version

Impacts and Risks of Non-Compliance: Helm v2 has known security vulnerabilities and lacks proper RBAC controls, creating significant security risks in production environments.

2.2. Kubernetes

Kubernetes is an open source container orchestration suite of software that is API driven with a datastore that manages the state of the deployments residing on the cluster.

The Kubernetes API is the mechanism that applications and users utilize to interact with the cluster. There are several ways to do this. For example, kubectl or oc CLI tools, web based UIs, or interacting directly with API using tools such as curl. You can use the SDK to build your own tools.

You can interact with the API in at least one of two ways. If the application or user is external to the cluster, the APIs can be accessed externally. If the application or user is directly connected to the cluster, they can access the cluster by using the Kubernetes Service Resource directly, bypassing the need to exit the cluster and log in again.

See test case platform-alteration-ocp-lifecycle

Impacts and Risks of Non-Compliance: End-of-life OpenShift versions lack security updates and support, creating significant security and operational risks.

2.3. CNI-OVN

OVN is the default pod network CNI plugin for OpenShift and is supported by Red Hat. OVN is Red Hat’s CNI for pods. It is a Geneve based overlay that requires L3 reachability between the host nodes. This L3 reachability can be over L2 or a pre-existing overlay network. OpenShift’s OVN forwarding is based on flow rules and implemented with nftables on the host operating system CNI pod.

2.4. Container storage (CSI)

Pod volumes are supported via local storage and the CSI for persistent volumes. Local storage is truly ephemeral storage, it is local only to the physical node that a pod is running on and is lost when a pod is killed and recreated. If a pod requires persistent storage, the CSI can be used via Kubernetes native primitives persistentVolume (PV) and persistentVolumeClaim (PVC) to get persistent storage, such as an NFS share via the CSI backed by NetApp Trident.

When using storage with Kubernetes, you can use storage classes. Refer to Block storage for a description of the available storage classes. Using storage classes, you can create volumes based on the parameters of the required storage.

Workloads should clear persistent storage by deleting the associated PV resources when removing the application from a cluster.

See test case lifecycle-persistent-volume-reclaim-policy

Impacts and Risks of Non-Compliance: Incorrect reclaim policies can lead to data persistence after application removal, causing storage waste and potential data security issues.

For more information, see Red Hat Persistent Storage.

2.5. Block storage

OpenShift Container Platform can provision raw block volumes. These volumes do not have a file system, and can provide performance benefits for applications that either write to the disk directly or implement their own storage service.

There are 2 types of storage connectivity and 2 levels of storage in each. All block storage is located on a NetApp appliance.

The two types of storage connectivity are:

NFS: is the default storage type
iSCSI: storage should only be used for database type applications

See Block Volume storage support for more information.

2.6. Ephemeral storage

Pods and containers can require ephemeral or transient local storage for their operation. The lifetime of this ephemeral storage does not extend beyond the life of the individual pod, and this ephemeral storage cannot be shared across pods.

Workload requirement

Pods must not place persistent data in ephemeral storage.

2.7. Local storage

Local storage is available on worker nodes for ephemeral storage only.

Workload requirement

Pods must not place persistent volumes in local storage.

See test case lifecycle-storage-provisioner

Impacts and Risks of Non-Compliance: Inappropriate storage provisioners can cause data persistence issues, performance problems, and storage failures.

2.8. Container runtime

OpenShift uses CRI-O as a CRI interface for Kubernetes. CRI-O manages runC for container image execution. CRI-O is an open-source container engine that provides a stable, performant platform for running OCI compatible runtimes. CRI-O is developed, tested and released in tandem with Kubernetes major and minor releases.

Images should be OCI compliant. Red Hat recommends that you build images using Red Hat’s open Universal Base Image (UBI).

See Red Hat Universal Base Images for additional information about UBI and support.

See test case platform-alteration-isredhat-release

Impacts and Risks of Non-Compliance: Non-Red Hat base images may lack security updates, enterprise support, and compliance certifications required for production use.

For more information about CRI-O, see the following:

This environment is maintained with the following open source tools:

2.9. CPU manager/pinning

The OpenShift platform can use the Kubernetes CPU Manager to support allocation of cores using kubernetes guaranteed QoS class for applications. Isolcpus is not enabled.

Create a Pod that gets assigned a QoS class of Guaranteed

Important note on using probes: If the CNF is running a DPDK process, do not use exec probes (executing a command within the container) as it may pile up and block the node eventually.

Workload Requirement: If a workload is doing CPU pinning, exec probes may not be used.

See test case networking-dpdk-cpu-pinning-exec-probe

Impacts and Risks of Non-Compliance: Exec probes on CPU-pinned DPDK workloads can cause performance degradation, interrupt real-time operations, and potentially crash applications due to resource contention.

Workloads MUST NOT apply tolerations for NoExecute, PreferNoSchedule, and NoSchedule

2.10. Container host operating system

Red Hat Enterprise Linux CoreOS (RHCOS) is the next generation container operating system. RCHOS is part of OpenShift Container Platform and is used as the operating system for the control plane. it is the default operating system for worker nodes. RHCOS is based on RHEL, has some immutability, leverages the CRI-O runtime, contains container tools, and is updated through the Machine Config Operator (MCO).

The controlled immutability of RHCOS does not support installing RPMs or additional packages in the traditional way. Some 3rd party services or functionalities need to run as agents on nodes of the cluster.

For more information, see About RHCOS.

See test cases platform-alteration-ocp-node-os-lifecycle, platform-alteration-boot-params

Impacts and Risks of Non-Compliance: Incompatible node operating systems can cause stability issues, security vulnerabilities, and lack of vendor support.

2.11. Red Hat Universal Base Images

Red Hat Universal Base Images (UBI) is designed to be a foundation for containerized cloud-native and web application use cases. You can build a containerized application by using UBI, push it to your choice of registry server, and easily share it with others. UBI is freely redistributable, even to non-Red Hat platforms. No subscription is required. Since it’s built on Red Hat Enterprise Linux, UBI has the same industry leading reliability, security and performance benefits.

2.11.1. Base Images

A set of three base images (Minimal, Standard, and Multi-service) are provided to provide optimum starting points for a variety of use cases.

See test case platform-alteration-isredhat-release

Impacts and Risks of Non-Compliance: Non-Red Hat base images may lack security updates, enterprise support, and compliance certifications required for production use.

2.11.1.1. Runtime Languages

A set of language runtime images (PHP, Perl, Python, Ruby, Node.js) enable developers to start coding out of the gate with the confidence that a Red Hat built container image provides.

2.11.1.2. Complementary packages

A set of associated YUM repositories/channels include RPM packages and updates that allow users to add application dependencies and rebuild UBI container images anytime they want.

Red Hat UBI images are the preferred images to build workload applications with as they leverage the fully supported Red Hat ecosystem. In addition, once a workload application is standardized on a Red Hat UBI, the image can become Red Hat certified.

Red Hat UBI images are free to vendors so there is a low barrier of entry to getting started. It is possible to utilize other base images to build containers that can be run on the OpenShift platform. See Red Hat Software Certification Workflow Guide and Red Hat OpenShift Software Certification Policy Guide for a view of the ease of support for containers utilizing various base images and differing levels of certification and supportability.

2.12. Pod security

SELinux should always be enabled within the OpenShift Container Platform and will be used to enforce syscalls that containers make. In addition, Kubernetes has another native function called pod security policies.

See test case platform-alteration-is-selinux-enforcing

Impacts and Risks of Non-Compliance: Non-enforcing SELinux reduces security isolation and can allow privilege escalation attacks and unauthorized resource access.

2.13. CI/CD framework

Applications should target a CI/CD approach for deployment and validation.

2.14. Kubernetes API versions

Review the Kubernetes and OpenShift API documentation:

Workload requirement

All workloads must verify that they are compliant with the correct release of REST API for Kubernetes and OpenShift. Please refer to the online documentation for deprecated APIs.

See test case platform-alteration-ocp-lifecycle

Impacts and Risks of Non-Compliance: End-of-life OpenShift versions lack security updates and support, creating significant security and operational risks.

2.15. OVN-Kubernetes CNI

OVN is Red Hat’s CNI for pod networking. It is a Geneve based overlay that requires L3 reachability between the host nodes. This L3 reachability can be over L2 or a pre-existing overlay network. Openshift’s OVN forwarding is based on flow rules and implemented with nftables on the host OS CNI pod.

For more information, see About the OVN-Kubernetes network plugin.

2.16. User plane functions

Develop user plane functions that meet the following requirements.

2.16.1. Node Tuning Operator

Red Hat created the Node Tuning Operator for low latency nodes.

In OpenShift Container Platform version 4.10 and previous versions, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance. Now this functionality is part of the Node Tuning Operator.

The emergence of Edge computing in the area of Telco plays a key role in reducing latency, congestion, and improving application performance. Many of the deployed applications in the Telco space require low latency and zero packet loss. OpenShift Container Platform provides a Node Tuning Operator to implement automatic tuning to achieve low latency performance for applications. The Node Tuning Operator is a meta-operator that leverages MachineConfig, Tuned and KubeletConfig resources, Topology Manager, and CPU Manager, to optimize the nodes.

2.16.2. Huge pages

In OpenShift Container Platform, nodes/hosts must pre-allocate huge pages.

For more information, see Configuring huge pages.

To request hugepages, pods must supply the following within the pod.spec for each container:

resources:
  limits:
    hugepages-2Mi: 100Mi
    memory: "1Gi"
    cpu: "1"
  requests:
    hugepages-2Mi: 100Mi
    memory: "1Gi"
    cpu: "1"

See test cases platform-alteration-hugepages-2m-only, platform-alteration-hugepages-config

Impacts and Risks of Non-Compliance: Using inappropriate hugepage sizes can cause memory allocation failures and reduce overall system performance and stability.

2.16.3. CPU isolation

The Node Tuning Operator manages host CPUs by dividing them into reserved CPUs for cluster and operating system housekeeping duties, and isolated CPUs for workloads. CPUs that are used for low latency workloads are set as isolated.

Device interrupts are load balanced between all isolated and reserved CPUs to avoid CPUs being overloaded, with the exception of CPUs where there is a guaranteed pod running. Guaranteed pod CPUs are prevented from processing device interrupts when the relevant annotations are set for the pod.

Workload requirement

To use isolated CPUs, specific annotations must be defined in the pod specification.

See test case lifecycle-cpu-isolation

Impacts and Risks of Non-Compliance: Improper CPU isolation can cause performance interference between workloads and fail to provide guaranteed compute resources.

2.16.4. Topology Manager and NUMA awareness

Topology Manager collects hints from the CPU Manager, Device Manager, and other Hint Providers to align pod resources, such as CPU, SR-IOV VFs, and other device resources, for all Quality of Service (QoS) classes on the same non-uniform memory access (NUMA) node. This topology information and the configured Topology manager policy determine whether a workload is accepted or rejected on a node.

To align CPU resources with other requested resources in a Pod spec, the CPU Manager must be enabled with the static CPU Manager policy.

The following Topology manager policies are available and dependent on the requirements of the workload can be enabled. For high performance workloads making use of SR-IOV VFs, NUMA awareness follows the NUMA node to which the SR-IOV capable network adapter is connected.

Best-effort policy: For each container in a pod with the best-effort topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node.
Restricted policy: For each container in a pod with the restricted topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in a Terminated state with a pod admission failure.
Single NUMA node policy: For each container in a pod with the single-numa-node topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure. For more information about the Topology manager, see

Using CPU Manager and Topology Manager.

2.16.5. IPv4 & IPv6

Applications should discover services via DNS by doing an AAAA and A query. If an application gets a AAAA response the application should prefer using the IPv6 address in the AAAA response for application sockets.

In OpenShift Container Platform 4.7+, you can declare ipFamilyPolicy: PreferDualStack which will present an IPv4 and IPv6 address in the service.

Workload recommendation

IPv4 should only be used inside a pod when absolutely necessary.

See test case networking-icmpv4-connectivity

Impacts and Risks of Non-Compliance: Failure indicates potential network isolation issues that could prevent workload components from communicating, leading to service degradation or complete application failure.

Workload recommendation

Services should be created as IPv6 only services wherever possible. If an application requires dual stack it should create a dual stack service.

See test cases networking-dual-stack-service, networking-icmpv6-connectivity

Impacts and Risks of Non-Compliance: Single-stack IPv4 services limit network architecture flexibility and prevent migration to modern dual-stack infrastructures.

For more information, see IPv4/IPv6 dual-stack.

To configure IPv4/IPv6 dual-stack, set dual-stack cluster network assignments:

kube-apiserver:
  --service-cluster-ip-range=<IPv4 CIDR>,<IPv6 CIDR>

2.16.6. IPv6 NAT

Services advertised to other CNFs should utilize VIPs on the SPK that are part of the platform. This creation of external services is done via pushing a configmap with F5 AS3 formatted configuration tocreation of CRDs for SPK into the Kubernetes API.

Creation of IPv6 external services and mapping them to IPv4 internal services within the clusters is possible and happens automatically. The SPK will terminate the IPv6 traffic destined to the pods in the cluster and translate the traffic to an IPv4 Pod destination IP address and source NAT the traffic to an IPv4 address on the load balancer.

In order to reach an IPv6 external service from an IPv4 pod, the service must be configured in Verizon’s external DNS systems to the cluster. Once this is done reachability is achieved by the pod generating a DNS A and/or AAAA query for the IPv6 external service. This query is forwarded by the cluster through the SPK which consumes the request and re-originates a request on behalf of the cluster. When a DNS response comes back with only a AAAA record response, the F5 will reserve an IPv4 address local to it that is reachable from the cluster and return to the pod via the cluster DNS both an A and the external AAAA record. This IPv4 address that is reserved is a 1 to 1 mapping to the IPv6 destination. Because the pods only have an IPv4 address and socket they attempt to reach the IPv4 address that is resident on the SPK, when the load balancer receives traffic destined to that IPv4 address it translates the traffic to the IPv6 destination address that was in the AAAA response and source NATs the pod’s IPv4 address to an IPv6 source NAT address that will allow the traffic to reach the IPv6 destination.

2.16.7. VRFs (aka routing instances)

Virtual routing and forwarding (VRF) provides a way to have separate routing tables on the device enabling multiple L3 routing domains concurrently. This allows for traffic in different VRF to be treated independently of each other.

Generally a Load Balancer is used within the platform for L4 services and sometimes L7 load balancing services. In a multi-tenant environment, Network Functions (NFs) can be deployed within a single namespace. Supporting applications like an OAM platform for multiple NFs from the same vendor should be run in an additional separate namespace. The CNI interface should be used as the default mechanism for accessing VRFs. For traffic inbound to an application this is done through allocation of a VIP on the load balancer via the Kubernetes API on the appropriate VRF. For traffic outbound from an application selection of the VRF is done on the application’s behalf via the Load Balancer and destination routing. Multus will be supported within the platform for additional NICs within containers. However Multus should be used only for those cases that cannot be supported by the load balancer.

The POD and Services networks are unrouted address space, they are only reachable via service VIPs on the load balancers. The POD network will be NATed as traffic egresses the load balancer. Traffic inbound will be destination NATed to Service/Pod IP addresses.

Applications should use Network Policies for firewalling the application. Network Policies should be written with a default deny and only allow ports and protocols on an as needed basis for any pods and services.

See test case networking-network-policy-deny-all

Impacts and Risks of Non-Compliance: Without default deny-all network policies, workloads are exposed to lateral movement attacks and unauthorized network access, compromising security posture and potentially enabling data breaches.

2.16.8. Ports reserved by OpenShift

The following ports are reserved by OpenShift and should not be used by any application. These ports are blocked by iptables on the nodes and traffic will not pass. Port list:

22623
22624

Workload requirement

The following ports are reserved by OpenShift and must not be used by any application: 22623, 22624.

See test case networking-ocp-reserved-ports-usage

Impacts and Risks of Non-Compliance: Using OpenShift-reserved ports can cause critical platform services to fail, potentially destabilizing the entire cluster.

2.16.9. Handling user-plane workloads

A workload which handles user plane traffic or latency-sensitive payloads at line rate falls into this category, such as load balancing, routing, deep packet inspection, and so on. Some of these workloads may also need to process the packets at a lower level.

This kind of workload may need to:

Use SR-IOV interfaces
Fully or partially bypassing the kernel networking stack with userspace networking technologies, like DPDK, F-stack, VPP, OpenFastPath, etc. A userspace networking stack can not only improve the performance but also reduce the need for the CAP_NET_ADMIN and CAP_NET_RAW.

For Mellanox devices, those capabilities are requested if the application needs to configure the device(CAP_NET_ADMIN) and/or allocate raw ethernet queue through kernel drive(CAP_NET_RAW)

As CAP_IPC_LOCK is mandatory for allocating hugepage memory, this capability is granted to DPDK-based applications. Additionally, if the workload is latency-sensitive and needs the determinacy provided by the real-time kernel, the CAP_SYS_NICE is also required.

Here is an example pod manifest of a DPDK application:

apiVersion: v1
kind: Pod
metadata:
  name: dpdk-app
  namespace: <target_namespace>
  annotations:
    k8s.v1.cni.cncf.io/networks: dpdk-network
spec:
  containers:
  - name: testpmd
    image: <DPDK_image>
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
      openshift.io/mlxnics: "1"
      memory: "1Gi"
      cpu: "4"
      hugepages-2Mi: "4Gi"
    requests:
      openshift.io/mlxnics: "1"
      memory: "1Gi"
      cpu: "4"
      hugepages-2Mi: "4Gi"
    command: ["sleep", "infinity"]
volumes:
- name: hugepage
  emptyDir:
    medium: HugePages

apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: <workload_name>
users: []
groups: []
priority: null
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: [IPC_LOCK, NET_ADMIN, NET_RAW] defaultAddCapabilities: null
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
fsGroup:
  type: MustRunAs
readOnlyRootFilesystem: false
runAsUser:
  type: MustRunAsRange
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret

3. Workload developer guide

This section discusses recommendations and requirements for workload application builders.

3.1. Preface

Cloud-native workload applications are containerized instances of classic physical or virtual applications which have been decomposed into microservices supporting elasticity, lifecycle management, security, logging, and other capabilities in a Cloud-Native format.

3.2. Goal

This document is mainly for the developers of workloads, who need to build high-performance applications in a containerized environment. We have created a guide that any partner can take and follow when developing their workloads so that they can be deployed on the OpenShift Container Platform (OCP) in a secure, efficient and supportable way.

3.3. Non-goal

This is not a guide on how to build workload functionality.

3.4. Refactoring

Workloads should break their software down into the smallest set of microservices as possible. Running monolithic applications inside of a container is not the operating model to be in.

It is hard to move a 1000LB boulder. However, it is easy when that boulder is broken down into many pieces (pebbles). All workloads should break apart each piece of the functions/services/processes into separate containers. These containers will still be within kubernetes pods and all of the functions that perform a single task should be within the same namespace.

There is a quote from Lewis and Fowler that describes this best:

The microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API.These services are built around business capabilities and independently deployable by fully automated deployment machinery.

— Lewis and Fowler

3.5. Workload security

In OCP, it is possible to run privileged containers that have all of the root capabilities on a host machine, allowing the ability to access resources which are not accessible in ordinary containers. This, however, increases the security risk to the whole cluster. Containers should only request those privileges they need to run their legitimate functions. No containers will be allowed to run with full privileges without an exception.

The general guidelines are:

Only ask for the necessary privileges and access control settings for your application.
If the function required by your workload can be fulfilled by OCP components, your application should not be requesting escalated privilege to perform this function.
Avoid using any host system resource if possible.
Leveraging read only root filesystem when possible.

Workload requirement

Only ask for the necessary privileges and access control settings for your application

See test case access-control-security-context-non-root-user-id-check

Impacts and Risks of Non-Compliance: Running containers as root increases the blast radius of security vulnerabilities and can lead to full host compromise if containers are breached.

Workload requirement

If the function required by your workload can be fulfilled by OCP components, your application should not be requesting escalated privilege to perform this function.

See test case access-control-security-context-privilege-escalation

Impacts and Risks of Non-Compliance: Allowing privilege escalation can lead to containers gaining root access, compromising the security boundary between containers and hosts.

Workload requirement

Avoid using any host system resource.

See test cases access-control-pod-host-ipc, access-control-pod-host-pid

Impacts and Risks of Non-Compliance: Host IPC access allows containers to communicate with host processes, potentially exposing sensitive information and enabling privilege escalation.

Workload requirement

Do not mount host directories for device access.

See test case access-control-pod-host-path

Impacts and Risks of Non-Compliance: Host path mounts can expose sensitive host files to containers, enable container escape attacks, and compromise host system integrity.

Workload requirement

Do not use host network namespace.

See test case access-control-namespace

Impacts and Risks of Non-Compliance: Using inappropriate namespaces can lead to resource conflicts, security boundary violations, and administrative complexity in multi-tenant environments.

Workload requirement

Workloads may not modify the platform in any way.

See test cases platform-alteration-base-image, platform-alteration-sysctl-config

Impacts and Risks of Non-Compliance: Modified base images can introduce security vulnerabilities, create inconsistent behavior, and violate immutable infrastructure principles.

Application pods must avoid using hostNetwork. Applications may not use the host network, including nodePort for network communication. Any networking needs beyond the functions provided by the pod network and ingress/egress proxy must be serviced via a MULTUS connected interface.

Workload requirement

Applications may not use NodePorts or the hostNetwork.

See test case access-control-service-type

Impacts and Risks of Non-Compliance: NodePort services expose applications directly on host ports, creating security risks and potential port conflicts with host services.

3.5.1. Avoid accessing resource on host

It is not recommended for an application to access following resources on the host.

3.5.2. Avoid mounting host directories as volumes

It is not necessary to mount host /sys/ or host /dev/ directories as a volume in a pod in order to use a network device such as SR-IOV VF. The moving of a network interface into the pod network namespace is done automatically by CNI. Mounting the whole /sys/ or /dev/ directory in the container will overwrite the network device descriptor inside the container which causes device not found or no such file or directory error.

Network interface statistics can be queried inside the container using the same /sys/ path as was done when running directly on the host. When running network interfaces in containers, relevant /sys/ statistics interfaces are available inside the container, such as /sys/class/net/net1/statistics/, /proc/net/tcp and /proc/net/tcp6.

For running DPDK applications with SR-IOV VF, device specs (in case of vfio-pci) are automatically attached to the container via the Device Plugin. There is no need to mount the /dev/ directory as a volume in the container as the application can find device specs under /dev/vfio/ in the container.

3.6. Linux capabilities

Linux Capabilities allow you to break apart the power of root into smaller groups of privileges. The Linux capabilities(7) man page provides a detailed description of how capabilities management is performed in Linux. In brief, the Linux kernel associates various capability sets with threads and files. The thread’s Effective capability set determines the current privileges of a thread.

When a thread executes a binary program the kernel updates the various thread capability sets according to a set of rules that take into account the UID of thread before and after the exec system call and the file capabilities of the program being executed. Refer to the blog series in 10# for more details about []Linux capabilities and some examples. For Red Hat specific review of capabilities please refer to the link:Linux Capabilities in OpenShift blog.# An additional reference is link:Docker Run Reference.[]

Users may choose to specify the required permissions for their running application in the Security Context of the pod specification. In OCP, administrators can use the Security Context Constraint (SCC) admission controller plugin to control the permissions allowed for pods deployed to the cluster. If the pod requests permissions that are not allowed by the SCCs available to that pod, the pod will not be admitted to the cluster.

The following runtime and SCC attributes control the capabilities that will be granted to a new container:

The values in the SCC for allowedCapabilities, defaultAddCapabilities and requiredDropCapabilities
allowPrivilegeEscalation: controls whether a container can acquire extra privileges through setuid binaries or the file capabilities of binaries

The capabilities associated with a new container are determined as follows:

If the container has the UID 0 (root) its Effective capability set is determined according to the capability attributes requested by the pod or container security context and allowed by the SCC assigned to the pod. In this case, the SCC provides a way to limit the capabilities of a root container.
If the container has a UID non 0 (non root), the new container has an empty Effective capability set (see Kubernetes should configure the ambient capability set). In this case the SCC assigned to the pod controls only the capabilities the container may acquire through the file capabilities of binaries it will execute.

Considering the general recommendation to avoid running root containers, capabilities required by non-root containers are controlled by the pod or container security context and the SCC capability attributes but can only be acquired by properly setting the file capabilities of the container binaries.

Refer to Managing security context constraints for more details on how to define and use the SCC.

3.6.1. DEFAULT capabilities

The default capabilities that are allowed via the restricted SCC are as follows (see default cri-o Linux capabilities)

"CHOWN"
"DAC_OVERRIDE"
"FSETID"
"FOWNER"
"SETPCAP"
"NET_BIND_SERVICE"

The capabilities: "SETGID", "SETUID" &"KILL", have been removed from the default OpenShift capabilities.

3.6.2. IPC_LOCK

IPC_LOCK capability is required if any of these functions are used in an application:

mlock()
mlockall()
shmctl()
mmap()

Even though mlock() is not necessary on systems where page swap is disabled (for example on OpenShift), it may still be required as it is a function that is built into DPDK libraries, and DPDK based applications may indirectly call it by calling other functions.

See test case access-control-ipc-lock-capability-check

Impacts and Risks of Non-Compliance: IPC_LOCK capability can be exploited to lock system memory, potentially causing denial of service and affecting other workloads on the same node.

3.6.3. NET_ADMIN

NET_ADMIN capability is required to perform various network related administrative operations inside container such as:

MTU setting
Link state modification
MAC/IP address assignment
IP address flushing
Route insertion/deletion/replacement
Control network driver and hardware settings via ethtool

This doesn’t include:

add/delete a virtual interface inside a container. For example: adding a VLAN interface
Setting VF device properties

All the administrative operations (except ethtool) mentioned above that require the NET_ADMIN capability should already be supported on the host by various CNIs in Openshift.

Workload requirement

Only userplane applications or applications using SR-IOV or Multicast can request NET_ADMIN capability

See test case access-control-net-admin-capability-check

Impacts and Risks of Non-Compliance: NET_ADMIN capability allows network configuration changes that can compromise cluster networking, enable privilege escalation, and bypass network security controls.

3.6.4. Avoid SYS_ADMIN

This capability is very powerful and overloaded. It allows the application to perform a range of system administration operations to the host. So you should avoid requiring this capability in your application.

Workload requirement

Applications MUST NOT use the SYS_ADMIN Linux capability

See test case access-control-sys-admin-capability-check

Impacts and Risks of Non-Compliance: SYS_ADMIN capability provides extensive privileges that can compromise container isolation, enable host system access, and create serious security vulnerabilities.

3.6.5. SYS_NICE

In the case that a workload is running on a node and is using DPDK, SYS_NICE will be used to allow DPDK application to switch to SCHED_FIFO.

See test case access-control-sys-nice-realtime-capability

Impacts and Risks of Non-Compliance: Missing SYS_NICE capability on real-time nodes prevents applications from setting appropriate scheduling priorities, causing performance degradation.

3.6.6. SYS_PTRACE

This capability is required when using Process Namespace Sharing. This is used when processes from one Container need to be exposed to another Container. For example, to send signals like SIGHUP from a process in a Container to another process in another Container. See Share Process Namespace between Containers in a Pod for more details. For more information on these capabilities refer to Linux Capabilities in OpenShift.

See test case access-control-sys-ptrace-capability

Impacts and Risks of Non-Compliance: Missing SYS_PTRACE capability when using shared process namespaces prevents inter-container process communication, breaking application functionality.

3.7. Operations that shall be executed by OpenShift

The application should not require NET_ADMIN capability to perform the following administrative operations:

3.7.1. Setting the MTU

Configure the MTU for the cluster network, also known as the OVN or Openshift-SDN network, by modifying the manifests generated by openshift-installer before deploying the cluster. See Changing the MTU for the cluster network for more information.
Configure additional networks managed by the Cluster Network Operator by using NetworkAttachmentDefinition resources generated by the Cluster Network Operator. See Using high performance multicast for more information.
Configure SR-IOV interfaces by using the SR-IOV Network Operator, see Configuring an SR-IOV network device for more information.

3.7.2. Modifying link state

All the links should be set up before attaching it to a pod.

3.7.3. Assigning IP/MAC addresses

For all the networks, the IP/MAC address should be assigned to the interface during pod creation.
MULTUS also allows users to override the IP/MAC address. Refer to Attaching a pod to an additional network for more information.

3.7.4. Manipulating pod route tables

By default, the default route of the pod will point to the cluster network, with or without the additional networks. MULTUS also allows users to override the default route of the pod. Refer to Attaching a pod to an additional network for more information.
Non-default routes can be added to pod routing tables by various IPAM CNI plugins during pod creation.

3.7.5. Setting SR/IOV VFs

The SR-IOV Network Operator also supports configuring the following parameters for SR-IOV VFs. Refer to Configuring an SR-IOV Ethernet network attachment for more information.

vlan
linkState
maxTxRate
minRxRate
vlanQoS
spoofChk
trust

3.7.6. Configuring multicast

In OpenShift, multicast is supported for both the default interface (OVN or OpenShift-SDN) and the additional interfaces such as macvlan, SR-IOV, etc. Multicast is disabled by default. To enable it, refer to the following procedures:

Enabling multicast for a project
Configuring an SR-IOV interface for multicast
If your application works as a multicast source and you want to utilize the additional interfaces to carry the multicast traffic, then you don’t need the NET_ADMIN capability. Follow the instructions in Using high performance multicast to set the correct multicast route in the pod’s routing table.

NOTE

OpenShift SDN CNI is deprecated as of OpenShift Container Platform 4.14. As of OpenShift Container Platform 4.15, the network plugin is not an option for new installations. In a subsequent future release, the OpenShift SDN network plugin is planned to be removed and no longer supported. Red Hat will provide bug fixes and support for this feature until it is removed, but this feature will no longer receive enhancements. As an alternative to OpenShift SDN CNI, you can use OVN Kubernetes CNI instead. For more information, see OpenShift SDN CNI removal in OCP 4.17.

3.8. Operations that can not be executed by OpenShift

All the CNI plugins are only invoked during pod creation and deletion. If your workload needs perform any operations mentioned above at runtime, the NET_ADMIN capability is required.

There are some other functionalities that are not currently supported by any of the OpenShift components which also require NET_ADMIN capability:

Link state modification at runtime
IP/MAC modification at runtime
Manipulate pod’s route table or firewall rules at runtime
SR/IOV VF setting at runtime
Netlink configuration
For example, ethtool can be used to configure things like rxvlan, txvlan, gso, tso, etc.

Multicast

If your application works as a receiving member of IGMP groups, you need to specify the NET_ADMIN capability in the pod manifest. So that the app is allowed to assign multicast addresses to the pod interface and join an IGMP group.

Set SO_PRIORITY to a socket to manipulate the 802.1p priority in ethernet frames
Set IP_TOS to a socket to manipulate the DSCP value of IP packets

3.9. Analyzing your application

To find out which capabilities the application needs, Red Hat has developed a SystemTap script (container_check.stp). With this tool, the workload developer can find out what capabilities an application requires in order to run in a container. It also shows the syscalls which were invoked. Find more info at Capabilities and Seccomp Profiles on Kubernetes

Another tool is capable which is part of the BCC tools. It can be installed on RHEL8 with dnf install bcc.

3.10. Finding the capabilities that an application needs

Here is an example of how to find out the capabilities that an application needs. testpmd is a DPDK based layer-2 forwarding application. It needs the CAP_IPC_LOCK to allocate the hugepage memory.

Use container_check.stp. We can see CAP_IPC_LOCK and CAP_SYS_RAWIO are requested by testpmd and the relevant syscalls.

$ $ /usr/share/systemtap/examples/profiling/container_check.stp -c 'testpmd -l 1-2 -w 0000:00:09.0 -- -a --portmask=0x8 --nb-cores=1'

Example output

[...]
capabilities used by executables
    executable:   prob capability
    testpmd:      cap_ipc_lock
    testpmd:      cap_sys_rawio

capabilities used by syscalls
    executable,   syscall ( capability )    : count
    testpmd,      mlockall ( cap_ipc_lock ) : 1
    testpmd,      mmap ( cap_ipc_lock )     : 710
    testpmd,      open ( cap_sys_rawio )    : 1
    testpmd,      iopl ( cap_sys_rawio )    : 1

failed syscalls
    executable,          syscall =       errno:   count
    eal-intr-thread,  epoll_wait =       EINTR:       1
    lcore-slave-2,          read =            :       1
    rte_mp_handle,       recvmsg =            :       1
    stapio,                      =       EINTR:       1
    stapio,               execve =      ENOENT:       3
    stapio,        rt_sigsuspend =            :       1
    testpmd,               flock =      EAGAIN:       5
    testpmd,                stat =      ENOENT:      10
    testpmd,               mkdir =      EEXIST:       2
    testpmd,            readlink =      ENOENT:       3
    testpmd,              access =      ENOENT:    1141
    testpmd,              openat =      ENOENT:       1
    testpmd,                open =      ENOENT:      13
    [...]

Use the capable command:
```
$ /usr/share/bcc/tools/capable
```
Start the testpmd application from another terminal, and send some test traffic to it. For example:
```
$ testpmd -l 18-19 -w 0000:01:00.0 -- -a --portmask=0x1 --nb-cores=1
```

Check the output of the capable command. Below, CAP_IPC_LOCK was requested for running testpmd.

[...]
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
0:41:58 0 3591  testpmd CAP_IPC_LOCK  1
[...]

Also, try to run testpmd without CAP_IPC_LOCK set with capsh. Now we can see that the hugepage memory cannot be allocated.

$ capsh --drop=cap_ipc_lock -- -c testpmd -l 18-19 -w 0000:01:00.0 -- -a --portmask=0x1 --nb-cores=1

+ .Example output

EAL: Detected 24 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:01:00.0 on NUMA socket 0
EAL: probe driver: 8086:10fb net_ixgbe
EAL: using IOMMU type 1 (Type 1)
EAL: Ignore mapping IO port bar(2)
EAL: PCI device 0000:01:00.1 on NUMA socket 0
EAL: probe driver: 8086:10fb net_ixgbe
EAL: PCI device 0000:07:00.0 on NUMA socket 0
EAL: probe driver: 8086:1521 net_e1000_igb
EAL: PCI device 0000:07:00.1 on NUMA socket 0
EAL: probe driver: 8086:1521 net_e1000_igb
EAL: cannot set up DMA remapping, error 12 (Cannot allocate memory) testpmd: mlockall() failed with error "Cannot allocate memory" testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=331456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
EAL: cannot set up DMA remapping, error 12 (Cannot allocate memory) testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=331456, size=2176,
socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
EAL: cannot set up DMA remapping, error 12 (Cannot allocate memory) EAL: cannot set up DMA remapping, error 12 (Cannot allocate memory)

3.11. Securing workload networks

Workloads must have the least permissions possible and must implement Network Policies that drop all traffic by default and permit only the relevant ports and protocols to the narrowest ranges of addresses possible.

Workload requirement

Applications must define network policies that permit only the minimum network access the application needs to function.

See test case networking-network-policy-deny-all

3.11.1. Managing secrets

Secrets objects in OpenShift provide a way to hold sensitive information such as passwords, config files and credentials. There are 4 types of secrets; service account, basic auth, ssh auth and TLS. Secrets can be added via deployment configurations or consumed by pods directly. For more information on secrets and examples, see the following documentation.

Providing sensitive data to pods

3.11.2. Setting SCC permissions for applications

Permissions to use an SCC is done by adding a cluster role that has uses permissions for the SCC and then rolebindings for the users within a namespace to that role for users that need that SCC. Application admins can create their own role/rolebindings to assign permissions to a Service Account.

3.12. Cloud-native function expectations and permissions

Cloud-native applications are developed as loosely-coupled well-behaved manageable microservices running in containers managed by a container orchestration engine such as kubernetes.

3.12.1. Cloud-native design best practices

The following best practices highlight some key principles of cloud-native application design.

Single purpose w/messaging interface: A container should address a single purpose with a well-defined (typically RESTful API) messaging interface. The motivation here is that such a container image is more reusable and more replaceable/upgradeable.
High observability: A container must provide APIs for the platform to observe the container health and act accordingly. These APIs include health checks (liveness and readiness), logging to stderr and stdout for log aggregation (by tools such as Logstash or Filebeat), and integrate with tracing and metrics-gathering libraries (such as Prometheus or Metricbeat).
Lifecycle conformance: A container must receive important events from the platform and conform/react to these events properly. For example, a container should catch SIGTERM or SIGKILL from the platform and shut down as quickly as possible. Other typically important events from the platform are PostStart to initialize before servicing requests and PreStop to release resources cleanly before shutting down.

See test case lifecycle-container-poststart

Impacts and Risks of Non-Compliance: Missing PostStart hooks can cause containers to start serving traffic before proper initialization, leading to application errors.

See test case lifecycle-container-prestop

Impacts and Risks of Non-Compliance: Missing PreStop hooks can cause ungraceful shutdowns, data loss, and connection drops during container termination.

Image immutability: Container images are meant to be immutable; i.e. customized images for different environments should typically not be built. Instead, an external means for storing and retrieving configurations that vary across environments for the container should be used. Additionally, the container image should NOT dynamically install additional packages at runtime.
Process disposability: Containers should be as ephemeral as possible and ready to be replaced by another container instance at any point in time. There are many reasons to replace a container, such as failing a health check, scaling down the application, migrating the containers to a different host, platform resource starvation, or another issue.

This means that containerized applications must keep their state externalized or distributed and redundant. To store files or block level data, persistent volume claims should be used. For information such as user sessions, use of an external, low-latency, key-value store such as redis should be used. Process disposability also requires that the application should be quick in starting up and shutting down, and even be ready for a sudden, complete hardware failure.

Another helpful practice in implementing this principle is to create small containers. Containers in cloud-native environments may be automatically scheduled and started on different hosts. Having smaller containers leads to quicker start-up times because before being restarted, containers need to be physically copied to the host system.

A corollary of this practice is to "retry instead of crashing", for example, When one service in your application depends on another service, it should not crash when the other service is unreachable. For example, your API service is starting up and detects the database is unreachable. Instead of failing and refusing to start, you design it to retry the connection. While the database connection is down the API can respond with a 503 status code, telling the clients that the service is currently unavailable. This practice should already be followed by applications, but if you are working in a containerized environment where instances are disposable, then the need for it becomes more obvious.

Also related to this, by default containers are launched with shared images using COW filesystems which only exist as long as the container exists. Mounting Persistent Volume Claims enables a container to have persistent physical storage. Clearly defining the abstraction for what storage is persisted promotes the idea that instances are disposable.

Workload requirement

Application design should conform to cloud-native design principles to the maximum extent possible.

3.12.1.1. High-level workload expectations

Workloads shall be built to be cloud-native
Containers MUST NOT run as root (uid=0). See test case access-control-security-context-non-root-user-id-check

Impacts and Risks of Non-Compliance: Running containers as root increases the blast radius of security vulnerabilities and can lead to full host compromise if containers are breached.

Containers MUST run with the minimal set of permissions required. Avoid Privileged Pods. See test case access-control-security-context-privilege-escalation

Impacts and Risks of Non-Compliance: Allowing privilege escalation can lead to containers gaining root access, compromising the security boundary between containers and hosts.

Use the main CNI for all traffic - MULTUS/SRIOV/MacVLAN are for corner cases only (extreme throughput requirements, protocols that are unable to be load balanced)
Workloads should employ N+k redundancy models
Workloads MUST define their pod affinity/anti-affinity rules. See test cases lifecycle-affinity-required-pods, lifecycle-pod-high-availability

Impacts and Risks of Non-Compliance: Missing affinity rules can cause incorrect pod placement, leading to performance issues and failure to meet co-location requirements.

All secondary network interfaces employed by workloads with the use of MULTUS MUST support Dual-Stack IPv4/IPv6.

For platform using IPv4 addressing for CNI interfaces, and a NAT46/64 implementation in Services Proxy/Load Balancer for Ingress & Egress traffic. Workloads shall support this requirement.

Instantiation of a workload (via Helm chart or Operators or otherwise) shall result in a fully-functional workload ready to serve traffic, without requiring any post-instantiation configuration of system parameters
Workloads shall implement service resilience at the application layer and not rely on individual compute availability/stability
Workloads shall decouple application configuration from Pods, to allow dynamic configuration updates
Workloads shall support elasticity with dynamic scale up/down using kubernetes-native constructs such as ReplicaSets, etc. See test cases lifecycle-crd-scaling, lifecycle-statefulset-scaling, lifecycle-deployment-scaling

Impacts and Risks of Non-Compliance: CRD scaling failures can prevent operator-managed applications from scaling properly, limiting application availability and performance.

Workloads shall support canary upgrades
Workloads shall self-recover from common failures like pod failure, host failure, and network failure. Kubernetes native mechanisms such as health-checks (Liveness, Readiness and Startup Probes) shall be employed at a minimum. See test cases lifecycle-liveness-probe, lifecycle-readiness-probe, lifecycle-startup-probe

Impacts and Risks of Non-Compliance: Missing liveness probes prevent Kubernetes from detecting and recovering from application deadlocks and hangs.

Workload requirement

Containers must not run as root

See test case access-control-security-context-non-root-user-id-check

Impacts and Risks of Non-Compliance: Running containers as root increases the blast radius of security vulnerabilities and can lead to full host compromise if containers are breached.

Workload requirement

All secondary interfaces (MULTUS) must support dual stack

See test cases networking-icmpv4-connectivity-multus, networking-icmpv6-connectivity-multus

Impacts and Risks of Non-Compliance: Multus network connectivity issues can isolate workloads from secondary networks, breaking multi-network applications and reducing network redundancy.

Workload requirement

Workloads shall not use node selectors nor taints/tolerations to assign pod location

See test cases lifecycle-pod-scheduling, platform-alteration-tainted-node-kernel

Impacts and Risks of Non-Compliance: Node selectors can create scheduling constraints that reduce cluster flexibility and cause deployment failures when nodes are unavailable.

3.12.1.2. Pod permissions

By default, pods should not expect to be permitted to run as root. Pod restrictions are enforced by SCC within the OpenShift platform. See Managing security context constraints.

Pods will execute on worker nodes, by default being admitted to the cluster with the "restricted" SCC.

The "restricted" SCC:

Ensures that no containers within the pod can run with the allowPrivilegedContainer flag set.
Ensures that pods cannot mount host directory volumes.

See test case access-control-pod-host-path

Impacts and Risks of Non-Compliance: Host path mounts can expose sensitive host files to containers, enable container escape attacks, and compromise host system integrity.

Requires that a pod run as a user in a pre-allocated range of UIDs from the namespace annotation.
Requires that a pod run with a pre-allocated MCS label from the namespace annotation.
Allows pods to use any supplemental group.

Any pods requiring elevated privileges must document the required capabilities driven by application syscalls and a process to validate the requirements must occur.

3.12.1.3. Logging

Log aggregation and analysis

Containers are expected to write logs to stdout. It is highly recommended that stdout/stderr leverage some standard logging format for output.
Logs CAN be parsed to a limited extent so that specific vendor logs can be sent back to the workload if required.
Workloads requiring log parsing must leverage some standard logging library or format for all stdout/stderr. Examples of standard logging libraries include; klog, rfc5424, and oslo.

See test case observability-container-logging

Impacts and Risks of Non-Compliance: Improper logging configuration prevents log aggregation and monitoring, making troubleshooting and debugging difficult.

3.12.1.4. Monitoring

Workloads are expected to bring their own metrics collection functions (e.g. Prometheus) for their application specific metrics. This metrics collector will not be expected to nor able to poll platform level metric data.

3.12.1.5. CPU allocation

It is important to note that when the OpenShift scheduler is placing pods, it first reviews the Pod CPU request and schedules it if there is a node that meets the requirements. It will then impose the CPU "Limits" to ensure the Pod doesn’t consume more than the intended allocation. The limit can never be lower than the request.

NUMA Configuration: OpenShift provides a topology manager which leverages the CPU manager and Device manager to help associate processes to CPUs. Topology manager handles NUMA affinity. This feature is available as of OpenShift 4.6. For some examples on how to leverage the topology manager and creating workloads that work in real time, see Scheduling NUMA-aware workloads and Low latency tuning.

3.12.1.6. Memory allocation

Regarding memory allocation, there are a couple of considerations. How much of the platform is OpenShift itself using, and how much is left over to allocate for the applications running on OpenShift?

Once it has been determined how much memory is left over for the applications, quotas can be applied which specify both the requested amount of memory and limits. In the case of where a memory request has been specified, OpenShift will not schedule the pod unless the amount of memory required to launch it is available. In the case of a limit being specified, OpenShift will not allocate more memory to the application than the limit provides.

When the OpenShift scheduler is placing pods, it reviews the pod memory request and schedules the pod if there is a node that meets the requirements. It then imposes memory limits to ensure the pod doesn’t consume more than the intended allocation. The limit can never be lower than the request.

Workload requirement

Vendors must supply quotas per project/namespace

See test case access-control-namespace-resource-quota

Impacts and Risks of Non-Compliance: Without resource quotas, workloads can consume excessive cluster resources, causing performance issues and potential denial of service for other applications.

3.12.1.7. Pods

Pods are the smallest deployable units of computing that can be created and managed in Kubernetes.

A Pod can contain one or more running containers at a time. Containers running in the same Pod have access to several of the same Linux namespaces. For example, each application has access to the same network namespace, meaning that one running container can communicate with another running container over 127.0.0.1:<port>. The same is true for storage volumes so all containers are in the same Pod have access to the same mount namespace and can mount the same volumes.

3.12.1.7.1. Pod interaction and configuration

Pod configurations should be created in a kubernetes native manner, the most basic example of a kubernetes native manner of configuration deployment is the use of a ConfigMap CR. ConfigMap CRs can be loaded into Kubernetes and pods can consume the data in a configmap by using the data in the ConfigMap to populate container environment variables or can be consumed as volumes in a container and read by an application.

Interaction with a running pod should be done via oc exec or oc rsh commands. This allows API role-based access control (RBAC) to the pods and command line interaction for debugging.

Workload requirement

SSH daemons must NOT be used in Openshift for pod interaction.

See test case access-control-ssh-daemons

Impacts and Risks of Non-Compliance: SSH daemons in containers create additional attack surfaces, violate immutable infrastructure principles, and can be exploited for unauthorized access.

3.12.1.7.2. Pod exit status

The most basic requirement for the lifecycle management of pods in OpenShift is the ability to start and stop correctly. When starting up, health probes like liveness and readiness checks can be put into place to ensure the application is functioning properly.

There are different ways a pod can be stopped in Kubernetes. One way is that the pod can remain alive but non-functional. Another way is that the pod can crash and become non-functional. In the first case, if the administrator has implemented liveness and readiness checks, OpenShift can stop the pod and either restart it on the same node or a different node in the cluster. For the second case, when the application in the pod stops, it should exit with a code and write suitable log entries to help the administrator diagnose what the issue was that caused the problem.

Pods should use terminationMessagePolicy: FallbackToLogsOnError to summarize why they crashed and use stderr to report errors on crash

See test case observability-termination-policy

Impacts and Risks of Non-Compliance: Incorrect termination message policies can prevent proper error reporting and make failure diagnosis difficult.

Workload requirement

All pods shall have a liveness, readiness and startup probes defined

See test cases lifecycle-liveness-probe, lifecycle-readiness-probe, lifecycle-startup-probe

Impacts and Risks of Non-Compliance: Missing liveness probes prevent Kubernetes from detecting and recovering from application deadlocks and hangs. Missing readiness probes can cause traffic to be routed to non-ready pods, resulting in failed requests and poor user experience. Missing startup probes can cause slow-starting applications to be killed prematurely, preventing successful application startup.

3.12.1.7.3. Graceful termination

There are different reasons that a pod may need to shutdown on an OpenShift cluster. It might be that the node the pod is running on needs to be shut down for maintenance, or the administrator is doing a rolling update of an application to a new version which requires that the old versions are shutdown properly.

When pods are shut down by the platform they are sent a SIGTERM signal which means that the process in the container should start shutting down, closing connections and stopping all activity. If the pod doesn’t shut down within the default 30 seconds then the platform may send a SIGKILL signal which will stop the pod immediately. This method isn’t as clean and the default time between the SIGTERM and SIGKILL messages can be modified based on the requirements of the application.

Pods should exit with zero exit codes when they are gracefully terminated.

Workload requirement

All pods must respond to SIGTERM signal and shutdown gracefully with a zero exit code.

See test case lifecycle-container-prestop

Impacts and Risks of Non-Compliance: Missing PreStop hooks can cause ungraceful shutdowns, data loss, and connection drops during container termination.

3.12.1.7.4. Pod resource profiles

OpenShift has a default scheduler that is responsible for the currently available resources on the platform, placing containers or applications on the platform appropriately. In order for OpenShift to do this correctly, the application developer must create a resource profile for the application. This resource profile contains requirements such as how much memory, CPU, and storage that the application needs. At this point, the scheduler is aware of what nodes in the cluster can satisfy the workload. It places the application on one of those nodes. The scheduler can also place the application pod in a pending state until resources are available.

All pods should have a resource request that is the minimum amount of resources the pod is expected to use at steady state for both memory and CPU.

3.12.1.7.5. Storage: emptyDir

There are several options for volumes and reading and writing files in OpenShift. When the requirement is temporary storage and given the option to write files into directories in containers versus an external filesystems, choose the emptyDir option. This will provide the administrator with the same temporary filesystem - when the pod is stopped the dir is deleted forever. Also, the emptyDir can be backed by whatever medium is backing the node, or it can be set to memory for faster reads and writes.

Using emptyDir with requested local storage limits instead of writing to the container directories also allows enabling readonlyRootFilesystem on the container or pod.

3.12.1.7.6. Liveness readiness and startup probes

As part of the pod lifecycle, the OpenShift platform needs to know what state the pod is in at all times. This can be accomplished with different health checks. There are at least three states that are important to the platform: startup, running, shutdown. Applications can also be running, but not healthy, meaning, the pod is up and the application shows no errors, but it cannot serve any requests.

When an application starts up on OpenShift it may take a while for the application to become ready to accept connections from clients, or perform whatever duty it is intended for.

Two health checks that are required to monitor the status of the applications are liveness and readiness. As mentioned above, the application can be running but not actually able to serve requests. This can be detected with liveness checks. The liveness check will send specific requests to the application that, if satisfied, indicate that the pod is in a healthy state and operating within the required parameters that the administrator has set. A failed liveness check will result in the container being restarted.

There is also a consideration of pod startup. Here the pod may start and take a while for different reasons. Pods can be marked as ready if they pass the readiness check. The readiness check determines that the pod has started properly and is able to answer requests. There are circumstances where both checks are used to monitor the applications in the pods. A failed readiness check results in the container being taken out of the available service endpoints. An example of this being relevant is when the pod was under heavy load, failed the readiness check, gets taken out of the endpoint pool, processes requests, passes the readiness check and is added back to the endpoint pool.

For more information, see Configure Liveness, Readiness and Startup Probes.

See test cases lifecycle-liveness-probe, lifecycle-readiness-probe, lifecycle-startup-probe

If the workload is doing CPU pinning and running a DPDK process do not use exec probes (executing a command within the container); as this can pile up and eventually block the node.

See test case networking-dpdk-cpu-pinning-exec-probe

3.12.1.7.7. Affinity and anti-affinity

In OpenShift Container Platform pod affinity and pod anti-affinity allow you to constrain which nodes your pod are eligible to be scheduled based on the key/value labels on other pods. There are two types of affinity rules, required and preferred. Required rules must be met, whereas preferred rules are best effort.

These pod affinity/anti-affinity rules are set in the pod specification as matchExpressions to a labelSelector. See Placing pods relative to other pods using affinity and anti-affinity rules for more information. The following example Pod CR illustrates pod affinity:

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
                - S1
        topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
    - name: with-pod-affinity
      image: docker.io/ocpqe/hello-pod

Workload requirement

Pods that need to be co-located on the same node need affinity rules. Pods that should not be co-located for resiliency purposes require anti-affinity rules.

See test case lifecycle-affinity-required-pods

Impacts and Risks of Non-Compliance: Missing affinity rules can cause incorrect pod placement, leading to performance issues and failure to meet co-location requirements.

Workload requirement

Pods that perform the same microservice and could be disrupted if multiple members of the service are unavailable must implement affinity/anti-affinity group rules or spread the pods across nodes to prevent disruption in the event of node failures, patches, or upgrades.

See test case lifecycle-pod-high-availability

Impacts and Risks of Non-Compliance: Missing anti-affinity rules can cause all pod replicas to be scheduled on the same node, creating single points of failure.

3.12.1.7.8. Upgrade expectations

The Kubernetes API deprecation policy defined in Kubernetes Deprecation Policy shall be followed.
Workloads are expected to maintain service continuity during platform upgrades, and during workload version upgrades
Workloads need to be prepared for nodes to reboot or shut down without notice
Workloads shall configure pod disruption budget appropriately to maintain service continuity during platform upgrades
Applications should not be tied to a specific version of Kubernetes or any of its components

Applications MUST specify a pod disruption budget appropriately to maintain service continuity during platform upgrades. The budget should be defined with a balance such that it allows operational flexibility for the cluster to drain nodes, but restrictive enough so that the service is not degraded over upgrades.

See test case lifecycle-pod-recreation

Impacts and Risks of Non-Compliance: Failed pod recreation indicates poor high availability configuration, leading to potential service outages during node failures.

Workload requirement

Pods that perform the same microservice and that could be disrupted if multiple members of the service are unavailable must implement pod disruption budgets to prevent disruption in the event of patches/upgrades.

See test case observability-pod-disruption-budget

Impacts and Risks of Non-Compliance: Improper disruption budgets can prevent necessary maintenance operations or allow too many pods to be disrupted simultaneously.

3.12.1.7.9. Taints and tolerations

Taints and tolerations allow the node to control which pods are scheduled on the node. A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the node specification (NodeSpec) and apply tolerations to a pod through the pod specification (PodSpec). A taint on a node instructs the node to repel all pods that do not tolerate the taint.

Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.

See Controlling pod placement using node taints for more information.

See test case platform-alteration-tainted-node-kernel

Impacts and Risks of Non-Compliance: Tainted kernels indicate unauthorized modifications that can introduce instability, security vulnerabilities, and support issues.

3.12.1.7.10. Requests and Limits in Kubernetes

Kubernetes provides mechanisms for defining resource usage per container:

Requests: The guaranteed minimum amount of a resource (e.g., CPU, memory). Used by the scheduler.
Limits: The maximum amount a container is allowed to consume. Enforced by the kubelet.
Quotas: Enforce aggregate resource usage at the namespace/project level to prevent resource overuse.

See: OpenShift Resource Quotas Per Project

Risks with Resource Limits

While limits can prevent runaway resource usage, they also introduce risk when misapplied, especially for CPU and memory.

CPU Limits Cause Throttling

Limits can throttle workloads even if unused CPU is available.
This leads to hangs, timeouts, and degraded performance.
CPU requests (without limits) often provide better fairness and stability.

Memory Limits Cause OOMKills

Limits on memory are strict—when exceeded, containers are killed.
Difficult to predict worst-case memory usage for infrastructure components.
Can result in crash loops, degraded service, and unrecoverable clusters.

Why Limits are a Problem for Cluster Components

Unlike with user workloads, setting resource limits for cluster components presents several challenges and is strongly discouraged:

Inability to Anticipate Scaling: Cluster components cannot predict their usage scaling across all customer environments, making it impossible to set one-size-fits-all limits.
Impeded Responsiveness: Setting static limits prevents administrators from reacting to changes in cluster needs, such as resizing control plane nodes to allocate more resources.
Undesirable Restarts: It is undesirable for cluster components to be restarted due to excessive resource consumption (e.g., OOMKills). Graceful handling without degrading cluster performance is preferred.

Therefore, cluster components SHOULD NOT be configured with resource limits.

However, cluster components MUST declare resource requests for both CPU and memory.

Benefits of Using Requests Without Limits

Guaranteed Minimums and Bursting: Specifying requests without limits ensures components receive their required minimum resources and can burst when needed.
Balancing Efficiency and Performance: When setting resource requests:
- If too low, the component may be starved under load, leading to degraded performance and service.
- If too high, the scheduler may be unable to place the component, leading to crash loops or failed deployments. Excessively high requests can also starve user workloads, particularly in small or single-node clusters.

Resource Requests: Compressible vs Incompressible

Kubernetes treats resources differently depending on their behavior under pressure:

Resource Type	Description	Examples
Compressible	Slower performance but still runs	CPU, network
Incompressible	Fails without required amount	Memory, storage

Resource Type

Description

Examples

Compressible

Slower performance but still runs

CPU, network

Incompressible

Fails without required amount

Memory, storage

Requesting Resources

Compressible (e.g., CPU): Requests should be balanced to ensure proportional system behavior and fairness.
Incompressible (e.g., memory): Requests should reflect minimum safe usage to avoid runtime failure.

See: More details on setting requests for different resource types

Alternatives to Resource Limits

Although limits are generally avoided for cluster components, the following mechanisms can help manage resources and prioritize workloads:

Pod Priority (PriorityClass): Preferred for ensuring essential core workloads have priority and sufficient resources.
- Allows critical components to avoid eviction during resource pressure.
- More effective than limits, as kubelet’s eviction does not consider limits.
Quotas: Can enforce requests and limits at the project level.
- Useful when guaranteed capacity is needed.
- Overcommitment trade-offs can be managed at the project, node, or cluster level.

Overcommitment

Nodes can be overcommitted, influencing how requests and limits are applied.

Overcommitted clusters often experience pod kills and kubelet restarts, degrading the product experience and increasing support costs.
Use quotas to enforce guaranteed capacity in production.
In development, overcommitment may be acceptable where performance trade-offs are tolerable.

See: Configuring your cluster to place pods on overcommitted nodes

See test case access-control-requests-and-limits

3.12.1.7.11. Use imagePullPolicy: IfNotPresent

If there is a situation where the container dies and needs to be restarted, the image pull policy becomes important. There are three image pull policies available: Always, Never and IfNotPresent. It is generally recommended to have a pull policy of IfNotPresent. This means that the if pod needs to restart for any reason, the kubelet will check on the node where the pod is starting and reuse the already downloaded container image if it’s available. OpenShift intentionally does not set AlwaysPullImages as turning on this admission plugin can introduce new kinds of cluster failure modes. Self-hosted infrastructure components are still pods: enabling this feature can result in cases where a loss of contact to an image registry can cause redeployment of an infrastructure or application pod to fail. We use PullIfNotPresent so that a loss of image registry access does not prevent the pod from restarting.

Container images that are protected by registry authentication have a condition whereby a user who is unable to download an image directly can still launch it by leveraging the host’s cached image.

See test case lifecycle-image-pull-policy

Impacts and Risks of Non-Compliance: Incorrect image pull policies can cause deployment failures when image registries are unavailable or during network issues.

3.12.1.7.12. Automount services for pods

Pods which do not require API access should set the value of automountServiceAccountToken to false within the pod spec, for example:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  serviceAccountName: examplesvcacct
  automountServiceAccountToken: false

Pods must include an explicit serviceAccountName in the pod spec. This is required to ensure that the pod is not automatically assigned the default service account.

See test case access-control-pod-automount-service-account-token

Impacts and Risks of Non-Compliance: Auto-mounted service account tokens expose Kubernetes API credentials to application code, creating potential attack vectors if applications are compromised.

3.12.1.7.13. Disruption budgets

When managing the platform there are at least two types of disruptions that can occur. They are voluntary and involuntary. When dealing with voluntary disruptions a pod disruption budget can be set that determines how many replicas of the application must remain running at any given time. For example, consider the case where an administrator is shutting down a node for

maintenance and the node has to be drained. If there is a pod disruption budget set then OpenShift will respect that and ensure that the required number of pods are available by bringing up pods on different nodes before draining the current node.

See test case observability-pod-disruption-budget

Impacts and Risks of Non-Compliance: Improper disruption budgets can prevent necessary maintenance operations or allow too many pods to be disrupted simultaneously.

3.12.1.7.14. No naked pods

Do not use naked Pods (that is, Pods not bound to a ReplicaSet, or StatefulSet deployment). Naked pods will not be rescheduled in the event of a node failure.

See test case lifecycle-pod-owner-type

Impacts and Risks of Non-Compliance: Naked pods and DaemonSets lack proper lifecycle management, making updates, scaling, and recovery operations difficult or impossible.

Workload requirement

Applications must not depend on any single pod being online for their application to function.

Workload requirement

Pods must be deployed as part of a Deployment or StatefulSet.

See test case lifecycle-pod-owner-type

Impacts and Risks of Non-Compliance: Naked pods and DaemonSets lack proper lifecycle management, making updates, scaling, and recovery operations difficult or impossible.

Workload requirement

Pods may not be deployed in a DaemonSet.

See test case lifecycle-pod-owner-type

Impacts and Risks of Non-Compliance: Naked pods and DaemonSets lack proper lifecycle management, making updates, scaling, and recovery operations difficult or impossible.

3.12.1.7.15. Image tagging

An image tag is a label applied to a container image in a repository that distinguishes a specific image from other images. Image tags may be used to categorize images (for example: latest, stable, development) and by versions within the categories. This allows the administrator to be specific when declaring which image to test, or which image to run in production.

See Tagging Images

See test case manageability-containers-image-tag

Impacts and Risks of Non-Compliance: Missing image tags make it difficult to track versions, perform rollbacks, and maintain deployment consistency.

3.12.1.7.16. One process per container

OpenShift organizes workloads into pods. Pods are the smallest unit of a workload that Kubernetes understands. Within pods, one can have one or more containers. Containers are essentially composed of the runtime that is required to launch and run a process.

Each container should run only one process. Different processes should always be split between containers, and where possible also separate into different pods. This can help in a number of ways, such as troubleshooting, upgrades and more efficient scaling.

However, OpenShift does support running multiple containers per pod. This can be useful if parts of the application need to share namespaces like networking and storage resources. Additionally, there are other models like launching init containers, sidecar containers, etc. which may justify running multiple containers in a single pod.

More information about pods can be found in Using pods.

See test case access-control-one-process-per-container

Impacts and Risks of Non-Compliance: Multiple processes per container complicate monitoring, debugging, and security assessment, and can lead to zombie processes and resource leaks.

3.12.1.7.17. init containers

Init containers can be used for running tools / commands / or any other action that needs to be done before the actual pod is started. For example, loading a database schema, or constructing a config file from a definition passed in via configMap or secret.

See Using init containers to perform tasks before a pod is deployed for more information.

3.12.1.8. Security and role-based access control

Roles / RoleBindings: A Role represents a set of permissions within a particular namespace. E.g: A given user can list pods/services within the namespace. The RoleBinding is used for granting the permissions defined in a role to a user or group of users. Applications may create roles and rolebindings within their namespace, however the scope of a role will be limited to the same permissions that the creator has or less.

See test case access-control-pod-role-bindings

Impacts and Risks of Non-Compliance: Cross-namespace role bindings can violate tenant isolation and create unintended privilege escalation paths.

ClusterRole / ClusterRoleBinding: A ClusterRole represents a set of permissions at the cluster level that can be used by multiple namespaces. The ClusterRoleBinding is used for granting the permissions defined in a ClusterRole to a user or group of users at a namespace level. Applications are not permitted to install cluster roles or create cluster role bindings. This is an administrative activity done by cluster administrators. Workloads should not use cluster roles; exceptions can be granted to allow this, however this is discouraged.

See Using RBAC to define and apply permissions for more information.

Workload requirement

Workloads may not create ClusterRole or ClusterRoleBinding CRs. Only cluster administrators should create these CRs.

See test case access-control-cluster-role-bindings

Impacts and Risks of Non-Compliance: Cluster-wide role bindings grant excessive privileges that can be exploited for lateral movement and privilege escalation across the entire cluster.

3.12.1.9. Custom role to access application CRDs

If an application requires installing/deploying CRDs (Custom Resource Definitions), the application must provide a role that allows necessary permissions to create CRs within the CRDs. The custom role to access CRDs must not create any permissions to access any other API resources than the CRDs.

Workload requirement

If an application creates CRDs; it must supply a role to access those CRDs and no other API resources/ permissions.

3.12.1.10. MULTUS

MULTUS is a meta-CNI that allows multiple CNIs that it delegates to. This allows pods to get additional interfaces beyond eth0 via additional CNIs. Having additional CNIs for SR-IOV and MacVLAN interfaces allow for direct routing of traffic to a pod without using the pod network via additional interfaces. This capability is being delivered for use in only corner case scenarios, it is not to be used in general for all applications. Example use cases include bandwidth requirements that necessitate SR-IOV and protocols that are unable to be supported by the load balancer. The OVN based pod network should be used for every interface that can be supported from a technical standpoint.

Workload requirement

Unless an application has a special traffic requirement that is not supported by SPK or ovn-kubernetes CNI the applications must use the pod network for traffic

See Understanding multiple networks for more information.

3.12.1.11. MULTUS SR-IOV / MACVLAN

SR-IOV is a specification that allows a PCIe device to appear to be multiple separate physical PCIe devices. The Performance Addon component allows you to validate SR-IOV by running DPDK, SCTP and device checking tests.

SR-IOV and MACVLAN interfaces are able to be requested for protocols that do not work with the default CNI or for exceptions where a workload has not been able to move functionality onto the CNI. These are exception use cases. MULTUS interfaces will be defined by the platform operations team for the workloads which can then consume them. VLANs will be applied by the SR-IOV VF, thus the VLAN / network that the SR-IOV interface requires must be part of the request for the namespace.

For more information, see About Single Root I/O Virtualization (SR-IOV) hardware networks.

By configuring the SR-IOV network, CRs named NetworkAttachmentDefinitions are exposed by the SR-IOV Operator in the workload namespace.

Different names will be assigned to different Network Attachment Definitions that are namespace specific. MACVLAN versus MULTUS interfaces will be named differently to distinguish the type of device assigned to them (created by configuring SR-IOV devices via the SRIOVNetworkNodePolicy CR).

For SR-IOV based SriovNetworkNodePolicy definitions, the MTU setting is omitted because it can cause conflicts with applications which set their own MTU value. It is required therefore that the application always manage its own MTU value for SR-IOV

Workload requirement

Applications must not use an MTU size greater than 8500 bytes

Workload requirement

Applications using SR-IOV multus Network Attachment Definitions must set their required MTU value for virtual functions (VFs) within their pod.

See test case networking-network-attachment-definition-sriov-mtu

From the workload perspective, a defined set of network attachment definitions will be available in the assigned namespace to serve secondary networks for regular usage or to serve for DPDK payloads.

The SR-IOV devices are configured by the cluster admin, and they will be available in the namespace assigned to the workload. The following command returns the list of secondary networks available in the namespace:

$ oc -n <workload_namespace> get network-attachment-definitions

3.12.1.12. SPK Integration via SPK Operator

The SPK runs in a separate namespace from the workload. The workload will not have direct permissions to access this namespace. In order to allow applications to manage the lifecycle of the SPK deployment, the SPK operator is used. The SPK Operator defines several CRDs for which CRs are created in the CNF namespace:

3.12.1.12.1. SPKSnatpool

The SPKSnatpool is provisioned within the application namespace to configure source network address translations (SNAT) on egress network traffic for the SPK. When internal Pods connect to external resources, their internal cluster IP address is translated to one of the available IP addresses in the SNAT pool. Example:

apiVersion: k8s.f5net.com/v1
kind: F5SPKSnatpool
metadata:
  name: egress-snatpool
  namespace: spk-democnf
spec:
  addressList:
    - 10.183.247.229
    - fdb0:5b22:e86a:1122::22
    - 10.183.247.23
    - fdb0:5b22:e86a:1122::230

More details on the SPK operator are in the "SPK Operator User Guide".

3.12.1.13. SR-IOV interface settings

The following settings must be negotiated with the cluster administrator, for each network type available in the namespace:

The type of netdevice to be used for the VF (kernel or userspace)
The vlan ID to be applied to a given set of VFs available in a namespace
For kernel-space devices, the IP allocation is provided directly by the cluster IP assignment mechanism.
The option to configure the IP of a given SR-IOV interface at runtime, see Adding a pod to an SR-IOV additional network.

SR-IOV settings are enabled by the cluster administrator.

The SRIOVnetwork CR creates the network-attach-definition within the target networkNamespace.

3.12.1.14. Attaching the VF to a pod

Once the right network attachment definition is found, applying the k8s.v1.cni.cncf.io/networks annotation with the name of the network attachment definition to the pod will add the additional network interfaces in the pod namespace, as per the following example:

apiVersion: v1
kind: Pod
metadata:
  name: sample-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "net1",
          "mac": "20:04:0f:f1:88:01",
          "ips": ["192.168.10.1/24", "2001::1/64"]
         }
      ]

3.12.1.15. Discovering SR-IOV devices properties from the application

All the properties of the interfaces are added to the pod’s k8s.v1.cni.cncf.io/network-status annotation. The annotation is json-formatted and for each network object contains information such as IPs (where available), MAC address, PCI address. For example:

k8s.v1.cni.cncf.io/network-status: |-
  [{
      "name": "",
      "interface": "eth0",
      "ips": [
        "10.132.3.148"
        ],
      "mac": "0a:58:0a:84:03:94",
      "default": true,
      "dns": {}
   }]

the IP information is not available if the driver specified is vf-io.

The same annotation is available as a file content inside the pod, at the /etc/podnetinfo/annotations path. A convenience library is available to easily consume those pieces of information from the application (bindings in C and Go).

For more information, see About Single Root I/O Virtualization (SR-IOV) hardware networks.

3.12.1.16. NUMA awareness

If the pod is using a guaranteed QoS class and the kubelet is configured with a suitable topology manager policy (restricted, single-numa node) then the VF assigned to the pod will belong to the same NUMA node as the other assigned resources (CPU and other NUMA aware devices). Please note that HugePages are currently not NUMA aware.

See Node Tuning Operator for NUMA awareness and more information about how HugePages are turned on.

3.12.1.17. Platform upgrade

Openshift upgrades happen as follows:

Consider this small example cluster:

master-0
master-1
master-2
worker-10
worker-11
worker-12
worker-13
loadbalancer-14
loadbalancer-15

In the above example cluster, there are three machine config pools: masters, workers, loadbalancers. This is an example cluster configuration, there may be more machine config pools based on functionality, e.g., 10 MCPs if needed.

When the cluster is upgraded, the API server and etcD are updated first. So the master config pool will be done first. Incrementally the cluster will go through and reboot master-0, 1, 2 to bring them to the new kubernetes version. After these are updated it will cycle to the next two machine pools one at a time. Openshift will consult the maxunavilable nodes in the machine config pool spec and reboot only as many as allowed by maxunavailable.

In a cluster as small as the above, maxUnavailable would be set to 1, so OpenShift would reboot loadbalancer-14 and worker-10 simultaneously as they are different machineconfigpools.

Openshift will wait until worker-10 is ready before proceeding onwards to worker-11 and continue. OpenShift will in parallel wait for loadbalancer-14 to become available again before restarting loadbalancer-15.

In clusters larger than the example cluster, the maxUnavailable for the worker pool may be set to a large number to reboot multiple nodes in parallel to speed up deployment of the new version of OpenShift. This number will take into account the work loads on the cluster to make sure sufficient resources are left to maintain application availability.

For an application to stay healthy during this process, if they are stateful at all, they should specify a statefulset or replicaset, kubernetes by default will attempt to schedule the set members across multiple nodes to give additional resiliency. In order to prevent kubernetes from stealing too many nodes out from under an application, an application that has a minimum number of pods that need to be running must specify a pod disruption budget. Pod disruption budgets allow an application to tell kubernetes that it needs N number of pods of said microservice alive at any given time. For example, a small stateful database may need 2 out of three pods available at any given time, so that application should set a pod disruption budget with a minavailable set to a value of 2. This will allow the scheduler to know that it should not take the second pod out of a set of 3 down at any given time during the series of node reboots.

Workload requirement

Applications may not set the pod disruption budget to minUnavailable equal to the number of pods in the deployment/replicaset or maxUnavailable pods to zero, operations will change your pod disruption budget to proceed with an upgrade at the risk of your application.

See test case observability-pod-disruption-budget

Additionally pod disruption budgets must not be set to very low numbers of unavailable; doing so prevents system patching from completing in a timely manner. Statefulsets/Replicasets/Deployments that have a requirement for pod antiaffinity spread to greater than 20% of the nodes in a given cluster must set the pod disruption budget to 20% for the set of pods.

Example

100 node cluster
20% of nodes would be 20 nodes
Any D/S 20 nodes or more would require 20% PDB

20 node deployment would have 4 pod PDB
40 node deployment would have 8 pod PDB

A corollary to the pod disruption budget is a strong readiness and health check. A well implemented readiness check is key for surviving these upgrades in that a pod should not report itself ready to kubernetes until it is actually ready to take over the load from another pod of the example set. An example of this being implemented poorly would be for a pod to report itself ready but it is not in sync with the other DB pods in the example above. Kubernetes could see that three of the pods are "ready" and destroy a second pod and cause disruption to the DB leading to failure of the application served by said DB.

See pod disruption budget reference, pod disruption budget policy & API.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: db-pod-disruption-budget
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: db

By default, only one machine is allowed to be unavailable when applying the kubelet-related configuration to the available worker nodes. For a large cluster, it can take a long time for the configuration change to be reflected. At any time, you can adjust the number of machines that are updating to speed up the process.

Run:

$ oc edit machineconfigpool worker

Set maxUnavailable to the desired value.

spec:
  maxUnavailable: <node_count>

3.12.1.18. OpenShift virtualization and CNV best practices

OpenShift Virtualization is generally-available for enterprise workloads, such throughput- and latency-insensitive workloads that may be added to the cluster. VNFs and other throughput or latency-sensitive applications can be considered only after careful validation.

OpenShift Virtualization should be installed according to its documentation, and only documented supported features may be used unless an explicit exception has been granted. See About OpenShift Virtualization.

In order to improve overall virtualization performance and reduce CPU latency, critical workloads can take advantage of OpenShift Virtualization’s high-performance features. These can provide the workloads with the following features:

Dedicated resources for virtual machines
Dedicated CPU for QEMU emulators
Dedicated CPU resources so as to not affect the CPU latency for workloads.

Similar to OpenStack, OpenShift Virtualization supports the device role tagging mechanism for the network interfaces (same format as it is in OSP). Users will be able to tag Network interfaces in the API and identify them in device metadata provided to the guest OS via the config drive.

3.12.1.18.1. VM image import recommendations (CDI)

OpenShift Virtualization VMs store their persistent disks on kubernetes Persistent Volumes (PV). PVs are requested by VMs using kubernetes Persistent Volume Claims (PVC). VMs may require a combination of blank and pre-populated disks in order to function.

Blank disks can be initialized automatically by kubevirt when an empty PV is initially encountered by a starting VM. Other disks must be populated prior to starting the VM. OpenShift Virtualization provides a component called the Containerized Data Importer (CDI) which automates the preparation of pre-populated persistent disks for VMs. CDI integrates with KubeVirt to synchronize VM creation and deletion with disk preparation by using a custom resource called a DataVolume. Using DataVolumes, data can be imported into a PV from various sources including container registries and HTTP servers.

The following recommendations should be followed when managing persistent disks for VMs:

Blank disks: Create a PVC and associate it with the VM using a persistentVolumeClaim volume type in the volumes section of the VirtualMachine spec.
Populated disks: In the VirtualMachine spec, add a DataVolume to the dataVolumeTemplates section and always use the dataVolume volume type in the volumes section.

3.12.1.18.2. Working with large VM disk images

In contrast to container images, VM disk images can be quite large (30GiB or more is common). It is important to consider the costs of transferring large amounts of data when planning workflows involving the creation of VMs (especially when scaling up the number of VMs). The efficiency of an image import depends on the format of the file and also the transfer method used. The most efficient workflow, for two reasons, is to host a gzip-compressed raw image on a server and import via HTTP. Compression avoids transferring zeros present in the free space of the image, and CDI can stream the contents directly into the target PV without any intermediate conversion steps. In contrast, images imported from a container registry must be transferred, unarchived, and converted prior to being usable. These additional steps increase the amount of data transferred between a node and the remote storage.

3.12.1.19. Operator best practices

OLM Packaged operators contain an index of all the images required to install the operator, and the ClusterServiceVersion which instructs OpenShift to create resources as described in the cluster service version. The cluster service version is a list of the required resources that need to be created in the cluster, i.e. service accounts, crds, roles, etc that are necessary for the operator and software that the operator installs to be successful within the cluster.

The OLM Packaged operator will then run in openshift-operators namespace within the cluster. Users can then utilize this operator by creating CRs within the CRDs that were created by the operator OLM package, to deploy the software managed by the operator. The platform administrator handles the OLM based operator installation for the users by creating a custom catalog in the cluster that is targeted by the application. The users then express via CRs that are consumed by the operator what they would like the operator to create in the users namespace.

3.12.1.19.1. Workload Operator requirements

Workload requirement

Operators should be certified against the openshift version of the cluster they will be deployed on.

See test case affiliated-certification-operator-is-certified

Impacts and Risks of Non-Compliance: Uncertified operators may have security flaws, compatibility issues, and lack enterprise support, creating operational risks.

Workload requirement

Operators must be compatible with our version of openshift

See test case platform-alteration-ocp-lifecycle

Impacts and Risks of Non-Compliance: End-of-life OpenShift versions lack security updates and support, creating significant security and operational risks.

Workload requirement

Operators must be in OLM bundle format (Operator Framework).

See test case operator-install-source

Impacts and Risks of Non-Compliance: Non-OLM operators bypass lifecycle management and dependency resolution, creating operational complexity and update issues.

Workload requirement

Must be able to function without the use of openshift routes or ingress objects.

Workload requirement

All custom resources for operators require podspecs for both pod image override as well pod quotas.

Workload requirement

Operators must not use daemonsets

See test case lifecycle-pod-owner-type

Impacts and Risks of Non-Compliance: Naked pods and DaemonSets lack proper lifecycle management, making updates, scaling, and recovery operations difficult or impossible.

Workload requirement

The OLM operator CSV must support the “all namespaces” install method if the operator is upstream software "global operator". If the operator is a proprietary cnf operator "vendor operator" it must support single namespaced installation. It is recommended for an operator to support all OLM install modes to ensure flexibility in our environment.

Workload requirement

The operator must default to watch all namespaces if the target namespace is left NULL or empty string as this is how the OLM global-operators operator group functions.

Workload requirement

Multiple versions of the same operator cannot exist on a single cluster.

Workload requirement

All Operator and operand images must be referenced using digest image tags "@sha256". Openshift "imagecontentsourcepolicy" objects (ICSP) only support mirror-by-digest at this time.

For requesting Global operators (upstream 3rd party shared operators), the operators must come from one of the redhat provided operator catalogs * redhat-operator * certified-operator * community-operator

Workload requirement

Operators that are proprietary to a workload application must ensure that their CRD’s are unique, and will not conflict with other operators in the cluster.

See test case observability-crd-status

Impacts and Risks of Non-Compliance: Missing status subresources prevent proper monitoring and automation based on custom resource states.

Workload requirement

If a workload application requires a specific version of a third party non-proprietary operator for their app to function they will need to re-package the upstream third party operator and modify the APIs so that it will not conflict with the globally installed operator version.

Workload requirement

Successful operator installation and runtime must be validated in pre-deployment lab environments before being allowed to be deployed to production.

See test case operator-install-status-succeeded

Impacts and Risks of Non-Compliance: Failed operator installations can leave applications in incomplete states, causing functionality gaps and operational issues.

Workload requirement

All required RBAC must be included in the OLM operator bundle so that it’s managed by OLM.

Workload requirement

It is not recommended for a workload application to share a proprietary operator with another workload application if that application does not share the same version lifecycle. If a workload application does share an operator the CRDs must be backwards compatible.

Applications providing OLM catalogs to bring their operators into a platform environment must ensure that their catalog has less than 1,000 images per application team, however applications should target a much lower number than this (preferably under 150-200 images)

Workload requirement

Applications providing OLM catalogs to bring their operators into a platform environment must ensure that their catalog uses a FQDN name specific to their brand for their docker registry and provide a pull secret so that the images from their hosted registry can be accessed.

Workload recommendation

It is recommended applications limit the scope of their OLM catalog to only operator packages needed for lifecycle upgrades, and the latest version installs following operator framework best practices by utilizing olm.skipRange to keep operator catalogs as small as possible.

Workload requirement

When an operator subscription is created into a namespace, the operator’s install plan must install all resources to that specific operator namespace. The only exception here are cluster scoped objects such as CRD’s.

Workload requirement

Operators must install to our environment using a subscription object only. Pre-configuration or custom configuration using config maps should not be required to install your operator to our environment.

Workload requirement

Operators are not permitted to use huge pages.

Workload requirement

The size of operator catalog and all images required by operator catalog cannot exceed 200GB of storage for an application catalog.

Workload requirement

Application vendors will need to increment their operator version to supply any hotfix or upgrade. For example If your existing operator version is 1.6.9-0, and you need to supply a hotfix, then your new OLM package should be 1.6.9-1 and the new operator CSV should have a replace tag for the existing operator CSV.

Workload requirement

Global operators are versioned to a release of the platform. These versions only change when there is a new platform release.

VCP CNF requirement

Operators are not permitted to be installed into a tenant app namespace. Operators that are installed with the "Single Namespace" olm install mode must be installed into the tenants dedicated operator namespace. Upstream 3rd party operators will be installed Globally with "All Namespaces" olm install mode into the openshift-operators namespace.

3.13. Requirements for a workload

The application MUST declare all listening ports as containerPorts in the Pod specification it provides to Kubernetes.
The application MUST NOT listen on any other ports that are undeclared.
- Listening ports MUST be named in the pod specification with the protocol they Implement.

See test case networking-undeclared-container-ports-usage

Impacts and Risks of Non-Compliance: Undeclared ports can be blocked by security policies, causing unexpected connectivity issues and making troubleshooting difficult.

The name field in the ContainerPort section must be of the form <protocol> where <protocol> is one of the below, and the optional <suffix> can be chosen by the application.
Preferred prefixes: grpc, grpc-web, http, http2
Fallback prefixes: tcp, udp
Valid example: http-webapi or grpc

See test case manageability-container-port-name-format

Impacts and Risks of Non-Compliance: Incorrect port naming conventions can cause service discovery issues and configuration management problems.

The application MUST communicate with Kubernetes Services by their service IP instead of selecting Pods in that service individually.
The application MUST NOT encrypt outbound traffic on the cluster network interface.
The application MUST NOT decrypt inbound traffic on the cluster network interface.
The application SHOULD NOT manage certificates related to communication over the cluster network interface.
The application MUST NOT provide nftables or iptables rules.
The application MUST NOT define Kubernetes Custom Resources in *.istio.io or \*.aspenmesh.io namespaces.
The application MUST NOT define Kubernetes resources in the istio-system namespace.

See test case access-control-namespace

Impacts and Risks of Non-Compliance: Using inappropriate namespaces can lead to resource conflicts, security boundary violations, and administrative complexity in multi-tenant environments.

The application MUST propagate tracing headers when making outgoing requests based on incoming requests.
Example: If an application receives a request with a trace header identifying that request with traceid 785a908c8d93b2d2 , and decides based on application logic that it must make a new request to another application pod to fulfill that request, it must annotate the new request with the same traceid 785a908c8d93b2d2.
The application MUST propagate all of these tracing headers if present: x-request-id, x-b3-traceid, x-b3-spanId, x-b3-parentspanid, x-b3-sampled, x-b3-flags, b3.
The application MUST propagate the tracing headers by copying any header value from the original request to the new request.
The application SHOULD NOT modify any of these header values unless it understands the format of the headers and wishes to enhance them (e.g. implements OpenTracing)
If some or none of the headers are present, the application SHOULD NOT create them.
If an application makes a new request and it is not in service of exactly one incoming request, it MAY omit all tracing headers.
- The application does not have to generate headers in this case. It could generate headers if it implements e.g. OpenTracin.

3.13.1. Image standards

It is recommended that container images be built utilizing Red Hat’s Universal Base Image as they will have a solid security baseline as well as support from Red Hat.

Vendors must satisfy 3 requirements related to maintaining proper workload isolation in a containerized environment:

Workload requirement

Containerized workloads must work with Red Hat’s restricted SCC1.

Workload requirement

Containerized workloads must work with Red Hat’s default SELinux context. This is meant to forbid all changes to both primary config files (SCC, SEL) and the many related files referenced by these primary files. All security configuration files must be unchanged from the vendor’s released version.

See test cases platform-alteration-base-image, platform-alteration-is-selinux-enforcing

Impacts and Risks of Non-Compliance: Modified base images can introduce security vulnerabilities, create inconsistent behavior, and violate immutable infrastructure principles. Non-enforcing SELinux reduces security isolation and can allow privilege escalation attacks and unauthorized resource access.

Workload requirement

The container image must be secure.

See test cases platform-alteration-isredhat-release, platform-alteration-is-selinux-enforcing

Impacts and Risks of Non-Compliance: Non-Red Hat base images may lack security updates, enterprise support, and compliance certifications required for production use. Non-enforcing SELinux reduces security isolation and can allow privilege escalation attacks and unauthorized resource access.

The Red Hat UBI is able to meet these requirements and enables images built with it to meet these requirements. UBI is supported by a dedicated, full-time team providing releases of base image. UBI has the following features:

Scheduled release every 6 weeks to pick up less critical fixes.
On-demand release for critical or important CVE within 5 days of CVE public release.
Guarantees alignment with host OS packages and versions that run tightly coupled to the container artifacts. Many CVEs and potential attacks result from mismatch of untested versions of utility functions.
Ensures globally consistent time zone usage and resulting timestamps for global operators.
Enables continuous authorization to operate (ATO). Authorize once, use many times.
Meets requirements of the DOD, for example Air Force/DISA STIG.
Supports system-wide crypto consistency, for example, must have same crypto implementation as the Red Hat host operating system.
Provides authentication of the base layer via digital signature from originating vendor and strong signature authority.

3.13.2. Universal Base Image information

UBI is designed to be a foundation for cloud-native and web applications use cases developed in containers. You can build a containerized application using UBI, push it to your choice of registry server, easily share it with others - and because it’s freely redistributable — even deploy it on non-Red Hat platforms. And since it’s built on Red Hat Enterprise Linux, UBI is a platform that is reliable, secure, and performant.

Base Images: A set of three base images (Minimal, Standard, and Multi-service) are provided to provide optimum starting points for a variety of use cases.
Runtime Languages: A set of language runtime images (PHP, Perl, Python, Ruby, Node.js) enable developers to start coding out of the gate with the confidence that a Red Hat built container image provides.
Complementary packages: A set of associated YUM repositories/channels include RPM packages and updates that allow users to add application dependencies and rebuild UBI container images anytime they want.

Red Hat UBI images are the preferred images to build VNFs on as they will leverage the fully supported Red Hat ecosystem. In addition, once a VNF is standardized on a Red Hat UBI, the image can become Red Hat certified.

Red Hat UBI images are free to vendors so there is a low barrier of entry to getting started.

3.13.3. Application DNS configuration requirements

Workloads should use the service name only as a configuration parameter for attaching to a service within your namespace, the cluster will append namespace name and kubernetes service nomenclature on behalf of the application via search string in DNS. This allows a generic name for a service that works in all clusters no matter what the namespace name is and what the cluster base FQDN is.

For more information, see Kubernetes upstream reference for pod/service names and DNS.

3.13.4. Image standards

It is recommended that container images be built utilizing Red Hat’s Universal Base Image as they will have a solid security baseline as well as support from Red Hat.

Must satisfy 3 requirements related to maintaining proper workload isolation in a containerized environment:

Containerized workloads should work with a restricted SCC unless an exception is given

Containerized workloads should work with Red Hat’s default SELinux context. This is meant to forbid all changes to both primary config files (SCC, SEL) and the many related files referenced by these primary files. All security configuration files must be unchanged from the vendor’s released version.

See test cases platform-alteration-base-image, platform-alteration-is-selinux-enforcing

The container image must be secure.

See test cases platform-alteration-isredhat-release, platform-alteration-is-selinux-enforcing

Scheduled release every 6 weeks to pick up less critical fixes.
On-demand release for critical or important CVE within 5 days of CVE public release.
Guarantees alignment with host OS packages and versions that run tightly coupled to the container artifacts. Many CVEs and potential attacks result from mismatch of untested versions of utility functions.
Ensures globally consistent time zone usage and resulting timestamps for global operators.
Enables continuous authorization to operate (ATO). Authorize once, use many times.
Meets requirements of the DOD, for example Air Force/DISA STIG.
Supports system-wide crypto consistency, for example, must have same crypto implementation as the Red Hat host operating system.
Provides authentication of the base layer via digital signature from originating vendor and strong signature authority.

3.13.5. Universal Base Image information

Base Images: A set of three base images (Minimal, Standard, and Multi-service) are provided to provide optimum starting points for a variety of use cases.
Runtime Languages: A set of language runtime images (PHP, Perl, Python, Ruby, Node.js) enable developers to start coding out of the gate with the confidence that a Red Hat built container image provides.
Complementary packages: A set of associated YUM repositories/channels include RPM packages and updates that allow users to add application dependencies and rebuild UBI container images anytime they want.

Red Hat UBI images are the preferred images to build VNFs on as they will leverage the fully supported Red Hat ecosystem. In addition, once a VNF is standardized on a Red Hat UBI, the image can become Red Hat certified.

Red Hat UBI images are free to vendors so there is a low barrier of entry to getting started.

3.13.6. Application DNS configuration requirements

Example

search clspcoykvzwcscp-y-xx-w1-001.svc.cluster.local
svc.cluster.local cluster.local kub2-4.csp-1.vzwops.com
nameserver 198.223.0.10
options ndots:5

If an application deploys a service in the namespace clspcoykvzwcscp-y-xx-w1-001 and is attempting to access a service named worker, the application should just configure the client of the service with an FQDN of worker.

The DNS search suffix will append clspcoykvzwcscp-y-xx-w1-001.svc.cluster.local to the end of the name and result in a successful query for worker.clspcoykvzwcscp-y-xx-w1-001.svc.cluster.local. This allows an application to be less aware of the application’s namespace name and genericize the configuration of the application.

Workloads must use the service name only as a configuration parameter for attaching to a service within your namespace

For FQDNs that are outside of their namespace (in another cluster or in the same cluster), applications must append a . at the end of the FQDN so as not to trigger search strings for the FQDN

Example

nnrfe1-000.bbtpnj33.ne.nrf.5gc.vzims.com
clspcoykvzwcscp-y-xx-w1-001-scp-cache-headless.clspcoykvzwcscp-y-xx-w1-001.svc.cluster.local

For more information, see Kubernetes upstream reference for pod/service names and DNS.

Copyright

Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.