1. Refactoring software as cloud-native network functions
Software down into the smallest set of microservices as possible.
It is hard to move a 1000LB boulder. However, it is easy when that boulder is broken down into many pieces. All containerized network functions (CNFs) should break apart each piece of the functions/services/processes into separate containers. These containers will still be within kubernetes pods and all of the functions that perform a single task should be within the same namespace.
There is a quote from Lewis and Fowler that describes this best:
The microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API.These services are built around business capabilities and independently deployable by fully automated deployment machinery.
1.1. Pods
Pods are the smallest deployable units of computing that can be created and managed in Kubernetes.
A Pod can contain one or more running containers at a time. Containers running in the same Pod have access to several of the same Linux namespaces. For example, each application has access to the same network namespace, meaning that one running container can communicate with another running container over 127.0.0.1:<port>
. The same is true for storage volumes so all containers are in the same Pod have access to the same mount namespace and can mount the same volumes.
2. High-level workload expectations
-
Workloads shall be built to be cloud-native
-
Containers MUST NOT run as root (uid=0). See test case access-control-security-context-non-root-user-check
-
Containers MUST run with the minimal set of permissions required. Avoid Privileged Pods. See test case access-control-security-context-privilege-escalation
-
Use the main CNI for all traffic - MULTUS/SRIOV/MacVLAN are for corner cases only (extreme throughput requirements, protocols that are unable to be load balanced)
-
Workloads should employ N+k redundancy models
-
Workloads MUST define their pod affinity/anti-affinity rules. See test cases lifecycle-affinity-required-pods, lifecycle-pod-high-availability
-
All secondary network interfaces employed by workloads with the use of MULTUS MUST support Dual-Stack IPv4/IPv6.
-
Instantiation of a workload (via Helm chart or Operators or otherwise) shall result in a fully-functional workload ready to serve traffic, without requiring any post-instantiation configuration of system parameters
-
Workloads shall implement service resilience at the application layer and not rely on individual compute availability/stability
-
Workloads shall decouple application configuration from Pods, to allow dynamic configuration updates
-
Workloads shall support elasticity with dynamic scale up/down using kubernetes-native constructs such as ReplicaSets, etc. See test cases lifecycle-crd-scaling, lifecycle-statefulset-scaling, lifecycle-deployment-scaling
-
Workloads shall support canary upgrades
-
Workloads shall self-recover from common failures like pod failure, host failure, and network failure. Kubernetes native mechanisms such as health-checks (Liveness, Readiness and Startup Probes) shall be employed at a minimum. See test cases lifecycle-liveness-probe, lifecycle-readiness-probe, lifecycle-startup-probe
Workload requirement
Containers must not run as root See test case access-control-security-context-non-root-user-check |
Workload requirement
All secondary interfaces (MULTUS) must support dual stack |
Workload requirement
Workloads shall not use node selectors nor taints/tolerations to assign pod location See test cases lifecycle-pod-scheduling, platform-alteration-tainted-node-kernel |
2.1. Workload restrictions
-
Workloads may not use host networking
-
Namespace should not be created by the Workloads deployment method (Helm / Operator)
-
Workloads may not perform Role creation
-
Workloads may not perform Rolebinding creation
-
Workloads may not have Cluster Roles
-
Workloads are not authorized to bring their own CNI
-
Workloads may not deploy Daemonsets
3. Workload developer guide
This section discusses recommendations and requirements for Workload application builders.
3.1. Cloud-native design best practices
The following best practices highlight some key principles of cloud-native application design.
- Single purpose w/messaging interface
-
A container should address a single purpose with a well-defined (typically RESTful API) messaging interface. The motivation here is that such a container image is more reusable and more replaceable/upgradeable.
- High observability
-
A container must provide APIs for the platform to observe the container health and act accordingly. These APIs include health checks (liveness and readiness), logging to stderr and stdout for log aggregation (by tools such as
Logstash
orFilebeat
), and integrate with tracing and metrics-gathering libraries (such asPrometheus
orMetricbeat
). - Lifecycle conformance
-
A container must receive important events from the platform and conform/react to these events properly. For example, a container should catch SIGTERM or SIGKILL from the platform and shut down as quickly as possible. Other typically important events from the platform are PostStart to initialize before servicing requests and PreStop to release resources cleanly before shutting down.
See test cases lifecycle-container-shutdown, lifecycle-container-startup
- Image immutability
-
Container images are meant to be immutable; i.e. customized images for different environments should typically not be built. Instead, an external means for storing and retrieving configurations that vary across environments for the container should be used. Additionally, the container image should NOT dynamically install additional packages at runtime.
- Process disposability
-
Containers should be as ephemeral as possible and ready to be replaced by another container instance at any point in time. There are many reasons to replace a container, such as failing a health check, scaling down the application, migrating the containers to a different host, platform resource starvation, or another issue.
This means that containerized applications must keep their state externalized or distributed and redundant. To store files or block level data, persistent volume claims should be used. For information such as user sessions, use of an external, low-latency, key-value store such as redis should be used. Process disposability also requires that the application should be quick in starting up and shutting down, and even be ready for a sudden, complete hardware failure.
Another helpful practice in implementing this principle is to create small containers. Containers in cloud-native environments may be automatically scheduled and started on different hosts. Having smaller containers leads to quicker start-up times because before being restarted, containers need to be physically copied to the host system.
A corollary of this practice is to "retry instead of crashing", for example, When one service in your application depends on another service, it should not crash when the other service is unreachable. For example, your API service is starting up and detects the database is unreachable. Instead of failing and refusing to start, you design it to retry the connection. While the database connection is down the API can respond with a 503 status code, telling the clients that the service is currently unavailable. This practice should already be followed by applications, but if you are working in a containerized environment where instances are disposable, then the need for it becomes more obvious.
Also related to this, by default containers are launched with shared images using COW filesystems which only exist as long as the container exists. Mounting Persistent Volume Claims enables a container to have persistent physical storage. Clearly defining the abstraction for what storage is persisted promotes the idea that instances are disposable.
- Horizontal scaling and redundancy
-
Support scaling and/or redundancy of your application through increasing/decreasing the number of application pods in your deployment, rather than requiring increased/decreased resources for a single application pod.
Also follow the principles described in Process Disposability, i.e. stateless containers with quick startup and teardown times.
- Self-containment
-
A container should contain everything it needs at build time. The container should rely only on the presence of the Linux kernel and have any additional libraries added into it at the time the container is built. The container image should NOT dynamically install additional packages at runtime.
The only exceptions are things such as configurations, which vary between different environments and must be provided at runtime; for example, through Kubernetes ConfigMap
- Runtime confinement
-
A container declares its resource requirements (cpu, memory, networking, disk) to the platform. The container must stay confined to the indicated resource requirements.
- Use small container images
-
Use the smallest base image possible. To reduce the size of your image, install only what is strictly needed inside it. Clean up temporary files and avoid the installation of unnecessary packages. This reduces container size, build time, and networking time when copying container images. Layer squashing may also be leveraged to hide secrets.
- Layer the application
-
Package a single app per container. Do not replace VM’s with containers. Try to create images with common layers. After the initial download, only the layers that make each image unique are needed, thereby reducing the overhead on downloads.
- Image tagging
-
Never build off the latest tag—this prevents builds from being reproducible over time. Properly tag your images. Tagging the image lets users identify a specific version of your software in order to download it. For this reason, tightly link the tagging system on container images to the release policy of your software.
3.2. Container development best practices
3.2.1. Pod exit status
The most basic requirement for the lifecycle management of pods in OpenShift is the ability to start and stop correctly. When starting up, health probes like liveness and readiness checks can be put into place to ensure the application is functioning properly.
There are different ways a pod can be stopped in Kubernetes. One way is that the pod can remain alive but non-functional. Another way is that the pod can crash and become non-functional. In the first case, if the administrator has implemented liveness and readiness checks, OpenShift can stop the pod and either restart it on the same node or a different node in the cluster. For the second case, when the application in the pod stops, it should exit with a code and write suitable log entries to help the administrator diagnose what the issue was that caused the problem.
Pods should use terminationMessagePolicy: FallbackToLogsOnError
to summarize why they crashed and use stderr to report errors on crash
See test case observability-termination-policy
Workload requirement
All pods shall have a liveness, readiness and startup probes defined See test cases lifecycle-liveness-probe, lifecycle-readiness-probe, lifecycle-startup-probe |
OpenShift Container Platform and Kubernetes give application instances time to shut down before removing them from load balancing rotations. However, applications must ensure they cleanly terminate user connections as well before they exit.
On shutdown, OpenShift Container Platform sends a TERM signal to the processes in the container. Application code, on receiving SIGTERM, stop accepting new connections. This ensures that load balancers route traffic to other active instances. The application code then waits until all open connections are closed, or gracefully terminate individual connections at the next opportunity, before exiting.
After the graceful termination period expires, a process that has not exited is sent the KILL signal,
which immediately ends the process. The terminationGracePeriodSeconds
attribute of a pod or pod
template controls the graceful termination period (default 30 seconds) and can be customized per
application as necessary.
3.2.2. Graceful termination
There are different reasons that a pod may need to shutdown on an OpenShift cluster. It might be that the node the pod is running on needs to be shut down for maintenance, or the administrator is doing a rolling update of an application to a new version which requires that the old versions are shutdown properly.
When pods are shut down by the platform they are sent a SIGTERM
signal which means that the process in the container should start shutting down, closing connections and stopping all activity. If the pod doesn’t shut down within the default 30 seconds then the platform may send a SIGKILL
signal which will stop the pod immediately. This method isn’t as clean and the default time between the SIGTERM
and SIGKILL
messages can be modified based on the requirements of the application.
Pods should exit with zero exit codes when they are gracefully terminated.
Workload requirement
All pods must respond to SIGTERM signal and shutdown gracefully with a zero exit code. See test case lifecycle-container-shutdown |
3.2.3. Pod resource profiles
CaaS Platform has a default scheduler that is responsible for being aware of the current available resources on the platform, and placing containers / applications on the platform appropriately.
All pods should have a resource request that is the minimum amount of resources the pod is expected to use at steady state for both memory and CPU.
3.2.4. Storage: emptyDir
There are several options for volumes and reading and writing files in OpenShift. When the requirement is temporary storage and given the option to write files into directories in containers versus an external filesystems, choose the emptyDir
option. This will provide the administrator with the same temporary filesystem - when the pod is stopped the dir is deleted forever. Also, the emptyDir
can be backed by whatever medium is backing the node, or it can be set to memory for faster reads and writes.
Using emptyDir
with requested local storage limits instead of writing to the container directories also allows enabling readonlyRootFilesystem
on the container or pod.
3.2.5. Liveness readiness and startup probes
As part of the pod lifecycle, the OpenShift platform needs to know what state the pod is in at all times. This can be accomplished with different health checks. There are at least three states that are important to the platform: startup, running, shutdown. Applications can also be running, but not healthy, meaning, the pod is up and the application shows no errors, but it cannot serve any requests.
When an application starts up on OpenShift it may take a while for the application to become ready to accept connections from clients, or perform whatever duty it is intended for.
Two health checks that are required to monitor the status of the applications are liveness and readiness. As mentioned above, the application can be running but not actually able to serve requests. This can be detected with liveness checks. The liveness check will send specific requests to the application that, if satisfied, indicate that the pod is in a healthy state and operating within the required parameters that the administrator has set. A failed liveness check will result in the container being restarted.
There is also a consideration of pod startup. Here the pod may start and take a while for different reasons. Pods can be marked as ready if they pass the readiness check. The readiness check determines that the pod has started properly and is able to answer requests. There are circumstances where both checks are used to monitor the applications in the pods. A failed readiness check results in the container being taken out of the available service endpoints. An example of this being relevant is when the pod was under heavy load, failed the readiness check, gets taken out of the endpoint pool, processes requests, passes the readiness check and is added back to the endpoint pool.
For more information, see Configure Liveness, Readiness and Startup Probes.
Exec probes need to be avoided at all cost because of the resource overhead the require. Exec probes cannot be used on RT containers. |
See test cases lifecycle-liveness-probe, lifecycle-readiness-probe, lifecycle-startup-probe
If the workload is doing CPU pinning and running a DPDK process do not use exec probes (executing a command within the container); as this can pile up and eventually block the node. See test case networking-dpdk-cpu-pinning-exec-probe |
3.2.6. Use imagePullPolicy: IfNotPresent
If there is a situation where the container dies and needs to be restarted, the image pull policy becomes important. There are three image pull policies available: Always
, Never
and IfNotPresent
. It is generally recommended to have a pull policy of IfNotPresent
. This means that the if pod needs to restart for any reason, the kubelet will check on the node where the pod is starting and reuse the already downloaded container image if it’s available. OpenShift intentionally does not set AlwaysPullImages
as turning on this admission plugin can introduce new kinds of cluster failure modes. Self-hosted infrastructure components are still pods: enabling this feature can result in cases where a loss of contact to an image registry can cause redeployment of an infrastructure or application pod to fail. We use PullIfNotPresent
so that a loss of image registry access does not prevent the pod from restarting.
Container images that are protected by registry authentication have a condition whereby a user who is unable to download an image directly can still launch it by leveraging the host’s cached image. |
See test case lifecycle-image-pull-policy
3.2.7. No naked pods
Do not use naked Pods (that is, Pods not bound to a ReplicaSet
, or StatefulSet
deployment). Naked pods will not be rescheduled in the event of a node failure.
See test case lifecycle-pod-owner-type
Workload requirement
Applications must not depend on any single pod being online for their application to function. |
Workload requirement
Pods must be deployed as part of a See test case lifecycle-pod-owner-type |
Workload requirement
Pods may not be deployed in a DaemonSet. See test case lifecycle-pod-owner-type |
3.2.8. init containers
init containers can be used for running tools or commands or any other action that needs to be done before the actual pod is started. For example, loading a database schema, or constructing a config file from a definition passed in via ConfigMap
or Secret
.
See Using init containers to perform tasks before a pod is deployed for more information.
3.2.9. Container security best practices
3.2.9.1. Avoid privileged containers
In OpenShift Container Platform, it is possible to run privileged containers that have all of the root capabilities on a host machine, allowing the ability to access resources which are not accessible in ordinary containers. This, however, increases the security risk to the whole cluster. Containers should only request those privileges they need to run their legitimate functions. No containers will be allowed to run with full privileges without an exception.
The general guidelines are:
-
Only ask for the necessary privileges and access control settings for your application.
-
If the function required by your CNF can be fulfilled by OCP components, your application should not be requesting escalated privilege to perform this function.
-
Avoid using any host system resource if possible.
-
Leveraging read only root filesystem when possible.
3.2.10. Avoid accessing resource on host
It is not recommended for an application to access following resources on the host.
3.2.11. Avoid mounting host directories as volumes
It is not necessary to mount host /sys/
or host /dev/
directories as a volume in a pod in order to use a network device such as SR-IOV VF. The moving of a network interface into the pod network namespace is done automatically by CNI. Mounting the whole /sys/
or /dev/
directory in the container will overwrite the network device descriptor inside the container which causes device not found
or no such file or directory
error.
Network interface statistics can be queried inside the container using the same /sys/
path as was done when running directly on the host. When running network interfaces in containers, relevant /sys/
statistics interfaces are available inside the container, such as /sys/class/net/net1/statistics/
, /proc/net/tcp
and /proc/net/tcp6
.
For running DPDK applications with SR-IOV VF, device specs (in case of vfio-pci) are automatically attached to the container via the Device Plugin. There is no need to mount the /dev/ directory as a volume in the container as the application can find device specs under /dev/vfio/
in the container.
3.2.12. Avoid the host network namespace
Application pods must avoid using hostNetwork
. Applications may not use the host network, including nodePort
, for network communication. Any networking needs beyond the functions provided by the pod network and ingress/egress proxy must be serviced via a MULTUS connected interface.
Workload requirement
Applications may not use See test case access-control-service-type |
3.2.13. Use of Capabilities
Linux Capabilities allow you to break apart the power of root into smaller groups of privileges. Platform administrators can use Pod Security Policy (PSP) to control permissions for pods. Users can also specify the necessary Security Context in the pod annotations.
- IPC_LOCK
-
IPC_LOCK capability is required if any of these functions are used in an application:
-
mlock()
-
mlockall()
-
shmctl()
-
mmap()
-
Even though ‘mlock’ is not necessary on systems where page swap is disabled, it may still be required as it is a function that is built into DPDK libraries, and DPDK based applications may indirectly call it by calling other functions.
- NET_ADMIN
-
NET_ADMIN capability is required to perform various network related administrative operations inside container such as:
-
MTU setting
-
Link state modification
-
MAC/IP address assignment
-
IP address flushing
-
Route insertion/deletion/replacement
-
Control network driver and hardware settings via ‘ethtool’
-
This doesn’t include:
-
adding or deleting a virtual interface inside a container. For example: adding a VLAN interface
-
Setting VF device properties
All the administrative operations (except ethtool
) mentioned above that require the NET_ADMIN capability should already be supported on the host by various CNIs on CaaS Platform.
- Avoid SYS_ADMIN
-
This capability is very powerful and overloaded. It allows the application to perform a range of system administration operations to the host. So you should avoid requiring this capability in your application.
- SYS_NICE
-
In the case that a Workload is using the real-time kernel. SYS_NICE is needed to allow DPDK application to switch to SCHED_FIFO
- SYS_PTRACE
-
This capability is required when using Process Namespace Sharing. This is used when processes from one Container need to be exposed to another Container. For example, to send signals like
SIGHUP
from a process in a Container to another process in another Container.
3.3. Logging
- Log aggregation and analysis
-
Containers are expected to write logs to stdout. It is highly recommended that stdout/stderr leverage some standard logging format for output.
-
Logs CAN be parsed to a limited extent so that specific vendor logs can be sent back to the workload if required.
-
Logs need to be properly labeled for logs’ consumer to correlate and process
-
Workloads requiring log parsing must leverage some standard logging library or format for all stdout/stderr. Examples of standard logging libraries include;
klog
,rfc5424
, andoslo
.
See test case observability-container-logging
3.4. Upgrade expectations
-
The Kubernetes API deprecation policy defined in Kubernetes Deprecation Policy shall be followed.
-
Workloads are expected to maintain service continuity during platform upgrades, and during workload version upgrades
-
Workloads need to be prepared for nodes to reboot or shut down without notice
-
Workloads shall configure pod disruption budget appropriately to maintain service continuity during platform upgrades
-
Applications may NOT deploy pod disruption budgets that prevent zero pod disruption.
-
Applications should not be tied to a specific version of Kubernetes or any of its components
Applications MUST specify a pod disruption budget appropriately to maintain service continuity during platform upgrades. The budget should be defined with a balance such that it allows operational flexibility for the cluster to drain nodes, but restrictive enough so that the service is not degraded over upgrades. See test case lifecycle-pod-recreation |
Workload requirement
Pods that perform the same microservice and that could be disrupted if multiple members of the service are unavailable must implement pod disruption budgets to prevent disruption in the event of patches/upgrades. See test case observability-pod-disruption-budget |
3.5. Taints and tolerations
Taints and tolerations allow the node to control which pods are scheduled on the node. A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.
You apply taints to a node through the node specification (NodeSpec
) and apply tolerations to a pod through the pod specification (PodSpec
). A taint on a node instructs the node to repel all pods that do not tolerate the taint.
Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.
See Controlling pod placement using node taints for more information.
See test case platform-alteration-tainted-node-kernel
3.6. Requests/Limits
Requests and limits provide a way for a workload developer to ensure they have adequate resources available to run the application. Requests can be made for storage, memory, CPU and so on. These requests and limits can be enforced by quotas. Quotas can be used as a way to enforce requests and limits. See Resource quotas per project for more information.
Nodes can be overcommitted which can affect the strategy of request/limit implementation. For example, when you need guaranteed capacity, use quotas to enforce. In a development environment, you can overcommit where a trade-off of guaranteed performance for capacity is acceptable. Overcommitment can be done on a project, node or cluster level.
See Configuring your cluster to place pods on overcommitted nodes for more information.
Workload requirement
Pods must define requests and limits values for CPU and memory. See test case access-control-requests-and-limits |
3.7. Security and role-based access control
- Roles / RoleBindings
-
A
Role
represents a set of permissions within a particular namespace. E.g: A given user can list pods/services within the namespace. TheRoleBinding
is used for granting the permissions defined in a role to a user or group of users. Applications may create roles and rolebindings within their namespace, however the scope of a role will be limited to the same permissions that the creator has or less.
See test case access-control-pod-role-bindings
- ClusterRole / ClusterRoleBinding
-
A
ClusterRole
represents a set of permissions at the cluster level that can be used by multiple namespaces. TheClusterRoleBinding
is used for granting the permissions defined in aClusterRole
to a user or group of users at a namespace level. Applications are not permitted to install cluster roles or create cluster role bindings. This is an administrative activity done by cluster administrators. Workloads should not use cluster roles; exceptions can be granted to allow this, however this is discouraged.
See Using RBAC to define and apply permissions for more information.
Workload requirement
Workloads may not create See test case access-control-cluster-role-bindings |
3.8. MULTUS
MULTUS is a meta-CNI that allows multiple CNIs that it delegates to. This allows pods to get additional interfaces beyond eth0
via additional CNIs. Having additional CNIs for SR-IOV and MacVLAN interfaces allow for direct routing of traffic to a pod without using the pod network via additional interfaces. This capability is being delivered for use in only corner case scenarios, it is not to be used in general for all applications. Example use cases include bandwidth requirements that necessitate SR-IOV and protocols that are unable to be supported by the load balancer. The OVN based pod network should be used for every interface that can be supported from a technical standpoint.
Workload requirement
Unless an application has a special traffic requirement that is not supported by SPK or ovn-kubernetes CNI the applications must use the pod network for traffic |
See Understanding multiple networks for more information.
3.9. Labels
Labels are used to organize and select subsets of objects. For example, labels enable a service to find and direct traffic to an appropriate pod. While pods can come and go, when labeled appropriately, the service will detect new pods, or a lack of pods, and forward or reduce the traffic accordingly.
When designing your label scheme, it might make sense to map applications as types, location, departments, roles, etc. The scheduler will then use these attributes when colocating pods or spreading the pods out amongst the cluster. It is also possible to search for resources by label.
4. Far Edge specific Workload requirements
Due to the specific deployment environment at the very far edge of the network, and the different CaaS/PaaS vendors being selected by Far Edge vs. Core/Edge, there are some Far Edge specific CNF requirements that are described in the following sections.
At the Far Edge, OpenShift is deployed as a Single Node OpenShift (SNO) cluster.
4.1. RT requirements and CPU engineering
In regards to CPU engineering and layout, the platform infrastructure Pods and operating system processes will require CPU resources in addition to workload requirements. To facilitate these requirements workload partitioning and performance profiles are used in tandem to accommodate this. The workload partitioning should be set up during the installation of the SNO cluster. An example cluster manifest is here.
Provided that these manifests are created at runtime, the machineConfig
can be changed to modify the number of cores allocated to the house keeping pods and processes. Additionally, these partition manifests must correspond to the same configuration as the performance profile..
This configuration can be verified by examining the allowed CPUs for systemd:
# cat /proc/1/status|grep Cpus_allowed_list
Cpus_allowed_list: 0-1,32-33
Housekeeping pods should have the following annotation in their namespace:
annotations:
workload.openshift.io/allowed: management
This ensures they are given access to the same cores as above. More information can be found in the workload partitioning documentation.
The platform provides the different types of CPU pools to the various application threads. The low-latency application, like vRAN, should be designed and implemented properly to fully utilize this platform. More information can be found in the Low latency tuning documentation.
4.2. CPU
4.2.1. Application shared CPU
The burstable workload can float around on the CPUs depending on the availability of resources. More information can be found in the Low latency tuning documentation.
The Workload that is not sensitive to CPU migration and context switching, e.g. OAM, is suitable for using the shared CPU pool. For more information see the Kubernetes Pod QoS class docs.
CPU resource request example:
resources:
# Container is in the application shared CPU pool
requests:
# CPU request is not an integer
cpu: 100m
limits:
cpu: 200m
4.2.2. Application exclusive CPU pool
CPUsets constrain the CPU and memory placement of task to the resources within a task’s current cpuset. Within cgroup, the workload can set CPU affinity to reduce the context switching, however CPU scheduler still performs load balances for non-cpu-pinned workload across all the CPUs in the cpuset. In addition, the following annotation is required to accompany the workload. See Disabling CPU CFS quota:
apiVersion: performance.openshift.io/v2
kind: Pod
metadata:
annotations:
cpu-quota.crio.io: "disable"
spec:
runtimeClassName: performance-<profile_name>
A workload should use Application exclusive CPU pool if it has CPU low-latency requirement. However, the workload should allow context switching with other application threads and kernel threads in the cpuset
. An example workload that uses application exclusive pool is a vRAN RT task.
CPU resource request example:
resources:
# Container is in the application exclusive CPU pool
requests:
# CPU request is an integer and matches with limits
cpu: 2
limits:
cpu: 2
4.2.3. Guidelines for the applications to use CPU pools
-
100% busy tight loop polling threads with the setting of RT CPU schedule scheduling policy (for example, polling threads less than
SCHED_FIFO:10/SCHED_RR
) need to implementuSleeps
to yield CPU time to other processes or threads.-
Sleep duration and periodicity need to be configurable parameters
-
As one of RT-Application on-boarding acceptance criteria, Sleep duration and periodicity will be configured properly to ensure that CaaS platform does not run into stability issues (for example, CPU starvation, process stall).
-
-
A cloud native application should be decoupled into multiple small units (for example, a container) to ensure that each unit is running in only one type of CPU pool.
-
If one of the containers in the pod uses guaranteed cpus then all other containers in that pod must be guaranteed
qos
class as well. -
The requirement is that the sum of all resources and limits must be equal (Pod QoS Guaranteed) and the pinned containers must use integer (and equal) values for cpu and memory requests and limits. This implies that the non-pinned containers also use requests=limits, but the values can be fractional "2.5" or even whole, but written as fractions eg. "1.0".
-
At least one core must be left for non-guaranteed work loads.
-
-
The workload running in the shared CPU pool should choose non-RT CPU schedule policy, like SCHED_OTHER to always share the CPU with other applications and kernel threads.
-
The workload running in Application exclusive CPU pool should choose RT CPU scheduling policy, like
SCHED_FIFO/SCHED_RR
to archive the low-latency, but should set the priority less than 10 to avoid the CPU starvation of the kernel threads (ksoftirqd SCHED_FIFO:11, ktimer SCHED_FIFO:11, rcuc SCHED_FIFO:11
) on the same CPU core. -
The workload running in Application-isolated exclusive CPU pool should choose RT CPU scheduling policy, like
SCHED_FIFO/SCHED_RR
and set high priority to achieve the best low-latency performance. The workload should be cpu-pinned on a set of dedicated CPU cores, but periodically yielding CPU (calling thenanosleep
function) in its busy tight loop is required to ensure that the kernel threads on the same CPU core can get the minimum CPU time. -
If a workload with both RT task and non-RT task has to be implemented in one single Pod:
-
The real-time tasks within the Pod:
-
Should execute the CPU pinning based on the rule of single thread per logical core
-
Should use RT CPU scheduling policy, like
SCHED_FIFO/SCHED_RR
-
Can set higher priority, but periodically yielding CPU in its busy tight loop is required (refer to the earlier uSleep requirements).
-
-
-
The non real time tasks within the Pod:
-
Should use the CPU cores separated from real time tasks using taskset or pthread_affinity
-
Need to take care of the scheduling/load balancing if the
SCHED_OTHER
scheduling policy is used -
Use RT CPU scheduling policy, like SCHED_RR, with lower priority levels, which will be equivalent to SMP configuration of threads running on not-isolcpu cores.
-
The above applies for guaranteed and non-guaranteed QoS within a single Pod.
It is possible to create a Pod with multiple containers and only pin some of the CPUs.
Follow these rules:
-
Total resource requests must match total limits
-
Pinned containers must request integer number of CPUs
apiVersion: v1
kind: Pod # QoS: Guaranteed
metadata: # total requests [id="k8s-best-practices-far-edge-limits"]
= limits
name: guar-2s
spec:
containers:
- name: pinned-1 # PINNED
resources:
limits: # requests are inferred
cpu: 2 # integer value
memory: "400Mi"
- name: best-1
resources:
limits: # Burstable
cpu: 0.5 # non-integer value
memory: "200Mi"
- name: best-2
resources:
limits: # Burstable
cpu: 1.0 # non-integer value
memory: "200Mi"
-
Workload applications should build resiliency in their pipeline to recover from unexpected outliers in the platform. Applications need to recover without causing the system or application to crash when an outlier is seen (e.g. symbol or TTI dropped).
4.3. Huge pages allocation
Unlike CPU or memory, huge pages do not support overcommit. Huge page requests must equal the limits. Huge pages are isolated at a container scope, so each container has its own limit on their cgroup sandbox as requested in a container spec. Applications that consume huge pages via shmget()
with SHM_HUGETLB
must run with a supplemental group that matches /proc/sys/vm/hugetlb_shm_group
. Huge page usage in a namespace is controllable via ResourceQuota similar to other compute resources like cpu or memory using the hugepages-1Gi token.
Please refer to Manage HugePages for more information.
4.5. PTP synchronization
Network synchronization is key to optimal radio network performance. While there is no change to fundamental synchronization requirements in the move from 4G to 5G, wider use of Time Division Duplex (TDD) radio technology and growing demand for new network architectures that support demanding 5G use cases have made the need for time synchronization more critical in 5G.
Precision Timing Protocol (PTP) is used as Transport-based synchronization solutions to provide accurate timing to both platform and application. A hierarchical PTP master-slave architecture is implemented for clock distribution.
The IEEE 1588 PTP standards provide a wealth of options and the basis for highly reliable and accurate time synchronization solutions. However, the time synchronization needs of specific applications in different industries can vary quite significantly. These specific needs are defined in separate PTP profile standards, often in collaboration with other industry standards organizations.
For 5G, both IEEE and ITU-T provide relevant profiles that can be used to design ultra-reliable and highly accurate time synchronization solutions. The ITU-T PTP Telecom Profiles are:
-
G8265.1 was first used to deliver accurate frequency synchronization, but not time::synchronization, to mobile base-stations (as the delay asymmetries and packet delay variation present in PTP-unaware networks would make it impossible to meet the stringent accuracy and stability requirements).
-
G8275.1 is designed to deliver both highly accurate frequency synchronization,::phase synchronization and Time of Day (ToD) with support for accurate frequency synchronization from the network physical layer. (Recommended one but all elements in the network must be PTP-aware).
-
G8275.2 is designed to deliver accurate frequency synchronization, phase synchronization and ToD with only partial support from the network where non-PTP nodes are also in the network path.
The following table shows the relevant options in Telecom profiles:
Typically, a PTP hierarchy has the following high-level components:
- Grandmaster (GM) clock
-
This is the primary reference time clock (PRTC) for the entire PTP::network. It usually synchronizes its clock from an external Global Navigation Satellite System (GNSS) source.
- Boundary clock (BC)
-
This intermediate device has multiple PTP-capable network connections to::synchronize one network segment to another accurately. It synchronizes its clock to a master and serves as a time source for ordinary clocks.
- Ordinary clock (OC)
-
In contrast to boundary clocks, this device only has a single PTP-capable network connection. Its main function is to synchronize its clock to a master, and in the event of losing the master, it can tolerate a loss of sync source for some period of time.
OpenShift Container Platform uses the PTP Operator to deploy PTP profiles (through the PTPConfig
CR) along with the linuxptp
Pod in each of the nodes requiring PTP support. This Pod runs ptp4l
and phc2sys
programs as containers.
The ptp4l
program represents the Linux implementation of PTP for the boundary and ordinary clocks. When using hardware timestamping, it directly synchronizes the PTP hardware clock (PHC) present on the NIC to the source clock (PRTC).
The phc2sys
container is responsible for synchronizing the two available clocks in a cluster node, typically these are the PHC and the system clocks. This program is used when hardware time stamping is configured. In such cases, it synchronizes the system clock from the PTP hardware clock on the defined network interface controller (NIC).
4.5.1. How to deploy and configure PTP
Before enabling PTP, ensure that NTP is disabled for the required nodes. This can be done by disabling the chrony
time service (chronyd
) using a MachineConfig
custom resource. For more information, see Disabling chrony time service.
After disabling the NTP service, the next thing is to properly assign labels to the cluster nodes. Labels ensure that the PTP operator applies PTP profiles to the correct cluster node. The use of labels facilitates in-cluster object searches, which are leveraged by the PTP operator (using a match field) to select the right cluster nodes and configure the ptp4l
and phc2sys
programs accordingly. See the following examples:
$ oc label node rh-sno-du ptp/boundary-clock="" # <- context: rh-sno-du cluster
$ oc label node rh-sno-ru ptp/ordinary-clock="" # <- context: rh-sno-ru cluster
These labels are used by the match
field in PtpConfig
CR to match a profile to one or more nodes.
Configuring PTP is a matter of installing the PTP Operator, selecting a capable NIC using the NodePtpDevice
CR and setting up the PtpConfig
CR with the appropriate values to work as:
-
An OC in an RU
Notice, however, that to fulfill Far Edge requirements regarding low latency the PTP Operator linuxptp
services must be set to allow threads to run with a SCHED_FIFO policy.
The follow topology illustrates the PTP operator running on cluster nodes where the O-RAN workloads are hosted:
The GM device sends its timestamps downstream and these are received in the boundary clock slave ports of a vDU (denoted in the image with a capital S). These timestamps are then used by the ptp4l
program to adjust the local PHC
in the corresponding cluster node. This also happens from the vDU master ports toward RU ordinary clocks slave ports.
At the same time, the phc2sys
process running in both the O-DU and O-RU nodes read the PHC timestamps from the ptp4l
program through a shared Unix Domain Socket (UDS) and finally adjusting the SYSTEM_CLOCK
offset, as shown in the figure below.
This process is continuously repeated keeping all the PHC
instances in the O-RAN deployment effectively synchronized to the GNSS source clock, which acts as the GM in this scenario.
- Tested hardware
-
Supported NICs have their own physical on-board clock (known as PHC - PTP Hardware Clock) that is used to hardware-timestamp the incoming and outgoing PTP messages.
The recommended NICs with hardware PTP support are:
Vendor | PTP Profile setting |
---|---|
Intel Columbiaville 800 Series NICs |
Ensure boundary_clock_jbod is set to 0. |
Intel Fortville X710 Series NICs |
Ensure boundary_clock_jbod is set to 1. |
- Fast event detection
-
Cloud-native applications, such as virtual RAN (vRAN), require quick access to notifications about hardware timing events. This is critical for the proper functioning of the whole Far Edge network. Fast event notifications is a Red Hat framework that enables early warnings on real-time PTP clock synchronization events.
This framework mitigates workload errors by allowing cluster nodes to directly communicate PTP clock sync status to the vRAN application running as part of the DU. These event notifications are available to RAN applications running on the same DU node using a publish/subscribe REST API approach that provides event notifications via a fast messaging bus.
Fast event notifications are generated by the PTP Operator in the OpenShift Container Platform for every PTP-capable network interface. Specifically, it uses an AMQP event notification bus provided by the AMQ Interconnect Operator that delivers flexible routing of messages between any AMQP-enabled endpoints. A high-level overview of the PTP fast events framework is below:
From the Openshift PTP perspective, the idea is to [install the AMQ Interconnect Operator and configure the PTP Operator.
Then, update the PtpConfig
CR to include the relevant values in each case (sample values for ptpClockThreshold
are shown below):
apiVersion: ptp.openshift.io/v1
kind: PtpConfig
metadata:
name: <ptp_config_name>
namespace: openshift-ptp
...
spec:
profile:
- name: "profile1"
interface: "enp5s0f0"
ptp4lOpts: "-2 -s --summary_interval -4"
phc2sysOpts: "-a -r -m -n 24 -N 8 -R 16"
ptp4lConf: ""
ptpClockThreshold:
holdOverTimeout: 5
maxOffsetThreshold: 100
minOffsetThreshold: -100
If the |
The ptpClockThreshold
configures how long the PTP operator should wait after losing the PTP master clock signal, triggering that error as a PTP event:
-
holdOverTimeout
is the time value (in seconds) before the PTP clock event state changes toFREERUN
, which means that synchronization from the PTP master clock is lost. -
maxOffsetThreshold
andminOffsetThreshold
settings configure the offset values (in nanoseconds) that compare against the values for CLOCK_REALTIME (phc2sys
) or master offset (ptp4l
). When theptp4l
orphc2sys
offset value is outside the specified range, the PTP clock state is then set toFREERUN
. When the offset value is within this range, the PTP clock state remains set toLOCKED
.
From the DU application perspective, a new cloud-event-proxy
sidecar container is appended to the DU application Pod that is loosely coupled to the main DU application container on the DU node. It provides an event publishing framework that allows you to subscribe to DU applications to any published PTP event. A detailed explanation of how it works can be found here.
The steps to get a DU application subscribed are described here, along with several examples of use.
Monitoring PTP fast event metrics PTP fast event metrics can be monitored in the OpenShift Container Platform web console by using the pre-configured and self-updating Prometheus monitoring stack as explained here. From the DU application perspective, the Prometheus metrics support helps to set up custom alerting manager systems for early detection issues.
4.6. Workload security
In OCP, it is possible to run privileged containers that have all of the root capabilities on a host machine, allowing the ability to access resources which are not accessible in ordinary containers. This, however, increases the security risk to the whole cluster. Containers should only request those privileges they need to run their legitimate functions. No containers will be allowed to run with full privileges without an exception.
The general guidelines are:
-
Only ask for the necessary privileges and access control settings for your application.
-
If the function required by your workload can be fulfilled by OCP components, your application should not be requesting escalated privilege to perform this function.
-
Avoid using any host system resource if possible.
-
Leveraging read only root filesystem when possible.
Workload requirement
Only ask for the necessary privileges and access control settings for your application See test case access-control-security-context-non-root-user-check |
Workload requirement
If the function required by your workload can be fulfilled by OCP components, your application should not be requesting escalated privilege to perform this function. See test case access-control-security-context-privilege-escalation |
Workload requirement
Avoid using any host system resource. See test cases access-control-pod-host-ipc, access-control-pod-host-pid |
Workload requirement
Do not mount host directories for device access. See test case access-control-pod-host-path |
Workload requirement
Do not use host network namespace. See test case access-control-namespace |
Workload requirement
Workloads may not modify the platform in any way. See test cases platform-alteration-base-image, platform-alteration-sysctl-config |
4.6.1. Avoid accessing resources on the host
It is not recommended for an application to access the following resources on the host.
4.6.2. Avoid mounting host directories as volumes
It is not necessary to mount host /sys/ or host /dev/ directory as a volume in a pod in order to use a network device such as SR-IOV VF. The moving of a network interface into the pod network namespace is done automatically by CNI. Mounting the whole /sys/ or /dev/ directory in the container will overwrite the network device descriptor inside the container which causes 'device not found' or 'no such file or directory' error.
Network interface statistics can be queried inside the container using the same /sys/ path as was done when running directly on the host. When running network interfaces in containers, relevant /sys/ statistics interfaces are available inside the container, such as /sys/class/net/net1/statistics/
, /proc/net/tcp
and /proc/net/tcp6
.
For running DPDK applications with SR-IOV VF, device specs (in case of vfio-pci) are automatically attached to the container via the Device Plugin. There is no need to mount the /dev/
directory as a volume in the container as the application can find device specs under /dev/vfio/
in the container.
4.6.3. Avoid the host network namespace
Application pods must avoid using hostNetwork
. Applications may not use the host network, including nodePort for network communication. Any networking needs beyond the functions provided by the pod network ingress/egress proxy must be serviced via a multus connected interface.
Workload requirement
Applications may not use |
4.7. Linux capabilities
Linux Capabilities allow you to break apart the power of root into smaller groups of privileges. The Linux capabilities(7) man page provides a detailed description of how capabilities management is performed in Linux. In brief, the Linux kernel associates various capability sets with threads and files. The thread’s Effective capability set determines the current privileges of a thread.
When a thread executes a binary program the kernel updates the various thread capability sets according to a set of rules that take into account the UID of thread before and after the exec system call and the file capabilities of the program being executed.
-
Refer to the blog series for more details about Linux Capabilities and some examples.
-
For Red Hat specific review of Capabilities, see Linux Capabilities in OpenShift.
-
An additional reference is Docker run reference.
Users can choose to specify the required permissions for their running application in the Security Context of the pod specification. In OCP, administrators can use the Security Context Constraint (SCC) admission controller plugin to control the permissions allowed for pods deployed to the cluster. If the pod requests permissions that are not allowed by the SCCs available to that pod, the pod will not be admitted to the cluster.
The following runtime and SCC attributes control the capabilities that will be granted to a new container:
-
The capabilities granted to the CRI-O engine. The default capabilities are listed here
As of Kubernetes version 1.18, CRI-O no longer runs with NET_RAW or SYS_CHROOT by default. See CRI-O v1.18.0.
-
The values in the SCC for
allowedCapabilities
,defaultAddCapabilities
andrequiredDropCapabilities
-
allowPrivilegeEscalation
: controls whether a container can acquire extra privileges through setuid binaries or the file capabilities of binaries
The capabilities associated with a new container are determined as follows:
-
If the container has the UID 0 (root) its Effective capability set is determined according to the capability attributes requested by the pod or container security context and allowed by the SCC assigned to the pod. In this case, the SCC provides a way to limit the capabilities of a root container.
-
If the container has a UID non 0 (non root), the new container has an empty Effective capability set (see #56374). In this case the SCC assigned to the pod controls only the capabilities the container may acquire through the file capabilities of binaries it will execute.
Considering the general recommendation to avoid running root containers, capabilities required by non-root containers are controlled by the pod or container security context and the SCC capability attributes but can only be acquired by properly setting the file capabilities of the container binaries.
Refer to Managing security context constraints for more details on how to define and use the SCC.
4.7.1. DEFAULT capabilities
The default capabilities that are allowed via the restricted SCC are as follows.
-
"CHOWN"
-
"DAC_OVERRIDE"
-
"FSETID"
-
"FOWNER"
-
"SETPCAP"
-
"NET_BIND_SERVICE"
The capabilities: "SETGID", "SETUID" &"KILL", have been removed from the default OpenShift capabilities. |
4.7.2. IPC_LOCK
IPC_LOCK capability is required if any of these functions are used in an application:
-
mlock()
-
mlockall()
-
shmctl()
-
mmap()
Even though mlock()
is not necessary on systems where page swap is disabled (for example on OpenShift), it may still be required as it is a function that is built into DPDK libraries, and DPDK based applications may indirectly call it by calling other functions.
See test case access-control-ipc-lock-capability-check
4.7.3. NET_ADMIN
NET_ADMIN capability is required to perform various network related administrative operations inside container such as:
-
MTU setting
-
Link state modification
-
MAC/IP address assignment
-
IP address flushing
-
Route insertion/deletion/replacement
-
Control network driver and hardware settings via
ethtool
This doesn’t include:
-
add/delete a virtual interface inside a container. For example: adding a VLAN interface
-
Setting VF device properties
All the administrative operations (except ethtool
) mentioned above that require the NET_ADMIN
capability should already be supported on the host by various CNIs in Openshift.
CNI TAP can be used to create tap devices and avoid NET_ADMIN.
Workload requirement
Only userplane applications or applications using SR-IOV or Multicast can request NET_ADMIN capability See test case access-control-net-admin-capability-check |
4.7.4. Avoid SYS_ADMIN
This capability is very powerful and overloaded. It allows the application to perform a range of system administration operations to the host. So you should avoid requiring this capability in your application.
Workload requirement
Applications MUST NOT use the SYS_ADMIN Linux capability See test case access-control-sys-admin-capability-check |
4.7.5. SYS_NICE
In the case that a workload is running on a node using the real-time kernel, SYS_NICE will be used to allow DPDK application to switch to SCHED_FIFO.
See test case access-control-sys-nice-realtime-capability
4.7.6. SYS_PTRACE
This capability is required when using Process Namespace Sharing. This is used when processes from one Container need to be exposed to another Container. For example, to send signals like SIGHUP from a process in a Container to another process in another Container. See Share Process Namespace between Containers in a Pod for more details. For more information on these capabilities refer to Linux Capabilities in OpenShift.
See test case access-control-sys-ptrace-capability
4.8. Operations that shall be executed by OpenShift
The application should not require NET_ADMIN
capability to perform the following administrative operations:
4.8.1. Setting the MTU
-
Configure the MTU for the cluster network, also known as the OVN or Openshift-SDN network, by modifying the manifests generated by
openshift-installer
before deploying the cluster. See Changing the MTU for the cluster network for more information. -
Configure additional networks managed by the Cluster Network Operator by using
NetworkAttachmentDefinition
resources generated by the Cluster Network Operator. See Using high performance multicast for more information. -
Configure SR-IOV interfaces by using the SR-IOV Network Operator, see Configuring an SR-IOV network device for more information.
4.8.2. Modifying link state
-
All the links should be set up before attaching it to a pod.
4.8.3. Assigning IP/MAC addresses
-
For all the networks, the IP/MAC address should be assigned to the interface during pod creation.
-
MULTUS also allows users to override the IP/MAC address. Refer to Attaching a pod to an additional network for more information.
4.8.4. Manipulating pod route tables
-
By default, the default route of the pod will point to the cluster network, with or without the additional networks. MULTUS also allows users to override the default route of the pod. Refer to Attaching a pod to an additional network for more information.
-
Non-default routes can be added to pod routing tables by various IPAM CNI plugins during pod creation.
4.8.5. Setting SR/IOV VFs
The SR-IOV Network Operator also supports configuring the following parameters for SR-IOV VFs. Refer to Configuring an SR-IOV Ethernet network attachment for more information.
-
vlan
-
linkState
-
maxTxRate
-
minRxRate
-
vlanQoS
-
spoofChk
-
trust
4.8.6. Configuring multicast
In OpenShift, multicast is supported for both the default interface (OVN or OpenShift-SDN) and the additional interfaces such as macvlan, SR-IOV, etc. Multicast is disabled by default. To enable it, refer to the following procedures:
If your application works as a multicast source and you want to utilize the additional interfaces to carry the multicast traffic, then you don’t need the NET_ADMIN
capability. Follow the instructions in Using high performance multicast to set the correct multicast route in the pod’s routing table.
4.9. Operations that can not be executed by OpenShift
All the CNI plugins are only invoked during pod creation and deletion. If your workload needs perform any operations mentioned above at runtime, the NET_ADMIN
capability is required.
There are some other functionalities that are not currently supported by any of the OpenShift components which also require NET_ADMIN
capability:
-
Link state modification at runtime
-
IP/MAC modification at runtime
-
Manipulate pod’s route table or firewall rules at runtime
-
SR/IOV VF setting at runtime
-
Netlink configuration
-
For example,
ethtool
can be used to configure things like rxvlan, txvlan, gso, tso, etc. -
Multicast
If your application works as a receiving member of IGMP groups, you need to specify the NET_ADMIN capability in the pod manifest. So that the app is allowed to assign multicast addresses to the pod interface and join an IGMP group.
-
Set
SO_PRIORITY
to a socket to manipulate the 802.1p priority in ethernet frames -
Set
IP_TOS
to a socket to manipulate the DSCP value of IP packets
4.10. Analyzing your application
To find out which capabilities the application needs, Red Hat has developed a SystemTap script (container_check.stp
). With this tool, the workload developer can find out what capabilities an application requires in order to run in a container. It also shows the syscalls which were invoked. Find more info at https://linuxera.org/capabilities-seccomp-kubernetes/
Another tool is capable
which is part of the BCC tools. It can be installed on RHEL8 with dnf install bcc
.
4.11. Finding the capabilities that an application needs
Here is an example of how to find out the capabilities that an application needs. testpmd
is a DPDK based layer-2 forwarding application. It needs the CAP_IPC_LOCK
to allocate the hugepage memory.
-
Use container_check.stp. We can see
CAP_IPC_LOCK
andCAP_SYS_RAWIO
are requested bytestpmd
and the relevant syscalls.$ $ /usr/share/systemtap/examples/profiling/container_check.stp -c 'testpmd -l 1-2 -w 0000:00:09.0 -- -a --portmask=0x8 --nb-cores=1'
Example output[...] capabilities used by executables executable: prob capability testpmd: cap_ipc_lock testpmd: cap_sys_rawio capabilities used by syscalls executable, syscall ( capability ) : count testpmd, mlockall ( cap_ipc_lock ) : 1 testpmd, mmap ( cap_ipc_lock ) : 710 testpmd, open ( cap_sys_rawio ) : 1 testpmd, iopl ( cap_sys_rawio ) : 1 failed syscalls executable, syscall = errno: count eal-intr-thread, epoll_wait = EINTR: 1 lcore-slave-2, read = : 1 rte_mp_handle, recvmsg = : 1 stapio, = EINTR: 1 stapio, execve = ENOENT: 3 stapio, rt_sigsuspend = : 1 testpmd, flock = EAGAIN: 5 testpmd, stat = ENOENT: 10 testpmd, mkdir = EEXIST: 2 testpmd, readlink = ENOENT: 3 testpmd, access = ENOENT: 1141 testpmd, openat = ENOENT: 1 testpmd, open = ENOENT: 13 [...]
-
Use the
capable
command:$ /usr/share/bcc/tools/capable
-
Start the testpmd application from another terminal, and send some test traffic to it. For example:
$ testpmd -l 18-19 -w 0000:01:00.0 -- -a --portmask=0x1 --nb-cores=1
-
Check the output of the
capable
command. Below,CAP_IPC_LOCK
was requested for runningtestpmd
.[...] 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 0:41:58 0 3591 testpmd CAP_IPC_LOCK 1 [...]
-
Also, try to run
testpmd
withoutCAP_IPC_LOCK
set withcapsh
. Now we can see that the hugepage memory cannot be allocated.$ capsh --drop=cap_ipc_lock -- -c testpmd -l 18-19 -w 0000:01:00.0 -- -a --portmask=0x1 --nb-cores=1
Example outputEAL: Detected 24 lcore(s) EAL: Detected 2 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: No free hugepages reported in hugepages-1048576kB EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device 0000:01:00.0 on NUMA socket 0 EAL: probe driver: 8086:10fb net_ixgbe EAL: using IOMMU type 1 (Type 1) EAL: Ignore mapping IO port bar(2) EAL: PCI device 0000:01:00.1 on NUMA socket 0 EAL: probe driver: 8086:10fb net_ixgbe EAL: PCI device 0000:07:00.0 on NUMA socket 0 EAL: probe driver: 8086:1521 net_e1000_igb EAL: PCI device 0000:07:00.1 on NUMA socket 0 EAL: probe driver: 8086:1521 net_e1000_igb EAL: cannot set up DMA remapping, error 12 (Cannot allocate memory) testpmd: mlockall() failed with error "Cannot allocate memory" testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=331456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc EAL: cannot set up DMA remapping, error 12 (Cannot allocate memory) testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=331456, size=2176, socket=1 testpmd: preferred mempool ops selected: ring_mp_mc EAL: cannot set up DMA remapping, error 12 (Cannot allocate memory) EAL: cannot set up DMA remapping, error 12 (Cannot allocate memory)
4.12. Image Security
Images will be scanned for vulnerabilities during Red Hat certification process.
Images must include digital signatures allowing validation that the image is from an authorized vendor, part or all of an authorized CNF delivered by the vendor, has a current component version, and has not been modified since signing. At a minimum, the signature must include information identifying the container base image included as well as for the entire container contents. Accompanying software artifacts such as Helm charts and shell scripts must be similarly signed individually.
4.13. Securing workload networks
Workloads must have the least permissions possible and must implement Network Policies that drop all traffic by default and permit only the relevant ports and protocols to the narrowest ranges of addresses possible.
Workload requirement
Applications must define network policies that permit only the minimum network access the application needs to function. See test case networking-network-policy-deny-all |
4.13.1. Managing secrets
Secrets objects in OpenShift provide a way to hold sensitive information such as passwords, config files and credentials. There are 4 types of secrets; service account, basic auth, ssh auth and TLS. Secrets can be added via deployment configurations or consumed by pods directly. For more information on secrets and examples, see the following documentation.
4.13.1.1. Setting SCC permissions for applications
Permissions to use an SCC is done by adding a cluster role that has uses permissions for the SCC and then rolebindings for the users within a namespace to that role for users that need that SCC. Application admins can create their own role/rolebindings to assign permissions to a Service Account.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: "restricted-scc-cat-1-role"
labels:
app: some-app
rules:
- apiGroups:
- security.openshift.io
resourceNames:
- restricted-cat-1
resources:
- securitycontextconstraints
verbs:
- use
4.13.1.2. SCC Application Categories
- Workloads that do not require advanced networking features (Category 1)
-
This is the default SCC for all users if your namespace does not use service mesh.
This kind of Workload shall:
-
Use the default CNI (OVN) network interface
-
Not request
NET_ADMIN
orNET_RAW
for advanced networking functions
SCC definition (default):
kind: SecurityContextConstraints
apiVersion: security.openshift.io/v1
metadata:
name: restricted-cat-1
users: []
groups: []
priority: null
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: null
defaultAddCapabilities: null
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
- NET_RAW
fsGroup:
type: MustRunAs
readOnlyRootFilesystem: false
runAsUser:
type: MustRunAsRange
seLinuxContext:
type: MustRunAs
supplementalGroups:
type: RunAsAny
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
- Workloads that require Service Mesh (Category 1 - no-uid0)
-
Workloads that utilize Service Mesh sidecars for mTLS and load balancing features must utilize an alternative to the restricted SCC Category 1 defined in section “SCC Application Categories” due to Service Mesh limitation around requirements to use a specific UID (1337). An alternate SCC called restricted-no-uid0 has been provided in the platform and is available to all tenants without special requests necessary.
kind: SecurityContextConstraints
apiVersion: security.openshift.io/v1
name: restricted-no-uid0
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: null
defaultAddCapabilities: null
fsGroup:
type: RunAsAny
groups:
- system:authenticated
metadata:
annotations:
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
type: MustRunAsNonRoot
seLinuxContext:
type: MustRunAs
supplementalGroups:
type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
- Workloads that require advanced networking features (Category 2)
-
Workloads with following characteristics may fall into this category:
-
Manipulate the low-level protocol flags, such as the 802.1p priority, VLAN tag, DSCP value, etc.
-
Manipulate the interface IP addresses or the routing table or the firewall rules on-the-fly.
-
Process Ethernet packets
-
This kind of Workload may:
-
Use Macvlan interface to sending and receiving Ethernet packets
-
Request CAP_NET_RAW for creating raw sockets
-
Request CAP_NET_ADMIN for
-
Modify the interface IP address on-the-fly
-
Manipulating the routing table on-the-fly
-
Manipulating firewall rules on-the-fly.
-
Setting packet DSCP value
-
Recommended SCC definition:
kind: SecurityContextConstraints
apiVersion: security.openshift.io/v1
metadata:
name: cnf-catalog-2
users: []
groups: []
priority: null
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: [NET_ADMIN, NET_RAW]
defaultAddCapabilities: null
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
fsGroup:
type: MustRunAs
readOnlyRootFilesystem: false
runAsUser:
type: MustRunAsRange
seLinuxContext:
type: MustRunAs
supplementalGroups:
type: RunAsAny
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
- User-Plane Workloads (Category 3)
-
A Workload which handles user plane traffic or latency-sensitive payloads at line rate falls into this category, such as load balancing, routing, deep packet inspection, and so on. Some of these CNFs may also need to process the packets at a lower level.
This kind of Workload shall:
-
Use SR-IOV interfaces
-
Fully or partially bypassing the kernel networking stack with userspace networking technologies, like DPDK, F-stack, VPP, OpenFastPath, etc. A userspace networking stack can not only improve the performance but also reduce the need for the
CAP_NET_ADMIN
andCAP_NET_RAW
.
For Mellanox devices, those capabilities are requested if the application needs to configure the device ( |
As CAP_IPC_LOCK
is mandatory for allocating hugepage memory, this capability shall be granted to the DPDK based applications. Additionally if the workload is latency-sensitive and needs the determinacy provided by the real-time kernel, the CAP_SYS_NICE
would also be required.
Here is an example pod manifest for a DPDK application:
apiVersion: v1
kind: Pod
metadata:
name: dpdk-app
namespace: <target_namespace>
annotations:
k8s.v1.cni.cncf.io/networks: dpdk-network
spec:
containers:
- name: testpmd
image: <DPDK_image>
securityContext:
capabilities:
add: ["IPC_LOCK"]
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
resources:
limits:
openshift.io/mlxnics: "1"
memory: "1Gi"
cpu: "4"
hugepages-2Mi: "4Gi"
requests:
openshift.io/mlxnics: "1"
memory: "1Gi"
cpu: "4"
hugepages-2Mi: "4Gi"
command: ["sleep", "infinity"]
volumes:
- name: hugepage
emptyDir:
medium: HugePages
More info can be found here.
Recommended SCC definition:
kind: SecurityContextConstraints
apiVersion: security.openshift.io/v1
metadata:
name: cnf-catalog-3
users: []
groups: []
priority: null
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: [IPC_LOCK, NET_ADMIN, NET_RAW]
defaultAddCapabilities: null
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
fsGroup:
type: MustRunAs
readOnlyRootFilesystem: false
runAsUser:
type: MustRunAsRange
seLinuxContext:
type: MustRunAs
supplementalGroups:
type: RunAsAny
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
5. Additional resources
Copyright
© Copyright 2023 Red Hat Inc.
All Rights Reserved.
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.