The millions of devices that currently make up the Internet of Things (IoT) reside not in the cloud but on-premises: from retail stores to factory floors. Many of these devices are tiny edge leaf devices like cameras (IP cameras, USB cameras etc.), sensors (smart heat sensors on equipment etc.). It has become increasingly difficult and impractical for developers to write bespoke solutions to detect and use each of the devices and the core DevOps/deployment teams to manage the applications and clusters which host these applications that are positioned in far edge sites (close to end-users).
Most of these devices, however, are too small to run Kubernetes themselves. How can they be leveraged by a Kubernetes workload and how can a private cluster be managed from the cloud?
Anthos clusters on VMware also called GKE on-prem was an offering that enabled users to run GKE in customer datacenters but this model needs vCenter Server and VSphere, but Anthos on bare metal uses a “bring your own operating system” model. It runs atop physical or virtual instances, and supports Red Hat Enterprise Linux 8.1/8.2, CentOS 8.1/8.2, or Ubuntu 18.04/20.04 LTS.
Akri from Deislabs (Microsoft), an open-source project that defines a Kubernetes-based approach on representing leaf devices on the edge as native Kubernetes Resources. It provides an abstraction layer like the Container Networking Interface, but instead of abstracting the underlying network details, it removes the work of finding, utilizing, and monitoring the availability of leaf devices.
A nationwide retail store needs to implement camera-based applications (general store monitoring, sentiment analysis, gender analysis, mask detection, self-checkout etc.) in their stores around the nation and enable users like store-managers and local business development teams to access the applications locally and enable the core team to operate and manage applications from HQ over a public cloud platform. In this scenario, the stores can be considered as edge-sites.
In this setting, the viable option is to use small footprint devices like Intel NUC’s or any other SBC’s with enough capacity to run Kubernetes and use USB cameras connected to them.
Anthos clusters on Bare Metal here deploys Kubernetes and connects the private bare metal cluster in edge site to Google Cloud. In the example use case above, the HQ can image and pre-deploy/configure the NUC's with Anthos Bare Metal or use some kind of automation (systemd. rc.local etc.) to bootstrap the cluster. Once powered on the cluster will be automatically registered to users project on GCP using pre-configured SA keys and then the HQ team can takeover to deploy, manage, update applications on the edge site. In this scenario, the local teams need no technical abilities and they can just consume the end-application.
In this post, Anthos Config Management is used to deploy Akri components to the remote/private cluster from Anthos portal on GCP. Once the deployment is complete the local users can access the camera's footage using the local LB IP.
This is just an example scenario to access the abilities and usage of Anthos Clusters on Bare Metal and Akri to manage Kubernetes and leaf devices on edge.
Anthos on bare metal is a deployment option to run Anthos on physical servers, deployed on an operating system provided by the user, without a hypervisor layer. Anthos on bare metal ships with built-in networking, lifecycle management, diagnostics, health checks, logging, and monitoring. Users can use supported CentOS, Red Hat Enterprise Linux (RHEL), and Ubuntu — all validated by Google for the cluster deployment.
With Anthos on bare metal, users can leverage existing investments and use the company’s standard hardware and operating system images, which are automatically checked and validated against Anthos infrastructure requirements. Anthos on bare metal follows the same billing model as all managed Anthos clusters and are based on the number of Anthos cluster vCPUs, charged on an hourly basis.
Users can deploy Anthos on bare metal using one of the following deployment models:
A standalone model allows users to manage every cluster independently. This is a good choice when running in an edge location or if users want their clusters to be administered independently from one another. A multi-cluster model which allows users to manage a fleet of clusters (user clusters) from a centralized cluster called the admin cluster. Suitable for managing multiple clusters from a centralized location.
To meet different requirements Anthos clusters on bare metal support the following deployment models (all three models include an admin cluster and user cluster construct as specified above).
The topology in the post comprises three Intel NUC’s, two with 16 GB RAM, 4 vCPU’s, 250 GB SSD which are used as a Master and Node (Anthos provides specific hardware requirements). The deployment model here is Hybrid, where the cluster is an admin cluster that also allows running user workloads, this cluster can be re-used as an admin cluster to create user clusters if needed. The third NUC (2 vCPU’s, 8GB RAM) is used as a workstation where all the required deployment operations are performed, this hosts the temporary KIND (Kubernetes in Docker) bootstrap cluster.
All the machines in the cluster are connected to a home network (private) with outbound connectivity to the internet. All the NUC’s procure IP’s from the home network (node_ip and master_host_ip). Anthos clusters on bare metal use MetalLB running in L2 mode for data plane load balancing. Users can carve out a block of contiguous IP addresses on the private network for allocating to K8s service of type “LoadBalancer”. Users can select a “start” IP address in the block above to be allocated to IngressVIP and another IP, not in the range for controlPlaneVIP.
Anthos clusters on bare metal deploy L4 load balancers that run on either a dedicated pool of worker nodes or on the same nodes as the control plane. The above topology deploys LB in ‘bundled’ mode where a load balancer will be installed on load balancer nodes during cluster creation.
Anthos provides overlay networking and L4/L7 load balancing out of the box. Customers can also integrate with their own load balancers such as F5 and Citrix. For storage, users can deploy persistent workloads using CSI integration with their existing infrastructure.
The install command of Anthos clusters on bare metal is bmctl, the utility enables users to create a cluster with minimal effort.
The bmctl utility reads all the configuration from a config file, a template can be created using ‘bmctl create config’. Users can pass ‘ — enable-apis — create-service-accounts’ to automatically enable service accounts and APIs automatically required for authentication to project in Google Cloud. This configuration file includes the key paths and ssh_keys to enter the master and node to deploy the components. The configuration file includes NodePools, AddressPools, and Storage-related configurations.
During installation, Anthos on Bare Metal creates a temporary bootstrap cluster (KIND, a Kubernetes cluster where required components are deployed to create the first admin/user cluster), in the above scenario the cluster is deployed on the workstation host. After a successful installation, the bootstrap cluster is deleted, leaving users with an admin cluster and user cluster. Generally, you should have no reason to interact with this cluster, if needed the same can be left without deleting ( — cleanup-external-cluster=false).
Cluster creation follows the cluster-api pattern which uses a custom GKE bare metal provider, the bootstrap KIND cluster running on the workstation uses the generated CRD’s to bootstrap the hybrid cluster (or any other deployment model). The CRD’s are also created on the Admin cluster which then enables users to create user clusters.
Cluster deployment steps :
All configuration files along with preflight checks, cluster creation logs and kubeconfig required to access the main cluster are created on the workstation in a specific directory.
Once the cluster is ready the bootstrap cluster will be deleted and users can start using the created hybrid cluster, users can use the hybrid cluster to create multiple user clusters as required. If the deployment model is standalone then a user cluster is created without any machinery (admin cluster) to create multiple user clusters.
As the cluster above is deployed using hybrid mode the cluster comprises Admin cluster machinery.
Cluster API components:
BareMetalCluster CRD:
BareMetalMachine CRD (as this is a hybrid cluster, the NUC's are added as objects):
Anthos clusters on bare metal connects the clusters to Google Cloud. This connection lets users manage and observe the isolated clusters from the Cloud Console by using Connect. As part of cluster deployment, a connect agent is deployed in the cluster and ‘bmctl’ utility automatically creates required service accounts (if more control is needed users can manually create connect-agent, connect-register and logging-monitoring service accounts) required for establishing a connection and registers the cluster to Google Cloud GKE Hub.
Connect is an Anthos feature and not specific to Anthos bare metal which allows users to connect any of the Kubernetes clusters to Google Cloud irrespective of where it is deployed. This enables access to cluster and workload management features, including a unified user interface, Cloud Console, to interact with the cluster. Connect allows Anthos Config Management to install or update the Connect in-cluster agent and observe the sync status. Connect also lets the metering agent observe the number of vCPUs in a connected cluster.
This is pivotal for a topology where applications on far edge-sites in isolated network spaces are managed from a central location (HQ/Regional IT operations center etc.)
The Connect Agent can be configured to traverse NATs, egress proxies, and firewalls to establish a long-lived, encrypted connection between the isolated bare metal cluster’s Kubernetes API server and Google Cloud project.
connect-agent deployed on the bare metal cluster :
The anthos-creds namespace holds required secrets to connect with GCP services.
Anthos portal showing the registered cluster (before sign in), the cluster information on the right panel will not be shown until the users log-in :
Kubernetes Engine portal showing the registered cluster (before sign in) :
Users can use the Google Cloud Console to sign in to registered clusters. This can be done in three ways: using basic authentication or a bearer token or using an OIDC provider. In the scenario below an admin user is created associated with cluster-admin role and the token is used for log-in (users can create a cluster read-only role based on the level of operation required).
Logging-in to cluster using token :
As shown below, once the log-in is successful all the management options will be enabled and cluster information will be displayed. The info also shows the GKE Hub Membership ID, users can use 'gcloud container hub memberships list' to list all connected clusters.
Anthos portal showing cluster information and list of cluster features :
Once the connect agent connects the cluster, users can use Kubernetes Engine portal as a dashboard to access all Kubernetes components and manage workloads on the private Anthos on bare metal cluster as if they are on GKE.
Anthos dashboard enables users to deploy Anthos related components (Features) like ACM, Service Mesh, etc. (provides a catalog of features and instructions) there is limited functionality to just verify the connectivity of the cluster and high-level details like cluster_size and nodes in the portal. Kubernetes Engine portal is the main cluster management entity that enables users to operate on core Kubernetes functionalities like create, edit, update and delete objects, access configuration objects (secrets/configmaps), deploy applications from the marketplace, browse objects and manage/list storage elements like PV/PVC.
Users can edit any Kubernetes object and the same will be reflected on the bare metal cluster.
Accessing cluster Services and Ingress :
Anthos clusters on bare metal uses the local volume provisioner (LVP) to manage local persistent volumes. Three types of storage classes are created for local PV’s in an Anthos clusters on bare metal: LVP Share (creates a local PV backed by subdirectories created during cluster creation in a local, shared file system), LVP node mounts (creates a local PV for each mounted disk in the configured directory), Anthos system (creates pre-configured local PV’s during cluster creation that is used by Anthos system pods).
Accessing PVC's and available Storage Classes:
Anthos portal facilitates users to enable supported features on the connected clusters :
Logging and metrics agents are installed and activated in each cluster when users create a new admin or user cluster. Stackdriver operator manages the lifecycle for all other Stackdriver related components (log-aggregator, log-forwarder, metadata-agent and prometheus) deployed onto the cluster. Users can setup a Cloud Monitoring Workspace within the Cloud project using the monitoring portal to access logs of all system and Kubernetes components in the remote bare metal clusters on the console.
The Stackdriver collector for Prometheus constructs a Cloud Monitoring MonitoredResource for the Kubernetes objects from well-known Prometheus labels. A separate entity 'stackdriver-prometheus-app' is deployed with the stack and can be configured to monitor the applications in the cluster.
Google Cloud's operations suite is the built-in observability solution for Google Cloud. It offers a fully managed logging solution, metrics collection, monitoring, dashboards, and alerting. Cloud Monitoring uses Workspaces to organize and manage its information.
Integrated logging:
Akri enables discovery and exposing heterogeneous leaf devices as resources and creates services for each device in Kubernetes clusters. This enables applications running on Kubernetes to consume the inputs from the devices. Akri handles the automatic inclusion and removal of devices, as well as the allocation and deallocation of resources to better optimize the clusters.
Akri is a standard Kubernetes extension implemented using two custom resource definitions (configuration and instances), an agent, a controller. Once installed, users can write an Akri definition for the device, the device plug-in acts as an agent, finding hardware that matches the definitions using supported discovery protocols, before using the Akri controller to manage it through a broker running in a Kubernetes pod that exposes the device and its APIs to your application code.
Akri is based on the Kubernetes device plugin framework, which provides vendors a mechanism called device plugins which can be used to advertise devices, monitor devices (like health checks), hook the devices into the runtime to execute device specific instructions (e.g: Clean GPU memory, capture video etc.) and make the device available in the container. This enables vendors to advertise their resources and monitor them without writing additional code.
Akri currently supports protocols: udev (to discover anything in the Linux device file system), ONVIF (to discover IP cameras) and OPC UA (to discover OPC UA Servers). Protocols like Bluetooth, Simple scan for IP/MAC addresses, LoRaWAN, Zeroconf etc. are in the roadmap.
In this post, udev (userspace /dev) protocol is used to discover USB cameras connected to two nodes of the Anthos on Baremetal Kubernetes cluster.
Udev manages device nodes in the /dev directory, such as microphones, security chips, usb cameras, and so on. Udev can be used to find devices that are attached to or embedded in nodes. Akri’s udev discovery handler parses udev rules listed in a Configuration, searches for them using udev, and returns a list of discovered device nodes (ie: /dev/video0). User can configure Akri which device(s) to find by passing udev rules into a Configuration spec.
A user can also allow multiple nodes to utilize a leaf device, thereby providing high availability in the case where a node goes offline. Furthermore, Akri will automatically create a Kubernetes service for each type of leaf device (or Akri Configuration), removing the need for an application to track the state of pods or nodes.
For example, the following is the information of the attached USB camera using ‘udevadm’ on one of the nodes:
Users can use Akri’s pre-defined grammar here to define specific udev rules in the protocol section of Akri’s configuration spec to define what devices should be discovered on a specific node by a udev rule.
Akri agent deployed as a daemonset handles resource availability changes and enable resource sharing, these tasks enable Akri to find configured resources (leaf devices), expose them to the Kubernetes cluster for workload scheduling, and allow resources to be shared by multiple Nodes. The Agent discovers resources via Discovery Handlers (DHs).
Akri Controller deployment on the master Node in the cluster enables cluster access to leaf devices and handle node disappearances. The broker pods are created automatically based on the protocol chosen and the capacity (resource sharing) defined.
The topology below illustrates a classic example of how GitOps can be configured and deployment of applications can be performed from cloud to edge. Akri is deployed on Anthos Bare Metal cluster in isolated network space using Anthos Config Management from cloud. Akri then takes care of all other required aspects to discover the connected USB cameras and enable the streaming application to consume them. End users can use LB_IP (local network) to view the footage accessing the streaming-app running on far edge sites (in this scenario the far edge site is the NUC running Anthos Bare Metal Kubernetes cluster)
In the topology below an admin user provides ‘ConfigManagement’ spec to Anthos Config Management, the spec contains the cluster_selector (in this case the Anthos bare metal cluster is selected) and git information which holds all the deployment manifests of Akri.
ACM takes care of deploying the manifests to the private bare metal cluster defined in the specified git_repo and maintain sync with the git branch. The Akri machinery detects the USB cameras connected to the nodes (here both the hosts serve as nodes — master taint removed) and local users can access the footage using the LB_IP on the local network.
ACM can be configured from Anthos dashboard on GCP console, usually all the clusters part of the project will be considered for update, users can configure specific clusters or apply the config to all.
Git repository authentication:
2. Conifg Sync - holds the information like git_url and the directory with the configuration manifests. A sample repo used here.
The repo configured above includes all manifests required to deploy Akri on Kubernetes. The format follows ACM specified layout where the cluster directory includes all cluster level objects (like CRD’s, ClusterRole, ClusterRoleBinding etc.), the namespaces directory include a sub-directory with the name of the namespace which holds the namespace object and all other namespaced objects (like Deployments, Daemonsets, Services etc.), the system directory contains configs for the Operator.
Once the ACM spec is submitted with a cluster_selector (in this case the Anthos bare metal cluster is selected) the controller on the bare metal cluster creates required git-importer and monitor pods in config-management namespace which performs all GitOps related tasks on the cluster.
Once the installation is complete users can check the sync status from Anthos Config Management section of the Anthos dashboard or use ‘nomos status’. As shown below once the sync is complete all Akri related components are deployed on the private Anthos on Bare Metal cluster and can be verified from Workloads section of Kubernetes Engine.
Akri implementation installs two CRD's in the cluster:
Akri Configuration specifies Akri controller the kind of leaf device to discover. In the sample below (akri-udev-video), udev protocol is used to discover USB cameras.
Akri Agent, a Kubernetes device plugin framework implementation searches for the leaf devices, checking for the availability based on the rules specified. Once the device has been discovered, the Akri Controller which sees each Akri Instance as a representation of a leaf device, deploys a “broker” pod that facilitates connection to leaf devices and utilize it.
Apart from the core Akri components, an 'akri-vedio-straming-app' is deployed which streams video frames collected from the broker pods.
The streaming application can be configured using 'akri-configuration' object:
Configurations like format, resolution and frames can be configured using broker pod spec:
The admin can manage all the Akri components from workloads section of Kubernetes Engine.
Pod logs can be accessed form Kubernetes Engine portal.
The stackdriver installed on the cluster streams all the logs to operations & logging unit enabling users to filter specific logs.
Anthos clusters on bare metal use MetalLB running in L2 mode for data plane load balancing. The MetalLB uses the pool of IP's configured while bootstrapping the cluster to attach them to services of type LoadBalancer.
Users can edit a service from Kubernetes Engine - Services & Ingress portal if required:
As shown below the akri-video-streaming-app is exposed as a type LoadBalancer which can be used to access the footage using Load balancer IP.
In the topology above two cameras are attached to two nodes of the same cluster and Akri controller deploys two broker pods for each instance (camera). As shown below, the cameras are pointed to two different objects:
The akri-video-streaming-app collects frames from both broker pods and displays them in a unified view. The resolution and frames are configured based on the akri-configuration spec.
This combination provides users with automation required to create clusters, connect them to the cloud, and enable leaf devices to be represented as Kubernetes objects, together making Kubernetes a versatile edge computing solution.
Also published on: https://www.linkedin.com/pulse/anthos-bare-metal-akri-managing-leaf-devices-edge-clusters-chandra/