Pod run task
Task type: dphi.space.cg2.pod.run
Purpose
Launch a pod run on Clustergate-2 using a container image and runtime settings.
Definition
- id: pod_run_job
type: dphi.space.cg2.pod.run
description: Run a pod with parameters
pod_name: test
image:
name: my-app
tag: latest
command: ["/bin/sh"]
args: ["-c", "echo 'Debi tirar mas fotos' > /data/hello-from-space.txt"]
env:
- name: LOG_LEVEL
value: info
- name: CONFIG_PATH
value: /data/config/config.yaml
ports:
- container: 8080
host: 8080
protocol: tcp
node: default
volume: payload
max_duration: 30
# Optional: merge contiguous pod_run manifests into one file (--- separated)
merge: true
Inputs
id: Name of the task.description: Description of the task.pod_name: Name attributed to the running pod, this has to be unique across the execution.merge: Optional boolean. When set totrueon apod_runtask, the bundler may merge its manifest with directly adjacentpod_runtasks that also setmerge: true.image.nameandimage.tag: Container image reference.commandandargs: Optional command override of the Docker image.env: Optional environment variables list.ports: Optional container port mappings.node: Target node to run on (FPGA, MPU, or GPU).max_duration: Optional run timeout, in minutes. The pod will be gracefully killed after the timeout expires.volume: Required volume name to attach to the pod.
Pod Hostname
Pods are reachable on the internal network using a tenant-prefixed hostname. The effective pod hostname is:
<tenant>-<pod_name>
Example: if your tenant is client1 and pod_name: test, the pod is reachable as client1-test.
Manifest Merging (merge: true)
When merge: true is set, consecutive pod_run tasks can be packaged into a single multi-document YAML manifest (documents separated by ---). When merged, the pods in that contiguous group are submitted together, and are expected to be scheduled at the same time.
Rules:
- Only directly adjacent
pod_runtasks withmerge: trueare merged. - Merging happens per contiguous run. Any non-
pod_runtask breaks the merge group. - A single isolated
pod_runwithmerge: truedoes not change behavior.
Timeout Calculation
The operation executes a sequence of steps. Each pod_run task becomes a pod_run step with a timeout_seconds value. When merge: true is used, each pod_run task still exists as its own step, but adjacent merged steps share a single multi-document manifest file.
max_duration is provided in minutes. The effective timeout budget for a pod_run step (timeout_seconds) includes:
max_duration * 60seconds- an additional grace period (for scheduling and shutdown). Default: 300 seconds.
Pod-To-Pod Communication
Pods can communicate with each other over the internal network. Use the tenant-prefixed hostname (<tenant>-<pod_name>) as the address.
Example: TCP Server/Client With merge: true
This example uplinks two scripts, starts a TCP server pod, then a TCP client pod. The two pod runs are adjacent and both set merge: true, so they may share a merged multi-document manifest in the bundle.
tasks:
- id: UPLINK
type: dphi.space.cg2.uplink
source:
- tcp_server.py
- tcp_client.py
destination: /
volume: payload
on_failure: stop
- id: POD_RUN_TCP_SERVER
pod_name: tcp-server
type: dphi.space.cg2.pod.run
image:
name: python
tag: "3.10"
command: [/bin/sh]
args:
- -c
- python /data/tcp_server.py --host 0.0.0.0 --port 23456 --out /data/tcp.txt --log-file /data/tcp-server-log.txt
ports:
- container: 23456
host: 23456
protocol: tcp
node: Mpu
volume: payload
max_duration: 1
on_failure: stop
merge: true
- id: POD_RUN_TCP_CLIENT
pod_name: tcp-client
type: dphi.space.cg2.pod.run
image:
name: python
tag: "3.10"
command: [/bin/sh]
args:
- -c
# Note: the effective hostname is <tenant>-<pod_name>. For tenant 'client1' and pod_name 'tcp-server': client1-tcp-server
- python /data/tcp_client.py --host client1-tcp-server --port 23456 --log-file /data/tcp-client-log.txt --message "hello from tcp client\n"
node: Mpu
volume: payload
max_duration: 2
on_failure: stop
merge: true
Outputs
status: Boolean success flag.message: Optional error message.
Common failure reasons
- Image name or tag is missing.
- Node selection is invalid.
- Remote manifest upload fails.
- Sequence execution fails.