Skip to main content

Pod run task

Task type: dphi.space.cg2.pod.run

Purpose

Launch a pod run on Clustergate-2 using a container image and runtime settings.

Definition

  - id: pod_run_job
type: dphi.space.cg2.pod.run
description: Run a pod with parameters
pod_name: test
image:
name: my-app
tag: latest
command: ["/bin/sh"]
args: ["-c", "echo 'Debi tirar mas fotos' > /data/hello-from-space.txt"]
env:
- name: LOG_LEVEL
value: info
- name: CONFIG_PATH
value: /data/config/config.yaml
ports:
- container: 8080
host: 8080
protocol: tcp
node: default
volume: payload
max_duration: 30

# Optional: merge contiguous pod_run manifests into one file (--- separated)
merge: true

Inputs

  • id: Name of the task.
  • description: Description of the task.
  • pod_name: Name attributed to the running pod, this has to be unique across the execution.
  • merge: Optional boolean. When set to true on a pod_run task, the bundler may merge its manifest with directly adjacent pod_run tasks that also set merge: true.
  • image.name and image.tag: Container image reference.
  • command and args: Optional command override of the Docker image.
  • env: Optional environment variables list.
  • ports: Optional container port mappings.
  • node: Target node to run on (FPGA, MPU, or GPU).
  • max_duration: Optional run timeout, in minutes. The pod will be gracefully killed after the timeout expires.
  • volume: Required volume name to attach to the pod.

Pod Hostname

Pods are reachable on the internal network using a tenant-prefixed hostname. The effective pod hostname is:

<tenant>-<pod_name>

Example: if your tenant is client1 and pod_name: test, the pod is reachable as client1-test.

Manifest Merging (merge: true)

When merge: true is set, consecutive pod_run tasks can be packaged into a single multi-document YAML manifest (documents separated by ---). When merged, the pods in that contiguous group are submitted together, and are expected to be scheduled at the same time.

Rules:

  • Only directly adjacent pod_run tasks with merge: true are merged.
  • Merging happens per contiguous run. Any non-pod_run task breaks the merge group.
  • A single isolated pod_run with merge: true does not change behavior.

Timeout Calculation

The operation executes a sequence of steps. Each pod_run task becomes a pod_run step with a timeout_seconds value. When merge: true is used, each pod_run task still exists as its own step, but adjacent merged steps share a single multi-document manifest file.

max_duration is provided in minutes. The effective timeout budget for a pod_run step (timeout_seconds) includes:

  • max_duration * 60 seconds
  • an additional grace period (for scheduling and shutdown). Default: 300 seconds.

Pod-To-Pod Communication

Pods can communicate with each other over the internal network. Use the tenant-prefixed hostname (<tenant>-<pod_name>) as the address.

Example: TCP Server/Client With merge: true

This example uplinks two scripts, starts a TCP server pod, then a TCP client pod. The two pod runs are adjacent and both set merge: true, so they may share a merged multi-document manifest in the bundle.

tasks:
- id: UPLINK
type: dphi.space.cg2.uplink
source:
- tcp_server.py
- tcp_client.py
destination: /
volume: payload
on_failure: stop

- id: POD_RUN_TCP_SERVER
pod_name: tcp-server
type: dphi.space.cg2.pod.run
image:
name: python
tag: "3.10"
command: [/bin/sh]
args:
- -c
- python /data/tcp_server.py --host 0.0.0.0 --port 23456 --out /data/tcp.txt --log-file /data/tcp-server-log.txt
ports:
- container: 23456
host: 23456
protocol: tcp
node: Mpu
volume: payload
max_duration: 1
on_failure: stop
merge: true

- id: POD_RUN_TCP_CLIENT
pod_name: tcp-client
type: dphi.space.cg2.pod.run
image:
name: python
tag: "3.10"
command: [/bin/sh]
args:
- -c
# Note: the effective hostname is <tenant>-<pod_name>. For tenant 'client1' and pod_name 'tcp-server': client1-tcp-server
- python /data/tcp_client.py --host client1-tcp-server --port 23456 --log-file /data/tcp-client-log.txt --message "hello from tcp client\n"
node: Mpu
volume: payload
max_duration: 2
on_failure: stop
merge: true

Outputs

  • status: Boolean success flag.
  • message: Optional error message.

Common failure reasons

  • Image name or tag is missing.
  • Node selection is invalid.
  • Remote manifest upload fails.
  • Sequence execution fails.