The experiment lifecycle consists largely of creating an experiment and iterating over trials, each operating within their own namespace on the cluster. The number of concurrently running trials is determined by the availability of namespaces to run trials in.
An experiment and its trials run in several steps, coordinated by the Red Sky Ops Manager: (click to view larger)
Note: If the Manager is not configured to use a Red Sky Ops Server, you can suggest trial configurations manually using
redskyctl suggest. The Experiment process then starts with the Create Trial step instead of Reconcile Experiment.
redskyctl suggest can still be used with a server configured, but the suggestion will be sent to the server to be queued.)
An experiment manifest is written and loaded into the cluster. When using the Enterprise product this will synchronize the cluster state with the remote Red Sky API server and begin requesting suggested parameter assignments; otherwise the system will be idle until suggestions are manually provided.
The definition of the experiment includes a trial template which will be combined with the parameter assignments to form a new trial resource in the cluster. Any failures during the remaining stages will cause the trial to marked as failed.
If the trial includes any setup tasks, a job is scheduled to run each setup task in individual containers. Setup tasks may incorporate parameter assignments, for example as a value in a Helm chart.
Using the patches from the experiment and the parameter assignments from the trial, an attempt is made to patch the cluster state. Empty patches are ignored, it may also be the case that parameter assignments established during setup tasks result in patch operations that do not result in changes.
Wait for Stabilization
For any deployment, stateful set or daemon set that was patched, a rollout status check will be performed. Once the patched objects are ready the trial can progress.
Run Trial Job
The trial resource includes a job template which will be used to schedule a new job. If container list of the job is empty, a container that performs a “sleep” will be injected (the amount of sleep time is determined by the
approximateRuntime field on the trial). The start and completion times of the job are recorded on the trial (the recorded start time will be adjusted by the value of the
startTimeOffset field on the trial).
When the trial job completes, the metrics are collected according their type. The metric values are recorded on the trial resource. For Prometheus metrics, a check is made to ensure a final scrape has been performed before metric collection. Once all metrics have been collected the trial is marked as finished.
When using the Enterprise product, the metrics of finished trials are reported back to the remote Red Sky API server to improve the next round of suggested parameter assignments.
If the trial included setup tasks, a job is scheduled to delete the objects created during setup creation.