Automatic orchestration

This article will introduce you to the concept of automatic orchestration.

Automatic Orchestration is a mechanism that ensures datasets are refreshed dynamically based on changes in their dependencies or source datasets. Unlike fixed schedules or manual re-runs, it eliminates the need for constant monitoring and manual intervention, allowing data pipelines to maintain accuracy and consistency automatically.

This feature is particularly useful for complex pipelines where multiple datasets are interconnected, ensuring that any upstream changes trigger updates downstream as needed. With Automatic Orchestration, you can avoid processing outdated data or wasting resources on unnecessary computations.

Key Difference from Scheduling or Immediate DAG Re-execution

Automatic Orchestration differs from traditional approaches like scheduling or re-running the DAG immediately after execution in the following ways:

  1. Dependency-Driven Updates
    • Automatic Orchestration only refreshes datasets when their source datasets or dependencies have changed.
    • This minimizes redundant processing and optimizes resource utilization.
  2. Dynamic Triggers
    • Unlike scheduling, which runs tasks at fixed intervals regardless of necessity, Automatic Orchestration triggers updates dynamically based on actual data changes.
  3. Efficient Resource Usage
    • Immediate DAG re-execution processes the entire pipeline after each run, even if no changes have occurred, potentially wasting resources. Automatic Orchestration avoids this by refreshing only the required datasets.

Steps to Enable Automatic Orchestration

Verify Airflow Status:

  • Before enabling Automatic Orchestration, check that Airflow is running.
  • If Airflow is not running, navigate to the Infrastructure Page and start it.

Enable Automatic Mode:

  • Once Airflow is confirmed to be running, go to the Orchestration Page.
  • Set the Refresh Mode to "Automatic."
  • The Azure Orchestration Function will start automatically when Automatic Mode is enabled.
Enabling automatic orchestration

Monitor Dataset States:

  1. After enabling Automatic Mode, you can monitor the status of datasets directly on the Orchestration Page.
  2. The page includes a "Dataset Status Table," which displays the state of each dataset.

Dataset States

  1. NORMAL: The dataset is up-to-date and contains the latest data.
  2. REFRESHING: The dataset has been submitted to the Spark engine or is waiting for the completion of a preceding dataset.
  3. REFRESHING_OUTDATED: The dataset was submitted for processing before one of its source datasets was updated.
  4. OUTDATED: A source dataset has been updated, or the dataset’s job definition cannot be processed.

Important Notes:

Avoid Manual Interference:

Do not run jobs directly via the UI or Airflow while Automatic Mode is enabled, as this may cause inconsistencies in dataset states.

Resetting Dataset States:

If issues arise, turn off Automatic Mode and enable it again. This will reset the state of all datasets, ensuring a clean start.

Initial State:

When Automatic Orchestration is turned on, all datasets are initially marked as OUTDATED.

Non-Functional Job Definitions:

Datasets with non-functional job definitions will continue to process indefinitely unless resolved.