The $0 Scheduler That Almost Cost a Compny Everything

A company encountered severe CPU load spikes and even system crashes during scheduled task triggers while migrating from Alibaba Cloud DataWorks to a self-built big data platform using DolphinScheduler.

After investigation, it turned out the issue wasn’t caused by too many tasks but by improper scheduler configuration. By adjusting the number of threads and CPU limits in DolphinScheduler, the CPU spike problem was successfully resolved, ensuring smooth task execution.

Referring to this case, users should balance machine load and concurrent task demands when modifying thread settings.

Background

The company planned to decommission Alibaba Cloud’s DataWorks product, which had been used previously, and build its own big data platform entirely with open-source components.

Due to budget constraints, they couldn’t afford to buy many servers at the start, so they initially purchased 4 ECS servers and ran all components on mixed nodes. As the migration progressed, they gradually scaled down DataWorks resources and allocated the saved costs to the self-built platform.

For the scheduling platform, they chose the currently popular DolphinScheduler.

The Problem

Other team members migrated offline tasks from DataWorks to DolphinScheduler using Hive SQL.

After a while, a large number of tasks had accumulated in DolphinScheduler.

Besides the scheduled tasks at 2 AM and 3 AM daily, there were also many tasks scheduled every 5 minutes.

Later, they found that every time the 5-minute interval triggered, the CPU load on ECS machines spiked from below 5% to 100%, lasting for several seconds.

The situation was even worse at 2 AM and 3 AM, often causing other component roles on the machines to stop running. In severe cases, the operating system would crash, as shown below:

Investigation Process

At first, they assumed that the high CPU usage was due to a large number of tasks being scheduled simultaneously at 5-minute intervals or 2 AM.

They grouped all tasks into one task group and observed resource usage.

The resource usage graph did become a bit smoother, but the spikes still occurred at 5-minute intervals and 2 AM — the issue remained unresolved.

Then they disabled the scheduling of all tasks except the Hive SQL ones scheduled every 5 minutes.

However, the CPU still spiked to 100% during those intervals.

This indicated that the issue wasn’t due to too many tasks being executed at once, because the number of scheduled tasks was significantly reduced.

Finally, they turned to investigate whether DolphinScheduler itself was consuming too many resources.

Solution

By examining the application.yaml configuration files for DolphinScheduler’s master and worker components, they found several settings related to threads, CPU, and memory.

In the master’s application.yaml:

# Thread-related configuration
fetch-command-num: 10
# master prepare execute thread number to limit handle commands in parallel
pre-exec-threads: 10
# master execute thread number to limit process instances in parallel
exec-threads: 100
# master dispatch task number per batch, if all the tasks dispatch failed in a batch, will sleep 1s.
dispatch-task-number: 3

# CPU and memory-related configuration
# master max cpuload avg, only higher than the system cpu load average, master server can schedule. default value -1: the number of cpu cores * 2
# If the CPU load exceeds the following value, the master role will stop scheduling until the load drops below this value. Default: no limit
max-cpu-load-avg: -1
# master reserved memory, only lower than system available memory, master server can schedule. default value 0.3, the unit is G
# If available memory is less than the following value (in GB), the master will stop scheduling until it rises above. Default: 0.3G
reserved-memory: 0.3

In the worker’s application.yaml:

# Thread-related configuration
# worker execute thread number to limit task instances in parallel
exec-threads: 100

# CPU and memory-related configuration
# worker max cpuload avg, only higher than the system cpu load average, worker server can be dispatched tasks. default value -1: the number of cpu cores * 2
# If the CPU load exceeds the following value, the worker role will stop accepting tasks until it drops. Default: no limit
max-cpu-load-avg: -1
# worker reserved memory, only lower than system available memory, worker server can be dispatched tasks. default value 0.3, the unit is G
# If available memory is less than the following value (in GB), the worker will stop accepting tasks until it rises above. Default: 0.3G
reserved-memory: 0.3

In the master’s start.sh, the JVM memory settings were:

JAVA_OPTS=${JAVA_OPTS:-"-server -Duser.timezone=${SPRING_JACKSON_TIME_ZONE} -Xms4g -Xmx4g -Xmn2g -XX:+PrintGCDetails -Xloggc:gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dump.hprof"}

In the worker’s start.sh, the JVM memory settings were:

JAVA_OPTS=${JAVA_OPTS:-"-server -Duser.timezone=${SPRING_JACKSON_TIME_ZONE} -Xms4g -Xmx4g -Xmn2g -XX:+PrintGCDetails -Xloggc:gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dump.hprof"}

If your machine doesn't have much memory, you can adjust the JVM memory settings in start.sh accordingly.

They mainly modified the thread counts and CPU limits in the master and worker application.yaml files — setting all thread counts to 3 and changing the CPU limit from -1 to 4 (the servers have 8 CPU cores).

The specific values can be adjusted based on your server specs or by trial and error.

Afterward, they observed the CPU usage again and found that during the 5-minute triggers, the CPU stayed stable.Even at 2 AM and 3 AM, when a burst of tasks was scheduled, the CPU remained under control and tasks ran smoothly, as shown in the images:

However, one thing to note:

After reducing the number of threads, the number of tasks that can be scheduled simultaneously also decreases.

For example, if each worker is set to 3 threads and there are 5 worker nodes, the max concurrency is 15 tasks.

So the thread count should be determined by balancing your server's load and the number of concurrent tasks you need to run.

The goal is to find a value that avoids sudden CPU spikes and ensures scheduled tasks finish within an acceptable time window.