Tải bản đầy đủ (.pdf) (11 trang)

Apache flink autoscaling

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (204.23 KB, 11 trang )

Design
Declarative resource management
In order to support active/reactive mode as well as auto scaling we propose to introduce
declarative resource management. In the model of declarative resource management, the
JobMaster no longer asks for each slot individually but instead announces to the
ResourceManager how many slots of which type it needs. The ResourceManager will then try to
fulfill these demands as good as possible by allocating slots for the JobMaster.
The resource requirements should consists of a quadruplet of minimum required resources,
target value, maximum required resources and resource spec (min, target, max, rs). The target
value is what the JobMaster would like to get. The minimum value defines how many resources
one needs at least in order to begin executing the job.
Flip-6 design of Job master and Resource manager interaction can be changed for this purpose.
Currently the Job master asks the Resource manager proactively for the required slots. Instead
the Job master can periodically derive the target number of slots from the user settings, job
graph and operator autoscaling policies. The target number of slots can be regularly sent to the
Resource manager declaring the desired state of resource consumption, e.g. in heartbeating. If
the desired number of slots is higher than already allocated one for this JobMaster, to cover the
missing slots, the Resource manager can
- either start respective number of new Task executors in active container mode,
- or allocate slots from available free slots in reactive container mode.
The dedicated Task executors will as usually register with the Job master forming its view of
really available resources which might meet the required number or not.


Based on the required and available number of slots, the JobMaster can start the job, recover
from failures and periodically trigger up- or downscaling.
With declarative resource management, the difference between the active and reactive mode
boils down to setting the target value to the maximum parallelism in the reactive mode. That
way we ensure that all operators are resource greedy and scale up to the maximum of available
resources and the configured maximum resource consumption.


Slot allocation protocol
Currently, the slot allocation is triggered by scheduling the ExecutionGraph. This will allocate for
each Execution a slot. If the SlotPool has not enough slots available, it will ask the
ResourceManager for more. This signal triggers the allocation of new containers.

The protocol can be slightly changed. Instead of requesting each slot individually from the
ResourceManager, the JobMaster will announce its resource requirements. Based on that the
ResourceManager will assign slots to the respective JobMaster.


Once the JobMaster has received enough slots to to fulfill the start or rescaling conditions, the
ExecutionGraph will be started or rescaled respectively.
The difference is that we separate the resource announcement and the scheduling of the
ExecutionGraph into two steps. Moreover, this design does not require a strict matching
between slots and slot requests, thus, it might make the AllocationID superfluous.

Scheduler component
In order to execute a JobGraph, we first need to extract the resource requirements from each
operator.
● Streaming:​ Extract requirements from all operators
● Batch: ​Depending on the execution mode only the requirements of the first stage needs
to be known
Next these requirements are given to the SlotPool which sends it to the ResourceManager if it
does not hold enough slots.
Once the SlotPool gets new resources assigned, it will notify the Scheduler about these
changes. The scheduler will then check whether the current set of resources fulfills the minimum
resource requirements. If this is the case, then it will start scheduling the ExecutionGraph with
the available set of slots.
If the ExecutionGraph is already running (with fewer resources than its target value), then the
Scheduler can decide to scale up in order to make use of the additional resources. The



scheduler might wait for the resources to stabilize in order to avoid too many successive
rescaling operations, especially, at start up of the job.

Requirements






Decide on scheduling strategy and announce required resources
Receive notifications of changed resources
○ Wait for stable resources
○ Trigger scheduling of executions once the min requirement has been fulfilled
Manage rescaling policies
○ Periodically querying the rescaling policies to update target values
Decide on rescaling (when is it time to scale up/down)

Separation of scheduler/deployment and ExecutionGraph
In order to make the scheduling of the ExecutionGraph more flexible and to support different
implementations we should decouple the scheduling from the ExecutionGraph. The
ExecutionGraph should be the data structure which tracks the current state of the job and is
updated by the JobMaster.
The scheduling of the ExecutionGraph should be the responsibility of a dedicated component,
the scheduler. The scheduler will take the ExecutionGraph and check which Executions need to
be executed. It will then acquire the required slots from the SlotPool and deploy the individual
Executions.
Separating the scheduling from the ExecutionGraph could allow us to transform the

ExecutionGraph into a synchronous data structure at some point in time. Making it
single-threaded would simplify future maintenance considerably.
Take also ​FLINK-10240 – Pluggable scheduling strategy for batch jobs​ into account when
designing the future scheduler component.

Calculation of target number of slots
In order to calculate the resource requirements, the scheduler/other component needs to iterate
over all operators of the JobGraph. Each operator should return the resource requirement
quadruplet (min, target, max, rs). Summing these values up with respect to the ResourceSpec
gives the resource requirements which will be announced to the ResourceManager.

Slot sharing
If a set of operators can share the same slot, then we only need to take the maximum over all
resource requirements of these operators and sum up the ResourceSpecs: (min​max​, target​max​,
max​max​, sum(rs​i​)).


Re-scaling policies
In order to support autoscaling, we can allow the target value of an operator to change
dynamically. This can be achieved by introducing a RescalingPolicy which the user can specify
for each operator. The RescalingPolicy can be periodically queried to learn about the current
target value. Changes in the target value will be propagated to the ResourceManager resulting
in a changed resource set. Once the SlotPool gets new resources assigned, the Scheduler
could trigger the rescaling.
The current behaviour could be achieved by using a FixedRescalingPolicy which simply returns
always the initial target value.

Start and rescaling actions
The scheduler is responsible for starting and rescaling the running job. It gets notified about
resource changes by the SlotPool. If the slot number is stable for some time, then it can decide

to rescale the job:
- If not running yet, ​start​ the job with maximum possible parallelism if all operators can
achieve their minimum parallelism
- If running, ​upscale​ if ….
- If running, ​downscale​ if ….

Fault recovery
In case of a fault the scheduler should check whether it needs to down-scale in order to run the
job (e.g. if a TaskManager died). The job should only be restarted if the minimum resource
requirements are fulfilled.

Configuration
Currently (prior to 1.7.0), Flink resolves parallelism of every operator, holds it in the job graph
and starts job with this exactly resolved parallelism. How the parallelism of an operator is
resolved in Flink is currently:
- fixed one if user called ​setParallelism()
- otherwise to the job parallelism if it is set in ​cli (-p)
- or to the ​default​ job parallelism from ​Flink config​.
For the declarative resource management, the user needs to be able to specify the minimum,
initial target and the maximum resource value. If the user did not explicitly set the parallelism via
setParallelism​ of an operator we could set the resource requirements to
● Active mode:​ (1, -p or cluster default, max parallelism)
● Reactive mode: ​(1, max parallelism, max parallelism)


If the user defined the parallelism via ​setParallelism(p)​:​ ​(p, p, p).

Implementation scope and steps of this feature
iteration for Flink 1.9.0
The first version of this feature will be implemented under the following assumptions: A passive

job cluster which runs a job in eager scheduling mode (= streaming jobs). With the assumption
of the passive job cluster (e.g. standalone mode) we know that all resources in the cluster will
be allocatable by the JobManager. Moreover, the ResourceManager cannot start new
TaskManagers and, thus, we don’t need to start the execution of the ExecutionGraph to kick this
of. Consequently, we can already reason about the available set of resources without changing
the slot allocation protocol and requiring that usable slots are registered in the SlotPool.
Moreover, by only considering the eager scheduling mode, we effectively exclude batch jobs
from these changes. This narrows the scope and will make it easier to implement a first working
prototype.

Implementation steps
Decouple ExecutionGraph from JobMaster [​FLINK-10498​]
With declarative resource management we want to react to the set of available resources. Thus,
we need a component which is responsible for scaling the ExecutionGraph accordingly. In order
to better do this and separate concerns, it is beneficial to introduce a
Scheduler/ExecutionGraphDriver component which is in charge of the ExecutionGraph. This
component owns the ExecutionGraph and is allowed to modify it. In the first version, this
component will simply accommodate all the existing logic of the JobMaster and the respective
JobMaster methods are forwarded to this component.
This new component should not change the existing behaviour of Flink.
Later this component will be in charge of announcing the required resources, deciding when to
rescale and executing the rescaling operation.

Introduce declarative resource management switch [​FLINK-10499​]
In order to not affect Flink’s behaviour, we propose to add a feature flag to turn on/off the
declarative resource management. In the beginning this feature flag should only be activated if


running a streaming job in per-job mode. The switch should control which type of
ExecutionGraphDriver will be instantiated.

The declarative resource management should become the default once it is fully implemented.

Let ExecutionGraphDriver react to fail signal [​FLINK-10500​]
In order to scale down when there are not enough resources available or if TMs died, the
ExecutionGraphDriver needs to learn about a failure. Depending on the failure type and the
available set of resources, it can then decide to scale the job down or simply restart. In the
scope of this issue, the ExecutionGraphDriver should simply call into the RestartStrategy.

Obtain resource overview of cluster [​FLINK-10501​]
In order to decide with which parallelism to run, the ExecutionGraphDriver needs to obtain an
overview over all available resources. This includes the resources managed by the SlotPool as
well as not yet allocated resources on the ResourceManager. This is a temporary workaround
until we adapted the slot allocation protocol to support resource declaration. Once this is done,
we will only take the SlotPool’s slots into account.

Periodically check for new resources [​FLINK-10503​]
In order to decide when to start scheduling or to rescale, we need to periodically check for new
resources (slots).

Wait for resource stabilization [​FLINK-10502​]
Add functionality to wait for resource stabilization. The available set of resources is considered
stable if it did not change for a given time. Only if the resource set is stable we should consider
to trigger the initial scheduling or rescaling actions.

Decide actual parallelism based on available resources [​FLINK-10504​]
Check if a JobGraph can be scheduled with the available set of resources (slots). If the
minimum parallelism is fulfilled, then distribute the available set of slots across all available slot
sharing groups in order to decide on the actual runtime parallelism. In the absence of minimum,
target and maximum parallelism, assume minimum = target = maximum = parallelism defined in
the JobGraph.

Ideally, we make the slot assignment strategy pluggable.

Treat fail signal as scheduling event [​FLINK-10505​]
Instead of simply calling into the RestartStrategy which restarts the existing ExecutionGraph
with the same parallelism, the ExecutionGraphDriver should treat a recovery similar to the initial


scheduling operation. First, one needs to decide on the new parallelism of the ExecutionGraph
(scale up/scale down) wrt to the available set of resources. Only if the minimum configuration is
fulfilled, the potentially rescaled ExecutionGraph will be restarted.

Introduce minimum, target and maximum parallelism to JobGraph
[​FLINK-10506​]
In order to run a job with a variable parallelism, one needs to be able to define the minimum and
maximum parallelism for an operator as well as the current target value. In the first
implementation, minimum could be 1 and maximum the max parallelism of the operator if no
explicit parallelism has been specified for an operator. If a parallelism p has been specified (via
setParallelism(p)), then minimum = maximum = p. The target value could be the command line
parameter -p or the default parallelism.

Scale job up if new resources are available [​FLINK-9957​]
Add a rescaling strategy to the ExecutionGraphDriver which decides when to rescale the
existing job. The simplest implementation could be to rescale whenever this is possible and
after a grace period between successive rescaling events has passed.

Set target parallelism to maximum when using the standalone job cluster
mode [​FLINK-10507​]
In order to enable the reactive container mode, we should set the target value to the maximum
parallelism if we run in standalone job cluster mode. That way, we will always use all available
resources and scale up if new resources are being added.


Future roadmap
tbd

Appendix: WIP - Autoscaling in passive container
mode (previous design)
Scope of this feature iteration for Flink 1.7.0
This document outlines design for the first iteration of auto-scaling feature design, mostly in the
context of passive container mode. The passive container environment means that Flink job
does not have an active control over resource allocation or destruction in the cluster (task
manager workers). It can detect that new available resources have appeared in the cluster or
some old are gone (failed).


In active container mode, Flink RM can request cluster for resources proactively. Downscaling
can be relevant also for a failing/overloaded active mode cluster, where resources cannot be
temporarily allocated.
Current assumptions:
- There is only one job running in the cluster (single job mode).
- Job is in the streaming mode.
Upon initial start, Flink bootstraps the job with the default parallelism as before. If Flink detects a
change in the number of available slots, it tries to fit max possible parallelism for the given
number of slots and automatically restart job with this new parallelism if possible.
Flink should also:
- preserve the fixed by user parallelism (​setParallelism()​)
- uniformly distribute new slots between slot groups with non-fixed parallelism
- respect restart strategy.

Implementation design
Configuration

-

Enable ​autoscaling flag​ in the cluster (could be separate for up- and downscaling
activation)

-

The job parallelism (​cli run -p​ or config default) is now not fixed but just initial one to
bootstrap the job.

-

User has to activate ​checkpointing​, the activated checkpointing mode should be
checked for the support of key group rescaling upon restoration.

-

Optionally, user can configure the ​minimum parallelism​ for the job to run, the job will
not be run with less parallelism and might optionally fail completely if enough slots are
still unavailable after e.g. some timeout. The default min parallelism is one.


High level view

Resource manager
The resource manager can forward slot reports from joining task managers to the job manager.
It can happen always or only if upscaling is activated.
The resource manager already holds the view of available task managers. It updates the view
when new task managers join and detects their failure by heartbeating. Therefore the job master
can just query the resource manager for the currently available number of slots, e.g.:

ResourceManagerGateway.requestResourceOverview(...).getNumberRegisteredSlots()

Job manager
Job graph modifications
We have to incorporate a fixed parallelism flag into the ​JobVertex​ of ​JobGraph​ if user explicitly
sets it for the corresponding operator by calling ​setParalellism()​. This flag will allow the job
rescaler to respect it while reevaluating changed parallelism.
Alternatively resolve it on server side and keep parallelism in ​JobVertex ​-1 if it is not fixed.

Job rescaling
Upscaling of running job
JM listens to notifications of new slots from RM and passes them to the Job upscaler. Job
upscaler (re)starts debounce timer upon this notification. Alternatively, the upscaler can just pull
the available slot number from the RM with fixed debounce delay.


When the debounce timer elapses, the job rescaler fetches the number of currently available
slots from the RM and asks the new max possible parallelism from the re-scaling strategy. If
parallelism increased, the job upscaler cancels the job and triggers restart operation with the
new parallelism.
Failure restart and downscaling
This type of restart/rescaling can be also relevant for the active container mode, e.g. if it could
not allocate more TMs instead of failed upon restart because of the previous failure.
We can create special restart strategy in case of auto-downscaling to substitute or wrap and
intercept configured by user restart strategy in execution graph. It will forward global failures to
the job downscaler for a new rescaling attempt or final global failure.
Failure cases:
-

-


-

Just restart if any failure but not NoResourceAvailableException​. The starting or
running job can fail this way. We do not know whether it was a user attempt of
downscaling or some other, maybe temporary failure. The job has to be given chance to
restart with the same parallelism and request new resources instead of failed in active
mode or try the old ones again in passive mode. The original restart strategy should be
consulted whether to continue restarting attempts.
Downscale if NoResourceAvailableException​. The job start failed, most probably
because the available resources decreased or cannot be requested in passive mode
after the previous failure. The job downscaler can ask the rescaling strategy for the new
(most probably decreased) parallelism. The job can be restarted with the new parallelism
immediately or with less timeout to reduce downtime, with or potentially w/o consulting
the original restart strategy, as the downscaler will eventually reach the min parallelism
where the job can not be run.
Below min parallelism​. The job downscaler does not start the job with a parallelism
less than the minimum. It should wait for enough slots to become available. It can
optionally fail completely if the slot number does not increase after several checks with
fixed delay.

Speed up downscaling to decrease downtime:
If the job needs to downscale, firstly, some TMs will die because user will usually try to
downscale killing them in passive mode or cluster cannot provide resources in active mode. The
running job will fail with some error but we do not know why exactly. The next restart with the
same parallelism will fail with ​NoResourceAvailableException​ and we can be sure to downscale.
The failure of unavailable slot requests will take some time (currently 5 min timeout for queued
requests in FLIP-6 code). The actual downscale will be delayed by it, increasing the job
downtime. This might need some tuning of this slot request timeout to decrease downtime.




Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×