feat: add test to check for `ctx.read_json()` #1212

westhide · 2025-03-20T00:14:08Z

Which issue does this PR close?

Closes #1209.

Rationale for this change

Add test for #1209.
Also fix #1214

What changes are included in this PR?

Are there any user-facing changes?

westhide · 2025-03-20T00:14:25Z

take

westhide · 2025-03-20T15:23:26Z

The test in version 45.0.0 seems blocking, try fixing

milenkovicm · 2025-03-20T17:38:32Z

apparently you found another bug:

datafusion-ballista/ballista/scheduler/src/scheduler_server/grpc.rs

Line 127 in bb10a1b

for (_, task) in schedulable_tasks {

maybe if it is changed to:

            for (_, task) in schedulable_tasks {
                match self
                    .state
                    .task_manager
                    .prepare_task_definition(task.clone())
                {
                    Ok(task_definition) => tasks.push(task_definition),
                    Err(e) => {
                        let job_id = task.partition.job_id;
                        error!(
                            "Error preparing task for job_id: {} error: {:?} ",
                            job_id,
                            e.to_string(),
                        );
                        let _ = self
                            .state
                            .task_manager
                            .abort_job(&job_id, e.to_string())
                            .await;
                    }
                }
            }

whole job gets cancelled in case of error, wdyt?

westhide · 2025-03-21T00:38:36Z

Yes, I will try to fix this bug by sending the scheduler_server error to client side.

ballista_scheduler::scheduler_server::grpc: Error preparing task definition: DataFusionError(Internal("Unsupported plan and extension codec failed with [Internal error: unsupported plan type: NdJsonExec { base_config: object_store_url=ObjectStoreUrl { url: Url { scheme: \"file\", cannot_be_a_base: false, username: \"\", password: None, host: None, port: None, path: \"/\", query: None, fragment: None } }, statistics=Statistics { num_rows: Absent, total_byte_size: Absent, column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }, file_groups={1 group: [[home/westhide/Code/apache/datafusion-ballista/examples/testdata/simple.json]]}, projection=[a], projected_statistics: Statistics { num_rows: Absent, total_byte_size: Absent, column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }, metrics: ExecutionPlanMetricsSet { inner: Mutex { data: MetricsSet { metrics: [] } } }, file_compression_type: FileCompressionType { variant: UNCOMPRESSED }, cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], constraints: Constraints { inner: [] }, schema: Schema { fields: [Field { name: \"a\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), emission_type: Incremental, boundedness: Bounded, output_ordering: None } }.\nThis was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker]. Plan: NdJsonExec { base_config: object_store_url=ObjectStoreUrl { url: Url { scheme: \"file\", cannot_be_a_base: false, username: \"\", password: None, host: None, port: None, path: \"/\", query: None, fragment: None } }, statistics=Statistics { num_rows: Absent, total_byte_size: Absent, column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }, file_groups={1 group: [[home/westhide/Code/apache/datafusion-ballista/examples/testdata/simple.json]]}, projection=[a], projected_statistics: Statistics { num_rows: Absent, total_byte_size: Absent, column_statistics: [ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, sum_value: Absent, distinct_count: Absent }] }, metrics: ExecutionPlanMetricsSet { inner: Mutex { data: MetricsSet { metrics: [] } } }, file_compression_type: FileCompressionType { variant: UNCOMPRESSED }, cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], constraints: Constraints { inner: [] }, schema: Schema { fields: [Field { name: \"a\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), emission_type: Incremental, boundedness: Bounded, output_ordering: None } }"))

milenkovicm · 2025-03-21T07:39:52Z

I believe whole job should be cancelled

westhide · 2025-03-21T16:00:42Z

I believe whole job should be cancelled

Yes, working on it.

milenkovicm · 2025-03-21T16:25:37Z

ah no, sorry for misunderstanding please do not cancel this PR.
what I meant, in case of this type of error, ballista job should be cancelled.

westhide · 2025-03-21T17:44:04Z

Got you, I'm working on fix this bug by reset the taskinfo and cancel the ballista job Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Marko Milenković ***@***.***> Sent: Saturday, March 22, 2025 12:25:58 AM To: apache/datafusion-ballista ***@***.***> Cc: westhide ***@***.***>; State change ***@***.***> Subject: Re: [apache/datafusion-ballista] feat: add test to check for `ctx.read_json()` (PR #1212) ah no, sorry for misunderstanding please do not cancel this PR. what I meant, in case of this type of error, ballista job should be cancelled. — Reply to this email directly, view it on GitHub<#1212 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BF34KCABOSHUAMA6K3OTNVL2VQ4RNAVCNFSM6AAAAABZMEWM2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBTHA3DMOBXHA>. You are receiving this because you modified the open/close state.Message ID: ***@***.***> [milenkovicm]milenkovicm left a comment (apache/datafusion-ballista#1212)<#1212 (comment)> ah no, sorry for misunderstanding please do not cancel this PR. what I meant, in case of this type of error, ballista job should be cancelled. — Reply to this email directly, view it on GitHub<#1212 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BF34KCABOSHUAMA6K3OTNVL2VQ4RNAVCNFSM6AAAAABZMEWM2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBTHA3DMOBXHA>. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

westhide · 2025-03-22T07:48:49Z

ah no, sorry for misunderstanding please do not cancel this PR. what I meant, in case of this type of error, ballista job should be cancelled.

Hello @milenkovicm, Cancel Job after prepare task definition failed ready for review, Thx~

milenkovicm · 2025-03-23T12:25:42Z

I won't be able to review this pr for few days. Will follow up asap.

milenkovicm · 2025-03-30T07:06:26Z

ballista/scheduler/src/scheduler_server/grpc.rs

                    }
                }
            }
+
+            unbind_prepare_failed_tasks(active_jobs, &prepare_failed_jobs).await;


I wonder whats the reason for this method?

When we detect that preparation of task failed, we can not recover from it so job should be cancelled.

Would self.cancel_job(job_id) trigger cancelation of all running tasks for given job and do cleanup of execution graph?

when task_manager execute prepare_task_definition, it will set task_info for running_stage, without this unbind_prepare_failed_tasks function to reset the task_info to None, when Scheduler try to cancel the job, it will try to send a stop task event to Executor, that will cause a task stop fail error log on the Executor side.

So will it just show error in the log or crash executor?

Just error log~

FYI

Scheduler get running_tasks by filter_map Some(task_info)

datafusion-ballista/ballista/scheduler/src/state/execution_graph/execution_stage.rs

Lines 587 to 599 in 9f8e4fc

pub(super) fn running_tasks(&self) -> Vec<(usize, usize, usize, String)> {

self.task_infos

.iter()

.enumerate()

.filter_map(|(partition, info)| match info {

Some(TaskInfo {task_id,

task_status: task_status::Status::Running(RunningTask { executor_id }), ..}) => {

Some((*task_id, self.stage_id, partition, executor_id.clone()))

}

_ => None,

})

.collect()

}

Scheduler Send CancelTasks event to Executor

datafusion-ballista/ballista/scheduler/src/scheduler_server/query_stage_scheduler.rs

Lines 232 to 251 in 9f8e4fc

QueryStageSchedulerEvent::JobCancel(job_id) => {

self.metrics_collector.record_cancelled(&job_id);

info!("Job {} Cancelled", job_id);

match self.state.task_manager.cancel_job(&job_id).await {

Ok((running_tasks, _pending_tasks)) => {

event_sender

.post_event(QueryStageSchedulerEvent::CancelTasks(

running_tasks,

))

.await?;

}

Err(e) => {

error!(

"Fail to invoke cancel_job for job {} due to {:?}",

job_id, e

);

}

}

self.state.clean_up_failed_job(job_id);

Executor log error!("Error cancelling task: {:?}", e); if cancel_tasks fail

datafusion-ballista/ballista/executor/src/executor_server.rs

Lines 706 to 732 in 9f8e4fc

async fn cancel_tasks(

&self,

request: Request<CancelTasksParams>,

) -> Result<Response<CancelTasksResult>, Status> {

let task_infos = request.into_inner().task_infos;

info!("Cancelling tasks for {:?}", task_infos);

let mut cancelled = true;

for task in task_infos {

if let Err(e) = self

.executor

.cancel_task(

task.task_id as usize,

task.job_id,

task.stage_id as usize,

task.partition_id as usize,

)

.await

{

error!("Error cancelling task: {:?}", e);

cancelled = false;

}

}

Ok(Response::new(CancelTasksResult { cancelled }))

}

Would be better to encode physical_plan to proto before create_task_info, what do you think?

this issue captures very rare corner case, which should not happen in properly configured cluster.

for the sake of simplicity and understanding can we should just cancel the job (if cluster state is consistent at the end) If the consequence of canceling failed task is error log it may not be too big of a problem.

what do you think?

Sure, we should keep code simplicity. unbind_prepare_failed_tasks reverted.

milenkovicm · 2025-03-30T18:26:29Z

ballista/scheduler/src/state/mod.rs

@@ -248,7 +262,7 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> SchedulerState<T,
    async fn launch_tasks(


I wonder if it makes sense to propagate sender:EventSender<QueryStageSchedulerEvent>, and task.manager.launch_multi_task and cancel the jobs there?

task.manager.launch_multi_task semantics is not clean, do yield error if only one task fail?

milenkovicm · 2025-03-30T18:29:07Z

ballista/scheduler/src/state/task_manager.rs

@@ -524,24 +524,35 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> TaskManager<T, U>
    pub(crate) async fn launch_multi_task(
        &self,
        executor: &ExecutorMetadata,
-        tasks: Vec<Vec<TaskDescription>>,
+        tasks: HashMap<(String, usize), Vec<TaskDescription>>,


do we need to change this parameter? if it is just job_id that we need, we can get that from task.partition if im not mistaken

milenkovicm · 2025-03-30T18:31:21Z

ballista/scheduler/src/state/task_manager.rs

        executor_manager: &ExecutorManager,
-    ) -> Result<()> {
+    ) -> Result<HashMap<String, Vec<TaskDescription>>> {


not sure what would be better approach, do propagate the sender as parameter and cancel job or to return job_ids which tasks failed. anyway it does not look like we need to return anything but hash set failed job_is

milenkovicm · 2025-03-30T20:48:38Z

ballista/scheduler/src/scheduler_server/grpc.rs

                match self.state.task_manager.prepare_task_definition(task) {
                    Ok(task_definition) => tasks.push(task_definition),
                    Err(e) => {
                        error!("Error preparing task definition: {:?}", e);
+                        info!("Cancel prepare task definition failed job: {}", job_id);
+                        if let Err(err) = self.cancel_job(job_id).await {


Would it be better if we fail job instead of canceling it QueryStageSchedulerEvent::JobRunningFailed {...}, not sure about queued_at property

Sure, it actually failed.

milenkovicm · 2025-04-18T21:06:10Z

hey @westhide are you still interested to get this PR merged?

milenkovicm · 2025-04-21T21:36:54Z

moving to draft as its waiting for changes

westhide closed this Mar 21, 2025

westhide reopened this Mar 21, 2025

westhide force-pushed the main-dev branch 3 times, most recently from 514aa22 to 1e760ca Compare March 22, 2025 07:26

westhide mentioned this pull request Mar 22, 2025

Ballista client keep blocking when prepare_task_definition or prepare_multi_task_definition fail #1214

Open

westhide force-pushed the main-dev branch from 1e760ca to ca3d3af Compare March 22, 2025 07:57

milenkovicm reviewed Mar 30, 2025

View reviewed changes

westhide added 3 commits March 30, 2025 21:37

feat: add test to check for ctx.read_json()

1971bc9

fix: Cancel prepare_task_definition fail job

1dc68c5

fix: Cancel prepare_task_definition fail job

f109cd8

westhide force-pushed the main-dev branch from ca3d3af to f109cd8 Compare March 30, 2025 14:12

milenkovicm reviewed Mar 30, 2025

View reviewed changes

milenkovicm marked this pull request as draft April 21, 2025 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add test to check for `ctx.read_json()` #1212

feat: add test to check for `ctx.read_json()` #1212

westhide commented Mar 20, 2025 •

edited

Loading

westhide commented Mar 20, 2025

westhide commented Mar 20, 2025

milenkovicm commented Mar 20, 2025

westhide commented Mar 21, 2025 •

edited

Loading

milenkovicm commented Mar 21, 2025

westhide commented Mar 21, 2025

milenkovicm commented Mar 21, 2025

westhide commented Mar 21, 2025 via email

westhide commented Mar 22, 2025

milenkovicm commented Mar 23, 2025

milenkovicm Mar 30, 2025

westhide Mar 30, 2025

milenkovicm Mar 30, 2025 •

edited

Loading

westhide Mar 30, 2025

westhide Mar 30, 2025

westhide Mar 30, 2025

milenkovicm Mar 30, 2025

westhide Mar 30, 2025

milenkovicm Mar 30, 2025

milenkovicm Mar 30, 2025

milenkovicm Mar 30, 2025

milenkovicm Mar 30, 2025 •

edited

Loading

westhide Apr 4, 2025

milenkovicm commented Apr 18, 2025

milenkovicm commented Apr 21, 2025

	pub(super) fn running_tasks(&self) -> Vec<(usize, usize, usize, String)> {
	self.task_infos
	.iter()
	.enumerate()
	.filter_map(\|(partition, info)\| match info {
	Some(TaskInfo {task_id,
	task_status: task_status::Status::Running(RunningTask { executor_id }), ..}) => {
	Some((*task_id, self.stage_id, partition, executor_id.clone()))
	}
	_ => None,
	})
	.collect()
	}

	QueryStageSchedulerEvent::JobCancel(job_id) => {
	self.metrics_collector.record_cancelled(&job_id);

	info!("Job {} Cancelled", job_id);
	match self.state.task_manager.cancel_job(&job_id).await {
	Ok((running_tasks, _pending_tasks)) => {
	event_sender
	.post_event(QueryStageSchedulerEvent::CancelTasks(
	running_tasks,
	))
	.await?;
	}
	Err(e) => {
	error!(
	"Fail to invoke cancel_job for job {} due to {:?}",
	job_id, e
	);
	}
	}
	self.state.clean_up_failed_job(job_id);

	async fn cancel_tasks(
	&self,
	request: Request<CancelTasksParams>,
	) -> Result<Response<CancelTasksResult>, Status> {
	let task_infos = request.into_inner().task_infos;
	info!("Cancelling tasks for {:?}", task_infos);

	let mut cancelled = true;

	for task in task_infos {
	if let Err(e) = self
	.executor
	.cancel_task(
	task.task_id as usize,
	task.job_id,
	task.stage_id as usize,
	task.partition_id as usize,
	)
	.await
	{
	error!("Error cancelling task: {:?}", e);
	cancelled = false;
	}
	}

	Ok(Response::new(CancelTasksResult { cancelled }))
	}

		@@ -248,7 +262,7 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> SchedulerState<T,
		async fn launch_tasks(

feat: add test to check for ctx.read_json() #1212

Are you sure you want to change the base?

feat: add test to check for ctx.read_json() #1212

Conversation

westhide commented Mar 20, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

westhide commented Mar 20, 2025

westhide commented Mar 20, 2025

milenkovicm commented Mar 20, 2025

westhide commented Mar 21, 2025 • edited Loading

milenkovicm commented Mar 21, 2025

westhide commented Mar 21, 2025

milenkovicm commented Mar 21, 2025

westhide commented Mar 21, 2025 via email

westhide commented Mar 22, 2025

milenkovicm commented Mar 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milenkovicm Mar 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scheduler get running_tasks by filter_map Some(task_info)

Scheduler Send CancelTasks event to Executor

Executor log error!("Error cancelling task: {:?}", e); if cancel_tasks fail

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milenkovicm Mar 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milenkovicm commented Apr 18, 2025

milenkovicm commented Apr 21, 2025

feat: add test to check for `ctx.read_json()` #1212

feat: add test to check for `ctx.read_json()` #1212

westhide commented Mar 20, 2025 •

edited

Loading

westhide commented Mar 21, 2025 •

edited

Loading

milenkovicm Mar 30, 2025 •

edited

Loading

Scheduler get `running_tasks` by `filter_map` Some(task_info)

Scheduler Send `CancelTasks` event to Executor

Executor log `error!("Error cancelling task: {:?}", e);` if `cancel_tasks` fail

milenkovicm Mar 30, 2025 •

edited

Loading