Solvedairflow Migrate from 2.1.4 to 2.2.0
✔️Accepted Answer
SQL we used from our deployment playbook to clean up these tables
Obviously a disclaimer on running this blindly. Validate its rows you can safely remove first by just executing the CTEs
BEGIN;
-- Remove dag runs without a valid run_id
DELETE FROM dag_run WHERE run_id is NULL;
-- Remove task fails without a run_id
WITH task_fails_to_remove AS (
SELECT
task_fail.dag_id,
task_fail.task_id,
task_fail.execution_date
FROM
task_fail
LEFT JOIN
dag_run ON
dag_run.dag_id = task_fail.dag_id
AND dag_run.execution_date = task_fail.execution_date
WHERE
dag_run.run_id IS NULL
)
DELETE FROM
task_fail
USING
task_fails_to_remove
WHERE (
task_fail.dag_id = task_fails_to_remove.dag_id
AND task_fail.task_id = task_fails_to_remove.task_id
AND task_fail.execution_date = task_fails_to_remove.execution_date
);
-- Remove task instances without a run_id
WITH task_instances_to_remove AS (
SELECT
task_instance.dag_id,
task_instance.task_id,
task_instance.execution_date
FROM
task_instance
LEFT JOIN
dag_run
ON dag_run.dag_id = task_instance.dag_id
AND dag_run.execution_date = task_instance.execution_date
WHERE
dag_run.run_id is NULL
)
DELETE FROM
task_instance
USING
task_instances_to_remove
WHERE (
task_instance.dag_id = task_instances_to_remove.dag_id
AND task_instance.task_id = task_instances_to_remove.task_id
AND task_instance.execution_date = task_instances_to_remove.execution_date
);
COMMIT;
Other Answers:
Seems other people have similar issue #18912 so for me it looks highly unlikely this was manually added.
I think the number of reports we have in such short time (I saw 3 reports already) indicate that those rows can appear frequently as result of normal operations by Airflow, and many users might hve similar issues soon.
If this is realy result of "regular" airflow behaviour, for me it calls for a very quick 2.2.1 with improved migration to handle that case (cc: @kaxil @jedcunningham)
Problem is that there is no clue, which rows need to be deleted. Also there is no straight logic behind "corresponding rows" for me.
This part of code for me is the only source of some guidelines.
If there is better way to deal with this kind of problem in future, I will appreciate to hear it :)
If some references are lost, this kind of error should be repaired automatically IHMO.
What's the reason for those rows to appear in the DB ? How did they get there? Do we know it?
I think there are two possible scenarios:
- If it is result of normal operation of Airflow ( even some some special circumstances, manually triggered dags using CLI or UI)
If that's the case, I agree it should be handled better - Airflow cleanig them or adding "fake" run_ids for those during migration is what I would expect as we should handle this as "regular" migration scenario.
- They are result of user adding manually entries to the database.
If this is the case then the BEST we could do is to spit-out the exact SQL query to run to delete those rows. We should not delete them automaticaly.
Case 1. Yes, I'm triggering dags manually from UI or by RestApi.
Case 2. That was very first time I had to work with Airflow Database. I have never add or modified it manually.
I'm only reading connection from db.
Case 2. That was very first time I had to work with Airflow Database. I have never add or modified it manually.
I'm only reading connection from db.
Thanks for confirming @sbialkowski-pixel! @kaxil @jedcunningham I think we need to seriously consider 2.2.1
Apache Airflow version
2.2.0
Operating System
Linux
Versions of Apache Airflow Providers
default.
Deployment
Docker-Compose
Deployment details
Using airflow-2.2.0python3.7
What happened
Upgrading image from apache/airflow:2.1.4-python3.7
to apache/airflow:2.2.0-python3.7
Cause this inside scheduler, which is not starting:
What you expected to happen
Automatic database migration and properly working scheduler.
How to reproduce
Ugrade from 2.1.4 to 2.2.0 with some dags history.
Anything else
No response
Are you willing to submit PR?
Code of Conduct