Zum Hauptinhalt springen

Spacewalk: Failed channel syncs because of duplicate packages

A brief summary:
I’m working with Spacewalk for three years now, from version 2.4 on. We manage about 350 systems, 30+ channels, 50+ repositories and 105.000+ packages. We import new packages and errata on a daily basis, roll out critical patches to stage systems automatically and do weekly/monthly patchdays on productive systems exclusively via Spacewalk.

Lately the periodic syncs of our CentOS 6 x86_64 channel failed on every run. New packages stayed in the queue and could not be rolled out to our systems. The Taskomatic daemon sent out error notification emails to the configured administrator mailbox in which it basically tells you that the repo-sync-bunch failed and in which logfile you can read more about this issue.

The error:

13:37:56 ERROR: (23, 'ERROR: duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"', 'Could not update database entry.')
13:37:56 Repo URL: http://mirror.centos.org/centos/6/extras/x86_64/
Traceback (most recent call last):
File "/usr/bin/spacewalk-repo-sync", line 257, in 
[...]
spacewalk.server.rhnSQL.sql_base.SQLSchemaError: (99999, 'ERROR: current transaction is aborted, commands ignored until end of transaction block', '', InternalError('current transaction is aborted, commands ignored until end of transaction block\n',))
2018-07-12 11:30:47,557 [DefaultQuartzScheduler_Worker-10] ERROR com.redhat.rhn.taskomatic.task.RepoSyncTask - Executing a task threw an exception: org.quartz.JobExecutionException
2018-07-12 11:30:47,557 [DefaultQuartzScheduler_Worker-10] ERROR com.redhat.rhn.taskomatic.task.RepoSyncTask - Message: Command '[/usr/bin/spacewalk-repo-sync, --channel, centos6_x86_64, --type, yum]' exited with error code 1
2018-07-12 11:30:47,558 [DefaultQuartzScheduler_Worker-10] ERROR com.redhat.rhn.taskomatic.task.RepoSyncTask - Cause: null
2018-07-12 11:30:47,558 [DefaultQuartzScheduler_Worker-10] ERROR com.redhat.rhn.taskomatic.task.RepoSyncTask - Stack trace:org.quartz.JobExecutionException: Command '[/usr/bin/spacewalk-repo-sync, --channel, centos6_x86_64, --type, yum]' exited with error code 1
[...]

Okay so what do we have here? Obviously it’s something about the database and a duplicate primary key. We also have the repository that caused the error to happen. No hint which entry is the duplicate whatsoever. But we do have the failed command so let’s try that on our own and strace the process and it’s forks:

[...]
23319 sendto(5, "Q\0\0\0?select LOOKUP_PACKAGE_NEVRA"..., 64, MSG_NOSIGNAL, NULL, 0) = 64
23319 poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
23319 recvfrom(5, "T\0\0\0\33\0\1id\0\0\0\0\0\0\0\0\0\6\244\377\377\377\377\377\377\0\0D\0\0\0"..., 16384, 0, NULL, NULL) = 65
23319 sendto(5, "Q\0\0\0\210select * from rhnPackage wh"..., 137, MSG_NOSIGNAL, NULL, 0) = 137
23319 poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
23319 recvfrom(5, "T\0\0\3#\0\34id\0\0\0\240\4\0\1\0\0\6\244\377\377\377\377\377\377\0\0org_"..., 16384, 0, NULL, NULL) = 1346
23319 sendto(5, "Q\0\0\0\210select * from rhnPackage wh"..., 137, MSG_NOSIGNAL, NULL, 0) = 137
23319 poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
23319 recvfrom(5, "T\0\0\3#\0\34id\0\0\0\240\4\0\1\0\0\6\244\377\377\377\377\377\377\0\0org_"..., 16384, 0, NULL, NULL) = 1682
23319 sendto(5, "Q\0\0\0\211select * from rhnPackage wh"..., 138, MSG_NOSIGNAL, NULL, 0) = 138
23319 poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
23319 recvfrom(5, "T\0\0\3#\0\34id\0\0\0\240\4\0\1\0\0\6\244\377\377\377\377\377\377\0\0org_"..., 16384, 0, NULL, NULL) = 1478
23319 sendto(5, "Q\0\0\0l\n select channel"..., 109, MSG_NOSIGNAL, NULL, 0) = 109
23319 poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
23319 recvfrom(5, "T\0\0\0#\0\1channel_id\0\0\0\242\255\0\1\0\0\6\244\377\377\377\377"..., 16384, 0, NULL, NULL) = 56
23319 sendto(5, "Q\0\0\0l\n select channel"..., 109, MSG_NOSIGNAL, NULL, 0) = 109
23319 poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
23319 recvfrom(5, "T\0\0\0#\0\1channel_id\0\0\0\242\255\0\1\0\0\6\244\377\377\377\377"..., 16384, 0, NULL, NULL) = 56
23319 sendto(5, "Q\0\0\0l\n select channel"..., 109, MSG_NOSIGNAL, NULL, 0) = 109
23319 poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
23319 recvfrom(5, "T\0\0\0#\0\1channel_id\0\0\0\242\255\0\1\0\0\6\244\377\377\377\377"..., 16384, 0, NULL, NULL) = 56
23319 sendto(5, "Q\0\0\1W\n select pac"..., 344, MSG_NOSIGNAL, NULL, 0) = 344
23319 poll([{fd=5, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
23319 recvfrom(5, "T\0\0\0#\0\1package_id\0\0\0\242\255\0\2\0\0\6\244\377\377\377\377"..., 16384, 0, NULL, NULL) = 56

Not very helpful in my opinion as there is no clue on what the package_id is that caused the error. It only confirms what we already know. But when it’s something in correlation to the database we should be able to get more information by logging all the queries done during the sync process, right? After enabling log_destination = 'stderr' and log_directory = 'pg_log' in postgresql.conf (assuming your Spacewalk installation uses a pgSQL DB), I restarted the sync and watched the PostgreSQL log closely:

2018-07-12 13:11:09.736 CEST LOG: statement: select LOOKUP_PACKAGE_NAME('centos-release') id from dual
2018-07-12 13:11:09.737 CEST LOG: statement: select LOOKUP_PACKAGE_NAME('python-setuptools') id from dual
2018-07-12 13:11:09.737 CEST LOG: statement: select LOOKUP_PACKAGE_NAME('sysstat') id from dual
2018-07-12 13:11:09.738 CEST LOG: statement: select LOOKUP_EVR(NULL, '9.0.4', '33.el6_9.1') id from dual
2018-07-12 13:11:09.753 CEST LOG: statement: select LOOKUP_EVR(NULL, '6', '10.el6.centos.12.3') id from dual
2018-07-12 13:11:09.767 CEST LOG: statement: select LOOKUP_EVR(NULL, '0.6.10', '4.el6_9') id from dual
2018-07-12 13:11:09.794 CEST LOG: statement: BEGIN
2018-07-12 13:11:09.795 CEST LOG: statement: lock table rhnChecksum in exclusive mode
2018-07-12 13:11:09.797 CEST LOG: statement: select lookup_checksum_fast('sha256', '3cdb75f0d06fe1d39cee6f010a2a487d1b2ba595fa3197bbfac9e9050b2c233f') id from dual
2018-07-12 13:11:09.894 CEST LOG: statement: select lookup_checksum_fast('sha256', '65fa114e7d947143580f98a5feeb652ceb7ebdd8b27fba65ba0b383b250c63d3') id from dual
2018-07-12 13:11:09.922 CEST LOG: statement: select lookup_checksum_fast('sha256', 'e0c9c45557bc787d85417b1daa3799ae61bab899e6475b793cf3488309f94304') id from dual
2018-07-12 13:11:09.954 CEST LOG: statement: COMMIT
2018-07-12 13:11:09.955 CEST LOG: statement: select LOOKUP_PACKAGE_NEVRA(603, 28320, 120) id from dual
2018-07-12 13:11:09.959 CEST LOG: statement: select LOOKUP_PACKAGE_NEVRA(5339, 28321, 100) id from dual
2018-07-12 13:11:09.973 CEST LOG: statement: select LOOKUP_PACKAGE_NEVRA(5913, 28322, 120) id from dual
2018-07-12 13:11:09.974 CEST LOG: statement: select * from rhnPackage where org_id = 1 and name_id = 603 and evr_id = 28320 and package_arch_id = 120 and checksum_id = 10000270
2018-07-12 13:11:09.987 CEST LOG: statement: select * from rhnPackage where org_id = 1 and name_id = 5339 and evr_id = 28321 and package_arch_id = 100 and checksum_id = 9998526
2018-07-12 13:11:09.989 CEST LOG: statement: select * from rhnPackage where org_id = 1 and name_id = 5913 and evr_id = 28322 and package_arch_id = 120 and checksum_id = 10000274
2018-07-12 13:11:09.990 CEST LOG: statement:
select channel_id
from rhnChannelPackage
where package_id = 121940
2018-07-12 13:11:09.991 CEST LOG: statement:
select channel_id
from rhnChannelPackage
where package_id = 121941
2018-07-12 13:11:09.992 CEST LOG: statement:
select channel_id
from rhnChannelPackage
where package_id = 121942
2018-07-12 13:11:09.992 CEST LOG: statement:
select package_id
from rhnChannelPackage cp
join rhnPackage p
on p.id = cp.package_id
join rhnChannel c
on c.id = cp.channel_id
where cp.channel_id = 105
and c.org_id != p.org_id
2018-07-12 13:11:10.206 CEST LOG: statement: insert into rhnChannelPackage (channel_id, package_id) values (105, 121940)
2018-07-12 13:11:10.230 CEST LOG: statement: insert into rhnChannelPackage (channel_id, package_id) values (105, 121941)
2018-07-12 13:11:10.230 CEST LOG: statement: insert into rhnChannelPackage (channel_id, package_id) values (105, 121942)
2018-07-12 13:11:10.231 CEST LOG: statement: SELECT rhn_channel.refresh_newest_package(105, 'server.app.yumreposync', 603)
2018-07-12 13:11:10.406 CEST LOG: statement: SELECT rhn_channel.refresh_newest_package(105, 'server.app.yumreposync', 5339)
2018-07-12 13:11:10.459 CEST ERROR: duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"
2018-07-12 13:11:10.459 CEST DETAIL: Key (channel_id, name_id, package_arch_id)=(105, 5339, 100) already exists.
2018-07-12 13:11:10.459 CEST CONTEXT: SQL statement "insert into rhnChannelNewestPackage
(channel_id, name_id, evr_id, package_id, package_arch_id)
(select channel_id,
name_id, evr_id,
package_id, package_arch_id
from rhnChannelNewestPackageView
where channel_id = channel_id_in
and (package_name_id_in is null
or name_id = package_name_id_in)
)"
PL/pgSQL function rhn_channel.refresh_newest_package(numeric,character varying,numeric) line 9 at SQL statement
2018-07-12 13:11:10.459 CEST STATEMENT: SELECT rhn_channel.refresh_newest_package(105, 'server.app.yumreposync', 5339)
2018-07-12 13:11:10.462 CEST LOG: statement:
select k1.description as ca_cert_name, k1.key as ca_cert, k1.org_id as ca_cert_org,
k2.description as client_cert_name, k2.key as client_cert, k2.org_id as client_cert_org,
k3.description as client_key_name, k3.key as client_key, k3.org_id as client_key_org
from rhncontentsource cs inner join
rhncontentsourcessl csssl on cs.id = csssl.content_source_id inner join
rhncryptokey k1 on csssl.ssl_ca_cert_id = k1.id left outer join
rhncryptokey k2 on csssl.ssl_client_cert_id = k2.id left outer join
rhncryptokey k3 on csssl.ssl_client_key_id = k3.id
where cs.id = 507
2018-07-12 13:11:10.462 CEST ERROR: current transaction is aborted, commands ignored until end of transaction block
2018-07-12 13:11:10.462 CEST STATEMENT:
select k1.description as ca_cert_name, k1.key as ca_cert, k1.org_id as ca_cert_org,
k2.description as client_cert_name, k2.key as client_cert, k2.org_id as client_cert_org,
k3.description as client_key_name, k3.key as client_key, k3.org_id as client_key_org
from rhncontentsource cs inner join
rhncontentsourcessl csssl on cs.id = csssl.content_source_id inner join
rhncryptokey k1 on csssl.ssl_ca_cert_id = k1.id left outer join
rhncryptokey k2 on csssl.ssl_client_cert_id = k2.id left outer join
rhncryptokey k3 on csssl.ssl_client_key_id = k3.id
where cs.id = 507

Marked in red we have the package_ids that have to be in any way responsible for the failed sync. Let’s do a quick search via the GUI to find out more about them. After looking them up one by one I discovered that all three of them where still in the package queue and not associated with a channel yet. You can check that in the software package manager.

Orphaned packages

So whats next? I know that these packages remain in the queue because the import fails but the real cause of the error seems to be something that’s already in the database which has some attribute with the exact same value as one of these three packages. Let’s google that error message and see if someone had the same issue in the past.
Seems like this happend once in 2016 because of a duplicate package published by Red Hat. Let’s see if one of these packages already exists in the database… Bingo! The sysstat and python-setuptools package have already been published some month prior but with a faulty naming schema. Like in 2016 the package names didn’t contain a dot between the package version and the OS release version. In theory simply deleting the faulty package should solve the problem but let’s see…

Delete the duplicate package

Okay and now let’s try to rerun the channel synchronisation and see if it runs without errors.

13:51:02 Sync of channel completed in 0:04:30.
13:51:02 Total time: 0:04:30

Perfect :)