Repairing CENTOS 7 Journal Corruption

Tags

, ,


centos7

Prior to CENTOS 7, most Linux distributions used flat text files to store logging information of both user and system processes. For decades this has worked quite well, however with the introduction of Systemd in the Red Hat/CENTOS 7 distribution, the logging of system processes is primary logged via the systemd-journald service.

One recurring issue I have found with the newer journaling has been corruption when the filesystem it resides on (usually the root / file system) is filled by errant processes.

The default parameters for Journaling allocate 10% of the total filesystem space for system journals (you can fiddle these in the /etc/systemd/journal.conf file). This can lead to exhaustion of disk space when something goes wrong and results in the journals being corrupt.

To check your Journals using the following command:

#journalctl –verify
PASS: /var/log/journal/efc95b88533a43d58629e84ffd777577/system.journal
#

However, if there are errors the output might look like:

?[1;31m40d628: entry timestamp out of synchronization?[0m
?[1;31mFile corruption detected at /var/log/journal/efc95b88533a43d58629e84ffd777577/system.journal:40d628 (of 8388608 bytes, 50%).?[0m
FAIL: /var/log/journal/efc95b88533a43d58629e84ffd777577/system.journal (Bad message)

or less severe:

PASS: /var/log/journal/8532b6d43f5e489cb226da58aea1bc6b/system@1f8a714ec78d4b6b974d58f5c70c0c13-0000000000091e03-000539fa07dac97e.journal
?[1;31m000000: missing entry array?[0m
?[1;31mFile corruption detected at /var/log/journal/8532b6d43f5e489cb226da58aea1bc6b/user-10000.journal:000000 (of 8388608 bytes, 0%).?[0m
FAIL: /var/log/journal/8532b6d43f5e489cb226da58aea1bc6b/user-10000.journal (Bad message)
7f90ba8: unused data (entry_offset==0)

Since the journal are basically append only binary files, there is little in the way of repair, instead, removing the corruption so good journals remain is the solution.

The first step is to vacuum out old records and flushes these to separate “user journal” files in /var/log/journal/<UID>/, I use 7 days as a benchmark, so the command is:

#journalctl –vacuum-time=7d

then rerun the verify

#journalctl –verify

This results in similar output to:

Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007964fd-0005401bf579e8d6.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-0000000000797dc5-00054021129b332d.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-0000000000799644-000540253e84faa7.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-000000000079b01f-000540287f016d0a.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-000000000079c8cc-0005402d7db02ac8.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-000000000079e26e-00054031975b871b.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-000000000079fc5c-0005403737defdba.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007a157e-0005403bc60d9b33.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007a2e98-0005404258f3f731.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007a481c-00054046f6cb0ac8.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007a6184-0005404b2efa4e62.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007a7b19-0005404f8f9fba40.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007a942f-0005405556e858c4.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007aad3d-00054059a1796ab0.journal (8.0M).
Vacuuming done, freed 1.6G of archived journals on disk.

Checking the actual file system we get:

total 336092
drwxr-sr-x+ 2 root systemd-journal 36864 Nov 10 10:40 .
drwxr-sr-x 3 root systemd-journal 4096 Sep 25 02:17 ..
-rw-r—–+ 1 root systemd-journal 16777216 Nov 10 10:41 system.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 3 17:20 system@d1a52deaffee4d47b47073d442b4094a-00000000007ac72e-0005405d5326732a.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 3 21:59 system@d1a52deaffee4d47b47073d442b4094a-00000000007ae12b-0005406062cdd04e.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 4 02:59 system@d1a52deaffee4d47b47073d442b4094a-00000000007afa6f-000540644b290732.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 4 07:29 system@d1a52deaffee4d47b47073d442b4094a-00000000007b135b-000540687c8c3457.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 4 11:11 system@d1a52deaffee4d47b47073d442b4094a-00000000007b2dca-0005406c40c48a26.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 4 14:13 system@d1a52deaffee4d47b47073d442b4094a-00000000007b46f0-0005406f5c6ba199.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 4 19:09 system@d1a52deaffee4d47b47073d442b4094a-00000000007b6087-00054071e740b136.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 4 23:10 system@d1a52deaffee4d47b47073d442b4094a-00000000007b7970-00054076081ef38d.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 5 06:51 system@d1a52deaffee4d47b47073d442b4094a-00000000007b92ec-0005407966c80ece.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 5 10:55 system@d1a52deaffee4d47b47073d442b4094a-00000000007baceb-0005407fd5966782.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 5 15:45 system@d1a52deaffee4d47b47073d442b4094a-00000000007bc6d8-00054083409fee69.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 5 21:13 system@d1a52deaffee4d47b47073d442b4094a-00000000007be183-000540874c3496ac.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 6 03:02 system@d1a52deaffee4d47b47073d442b4094a-00000000007bfab9-0005408be1daf276.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 6 08:13 system@d1a52deaffee4d47b47073d442b4094a-00000000007c13bb-00054090c10b11e2.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 6 12:27 system@d1a52deaffee4d47b47073d442b4094a-00000000007c2bf2-0005409519ef4465.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 6 16:36 system@d1a52deaffee4d47b47073d442b4094a-00000000007c4536-00054098a52ebff7.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 6 21:32 system@d1a52deaffee4d47b47073d442b4094a-00000000007c5fbe-0005409c1f2b127a.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 7 02:57 system@d1a52deaffee4d47b47073d442b4094a-00000000007c77f8-000540a043a053d7.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 7 08:54 system@d1a52deaffee4d47b47073d442b4094a-00000000007c906f-000540a4cd7ecee0.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 7 13:28 system@d1a52deaffee4d47b47073d442b4094a-00000000007ca972-000540a9c88a20b6.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 7 18:16 system@d1a52deaffee4d47b47073d442b4094a-00000000007cc348-000540ad9d82ebc9.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 7 22:26 system@d1a52deaffee4d47b47073d442b4094a-00000000007cdc2c-000540b1a2a3e5df.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 8 03:13 system@d1a52deaffee4d47b47073d442b4094a-00000000007cf677-000540b520aa7be8.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 8 08:03 system@d1a52deaffee4d47b47073d442b4094a-00000000007d1183-000540b924e05138.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 8 12:34 system@d1a52deaffee4d47b47073d442b4094a-00000000007d2bc0-000540bd327db354.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 8 16:49 system@d1a52deaffee4d47b47073d442b4094a-00000000007d4556-000540c0fa5e1b76.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 8 21:05 system@d1a52deaffee4d47b47073d442b4094a-00000000007d5f7c-000540c48b371469.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 9 02:13 system@d1a52deaffee4d47b47073d442b4094a-00000000007d79bc-000540c81ffb6b0f.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 9 07:08 system@d1a52deaffee4d47b47073d442b4094a-00000000007d91c9-000540cc6bf5443c.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 9 11:46 system@d1a52deaffee4d47b47073d442b4094a-00000000007dac27-000540d08a8511c4.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 9 15:53 system@d1a52deaffee4d47b47073d442b4094a-00000000007dc600-000540d46ce0d4fa.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 9 20:38 system@d1a52deaffee4d47b47073d442b4094a-00000000007ddfb7-000540d7e14c3fa2.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 10 01:35 system@d1a52deaffee4d47b47073d442b4094a-00000000007df979-000540dbdb37ad71.journal
-rw-r—–+ 1 root systemd-journal 8388608 Nov 10 06:41 system@d1a52deaffee4d47b47073d442b4094a-00000000007e12f5-000540e003437aaf.journal
-rw-r-x—+ 1 root systemd-journal 8388608 Nov 10 06:46 user-10000.journal
-rw-r-x—+ 1 root systemd-journal 8388608 Jan 1 2016 user-10003.journal
-rw-r-x—+ 1 root systemd-journal 8388608 Nov 10 06:46 user-10007.journal
-rw-r-x—+ 1 root systemd-journal 8388608 Nov 10 06:46 user-10008.journal
-rw-r-x—+ 1 root systemd-journal 8388608 Dec 14 2015 user-10009.journal

This is a lot of file, so there are holes and issues in the system journal thats resulted in a lot of fragmentation. Basically these journals are near useless. Now we can re-run the verify an see if the corruption is out of the system journal and moved to one or more user journals.

#journalctl –verify

The truncated output:

….

PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007cc348-000540ad9d82ebc9.journal
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007dac27-000540d08a8511c4.journal
?[1;31m000000: missing entry array?[0m
?[1;31mFile corruption detected at /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10007.journal:000000 (of 8388608 bytes, 0%).?[0m
FAIL: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10007.journal (Bad message)
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007c5fbe-0005409c1f2b127a.journal
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007baceb-0005407fd5966782.journal

…….

?[1;31m000000: missing entry array?[0m
?[1;31mFile corruption detected at /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10009.journal:000000 (of 8388608 bytes, 0%).?[0m
FAIL: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10009.journal (Bad message)
7fc498: unused data (entry_offset==0)

…..

?[1;31m000000: missing entry array?[0m
?[1;31mFile corruption detected at /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10008.journal:000000 (of 8388608 bytes, 0%).?[0m
FAIL: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10008.journal (Bad message)

?[1;31m000000: missing entry array?[0m
?[1;31mFile corruption detected at /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10000.journal:000000 (of 8388608 bytes, 0%).?[0m
FAIL: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10000.journal (Bad message)

There are more errors but basically we can see that some of the user journals (i.e. user-10008.journal) are corrupted. We cant repair these, so just delete them:

#cd /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/
#rm -f user-10000.journal
#rm -f user-10008.journal
#rm -f user-10009.journal

Now re-run the verify to see that the failed files are gone. If there are no furthur issues then everything passes and all is good. If there are still issues then we can address those. After another verify run we get:

PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007e12f5-000540e003437aaf.journal
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007ac72e-0005405d5326732a.journal
7fc498: unused data (entry_offset==0)
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007d5f7c-000540c48b371469.journal
7fec28: unused data (entry_offset==0)
7fec78: unused data (entry_offset==0)
7fed08: unused data (entry_offset==0)
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007dc600-000540d46ce0d4fa.journal
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007b92ec-0005407966c80ece.journal
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007c2bf2-0005409519ef4465.journal

We can see there are holes in the journal, this might not be serious but its not a clean PASS so re-run the vacuum with 1 day less, initially it was run with 7 days, now try 6 and keep going down till 1 day till the verify is clean and PASS’s with no issues.

After 5 day vacuum we get:

Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007b46f0-0005406f5c6ba199.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007b6087-00054071e740b136.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007b7970-00054076081ef38d.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007b92ec-0005407966c80ece.journal (8.0M).
Deleted archived journal /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system@d1a52deaffee4d47b47073d442b4094a-00000000007baceb-0005407fd5966782.journal (8.0M).
Vacuuming done, freed 40.0M of archived journals on disk.

If a verify keeps showing issues and you have vacuum’ed down to 1 day, you can then use the “vacuum-size” option. Data removed from the system journal is copied to a user journal so its not gone. The size command can be in Gig G, Meg M or in Kilobytes K, in this case, I’m going to be aggressive and just cleanup to 10K using:

#journalctl –vacuum-size=10K

then a verify:

#journalctl –verify

The result is a clean system journal.

#journalctl –verify
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/system.journal
PASS: /var/log/journal/0d4d88a0ea9b4f9cb08cd2ab16dece30/user-10003.journal
#

Worse case repair

In the worst case scenario, the system journal is corrupt, such as below:

?[1;31m40d628: entry timestamp out of synchronization?[0m
?[1;31mFile corruption detected at /var/log/journal/efc95b88533a43d58629e84ffd777577/system.journal:40d628 (of 8388608 bytes, 50%).?[0m
FAIL: /var/log/journal/efc95b88533a43d58629e84ffd777577/system.journal (Bad message)

No matter how small the system journal was vacuumed the verify failed. In this case just delete the system.journal file:

#rm -f /var/log/journal/efc95b88533a43d58629e84ffd777577/system.journal

Restart the jounrald service

#systemctl restart systemd-journald

You can see it has created a new journal with:

#ls -la /var/log/journal/efc95b88533a43d58629e84ffd777577

and a verify now returns:

#journalctl –verify’
PASS: /var/log/journal/efc95b88533a43d58629e84ffd777577/system.journal
#

Done.

 

-oOo-