[mb-users] Verify data integrity
Grant
emailgrant at gmail.com
Sun Jul 20 19:03:11 UTC 2008
>>>> Even so, if the FLAC CRC matches with AR's, the editor can be
>>>> confident in adding it to the MBz DB right? Or maybe the AR CRC is
>>>> against whole discs as opposed to individual tracks?
>>>> - Grant
>>>>
>>> My bad for calling everything CRC I guess, but no they don't use the same
>>> hashing algorithm and will never match.
>>>
>>
>> How does AR do it's checksumming? Is it calculated based on a WAV or
>> ISO of the entire disc?
>>
>> If it can be determined that a FLAC rip matches with AR, the embedded
>> FLAC checksum, although different from whatever AR uses, could be
>> added to the MBz DB with certainty right?
>>
>> - Grant
>>
> I reverse engineered the AR system a year ago or so, there's a perl
> script that performs AR checking available from,
> http://www.srcf.ucam.org/~cjk32/ARCue/
>
> The checksums are the (mod 2^32) sum of each 32bit LR sample multiplied
> by it's offset within the track. The first and last five frames of the
> first and last tracks are ignored to prevent problems with drives that
> cannot overread into the lead-in or lead-out.
>
> I do like the way accurate rip works, but there are some limitations,
> and I've been wondering about how an improved system might operate.
>
> AR seems to work around the following principle. There are two kinds of
> errors one can suffer from, systematic errors and random noise.
>
> The only realistic systematic error that will be encountered is an
> constant offset of the samples read (e.g. when asked for sample 0, the
> drive actually return sample 15), and EAC+AR deals with this by
> establishing the drive's offset, correcting by this amount, and making
> it difficult for the user to change it.
>
> The second kind of error is random noise, caused by a damaged disc,
> failing drive laser etc. There errors are manifested as random changes
> in the data read, and will not be consistent across multiple reads
> (ignoring any caching performed by the drive). Because these errors are
> random and infrequent, if two independent reads of a disc give the same
> data (or almost equivalently, the same checksum), then it is
> overwhelmingly likely that both reads of the disc read the correct
> data. AR collects all checksum submissions for a given discid, and when
> it gets 2 or more the same for a given track / disc id, it considers
> them correct. As it is possible for multiple pressings to have
> different audio data, but the same disc id, it is quite possible to have
> multiple valid checksums for each track on that disc.
>
>
>
> There are a few problems with the current system.
>
> Firstly, the measured drive read offsets used by the whole AR+EAC system
> seem incorrect. The offset for one drive was established using an
> ingenious, but flawed mechanism that gave in incorrect value. As this
> drive offset was then used a refenence to determine all others, they all
> share the same error. More recent tests using a different and arguably
> better method have given a different drive offset, whic is much more
> likely to be correct.
>
> Secondly, AR doesn't allow any validation of the leading and trailing
> five frames of audio; some drives cannot read this data, and it is hence
> not included in the checksums.
>
> It cannot deal (I believe) with audio hidden in the pregap.
>
> My personal preference would be to use an AR like system, but with MD5
> hashes based upon all the data in the track (i.e. not cutting of leading
> and trailing frames), and using the newly measured 'correct' offset.
> Such hashes would be collected for each track of each discid, and where
> 2 or more match, they would be published as a correct hash for that
> track. The MD5 calculated for any track would be the same as the FLAC
> MD5 checksum.
>
> This system isn't ideal though, given the effort and infrastructure
> already invested into the existing system. One way to take advantage of
> the existing data might be to also calculate AR checksums using the
> current method, and accept submissions of both as a set. The confidence
> level for the AR checksums could then be applied to the MD5 hashes that
> they span. For example, if the AR checksums indicated that tracks 1-3
> were correct with a confidence of 50, you could then be sure that the
> MD5 hash for track 2 was also correct, (because the range over which the
> AR checksums for tracks 1-2 is calculated wholly covers the range over
> which the MD5 hash for track 2 is calculated).
>
> Any thoughts?
>
> Chris
There is a new Windows program called tripleflac which checks a flac
file against the AccurateRip database. If the flac file's offset is
different from the one in the AR database, it will tell you. You can
then adjust the offset of the flac file with something like CUETools
(Windows) and then verify the flac file against the AR database via
tripleflac or ARcue.
Also, AccurateRip2 is said to work around the different pressings issue.
MBz?
- Grant
More information about the MusicBrainz-users
mailing list