Bugzilla – Bug 6231
flaws in buit-in mp3 decoder
Last modified: 2009-09-08 09:28:54 UTC
Please find all related info inside thread http://forums.slimdevices.com/showthread.php?t=40409
marking target as undefined to flag review. Probably should be assigned to Sean after review, pending his investigation. cc'ing chris to expedite.
Important posts: http://forums.slimdevices.com/showpost.php?p=244035&postcount=1 http://forums.slimdevices.com/showpost.php?p=244489&postcount=18 http://forums.slimdevices.com/showpost.php?p=244778&postcount=36 http://forums.slimdevices.com/showpost.php?p=244790&postcount=37 http://forums.slimdevices.com/showpost.php?p=245170&postcount=45 http://forums.slimdevices.com/showpost.php?p=245174&postcount=46 http://forums.slimdevices.com/showpost.php?p=245175&postcount=47 workaround: http://forums.slimdevices.com/showpost.php?p=245924&postcount=82
Created attachment 2453 [details] Files for testing mp3 decoder Provided by sikahr, thanks.
Created attachment 2455 [details] SB3 toslink playback recording of wav and mp3
Created attachment 2456 [details] Image showing the difference between wav and mp3 decoding
CC'ing Sean.
Vidur took a quick look around about the MAD decoder accuracy issue he says: 1) there is an OPT_ACCURACY macro...i think it determines the accuracy of the fixed multiplier. i implemented the multiplier in IP3k assembly, IIRC, and i think i only implemented a single version...it might be worth looking at the accuracy of that implementation. 2) the size of the fixed (frac vs integer bits) may also be something to look at. i think i used the same number of bits used on other platform (see fixed.h), so i'm not sure there' s anything there. 3) http://www.underbit.com/products/mad/ talks about a circumstance where MAD is a "limited accuracy" decoder...i'm not sure what he's referring to...but it may be worth posting to the mad-dev list to ask.
Sean: Will you be able to look at this for your 7.0.1 maintenance release for TR and SB?
A few updates on this. It does not look like a problem with the multiplier. In the ip3k we are using the FPM_DEFAULT (portable) multiplier, the accurate one as opposed to the fast one. It looks like Vidur had a shot at coding an ip3k optimized version in assembler, but it was never used. I tested that one anyway, and found that it is not correct, which is probably why he never used it. Anyway, I confirmed that the multiplier we're using gives the same results on ip3k as on x86. That's this one: # define mad_f_mul(x, y) ((((x) + (1L << 11)) >> 12) * \ (((y) + (1L << 15)) >> 16)) Libmad will normally use an x86 optimized version. For the purpose of comparing the PCM output, I disabled this one and built it using the portable one. The results still do not match our output. Just eyeballing a hex dump of our PCM output, I can tell it is not a spectacularly clean sine wave, whereas libmad's looks reasonable even with the default/portable multiplier. Next will look at synth.c and perhaps try dumping out the data before the DCT...
hmm never mind. I was not looking at the right output from madplay. With FPM_DEFAULT, libmad on x86 outputs a a signal which looks about the same ours. So the issue is simply that the portable multiplier is not as good as the x86 version. For whatever reason I'm having trouble getting exactly phase-aligned excerpts from the PCM output, but using a script to automatically try and match them up yields the attached (sine-90db_fpm_default.txt). Clearly, they have the same issue. Will investigate if there's a better multiply routine we could be using.
Created attachment 3441 [details] sine-90db_fpm_default.txt PCM comparison of our mad vs mad on x86 using the portable fpm
Hi Sean and all, It's been interesting following this bug. I thought I'd offer my two cents. I hope I won't just be adding noise (no pun intended :) ). I took a look at the MAD code (the version found online). It appears that the platform-specific multiplication macros implement 32 by 32 into 64 bits multiplication (the result is stored in two registers). Then the result is right shifted by 28, in order to get back to the normal fixed point format. The _portable_ version of the multiplier assumes there is no full 32 bit multiplication so it first divides the operands to make them smaller (losing accuracy), and then multiplies - now the multiplication result will fit in 32 bits. According to the ip3000 documentation on Ubicom's site, it looks like the ip3k cannot do full 32x32->64 multiplication. Is this the case? If so, one option is to write a software routine to replace the missing hardware multiplication. I don't know if this is feasible performance-wise, but it's probably worth looking at. Here's an example found on the web of that kind of routine (see 'listing 2'): http://www.embedded.com/columns/15201575?_requestid=161305 I'm not sure if the compiler supports 64-bit types, but if so, consider defining #FPM_64BIT. In that case I think the compiler will be the one taking care of converting the 32x32->64 multiply into multiple instructions the hardware does support. Another thing I bumped into in the MAD code is something called a 'Subband Synthesis Optimization' (see comments in synth.c). If I get this right, this optimization allows using 32x32->32 instead of 32x32->64 multiplications with less loss of accuracy than what the default multiply would give. It may be worthwhile making sure this optimization is on (OPT_SSO is getting #defined) and seeing if it makes a difference. Roy
Roy, that is correct. The S3.28 fixed point multiply requires basically a 32x32 multiply followed by a shift left by 4 bits. The portable routine is an approximation, and not as accurate as what you can easily do on an architecture that will handle a 64 bit result. It's not just a matter of the compiler handling 64-bit types though - you'll get a very slow routine doing it that way, unless the architecture directly supports it. For a full-accuracy result, we need an architecture-specific routine for this particular fixed-point format (S3.28). I have been working with someone at Ubicom over the last few days to develop one. Certainly it's feasible to make such a routine, the question is whether it will be fast enough, as it requires quite a lot of arithmetic.
I think you're onto something with OPT_SSO. I'm looking into it.
SSO is already on - it's forced on in synth.c if using FPM_DEFAULT
> Certainly it's feasible to make such a routine, the question is whether it will be fast enough Yeah, that's what I meant by "I don't know if this is feasible performance-wise", sorry for the confusing wording.
Hi, I tried to code this needed multiplication macro/function in delphi. I did some tests and it seems to work. I at the moment don't have time to do conversion to IP3K assembler or C. Program multiply2; {$APPTYPE CONSOLE} const Power_2_28 : integer = $10000000; Power_2_56 : int64 = $0100000000000000; var x1,x2: integer; function Multiply16(x,y: word) : LongWord; begin result:=x*y; end; function Multiply32(x,y: integer) : Int64; // x,y : S3.28 fixed point // result : S3.28 fixed point = x*y var x_lo, x_hi, y_lo, y_hi : word; x_sign, y_sign : Integer; xx,yy : Longword; temp : LongWord; begin if (x < 0) then begin x_sign := -1; xx := -x; end else begin x_sign := 1; xx := x; end; if (y < 0) then begin y_sign := -1; yy := -y; end else begin y_sign := 1; yy := y; end; x_lo := xx AND $0000FFFF; x_hi := xx SHR 16; y_lo := yy AND $0000FFFF; y_hi := yy SHR 16; temp := ((Multiply16(x_lo,y_lo) + $08000000) SHR 28 ); temp := temp + ((Multiply16(x_hi,y_lo) + $00000800) SHR 12 ); temp := temp + ((Multiply16(y_hi,x_lo) + $00000800) SHR 12 ); temp := temp + ( Multiply16(x_hi,y_hi) SHL 4 ); result := temp * x_sign * y_sign; end; begin ReadLn(x1); ReadLn(x2); WriteLn('Multiply32=',Multiply32(x1,x2)/Power_2_28:13:10); WriteLn('64bit =',Int64(x1)*Int64(x2)/Power_2_56:13:10); end.
One correction (int64 is not needed) : function Multiply32(x,y: integer) : Integer;
Nenad, Thanks - it's kind of you to offer that, but I do have an optimized multiplier routine now. The challenge is that even though it's optimal for the architecture, it can never be as fast as the FPM_DEFAULT routine, which only uses a single multiply, so it is unlikely that we can use it. Also, even if it were as fast as FPM_DEFAULT, we'd still have a code size problem because this needs to be inlined a zillion times in a section of code that needs to fit in the CPU's on-chip RAM (sort of like cache). I'm doing some further experiments with MAD to see if there's anything else we can do, but it seems we may be at the limit of what is possible already. If you're interested here is the ip3k routine. " mulu %1, %2 \n\t" /* acc0 = xL * yL */ " lsr.4 d15, mac_lo, #16 \n\t" /* d15 = (xL * yL >> 16) */ " lsr.4 d14, %1, #16 \n\t" /* d14 = xH */ " lsr.4 d13, %2, #16 \n\t" /* d13 = yH */ " mulu %1, d13 \n\t" /* acc0 = xL * yH */ " add.4 d15, mac_lo, d15 \n\t" /* Accumulate results in d15 */ " mulu %2, d14 \n\t" /* acc0 = xH * yL */ " addc d15, mac_lo, d15 \n\t" /* Accumulate results in d15 */ " mulu d13, d14 \n\t" /* acc0 = xH * yH */ " lsr.4 d13, d15, #16 \n\t" /* d13 now has the upper half of d15 in the lower 16 bits */ " jmpcc.f 1f \n\t" /* If carry bit is clear skip next instruction. */ " bset d13, d13, #16 \n\t" /* Compensate for the carry bit. */ "1: add.4 source3, mac_lo, d13 \n\t" /* source3 now has the uppper 28 bits of the final result*/ " lsl.4 d15, d15, #16 \n\t" /* d15 now has the lower half of d15 in the upper 16 bits*/ " shftd %0, d15, #28 \n\t" /* Final result. */
Hi Sean, regarding code size problem, I don't understand. It seems to me if you define function and only call function in macro, code size problem doesn't exists, but maybe I don't understand. Nenad
Hi Sean, I tried to analyze IP3K routine and it seems to me this is unsigned multiply. If this is the case, I tried to speed optimize routine sacrificing some precision (lowest 4 bits) for speed. Precision is now 24 bit after fixed point (i believe it is more then enough). So here it is (reduction to 10 instructions): " lsr.4 d14, %1, #16 \n\t" /* d14 = xH */ " lsr.4 d13, %2, #16 \n\t" /* d13 = yH */ " mulu %1, d13 \n\t" /* acc0 = xL*yH */ " lsr.4 d15, mac_lo, #16 \n\t" /* d15 = xL*yH/(2^16) */ " mulu %2, d14 \n\t" /* acc0 = xH*yL */ " lsr.4 mac_lo, mac_lo, #16 \n\t" /* acc0 = xH*yL/(2^16) */ " add.4 d15, mac_lo, d15 \n\t" /* d15 = xL*yH/(2^16) + xH*yL/(2^16) */ " mulu d13, d14 \n\t" /* acc0 = xH*yH */ " add.4 d15, mac_lo, d15 \n\t" /* d15 = xH*yH + xL*yH/(2^16) + xH*yL/(2^16) */ " lsl.4 %0, d15, #4 \n\t" /* result = xH*yH*(2^4) + xL*yH/(2^12) + xH*yL/(2^12) */
Improved version ( constructed in dreams :)) Saved some lowest bits. Saved one register(D15). Same number of instructions (10). Nenad " lsr.4 d14, %1, #16 \n\t" /* d14 = xH */ " lsr.4 d13, %2, #16 \n\t" /* d13 = yH */ " mulu d13, d14 \n\t" /* acc0 = xH*yH */ " lsl.4 %0, mac_lo, #4 \n\t" /* result = xH*yH*(2^4) */ " mulu %1, d13 \n\t" /* acc0 = xL*yH */ " lsr.4 d13, mac_lo, #12 \n\t" /* d13 = xL*yH/(2^12) */ " mulu %2, d14 \n\t" /* acc0 = xH*yL */ " lsr.4 d14, mac_lo, #12 \n\t" /* d14 = xH*yL/(2^12) */ " add.4 #0, d13, #0 \n\t" /* result = result + d13 */ " add.4 #0, d14, #0 \n\t" /* result = result + d14 */ /* Final result = xH*yH*(2^4) + (xL*yH + xH*yL)/(2^12) */
Nenad, I am leaving on vacation for two weeks - going to New York for my wedding, so I'll be offline for a while. However I thought I would explain a few aspects of this problem that may not be immediately obvious - special constraints due to the embedded, real-time environment: 1) We have to manage CPU cycles very carefully. The codecs have a hard limit that they're allowed to use in order to ensure that the rest of the system can do I/O, visualizers, DRM decryption, etc. It's extremely unlikely that we would have time to do a full precision multiply. 2) Code space is limited. DSP code has to run from a small on-chip program memory region in order to execute at the full 250MIPS. The multiply routine has to be inlined for performance, and because it is used so many times, it can't be made much larger as it causes all the DSP code to expand very quickly. This is a classic "time/space tradeoff" - but unfortunately we can't afford either! 3) In order to improve on the FPM_DEFAULT routine, I expect you would really need to get a full precision S3.28 result. I have compared the output of FPM_DEFAULT to a full precision routine, and it's usually good to 3.22 or so. The OPT_SSO mechanism in MAD synth.c scales the inputs to improve the accuracy, so it's not an apples-to-apples comparison since a full precision routine would not use the scaling, if I understand it correctly. 4) This must be a signed multiply (the "S" in "S3.28" means the sign bit). I have not studied your algorithm yet, but if you are assuming positive multiplicands then it will unfortunately not work. A few more thoughts... Not being an expert on the DSP being done here, I don't know the exact relationship between multiplier precision and the ultimate PCM output accuracy. However, it is possible to get the same output from MAD on a PC as we currently generate, by simply changing FPM_INTEL to FPM_DEFAULT (this will also enable SSO automatically). I think it would be an interesting exercise to use audiodiffmaker to determine the RMS level of the difference, for real music signals. It has taken me a while to get up to speed on this since I have not worked with the MAD code base until now. However, now I have learned a lot more about this problem since you first brought this up... enough that I can now say it is very unlikely we can do any better, and I understand why the person who originally did this port settled on FPM_DEFAULT (he had also experimented with an ip3k optimized routine). I'm not throwing in the towel just yet, but I just want to let you know that's how it's looking to me at the moment. regards, Sean
Sean, Congratulations. join the club. Nenad
the brains on display here in this thread are extremely impressive. i also want to say congrats, having just got married myself about a month ago. even if current hardware has reached its limit, this thread should be remembered for new hardware, so it has the power to handle it. (although i remain fairly convinced no one can hear these differences, its still good to be as accurate as possible).
Congratulations Sean! I sincerely hope *not* to see a reply from you to this posting during the next two weeks :). Have fun... Here's a thought I had today. I looked again at the currently used FPM_DEFAULT routine - # define mad_f_mul(x, y) ((((x) + (1L << 11)) >> 12) * \ (((y) + (1L << 15)) >> 16)) The format of the operands (x and y) being used is S3.28. In other words - these numbers occupy 32-bit registers (in C - these are (signed) ints). The operand X, even after the right-shift, has have more than 16 meaningful bits. But the ip3k only has 16-bit multiplications. Here's what I'm getting at: is it possible that because we are multiplying ints (and not short ints) the compiler actually replaces these multiplications with its own multiply routine? If that is indeed the case, then the current firmware already includes a call to a multiply routine for each multiplication in the code. So there would be a much better chance that we could replace that multiply routine with one of the routines you and Nenad have suggested without exceeding the real-time and code size budgets. It would be nice if Sean of someone from slimdevices could post the assembly the compiler generates for synth.c for further analysis. > 3) In order to improve on the FPM_DEFAULT routine, I expect you would really need to get a full precision S3.28 result. I have compared the output of FPM_DEFAULT to a full precision routine, and it's usually good to 3.22 or so. The OPT_SSO mechanism in MAD synth.c scales the inputs to improve the accuracy, so it's not an apples-to-apples comparison since a full precision routine would not use the scaling, if I understand it correctly. If you choose to go with a multiply that is not full-precision but a few bits less (as Nenad has suggested), you might be able to minimize the loss of precision by still using SSO (the code hints that you can define OPT_SSO even without FPM_DEFAULT). I think this could help because SSO doesn't cover all of the multiplications, only some. So SSO would be used to take care of some of the multiplications, and for the remaining ones we'd use the new routine. Btw, Nenad - very nice job there optimizing the routine.
(In reply to comment #23) > > 3) In order to improve on the FPM_DEFAULT routine, I expect you would really > need to get a full precision S3.28 result. I have compared the output of > FPM_DEFAULT to a full precision routine, and it's usually good to 3.22 or so. > The OPT_SSO mechanism in MAD synth.c scales the inputs to improve the accuracy, > so it's not an apples-to-apples comparison since a full precision routine would > not use the scaling, if I understand it correctly. Sean, I did error/precision analysis of different multiplying routines and no way you can get S3.22 from x_hi*y_hi only. Roy must be right, compiler do something else there. Error analysis: x_hi, y_hi - s3.12 x_lo, y_lo - lowest 16 bits (x_hi*y_hi) << 4: Average error= 0.00026051355841111000 Maximal error= 0.00105388161365827000 Average error bits= 16.09364607987735910000 Aver. correct bits= 11.90635392012264090000 (bits after binary point) ((x_hi*y_hi) << 4) + ((x_hi*y_lo + x_lo*y_hi) >> 12): Average error= 0.00000000517356061000 Maximal error= 0.00000002196329107000 Average error bits= 0.47380467693124221000 Aver. correct bits= 27.52619532306875780000 (bits after binary point) PS1 Roy, thank You for your comments, very helpful. PS2 Sean, when You come back please send me x86 madplay compiled with FPM_DEFAULT.
Hi, here is signed multiply routine. Maybe you can find way to use it. I tested this (equivalent) routine in delphi an it seems to work. Peace, Nenad ; x_lo is %1 LSB, y_lo is %2 LSB, %0 is result ;x_hi := x >> 16 " lsr.4 d14, %1, #16 \n\t" /* d14 = xH */ ;y_hi := y >> 16 " lsr.4 d13, %2, #16 \n\t" /* d13 = yH */ ; if x_lo<0 then x_hi:=x_hi+1 " btst %1, #15 \n\t" /* If xL is negative (#15=1)*/ " jmpeq.T L1 \n\t" /* If zero bit is set goto L1: */ " add.4 d14, 1, d14 \n\t" /* xH = xH + 1 */ ;if y_lo<0 then y_hi:=y_hi+1; "L1: btst %2, #15 \n\t" /* If yL is negative (#15=1)*/ " jmpeq.T L2 \n\t" /* If zero bit is set goto L2: */ " add.4 d13, 1, d13 \n\t" /* yH = yH + 1 */ "L2: muls d13, d14 \n\t" /* acc0 = xH*yH */ " lsl.4 %0, mac_lo, #4 \n\t" /* result = xH*yH << 4 */ " muls %1, d13 \n\t" /* acc0 = xL*yH */ " asr.4 d13, mac_lo, #12 \n\t" /* d13 = xL*yH >> 12 */ " muls %2, d14 \n\t" /* acc0 = xH*yL */ " asr.4 d14, mac_lo, #12 \n\t" /* d14 = xH*yL >> 12 */ " add.4 %0, d13, %0 \n\t" /* result = result + d13 */ " add.4 %0, d14, %0 \n\t" /* result = result + d14 */ ; Final result = xH*yH << 4 + (xL*yH + xH*yL) >> 12
Hi guys! First, i apologize for intruding in to this highly fascinating and technical thread, because the last time I did any programming was more than a decade ago on long obsolete languages and processors. So while I understand the gist of the problem, I am not able to provide any concrete assistance. But here is my thought: isn’t vorbis far more compute-intensive than mp3? Especially since it normally makes heavy use of floating point operations… I remember threads in which it was discussed how difficult it was to cram vorbis’ integer decoder (based on tremor?) in to the SB3. But somehow you guys made it work! Fabulously, btw – the audio quality is awesome. Are there perhaps any lessons we can learn from that?
hello gentlemen! has there been any progress on this bug yet? it does have 11 votes and has been open for a year now...
Hi, There is nothing we can do now, All our base are belong to Sean :)
The higher precision routine is both too slow and too big. Sorry, but it is not even close to possible.
Bummer! So... I guess I just keep using the workaround of doing MP3 decoding server-side. It actually works okay, aside from little things like not being able to fast-forward (whether that's for anything with a server-side process in the pipeline, or has to do with mixing in synchronized players, too, I never bothered to clarify). May I suggest that if precisely correct in-player MP3 decoding will never happen, that you ensure that all needful outboard executables are delivered with packaged builds, and that you document the server-side workaround prominently (or even make it the config default, if you can do so without having to field too many phone calls)? What seems to be working for me is: mp3 flc * * [madplay] --output=wave:- --replay-gain=audiophile --attenuate=-0 --very-quiet --bit-depth=24 $FILE$ | [flac] -cs -- Thanks for trying.
all i would ask going forward, is that for future hardware, slim removes this limitation. please DON'T FORGET about it!
yeah... too bad. here's a suggestion which is impractical at best and impossible at worst: would using an entirely different MP3 decoder solve the problem? LAME is also regarded as one of the best MP3 codecs. but i suspect porting it to ip3k would be way too much effort for products which are already approaching EOL (except maybe boom and receiver. and controller.)? again, sorry i cant contribute more, but my programming skills have stopped at turbo pascal. ;-)