Skip to content
Snippets Groups Projects
ffmpeg_powerpc_performance_evaluation_howto.txt 5.63 KiB
Newer Older
  • Learn to ignore specific revisions
  • FFmpeg & evaluating performance on the PowerPC Architecture HOWTO
    
    
    (c) 2003-2004 Romain Dolbeau <romain@dolbeau.org>
    
    The PowerPC architecture and its SIMD extension AltiVec offer some
    interesting tools to evaluate performance and improve the code.
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    This document tries to explain how to use those tools with FFmpeg.
    
    The architecture itself offers two ways to evaluate the performance of
    a given piece of code:
    
    
    1) The Time Base Registers (TBL)
    2) The Performance Monitor Counter Registers (PMC)
    
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    The first ones are always available, always active, but they're not very
    accurate: the registers increment by one every four *bus* cycles. On
    my 667 Mhz tiBook (ppc7450), this means once every twenty *processor*
    cycles. So we won't use that.
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    The PMC are much more useful: not only can they report cycle-accurate
    
    timing, but they can also be used to monitor many other parameters,
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    such as the number of AltiVec stalls for every kind of instruction,
    
    or instruction cache misses. The downside is that not all processors
    support the PMC (all G3, all G4 and the 970 do support them), and
    they're inactive by default - you need to activate them with a
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    dedicated tool. Also, the number of available PMC depends on the
    procesor: the various 604 have 2, the various 75x (aka. G3) have 4,
    and the various 74xx (aka G4) have 6.
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    *WARNING*: The PowerPC 970 is not very well documented, and its PMC
    registers are 64 bits wide. To properly notify the code, you *must*
    tune for the 970 (using --tune=970), or the code will assume 32 bit
    
    registers.
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    This needs to be done by hand. First, you need to configure FFmpeg as
    usual, but add the "--powerpc-perf-enable" option. For instance:
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    ./configure --prefix=/usr/local/ffmpeg-svn --cc=gcc-3.3 --tune=7450 --powerpc-perf-enable
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    This will configure FFmpeg to install inside /usr/local/ffmpeg-svn,
    
    compiling with gcc-3.3 (you should try to use this one or a newer
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    gcc), and tuning for the PowerPC 7450 (i.e. the newer G4; as a rule of
    thumb, those at 550Mhz and more). It will also enable the PMC.
    
    
    You may also edit the file "config.h" to enable the following line:
    
    #####
    // #define ALTIVEC_USE_REFERENCE_C_CODE 1
    #####
    
    
    If you enable this line, then the code will not make use of AltiVec,
    but will use the reference C code instead. This is useful to compare
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    performance between two versions of the code.
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    Also, the number of enabled PMC is defined in "libavcodec/ppc/dsputil_ppc.h":
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    If you have a G4 CPU, you can enable all 6 PMC. DO NOT enable more
    PMC than available on your CPU!
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    Then, simply compile FFmpeg as usual (make && make install).
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    This FFmeg can be used exactly as usual. But before exiting, FFmpeg
    
    will dump a per-function report that looks like this:
    
     Values are from the PMC registers, and represent whatever the
     registers are set to record.
    
     Function "gmc1_altivec" (pmc1):
            min: 231
            max: 1339867
            avg: 558.25 (255302)
     Function "gmc1_altivec" (pmc2):
            min: 93
            max: 2164
            avg: 267.31 (255302)
     Function "gmc1_altivec" (pmc3):
            min: 72
            max: 1987
            avg: 276.20 (255302)
    (...)
    #####
    
    
    In this example, PMC1 was set to record CPU cycles, PMC2 was set to
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    record AltiVec Permute Stall Cycles, and PMC3 was set to record AltiVec
    
    Issue Stalls.
    
    The function "gmc1_altivec" was monitored 255302 times, and the
    minimum execution time was 231 processor cycles. The max and average
    aren't much use, as it's very likely the OS interrupted execution for
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    reasons of its own :-(
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    With the exact same settings and source file, but using the reference C
    code we get:
    
     Values are from the PMC registers, and represent whatever the
     registers are set to record.
    
     Function "gmc1_altivec" (pmc1):
            min: 592
            max: 2532235
            avg: 962.88 (255302)
     Function "gmc1_altivec" (pmc2):
            min: 0
            max: 33
            avg: 0.00 (255302)
     Function "gmc1_altivec" (pmc3):
            min: 0
            max: 350
            avg: 0.03 (255302)
    (...)
    #####
    
    
    592 cycles, so the fastest AltiVec execution is about 2.5x faster than
    the fastest C execution in this example. It's not perfect but it's not
    bad (well I wrote this function so I can't say otherwise :-).
    
    Once you have that kind of report, you can try to improve things by
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    finding what goes wrong and fixing it; in the example above, one
    should try to diminish the number of AltiVec stalls, as this *may*
    improve performance.
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    IV) Enabling the PMC in Mac OS X
    
    This is easy. Use "Monster" and "monster". Those tools come from
    Apple's CHUD package, and can be found hidden in the developer web
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    site & FTP site. "MONster" is the graphical application, use it to
    
    generate a config file specifying what each register should
    monitor. Then use the command-line application "monster" to use that
    config file, and enjoy the results.
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    Note that "MONster" can be used for many other things, but it's
    
    documented by Apple, it's not my subject.
    
    If you are using CHUD 4.4.2 or later, you'll notice that MONster is
    no longer available. It's been superseeded by Shark, where
    configuration of PMCs is available as a plugin.
    
    
    Diego Biurrun's avatar
    Diego Biurrun committed
    V) Enabling the PMC on Linux
    
    On linux you may use oprofile from http://oprofile.sf.net, depending on the
    version and the cpu you may need to apply a patch[1] to access a set of the
    possibile counters from the userspace application. You can always define them
    using the kernel interface /dev/oprofile/* .
    
    [1] http://dev.gentoo.org/~lu_zero/development/oprofile-g4-20060423.patch
    
    Romain Dolbeau <romain@dolbeau.org>
    Luca Barbato <lu_zero@gentoo.org>