STREAM Results for the HP ProLiants DL385, DL585, BL25p, BL35p, ML370 G4

From: Schmidt, David (Performance Eng.) (d.schmidt@hp.com)
Date: Mon Feb 14 2005 - 21:08:26 UTC

  • Next message: John D Mccalpin: "Fw: standard STREAM on IBM eServer p5 575 (1900 MHz, 8cpu)"

    John,

    Below are STREAM results for the HP ProLiant DL385 (2 CPU), DL585 (4
    CPU), BL25p (2 CPU), BL35p (2 CPU), and ML370 G4 (1 CPU). The
    configurations are described below with the results:

    HP ProLiant DL385
    2x2.6GHz/1MB L2 252 Opteron processors
    16GB PC3200 DDR memory (8x2GB DIMMs)
    SuSE Linux Enterprise Server 9 (x86_64)

    I used Revision 5.3 of the stream code and compiled with PGI C/C++ for
    Linux v.5.2-4:

       pgcc -O2 -Mvect=sse -Mnontemporal -Munsafe_par_align -mp -o ompstream
    stream_omp.c

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 1000000, Offset = 49152
    Total memory required = 22.9 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Number of Threads requested = 2
    Number of Threads requested = 2
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 4107 microseconds.
       (= 4107 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 7928.7410 0.0018 0.0020 0.0020
    Scale: 7827.9323 0.0018 0.0020 0.0020
    Add: 8244.3322 0.0026 0.0029 0.0030
    Triad: 8247.0339 0.0026 0.0029 0.0029
    -------------------------------------------------------------
    Solution Validates
    -------------------------------------------------------------

    ========================================================================
    =========

    HP ProLiant DL585
    4x2.6GHz/1MB L2 852 Opteron processors
    32GB PC2700 memory (16x2GB DIMMs)
    SuSE Linux Enterprise Server 9 (x86_64)

    I used Revision 5.3 of the stream code and compiled with PGI C/C++ for
    Linux v.5.2-4:

       pgcc -O2 -Mvect=sse -Mnontemporal -Munsafe_par_align -mp -o ompstream
    stream_omp.c

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 2500000, Offset = 49152
    Total memory required = 57.2 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Number of Threads requested = 4
    Number of Threads requested = 4
    Number of Threads requested = 4
    Number of Threads requested = 4
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 14740 microseconds.
       (= 14740 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 13893.0242 0.0026 0.0029 0.0029
    Scale: 13894.1747 0.0026 0.0029 0.0029
    Add: 14599.0393 0.0037 0.0041 0.0042
    Triad: 14562.7128 0.0037 0.0041 0.0041
    -------------------------------------------------------------
    Solution Validates
    -------------------------------------------------------------

    ========================================================================
    =========

    HP ProLiant BL25p
    2x2.6GHz/1MB L2 Opteron 252 processors
    16GB memory (8x2GB DIMMs)
    SuSE Linux Enterprise Server 9 (x86_64) - kernel 2.6.5-7.97-smp

    I used Revision 5.3 of the stream code and compiled with PGI C/C++ for
    Linux v.5.2-4:
    /usr/pgi/linux86-64/5.2/bin/pgcc -O2 -Mvect=sse -Mnontemporal
    -Munsafe_par_align -mp -o ompstream stream_omp.c

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 2500000, Offset = 512
    Total memory required = 57.2 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Number of Threads requested = 2
    Number of Threads requested = 2
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 8744 microseconds.
       (= 8744 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 8435.0005 0.0043 0.0047 0.0048
    Scale: 8407.1036 0.0043 0.0048 0.0048
    Add: 8689.2563 0.0062 0.0069 0.0070
    Triad: 8670.6946 0.0062 0.0069 0.0070
    -------------------------------------------------------------
    Solution Validates
    -------------------------------------------------------------

    ========================================================================
    =========

    HP ProLiant BL35p
    2x2.4GHz/1MB L2 250 Opteron processors
    8GB PC3200 memory (4x2GB DIMMs)
    SuSE Linux Enterprise Server 9 (x86_64)

    I used Revision 5.3 of the stream code and compiled with PGI C/C++ for
    Linux v.5.2-4:
       pgcc -O2 -Mvect=sse -Mnontemporal -mp -o ompstream stream_omp.c

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 6500000, Offset = 16384
    Total memory required = 148.8 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Number of Threads requested = 2
    Number of Threads requested = 2
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 83088 microseconds.
       (= 83088 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) Avg time Min time Max time
    Copy: 8742.5116 0.0107 0.0119 0.0120
    Scale: 8737.4332 0.0107 0.0119 0.0119
    Add: 9092.4575 0.0155 0.0172 0.0172
    Triad: 9062.3596 0.0155 0.0172 0.0172
    -------------------------------------------------------------
    Solution Validates
    -------------------------------------------------------------

    ========================================================================
    =========

    HP ProLiant ML370 G4
    1x3.6z/2MB L2 Xeon processors
    4GB memory (8x512MB DIMMs)
    Windows Server 2003

    Intel(R) C++ Compiler for 32-bit applications, Version 8.1 Build
    20040802Z
    icl -Qopenmp -QxW -Qparallel -O3 -w stream_d_omp.c win_second_wall.c -o
    omp2

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 10000000, Offset = 2048
    Total memory required = 228.9 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 43215 microseconds.
       (= 43215 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 3965.6825 0.0419 0.0403 0.0535
    Scale: 3925.5339 0.0409 0.0408 0.0413
    Add: 3515.2573 0.0684 0.0683 0.0685
    Triad: 3577.7362 0.0674 0.0671 0.0694

    Thanks,

    David Schmidt
    Hewlett-Packard Company
    (281) 514-5039
    D.Schmidt@hp.com



    This archive was generated by hypermail 2.1.4 : Tue Feb 15 2005 - 07:11:56 UTC