Re: 2010/432 Benchmarks
Posted by: T. Muntean
Date: 02 October 2010 09:21
I do apologize for my late reply!
Please find hereafter some comments to explain the sense of our results reported in Section 8. Please note that our goal is NOT (yet!-in this paper)to publish the best parallel implementation or a specific parallel implementation for a given architecture; we only propose a method to parallelise such an algorithm in order to get generic efficient parallel versions. The performance figures are there to illustrate the results on usual well-known platforms.
(Thanks to your comments, I've posted a new version of the paper today which, I hope, is more specific on that).
DETAILS FOR YOUR REMARKS:
HAGEN: The performance figures for Skein are curiously low. The official submission document of Skein reports 6.1 cycles/byte for Skein-512 on a 64-bit architecture, which should result in well over 300 MB/s hashing speed on a 2.2 GHz machine. Even for a Java implementation, sphlib reports over 100 MB/s. By contrast, Figure 9 of this paper shows speeds below 40 MB/s. Moreover, Table 1 seems to contradict the sphlib results, or at least reveal questionable benchmarking methods employed here.
MUNTEAN: The tests for the Skein implementation and also for the sphlib-2.0 were done using a Dell Latitude D830, Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20Ghz, with Ubuntu 9.10 and had as input a file of 700MB. In these conditions the results obtained are given in Table 1, the goal being here limited to the comparison of the two implementations having the same test conditions. By contrast the test results reported by sphlib were done on another platform ( Intel core2Dou, 2.4GHz, with Windows) and the result is shown for “long messages” of 8Mbits (the initialization and the finalization phases are not taken into account). The difference between the results may also be due to the HDD performance, (as the test have been done on a laptop), results on a RamDisk would have been better.
Tests for smaller files with different input parameters have been done – for a 500MB file 32MB/sec, for the same input pameters etc. ; but the reported results are the ones given for a chosen worst case scenario (a big file – 700MB, big leafs – 16KO, as many tree levels as possible, ...)
Hagen: Figure 10 is completely incomprehensible to me. Does it really show execution time, i.e., do larger numbers represent slower hashing? Then the sequential C implementation would be slower than the sequential Java implementation! If it's the other way around and higher means faster, then parallelization would actually slow down the Java implementation
MUNTEAN: Figure 10 shows the speed-up results (MB/s) for Skein-512, using as input a file of 700MB, for the C implementation (from the submission CD ), the Java tree sequential implementation and the parallel implementation using the thread pool method (for the tree implementation we used leafs of 16KO and nodes of 512 bits ).
Thus, the figure illustrates the fact that the C tree sequential implementation for the worst case scenario is slower than the parallel Java implementation but faster than the tree sequential in Java. Of course larger messages are hashed slower, and if the input parameters are chosen correctly (at least two threads are created for the leaf level) than parallelization makes sense. Otherwise a sequential implementation is faster, as for the parallel implementation creation of different objects is needed ( ThreadPool, Threads etc.); this means time consumption.
HAGEN: My second issue with the results is that the speed-ups due to parallelization shown in Figure 9 seem to be very modest. By contrast, the PySkein implementation shows more than 50% speed-up on similar hardware, starting from competitive levels of performance (e.g., from 370 MB/s to 579 MB/s by using two cores). This is achieved by parallelizing only the leaf level of the tree and alternatingly assigning leaf nodes to one of two threads. For moderately large leaf sizes, runtime is dominated by leaf node processing anyway, so that assigning inner nodes to their own threads mainly incurs overhead.
MUNTEAN: The parallel implementation creates one thread for each node of the tree (max parallelism) and executes them when the internal OS thread scheduler decides (no realistic control on it). So the parallelization is done for each level, when there is only one synchronization point (control on it for the leaf level), you get better performances. This is because of the multiple synchronization points and also the overhead of the thread scheduler the execution time is smaller. So we can deduce from here that a better implementation would be to reduce the number of the synchronization points , rather than trying to obtain a higher speed up (only in theory). These results also show that your method and the one from the Skein reference paper offer better performance results, but have consequences on the tree’s structure (height of the tree).
Hope this will help!
Edited 2 time(s). Last edit at 25-Oct-2010 06:24 by T. Muntean.