Solving the Multicore Challenge in an Established Enterprise Software Application

::Background

Pocket Soft's RTPatch Server software product uses a patented byte-level differencing algorithm to extract only the changes between two revisions of a file. RTPatch Server employs a resource efficient differencing algorithm that utilizes a fixed memory window and streaming read (i.e., single pass) of the input files, yielding linear build time (“patch” creation), with respect to input file size at a given change pattern. Resource efficiency and linear build time enables RTPatch Server to process multi-gigabyte sized files in a restricted and predictable timeframe, yielding byte-level reductions of 90-99%.

As with other forms of byte-level differencing, byte-level differencing in RTPatch Server is CPU-bound.

::Problem

RTPatch Server, version 4 and previous, used a single threaded differencing step, therefore execution on systems with multiple cores produced results no faster than a single core/processor alternative. Additionally, since the individual core clock-speed tends to be slower on a multicore machine compared to the single processor alternative, Build times on “faster” multicore machines were slightly longer than on contemporary single core/processor hardware.

RTPatch Server's inability to access all of the multicore CPU power of a typical server or developer's workstation was a technical limitation. Although this limitation was anticipated during the initial development of the RTPatch Server Build algorithm, the challenges of implementing a solution were significant. One option to make use of additional processors when computing differences for multiple files is to assign a file pair to each core; however, since RTPatch Server 5 is designed for very large file processing (multi-gigabyte), it is not uncommon in the “real world” that only a single file pair need be processed by a user of RTPatch Server 5. Assigning file pairs to different cores, therefore, provides no benefit to the common RTPatch Server 5 user. Further, even in cases where multiple files pairs require differencing, as is the case in online file backup applications, there typically remains a file that is large relative to all other files, and causes a bottle-neck on the overall application.

For these reasons, the problem in RTPatch Server 5 is utilization of multiple cores when differencing a single file pair.

The differencing process proceeds by comparing a block of the “new” file against a larger window (which has been extensively hashed and indexed) of the “old” file. Then, the matches that are found in this way are coded in such a fashion that they must be processed serially. This challenges the developer of a multicore algorithm by requiring extensive sharing of data across threads, the interleaving of asynchronous output with additional processing (coding), significant bookkeeping effort to keep track of what must be done/has been done, and special data structures to avoid redundant hashing/indexing. Of course, it would also be nice if the single-threaded code could be leveraged as much as possible, since that code is already debugged and functional.

::Goals

RTPatch Server 5 introduces multithreaded byte-level differencing, dramatically reducing Build time. Goals set at the start of development:

  1. Achieve 90+% sustained CPU utilization when differencing a file of sufficient size, and when instructed to use all available cores. In other words, the goal is to achieve sustained concurrency of the working cores throughout the entire patch creation process, which could take anywhere from seconds to hours, depending on the amount of data to difference and number of available cores.

  2. No Loss of Patch Efficiency (i.e., patch size). Dividing a single file into non-overlapped sections, differencing on those sections, then assembling into one patch is a relatively easy-to-implement approach; however, depending on the scattering of changes and the choice of division points, the generated patch file could be significantly larger than the single threaded approach, even in the common case.

  3. Overall Application Speedup. Portions of RTPatch Server 5 remain single-threaded (e.g., checksum generation) or do not otherwise benefit significantly from additional processing power, i.e., Amdahl's/Gustafson's Laws. However, the typical RTPatch Server input is relatively large (greater than 1GB), so the vast majority of the processing time is spent in the multicore enabled portion of code - identifying the changes and similarities between two files. Though overall application speedup is the primary goal of RTPatch Server 5, the amount of speedup will still depend on the input itself. In any case, the maximum application speedup that was considered achievable was n-1 times faster where “n” is the number of physical cores present on the test machine (e.g., 7x faster on an 8 core machine). Consider three examples:
    Example 1: A file pair that is a typical file update, resulting in a 90% reduced patch;

    Example 2: A file pair that is completely dissimilar and results in a 40% reduced patch (i.e., approximating simple compression);

    Example 3: A file pair that is nearly identical, and results in a 99+% reduced patch.
    In the case of Example 1, overall application time is primarily devoted to the actual difference processing (which is multithreaded). As a percentage of overall single-threaded runtime, the portion of code that have little to no impact by multithreading is relatively small, thus overall application speedup can in fact approach n-1 times faster. On the other hand, in Examples 2 and 3, the percentage of overall single-threaded runtime is either spent on writing out the patch bytes (Example 2), which is file I/O bound, or in setting up the difference process, computing checksums, etc. (Example 3). In the cases of Examples 2 and 3, overall application speedup should occur, but cannot approach the degree of speedup attributed to the multithreaded algorithm portion of code. Because the RTPatch Server 5 test suite is comprised of real-world data with varying change patterns and similarities of file pairs, our goal with RTPatch Server 5 was to realize broad range of overall application speedup between 2x faster and n-1 times faster as the max attainable.

  4. Take advantage of Intel® Hyper-Threading when available. Rather than limit tests to use a maximum of "n" threads, where "n" is the number of physical cores available on the machine, our tests were designed to use "2n" when Hyper-Threading is detected. As the results below show (dual quad core, i.e., 8 physical cores), setting a thread count to twice the number of physical cores (16 threads) yielded faster results, indicating that hyperthreading provided measurable benefits.

::Environment and Tools

Compiler: Visual Studio 2005, Microsoft
Language: C, assembly code
Hardware/OS: Windows 7 Ultimate, x64 Edition, dual Intel Xeon Quad Core @ 3.33Ghz (W5590 chips; 8 core total, w/HT enabled), 32GB RAM, Seagate 15K.6 SAS hard drive.

::Multicore Development Library

Pocket Soft chose Postulate5’s UnThread API as the Multicore Development Library for the RTPatch Server multicore project; UnThread provides executive-based (task dispatch) thread pooling with advanced affinity settings, but also (and more importantly for our purposes) has features that support all of the particular challenges that were anticipated in the multicore conversion.

The massive shared data requirement of RTPatch Server is handled by UnThread’s buffer mechanism. It was unclear at the outset whether direct shared memory would do too much data serialization, and UnThread’s buffers allow easy code change from shared buffers to separate buffers with data-copying.

The interleaving of the output data (with additional processing for coding) was handled cleanly by UnThread’s ordered server mechanism, which provides a separate thread to flexibly handle queued requests from any other threads (after optionally sorting the requests).

RTPatch Server’s bookkeeping requirements were met by UnThread’s Task Status Window (TSW) and Task Resource Array (TRA) objects. The TSW allows the thread pool executive to keep track of which blocks of the “new” file are currently being processed by threads in the pool, and make sure that the leading task never gets too far ahead of the trailing task (important for keeping memory requirements small) while the TRA allows the executive to maintain an array of per-task resources that are reused by threads in the pool so as to avoid extensive reinitialization of various data structures.

The avoidance of redundant hashing/indexing is facilitated by UnThread’s Shared Virtual Array (SVA) feature. These objects provide a flexible double-buffering scheme that allows worker threads in RTPatch Server to freely request access to needed hash/index structures without ever needing to build one of these tables more than once. The key to making this work is an atomic release/retrieve call that allows a worker thread to release previously used data and retrieve new data (some of which may be the same) without concern that another thread will cause needed data to be “swapped out” and require regeneration.

Postulate5’s UnThread met all of the design requirements for RTPatch Server.

::Results, Application Speedup

Pocket Soft's test suite is gathered from real-world applications of its byte-level differencing products. Note in particular that overall patch compression was not the main focus of these test results. Since RTPatch Server includes a range of speed and resource usage options, optimizing patch size for a file pair is a separate topic, and outside the scope of this document. For more details on resource usage and efficiency, as well as patch compression results, please contact Pocket Soft.

Several file pairs of note are listed below showing application speedup results, on an Intel dual quad-core machine (8 physical cores total):

Description: US Postal Service Data

Description: Structured Data File (proprietary)

Description: Japanese Map Data

* 16 threads used to take advantage of Intel Hyper-Threading (see Goal #4 above).

The Japanese city map data results are of particular note, as it is this file pair that was a primary driving factor in creation of a multicore version of RTPatch Server. Provided by an existing Fortune 500 customer, this data file suffered from poor simple compression results (less than 30%), as well as exceedingly long runtime in the single threaded versions of RTPatch Server. Though an optimal patch output of 77.8% compression is achievable with RTPatch Server, the results shown above are presented both to show variation, as well as for test expediency – the 77.8% compression case also benefited significantly from new version 5, yielding a 7.26x faster speedup at 16 threads, thus reducing the build time to several hours, down from over a day when using a single thread/core.

For more statistics on any file set, please contact Pocket Soft.

::Summary

Conversion of RTPatch Server to a multithreaded application met or exceeded all goals:

  1. Sustained CPU Utilization. Given sufficiently sized inputs and using all available cores, CPU utilization plateaus of 99% are common, with an average CPU utilization during the differencing phase, exceeding 90% throughout the differencing step, demonstrating a high level of sustained concurrency.

  2. No Loss of Patch Efficiency (i.e., patch size). Less than a +/- 1% variation in patch size was observed in the general case, when comparing single threaded differencing to multithreaded differencing.

  3. Overall Application Speedup. In the typical, real-world cases of GB+ sized files, RTPatch Server 5 reached up to 7x faster build times on an 8 core system, achieving the stated speedup goal of n-1 times faster. Average speedup of about 4x faster on an 8 core system was recorded for all tests cases, which included small files with little to no multithreading possible.

  4. Take advantage of Intel® Hyper-Threading when available. The max speedup that enabled RTPatch Server to achieve Goal #3 was due in part to the use of Hyper-Threading on the test machine. The extra threads served as "slack threads" to utilize spare cycles on otherwise idle cores.