Kyoung-Don Kang
Kyoung-Don Kang
The first new socket function, acceptex(), combines the functions accept(), getsocketname(), and recv(). This modification resulted in a modest performance gain (about 1 percent for small transfers). The other function, send_file(), instructs the kernel to send a file over a socket, replacing read() and write() calls. The authors show that send_file() actually introduces considerable (as much as 18% for large transfers) performance degradation. However, it provides support for fundamental changes made in the operating system that overcome this penalty.
Per-byte optimizations implemented include an in-kernel caching mechanism to eliminate copying data from the file system to the network protocal stack, avoiding unnecessary file system accesses. This could be (and has been in the case of Zeus) implemented by the application programmer by memory mapping files, but the authors argue that including it in the OS enables file sharing with other protocols. In addition, they simulated the ability of the OS to offload the calculation of the Internet checksum to an adaptor by disabling checksums (eliminating the need to copy data into the CPU). The combination of these two modification showed substantial performance gains in server throughput (21% over the baseline under the SURGE load).
Per connection optimizations were introduced to reduce the number of TCP packets that are sent during a typical HTTP session. The first modification piggybacks the server FIN packet onto the last data packet sent to the client. Another piggybacks the client's last data ACK with the FIN packet. The third removes the client ACK packet to the servers SYN-ACK and instead allows the initial client data packet (the HTTP GET request) to signal that ACK. These changes increased server throughput by as much as 5% for small files. However, it is likely that this gain will be negligible when persistent connections (HTTP 1.1) are used.
The reasons for including acceptex() in this study is not clear. The root of the performance gained by this function appears to be due to the fact that its implementation combines two kernel calls into a single kernel call. This is consistent with previous studies that focus on reducing kernel calls. Thus, its inclusion in this report is not enlightening.
The significant gains in these modification seem to lie in the caching of file data so that file accesses are minimized. The authors note that the motivation for this cache is that AIX does not possess an integrated I/O system. Some sort of comparison should have been made between their methodification to AIX and a case where an integrated I/O was used.
The visual presentation of the data in this report was lacking. Results were presented in tabular format and generally only showed incremental performance gains or a comparison between the baseline and the end result. A visual plot of filesize versus throughput for each level of modification should have been included.
The bulk of this study looked at data of workloads generated by WebStone. It would have been interesting to see the how the performance improvements were affected by varying workloads under SURGE.
In addition, the report should have discussed in more detail the forseen effects of switching from HTTP 1.0 to HTTP 1.1.
Susan Bibeault et. al.
Under lightly loaded network: In case of a lightly loaded server the reply latency and the variance were pretty low. (as expected) However, as the server became loaded the latency and the variance both went up but they were of comparble order. This is nothing new and is as expected.
Under heavily loaded network: (characterized by high packet loss). Some surprising results were reported. Under light server, load high variance and latency was observed for small files (1K and 20K); the reason could be because of the slow-start algo in face of packet loss. Similar thing was observed for a heavily loaded server. The most surprising result was observed with 500K files. In this senario, heavily loaded server performed *better* than the lightly loaded one. This couldn't be explained in a satisfactory way by the authors and I think, its a peculiarity of their experiment.
Another important observation is the inadequacy of active measurements (particularly using Poisson based tools) for determining the performance as affected by n/w. The reason is that such tools don't take into consideration TCP's slow start and backoff strategies, hence they present a much more optimistic estimate of the n/w performance.
Avneesh Saxena et. al.
However, they found another important problem in general resource allocation: application semantics is missing in the resource allocation and sometimes OS charges an unfortunate process which is not necessary related. They present a new abstraction called "resource containers" to remedy this problem: they separate protection domain and resource principal that are combined in the conventional process notion. They introduce application semantics: an application is allocated a certain amount of total resource no matter how many threads/processes are invoked and that application is charged for this resource allocation. These two papers are pretty good in the sense that they really show new ideas and their performance gain. However, some part of their implementation and the simulation method were not very clear.
IO-Lite really focused on the multiple copy problem which impairs the server performance significantly. They presented a way to avoid this problem by introducing a kind of shared read-only file buffer called "immutable buffer". I think their original observation about performance degradation due to unnecessary copies was very impressive and the paper is well written in general. They clearly explained their implementation.
Kyoung-Don Kang
Of the 3 papers, I think the resource container is the most valuable one that make a combination of the other two papers. The lazy receiver processing is more focused on network part, while the IO Lite is more OS focused.
The value of the first 2 papers is that they provide a way to solve a problem that has been haunting web service for many years. That is, service QoS. In the current server/OS implementations, it is quite difficult for applications to provide QoS over web platforms, and as indicated by these two papers, the underlying OS doesn't offer support in network resouce and OS resource management. Even worse, as the LRP paper indicated, servers often perform very poor at time of network overload.
The container paper provides a unified way to solve this problem. Resource container is really a brand new idea for resource allocation that can be used extensively beyond server structures.
As indicated by the LRP paper, it enables application to do kind of pre-filtering to the coming packets, depending the identity of the clients. Besides possible protection again denial of service attacks, I think it has more important meaning in that it enables application to provice QoS of service according to client identities ( still limited in the IP addr. form. ) This meaning is very useful for the research currently doing on Web QoS.
The IO lite paper offers good performance gain by avoiding memory copy, as the similiar idea we discussed in the last class. A possible problem is its processing of frequently modified buffers, since all buffers are regarded as immutable.
Haibin Wang et. al.
Jinze Liu et. al.
Jinze Liu et. al.
I thought this was a good paper. Their idea has definite merit. However, resource containers only provide a framework for resource management. The use of resource containers can only be as effective as the underlying resource policies to track and control resource consumption. In addition, I was curious to see how performance was affected under extreme conditions (many clients operating at varying priority levels, as well as CGI-processing, and dividing the server into QoS partitions by companies). They showed that overhead was negligible, but it would be interesting to see if that changes as scheduling and accounting becomes more difficult).
Susan Bibeault et. al.
The good point of this idea is that it increases server's throughput in overload situation. And it is very neat to use the data of statistics. However,this approach is not suitable for real-time system. Where critical transaction's deadline must be met. Also, although this could improve server's throughput under heavy overload, I wonder whether the reponse time to the clients will be too much. And I guess it will become weird when in light server load, the reponse time is still as much as or even worse than in heavy overload situation.
A better way in my opinion is to combine active interruption and soft-timer together to make critical transaction have a better reponse time and common transaction have a guarenteed response time. Meanwhile, the server can increase throughput.
Jinze Liu et. al.
I think the main idea in this paper is easily applicable to a soft real-time application such as the Internet video streaming. It can accomodate some probabilistic limitation of the soft timer. At the same time, it can get a significant benefit from its rate-based clocking of packet transmission since it has a large number of packets to send, therefore efficient transmission mechanism is very important. The server performance loss can be reduced due to relatively less context switching and cache/TLB misse effect. As the result, the server might be able to serve more multimedia transactions in a certain time.
Kyoung-Don Kang
The main idea of the new version of select() is to preserve information about the change in the state of a socket between select_wakeup() and do_scan(). With this information, the number of the sockets that need to be checked each time decreases largely. For the ufalloc(), they converted the linear search scheme to a logarithmic-time algorithm by using a two-level tree of bitmaps.
Experiments show that these changes improve the perfomance of Web servers and proxies on realistic benchmarks and on a live proxy, without harming performance on naive benchmarks. (A good way to solve the problem: event-driven servers perform poorly under real conditions.)
Ying and Avneesh
The authors set out by designing their own web server that gives them more control in influencing which connections are served first. Though control in their server is not complete, it does give then more control than that of typical Web servers. They explored two policies, size independent (FIFO) scheduling of requests, and shortest-connection first scheduling, where each device provides service only to the shortest connection at any point in time. The scheduling on their experimental server was done through 3 queues (protocol, disk, and network) and a listen thread. Each queue has an associated group of threads. The listen thread blocks on the accept() call, waiting for new connections. When a new connection arrives, it creates a connection descriptor. Its state includes two file descriptors, a memory buffer, and a progress indicator. Protocol threads gather the name of the file and the size of the file to be transferred. The disk thread reads a block of data from the file system and passes it to the network queue. Network threads then call write() on the associated socket to transfer the contents of the connections buffer to the kernels socket buffer. The disk and network threads dequeue the connection that has the least number of bytes remaining to be served (shortest-connection-first). Protocol is always FIFO (it gets file size). Their server also has the performance benefits of using a single process with a fixed number of threads (no context switching, IPC, etc). A limitation the authors have is over the order of events in the OS. To gain more control over the order, they sacrifice throughput and limit the number of threads each queue has.
Their server, implementing the two aforementioned policies, was compared in several ways to the Apache server. Their experiments were done using the SURGE workload generator and based on the idea of heavy-tailed web-task sizes (a tiny number of the very largest files make up most of the load on a web server).
Their experiments yielded the following results. They found that response time of small files is independent of file size, but that response time increases linearly as a function of file size for larger files. They showed that Apache provided worse response time for small files than their experimental server, punishing short connections. The shortest-connection-first policy improved mean response time 4-5 times compared to the size-independent policy for 1000+ UEs (even larger disparity when compared to Apache). For 1200-1600 UEs, the shortest-connection-first policy did not cause large jobs to perform worse than with the size-independent policy (1800 UEs saw shortest-connection-first policy hurt large jobs). This was attributed to the fact that large jobs do not suffer as badly in a heavy-tailed distribution versus an exponential distribution (largest 1% interrupted by less than 50% of total work arriving in heavy-tailed distribution; largest 1% interrupted by 95% of total work arriving in exponential distribution). Varying the size of the thread pool (for the queues) brought different results. The difference in performance between SCF and size-independent scheduling is the same up to 35 threads (SCF being a lot better). But the difference in performance decreases from 35-60 threads. At 60 threads, there is no buildup in the network queues (all jobs are being served very quickly, so the scheduling policy does not matter). The just-mentioned result suggests SCF could be very advantageous in a system where the degree of control over kernel scheduling is great. The just-mentioned experiment also showed that byte-throughput increased with the number of threads being used by the queues. Throughput was sacrificed to gain some control of kernel scheduling.
The authors did a great job at setting up the experiment and showing the benefits of SCF. I thought their comparisons to both Apache and to size-independent scheduling (used in most web servers today) showed how effective SCF could be. Their suggestions for architecture improvements were also valuable. However, I believe there are several glaring holes in this paper. The first is that this scheduling is concerned with web servers that deal with static files. What about dynamic content? They imply that their work is not very applicable to dealing with dynamic content, when many of the more popular web sites today are filled with dynamic content. As the authors themselves state, the SCF policy explored does not prevent the starvation of jobs in the case when the server is permanently overloaded (large jobs will never get served). Another major issue is the effectiveness of SCF does the architecture of the server have to allow a great deal of control over scheduling? Will giving the user this come at some cost? To gain this control, the number of threads in the queue pools is reduced. Is this a real benefit, since throughput will be hurt? The validity of their experimental results is somewhat questionable, too. The test used two clients. One of these clients had a port that was malfunctioning. This had to have had an effect on the results.
Tim Bellaire et. al.
The concept of a Customer Behavior Model Graph (CBMG) is introduced. This graph represents the different states that a user can be in, while navigating the web site, and the transitions (with probabilities) for going from one state to another. Different classes of users are constructed (i.e. occasional buyer, heavy buyer) and CBMG's are created for each class.
One comment on the CBMG is that there is not a lot of description given to 'user think time'. The authors indicate that the think time is generated from an exponential distribution, but do not really justify this choice. If think time is exponentially distributed, then it has the property of being memoryless, but it would seem that successive think times of one user might not have this property. For example, if I want to buy a gift for someone, I might take a long time to figure out what to buy, but my next purchase might be wrapping paper, which would take much less think time to put in my cart. Therefore, the think time of the second purchase is dependent upon the first purchase.
The goal of their system is to assign priorities to incoming users based upon their usage profiles, and other factors. Every user is assigned a high priority upon visiting the web site or upon placing an item in their shopping cart. Users who have been around for a long time, but have not placed items in their shopping cart are downgraded to medium, and eventually low priorities.
The authors list m1 as a time limit to go from high to medium, and m2 from medium to low. However, they do not list any real time values for these variables. Depending on the times chosen, things like network delay (over slow WANs) may cause users with slower network connections to be assigned lower priorities, even though they may be heavy buyers.
This priority scheme allows users who will (most likely) make a purchase obtain better response times from the server. This prevents potential buyers from experiencing delays and consequently canceling a purchase. The goal is to maximize revenue and to minimize angry customers (those who leave as a result of slow access times) and lost revenue (from people who would have made a purchase, but did not because of slow access time).
The authors created a simulation of an electronic bookstore, and used SURGE to place loads on the server. The results showed the following things:
1. Revenue/sec increases for heavy buyers as load increases, and is better with priorities than without. 2. Revenue/sec increases up to a certain arrival rate for occasional buyers, and then begins to decrease. This is because under heavy loads, the occasional buyer has a lower priority than the heavy buyer. 3. For lightly loaded servers, there are no angry customers, but as the arrival rate increases, the number of angry customers increases, but these customers do not decrease the total revenue (so most likely, these are occasional buyers, not heavy buyers). 4. With priorities, there is no lost revenue. Without priorities, there is.
The idea presented in this paper seems novel and hopefully future e-commerce web servers will incorporate it. However, perhaps a more carefully designed priority scheme would be necessary. For example, if users realized that they got better performance browsing as long as they had at least one item in their cart, they may put an item in simply for the purposes of getting better performance. There, perhaps, needs to be a timeout even on customers with items in their carts. But this scheme might not even work. For the most part, users will receive correct priorities and service, but there will always be a few users who receive higher performance when they should not.
Perry Myers et. al.
The study is only a preliminary step in investigating differentiated QoS mechanisms in Web servers. They studied the case where there are only two levels of quality service needed. I believe there should have more priority levels in reality, the work should refine the priority levels in order to deal with different conditions. Also This work focused on a single-machine server system instead of a cluster of web servers, also their study focus on static file systems, those are some open area that should be investigate in the future work.
Jinghui Chen et. al.
In this paper, the author presents a priority based approach to provide differentiated services. They made two different changes, one is at the user-level by modifing Apache Web server program. The other is in the kernel of Linux system. This paper presents quite a few future work. For example, how to get the priority based on the URL, and how to schedule the overall resource on the server side.
Jinze Liu et. al.
In this paper, priority-based request scheduling, one way to provide differentiated quality of service, is investigated. There are two approaches to modify the Web servers in order to provide difference among requests in term of priorities. The difference of User-level approach and Kernel-level approach is analyzed. And also, their performance are measured.
Ying & Avneesh
To accomplish this the authors "slowed down" the serving of background (low priority) processes to allow more resources for "high priority" processes.
Ways to achieve background requests that are interruptable:
Rob Schutt et. al.
In my view, the most significant contribution was the introduction of session-level semantics for scheduling and prioritizing request. In typical web-transactions, its necessary to provide much higher level of service to the client whose session looks more promising to give revenue.
Avneesh et. al.
This paper addresses the QoS issues from the server's point of view. It proposes a number of necessary procedures to implement this: Classification, Admission Control and Scheduling. The aim of this QoS tiered service is to support different performance levels for different classes of users and maintain predictable performance, even at time of server overload.
As pointed out by the paper, servers are currently a significant component in end to end delay, so improvements need to be done to change this. And there are a number of issues make this necessary: Decreased Network Delay, Flash crowds and new technologies. So it is important for the server to have some kind of mechanism to guarantee services even at time of overload.
To achieve this, the paper proposes a WebWos architecture, which includes several components: Session control, measurement and application control. Session control deals with request classification, admission control, session management, request scheduling, application control deals with resource control and resource scheduling.
The prototype built in this paper is fairly flexible, it supports a number of policies: User-class & Target class based classification, two admission control trigger parameters and a number of schdeuling policies. Since the HP platform has a built in resource scheduler, it doesn't explicit on this point, though more detail on this will be very important.
The issues talked in this paper is quite practical, and its architecture is also a very standard( typical ) solution. Though, we expect it to be more explicit on the resource scheduling part.
Haibin et. al.
Ying & Avneesh
The second major part of the paper dealt with a TCP handoff mechanism that needed to be implemented for the LARD scheme to work. The front-end must actually complete the connection with the client to determine the correct back end (since it is dependent on target). They, thus, needed to construct a mechanism for handing that connection to the back-end server that allowed the client to remain ignorant of the cluster mechanisms.
Their approach is appealing, but it would be interesting to see what happens when load reaches saturation on the front-end (since the front-end must do more than simply assign requests round-robin). Also, a major weakness of the paper is that the have not explored the necessary modifications that need to be made to LARD in order to use HTTP 1.1.
Susan Bibeault et. al.
This paper presents a new approach to improve the server performance in cluster based Network Servers by using content-based request distribution.. It is supposed to acheive both locality and load balancing among all the servers by mapping the requests to a set of target servers.
Comparing with the round robin state of the art approach,this one is definitely could get in better performance by using server locality(I think here they meant server special for a set of requests).
The following maybe some problems:
A simulation was performed using ClarkNet to simulate different workloads. Priority-based scheduling in their model improved the response time for high-priority traffic– however response time increased, as expected, for low priority traffic. As the ratio of high-priority traffic was increased, the slowdown and response time increased (due to self-similar traffic and the fact that high-priority tasks were not as likely to jump over lower priority tasks because fewer of them were being served). Low-priority task response time and slowdown changed little as the high-priority task ratio increased.
The authors also talked about task assignment schemes. Enhanced shortest_queue_first scheduling was a new scheduling scheme– where a new task is assigned to the server with least number of waiting tasks with equal or higher priority tasks than the new task. The authors also discussed the three parts of a task’s waiting time: delay encountered from the task sever being in serviced upon its arrival, the delay it experiences due to tasks enqueued upon its arrival, and delay due to higher priority task arriving after its arrival.
I thought this paper was very disorganized and offered nothing new. Obviously, priority-based scheduling will cause better response times for higher-priority tasks. The authors discuss task assignment schemes and why a task waits, but to me it was unclear why (they gave theories without statistical backup). Another issue was why these used ClarkNet instead of Surge to simulate workload.
Timothy Bellaire et. al.
In this paper, a web server model is presented. In this model QoS can be assured by maintaining the system load with threshholds, which can be achieved by admission control, scheduling, and effecient task assignment schemes.
A new concept is used to test the system's performance. Slowdown --- the radio of its response time to its service time. It is reasonable, because a user is often willing to wait longer for a big task.
The system's performance is tested. With priority-based sheduling, it is shown that the high priority requests incur low delay even when the system approaches full utiliztaion. The relationship between the increase in high priority ratio and the mean slowdown/ mean response time curve of the high priority tasks is shown.
At last, the task assignment schemes are compared. Enhanced_SQF is shown to be helful under nomal load, compared to SQF.
Ying and Avneesh
Their implementation required modification to the DNS service. The client browser queries the DNS server for all replicated servers and then probes (via broadcast UDP) the replicated servers and chooses a server that it measures to have the best response time. The authors demonstrate that this technique more fairly distributes the load between the servers and produces better user response times versus the mirror and DNS-base methods. Since modifying DNS is not an immediate option, they outline an altered QoS based implementation where all replicated servers keep track of their peer servers. A client gets a list of all replicated servers when it accesses (polls) the initial server. If the client determines the responsiveness of the polled server is not adequate, it will poll the replicas (again using broadcast UDP) to determine the best response time. This proposed implementation introduces the determination of adequate response time by a single measurement (the initial poll) which the authors claim is a bad idea since the measurement can be skewed by load distribution. Why not then enforce the UDP broadcast? I don't think the authors explained the single measurement point well enough. A more serious problem is the additional server load introduced by the initial (HTTP) poll of the server that returns the replica addresses. This polling can potentially introduce so many additional server requests that it makes the method undesirable under certain load distributions. This issue definitely needs to be assessed.
Susan Bibeault et. al.
This paper is valuable in that it offers something different from the popular implementation of distributed web-hosting: Round-robin & Locality reference. As shown from its different test scenarios, it excels over the later 2 algorithms in response time and availability.
The later 2 load distribution algorithms doesn't take actual load at different servers into consideration, so it is possible that some servers become overloaded, while others remain underloaded. The QoS based algorithm considers this and make each browser decide its own load distribution. The paper proposes two implementations for this strategy, the first requires changes to DNS, the later no change. But both requires changes at the browser side and server side, which could be a major impede for actual implementation.
Another thing is the actual load real time propoagation across the network. This could be a real difficult problem, and the paper doesn't give a persuasive answer to this, though possible research directions are pointed out. So a simple load distribution algorithm is finally tracked back to a classical network propagation problem, which remains no optimal solution. Though, the paper does point out a good direction.
Haibin et. al.
M/S is feasible. Fewer nodes to process static content do not cause any problems. In studies it has been shown that most web servers have sufficient throughput to deliver static content at a rate greater than what the outgoing link can handle. M/S is also better for several reasons. It offers better expandability (recruitment of non-dedicated nodes) and better efficiency (M/S separates dynamic and static content processing so long-running CGI scripts don't slow down static content processing). M/S also offers better availability (fault tolerance is easily implemented by masking the failures).
Because few nodes were available for experimentation, a lot of simulation was performed to analyze the M/S architecture. M/S was compared with two other architectural solutions Flat-R (user requests are evenly distributed to nodes) and Flat-C (requests are scheduled to a node with the least number of outstanding connections). M/S was shown in their experiments to improve Flat-C by 23 % and 36 % over Flat-R. It was also shown that as the average ratio of CGI processing rate to static request rate lessened, CGGI activity becomes more intensive and the M/S architecture was better suited to handle this. This is because as the average processing time of CGI requests increases, optimizing CGI performance becomes more critical. There will be more CGI requests, too, at each node, which increases the waiting time of static requests. M/S architecture also benefited greatly from the ability to recruit non-dedicated resources. M/S was also compared with M/S-ns (no sampling is used to assess I/O and CPU demands), M/S-nr (no reservation is used to keep a portion of master resources available for static content processing), and M/S-1 (all nodes treated as master nodes). M/S significantly improved M/S-nr, while M/S-1 had significant performance degradation. The performance improvement of M/S over M/S-ns averaged 14 %. Performance sensitivity was studied to figure the optimal number of master nodes for the system. It was found to be 6 for 32 and 25 for a 128-node system.
Tim Bellaire et. al.
This paper addresses the processing difference of dynamic and static content at web server clusters, and propose a server cluster architecture to differentiate dynamic/static contents' processing at different machines.
A phenomenon is observed by the author that static content processing and dynamic processing are different in processing times, and treating them in the same way often prolong the static content's response time unfairly. So the paper proposed to divide them and process the request in different ways according to whether they are dynamic or static.
The cluster's architecture is composed by master nodes and slave nodes. Master nodes take care of both dynamic/static content processing, also forward part of the dynamic content processing to slave nodes. Slave nodes are supposed to only process dynamic content.
By chancing the number of nodes in a cluster, the ration of master nodes in a cluster and the ration of dynamic load shifted to slave nodes, the paper tries to get a better stretch factor for request processing. The stretch factor is defined as the ratio of response times of a sequence of request over the service demands of these requests. Literally, it means the ratio of waiting time to processing time. It is a more resonable performance indicator since it also considers the server's load.
The whole paper rests on its analysis of a complex math inequation. A number of assumptions are made, though not all of them is valid, such as the assumption that request comes with a pisson distribution. Though, it offers a good way for consideration.
Another things new in the paper is its invention of RCGI techniques, though they said that RCGI offers better performance than CGI at a busy server, I can't understand why they simply use Fast-CGI since it eliminates the overload of remote process fork and apparently is better for performance.
The paper is valuable in its addressing of the difference in processing dynamic contents.
Haibin et. al.