Skip to main content

Lightweight processes behavior with an new PID [Resolved]

I've been experimenting with lightweight processes. Basically calling the clone function and assigning a new PID to the cloned LWP. This works fine it lets me identify all the children threads of those LWP's. The tiny problem that I ran into is performance. It degrades quite a bit ( processing slows down by 30% ). Now I was reading that LWP's can be scheduled and priorities assigned to them (I did not try either). Would this help the performance?

One thing that I noticed when running strace that the Futex usage exploded by the factor of 8-10. Could the LWP's be a major the cause of that? The understandable part is the explosion in context switches, but I thought that LWP share the same memory space, so there should not be an explosion in the Futex usage.

Are there any tricks or best practices that should be followed when using LWP's or deciding to use them?

Would forking be a better or a worse option from the performance standpoint?


Question Credit: Marko Bencik
Question Reference
Asked July 20, 2019
Posted Under: Unix Linux
8 views
2 Answers

After some days of testing I've found out the following.

The Futexes come from sharing memory buffers between the threads ( unfortunately this is unavoidable ), the threads run math models on quite high frequencies. The futex are directly impacting the execution latency, but not linearly and it is more dependent if the frequency of data is high.

There is a possibility to avoid some allocations with memory pools or similar, since I know the size of the majority of data. That has a positive effect on execution and CPU load.

The LWP are cloned with a different PID from the parent PID, this is OK in linux but it does not work on pThreads. Regarding the performance it degrades but net significantly because of the LWP's. The shared memory resources are creating a far bigger problem.

Regarding building the app with jeMalloc, tcMalloc and locklessMalloc none of those libs give me a competitive edge. TcMalloc is great if the core number is 4 or higher, jeMalloc is good if the caches should be big. But the results for me +/- 1% from the base line for multiple running scenarios.

Regarding mapping more memory regions into the process that creates a big difference on the overall execution. FillBraden was right on that part it hits us on very hard when the execution starts or when the data streams increase the data amount. We upgraded the behavior there with the pools.

One test series included to run the application with SCHED_RR this also improved the execution. The thing is it also grades the threads higher in regards of prio so that this is having a impact. The advantage is I am able to run just the cores without hyperthreading very reliably. The reason is the behavior of the application and models. For unknown reason hyperthreading messes things up quite a bit.

Forking the individual models help with identifying which threads belong to which model but it does not gi as any advantage in the execution speed. This is definitively a trade of but also a solution in regarding identifying models that a running badly and fixing them.


credit: Marko Bencik
Answered July 20, 2019

One thing that I noticed when running strace that the Futex usage exploded by the factor of 8-10. Could the LWP's be a major the cause of that? [...], but I thought that LWP share the same memory space, so there should not be an explosion in the Futex usage.

Yes, using LWPs will likely increase the usage of Futexes, since they're actually meant for this exact case, synchronization of different threads sharing the same memory.

Futexes are used for the slow path of any lock operations when there's shared memory, the LWP or thread tells the kernel to block it until notified that the lock has been unblocked.

The fast path uses atomic operations (CPU atomically increases or decreases a counter, so it can detect whether it was the first to lock or last to unlock), so that no syscalls need to be issued on the fastpath.

Increasing lock contention means more Futex operations will happen, which is likely to impact performance, not only because of the syscalls themselves, but because when Futexes are invoked, that means some LWP or threads are sleeping waiting for resources.

Code in glibc will be aware of use of multiple threads or LWPs, so even if you don't have explicit locks in your code, the system libraries will have them and it's possible you'll have lock contention as a result, possibly slowing down your program as described.

It degrades quite a bit ( processing slows down by 30% ).

Another factor when you have many threads sharing memory is that there is also some in-kernel memory structures that have coarse locks and might create lock contention as well.

In particular, the mmap_sem, which needs to be locked for write every time you map more memory regions into your process. (In particular, allocating more memory with malloc() and friends might trigger this.)

Now I was reading that LWP's can be scheduled and priorities assigned to them (I did not try either). Would this help the performance?

Possibly... It's hard to say. You'd have to benchmark.

If what you're seeing is lock contention though, and if it's generalized through your codepaths (not localized to a single or a few LWPs), then it's unlikely to help.

You can use the perf tool to help you understand performance of a set of processes on Linux. It's able to show you the hot spots and also show you whether kernel hot spots exist.

Are there any tricks or best practices that should be followed when using LWP's or deciding to use them?

For LWPs or threads, when using a very large number of them, the implementation of malloc() becomes quite important, since there are both issues relating to the kernel (expanding the memory mappings causes potential contention of the mmap_sem), as issues in userspace (using single arenas for all threads means you need locking to reserve space from them.)

Some malloc libraries were written to improve performance in a scenario where massive threading is used. For example, tcmalloc or jemalloc. Adopting these is usually simple (just link an additional library in) and can yield large performance boosts if that's indeed your bottleneck. As always, benchmark to find out if this helps or not.

Would forking be a better or a worse option from the performance standpoint?

Possibly better. It's hard to tell, you need to benchmark to see if in your specific case it's better.

Same as for adopting LWPs, you should benchmark to tell whether that's actually worth it or not.

As described above, using shared memory between LWPs or threads (or having more LWPs or threads sharing the same memory) increases the potential for lock contention (and even if you don't have explicit locks yourself, glibc and the kernel will.) So it's quite possible LWPs are actually slowing you down.

I've witnessed a case in which a multi-threaded application was too slow. A developer changed it to use a single thread rather than 40 threads. The application suddenly had a speed up of 1,000%. It turn out it was spending 90% of its time in lock contention!


credit: filbranden
Answered July 20, 2019
Your Answer
D:\Adnan\Candoerz\CandoProject\vQA