Kernel Mode Drivers

Part 13: Basic technique. Synchronization: Mutually exclusive access.



In the last article, we used synchronization to wait for a timer to fire and wait for a thread to finish. Another common problem that synchronization solves is the problem of exclusive data access.

In the example that we will look at this time, not one, but several threads are created. All threads work with the same resource, which is a variable of size DWORD. The task of each thread is to increase the value of this variable by one several times. At the end of the work, the value of the variable should be equal to the product of the number of threads by the number of passes.

The variable, the value of which is increased by the threads, is a prototype of some data, for example, statistical data on the number of interceptions of one of the system services (we will deal with this in one of the following articles). If the work of threads is not synchronized, then they will access the shared data in an unpredictable order, which will inevitably lead to the fact that instead of statistical data, we, at best, get garbage, and at worst, the driver will cause the system to crash.

In this case, the mutex synchronization object is perfect for synchronizing threads . In the nucleus, it is also called mutant . The term mutex comes from the words "mutual exclusion", which means "mutually exclusive access".

Before starting to work with a shared resource, a thread must acquire a mutex. Only one thread can own a mutex at a time. Therefore, if a thread manages to acquire a mutex, it means that the mutex was free. If the mutex is busy, the system puts the thread on hold until the mutex is released. Having captured the mutex, the thread starts working with the resource, having a guarantee that no matter how many threads try to capture the same mutex, they will all wait for it to be released. When finished with the resource, the thread releases the mutex. Even if at this moment several threads are waiting for the mutex, only one thread can capture it (which one is up to the scheduler). Thus, the mutex ensures that only one thread gets exclusive access to the resource.



13.1 MutualExclusion Driver Source Code

 ;@echo off
 ;goto make

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;
 ;  MutualExclusion - Взаимоисключающий доступ к разделяемому ресурсу
 ;
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 .386
 .model flat, stdcall
 option casemap:none

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                               В К Л Ю Ч А Е М Ы Е    Ф А Й Л Ы
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 include \masm32\include\w2k\ntstatus.inc
 include \masm32\include\w2k\ntddk.inc
 include \masm32\include\w2k\ntoskrnl.inc

 includelib \masm32\lib\w2k\ntoskrnl.lib

 include \masm32\Macros\Strings.mac

 NUM_THREADS equ 5       ; не должно быть больше чем MAXIMUM_WAIT_OBJECTS (64)
 NUM_WORKS   equ 10

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                                       М А К Р О С Ы                                               
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 include Mutex.mac

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                             Н Е И З М Е Н Я Е М Ы Е    Д А Н Н Ы Е                                
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 .const

 CCOUNTED_UNICODE_STRING "\\Device\\MutualExclusion", g_usDeviceName, 4
 CCOUNTED_UNICODE_STRING "\\DosDevices\\MutualExclusion", g_usSymbolicLinkName, 4

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                     Н Е И Н И Ц И А Л И З И Р О В А Н Н Ы Е    Д А Н Н Ы Е                                
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 .data?

 g_pkWaitBlock       PKWAIT_BLOCK    ?
 g_apkThreads        DWORD NUM_THREADS dup(?)    ; Массив указателей на KTHREAD
 g_dwCountThreads    DWORD   ?
 g_kMutex            KMUTEX  <>
 g_dwWorkElement     DWORD   ?

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                                           К О Д                                                   
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 .code

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                                        ThreadProc                                                 
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 ThreadProc proc uses ebx Param:DWORD

 local liDelayTime:LARGE_INTEGER
 local pkThread:DWORD     ; PKTHREAD
 local dwWorkElement:DWORD

     invoke PsGetCurrentThread
     mov pkThread, eax
     invoke DbgPrint, $CTA0("MutualExclusion: Thread %08X is entering ThreadProc\n"), pkThread

     xor ebx, ebx
     .while ebx < NUM_WORKS

         invoke DbgPrint, $CTA0("MutualExclusion: Thread %08X is working on #%d\n"), pkThread, ebx
        
         MUTEX_WAIT addr g_kMutex

         ; Считываем значение разделяемого ресурса

         push g_dwWorkElement
         pop dwWorkElement

         ; Имитируем работу с разделяемым ресурсом

         invoke rand             ; Выдает псевдослучайное число в диапазоне 0 - 07FFFh
         shl eax, 4              ; * 16
         neg eax                 ; задержка = 0 - ~50 мс
         or liDelayTime.HighPart, -1
         mov liDelayTime.LowPart, eax
         invoke KeDelayExecutionThread, KernelMode, FALSE, addr liDelayTime

         ; Изменяем разделяемый ресурс и записываем его назад

         inc dwWorkElement

         push dwWorkElement
         pop g_dwWorkElement

         MUTEX_RELEASE addr g_kMutex

         mov eax, liDelayTime.LowPart
         neg eax
         mov edx, 3518437209     ; Магическое число
         mul edx                 ; Деление на 10000 через умножение. Получим миллисекунды.
         shr edx, 13
         invoke DbgPrint, $CTA0("MutualExclusion: Thread %08X work #%d is done (%02dms)\n"), \
                            pkThread, ebx, edx

         inc ebx

     .endw

     invoke DbgPrint, $CTA0("MutualExclusion: Thread %08X is about to terminate\n"), pkThread

     invoke PsTerminateSystemThread, STATUS_SUCCESS

     ret

 ThreadProc endp

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                                          CleanUp                                                  
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 CleanUp proc pDriverObject:PDRIVER_OBJECT

     invoke IoDeleteSymbolicLink, addr g_usSymbolicLinkName

     mov eax, pDriverObject
     invoke IoDeleteDevice, (DRIVER_OBJECT PTR [eax]).DeviceObject

     .if g_pkWaitBlock != NULL
         invoke ExFreePool, g_pkWaitBlock
         and g_pkWaitBlock, NULL
     .endif

     ret

 CleanUp endp

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                                       DriverUnload                                                
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 DriverUnload proc pDriverObject:PDRIVER_OBJECT

     invoke DbgPrint, $CTA0("MutualExclusion: Entering DriverUnload\n")
     invoke DbgPrint, $CTA0("MutualExclusion: Wait for threads exit...\n")

     ; Ждем окончания работы всех потоков

     .if g_dwCountThreads > 0

         invoke KeWaitForMultipleObjects, g_dwCountThreads, addr g_apkThreads, WaitAll, \
                     Executive, KernelMode, FALSE, NULL, g_pkWaitBlock

         .while g_dwCountThreads
             dec g_dwCountThreads
             mov eax, g_dwCountThreads   ; zero-based
             fastcall ObfDereferenceObject, g_apkThreads[eax * type g_apkThreads]
         .endw

     .endif

     invoke CleanUp, pDriverObject

     ; Выдаем результаты работы. Значение g_dwWorkElement должно быть равно NUM_THREADS * NUM_WORKS

     invoke DbgPrint, $CTA0("MutualExclusion: WorkElement = %d\n"), g_dwWorkElement
 
     invoke DbgPrint, $CTA0("MutualExclusion: Leaving DriverUnload\n")

     ret

 DriverUnload endp

 ; На всякий случай проверим, ещё на этапе компиляции, не превышает ли значение NUM_THREADS
 ; максимально допустимого MAXIMUM_WAIT_OBJECTS.

 ; Если мы заставим систему ждать более чем MAXIMUM_WAIT_OBJECTS объектов,
 ; то получим BSOD с кодом 0xC (MAXIMUM_WAIT_OBJECTS_EXCEEDED).

 IF NUM_THREADS GT MAXIMUM_WAIT_OBJECTS
     .ERR Maximum number of wait objects exceeded!
 ENDIF

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                                       StartThread                                                 
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 StartThreads proc uses ebx

 local hThread:HANDLE
 local i:DWORD
    
     and i, 0
     xor ebx, ebx
     .while i < NUM_THREADS
    
         invoke PsCreateSystemThread, addr hThread, THREAD_ALL_ACCESS, NULL, NULL, NULL, ThreadProc, 0
         .if eax == STATUS_SUCCESS

             invoke ObReferenceObjectByHandle, hThread, THREAD_ALL_ACCESS, NULL, KernelMode, \
                                     addr g_apkThreads[ebx * type g_apkThreads], NULL

             invoke ZwClose, hThread
             invoke DbgPrint, $CTA0("MutualExclusion: System thread created. Thread Object: %08X\n"), \
                                     g_apkThreads[ebx * type g_apkThreads]
             inc ebx
         .else
             invoke DbgPrint, $CTA0("MutualExclusion: Can't create system thread. Status: %08X\n"), eax
         .endif
         inc i
     .endw

     mov g_dwCountThreads, ebx
     .if ebx != 0
         mov eax, STATUS_SUCCESS
     .else
         mov eax, STATUS_UNSUCCESSFUL
     .endif

     ret

 StartThreads endp

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                                       DriverEntry                                                 
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 DriverEntry proc pDriverObject:PDRIVER_OBJECT, pusRegistryPath:PUNICODE_STRING

 local status:NTSTATUS
 local pDeviceObject:PDEVICE_OBJECT
 local liTickCount:LARGE_INTEGER

     mov status, STATUS_DEVICE_CONFIGURATION_ERROR

     invoke IoCreateDevice, pDriverObject, 0, addr g_usDeviceName, \
                                FILE_DEVICE_UNKNOWN, 0, FALSE, addr pDeviceObject
     .if eax == STATUS_SUCCESS
         invoke IoCreateSymbolicLink, addr g_usSymbolicLinkName, addr g_usDeviceName
         .if eax == STATUS_SUCCESS

             mov eax, NUM_THREADS
             mov ecx, sizeof KWAIT_BLOCK
             xor edx, edx
             mul ecx

             and g_pkWaitBlock, NULL
             invoke ExAllocatePool, NonPagedPool, eax
             .if eax != NULL
                 mov g_pkWaitBlock, eax

                 MUTEX_INIT addr g_kMutex

                 invoke KeQueryTickCount, addr liTickCount

                 invoke srand, liTickCount.LowPart

                 and g_dwWorkElement, 0

                 invoke StartThreads
                 .if eax == STATUS_SUCCESS
                     mov eax, pDriverObject
                     mov (DRIVER_OBJECT PTR [eax]).DriverUnload, offset DriverUnload
                     mov status, STATUS_SUCCESS
                 .else
                     invoke CleanUp, pDriverObject
                 .endif
             .else
                 invoke CleanUp, pDriverObject
                 invoke DbgPrint, $CTA0("MutualExclusion: Couldn't allocate memory for Wait Block\n")
             .endif
         .else
             invoke IoDeleteDevice, pDeviceObject
         .endif
     .endif

     mov eax, status
     ret

 DriverEntry endp

 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 ;                                                                                                   
 ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 end DriverEntry

 :make

 set drv=MutualExclusion

 \masm32\bin\ml /nologo /c /coff %drv%.bat
 \masm32\bin\link /nologo /driver /base:0x10000 /align:32 /out:%drv%.sys /subsystem:native /ignore:4078 %drv%.obj

 del %drv%.obj

 echo.
 pause



13.2 Procedure DriverEntry

             mov eax, NUM_THREADS
             mov ecx, sizeof KWAIT_BLOCK
             xor edx, edx
             mul ecx

             and g_pkWaitBlock, NULL
             invoke ExAllocatePool, NonPagedPool, eax
             .if eax != NULL
                 mov g_pkWaitBlock, eax

Allocate a block of memory with the size NUM_THREADS * sizeof KWAIT_BLOCK. The NUM_THREADS constant determines how many threads we will start. We save the pointer to the allocated memory block in the g_pkWaitBlock variable. We will need this memory when unloading the driver. Why it is needed and why we select it when initializing the driver, I will explain later. We are not using it yet.

                 MUTEX_INIT g_kMutex

As I said, we will use a mutex for synchronization. Before using it, it must be initialized by calling the KeInitializeMutex function . I am using the MUTEX_INIT macro. In simplified form, it looks like this (the full version of the macro is more universal - see Mutex.mac):

 MUTEX_INIT MACRO mtx:REQ
     invoke KeInitializeMutex, mtx, 0
 ENDM

The KeInitializeMutex function simply populates the KMUTANT structure describing the mutex object, setting the mutex to an idle state.

 KMUTANT STRUCT                           ; sizeof = 020h
     Header          DISPATCHER_HEADER <> ; 0000h
     MutantListEntry LIST_ENTRY        <> ; 0010h
     OwnerThread     PVOID             ?  ; 0018h  PTR KTHREAD
     Abandoned       BYTE              ?  ; 001Ch  BOOLEAN
     ApcDisable      BYTE              ?  ; 001Dh
                     WORD              ?  ; 001Eh  padding
 KMUTANT ENDS

So the mutex is ready to use.

                 invoke KeQueryTickCount, addr liTickCount

                 invoke srand, liTickCount.LowPart

To get as close as possible to combat conditions, we need to make the threads access the shared resource chaotically. To do this, we will generate a certain pseudo-random number, and use it as a time interval to delay the stream. Ntoskrnl.exe exports the standard library function rand . This function returns a pseudo-random number in the range 0 - 07FFFh. For generation, a so-called seed or seed is used, which is stored in a global non-exportable kernel variable. Initially, the "seed" is initialized to one, i.e. the seed for the pseudo-random number generator already exists. But we will still do everything according to the rules: call the srand functionand we will initialize the "seed" also with a pseudo-random number, for which we use the low-order part of the 64-bit chill returned by the KeQueryTickCount function . KeQueryTickCount returns the number of ticks that have passed since the processor was powered up.

                 and g_dwWorkElement, 0

g_dwWorkElement is our shared resource.



13.3 Create System Threads

     and i, 0
     xor ebx, ebx
     .while i < NUM_THREADS
    
         invoke PsCreateSystemThread, addr hThread, THREAD_ALL_ACCESS, NULL, NULL, NULL, ThreadProc, 0
         .if eax == STATUS_SUCCESS

             invoke ObReferenceObjectByHandle, hThread, THREAD_ALL_ACCESS, NULL, KernelMode, \
                                     addr g_apkThreads[ebx * type g_apkThreads], NULL

             invoke ZwClose, hThread

             inc ebx

         .endif
         inc i
     .endw

     mov g_dwCountThreads, ebx

There are no fundamental differences from what we did last time in the TimerWorks driver. The only difference is that we start not one, but NUM_THREADS threads, store pointers to them in the g_apkThreads array, and the number of actually created threads in the g_dwCountThreads variable. The starting function of all threads is the same - ThreadProc .

And one moment.

 IF NUM_THREADS GT MAXIMUM_WAIT_OBJECTS
     .ERR Maximum number of wait objects exceeded!
 ENDIF

These three lines will prevent you from compiling the driver if by mistake you change NUM_THREADS to a value greater than MAXIMUM_WAIT_OBJECTS. MAXIMUM_WAIT_OBJECTS is the maximum number of wait objects that can be expected at the same time, and it is 64. Just like in the TimerWorks driver, in the DriverUnload procedure , we will wait for all running threads and their number should not exceed MAXIMUM_WAIT_OBJECTS.



13.4 Streaming Procedure

So, we have created NUM_THREADS threads. All of them, sooner or later, by the decision of the scheduler, will start executing the ThreadProc procedure . Moreover, in a multiprocessor system, this may happen literally at the same time.

     invoke PsGetCurrentThread
     mov pkThread, eax
     invoke DbgPrint, $CTA0("MutualExclusion: Thread %08X is entering ThreadProc\n"), pkThread

Using the PsGetCurrentThread function, we get a pointer to the structure of the current thread and display its address in a debug message.

     xor ebx, ebx
     .while ebx < NUM_WORKS

We organize a loop that repeats NUM_WORKS times. In a loop, we simulate the random operation of threads with a shared resource. Each thread will have to increment the value of the only global variable g_dwWorkElement by one and perform this operation NUM_WORKS times.

         MUTEX_WAIT g_kMutex

If the thread is at this point, then it has successfully acquired the mutex and the thread is guaranteed that until it releases the mutex, no one else can acquire it (the mutex). This means that a thread can exclusively work with a resource, since all other NUM_THREADS - 1 threads must also acquire the same mutex to enter this piece of code.

I am using a macro here too. In a simplified form, it looks like this (the full version of the macro is also more universal):

 MUTEX_WAIT MACRO mtx:REQ
     invoke KeWaitForMutexObject, mtx, Executive, KernelMode, FALSE, NULL
 ENDM

The parameters of the KeWaitForMutexObject function are completely identical to the parameters of the KeWaitForSingleObject function , and everything said last time about the KeWaitForSingleObject function also applies to KeWaitForMutexObject . Moreover, in the header files KeWaitForMutexObject is defined as follows:

 #define KeWaitForMutexObject KeWaitForSingleObject

To be clear, these two functions have the same entry point. Those. KeWaitForMutexObject and KeWaitForSingleObject are just synonymous names for the same function.

At the core there is another mutex - fast mutex (fast mutex). It is called fast because its capture / release is faster. It differs significantly from a regular mutex and is less versatile.

         push g_dwWorkElement
         pop dwWorkElement

We read the value of the shared resource into the local memory of the thread (namely, into the stack) and start working with it.

To simulate the work of a thread with a resource, we simply pause its execution for a random interval from the range of 0-50 milliseconds. Those. it is assumed that all this time the thread is doing some kind of manipulation with the resource.

         invoke rand
         shl eax, 4              ; * 16
         neg eax                 ; задержка = 0 - ~50 мс
         or liDelayTime.HighPart, -1
         mov liDelayTime.LowPart, eax
         invoke KeDelayExecutionThread, KernelMode, FALSE, addr liDelayTime

As I said, the rand function produces a pseudo-random number in the range 0 - 07FFFh. By multiplying it by 16, we get the delay time we need. All I said last time about setting functions DueTime KeSetTimerEx fully applicable to the function parameter DelayTime KeDelayExecutionThread . The only difference is that not the 64-bit value itself is passed to the KeDelayExecutionThread function , but a pointer to it. As the name of the KeDelayExecutionThread function suggests , it pauses the thread for a while. This function should only be called when IRQL = PASSIVE_LEVEL. Internally, it uses a timer to wait.

A few more words should be said about the rand function . As I said, this function generates a pseudo-random number based on the initial value that we initialized by calling the srand function when starting the driver. Each call to rand changes this seed, and since it is a global kernel variable, we have some influence on the result of a possible call to this function by some other code. But since the result is a random number, it will not become either less or more random. So you can use the rand function without any problems. By the way, I did not find the use of this function by the kernel itself.

         inc dwWorkElement

         push dwWorkElement
         pop g_dwWorkElement

We modify the shared resource and write it back.

         MUTEX_RELEASE g_kMutex

The work is done, and the mutex needs to be released so that other threads, which are probably already waiting for it to be released, can also do their part of the work.

And again a macro, the simplified form of which looks like this:

 MUTEX_RELEASE MACRO mtx:REQ
     invoke KeReleaseMutex, mtx, FALSE
 ENDM

The last parameter of the KeReleaseMutex function is needed to increase overall performance. If you are going to wait again immediately after freeing the mutex, perhaps for some other object, then the last parameter can be set to TRUE. Then KeReleaseMutex will not release the lock from the dispatcher's database, and the next KeWaitXxx function , accordingly, will not set it, and calls to KeReleaseMutex - KeWaitXxx will pass as a single atomic operation: the dispatcher's database lock will be set when entering KeReleaseMutex , and released inside KeWaitXxx... I use the value FALSE because The macros MUTEX_INIT, MUTEX_WAIT and MUTEX_RELEASE are a general solution for all occasions. If time is critical for you, then you can resort to optimization. I give simplified versions of macros so that if you are not friends with them, you would understand what functions and how to use them.

The KeReleaseMutex function , by the way, is a wrapper around the more flexible KeReleaseMutant function.

A few more points about a regular (not fast) mutex. Each capture of a mutex must be followed by a release. Moreover, the mutex can be captured recursively, i.e. several times in a row, but only with one stream. If a thread has acquired a mutex several times, then it must release it exactly the same number of times. The number of recursive captures of the mutex is limited by the MINLONG constant equal to 80000000h. If you exceed this absurdly high value, you will get a BSOD with the code STATUS_MUTANT_LIMIT_EXCEEDED. If you try to free a mutex not occupied by you, then - STATUS_MUTANT_NOT_OWNED.

         mov eax, liDelayTime.LowPart
         neg eax
         mov edx, 3518437209     ; Магическое число
         mul edx
         shr edx, 13
         invoke DbgPrint, $CTA0("MutualExclusion: Thread %08X work #%d is done (%02dms)\n"), \
                            pkThread, ebx, edx

For control, we display a message in which the thread notifies about what kind of work it has done, and how long it took. Here I use "magic division" - this is division through multiplication, which is much faster (although, in this case, time is absolutely not critical). Since latency is measured in 100-nanosecond intervals, you need to multiply this number by 10,000 to get to milliseconds. To calculate the "magic numbers" look at http://www.wasm.ru/ in the "Educational programs" section for the Magic Divider by The Svin utility .

         inc ebx

     .endw

Let's move on to the next work.

     invoke PsTerminateSystemThread, STATUS_SUCCESS

If all the work is done, we destroy the flow.



13.5 DriverUnload Procedure

Just like in the TimerWorks driver, before you allow the driver to be unloaded, you need to make sure that all threads have terminated. the thread procedure is in the body of the driver and unloading it prematurely will crash the system. We will solve this problem in the same way as last time - we will wait, but now there are several threads.

     .if g_dwCountThreads > 0

         invoke KeWaitForMultipleObjects, g_dwCountThreads, addr g_apkThreads, WaitAll, \
                     Executive, KernelMode, FALSE, NULL, g_pkWaitBlock

If there is someone to wait for, then we call the KeWaitForMultipleObjects function . The parameters of this function are basically the same as those of the KeWaitForSingleObject function , but this function expects not one, but several objects at once. Moreover, the objects can be different, the main thing is that they are expected.

The first parameter determines the number of objects in the array of pointers to objects, the address of which is passed in the second parameter. The third parameter determines whether we want to wait for all objects or the first one that is freed. If you want to wait for the first freed object, then you can determine which object has passed into a free state by the value returned by KeWaitForMultipleObjects . If it is equal to STATUS_WAIT_0, then you have waited for the first object in the array, if STATUS_WAIT_1 is the second, etc. The last parameter is a pointer to a block of memory that the KeWaitForMultipleObjects function uses to wait. The size of this block must be equal to the product of the size of the KWAIT_BLOCK structure by the number of objects. In DriverEntrywe have already allocated the required memory. We did it in advance, because if suddenly (which is unlikely, but still) we will not be able to allocate this memory now - when unloading the driver, then how will we wait for the completion of our threads?

 KTHREAD STRUCT
 . . .
     WaitBlock KWAIT_BLOCK 4 dup(<>)
 . . .
 KTHREAD ENDS

If the number of objects is not more than THREAD_WAIT_OBJECTS (3), you do not need to allocate WaitBlock, because in this case, the system will use the memory reserved directly in the "stream" object:

         .while g_dwCountThreads
             dec g_dwCountThreads
             mov eax, g_dwCountThreads
             fastcall ObfDereferenceObject, g_apkThreads[eax * type g_apkThreads]
         .endw

After waiting for all threads to finish working, we decrease the number of references to their objects, which we increased when creating threads.

     invoke DbgPrint, $CTA0("MutualExclusion: WorkElement = %d\n"), g_dwWorkElement

We give out the results of the work. G_dwWorkElement must be equal to NUM_THREADS * NUM_WORKS. By commenting out the macros MUTEX_WAIT and MUTEX_RELEASE, you will never get the correct result, because threads will modify the shared resource uncontrollably.

The source code of the driver in the archive .