Monday, November 16, 2020

S3 Sleep, Resume and Handling Them with Type-1 Hypervisor

This post explains how the system enters and resumes from S3 (Sleep) on a modern x86-64 system, by reviewing specifications and the implementation of Windows as an example. This post also outlines challenges with S3 for type-1 hypervisors and how to work around it.
TeaTea in the S3 state

Why S3 is Interesting

On normal system startup, UEFI-based system firmware goes through four execution phases before starting the OS. Those phases include Driver eXecution Environment (DXE), Boot Device Selection (BDS), and Transient System Load (TSL) where system configurations are set and 3rd party firmware modules may be executed. The S3 resume boot path, on the other hand, those phases are skipped for faster start-up. 

This has significant security implications because the S3 resume boot path needs to reapply the same security configurations as they are made during the normal boot path, using entirely different code. Failure of doing it securely leads to vulnerabilities, for example, unauthorized modification of a system firmware image if a firmware write-protection bit is not reapplied during resume. 

Also, for the type-1 hypervisor that is loaded during the TSL phase, lack of the that phase means it is unable to get loaded on resume. Since the processors were shutdown on S3, processor-based virtualization features such as Intel VT-x stop working after resume even though the hypervisor module remains mapped in memory. This needs to be handled. 

High-Level Flow

Before diving into details, let us review a high-level flow of S3 sleep and resume. The followings are the highlights.
  1. Setting certain bits in the registers called Power Management (PM) 1 Control registers, or PM1a_CNT_BLK / PM1b_CNT_BLK puts the system into the S3 state.
  2. During the next system start-up, system firmware detects that shutdown was because of S3 and executes the S3 resume boot path, instead of the normal boot path.
  3. System firmware executes a collection of commands, called boot scripts, and code pointed by the Firmware Waking Vector in the ACPI table. This latter is called an OS waking vector and set up by the OS prior to entering S3.
  4. The waking vector resumes execution of the OS.

Entering S3

The platform enters S3 when software sets 1 to the SLP_EN bits and 5 (0b101) to the SLP_TYP bits in the PM1 control registers. Looking at the ACPI specification, it states that settings the SLP_EN triggers state transition. 
Table 4.13: PM1 Control Registers Fixed Hardware Feature Control, from the ACPI spec
The explanation of the SLP_TYP bits in the table is not crystal clear, but it becomes more obvious with the specification of the Intel platform. The below is an excerpt from the table under 4.2.2 Power Management 1 Control (PM1_CNT) in one of the hardware models that implement ACPI.
 From Intel 495 Series Chipset Family On-Package Platform Controller Hub volume 2

Then, where are those registers? The ACPI does not define it but does define the way to locate them. Under 4.8.3 PM1 Control Registers, it states that
Each register block has a unique 32-bit pointer in the Fixed ACPI Table (FADT) to allow the PM1 event bits to be partitioned between two chips.
The below are excerpts of the FADT format, which contains multiple fields indicating where the registers are. 
...
...
Depending on the implementation of ACPI, some fields may be unused. On my system, the SLEEP_CONTROL_REG field in the table tells that the register is located at IO-port 0x1804. 

RWEverything parsing the FACP table on Windows

So far, we learned that: 
  • the system enters S3 state when software sets SLP_EN and SLP_TYP bits in the PM1 control register.
  • the PM1 control register can be located through the FADT ACPI table. 
Note that the ACPI table itself can be easily located with platform specific ways, such as /sys/firmware/acpi on Linux, GetSystemFirmwareTable() on Windows, or EfiLocateFirstAcpiTable() on UEFI.

Resuming from S3

On system start-up, system firmware executes the same initialization path as the normal boot path, and then, diverges when it detects that the previous shutdown was entering S3. This resume-specific path is called the S3 resume boot path and well explained in the UEFI Platform Initialization (PI) specification. 

In a nutshell, the S3 resume boot path executes the boot scripts to re-initialize the platform, instead of executing the last three boot phases: DXE, BDS and TSL. The boot scripts are saved in non-volatile storage and replicate platform configuration made during normal boot. The below illustration from the spec highlights differences between normal and S3 resume boot paths, as well as how boot scripts are saved and consumed.
Normal and S3 resume boot paths, from the PI spec
As illustrated, after boot scripts are executed, an OS waking vector is executed to resume execution of the OS on the S3 resume boot path. The OS waking vector is the very first OS-specific code (the code that is developed by the OS vendor, and not part of system firmware). This is typically 16bit real-mode code that changes the processor mode to the long mode, resets registers to the same values as what they had before the system entered S3, and lets the OS execute further restoration code to fully resume the system. The OS sets up this OS waking vector right before entering S3. 

How the OS sets up the OS waking vector and how system firmware finds its location? Again, ACPI defines the way. 

The Firmware ACPI Control Structure (FACS) table defines a field called Firmware Waking Vector. This is where the OS should write the address of the OS waking vector to it, and system firmware should read it to locate and execute the OS waking vector. 
Firmware Waking Vector in FACS, from the ACPI spec

To summarize the flow in the chronological order:
  1. OS writes an address of the OS waking vector (ie, bootstrap code) to the Firmware Waking Vector field of the FACS table before entering S3.
  2. System firmware reads the field to know the address of the OS waking vector and transfers execution to the address during the S3 resume boot path.
  3. The OS waking vector eventually resumes system states using configurations kept in memory.

Implementation on Windows and EDK2

Let us look into how the above we reviewed are implemented on Windows (build 18362) and EDK2. 
EDK2 is a reference implementation of UEFI, the system firmware specification, and very commonly used as a base of commercial system firmware. 

Entering S3

On Windows, HaliAcpiSleep() is the main function that implements S3 handling and is called on all processors when a user requests entering S3. It roughly does the following in the order. 
  1. Boot strap processor (BSP) sets up the OS waking vector with HalpSetupRealModeResume().
    *HalpWakeVector = HalpLowStubPhysicalAddress; // // Where HalpWakeVector is the address of the // Firmware Waking Vector field in the FACS table, // initialized at HaliInitPowerManagement() //
  2. BSP waits for all APs to complete saving their states.
    InterlockedAdd(&HalpFlushBarrier, 1);
    while (HalpFlushBarrier != ProcessorCount);
  3. Application processors (APs) save their registers with HalpSaveProcessorState().
  4. APs enter the loop that does not exit in a successful path in HalpFlushAndWait().
    InterlockedIncrement(&HalpFlushBarrier);
    while (HalpFlushBarrier);
  5. BSP writes to the PM1 control register(s) to set the following values with HalpAcpiPmRegisterWrite().
    • SLP_TYP = 5 (S3)
    • SLP_EN = 1 
This puts the system into the S3 state. Let us look into the resume path.

Resuming from S3

  1. On the EDK2, system firmware, side, the S3 resume boot specific execution flow looks roughly like this.
    ...
      -> DxeLoadCore()
           -> S3RestoreConfig2()
                -> S3ResumeExecuteBootScript()
                     -> S3BootScriptExecutorEntryFunction()
  2. S3BootScriptExecutorEntryFunction() executes the boot script and jumps to the OS waking vector as indicated by Facs->FirmwareWakingVector at the end.
  3. The OS waking vector is a copy of HalpRMStub. This eventually brings the execution of BSP to the right after HalpSetupRealModeResume() with RAX=1, as if it returned from the function. 
  4. BSP wakes up other APs by sending INIT-SIPI-SIPI.
    // // This wakes up all APs with HalStartNextProcessor() calls //
    HalpAcpiPostSleep(...);     
  5. The INIT-SIPI-SIPI brings APs to the right after HalpSaveProcessorState() with RAX=1, as if it returned from the function. For more details on how INIT-SIPI-SIPI starts up APs, please read the previous post
  6. All BSP and APs call HalpPostSleepMP() to restore other platform states, then return from HaliAcpiSleep(), continuing OS execution as usual
If you are interested in how exactly the OS waking vector is set up and resumes the system states, I suggest reversing the HaliAcpiSleep() on your own. The way it factors code to keep the flow as straightforward as possible is a masterpiece.  

Note that on VMware, step 1 of the pre-S3 and step 1-3 of the post-S3 steps are skipped. Windows on VMware dose not need them either as the VMware hypervisor directly restores system states, instead of going through the full S3 resume boot path. 

Handling S3 with Type-1 Hypervisor

As mentioned previously, S3 is a challenge for the type1 hypervisor that is loaded during the TSL phase because,
  • On resume, the TSL phase is skipped and no opportunity to get called.
  • On resume, virtualization is disabled and needs to be enabled.
  • It cannot add its boot script to trigger reinitialization, because it is locked at the TSL phase already. 
One may employ the guest support module that subscribes the resume event and notifies the hypervisor to trigger reinitialization, but it is neither secure, portable, nor reliable. Another quick-and-dirty way is to disable S3 by altering the ACPI table, which has an obvious user experience issue.    

The much superior way is to hook the OS waking vector. This works as following:
  1. The hypervisor intercepts IO access to the PM1 control register(s)
  2. When the guest attempts to write to the register to enter sleep, the hypervisor 
    1. overwrites contents of the Firmware Waking Vector field with its own waking vector address, and
    2. writes to the register and lets the system enter S3
  3. When the system wakes up, hypervisor's waking vector is executed, and it
    1. reenables virtualization (with VMXON for example) 
    2. sets up the guest state to emulate execution of guest's waking vector (ie, guest's RIP is set to the guest waking vector)
    3. launches the guest (with VMLAUNCH for example)
Hypervisor resuming from S3

This way, the hypervisor can take control over the system before running any OS (guest) specific code. Implementation of this can be found in multiple hypervisors such as ACRN Embedded Hypervisor and Bitvisor.  

For completeness, noting that the type1 hypervisor that is part of an OEM firmware image or PEI modules does not have to do any of those. If the module were part of the OEM image, it would be able to add a boot script to register reinitialization, and if the module were a PEI module, it would be executed even in the S3 resume boot path. 

Conclusion

Entering and resuming from S3 is complex work that involves all OS, system firmware, and hardware implementation, as well as multiple specifications such as PI and ACPI. However, studying it allows us to familiarize ourselves with the industry standards and intriguing low-level implementation details.  

As a side note, I recommend learning type-1 hypervisors over OS kernel-based ones. Type-1 hypervisor is not just more flexible, it lets you understand greater details of how the system works (and arguably is a more common design across production-level hypervisors). I am still suspending the registration of the public hypervisor development class, but looking into reopening it sometime in the next year as a remote class. If you are interested, please reach out to tanda.sat@gmail.com for details.  

References

EDIT

  • Nov 25 - Correct that the boot phase relevant to 3rd party type 1 hypervisor is TSL and not BDS.

Monday, May 18, 2020

Introductory Study of IOMMU (VT-d) and Kernel DMA Protection on Intel Processors

This post is a write up of the introductory study of Intel VT-d, especially about how DMA remapping may be programmed and how Windows uses it. The hope is that this article helps you gain a basic understanding of it and start looking into more details as you are interested.

Intel VT-d

Intel VT-d, formally called as Intel VT for Directed I/O, consists of the following three features:
  • DMA Remapping
  • Interrupt Remapping
  • Interrupt Posting
DMA remapping is the most commonly discussed feature out of those and is the focus of this article.

DMA Remapping

DMA Remapping is an important feature because it allows software to implement security against Direct Memory Access (DMA) from malicious devices by configuring access permissions for each physical memory page. While ordinary memory page protections can be configured through the paging structures, and when Intel VT-x is used, through the Extended Page Tables (EPT), those configurations are completely ignored in case of DMA access. Therefore, the other protection mechanism is required to complete the protection of memory. DMA remapping achieves this.

The following illustration from the specification highlights that DMA goes through DMA remapping instead of CPU memory virtualization (ie, EPT). 


VT-x Not Required

Ignore the upper half of the illustration above. It is a typical misconception that VT-d (DMA remapping) is tied with VT-x, virtual machines, and such. DMA remapping is usable and useful without VT-x; Windows, for example, can enable a DMA remapping based security feature (called Kernel DMA Protection) without requiring VT-x based security (VBS: Virtualization Based Security) enabled.

The sample project shown in this post below enables DMA remapping independently as well.

IOMMU

DMA remapping is also referred to as IOMMU, as it functions like Memory Management Unit (MMU) for IO memory access. Not only the concept is similar, but it also has a very similar programming interface as that of MMU, that is, the paging structures and EPT. 

At high-level, the major difference is that DMA remapping uses two more tables for translation, on the top of the familiar PML5, 4, PDPT, PD, and PT. Simply put, translation with MMU is
  • Hardware register => PML4 => PDPT => ...
while that of IOMMU is
  • Hardware register => Root table => Context table => PML4 => PDPT => ...
The specification refers to the tables referenced from the context table as the second-level page tables. The below diagram illustrates the translation flow.
Notice that,
  • The entry of the root table is selected based on the bus number of the device requesting DMA.
  • The entry of the context table is selected based on a combination of device and function numbers of the device.
As an example of bus:device:function (referred to as source-id) assignment, my test DMA-capable device is listed as Bus 6 : Device 0 : Function 0 on one system as shown below.


Sample Code and Demo

Let us jump into some code. The HelloIommuPkg is a runtime DXE driver that enables DMA remapping and protects its first page (PE header) from DMA read and write by any devices. 

Loading this will yield the following output and protect the page if successful.
Then, performing DMA read with the test PCI device using PCILeech demonstrates that the other page is readable,

but the protected page is not.

By inspecting one of the reported fault-recording registers using RWEverything, it can be confirmed that DMA was indeed blocked by a lack of read-permission.
  • The first column indicates the faulting address (0x6ff48000)
  • The third column indicates the source-id of the requesting device (Bus 6 : Device 0 : Function 0)
  • 6 in the fourth column indicates the lack of read-permission.

Programming IOMMU 

Enabling DMA remapping at a minimum can be divided into the following steps:
  1. Locating the DMA Remapping Reporting (DMAR) ACPI table.
  2. Gathering information about the available DMA remapping hardware units from DMA-remapping hardware unit definition (DRHD) structures in (1).
  3. Configuring translation by initializing the tables mentioned above. 
  4. Writing hardware registers to use (3) and activating DMA remapping.
HelloIummuDxe.c roughly follows this sequence with some demonstration and error checking code. (1) and (2) are straightforward and can be validated tools like RWEverything.

The complexity of (3) varies largely depending on how granular and selective translations and memory protections are required. HelloIummuPkg allows any access from any device to anywhere, except against the single page, which simplifies this step. (4) is mostly just following the specification.

Overall, the minimum steps are simple and HelloIummuPkg's line count without comments is less than 700 lines.

Use of DMA Remapping on Windows 

Windows uses DMA remapping when available. If the system does not enable Kernel DMA Protection, it configures translations mostly to pass-through all requests from all devices with few exceptions.

The following screenshot taken from the system without Kernel DMA Protection shows translation for the DMA-capable device at Bus 7 : Device 0 : Function 0. The value 9 at the right bottom indicates DMA requests are passed thought (See "TT: Translation Type" in the specification). 

Notice the most of the entries points to the same context table at 0x1ac000 which is configured for pass-through, providing no protection.

As a side note, it would be technically possible for third-party Windows drivers to modify those translations and attempt to provide additional security against DMA unless VBS is enabled.

Use of DMA Remapping with Kernel DMA Protection

If Kernel DMA Protection is enabled, most of the translations are configured to fail. This is achieved by pointing to the second-level PML4 that is filled with zero, meaning translations are not present.

The below screenshot shows an example configuration with Kernel DMA Protection. Notice the context table at 0x1ac000 points to the second level PML4 at 0x251000, which is all zero.


Note that those memory locations are not visible if VBS is enabled. Disable it to inspect them.

Interestingly, I was not able to observe the described behavior of Kernel DMA Protection, in that, regardless of whether the screen is locked, performing DMA against the device resulted in bug check 0xE6: DRIVER_VERIFIER_DMA_VIOLATION (type 0x26). From what I read from Hal.dll, it made sense to bug check, but I doubt this is how Kernel DMA Protection is supposed to protect the system.

Conclusion

DMA remapping is part of the Intel VT-d architecture providing security against DMA from malicious devices and can be enabled without Intel VT-x to be used together. The sample project HelloIommuPkg demonstrates the simple setup of DMA remapping from UEFI with less than 700 lines of code.

It is shown that Windows enables DMA remapping if available, and when the Kernel DMA Protection feature is enabled, DMA access is mostly blocked though the second-level PML4.

Further Learning Resources

A cat protected from direct access.

Friday, March 20, 2020

Initializing Application Processors on Windows

This post guides you to the journey of starting up application processors (APs) on Windows. This post can be read just for fun but can also help you make more sense of the INIT-SIPI-SIPI VM-exits sequence you have to handle when writing an UEFI hypervisor.

AP Initialization and Overview of Its Implementation


Before running any software code, hardware selects the processor that gets initialized and starts executing firmware code. This processor is called a bootstrap processor (BSP) and is basically the sole active processor until an operating system starts up the rest of the processors. 

Those non-BSP are called APs and are initialized by the BSP sending a sequence of inter processor interrupts (IPIs): INIT, Startup IPI, and the 2nd Startup IPI. This sequence is also referred to as INIT-SIPI-SIPI.

As noted in the previous post, a hypervisor that starts earlier than the operating system needs to handle VM-exists caused by those IPIs. But when that happen exactly? 

On Linux, this is relatively easy to find out. Searching "STARTUP IPI" in Linux source code or other developers' forums leads you to the implementation, smpboot.c. On Windows 10, this is done in HalpApicStartProcessor, called from kernel's KeStartAllProcessors, in short. The stack trace is shown below: 

00 hal!HalpApicStartProcessor
01 hal!HalpInterruptStartProcessor
02 hal!HalStartNextProcessor
03 nt!KeStartAllProcessors
04 nt!Phase1InitializationDiscard
05 nt!Phase1Initialization
06 nt!PspSystemThreadStartup
07 nt!KiStartSystemThread

Let us look into little more details on Windows 19H1 (18362.1.amd64fre.19h1_release.190318-1202) without Hyper-V enabled. To be clear, the execution path varies drastically if Hyper-V is enabled.

High Level Flow


KeStartAllProcessors captures various system register values with KxInitializeProcessorState, updates per processor book keeping data structures and calls HalStartNextProcessors for each registered processor one by one to start all of them. 

HalpInterruptStartProcessor builds stub code and temporal data structures required for APs to go through real-mode, 32 bit protected-mode, and long-mode, such as page tables, GDT, and IDT. HalpLowStub (that is PROCESSOR_START_BLOCK according to this talk by Alex Ionescu) is the address of where those are build and the very entry point of the AP. We will review the entry point code and how it goes up to the NT kernel. 

HalpInterruptStartProcessor, after the stub is built. executes HalpApicStartProcessor which is responsible for issuing the INIT-SIPI-SIPI sequence. Pseudo code of this function is shown below.

NTSTATUS HalpApicStartProcessor(     UINT64,     UINT32 LocalApicId,     UINT64,     UINT32 StartupIp     ) {     //     // Assert INIT, then de-assert it. INIT-deassert IPI is done only for backword     // compatibility.     // See: 10.4.7.4 Local APIC State After It Receives an INIT-Deassert IPI     //     HalpApicWriteCommand(LocalApicId, 0xC500); // APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT     KeStallExecutionProcessor(10u);
    HalpApicWriteCommand(LocalApicId, 0x8500); // APIC_INT_LEVELTRIG | APIC_DM_INIT     KeStallExecutionProcessor(200u);
    //     // Compute the SIPI message value and send it.     // "the SIPI message contains a vector to the BIOS AP initialization code (at     //  000VV000H, where VV is the vector contained in the SIPI message)."     // See: 8.4.3 MP Initialization Protocol Algorithm for MP Systems     //     sipiMessage = (StartupIp & 0xFF000 | 0x600000u) >> 12;  // APIC_DM_STARTUP     HalpApicWriteCommand(LocalApicId, sipiMessage);     KeStallExecutionProcessor(200u);     HalpApicWaitForCommand();     KeStallExecutionProcessor(100u);
    //     // Send the 2nd startup IPI.     //     HalpApicWriteCommand(LocalApicId, sipiMessage);     KeStallExecutionProcessor(200u);

Note that those HalpApic functions are the function pointers that are set for APIC or APICx2 according to the system configurations.

Then let us review how APs get initialized by following the stub code.

AP Initialization Code

HalpRMStub - Real-Mode 


The entry point code is symbolized as HalpRMStub. As the name suggests, running in the real-mode, right after the SIPI.  As seen in the screenshot below, the stub code sets CR0.PE (0x1) enabling the protected mode and jumps out to somewhere.


As it is 16bit code, the instructions show by Windbg is slightly broken. Below is the correct output.

Also, let us switch to physical addresses since the code runs in the real-mode.

From code, the value of EDI is known to be 0x13000, because EDI is CS << 4, and CS is [19:12] of the IP, as stated in 8.4.3 (see the comment in the above pseudo code).

HalpPMStub - Protected-Mode 


Following EDI+0x60 navigates us to the protected mode stub implemented as HalpPMStub.

This code is responsible for switching to the long-mode. As seen below, it
  • sets CR4.PSE (0x1000),
  • updates IA32_EFER, then
  • sets CR0.PG (0x8000000), to activate the long-mode (see the second screenshot).


Then, it jumps out to where RDI+0x66 specifies. 

HalpLMIdentityStub - Long-Mode under Identity Mapping


The JMP leads to the short stub whose sole responsibility is to retrieve the value of CR3 that can permanently be used, that is, the same value as that of BSP.

As the processor should already working with the virtual addresses, let us switch to it.

RDI+0x70 gives us HalpLMStub.

HalpLMStub - Long-Mode


This is the final stub that APs go through. The first thing this stub does is to apply the permanent CR3 value to have the same memory layout as BSP (and any other already initialized APs) followed by invalidation of TBLs.

After switching the page tables, it performs various initialization, and at the end, it jumps out to where RDI+0x278 indicates.
This ends up with nt!KiSystemStartup, letting the AP run the same initialization code as BSP (except few things done exclusively by BSP).

Conclusion

We reviewed how Windows initiates execution of APs with the INIT-SIPI-SIPI sequence and how APs go though from real-mode to the regular NT kernel initialization function on Windows 10 19H1 without Hyper-V.
  
Hopefully, you enjoyed this post and gained more contexts on INIT-SIPI-SIPI VM-exits you may see while writing a hypervisor too.

Friday, March 13, 2020

Introduction and Notes on Design Considerations of UEFI-based Hypervisors

In this post, I am going to write up some of the lessons learned and the challenges I had to go through to write a UEFI-based hypervisor that supports booting Windows. I hope this post gives pointers to study and helps you get started with writing a similar hypervisor.
UEFI hypervisor brief design walk-through

Background


Lately, I spent some time to study EDK2-based UEFI programming and developed a hypervisor as a UEFI driver. It has been fun and turned out to be more straightforward than I initially imagined, but at the same time, there were some learning curves and technical challenges I had to take extra time to understand and overcome.

The major reason of taking extra time was lack of write ups or tutorials for my goal. Although there were few open-source projects and many documents and presentations I was able to study, those were not focused on UEFI programming with the context of writing hypervisors. This is entirely understandable as I do not suppose those are common subjects, and that was also why I wrote up this post.

In this post, I will start by giving a high-level overview of UEFI, and unique aspects in its execution environment, then look into challenges of writing a hypervisor as a UEFI driver.

UEFI Execution Environment


UEFI vs EDK2


UEFI is the specification of firmware to replace legacy-BIOS, where no standard exists, and offers a well-defined execution environment and programming interfaces. EDK2 is the open-source, reference implementation of the specification and provides tools to develop firmware modules.

Application vs Driver


Firmware modules can be built as part of a whole firmware image or as a standalone module (file) to be separately deployed. The latter is how I compiled the module. Additionally, UEFI modules can be written as an application which is unloaded from memory once its execution finishes, or as a driver which remains loaded unless explicitly unloaded. Obviously, the driver is the natural choice for the hypervisor, although I will mention the other common approach later.

Boot Time vs Run Time


The execution environment of drivers can be separated into two different phases: boot time and run time.

Frankly speaking, the boot time is before execution is handed over to the operating system and the run time is after that. This transition happens when a UEFI defined API called ExitBootServices is called. In the case of Windows startup, this is sometime before winload.efi transfers its execution to ntoskrnl.exe.

Most of the firmware drivers loaded on memory are unloaded at this point because most of them, for example, a network driver for PXE boot, are no longer needed once execution is handed over to the operating system. This type of driver is called boot drivers, and not suitable for the hypervisor that is meant to stay alive even after the operating system is fully started.

Runtime drivers, on the other side, are the type of driver that resides on memory throughout the system life span and suited for the hypervisor.

Boot-time Services vs Run-time Services


UEFI defines a collection of APIs, and their availability is impacted by the boot-to-run time transition. The type of API called boot-time services can no longer be used after the transition because drivers that implement the API are unloaded. After this transition, runtime drivers can only use the run-time services, which drastically reduces the ability of the hypervisor to interact with the environment.

Physical Mode vs Virtual Mode


Another transition that the runtime drivers have to go through is the change of the memory address layout.

At the boot time, the system is in the long-mode, same as Windows. However, virtual to physical address mapping is pure 1:1, that is, the virtual address 0xdf2000 is translated into the physical address 0xdf2000. This mode is called physical mode.

Soon after the transition to run time, a bootloader (winload.efi in the case of Windows) sets up and configures new page tables to map runtime drivers to the addresses that work well with the operating system (eg, the physical address 0xdf2000 may be mapped to 0xfffff803`1ce40000). Then, the bootloader calls the SetVirtualAddressMap run-time service letting runtime drivers perform their preparation, switches to the new page table and discards the old page table. After this point, the runtime drivers are mapped to only the new address, just like regular Windows drivers. This mode is called virtual mode. This transition can be catastrophic if the hypervisor depends on the physical mode page tables. We will review how it can be a problem.

Application Processor Start-Up


Another unique event that the UEFI hypervisor has to handle is processor initialization. Processors that are not selected as a bootstrap processor (BSP; the processor initialized first) are called application processors (APs) and are initialized after transitioning to the virtual mode. This is done by BSP signaling INIT and Startup-IPI (SIPI). When SIPI is signaled, APs start its execution on the real-mode and go through mode transition up to the long-mode (in the case of the 64bit operating systems). This requires some extra VM-exit handling that was not relevant for the blue pull style hypervisors.

Those unique aspects of the UEFI environment pose technical challenges and require different hypervisor design considerations.

Challenges, Solutions, and Considerations


Host CR3


As mentioned, the host CR3 becomes invalid if the value at the time of driver load is used because that would be physical mode page tables that get destroyed. The most straightforward solution for this is to set up our own page tables with the same translation as the existing one (ie, physical mode page tables) and use them for the host. This may sound complicated but is implemented with just 50 lines of C code in MiniVisor.

However, this results in having different address translations once the guest switches to the virtual mode and makes it significantly difficult for the host to interact with the guest. For example, host code cannot be debugged with tools like Windbg anymore because none of Windows code is mapped in a usable form while the host is running. If the hypervisor is going to need complex interaction with the guest virtual address, other approaches might make it simpler at the end. In a private build, I implemented a guest shell-code that runs in the same address space as the NT system process for interaction with the guest.
Injecting the guest agent that hooks Windows kernel API
It also makes it harder to access the guest virtual memory from the host for the same reason without implementing the guest-virtual-to-host-virtual mapping mechanism. MiniVisor implements this in MemoryAccess.c. This is essentially what every single hypervisor implements. 

Host IDT


For the same reason as the host CR3 is discarded, the host IDT becomes invalid if the value at the time of driver load is used. Although this does not cause an issue immediately because interrupt is disabled during execution of the host, any programming error causing exception will cause triple fault without running any diagnostics code. The solution is to create its own IDT for the host.

Having its own IDT, however, means NMI can no longer be delivered to the Windows kernel if that occurs during the execution of the host (reminder: NMI still occurs even if interrupts are disabled). MiniVisor discards NMI for simplicity but you should consider reinjecting it into the guest instead.

Host GDT


You may wonder about the GDT. Yes, the GDT also needs to be created, but also requires modification because firmware does not set up the task state segment that is required for VMX.

Logging


Console output API is the boot-time service that cannot be used after the transition to run time. Hence, console-based logging must be ceased after that point. This could be addressed in several ways, such as hooking into operating system logging API, but the simplest solution is to use serial output instead of console output. This has its limitations but requires almost zero extra code.

Another sensible option is to have ring buffer to store log entries, and later, let a client application to pull and print them out.

Testing Application Processors Startup


This requires the hypervisor to handle VM-exits as well as proper emulation of paging mode transitions that are not relevant for the blue pull-style hypervisors. Specifically, handling of INIT, SIPI and CR0.PG access are required.

For me, this was one of the most challenging parts of writing a hypervisor that supports booting an operating system, mostly due to lack of available virtualization solutions as a test environment and difference between them and the bare-metal environment (eg, TLB, MSR etc), requiring through testing with bare-metal.

My recommendation is to buy and set up a single-board computer with a serial port so you can at least do printf-debugging (or even better, Direct Connect Interface support). I might blog about selecting devices and setting them up.
Testing with a single-board computer

Driver vs Standalone File

Compiling the hypervisor as a runtime driver works as demonstrated in the project. However, the more common approach is to build the hypervisor as a separate file and a UEFI application loads it into memory and starts executing it. That is how VMware hypervisor as well as Hyper-V is implemented, as examples. The standalone hypervisor format is often ELF because of wider cross-platform compiler and debugging tool support. 

This approach has an advantage that the hypervisor code remains platform agnostic and re-usable; for example, one can write a small Windows driver as a hypervisor loader without mixing up platform dependent loader code and hypervisor code that should be platform independent. Then, the hypervisor module can remain portable.


MiniVisor did not take this approach just because of lack of structure started from experimentation. I plan to restructure the project in this way. 


Conclusion

We reviewed some uniqueness of the UEFI environment and how those impact design and implementation of hypervisors compared with those designed under the blue-pill model. We also looked at how MiniVisor was designed to work with those new factors and implied limitations.

While this short blog post may not be sufficient for some readers to have clear ideas of those challenges and explained solutions, I hope this post gives you some pointers to study the codebase of MiniVisor and help make sense of why things are written in different ways than the blue pill-style Windows hypervisor.

Further Learning

As a final note, if you are particularly curious about tooling hypervisor for research and/or just having a solid understanding of the underneath technologies and concepts, Bruce Dang and I plan to offer a 5 days class this October. This will let you write your hypervisor for both Windows and UEFI environments, develop "something useful" and play with them on physical and virtual machines to internalize technical details. 

Please sign up from this page or contact us if you are interested in.
https://gracefulbits.regfox.com/hypervisor-development-for-security-analysis