Satoshi's note: 2020

Thursday, December 24, 2020

Experiment in extracting runtime drivers on Windows

This post explains the concept of UEFI runtime drivers, how they interact with OS, and an experimental attempt to extract them.

Here is a quick takeaway from this article.

UEFI runtime drivers are part of firmware that run with the ring-0 privilege before OS starts.
They provide interfaces to some firmware-dependent features, called runtime services, to OS.
Windows saves the addresses of those runtime services into HalEfiRuntimeServicesBlock
The base addresses of runtime drivers can be located from the contents of HalEfiRuntimeServicesBlock, but it is difficult to safely find HalEfiRuntimeServicesBlock and base addresses.
Dumping the runtime drivers are useful for diagnosing issues with them, but the HalEfiRuntimeServicesBlock-based approach is fundamentally limited to drivers that implement runtime services.

Happy holidays?

What are UEFI runtime drivers?

Any retail x86_64 PCs have UEFI-based system firmware those days, and such firmware must implement some modules that reside in memory from system start up to shutdown. Those modules are called UEFI "runtime drivers" and meant to provide interfaces to certain firmware-implemented services to an operating system (OS), as specified by the UEFI specification

For example, UEFI defines ResetSystem() as one of such interfaces and requires OEM to implement firmware containing the runtime driver that implements it. Taking my ASUS laptop as an example, ResetSystem() is implemented in a runtime driver called SBRun.

Those interfaces are called "runtime services" and defined for 14 services. One can find the C-representation of their definitions in EDK2.

///
/// EFI Runtime Services Table.
///
typedef struct {
  ///
  /// The table header for the EFI Runtime Services Table.
  ///
  EFI_TABLE_HEADER                Hdr;

  //
  // Time Services
  //
  EFI_GET_TIME                    GetTime;
  EFI_SET_TIME                    SetTime;
  EFI_GET_WAKEUP_TIME             GetWakeupTime;
  EFI_SET_WAKEUP_TIME             SetWakeupTime;

  //
  // Virtual Memory Services
  //
  EFI_SET_VIRTUAL_ADDRESS_MAP     SetVirtualAddressMap;
  EFI_CONVERT_POINTER             ConvertPointer;

  //
  // Variable Services
  //
  EFI_GET_VARIABLE                GetVariable;
  EFI_GET_NEXT_VARIABLE_NAME      GetNextVariableName;
  EFI_SET_VARIABLE                SetVariable;

  //
  // Miscellaneous Services
  //
  EFI_GET_NEXT_HIGH_MONO_COUNT    GetNextHighMonotonicCount;
  EFI_RESET_SYSTEM                ResetSystem;

  //
  // UEFI 2.0 Capsule Services
  //
  EFI_UPDATE_CAPSULE              UpdateCapsule;
  EFI_QUERY_CAPSULE_CAPABILITIES  QueryCapsuleCapabilities;

  //
  // Miscellaneous UEFI 2.0 Service
  //
  EFI_QUERY_VARIABLE_INFO         QueryVariableInfo;
} EFI_RUNTIME_SERVICES;

Why we care?

Runtime drivers have some interesting characteristics:

They start before the OS is loaded and can influence the boot process
They run with the ring-0 privilege
They are called during normal OS execution through the runtime services
They are not listed by any widely known monitoring tools or debuggers (unlike device drivers) on Windows
They can be developed by anyone and be loaded as long as Secure Boot is disabled
They may not exist as files in storage that is accessible from OS

Because of the lack of visibility into them and access to files, diagnosing issues with them can be more challenging than the issues with OS-based kernel modules. For example, to reverse engineer runtime driver code, you first have to find the runtime drivers in memory and extract them to a file, instead of grabbing a file on the disk.

The other reason is that runtime drivers have gained popularity in a reverse engineering community and are used more and more widely in a way that breaks the standard OS/system integrity. For example, the owner of the system may install a 3rd party runtime driver that overrides the runtime services provided by OEM firmware, to have a "backdoor" for reverse engineering. EfiGuard and efi-memory are examples of those. While that is the owner's very intention, some software may still want to detect this and be aware of the fact that system integrity might be tampered.

How can we find runtime drivers on Windows?

There is no documented interface to locate any runtime drivers in the Windows kernel, unfortunately. However, there are few implementation details that can be abused for it, for example,

The addresses of some runtime services are stored in HalEfiRuntimeServicesBlock
There is an EFI_RUNTIME_SERVICES global variable, which contains pointers to the runtime services as seen above, has a distinctive RUNTSERV signature, and is memory resident
Runtime drivers are also memory resident, mapped in a certain contiguous physical and virtual memory range, and have the DOS header at 4KB aligned addresses
Physical memory addresses backing runtime drivers are outside the ranges of Windows manages

Based on those facts, one may scan the DOS header in physical or virtual memory and attempt to find all runtime drivers.

The other limited (details later) but arguably easier and safer way is to refer to HalEfiRuntimeServicesBlock. HalEfiRuntimeServicesBlock is a Windows defined structure made up of copies of a handful of runtime service addresses [ref], as shown below.

typedef struct _HAL_RUNTIME_SERVICES_BLOCK
{
    void* GetTime;
    void* SetTime;
    void* ResetSystem;
    void* GetVariable;
    void* GetNextVariableName;
    void* SetVariable;
    void* UpdateCapsule;
    void* QueryCapsuleCapabilities;
    void* QueryVariableInfo;
} HAL_RUNTIME_SERVICES_BLOCK;

Windows initializes this structure on its startup and uses it to invoke runtime services (as opposed to directly using the EFI_RUNTIME_SERVICES structure). By locating HalEfiRuntimeServicesBlock, one can find runtime services' addresses and runtime drivers implementing them.

PoC and challenges

I authored the tool implementing this idea, named kraft_dinner. Greater details can be found in source code, and here are some notable discoveries with this approach.

HalEfiRuntimeServicesBlock can be found with HalQuerySystemInformation() up until only 19H2. One has to get creative for newer versions.
The physical memory address range backing runtime drivers is not known to Windows and not reported by MmGetPhysicalMemoryRanges(). This can be used to test a probable runtime driver address.
MmCopyMemory() never succeeds in reading memory that backs runtime drivers, regardless of whether virtual or physical memory is specified. This makes implementing a safe search operation harder.

Limitations

While the HalEfiRuntimeServicesBlock approach works reasonably well, it has some fundamental limitations.

Firstly, runtime drivers that do not implement runtime services are not found. Such runtime drivers do not have any formally defined way to directly influence Windows and system integrity, but may still hook (patch) other code to implement backdoor instead. umap and voyager are examples of such hacking drivers. A memory scanning-based approach would address this issue.

Secondly, runtime drivers can hide from this approach easily by writing trampoline code at the beginning of the original runtime service, instead of replacing the pointer in EFI_RUNTIME_SERVICES, or by nullifying the PE header. Those may be mitigated with more intelligent analysis and dumping, but is an easy countermeasure against scanning.

Thirdly, as a general challenge with memory analysis, classifying memory dumped files is not straightforward. Because memory contents can slightly vary between boots because of relocation (code patches), hash values change each time. Fuzzy hashing such as ssdeep is required to classify dump files and build a useful database. Also, a dump file does not contain driver's name and GUID as found in the actual firmware.

Finally, this is Windows-specific and hacky. It depends heavily on Windows implementation details which may break soon. While PoC worked fined with multiple devices, I would not be comfortable deploying this logic for millions of systems.

So why did I do this?

I encountered a bug in one of OEM runtime drivers and thought I would tool something quick, but in Rust :)

References

Monday, November 16, 2020

S3 Sleep, Resume and Handling Them with Type-1 Hypervisor

This post explains how the system enters and resumes from S3 (Sleep) on a modern x86-64 system, by reviewing specifications and the implementation of Windows as an example. This post also outlines challenges with S3 for type-1 hypervisors and how to work around it.

TeaTea in the S3 state

Why S3 is Interesting

On normal system startup, UEFI-based system firmware goes through four execution phases before starting the OS. Those phases include Driver eXecution Environment (DXE), Boot Device Selection (BDS), and Transient System Load (TSL) where system configurations are set and 3rd party firmware modules may be executed. The S3 resume boot path, on the other hand, those phases are skipped for faster start-up.

This has significant security implications because the S3 resume boot path needs to reapply the same security configurations as they are made during the normal boot path, using entirely different code. Failure of doing it securely leads to vulnerabilities, for example, unauthorized modification of a system firmware image if a firmware write-protection bit is not reapplied during resume.

Also, for the type-1 hypervisor that is loaded during the TSL phase, lack of the that phase means it is unable to get loaded on resume. Since the processors were shutdown on S3, processor-based virtualization features such as Intel VT-x stop working after resume even though the hypervisor module remains mapped in memory. This needs to be handled.

High-Level Flow

Before diving into details, let us review a high-level flow of S3 sleep and resume. The followings are the highlights.

Setting certain bits in the registers called Power Management (PM) 1 Control registers, or PM1a_CNT_BLK / PM1b_CNT_BLK puts the system into the S3 state.
During the next system start-up, system firmware detects that shutdown was because of S3 and executes the S3 resume boot path, instead of the normal boot path.
System firmware executes a collection of commands, called boot scripts, and code pointed by the Firmware Waking Vector in the ACPI table. This latter is called an OS waking vector and set up by the OS prior to entering S3.
The waking vector resumes execution of the OS.

Entering S3

The platform enters S3 when software sets 1 to the SLP_EN bits and 5 (0b101) to the SLP_TYP bits in the PM1 control registers. Looking at the ACPI specification, it states that settings the SLP_EN triggers state transition.

Table 4.13: PM1 Control Registers Fixed Hardware Feature Control, from the ACPI spec

The explanation of the SLP_TYP bits in the table is not crystal clear, but it becomes more obvious with the specification of the Intel platform. The below is an excerpt from the table under 4.2.2 Power Management 1 Control (PM1_CNT) in one of the hardware models that implement ACPI.

From Intel 495 Series Chipset Family On-Package Platform Controller Hub volume 2

Then, where are those registers? The ACPI does not define it but does define the way to locate them. Under 4.8.3 PM1 Control Registers, it states that

Each register block has a unique 32-bit pointer in the Fixed ACPI Table (FADT) to allow the PM1 event bits to be partitioned between two chips.

The below are excerpts of the FADT format, which contains multiple fields indicating where the registers are.

...

Depending on the implementation of ACPI, some fields may be unused. On my system, the SLEEP_CONTROL_REG field in the table tells that the register is located at IO-port 0x1804.

RWEverything parsing the FACP table on Windows

So far, we learned that:

the system enters S3 state when software sets SLP_EN and SLP_TYP bits in the PM1 control register.
the PM1 control register can be located through the FADT ACPI table.

Note that the ACPI table itself can be easily located with platform specific ways, such as /sys/firmware/acpi on Linux, GetSystemFirmwareTable() on Windows, or EfiLocateFirstAcpiTable() on UEFI.

Resuming from S3

On system start-up, system firmware executes the same initialization path as the normal boot path, and then, diverges when it detects that the previous shutdown was entering S3. This resume-specific path is called the S3 resume boot path and well explained in the UEFI Platform Initialization (PI) specification.

In a nutshell, the S3 resume boot path executes the boot scripts to re-initialize the platform, instead of executing the last three boot phases: DXE, BDS and TSL. The boot scripts are saved in non-volatile storage and replicate platform configuration made during normal boot. The below illustration from the spec highlights differences between normal and S3 resume boot paths, as well as how boot scripts are saved and consumed.

Normal and S3 resume boot paths, from the PI spec

As illustrated, after boot scripts are executed, an OS waking vector is executed to resume execution of the OS on the S3 resume boot path. The OS waking vector is the very first OS-specific code (the code that is developed by the OS vendor, and not part of system firmware). This is typically 16bit real-mode code that changes the processor mode to the long mode, resets registers to the same values as what they had before the system entered S3, and lets the OS execute further restoration code to fully resume the system. The OS sets up this OS waking vector right before entering S3.

How the OS sets up the OS waking vector and how system firmware finds its location? Again, ACPI defines the way.

The Firmware ACPI Control Structure (FACS) table defines a field called Firmware Waking Vector. This is where the OS should write the address of the OS waking vector to it, and system firmware should read it to locate and execute the OS waking vector.

Firmware Waking Vector in FACS, from the ACPI spec

To summarize the flow in the chronological order:

OS writes an address of the OS waking vector (ie, bootstrap code) to the Firmware Waking Vector field of the FACS table before entering S3.
System firmware reads the field to know the address of the OS waking vector and transfers execution to the address during the S3 resume boot path.
The OS waking vector eventually resumes system states using configurations kept in memory.

Implementation on Windows and EDK2

Let us look into how the above we reviewed are implemented on Windows (build 18362) and EDK2.

EDK2 is a reference implementation of UEFI, the system firmware specification, and very commonly used as a base of commercial system firmware.

Entering S3

On Windows, HaliAcpiSleep() is the main function that implements S3 handling and is called on all processors when a user requests entering S3. It roughly does the following in the order.

Boot strap processor (BSP) sets up the OS waking vector with HalpSetupRealModeResume().
*HalpWakeVector = HalpLowStubPhysicalAddress; // // Where HalpWakeVector is the address of the // Firmware Waking Vector field in the FACS table, // initialized at HaliInitPowerManagement() //
BSP waits for all APs to complete saving their states.
InterlockedAdd(&HalpFlushBarrier, 1);
while (HalpFlushBarrier != ProcessorCount);
Application processors (APs) save their registers with HalpSaveProcessorState().
APs enter the loop that does not exit in a successful path in HalpFlushAndWait().
InterlockedIncrement(&HalpFlushBarrier);
while (HalpFlushBarrier);
BSP writes to the PM1 control register(s) to set the following values with HalpAcpiPmRegisterWrite().

SLP_TYP = 5 (S3)
SLP_EN = 1

This puts the system into the S3 state. Let us look into the resume path.

Resuming from S3

On the EDK2, system firmware, side, the S3 resume boot specific execution flow looks roughly like this.
...
  -> DxeLoadCore()
       -> S3RestoreConfig2()
            -> S3ResumeExecuteBootScript()
                 -> S3BootScriptExecutorEntryFunction()
S3BootScriptExecutorEntryFunction() executes the boot script and jumps to the OS waking vector as indicated by Facs->FirmwareWakingVector at the end.
The OS waking vector is a copy of HalpRMStub. This eventually brings the execution of BSP to the right after HalpSetupRealModeResume() with RAX=1, as if it returned from the function.
BSP wakes up other APs by sending INIT-SIPI-SIPI.
// // This wakes up all APs with HalStartNextProcessor() calls //
HalpAcpiPostSleep(...);
The INIT-SIPI-SIPI brings APs to the right after HalpSaveProcessorState() with RAX=1, as if it returned from the function. For more details on how INIT-SIPI-SIPI starts up APs, please read the previous post.
All BSP and APs call HalpPostSleepMP() to restore other platform states, then return from HaliAcpiSleep(), continuing OS execution as usual

If you are interested in how exactly the OS waking vector is set up and resumes the system states, I suggest reversing the HaliAcpiSleep() on your own. The way it factors code to keep the flow as straightforward as possible is a masterpiece.

Note that on VMware, step 1 of the pre-S3 and step 1-3 of the post-S3 steps are skipped. Windows on VMware dose not need them either as the VMware hypervisor directly restores system states, instead of going through the full S3 resume boot path.

Handling S3 with Type-1 Hypervisor

As mentioned previously, S3 is a challenge for the type1 hypervisor that is loaded during the TSL phase because,

On resume, the TSL phase is skipped and no opportunity to get called.
On resume, virtualization is disabled and needs to be enabled.
It cannot add its boot script to trigger reinitialization, because it is locked at the TSL phase already.

One may employ the guest support module that subscribes the resume event and notifies the hypervisor to trigger reinitialization, but it is neither secure, portable, nor reliable. Another quick-and-dirty way is to disable S3 by altering the ACPI table, which has an obvious user experience issue.

The much superior way is to hook the OS waking vector. This works as following:

The hypervisor intercepts IO access to the PM1 control register(s)
When the guest attempts to write to the register to enter sleep, the hypervisor

overwrites contents of the Firmware Waking Vector field with its own waking vector address, and
writes to the register and lets the system enter S3

When the system wakes up, hypervisor's waking vector is executed, and it

reenables virtualization (with VMXON for example)
sets up the guest state to emulate execution of guest's waking vector (ie, guest's RIP is set to the guest waking vector)
launches the guest (with VMLAUNCH for example)

Hypervisor resuming from S3

This way, the hypervisor can take control over the system before running any OS (guest) specific code. Implementation of this can be found in multiple hypervisors such as ACRN Embedded Hypervisor and Bitvisor.

For completeness, noting that the type1 hypervisor that is part of an OEM firmware image or PEI modules does not have to do any of those. If the module were part of the OEM image, it would be able to add a boot script to register reinitialization, and if the module were a PEI module, it would be executed even in the S3 resume boot path.

Conclusion

Entering and resuming from S3 is complex work that involves all OS, system firmware, and hardware implementation, as well as multiple specifications such as PI and ACPI. However, studying it allows us to familiarize ourselves with the industry standards and intriguing low-level implementation details.

As a side note, I recommend learning type-1 hypervisors over OS kernel-based ones. Type-1 hypervisor is not just more flexible, it lets you understand greater details of how the system works (and arguably is a more common design across production-level hypervisors). I am still suspending the registration of the public hypervisor development class, but looking into reopening it sometime in the next year as a remote class. If you are interested, please reach out to tanda.sat@gmail.com for details.

References

ACPI Specification Version 6.3 Errata A (October 2020)
UEFI Platform Initialization Specification Version 1.7 (Errata A)

https://www.uefi.org/specifications

Intel® 495 Series Chipset Family On-Package PCH Datasheet Vol. 2

https://www.intel.ca/content/www/ca/en/products/docs/chipsets/495-series-chipset-on-package-pch-datasheet-vol-2.html

RWEverything

http://rweverything.com/

Project ACRN Embedded Hypervisor

https://github.com/projectacrn/acrn-hypervisor
Search "host_enter_s3" function

Bitvisor

https://github.com/matsu/bitvisor
Search "acpi_pm1_sleep" function

EDK2

https://github.com/tianocore/edk2

A Tour Beyond BIOS Implementing S3 Resume with EDKII

https://github.com/tianocore-docs/Docs/raw/master/White_Papers/A_Tour_Beyond_BIOS_Implementing_S3_resume_with_EDKII_V2.pdf

EDIT

Nov 25 - Correct that the boot phase relevant to 3rd party type 1 hypervisor is TSL and not BDS.

Monday, May 18, 2020

Introductory Study of IOMMU (VT-d) and Kernel DMA Protection on Intel Processors

This post is a write up of the introductory study of Intel VT-d, especially about how DMA remapping may be programmed and how Windows uses it. The hope is that this article helps you gain a basic understanding of it and start looking into more details as you are interested.

Intel VT-d

Intel VT-d, formally called as Intel VT for Directed I/O, consists of the following three features:

DMA Remapping
Interrupt Remapping
Interrupt Posting

DMA remapping is the most commonly discussed feature out of those and is the focus of this article.

DMA Remapping

DMA Remapping is an important feature because it allows software to implement security against Direct Memory Access (DMA) from malicious devices by configuring access permissions for each physical memory page. While ordinary memory page protections can be configured through the paging structures, and when Intel VT-x is used, through the Extended Page Tables (EPT), those configurations are completely ignored in case of DMA access. Therefore, the other protection mechanism is required to complete the protection of memory. DMA remapping achieves this.

The following illustration from the specification highlights that DMA goes through DMA remapping instead of CPU memory virtualization (ie, EPT).

VT-x Not Required

Ignore the upper half of the illustration above. It is a typical misconception that VT-d (DMA remapping) is tied with VT-x, virtual machines, and such. DMA remapping is usable and useful without VT-x; Windows, for example, can enable a DMA remapping based security feature (called Kernel DMA Protection) without requiring VT-x based security (VBS: Virtualization Based Security) enabled.

The sample project shown in this post below enables DMA remapping independently as well.

IOMMU

DMA remapping is also referred to as IOMMU, as it functions like Memory Management Unit (MMU) for IO memory access. Not only the concept is similar, but it also has a very similar programming interface as that of MMU, that is, the paging structures and EPT.

At high-level, the major difference is that DMA remapping uses two more tables for translation, on the top of the familiar PML5, 4, PDPT, PD, and PT. Simply put, translation with MMU is

Hardware register => PML4 => PDPT => ...

while that of IOMMU is

Hardware register => Root table => Context table => PML4 => PDPT => ...

The specification refers to the tables referenced from the context table as the second-level page tables. The below diagram illustrates the translation flow.

Notice that,

The entry of the root table is selected based on the bus number of the device requesting DMA.
The entry of the context table is selected based on a combination of device and function numbers of the device.

As an example of bus:device:function (referred to as source-id) assignment, my test DMA-capable device is listed as Bus 6 : Device 0 : Function 0 on one system as shown below.

Sample Code and Demo

Let us jump into some code. The HelloIommuPkg is a runtime DXE driver that enables DMA remapping and protects its first page (PE header) from DMA read and write by any devices.

Loading this will yield the following output and protect the page if successful.

Then, performing DMA read with the test PCI device using PCILeech demonstrates that the other page is readable,

but the protected page is not.

By inspecting one of the reported fault-recording registers using RWEverything, it can be confirmed that DMA was indeed blocked by a lack of read-permission.

The first column indicates the faulting address (0x6ff48000)
The third column indicates the source-id of the requesting device (Bus 6 : Device 0 : Function 0)
6 in the fourth column indicates the lack of read-permission.

Programming IOMMU

Enabling DMA remapping at a minimum can be divided into the following steps:

Locating the DMA Remapping Reporting (DMAR) ACPI table.
Gathering information about the available DMA remapping hardware units from DMA-remapping hardware unit definition (DRHD) structures in (1).
Configuring translation by initializing the tables mentioned above.
Writing hardware registers to use (3) and activating DMA remapping.

HelloIummuDxe.c roughly follows this sequence with some demonstration and error checking code. (1) and (2) are straightforward and can be validated tools like RWEverything.

The complexity of (3) varies largely depending on how granular and selective translations and memory protections are required. HelloIummuPkg allows any access from any device to anywhere, except against the single page, which simplifies this step. (4) is mostly just following the specification.

Overall, the minimum steps are simple and HelloIummuPkg's line count without comments is less than 700 lines.

Use of DMA Remapping on Windows

Windows uses DMA remapping when available. If the system does not enable Kernel DMA Protection, it configures translations mostly to pass-through all requests from all devices with few exceptions.

The following screenshot taken from the system without Kernel DMA Protection shows translation for the DMA-capable device at Bus 7 : Device 0 : Function 0. The value 9 at the right bottom indicates DMA requests are passed thought (See "TT: Translation Type" in the specification).

Notice the most of the entries points to the same context table at 0x1ac000 which is configured for pass-through, providing no protection.

As a side note, it would be technically possible for third-party Windows drivers to modify those translations and attempt to provide additional security against DMA unless VBS is enabled.

Use of DMA Remapping with Kernel DMA Protection

If Kernel DMA Protection is enabled, most of the translations are configured to fail. This is achieved by pointing to the second-level PML4 that is filled with zero, meaning translations are not present.

The below screenshot shows an example configuration with Kernel DMA Protection. Notice the context table at 0x1ac000 points to the second level PML4 at 0x251000, which is all zero.

Note that those memory locations are not visible if VBS is enabled. Disable it to inspect them.

Interestingly, I was not able to observe the described behavior of Kernel DMA Protection, in that, regardless of whether the screen is locked, performing DMA against the device resulted in bug check 0xE6: DRIVER_VERIFIER_DMA_VIOLATION (type 0x26). From what I read from Hal.dll, it made sense to bug check, but I doubt this is how Kernel DMA Protection is supposed to protect the system.

Conclusion

DMA remapping is part of the Intel VT-d architecture providing security against DMA from malicious devices and can be enabled without Intel VT-x to be used together. The sample project HelloIommuPkg demonstrates the simple setup of DMA remapping from UEFI with less than 700 lines of code.

It is shown that Windows enables DMA remapping if available, and when the Kernel DMA Protection feature is enabled, DMA access is mostly blocked though the second-level PML4.

Further Learning Resources

Intel® Virtualization Technology for Directed I/O Architecture Specification
A Tour Beyond BIOS: Using IOMMU for DMA Protection in UEFI Firmware
VTd module in IntelSiliconPkg
Hal.dll, especially

HalpIvtProcessDrhdEntry and ExtEnvRegisterIommu, which are responsible for the programming step (2). Those highlight the layout of the structure passed across many of IOMMU related functions.
IvtInitializeIommu (and HalpIommuInitializeAll), which is responsible for part of the step (3) and (4).
HalpIommuConfigureInterrupt, IvtSetMessageInterruptRouting and IvtHandleInterrupt for the implementation of translation fault reporting (ie, how bug check is initiated).
Highly recommended to study the specification before reading Hal.

A cat protected from direct access.

Friday, March 20, 2020

Initializing Application Processors on Windows

This post guides you to the journey of starting up application processors (APs) on Windows. This post can be read just for fun but can also help you make more sense of the INIT-SIPI-SIPI VM-exits sequence you have to handle when writing an UEFI hypervisor.

AP Initialization and Overview of Its Implementation

Before running any software code, hardware selects the processor that gets initialized and starts executing firmware code. This processor is called a bootstrap processor (BSP) and is basically the sole active processor until an operating system starts up the rest of the processors.

Those non-BSP are called APs and are initialized by the BSP sending a sequence of inter processor interrupts (IPIs): INIT, Startup IPI, and the 2nd Startup IPI. This sequence is also referred to as INIT-SIPI-SIPI.

As noted in the previous post, a hypervisor that starts earlier than the operating system needs to handle VM-exists caused by those IPIs. But when that happen exactly?

On Linux, this is relatively easy to find out. Searching "STARTUP IPI" in Linux source code or other developers' forums leads you to the implementation, smpboot.c. On Windows 10, this is done in HalpApicStartProcessor, called from kernel's KeStartAllProcessors, in short. The stack trace is shown below:

00 hal!HalpApicStartProcessor
01 hal!HalpInterruptStartProcessor
02 hal!HalStartNextProcessor
03 nt!KeStartAllProcessors
04 nt!Phase1InitializationDiscard
05 nt!Phase1Initialization
06 nt!PspSystemThreadStartup
07 nt!KiStartSystemThread

Let us look into little more details on Windows 19H1 (18362.1.amd64fre.19h1_release.190318-1202) without Hyper-V enabled. To be clear, the execution path varies drastically if Hyper-V is enabled.

High Level Flow

KeStartAllProcessors captures various system register values with KxInitializeProcessorState, updates per processor book keeping data structures and calls HalStartNextProcessors for each registered processor one by one to start all of them.

HalpInterruptStartProcessor builds stub code and temporal data structures required for APs to go through real-mode, 32 bit protected-mode, and long-mode, such as page tables, GDT, and IDT. HalpLowStub (that is PROCESSOR_START_BLOCK according to this talk by Alex Ionescu) is the address of where those are build and the very entry point of the AP. We will review the entry point code and how it goes up to the NT kernel.

HalpInterruptStartProcessor, after the stub is built. executes HalpApicStartProcessor which is responsible for issuing the INIT-SIPI-SIPI sequence. Pseudo code of this function is shown below.

NTSTATUS
HalpApicStartProcessor(
    UINT64,
    UINT32 LocalApicId,
    UINT64,
    UINT32 StartupIp
    )
{
    //
    // Assert INIT, then de-assert it. INIT-deassert IPI is done only for backword
    // compatibility.
    // See: 10.4.7.4 Local APIC State After It Receives an INIT-Deassert IPI
    //
    HalpApicWriteCommand(LocalApicId, 0xC500); // APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT
    KeStallExecutionProcessor(10u);


    HalpApicWriteCommand(LocalApicId, 0x8500); // APIC_INT_LEVELTRIG | APIC_DM_INIT
    KeStallExecutionProcessor(200u);


    //
    // Compute the SIPI message value and send it.
    // "the SIPI message contains a vector to the BIOS AP initialization code (at
    //  000VV000H, where VV is the vector contained in the SIPI message)."
    // See: 8.4.3 MP Initialization Protocol Algorithm for MP Systems
    //
    sipiMessage = (StartupIp & 0xFF000 | 0x600000u) >> 12;  // APIC_DM_STARTUP
    HalpApicWriteCommand(LocalApicId, sipiMessage);
    KeStallExecutionProcessor(200u);
    HalpApicWaitForCommand();
    KeStallExecutionProcessor(100u);


    //
    // Send the 2nd startup IPI.
    //
    HalpApicWriteCommand(LocalApicId, sipiMessage);
    KeStallExecutionProcessor(200u);

Note that those HalpApic functions are the function pointers that are set for APIC or APICx2 according to the system configurations.

Then let us review how APs get initialized by following the stub code.

AP Initialization Code

HalpRMStub - Real-Mode

The entry point code is symbolized as HalpRMStub. As the name suggests, running in the real-mode, right after the SIPI. As seen in the screenshot below, the stub code sets CR0.PE (0x1) enabling the protected mode and jumps out to somewhere.

As it is 16bit code, the instructions show by Windbg is slightly broken. Below is the correct output.

Also, let us switch to physical addresses since the code runs in the real-mode.

From code, the value of EDI is known to be 0x13000, because EDI is CS << 4, and CS is [19:12] of the IP, as stated in 8.4.3 (see the comment in the above pseudo code).

HalpPMStub - Protected-Mode

Following EDI+0x60 navigates us to the protected mode stub implemented as HalpPMStub.

This code is responsible for switching to the long-mode. As seen below, it

sets CR4.PSE (0x1000),
updates IA32_EFER, then
sets CR0.PG (0x8000000), to activate the long-mode (see the second screenshot).

Then, it jumps out to where RDI+0x66 specifies.

HalpLMIdentityStub - Long-Mode under Identity Mapping

The JMP leads to the short stub whose sole responsibility is to retrieve the value of CR3 that can permanently be used, that is, the same value as that of BSP.

As the processor should already working with the virtual addresses, let us switch to it.

RDI+0x70 gives us HalpLMStub.

HalpLMStub - Long-Mode

This is the final stub that APs go through. The first thing this stub does is to apply the permanent CR3 value to have the same memory layout as BSP (and any other already initialized APs) followed by invalidation of TBLs.

After switching the page tables, it performs various initialization, and at the end, it jumps out to where RDI+0x278 indicates.

This ends up with nt!KiSystemStartup, letting the AP run the same initialization code as BSP (except few things done exclusively by BSP).

Conclusion

We reviewed how Windows initiates execution of APs with the INIT-SIPI-SIPI sequence and how APs go though from real-mode to the regular NT kernel initialization function on Windows 10 19H1 without Hyper-V.

Hopefully, you enjoyed this post and gained more contexts on INIT-SIPI-SIPI VM-exits you may see while writing a hypervisor too.

Friday, March 13, 2020

Introduction and Notes on Design Considerations of UEFI-based Hypervisors

In this post, I am going to write up some of the lessons learned and the challenges I had to go through to write a UEFI-based hypervisor that supports booting Windows. I hope this post gives pointers to study and helps you get started with writing a similar hypervisor.

UEFI hypervisor brief design walk-through

Background

Lately, I spent some time to study EDK2-based UEFI programming and developed a hypervisor as a UEFI driver. It has been fun and turned out to be more straightforward than I initially imagined, but at the same time, there were some learning curves and technical challenges I had to take extra time to understand and overcome.

The major reason of taking extra time was lack of write ups or tutorials for my goal. Although there were few open-source projects and many documents and presentations I was able to study, those were not focused on UEFI programming with the context of writing hypervisors. This is entirely understandable as I do not suppose those are common subjects, and that was also why I wrote up this post.

In this post, I will start by giving a high-level overview of UEFI, and unique aspects in its execution environment, then look into challenges of writing a hypervisor as a UEFI driver.

UEFI Execution Environment

UEFI vs EDK2

UEFI is the specification of firmware to replace legacy-BIOS, where no standard exists, and offers a well-defined execution environment and programming interfaces. EDK2 is the open-source, reference implementation of the specification and provides tools to develop firmware modules.

Application vs Driver

Firmware modules can be built as part of a whole firmware image or as a standalone module (file) to be separately deployed. The latter is how I compiled the module. Additionally, UEFI modules can be written as an application which is unloaded from memory once its execution finishes, or as a driver which remains loaded unless explicitly unloaded. Obviously, the driver is the natural choice for the hypervisor, although I will mention the other common approach later.

Boot Time vs Run Time

The execution environment of drivers can be separated into two different phases: boot time and run time.

Frankly speaking, the boot time is before execution is handed over to the operating system and the run time is after that. This transition happens when a UEFI defined API called ExitBootServices is called. In the case of Windows startup, this is sometime before winload.efi transfers its execution to ntoskrnl.exe.

Most of the firmware drivers loaded on memory are unloaded at this point because most of them, for example, a network driver for PXE boot, are no longer needed once execution is handed over to the operating system. This type of driver is called boot drivers, and not suitable for the hypervisor that is meant to stay alive even after the operating system is fully started.

Runtime drivers, on the other side, are the type of driver that resides on memory throughout the system life span and suited for the hypervisor.

Boot-time Services vs Run-time Services

UEFI defines a collection of APIs, and their availability is impacted by the boot-to-run time transition. The type of API called boot-time services can no longer be used after the transition because drivers that implement the API are unloaded. After this transition, runtime drivers can only use the run-time services, which drastically reduces the ability of the hypervisor to interact with the environment.

Physical Mode vs Virtual Mode

Another transition that the runtime drivers have to go through is the change of the memory address layout.

At the boot time, the system is in the long-mode, same as Windows. However, virtual to physical address mapping is pure 1:1, that is, the virtual address 0xdf2000 is translated into the physical address 0xdf2000. This mode is called physical mode.

Soon after the transition to run time, a bootloader (winload.efi in the case of Windows) sets up and configures new page tables to map runtime drivers to the addresses that work well with the operating system (eg, the physical address 0xdf2000 may be mapped to 0xfffff803`1ce40000). Then, the bootloader calls the SetVirtualAddressMap run-time service letting runtime drivers perform their preparation, switches to the new page table and discards the old page table. After this point, the runtime drivers are mapped to only the new address, just like regular Windows drivers. This mode is called virtual mode. This transition can be catastrophic if the hypervisor depends on the physical mode page tables. We will review how it can be a problem.

Application Processor Start-Up

Another unique event that the UEFI hypervisor has to handle is processor initialization. Processors that are not selected as a bootstrap processor (BSP; the processor initialized first) are called application processors (APs) and are initialized after transitioning to the virtual mode. This is done by BSP signaling INIT and Startup-IPI (SIPI). When SIPI is signaled, APs start its execution on the real-mode and go through mode transition up to the long-mode (in the case of the 64bit operating systems). This requires some extra VM-exit handling that was not relevant for the blue pull style hypervisors.

Those unique aspects of the UEFI environment pose technical challenges and require different hypervisor design considerations.

Challenges, Solutions, and Considerations

Host CR3

As mentioned, the host CR3 becomes invalid if the value at the time of driver load is used because that would be physical mode page tables that get destroyed. The most straightforward solution for this is to set up our own page tables with the same translation as the existing one (ie, physical mode page tables) and use them for the host. This may sound complicated but is implemented with just 50 lines of C code in MiniVisor.

However, this results in having different address translations once the guest switches to the virtual mode and makes it significantly difficult for the host to interact with the guest. For example, host code cannot be debugged with tools like Windbg anymore because none of Windows code is mapped in a usable form while the host is running. If the hypervisor is going to need complex interaction with the guest virtual address, other approaches might make it simpler at the end. In a private build, I implemented a guest shell-code that runs in the same address space as the NT system process for interaction with the guest.

Injecting the guest agent that hooks Windows kernel API

It also makes it harder to access the guest virtual memory from the host for the same reason without implementing the guest-virtual-to-host-virtual mapping mechanism. MiniVisor implements this in MemoryAccess.c. This is essentially what every single hypervisor implements.

Host IDT

For the same reason as the host CR3 is discarded, the host IDT becomes invalid if the value at the time of driver load is used. Although this does not cause an issue immediately because interrupt is disabled during execution of the host, any programming error causing exception will cause triple fault without running any diagnostics code. The solution is to create its own IDT for the host.

Having its own IDT, however, means NMI can no longer be delivered to the Windows kernel if that occurs during the execution of the host (reminder: NMI still occurs even if interrupts are disabled). MiniVisor discards NMI for simplicity but you should consider reinjecting it into the guest instead.

Host GDT

You may wonder about the GDT. Yes, the GDT also needs to be created, but also requires modification because firmware does not set up the task state segment that is required for VMX.

Logging

Console output API is the boot-time service that cannot be used after the transition to run time. Hence, console-based logging must be ceased after that point. This could be addressed in several ways, such as hooking into operating system logging API, but the simplest solution is to use serial output instead of console output. This has its limitations but requires almost zero extra code.

Another sensible option is to have ring buffer to store log entries, and later, let a client application to pull and print them out.

Testing Application Processors Startup

This requires the hypervisor to handle VM-exits as well as proper emulation of paging mode transitions that are not relevant for the blue pull-style hypervisors. Specifically, handling of INIT, SIPI and CR0.PG access are required.

For me, this was one of the most challenging parts of writing a hypervisor that supports booting an operating system, mostly due to lack of available virtualization solutions as a test environment and difference between them and the bare-metal environment (eg, TLB, MSR etc), requiring through testing with bare-metal.

My recommendation is to buy and set up a single-board computer with a serial port so you can at least do printf-debugging (or even better, Direct Connect Interface support). I might blog about selecting devices and setting them up.

Testing with a single-board computer

Driver vs Standalone File

Compiling the hypervisor as a runtime driver works as demonstrated in the project. However, the more common approach is to build the hypervisor as a separate file and a UEFI application loads it into memory and starts executing it. That is how VMware hypervisor as well as Hyper-V is implemented, as examples. The standalone hypervisor format is often ELF because of wider cross-platform compiler and debugging tool support.

This approach has an advantage that the hypervisor code remains platform agnostic and re-usable; for example, one can write a small Windows driver as a hypervisor loader without mixing up platform dependent loader code and hypervisor code that should be platform independent. Then, the hypervisor module can remain portable.

MiniVisor did not take this approach just because of lack of structure started from experimentation. I plan to restructure the project in this way.

Conclusion

We reviewed some uniqueness of the UEFI environment and how those impact design and implementation of hypervisors compared with those designed under the blue-pill model. We also looked at how MiniVisor was designed to work with those new factors and implied limitations.

While this short blog post may not be sufficient for some readers to have clear ideas of those challenges and explained solutions, I hope this post gives you some pointers to study the codebase of MiniVisor and help make sense of why things are written in different ways than the blue pill-style Windows hypervisor.

Further Learning

As a final note, if you are particularly curious about tooling hypervisor for research and/or just having a solid understanding of the underneath technologies and concepts, Bruce Dang and I plan to offer a 5 days class this October. This will let you write your hypervisor for both Windows and UEFI environments, develop "something useful" and play with them on physical and virtual machines to internalize technical details.

Please sign up from this page or contact us if you are interested in.
https://gracefulbits.regfox.com/hypervisor-development-for-security-analysis