Background
Lately, I spent some time to
study EDK2-based UEFI programming and developed a hypervisor as a UEFI driver.
It has been fun and turned out to be more straightforward than I initially
imagined, but at the same time, there were some learning curves and technical
challenges I had to take extra time to understand and overcome.
The major reason of taking
extra time was lack of write ups or tutorials for my goal. Although there were
few open-source projects and many documents and presentations I was able to
study, those were not focused on UEFI programming with the context of writing
hypervisors. This is entirely understandable as I do not suppose those are
common subjects, and that was also why I wrote up this post.
In this post, I will start by
giving a high-level overview of UEFI, and unique aspects in its execution
environment, then look into challenges of writing a hypervisor as a
UEFI driver.
UEFI Execution Environment
UEFI vs EDK2
UEFI is the specification of
firmware to replace legacy-BIOS, where no standard exists, and offers a
well-defined execution environment and programming interfaces. EDK2 is the
open-source, reference implementation of the specification and provides tools
to develop firmware modules.
Application vs Driver
Firmware modules can be built
as part of a whole firmware image or as a standalone module (file) to be
separately deployed. The latter is how I compiled the module. Additionally,
UEFI modules can be written as an application which is unloaded from memory
once its execution finishes, or as a driver which remains loaded unless explicitly
unloaded. Obviously, the driver is the natural choice for the hypervisor, although I will mention the other common approach later.
Boot Time vs Run Time
The execution environment of
drivers can be separated into two different phases: boot time and run time.
Frankly speaking, the boot time
is before execution is handed over to the operating system and the run time is
after that. This transition happens when a UEFI defined API called
ExitBootServices is called. In the case of Windows startup, this is sometime
before winload.efi transfers its execution to ntoskrnl.exe.
Most of the firmware drivers
loaded on memory are unloaded at this point because most of them, for example,
a network driver for PXE boot, are no longer needed once execution is handed
over to the operating system. This type of driver is called boot drivers, and
not suitable for the hypervisor that is meant to stay alive even after the
operating system is fully started.
Runtime drivers, on the other
side, are the type of driver that resides on memory throughout the system life
span and suited for the hypervisor.
Boot-time Services vs Run-time Services
UEFI defines a collection of
APIs, and their availability is impacted by the boot-to-run time transition.
The type of API called boot-time services can no longer be used after the
transition because drivers that implement the API are unloaded. After this transition,
runtime drivers can only use the run-time services, which drastically reduces
the ability of the hypervisor to interact with the environment.
Physical Mode vs Virtual Mode
Another transition that the runtime
drivers have to go through is the change of the memory address layout.
At the boot time, the system is
in the long-mode, same as Windows. However, virtual to physical address mapping
is pure 1:1, that is, the virtual address 0xdf2000 is translated into the
physical address 0xdf2000. This mode is called physical mode.
Soon after the transition to run
time, a bootloader (winload.efi in the case of Windows) sets up and configures
new page tables to map runtime drivers to the addresses that work well with the
operating system (eg, the physical address 0xdf2000 may be mapped to
0xfffff803`1ce40000). Then, the bootloader calls the SetVirtualAddressMap run-time
service letting runtime drivers perform their preparation, switches to the new
page table and discards the old page table. After this point, the runtime
drivers are mapped to only the new address, just like regular Windows drivers.
This mode is called virtual mode. This transition can be catastrophic if the
hypervisor depends on the physical mode page tables. We will review how it can
be a problem.
Application Processor Start-Up
Another unique event that the
UEFI hypervisor has to handle is processor initialization. Processors that are
not selected as a bootstrap processor (BSP; the processor initialized first)
are called application processors (APs) and are initialized after transitioning
to the virtual mode. This is done by BSP signaling INIT and Startup-IPI (SIPI).
When SIPI is signaled, APs start its execution on the real-mode and go through
mode transition up to the long-mode (in the case of the 64bit operating
systems). This requires some extra VM-exit handling that was not relevant for
the blue pull style hypervisors.
Those unique aspects of the
UEFI environment pose technical challenges and require different hypervisor
design considerations.
Challenges, Solutions, and Considerations
Host CR3
As mentioned, the host CR3
becomes invalid if the value at the time of driver load is used because that
would be physical mode page tables that get destroyed. The most straightforward
solution for this is to set up our own page tables with the same translation as
the existing one (ie, physical mode page tables) and use them for the host.
This may sound complicated but is implemented with just 50 lines of C code in
MiniVisor.
However, this results in having
different address translations once the guest switches to the virtual mode and
makes it significantly difficult for the host to interact with the guest. For
example, host code cannot be debugged with tools like Windbg anymore because
none of Windows code is mapped in a usable form while the host is running. If the hypervisor is going to need complex interaction with the guest virtual address, other approaches might make it simpler at the end. In a private build, I implemented a guest shell-code that runs in the same address space as the NT system process for interaction with the guest.
Injecting the guest agent that hooks Windows kernel API |
It also makes it harder to access
the guest virtual memory from the host for the same reason without implementing
the guest-virtual-to-host-virtual mapping mechanism. MiniVisor implements this
in MemoryAccess.c. This is essentially what every single hypervisor implements.
Host IDT
For the same reason as the host
CR3 is discarded, the host IDT becomes invalid if the value at the time of
driver load is used. Although this does not cause an issue immediately because
interrupt is disabled during execution of the host, any programming error
causing exception will cause triple fault without running any diagnostics code.
The solution is to create its own IDT for the host.
Having its own IDT, however,
means NMI can no longer be delivered to the Windows kernel if that occurs
during the execution of the host (reminder: NMI still occurs even if interrupts
are disabled). MiniVisor discards NMI for simplicity but you should consider reinjecting
it into the guest instead.
Host GDT
You may wonder about the GDT.
Yes, the GDT also needs to be created, but also requires modification because
firmware does not set up the task state segment that is required for VMX.
Logging
Console output API is the boot-time
service that cannot be used after the transition to run time. Hence,
console-based logging must be ceased after that point. This could be addressed
in several ways, such as hooking into operating system logging API, but the
simplest solution is to use serial output instead of console output. This has
its limitations but requires almost zero extra code.
Another sensible option is to
have ring buffer to store log entries, and later, let a client application to
pull and print them out.
Testing Application Processors Startup
This requires the hypervisor to
handle VM-exits as well as proper emulation of paging mode transitions that are
not relevant for the blue pull-style hypervisors. Specifically, handling of
INIT, SIPI and CR0.PG access are required.
For me, this was one of the most
challenging parts of writing a hypervisor that supports booting an operating
system, mostly due to lack of available virtualization solutions as a test
environment and difference between them and the bare-metal environment (eg,
TLB, MSR etc), requiring through testing with bare-metal.
My recommendation is to buy and
set up a single-board computer with a serial port so you can at least do
printf-debugging (or even better, Direct Connect Interface support). I might blog
about selecting devices and setting them up.
Testing with a single-board computer |
Driver vs Standalone File
Compiling the hypervisor as a runtime driver works as demonstrated in the project. However, the more common approach is to build the hypervisor as a separate file and a UEFI application loads it into memory and starts executing it. That is how VMware hypervisor as well as Hyper-V is implemented, as examples. The standalone hypervisor format is often ELF because of wider cross-platform compiler and debugging tool support.
This approach has an advantage that the hypervisor code remains platform agnostic and re-usable; for example, one can write a small Windows driver as a hypervisor loader without mixing up platform dependent loader code and hypervisor code that should be platform independent. Then, the hypervisor module can remain portable.
MiniVisor did not take this approach just because of lack of structure started from experimentation. I plan to restructure the project in this way.
MiniVisor did not take this approach just because of lack of structure started from experimentation. I plan to restructure the project in this way.
Conclusion
We reviewed some uniqueness of
the UEFI environment and how those impact design and implementation of
hypervisors compared with those designed under the blue-pill model. We also
looked at how MiniVisor was designed to work with those new factors and implied
limitations.
While this short blog post may
not be sufficient for some readers to have clear ideas of those challenges and explained
solutions, I hope this post gives you some pointers to study the codebase of
MiniVisor and help make sense of why things are written in different ways than
the blue pill-style Windows hypervisor.
Further Learning
As a final note, if you are
particularly curious about tooling hypervisor for research and/or just having a
solid understanding of the underneath technologies and concepts, Bruce Dang and
I plan to offer a 5 days class this October. This will let you write your
hypervisor for both Windows and UEFI environments, develop "something
useful" and
play with them on physical and virtual machines to internalize technical details.
Please sign up from this page or contact us if you are interested in.
https://gracefulbits.regfox.com/hypervisor-development-for-security-analysis
Please sign up from this page or contact us if you are interested in.
https://gracefulbits.regfox.com/hypervisor-development-for-security-analysis
Thank you, Satoshi, for sharing your great experience!
ReplyDeleteMoving hypervisor to the UEFI opens new horizons for the Platform Security and reverse-engineering!
I am waiting for the blog about single-board computer and their configuration.
Do you think it is possible to test/check the UEFI-based hypervisor inside virtual environment, such as Vbox and VmWare?
Yes, you can do that with VMware. See this for more details.
Deletehttps://github.com/tandasat/MiniVisorPkg/blob/master/Docs/Building_and_Debugging.md#uefi-testing-with-other-virtualization-platform
Hi Satoshi,
ReplyDeleteThanks for sharing your experience. I wanted to check whether the training in October is still going to be in Maryland, or do you plan to conduct it remotely? I wish to attend that but won't be able to travel due to the current situation. If you plan to conduct remotely please do let me know.
Hi Japs,
DeleteSorry for inactivity. As you might have guessed, we no longer plan to have the class in October or physically. I recently started to look into re-organizing contents to include type-1 and adjustment for remote training, but have no concrete dates yet.
Once I feel comfortable with offering remote training and set dates, I will make an announcement on Twitter @standa_t and here. This would be early next at earliest.
Thanks! I will look forward to the announcement :)
Delete