Monday, November 16, 2020

S3 Sleep, Resume and Handling Them with Type-1 Hypervisor

This post explains how the system enters and resumes from S3 (Sleep) on a modern x86-64 system, by reviewing specifications and the implementation of Windows as an example. This post also outlines challenges with S3 for type-1 hypervisors and how to work around it.
TeaTea in the S3 state

Why S3 is Interesting

On normal system startup, UEFI-based system firmware goes through four execution phases before starting the OS. Those phases include Driver eXecution Environment (DXE), Boot Device Selection (BDS), and Transient System Load (TSL) where system configurations are set and 3rd party firmware modules may be executed. The S3 resume boot path, on the other hand, those phases are skipped for faster start-up. 

This has significant security implications because the S3 resume boot path needs to reapply the same security configurations as they are made during the normal boot path, using entirely different code. Failure of doing it securely leads to vulnerabilities, for example, unauthorized modification of a system firmware image if a firmware write-protection bit is not reapplied during resume. 

Also, for the type-1 hypervisor that is loaded during the TSL phase, lack of the that phase means it is unable to get loaded on resume. Since the processors were shutdown on S3, processor-based virtualization features such as Intel VT-x stop working after resume even though the hypervisor module remains mapped in memory. This needs to be handled. 

High-Level Flow

Before diving into details, let us review a high-level flow of S3 sleep and resume. The followings are the highlights.
  1. Setting certain bits in the registers called Power Management (PM) 1 Control registers, or PM1a_CNT_BLK / PM1b_CNT_BLK puts the system into the S3 state.
  2. During the next system start-up, system firmware detects that shutdown was because of S3 and executes the S3 resume boot path, instead of the normal boot path.
  3. System firmware executes a collection of commands, called boot scripts, and code pointed by the Firmware Waking Vector in the ACPI table. This latter is called an OS waking vector and set up by the OS prior to entering S3.
  4. The waking vector resumes execution of the OS.

Entering S3

The platform enters S3 when software sets 1 to the SLP_EN bits and 5 (0b101) to the SLP_TYP bits in the PM1 control registers. Looking at the ACPI specification, it states that settings the SLP_EN triggers state transition. 
Table 4.13: PM1 Control Registers Fixed Hardware Feature Control, from the ACPI spec
The explanation of the SLP_TYP bits in the table is not crystal clear, but it becomes more obvious with the specification of the Intel platform. The below is an excerpt from the table under 4.2.2 Power Management 1 Control (PM1_CNT) in one of the hardware models that implement ACPI.
 From Intel 495 Series Chipset Family On-Package Platform Controller Hub volume 2

Then, where are those registers? The ACPI does not define it but does define the way to locate them. Under 4.8.3 PM1 Control Registers, it states that
Each register block has a unique 32-bit pointer in the Fixed ACPI Table (FADT) to allow the PM1 event bits to be partitioned between two chips.
The below are excerpts of the FADT format, which contains multiple fields indicating where the registers are. 
...
...
Depending on the implementation of ACPI, some fields may be unused. On my system, the SLEEP_CONTROL_REG field in the table tells that the register is located at IO-port 0x1804. 

RWEverything parsing the FACP table on Windows

So far, we learned that: 
  • the system enters S3 state when software sets SLP_EN and SLP_TYP bits in the PM1 control register.
  • the PM1 control register can be located through the FADT ACPI table. 
Note that the ACPI table itself can be easily located with platform specific ways, such as /sys/firmware/acpi on Linux, GetSystemFirmwareTable() on Windows, or EfiLocateFirstAcpiTable() on UEFI.

Resuming from S3

On system start-up, system firmware executes the same initialization path as the normal boot path, and then, diverges when it detects that the previous shutdown was entering S3. This resume-specific path is called the S3 resume boot path and well explained in the UEFI Platform Initialization (PI) specification. 

In a nutshell, the S3 resume boot path executes the boot scripts to re-initialize the platform, instead of executing the last three boot phases: DXE, BDS and TSL. The boot scripts are saved in non-volatile storage and replicate platform configuration made during normal boot. The below illustration from the spec highlights differences between normal and S3 resume boot paths, as well as how boot scripts are saved and consumed.
Normal and S3 resume boot paths, from the PI spec
As illustrated, after boot scripts are executed, an OS waking vector is executed to resume execution of the OS on the S3 resume boot path. The OS waking vector is the very first OS-specific code (the code that is developed by the OS vendor, and not part of system firmware). This is typically 16bit real-mode code that changes the processor mode to the long mode, resets registers to the same values as what they had before the system entered S3, and lets the OS execute further restoration code to fully resume the system. The OS sets up this OS waking vector right before entering S3. 

How the OS sets up the OS waking vector and how system firmware finds its location? Again, ACPI defines the way. 

The Firmware ACPI Control Structure (FACS) table defines a field called Firmware Waking Vector. This is where the OS should write the address of the OS waking vector to it, and system firmware should read it to locate and execute the OS waking vector. 
Firmware Waking Vector in FACS, from the ACPI spec

To summarize the flow in the chronological order:
  1. OS writes an address of the OS waking vector (ie, bootstrap code) to the Firmware Waking Vector field of the FACS table before entering S3.
  2. System firmware reads the field to know the address of the OS waking vector and transfers execution to the address during the S3 resume boot path.
  3. The OS waking vector eventually resumes system states using configurations kept in memory.

Implementation on Windows and EDK2

Let us look into how the above we reviewed are implemented on Windows (build 18362) and EDK2. 
EDK2 is a reference implementation of UEFI, the system firmware specification, and very commonly used as a base of commercial system firmware. 

Entering S3

On Windows, HaliAcpiSleep() is the main function that implements S3 handling and is called on all processors when a user requests entering S3. It roughly does the following in the order. 
  1. Boot strap processor (BSP) sets up the OS waking vector with HalpSetupRealModeResume().
    *HalpWakeVector = HalpLowStubPhysicalAddress; // // Where HalpWakeVector is the address of the // Firmware Waking Vector field in the FACS table, // initialized at HaliInitPowerManagement() //
  2. BSP waits for all APs to complete saving their states.
    InterlockedAdd(&HalpFlushBarrier, 1);
    while (HalpFlushBarrier != ProcessorCount);
  3. Application processors (APs) save their registers with HalpSaveProcessorState().
  4. APs enter the loop that does not exit in a successful path in HalpFlushAndWait().
    InterlockedIncrement(&HalpFlushBarrier);
    while (HalpFlushBarrier);
  5. BSP writes to the PM1 control register(s) to set the following values with HalpAcpiPmRegisterWrite().
    • SLP_TYP = 5 (S3)
    • SLP_EN = 1 
This puts the system into the S3 state. Let us look into the resume path.

Resuming from S3

  1. On the EDK2, system firmware, side, the S3 resume boot specific execution flow looks roughly like this.
    ...
      -> DxeLoadCore()
           -> S3RestoreConfig2()
                -> S3ResumeExecuteBootScript()
                     -> S3BootScriptExecutorEntryFunction()
  2. S3BootScriptExecutorEntryFunction() executes the boot script and jumps to the OS waking vector as indicated by Facs->FirmwareWakingVector at the end.
  3. The OS waking vector is a copy of HalpRMStub. This eventually brings the execution of BSP to the right after HalpSetupRealModeResume() with RAX=1, as if it returned from the function. 
  4. BSP wakes up other APs by sending INIT-SIPI-SIPI.
    // // This wakes up all APs with HalStartNextProcessor() calls //
    HalpAcpiPostSleep(...);     
  5. The INIT-SIPI-SIPI brings APs to the right after HalpSaveProcessorState() with RAX=1, as if it returned from the function. For more details on how INIT-SIPI-SIPI starts up APs, please read the previous post
  6. All BSP and APs call HalpPostSleepMP() to restore other platform states, then return from HaliAcpiSleep(), continuing OS execution as usual
If you are interested in how exactly the OS waking vector is set up and resumes the system states, I suggest reversing the HaliAcpiSleep() on your own. The way it factors code to keep the flow as straightforward as possible is a masterpiece.  

Note that on VMware, step 1 of the pre-S3 and step 1-3 of the post-S3 steps are skipped. Windows on VMware dose not need them either as the VMware hypervisor directly restores system states, instead of going through the full S3 resume boot path. 

Handling S3 with Type-1 Hypervisor

As mentioned previously, S3 is a challenge for the type1 hypervisor that is loaded during the TSL phase because,
  • On resume, the TSL phase is skipped and no opportunity to get called.
  • On resume, virtualization is disabled and needs to be enabled.
  • It cannot add its boot script to trigger reinitialization, because it is locked at the TSL phase already. 
One may employ the guest support module that subscribes the resume event and notifies the hypervisor to trigger reinitialization, but it is neither secure, portable, nor reliable. Another quick-and-dirty way is to disable S3 by altering the ACPI table, which has an obvious user experience issue.    

The much superior way is to hook the OS waking vector. This works as following:
  1. The hypervisor intercepts IO access to the PM1 control register(s)
  2. When the guest attempts to write to the register to enter sleep, the hypervisor 
    1. overwrites contents of the Firmware Waking Vector field with its own waking vector address, and
    2. writes to the register and lets the system enter S3
  3. When the system wakes up, hypervisor's waking vector is executed, and it
    1. reenables virtualization (with VMXON for example) 
    2. sets up the guest state to emulate execution of guest's waking vector (ie, guest's RIP is set to the guest waking vector)
    3. launches the guest (with VMLAUNCH for example)
Hypervisor resuming from S3

This way, the hypervisor can take control over the system before running any OS (guest) specific code. Implementation of this can be found in multiple hypervisors such as ACRN Embedded Hypervisor and Bitvisor.  

For completeness, noting that the type1 hypervisor that is part of an OEM firmware image or PEI modules does not have to do any of those. If the module were part of the OEM image, it would be able to add a boot script to register reinitialization, and if the module were a PEI module, it would be executed even in the S3 resume boot path. 

Conclusion

Entering and resuming from S3 is complex work that involves all OS, system firmware, and hardware implementation, as well as multiple specifications such as PI and ACPI. However, studying it allows us to familiarize ourselves with the industry standards and intriguing low-level implementation details.  

As a side note, I recommend learning type-1 hypervisors over OS kernel-based ones. Type-1 hypervisor is not just more flexible, it lets you understand greater details of how the system works (and arguably is a more common design across production-level hypervisors). I am still suspending the registration of the public hypervisor development class, but looking into reopening it sometime in the next year as a remote class. If you are interested, please reach out to tanda.sat@gmail.com for details.  

References

EDIT

  • Nov 25 - Correct that the boot phase relevant to 3rd party type 1 hypervisor is TSL and not BDS.