Implementing a software scripting engine on Fermi to achieve safe memory re-clocking.
Fermi stands for Nvidia GPUs based on Fermi architecture.
NVidia cards have long had the possibility to reclock at least some of the engines of its GPUs. Up to the geforce 7 (included), reclocking used to happen at boot time and usually didn't involve memory reclocking at all.
It changed with geforce 8 (nv50) where almost all laptops got the capability to reclock both the VRAM and the main engines. This was introduced in order to lower power consumption when the GPU was mostly idle. The default boot clocks were usually in some intermediate state between the slowest and the fastest clocks. The reclocking process for these cards is mostly understood and Nouveau is not far from being safely reclock on the fly, even while gaming.
Geforce 200 (nva3) introduced load-based reclocking on all the cards. This started being a real problem because the default boot clocks are a third to a half of the maximum clocks. This explains a lot of the performance problem on those cards with Nouveau.
The reclocking process on these cards is yet to be perfected but it is hoped that it will soon reach the level of stability of nv50.
On Fermi (geforce 400), the performance problem got worse as default boot clocks are often set to about 10% the maximum clocks. With such low clocks, desktop usage suffers many performance problems. This makes reclocking on these card an urging task. Reclocking Fermi's engines is mostly understood while memory reclocking (GDDR5) is being investigated.
In order for memory reclocking to be done safely, it is needed to deny access to VRAM to all applications. This can be done radeon-style (by unmapping the BOs in TTM while locking BO mapping) or by putting the card off-bus. The latter being more efficient, it has been the method used on nv50.
When the card is off-bus, no MMIO access can be done and so, no fiddling with the PLL registers can be done. This is why reclocking on nv50 is carried on by an internal "scripting" engine called HWSQ (HardWare SeQuencer). This engine has the ability to put the card off-bus, wait and write to MMIO registers.
Unfortunately, HWSQ was removed when Fermi was released as it was superseded by a more general-purpose engine called PDAEMON. PDAEMON was introduced on Geforce 200 (nva3) and can be considered as a 200 MHz microcontroller with full access to the registers of the card and that is also capable of sending IRQs to the host. It is used by nvidia for both hardware monitoring and reclocking.
The ISA of this microcontroller is FµC (flexible microcode). This ISA is being used in most engines of the GPU on Fermi and almost all of them in Kepler. An open-source implementation of PDAEMON with basic command-submission (host ->pdaemon) and fan-management has been written[1]. This implementation works on PDAEMON from nva3 to nvd9 and should work out of the box on Kepler (never tested but the needed ReEngineering is done).
The goal of this project would be to propose a replacement for HWSQ by implementing some scripting capabilities in PDAEMON. This requires some improvements on PDAEMON such as PDAEMON -> Host communication and a very simple scheduler to manage multiple "applications" (thermal management, hardware monitoring and reclocking).
Theoretically, it would be possible to replace hwsq by straight FµC and ask PDAEMON to execute this. However, FµC isn't a simple ISA and embedding an assembler inside the Linux kernel doesn't seem like a good idea.
It is also a very stupid idea from a security point of view as this engine can do pretty much everything.
Furthermore, it could generate some corner effects that could hinder the GPU monitoring processes.
Lastly, reclocking is already hard on its own. Implementing it in FµC seems like an impossible mission, especially when considering all the chipset-specific variations.
On the contrary, a script could be guaranteed not to interfere with the critical mission of PDAEMON (thermal monitoring).
In order to be useful, the proposed scripting ISA must be capable of cleanly implementing both memory reclocking. Designing the ISA of this script and implementing it in PDAEMON will be the main challenge of this project.
The scripting engine will then be put to a test by implementing safe engine reclocking and safe memory reclocking if the reverse engineering has been completed.
This task is likely to require RE and various fixing to accommodate all the testers.
In parallel of this task, PDAEMON should be merged to Linux. This involves creating a new Nouveau submodule, porting the user-space command-submission program to the kernel and then, porting some hardware monitoring features to PDAEMON.
Apart from this, I am also great supporter and promoter of FOSS documentation. So, I plan to work on a lot of documentation for nouveau which is according to me severely lacking at present.
Implementation Steps
Step 1 : Work with PDAEMON, improve its functionality and come up with a PDAEMON to host communication scheme, also write a scheduler.
Step 2 : Design, implement and integrate scripting capabilities in FµC.
Step 3 : Integrate PDAEMON to kernel tree and use it to safely reclock the engines.
Step 4 : Work with the community for testing and fixing bugs. This will take quite a bit of an effort as I will have to remotely communicate with users and request them to test it, then try to fix the bugs they run in to.
Step 5 : Get the work which has been done merged with fixes for reclocking.
Step 6 : Create documentation and work on PDAEMON to support automatic reclocking (to free the CPU from the PCOUNTER polling)and increase the smoothness of the solution.
Fermi stands for Nvidia GPUs based on Fermi architecture.
NVidia cards have long had the possibility to reclock at least some of the engines of its GPUs. Up to the geforce 7 (included), reclocking used to happen at boot time and usually didn't involve memory reclocking at all.
It changed with geforce 8 (nv50) where almost all laptops got the capability to reclock both the VRAM and the main engines. This was introduced in order to lower power consumption when the GPU was mostly idle. The default boot clocks were usually in some intermediate state between the slowest and the fastest clocks. The reclocking process for these cards is mostly understood and Nouveau is not far from being safely reclock on the fly, even while gaming.
Geforce 200 (nva3) introduced load-based reclocking on all the cards. This started being a real problem because the default boot clocks are a third to a half of the maximum clocks. This explains a lot of the performance problem on those cards with Nouveau.
The reclocking process on these cards is yet to be perfected but it is hoped that it will soon reach the level of stability of nv50.
On Fermi (geforce 400), the performance problem got worse as default boot clocks are often set to about 10% the maximum clocks. With such low clocks, desktop usage suffers many performance problems. This makes reclocking on these card an urging task. Reclocking Fermi's engines is mostly understood while memory reclocking (GDDR5) is being investigated.
In order for memory reclocking to be done safely, it is needed to deny access to VRAM to all applications. This can be done radeon-style (by unmapping the BOs in TTM while locking BO mapping) or by putting the card off-bus. The latter being more efficient, it has been the method used on nv50.
When the card is off-bus, no MMIO access can be done and so, no fiddling with the PLL registers can be done. This is why reclocking on nv50 is carried on by an internal "scripting" engine called HWSQ (HardWare SeQuencer). This engine has the ability to put the card off-bus, wait and write to MMIO registers.
Unfortunately, HWSQ was removed when Fermi was released as it was superseded by a more general-purpose engine called PDAEMON. PDAEMON was introduced on Geforce 200 (nva3) and can be considered as a 200 MHz microcontroller with full access to the registers of the card and that is also capable of sending IRQs to the host. It is used by nvidia for both hardware monitoring and reclocking.
The ISA of this microcontroller is FµC (flexible microcode). This ISA is being used in most engines of the GPU on Fermi and almost all of them in Kepler. An open-source implementation of PDAEMON with basic command-submission (host ->pdaemon) and fan-management has been written[1]. This implementation works on PDAEMON from nva3 to nvd9 and should work out of the box on Kepler (never tested but the needed ReEngineering is done).
The goal of this project would be to propose a replacement for HWSQ by implementing some scripting capabilities in PDAEMON. This requires some improvements on PDAEMON such as PDAEMON -> Host communication and a very simple scheduler to manage multiple "applications" (thermal management, hardware monitoring and reclocking).
Theoretically, it would be possible to replace hwsq by straight FµC and ask PDAEMON to execute this. However, FµC isn't a simple ISA and embedding an assembler inside the Linux kernel doesn't seem like a good idea.
It is also a very stupid idea from a security point of view as this engine can do pretty much everything.
Furthermore, it could generate some corner effects that could hinder the GPU monitoring processes.
Lastly, reclocking is already hard on its own. Implementing it in FµC seems like an impossible mission, especially when considering all the chipset-specific variations.
On the contrary, a script could be guaranteed not to interfere with the critical mission of PDAEMON (thermal monitoring).
In order to be useful, the proposed scripting ISA must be capable of cleanly implementing both memory reclocking. Designing the ISA of this script and implementing it in PDAEMON will be the main challenge of this project.
The scripting engine will then be put to a test by implementing safe engine reclocking and safe memory reclocking if the reverse engineering has been completed.
This task is likely to require RE and various fixing to accommodate all the testers.
In parallel of this task, PDAEMON should be merged to Linux. This involves creating a new Nouveau submodule, porting the user-space command-submission program to the kernel and then, porting some hardware monitoring features to PDAEMON.
Apart from this, I am also great supporter and promoter of FOSS documentation. So, I plan to work on a lot of documentation for nouveau which is according to me severely lacking at present.
Implementation Steps
Step 1 : Work with PDAEMON, improve its functionality and come up with a PDAEMON to host communication scheme, also write a scheduler.
Step 2 : Design, implement and integrate scripting capabilities in FµC.
Step 3 : Integrate PDAEMON to kernel tree and use it to safely reclock the engines.
Step 4 : Work with the community for testing and fixing bugs. This will take quite a bit of an effort as I will have to remotely communicate with users and request them to test it, then try to fix the bugs they run in to.
Step 5 : Get the work which has been done merged with fixes for reclocking.
Step 6 : Create documentation and work on PDAEMON to support automatic reclocking (to free the CPU from the PCOUNTER polling)and increase the smoothness of the solution.
Nice!! Keep us posted on your progress :)
ReplyDeleteThanks and I will sure try to. :)
ReplyDeleteThanks for your time and your work. Can we expect some API or utilities like nvclock to be available soon ?
ReplyDeleteWell, I still have to completely test and merge FSE with PDAEMON. After that PDAEMON will be sent for a RFC to the community and then the required changes will be made. So, I don't really know if others have something similar planned for this soon. I ll try to keep my blog updated.
ReplyDelete