
1: \input texinfo @c -*- texinfo -*- 2: @c %**start of header 3: @setfilename qemu-tech.info 4: @settitle QEMU Internals 5: @exampleindent 0 6: @paragraphindent 0 7: @c %**end of header 8: 9: @iftex 10: @titlepage 11: @sp 7 12: @center @titlefont{QEMU Internals} 13: @sp 3 14: @end titlepage 15: @end iftex 16: 17: @ifnottex 18: @node Top 19: @top 20: 21: @menu 22: * Introduction:: 23: * QEMU Internals:: 24: * Regression Tests:: 25: * Index:: 26: @end menu 27: @end ifnottex 28: 29: @contents 30: 31: @node Introduction 32: @chapter Introduction 33: 34: @menu 35: * intro_features:: Features 36: * intro_x86_emulation:: x86 emulation 37: * intro_arm_emulation:: ARM emulation 38: * intro_mips_emulation:: MIPS emulation 39: * intro_ppc_emulation:: PowerPC emulation 40: * intro_sparc_emulation:: SPARC emulation 41: @end menu 42: 43: @node intro_features 44: @section Features 45: 46: QEMU is a FAST! processor emulator using a portable dynamic 47: translator. 48: 49: QEMU has two operating modes: 50: 51: @itemize @minus 52: 53: @item 54: Full system emulation. In this mode, QEMU emulates a full system 55: (usually a PC), including a processor and various peripherals. It can 56: be used to launch an different Operating System without rebooting the 57: PC or to debug system code. 58: 59: @item 60: User mode emulation (Linux host only). In this mode, QEMU can launch 61: Linux processes compiled for one CPU on another CPU. It can be used to 62: launch the Wine Windows API emulator (@url{http://www.winehq.org}) or 63: to ease cross-compilation and cross-debugging. 64: 65: @end itemize 66: 67: As QEMU requires no host kernel driver to run, it is very safe and 68: easy to use. 69: 70: QEMU generic features: 71: 72: @itemize 73: 74: @item User space only or full system emulation. 75: 76: @item Using dynamic translation to native code for reasonable speed. 77: 78: @item Working on x86 and PowerPC hosts. Being tested on ARM, Sparc32, Alpha and S390. 79: 80: @item Self-modifying code support. 81: 82: @item Precise exceptions support. 83: 84: @item The virtual CPU is a library (@code{libqemu}) which can be used 85: in other projects (look at @file{qemu/tests/qruncom.c} to have an 86: example of user mode @code{libqemu} usage). 87: 88: @end itemize 89: 90: QEMU user mode emulation features: 91: @itemize 92: @item Generic Linux system call converter, including most ioctls. 93: 94: @item clone() emulation using native CPU clone() to use Linux scheduler for threads. 95: 96: @item Accurate signal handling by remapping host signals to target signals. 97: @end itemize 98: 99: QEMU full system emulation features: 100: @itemize 101: @item QEMU can either use a full software MMU for maximum portability or use the host system call mmap() to simulate the target MMU. 102: @end itemize 103: 104: @node intro_x86_emulation 105: @section x86 emulation 106: 107: QEMU x86 target features: 108: 109: @itemize 110: 111: @item The virtual x86 CPU supports 16 bit and 32 bit addressing with segmentation. 112: LDT/GDT and IDT are emulated. VM86 mode is also supported to run DOSEMU. 113: 114: @item Support of host page sizes bigger than 4KB in user mode emulation. 115: 116: @item QEMU can emulate itself on x86. 117: 118: @item An extensive Linux x86 CPU test program is included @file{tests/test-i386}. 119: It can be used to test other x86 virtual CPUs. 120: 121: @end itemize 122: 123: Current QEMU limitations: 124: 125: @itemize 126: 127: @item No SSE/MMX support (yet). 128: 129: @item No x86-64 support. 130: 131: @item IPC syscalls are missing. 132: 133: @item The x86 segment limits and access rights are not tested at every 134: memory access (yet). Hopefully, very few OSes seem to rely on that for 135: normal use. 136: 137: @item On non x86 host CPUs, @code{double}s are used instead of the non standard 138: 10 byte @code{long double}s of x86 for floating point emulation to get 139: maximum performances. 140: 141: @end itemize 142: 143: @node intro_arm_emulation 144: @section ARM emulation 145: 146: @itemize 147: 148: @item Full ARM 7 user emulation. 149: 150: @item NWFPE FPU support included in user Linux emulation. 151: 152: @item Can run most ARM Linux binaries. 153: 154: @end itemize 155: 156: @node intro_mips_emulation 157: @section MIPS emulation 158: 159: @itemize 160: 161: @item The system emulation allows full MIPS32/MIPS64 Release 2 emulation, 162: including privileged instructions, FPU and MMU, in both little and big 163: endian modes. 164: 165: @item The Linux userland emulation can run many 32 bit MIPS Linux binaries. 166: 167: @end itemize 168: 169: Current QEMU limitations: 170: 171: @itemize 172: 173: @item Self-modifying code is not always handled correctly. 174: 175: @item 64 bit userland emulation is not implemented. 176: 177: @item The system emulation is not complete enough to run real firmware. 178: 179: @item The watchpoint debug facility is not implemented. 180: 181: @end itemize 182: 183: @node intro_ppc_emulation 184: @section PowerPC emulation 185: 186: @itemize 187: 188: @item Full PowerPC 32 bit emulation, including privileged instructions, 189: FPU and MMU. 190: 191: @item Can run most PowerPC Linux binaries. 192: 193: @end itemize 194: 195: @node intro_sparc_emulation 196: @section SPARC emulation 197: 198: @itemize 199: 200: @item Full SPARC V8 emulation, including privileged 201: instructions, FPU and MMU. SPARC V9 emulation includes most privileged 202: and VIS instructions, FPU and I/D MMU. Alignment is fully enforced. 203: 204: @item Can run most 32-bit SPARC Linux binaries, SPARC32PLUS Linux binaries and 205: some 64-bit SPARC Linux binaries. 206: 207: @end itemize 208: 209: Current QEMU limitations: 210: 211: @itemize 212: 213: @item IPC syscalls are missing. 214: 215: @item Floating point exception support is buggy. 216: 217: @item Atomic instructions are not correctly implemented. 218: 219: @item Sparc64 emulators are not usable for anything yet. 220: 221: @end itemize 222: 223: @node QEMU Internals 224: @chapter QEMU Internals 225: 226: @menu 227: * QEMU compared to other emulators:: 228: * Portable dynamic translation:: 229: * Register allocation:: 230: * Condition code optimisations:: 231: * CPU state optimisations:: 232: * Translation cache:: 233: * Direct block chaining:: 234: * Self-modifying code and translated code invalidation:: 235: * Exception support:: 236: * MMU emulation:: 237: * Hardware interrupts:: 238: * User emulation specific details:: 239: * Bibliography:: 240: @end menu 241: 242: @node QEMU compared to other emulators 243: @section QEMU compared to other emulators 244: 245: Like bochs [3], QEMU emulates an x86 CPU. But QEMU is much faster than 246: bochs as it uses dynamic compilation. Bochs is closely tied to x86 PC 247: emulation while QEMU can emulate several processors. 248: 249: Like Valgrind [2], QEMU does user space emulation and dynamic 250: translation. Valgrind is mainly a memory debugger while QEMU has no 251: support for it (QEMU could be used to detect out of bound memory 252: accesses as Valgrind, but it has no support to track uninitialised data 253: as Valgrind does). The Valgrind dynamic translator generates better code 254: than QEMU (in particular it does register allocation) but it is closely 255: tied to an x86 host and target and has no support for precise exceptions 256: and system emulation. 257: 258: EM86 [4] is the closest project to user space QEMU (and QEMU still uses 259: some of its code, in particular the ELF file loader). EM86 was limited 260: to an alpha host and used a proprietary and slow interpreter (the 261: interpreter part of the FX!32 Digital Win32 code translator [5]). 262: 263: TWIN [6] is a Windows API emulator like Wine. It is less accurate than 264: Wine but includes a protected mode x86 interpreter to launch x86 Windows 265: executables. Such an approach has greater potential because most of the 266: Windows API is executed natively but it is far more difficult to develop 267: because all the data structures and function parameters exchanged 268: between the API and the x86 code must be converted. 269: 270: User mode Linux [7] was the only solution before QEMU to launch a 271: Linux kernel as a process while not needing any host kernel 272: patches. However, user mode Linux requires heavy kernel patches while 273: QEMU accepts unpatched Linux kernels. The price to pay is that QEMU is 274: slower. 275: 276: The new Plex86 [8] PC virtualizer is done in the same spirit as the 277: qemu-fast system emulator. It requires a patched Linux kernel to work 278: (you cannot launch the same kernel on your PC), but the patches are 279: really small. As it is a PC virtualizer (no emulation is done except 280: for some priveledged instructions), it has the potential of being 281: faster than QEMU. The downside is that a complicated (and potentially 282: unsafe) host kernel patch is needed. 283: 284: The commercial PC Virtualizers (VMWare [9], VirtualPC [10], TwoOStwo 285: [11]) are faster than QEMU, but they all need specific, proprietary 286: and potentially unsafe host drivers. Moreover, they are unable to 287: provide cycle exact simulation as an emulator can. 288: 289: @node Portable dynamic translation 290: @section Portable dynamic translation 291: 292: QEMU is a dynamic translator. When it first encounters a piece of code, 293: it converts it to the host instruction set. Usually dynamic translators 294: are very complicated and highly CPU dependent. QEMU uses some tricks 295: which make it relatively easily portable and simple while achieving good 296: performances. 297: 298: The basic idea is to split every x86 instruction into fewer simpler 299: instructions. Each simple instruction is implemented by a piece of C 300: code (see @file{target-i386/op.c}). Then a compile time tool 301: (@file{dyngen}) takes the corresponding object file (@file{op.o}) 302: to generate a dynamic code generator which concatenates the simple 303: instructions to build a function (see @file{op.h:dyngen_code()}). 304: 305: In essence, the process is similar to [1], but more work is done at 306: compile time. 307: 308: A key idea to get optimal performances is that constant parameters can 309: be passed to the simple operations. For that purpose, dummy ELF 310: relocations are generated with gcc for each constant parameter. Then, 311: the tool (@file{dyngen}) can locate the relocations and generate the 312: appriopriate C code to resolve them when building the dynamic code. 313: 314: That way, QEMU is no more difficult to port than a dynamic linker. 315: 316: To go even faster, GCC static register variables are used to keep the 317: state of the virtual CPU. 318: 319: @node Register allocation 320: @section Register allocation 321: 322: Since QEMU uses fixed simple instructions, no efficient register 323: allocation can be done. However, because RISC CPUs have a lot of 324: register, most of the virtual CPU state can be put in registers without 325: doing complicated register allocation. 326: 327: @node Condition code optimisations 328: @section Condition code optimisations 329: 330: Good CPU condition codes emulation (@code{EFLAGS} register on x86) is a 331: critical point to get good performances. QEMU uses lazy condition code 332: evaluation: instead of computing the condition codes after each x86 333: instruction, it just stores one operand (called @code{CC_SRC}), the 334: result (called @code{CC_DST}) and the type of operation (called 335: @code{CC_OP}). 336: 337: @code{CC_OP} is almost never explicitely set in the generated code 338: because it is known at translation time. 339: 340: In order to increase performances, a backward pass is performed on the 341: generated simple instructions (see 342: @code{target-i386/translate.c:optimize_flags()}). When it can be proved that 343: the condition codes are not needed by the next instructions, no 344: condition codes are computed at all. 345: 346: @node CPU state optimisations 347: @section CPU state optimisations 348: 349: The x86 CPU has many internal states which change the way it evaluates 350: instructions. In order to achieve a good speed, the translation phase 351: considers that some state information of the virtual x86 CPU cannot 352: change in it. For example, if the SS, DS and ES segments have a zero 353: base, then the translator does not even generate an addition for the 354: segment base. 355: 356: [The FPU stack pointer register is not handled that way yet]. 357: 358: @node Translation cache 359: @section Translation cache 360: 361: A 16 MByte cache holds the most recently used translations. For 362: simplicity, it is completely flushed when it is full. A translation unit 363: contains just a single basic block (a block of x86 instructions 364: terminated by a jump or by a virtual CPU state change which the 365: translator cannot deduce statically). 366: 367: @node Direct block chaining 368: @section Direct block chaining 369: 370: After each translated basic block is executed, QEMU uses the simulated 371: Program Counter (PC) and other cpu state informations (such as the CS 372: segment base value) to find the next basic block. 373: 374: In order to accelerate the most common cases where the new simulated PC 375: is known, QEMU can patch a basic block so that it jumps directly to the 376: next one. 377: 378: The most portable code uses an indirect jump. An indirect jump makes 379: it easier to make the jump target modification atomic. On some host 380: architectures (such as x86 or PowerPC), the @code{JUMP} opcode is 381: directly patched so that the block chaining has no overhead. 382: 383: @node Self-modifying code and translated code invalidation 384: @section Self-modifying code and translated code invalidation 385: 386: Self-modifying code is a special challenge in x86 emulation because no 387: instruction cache invalidation is signaled by the application when code 388: is modified. 389: 390: When translated code is generated for a basic block, the corresponding 391: host page is write protected if it is not already read-only (with the 392: system call @code{mprotect()}). Then, if a write access is done to the 393: page, Linux raises a SEGV signal. QEMU then invalidates all the 394: translated code in the page and enables write accesses to the page. 395: 396: Correct translated code invalidation is done efficiently by maintaining 397: a linked list of every translated block contained in a given page. Other 398: linked lists are also maintained to undo direct block chaining. 399: 400: Although the overhead of doing @code{mprotect()} calls is important, 401: most MSDOS programs can be emulated at reasonnable speed with QEMU and 402: DOSEMU. 403: 404: Note that QEMU also invalidates pages of translated code when it detects 405: that memory mappings are modified with @code{mmap()} or @code{munmap()}. 406: 407: When using a software MMU, the code invalidation is more efficient: if 408: a given code page is invalidated too often because of write accesses, 409: then a bitmap representing all the code inside the page is 410: built. Every store into that page checks the bitmap to see if the code 411: really needs to be invalidated. It avoids invalidating the code when 412: only data is modified in the page. 413: 414: @node Exception support 415: @section Exception support 416: 417: longjmp() is used when an exception such as division by zero is 418: encountered. 419: 420: The host SIGSEGV and SIGBUS signal handlers are used to get invalid 421: memory accesses. The exact CPU state can be retrieved because all the 422: x86 registers are stored in fixed host registers. The simulated program 423: counter is found by retranslating the corresponding basic block and by 424: looking where the host program counter was at the exception point. 425: 426: The virtual CPU cannot retrieve the exact @code{EFLAGS} register because 427: in some cases it is not computed because of condition code 428: optimisations. It is not a big concern because the emulated code can 429: still be restarted in any cases. 430: 431: @node MMU emulation 432: @section MMU emulation 433: 434: For system emulation, QEMU uses the mmap() system call to emulate the 435: target CPU MMU. It works as long the emulated OS does not use an area 436: reserved by the host OS (such as the area above 0xc0000000 on x86 437: Linux). 438: 439: In order to be able to launch any OS, QEMU also supports a soft 440: MMU. In that mode, the MMU virtual to physical address translation is 441: done at every memory access. QEMU uses an address translation cache to 442: speed up the translation. 443: 444: In order to avoid flushing the translated code each time the MMU 445: mappings change, QEMU uses a physically indexed translation cache. It 446: means that each basic block is indexed with its physical address. 447: 448: When MMU mappings change, only the chaining of the basic blocks is 449: reset (i.e. a basic block can no longer jump directly to another one). 450: 451: @node Hardware interrupts 452: @section Hardware interrupts 453: 454: In order to be faster, QEMU does not check at every basic block if an 455: hardware interrupt is pending. Instead, the user must asynchrously 456: call a specific function to tell that an interrupt is pending. This 457: function resets the chaining of the currently executing basic 458: block. It ensures that the execution will return soon in the main loop 459: of the CPU emulator. Then the main loop can test if the interrupt is 460: pending and handle it. 461: 462: @node User emulation specific details 463: @section User emulation specific details 464: 465: @subsection Linux system call translation 466: 467: QEMU includes a generic system call translator for Linux. It means that 468: the parameters of the system calls can be converted to fix the 469: endianness and 32/64 bit issues. The IOCTLs are converted with a generic 470: type description system (see @file{ioctls.h} and @file{thunk.c}). 471: 472: QEMU supports host CPUs which have pages bigger than 4KB. It records all 473: the mappings the process does and try to emulated the @code{mmap()} 474: system calls in cases where the host @code{mmap()} call would fail 475: because of bad page alignment. 476: 477: @subsection Linux signals 478: 479: Normal and real-time signals are queued along with their information 480: (@code{siginfo_t}) as it is done in the Linux kernel. Then an interrupt 481: request is done to the virtual CPU. When it is interrupted, one queued 482: signal is handled by generating a stack frame in the virtual CPU as the 483: Linux kernel does. The @code{sigreturn()} system call is emulated to return 484: from the virtual signal handler. 485: 486: Some signals (such as SIGALRM) directly come from the host. Other 487: signals are synthetized from the virtual CPU exceptions such as SIGFPE 488: when a division by zero is done (see @code{main.c:cpu_loop()}). 489: 490: The blocked signal mask is still handled by the host Linux kernel so 491: that most signal system calls can be redirected directly to the host 492: Linux kernel. Only the @code{sigaction()} and @code{sigreturn()} system 493: calls need to be fully emulated (see @file{signal.c}). 494: 495: @subsection clone() system call and threads 496: 497: The Linux clone() system call is usually used to create a thread. QEMU 498: uses the host clone() system call so that real host threads are created 499: for each emulated thread. One virtual CPU instance is created for each 500: thread. 501: 502: The virtual x86 CPU atomic operations are emulated with a global lock so 503: that their semantic is preserved. 504: 505: Note that currently there are still some locking issues in QEMU. In 506: particular, the translated cache flush is not protected yet against 507: reentrancy. 508: 509: @subsection Self-virtualization 510: 511: QEMU was conceived so that ultimately it can emulate itself. Although 512: it is not very useful, it is an important test to show the power of the 513: emulator. 514: 515: Achieving self-virtualization is not easy because there may be address 516: space conflicts. QEMU solves this problem by being an executable ELF 517: shared object as the ld-linux.so ELF interpreter. That way, it can be 518: relocated at load time. 519: 520: @node Bibliography 521: @section Bibliography 522: 523: @table @asis 524: 525: @item [1] 526: @url{http://citeseer.nj.nec.com/piumarta98optimizing.html}, Optimizing 527: direct threaded code by selective inlining (1998) by Ian Piumarta, Fabio 528: Riccardi. 529: 530: @item [2] 531: @url{http://developer.kde.org/~sewardj/}, Valgrind, an open-source 532: memory debugger for x86-GNU/Linux, by Julian Seward. 533: 534: @item [3] 535: @url{http://bochs.sourceforge.net/}, the Bochs IA-32 Emulator Project, 536: by Kevin Lawton et al. 537: 538: @item [4] 539: @url{http://www.cs.rose-hulman.edu/~donaldlf/em86/index.html}, the EM86 540: x86 emulator on Alpha-Linux. 541: 542: @item [5] 543: @url{http://www.usenix.org/publications/library/proceedings/usenix-nt97/@/full_papers/chernoff/chernoff.pdf}, 544: DIGITAL FX!32: Running 32-Bit x86 Applications on Alpha NT, by Anton 545: Chernoff and Ray Hookway. 546: 547: @item [6] 548: @url{http://www.willows.com/}, Windows API library emulation from 549: Willows Software. 550: 551: @item [7] 552: @url{http://user-mode-linux.sourceforge.net/}, 553: The User-mode Linux Kernel. 554: 555: @item [8] 556: @url{http://www.plex86.org/}, 557: The new Plex86 project. 558: 559: @item [9] 560: @url{http://www.vmware.com/}, 561: The VMWare PC virtualizer. 562: 563: @item [10] 564: @url{http://www.microsoft.com/windowsxp/virtualpc/}, 565: The VirtualPC PC virtualizer. 566: 567: @item [11] 568: @url{http://www.twoostwo.org/}, 569: The TwoOStwo PC virtualizer. 570: 571: @end table 572: 573: @node Regression Tests 574: @chapter Regression Tests 575: 576: In the directory @file{tests/}, various interesting testing programs 577: are available. They are used for regression testing. 578: 579: @menu 580: * test-i386:: 581: * linux-test:: 582: * qruncom.c:: 583: @end menu 584: 585: @node test-i386 586: @section @file{test-i386} 587: 588: This program executes most of the 16 bit and 32 bit x86 instructions and 589: generates a text output. It can be compared with the output obtained with 590: a real CPU or another emulator. The target @code{make test} runs this 591: program and a @code{diff} on the generated output. 592: 593: The Linux system call @code{modify_ldt()} is used to create x86 selectors 594: to test some 16 bit addressing and 32 bit with segmentation cases. 595: 596: The Linux system call @code{vm86()} is used to test vm86 emulation. 597: 598: Various exceptions are raised to test most of the x86 user space 599: exception reporting. 600: 601: @node linux-test 602: @section @file{linux-test} 603: 604: This program tests various Linux system calls. It is used to verify 605: that the system call parameters are correctly converted between target 606: and host CPUs. 607: 608: @node qruncom.c 609: @section @file{qruncom.c} 610: 611: Example of usage of @code{libqemu} to emulate a user mode i386 CPU. 612: 613: @node Index 614: @chapter Index 615: @printindex cp 616: 617: @bye