Xbyak - x86, x64 JIT assembler -

an ultimate optimization for x86(IA-32) and x64(AMD64, x86-64)
What's this?
This is a header file which enables dynamically to assemble x86(IA-32), x64(AMD64, x86-64) mnemonic. Because we can generate binary a program while code is running, we can get the flexibility of optimazation(ex. quantization, polynomial calcuration).
application for fast encryption(High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves)
header file only
You can use Xbyak's functions at once if xbyak.h is included.
support Windows Xp(32bit, 64bit), Vista/Linux(32bit, 64bit)/Intel Mac
Xbyak runs on Visual Studio C++ 2005 Express Edition, VS2008 Pro, VC2010, mingw and gcc.
#"-fno-operator-names" option is required on gcc to avoid analyzing "and", "or", etc. as operators.
Or, define XBYAK_NO_OP_NAMES to use and_(), or_() instead.
support almost all mnemonics of Pentium for user application
MMX/MMX2/SSE/SSE2/SSE3/SSSE3/SSE4/FPU(partially)/AVX are available.
Output small binary code if possible
"cmp(eax, 5);" means "cmp eax, byte 5" on NASM.
The BSD 3-Clause License
How to use
On Linux,
  >sudo make install
or copy xbyak.h, xbyak_mnemonic.h and xbyak_bin2hex.h into the same directory(ex. /usr/local/include/xbyak/), and specify the directory under compiling your source(-I/usr/local/include/).
New Feature
AutoGrow mode is a mode that Xbyak grows memory automatically if necessary. Call ready() before calling getCode() to calc address of jmp.
struct Code : Xbyak::CodeGenerator {
    : Xbyak::CodeGenerator(<default memory size>, Xbyak::AutoGrow)
Code c;
c.ready(); // Don't forget to call this function
Create your class inheriting Xbyak::CodeGenerator and write x86, x64 mnemonics in your class method. After calling the method and call Xbyak::getCode() and cast the return value into function pointer as you like.
NASM              Xbyak
mov eax, ebx  --> mov(eax, ebx);
inc ecx       --> inc(ecx);
ret           --> ret();
(ptr|dword|word|byte) [base + index * (1|2|4|8) + displacement]
                      [rip + 32bit disp] ; x64 only
Selector is not supported.
dword, word, byte are class variables, then don't use these name
NASM                   Xbyak
mov eax, [ebx+ecx] --> mov (eax, ptr[ebx+ecx]);
test byte [esp], 4 --> test (byte [esp], 4);
You can omit a destination for almost 3-op mnemonics.
vaddps(xmm1, xmm2, xmm3); // xmm1 <- xmm2 + xmm3
vaddps(xmm2, xmm3); // xmm2 <- xmm2 + xmm3
Specify the string when you want to jump. Use T_NEAR when a relative address offset is bigger than 8bit.
Otherwise ERR_LABEL_IS_TOO_FAR exception will occur.
    jmp ("L1");

    jmp ("L2");
    (small code)

    jmp ("L3", T_NEAR);
    (large code)
Code Size
The maximum default code size is 2048 bytes. If you want bigger size, then specify CodeGenerator(int maxSize).
Tiny samples
sample 1
Generating add function
#include <stdio.h>
#include "xbyak/xbyak.h"

struct AddFunc : public Xbyak::CodeGenerator {
    AddFunc(int y)
        mov(eax, ptr[esp+4]);
        add(eax, y);

int main()
    AddFunc a(3);
    int (*add3)(int) = (int (*)(int))a.getCode();
    printf("3 + 2 = %d\n", add3(2));
The content indicated by the function pointer is the following.
    mov    eax, dword ptr [esp+4]
    add    eax, 3
sample 2
How to use jmp
    sum from 1 to n
class Sample : public Xbyak::CodeGenerator {
    Sample(int n)
        mov(ecx, n); // -- (A)
        xor(eax, eax); // sum
        test(ecx, ecx);
        xor(edx, edx); // i
        add(eax, edx);
        cmp(edx, ecx);
int main(int argc, char *argv[])
    int n = argc < 2 ? 100 : atoi(argv[1]);
    try {
        Sample s(n);
        printf("1 + ... + %d = %d\n", n, ((int (*)())s.getCode())());
    } catch (Xbyak::Error err) {
        printf("ERR:%s(%d)\n", Xbyak::ConvertErrorToString(err), err);
    } catch (...) {
        printf("unkwon error\n");
    return 0;
In Sample() constructor, Xbyak generates the function that outputs the sum from 1 to n. When a part of (A) is called, the value of n is determinated, then Xbyak can assemble it.
Introduction to Xbyak
Where Xbyak should be used?
Xbyak is quite different from inline assembler. For example, if you write the following code by inline assembler,
func(int n)
    __asm {
        mov eax, n
then, maybe a complier will automatically generate stack frame like
    push    ebp
    mov     ebp, esp
    mov     eax, [ebp+8] .
But Xbyak does nothing like this. You must make stack frame if necessary by yourself.
On the other hand, you can't write "mov eax, n" when n is not determinated on inline assembler or standard assembler. You must write "mov ebx, [pointer to n] / mov eax, ebx".
quantize.cpp is an example for Quantization, which is used in encoding process of JPEG or MPEG. Quantization is an operation that devides a given array by a given other array.
void quantize(uint32 dest[64],
              const uint32 src[64], const uint32 qTbl[64])
    for (int i = 0; i < N; i++) {
        dest[i] = src[i] / qTbl[i];
qTbl[] is fixed value in encoding JPEG, but the content is variable by quality parameter.
By the way, division is very more heavy operation thatn add/sub/mul, so we want to optimize it by avoiding division. For example, VC++ generates the following code.
// C
uint32 func(uint32 n)
    return n / 10;

// asm
    mov    eax, cccccccdH
    mul    DWORD PTR _n$[esp-4]
    mov    eax, edx
    shr    eax, 3
But this technique can be possible divisor is fixed in compiling, then we can't use it in quantize().
Then we do it while code is running by using Xbyak.
Quantize::udiv() function generates the optimized division code for a given divior.
Benchmark on Pentium D 2.8GHz + VC2005 Express Edition(unit:second).
Compare ordinary quantization with optimized quantization by Xbyak(second)
quantization speed
quality q = 1(low) q = 10 q = 50 q = 100(high)
VC2005 8.0 8.0 8.0 8.0
Xbyak 1.6 0.8 0.5 0.5
The socre of ordinary quantization is constant. It takes 8.0 * 2.8 * 109 / 64 / 107 = 35 clocks per one division. On the other hand, Xbyak is fast.
; generated code on q = 1.
    push        esi
    push        edi
    mov         edi,dword ptr [esp+0Ch]
    mov         esi,dword ptr [esp+10h]
    mov         eax,dword ptr [esi]
    shr         eax,4
    mov         dword ptr [edi],eax      ;  / 16
    mov         eax,dword ptr [esi+4]
    mov         edx,0BA2E8BA3h
    mul         eax,edx
    shr         edx,3                    ;  / 11

; generated code on q = 100
     push        esi
     push        edi
     mov         edi,dword ptr [esp+0Ch]
     mov         esi,dword ptr [esp+10h]
     mov         eax,dword ptr [esi]
     mov         dword ptr [edi],eax
     mov         eax,dword ptr [esi+4]    ; / 1
     mov         dword ptr [edi+4],eax
     mov         eax,dword ptr [esi+8]
     mov         dword ptr [edi+8],eax    ;  / 1
     mov         eax,dword ptr [esi+0Ch]
     mov         dword ptr [edi+0Ch],eax
polynomial calculation
calc.cpp is a tiny polynomial calculation sample with boost::spirit.
Create actions according to constant/variable x/add/sub/mul/div etc.
    void genPush(double n)
        if (constTblPos_ == MAX_CONST_NUM) throw;
        constTbl_[constTblPos_] = static_cast<float>(n);
        if (regIdx_ == 7) throw;
        movss(Xbyak::Xmm(++regIdx_), ptr[edx+constTblPos_*sizeof(float)]);
    void genSub(const char*, const char*)
        subss(Xbyak::Xmm(regIdx_ - 1), Xbyak::Xmm(regIdx_)); regIdx_--;
struct Grammar : public boost::spirit::grammar<Grammar> {
    FuncGen& f_;
    Grammar(FuncGen& f) : f_(f) { }
    template<typename ScannerT>
    struct definition {
        boost::spirit::rule<ScannerT> exp0, exp1, exp2, val;

        definition(const Grammar& self)
            using namespace boost;
            using namespace boost::spirit;

            exp0 = exp1 >> *(('+' >> exp1)[bind(&FuncGen::genAdd, ref(self.f_), _1, _2)]
                           | ('-' >> exp1)[bind(&FuncGen::genSub, ref(self.f_), _1, _2)]);
            exp1 = exp2 >> *(('*' >> exp2)[bind(&FuncGen::genMul, ref(self.f_), _1, _2)]
                           | ('/' >> exp2)[bind(&FuncGen::genDiv, ref(self.f_), _1, _2)]);
            val = ch_p('x')[bind(&FuncGen::genX, ref(self.f_))];
            exp2 = real_p[bind(&FuncGen::genPush, ref(self.f_), _1)]
                 | val
                 | '(' >> exp0 >> ')';
        const boost::spirit::rule<ScannerT>& start() const { return exp0; }
Output values by generated function.
    void (*func)(float *ret, const float *x) = (void (*)(float *, const float*))funcGen.getCode();
    for (float x = 0; x < 10; x += 0.7f) {
        float y;
        func(&y, &x);
        printf("f(%f)=%f\n", x, y);
For example, you type "x+2*(x*x+3/x)", then you can get the code when program is running.
    ; @param y [out] f(x)
    ; @param x [in] x
    ; void func(float *y, const float *x);
    mov         eax,dword ptr [esp+8]
    mov         edx,12FEA8h
    movss       xmm0,dword ptr [eax]
    movss       xmm1,dword ptr [edx]
    movss       xmm2,dword ptr [eax]
    movss       xmm3,dword ptr [eax]
    mulss       xmm2,xmm3
    movss       xmm3,dword ptr [edx+4]
    movss       xmm4,dword ptr [eax]
    divss       xmm3,xmm4
    addss       xmm2,xmm3
    mulss       xmm1,xmm2
    addss       xmm0,xmm1
    mov         eax,dword ptr [esp+4]
    movss       dword ptr [eax],xmm0
>calc "x*x+3*x+5"
1st:2007/1/17, last update:2012/11/15

mailto:MITSUNARI Shigeo<herumi@nifty.com>