Seokho Song
Seokho Song
8 min read

Categories

Tags

LANGUAGES


Recently, I expanded my contribution from JS surface layer to lower layer compiler pipeline, which takes a bytecode and produce machine instructions or produce a new graph for the further compiler.

These are called maglev and turbolev.

This post will describe how I expanded my contribution area and merged my first four CLs for the Turbolev project.

Backgrounds

V8 and JIT

V8 is the Javascript Engine that powers Chrome, NodeJS, Deno, … and more. It has multi tiering JIT compiler. A just-in-time (JIT) compilation is compilation (of computer code) during execution of a program (at run time) rather than before execution. [FYI: Wikipedia]

For example, below code will create bytecode and executed in the Virtual Machine. I.e., Not a actual machine instruction.

Code:

function add(a, b) {
  return a + b;
}

add(1, 2);

Bytecode:


devsdk@dave ~/workspace/chromium/v8/v8 % ./out/arm64.release/d8 ~/workspace/chromium/playground/add.js --print-bytecode
[generated bytecode for function:  (0x2f3e0005d7cd <SharedFunctionInfo>)]
Bytecode length: 28
Parameter count 1
Register count 4
Frame size 32
         0xaef00100084 @    0 : 13 00             LdaConstant [0]
         0xaef00100086 @    2 : cf                Star1
         0xaef00100087 @    3 : 1b fe f7          Mov <closure>, r2
         0xaef0010008a @    6 : 6d 6f 01 f8 02    CallRuntime [DeclareGlobals], r1-r2
         0xaef0010008f @   11 : 23 01 00          LdaGlobal [1], [0]
         0xaef00100092 @   14 : cf                Star1
         0xaef00100093 @   15 : 0d 01             LdaSmi [1]
         0xaef00100095 @   17 : ce                Star2
         0xaef00100096 @   18 : 0d 02             LdaSmi [2]
         0xaef00100098 @   20 : cd                Star3
         0xaef00100099 @   21 : 6b f8 f7 f6 02    CallUndefinedReceiver2 r1, r2, r3, [2]
         0xaef0010009e @   26 : d0                Star0
         0xaef0010009f @   27 : b5                Return
Constant pool (size = 2)
0xaef0010004d: [TrustedFixedArray]
 - map: 0x2f3e00000605 <Map(TRUSTED_FIXED_ARRAY_TYPE)>
 - length: 2
           0: 0x2f3e0005d81d <FixedArray[2]>
           1: 0x2f3e00003e8d <String[3]: #add>
Handler Table (size = 0)
Source Position Table (size = 0)
[generated bytecode for function: add (0x2f3e0005d82d <SharedFunctionInfo add>)]
Bytecode length: 6
Parameter count 3
Register count 0
Frame size 0
         0xaef001000c8 @    0 : 0b 04             Ldar a1
         0xaef001000ca @    2 : 3f 03 00          Add a0, [0]
         0xaef001000cd @    5 : b5                Return
Constant pool (size = 0)
Handler Table (size = 0)
Source Position Table (size = 0)

This bytecode acts as machine instructions for V8’s Virtual Machine.

When the function becomes “hot” enough (is called frequently), the JIT compiler optimize it into native machine instructions for faster execution.

function add(a, b) {
  return a + b;
}

%PrepareFunctionForOptimization(add);
add(1,2);
%OptimizeFunctionOnNextCall(add);
add(1,2);

Function add compiled as:


devsdk@dave ~/workspace/chromium/v8/v8 % ./out/arm64.release/d8 ~/workspace/chromium/playground/add.js  --allow-natives-syntax --turbofan --print-opt-code
--- Raw source ---
(a, b) {
  return a + b;
}


--- Optimized code ---
optimization_id = 0
source_position = 12
kind = TURBOFAN_JS
name = add
compiler = turbofan
address = 0x2be2001001d5

Instructions (size = 288)
0x140000020     0  a9bf7bfd       stp fp, lr, [sp, #-16]!
0x140000024     4  910003fd       mov fp, sp
0x140000028     8  a9be03ff       stp xzr, x0, [sp, #-32]!
0x14000002c     c  a9016fe1       stp x1, cp, [sp, #16]
0x140000030    10  f8520342       ldur x2, [x26, #-224]



  .-')   .-. .-')             _ (`-.    _ (`-.    ('-.  _ .-') _   
 ( OO ). \  ( OO )           ( (OO  )  ( (OO  ) _(  OO)( (  OO) )  
(_)---\_),--. ,--.  ,-.-')  _.`     \ _.`     \(,------.\     .'_  
/    _ | |  .'   /  |  |OO)(__...--''(__...--'' |  .---',`'--..._) 
\  :` `. |      /,  |  |  \ |  /  | | |  /  | | |  |    |  |  \  ' 
 '..`''.)|     ' _) |  |(_/ |  |_.' | |  |_.' |(|  '--. |  |   ' | 
.-._)   \|  .   \  ,|  |_.' |  .___.' |  .___.' |  .--' |  |   / : 
\       /|  |\   \(_|  |    |  |      |  |      |  `---.|  '--'  / 
 `-----' `--' '--'  `--'    `--'      `--'      `------'`-------' 
 
 /* THERE ARE more than 100 liens of machine code; SKIPPED */

0x14000013c   11c  97fffffb       bl #-0x14 (addr 0x140000128)

Inlined functions (count = 0)

Deoptimization Input Data (deopt points = 4)
 index  bytecode-offset    pc
     0                2    NA
     1                2    NA
     2                2    NA
     3               -1    9c

Safepoints (stack slots = 6, entries = 2, byte size = 24)
0x1400000bc     9c  slots (sp->fp): 100000  deopt      3 trampoline:    11c
0x1400000e4     c4  slots (sp->fp): 000000

RelocInfo (size = 13)
0x1400000b4  full embedded object  (0x0263000446c5 <NativeContext[301]>)
0x1400000b8  near builtin entry
0x1400000e0  near builtin entry
0x140000110  constant pool (size 16)

--- End code ---

The demonstration above used V8’s current top-tier JIT compiler: Turbofan and Turboshaft.

Maglev and Turbofan

As mentioned, V8 has multi-tier JIt compiler. Maglev is a mid-tier JIT compiler. Turbofan and Turboshaft produce highly optimized code but takes longer to compile.

Compiler Compile Speed Optimization Level Structure
Maglev Fast Lower optimization Simple
Turbofan Slower Higher optimization Complex

V8’s tiering manager decides which compiler to use for given function by its decision logic.

Turbolev

The name comes from maglev + turboshaft. As you might guess, it’s a project aimed at replacing the Turbofan compiler frontend, using Maglev’s IR as the new starting point.

The previous top-tier JIT compiler used Turbofan as its frontend and Turboshaft as its backend. Turbofan employed a Sea of Nodes for optimization, rather than the more conventional CFG (Control Flow Graph)-based approach. While theoretically powerful, but it had some practical drawbacks in complexity and maintainability.

To address these issues, the idea is build the Turboshaft backend graph from the mid-tier compiler, Maglev.

For more read about backgrounds and beginning of this project: Land ahoy: leaving the Sea of Nodes · V8 by Darius :).

How I join this project?

For a few years, I contributed JS feature surface layer area of the V8 project. However, I realized that memory managements and actual JIT compiler were still ‘black box’ to me. As opportunities to implement new ECMAScript features became harder to find as an external contributor. So I decided to expand my contribution area to look inside the black box and expand my opportunities.

I remembered that Darius had reviewed the JIT part of my Float16Array CLs. And he recently published the article on v8.dev about turbolev. I sent him an email to see if I can help something.

Hi Darius,

I’ve noticed that it’s getting a bit tough to find contribution opportunities related to ECMAScript features as an external contributor. So I was wondering—are there any areas in the compiler side (like Turboshaft or Maglev), or GC/memory stuff, where I could possibly help out?

I’d love to learn more and hopefully get involved in a broader range of work in V8.

Regards,

Seokho

He generously suggests the turbolev project! (With so so detailed explanation :))

Following his guidance, I decided to join the project. Firstly, I’d like to find out easiest one.

The first target was Math.atan2. It seemed like a good candidate because there was an existing code path for Math.pow that handled similar IEEE754 binary operation.

And I created a draft CL and emailed some questions about it to Darius, Marja, Victor, and Leszek. They kindly explained it to me and helped me to run pinpoint the performance measurement tool for Googler.

Contributions

Starting from optimizing Math.atan2, I contributed about 4 CLs.

list.png

I optimized:

  1. Math.atan2
  2. Math.sqrt
  3. Array.prototype.at

… and some refactors.

I’ll introduce one of them:

Implement Math.sqrt turbolev - Support Math.sqrt function to mid and top tier compiler in V8.

There are 61% of performance enhanced.

list.png

In the code review, Victor said it was amazing!

victor.png

Let’s explain how do I optimize this:

I added a maglev IR node.

maglev-ir.h

class Float64Sqrt : public FixedInputValueNodeT<1, Float64Sqrt> {
  using Base = FixedInputValueNodeT<1, Float64Sqrt>;

 public:
  explicit Float64Sqrt(uint64_t bitfield) : Base(bitfield) {}

  static constexpr OpProperties kProperties = OpProperties::HoleyFloat64();
  static constexpr
      typename Base::InputTypes kInputTypes{ValueRepresentation::kHoleyFloat64};

  Input& input() { return Node::input(0); }

  void SetValueLocationConstraints();
  void GenerateCode(MaglevAssembler*, const ProcessingState&);
  void PrintParams(std::ostream&) const;
};

And registered MathSqrt in maglev graph builder maglev-graph-builder.h

...
  V(MathRound)                                 \
  V(MathSqrt)                                  \
  V(MathClz32)                                 \
  V(SetPrototypeHas)                           \
...

That macro creates reducer function declaration . So C++ compiler told me to create a implementation.

maglev-graph-builder.cc

MaybeReduceResult MaglevGraphBuilder::TryReduceMathSqrt(
    compiler::JSFunctionRef target, CallArguments& args) {
  if (args.count() < 1) {
    return GetRootConstant(RootIndex::kNanValue);
  }

  if (!CanSpeculateCall()) {
    ValueRepresentation rep = args[0]->properties().value_representation();
    if (rep == ValueRepresentation::kTagged) return {};
  }

  ValueNode* value =
      GetFloat64ForToNumber(args[0], NodeType::kNumber,
                            TaggedToFloat64ConversionType::kNumberOrUndefined);

  return AddNewNode<Float64Sqrt>({value});
}

This reducer attempts to simplify or convert a high-level operation (like a JS call to Math.sqrt) into a more optimized, lower-level IR node (Float64Sqrt). Which will be used in “turbolev” for top tier JIt and “maglev” for mid-tier JIT.

For Mid-Tier Maglev - arm64 arch:

maglev-ir-arm64.cc

void Float64Sqrt::SetValueLocationConstraints() {
  UseRegister(input());
  DefineSameAsFirst(this);
}
void Float64Sqrt::GenerateCode(MaglevAssembler* masm,
                               const ProcessingState& state) {
  DoubleRegister value = ToDoubleRegister(input());
  DoubleRegister result_register = ToDoubleRegister(result());
  __ fsqrt(result_register, value);
}

Now, when it JIT compiles this code, the Float64Sqrt node will generate machine instruction fsqrt for arm64 arch. Similarly, for x64, it will generate sqrtsd SSE instruction.

For Top-Tier turbolev compilation:

The Float64Sqrt node consumed by turbolev-graph-builder.

turbolev-graph-builder.cc

  maglev::ProcessResult Process(maglev::Float64Sqrt* node,
                                const maglev::ProcessingState& state) {
    SetMap(node, __ Float64Sqrt(Map(node->input())));
    return maglev::ProcessResult::kContinue;
  }

This creates Float64Sqrt Turboshaft Graph Node, using the coresponding node from the Maglev as input. From here, Turboshaft’s powerful CFG-based backend can perform further optimizations like Loop Unrolling, Store Elimination, and advanced Register Allocation.

So, When Math.sqrt becomes hot and stable, the tiering manager can trigger a Turbolev compilation. (While this will likely generate the same machine instruction, the key benefit is that Turboshaft can now apply its powerful optimizations to all the surrounding code.)

Concluding

Through this work, I’ve gained much more familiarity with the Maglev and Turbolev areas of V8. There are still many unimplemented functions and opportunities to help - There are many yet unimplemented functions. I’ll continue contributing to get more comfortable here, and the, expand other areas like WASM or memory or something else!

Thanks Darius, Marja, Victor, Leszek to gives great guide and code review :).



To Reader

Thank you for reading this post! If you find some error, misspelling, or something awkward, please feel free to contribute here!