|
|
|
<html> |
|
<head> |
|
<meta charset="UTF-8"> |
|
</head> |
|
<style> |
|
|
|
table td { vertical-align: top; } |
|
|
|
.stack-trie { white-space: nowrap; font-family: monospace; } |
|
.stack-trie ul { padding-left: 1ch; } |
|
.stack-trie li { margin-left: 1ch; list-style-type: none; } |
|
.stack-trie .marker { |
|
cursor: pointer; |
|
} |
|
.stack-trie .marker.collapsed::before { |
|
content: "+ "; |
|
} |
|
.stack-trie .marker:not(.collapsed)::before { |
|
content: "- "; |
|
} |
|
.stack-trie a { text-decoration: none; } |
|
.stack-trie a:hover { text-decoration: underline; } |
|
.status-missing { background-color: purple; color: white; } |
|
.status-error { background-color: red; color: white; } |
|
.status-empty { background-color: white; color: black; } |
|
.status-ok { background-color: green; color: white; } |
|
.status-break { background-color: lime; color: black; } |
|
summary::-webkit-details-marker { color: #00ACF3; font-size: 125%; margin-right: 2px; } |
|
summary:focus { outline-style: none; } |
|
article > details > summary { font-size: 28px; margin-top: 16px; } |
|
details > p { margin-left: 24px; } |
|
details details summary { font-size: 16px; } |
|
|
|
</style> |
|
<script> |
|
|
|
function toggleList(toggleItem) { |
|
const listItem = toggleItem.parentNode; |
|
const nestedList = listItem.querySelector('ul'); |
|
if (nestedList) { |
|
nestedList.style.display = nestedList.style.display === 'none' ? 'block' : 'none'; |
|
|
|
|
|
toggleItem.classList.toggle('collapsed'); |
|
} |
|
} |
|
|
|
</script> |
|
<body> |
|
<div> |
|
|
|
<h2>Stack trie</h2> |
|
<p> |
|
The <strong>stack trie</strong> is a way of getting a quick orientation on where all the |
|
compilations in a model take place, esp., if you are compiling a codebase you are unfamiliar with. |
|
It is a tree of stack frames, for all stacks that triggered PT2 compilation. If only a single |
|
stack is in the tree, you will simply see a plain list of frames (most recent call last). With |
|
multiple stacks, at every point where two stacks diverge from having a common prefix, we increase |
|
the indentation of the list and have a separate sub-list per sub-tree. |
|
</p> |
|
<p> |
|
Links to particular compilation are color coded by status: |
|
<span class="status-ok">[Success]</span>, |
|
<span class="status-break">[Success with restart (e.g., graph break)]</span>, |
|
<span class="status-empty">[Empty graph]</span>, |
|
<span class="status-error">[Error]</span>, |
|
<span class="status-missing">[Metrics were missing]</span> |
|
</p> |
|
<details><summary>Stack</summary><div class='stack-trie'><ul><li>/shared_volume/repos/quark/bench_qdq.py:161 in <module><br> mean, median = do_bench(run_scaled_fake_quantize_comp, kwargs_scaled_fake_quantize, num_runs=num_runs, num_warmup=num_warmup, name="quark qdq")</li> |
|
<li>/shared_volume/repos/quark/bench_qdq.py:70 in do_bench<br> f(**kwargs)</li> |
|
<li><a href='#[0/0]' class='status-ok'>[0/0]</a> /shared_volume/repos/quark/bench_qdq.py:7 in run_scaled_fake_quantize<br> </li> |
|
</ul></div></details> |
|
</div> |
|
<div> |
|
|
|
<h2>IR dumps</h2> |
|
<p> |
|
The <strong>IR dumps</strong> collected dumped intermediate products from various points of the PT2 |
|
compilation process. The products are organized by compile id, and then sorted in chronological |
|
order. |
|
</p> |
|
<p> |
|
A <strong>compile id</strong> uniquely identifies are particular compilation inside a PT2 |
|
program. It is traditionally written as <code>[x/y]</code>, where the <strong>frame id</strong> x |
|
identifies the particular Python frame which we are compiling, and <strong>frame compile |
|
id</strong> y identifies how many times we've recompiled this same frame. For example, |
|
<code>[0/0]</code> refers to the very first frame compiled by PT2; <code>[0/1]</code> refers to the |
|
first recompilation of this frame, while <code>[1/0]</code> refers to a different frame, within |
|
distinct code cache, which we are compiling next (perhaps because of a graph break). Although |
|
Dynamo treats distinct frames as completely unrelated, a frame compilation could overlap with another |
|
frame; for example, if you graph break in an inlined function, Dynamo will typically try to compile |
|
the nested frame again on an inner frame. You can identify the hierarchical relationship between |
|
frames by looking at the stack trie above. |
|
</p> |
|
<p> |
|
In some situations, the compile id will have an extra signifier <code>[x/y_z]</code>, where z is the |
|
<strong>attempt</strong> for this particular (re)compilation. Certain conditions will cause Dynamo to |
|
restart analysis, when Dynamo discovers that it needs to undo a decision it previously made. The most |
|
common cause of recompilation is a graph break in an inlined function call, which forces to restart |
|
and avoid inlining the function in the first place. |
|
</p> |
|
<p> |
|
When compiled autograd is enabled, the compile id will include a prefix signifier <code>[!a/x/y]</code>, |
|
where a is the <strong>compiled autograd id</strong>. For instance, <code>[!0/-/-]</code> refers |
|
to the first graph captured by compiled autograd. It is then traced by torch.compile as <code>[!0/x/y_z]</code>. |
|
</p> |
|
<p> |
|
Here is a high level description of PT2's compilation phases, and the intermediate products each |
|
phase generates: |
|
</p> |
|
<ol> |
|
<li><em>Optional:</em> If compiled autograd is enabled, and we are processing a backward call, compiled autograd will trace the autograd graph from the autograd engine, and produce an FX graph <code>compiled_autograd_graph</code> that will be Dynamo traced. Otherwise, Dynamo will directly trace user's bytecode.</li> |
|
<li>Dynamo symbolically evaluates the Python bytecode of a program, producing <code>dynamo_output_graph</code></li> |
|
<li><em>Optional:</em> If <code>optimize_ddp</code> is enabled, the DDPOptimizer will split the Dynamo output graph to improve pipelining communications. Each split subgraph is <code>optimize_ddp_split_child_submod</code>, and the high level graph that plumbs the graphs together is <code>optimize_ddp_split_graph</code>. If there are multiple splits, each subsequent build product will be produced multiple times, one for each split.</li> |
|
<li>AOTAutograd traces the (possibly split) Dynamo output graph, producing a <code>aot_joint_graph</code> if backwards is enabled. It then partitions the graph into <code>aot_forward_graph</code> and <code>aot_backward_graph</code>. If training is not needed, there may only be an <code>aot_inference_graph</code>.</li> |
|
<li>Inductor will apply some post grad FX passes, producing <code>inductor_post_grad_graph</code></li> |
|
<li>Inductor will perform code generation, producing the final <code>inductor_output_code</code> which will be executed at runtime. This output is a valid Python program and can be directly run.</li> |
|
</ol> |
|
|
|
|
|
<h2> Chromium Events </h2> |
|
PT2 generates <a href='chromium_events.json'>Chromium Trace Events</a> in JSON on specific events during compilation. |
|
You can download and view them in a tool like <a href='https://ui.perfetto.dev/'>Perfetto</a>. |
|
|
|
<p> |
|
Build products below: |
|
</p> |
|
<ul> |
|
|
|
<li><a id="[0/0]">[0/0]</a> |
|
<ul> |
|
|
|
<li><a href="-_0_0_0/dynamo_output_graph_0.txt">-_0_0_0/dynamo_output_graph_0.txt</a> (0)</li> |
|
|
|
<li><a href="-_0_0_0/inductor_pre_grad_graph_1.txt">-_0_0_0/inductor_pre_grad_graph_1.txt</a> (1)</li> |
|
|
|
<li><a href="-_0_0_0/before_recompile_pre_grad_2.txt">-_0_0_0/before_recompile_pre_grad_2.txt</a> (2)</li> |
|
|
|
<li><a href="-_0_0_0/after_recompile_pre_grad_3.txt">-_0_0_0/after_recompile_pre_grad_3.txt</a> (3)</li> |
|
|
|
<li><a href="-_0_0_0/aot_forward_graph_fw_metadata_4.txt">-_0_0_0/aot_forward_graph_fw_metadata_4.txt</a> (4)</li> |
|
|
|
<li><a href="-_0_0_0/aot_inference_graph_5.txt">-_0_0_0/aot_inference_graph_5.txt</a> (5)</li> |
|
|
|
<li><a href="-_0_0_0/torch._functorch.config_6.txt">-_0_0_0/torch._functorch.config_6.txt</a> (6)</li> |
|
|
|
<li><a href="-_0_0_0/inductor_output_code_ch44xxkifazlcpkp6mi44xhqeej2j5mbgwmesiwx6y3oajzmixxp_7.html">-_0_0_0/inductor_output_code_ch44xxkifazlcpkp6mi44xhqeej2j5mbgwmesiwx6y3oajzmixxp_7.html</a> (7)</li> |
|
|
|
<li><a href="-_0_0_0/fx_graph_cache_hit_8.json">-_0_0_0/fx_graph_cache_hit_8.json</a> ✅ (8)</li> |
|
|
|
<li><a href="-_0_0_0/aotautograd_cache_bypass_9.json">-_0_0_0/aotautograd_cache_bypass_9.json</a> ❓ (9)</li> |
|
|
|
<li><a href="-_0_0_0/dynamo_cpp_guards_str_10.txt">-_0_0_0/dynamo_cpp_guards_str_10.txt</a> (10)</li> |
|
|
|
<li><a href="-_0_0_0/compilation_metrics_11.html">-_0_0_0/compilation_metrics_11.html</a> (11)</li> |
|
|
|
</ul> |
|
</li> |
|
|
|
</ul> |
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
|
|
<script> |
|
document.addEventListener('DOMContentLoaded', function() { |
|
|
|
|
|
const queryParams = new URLSearchParams(window.location.search); |
|
if (queryParams.size === 0) return url; |
|
|
|
function appendQueryParams(url) { |
|
const newURL = new URL((new Request(url)).url); |
|
const newSearchParams = new URLSearchParams(newURL.searchParams); |
|
console.log(newURL.searchParams); |
|
console.log(newSearchParams); |
|
|
|
|
|
for (const [key, value] of queryParams) { |
|
newSearchParams.set(key, value); |
|
} |
|
|
|
newURL.search = newSearchParams; |
|
return newURL; |
|
} |
|
|
|
|
|
const relativeLinks = document.querySelectorAll('a[href]:not([href^="http://"]):not([href^="https://"]):not([href^="\#"])'); |
|
|
|
|
|
relativeLinks.forEach((link) => { |
|
link.setAttribute("href", appendQueryParams(link.getAttribute("href"))) |
|
}); |
|
}); |
|
</script> |
|
|
|
</body> |
|
</html> |
|
|