<p>前几天看了<a href="https://xz.aliyun.com/t/5863" target="_blank">浏览器解码看XSS</a>，没有看得很明白，又找了这篇<a href="http://bobao.360.cn/learning/detail/292.html" target="_blank">深入理解浏览器解析机制和XSS向量编码</a>，翻译的文章，有些地方翻译的怪怪的，需要看下<a href="https://www.attacker-domain.com/2013/04/deep-dive-into-browser-parsing-and-xss.html" target="_blank">原文</a>，啃了2天终于搞明白了</p><p>原文里给出了几个XSS Payload，也给出了<a href="http://test.attacker-domain.com/browserparsing/answers.txt" target="_blank">答案</a>和<a href="http://test.attacker-domain.com/browserparsing/tests.html" target="_blank">演示地址</a>，有答案但没解析，下面一个个分析</p><p>有点像上学的时候，看书看不懂，做题不会做，看答案解析做题就懂了</p><h2 id="toc-0">Basics</h2><h3 id="toc-1">1</h3><div class="highlight"><pre><a&nbsp;href="%6a%61%76%61%73%63%72%69%70%74:%61%6c%65%72%74%28%31%29"></a></pre></div><p>URL encoded "javascript:alert(1)"</p><p>Answer: The javascript will NOT execute.</p><p>里面没有HTML编码内容，不考虑，其中href内部是URL，于是直接丢给URL模块处理，但是协议无法识别（即被编码的<code>javascript:</code>），解码失败，不会被执行</p><p>URL规定协议，用户名，密码都必须是ASCII，编码当然就无效了</p><p>A URL’s scheme is an ASCII string that identifies the 
type of URL and can be used to dispatch a URL for further processing 
after parsing. It is initially the empty string.<br/>A URL’s username is an ASCII string identifying a username. It is initially the empty string.<br/>A URL’s password is an ASCII string identifying a password. It is initially the empty string.</p><p>from <a href="https://url.spec.whatwg.org/#concept-url" target="_blank">https://url.spec.whatwg.org/#concept-url</a></p><h3 id="toc-2">2</h3><div class="highlight"><pre><a&nbsp;href="&#x6a;&#x61;&#x76;&#x61;&#x73;&#x63;&#x72;&#x69;&#x70;&#x74;:%61%6c%65%72%74%28%32%29"></pre></div><p>Character entity encoded "javascript" and URL encoded "alert(2)"</p><p>Answer: The javascript will execute.</p><p>先HTML解码，得到</p><p><code><a href="javascript:%61%6c%65%72%74%28%32%29"></code></p><p>href中为URL，URL模块可识别为<code>javascript</code>协议，进行URL解码，得到</p><p><code><a href="javascript:alert(2)"></code></p><p>由于是javascript协议，解码完给JS模块处理，于是被执行</p><h3 id="toc-3">3</h3><div class="highlight"><pre><a&nbsp;href="javascript%3aalert(3)"></a></pre></div><p>URL encoded ":"</p><p>Answer: The javascript will NOT execute.</p><p>同1，不解释</p><h3 id="toc-4">4</h3><div class="highlight"><pre><div>&#60;img&nbsp;src=x&nbsp;onerror=alert(4)&#62;</div></pre></div><p>Character entity encoded < and ></p><p>Answer: The javascript will NOT execute.</p><p>这里包含了HTML编码内容，反过来以开发者的角度思考，HTML编码就是为了显示这些特殊字符，而不干扰正常的DOM解析，所以这里面的内容不会变成一个img元素，也不会被执行</p><p>从HTML解析机制看，在读取<code><div></code>之后进入数据状态，<code>&#60;</code>会被HTML解码，但不会进入标签开始状态，当然也就不会创建<code>img</code>元素，也就不会执行</p><h3 id="toc-5">5</h3><div class="highlight"><pre><textarea>&#60;script&#62;alert(5)&#60;/script&#62;</textarea></pre></div><p>Character entity encoded < and ></p><p>Answer: The javascript will NOT execute AND the character entities will NOT<br/>be decoded either</p><p><code><textarea></code>是<code>RCDATA</code>元素（RCDATA elements），可以容纳文本和字符引用，注意<strong>不能容纳其他元素</strong>，HTML解码得到</p><p><code><textarea><script>alert(5)</script></textarea></code></p><p>于是直接显示</p><p><code>RCDATA</code>元素（RCDATA elements）包括<code>textarea</code>和<code>title</code></p><h3 id="toc-6">6</h3><div class="highlight"><pre><textarea><script>alert(6)</script></textarea></pre></div><p>Answer: The javascript will NOT execute.</p><p>同5，不解释</p><h2 id="toc-7">Advanced</h2><h3 id="toc-8">7</h3><div class="highlight"><pre><button&nbsp;onclick="confirm(&#39;7&#39;);">Button</button></pre></div><p>Character entity encoded &#39;</p><p>Answer: The javascript will execute.</p><p>这里<code>onclick</code>中为标签的属性值（类比2中的<code>href</code>），会被HTML解码，得到</p><p><code><button onclick="confirm(&#39;7&#39;);">Button</button></code></p><p>然后被执行</p><h3 id="toc-9">8</h3><div class="highlight"><pre><button&nbsp;onclick="confirm(&#39;8\u0027);">Button</button></pre></div><p>Unicode escape sequence encoded &#39;</p><p>Answer: The javascript will NOT execute.</p><p><code>onclick</code>中的值会交给JS处理，在JS中只有字符串和<a href="https://developer.mozilla.org/zh-CN/docs/Glossary/Identifier" target="_blank">标识符</a>能用Unicode表示，<code>&#39;</code>显然不行，JS执行失败</p><p>In string literals, regular expression literals, template
 literals and identifiers, any Unicode code point may also be expressed 
using Unicode escape sequences that explicitly express a code point&#39;s 
numeric value.</p><p>from <a href="https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-source-code" target="_blank">https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-source-code</a> （这个链接很卡）</p><p>标识符（identifiers）<br/>代码中用来标识变量、函数、或属性的字符序列。<br/>在JavaScript中，标识符只能包含字母或数字或下划线（“_”）或美元符号（“$”），且不能以数字开头。标识符与字符串不同之处在于字符串是数据，而标识符是代码的一部分。在
 JavaScript 中，无法将标识符转换为字符串，但有时可以将字符串解析为标识符。</p><p>from <a href="https://developer.mozilla.org/zh-CN/docs/Glossary/Identifier" target="_blank">https://developer.mozilla.org/zh-CN/docs/Glossary/Identifier</a></p><h3 id="toc-10">9</h3><div class="highlight"><pre><script>&#97;&#108;&#101;&#114;&#116&#40;&#57;&#41;&#59</script></pre></div><p>Character entity encoded alert(9);</p><p>Answer: The javascript will NOT execute.</p><p><code>script</code>属于原始文本元素(Raw text elements)，<strong>只可以容纳文本</strong>，注意<strong>没有字符引用</strong>，于是直接由JS处理，JS也认不出来，执行失败</p><p>原始文本元素(Raw text elements)有<code><script></code>和<code><style></code></p><h3 id="toc-11">10</h3><div class="highlight"><pre><script>\u0061\u006c\u0065\u0072\u0074(10);</script></pre></div><p>Unicode Escape sequence encoded alert</p><p>Answer: The javascript will execute.</p><p>同8，函数名<code>alert</code>属于标识符，直接被JS执行</p><h3 id="toc-12">11</h3><div class="highlight"><pre><script>\u0061\u006c\u0065\u0072\u0074\u0028\u0031\u0031\u0029</script></pre></div><p>Unicode Escape sequence encoded alert(11)</p><p>Answer: The javascript will NOT execute.</p><p>同8，不解释</p><h3 id="toc-13">12</h3><div class="highlight"><pre><script>\u0061\u006c\u0065\u0072\u0074(\u0031\u0032)</script></pre></div><p>Unicode Escape sequence encoded alert and 12</p><p>Answer: The javascript will NOT execute.</p><p>这里看似将没毛病，但是这里<code>\u0031\u0032</code>在解码的时候会被解码为字符串<code>12</code>，注意是<strong>字符串</strong>，不是数字，文字显然是需要引号的，JS执行失败</p><h3 id="toc-14">13</h3><div class="highlight"><pre><script>alert(&#39;13\u0027)</script></pre></div><p>Unicode escape sequence encoded &#39;</p><p>Answer: The javascript will NOT execute.</p><p>同8</p><h3 id="toc-15">14</h3><div class="highlight"><pre><script>alert(&#39;14\u000a&#39;)</script></pre></div><p>Unicode escape sequence encoded line feed.</p><p>Answer: The javascript will execute.</p><p><code>\u000a</code>在JavaScript里是换行，就是<code>\n</code>，直接执行</p><p>Java菜鸡才知道在Java里<code>\u000a</code>是换行，相当于在源码里直接按一下回车键，后面的代码都换行了</p><p>ECMAScript differs from the Java programming language in 
the behaviour of Unicode escape sequences. In a Java program, if the 
Unicode escape sequence \u000A, for example, occurs within a single-line
 comment, it is interpreted as a line terminator (Unicode code point 
U+000A is LINE FEED (LF)) and therefore the next code point is not part 
of the comment. Similarly, if the Unicode escape sequence \u000A occurs 
within a string literal in a Java program, it is likewise interpreted as
 a line terminator, which is not allowed within a string literal—one 
must write \n instead of \u000A to cause a LINE FEED (LF) to be part of 
the String value of a string literal. In an ECMAScript program, a 
Unicode escape sequence occurring within a comment is never interpreted 
and therefore cannot contribute to termination of the comment. 
Similarly, a Unicode escape sequence occurring within a string literal 
in an ECMAScript program always contributes to the literal and is never 
interpreted as a line terminator or as a code point that might terminate
 the string literal.</p><p>from <a href="https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-source-code" target="_blank">https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-source-code</a></p><h2 id="toc-16">Bonus</h2><h3 id="toc-17">15</h3><div class="highlight"><pre><a&nbsp;href="&#x6a;&#x61;&#x76;&#x61;&#x73;&#x63;&#x72;&#x69;&#x70;&#x74;&#x3a;&#x25;&#x35;&#x63;&#x25;&#x37;&#x35;&#x25;&#x33;&#x30;&#x25;&#x33;&#x30;&#x25;&#x33;&#x36;&#x25;&#x33;&#x31;&#x25;&#x35;&#x63;&#x25;&#x37;&#x35;&#x25;&#x33;&#x30;&#x25;&#x33;&#x30;&#x25;&#x33;&#x36;&#x25;&#x36;&#x33;&#x25;&#x35;&#x63;&#x25;&#x37;&#x35;&#x25;&#x33;&#x30;&#x25;&#x33;&#x30;&#x25;&#x33;&#x36;&#x25;&#x33;&#x35;&#x25;&#x35;&#x63;&#x25;&#x37;&#x35;&#x25;&#x33;&#x30;&#x25;&#x33;&#x30;&#x25;&#x33;&#x37;&#x25;&#x33;&#x32;&#x25;&#x35;&#x63;&#x25;&#x37;&#x35;&#x25;&#x33;&#x30;&#x25;&#x33;&#x30;&#x25;&#x33;&#x37;&#x25;&#x33;&#x34;&#x28;&#x31;&#x35;&#x29;"></a></pre></div><p>Answer: The javascript will execute.</p><p>先HTML解码，得到</p><p><code><a 
href="javascript:%5c%75%30%30%36%31%5c%75%30%30%36%63%5c%75%30%30%36%35%5c%75%30%30%37%32%5c%75%30%30%37%34(15)"></a></code></p><p>在href中由URL模块处理，解码得到</p><p><code>javascript:\u0061\u006c\u0065\u0072\u0074(15)</code></p><p>识别JS协议，然后由JS模块处理，解码得到</p><p><code>javascript:alert(15)</code></p><p>最后被执行</p><h2 id="toc-18">总结</h2><ol><li><p><code><script></code>和<code><style></code>数据只能有文本，不会有HTML解码和URL解码操作</p></li><li><p><code><textarea></code>和<code><title></code>里会有HTML解码操作，但不会有子元素</p></li><li><p>其他元素数据（如<code>div</code>）和元素属性数据（如<code>href</code>）中会有HTML解码操作</p></li><li><p>部分属性（如<code>href</code>）会有URL解码操作，但URL中的协议需为ASCII</p></li><li><p>JavaScript会对字符串和标识符Unicode解码</p></li></ol><p>根据浏览器的自动解码，反向构造 XSS Payload 即可</p><p>转自先知社区<br/></p>