1. 13.2 Parsing HTML documents
      1. 13.2.1 Overview of the parsing model
      2. 13.2.2 Parse errors
      3. 13.2.3 输入字节流
        1. 13.2.3.1 使用已知字符编码进行解析
        2. 13.2.3.2 Determining the character encoding
        3. 13.2.3.3 字符编码
        4. 13.2.3.4 Changing the encoding while parsing
        5. 13.2.3.5 Preprocessing the input stream
      4. 13.2.4 解析状态
        1. 13.2.4.1 The insertion mode
        2. 13.2.4.2 The stack of open elements
        3. 13.2.4.3 活动格式化元素列表
        4. 13.2.4.4 元素指针
        5. 13.2.4.5 其他解析状态标志
      5. 13.2.5 Tokenization
        1. 13.2.5.1 Data state
        2. 13.2.5.2 RCDATA state
        3. 13.2.5.3 RAWTEXT state
        4. 13.2.5.4 Script data state
        5. 13.2.5.5 PLAINTEXT state
        6. 13.2.5.6 Tag open state
        7. 13.2.5.7 End tag open state
        8. 13.2.5.8 Tag name state
        9. 13.2.5.9 RCDATA 小于号状态
        10. 13.2.5.10 RCDATA 结束标签打开状态
        11. 13.2.5.11 RCDATA end tag name state
        12. 13.2.5.12 RAWTEXT 小于号状态
        13. 13.2.5.13 RAWTEXT end tag open state
        14. 13.2.5.14 RAWTEXT end tag name state
        15. 13.2.5.15
        16. 13.2.5.16 Script data end tag open state
        17. 13.2.5.17 Script data end tag name state
        18. 13.2.5.18 脚本数据转义开始状态
        19. 13.2.5.19 脚本数据转义开始
        20. 13.2.5.20 脚本数据转义状态
        21. 13.2.5.21 脚本数据转义横线状态
        22. 13.2.5.22 Script data escaped dash dash state
        23. 13.2.5.23 脚本数据转义小于号状态
        24. 13.2.5.24 Script data escaped end tag open state
        25. 13.2.5.25 Script data escaped end tag name state
        26. 13.2.5.26 脚本数据双重转义开始状态
        27. 13.2.5.27 Script data double escaped state
        28. 13.2.5.28 脚本数据双重转义横线状态
        29. 13.2.5.29 Script data double escaped dash dash state
        30. 13.2.5.30 脚本数据双重转义小于号状态
        31. 13.2.5.31 脚本数据双重转义结束状态
        32. 13.2.5.32 Before attribute name state
        33. 13.2.5.33 Attribute name state
        34. 13.2.5.34 After attribute name state
        35. 13.2.5.35 Before attribute value state
        36. 13.2.5.36 Attribute value (double-quoted) state
        37. 13.2.5.37 Attribute value (single-quoted) state
        38. 13.2.5.38 Attribute value (unquoted) state
        39. 13.2.5.39 After attribute value (quoted) state
        40. 13.2.5.40 Self-closing start tag state
        41. 13.2.5.41 Bogus comment state
        42. 13.2.5.42 Markup declaration open state
        43. 13.2.5.43 Comment start state
        44. 13.2.5.44 Comment start dash state
        45. 13.2.5.45 Comment state
        46. 13.2.5.46 注释小于号状态
        47. 13.2.5.47 注释小于号感叹号状态
        48. 13.2.5.48 注释小于号感叹号横线状态
        49. 13.2.5.49 Comment less-than sign bang dash dash state
        50. 13.2.5.50 Comment end dash state
        51. 13.2.5.51 Comment end state
        52. 13.2.5.52 Comment end bang state
        53. 13.2.5.53 DOCTYPE state
        54. 13.2.5.54 Before DOCTYPE name state
        55. 13.2.5.55 DOCTYPE name state
        56. 13.2.5.56 After DOCTYPE name state
        57. 13.2.5.57 After DOCTYPE public keyword state
        58. 13.2.5.58 Before DOCTYPE public identifier state
        59. 13.2.5.59 DOCTYPE public identifier (double-quoted) state
        60. 13.2.5.60 DOCTYPE public identifier (single-quoted) state
        61. 13.2.5.61 After DOCTYPE public identifier state
        62. 13.2.5.62 Between DOCTYPE public and system identifiers state
        63. 13.2.5.63 After DOCTYPE system keyword state
        64. 13.2.5.64 Before DOCTYPE system identifier state
        65. 13.2.5.65 DOCTYPE system identifier (double-quoted) state
        66. 13.2.5.66 DOCTYPE system identifier (single-quoted) state
        67. 13.2.5.67 After DOCTYPE system identifier state
        68. 13.2.5.68 Bogus DOCTYPE state
        69. 13.2.5.69 CDATA 部分状态
        70. 13.2.5.70 CDATA 部分括号状态
        71. 13.2.5.71 CDATA 部分结束状态
        72. 13.2.5.72 Character reference state
        73. 13.2.5.73 Named character reference state
        74. 13.2.5.74 Ambiguous ampersand state
        75. 13.2.5.75 数字字符引用状态
        76. 13.2.5.76 Hexadecimal character reference start state
        77. 13.2.5.77 Decimal character reference start state
        78. 13.2.5.78 Hexadecimal character reference state
        79. 13.2.5.79 Decimal character reference state
        80. 13.2.5.80 数字字符引用结束状态
      6. 13.2.6 Tree construction
        1. 13.2.6.1 Creating and inserting nodes
        2. 13.2.6.2 解析只包含文本的元素
        3. 13.2.6.3 Closing elements that have implied end tags
        4. 13.2.6.4 解析 HTML 内容中的标记的规则
          1. 13.2.6.4.1 "initial" 插入模式
          2. 13.2.6.4.2 "before html" 插入模式
          3. 13.2.6.4.3 "before head" 插入模式
          4. 13.2.6.4.4 The "in head" insertion mode
          5. 13.2.6.4.5 "in head noscript" 插入模式
          6. 13.2.6.4.6 "after head" 插入模式
          7. 13.2.6.4.7 The "in body" insertion mode
          8. 13.2.6.4.8 The "text" insertion mode
          9. 13.2.6.4.9 "in table" 插入模式
          10. 13.2.6.4.10 The "in table text" insertion mode
          11. 13.2.6.4.11 The "in caption" 插入模式
          12. 13.2.6.4.12 "in column group" 插入模式
          13. 13.2.6.4.13 "in table body" 插入模式
          14. 13.2.6.4.14 "in row" 插入模式
          15. 13.2.6.4.15 "in cell" 插入模式
          16. 13.2.6.4.16 "in select" 插入模式
          17. 13.2.6.4.17 "in select in table" 插入模式
          18. 13.2.6.4.18 "in template" 插入模式
          19. 13.2.6.4.19 "after body" 插入模式
          20. 13.2.6.4.20 "in frameset" 插入模式
          21. 13.2.6.4.21 "after frameset" 插入模式
          22. 13.2.6.4.22 "after after body" 插入模式
          23. 13.2.6.4.23 "after after frameset" 插入模式
        5. 13.2.6.5 The rules for parsing tokens in foreign content
      7. 13.2.7 The end
      8. 13.2.8 Coercing an HTML DOM into an infoset
      9. 13.2.9 解析器的错误处理和奇怪的场景介绍
        1. 13.2.9.1 错误嵌套的标签:<b><i></b></i>
        2. 13.2.9.2 错误嵌套的标签:<b><p></b></p>
        3. 13.2.9.3 Unexpected markup in tables
        4. 13.2.9.4 解析时修改页面的脚本
        5. 13.2.9.5 在多个文档中移动的脚本的执行
        6. 13.2.9.6 Unclosed formatting elements
    2. 13.3 序列化 HTML 片段
    3. 13.4 解析 HTML 片段

13.2 Parsing HTML documents

This section only applies to user agents, data mining tools, and conformance checkers.

The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XML syntax".

User agents must use the parsing rules described in this section to generate the DOM trees from text/html resources. Together, these rules define what is referred to as the HTML parser.

While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.

Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.

Authors interested in using SGML tools in their authoring pipeline are encouraged to use XML tools and the XML serialization of HTML.

For the purposes of conformance checkers, if a resource is determined to be in the HTML syntax, then it is an HTML document.

As stated in the terminology section, references to element types that do not explicitly specify a namespace always refer to elements in the HTML namespace. For example, if the spec talks about "a menu element", then that is an element with the local name "menu", the namespace "http://www.w3.org/1999/xhtml", and the interface HTMLMenuElement. Where possible, references to such elements are hyperlinked to their definition.

13.2.1 Overview of the parsing model

The input to the HTML parsing process consists of a stream of code points, which is passed through a tokenization stage followed by a tree construction stage. The output is a Document object.

Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.

In the common case, the data handled by the tokenization stage comes from the network, but it can also come from script running in the user agent, e.g. using the document.write() API.

There is only one set of states for the tokenizer stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.

In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" end tag token:

...
<script>
 document.write('<p>');
</script>
...

To handle these cases, parsers have a script nesting level, which must be initially set to zero, and a parser pause flag, which must be initially set to false.

13.2.2 Parse errors

This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error condition exists in the document.

Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.

Some parse errors have dedicated codes outlined in the table below that should be used by conformance checkers in reports.

Error descriptions in the table below are non-normative.

Code Description
abrupt-closing-of-empty-comment

This error occurs if the parser encounters an empty comment that is abruptly closed by a U+003E (>) code point (i.e., <!--> or <!--->). The parser behaves as if the comment is closed correctly.

abrupt-doctype-public-identifier

This error occurs if the parser encounters a U+003E (>) code point in the DOCTYPE public identifier (e.g., <!DOCTYPE html PUBLIC "foo>). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

abrupt-doctype-system-identifier

This error occurs if the parser encounters a U+003E (>) code point in the DOCTYPE system identifier (e.g., <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "foo>). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

absence-of-digits-in-numeric-character-reference

This error occurs if the parser encounters a numeric character reference that doesn't contain any digits (e.g., &#qux;). In this case the parser doesn't resolve the character reference.

cdata-in-html-content

This error occurs if the parser encounters a CDATA section outside of foreign content (SVG or MathML). The parser treats such CDATA sections (including leading "[CDATA[" and trailing "]]" strings) as comments.

character-reference-outside-unicode-range

This error occurs if the parser encounters a numeric character reference that references a code point that is greater than the valid Unicode range. The parser resolves such a character reference to a U+FFFD REPLACEMENT CHARACTER.

control-character-in-input-stream

This error occurs if the input stream contains a control code point that is not ASCII whitespace or U+0000 NULL. Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM.

control-character-reference

This error occurs if the parser encounters a numeric character reference that references a control code point that is not ASCII whitespace or is a U+000D CARRIAGE RETURN. The parser resolves such character references as-is except C1 control references that are replaced according to the numeric character reference end state.

end-tag-with-attributes

This error occurs if the parser encounters an end tag with attributes. Attributes in end tags are completely ignored and do not make their way into the DOM.

duplicate-attribute

This error occurs if the parser encounters an attribute in a tag that already has an attribute with the same name. The parser ignores all such duplicate occurrences of the attribute.

end-tag-with-trailing-solidus

This error occurs if the parser encounters an end tag that has a U+002F (/) code point right before the closing U+003E (>) code point (e.g., </div/>). Such a tag is treated as a regular end tag.

eof-before-tag-name

This error occurs if the parser encounters the end of the input stream where a tag name is expected. In this case the parser treats the beginning of a start tag (i.e., <) or an end tag (i.e., </) as text content.

eof-in-cdata

This error occurs if the parser encounters the end of the input stream in a CDATA section. The parser treats such CDATA sections as if they are closed immediately before the end of the input stream.

eof-in-comment

This error occurs if the parser encounters the end of the input stream in a comment. The parser treats such comments as if they are closed immediately before the end of the input stream.

eof-in-doctype

This error occurs if the parser encounters the end of the input stream in a DOCTYPE. In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

eof-in-script-html-comment-like-text

This error occurs if the parser encounters the end of the input stream in text that resembles an HTML comment inside script element content (e.g., <script><!-- foo).

Syntactic structures that resemble HTML comments in script elements are parsed as text content. They can be a part of a scripting language-specific syntactic structure or be treated as an HTML-like comment, if the scripting language supports them (e.g., parsing rules for HTML-like comments can be found in Annex B of the JavaScript specification). The common reason for this error is a violation of the restrictions for contents of script elements. [JAVASCRIPT]

eof-in-tag

This error occurs if the parser encounters the end of the input stream in a start tag or an end tag (e.g., <div id=). Such a tag is completely ignored.

incorrectly-closed-comment

This error occurs if the parser encounters a comment that is closed by the "--!>" code point sequence. The parser treats such comments as if they are correctly closed by the "-->" code point sequence.

incorrectly-opened-comment

This error occurs if the parser encounters the "<!" code point sequence that is not immidiately followed by two U+002D (-) code points and that is not the start of a DOCTYPE or a CDATA section. All content that follows the "<!" code point sequence up to a U+003E (>) code point (if present) or to the end of the input stream is treated as a comment.

One possible cause of this error is using an XML markup declaration (e.g., <!ELEMENT br EMPTY>) in HTML.

invalid-character-sequence-after-doctype-name

This error occurs if the parser encounters any code point sequence other than "PUBLIC" and "SYSTEM" keywords after a DOCTYPE name. In such a case, the parser ignores any following public or system identifiers, and if the DOCTYPE is correctly placed as a document preamble, sets the Document to quirks mode.

invalid-first-character-of-tag-name

This error occurs if the parser encounters a code point that is not an ASCII alpha where first code point of a start tag name or an end tag name is expected. If a start tag was expected such code point and a preceding U+003C (<) is treated as text content, and all content that follows is treated as markup. Whereas, if an end tag was expected, such code point and all content that follows up to a U+003E (>) code point (if present) or to the end of the input stream is treated as a comment.

For example, consider the following markup:

<42></42>

This will be parsed into:

While the first code point of a tag name is limited to an ASCII alpha, a wide range of code points (including ASCII digits) is allowed in subsequent positions.

missing-attribute-value

This error occurs if the parser encounters a U+003E (>) code point where an attribute value is expected (e.g., <div id=>). The parser treats the attribute as having an empty value.

missing-doctype-name

This error occurs if the parser encounters a DOCTYPE that is missing a name (e.g., <!DOCTYPE>). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

missing-doctype-public-identifier

This error occurs if the parser encounters a U+003E (>) code point where start of the DOCTYPE public identifier is expected (e.g., <!DOCTYPE html PUBLIC >). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

missing-doctype-system-identifier

This error occurs if the parser encounters a U+003E (>) code point where start of the DOCTYPE system identifier is expected (e.g., <!DOCTYPE html SYSTEM >). In such a case, if the DOCTYPE is correctly placed as a document preamble, the parser sets the Document to quirks mode.

missing-end-tag-name

This error occurs if the parser encounters a U+003E (>) code point where an end tag name is expected, i.e., </>. The parser completely ignores whole "</>" code point sequence.

missing-quote-before-doctype-public-identifier

This error occurs if the parser encounters the DOCTYPE public identifier that is not preceded by a quote (e.g., <!DOCTYPE html PUBLIC -//W3C//DTD HTML 4.01//EN">). In such a case, the parser ignores the public identifier, and if the DOCTYPE is correctly placed as a document preamble, sets the Document to quirks mode.

missing-quote-before-doctype-system-identifier

This error occurs if the parser encounters the DOCTYPE system identifier that is not preceded by a quote (e.g., <!DOCTYPE html SYSTEM http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">). In such a case, the parser ignores the system identifier, and if the DOCTYPE is correctly placed as a document preamble, sets the Document to quirks mode.

missing-semicolon-after-character-reference

This error occurs if the parser encounters a character reference that is not terminated by a U+003B (;) code point. Usually the parser behaves as if character reference is terminated by the U+003B (;) code point; however, there are some ambiguous cases in which the parser includes subsequent code points in the character reference.

For example, &not;in will be parsed as "¬in" whereas &notin will be parsed as "".

missing-whitespace-after-doctype-public-keyword

This error occurs if the parser encounters a DOCTYPE whose "PUBLIC" keyword and public identifier are not separated by ASCII whitespace. In this case the parser behaves as if ASCII whitespace is present.

missing-whitespace-after-doctype-system-keyword

This error occurs if the parser encounters a DOCTYPE whose "SYSTEM" keyword and system identifier are not separated by ASCII whitespace. In this case the parser behaves as if ASCII whitespace is present.

missing-whitespace-before-doctype-name

This error occurs if the parser encounters a DOCTYPE whose "DOCTYPE" keyword and name are not separated by ASCII whitespace. In this case the parser behaves as if ASCII whitespace is present.

missing-whitespace-between-attributes

This error occurs if the parser encounters attributes that are not separated by ASCII whitespace (e.g., <div id="foo"class="bar">). In this case the parser behaves as if ASCII whitespace is present.

missing-whitespace-between-doctype-public-and-system-identifiers

This error occurs if the parser encounters a DOCTYPE whose public and system identifiers are not separated by ASCII whitespace. In this case the parser behaves as if ASCII whitespace is present.

nested-comment

This error occurs if the parser encounters a nested comment (e.g., <!-- <!-- nested --> -->). Such a comment will be closed by the first occuring "-->" code point sequence and everything that follows will be treated as markup.

noncharacter-character-reference

This error occurs if the parser encounters a numeric character reference that references a noncharacter. The parser resolves such character references as-is.

noncharacter-in-input-stream

This error occurs if the input stream contains a noncharacter. Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM.

non-void-html-element-start-tag-with-trailing-solidus

This error occurs if the parser encounters a start tag for an element that is not in the list of void elements or is not a part of foreign content (i.e., not an SVG or MathML element) that has a U+002F (/) code point right before the closing U+003E (>) code point. The parser behaves as if the U+002F (/) is not present.

For example, consider the following markup:

<div/><span></span><span></span>

This will be parsed into:

The trailing U+002F (/) in a start tag name can be used only in foreign content to specify self-closing tags. (Self-closing tags don't exist in HTML.) It is also allowed for void elements, but doesn't have any effect in this case.

null-character-reference

This error occurs if the parser encounters a numeric character reference that references a U+0000 NULL code point. The parser resolves such character references to a U+FFFD REPLACEMENT CHARACTER.

surrogate-character-reference

This error occurs if the parser encounters a numeric character reference that references a surrogate. The parser resolves such character references to a U+FFFD REPLACEMENT CHARACTER.

surrogate-in-input-stream

This error occurs if the input stream contains a surrogate. Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM.

Surrogates can only find their way into the input stream via script APIs such as document.write().

unexpected-character-after-doctype-system-identifier

This error occurs if the parser encounters any code points other than ASCII whitespace or closing U+003E (>) after the DOCTYPE system identifier. The parser ignores these code points.

unexpected-character-in-attribute-name

This error occurs if the parser encounters a U+0022 ("), U+0027 ('), or U+003C (<) code point in an attribute name. The parser includes such code points in the attribute name.

Code points that trigger this error are usually a part of another syntactic construct and can be a sign of a typo around the attribute name.

For example, consider the following markup:

<div foo<div>

Due to a forgotten U+003E (>) code point after foo the parser treats this markup as a single div element with a "foo<div" attribute.

As another example of this error, consider the following markup:

<div id'bar'>

Due to a forgotten U+003D (=) code point between an attribute name and value the parser treats this markup as a div element with the attribute "id'bar'" that has an empty value.

unexpected-character-in-unquoted-attribute-value

This error occurs if the parser encounters a U+0022 ("), U+0027 ('), U+003C (<), U+003D (=), or U+0060 (`) code point in an unquoted attribute value. The parser includes such code points in the attribute value.

Code points that trigger this error are usually a part of another syntactic construct and can be a sign of a typo around the attribute value.

U+0060 (`) is in the list of code points that trigger this error because certain legacy user agents treat it as a quote.

For example, consider the following markup:

<div foo=b'ar'>

Due to a misplaced U+0027 (') code point the parser sets the value of the "foo" attribute to "b'ar'".

unexpected-equals-sign-before-attribute-name

This error occurs if the parser encounters a U+003D (=) code point before an attribute name. In this case the parser treats U+003D (=) as the first code point of the attribute name.

The common reason for this error is a forgotten attribute name.

For example, consider the following markup:

<div foo="bar" ="baz">

Due to a forgotten attribute name the parser treats this markup as a div element with two attributes: a "foo" attribute with a "bar" value and a "="baz"" attribute with an empty value.

unexpected-null-character

This error occurs if the parser encounters a U+0000 NULL code point in the input stream in certain positions. In general, such code points are either completely ignored or, for security reasons, replaced with a U+FFFD REPLACEMENT CHARACTER.

unexpected-question-mark-instead-of-tag-name

This error occurs if the parser encounters a U+003F (?) code point where first code point of a start tag name is expected. The U+003F (?) and all content that follows up to a U+003E (>) code point (if present) or to the end of the input stream is treated as a comment.

For example, consider the following markup:

<?xml-stylesheet type="text/css" href="style.css"?>

This will be parsed into:

The common reason for this error is an XML processing instruction (e.g., <?xml-stylesheet type="text/css" href="style.css"?>) or an XML declaration (e.g., <?xml version="1.0" encoding="UTF-8"?>) being used in HTML.

unexpected-solidus-in-tag

This error occurs if the parser encounters a U+002F (/) code point that is not a part of a quoted attribute value and not immediately followed by a U+003E (>) code point in a tag (e.g., <div / id="foo">). In this case the parser behaves as if it encountered ASCII whitespace.

unknown-named-character-reference

This error occurs if the parser encounters an ambiguous ampersand. In this case the parser doesn't resolve the character reference.

13.2.3 输入字节流

作为 tokenization 阶段的输入的代码点流,最初是经过用户代理的, 被视为字节流(通常通过网络或本地文件系统)。 这些字节是根据特定 字符编码 对实际字符进行的编码,用户代理用该字符编码将字节解码为字符。

对于 XML 文档,用户代理确定字符编码的算法由 XML 规范给出。 本小节不适用于 XML 文档。 [XML]

通常使用下面定义的 编码探测算法 来确定字符编码。

给定一个字符编码,输入字节流 中的字节必须转换为给 tokenizer 输入流 的字符, 这其中需要传递 输入字节流 本身, 以及用来 解码 的字符编码。

前导 Byte Order Mark (BOM) 会使 tokenizer 忽略字符编码参数,并跳过这个 BOM。

不符合编码标准的(例如 UTF-8 输入字节流中的非法 UTF-8 字节序列) 原始字节流中的字节或字节序列也是一致性检查工具需要报告的错误。[ENCODING]

解码算法描述了如何处理非法输入; 为了安全起见,精确地遵守这些规则是很必要的。 对无效字节序列的处理的区别可能会导致脚本注入缺陷(虽然还有其他问题)。

当 HTML 解析器解码一个输入字节流时,它必须使用一个字符编码以及对应的 信心。 这个信息可能是 tentative, certainirrelevant。 无论信心是 tentativecertain 这个字符编码都会 用于解析 来决定是否 更改字符编码。 如果不需要字符编码, 例如因为解析器正在操作 Unicode 流根本不需要字符编码, 那么 信心 就是 irrelevant

有些算法给解析器的数据时,会直接 在 输入流 中添加字符, 而不是在 输入字节流 中添加字节。

13.2.3.1 使用已知字符编码进行解析

当 HTML 解析器要在有 已知确定编码 的输入字节流上操作时,字符编码就是那个编码, 信心certain

13.2.3.2 Determining the character encoding

In some cases, it might be impractical to unambiguously determine the encoding before parsing the document. Because of this, this specification provides for a two-pass mechanism with an optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing algorithm to whatever bytes they have available before beginning to parse the document. Then, the real parser is started, using a tentative encoding derived from this pre-parse and other out-of-band metadata. If, while the document is being loaded, the user agent discovers a character encoding declaration that conflicts with this information, then the parser can get reinvoked to perform a parse of the document with the real encoding.

User agents must use the following algorithm, called the encoding sniffing algorithm, to determine the character encoding to use when decoding a document in the first pass. This algorithm takes as input any out-of-band metadata available to the user agent (e.g. the Content-Type metadata of the document) and all the bytes available so far, and returns a character encoding and a confidence that is either tentative or certain.

  1. If the result of BOM sniffing is an encoding, return that encoding with confidence certain.

    Although the decode algorithm will itself change the encoding to use based on the presence of a byte order mark, this algorithm sniffs the BOM as well in order to set the correct document's character encoding and confidence.

  2. If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain.

    Typically, user agents remember such user requests across sessions, and in some cases apply them to documents in iframes as well.

  3. The user agent may wait for more bytes of the resource to be available, either in this step or at any later step in this algorithm. For instance, a user agent might wait 500ms or 1024 bytes, whichever came first. In general preparsing the source to find the encoding improves performance, as it reduces the need to throw away the data structures used when parsing upon finding the encoding information. However, if the user agent delays too long to obtain data to determine the encoding, then the cost of the delay could outweigh any performance improvements from the preparse.

    The authoring conformance requirements for character encoding declarations limit them to only appearing in the first 1024 bytes. User agents are therefore encouraged to use the prescan algorithm below (as invoked by these steps) on the first 1024 bytes, but not to stall beyond that.

  4. If the transport layer specifies a character encoding, and it is supported, return that encoding with the confidence certain.

  5. Optionally prescan the byte stream to determine its encoding. The end condition is that the user agent decides that scanning further bytes would not be efficient. User agents are encouraged to only prescan the first 1024 bytes. User agents may decide that scanning any bytes is not efficient, in which case these substeps are entirely skipped.

    The aforementioned algorithm either aborts unsuccessfully or returns a character encoding. If it returns a character encoding, then return the same encoding, with confidence tentative.

  6. If the HTML parser for which this algorithm is being run is associated with a Document d whose browsing context is non-null and a child browsing context, then:

    1. Let parentDocument be d's browsing context's container document.

    2. If parentDocument's origin is same origin with d's origin and parentDocument's character encoding is not UTF-16BE/LE, then return parentDocument's character encoding, with the confidence tentative.

  7. Otherwise, if the user agent has information on the likely encoding for this page, e.g. based on the encoding of the page when it was last visited, then return that encoding, with the confidence tentative.

  8. The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream. Such algorithms may use information about the resource other than the resource's contents, including the address of the resource. If autodetection succeeds in determining a character encoding, and that encoding is a supported encoding, then return that encoding, with the confidence tentative. [UNIVCHARDET]

    User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network, since doing so involves inherently non-interoperable heuristics. Attempting to detect encodings based on an HTML document's preamble is especially tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin with a lot of markup rather than with text content.

    The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. When a user agent can examine the whole file, rather than just the preamble, detecting for UTF-8 specifically can be especially effective. [PPUTF8] [UTF8DET]

  9. Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative.

    In controlled environments or in environments where the encoding of documents can be prescribed (for example, for user agents intended for dedicated use in new networks), the comprehensive UTF-8 encoding is suggested.

    In other environments, the default encoding is typically dependent on the user's locale (an approximation of the languages, and thus often encodings, of the pages that the user is likely to frequent). The following table gives suggested defaults based on the user's locale, for compatibility with legacy content. Locales are identified by BCP 47 language tags. [BCP47] [ENCODING]

    Locale language Suggested default encoding
    ar Arabic windows-1256
    ba Bashkir windows-1251
    be Belarusian windows-1251
    bg Bulgarian windows-1251
    cs Czech windows-1250
    el Greek ISO-8859-7
    et Estonian windows-1257
    fa Persian windows-1256
    he Hebrew windows-1255
    hr Croatian windows-1250
    hu Hungarian ISO-8859-2
    ja Japanese Shift_JIS
    kk Kazakh windows-1251
    ko Korean EUC-KR
    ku Kurdish windows-1254
    ky Kyrgyz windows-1251
    lt Lithuanian windows-1257
    lv Latvian windows-1257
    mk Macedonian windows-1251
    pl Polish ISO-8859-2
    ru Russian windows-1251
    sah Yakut windows-1251
    sk Slovak windows-1250
    sl Slovenian ISO-8859-2
    sr Serbian windows-1251
    tg Tajik windows-1251
    th Thai windows-874
    tr Turkish windows-1254
    tt Tatar windows-1251
    uk Ukrainian windows-1251
    vi Vietnamese windows-1258
    zh-CN Chinese (People's Republic of China) gb18030
    zh-TW Chinese (Taiwan) Big5
    All other locales windows-1252

    The contents of this table are derived from the intersection of Windows, Chrome, and Firefox defaults.

The document's character encoding must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to use for the input byte stream.


When an algorithm requires a user agent to prescan a byte stream to determine its encoding, given some defined end condition, then it must run the following steps. These steps either abort unsuccessfully or return a character encoding. If at any point during these steps (including during instances of the get an attribute algorithm invoked by this one) the user agent either runs out of bytes (meaning the position pointer created in the first step below goes beyond the end of the byte stream obtained so far) or reaches its end condition, then abort the prescan a byte stream to determine its encoding algorithm unsuccessfully.

  1. Let position be a pointer to a byte in the input byte stream, initially pointing at the first byte.

  2. Loop: If position points to:

    A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (`<!--`)

    Advance the position pointer so that it points at the first 0x3E byte which is preceded by two 0x2D bytes (i.e. at the end of an ASCII '-->' sequence) and comes after the 0x3C byte that was found. (The two 0x2D bytes can be the same as those in the '<!--' sequence.)

    A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)
    1. Advance the position pointer so that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or 0x2F byte (the one in sequence of characters matched above).

    2. Let attribute list be an empty list of strings.

    3. Let got pragma be false.

    4. Let need pragma be null.

    5. Let charset be the null value (which, for the purposes of this algorithm, is distinct from an unrecognized encoding or the empty string).

    6. Attributes: Get an attribute and its value. If no attribute was sniffed, then jump to the processing step below.

    7. If the attribute's name is already in attribute list, then return to the step labeled attributes.

    8. Add the attribute's name to attribute list.

    9. Run the appropriate step from the following list, if one applies:

      If the attribute's name is "http-equiv"

      If the attribute's value is "content-type", then set got pragma to true.

      If the attribute's name is "content"

      Apply the algorithm for extracting a character encoding from a meta element, giving the attribute's value as the string to parse. If a character encoding is returned, and if charset is still set to null, let charset be the encoding returned, and set need pragma to true.

      If the attribute's name is "charset"

      Let charset be the result of getting an encoding from the attribute's value, and set need pragma to false.

    10. Return to the step labeled attributes.

    11. Processing: If need pragma is null, then jump to the step below labeled next byte.

    12. If need pragma is true but got pragma is false, then jump to the step below labeled next byte.

    13. If charset is failure, then jump to the step below labeled next byte.

    14. If charset is UTF-16BE/LE, then set charset to UTF-8.

    15. If charset is x-user-defined, then set charset to windows-1252.

    16. Abort the prescan a byte stream to determine its encoding algorithm, returning the encoding given by charset.

    A sequence of bytes starting with a 0x3C byte (<), optionally a 0x2F byte (/), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (A-Z or a-z)
    1. Advance the position pointer so that it points at the next 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x3E (>) byte.

    2. Repeatedly get an attribute until no further attributes can be found, then jump to the step below labeled next byte.

    A sequence of bytes starting with: 0x3C 0x21 (`<!`)
    A sequence of bytes starting with: 0x3C 0x2F (`</`)
    A sequence of bytes starting with: 0x3C 0x3F (`<?`)

    Advance the position pointer so that it points at the first 0x3E byte (>) that comes after the 0x3C byte that was found.

    Any other byte

    Do nothing with that byte.

  3. Next byte: Move position so it points at the next byte in the input byte stream, and return to the step above labeled loop.

When the prescan a byte stream to determine its encoding algorithm says to get an attribute, it means doing this:

  1. If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x2F (/) then advance position to the next byte and redo this step.

  2. If the byte at position is 0x3E (>), then abort the get an attribute algorithm. There isn't one.

  3. Otherwise, the byte at position is the start of the attribute name. Let attribute name and attribute value be the empty string.

  4. Process the byte at position as follows:

    If it is 0x3D (=), and the attribute name is longer than the empty string
    Advance position to the next byte and jump to the step below labeled value.
    If it is 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP)
    Jump to the step below labeled spaces.
    If it is 0x2F (/) or 0x3E (>)
    Abort the get an attribute algorithm. The attribute's name is the value of attribute name, its value is the empty string.
    If it is in the range 0x41 (A) to 0x5A (Z)
    Append the code point b+0x20 to attribute name (where b is the value of the byte at position). (This converts the input to lowercase.)
    Anything else
    Append the code point with the same value as the byte at position to attribute name. (It doesn't actually matter how bytes outside the ASCII range are handled here, since only ASCII bytes can contribute to the detection of a character encoding.)
  5. Advance position to the next byte and return to the previous step.

  6. Spaces: If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP) then advance position to the next byte, then, repeat this step.

  7. If the byte at position is not 0x3D (=), abort the get an attribute algorithm. The attribute's name is the value of attribute name, its value is the empty string.

  8. Advance position past the 0x3D (=) byte.

  9. Value: If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP) then advance position to the next byte, then, repeat this step.

  10. Process the byte at position as follows:

    If it is 0x22 (") or 0x27 (')
    1. Let b be the value of the byte at position.
    2. Quote loop: Advance position to the next byte.
    3. If the value of the byte at position is the value of b, then advance position to the next byte and abort the "get an attribute" algorithm. The attribute's name is the value of attribute name, and its value is the value of attribute value.
    4. Otherwise, if the value of the byte at position is in the range 0x41 (A) to 0x5A (Z), then append a code point to attribute value whose value is 0x20 more than the value of the byte at position.
    5. Otherwise, append a code point to attribute value whose value is the same as the value of the byte at position.
    6. Return to the step above labeled quote loop.
    If it is 0x3E (>)
    Abort the get an attribute algorithm. The attribute's name is the value of attribute name, its value is the empty string.
    If it is in the range 0x41 (A) to 0x5A (Z)
    Append a code point b+0x20 to attribute value (where b is the value of the byte at position). Advance position to the next byte.
    Anything else
    Append a code point with the same value as the byte at position to attribute value. Advance position to the next byte.
  11. Process the byte at position as follows:

    If it is 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x3E (>)
    Abort the get an attribute algorithm. The attribute's name is the value of attribute name and its value is the value of attribute value.
    If it is in the range 0x41 (A) to 0x5A (Z)
    Append a code point b+0x20 to attribute value (where b is the value of the byte at position).
    Anything else
    Append a code point with the same value as the byte at position to attribute value.
  12. Advance position to the next byte and return to the previous step.

For the sake of interoperability, user agents should not use a pre-scan algorithm that returns different results than the one described above. (But, if you do, please at least let us know, so that we can improve this algorithm and benefit everyone...)

13.2.3.3 字符编码

用户代理必须支持 WHATWG Encoding 标准中定义的编码,包括但不限于 UTF-8, ISO-8859-2, ISO-8859-7, ISO-8859-8, windows-874, windows-1250, windows-1251, windows-1252, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, gb18030, Big5, ISO-2022-JP, Shift_JIS, EUC-KR, UTF-16BE/LE, 和 x-user-defined。 用户代理不得支持其他编码。

上述禁止支持例如 CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, 和 UTF-32 这些编码。本规范的算法中不试图支持那些禁止编码; 支持和使用禁止的编码可能会导致不可预期的行为。 [CESU8] [UTF7] [BOCU1] [SCSU]

13.2.3.4 Changing the encoding while parsing

When the parser requires the user agent to change the encoding, it must run the following steps. This might happen if the encoding sniffing algorithm described above failed to find a character encoding, or if it found a character encoding that was not the actual encoding of the file.

  1. If the encoding that is already being used to interpret the input stream is UTF-16BE/LE, then set the confidence to certain and return. The new encoding is ignored; if it was anything but the same encoding, then it would be clearly incorrect.

  2. If the new encoding is UTF-16BE/LE, then change it to UTF-8.

  3. If the new encoding is x-user-defined, then change it to windows-1252.

  4. If the new encoding is identical or equivalent to the encoding that is already being used to interpret the input stream, then set the confidence to certain and return. This happens when the encoding information found in the file matches what the encoding sniffing algorithm determined to be the encoding, and in the second pass through the parser if the first pass found that the encoding sniffing algorithm described in the earlier section failed to find the right encoding.

  5. If all the bytes up to the last byte converted by the current decoder have the same Unicode interpretations in both the current encoding and the new encoding, and if the user agent supports changing the converter on the fly, then the user agent may change to the new converter for the encoding on the fly. Set the document's character encoding and the encoding used to convert the input stream to the new encoding, set the confidence to certain, and return.

  6. Otherwise, navigate to the document again, with historyHandling set to "replace", and using the same source browsing context, but this time skip the encoding sniffing algorithm and instead just set the encoding to the new encoding and the confidence to certain. Whenever possible, this should be done without actually contacting the network layer (the bytes should be re-parsed from memory), even if, e.g., the document is marked as not being cacheable. If this is not possible and contacting the network layer would involve repeating a request that uses a method other than `GET`, then instead set the confidence to certain and ignore the new encoding. The resource will be misinterpreted. User agents may notify the user of the situation, to aid in application development.

This algorithm is only invoked when a new encoding is found declared on a meta element.

13.2.3.5 Preprocessing the input stream

The input stream consists of the characters pushed into it as the input byte stream is decoded or from the various APIs that directly manipulate the input stream.

Any occurrences of surrogates are surrogate-in-input-stream parse errors. Any occurrences of noncharacters are noncharacter-in-input-stream parse errors and any occurrences of controls other than ASCII whitespace and U+0000 NULL characters are control-character-in-input-stream parse errors.

The handling of U+0000 NULL characters varies based on where the characters are found and happens at the later stages of the parsing. They are either ignored or, for security reasons, replaced with a U+FFFD REPLACEMENT CHARACTER. This handling is, by necessity, spread across both the tokenization stage and the tree construction stage.

Before the tokenization stage, the input stream must be preprocessed by normalizing newlines. Thus, newlines in HTML DOMs are represented by U+000A LF characters, and there are never any U+000D CR characters in the input to the tokenization stage.

The next input character is the first character in the input stream that has not yet been consumed or explicitly ignored by the requirements in this section. Initially, the next input character is the first character in the input. The current input character is the last character to have been consumed.

The insertion point is the position (just before a character or just before the end of the input stream) where content inserted using document.write() is actually inserted. The insertion point is relative to the position of the character immediately after it, it is not an absolute offset into the input stream. Initially, the insertion point is undefined.

The "EOF" character in the tables below is a conceptual character representing the end of the input stream. If the parser is a script-created parser, then the end of the input stream is reached when an explicit "EOF" character (inserted by the document.close() method) is consumed. Otherwise, the "EOF" character is not a real character in the stream, but rather the lack of any further characters.

13.2.4 解析状态

13.2.4.1 The insertion mode

The insertion mode is a state variable that controls the primary operation of the tree construction stage.

Initially, the insertion mode is "initial". It can change to "before html", "before head", "in head", "in head noscript", "after head", "in body", "text", "in table", "in table text", "in caption", "in column group", "in table body", "in row", "in cell", "in select", "in select in table", "in template", "after body", "in frameset", "after frameset", "after after body", and "after after frameset" during the course of the parsing, as described in the tree construction stage. The insertion mode affects how tokens are processed and whether CDATA sections are supported.

Several of these modes, namely "in head", "in body", "in table", and "in select", are special, in that the other modes defer to them at various times. When the algorithm below says that the user agent is to do something "using the rules for the m insertion mode", where m is one of these modes, the user agent must use the rules described under the m insertion mode's section, but must leave the insertion mode unchanged unless the rules in m themselves switch the insertion mode to a new value.

When the insertion mode is switched to "text" or "in table text", the original insertion mode is also set. This is the insertion mode to which the tree construction stage will return.

Similarly, to parse nested template elements, a stack of template insertion modes is used. It is initially empty. The current template insertion mode is the insertion mode that was most recently added to the stack of template insertion modes. The algorithms in the sections below will push insertion modes onto this stack, meaning that the specified insertion mode is to be added to the stack, and pop insertion modes from the stack, which means that the most recently added insertion mode must be removed from the stack.


When the steps below require the UA to reset the insertion mode appropriately, it means the UA must follow these steps:

  1. Let last be false.

  2. Let node be the last node in the stack of open elements.

  3. Loop: If node is the first node in the stack of open elements, then set last to true, and, if the parser was created as part of the HTML fragment parsing algorithm (fragment case), set node to the context element passed to that algorithm.

  4. If node is a select element, run these substeps:

    1. If last is true, jump to the step below labeled done.

    2. Let ancestor be node.

    3. Loop: If ancestor is the first node in the stack of open elements, jump to the step below labeled done.

    4. Let ancestor be the node before ancestor in the stack of open elements.

    5. If ancestor is a template node, jump to the step below labeled done.

    6. If ancestor is a table node, switch the insertion mode to "in select in table" and return.

    7. Jump back to the step labeled loop.

    8. Done: Switch the insertion mode to "in select" and return.

  5. If node is a td or th element and last is false, then switch the insertion mode to "in cell" and return.

  6. If node is a tr element, then switch the insertion mode to "in row" and return.

  7. If node is a tbody, thead, or tfoot element, then switch the insertion mode to "in table body" and return.

  8. If node is a caption element, then switch the insertion mode to "in caption" and return.

  9. If node is a colgroup element, then switch the insertion mode to "in column group" and return.

  10. If node is a table element, then switch the insertion mode to "in table" and return.

  11. If node is a template element, then switch the insertion mode to the current template insertion mode and return.

  12. If node is a head element and last is false, then switch the insertion mode to "in head" and return.

  13. If node is a body element, then switch the insertion mode to "in body" and return.

  14. If node is a frameset element, then switch the insertion mode to "in frameset" and return. (fragment case)

  15. If node is an html element, run these substeps:

    1. If the head element pointer is null, switch the insertion mode to "before head" and return. (fragment case)

    2. Otherwise, the head element pointer is not null, switch the insertion mode to "after head" and return.

  16. If last is true, then switch the insertion mode to "in body" and return. (fragment case)

  17. Let node now be the node before node in the stack of open elements.

  18. Return to the step labeled loop.

13.2.4.2 The stack of open elements

Initially, the stack of open elements is empty. The stack grows downwards; the topmost node on the stack is the first one added to the stack, and the bottommost node of the stack is the most recently added node in the stack (notwithstanding when the stack is manipulated in a random access fashion as part of the handling for misnested tags).

The "before html" insertion mode creates the html document element, which is then added to the stack.

In the fragment case, the stack of open elements is initialized to contain an html element that is created as part of that algorithm. (The fragment case skips the "before html" insertion mode.)

The html node, however it is created, is the topmost node of the stack. It only gets popped off the stack when the parser finishes.

The current node is the bottommost node in this stack of open elements.

The adjusted current node is the context element if the parser was created as part of the HTML fragment parsing algorithm and the stack of open elements has only one element in it (fragment case); otherwise, the adjusted current node is the current node.

Elements in the stack of open elements fall into the following categories:

Special

The following elements have varying levels of special parsing rules: HTML's address, applet, area, article, aside, base, basefont, bgsound, blockquote, body, br, button, caption, center, col, colgroup, dd, details, dir, div, dl, dt, embed, fieldset, figcaption, figure, footer, form, frame, frameset, h1, h2, h3, h4, h5, h6, head, header, hgroup, hr, html, iframe, img, input, keygen, li, link, listing, main, marquee, menu, meta, nav, noembed, noframes, noscript, object, ol, p, param, plaintext, pre, script, section, select, source, style, summary, table, tbody, td, template, textarea, tfoot, th, thead, title, tr, track, ul, wbr, xmp; MathML mi, MathML mo, MathML mn, MathML ms, MathML mtext, and MathML annotation-xml; and SVG foreignObject, SVG desc, and SVG title.

An image start tag token is handled by the tree builder, but it is not in this list because it is not an element; it gets turned into an img element.

Formatting

The following HTML elements are those that end up in the list of active formatting elements: a, b, big, code, em, font, i, nobr, s, small, strike, strong, tt, and u.

Ordinary

All other elements found while parsing an HTML document.

Typically, the special elements have the start and end tag tokens handled specifically, while ordinary elements' tokens fall into "any other start tag" and "any other end tag" clauses, and some parts of the tree builder check if a particular element in the stack of open elements is in the special category. However, some elements (e.g., the option element) have their start or end tag tokens handled specifically, but are still not in the special category, so that they get the ordinary handling elsewhere.

The stack of open elements is said to have an element target node in a specific scope consisting of a list of element types list when the following algorithm terminates in a match state:

  1. Initialize node to be the current node (the bottommost node of the stack).

  2. If node is the target node, terminate in a match state.

  3. Otherwise, if node is one of the element types in list, terminate in a failure state.

  4. Otherwise, set node to the previous entry in the stack of open elements and return to step 2. (This will never fail, since the loop will always terminate in the previous step if the top of the stack — an html element — is reached.)

The stack of open elements is said to have a particular element in scope when it has that element in the specific scope consisting of the following element types:

The stack of open elements is said to have a particular element in list item scope when it has that element in the specific scope consisting of the following element types:

The stack of open elements is said to have a particular element in button scope when it has that element in the specific scope consisting of the following element types:

The stack of open elements is said to have a particular element in table scope when it has that element in the specific scope consisting of the following element types:

The stack of open elements is said to have a particular element in select scope when it has that element in the specific scope consisting of all element types except the following:

Nothing happens if at any time any of the elements in the stack of open elements are moved to a new location in, or removed from, the Document tree. In particular, the stack is not changed in this situation. This can cause, amongst other strange effects, content to be appended to nodes that are no longer in the DOM.

In some cases (namely, when closing misnested formatting elements), the stack is manipulated in a random-access fashion.

13.2.4.3 活动格式化元素列表

活动格式化元素列表 初始为空。 它用来处理错误嵌套的 格式化元素标签

该列表包含 formatting 类别下的元素以及 标记。 当进入 applet, object, marquee, template, td, th, 和 caption 元素时会插入 标记。 用来防止格式化 "泄漏" applet, object, marquee, template, td, th, 和 caption 元素。

此外,每个 活动的格式化原始列表 中的元素都与创建它的标记相关联,这样在必要时可以为那个标记继续创建元素。

当下面的步骤要 UA 把一个元素 element 推入到活动格式化元素 时, UA 必须执行以下步骤:

  1. 如果在 活动的格式化原始列表 中 最后一个 标记 后已经有三个元素(如果没有标记,就匹配任意地方) 与 element 有同样的名字、命名空间和属性,就把最先的那个元素从 活动的格式化原始列表 移除。 比较属性时必须以元素刚被解析器创建出来的状态为准; 如果两个元素的所有属性可以配对,每一对属性有同样的属性名、命名空间和值(顺序不重要), 那么就说两个元素有相同的属性。

    这是诺亚方舟条款。。但每个家庭有三个人,而不是两个。

  2. element 添加到 活动的格式化原始列表

当下面的步骤要 UA 重新构造活动的格式化元素 时, UA 必须执行以下步骤:

  1. 如果 活动的格式化原始列表 中没有项目, 那就没什么要重新构造的,停止本算法。

  2. 如果 活动的格式化原始列表 的最后一项(最近添加的)是一个 标记, 或者如果它是 打开元素栈 中的元素, 那么也没有什么要重新构造的;停止本算法。

  3. entry活动的格式化原始列表 的最后一项(最近添加的)。

  4. Rewind:如果在 活动的格式化原始列表entry 之前没有项目,那就跳到标记为 create 的步骤。

  5. entry活动的格式化原始列表entry 的前一项。

  6. 如果 entry 既不是一个 标记, 也不是 打开元素栈 中的元素, 跳到标记为 rewind 的步骤。

  7. Advance: 令 entry活动的格式化原始列表entry 的后一项。

  8. Create: 为创建 entry 的标记 插入一个 HTML 元素,来得到 new element

  9. 把列表中的 entry 替换为一个 new element 的项目。

  10. 如果 活动的格式化原始列表new element 的项目不是列表的最后一项,返回到标记为 advance 的步骤。

这样的效果是重新打开所有在当前 body,cell 或 caption(任何最早的元素)格式化元素中的,没有显式关闭的元素。

本标准中,活动的格式化原始列表 总是由按时间排序的元素组成,最先添加的元素在最前面,最后添加的元素在最后 (当然上述算法中 7-10 的循环除外)。

当下面的步骤要 UA 清楚活动格式化元素列表到最后一个标记 时,UA 必须执行以下步骤:

  1. entry活动的格式化原始列表 中的最后一项(最近添加的)。

  2. 活动的格式化原始列表 中移除 entry

  3. 如果 entry 是一个 标记, 则立即停止本算法。该列表已经清除到了最后一个 标记

  4. 跳到步骤 1。

13.2.4.4 元素指针

head 元素指针form 元素指针 初始都为 null。

一旦 head 元素已经解析完成(不管是隐式地还是显式地) head 元素指针 都会设置为指向这一节点。

form 元素指针 指向仍然打开的还未看到关闭标签的最后一个 form 元素。 它用来让表单控件即使最坏的写法下也可以关联到表单。 在 template 元素中要忽略它。

13.2.4.5 其他解析状态标志

如果与该解析器关联的 Document脚本被启用脚本标志 就设为 "enabled",否则设为 "disabled"。

即使该解析器最初是为 HTML 片段解析算法 创建的,也可以启用 脚本标志, 即使 script 元素在这种情况下不会执行。

当解析器创建时,frameset-ok 标志 设为 "ok"。 当遇到某些标记时它会设为 "not ok"。

13.2.5 Tokenization

Implementations must act as if they used the following state machine to tokenize HTML. The state machine must start in the data state. Most states consume a single character, which may have various side-effects, and either switches the state machine to a new state to reconsume the current input character, or switches it to a new state to consume the next character, or stays in the same state to consume the next character. Some states have more complicated behavior and can consume several characters before switching to another state. In some cases, the tokenizer state is also changed by the tree construction stage.

When a state says to reconsume a matched character in a specified state, that means to switch to that state, but when it attempts to consume the next input character, provide it with the current input character instead.

The exact behavior of certain states depends on the insertion mode and the stack of open elements. Certain states also use a temporary buffer to track progress, and the character reference state uses a return state to return to the state it was invoked from.

The output of the tokenization step is a series of zero or more of the following tokens: DOCTYPE, start tag, end tag, comment, character, end-of-file. DOCTYPE tokens have a name, a public identifier, a system identifier, and a force-quirks flag. When a DOCTYPE token is created, its name, public identifier, and system identifier must be marked as missing (which is a distinct state from the empty string), and the force-quirks flag must be set to off (its other state is on). Start and end tag tokens have a tag name, a self-closing flag, and a list of attributes, each of which has a name and a value. When a start or end tag token is created, its self-closing flag must be unset (its other state is that it be set), and its attributes list must be empty. Comment and character tokens have data.

When a token is emitted, it must immediately be handled by the tree construction stage. The tree construction stage can affect the state of the tokenization stage, and can insert additional characters into the stream. (For example, the script element can result in scripts executing and using the dynamic markup insertion APIs to insert characters into the stream being tokenized.)

Creating a token and emitting it are distinct actions. It is possible for a token to be created but implicitly abandoned (never emitted), e.g. if the file ends unexpectedly while processing the characters that are being parsed into a start tag token.

When a start tag token is emitted with its self-closing flag set, if the flag is not acknowledged when it is processed by the tree construction stage, that is a non-void-html-element-start-tag-with-trailing-solidus parse error.

When an end tag token is emitted with attributes, that is an end-tag-with-attributes parse error.

When an end tag token is emitted with its self-closing flag set, that is an end-tag-with-trailing-solidus parse error.

An appropriate end tag token is an end tag token whose tag name matches the tag name of the last start tag to have been emitted from this tokenizer, if any. If no start tag has been emitted from this tokenizer, then no end tag token is appropriate.

A character reference is said to be consumed as part of an attribute if the return state is either attribute value (double-quoted) state, attribute value (single-quoted) state or attribute value (unquoted) state.

When a state says to flush code points consumed as a character reference, it means that for each code point in the temporary buffer (in the order they were added to the buffer) user agent must append the code point from the buffer to the current attribute's value if the character reference was consumed as part of an attribute, or emit the code point as a character token otherwise.

Before each step of the tokenizer, the user agent must first check the parser pause flag. If it is true, then the tokenizer must abort the processing of any nested invocations of the tokenizer, yielding control back to the caller.

The tokenizer state machine consists of the states defined in the following subsections.

13.2.5.1 Data state

Consume the next input character:

U+0026 AMPERSAND (&)
Set the return state to the data state. Switch to the character reference state.
U+003C LESS-THAN SIGN (<)
Switch to the tag open state.
U+0000 NULL
This is an unexpected-null-character parse error. Emit the current input character as a character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.2 RCDATA state

Consume the next input character:

U+0026 AMPERSAND (&)
Set the return state to the RCDATA state. Switch to the character reference state.
U+003C LESS-THAN SIGN (<)
Switch to the RCDATA less-than sign state.
U+0000 NULL
This is an unexpected-null-character parse error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.3 RAWTEXT state

Consume the next input character:

U+003C LESS-THAN SIGN (<)
Switch to the RAWTEXT less-than sign state.
U+0000 NULL
This is an unexpected-null-character parse error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.4 Script data state

Consume the next input character:

U+003C LESS-THAN SIGN (<)
Switch to the script data less-than sign state.
U+0000 NULL
This is an unexpected-null-character parse error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.5 PLAINTEXT state

Consume the next input character:

U+0000 NULL
This is an unexpected-null-character parse error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.6 Tag open state

Consume the next input character:

U+0021 EXCLAMATION MARK (!)
Switch to the markup declaration open state.
U+002F SOLIDUS (/)
Switch to the end tag open state.
ASCII alpha
Create a new start tag token, set its tag name to the empty string. Reconsume in the tag name state.
U+003F QUESTION MARK (?)
This is an unexpected-question-mark-instead-of-tag-name parse error. Create a comment token whose data is the empty string. Reconsume in the bogus comment state.
EOF
This is an eof-before-tag-name parse error. Emit a U+003C LESS-THAN SIGN character token and an end-of-file token.
Anything else
This is an invalid-first-character-of-tag-name parse error. Emit a U+003C LESS-THAN SIGN character token. Reconsume in the data state.
13.2.5.7 End tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in the tag name state.
U+003E GREATER-THAN SIGN (>)
This is a missing-end-tag-name parse error. Switch to the data state.
EOF
This is an eof-before-tag-name parse error. Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token and an end-of-file token.
Anything else
This is an invalid-first-character-of-tag-name parse error. Create a comment token whose data is the empty string. Reconsume in the bogus comment state.
13.2.5.8 Tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before attribute name state.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.
EOF
This is an eof-in-tag parse error. Emit an end-of-file token.
Anything else
Append the current input character to the current tag token's tag name.
13.2.5.9 RCDATA 小于号状态

消耗掉 下一个输入字符

U+002F SOLIDUS (/)
temporary buffer 设为空字符串。 切换到 RCDATA 结束标签打开状态
任何其他情况
发出一个 U+003C LESS-THAN SIGN 字符标记。 在 RCDATA 状态 重新消耗 该字符。
13.2.5.10 RCDATA 结束标签打开状态

使用掉 下一个输入字符

ASCII 字符
创建一个新的结束标签标记,把它的标签名设为空字符串。 在 RCDATA 结束标签名状态 重新消耗 该字符。
任何其他情况
发出一个 U+003C LESS-THAN SIGN 和 U+002F SOLIDUS 字符标记。 在 RCDATA 状态 重新消耗 该字符。
13.2.5.11 RCDATA end tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
If the current end tag token is an appropriate end tag token, then switch to the before attribute name state. Otherwise, treat it as per the "anything else" entry below.
U+002F SOLIDUS (/)
If the current end tag token is an appropriate end tag token, then switch to the self-closing start tag state. Otherwise, treat it as per the "anything else" entry below.
U+003E GREATER-THAN SIGN (>)
If the current end tag token is an appropriate end tag token, then switch to the data state and emit the current tag token. Otherwise, treat it as per the "anything else" entry below.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name. Append the current input character to the temporary buffer.
ASCII lower alpha
Append the current input character to the current tag token's tag name. Append the current input character to the temporary buffer.
Anything else
Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a character token for each of the characters in the temporary buffer (in the order they were added to the buffer). Reconsume in the RCDATA state.
13.2.5.12 RAWTEXT 小于号状态

消耗掉 下一个输入字符

U+002F SOLIDUS (/)
设置 temporary buffer 为空字符串。切换到 RAWTEXT 结束标签打开状态
任何其他情况
发出一个 U+003C LESS-THAN SIGN 字符标记。 在 RAWTEXT 状态 重新消耗 该字符。
13.2.5.13 RAWTEXT end tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in the RAWTEXT end tag name state.
Anything else
Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token. Reconsume in the RAWTEXT state.
13.2.5.14 RAWTEXT end tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
If the current end tag token is an appropriate end tag token, then switch to the before attribute name state. Otherwise, treat it as per the "anything else" entry below.
U+002F SOLIDUS (/)
If the current end tag token is an appropriate end tag token, then switch to the self-closing start tag state. Otherwise, treat it as per the "anything else" entry below.
U+003E GREATER-THAN SIGN (>)
If the current end tag token is an appropriate end tag token, then switch to the data state and emit the current tag token. Otherwise, treat it as per the "anything else" entry below.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name. Append the current input character to the temporary buffer.
ASCII lower alpha
Append the current input character to the current tag token's tag name. Append the current input character to the temporary buffer.
Anything else
Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a character token for each of the characters in the temporary buffer (in the order they were added to the buffer). Reconsume in the RAWTEXT state.
13.2.5.15

消耗掉 下一个输入字符

U+002F SOLIDUS (/)
临时缓冲区 设为空字符串。切换到 脚本数据结束标签打开状态
U+0021 EXCLAMATION MARK (!)
切换到 脚本数据转义开始状态。 发出一个 U+003C LESS-THAN SIGN 和 U+0021 EXCLAMATION MARK 字符标记。
任何其他字符
发出一个 U+003C LESS-THAN SIGN 字符标记。 在 脚本数据状态重新消耗 该字符。
13.2.5.16 Script data end tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in the script data end tag name state.
Anything else
Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token. Reconsume in the script data state.
13.2.5.17 Script data end tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
If the current end tag token is an appropriate end tag token, then switch to the before attribute name state. Otherwise, treat it as per the "anything else" entry below.
U+002F SOLIDUS (/)
If the current end tag token is an appropriate end tag token, then switch to the self-closing start tag state. Otherwise, treat it as per the "anything else" entry below.
U+003E GREATER-THAN SIGN (>)
If the current end tag token is an appropriate end tag token, then switch to the data state and emit the current tag token. Otherwise, treat it as per the "anything else" entry below.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name. Append the current input character to the temporary buffer.
ASCII lower alpha
Append the current input character to the current tag token's tag name. Append the current input character to the temporary buffer.
Anything else
Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a character token for each of the characters in the temporary buffer (in the order they were added to the buffer). Reconsume in the script data state.
13.2.5.18 脚本数据转义开始状态

消耗掉 下一个输入字符

U+002D HYPHEN-MINUS (-)
切换到 脚本数据转义开始横线状态。 发出一个 U+002D HYPHEN-MINUS 字符标记。
任何其他字符
脚本数据状态重新消耗 该字符。
13.2.5.19 脚本数据转义开始

使用掉 下一个输入字符

U+002D HYPHEN-MINUS (-)
切换到 脚本数据转义横线横线状态。 发出一个 U+002D HYPHEN-MINUS 字符标记。
任何其他情况
脚本数据状态 重新消耗 该字符。
13.2.5.20 脚本数据转义状态

使用掉 下一个输入字符

U+002D HYPHEN-MINUS (-)
切换到 脚本数据转义横线状态。 发出一个 U+002D HYPHEN-MINUS 字符标记。
U+003C LESS-THAN SIGN (<)
切换到 脚本数据转义小于号状态
U+0000 NULL
这是一个 unexpected-null-character 解析错误。 发出一个 U+FFFD REPLACEMENT CHARACTER 字符标记。
EOF
这是一个 eof-in-script-html-comment-like-text 解析错误。发出一个 EOF 标记。
任何其他情况
当前输入字符 作为字符标记发出。
13.2.5.21 脚本数据转义横线状态

消耗掉 下一个输入字符

U+002D HYPHEN-MINUS (-)
切换到 脚本数据转义横线横线状态。发出一个 U+002D HYPHEN-MINUS 字符标记。
U+003C LESS-THAN SIGN (<)
切换到 脚本数据转义小于号状态
U+0000 NULL
这是一个 unexpected-null-character 解析错误。 切换到 脚本数据转义状态。 发出一个 U+FFFD REPLACEMENT CHARACTER 字符标记。
EOF
这是一个 eof-in-script-html-comment-like-text 解析错误。 发出一个 EOF 标记。
任何其他情况
切换到 脚本数据转义状态。 把 当前输入字符 作为字符标记发出。
13.2.5.22 Script data escaped dash dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Emit a U+002D HYPHEN-MINUS character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data escaped less-than sign state.
U+003E GREATER-THAN SIGN (>)
Switch to the script data state. Emit a U+003E GREATER-THAN SIGN character token.
U+0000 NULL
This is an unexpected-null-character parse error. Switch to the script data escaped state. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text parse error. Emit an end-of-file token.
Anything else
Switch to the script data escaped state. Emit the current input character as a character token.
13.2.5.23 脚本数据转义小于号状态

消耗掉 下一个输入字符

U+002F SOLIDUS (/)
设置 temporary buffer 为空字符串。切换到 脚本数据转义结束标签打开状态
ASCII 字符
设置 temporary buffer 为空字符串。发出一个 U+003C LESS-THAN SIGN 字符标记。 在 小本数据双重转义开始状态 重新消耗该字符
任何其他情况
发出一个 U+003C LESS-THAN SIGN 字符标记。 在 脚本数据转义状态 重新消耗 该字符。
13.2.5.24 Script data escaped end tag open state

Consume the next input character:

ASCII alpha
Create a new end tag token, set its tag name to the empty string. Reconsume in the script data escaped end tag name state.
Anything else
Emit a U+003C LESS-THAN SIGN character token and a U+002F SOLIDUS character token. Reconsume in the script data escaped state.
13.2.5.25 Script data escaped end tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
If the current end tag token is an appropriate end tag token, then switch to the before attribute name state. Otherwise, treat it as per the "anything else" entry below.
U+002F SOLIDUS (/)
If the current end tag token is an appropriate end tag token, then switch to the self-closing start tag state. Otherwise, treat it as per the "anything else" entry below.
U+003E GREATER-THAN SIGN (>)
If the current end tag token is an appropriate end tag token, then switch to the data state and emit the current tag token. Otherwise, treat it as per the "anything else" entry below.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name. Append the current input character to the temporary buffer.
ASCII lower alpha
Append the current input character to the current tag token's tag name. Append the current input character to the temporary buffer.
Anything else
Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a character token for each of the characters in the temporary buffer (in the order they were added to the buffer). Reconsume in the script data escaped state.
13.2.5.26 脚本数据双重转义开始状态

使用掉 下一个输入字符

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
如果 temporary buffer 是字符串 "script", 则切换到 脚本数据双重转义状态。 否则, 切换到 脚本数据转义状态。 把 当前输入字符 作为字符标记发出。
ASCII 大写字母
当前输入字符 (添加一个 0x0020 到该字符的代码点) 的小写版本追加到 temporary buffer。 把 当前输入字符 作为字符标记发出。
ASCII 小写字母
当前输入字符 追加到 temporary buffer。 把 当前输入字符 作为字符标记发出。
任何其他情况
脚本数据转义状态 重新消耗 该字符。
13.2.5.27 Script data double escaped state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the script data double escaped dash state. Emit a U+002D HYPHEN-MINUS character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data double escaped less-than sign state. Emit a U+003C LESS-THAN SIGN character token.
U+0000 NULL
This is an unexpected-null-character parse error. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text parse error. Emit an end-of-file token.
Anything else
Emit the current input character as a character token.
13.2.5.28 脚本数据双重转义横线状态

使用掉 下一个输入字符

U+002D HYPHEN-MINUS (-)
切换到 脚本数据双重转义横线横线状态。 发出一个 U+002D HYPHEN-MINUS 字符标记。
U+003C LESS-THAN SIGN (<)
切换到 脚本数据双重转义小于号状态。 发出一个 U+003C LESS-THAN SIGN 字符标记。
U+0000 NULL
这是一个 unexpected-null-character 解析错误。 切换到 脚本数据双重转义状态。 发出一个 U+FFFD REPLACEMENT CHARACTER 字符标记。
EOF
这是一个 eof-in-script-html-comment-like-text 解析错误。 发出一个 EOF 标记。
任何其他情况
切换到 脚本数据双重转义状态。 把 当前输入字符 作为字符标记发出。
13.2.5.29 Script data double escaped dash dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Emit a U+002D HYPHEN-MINUS character token.
U+003C LESS-THAN SIGN (<)
Switch to the script data double escaped less-than sign state. Emit a U+003C LESS-THAN SIGN character token.
U+003E GREATER-THAN SIGN (>)
Switch to the script data state. Emit a U+003E GREATER-THAN SIGN character token.
U+0000 NULL
This is an unexpected-null-character parse error. Switch to the script data double escaped state. Emit a U+FFFD REPLACEMENT CHARACTER character token.
EOF
This is an eof-in-script-html-comment-like-text parse error. Emit an end-of-file token.
Anything else
Switch to the script data double escaped state. Emit the current input character as a character token.
13.2.5.30 脚本数据双重转义小于号状态

使用掉 下一个输入字符

U+002F SOLIDUS (/)
设置 temporary buffer 设为空字符串。 切换到 脚本数据双重转义结束状态。 发出一个 U+002F SOLIDUS 字符标记。
任何其他情况
脚本数据双重转义状态 重新消耗 该字符。
13.2.5.31 脚本数据双重转义结束状态

使用掉 下一个输入字符

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
如果 temporary buffer 是字符串 "script",就切换到 脚本数据转义状态。 否则,切换到 脚本数据双重转义状态。 把 当前输入字符 作为字符标记发出。
ASCII 大写字母
当前输入字符 (添加一个 0x0020 到该字符的代码点) 的小写版本追加到 temporary buffer。 把 当前输入字符 作为字符标记发出。
ASCII 小写字母
当前输入字符 追加到 temporary buffer。 把 当前输入字符 作为字符标记发出。
任何其他情况
脚本数据双重转义状态 重新消耗 该字符。
13.2.5.32 Before attribute name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
EOF
Reconsume in the after attribute name state.
U+003D EQUALS SIGN (=)
This is an unexpected-equals-sign-before-attribute-name parse error. Start a new attribute in the current tag token. Set that attribute's name to the current input character, and its value to the empty string. Switch to the attribute name state.
Anything else
Start a new attribute in the current tag token. Set that attribute name and value to the empty string. Reconsume in the attribute name state.
13.2.5.33 Attribute name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
EOF
Reconsume in the after attribute name state.
U+003D EQUALS SIGN (=)
Switch to the before attribute value state.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current attribute's name.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's name.
U+0022 QUOTATION MARK (")
U+0027 APOSTROPHE (')
U+003C LESS-THAN SIGN (<)
This is an unexpected-character-in-attribute-name parse error. Treat it as per the "anything else" entry below.
Anything else
Append the current input character to the current attribute's name.

When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a duplicate-attribute parse error and the new attribute must be removed from the token.

If an attribute is so removed from a token, it, and the value that gets associated with it, if any, are never subsequently used by the parser, and are therefore effectively discarded. Removing the attribute in this way does not change its status as the "current attribute" for the purposes of the tokenizer, however.

13.2.5.34 After attribute name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003D EQUALS SIGN (=)
Switch to the before attribute value state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
EOF
This is an eof-in-tag parse error. Emit an end-of-file token.
Anything else
Start a new attribute in the current tag token. Set that attribute name and value to the empty string. Reconsume in the attribute name state.
13.2.5.35 Before attribute value state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+0022 QUOTATION MARK (")
Switch to the attribute value (double-quoted) state.
U+0027 APOSTROPHE (')
Switch to the attribute value (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-attribute-value parse error. Switch to the data state. Emit the current tag token.
Anything else
Reconsume in the attribute value (unquoted) state.
13.2.5.36 Attribute value (double-quoted) state

Consume the next input character:

U+0022 QUOTATION MARK (")
Switch to the after attribute value (quoted) state.
U+0026 AMPERSAND (&)
Set the return state to the attribute value (double-quoted) state. Switch to the character reference state.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's value.
EOF
This is an eof-in-tag parse error. Emit an end-of-file token.
Anything else
Append the current input character to the current attribute's value.
13.2.5.37 Attribute value (single-quoted) state

Consume the next input character:

U+0027 APOSTROPHE (')
Switch to the after attribute value (quoted) state.
U+0026 AMPERSAND (&)
Set the return state to the attribute value (single-quoted) state. Switch to the character reference state.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's value.
EOF
This is an eof-in-tag parse error. Emit an end-of-file token.
Anything else
Append the current input character to the current attribute's value.
13.2.5.38 Attribute value (unquoted) state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before attribute name state.
U+0026 AMPERSAND (&)
Set the return state to the attribute value (unquoted) state. Switch to the character reference state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's value.
U+0022 QUOTATION MARK (")
U+0027 APOSTROPHE (')
U+003C LESS-THAN SIGN (<)
U+003D EQUALS SIGN (=)
U+0060 GRAVE ACCENT (`)
This is an unexpected-character-in-unquoted-attribute-value parse error. Treat it as per the "anything else" entry below.
EOF
This is an eof-in-tag parse error. Emit an end-of-file token.
Anything else
Append the current input character to the current attribute's value.
13.2.5.39 After attribute value (quoted) state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before attribute name state.
U+002F SOLIDUS (/)
Switch to the self-closing start tag state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current tag token.
EOF
This is an eof-in-tag parse error. Emit an end-of-file token.
Anything else
This is a missing-whitespace-between-attributes parse error. Reconsume in the before attribute name state.
13.2.5.40 Self-closing start tag state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Set the self-closing flag of the current tag token. Switch to the data state. Emit the current tag token.
EOF
This is an eof-in-tag parse error. Emit an end-of-file token.
Anything else
This is an unexpected-solidus-in-tag parse error. Reconsume in the before attribute name state.
13.2.5.41 Bogus comment state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the comment token.
EOF
Emit the comment. Emit an end-of-file token.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
Anything else
Append the current input character to the comment token's data.
13.2.5.42 Markup declaration open state

If the next few characters are:

Two U+002D HYPHEN-MINUS characters (-)
Consume those two characters, create a comment token whose data is the empty string, and switch to the comment start state.
ASCII case-insensitive match for the word "DOCTYPE"
Consume those characters and switch to the DOCTYPE state.
The string "[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before and after)
Consume those characters. If there is an adjusted current node and it is not an element in the HTML namespace, then switch to the CDATA section state. Otherwise, this is a cdata-in-html-content parse error. Create a comment token whose data is the "[CDATA[" string. Switch to the bogus comment state.
Anything else
This is an incorrectly-opened-comment parse error. Create a comment token whose data is the empty string. Switch to the bogus comment state (don't consume anything in the current state).
13.2.5.43 Comment start state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment start dash state.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-closing-of-empty-comment parse error. Switch to the data state. Emit the comment token.
Anything else
Reconsume in the comment state.
13.2.5.44 Comment start dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment end state
U+003E GREATER-THAN SIGN (>)
This is an abrupt-closing-of-empty-comment parse error. Switch to the data state. Emit the comment token.
EOF
This is an eof-in-comment parse error. Emit the comment token. Emit an end-of-file token.
Anything else
Append a U+002D HYPHEN-MINUS character (-) to the comment token's data. Reconsume in the comment state.
13.2.5.45 Comment state

Consume the next input character:

U+003C LESS-THAN SIGN (<)
Append the current input character to the comment token's data. Switch to the comment less-than sign state.
U+002D HYPHEN-MINUS (-)
Switch to the comment end dash state.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
EOF
This is an eof-in-comment parse error. Emit the comment token. Emit an end-of-file token.
Anything else
Append the current input character to the comment token's data.
13.2.5.46 注释小于号状态

消耗掉 下一个输入字符

U+0021 EXCLAMATION MARK (!)
当前输入字符 添加到注释标记的数据。 切换到 注释小于号感叹号状态
U+003C LESS-THAN SIGN (<)
当前输入字符 添加到注释标记的数据。
任何其他字符
注释状态 重新消耗它
13.2.5.47 注释小于号感叹号状态

使用掉 下一个输入字符

U+002D HYPHEN-MINUS (-)
切换到 注释小于号感叹号横线状态
任何其他情况
注释状态 重新消耗 该字符。
13.2.5.48 注释小于号感叹号横线状态

使用掉 下一个输入字符

U+002D HYPHEN-MINUS (-)
切换到 注释小于号感叹号横线横线状态
任何其他情况
注释结束横线状态 重新消耗 该字符。
13.2.5.49 Comment less-than sign bang dash dash state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
EOF
Reconsume in the comment end state.
Anything else
This is a nested-comment parse error. Reconsume in the comment end state.
13.2.5.50 Comment end dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment end state
EOF
This is an eof-in-comment parse error. Emit the comment token. Emit an end-of-file token.
Anything else
Append a U+002D HYPHEN-MINUS character (-) to the comment token's data. Reconsume in the comment state.
13.2.5.51 Comment end state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the comment token.
U+0021 EXCLAMATION MARK (!)
Switch to the comment end bang state.
U+002D HYPHEN-MINUS (-)
Append a U+002D HYPHEN-MINUS character (-) to the comment token's data.
EOF
This is an eof-in-comment parse error. Emit the comment token. Emit an end-of-file token.
Anything else
Append two U+002D HYPHEN-MINUS characters (-) to the comment token's data. Reconsume in the comment state.
13.2.5.52 Comment end bang state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Append two U+002D HYPHEN-MINUS characters (-) and a U+0021 EXCLAMATION MARK character (!) to the comment token's data. Switch to the comment end dash state.
U+003E GREATER-THAN SIGN (>)
This is an incorrectly-closed-comment parse error. Switch to the data state. Emit the comment token.
EOF
This is an eof-in-comment parse error. Emit the comment token. Emit an end-of-file token.
Anything else
Append two U+002D HYPHEN-MINUS characters (-) and a U+0021 EXCLAMATION MARK character (!) to the comment token's data. Reconsume in the comment state.
13.2.5.53 DOCTYPE state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before DOCTYPE name state.
U+003E GREATER-THAN SIGN (>)
Reconsume in the before DOCTYPE name state.
EOF
This is an eof-in-doctype parse error. Create a new DOCTYPE token. Set its force-quirks flag to on. Emit the token. Emit an end-of-file token.
Anything else
This is a missing-whitespace-before-doctype-name parse error. Reconsume in the before DOCTYPE name state.
13.2.5.54 Before DOCTYPE name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
ASCII upper alpha
Create a new DOCTYPE token. Set the token's name to the lowercase version of the current input character (add 0x0020 to the character's code point). Switch to the DOCTYPE name state.
U+0000 NULL
This is an unexpected-null-character parse error. Create a new DOCTYPE token. Set the token's name to a U+FFFD REPLACEMENT CHARACTER character. Switch to the DOCTYPE name state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-name parse error. Create a new DOCTYPE token. Set its force-quirks flag to on. Switch to the data state. Emit the token.
EOF
This is an eof-in-doctype parse error. Create a new DOCTYPE token. Set its force-quirks flag to on. Emit the token. Emit an end-of-file token.
Anything else
Create a new DOCTYPE token. Set the token's name to the current input character. Switch to the DOCTYPE name state.
13.2.5.55 DOCTYPE name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the after DOCTYPE name state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current DOCTYPE token's name.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's name.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token's name.
13.2.5.56 After DOCTYPE name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else

If the six characters starting from the current input character are an ASCII case-insensitive match for the word "PUBLIC", then consume those characters and switch to the after DOCTYPE public keyword state.

Otherwise, if the six characters starting from the current input character are an ASCII case-insensitive match for the word "SYSTEM", then consume those characters and switch to the after DOCTYPE system keyword state.

Otherwise, this is an invalid-character-sequence-after-doctype-name parse error. Set the DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.

13.2.5.57 After DOCTYPE public keyword state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before DOCTYPE public identifier state.
U+0022 QUOTATION MARK (")
This is a missing-whitespace-after-doctype-public-keyword parse error. Set the DOCTYPE token's public identifier to the empty string (not missing), then switch to the DOCTYPE public identifier (double-quoted) state.
U+0027 APOSTROPHE (')
This is a missing-whitespace-after-doctype-public-keyword parse error. Set the DOCTYPE token's public identifier to the empty string (not missing), then switch to the DOCTYPE public identifier (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-public-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-public-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.
13.2.5.58 Before DOCTYPE public identifier state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+0022 QUOTATION MARK (")
Set the DOCTYPE token's public identifier to the empty string (not missing), then switch to the DOCTYPE public identifier (double-quoted) state.
U+0027 APOSTROPHE (')
Set the DOCTYPE token's public identifier to the empty string (not missing), then switch to the DOCTYPE public identifier (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-public-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-public-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.
13.2.5.59 DOCTYPE public identifier (double-quoted) state

Consume the next input character:

U+0022 QUOTATION MARK (")
Switch to the after DOCTYPE public identifier state.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's public identifier.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-doctype-public-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token's public identifier.
13.2.5.60 DOCTYPE public identifier (single-quoted) state

Consume the next input character:

U+0027 APOSTROPHE (')
Switch to the after DOCTYPE public identifier state.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's public identifier.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-doctype-public-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token's public identifier.
13.2.5.61 After DOCTYPE public identifier state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the between DOCTYPE public and system identifiers state.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
U+0022 QUOTATION MARK (")
This is a missing-whitespace-between-doctype-public-and-system-identifiers parse error. Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE (')
This is a missing-whitespace-between-doctype-public-and-system-identifiers parse error. Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-system-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.
13.2.5.62 Between DOCTYPE public and system identifiers state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
U+0022 QUOTATION MARK (")
Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE (')
Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-system-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.
13.2.5.63 After DOCTYPE system keyword state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Switch to the before DOCTYPE system identifier state.
U+0022 QUOTATION MARK (")
This is a missing-whitespace-after-doctype-system-keyword parse error. Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE (')
This is a missing-whitespace-after-doctype-system-keyword parse error. Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-system-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-system-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.
13.2.5.64 Before DOCTYPE system identifier state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+0022 QUOTATION MARK (")
Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE (')
Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
U+003E GREATER-THAN SIGN (>)
This is a missing-doctype-system-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
This is a missing-quote-before-doctype-system-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Reconsume in the bogus DOCTYPE state.
13.2.5.65 DOCTYPE system identifier (double-quoted) state

Consume the next input character:

U+0022 QUOTATION MARK (")
Switch to the after DOCTYPE system identifier state.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's system identifier.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-doctype-system-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token's system identifier.
13.2.5.66 DOCTYPE system identifier (single-quoted) state

Consume the next input character:

U+0027 APOSTROPHE (')
Switch to the after DOCTYPE system identifier state.
U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's system identifier.
U+003E GREATER-THAN SIGN (>)
This is an abrupt-doctype-system-identifier parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
Append the current input character to the current DOCTYPE token's system identifier.
13.2.5.67 After DOCTYPE system identifier state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
Ignore the character.
U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the current DOCTYPE token.
EOF
This is an eof-in-doctype parse error. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Emit an end-of-file token.
Anything else
This is an unexpected-character-after-doctype-system-identifier parse error. Reconsume in the bogus DOCTYPE state. (This does not set the DOCTYPE token's force-quirks flag to on.)
13.2.5.68 Bogus DOCTYPE state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Switch to the data state. Emit the DOCTYPE token.
U+0000 NULL
This is an unexpected-null-character parse error. Ignore the character.
EOF
Emit the DOCTYPE token. Emit an end-of-file token.
Anything else
Ignore the character.
13.2.5.69 CDATA 部分状态

消耗掉 下一个输入字符

U+005D RIGHT SQUARE BRACKET (])
切换到 CDATA 部分括号状态
EOF
这是一个 eof-in-cdata 解析错误。发出一个 EOF 标记。
任何其他字符
当前输入字符 作为一个字符标记发出。

U+0000 NULL 字符在构造树的阶段,作为 in foreign content 插入模式的一部分进行处理。CDATA 部分只出现在这里。

13.2.5.70 CDATA 部分括号状态

使用掉 下一个输入字符

U+005D RIGHT SQUARE BRACKET (])
切换到 CDATA 部分结束状态
任何其他情况
发出一个 U+005D RIGHT SQUARE BRACKET 字符标记。 在 CDATA 部分状态 重新消耗 该字符。
13.2.5.71 CDATA 部分结束状态

使用掉 下一个输入字符

U+005D RIGHT SQUARE BRACKET (])
发出一个 U+005D RIGHT SQUARE BRACKET 字符标记。
U+003E GREATER-THAN SIGN character
切换到 数据状态
任何其他情况
发出两个 U+005D RIGHT SQUARE BRACKET 字符标记。 在 CDATA 部分状态 重新消耗 该字符。
13.2.5.72 Character reference state

Set the temporary buffer to the empty string. Append a U+0026 AMPERSAND (&) character to the temporary buffer. Consume the next input character:

ASCII alphanumeric
Reconsume in the named character reference state.
U+0023 NUMBER SIGN (#)
Append the current input character to the temporary buffer. Switch to the numeric character reference state.
Anything else
Flush code points consumed as a character reference. Reconsume in the return state.
13.2.5.73 Named character reference state

Consume the maximum number of characters possible, where the consumed characters are one of the identifiers in the first column of the named character references table. Append each character to the temporary buffer when it's consumed.

If there is a match

If the character reference was consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next input character is either a U+003D EQUALS SIGN character (=) or an ASCII alphanumeric, then, for historical reasons, flush code points consumed as a character reference and switch to the return state.

Otherwise:

  1. If the last character matched is not a U+003B SEMICOLON character (;), then this is a missing-semicolon-after-character-reference parse error.

  2. Set the temporary buffer to the empty string. Append one or two characters corresponding to the character reference name (as given by the second column of the named character references table) to the temporary buffer.

  3. Flush code points consumed as a character reference. Switch to the return state.
Otherwise
Flush code points consumed as a character reference. Switch to the ambiguous ampersand state.

If the markup contains (not in an attribute) the string I'm &notit; I tell you, the character reference is parsed as "not", as in, I'm ¬it; I tell you (and this is a parse error). But if the markup was I'm &notin; I tell you, the character reference would be parsed as "notin;", resulting in I'm ∉ I tell you (and no parse error).

However, if the markup contains the string I'm &notit; I tell you in an attribute, no character reference is parsed and string remains intact (and there is no parse error).

13.2.5.74 Ambiguous ampersand state

Consume the next input character:

ASCII alphanumeric
If the character reference was consumed as part of an attribute, then append the current input character to the current attribute's value. Otherwise, emit the current input character as a character token.
U+003B SEMICOLON (;)
This is an unknown-named-character-reference parse error. Reconsume in the return state.
Anything else
Reconsume in the return state.
13.2.5.75 数字字符引用状态

设置 字符引用代码 为 zero (0)。

消耗掉 下一个输入字符

U+0078 LATIN SMALL LETTER X
U+0058 LATIN CAPITAL LETTER X
当前输入字符 追加到 temporary buffer。 切换到 十六进制字符引用开始状态
任何其他情况
十进制字符引用开始状态 重新消耗 该字符。
13.2.5.76 Hexadecimal character reference start state

Consume the next input character:

ASCII hex digit
Reconsume in the hexadecimal character reference state.
Anything else
This is an absence-of-digits-in-numeric-character-reference parse error. Flush code points consumed as a character reference. Reconsume in the return state.
13.2.5.77 Decimal character reference start state

Consume the next input character:

ASCII digit
Reconsume in the decimal character reference state.
Anything else
This is an absence-of-digits-in-numeric-character-reference parse error. Flush code points consumed as a character reference. Reconsume in the return state.
13.2.5.78 Hexadecimal character reference state

Consume the next input character:

ASCII digit
Multiply the character reference code by 16. Add a numeric version of the current input character (subtract 0x0030 from the character's code point) to the character reference code.
ASCII upper hex digit
Multiply the character reference code by 16. Add a numeric version of the current input character as a hexadecimal digit (subtract 0x0037 from the character's code point) to the character reference code.
ASCII lower hex digit
Multiply the character reference code by 16. Add a numeric version of the current input character as a hexadecimal digit (subtract 0x0057 from the character's code point) to the character reference code.
U+003B SEMICOLON
Switch to the numeric character reference end state.
Anything else
This is a missing-semicolon-after-character-reference parse error. Reconsume in the numeric character reference end state.
13.2.5.79 Decimal character reference state

Consume the next input character:

ASCII digit
Multiply the character reference code by 10. Add a numeric version of the current input character (subtract 0x0030 from the character's code point) to the character reference code.
U+003B SEMICOLON
Switch to the numeric character reference end state.
Anything else
This is a missing-semicolon-after-character-reference parse error. Reconsume in the numeric character reference end state.
13.2.5.80 数字字符引用结束状态

检查 字符引用代码

temporary buffer 设为空字符串。 给 temporary buffer 追加一个等于 字符引用代码 的代码点。 把消耗掉的代码点输出为字符引用。 切换到 return state

13.2.6 Tree construction

The input to the tree construction stage is a sequence of tokens from the tokenization stage. The tree construction stage is associated with a DOM Document object when a parser is created. The "output" of this stage consists of dynamically modifying or extending that document's DOM tree.

This specification does not define when an interactive user agent has to render the Document so that it is available to the user, or when it has to begin accepting user input.


As each token is emitted from the tokenizer, the user agent must follow the appropriate steps from the following list, known as the tree construction dispatcher:

If the stack of open elements is empty
If the adjusted current node is an element in the HTML namespace
If the adjusted current node is a MathML text integration point and the token is a start tag whose tag name is neither "mglyph" nor "malignmark"
If the adjusted current node is a MathML text integration point and the token is a character token
If the adjusted current node is a MathML annotation-xml element and the token is a start tag whose tag name is "svg"
If the adjusted current node is an HTML integration point and the token is a start tag
If the adjusted current node is an HTML integration point and the token is a character token
If the token is an end-of-file token
Process the token according to the rules given in the section corresponding to the current insertion mode in HTML content.
Otherwise
Process the token according to the rules given in the section for parsing tokens in foreign content.

The next token is the token that is about to be processed by the tree construction dispatcher (even if the token is subsequently just ignored).

A node is a MathML text integration point if it is one of the following elements:

A node is an HTML integration point if it is one of the following elements:

If the node in question is the context element passed to the HTML fragment parsing algorithm, then the start tag token for that element is the "fake" token created during by that HTML fragment parsing algorithm.


Not all of the tag names mentioned below are conformant tag names in this specification; many are included to handle legacy content. They still form part of the algorithm that implementations are required to implement to claim conformance.

The algorithm described below places no limit on the depth of the DOM tree generated, or on the length of tag names, attribute names, attribute values, Text nodes, etc. While implementers are encouraged to avoid arbitrary limits, it is recognized that practical concerns will likely force user agents to impose nesting depth constraints.

13.2.6.1 Creating and inserting nodes

While the parser is processing a token, it can enable or disable foster parenting. This affects the following algorithm.

The appropriate place for inserting a node, optionally using a particular override target, is the position in an element returned by running the following steps:

  1. If there was an override target specified, then let target be the override target.

    Otherwise, let target be the current node.

  2. Determine the adjusted insertion location using the first matching steps from the following list:

    If foster parenting is enabled and target is a table, tbody, tfoot, thead, or tr element

    Foster parenting happens when content is misnested in tables.

    Run these substeps:

    1. Let last template be the last template element in the stack of open elements, if any.

    2. Let last table be the last table element in the stack of open elements, if any.

    3. If there is a last template and either there is no last table, or there is one, but last template is lower (more recently added) than last table in the stack of open elements, then: let adjusted insertion location be inside last template's template contents, after its last child (if any), and abort these steps.

    4. If there is no last table, then let adjusted insertion location be inside the first element in the stack of open elements (the html element), after its last child (if any), and abort these steps. (fragment case)

    5. If last table has a parent node, then let adjusted insertion location be inside last table's parent node, immediately before last table, and abort these steps.

    6. Let previous element be the element immediately above last table in the stack of open elements.

    7. Let adjusted insertion location be inside previous element, after its last child (if any).

    These steps are involved in part because it's possible for elements, the table element in this case in particular, to have been moved by a script around in the DOM, or indeed removed from the DOM entirely, after the element was inserted by the parser.

    Otherwise

    Let adjusted insertion location be inside target, after its last child (if any).

  3. If the adjusted insertion location is inside a template element, let it instead be inside the template element's template contents, after its last child (if any).

  4. Return the adjusted insertion location.


When the steps below require the UA to create an element for a token in a particular given namespace and with a particular intended parent, the UA must run the following steps:

  1. Let document be intended parent's node document.

  2. Let local name be the tag name of the token.

  3. Let is be the value of the "is" attribute in the given token, if such an attribute exists, or null otherwise.

  4. Let definition be the result of looking up a custom element definition given document, given namespace, local name, and is.

  5. If definition is non-null and the parser was not created as part of the HTML fragment parsing algorithm, then let will execute script be true. Otherwise, let it be false.

  6. If will execute script is true, then:

    1. Increment document's throw-on-dynamic-markup-insertion counter.

    2. If the JavaScript execution context stack is empty, then perform a microtask checkpoint.

    3. Push a new element queue onto document's relevant agent's custom element reactions stack.

  7. Let element be the result of creating an element given document, localName, given namespace, null, and is. If will execute script is true, set the synchronous custom elements flag; otherwise, leave it unset.

    This will cause custom element constructors to run, if will execute script is true. However, since we incremented the throw-on-dynamic-markup-insertion counter, this cannot cause new characters to be inserted into the tokenizer, or the document to be blown away.

  8. Append each attribute in the given token to element.

    This can enqueue a custom element callback reaction for the attributeChangedCallback, which might run immediately (in the next step).

    Even though the is attribute governs the creation of a customized built-in element, it is not present during the execution of the relevant custom element constructor; it is appended in this step, along with all other attributes.

  9. If will execute script is true, then:

    1. Let queue be the result of popping from document's relevant agent's custom element reactions stack. (This will be the same element queue as was pushed above.)

    2. Invoke custom element reactions in queue.

    3. Decrement document's throw-on-dynamic-markup-insertion counter.

  10. If element has an xmlns attribute in the XMLNS namespace whose value is not exactly the same as the element's namespace, that is a parse error. Similarly, if element has an xmlns:xlink attribute in the XMLNS namespace whose value is not the XLink Namespace, that is a parse error.

  11. If element is a resettable element, invoke its reset algorithm. (This initializes the element's value and checkedness based on the element's attributes.)

  12. If element is a form-associated element and not a form-associated custom element, the form element pointer is not null, there is no template element on the stack of open elements, element is either not listed or doesn't have a form attribute, and the intended parent is in the same tree as the element pointed to by the form element pointer, then associate element with the form element pointed to by the form element pointer and set element's parser inserted flag.

  13. Return element.


When the steps below require the user agent to insert a foreign element for a token in a given namespace, the user agent must run these steps:

  1. Let the adjusted insertion location be the appropriate place for inserting a node.

  2. Let element be the result of creating an element for the token in the given namespace, with the intended parent being the element in which the adjusted insertion location finds itself.

  3. If it is possible to insert element at the adjusted insertion location, then:

    1. If the parser was not created as part of the HTML fragment parsing algorithm, then push a new element queue onto element's relevant agent's custom element reactions stack.

    2. Insert element at the adjusted insertion location.

    3. If the parser was not created as part of the HTML fragment parsing algorithm, then pop the element queue from element's relevant agent's custom element reactions stack, and invoke custom element reactions in that queue.

    If the adjusted insertion location cannot accept more elements, e.g. because it's a Document that already has an element child, then element is dropped on the floor.

  4. Push element onto the stack of open elements so that it is the new current node.

  5. Return element.

When the steps below require the user agent to insert an HTML element for a token, the user agent must insert a foreign element for the token, in the HTML namespace.


When the steps below require the user agent to adjust MathML attributes for a token, then, if the token has an attribute named definitionurl, change its name to definitionURL (note the case difference).

When the steps below require the user agent to adjust SVG attributes for a token, then, for each attribute on the token whose attribute name is one of the ones in the first column of the following table, change the attribute's name to the name given in the corresponding cell in the second column. (This fixes the case of SVG attributes that are not all lowercase.)

Attribute name on token Attribute name on element
attributename attributeName
attributetype attributeType
basefrequency baseFrequency
baseprofile baseProfile
calcmode calcMode
clippathunits clipPathUnits
diffuseconstant diffuseConstant
edgemode edgeMode
filterunits filterUnits
glyphref glyphRef
gradienttransform gradientTransform
gradientunits gradientUnits
kernelmatrix kernelMatrix
kernelunitlength kernelUnitLength
keypoints keyPoints
keysplines keySplines
keytimes keyTimes
lengthadjust lengthAdjust
limitingconeangle limitingConeAngle
markerheight markerHeight
markerunits markerUnits
markerwidth markerWidth
maskcontentunits maskContentUnits
maskunits maskUnits
numoctaves numOctaves
pathlength pathLength
patterncontentunits patternContentUnits
patterntransform patternTransform
patternunits patternUnits
pointsatx pointsAtX
pointsaty pointsAtY
pointsatz pointsAtZ
preservealpha preserveAlpha
preserveaspectratio preserveAspectRatio
primitiveunits primitiveUnits
refx refX
refy refY
repeatcount repeatCount
repeatdur repeatDur
requiredextensions requiredExtensions
requiredfeatures requiredFeatures
specularconstant specularConstant
specularexponent specularExponent
spreadmethod spreadMethod
startoffset startOffset
stddeviation stdDeviation
stitchtiles stitchTiles
surfacescale surfaceScale
systemlanguage systemLanguage
tablevalues tableValues
targetx targetX
targety targetY
textlength textLength
viewbox viewBox
viewtarget viewTarget
xchannelselector xChannelSelector
ychannelselector yChannelSelector
zoomandpan zoomAndPan

When the steps below require the user agent to adjust foreign attributes for a token, then, if any of the attributes on the token match the strings given in the first column of the following table, let the attribute be a namespaced attribute, with the prefix being the string given in the corresponding cell in the second column, the local name being the string given in the corresponding cell in the third column, and the namespace being the namespace given in the corresponding cell in the fourth column. (This fixes the use of namespaced attributes, in particular lang attributes in the XML namespace.)

Attribute name Prefix Local name Namespace
xlink:actuate xlink actuate XLink namespace
xlink:arcrole xlink arcrole XLink namespace
xlink:href xlink href XLink namespace
xlink:role xlink role XLink namespace
xlink:show xlink show XLink namespace
xlink:title xlink title XLink namespace
xlink:type xlink type XLink namespace
xml:lang xml lang XML namespace
xml:space xml space XML namespace
xmlns (none) xmlns XMLNS namespace
xmlns:xlink xmlns xlink XMLNS namespace

When the steps below require the user agent to insert a character while processing a token, the user agent must run the following steps:

  1. Let data be the characters passed to the algorithm, or, if no characters were explicitly specified, the character of the character token being processed.

  2. Let the adjusted insertion location be the appropriate place for inserting a node.

  3. If the adjusted insertion location is in a Document node, then return.

    The DOM will not let Document nodes have Text node children, so they are dropped on the floor.

  4. If there is a Text node immediately before the adjusted insertion location, then append data to that Text node's data.

    Otherwise, create a new Text node whose data is data and whose node document is the same as that of the element in which the adjusted insertion location finds itself, and insert the newly created node at the adjusted insertion location.

Here are some sample inputs to the parser and the corresponding number of Text nodes that they result in, assuming a user agent that executes scripts.

Input Number of Text nodes
A<script>
var script = document.getElementsByTagName('script')[0];
document.body.removeChild(script);
</script>B
One Text node in the document, containing "AB".
A<script>
var text = document.createTextNode('B');
document.body.appendChild(text);
</script>C
Three Text nodes; "A" before the script, the script's contents, and "BC" after the script (the parser appends to the Text node created by the script).
A<script>
var text = document.getElementsByTagName('script')[0].firstChild;
text.data = 'B';
document.body.appendChild(text);
</script>C
Two adjacent Text nodes in the document, containing "A" and "BC".
A<table>B<tr>C</tr>D</table>
One Text node before the table, containing "ABCD". (This is caused by foster parenting.)
A<table><tr> B</tr> C</table>
One Text node before the table, containing "A B C" (A-space-B-space-C). (This is caused by foster parenting.)
A<table><tr> B</tr> </em>C</table>
One Text node before the table, containing "A BC" (A-space-B-C), and one Text node inside the table (as a child of a tbody) with a single space character. (Space characters separated from non-space characters by non-character tokens are not affected by foster parenting, even if those other tokens then get ignored.)

When the steps below require the user agent to insert a comment while processing a comment token, optionally with an explicitly insertion position position, the user agent must run the following steps:

  1. Let data be the data given in the comment token being processed.

  2. If position was specified, then let the adjusted insertion location be position. Otherwise, let adjusted insertion location be the appropriate place for inserting a node.

  3. Create a Comment node whose data attribute is set to data and whose node document is the same as that of the node in which the adjusted insertion location finds itself.

  4. Insert the newly created node at the adjusted insertion location.


DOM mutation events must not fire for changes caused by the UA parsing the document. This includes the parsing of any content inserted using document.write() and document.writeln() calls. [UIEVENTS]

However, mutation observers do fire, as required by DOM .

13.2.6.2 解析只包含文本的元素

通用原始文本元素解析算法通用 RCDATA 元素解析算法 包括以下步骤。这些算法只会被开始标签的标记调用。

  1. 为该标记 插入一个 HTML 元素

  2. 如果被调用的是 通用原始文本元素解析算法, 把 tokenizer 切换到 RAWTEXT 状态; 否则,如果被调用的是 通用 RCDATA 元素解析算法, 把 tokenizer 切换到 RCDATA 状态

  3. 原始插入模式 为当前 插入模式

  4. 然后,把 插入模式 切换到 "text"。

13.2.6.3 Closing elements that have implied end tags

When the steps below require the UA to generate implied end tags, then, while the current node is a dd element, a dt element, an li element, an optgroup element, an option element, a p element, an rb element, an rp element, an rt element, or an rtc element, the UA must pop the current node off the stack of open elements.

If a step requires the UA to generate implied end tags but lists an element to exclude from the process, then the UA must perform the above steps as if that element was not in the above list.

When the steps below require the UA to generate all implied end tags thoroughly, then, while the current node is a caption element, a colgroup element, a dd element, a dt element, an li element, an optgroup element, an option element, a p element, an rb element, an rp element, an rt element, an rtc element, a tbody element, a td element, a tfoot element, a th element, a thead element, or a tr element, the UA must pop the current node off the stack of open elements.

13.2.6.4 解析 HTML 内容中的标记的规则
13.2.6.4.1 "initial" 插入模式

当用户代理应用 "initial" 插入模式 的规则时,用户代理必须按以下规则处理标记:

如果该标记是一个 U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), 或 U+0020 SPACE 字符标记。

忽略这个标记。

注释标记

Document 对象上 插入注释 作为最后一个子节点。

DOCTYPE 标记

如果这个 DOCTYPE 标记的名字不是字符串 "html" 的大小写敏感匹配, 或者该标记的系统标识符不存在或者大小写敏感地匹配字符串 "about:legacy-compat", 那么这是一个解析错误 解析错误

Document 节点上追加一个 DocumentType 节点, 其 name 属性设为 DOCTYPE 标记给出的名字,如果没有就设为空字符串; publicId 属性设为 DOCTYPE 标记给出的共有标识符,如果没有就设为空字符串; systemId 属性设为 DOCTYPE 标记给出的系统标识符,如果没有就设为空字符串; DocumentType 对象的其他属性适当地设为 null 或者空列表。 把 DocumentType 节点关联到 Document 对象, 这样它就会作为 Document 对象的 doctype 属性值返回。

如果该文档 不是 一个 iframe srcdoc 文档, 且 DOCTYPE 标记匹配以下一个条件,就把 Document 设为 怪异模式

否则,如果该文档 不是 一个 iframe srcdoc 文档, 且 DOCTYPE 标记匹配以下任一条件,那么设置 Document受限的怪异模式

系统标识符和公共标识符字符串与上述列表给出的值进行比较时,必须采用 ASCII 大小写不敏感 的方式。 上述条件中,值为空字符串的系统标识符不被认为缺失。

然后将 插入模式 设置为 "before html"。

任何其他情况

如果该文档 不是 一个 iframe srcdoc 文档, 那么这是一个 解析错误; 把 Document 设置为 怪异模式

任何情况下,都把 插入模式 设置为 "before html",然后开始重新处理标记。

13.2.6.4.2 "before html" 插入模式

当用户代理应用 "before html" 插入模式 的规则时,用户代理必须按以下规则处理标记:

DOCTYPE 标记

解析错误。 忽略该标记。

注释标记

Document 最后 插入注释 作为最后一个子节点。

字符标记 U+0009 CHARACTER TABULATION,U+000A LINE FEED (LF),U+000C FORM FEED (FF),U+000D CARRIAGE RETURN (CR),或 U+0020 SPACE

忽略该标记。

名为 "html" 的开始标签

HTML 命名空间 为该标记创建一个元素, 其目标父元素是 Document。把它添加到 Document 对象。 把该元素放到 打开元素栈

如果 Document 正在作为 浏览环境导航 的一步进行载入,执行这些步骤:

  1. 如果为该文档的 URL 执行 匹配 service worker 注册 的结果不是 null,执行 应用缓存选择算法, 把 Document 对象传给它,不带 manifest。

  2. 否则,执行这些子步骤:

    1. 如果新创建的元素有一个 manifest 属性且值为非空字符串,就相对于新创建的文档的 节点文档 解析 该属性的值, 如果成功就执行 应用缓存选择算法, 传参包括 Document 对象以及设置 exclude fragment 标志 的情况下对 结果 URL 记录 应用 URL 序列化 算法的结果。

    2. 否则,执行 应用缓存选择算法, 把 Document 对象传给它,不带 manifest。

插入模式 切换为 "before head"。

名为 "head","body","html","br" 的结束标签

执行下面“任何其他标记”的步骤。

任何其他结束标签

解析错误。 忽略该标记。

任何其他标记

创建一个 html 元素,其 节点文档Document 对象。 把它添加到 Document 对象。把该元素放到 打开元素栈

如果 Document 正在作为 浏览环境导航 的一步进行载入,则执行 应用缓存选择算法,不带 manifest 并传入 Document 对象。

插入模式 切换到 "before head",然后重新处理该标记。

document 元素 最终可能被从 Document 对象移除, 比如可能是脚本移除了它;这种情况下什么都不会发生,会按照下一部分描述的规则继续把内容添加到对应的节点。

13.2.6.4.3 "before head" 插入模式

当用户代理应用 "before head" 插入模式 的规则时,用户代理必须按以下规则处理标记:

U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), 或 U+0020 SPACE 字符标记

忽略该标记。

注释标记

插入注释

DOCTYPE 标记

解析错误。 忽略该标记。

名为 "html" 的开始标签

使用 "in body" 插入模式 处理该标记。

名为 "head" 的开始标签

为该标记 插入一个 HTML 元素

head 元素指针 设为刚创建的 head 元素。

插入模式 切换到 "in head"。

名为 "head", "body", "html", "br" 的结束标签

执行下面“其他情况”描述的步骤。

任何其他结束标签

解析错误。 忽略该标记。

其他情况

为 "head" 开始标签标记 插入一个 HTML 元素,不设任何属性。

head 元素指针 设为刚创建的 head 元素。

插入模式 设为 "in head"。

重新处理当前标记。

13.2.6.4.4 The "in head" insertion mode

When the user agent is to apply the rules for the "in head" insertion mode, the user agent must handle the token as follows:

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the character.

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is "html"

Process the token using the rules for the "in body" insertion mode.

A start tag whose tag name is one of: "base", "basefont", "bgsound", "link"

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

A start tag whose tag name is "meta"

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

If the element has a charset attribute, and getting an encoding from its value results in an encoding, and the confidence is currently tentative, then change the encoding to the resulting encoding.

Otherwise, if the element has an http-equiv attribute whose value is an ASCII case-insensitive match for the string "Content-Type", and the element has a content attribute, and applying the algorithm for extracting a character encoding from a meta element to that attribute's value returns an encoding, and the confidence is currently tentative, then change the encoding to the extracted encoding.

A start tag whose tag name is "title"

Follow the generic RCDATA element parsing algorithm.

A start tag whose tag name is "noscript", if the scripting flag is enabled
A start tag whose tag name is one of: "noframes", "style"

Follow the generic raw text element parsing algorithm.

A start tag whose tag name is "noscript", if the scripting flag is disabled

Insert an HTML element for the token.

Switch the insertion mode to "in head noscript".

A start tag whose tag name is "script"

Run these steps:

  1. Let the adjusted insertion location be the appropriate place for inserting a node.

  2. Create an element for the token in the HTML namespace, with the intended parent being the element in which the adjusted insertion location finds itself.

  3. Set the element's parser document to the Document, and unset the element's "non-blocking" flag.

    This ensures that, if the script is external, any document.write() calls in the script will execute in-line, instead of blowing the document away, as would happen in most other cases. It also prevents the script from executing until the end tag is seen.

  4. If the parser was created as part of the HTML fragment parsing algorithm, then mark the script element as "already started". (fragment case)

  5. If the parser was invoked via the document.write() or document.writeln() methods, then optionally mark the script element as "already started". (For example, the user agent might use this clause to prevent execution of cross-origin scripts inserted via document.write() under slow network conditions, or when the page has already taken a long time to load.)

  6. Insert the newly created element at the adjusted insertion location.

  7. Push the element onto the stack of open elements so that it is the new current node.

  8. Switch the tokenizer to the script data state.

  9. Let the original insertion mode be the current insertion mode.

  10. Switch the insertion mode to "text".

An end tag whose tag name is "head"

Pop the current node (which will be the head element) off the stack of open elements.

Switch the insertion mode to "after head".

An end tag whose tag name is one of: "body", "html", "br"

Act as described in the "anything else" entry below.

A start tag whose tag name is "template"

Insert an HTML element for the token.

Insert a marker at the end of the list of active formatting elements.

Set the frameset-ok flag to "not ok".

Switch the insertion mode to "in template".

Push "in template" onto the stack of template insertion modes so that it is the new current template insertion mode.

An end tag whose tag name is "template"

If there is no template element on the stack of open elements, then this is a parse error; ignore the token.

Otherwise, run these steps:

  1. Generate all implied end tags thoroughly.

  2. If the current node is not a template element, then this is a parse error.

  3. Pop elements from the stack of open elements until a template element has been popped from the stack.

  4. Clear the list of active formatting elements up to the last marker.
  5. Pop the current template insertion mode off the stack of template insertion modes.

  6. Reset the insertion mode appropriately.

A start tag whose tag name is "head"
Any other end tag

Parse error. Ignore the token.

Anything else

Pop the current node (which will be the head element) off the stack of open elements.

Switch the insertion mode to "after head".

Reprocess the token.

13.2.6.4.5 "in head noscript" 插入模式

当用户代理应用 "in head noscript" 插入模式 的规则时,用户代理必须按以下规则处理标记:

DOCTYPE 标记

解析错误。忽略该标记。

名为 "html" 的开始标签

使用 "in body" 插入模式 处理该标记。

名为 "noscript" 的结束标签

Pop the 当前节点 (which will be a noscript element) from the 打开元素栈; the new 当前节点 will be a head element.

当前节点(是一个 noscript 元素) 从 打开元素栈 弹出, 新的 当前节点 会是一个 head 元素。

插入模式 切换到 "in head"。

字符标记 U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), 或 U+0020 SPACE
注释标记
名为 "basefont", "bgsound", "link", "meta", "noframes", "style" 的开始标签

使用 "in head" 插入模式 处理该标记。

名为 "br" 的结束标签

执行下面“任何其他情况”所描述的步骤。

名为 "head", "noscript" 的开始标签
任何其他结束标签

解析错误。忽略该标记。

任何其他情况

解析错误.

当前节点(是一个 noscript 元素) 从 打开元素栈 弹出, 新的 当前节点 会是一个 head 元素。

插入模式 切换到 "in head"。

重新处理该标记。

13.2.6.4.6 "after head" 插入模式

当用户代理应用 "after head" 插入模式 的规则时,用户代理必须按以下规则处理标记:

字符标记 U+0009 CHARACTER TABULATION,U+000A LINE FEED (LF),U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR) 或 U+0020 SPACE

插入一个字符

注释标记

插入注释

DOCTYPE 标记

解析错误。 忽略该标记。

名为 "html" 的开始标签

使用 "in body" 插入模式 的规则处理该标记。

名为 "body" 的开始标签

为该标记 插入一个 HTML 元素

设置 frameset-ok 标志 为 "not ok"。

插入模式 切换到 "in body"。

名为 "frameset" 的开始标签

插入一个 HTML 元素 为该标记。

插入模式 切换到 "in frameset"。

名为 "base","basefont","bgsound","link","meta", "noframes","script","style","template","title" 的开始标签

解析错误

head 元素指针 指向的那个元素压入 打开元素栈

使用 "in head" 插入模式 的规则处理该标记。

head 元素指针 指向的元素从 打开元素栈 移除。 (这时它可能不是 当前节点

head 元素指针 这时不会是 null。

名为 "template" 的结束标签

使用 "in head" 插入模式 的规则处理该标记。

名为 "body","html","br" 的结束标签

执行下面 "任何其他情况" 的步骤。

名为 "head" 的开始标签
任何其他结束标签

解析错误。忽略该标记。

任何其他情况

为 "body" 开始标签标记 插入一个 HTML 元素,不设任何属性。

插入模式 切换到 "in body"。

重新处理当前标记。

13.2.6.4.7 The "in body" insertion mode

When the user agent is to apply the rules for the "in body" insertion mode, the user agent must handle the token as follows:

A character token that is U+0000 NULL

Parse error. Ignore the token.

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Reconstruct the active formatting elements, if any.

Insert the token's character.

Any other character token

Reconstruct the active formatting elements, if any.

Insert the token's character.

Set the frameset-ok flag to "not ok".

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is "html"

Parse error.

If there is a template element on the stack of open elements, then ignore the token.

Otherwise, for each attribute on the token, check to see if the attribute is already present on the top element of the stack of open elements. If it is not, add the attribute and its corresponding value to that element.

A start tag whose tag name is one of: "base", "basefont", "bgsound", "link", "meta", "noframes", "script", "style", "template", "title"
An end tag whose tag name is "template"

Process the token using the rules for the "in head" insertion mode.

A start tag whose tag name is "body"

Parse error.

If the second element on the stack of open elements is not a body element, if the stack of open elements has only one node on it, or if there is a template element on the stack of open elements, then ignore the token. (fragment case)

Otherwise, set the frameset-ok flag to "not ok"; then, for each attribute on the token, check to see if the attribute is already present on the body element (the second element) on the stack of open elements, and if it is not, add the attribute and its corresponding value to that element.

A start tag whose tag name is "frameset"

Parse error.

If the stack of open elements has only one node on it, or if the second element on the stack of open elements is not a body element, then ignore the token. (fragment case)

If the frameset-ok flag is set to "not ok", ignore the token.

Otherwise, run the following steps:

  1. Remove the second element on the stack of open elements from its parent node, if it has one.

  2. Pop all the nodes from the bottom of the stack of open elements, from the current node up to, but not including, the root html element.

  3. Insert an HTML element for the token.

  4. Switch the insertion mode to "in frameset".

An end-of-file token

If the stack of template insertion modes is not empty, then process the token using the rules for the "in template" insertion mode.

Otherwise, follow these steps:

  1. If there is a node in the stack of open elements that is not either a dd element, a dt element, an li element, an optgroup element, an option element, a p element, an rb element, an rp element, an rt element, an rtc element, a tbody element, a td element, a tfoot element, a th element, a thead element, a tr element, the body element, or the html element, then this is a parse error.

  2. Stop parsing.

An end tag whose tag name is "body"

If the stack of open elements does not have a body element in scope, this is a parse error; ignore the token.

Otherwise, if there is a node in the stack of open elements that is not either a dd element, a dt element, an li element, an optgroup element, an option element, a p element, an rb element, an rp element, an rt element, an rtc element, a tbody element, a td element, a tfoot element, a th element, a thead element, a tr element, the body element, or the html element, then this is a parse error.

Switch the insertion mode to "after body".

An end tag whose tag name is "html"

If the stack of open elements does not have a body element in scope, this is a parse error; ignore the token.

Otherwise, if there is a node in the stack of open elements that is not either a dd element, a dt element, an li element, an optgroup element, an option element, a p element, an rb element, an rp element, an rt element, an rtc element, a tbody element, a td element, a tfoot element, a th element, a thead element, a tr element, the body element, or the html element, then this is a parse error.

Switch the insertion mode to "after body".

Reprocess the token.

A start tag whose tag name is one of: "address", "article", "aside", "blockquote", "center", "details", "dialog", "dir", "div", "dl", "fieldset", "figcaption", "figure", "footer", "header", "hgroup", "main", "menu", "nav", "ol", "p", "section", "summary", "ul"

If the stack of open elements has a p element in button scope, then close a p element.

Insert an HTML element for the token.

A start tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", "h6"

If the stack of open elements has a p element in button scope, then close a p element.

If the current node is an HTML element whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6", then this is a parse error; pop the current node off the stack of open elements.

Insert an HTML element for the token.

A start tag whose tag name is one of: "pre", "listing"

If the stack of open elements has a p element in button scope, then close a p element.

Insert an HTML element for the token.

If the next token is a U+000A LINE FEED (LF) character token, then ignore that token and move on to the next one. (Newlines at the start of pre blocks are ignored as an authoring convenience.)

Set the frameset-ok flag to "not ok".

A start tag whose tag name is "form"

If the form element pointer is not null, and there is no template element on the stack of open elements, then this is a parse error; ignore the token.

Otherwise:

If the stack of open elements has a p element in button scope, then close a p element.

Insert an HTML element for the token, and, if there is no template element on the stack of open elements, set the form element pointer to point to the element created.

A start tag whose tag name is "li"

Run these steps:

  1. Set the frameset-ok flag to "not ok".

  2. Initialize node to be the current node (the bottommost node of the stack).

  3. Loop: If node is an li element, then run these substeps:

    1. Generate implied end tags, except for li elements.

    2. If the current node is not an li element, then this is a parse error.

    3. Pop elements from the stack of open elements until an li element has been popped from the stack.

    4. Jump to the step labeled done below.

  4. If node is in the special category, but is not an address, div, or p element, then jump to the step labeled done below.

  5. Otherwise, set node to the previous entry in the stack of open elements and return to the step labeled loop.

  6. Done: If the stack of open elements has a p element in button scope, then close a p element.

  7. Finally, insert an HTML element for the token.

A start tag whose tag name is one of: "dd", "dt"

Run these steps:

  1. Set the frameset-ok flag to "not ok".

  2. Initialize node to be the current node (the bottommost node of the stack).

  3. Loop: If node is a dd element, then run these substeps:

    1. Generate implied end tags, except for dd elements.

    2. If the current node is not a dd element, then this is a parse error.

    3. Pop elements from the stack of open elements until a dd element has been popped from the stack.

    4. Jump to the step labeled done below.

  4. If node is a dt element, then run these substeps:

    1. Generate implied end tags, except for dt elements.

    2. If the current node is not a dt element, then this is a parse error.

    3. Pop elements from the stack of open elements until a dt element has been popped from the stack.

    4. Jump to the step labeled done below.

  5. If node is in the special category, but is not an address, div, or p element, then jump to the step labeled done below.

  6. Otherwise, set node to the previous entry in the stack of open elements and return to the step labeled loop.

  7. Done: If the stack of open elements has a p element in button scope, then close a p element.

  8. Finally, insert an HTML element for the token.

A start tag whose tag name is "plaintext"

If the stack of open elements has a p element in button scope, then close a p element.

Insert an HTML element for the token.

Switch the tokenizer to the PLAINTEXT state.

Once a start tag with the tag name "plaintext" has been seen, that will be the last token ever seen other than character tokens (and the end-of-file token), because there is no way to switch out of the PLAINTEXT state.

A start tag whose tag name is "button"
  1. If the stack of open elements has a button element in scope, then run these substeps:

    1. Parse error.

    2. Generate implied end tags.

    3. Pop elements from the stack of open elements until a button element has been popped from the stack.

  2. Reconstruct the active formatting elements, if any.

  3. Insert an HTML element for the token.

  4. Set the frameset-ok flag to "not ok".

An end tag whose tag name is one of: "address", "article", "aside", "blockquote", "button", "center", "details", "dialog", "dir", "div", "dl", "fieldset", "figcaption", "figure", "footer", "header", "hgroup", "listing", "main", "menu", "nav", "ol", "pre", "section", "summary", "ul"

If the stack of open elements does not have an element in scope that is an HTML element with the same tag name as that of the token, then this is a parse error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags.

  2. If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error.

  3. Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the stack.

An end tag whose tag name is "form"

If there is no template element on the stack of open elements, then run these substeps:

  1. Let node be the element that the form element pointer is set to, or null if it is not set to an element.

  2. Set the form element pointer to null.

  3. If node is null or if the stack of open elements does not have node in scope, then this is a parse error; return and ignore the token.

  4. Generate implied end tags.

  5. If the current node is not node, then this is a parse error.

  6. Remove node from the stack of open elements.

If there is a template element on the stack of open elements, then run these substeps instead:

  1. If the stack of open elements does not have a form element in scope, then this is a parse error; return and ignore the token.

  2. Generate implied end tags.

  3. If the current node is not a form element, then this is a parse error.

  4. Pop elements from the stack of open elements until a form element has been popped from the stack.

An end tag whose tag name is "p"

If the stack of open elements does not have a p element in button scope, then this is a parse error; insert an HTML element for a "p" start tag token with no attributes.

Close a p element.

An end tag whose tag name is "li"

If the stack of open elements does not have an li element in list item scope, then this is a parse error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags, except for li elements.

  2. If the current node is not an li element, then this is a parse error.

  3. Pop elements from the stack of open elements until an li element has been popped from the stack.

An end tag whose tag name is one of: "dd", "dt"

If the stack of open elements does not have an element in scope that is an HTML element with the same tag name as that of the token, then this is a parse error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags, except for HTML elements with the same tag name as the token.

  2. If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error.

  3. Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the stack.

An end tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", "h6"

If the stack of open elements does not have an element in scope that is an HTML element and whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6", then this is a parse error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags.

  2. If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error.

  3. Pop elements from the stack of open elements until an HTML element whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6" has been popped from the stack.

An end tag whose tag name is "sarcasm"

Take a deep breath, then act as described in the "any other end tag" entry below.

A start tag whose tag name is "a"

If the list of active formatting elements contains an a element between the end of the list and the last marker on the list (or the start of the list if there is no marker on the list), then this is a parse error; run the adoption agency algorithm for the token, then remove that element from the list of active formatting elements and the stack of open elements if the adoption agency algorithm didn't already remove it (it might not have if the element is not in table scope).

In the non-conforming stream <a href="a">a<table><a href="b">b</table>x, the first a element would be closed upon seeing the second one, and the "x" character would be inside a link to "b", not to "a". This is despite the fact that the outer a element is not in table scope (meaning that a regular </a> end tag at the start of the table wouldn't close the outer a element). The result is that the two a elements are indirectly nested inside each other — non-conforming markup will often result in non-conforming DOMs when parsed.

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Push onto the list of active formatting elements that element.

A start tag whose tag name is one of: "b", "big", "code", "em", "font", "i", "s", "small", "strike", "strong", "tt", "u"

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Push onto the list of active formatting elements that element.

A start tag whose tag name is "nobr"

Reconstruct the active formatting elements, if any.

If the stack of open elements has a nobr element in scope, then this is a parse error; run the adoption agency algorithm for the token, then once again reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Push onto the list of active formatting elements that element.

An end tag whose tag name is one of: "a", "b", "big", "code", "em", "font", "i", "nobr", "s", "small", "strike", "strong", "tt", "u"

Run the adoption agency algorithm for the token.

A start tag whose tag name is one of: "applet", "marquee", "object"

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

Insert a marker at the end of the list of active formatting elements.

Set the frameset-ok flag to "not ok".

An end tag token whose tag name is one of: "applet", "marquee", "object"

If the stack of open elements does not have an element in scope that is an HTML element with the same tag name as that of the token, then this is a parse error; ignore the token.

Otherwise, run these steps:

  1. Generate implied end tags.

  2. If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error.

  3. Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the stack.

  4. Clear the list of active formatting elements up to the last marker.
A start tag whose tag name is "table"

If the Document is not set to quirks mode, and the stack of open elements has a p element in button scope, then close a p element.

Insert an HTML element for the token.

Set the frameset-ok flag to "not ok".

Switch the insertion mode to "in table".

An end tag whose tag name is "br"

Parse error. Drop the attributes from the token, and act as described in the next entry; i.e. act as if this was a "br" start tag token with no attributes, rather than the end tag token that it actually is.

A start tag whose tag name is one of: "area", "br", "embed", "img", "keygen", "wbr"

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

Set the frameset-ok flag to "not ok".

A start tag whose tag name is "input"

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

If the token does not have an attribute with the name "type", or if it does, but that attribute's value is not an ASCII case-insensitive match for the string "hidden", then: set the frameset-ok flag to "not ok".

A start tag whose tag name is one of: "param", "source", "track"

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

A start tag whose tag name is "hr"

If the stack of open elements has a p element in button scope, then close a p element.

Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.

Acknowledge the token's self-closing flag, if it is set.

Set the frameset-ok flag to "not ok".

A start tag whose tag name is "image"

Parse error. Change the token's tag name to "img" and reprocess it. (Don't ask.)

A start tag whose tag name is "textarea"

Run these steps:

  1. Insert an HTML element for the token.

  2. If the next token is a U+000A LINE FEED (LF) character token, then ignore that token and move on to the next one. (Newlines at the start of textarea elements are ignored as an authoring convenience.)

  3. Switch the tokenizer to the RCDATA state.

  4. Let the original insertion mode be the current insertion mode.

  5. Set the frameset-ok flag to "not ok".

  6. Switch the insertion mode to "text".

A start tag whose tag name is "xmp"

If the stack of open elements has a p element in button scope, then close a p element.

Reconstruct the active formatting elements, if any.

Set the frameset-ok flag to "not ok".

Follow the generic raw text element parsing algorithm.

A start tag whose tag name is "iframe"

Set the frameset-ok flag to "not ok".

Follow the generic raw text element parsing algorithm.

A start tag whose tag name is "noembed"
A start tag whose tag name is "noscript", if the scripting flag is enabled

Follow the generic raw text element parsing algorithm.

A start tag whose tag name is "select"

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

Set the frameset-ok flag to "not ok".

If the insertion mode is one of "in table", "in caption", "in table body", "in row", or "in cell", then switch the insertion mode to "in select in table". Otherwise, switch the insertion mode to "in select".

A start tag whose tag name is one of: "optgroup", "option"

If the current node is an option element, then pop the current node off the stack of open elements.

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

A start tag whose tag name is one of: "rb", "rtc"

If the stack of open elements has a ruby element in scope, then generate implied end tags. If the current node is not now a ruby element, this is a parse error.

Insert an HTML element for the token.

A start tag whose tag name is one of: "rp", "rt"

If the stack of open elements has a ruby element in scope, then generate implied end tags, except for rtc elements. If the current node is not now a rtc element or a ruby element, this is a parse error.

Insert an HTML element for the token.

A start tag whose tag name is "math"

Reconstruct the active formatting elements, if any.

Adjust MathML attributes for the token. (This fixes the case of MathML attributes that are not all lowercase.)

Adjust foreign attributes for the token. (This fixes the use of namespaced attributes, in particular XLink.)

Insert a foreign element for the token, in the MathML namespace.

If the token has its self-closing flag set, pop the current node off the stack of open elements and acknowledge the token's self-closing flag.

A start tag whose tag name is "svg"

Reconstruct the active formatting elements, if any.

Adjust SVG attributes for the token. (This fixes the case of SVG attributes that are not all lowercase.)

Adjust foreign attributes for the token. (This fixes the use of namespaced attributes, in particular XLink in SVG.)

Insert a foreign element for the token, in the SVG namespace.

If the token has its self-closing flag set, pop the current node off the stack of open elements and acknowledge the token's self-closing flag.

A start tag whose tag name is one of: "caption", "col", "colgroup", "frame", "head", "tbody", "td", "tfoot", "th", "thead", "tr"

Parse error. Ignore the token.

Any other start tag

Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

This element will be an ordinary element.

Any other end tag

Run these steps:

  1. Initialize node to be the current node (the bottommost node of the stack).

  2. Loop: If node is an HTML element with the same tag name as the token, then:

    1. Generate implied end tags, except for HTML elements with the same tag name as the token.

    2. If node is not the current node, then this is a parse error.

    3. Pop all the nodes from the current node up to node, including node, then stop these steps.

  3. Otherwise, if node is in the special category, then this is a parse error; ignore the token, and return.

  4. Set node to the previous entry in the stack of open elements.

  5. Return to the step labeled loop.

When the steps above say the user agent is to close a p element, it means that the user agent must run the following steps:

  1. Generate implied end tags, except for p elements.

  2. If the current node is not a p element, then this is a parse error.

  3. Pop elements from the stack of open elements until a p element has been popped from the stack.

The adoption agency algorithm, which takes as its only argument a token token for which the algorithm is being run, consists of the following steps:

  1. Let subject be token's tag name.

  2. If the current node is an HTML element whose tag name is subject, and the current node is not in the list of active formatting elements, then pop the current node off the stack of open elements, and return.

  3. Let outer loop counter be zero.

  4. Outer loop: If outer loop counter is greater than or equal to eight, then return.

  5. Increment outer loop counter by one.

  6. Let formatting element be the last element in the list of active formatting elements that:

    If there is no such element, then return and instead act as described in the "any other end tag" entry above.

  7. If formatting element is not in the stack of open elements, then this is a parse error; remove the element from the list, and return.

  8. If formatting element is in the stack of open elements, but the element is not in scope, then this is a parse error; return.

  9. If formatting element is not the current node, this is a parse error. (But do not return.)

  10. Let furthest block be the topmost node in the stack of open elements that is lower in the stack than formatting element, and is an element in the special category. There might not be one.

  11. If there is no furthest block, then the UA must first pop all the nodes from the bottom of the stack of open elements, from the current node up to and including formatting element, then remove formatting element from the list of active formatting elements, and finally return.

  12. Let common ancestor be the element immediately above formatting element in the stack of open elements.

  13. Let a bookmark note the position of formatting element in the list of active formatting elements relative to the elements on either side of it in the list.

  14. Let node and last node be furthest block. Follow these steps:

    1. Let inner loop counter be zero.

    2. Inner loop: Increment inner loop counter by one.

    3. Let node be the element immediately above node in the stack of open elements, or if node is no longer in the stack of open elements (e.g. because it got removed by this algorithm), the element that was immediately above node in the stack of open elements before node was removed.

    4. If node is formatting element, then go to the next step in the overall algorithm.

    5. If inner loop counter is greater than three and node is in the list of active formatting elements, then remove node from the list of active formatting elements.

    6. If node is not in the list of active formatting elements, then remove node from the stack of open elements and then go back to the step labeled inner loop.

    7. Create an element for the token for which the element node was created, in the HTML namespace, with common ancestor as the intended parent; replace the entry for node in the list of active formatting elements with an entry for the new element, replace the entry for node in the stack of open elements with an entry for the new element, and let node be the new element.

    8. If last node is furthest block, then move the aforementioned bookmark to be immediately after the new node in the list of active formatting elements.

    9. Insert last node into node, first removing it from its previous parent node if any.

    10. Let last node be node.

    11. Return to the step labeled inner loop.

  15. Insert whatever last node ended up being in the previous step at the appropriate place for inserting a node, but using common ancestor as the override target.

  16. Create an element for the token for which formatting element was created, in the HTML namespace, with furthest block as the intended parent.

  17. Take all of the child nodes of furthest block and append them to the element created in the last step.

  18. Append that new element to furthest block.

  19. Remove formatting element from the list of active formatting elements, and insert the new element into the list of active formatting elements at the position of the aforementioned bookmark.

  20. Remove formatting element from the stack of open elements, and insert the new element into the stack of open elements immediately below the position of furthest block in that stack.

  21. Jump back to the step labeled outer loop.

This algorithm's name, the "adoption agency algorithm", comes from the way it causes elements to change parents, and is in contrast with other possible algorithms for dealing with misnested content.

13.2.6.4.8 The "text" insertion mode

When the user agent is to apply the rules for the "text" insertion mode, the user agent must handle the token as follows:

A character token

Insert the token's character.

This can never be a U+0000 NULL character; the tokenizer converts those to U+FFFD REPLACEMENT CHARACTER characters.

An end-of-file token

Parse error.

If the current node is a script element, mark the script element as "already started".

Pop the current node off the stack of open elements.

Switch the insertion mode to the original insertion mode and reprocess the token.

An end tag whose tag name is "script"

If the JavaScript execution context stack is empty, perform a microtask checkpoint.

Let script be the current node (which will be a script element).

Pop the current node off the stack of open elements.

Switch the insertion mode to the original insertion mode.

Let the old insertion point have the same value as the current insertion point. Let the insertion point be just before the next input character.

Increment the parser's script nesting level by one.

Prepare the script. This might cause some script to execute, which might cause new characters to be inserted into the tokenizer, and might cause the tokenizer to output more tokens, resulting in a reentrant invocation of the parser.

Decrement the parser's script nesting level by one. If the parser's script nesting level is zero, then set the parser pause flag to false.

Let the insertion point have the value of the old insertion point. (In other words, restore the insertion point to its previous value. This value might be the "undefined" value.)

At this stage, if there is a pending parsing-blocking script, then:

If the script nesting level is not zero:

Set the parser pause flag to true, and abort the processing of any nested invocations of the tokenizer, yielding control back to the caller. (Tokenization will resume when the caller returns to the "outer" tree construction stage.)

The tree construction stage of this particular parser is being called reentrantly, say from a call to document.write().

Otherwise:

Run these steps:

  1. Let the script be the pending parsing-blocking script. There is no longer a pending parsing-blocking script.

  2. Block the tokenizer for this instance of the HTML parser, such that the event loop will not run tasks that invoke the tokenizer.

  3. If the parser's Document has a style sheet that is blocking scripts or the script's "ready to be parser-executed" flag is not set: spin the event loop until the parser's Document has no style sheet that is blocking scripts and the script's "ready to be parser-executed" flag is set.

  4. If this parser has been aborted in the meantime, return.

    This could happen if, e.g., while the spin the event loop algorithm is running, the browsing context gets closed, or the document.open() method gets invoked on the Document.

  5. Unblock the tokenizer for this instance of the HTML parser, such that tasks that invoke the tokenizer can again be run.

  6. Let the insertion point be just before the next input character.

  7. Increment the parser's script nesting level by one (it should be zero before this step, so this sets it to one).

  8. Execute the script.

  9. Decrement the parser's script nesting level by one. If the parser's script nesting level is zero (which it always should be at this point), then set the parser pause flag to false.

  10. Let the insertion point be undefined again.

  11. If there is once again a pending parsing-blocking script, then repeat these steps from step 1.

Any other end tag

Pop the current node off the stack of open elements.

Switch the insertion mode to the original insertion mode.

13.2.6.4.9 "in table" 插入模式

当用户代理应用 "in table" 插入模式 的规则时,用户代理必须按以下规则处理标记:

字符标记,如果 current nodetable, tbody, tfoot, thead, 或 tr 元素

待处理表格字符标记 为一个空的标记列表。

原始插入模式 为当前 插入模式

插入模式 切换到 "in table text" 后重新处理该标记。

注释标记

插入注释

DOCTYPE 标记

解析错误。忽略该标记。

名为 "caption" 的开始标签

将栈清除回表格上下文。(见下文)

活动的格式化原始列表 的结尾插入一个 标记

为该标记 插入一个 HTML 元素 ,然后把 插入模式 切换到 "in caption"。

名为 "colgroup" 的开始标签

将栈清除回表格上下文。(见下文)

为该标记 插入一个 HTML 元素 ,然后把 插入模式 切换到 "in column group"。

名为 "col" 的开始标签

将栈清除回表格上下文。(见下文)

为 "colgroup" 开始标签标记 插入一个 HTML 元素,没有任何属性, 然后把 插入模式 切换到 "in column group"。

重新处理当前标记。

名为 "tbody", "tfoot", "thead" 的开始标签

将栈清除回表格上下文。(见下文)

为该标记 插入一个 HTML 元素 ,然后把 插入模式 切换到 "in table body"。

名为 "td", "th", "tr" 的开始标签

将栈清除回表格上下文。(见下文)

为 "tbody" 开始标签标记 插入一个 HTML 元素,没有任何属性, 然后把 插入模式 切换到 "in table body"。

重新处理当前标记。

名为 "table" 的开始标签

解析错误

如果 打开元素的栈 在表格范围内部包含一个 table 元素,忽略该标记。

否则:

打开元素栈 弹栈直到弹出来的是一个 table 元素。

适当地重置解析器的插入模式

重新处理该标记。

名为 "table" 的结束标签

如果 打开元素栈 在表格范围内部包含一个 table 元素, 这是一个 解析错误;忽略该标记。

否则:

打开元素栈 弹栈直到弹出来的是一个 table 元素。

适当地重置解析器的插入模式

名为 "body", "caption", "col", "colgroup", "html", "tbody", "td", "tfoot", "th", "thead", "tr" 的结束标签

解析错误。忽略该标记。

名为 "style", "script", "template" 的开始标签
名为 "template" 的结束标签

使用 "in head" 插入模式 的规则处理该标记。

名为 "input" 的开始标签

如果该标记没有名为 "type" 的属性,或该属性的值不能 ASCII 大小写不敏感 地匹配字符串 "hidden",那么执行下面 "anything else" 描述的步骤。

否则:

解析错误

为该标记 插入一个 HTML 元素

把那个 input 元素从 打开元素栈 弹栈。

确认该标记的 self-closing 标志,如果设置了这个标志的话。

名为 "form" 的开始标签

解析错误

如果在 打开元素栈 上有一个 template 元素,且 form 元素指针 不是 null, 忽略该标记。

否则:

为该标记 插入一个 HTML 元素,然后设置 form 元素指针 为该元素创建的指针。

把那个 form 元素从 打开元素栈 弹出。

文件尾(EOF)标记

使用 "in body" 插入模式 的规则处理该标记。

其他情况

解析错误。启用 foster parenting使用 "in body" 插入模式 处理该标记,然后禁用 foster parenting

当上述步骤要求 UA 把栈清除回表格上下文 时, 那么 UA 必须从 打开元素栈 弹出元素, 直到 当前节点table, template, 或 html 元素。

这与 在表格范围内存在元素 步骤中用到的元素列表是一样的。

在这一处理后,当前节点 是一个 html 元素的, 就是一个 fragment case

13.2.6.4.10 The "in table text" insertion mode

When the user agent is to apply the rules for the "in table text" insertion mode, the user agent must handle the token as follows:

A character token that is U+0000 NULL

Parse error. Ignore the token.

Any other character token

Append the character token to the pending table character tokens list.

Anything else

If any of the tokens in the pending table character tokens list are character tokens that are not ASCII whitespace, then this is a parse error: reprocess the character tokens in the pending table character tokens list using the rules given in the "anything else" entry in the "in table" insertion mode.

Otherwise, insert the characters given by the pending table character tokens list.

Switch the insertion mode to the original insertion mode and reprocess the token.

13.2.6.4.11 The "in caption" 插入模式

当用户代理应用 "in caption" 插入模式 的规则时,用户代理必须按以下规则处理标记:

名为 "caption" 的结束标签

the 打开元素栈 在表格范围内不包含一个 caption 元素, 这是一个 解析错误; 忽略该标记。 (fragment case)

否则:

生成暗示的结束标签

现在如果 当前节点 不是 caption 元素, 那么这是一个 解析错误

从栈中弹出元素直到得到一个 caption 元素。

清空直到最后一个标记的活跃的格式化元素列表

插入模式 切换到 "in table"。

名为 "caption","col","colgroup","tbody","td","tfoot", "th","thead","tr" 的开始标签
名为 "table" 的结束标签

如果 打开元素栈 在表格范围内没有 caption 元素, 这是一个 解析错误; 忽略该标记。 (fragment case)

否则:

生成暗示的结束标签

现在如果 当前节点 不是 caption 元素, 那么这是一个 解析错误

从栈中弹出元素直到得到一个 caption 元素。

清空直到最后一个标记的活跃的格式化元素列表

插入模式 切换到 "in table"。

重新处理该标记。

名为 "body","col","colgroup","html","tbody","td", "tfoot","th","thead","tr" 的结束标签

解析错误。 忽略该标记。

任何其他标记

使用 "in body" 插入模式 的规则处理该标记。

13.2.6.4.12 "in column group" 插入模式

当用户代理应用 "in column group" 插入模式 的规则时,用户代理必须按以下规则处理标记:

字符标记 U+0009 CHARACTER TABULATION,U+000A LINE FEED (LF),U+000C FORM FEED (FF),U+000D CARRIAGE RETURN (CR),或 U+0020 SPACE

插入该字符

注释标记

插入注释

DOCTYPE 标记

解析错误。忽略该标记。

名为 "html" 的开始标签

使用 "in body" 插入模式 的规则处理该标记。

名为 "col" 的开始标签

插入一个 HTML 元素 为该标记。 把 当前节点 立即从 打开元素栈 弹出。

确认该标记的 self-closing flag,如果设置了该标记的话。

名为 "colgroup" 的结束标签

如果 当前节点 不是 colgroup 元素,那么 解析错误; 忽略该标记。

否则,把 当前节点打开元素栈 弹出。 将 插入模式 切换到 "in table"。

名为 "col" 的结束标签

解析错误。 忽略该标记。

名为 "template" 的开始标签
名为 "template" 的结束标签

使用 "in head" 插入模式 的规则处理该标记。

文件尾(EOF)标记

使用 "in body" 插入模式 的规则处理该标记。

任何其他标记

如果 当前节点 不是 colgroup 元素,那么这是一个 解析错误; 忽略该标记。

否则,把 当前节点打开元素栈 弹出。

插入模式 切换到 "in table"。

重新处理该标记。

13.2.6.4.13 "in table body" 插入模式

当用户代理应用 "in table body" 插入模式 的规则时,用户代理必须按以下规则处理标记:

名为 "tr" 的开始标签

将栈清除回表格体上下文。(见下文)

为该标记 插入一个 HTML 元素, 然后把 插入模式 切换到 "in row"。

名为 "th", "td" 的开始标签

解析错误

将栈清除回表格体上下文。(见下文)

为 "tr" 开始标签标记 插入一个 HTML 元素, 不设置任何属性,然后把 插入模式 切换为 "in row"。

重新处理当前标记。

名为 "tbody", "tfoot", "thead" 的结束标签标记

如果 打开元素栈 在表格范围内没有 一个与该标记同名的 HTML 元素, 这就是一个 解析错误;忽略这个标记。

否则:

将栈清除回表格体上下文。(见下文)

当前节点打开元素栈 弹出。 把 插入模式 切换到 "in table"。

名为 "caption", "col", "colgroup", "tbody", "tfoot", "thead" 的开始标记
名为 "table" 的结束标签

如果 打开元素栈 在表格范围内没有一个 tbody, thead, 或 tfoot 元素, 这是一个 解析错误;忽略该标记。

否则:

将栈清除回表格体上下文。(见下文)

当前节点打开元素栈 弹出。 把 插入模式 切换到 "in table"。

重新处理当前标记。

名为 "body", "caption", "col", "colgroup", "html", "td", "th", "tr" 的结束标签

解析错误。忽略该标记。

任何其他情况

使用 "in table" 插入模式 的规则处理该标记。

当上述步骤中要求 UA 将栈清除回表格体上下文 时, UA 必须从 打开元素栈 弹出元素,直到 当前节点tbody, tfoot, thead, templatehtml 元素为止。

在这一处理后,当前节点 是一个 html 元素的, 就是一个 fragment case

13.2.6.4.14 "in row" 插入模式

当用户代理应用 "in row" 插入模式 的规则时,用户代理必须按以下规则处理标记:

名为 "th", "td" 的开始标签

将栈清除回表格行上下文。(见下文)

尾该标记 插入一个 HTML 元素,然后b把 插入模式 切换到 "in cell"。

活动的格式化元素列表 的结尾插入一个 标记

名为 "tr" 的结束标签

如果 打开元素标记 在 table 范围内不包含一个 tr 元素, 这是一个 解析错误;忽略该标记。

否则:

将栈清除回表格行上下文。(见下文)

当前节点(是一个 tr 元素) 从 打开元素栈 弹出。 把 插入模式 切换到 "in table body"。

名为 "caption", "col", "colgroup", "tbody", "tfoot", "thead", "tr" 的开始标签
名为 "table" 的结束标签

如果 打开元素栈 在 table 范围内不包含 tr 元素, 这是一个 解析错误;忽略该标记。

否则:

将栈清除回表格行上下文。(见下文)

当前节点(是一个 tr 元素) 从 打开元素栈 弹出。 把 插入模式 切换到 "in table body"。

重新处理该标记。

名为 "tbody", "tfoot", "thead" 的结束标签

如果 打开元素栈 在表格范围内没有 一个与该标记同名的 HTML 元素, 这就是一个 解析错误;忽略这个标记。

如果 打开元素栈 在 table 范围内没有一个 tr 元素,这是一个 解析错误;忽略该标记。

否则:

将栈清除回表格行上下文。(见下文)

当前节点(是一个 tr 元素) 从 打开元素栈 弹出。 把 插入模式 切换到 "in table body"。

重新处理该标记。

名为 "body", "caption", "col", "colgroup", "html", "td", "th" 的结束标签

解析错误。忽略该标记。

其他情况

使用 "in table" 插入模式 处理该标记。

当上述步骤中要求 UA 将栈清除回表格行上下文 时, UA 必须从 打开元素栈 弹出元素,直到 当前节点tr, templatehtml 元素为止。

在这一处理后,当前节点 是一个 html 元素的, 就是一个 fragment case

13.2.6.4.15 "in cell" 插入模式

当用户代理应用 "in cell" 插入模式 的规则时,用户代理必须按以下规则处理标记:

名为 "td","th" 的结束标签

如果 打开元素栈 在表格范围内没有 一个与该标记同名的 HTML 元素, 那么这是一个 解析错误; 忽略该标记。

否则:

生成暗示的结束标签

现在如果 当前节点 不是与该标记同名的 HTML 元素,那么这是一个 解析错误

打开元素栈 弹出元素,直到弹出了与该标记同名的 HTML 元素 为止。

清空直到最后一个标记的活跃的格式化元素列表

插入模式 切换到 "in row"。

名为 "caption","col", "colgroup","tbody","td","tfoot","th","thead","tr" 的开始标签

如果 打开元素栈 在表格范围内 不包含 tdth 元素, 那么这是一个 解析错误;忽略该标记。 (fragment case)

否则,关闭该单元格(见下文)并重新处理该标记。

名为 "body","caption", "col","colgroup","html" 的结束标签

解析错误。 忽略该标记。

名为 "table","tbody", "tfoot","thead","tr" 的结束标签

如果 打开元素栈 在表格范围内不包含 与该标记同名的 HTML 元素, 那么这是一个 解析错误; 忽略该标记。

否则,关闭该单元格(见下文)并重新处理该标记。

任何其他标记

使用 "in body" 插入模式 的规则处理该标记。

上述步骤中的 关闭单元格,是指执行以下算法:

  1. 生成暗示的结束标签

  2. 如果现在的 当前节点 不是 tdth 元素, 那么这是一个 解析错误

  3. 打开元素栈 弹出元素,直到得到一个 tdth 元素。

  4. 清空直到最后一个标记的活跃的格式化元素列表

  5. 插入模式 切换到 "in row"。

打开元素栈 在表格上下文 不可能同时包含 tdth 元素, 当调用 关闭单元格 算法时也不可能包含它们。

13.2.6.4.16 "in select" 插入模式

当用户代理应用 "in select" 插入模式 的规则时,用户代理必须按以下规则处理标记:

一个 U+0000 NULL 字符标记

解析错误。忽略该标记。

任何其他字符标记

插入该标记的字符

注释标记

插入注释

DOCTYPE 标记

解析错误。忽略该标记。

名为 "html" 的开始标签

使用 "in body" 插入模式 处理该标记。

名为 "option" 的开始标签

如果 当前节点 是一个 option 元素, 把该元素从 打开元素栈 弹出。

为该标记 插入一个 HTML 元素

名为 "optgroup" 的开始标签

如果 当前节点 是一个 option 元素, 把该元素从 打开元素栈 弹出。

如果 当前节点 是一个 optgroup 元素, 把该元素从 打开元素栈 弹出。

为该标记 插入一个 HTML 元素

名为 "optgroup" 的结束标签

首先,如果 当前节点 是一个 option 元素, 而且在 打开元素栈 中紧接着该节点之前的那个元素是一个 optgroup 元素,那么把 当前节点打开元素栈 弹出。

如果 当前节点 是一个 optgroup 元素, 那么把该节点从 打开元素栈 弹出。 否则,这是一个 解析错误;忽略该标记。

名为 "option" 的结束标签

如果 当前节点 是一个 option 元素,就把那个节点从 打开元素栈 弹出。 否则这就是一个 解析错误;忽略该标记。

名为 "select" 的结束标签

如果 打开元素栈 在 select 范围内没有 select 元素, 这是一个 解析错误;忽略该标记。(fragment case

否则:

打开元素栈 弹出元素,直到得到一个 select 元素。

适当地重置解析器的插入模式

名为 "select" 的开始标签

解析错误

如果 打开元素栈 在 select 范围内没有一个 select 元素, 忽略该标记(fragment case

否则:

打开元素栈 弹出元素,直到得到一个 select 元素。

适当地重置解析器的插入模式

就是把它当作结束标签处理了。

名为 "input", "keygen", "textarea" 的开始标签

解析错误

如果 打开元素栈 在 select 范围内没有一个 select 元素, 忽略该标记。(fragment case

否则:

打开元素栈 弹出元素直到得到一个 select 元素。

适当地重置解析器的插入模式

重新处理该标记。

名为 "script", "template" 的开始标签
名为 "template" 的结束标签

使用 the "in head" 插入模式 处理该标记。

文件尾(EOF)标记

使用 "in body" 插入模式 处理该标记。

其他情况

解析错误。忽略该标记。

13.2.6.4.17 "in select in table" 插入模式

当用户代理应用 "in select in table" 插入模式 的规则时,用户代理必须按以下规则处理标记:

名为 "caption", "table", "tbody", "tfoot", "thead", "tr", "td", "th" 的开始标签

解析错误

打开元素栈 弹出元素,直到得到一个 select 元素。

适当地重置插入模式

重新处理该标记。

名为 "caption", "table", "tbody", "tfoot", "thead", "tr", "td", "th" 的结束标签

解析错误.

如果 打开元素栈 在 table 范围内不包含 与该标记同名的 HTML 元素, 忽略该标记。

否则:

打开元素栈 弹出元素,直到得到一个 select 元素。

适当地重置解析器的插入模式.

重新处理该标记。

其他情况

使用 "in select" 插入模式 处理该标记。

13.2.6.4.18 "in template" 插入模式

当用户代理应用 "in template" 插入模式 的规则时,用户代理必须按以下规则处理标记:

字符标记
注释标记
DOCTYPE 标记

使用 "in body" 插入模式 的规则处理该标记。

名为 "base", "basefont", "bgsound", "link", "meta", "noframes", "script", "style", "template", "title" 的开始标签
名为 "template" 的结束标签

使用 "in head" 插入模式 的规则处理该标记。

名为 "caption", "colgroup", "tbody", "tfoot", "thead" 的开始标签

把当前 当前模板插入模式 弹出 模板插入模式的栈

把 "in table" 压入 模板插入模式的栈, 让它称为新的 当前模板插入模式

插入模式 切换到 "in table" 后重新处理该标记。

名为 "col" 的开始标签

当前模板插入模式 弹出 模板插入模式的栈

把 "in column group" 压入 模板插入模式的栈 让它称为新的 当前模板插入模式

插入模式 切换到 "in column group" 后重新处理该标记。

名为 "tr" 的开始标签

当前模板插入模式 弹出 模板插入模式的栈

把 "in table body" 压入 模板插入模式的栈 让它称为新的 当前模板插入模式

插入模式 切换到 "in table body" 后重新处理该标记。

名为 "td", "th" 的开始标签

当前模板插入模式 弹出 模板插入模式的栈

把 "in row" 压入 模板插入模式的栈 让它称为新的 当前模板插入模式

插入模式 切换到 "in row" 后重新处理该标记。

任何其他开始标签

当前模板插入模式 弹出 模板插入模式的栈

把 "in body" 压入 模板插入模式的栈 让它称为新的 当前模板插入模式

插入模式 切换到 "in body" 后重新处理该标记。

任何其他结束标签

解析错误。忽略该标签。

文件尾(EOF)标记

如果在 打开元素栈 上没有 template 元素,就 停止解析。(fragment case

否则这就是一个 解析错误

打开元素栈 弹栈直到弹出来的是一个 template 元素。

清空直到最后一个标记的活跃的格式化元素列表

当前模板插入模式 弹出 模板插入模式的栈

适当地重置解析器的插入模式.

重新处理该标记。

13.2.6.4.19 "after body" 插入模式

当用户代理应用 "after body" 插入模式 的规则时,用户代理必须按以下规则处理标记:

字符标记 U+0009 CHARACTER TABULATION,U+000A LINE FEED (LF),U+000C FORM FEED (FF),U+000D CARRIAGE RETURN (CR),或 U+0020 SPACE

使用 "in body" 插入模式 的规则处理该标记。

注释标记

插入一个注释 作为 打开元素栈 中第一个元素的最后一个子节点。 (html 元素)。

DOCTYPE 标记

解析错误。 忽略该标记。

名为 "html" 的开始标签

使用 "in body" 插入模式 的规则处理该标记。

名为 "html" 的结束标签

如果该解析器最初是作为 HTML 片段解析算法 的一部分创建的,那么这是一个 解析错误;忽略该标记。 (fragment case

否则,把 插入模式 切换到 "after after body"。

文件尾(EOF)标记

停止解析

任何其他标记

解析错误。把 插入模式 切换到 "in body" 并重新处理该标记。

13.2.6.4.20 "in frameset" 插入模式

当用户代理应用 "in frameset" 插入模式 的规则时,用户代理必须按以下规则处理标记:

字符标记 U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR) 或 U+0020 SPACE

插入该字符

评论标记

插入该字符

DOCTYPE 字符

解析错误。忽略该字符。

名为 "html" 的开始标签

使用 "in body" 插入模式 处理该标记。

名为 "frameset" 的开始标签

为该标记 插入一个 HTML 元素

名为 "frameset" 的结束标签

如果 当前节点 是根 html 元素, 那么这是一个 解析错误;忽略该标记。(fragment case

否则,把 当前节点打开元素栈 弹出。

如果该解析器最初 不是 作为 HTML 片段解析算法fragment case) 的一部分创建的,且 当前节点 不再是 frameset 元素,那么把 插入模式 切换到 "after frameset"。

名为 "frame" 的开始标签

为该标记 插入一个 HTML 元素。 立即把 当前节点打开元素栈 弹出。

确认该标记的 self-closing 标志,如果设置了这个标志的话。

名为 "noframes" 的开始标签

使用 "in head" 插入模式 处理该标记。

文件尾(EOF)标记

如果 当前节点 不是根 html 元素, 那么这是一个 解析错误

fragment case 中,当前节点 只能是根 html 元素。

停止解析

任何其他标记

解析错误。忽略该标记。

13.2.6.4.21 "after frameset" 插入模式

当用户代理应用 "after frameset" 插入模式 的规则时,用户代理必须按以下规则处理标记:

字符标记 U+0009 CHARACTER TABULATION,U+000A LINE FEED (LF),U+000C FORM FEED (FF),U+000D CARRIAGE RETURN (CR),或 U+0020 SPACE

插入该字符

注释标记

插入注释

DOCTYPE 标记

解析错误。忽略该标记。

名为 "html" 的开始标签

使用 "in body" 插入模式 的规则处理该标记。

名为 "html" 的结束标签

插入模式 切换到 "after after frameset"。

名为 "noframes" 的开始标签

使用 "in head" 插入模式 的规则处理该标记。

文件尾(EOF)标记

停止解析

任何其他标记

解析错误。 忽略该标记。

13.2.6.4.22 "after after body" 插入模式

当用户代理应用 "after after body" 插入模式 的规则时,用户代理必须按以下规则处理标记:

注释标记

作为 Document 对象的最后一个子节点 插入该注释

DOCTYPE 标记
字符标记 U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), 或 U+0020 SPACE
名为 "html" 的开始标签

使用 "in body" 插入模式的规则处理该标记。

文件尾(EOF)标记

停止解析

任何其他标记

解析错误。把 插入模式 切换到 "in body" 并重新处理该标记。

13.2.6.4.23 "after after frameset" 插入模式

当用户代理应用 "after after frameset" 插入模式 的规则时,用户代理必须按以下规则处理标记:

注释标记

插入注释 作为 Document 对象的最后一个子节点。

DOCTYPE 标记
字符标记 U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR) 或 U+0020 SPACE
名为 "html" 的开始标签

使用 "in body" 插入模式 的规则处理该标记。

文件尾(EOF)标记

停止解析

名为 "noframes" 的开始标签

使用 "in head" 插入模式 的规则处理该标记。

任何其他标记

解析错误。忽略该标记。

13.2.6.5 The rules for parsing tokens in foreign content

When the user agent is to apply the rules for parsing tokens in foreign content, the user agent must handle the token as follows:

A character token that is U+0000 NULL

Parse error. Insert a U+FFFD REPLACEMENT CHARACTER character.

A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

Insert the token's character.

Any other character token

Insert the token's character.

Set the frameset-ok flag to "not ok".

A comment token

Insert a comment.

A DOCTYPE token

Parse error. Ignore the token.

A start tag whose tag name is one of: "b", "big", "blockquote", "body", "br", "center", "code", "dd", "div", "dl", "dt", "em", "embed", "h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "i", "img", "li", "listing", "menu", "meta", "nobr", "ol", "p", "pre", "ruby", "s", "small", "span", "strong", "strike", "sub", "sup", "table", "tt", "u", "ul", "var"
A start tag whose tag name is "font", if the token has any attributes named "color", "face", or "size"

Parse error.

If the parser was created as part of the HTML fragment parsing algorithm, then act as described in the "any other start tag" entry below. (fragment case)

Otherwise:

Pop an element from the stack of open elements, and then keep popping more elements from the stack of open elements until the current node is a MathML text integration point, an HTML integration point, or an element in the HTML namespace.

Then, reprocess the token.

Any other start tag

If the adjusted current node is an element in the MathML namespace, adjust MathML attributes for the token. (This fixes the case of MathML attributes that are not all lowercase.)

If the adjusted current node is an element in the SVG namespace, and the token's tag name is one of the ones in the first column of the following table, change the tag name to the name given in the corresponding cell in the second column. (This fixes the case of SVG elements that are not all lowercase.)

Tag name Element name
altglyph altGlyph
altglyphdef altGlyphDef
altglyphitem altGlyphItem
animatecolor animateColor
animatemotion animateMotion
animatetransform animateTransform
clippath clipPath
feblend feBlend
fecolormatrix feColorMatrix
fecomponenttransfer feComponentTransfer
fecomposite feComposite
feconvolvematrix feConvolveMatrix
fediffuselighting feDiffuseLighting
fedisplacementmap feDisplacementMap
fedistantlight feDistantLight
fedropshadow feDropShadow
feflood feFlood
fefunca feFuncA
fefuncb feFuncB
fefuncg feFuncG
fefuncr feFuncR
fegaussianblur feGaussianBlur
feimage feImage
femerge feMerge
femergenode feMergeNode
femorphology feMorphology
feoffset feOffset
fepointlight fePointLight
fespecularlighting feSpecularLighting
fespotlight feSpotLight
fetile feTile
feturbulence feTurbulence
foreignobject foreignObject
glyphref glyphRef
lineargradient linearGradient
radialgradient radialGradient
textpath textPath

If the adjusted current node is an element in the SVG namespace, adjust SVG attributes for the token. (This fixes the case of SVG attributes that are not all lowercase.)

Adjust foreign attributes for the token. (This fixes the use of namespaced attributes, in particular XLink in SVG.)

Insert a foreign element for the token, in the same namespace as the adjusted current node.

If the token has its self-closing flag set, then run the appropriate steps from the following list:

If the token's tag name is "script", and the new current node is in the SVG namespace

Acknowledge the token's self-closing flag, and then act as described in the steps for a "script" end tag below.

Otherwise

Pop the current node off the stack of open elements and acknowledge the token's self-closing flag.

An end tag whose tag name is "script", if the current node is an SVG script element

Pop the current node off the stack of open elements.

Let the old insertion point have the same value as the current insertion point. Let the insertion point be just before the next input character.

Increment the parser's script nesting level by one. Set the parser pause flag to true.

Process the SVG script element according to the SVG rules, if the user agent supports SVG. [SVG]

Even if this causes new characters to be inserted into the tokenizer, the parser will not be executed reentrantly, since the parser pause flag is true.

Decrement the parser's script nesting level by one. If the parser's script nesting level is zero, then set the parser pause flag to false.

Let the insertion point have the value of the old insertion point. (In other words, restore the insertion point to its previous value. This value might be the "undefined" value.)

Any other end tag

Run these steps:

  1. Initialize node to be the current node (the bottommost node of the stack).

  2. If node's tag name, converted to ASCII lowercase, is not the same as the tag name of the token, then this is a parse error.

  3. Loop: If node is the topmost element in the stack of open elements, then return. (fragment case)

  4. If node's tag name, converted to ASCII lowercase, is the same as the tag name of the token, pop elements from the stack of open elements until node has been popped from the stack, and then return.

  5. Set node to the previous entry in the stack of open elements.

  6. If node is not an element in the HTML namespace, return to the step labeled loop.

  7. Otherwise, process the token according to the rules given in the section corresponding to the current insertion mode in HTML content.

13.2.7 The end

Document/DOMContentLoaded_event

Support in all current engines.

Firefox1+Safari3.1+Chrome1+
Opera9+Edge79+
Edge (Legacy)12+Internet Explorer9+
Firefox Android4+Safari iOS2+Chrome Android18+WebView Android1+Samsung Internet1.0+Opera Android10.1+

Once the user agent stops parsing the document, the user agent must run the following steps:

Window/load_event

Support in all current engines.

Firefox1+Safari1.3+Chrome1+
Opera4+Edge79+
Edge (Legacy)12+Internet Explorer4+
Firefox Android4+Safari iOS1+Chrome Android18+WebView Android1+Samsung Internet1.0+Opera Android10.1+
  1. Set the current document readiness to "interactive" and the insertion point to undefined.

  2. Pop all the nodes off the stack of open elements.

  3. If the list of scripts that will execute when the document has finished parsing is not empty, run these substeps:

    1. Spin the event loop until the first script in the list of scripts that will execute when the document has finished parsing has its "ready to be parser-executed" flag set and the parser's Document has no style sheet that is blocking scripts.

    2. Execute the first script in the list of scripts that will execute when the document has finished parsing.

    3. Remove the first script element from the list of scripts that will execute when the document has finished parsing (i.e. shift out the first entry in the list).

    4. If the list of scripts that will execute when the document has finished parsing is still not empty, repeat these substeps again from substep 1.

  4. Queue a global task on the DOM manipulation task source given the Document's relevant global object to run the following substeps:

    1. Fire an event named DOMContentLoaded at the Document object, with its bubbles attribute initialized to true.

    2. Enable the client message queue of the ServiceWorkerContainer object whose associated service worker client is the Document object's relevant settings object.

  5. Spin the event loop until the set of scripts that will execute as soon as possible and the list of scripts that will execute in order as soon as possible are empty.

  6. Spin the event loop until there is nothing that delays the load event in the Document.

  7. Queue a global task on the DOM manipulation task source given the Document's relevant global object to run the following substeps:

    1. Set the current document readiness to "complete".

    2. Load event: If the Document object's browsing context is non-null, then fire an event named load at the Document object's relevant global object, with legacy target override flag set.

  8. If the Document object's browsing context is non-null, then queue a global task on the DOM manipulation task source given the Document's relevant global object to run these steps:

    1. If the Document's page showing flag is true, then return (i.e. don't fire the event below).

    2. Set the Document's page showing flag to true.

    3. Fire an event named pageshow at the Document object's relevant global object, using PageTransitionEvent, with the persisted attribute initialized to false, and legacy target override flag set.

  9. If the Document has any pending application cache download process tasks, then queue each such task in the order they were added to the list of pending application cache download process tasks, and then empty the list of pending application cache download process tasks. The task source for these tasks is the networking task source.

  10. If the Document's print when loaded flag is set, then run the printing steps.

  11. The Document is now ready for post-load tasks.

  12. Completely finish loading the Document.

When the user agent is to abort a parser, it must run the following steps:

  1. Throw away any pending content in the input stream, and discard any future content that would have been added to it.

  2. Set the current document readiness to "interactive".

  3. Pop all the nodes off the stack of open elements.

  4. Set the current document readiness to "complete".

13.2.8 Coercing an HTML DOM into an infoset

When an application uses an HTML parser in conjunction with an XML pipeline, it is possible that the constructed DOM is not compatible with the XML tool chain in certain subtle ways. For example, an XML toolchain might not be able to represent attributes with the name xmlns, since they conflict with the Namespaces in XML syntax. There is also some data that the HTML parser generates that isn't included in the DOM itself. This section specifies some rules for handling these issues.

If the XML API being used doesn't support DOCTYPEs, the tool may drop DOCTYPEs altogether.

If the XML API doesn't support attributes in no namespace that are named "xmlns", attributes whose names start with "xmlns:", or attributes in the XMLNS namespace, then the tool may drop such attributes.

The tool may annotate the output with any namespace declarations required for proper operation.

If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names that the API wouldn't support to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's code point when expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in increasing numeric order.

For example, the element name foo<bar, which can be output by the HTML parser, though it is neither a legal HTML element name nor a well-formed XML element name, would be converted into fooU00003Cbar, which is a well-formed XML element name (though it's still not legal in HTML by any means).

As another example, consider the attribute xlink:href. Used on a MathML element, it becomes, after being adjusted, an attribute with a prefix "xlink" and a local name "href". However, used on an HTML element, it becomes an attribute with no prefix and the local name "xlink:href", which is not a valid NCName, and thus might not be accepted by an XML API. It could thus get converted, becoming "xlinkU00003Ahref".

The resulting names from this conversion conveniently can't clash with any attribute generated by the HTML parser, since those are all either lowercase or those listed in the adjust foreign attributes algorithm's table.

If the XML API restricts comments from having two consecutive U+002D HYPHEN-MINUS characters (--), the tool may insert a single U+0020 SPACE character between any such offending characters.

If the XML API restricts comments from ending in a U+002D HYPHEN-MINUS character (-), the tool may insert a single U+0020 SPACE character at the end of such comments.

If the XML API restricts allowed characters in character data, attribute values, or comments, the tool may replace any U+000C FORM FEED (FF) character with a U+0020 SPACE character, and any other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.

If the tool has no way to convey out-of-band information, then the tool may drop the following information:

The mutations allowed by this section apply after the HTML parser's rules have been applied. For example, a <a::> start tag will be closed by a </a::> end tag, and never by a </aU00003AU00003A> end tag, even if the user agent is using the rules above to then generate an actual element in the DOM with the name aU00003AU00003A for that start tag.

13.2.9 解析器的错误处理和奇怪的场景介绍

This section is non-normative.

这一部分来检查一些错误的标记,并讨论 HTML 解析器 如何处理这些情况。

13.2.9.1 错误嵌套的标签:<b><i></b></i>

This section is non-normative.

这是一个最常讨论的有问题的标记:

<p>1<b>2<i>3</b>4</i>5</p>

直到 "3" 解析的结果都很直观,这时 DOM 看起来是这样:

这里 打开元素栈 有五个元素: html, body, p, b 以及 i活动格式化元素列表 只有两个: bi插入模式 为 "in body"。

在收到标签名为 "b" 的结束标记时,会调用 "adoption agency algorithm"。 这是一个简单的例子,formatting elementb 元素, 且没有 furthest block。 因此 打开元素栈 最终只有三个元素: html, bodyp活动格式化元素列表 只有一个 i。 这时 DOM 树没有变化。

下一个标记是一个字符("4"),触发了 重新构造活动的格式化元素算法, 这个例子中就是 i 元素。因此会为 Text 节点 "4" 创建一个 i 元素。 在收到 "i" 的结束标签标记后,再插入 Text 节点 "5",DOM 看起来如下:

13.2.9.2 错误嵌套的标签:<b><p></b></p>

This section is non-normative.

这是与上一个例子类似的一个例子:

<b>1<p>2</b>3</p>

直到 "2" 解析的结果都很直观:

有趣的部分是解析到 "b" 的结束标记时。

在看到该标记之前,打开元素栈 有四个元素: html, body, b, 和 p活动格式化元素列表 只有一个: b插入模式 为 "in body"。

在收到 "b" 的结束标记时,会像上一个例子那样调用 "adoption agency 算法"。 但这个例子中 一个 furthest block,即 p 元素。 因此 adoption agency 算法不会被跳过。

common ancestorbody 元素。 "bookmark" 的概念标记了 b活动格式化元素列表 中的位置, 但由于该列表只有一个元素,这个 bookmark 没啥作用。

算法继续进行,node 最终被设为格式化元素(b), last node 最终被设为 furthest blockp)。

last node 被追加(移动)到 common ancestor,所以 DOM 变成了这样:

继续创建一个 b 元素,把 p 元素的子节点移动过去:

最后把新的 b 元素追加到 p 元素,所以 DOM 变成了这样:

活动格式化元素列表打开元素栈 移除 b 元素, 所以当解析 "3" 时,它被追加到了 p 元素:

13.2.9.3 Unexpected markup in tables

This section is non-normative.

Error handling in tables is, for historical reasons, especially strange. For example, consider the following markup:

<table><b><tr><td>aaa</td></tr>bbb</table>ccc

The highlighted b element start tag is not allowed directly inside a table like that, and the parser handles this case by placing the element before the table. (This is called foster parenting.) This can be seen by examining the DOM tree as it stands just after the table element's start tag has been seen:

...and then immediately after the b element start tag has been seen:

At this point, the stack of open elements has on it the elements html, body, table, and b (in that order, despite the resulting DOM tree); the list of active formatting elements just has the b element in it; and the insertion mode is "in table".

The tr start tag causes the b element to be popped off the stack and a tbody start tag to be implied; the tbody and tr elements are then handled in a rather straight-forward manner, taking the parser through the "in table body" and "in row" insertion modes, after which the DOM looks as follows:

Here, the stack of open elements has on it the elements html, body, table, tbody, and tr; the list of active formatting elements still has the b element in it; and the insertion mode is "in row".

The td element start tag token, after putting a td element on the tree, puts a marker on the list of active formatting elements (it also switches to the "in cell" insertion mode).

The marker means that when the "aaa" character tokens are seen, no b element is created to hold the resulting Text node:

The end tags are handled in a straight-forward manner; after handling them, the stack of open elements has on it the elements html, body, table, and tbody; the list of active formatting elements still has the b element in it (the marker having been removed by the "td" end tag token); and the insertion mode is "in table body".

Thus it is that the "bbb" character tokens are found. These trigger the "in table text" insertion mode to be used (with the original insertion mode set to "in table body"). The character tokens are collected, and when the next token (the table element end tag) is seen, they are processed as a group. Since they are not all spaces, they are handled as per the "anything else" rules in the "in table" insertion mode, which defer to the "in body" insertion mode but with foster parenting.

When the active formatting elements are reconstructed, a b element is created and foster parented, and then the "bbb" Text node is appended to it:

The stack of open elements has on it the elements html, body, table, tbody, and the new b (again, note that this doesn't match the resulting tree!); the list of active formatting elements has the new b element in it; and the insertion mode is still "in table body".

Had the character tokens been only ASCII whitespace instead of "bbb", then that ASCII whitespace would just be appended to the tbody element.

Finally, the table is closed by a "table" end tag. This pops all the nodes from the stack of open elements up to and including the table element, but it doesn't affect the list of active formatting elements, so the "ccc" character tokens after the table result in yet another b element being created, this time after the table:

13.2.9.4 解析时修改页面的脚本

This section is non-normative.

考虑下面的标记,这个例子中我们假设这个文档的 URLhttps://example.com/inner, 正在渲染到另一个 URLhttps://example.com/outer 的文档的 iframe 中:

<div id=a>
 <script>
  var div = document.getElementById('a');
  parent.document.body.appendChild(div);
 </script>
 <script>
  alert(document.URL);
 </script>
</div>
<script>
 alert(document.URL);
</script>

直到第一个 "script" 结束标签(在解析它之前),结果相对比较直观:

解析这个脚本后 div 元素和它的子元素 script 消失了:

这时它们在前面所说的外层 浏览环境Document 中。但 打开元素栈 仍然包含 div 元素

所以当解析第二个 script 元素时,它被插入 到了外层 Document 对象

解析到的文档与创建解析器的 Document 不同时,其中的脚本不会执行, 所以第一个 alert 不会显示。

一旦解析到 div 元素的结束标签,div 元素就被弹栈, 所以下一个 script 元素处于内层 Document 中:

该脚本会被执行,产生一个写着 "https://example.com/inner" 的 alert。

13.2.9.5 在多个文档中移动的脚本的执行

This section is non-normative.

继续上一小节的例子,考虑第二个 script 元素是外部脚本的情况 (即有 src 属性的脚本)。 由于该元素创建时不在解析器的 Document 中,这个外链脚本甚至不会被下载。

具有 src 属性的 script 元素正常地解析到其解析器的 Document 中的情况, 如果该元素被移动到其他文档中时它还在下载中,那么它会继续下载但不会执行。

通常在 Document 之间移动 script 元素是一个糟糕的实践。

13.2.9.6 Unclosed formatting elements

This section is non-normative.

The following markup shows how nested formatting elements (such as b) get collected and continue to be applied even as the elements they are contained in are closed, but that excessive duplicates are thrown away.

<!DOCTYPE html>
<p><b class=x><b class=x><b><b class=x><b class=x><b>X
<p>X
<p><b><b class=x><b>X
<p></b></b></b></b></b></b>X

The resulting DOM tree is as follows:

Note how the second p element in the markup has no explicit b elements, but in the resulting DOM, up to three of each kind of formatting element (in this case three b elements with the class attribute, and two unadorned b elements) get reconstructed before the element's "X".

Also note how this means that in the final paragraph only six b end tags are needed to completely clear the list of active formatting elements, even though nine b start tags have been seen up to this point.

13.3 序列化 HTML 片段

下列步骤组成了 HTML 片段的序列化算法。该算法的输入为一个 DOM ElementDocument,或 DocumentFragment 作为 node,返回一个字符串。

该算法会把被序列化节点的 子节点 序列化,不包括节点本身。

  1. s 为一个字符串,把它初始化为空字符串。

  2. 如果 该节点 是一个 template 元素,则令 该节点template 元素的 模板内容 (一个 DocumentFragment 节点)。

  3. 该节点 的每一个子节点,以 树序 运行下列步骤:

    1. 当前节点 为正在被处理的子节点。

    2. 从下列列表中选择合适的字符串追加到 s

      如果 当前节点 是一个 Element

      如果 当前节点HTML 命名空间MathML 命名空间, 或 SVG 命名空间 中的一个元素,则令 tagname当前节点 的 局部名。 否则,令 tagname当前节点的 qualified name。

      追加一个 U+003C LESS-THAN SIGN 字符(<),以及 tagname

      对于由 HTML parsercreateElement() 创建的 HTML 元素tagname 将会是小写。

      对该元素具有的每一个属性,追加一个 U+0020 SPACE 字符,属性的序列化名称(见下文), 一个 U+003D EQUALS SIGN 字符(=),一个 U+0022 QUOTATION MARK 字符("), 属性在 属性模式转义后的(见下文) 值, 以及第二个 U+0022 QUOTATION MARK 字符(")。

      上一段中属性的序列化名称 必须这样确定:

      如果该属性没有命名空间

      该属性的序列化名称为该属性的 局部名。

      对于由 HTML 解析器Element.setAttribute() 设置的 HTML 元素 上的属性,局部名 将会是小写的。

      如果该属性在 XML 命名空间

      该属性的序列化名称是:字符串 "xml:" 紧跟着该属性的 局部名。

      如果该属性在 XMLNS 命名空间 且该属性的 局部名 是 xmlns

      该属性的序列化名称为字符串 "xmlns"。

      如果该属性在 XMLNS 命名空间 且该属性的 局部名 不是xmlns

      该属性的序列化名称为字符串 "xmlns:" 紧跟着该属性的 局部名。

      如果该属性在 XLink 命名空间

      该属性的序列化名称为字符串 "xlink:" 紧跟着该属性的 局部名。

      如果该属性在其他命名空间

      该属性的序列化名称为该属性的 qualified name。

      尽管真正的属性顺序是 UA 定义的,也依赖于其他一些因素,比如原始标记中属性的给出顺序, 但是排序次序必须是稳定的,这样对该算法的连续调用中,元素属性的序列化采取同样的次序。

      追加一个 U+003E GREATER-THAN SIGN 字符(>)。

      如果 当前节点 是一个 areabase, basefontbgsoundbrcol, embedframehrimg, inputkeygenlinkmeta, paramsourcetrackwbr 元素, 则继续处理下一个子元素。

      当前节点 的元素上执行 HTML 片段序列化算法 (以此递归进入该元素),紧跟着一个 U+003C LESS-THAN SIGN 字符(<),一个 U+002F SOLIDUS 字符 (/),再来一个 tagname ,最后是一个 U+003E GREATER-THAN SIGN 字符(>)。

      如果 当前节点是一个 Text 节点

      如果 当前节点 的父节点是一个 style, scriptxmpiframenoembed, noframes,或 plaintext 元素,或者 当前节点 的父节点是一个 noscript 元素且该节点的 脚本被启用,则追加 当前节点data IDL 属性的字面值。

      否则,追加 当前节点data IDL 属性的 转义(见下文)后的值。

      如果 当前节点是一个 Comment

      追加字面字符串 "<!--"(U+003C LESS-THAN SIGN,U+0021 EXCLAMATION MARK,U+002D HYPHEN-MINUS,U+002D HYPHEN-MINUS),再加 当前节点data IDL 属性的值,再加字面字符串 "-->" (U+002D HYPHEN-MINUS,U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN)。

      如果 当前节点 是一个 ProcessingInstruction

      追加字面字符串 "<?"(U+003C LESS-THAN SIGN,U+003F QUESTION MARK),再加 当前节点target IDL 属性的值, 再加一个 U+0020 SPACE 字符,再加 当前节点data IDL 属性的值, 再加一个 U+003E GREATER-THAN SIGN 字符(>)。

      如果 当前节点是一个 DocumentType

      追加字面字符串 "<!DOCTYPE"(U+003C LESS-THAN SIGN,U+0021 EXCLAMATION MARK,U+0044 LATIN CAPITAL LETTER D,U+004F LATIN CAPITAL LETTER O,U+0043 LATIN CAPITAL LETTER C,U+0054 LATIN CAPITAL LETTER T,U+0059 LATIN CAPITAL LETTER Y,U+0050 LATIN CAPITAL LETTER P,U+0045 LATIN CAPITAL LETTER E),再加一个空格(U+0020 SPACE), 再加 当前节点name IDL 属性的值, 再加字面字符串 ">"(U+003E GREATER-THAN SIGN)。

  4. 该算法的结果就是字符串 s

如果用 HTMl 解析器 解析, 该算法输出的可能不是原始的树结构。HTMl 解析器 本身也可以产生 序列化、重新解析后不能复原的树结构,虽然这些情形是不一致的而且很典型。

例如,对于一个有着 Comment 子节点的 textarea 元素, 先序列化后,再重新解析输出,注释将会显示在文本控件中。 类似地,作为 DOM 操作的结果, 如果某个元素内部的注释中包含字面字符串"-->",那么解析该元素序列化的结果之后, 该注释将会从那一点截断,剩余的注释将会解释为标记。还有很多类似的例子,比如: script 元素包含的 Text 节点包含字符串 "</script>" 的情况、 p 元素包含 ul 元素的情况(ul 元素的 开始标签 意味着 p 的结束标签)。

这可能引起跨站脚本攻击。一个例子是页面让用户输入一些字体家族的名称, 它们稍后被通过 DOM 插入到 CSS style 块, 然后使用 innerHTML IDL 属性来获取那个 style 元素的 HTML 序列化。如果用户输入 "</style><script>attack</script>" 作为字体家族名称,innerHTML 返回的标记如果在不同的上下文进行解析, 将会包含一个 script 节点,即使原始 DOM 中不存在 script 节点。

例如,考虑下列标记:

<form id="outer"><div></form><form id="inner"><input>

这将被解析为:

input 元素将会关联到内部 form 元素。 现在如果序列化并重新解析这一树结构,<form id="inner"> 开始标签将被忽略, 所以 input 元素将会被关联到外部 form 元素上。

<html><head></head><body><form id="outer"><div><form id="inner"><input></form></div></form></body></html>

作为下一个例子,考虑下列标记:

<a><table><a>

将被解析为:

a 元素是嵌套的,因为第二个 a 元素是 foster parented。 经历序列化-再解析往返后, a 元素和 table 元素将会都变成兄弟节点,因为第二个 <a> 开始标签隐式地关闭了第一个 a 元素。

<html><head></head><body><a><a></a><table></table></a></body></html>

由于历史原因,即使(在前两个例子中)往返计算的标记可以是一致的,也不能保持 pretextarea,或 listing 元素中初始的 U+000A LINE FEED(LF)字符 HTML 解析器 在解析过程中会扔掉这样的字符, 但该算法 不会 序列化一个额外的 U+000A LINE FEED(LF)字符进去。

例如,考虑下列标记:

<pre>

Hello.</pre>

当该文档初次被解析时,pre 元素的 子文本内容 以一个换行字符起始。 经历序列化-再解析往返后,pre 元素的 子文本内容 会只是 "Hello."。

上述算法用到的 转义字符串 包含运行下列步骤:

  1. 将所有出现的 "&" 字符替换为字符串 "&amp;"。

  2. 将所有出现的 U+00A0 NO-BREAK SPACE 字符替换为字符串 "&nbsp;"。

  3. 如果算法在 属性模式 下调用的,将所有出现的 """ 字符替换为字符串 "&quot;"。

  4. 如果算法 不是属性模式 下调用的, 将所有出现的 "<" 字符替换为字符串 "&lt;", 且所有出现的 ">" 字符替换为字符串 "&gt;"。

13.4 解析 HTML 片段

下面的步骤构成了 HTML 片段解析算法。 该算法的输入包括一个 Element 元素,称为 context 元素, 它为解析器提供了上下文;以及一个 input,就是要解析的字符串。 返回一个包含0个或更多节点的列表。

在解析器这一部分的算法中标记为 fragment case 的部分, 是只在解析器是为该算法创建的时候才发生的。 该算法已经用这样的标记标注过,但只用于表达这样的信息,并没有规范的权重。 如果当解析器并非为本算法创建但仍然发生了 fragment case 描述的条件, 那么就是本规范中的一个错误。

  1. 创建一个新的 Document 节点,把它标记为 HTML 文档

  2. 如果 context 元素的 节点文档 处于 怪异模式, 那么令 Document 处于 怪异模式。 否则 context 元素的 节点文档 就处于 受限的怪异模式, 然后令 Document 处于 受限的怪异模式。 否则让 Document 处于 非怪异模式

  3. 创建一个新的 HTML 解析器, 并把它关联到刚创建的 Document 节点。

  4. 按下面的描述,根据 context 元素设置 HTML 解析器tokenization 阶段的状态:

    title
    textarea
    将 tokenizer 切换到 RCDATA 状态
    style
    xmp
    iframe
    noembed
    noframes
    将 tokenizer 切换到 RAWTEXT 状态
    script
    将 tokenizer 切换到 脚本数据状态
    noscript
    如果 脚本标志 处于启用状态, 将 tokenizer 切换到 RAWTEXT 状态。 否则,让 tokenizer 留在 数据状态
    plaintext
    将 tokenizer 切换到 纯文本状态
    任何其他元素
    将 tokenizer 留在 数据状态

    出于性能原因,不报告错误的实现,以及直接使用本规范中描述的实际状态机的实现, 可以(在上述提到的地方)使用 PLAINTEXT 状态,而不是 RAWTEXT 和脚本数据状态。 除了关于解析错误的规则之外,它们是等价的, 因为 fragment case 中没有 适当的结束标记, 所以它们涉及的状态转换少得多。

  5. root 为一个新的没有属性的 html 元素。

  6. root 元素添加到上面创建的 Document 节点中。

  7. 建立解析器的 打开元素栈, 让它只包含一个 root 元素。

  8. 如果 context 元素是一个 template,把 "in template" 压入 模板插入模式的栈, 这样它就变成了新的 当前模板插入模式

  9. 创建一个开始标签的标记,其名字为 context 的 local name,其属性为 context 的属性。

    令这个开始标签标记为 context 元素的开始标签标记, 比如可以用于确定是否是一个 HTML 集成点

  10. 适当地重置解析器的插入模式

    该算法中解析器会引用 context 元素。

  11. 设置解析器的 form 元素指针 为 离 context 元素最近的 form 元素节点(直接从祖先链向上找,包括该元素自己),如果有的话。 (如果没有这样的 form 元素, form 元素指针 就保持它的初始值 null)

  12. input 放到刚创建的 HTML 解析器输入流 中。字符编码的 信心irrelevant

  13. 启动这个解析器,让它开始运行直到消耗完刚才插入到输入流中的所有字符。

  14. 按照 树序 返回 root 的子节点。