本章只描述有 HTML MIME type 标注的资源的语法规则。XML 资源的规则在下一章 " XML 语法" 中讨论。
This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").
Documents must consist of the following parts, in the given order:
html
element.The various types of content mentioned above are described in the next few sections.
In addition, there are some restrictions on how character encoding declarations are to be serialized, as discussed in the section on that topic.
ASCII whitespace before the html
element, at the start of the
html
element and before the head
element, will be dropped when the
document is parsed; ASCII whitespace after the html
element
will be parsed as if it were at the end of the body
element. Thus, ASCII
whitespace around the document element does not round-trip.
It is suggested that newlines be inserted after the DOCTYPE, after any comments that are
before the document element, after the html
element's start tag (if it is not omitted), and after any comments that are inside the
html
element but before the head
element.
Many strings in the HTML syntax (e.g. the names of elements and their attributes) are case-insensitive, but only for ASCII upper alphas and ASCII lower alphas. For convenience, in this section this is just referred to as "case-insensitive".
A DOCTYPE is a required preamble.
DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.
A DOCTYPE must consist of the following components, in this order:
<!DOCTYPE
".html
".In other words, <!DOCTYPE html>
, case-insensitively.
For the purposes of HTML generators that cannot output HTML markup with the short DOCTYPE
"<!DOCTYPE html>
", a DOCTYPE legacy string may be inserted
into the DOCTYPE (in the position defined above). This string must consist of:
SYSTEM
".about:legacy-compat
".In other words, <!DOCTYPE html SYSTEM "about:legacy-compat">
or
<!DOCTYPE html SYSTEM 'about:legacy-compat'>
, case-insensitively except for the
part in single or double quotes.
The DOCTYPE legacy string should not be used unless the document is generated from a system that cannot output the shorter string.
There are six different kinds of elements: void
elements, the template
element, raw text
elements, escapable raw text elements, foreign elements, and
normal elements.
area
, base
, br
, col
, embed
,
hr
, img
, input
, link
, meta
,
param
, source
, track
, wbr
template
elementtemplate
script
, style
textarea
, title
Tags are used to delimit the start and end of elements in the markup. Raw text, escapable raw text, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted, as described below in the section on optional tags. Those that cannot be omitted must not be omitted. Void elements only have a start tag; end tags must not be specified for void elements. Foreign elements must either have a start tag and an end tag, or a start tag that is marked as self-closing, in which case they must not have an end tag.
The contents of the element must be placed between just after the start tag (which might be implied, in certain cases) and just before the end tag (which again, might be implied in certain cases). The exact allowed contents of each individual element depend on the content model of that element, as described earlier in this specification. Elements must not contain content that their content model disallows. In addition to the restrictions placed on the contents by those content models, however, the five types of elements have additional syntactic requirements.
Void elements can't have any contents (since there's no end tag, no content can be put between the start tag and the end tag).
The template
element can have
template contents, but such template contents are not children of the
template
element itself. Instead, they are stored in a DocumentFragment
associated with a different Document
— without a browsing context — so
as to avoid the template
contents interfering with the main Document
.
The markup for the template contents of a template
element is placed
just after the template
element's start tag and just before template
element's end tag (as with other elements), and may consist of any text, character references, elements, and comments, but
the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand.
Raw text elements can have text, though it has restrictions described below.
Escapable raw text elements can have text and character references, but the text must not contain an ambiguous ampersand. There are also further restrictions described below.
Foreign elements whose start tag is marked as self-closing can't have any contents (since, again, as there's no end tag, no content can be put between the start tag and the end tag). Foreign elements whose start tag is not marked as self-closing can have text, character references, CDATA sections, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand.
The HTML syntax does not support namespace declarations, even in foreign elements.
For instance, consider the following HTML fragment:
<p>
<svg>
<metadata>
<!-- this is invalid -->
<cdr:license xmlns:cdr="https://www.example.com/cdr/metadata" name="MIT"/>
</metadata>
</svg>
</p>
The innermost element, cdr:license
, is actually in the SVG namespace, as
the "xmlns:cdr
" attribute has no effect (unlike in XML). In fact, as the
comment in the fragment above says, the fragment is actually non-conforming. This is because
SVG 2 does not define any elements called "cdr:license
" in
the SVG namespace.
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
Tags contain a tag name, giving the element's name. HTML elements all have names that only use ASCII alphanumerics. In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.
Start tags must have the following format:
End tags must have the following format:
Attributes for an element are expressed inside the element's start tag.
Attributes have a name and a value. Attribute names must consist of one or more characters other than controls, U+0020 SPACE, U+0022 ("), U+0027 ('), U+003E (>), U+002F (/), U+003D (=), and noncharacters. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of ASCII lower and ASCII upper alphas.
Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.
Attributes can be specified in four different ways:
Just the attribute name. The value is implicitly the empty string.
In the following example, the disabled
attribute is
given with the empty attribute syntax:
<input disabled>
If an attribute using the empty attribute syntax is to be followed by another attribute, then there must be ASCII whitespace separating the two.
The attribute name, followed by zero or more ASCII whitespace, followed by a single U+003D EQUALS SIGN character, followed by zero or more ASCII whitespace, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal ASCII whitespace, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.
In the following example, the value
attribute is given
with the unquoted attribute value syntax:
<input value=yes>
If an attribute using the unquoted attribute syntax is to be followed by another attribute or by the optional U+002F SOLIDUS character (/) allowed in step 6 of the start tag syntax above, then there must be ASCII whitespace separating the two.
The attribute name, followed by zero or more ASCII whitespace, followed by a single U+003D EQUALS SIGN character, followed by zero or more ASCII whitespace, followed by a single U+0027 APOSTROPHE character ('), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0027 APOSTROPHE characters ('), and finally followed by a second single U+0027 APOSTROPHE character (').
In the following example, the type
attribute is given
with the single-quoted attribute value syntax:
<input type='checkbox'>
If an attribute using the single-quoted attribute syntax is to be followed by another attribute, then there must be ASCII whitespace separating the two.
The attribute name, followed by zero or more ASCII whitespace, followed by a single U+003D EQUALS SIGN character, followed by zero or more ASCII whitespace, followed by a single U+0022 QUOTATION MARK character ("), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0022 QUOTATION MARK characters ("), and finally followed by a second single U+0022 QUOTATION MARK character (").
In the following example, the name
attribute is given with
the double-quoted attribute value syntax:
<input name="be evil">
If an attribute using the double-quoted attribute syntax is to be followed by another attribute, then there must be ASCII whitespace separating the two.
There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other.
When a foreign element has one of the namespaced attributes given by the local name and namespace of the first and second cells of a row from the following table, it must be written using the name given by the third cell from the same row.
Local name | Namespace | Attribute name |
---|---|---|
actuate | XLink namespace | xlink:actuate
|
arcrole | XLink namespace | xlink:arcrole
|
href | XLink namespace | xlink:href
|
role | XLink namespace | xlink:role
|
show | XLink namespace | xlink:show
|
title | XLink namespace | xlink:title
|
type | XLink namespace | xlink:type
|
lang | XML namespace | xml:lang
|
space | XML namespace | xml:space
|
xmlns | XMLNS namespace | xmlns
|
xlink | XMLNS namespace | xmlns:xlink
|
No other namespaced attribute can be expressed in the HTML syntax.
Whether the attributes in the table above are conforming or not is defined by other specifications (e.g. SVG 2 and MathML); this section only describes the syntax rules if the attributes are serialized using the HTML syntax.
Certain tags can be omitted.
Omitting an element's start tag in the
situations described below does not mean the element is not present; it is implied, but it is
still there. For example, an HTML document always has a root html
element, even if
the string <html>
doesn't appear anywhere in the markup.
An html
element's start tag may be omitted
if the first thing inside the html
element is not a comment.
For example, in the following case it's ok to remove the "<html>
"
tag:
<!DOCTYPE HTML>
<html>
<head>
<title>Hello</title>
</head>
<body>
<p>Welcome to this example.</p>
</body>
</html>
Doing so would make the document look like this:
<!DOCTYPE HTML>
<head>
<title>Hello</title>
</head>
<body>
<p>Welcome to this example.</p>
</body>
</html>
This has the exact same DOM. In particular, note that whitespace around the document element is ignored by the parser. The following example would also have the exact same DOM:
<!DOCTYPE HTML><head>
<title>Hello</title>
</head>
<body>
<p>Welcome to this example.</p>
</body>
</html>
However, in the following example, removing the start tag moves the comment to before the
html
element:
<!DOCTYPE HTML>
<html>
<!-- where is this comment in the DOM? -->
<head>
<title>Hello</title>
</head>
<body>
<p>Welcome to this example.</p>
</body>
</html>
With the tag removed, the document actually turns into the same as this:
<!DOCTYPE HTML>
<!-- where is this comment in the DOM? -->
<html>
<head>
<title>Hello</title>
</head>
<body>
<p>Welcome to this example.</p>
</body>
</html>
This is why the tag can only be removed if it is not followed by a comment: removing the tag when there is a comment there changes the document's resulting parse tree. Of course, if the position of the comment does not matter, then the tag can be omitted, as if the comment had been moved to before the start tag in the first place.
An html
element's end tag may be omitted if
the html
element is not immediately followed by a comment.
A head
element's start tag may be omitted if
the element is empty, or if the first thing inside the head
element is an
element.
A head
element's end tag may be omitted if
the head
element is not immediately followed by ASCII whitespace or a
comment.
A body
element's start tag may be omitted
if the element is empty, or if the first thing inside the body
element is not
ASCII whitespace or a comment, except if the
first thing inside the body
element is a meta
, link
,
script
, style
, or template
element.
A body
element's end tag may be omitted if the
body
element is not immediately followed by a comment.
Note that in the example above, the head
element start and end tags, and the
body
element start tag, can't be omitted, because they are surrounded by
whitespace:
<!DOCTYPE HTML>
<html>
<head>
<title>Hello</title>
</head>
<body>
<p>Welcome to this example.</p>
</body>
</html>
(The body
and html
element end tags could be omitted without
trouble; any spaces after those get parsed into the body
element anyway.)
Usually, however, whitespace isn't an issue. If we first remove the whitespace we don't care about:
<!DOCTYPE HTML><html><head><title>Hello</title></head><body><p>Welcome to this example.</p></body></html>
Then we can omit a number of tags without affecting the DOM:
<!DOCTYPE HTML><title>Hello</title><p>Welcome to this example.</p>
At that point, we can also add some whitespace back:
<!DOCTYPE HTML>
<title>Hello</title>
<p>Welcome to this example.</p>
This would be equivalent to this document, with the omitted tags shown in their
parser-implied positions; the only whitespace text node that results from this is the newline at
the end of the head
element:
<!DOCTYPE HTML>
<html><head><title>Hello</title>
</head><body><p>Welcome to this example.</p></body></html>
An li
element's end tag may be omitted if the
li
element is immediately followed by another li
element or if there is
no more content in the parent element.
A dt
element's end tag may be omitted if the
dt
element is immediately followed by another dt
element or a
dd
element.
A dd
element's end tag may be omitted if the
dd
element is immediately followed by another dd
element or a
dt
element, or if there is no more content in the parent element.
A p
element's end tag may be omitted if the
p
element is immediately followed by an address
, article
,
aside
, blockquote
, details
, div
, dl
,
fieldset
, figcaption
, figure
, footer
, form
, h1
, h2
,
h3
, h4
, h5
, h6
, header
,
hgroup
, hr
, main
, menu
, nav
,
ol
, p
, pre
, section
, table
, or
ul
element, or if there is no more content in the parent element and the parent
element is an HTML element that is not an a
,
audio
, del
, ins
, map
, noscript
,
or video
element, or an autonomous custom element.
We can thus simplify the earlier example further:
<!DOCTYPE HTML><title>Hello</title><p>Welcome to this example.
An rt
element's end tag may be omitted if the
rt
element is immediately followed by an rt
or rp
element,
or if there is no more content in the parent element.
An rp
element's end tag may be omitted if the
rp
element is immediately followed by an rt
or rp
element,
or if there is no more content in the parent element.
An optgroup
element's end tag may be omitted
if the optgroup
element is
immediately followed by another optgroup
element, or if there is no more content in
the parent element.
An option
element's end tag may be omitted if
the option
element is immediately followed by another option
element, or
if it is immediately followed by an optgroup
element, or if there is no more content
in the parent element.
A colgroup
element's start tag may be
omitted if the first thing inside the colgroup
element is a col
element,
and if the element is not immediately preceded by another colgroup
element whose
end tag has been omitted. (It can't be omitted if the element
is empty.)
A colgroup
element's end tag may be omitted
if the colgroup
element is not immediately followed by ASCII whitespace
or a comment.
A caption
element's end tag may be omitted if
the caption
element is not immediately followed by ASCII whitespace or a
comment.
A thead
element's end tag may be omitted if
the thead
element is immediately followed by a tbody
or
tfoot
element.
A tbody
element's start tag may be omitted
if the first thing inside the tbody
element is a tr
element, and if the
element is not immediately preceded by a tbody
, thead
, or
tfoot
element whose end tag has been omitted. (It
can't be omitted if the element is empty.)
A tbody
element's end tag may be omitted if
the tbody
element is immediately followed by a tbody
or
tfoot
element, or if there is no more content in the parent element.
A tfoot
element's end tag may be omitted if
there is no more content in the parent element.
A tr
element's end tag may be omitted if the
tr
element is immediately followed by another tr
element, or if there is
no more content in the parent element.
A td
element's end tag may be omitted if the
td
element is immediately followed by a td
or th
element,
or if there is no more content in the parent element.
A th
element's end tag may be omitted if the
th
element is immediately followed by a td
or th
element,
or if there is no more content in the parent element.
The ability to omit all these table-related tags makes table markup much terser.
Take this example:
<table>
<caption>37547 TEE Electric Powered Rail Car Train Functions (Abbreviated)</caption>
<colgroup><col><col><col></colgroup>
<thead>
<tr>
<th>Function</th>
<th>Control Unit</th>
<th>Central Station</th>
</tr>
</thead>
<tbody>
<tr>
<td>Headlights</td>
<td>✔</td>
<td>✔</td>
</tr>
<tr>
<td>Interior Lights</td>
<td>✔</td>
<td>✔</td>
</tr>
<tr>
<td>Electric locomotive operating sounds</td>
<td>✔</td>
<td>✔</td>
</tr>
<tr>
<td>Engineer's cab lighting</td>
<td></td>
<td>✔</td>
</tr>
<tr>
<td>Station Announcements - Swiss</td>
<td></td>
<td>✔</td>
</tr>
</tbody>
</table>
The exact same table, modulo some whitespace differences, could be marked up as follows:
<table>
<caption>37547 TEE Electric Powered Rail Car Train Functions (Abbreviated)
<colgroup><col><col><col>
<thead>
<tr>
<th>Function
<th>Control Unit
<th>Central Station
<tbody>
<tr>
<td>Headlights
<td>✔
<td>✔
<tr>
<td>Interior Lights
<td>✔
<td>✔
<tr>
<td>Electric locomotive operating sounds
<td>✔
<td>✔
<tr>
<td>Engineer's cab lighting
<td>
<td>✔
<tr>
<td>Station Announcements - Swiss
<td>
<td>✔
</table>
Since the cells take up much less room this way, this can be made even terser by having each row on one line:
<table>
<caption>37547 TEE Electric Powered Rail Car Train Functions (Abbreviated)
<colgroup><col><col><col>
<thead>
<tr> <th>Function <th>Control Unit <th>Central Station
<tbody>
<tr> <td>Headlights <td>✔ <td>✔
<tr> <td>Interior Lights <td>✔ <td>✔
<tr> <td>Electric locomotive operating sounds <td>✔ <td>✔
<tr> <td>Engineer's cab lighting <td> <td>✔
<tr> <td>Station Announcements - Swiss <td> <td>✔
</table>
The only differences between these tables, at the DOM level, is with the precise position of the (in any case semantically-neutral) whitespace.
However, a start tag must never be omitted if it has any attributes.
Returning to the earlier example with all the whitespace removed and then all the optional tags removed:
<!DOCTYPE HTML><title>Hello</title><p>Welcome to this example.
If the body
element in this example had to have a class
attribute and the html
element had to have a lang
attribute, the markup would have to become:
<!DOCTYPE HTML><html lang="en"><title>Hello</title><body class="demo"><p>Welcome to this example.
This section assumes that the document is conforming, in particular, that there are no content model violations. Omitting tags in the fashion described in this section in a document that does not conform to the content models described in this specification is likely to result in unexpected DOM differences (this is, in part, what the content models are designed to avoid).
由于历史原因,某些元素在其内容模型之外还有额外的限制。
table
元素不得包含 tr
元素,虽然根据本规范描述的内容模型
这些元素技术上是允许在 table
里的。
(如果代码中在 table
里放一个 tr
元素,
实际上暗示着在它前面加一个 tbody
开始标签。)
单个 newline 可以紧挨着放在
pre
和 textarea
元素的
开始标签 后面。
这不影响元素的处理。如果元素的内容本身以 换行 开头,
就 必须 包含这个可选的 换行
(否则内容中的前导换行符将被视为可选换行符并被忽略)。
原始文本 和
可转义的原始文本元素
中的内容不得出现任何 "</
"
(U+003C LESS-THAN SIGN, U+002F SOLIDUS) 加那个元素名的大小写不敏感匹配,
再加一个 U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF),
U+000D CARRIAGE RETURN (CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (>) 或 U+002F SOLIDUS (/) 字符。
在元素内允许 文本,属性值和注释。 如其他部分所述,基于放置文本的位置,对文本中的内容和不允许的内容有额外的限制。
HTML 中的 换行 可以表示为 U+000D CARRIAGE RETURN (CR) 字符, U+000A LINE FEED (LF) 字符,或一对 U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) 字符(按这个顺序)。
在允许 字符引用 的地方,一个 U+000A LINE FEED (LF) 字符(但U+000D CARRIAGE RETURN (CR) 字符不行) 的字符引用也表示一个 换行。
In certain cases described in other sections, text may be mixed with character references. These can be used to escape characters that couldn't otherwise legally be included in text.
Character references must start with a U+0026 AMPERSAND character (&). Following this, there are three possible kinds of character references:
The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.
An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section.
CDATA 部分 必须按顺序包含以下部分:
<![CDATA[
"。]]>
" 的 文本。]]>
"。CDATA 部分职能用于外部内容(MathML 或 SVG)。在这个例子中,
CDATA 部分用于转义 MathML ms
元素的内容:
<p>You can add a string to a number, but this stringifies the number:</p> <math> <ms><![CDATA[x<y]]></ms> <mo>+</mo> <mn>3</mn> <mo>=</mo> <ms><![CDATA[x<y3]]></ms> </math>
注释 必须有如下格式:
<!--
"。>
" 或 "->
" 起始。
也不的包含 "<!--
", "-->
" 或
"--!>
",也不得以字符串 "<!-
" 结尾。-->
"。text 允许以字符串
"<!
" 结束,比如 <!--My favorite operators are > and
<!-->
。