一、urlencode 相关的三个协议
- RFC1738: Uniform Resource Locators (URL)
- RFC2396: Uniform Resource Identifiers (URI): Generic Syntax
- W3C HTML4.01:
application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by+
, and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by%HH
, a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e.,%0D%0A
).
The control names/values are listed in the order they appear in the document. The name is separated from the value by=
and name/value pairs are separated from each other by&
.
二、关于 w3c 和 rfc
contains technical and organizational documents about the Internet, including the specifications and policy documents produced by four streams: the Internet Engineering Task Force (IETF), the Internet Research Task Force (IRTF), the Internet Architecture Board (IAB), and Independent Submissions.
The World Wide Web Consortium W3C is an international community that develops open standards to ensure the long-term growth of the Web.
换言之,RFC 标准覆盖的范围覆盖到网络的方方面面,而W3C则是侧重于http、超文本等web技术标准
三、JAVA 实现
JAVA 的实现中,URLEncoder 进行编码时会把空格转义成加号 +
,而 URLDecoder 解码时会把加号 +
和 %20
都恢复成空格,文档描述则如下表明,urlencode/decode 是根据 application/x-www-form-urlencoded
格式行编码和解码
Translates a string into application/x-www-form-urlencoded format using a specific encoding scheme. This method uses the supplied encoding scheme to obtain the bytes for unsafe characters.
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilities.
Decodes a application/x-www-form-urlencoded string using a specific encoding scheme. The supplied encoding is used to determine what characters are represented by any consecutive sequences of the form "%xy".
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilities.
也就是说,JAVA 中关于 url encode/decode 是根据 W3C 的标准实现的,具体详情可参考:Non-ASCII characters in URI attribute values
但在 JAVA 的实现中,还有一个比较有意思的是 URI 类的转义,反而是根据 RFC2396 标准实现的,如下事例中,对空格的转义都转成了 %20
URI uri = new URI("http", "//www.someurl.com/has spaces in url?param a=value b", null);
System.out.println(uri);
// 输出
http://www.someurl.com/has%20spaces%20in%20url?param%20a=value%20b