6.Go 字符串

Go中的string是不可变的byte序列

与Python或Java等语言不同，它们在内部未表示为Unicode。因此，从文件或网络连接读取字符串时，没有从字节到内部表示的转换步骤。将字符串写入文件时，不会转换为代码页。

转到字符串不需要任何特定的代码页。它们只是字节。

Go源代码文件始终是UTF-8，因此源代码中定义的字符串也是UTF-8字符串。

此外，标准库中涉及将字符转换为大写或小写等的函数都假定原始字节代表UTF-8编码的Unicode字符串，并使用Unicode规则执行转换。

请注意，如"bar"表示的双引号(又称解释文字)与如'foo'表示的以反引号(即原始文字)界定的字符串文字之间的区别。双引号之间的文本形成文字的值，并使用反斜杠将换行符转义为"\n"，以转义字符。反引号之间的文本被视为未解释(隐式为UTF-8编码)；特别是，反斜杠没有特殊含义，并且字符串可能包含换行符。

string的基本使用:

var s string // empty string ""
s1 := "string\nliteral\nwith\tescape characters"
s2 := `raw string literal
which doesn't recognize escape characters like \n
`

// 可以使用+来拼接字符串
fmt.Printf("sum of string: %s\n", s+s1+s2)

// 使用==来比较字符串
if s1 == s2 {
    fmt.Printf("s1 is equal to s2\n")
} else {
    fmt.Printf("s1 is not equal to s2\n")
}

fmt.Printf("substring of s1: %s\n", s1[3:5])
fmt.Printf("byte (character) at position 3 in s1: %d\n", s1[3])

// C语言风格的string格式化
s = fmt.Sprintf("%d + %f = %s", 1, float64(3), "4")
fmt.Printf("s: %s\n", s)

sum of string: string
literal
with escape charactersraw string literal
which doesn't recognize escape characters like \n

s1 is not equal to s2
substring of s1: in
byte (character) at position 3 in s1: 105
s: 1 + 3.000000 = 4

标准库中重要的字符串操作:

strings 实现字符串搜索，拆分，大小写转换
bytes 具有与strings相同的功能，但是用来操作[]byte和byte切片
strconv 用于字符串与整数和浮点数之间的转换
unicode/utf8 UTF-8编码和解码
regexp 实现正则表达式
text/scanner 用于扫描和标记化UTF-8编码的文本
text/template 用于从模板生成更大的字符串
html/template 具有text/template的所有功能，用于生成HTML的语法结构，可以防止代码注入攻击

在string中找指定的字符串

使用strings.Index寻找子字符串的位置:

s := "where hello is?"
toFind := "hello"
idx := strings.Index(s, toFind)
fmt.Printf("'%s' is in s starting at position %d\n", toFind, idx)

// 没有找到返回 -1
idx = strings.Index(s, "not present")
fmt.Printf("Index of non-existent substring is: %d\n", idx)

'hello' is in s starting at position 6
Index of non-existent substring is: -1

从末尾开始查找子字符串

使用string.LastIndex从末尾开始搜索：

s := "hello and second hello"
toFind := "hello"
idx := strings.LastIndex(s, toFind)
fmt.Printf("when searching from end, '%s' is in s at position %d\n", toFind, idx)

when searching from end, hello is in s at position 17

查找所有子串

上面我们只查找字符串首次出现的位置, 查找所有子串如下:

s := "first is, second is, third is"
toFind := "is"
currStart := 0
for {
    idx := strings.Index(s, toFind)
    if idx == -1 {
        break
    }
    fmt.Printf("found '%s' at position %d\n", toFind, currStart+idx)
    currStart += idx + len(toFind)
    s = s[idx+len(toFind):]
}

found 'is' at position 6
found 'is' at position 17
found 'is' at position 27

是否包含字符串

使用strings.Contains查看string中是否包含另一个string

s := "is hello there?"
toFind := "hello"
if strings.Contains(s, toFind) {
    fmt.Printf("'%s' contains '%s'\n", s, toFind)
} else {
    fmt.Printf("'%s' doesn't contain '%s'\n", s, toFind)
}

'is hello there?' contains 'hello'

是否以另一个字符串开头

使用strings.HasPrefix

s := "this is string"
toFind := "this"
if strings.HasPrefix(s, toFind) {
    fmt.Printf("'%s' starts with '%s'\n", s, toFind)
} else {
    fmt.Printf("'%s' doesn't start with '%s'\n", s, toFind)
}

'this is string' starts with 'this'

是否以另一个字符串结尾

使用strings.HasSuffix:

s := "this is string"
toFind := "string"
if strings.HasSuffix(s, toFind) {
    fmt.Printf("'%s' ends with '%s'\n", s, toFind)
} else {
    fmt.Printf("'%s' doesn't end with '%s'\n", s, toFind)
}

'this is string' ends with 'string'

string比较

可以使用 ==, > 和 < 来比较字符串,比较是对原始字节执行的。

这可以按照您期望的ascii（即英语）文本的方式工作，但是当字符串使用大小写混合的字符（例如“ abba”为“ Zorro”）或使用非英语字母的字母时，可能就不是您想要的。

使用strings.Compare比较字符串

也可以与strings.Compare进行比较。和==，>和<相同.

大小写敏感的比较

使用strings.EqualFold进行忽略大小写的比较:

s1 := "gone"
s2 := "GoNe"
if strings.EqualFold(s1, s2) {
    fmt.Printf("'%s' is equal '%s' when ignoring case\n", s1, s2)
} else {
    fmt.Printf("'%s' is not equal '%s' when ignoring case\n", s1, s2)
}

'gone' is equal 'GoNe' when ignoring case

确切的规则是：将两个字符串都视为UTF-8编码的字符串，并使用Unicode大小写折叠比较字符。

s1 := "string one"
s2 := "string two"

if s1 == s2 {
    fmt.Printf("s1 is equal to s2\n")
} else {
    fmt.Printf("s1 is not equal to s2\n")
}

if s1 == s1 {
    fmt.Printf("s1 is equal to s1\n")
} else {
    fmt.Printf("inconcivable! s1 is not equal to itself\n")
}

if s1 > s2 {
    fmt.Printf("s1 is > than s2\n")
} else {
    fmt.Printf("s1 is not > than s2\n")
}

if s1 < s2 {
    fmt.Printf("s1 is < than s2\n")
} else {
    fmt.Printf("s1 is not < than s2\n")
}

s1 is not equal to s2
s1 is equal to s1
s1 is not > than s2
s1 is < than s2

大小写转换

s := "Mixed Case"
fmt.Printf("ToLower(s): '%s'\n", strings.ToLower(s))
fmt.Printf("ToUpper(s): '%s'\n", strings.ToUpper(s))
fmt.Printf("ToTitle(s): '%s'\n", strings.ToTitle(s))

ToLower(s): 'mixed case'
ToUpper(s): 'MIXED CASE'
ToTitle(s): 'MIXED CASE'

string转int, float32, float64

s := "234"
i, err := strconv.Atoi(s)
if err != nil {
    fmt.Printf("strconv.Atoi() failed with: '%s'\n", err)
}
fmt.Printf("strconv.Atoi('%s'): %d\n", s, i)

i, err = strconv.Atoi("not a number")
if err != nil {
    fmt.Printf("strconv.Atoi('not a number') failed with: '%s'\n", err)
}

i64, err := strconv.ParseInt(s, 10, 64)
if err != nil {
    fmt.Printf("strconv.ParseInt() failed with: '%s'\n", err)
}
fmt.Printf("strconv.ParseInt('%s', 64): %d\n", s, i64)

s = "-3.234"
f64, err := strconv.ParseFloat(s, 64)
if err != nil {
    fmt.Printf("strconv.ParseFloat() failed with: '%s'\n", err)
}
fmt.Printf("strconv.ParseFloat('%s', 64): %g\n", s, f64)

var f2 float64
_, err = fmt.Sscanf(s, "%f", &f2)
if err != nil {
    fmt.Printf("fmt.Sscanf() failed with: '%s'\n", err)
}
fmt.Printf("fmt.Sscanf(): %g\n", f2)

strconv.Atoi('234'): 234
strconv.Atoi('not a number') failed with: 'strconv.Atoi: parsing "not a number": invalid syntax'
strconv.ParseInt('234', 64): 234
strconv.ParseFloat('-3.234', 64): -3.234
fmt.Sscanf(): -3.234

修剪string (移除char或string)

strings.TrimSpace strings.TrimSpace(s string) 删除字符串开头和结尾的空格:

s := "  str\n "
fmt.Printf("TrimSpace: %#v => %#v\n", s, strings.TrimSpace(s))

TrimSpace: " str\n " => "str"

strings.TrimPrefix, strings.TrimSuffix 删除开头或结尾的指定字符串

prefix := "aba"
s1 := "abacdda"
trimmed1 := strings.TrimPrefix(s1, prefix)
fmt.Printf("TrimPrefix %#v of %#v => %#v\n\n", prefix, s1, trimmed1)

s2 := "abacdda"
suffix := "da"
trimmed2 := strings.TrimSuffix(s2, suffix)
fmt.Printf("TrimSuffix %#v of %#v => %#v\n\n", suffix, s2, trimmed2)

TrimPrefix "aba" of "abacdda" => "cdda"

TrimSuffix "da" of "abacdda" => "abacd"

strings.Trim 从字符串中删除给定剪切集中的所有字符：

s := "abacdda"
cutset := "zab"

trimmed := strings.Trim(s, cutset)
fmt.Printf("Trim chars %#v from %#v => %#v\n\n", cutset, s, trimmed)

trimmed = strings.TrimLeft(s, cutset)
fmt.Printf("TrimLeft chars %#v from %#v => %#v\n\n", cutset, s, trimmed)

trimmed = strings.TrimRight(s, cutset)
fmt.Printf("TrimRight chars %#v from %#v => %#v\n\n", cutset, s, trimmed)

Trim chars "zab" from "abacdda" => "cdd"

TrimLeft chars "zab" from "abacdda" => "cdda"

TrimRight chars "zab" from "abacdda" => "abacdd"

strings.Replace替换指定字符串

s := "this is string"
toRemove := " is"

after := strings.Replace(s, toRemove, "", -1)    
fmt.Printf("Removed %#v from %#v => %#v\n\n", toRemove, s, after)

Removed " is" from "this is string" => "this string"

s := "original string original"
s2 := strings.Replace(s, "original", "replaced", -1)
fmt.Printf("s2: '%s'\n", s2)

s2: 'replaced string replaced'

strings.Replace 函数最后一个参数代表最大的替换数,-1代表替换所有的

使用正则表达式替换字符串

s := "original string original"
rx := regexp.MustCompile("(?U)or.*al")
s2 := rx.ReplaceAllString(s, "replaced")
fmt.Printf("s2: '%s'\n", s2)

s2: 'replaced string replaced'

这显示了如何用正则表达式替换字符串。 (？U)or。* al是与原始字符串匹配的非贪婪((？U)标志)正则表达式。

我们用替换的字符串替换匹配正则表达式的字符串的所有部分。

正则表达式是非贪婪的，这意味着它与可能的最短字符串匹配，而不是与寻找可能的最长匹配的默认贪婪匹配相对。

拆分和合并

将字符串分割为[] string，或者将其重新连接为字符串:

s := "this is a string"
a := strings.Split(s, " ")
fmt.Printf("a: %#v\n", a)

s2 := strings.Join(a, ",")
fmt.Printf("s2: %#v\n", s2)

a: []string{"this", "is", "a", "string"}
s2: "this,is,a,string"
使用Join连接字符串比使用+快

格式化文本

Go的标准库fmt包实现了C语言风格的字符串格式化:

s := fmt.Sprintf("Hello %s", "World")
fmt.Printf("s: '%s'\n", s)
s = fmt.Sprintf("%d + %f = %d", 2, float64(3), 5)
fmt.Println(s)

s: 'Hello World'
2 + 3.000000 = 5

fmt.Sprintf的第一个参数是格式字符串，它定义了如何格式化后续参数。后续参数是将被格式化的值。

fmt.Sprintf创建一个格式化的字符串。

为了方便起见，还有：

fmt.Fprintf(w io.Writer, string format, args... interface{}), which will write a formatted string to a given writer
fmt.Printf(format string, args.. interface{}) which writes a formatted string to os.Stdout.
函数Sprintf格式化第一个参数中的字符串，将动词替换为后续参数的值，并返回结果。像Sprintf一样，函数Printf也会格式化，但不会返回结果，而是打印字符串。

字符串格式动词列表:

%v // the value in a default format // when printing structs, the plus flag (%+v) adds field names
%#v // a Go-syntax representation of the value
%T // a Go-syntax representation of the type of the value
%% // a literal percent sign; consumes no value

Boolean:

%t // the word true or false

Integer:

%b // base 2
%c // the character represented by the corresponding Unicode code point
%d // base 10
%o // base 8
%q // a single-quoted character literal safely escaped with Go syntax.
%x // base 16, with lower-case letters for a-f
%X // base 16, with upper-case letters for A-F
%U // Unicode format: U+1234; same as "U+%04X"

浮点型和复数:

%b // decimalless scientific notation with exponent a power of two, // in the manner of strconv.FormatFloat with the 'b' format, e.g. -123456p-78
%e // scientific notation, e.g. -1.234456e+78
%E // scientific notation, e.g. -1.234456E+78
%f // decimal point but no exponent, e.g. 123.456
%F // synonym for %f
%g // %e for large exponents, %f otherwise
%G // %E for large exponents, %F otherwise

字符串和字节片(与这些动词等效地对待):

%s // the uninterpreted bytes of the string or slice
%q // a double-quoted string safely escaped with Go syntax
%x // base 16, lower-case, two characters per byte
%X // base 16, upper-case, two characters per byte

指针:

%p // base 16 notation, with leading 0x

解析文本

使用fmt.Sscanf
fmt.Sscanf is the reverse of fmt.Sprintf. Given a string and formatting directive you can parse string into components.

// extract int and float from a string
s := "48 123.45"
var f float64
var i int
nParsed, err := fmt.Sscanf(s, "%d %f", &i, &f)
if err != nil {
    log.Fatalf("first fmt.Sscanf failed with %s\n", err)
}
fmt.Printf("i: %d, f: %f, extracted %d values\n", i, f, nParsed)

var i2 int
_, err = fmt.Sscanf(s, "%d %f %d", &i, &f, &i2)
if err != nil {
    fmt.Printf("second fmt.Sscanf failed with %s\n", err)
}

i: 48, f: 123.450000, extracted 2 values
second fmt.Sscanf failed with EOF

fmt.Sscanf supports the same formatting directives as fmt.Sprintf.

If formatting string doesn't match parsed string, fmt.Sscanf returns an error. In our examples the error is EOF because we wanted to extract more values than were in the string.

Using strings.Split
string.Split allows to split a string by a separator.

s := "this,. is,. a,. string"
a := strings.Split(s, ",.")
fmt.Printf("a: %#v\n", a)

a: []string{"this", " is", " a", " string"}

逐行读取文件

把file读入内存并用行来拆分

// ReadFileAsLines reads a file and splits it into lines
func ReadFileAsLines(path string) ([]string, error) {
    d, err := ioutil.ReadFile(path)
    if err != nil {
        return nil, err
    }
    s := string(d)
    lines := strings.Split(s, "\n")
    return lines, nil
}

There are 32 lines in 'main.go'

遍历文件中的行
与将整个文件读入内存相比，一次只处理一行更有效。

我们可以使用bufio.Scanner做到这一点：

func IterLinesInFile(filePath string, process func (s string) bool) error {
    file, err := os.Open(filePath)
    if err != nil {
        return err
    }
    defer file.Close()
    scanner := bufio.NewScanner(file)
    // Scan() reads next line and returns false when reached end or error
    for scanner.Scan() {
        line := scanner.Text()
        if !process(line) {
          return nil
        }
        // process the line
    }
    // check if Scan() finished because of error or because it reached end of file
    return scanner.Err()
}

38 lines in 'main.go'

规范化换行

3种换行符

表示换行符的方式有3种:

Unix：使用单个字符LF，即字节10(0x0a)，在Go字符串文字中表示为“”。
Windows：使用2个字符：CR LF，它是字节13 10(0x0d，0x0a)，在Go字符串文字中表示为“”。
Mac OS：使用1个字符的CR(字节13(0x0d))，在Go字符串文字中表示为“”。这是最不流行的。

将字符串分成几行时，你必须决定如何处理。

假设你的代码只会显示例如 Unix样式的行尾，只能处理“”，但对于Mac尾行的文件根本不起作用，而Windows尾行的文件中将带有CR字符。

处理多个换行符表示的一种简单方法是标准化换行符，然后对标准化版本进行操作。

最后，可以编写处理所有换行符结尾的代码。不可避免地这样的代码要复杂一些。

换行规范化:

// NormalizeNewlines normalizes \r\n (windows) and \r (mac)
// into \n (unix)
func NormalizeNewlines(d []byte) []byte {
    // replace CR LF \r\n (windows) with LF \n (unix)
    d = bytes.Replace(d, []byte{13, 10}, []byte{10}, -1)
    // replace CF \r (mac) with LF \n (unix)
    d = bytes.Replace(d, []byte{13}, []byte{10}, -1)
    return d
}

"new\nline"