这个平台第一次写技术博客,希望自己能够坚持下去。
既然是嵌入式,那么就从common的分析, 当然选择一个ARM的平台。操作系统当然是linux啦。
一言不合先贴log,没有log怎么分析。
#################### 我是log #############################
[ 1003.716695] Unable to handle kernel NULL pointer dereference at virtual address 00000009
[ 1003.723854] pgd = c0204000
[ 1003.726445] [00000009] *pgd=00000000
[ 1003.730004] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 1003.735296] Modules linked in: ecm ath9k ath9k_common iptable_nat ath9k_hw ath10k_pci ath10k_core ath nf_nat_pptp nf_nat_ipv4 nf_nat_amanda nf_conntrack_pptp nf_conni
[ 1003.965769] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 3.14.77 #87
[ 1003.972016] task: dd478b80 ti: dd482000 task.ti: dd482000
[ 1003.977492] PC is at ieee80211_wake_txqs+0x120/0x2cc [mac80211]
[ 1003.983349] LR is at ieee80211_wake_txqs+0x25c/0x2cc [mac80211]
[ 1003.989204] pc : [<bf89b7a4>] lr : [<bf89b8e0>] psr: 80000113
[ 1003.989204] sp : dd483e80 ip : 00000000 fp : 00000000
[ 1004.000661] r10: 00000000 r9 : d5df8000 r8 : d8578cdc
[ 1004.005870] r7 : 00000000 r6 : d8578c40 r5 : d8578c40 r4 : d85a4500
[ 1004.012379] r3 : d5df9ec0 r2 : 00000040 r1 : d5f6fb20 r0 : 00000000
[ 1004.018892] Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
[ 1004.026182] Control: 10c5787d Table: 58e8806a DAC: 00000015
[ 1004.031913] Process ksoftirqd/0 (pid: 3, stack limit = 0xdd482238)
[ 1004.038074] Stack: (0xdd483e80 to 0xdd484000)
[ 1004.042421] 3e80: d8579620 d8579620 d857df20 00000040 d8578fe8 d85a4500 d85a4a4c d8578c40
[ 1004.050577] 3ea0: 00000004 d85790b8 bf9a1740 d5df8000 ddc1de80 d8579250 d8579254 c08602bc
[ 1004.058737] 3ec0: 00000000 00000000 00000100 40000002 00000006 c02333dc dd482030 00000004
[ 1004.066898] 3ee0: c0864090 dd482000 c0864098 c02329d4 c024e180 60000013 ffffffff 0000000a
[ 1004.075057] 3f00: 04208040 000112e1 00000000 00000003 00000000 dd482038 00000000 c0870f14
[ 1004.083217] 3f20: 00000001 dd482008 00000000 00000000 00000000 c0232b94 c0232b64 dd482000
[ 1004.091376] 3f40: dd423880 c024e3bc dd4238c0 00000000 dd423880 c024e160 00000000 00000000
[ 1004.099537] 3f60: 00000000 c0248360 00000000 00000001 00000000 dd423880 00000000 00030003
[ 1004.107696] 3f80: dd483f80 dd483f80 00000000 00000000 dd483f90 dd483f90 dd483fac dd4238c0
[ 1004.115855] 3fa0: c0248288 00000000 00000000 c0208d30 00000000 00000000 00000000 00000000
[ 1004.124013] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 1004.132174] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[ 1004.140494] [<bf89b7a4>] (ieee80211_wake_txqs [mac80211]) from [<c02333dc>] (tasklet_action+0x8c/0xec)
[ 1004.149631] [<c02333dc>] (tasklet_action) from [<c02329d4>] (__do_softirq+ 0x104/0x294)
[ 1004.157526] [<c02329d4>] (__do_softirq) from [<c0232b94>] (run_ksoftirqd+0x30/0x90)
[ 1004.165171] [<c0232b94>] (run_ksoftirqd) from [<c024e3bc>] (smpboot_thread_fn+0x25c/0x274)
[ 1004.173419] [<c024e3bc>] (smpboot_thread_fn) from [<c0248360>] (kthread+0xd8/0xec)
[ 1004.180966] [<c0248360>] (kthread) from [<c0208d30>] (ret_from_fork+0x14/0x24)
[ 1004.188168] Code: e59d200c e0893002 e2833d7a e593b02c (e5db3009)
[ 1004.194328] ---[ end trace 891b6516b6873ba4 ]---
################下面是分析########################
首先kernel oops有个固定的格式
• Error Summary
• Error Type
• CPU#/PID#/Kernel-Version
• Hardware
• CPU Register Dump
–PC/LR
•Stack Dump
•Backtrace
##################################################
产生kernel crash或者panic的原因有很多,这里用空指针举例子。log中黑体的部分都要着重看。
1)
最关键的还是看PC指针,可以看到挂掉的时候是处在ieee80211_wake_txqs+0x120/0x2cc 。那么后面的0x120/0x2cc代表什么呢?
偏移/长度 -- 偏移是基于函数起始地址的便宜;长度是函数的长度。
那么问题来了ieee80211_wake_txqs这个函数的起始地址怎么来?这里要请出objdump工具,当然每个平台都有自己的toolchain,你要找到toolchain里面的objdump。
命令很简单: objdump -S your_module.ko > dump.log
这个时候可以在dump的文件里面找的ieee80211_wake_txqs的起始地址,打个比方0xc1234567, 那么最后就挂在0xc1234567+ 0x380的位置,这个时候找到那句汇编然后就老老实实分析呗,看看对应哪句c代码。
Tips:
1. 如果你的ko有编译出来的既有debug版本和非debug版本,那么恭喜你,对debug版本用objdump会有惊喜哦。假如没有debug版本也没有关系,gcc编译的时候加上-g选项,这个输出的ko自带很多debug信息。
2. 如果发生oops的地方处在内核之中,如果你有对应的vmlinux,仍然可以使用objdump,不过首先要用nm的工具找到ieee80211_wake_txqs的起始地址,nm不是尼玛的缩写,而是names的缩写, nm命令主要是用来列出某些文件中的符号(说白了就是一些函数和全局变量等)
nm vmlinux | grep ieee80211_wake_txqs
假设是0xc1234567,那么偏移地址就是0xc12348e7 = 0xc1234567 + 0x380
objdump -S vmlinux -start-address=0xc1234567 -stop-address=0xc12348e7 > dump.log
3. 其实addr2line这个也是神器,只要ko/vmlinux是带debug版本的同样可以找到具体出错代码
addr2line -e ko/vmlinux -a 0xc12348e7 -f
Note:如果ieee80211_wake_txqs所在的模块是在内核,则其地址是类似于pc : [<bf89b7a4>]。加入其属于某个ko,那么不管是nm还是objdump该ko得到的地址都不是最后在kernel里面的地址,虽然我们更多关心的偏移的对应那行代码。如果想要知道该地址,可以通过/proc/kallsyms里面找到该函数地址。至于kallsyms的故事,请待下回分解。
似乎以上就搞定了一切,如果你止步于此,就等于吃了麦当劳。营养不够
实际上其他信息也是能提供更多的蛛丝马迹,因为有时候找到出错代码只是分析原因的第一步
2)
Oops: 5 [#1] PREEMPT SMP ARM
这个是 OOPS 信息的错误码
bit 描述
bit 0 0 means no page found, 1 means a protection fault
bit 1 0 means read, 1 means write
bit 2 0 means kernel, 1 means user-mode
[#1] — this value is the number of times the Oops occurred. Multiple Oops can be triggered as a cascading effect of the first one.
这个值是 Oops 发生的次数, 多个 Oops 可以级联效应触发
PREEMPT是指系统支持抢占模式,支持SMP(多核) ARM/THUMB(指令集)等信息。
3)
CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 3.14.77 #87
表示这个 OOPS 发生在 CPU0, 当前运行的进程由Comm给出,3号进程 ksoftirqd, Not Tainted 意味着内核没有被污染, 内核版本是 3.14.77,
其中Tainted的表示可以从内核中 kernel/panic.c 中找到
Tainted 描述
‘G’ if all modules loaded have a GPL or compatible license
‘P’ if any proprietary module has been loaded. Modules without a MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by insmod as GPL compatible are assumed to be proprietary.
‘F’ if any module was force loaded by “insmod -f”.
‘S’ if the Oops occurred on an SMP kernel running on hardware that hasn’t been certified as safe to run multiprocessor. Currently this occurs only on various Athlons that are not SMP capable.
‘R’ if a module was force unloaded by “rmmod -f”.
‘M’ if any processor has reported a Machine Check Exception.
‘B’ if a page-release function has found a bad page reference or some unexpected page flags.
‘U’ if a user or user application specifically requested that the Tainted flag be set.
‘D’ if the kernel has died recently, i.e. there was an OOPS or BUG.
‘W’ if a warning has previously been issued by the kernel.
‘C’ if a staging module / driver has been loaded.
‘I’ if the kernel is working around a sever bug in the platform’s firmware (BIOS or similar).
---------------------
4)
Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
这个主要是oops的时候CPU的寄存器Current Processor Status Register (CPSR)
和Saved Processor Status Register (SPSR)的状态。不过User mode and System mode (the two least privileged modes) which do not have a SPSR。
The allocation of the bits within the CPSR (and SPSR) is:
31 30 29 28 27 24 19 … 16 9 8 7 6 5 4 … 0
N Z C V Q — J — GE[3:0] — E A I F T M[4:0]
NZCV是condtion flags:
Negative: is set to bit 31 of the result, so N is 1 if the signed value is negative, and cleared if the result is positive or zero.
Zero: is set if the result is zero; this is usual to denote an equal result from a comparison. If the result is non-zero, this flag is cleared.
Carry: Is more complex:
With the instructions ADC,ADD, andCMN, this flag is set if the result would produce an unsigned overflow.
With the instructions CMP,SBC, andSUB, this flag is set if the result would produce an unsigned underflow(a borrow).
For other instructions that use shifting, this flag is set to the value of the last bit shifted out by the shifter.
Other instructions usually leave this flag alone.
oVerflow: for addition and subtraction, this flag is set if asigned overflow occurred. Otherwise, it is generally left alone. Note that some API conventions may specifically set oVerflow to flag an error condition.
IRQs on FIQs on
The Interrupt flags are as follows:
I: when set, disables IRQ interrupts
F: when set, disables FIQ interrupts
A: [ARMv6 and later] when set, disables imprecise aborts (this is an abort on a memory write that has been held in a write buffer in the processor and not written to memory until later, perhaps after another abort or interrupt is in progress.
简单地说:
nzCv are the flags, If a flag is in uppercase it is set, otherwise clear.
N = Last result was Negative
Z = Last result was Zero
C = Last result needed/produced a Carry bit
V = Last result oVerflowed
IRQ on means Hardware interrupts are enabled.
FIRQ on means that some hardware interrupts are handled with a fast context switch.
Mode is the CPU mode, indicating that the code was privileged.
Control: 10c5787d Table: 58e8806a DAC: 00000015
这个是 control structures for the the CPU set by the kernel.
Mode SVC_32 ISA ARM Segment kernel
当前CPU模式是SVC 32 ARM kernel
5)
Stack: (0xdd483e80 to 0xdd484000)
栈开始的地方0xdd483e80 ,后续的内容是栈的内容。
6) backtrace开始的地方
<bf89b7a4>] (ieee80211_wake_txqs [mac80211]) from [<c02333dc>] (tasklet_action+0x8c/0xec)
7)最后
kernel里面关于oops的文档:
https://elixir.bootlin.com/linux/v3.18.125/source/Documentation/oops-tracing.txt
欢迎讨论:bmebob_zhao@163.com
以上.