前言
前不久准备写一个关于文本处理的小程序,需要高效地对文本进行读取。于是就归纳了一下常见的文件读取方法,并测试了各自的时间,也阅读了相关的一些源码,希望能说清楚测试结果背后的道理,在以后用到相关操作时,能选取最佳的方法。为了减少一些无关的干扰,我们把源码里的一些检验参数等的代码省略,有些代码进行了简化。
常见的五类文件读取方法
采用BufferedReader
static long testBuffered(String fileName) throws IOException{
Long startTime = System.currentTimeMillis();
BufferedReader reader = new BufferedReader(new FileReader(fileName));
char[] buffer=new char[8*1024];
long sum = 0;
while((count=reader.read(buffer))!=-1)
{
sum += count;
}
reader.close();
Long endTime = System.currentTimeMillis();
System.out.println("Total time of BufferedReader is "+ (endTime - startTime) + " milliseconds, Total byte is " + sum);
return endTime - startTime;
}
BufferedReader是一个很常见的文件读取方法。buffer的大小为8*1024。这是因为为了和BufferedReader里的缓存进行统一。BufferedReader的构造函数如下:
private char cb[];
private static int defaultCharBufferSize = 8192;
public BufferedReader(Reader in, int sz) {
super(in);
this.in = in;
cb = new char[sz];
nextChar = nChars = 0;
}
public BufferedReader(Reader in) {
this(in, defaultCharBufferSize);
}
我们可以看到如果构造时未输入参数,那么这个大小就是默认的defaultCharBufferSize也就是$8192=8*1024$,用这个大小呢,创建了一个私有数据cb,我猜它是charbuffer的缩写。而BufferedReader的读一串字符调用的是如下函数。
public int read(char cbuf[], int off, int len) throws IOException {
synchronized (lock) {
int n = read1(cbuf, off, len);
if (n <= 0) return n;
while ((n < len) && in.ready()) {
int n1 = read1(cbuf, off + n, len - n);
if (n1 <= 0) break;
n += n1;
}
return n;
}
}
可见它是循环调用read1把传入的数组(cbuf)填充到要求的长度(len)。然后后面就是一连串的调用链如下图
经过各种嵌套调用后,最后是用的是FileChannel,这也是本文里的第四种方法,于是当然,BufferedReader的效率很差。
采用RandomAccessFile
static long testRandomAccess(String fileName) throws IOException{
Long startTime = System.currentTimeMillis();
RandomAccessFile reader = new RandomAccessFile(fileName,"r");
int count;
byte[] buffer=new byte[8*1024];//缓冲区
long sum = 0;
while((count=reader.read(buffer))!=-1){
sum += count;
}
reader.close();
Long endTime = System.currentTimeMillis();
System.out.println("Total time of RandomAccess is "+ (endTime - startTime) + " milliseconds, Total byte is " + sum);
return endTime - startTime;
}
为啥上述代码里的buffer也是8k呢?这是因为调用链如下
可见该函数的调用链很短,而且是用native函数进行操作。最后的io_util.c的相关代码如下
#define BUF_SIZE 8192
jint
readBytes(JNIEnv *env, jobject this, jbyteArray bytes,
jint off, jint len, jfieldID fid)
{
jint nread;
char stackBuf[BUF_SIZE];
char *buf = stackBuf;
if (len > BUF_SIZE) {
buf = malloc(len);
}
fd = GET_FD(this, fid);
nread = IO_Read(fd, buf, len);
(*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte *)buf);
if (buf != stackBuf) {
free(buf);
}
return nread;
}
从上述代码可以知道,如果要读的数组的长度不大于8192,那么就直接用该局部变量。如果大于,那么就需要重新分配这么一块内存。因此我们在测试代码里,选择了8192这样的长度,就是为了避免调用时需要从堆上分配内存,毕竟C中的malloc和free都不是很快,完全是效率黑洞。
采用FileInputStream
这种方式也很常见,原理也和名字一样,把文件变成输入流,然后一个字符一个字符的读取。它是调用了InputStream的read函数实现的,代码如下:
public int read(byte b[], int off, int len) throws IOException {
int c = read();
if (c == -1) {
return -1;
}
b[off] = (byte)c;
int i = 1;
try {
for (; i < len ; i++) {
c = read();
if (c == -1) {
break;
}
b[off + i] = (byte)c;
}
} catch (IOException ee) {
}
return i;
}
采用与ByteBuffer配合的FileChannel
这种方式就和第一种方式的最后的调用那里差不多,所以速度按理来说还行。代码如下:
static long testFileStreamChannel(String fileName) throws IOException{
Long startTime = System.currentTimeMillis();
FileInputStream reader = new FileInputStream(fileName);
FileChannel ch = reader.getChannel();
ByteBuffer bb = ByteBuffer.allocate(8*1024);
long sum = 0;
int count;
while ((count=ch.read(bb)) != -1 )
{
sum += count;
bb.clear();
}
reader.close();
Long endTime = System.currentTimeMillis();
System.out.println("Total time of FileStreamChannel is "+ (endTime - startTime) + " milliseconds, Total byte is " + sum);
return endTime - startTime;
}
它调用的FileChannel的read函数其实内部是用IOUtill里的read。代码如下:
static int read(FileDescriptor fd, ByteBuffer dst, long position, NativeDispatcher nd) throws IOException
{
if (dst instanceof DirectBuffer)
return readIntoNativeBuffer(fd, dst, position, nd);
ByteBuffer bb = Util.getTemporaryDirectBuffer(dst.remaining());
try {
int n = readIntoNativeBuffer(fd, bb, position, nd);
bb.flip();
if (n > 0)
dst.put(bb);//放入传入的缓存
return n;
} finally {
Util.offerFirstTemporaryDirectBuffer(bb);
}
}
他就是申请一块临时堆外DirectByteBuffer,大小同传入的buffer的大小。然后读取文件,最后在把它放回传入的缓存。
采用与MappedByteBuffer相结合的FileChannel
这类方法很少见。测试代码如下:
static long testFileStreamChannelMap(String fileName) throws IOException{
Long startTime = System.currentTimeMillis();
FileInputStream reader = new FileInputStream(fileName);
FileChannel ch = reader.getChannel();
MappedByteBuffer mb =ch.map( FileChannel.MapMode.READ_ONLY,0L, ch.size() );//这是关键
long sum = 0;
sum = mb.limit();
reader.close();
Long endTime = System.currentTimeMillis();
System.out.println("Total time of testFileStreamChannelMap is "+ (endTime - startTime) + " milliseconds, Total byte is " + sum);
return endTime - startTime;
}
我们现在看看上面有注释的那句话干了什么
public MappedByteBuffer map(MapMode mode, long position, long size) throws IOException
{
int pagePosition = (int)(position % allocationGranularity);
long mapPosition = position - pagePosition;
long mapSize = size + pagePosition;
try {
// native方法,返回一个内存映射的地址
addr = map0(imode, mapPosition, mapSize);
} catch (OutOfMemoryError x) {
// 内存不够,手动gc,然后再来
System.gc();
try {
Thread.sleep(100);
} catch (InterruptedException y) {
Thread.currentThread().interrupt();
}
try {
addr = map0(imode, mapPosition, mapSize);
} catch (OutOfMemoryError y) {
throw new IOException("Map failed", y);
}
}
//根据地址,构造一个Buffer返回
return Util.newMappedByteBufferR(isize, addr + pagePosition, mfd, um);
}
上述代码中Util.newMappedByteBufferR这个名字很容易让人误解,其实它构造的是MappedByteBuffer的子类DirectByteBuffer的子类DirectByteBufferR。也就是说,它获取了文件在虚拟内存中映射的地址,并构造了一个DirectByteBufferR。这种类型的好处是,它是直接操纵那块虚拟内存的。
测试和分析总结
我们现在可以开始对这四种方法的读取速率进行测试了,将生成大小大约是1KB,128KB,256KB,512KB,768KB,1MB,128MB,256MB,512MB,768MB,1GB的文件进行读取。
static boolean generateFile(String fileName,long size){
try {
BufferedWriter writer = new BufferedWriter(new FileWriter(fileName),8*1024);
for(int count = 0;count < size;count ++){
writer.write('a');
}
writer.close();
}catch (IOException e){
e.printStackTrace();
return false;
}
return true;
}
public static void main(String[] args) {
String fileName = "data.txt";
long m = 1024 ;
long size[] = {m,m * 128,m * 256,m * 512,m * 768,m * 1024,m * 1024 * 128,m * 1024 * 256,m * 1024 * 512,m * 1024 * 768,m * 1024 * 1024};
for (int i = 0;i < size.length;i ++ ) {
generateFile(fileName, size[i]);
try {
testBuffered(fileName);
testRandomAccess(fileName);
testFileStream(fileName);
testFileStreamChannel(fileName);
testFileStreamChannelMap(fileName);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("--------------------------------------------------------");
}
}
测试得到的输出如下:
Total time of BufferedReader is 1 milliseconds, Total byte is 1024
Total time of RandomAccess is 1 milliseconds, Total byte is 1024
Total time of FileStream is 0 milliseconds, Total byte is 1024
Total time of FileStreamChannel is 17 milliseconds, Total byte is 1024
Total time of testFileStreamChannelMap is 3 milliseconds, Total byte is 1024
--------------------------------------------------------
Total time of BufferedReader is 16 milliseconds, Total byte is 131072
Total time of RandomAccess is 0 milliseconds, Total byte is 131072
Total time of FileStream is 0 milliseconds, Total byte is 131072
Total time of FileStreamChannel is 0 milliseconds, Total byte is 131072
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 131072
--------------------------------------------------------
Total time of BufferedReader is 5 milliseconds, Total byte is 262144
Total time of RandomAccess is 1 milliseconds, Total byte is 262144
Total time of FileStream is 0 milliseconds, Total byte is 262144
Total time of FileStreamChannel is 1 milliseconds, Total byte is 262144
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 262144
--------------------------------------------------------
Total time of BufferedReader is 9 milliseconds, Total byte is 524288
Total time of RandomAccess is 0 milliseconds, Total byte is 524288
Total time of FileStream is 0 milliseconds, Total byte is 524288
Total time of FileStreamChannel is 1 milliseconds, Total byte is 524288
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 524288
--------------------------------------------------------
Total time of BufferedReader is 10 milliseconds, Total byte is 786432
Total time of RandomAccess is 0 milliseconds, Total byte is 786432
Total time of FileStream is 0 milliseconds, Total byte is 786432
Total time of FileStreamChannel is 5 milliseconds, Total byte is 786432
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 786432
--------------------------------------------------------
Total time of BufferedReader is 2 milliseconds, Total byte is 1048576
Total time of RandomAccess is 1 milliseconds, Total byte is 1048576
Total time of FileStream is 0 milliseconds, Total byte is 1048576
Total time of FileStreamChannel is 3 milliseconds, Total byte is 1048576
Total time of testFileStreamChannelMap is 1 milliseconds, Total byte is 1048576
--------------------------------------------------------
Total time of BufferedReader is 146 milliseconds, Total byte is 134217728
Total time of RandomAccess is 43 milliseconds, Total byte is 134217728
Total time of FileStream is 44 milliseconds, Total byte is 134217728
Total time of FileStreamChannel is 89 milliseconds, Total byte is 134217728
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 134217728
--------------------------------------------------------
Total time of BufferedReader is 230 milliseconds, Total byte is 268435456
Total time of RandomAccess is 88 milliseconds, Total byte is 268435456
Total time of FileStream is 85 milliseconds, Total byte is 268435456
Total time of FileStreamChannel is 107 milliseconds, Total byte is 268435456
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 268435456
--------------------------------------------------------
Total time of BufferedReader is 463 milliseconds, Total byte is 536870912
Total time of RandomAccess is 193 milliseconds, Total byte is 536870912
Total time of FileStream is 393 milliseconds, Total byte is 536870912
Total time of FileStreamChannel is 379 milliseconds, Total byte is 536870912
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 536870912
--------------------------------------------------------
Total time of BufferedReader is 844 milliseconds, Total byte is 805306368
Total time of RandomAccess is 282 milliseconds, Total byte is 805306368
Total time of FileStream is 273 milliseconds, Total byte is 805306368
Total time of FileStreamChannel is 255 milliseconds, Total byte is 805306368
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 805306368
--------------------------------------------------------
Total time of BufferedReader is 1097 milliseconds, Total byte is 1073741824
Total time of RandomAccess is 407 milliseconds, Total byte is 1073741824
Total time of FileStream is 348 milliseconds, Total byte is 1073741824
Total time of FileStreamChannel is 395 milliseconds, Total byte is 1073741824
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 1073741824
--------------------------------------------------------
可以看见第一种方法所用的时间最长,这是完全符合我们的预期的。而最后一种因为直接操纵内存,所以时间可以忽略。最后因为要构造BufferedCache,所以在小文件上也会花一些时间。于是我们可以得出结论BufferedReader效率怎么都比较低,完全可以弃用。如果只是第一次读取小文件的话,不要用关于FileChannel的方法。输入缓冲期不要大于8K,因为大部分的默认缓冲区都是8K,这样可以容易配合。虽然在测试中FileChannel配合MappedByteBuffer在大文件中取得了很优异的效果,但是在实际使用中,用这个的还是比较少。因为它存在很多问题如内存占用、文件关闭不确定,被其打开的文件只有在垃圾回收的才会被关闭,而且这个时间点是不确定的。而这些问题是大部分程序员所深恶痛绝的,毕竟这些行为没法自己控制。不能重现的Bug最难修啊。