技术说明 9

原文:https://www.lua.org/notes/ltn009.html

Technical Note 9 技术说明 9

Last update: Mon Aug 12 15:51:58 EST 2002

Creating Strings Piece by Piece 逐个字符创建字符串

by Roberto Ierusalimschy 罗伯托·伊鲁萨林斯基

Abstract 摘要

In Lua, “accumulation” of string results (that is, loops over a code like s = s..x ) can be quite expensive. This note describes an efficient way to create a string piece by piece in Lua.

​ 在 Lua 中,“累积”字符串结果(即循环遍历类似 s = s..x 的代码)可能非常昂贵。本说明介绍了一种在 Lua 中逐个字符高效创建字符串的方法。

The Problem 问题

Suppose you are building a string piecemeal, for instance reading a file line by line. Your typical code would look like this:

​ 假设您正在逐个字符构建字符串,例如逐行读取文件。您的典型代码如下所示:

-- WARNING: bad code ahead!!
local buff = ""
while 1 do
  local line = read()
  if line == nil then break end
  buff = buff..line.."\n"
end
-- 警告:前有错误代码!!
local buff = ""
while 1 do
local line = read()
if line == nil then break end
buff = buff..line.."\n"
end

Despite its innocent look, this code in Lua can cause a huge performance penalty for large files: For instance, it takes almost a minute to read a 350 Kbyte file. 尽管这段 Lua 代码看起来无害,但它可能会导致处理大型文件时性能大幅下降:例如,读取一个 350 KB 的文件几乎需要一分钟时间。

Frequently this is not a problem. For small strings, the above loop is OK. To read a whole file, you can use the "*all" option, that reads it at once. But sometimes you have no such simple solution. Then, the only solution is a more efficient algorithm for your problem. Here we show one (algorithm, not problem).

​ 通常情况下,这不是问题。对于小字符串,上面的循环是可行的。要读取整个文件,可以使用 "*all" 选项,它会一次性读取文件。但有时你没有这么简单的解决方案。那么,唯一的解决方案就是为你的问题找到一种更高效的算法。我们这里展示一种(算法,不是问题)。

The Solution 解决方案

The heart of the algorithm is a stack, that keeps the large strings already created in its bottom, while small strings enter through the top. The main invariant of this stack is similar to that of the popular (among programmers, I mean) “Tower of Hanoy”: A string in the stack can never sit over a shorter string. Whenever a new string is pushed over a shorter one, then (and only then) the algorithm concatenates both. This concatenation creates a larger string, that now may be larger than its neighbor in the previous floor. If that happens, they are also joined. Those concatenations go down the stack until the loop reaches a larger string or the stack bottom.

​ 该算法的核心是一个栈,它将已创建的大字符串保存在底部,而小字符串则从顶部进入。此栈的主要不变性类似于程序员中流行的“河内塔”:栈中的字符串永远不会位于较短的字符串之上。每当一个新字符串被推到一个较短的字符串之上时,算法就会(且仅会)将两者连接起来。这种连接会创建一个更大的字符串,现在它可能比前一层中的相邻字符串更大。如果发生这种情况,它们也会连接起来。这些连接会一直向下进行,直到循环到达一个更大的字符串或栈底。

function newBuffer ()
  return {n=0}     -- 'n' counts number of elements in the stack
end函数 newBuffer () 返回 {n=0} -- 'n' 计数栈中的元素数量 结束function addString (stack, s)
  tinsert(stack, s)       -- push 's' into the top of the stack
  for i=stack.n-1, 1, -1 do
    if strlen(stack[i]) > strlen(stack[i+1]) then break end
    stack[i] = stack[i]..tremove(stack)
  end
end
函数 addString (stack, s) tinsert(stack, s) -- 将 's' 压入栈顶 for i=stack.n-1, 1, -1 do if strlen(stack[i]) > strlen(stack[i+1]) then break end stack[i] = stack[i]..tremove(stack) end end

To get the final contents of the buffer, we just need to concatenate all strings down the bottom:

​ 要获取缓冲区的最终内容,我们只需要连接底部的所有字符串:

function toString (stack)
  for i=stack.n-1, 1, -1 do
    stack[i] = stack[i]..tremove(stack)
  end
  return stack[1]
end
函数 toString (stack)
for i=stack.n-1, 1, -1 do
stack[i] = stack[i]..tremove(stack)
end
返回 stack[1]
end

Using this new data structure, we can rewrite our program as follows:

​ 使用这个新的数据结构,我们可以将我们的程序重写如下:

local s = newBuffer()
while 1 do
  local line = read()
  if line == nil then break end
  addString(s, line.."\n")  
end
s = toString(s)
local s = newBuffer()
while 1 do
local line = read()
if line == nil then break end
addString(s, line.."\n")
end
s = toString(s)

This new program reduces our original time to read a 350 Kbyte file from 40 seconds to 0.5 seconds. (The call read"*all" is still faster, finishing the job in 0.02 seconds.)

​ 这个新程序将我们读取 350 Kbyte 文件的原始时间从 40 秒减少到 0.5 秒。(调用 read"*all" 仍然更快,在 0.02 秒内完成作业。)

Explanation 说明

To understand what happens with the naive approach, let us assume that we are in the middle of the reading; buff is already a string with 50 Kbytes, and each line has 20 bytes. After the assignment

​ 为了理解朴素方法会发生什么,让我们假设我们正在读取过程中; buff 已经是一个包含 50 Kbyte 的字符串,并且每行有 20 个字节。在分配之后

    buff = buff..line.."\n"

buff is a new string with 50,020 bytes, and the old string in now garbage. After two loop cycles, buff is a string with 50,040 bytes, and there are two old strings making a total of more than 100 Kbytes of garbage. Therefore, Lua decides, quite correctly, that it is a good time to run its garbage collector, and so it frees those 100 Kbytes. The problem is that this will happen every two cycles, and so Lua will run its garbage collector two thousand times before finishing the loop. Even with all this work, its memory usage will be around three times the file size. To make things worse, each concatenation must copy the whole string content (50 Kbytes and growing) into the new string.

buff 是一个包含 50,020 字节的新字符串,而旧字符串现在是垃圾。经过两个循环周期后, buff 是一个包含 50,040 字节的字符串,并且有两个旧字符串,总共产生了超过 100 KB 的垃圾。因此,Lua 非常正确地决定,现在是运行垃圾回收器的好时机,于是它释放了这 100 KB。问题是这种情况每两个周期就会发生一次,因此在完成循环之前,Lua 将运行其垃圾回收器两千次。即使做了所有这些工作,其内存使用量仍将是文件大小的三倍左右。更糟糕的是,每次连接都必须将整个字符串内容(50 KB 及以上)复制到新字符串中。

This problem is not exclusive of Lua: other languages with true garbage collection, and where strings are immutable objects, present a similar behavior (Java being the most famous example).

​ 这个问题并非 Lua 独有:其他具有真正垃圾回收功能的语言以及字符串为不可变对象的语言也会出现类似行为(Java 是最著名的示例)。

Our original loop did a “linear” approach to the problem, concatenating small strings one by one into the accumulator. The new algorithm avoids this, using a binary approach. It concatenates many small strings among them, and once in a while it concatenates the resulting large strings into larger ones.

​ 我们的原始循环对问题采用了“线性”方法,将小字符串一个接一个地连接到累加器中。新算法避免了这种情况,采用了二进制方法。它在它们之间连接了许多小字符串,并且偶尔将结果大字符串连接成更大的字符串。

最后修改 January 31, 2024: 更新 (e0896df)