[讨论] JavaScript中String的存储

lvzixun 2013-10-29

今天跟朋友谈到了JavaScript里面String和对象的问题,他们说String是primitive value,所以String存储的字符串是在stack上分配的。 之后给我截了个图:

我对Lua还算是比较熟悉,Lua里面string是copy on write机制的,相同的字符只会存在一份。 JavaScript我用的不多,所以他说String的字符串是在stack里面存的,有点不太相信,因为一旦string是在stack中存储的话,对于函数调用每次都要copy string。还有函数返回时还要对string做特殊的处理,等等。感觉这样设计不太合理,R大能否解答下? 

 

PS:

抱歉,RednaxelaFX 大。 一直都要说来HLLVM混的,但是这段时间工作上的事情太忙了,我自己的A2都没时间去改,等忙过了这一阵子HLLVM肯定多逛  ;D 

 

RednaxelaFX 2013-10-29
lvzixun 写道
今天跟朋友谈到了JavaScript里面String和对象的问题,他们说String是primitive value,所以String存储的字符串是在stack上分配的。之后给我截了个图:

哈哈哈这是个经典误解。上面绿色的部分是对的,但红色的部分就不对了。

我也读过这本书见过这张图,但我想不起来这是哪本书了。
印象中是还在读大学的时候读的,当时我也被这图忽悠了。有人这是哪本书么?回头我把书名补充上来晒一下hall of shame。

要讨论这种问题得分开几个层面。

---------------------------------------------------------------

定义层面

首先是定义:String、stack、heap在这个上下文里的定义。讨论语言问题要定义肯定要去查规范。不幸的是,ECMAScript 5.1规范在“内存”的定义上非常模糊,并没有明确定义出ECMAScript实现中的各个运行时区域的划分,也就不存在“stack/heap划分”。

所以,从定义层面看没有所谓“String存储的字符串是在stack上分配的”。

唯一提到运行时调用栈相关的地方是10.3
ECMAScript 5.1 写道
10.3 Execution Contexts

When control is transferred to ECMAScript executable code, control is entering an execution context. Active execution contexts logically form a stack. The top execution context on this logical stack is the running execution context. A new execution context is created whenever control is transferred from the executable code associated with the currently running execution context to executable code that is not associated with that execution context. The newly created execution context is pushed onto the stack and becomes the running execution context.


而heap这个词压根没在该规范里出现过。

String的相关规定倒是很明确:
ECMAScript 5.1 写道
4.3.2 primitive value

member of one of the types Undefined, Null, Boolean, Number, or String as defined in Clause 8

NOTE A primitive value is a datum that is represented directly at the lowest level of the language implementation.

4.3.16 String value

primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer

NOTE A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

4.3.17 String type

set of all possible String values

4.3.18 String object

member of the Object type that is an instance of the standard built-in String constructor

NOTE A String object is created by using the String constructor in a new expression, supplying a String value as an argument. The resulting object has an internal property whose value is the String value. A String object can be coerced to a String value by calling the String constructor as a function (15.5.1).

8 Types

Algorithms within this specification manipulate values each of which has an associated type. The possible value types are exactly those defined in this clause. Types are further subclassified into ECMAScript language types and specification types.

An ECMAScript language type corresponds to values that are directly manipulated by an ECMAScript programmer using the ECMAScript language. The ECMAScript language types are Undefined, Null, Boolean, String, Number, and Object.

...

8.4 The String Type

The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values (“elements”). The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a code unit value (see Clause 6). Each element is regarded as occupying a position within the sequence. These positions are indexed with nonnegative integers. The first element (if any) is at position 0, the next element (if any) at position 1, and so on. The length of a String is the number of elements (i.e., 16-bit values) within it. The empty String has length zero and therefore contains no elements.

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

NOTE: The rationale behind this design was to keep the implementation of Strings as simple and high-performing as possible. The intent is that textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it. Usually this would occur at the same time incoming text is converted from its original character encoding to Unicode (and would impose no additional overhead). Since it is recommended that ECMAScript source code be in Normalised Form C, string literals are guaranteed to be normalised (if source text is guaranteed to be normalised), as long as they do not contain any Unicode escape sequences.


11.9.6 The Strict Equality Comparison Algorithm

The comparison x === y, where x and y are values, produces true or false. Such a comparison is performed as follows:

If Type(x) is different from Type(y), return false.
If Type(x) is Undefined, return true.
If Type(x) is Null, return true.
If Type(x) is Number, then
If x is NaN, return false.
If y is NaN, return false.
If x is the same Number value as y, return true.
If x is +0 and y is −0, return true.
If x is −0 and y is +0, return true.
Return false.
If Type(x) is String, then return true if x and y are exactly the same sequence of characters (same length and same characters in corresponding positions); otherwise, return false.
If Type(x) is Boolean, return true if x and y are both true or both false; otherwise, return false.
Return true if x and y refer to the same object. Otherwise, return false.
NOTE This algorithm differs from the SameValue Algorithm (9.12) in its treatment of signed zeroes and NaNs.


根据4.3.2的规定可以知道,String值在JavaScript里是一种“原始值”(primitive value)。
引申一下,而其对应的类型String是一种原始类型(primitive type)。

JavaScript有Object类型,但String不是一种JavaScript Object。

根据4.3.18的规定,还有一种叫做String object的东西,它是通过new String()创建出来的对象,是一种JavaScript Object,跟原始类型的String不是一回事。这种东西实际上是包装着String值的对象,而不是String值本身。

这组定义大概就是各种误解的源头了。



我先去睡个觉…回来再继续码字。等我!!

> s = "abc"
"abc"
> a = new String(s)
String {0: "a", 1: "b", 2: "c"}
> b = new String(s)
String {0: "a", 1: "b", 2: "c"}
> a === b
false


---------------------------------------------------------------

实现层面

回来再码

===============================================================

关于Lua string

可变的东西才会有“write”,然后才会引申出“copy-on-write”的做法;Lua string跟JavaScript string一样不可变,所以也谈不上CoW。
Lua string与其说是“copy-on-write”,还不如说是“always interned”更准确一些。用源码里的话说是“all strings are always internalized.”,内容相同的字符串确实只有一份。

像这样的一段代码:
local a = "f".."oo"
local b = "fo".."o"
print(a == b)    -- prints: true

执行过后变量a与变量b所指向的字符串的内容相同,都是"foo";事实上两个变量会指向同一个TString实例。

Lua的全局状态global_State里有个名为strt的成员,是专门驻留字符串的hashtable。所有Lua string实例在创建的时候都会到这个strt去查找一遍跟它内容相同的字符串之前是不是已经有了,有的话就直接用那个实例,没的话才真正创建新实例并插入到strt里。典型的interning。

有个有趣的细节是:在创建新字符串、从strt查找已有字符串时,如果找到了内容相同、但已被标记为“死掉”的字符串实例,Lua会把它的标记改回“活的”然后直接复用那个实例。

LuaJIT 2此处的实现跟官方Lua基本一样。

更加有趣的是Mike Pall大大之前给Lua做过一个fast string patch,不过很明显Lua 5.1和5.2都没有使用它。这个patch的本质跟下面提到的C++ std::string的某些实现所用的SSO技巧一样,把字符串内容直接藏在“TString值”里面。

===============================================================

关于C++ std::string

这才是copy-on-write的字符串实现好例子,特别是C++98/03的std::string的典型实现方式。

它最有趣的地方在于它既可变(mutable),又具有值语义(value semantics)。

拿一个典型的CoW式std::string来看,g++/libstdc++的std::basic_string。
下面引用自g++-4.8.0的文档
引用
A string looks like this:
                                     [_Rep]
                                     _M_length
[basic_string<char_type>]            _M_capacity
_M_dataplus                          _M_refcount
_M_p ---------------->               unnamed array of char_type

(为啥选libstdc++?因为它大概是现在主流Standard C++ Library实现里唯一一个没有使用SSO而且还在用CoW的std::basic_string的版本了orz)

在这种实现里,如果我们运行下面代码:
#include <string>
#include <iostream>
using namespace std;

int main() {
  string foo("foo");   // (1)
  string goo = foo;    // (2)
  goo[0] = 'g';        // (3)
  cout << foo << endl;
  cout << goo << endl;
  return 0;
}


由于CoW在多线程环境中有不少坑,C++11开始不允许std::string使用copy-on-write(或者叫shared buffer)方式实现。具体理由在此有阐述:Concurrency Modifications to Basic String
Herb Sutter大大写过一篇文章讲解为啥用CoW实现String不一定好:Optimizations That Aren't (In a Multithreaded World)
Scott Meyers大大有提到short string optimization (SSO)技巧:The View from Aristeia: std::string, SSO, and Move Semantics
John Ahlgren: Small String Optimization and Move Operations
Scott Meyers大大的书《Effective STL》里有更详细的讲解。

陈硕大大写过讨论“值语义”的文章:C++ 工程实践(8):值语义

说到C++ std::string,顺带笔记一下一个用expression template技巧来延迟运算,减少连续运算的中间结果带来的开销。现代主流JavaScript引擎都会自动采用类似的延迟运算策略,所以字符串拼接速度普遍比老JavaScript引擎有质的飞跃。
lvzixun 2013-11-03

@RednaxelaFX  果然那张图的描述是太诡异了orz.  JS内部对基础类型string的实现上应该会对相同的字符串做唯一处理吧?

 

lua的string用“always interned” 确实更加准确。  fast string 和SSO 对于小字符串的优化很有意思,看来value slot 结构中可以存储各种你想存的东西, 改天也跟A2的string加上SSO这样的优化。哈哈哈

 

kungo 2014-09-28
为什么我看到的截图是个广告
Global site tag (gtag.js) - Google Analytics