Last active
August 29, 2015 14:08
-
-
Save chuanwang66/6a7762ac6bd1de465ed8 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
一、VMA | |
参见http://www.makelinux.net/books/lkd2/ch14lev1sec2 | |
Memory areas are represented by a memory area object, which is stored in the vm_area_struct structure and defined in <linux/mm.h>. Memory areas are often called virtual memory areas or VMA's in the kernel. | |
The vm_area_struct structure describes a single memory area over a contiguous interval in a given address space. The kernel treats each memory area as a unique memory object. Each memory area shares certain properties, such as permissions and a set of associated operations. In this manner, the single VMA structure can represent multiple types of memory areasfor example, memory-mapped files or the process's user-space stack. This is similar to the object-oriented approach taken by the VFS layer (see Chapter 12, "The Virtual Filesystem"). Here's the structure, with comments added describing each field: | |
struct vm_area_struct { | |
struct mm_struct *vm_mm; /* associated mm_struct */ | |
unsigned long vm_start; /* VMA start, inclusive */ | |
unsigned long vm_end; /* VMA end , exclusive */ | |
struct vm_area_struct *vm_next; /* list of VMA's */ | |
pgprot_t vm_page_prot; /* access permissions */ | |
unsigned long vm_flags; /* flags */ | |
struct rb_node vm_rb; /* VMA's node in the tree */ | |
union { /* links to address_space->i_mmap or i_mmap_nonlinear */ | |
struct { | |
struct list_head list; | |
void *parent; | |
struct vm_area_struct *head; | |
} vm_set; | |
struct prio_tree_node prio_tree_node; | |
} shared; | |
struct list_head anon_vma_node; /* anon_vma entry */ | |
struct anon_vma *anon_vma; /* anonymous VMA object */ | |
struct vm_operations_struct *vm_ops; /* associated ops */ | |
unsigned long vm_pgoff; /* offset within file */ | |
struct file *vm_file; /* mapped file, if any */ | |
void *vm_private_data; /* private data */ | |
}; | |
Recall that each memory descriptor is associated with a unique interval in the process's address space. The vm_start field is the initial (lowest) address in the interval and the vm_end field is the first byte after the final (highest) address in the interval. That is, vm_start is the inclusive start and vm_end is the exclusive end of the memory interval. Thus, vm_end vm_start is the length in bytes of the memory area, which exists over the interval [vm_start, vm_end). Intervals in different memory areas in the same address space cannot overlap. | |
The vm_mm field points to this VMA's associated mm_struct. Note each VMA is unique to the mm_struct to which it is associated. Therefore, even if two separate processes map the same file into their respective address spaces, each has a unique vm_area_struct to identify its unique memory area. Conversely, two threads that share an address space also share all the vm_area_struct structures therein. | |
VMA Flags | |
The vm_flags field contains bit flags, defined in <linux/mm.h>, that specify the behavior of and provide information about the pages contained in the memory area. Unlike permissions associated with a specific physical page, the VMA flags specify behavior for which the kernel is responsible, not the hardware. Furthermore, vm_flags contains information that relates to each page in the memory area, or the memory area as a whole, and not specific individual pages. Table 14.1 is a listing of the possible vm_flags values. | |
VMA Operations | |
The vm_ops field in the vm_area_struct structure points to the table of operations associated with a given memory area, which the kernel can invoke to manipulate the VMA. The vm_area_struct acts as a generic object for representing any type of memory area, and the operations table describes the specific methods that can operate on this particular instance of the object. | |
The operations table is represented by struct vm_operations_struct and is defined in <linux/mm.h>: | |
struct vm_operations_struct { | |
void (*open) (struct vm_area_struct *); | |
void (*close) (struct vm_area_struct *); | |
struct page * (*nopage) (struct vm_area_struct *, unsigned long, int); | |
int (*populate) (struct vm_area_struct *, unsigned long, unsigned long, | |
pgprot_t, unsigned long, int); | |
}; | |
Here's a description for each individual method: | |
void open(struct vm_area_struct *area) | |
This function is invoked when the given memory area is added to an address space. | |
void close(struct vm_area_struct *area) | |
This function is invoked when the given memory area is removed from an address space. | |
struct page * nopage(struct vm_area_sruct *area, | |
unsigned long address, | |
int unused) | |
This function is invoked by the page fault handler when a page that is not present in physical memory is accessed. | |
int populate(struct vm_area_struct *area, | |
unsigned long address, | |
unsigned long len, pgprot_t prot, | |
unsigned long pgoff, int nonblock) | |
This function is invoked by the remap_pages() system call to prefault a new mapping. | |
Lists and Trees of Memory Areas | |
As discussed, memory areas are accessed via both the mmap and the mm_rb fields of the memory descriptor. These two data structures independently point to all the memory area objects associated with the memory descriptor. In fact, they both contain pointers to the very same vm_area_struct structures, merely represented in different ways. | |
The first field, mmap, links together all the memory area objects in a singly linked list. Each vm_area_struct structure is linked into the list via its vm_next field. The areas are sorted by ascended address. The first memory area is the vm_area_struct structure to which mmap points. The last structure points to NULL. | |
The second field, mm_rb, links together all the memory area objects in a red-black tree. The root of the red-black tree is mm_rb, and each vm_area_struct structure in this address space is linked to the tree via its vm_rb field. | |
A red-black tree is a type of balanced binary tree. Each element in a red-black tree is called a node. The initial node is called the root of the tree. Most nodes have two children: a left child and a right child. Some nodes have only one child, and the final nodes, called leaves, have no children. For any node, the elements to the left are smaller in value, whereas the elements to the right are larger in value. Furthermore, each node is assigned a color (red or black, hence the name of this tree) according to two rules: The children of a red node are black and every path through the tree from a node to a leaf must contain the same number of black nodes. The root node is always red. Searching of, insertion to, and deletion from the tree is an O(log(n)) operation. | |
The linked list is used when every node needs to be traversed. The red-black tree is used when locating a specific memory area in the address space. In this manner, the kernel uses the redundant data structures to provide optimal performance regardless of the operation performed on the memory areas. | |
Memory Areas in Real Life | |
Let's look at a particular process's address space and the memory areas inside. For this task, I'm using the useful /proc filesystem and the pmap(1) utility. The example is a very simple user-space program, which does absolutely nothing of value, except act as an example: | |
int main(int argc, char *argv[]) | |
{ | |
return 0; | |
} | |
Take note of a few of the memory areas in this process's address space. Right off the bat, you know there is the text section, data section, and bss. Assuming this process is dynamically linked with the C library, these three memory areas also exist for libc.so and again for ld.so. Finally, there is also the process's stack. | |
The output from /proc/<pid>/maps lists the memory areas in this process's address space: | |
rml@phantasy:~$ cat /proc/1426/maps | |
00e80000-00faf000 r-xp 00000000 03:01 208530 /lib/tls/libc-2.3.2.so | |
00faf000-00fb2000 rw-p 0012f000 03:01 208530 /lib/tls/libc-2.3.2.so | |
00fb2000-00fb4000 rw-p 00000000 00:00 0 | |
08048000-08049000 r-xp 00000000 03:03 439029 /home/rml/src/example | |
08049000-0804a000 rw-p 00000000 03:03 439029 /home/rml/src/example | |
40000000-40015000 r-xp 00000000 03:01 80276 /lib/ld-2.3.2.so | |
40015000-40016000 rw-p 00015000 03:01 80276 /lib/ld-2.3.2.so | |
4001e000-4001f000 rw-p 00000000 00:00 0 | |
bfffe000-c0000000 rwxp fffff000 00:00 0 | |
The data is in the form | |
start-end permission offset major:minor inode file | |
The pmap(1) utility formats this information in a bit more readable manner: | |
rml@phantasy:~$ pmap 1426 | |
example[1426] | |
00e80000 (1212 KB) r-xp (03:01 208530) /lib/tls/libc-2.3.2.so | |
00faf000 (12 KB) rw-p (03:01 208530) /lib/tls/libc-2.3.2.so | |
00fb2000 (8 KB) rw-p (00:00 0) | |
08048000 (4 KB) r-xp (03:03 439029) /home/rml/src/example | |
08049000 (4 KB) rw-p (03:03 439029) /home/rml/src/example | |
40000000 (84 KB) r-xp (03:01 80276) /lib/ld-2.3.2.so | |
40015000 (4 KB) rw-p (03:01 80276) /lib/ld-2.3.2.so | |
4001e000 (4 KB) rw-p (00:00 0) | |
bfffe000 (8 KB) rwxp (00:00 0) [ stack ] | |
mapped: 1340 KB writable/private: 40 KB shared: 0 KB | |
The first three rows are the text section, data section, and bss of libc.so, the C library. | |
The next two rows are the text and data section of our executable object. | |
The following three rows are the text section, data section, and bss for ld.so, the dynamic linker. | |
The last row is the process's stack. | |
Note how the text sections are all readable and executable, which is what you expect for object code. On the other hand, the data section and bss (which both contain global variables) are marked readable and writable, but not executable. The stack is, naturally, readable, writable, and executablenot of much use otherwise. | |
The entire address space takes up about 1340KB, but only 40KB are writable and private. If a memory region is shared or nonwritable, the kernel keeps only one copy of the backing file in memory. This might seem like common sense for shared mappings, but the nonwritable case can come as a bit of a surprise. If you consider the fact that a nonwritable mapping can never be changed (the mapping is only read from), it is clear that it is safe to load the image only once into memory. Therefore, the C library need only occupy 1212KB in physical memory, and not 1212KB multiplied by every process using the library. Because this process has access to about 1340KB worth of data and code, yet consumes only about 40KB of physical memory, the space savings from such sharing is substantial. | |
Note the memory areas without a mapped file that are on device 00:00 and inode zero. This is the zero page. The zero page is a mapping that consists of all zeros. By mapping the zero page over a writable memory area, the area is in effect "initialized" to all zeros. This is important in that it provides a zeroed memory area, which is expected by the bss. Because the mapping is not shared, as soon as the process writes to this data a copy is made (à la copy-on-write) and the value updated from zero. | |
Each of the memory areas that are associated with the process corresponds to a vm_area_struct structure. Because the process was not a thread, it has a unique mm_struct structure referenced from its task_struct. | |
二、 | |
传统IPC中数据经常采用存储-转发的方式来传递,这种方式中将数据拷贝2次,从用户空间到内核空间,再从内核空间到用户空间。不过binder驱动采用了一种新的方式,只将数据从发送端的用户空间拷贝到内核空间即完成了数据从一个进程到另一个进程的转移。实现它的关键在于binder驱动和接收端进程都将同一段物理内存映射到了各自的虚拟地址空间中去了,不过这些物理内存的分配和释放都是在binder驱动中完成的,接收端进程只有读的权限。 | |
binder驱动实际上没有对应的硬件,但是它也实现了mmap函数,他可不是用来映射物理介质到用户空间的,而是帮助实现binder驱动完成一次数据转移的功能。下面先看看这个mmap如何使用,通常,在进程打开/dev/binder节点之后,会调用mmap函数: | |
fd = open(“/dev/binder”, O_RDWR); | |
mmap(NULL, MAP_SIZE, PROT_READ, MAP_PRIVATE, fd, 0); --> 会最终调用到binder_mmap()函数 | |
这样Binder的接收方就有了一片大小为MAP_SIZE的接收缓存区。mmap()的返回值是内存映射在用户空间的地址,不过这段空间是由binder驱动管理,用户不必也不能直接访问(映射类型为PROT_READ,只读映射)。接收缓存区映射好后就可以做为缓存池接收和存放数据了。 | |
struct binder_proc中有和管理缓存区相关的域; | |
struct binder_buffer结构体是缓冲区 | |
==> binder_mmap()函数 | |
调用时机: | |
[vma->vm_start, vma->vm_end)即为此次映射内核为我们分配的开始地址和结束地址,他们的差值就是系统调用mmap()中的length的值。而vma->vm_start则是系统调用mmap()的返回值。注意这里vma->vm_start和vma->vm_end都是调用进程的用户空间的虚拟地址。进程在调用mmap系统调用时,就会调用到binder驱动对应的file_operations->mmap()成员函数,即binder_mmap()这个函数。 | |
函数定义: | |
static int binder_mmap(struct file *filp, struct vm_area_struct *vma) | |
{ | |
// binder驱动实现了自己的mmap()函数,它并不是为了物理介质和用户空间做映射,而是用来创建数据接收的缓存空间。 | |
// binder数据只在用户空间和内核空间拷贝一次的秘密也就在于binder驱动对接收缓冲区的管理 | |
//通常,上层引用程序在使用binder时,都会如下调用: | |
// fd = open("/dev/binder", O_RDWR); | |
// mmap(NULL, MAP_SIZE, PROT_READ, MAP_PRIVATE, fd, 0); | |
// 但是,如果应用程序只进行异步请求,那么我们可以不用给进程分配和管理接收缓存区了。 | |
// 另外,binder_mmap()只允许最大分配4MB的虚拟地址空间,而对于引用程序,只拥有对该内存的读权限。 | |
//※filp ==> proc(即filp->private_data) : 在内核虚拟地址空间 | |
//※vma ==> 在进程虚拟地址空间 | |
int ret; | |
struct vm_struct *area; | |
struct binder_proc *proc = filp->private_data; //当前进程的binder_proc结构体 | |
const char *failure_string; | |
struct binder_buffer *buffer; //binder驱动用来存储一次单边传输数据的基本结构体 | |
if((vma->vm_end - vma->vm_start) > SZ_4M) | |
vma->vm_end = vma->vm_start + SZ_4M; //这段虚拟地址空间不得大于4MB | |
// #define FORBIDDEN_MMAP_FLAGS (VM_WRITE) | |
if(vma->vm_flags & FORBIDDEN_MMAP_FLAGS) { //只能被映射成只读 | |
ret = -EPERM; | |
failure_string = "bad vm_flags"; | |
goto err_bad_arg; | |
} | |
//VM_DONTCOPY : fork的时候不拷贝这段vma; VM_MAYWRITE : 尝试写操作 | |
vma->vm_flags = (vma->vm_flags | VM_DONTCOPY) & ~VM_MAYWRITE; | |
if(proc->buffer) { //binder_proc.buffer中保存映射之后的内核虚拟地址 | |
ret = -EBUSY; | |
failure_string = "already mapped"; //已经mmap过了 | |
goto err_already_mapped; | |
} | |
//get_vm_area() : 在内核中申请并保留一块连续的内核虚拟内存空间 | |
area = get_vm_area(vma->vm_end - vma->vm_start, VM_IOREMAP); | |
if(area == NULL) { | |
... //出错处理 | |
} | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment