-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtlb.txt
328 lines (269 loc) · 14 KB
/
tlb.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
The Lustre Book.
Hacker's Guide to Lustre Filesystem.
=====================================
Arshad Hussain arshad.super@gmail.com
=====================================
Lustre FS from 100 feet. Design and how individual components interacts.
-----------------------------
| MGS/MDS Module |
| |
| ----------- |
|MGT01 <-| |-> |
|MDT01 <-| Shared |-> |
| | | Storage | | |
| | | | | |
| | ----------- | |
| | | |
| ----------- ----------- |
| | MDS01 | | MDS02 | |
| | Active | | Standby | |
| ----------- ----------- |
-----------------------------
|
----------- |
| | |
| Clients | | -----------------------------
| | -------------------------------| OSS Module |
----------- NETWORK | |
| ----------- |
|OST01 <-| |-> |
| | <-| Shared |-> OST02|
| | | Storage | | |
| | | | | |
| | ----------- | |
| | | |
| ----------- ----------- |
| | OSS01 | | OSS02 | |
| | Active | | Active | |
| ----------- ----------- |
-----------------------------
Lustre is a distributed filesystem, where individual servers having local disk
is tied together so they present a unified storage, working together coherently.
Data to be written/read is broken in smaller subset(objects or data stripes) and
stored across the cluster. In this process Increasing throughput as the read
or write is happening.
Lustre is composed of 7 important components.
============================================
1 Client : Combines all the server components and gives a POSIX view of the
mountable filesystem. Initiates data request to read or write.
2 MGS : Management server - Maintains configuration.
3 MGT : Management target - Target for MGS.
4 MDS : Metadata server - Manages access to FS (inode, paths, permission).
5 MDT : Metadata Target - These are ldiskfs formatted disk where actual
inodes/directores are stored. This holds FS Metadata data.
6 OSS : Object storage server - Manages access to FS data.
7 OST : Object storage target - Holds actual FS data.
PTLRPC
======
All components connect through LNET(TCP/IP or IB Abstraction) layer. RPC is
build on top of LNET which handles client - server communication.
LDLM
====
Filesystem coherency is achieved using Lustre distributed lock manager(LDLM).
LDLM provides a unified view of locks to all participating server and clients.
All access to resources(metadata, data) is serialised. LDLM works by
maintaining a global lock queue. Every time there is need to obtain lock, it
enqueues the lock to this global queue. Once the lock is relinquish by the
resource it is dequeued. There are three global queue -
1. Granted - Locks are successfully taken
2. Converting - Locks are demoting
3. Blocked - Locks cannot be granted.
LDLM uses 'call-backs' or AST to notify process that the work is done or some
attention is required.
Replay & Failover
=================
Disks are protected by RAID. Lustre have no control on them. Nodes failover (HA)
are managed by Lustre. Heartbeat is setup between primary and secondary nodes
using either pacemaker or coresync. If primary fails, there is an attempt to
umount() primary first. If that takes long, the primary is STONITH (killed
forcefully). Meanwhile secondary node is mounted. Upon successful re-connection
of client, it check the pending TID (Transaction ID) against the LAST_RECV and
replays all the TID greater than equal to TID.
Client Server
====== ======
00. Client sends RPC Request + XID 01. Server uses XID as descriptor for
completing the requested task.
03. Server Sends back TID.
04. Client queues TID. And waits
for completion event on XID. 05. Meanwhile server executes the
Request.
06. After completion or failure of
request. Server sends back response
with return status.
07. Client check the response
and drops the TID from the queue.
(Job is completed)
LLOG
====
Llog or lustre logs are transaction logs in Lustre. It helps in maintaining a
'atomic' commit during distributive transaction. For example, an unlink starts
at MDS and completes on OSS. It also maintains the whole cluster wide
configuration.
Layout & objects
================
Layouts are maps which points the combination of IO offset and size to one or
more objects, which may or may not be residing on separate OST. Layouts are
kept in EA. During the time of file create or open, the MDT reads and maps
layout using Object Index (OI) to retrive the FID. FID now has all the
information to do IO independent of MDT. Objects are store house for data
within OST. Objects are stored in O/0/d* format on the OSTs. During the creating
of the file the objects can be specified in the form of 'stripe count'. This
until now would be frozen until the file is migrated (see pfl). If stripe count
is greater than 1, then part of object hold part of file. This also increase the
parellism while doing the actual IO. If stripe count is 1, the single object is
the whole file. Not a clever, layout selection in case there are lots of data
to be written or read.
system call path under Lustre
=============================
Client Server
====== ======
00. Client makes a POSIX system call
01. llite/file.c -> Entry point. Traps
it and does a sanity check. If not
sane it throws a ENOSUPRT and returns.
02. llite/vvp_io.c -> Get locks on
range if any and initiate clio for IO.
(IO loops within...) and calls
clio_end.
03. Before clio_end - Call osc_io.c
to prepare RPC and send it across to
server. At this point, vvp_object
(inode) and cl_object (page) is
prepared.
04. Server receives RPC and does a
sanity check on it.
05. Through dt_object (points to ldiskfs
operations) invoke ldiskfs to make
actual FS modification.
06. Server returns RPC.
07. Client receives RPC and
take appropriate action on it
and passes on to application.
The Write Path:
12:37:00 mrpel7 kernel: [<ffffffff81633ec4>] dump_stack+0x19/0x1b
12:37:00 mrpel7 kernel: [<ffffffffa10aa0e7>] vvp_io_fini+0x337/0x830 [lustre]
12:37:00 mrpel7 kernel: [<ffffffffa0afdf42>] ? lov_io_fini+0x282/0x460 [lov]
12:37:00 mrpel7 kernel: [<ffffffffa0683c25>] cl_io_fini+0x75/0x240 [obdclass]
12:37:00 mrpel7 kernel: [<ffffffffa105b64c>] ll_file_io_generic+0x2fc/0xb00 [lustre]
12:37:00 mrpel7 kernel: [<ffffffffa105c12d>] ll_file_aio_write+0x12d/0x1f0 [lustre]
12:37:00 mrpel7 kernel: [<ffffffffa105c2be>] ll_file_write+0xce/0x1e0 [lustre]
12:37:00 mrpel7 kernel: [<ffffffff811de5fd>] vfs_write+0xbd/0x1e0
12:37:00 mrpel7 kernel: [<ffffffff811df09f>] SyS_write+0x7f/0xe0
12:37:00 mrpel7 kernel: [<ffffffff816448c9>] system_call_fastpath+0x16/0x1b
How write works in lustre: Sample use case.
===========================================
# From client side start i/o. Effectively we are writing 512*10 = 5120 bytes.
# 512 block 10 times.
yes 'k' | dd of=/mnt/lustre/five bs=512 count=10
# Get name of file; its FID and intent of open/create flags.
00000002:00010000:0.0:1463692949.850728:0:20027:0:(mdc_locks.c:1260:mdc_intent_lock()) (name: five,[0x200000400:0x1:0x0]) in obj [0x200000007:0x1:0x0], intent: open|creat flags 0101102
# Starts with vvp_io_write_start()
00000080:00200000:0.0:1463692949.851794:0:20027:0:(vvp_io.c:806:vvp_io_write_start()) write: nob 512, result: 512
# Take extent lock on range.
# Call a write (IOT = IO type) is 1. for READ it is 0. and ll_file_io_generic
# has returned 512 (1 block size in our dd's case).
00000080:00200000:0.0:1463692949.851811:0:20027:0:(file.c:1234:ll_file_io_generic()) iot: 1, result: 512
# Start CLIO - This will loop until done, and pass it to OSC where RPC will
# be created and send to server.
00000020:00200000:0.0:1463692949.851815:0:20027:0:(cl_io.c:238:cl_io_rw_init()) header@ffff880010162d20[0x0, 3, [0x200000400:0x1:0x0] hash]
# Identify the range to write
00000020:00200000:0.0:1463692949.851815:0:20027:0:(cl_io.c:238:cl_io_rw_init()) io range: 1 [3584, 4096) 0 0
# Start write
00000080:00200000:0.0:1463692949.851834:0:20027:0:(vvp_io.c:791:vvp_io_write_start()) write: [3584, 4096)
# Write starts with ll_write_begin()
# Here 3584 is Where it will start the write and 512 is len of the write
(in dd's case its block size of 512)
00000080:00200000:0.0:1463692949.851835:0:20027:0:(rw26.c:548:ll_write_begin()) Writing 0 of 3584 to 512 bytes
# Write ends with ll_write_end(). Queued page 1 is indication that one block is
# written/queued. In our dd's case its 512 bytes. Suppose the block size was 8192;
# then the page size is divided by block size = 8192/4096=2 and the write pages
# will be queued 2 (or looped twice in this case).
00000080:00200000:0.0:1463692949.851842:0:20027:0:(rw26.c:677:ll_write_end()) queued page: 1.
# commit write (or queue write) - Async page is submitted for write.
# Remember sync write would be different and written directly.
00000080:00200000:0.0:1463692949.851842:0:20027:0:(vvp_io.c:700:vvp_io_write_commit()) commit async pages: 1, from 3584, to 4096
# Actual writes goes to disk.
0000080:00200000:0.0:1463692949.851843:0:20027:0:(vvp_io.c:722:vvp_io_write_commit()) Committed 1 pages 512 bytes, tot: 512
# Finally update the inode with newly written size by calling ll_merge_lvb
00000080:00200000:0.0:1463692949.851844:0:20027:0:(file.c:1049:ll_merge_lvb()) [0x200000400:0x1:0x0] updating i_size 4096
# ll_file_io_generic() - call to write sys call looks like this. returns bytes written or -1.
# Starts with ppos = 0; and it should increment by 512 as dd would be writing it in 512 bytes chunks.
Mapping Inode to Data
=====================
This form of debugging is required when there is suspicion that the layout is
corrupted. Given just the pathname we could get the data contents. This type
of debugging also verifies the metadata (inode) correctly mapps to data.
# Given just the pathname.
[root@mrpel7 lustre-release]# stat /mnt/lustre/five
File: '/mnt/lustre/five'
Size: 9 Blocks: 1 IO Block: 4194304 regular file
Device: 2c54f966h/743766374d Inode: 144115205272502273 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2017-12-03 17:26:20.000000000 +0530
Modify: 2017-12-03 17:26:36.000000000 +0530
Change: 2017-12-03 17:26:36.000000000 +0530
Birth: -
# Get object-id of the file
[root@mrpel7 lustre-release]# lfs getstripe /mnt/lustre/five
/mnt/lustre/five
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 0
obdidx objid objid group
0 2 0x2 0
# Get number of OST's. File 'five' object could be in any OST.
[root@mrpel7 lustre-release]# mount | grep lustre-ost | cut -d' ' -f1
/tmp/lustre-ost1
/tmp/lustre-ost2
# Find the backing disk mapping. Object 2 will be either under /dev/loop1 or
# /dev/loop2.
[root@mrpel7 lustre-release]# losetup -a
/dev/loop0: [2050]:2762852 (/tmp/lustre-mdt1)
/dev/loop1: [2050]:2762853 (/tmp/lustre-ost1)
/dev/loop2: [2050]:2762854 (/tmp/lustre-ost2)
# Lustre keeps striping info in EA of the inodes. EA contains the list of all
# object ID and object index (OST index) Lustre uses object id as file name.
# If object-id is 'X'. The formula is O/0/d$((X % 32))/X to get the extents.
# In this example we replace 'X' with 2.
[root@mrpel7 lustre-release]# debugfs -c -R "stat O/0/d$((2 % 32))/2" /dev/loop1
debugfs 1.42.13.wc4 (28-Nov-2015)
/dev/loop1: catastrophic mode - not reading inode or group bitmaps
Inode: 101 Type: regular Mode: 0666 Flags: 0x80000
Generation: 3829217146 Version: 0x00000001:00000002
User: 0 Group: 0 Size: 9
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5a23e674:00000000 -- Sun Dec 3 17:26:36 2017
atime: 0x5a23e664:00000000 -- Sun Dec 3 17:26:20 2017
mtime: 0x5a23e674:00000000 -- Sun Dec 3 17:26:36 2017
crtime: 0x5a23e63f:c93ffc9c -- Sun Dec 3 17:25:43 2017
Size of extra inode fields: 32
Extended attributes stored in inode body:
invalid EA entry in inode
EXTENTS:
(0):33113
# We now know the inode to be 101! and block to be 33113
# We could also use 'blocks' subcommand with inode to get the blocks.
[root@mrpel7 lustre-release]# debugfs -c -R "blocks <101>" /dev/loop1
debugfs 1.42.13.wc4 (28-Nov-2015)
/dev/loop1: catastrophic mode - not reading inode or group bitmaps
33113
# We can now 'dd' to the location and check if the contens are what
# we are looking at. Read from device /dev/loop1, and skip the device 33113
# times the block size.
# dd if=/dev/loop1 of=output_file bs=4096 count=1 skip=33113
# Check the contents
[root@mrpel7 lustre-release]# xxd output_file | head -7
0000000: 4c75 7374 7265 4653 0a00 0000 0000 0000 LustreFS........
0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
# Check the contents POSIX style
[root@mrpel7 lustre-release]# cat /mnt/lustre/five
LustreFS