tlb.txt

The Lustre Book.
Hacker's Guide to Lustre Filesystem.
=====================================
Arshad Hussain arshad.super@gmail.com
=====================================


Lustre FS from 100 feet. Design and how individual components interacts.


   	    -----------------------------
            |       MGS/MDS Module      |
            |                           |
            |        -----------        |
            |MGT01 <-|         |->      |
            |MDT01 <-| Shared  |->      |
            |  |     | Storage |     |  |
            |  |     |         |     |  |
            |  |     -----------     |  |
            |  |                     |  |
            | -----------   ----------- |
            | |  MDS01  |   |  MDS02  | |
            | |  Active |   | Standby | |
            | -----------   ----------- |
            -----------------------------
                          | 
-----------               |
|         |               |
| Clients |               |                -----------------------------
|         | -------------------------------|        OSS Module         | 
-----------            NETWORK             |                           |
                                           |        -----------        |
                                           |OST01 <-|         |->      |
                                           |  |   <-| Shared  |-> OST02| 
                                           |  |     | Storage |     |  |
                                           |  |     |         |     |  |
                                           |  |     -----------     |  |
                                           |  |                     |  |
                                           | -----------   ----------- |
                                           | |  OSS01  |   |  OSS02  | |
                                           | |  Active |   |  Active | |
                                           | -----------   ----------- |
                                           -----------------------------

Lustre is a distributed filesystem, where individual servers having local disk
is tied together so they present a unified storage, working together coherently.
Data to be written/read is broken in smaller subset(objects or data stripes) and 
stored across the cluster.  In this process Increasing throughput as the read 
or write is happening. 


Lustre is composed of 7 important components.
============================================
1 Client : Combines all the server components and gives a POSIX view of the
	   mountable filesystem. Initiates data request to read or write.
2 MGS	 : Management server - Maintains configuration.
3 MGT	 : Management target - Target for MGS.
4 MDS	 : Metadata server - Manages access to FS (inode, paths, permission).
5 MDT	 : Metadata Target - These are ldiskfs formatted disk where actual 
	   inodes/directores are stored. This holds FS Metadata data.
6 OSS 	 : Object storage server - Manages access to FS data.
7 OST 	 : Object storage target - Holds actual FS data.


PTLRPC
======
All components connect through LNET(TCP/IP or IB Abstraction) layer. RPC is 
build on top of LNET which handles client - server communication.


LDLM
====
Filesystem coherency is achieved using Lustre distributed lock manager(LDLM). 
LDLM provides a unified view of locks to all participating server and clients.
All access to resources(metadata, data) is serialised. LDLM works by 
maintaining a global lock queue. Every time there is need to obtain lock, it 
enqueues the lock to this global queue. Once the lock is relinquish by the 
resource it is dequeued. There are three global queue - 
1. Granted 	- Locks are successfully taken
2. Converting 	- Locks are demoting
3. Blocked	- Locks cannot be granted.
LDLM uses 'call-backs' or AST to notify process that the work is done or some
attention is required.


Replay & Failover
=================
Disks are protected by RAID. Lustre have no control on them. Nodes failover (HA)
are managed by Lustre. Heartbeat is setup between primary and secondary nodes 
using either pacemaker or coresync. If primary fails, there is an attempt to 
umount() primary first. If that takes long, the primary is STONITH (killed 
forcefully).  Meanwhile secondary node is mounted. Upon successful re-connection 
of client, it check the pending TID (Transaction ID) against the LAST_RECV and 
replays all the TID greater than equal to TID.

Client					Server
======					======
00. Client sends RPC Request + XID	01. Server uses XID as descriptor for 
					completing the requested task.
					
					03. Server Sends back TID.

04. Client queues TID. And waits
for completion event on XID.		05. Meanwhile server executes the
					Request.

					06. After completion or failure of 
					request. Server sends back response
					with return status.
07. Client check the response
and drops the TID from the queue.
(Job is completed)


LLOG
====
Llog or lustre logs are transaction logs in Lustre. It helps in maintaining a 
'atomic' commit during distributive transaction. For example, an unlink starts 
at MDS and completes on OSS. It also maintains the whole cluster wide 
configuration.


Layout & objects
================
Layouts are maps which points the combination of IO offset and size to one or 
more objects, which may or may not be residing on separate OST. Layouts are 
kept in EA. During the time of file create or open, the MDT reads and maps 
layout using Object Index (OI) to retrive the FID. FID now has all the 
information to do IO independent of MDT. Objects are store house for data 
within OST. Objects are stored in O/0/d* format on the OSTs. During the creating 
of the file the objects can be specified in the form of 'stripe count'. This 
until now would be frozen until the file is migrated (see pfl). If stripe count 
is greater than 1, then part of object hold part of file. This also increase the 
parellism while doing the actual IO. If stripe count is 1, the single object is 
the whole file. Not a clever, layout selection in case there are lots of data 
to be written or read. 

system call path under Lustre
=============================

Client					Server
======					======
00. Client makes a POSIX system call	

01. llite/file.c -> Entry point. Traps 
it and does a sanity check. If not 
sane it throws a ENOSUPRT and returns.

02. llite/vvp_io.c -> Get locks on 
range if any and initiate clio for IO.	
(IO loops within...) and calls
clio_end.

03. Before clio_end - Call osc_io.c
to prepare RPC and send it across to 
server.	At this point, vvp_object 
(inode) and cl_object (page) is 
prepared.
					04. Server receives RPC and does a
					sanity check on it.

					05. Through dt_object (points to ldiskfs
					operations) invoke ldiskfs to make 
					actual FS modification.

					06. Server returns RPC.

07. Client receives RPC and
take appropriate action on it
and passes on to application.

The Write Path:
12:37:00 mrpel7 kernel: [<ffffffff81633ec4>] dump_stack+0x19/0x1b
12:37:00 mrpel7 kernel: [<ffffffffa10aa0e7>] vvp_io_fini+0x337/0x830 [lustre]
12:37:00 mrpel7 kernel: [<ffffffffa0afdf42>] ? lov_io_fini+0x282/0x460 [lov]
12:37:00 mrpel7 kernel: [<ffffffffa0683c25>] cl_io_fini+0x75/0x240 [obdclass]
12:37:00 mrpel7 kernel: [<ffffffffa105b64c>] ll_file_io_generic+0x2fc/0xb00 [lustre]
12:37:00 mrpel7 kernel: [<ffffffffa105c12d>] ll_file_aio_write+0x12d/0x1f0 [lustre]
12:37:00 mrpel7 kernel: [<ffffffffa105c2be>] ll_file_write+0xce/0x1e0 [lustre]
12:37:00 mrpel7 kernel: [<ffffffff811de5fd>] vfs_write+0xbd/0x1e0
12:37:00 mrpel7 kernel: [<ffffffff811df09f>] SyS_write+0x7f/0xe0
12:37:00 mrpel7 kernel: [<ffffffff816448c9>] system_call_fastpath+0x16/0x1b

How write works in lustre: Sample use case.
===========================================

# From client side start i/o.  Effectively we are writing 512*10 = 5120 bytes. 
# 512 block 10 times. 
yes 'k' | dd of=/mnt/lustre/five bs=512 count=10

# Get name of file; its FID and intent of open/create flags.
00000002:00010000:0.0:1463692949.850728:0:20027:0:(mdc_locks.c:1260:mdc_intent_lock()) (name: five,[0x200000400:0x1:0x0]) in obj [0x200000007:0x1:0x0], intent: open|creat flags 0101102

# Starts with vvp_io_write_start() 
00000080:00200000:0.0:1463692949.851794:0:20027:0:(vvp_io.c:806:vvp_io_write_start()) write: nob 512, result: 512

# Take extent lock on range. 
# Call a write (IOT = IO type) is 1. for READ it is 0. and ll_file_io_generic 
# has returned 512 (1 block size in our dd's case). 
00000080:00200000:0.0:1463692949.851811:0:20027:0:(file.c:1234:ll_file_io_generic()) iot: 1, result: 512

# Start CLIO - This will loop until done, and pass it to OSC where RPC will
# be created and send to server.
00000020:00200000:0.0:1463692949.851815:0:20027:0:(cl_io.c:238:cl_io_rw_init()) header@ffff880010162d20[0x0, 3, [0x200000400:0x1:0x0] hash]

# Identify the range to write
00000020:00200000:0.0:1463692949.851815:0:20027:0:(cl_io.c:238:cl_io_rw_init()) io range: 1 [3584, 4096) 0 0

#  Start write
00000080:00200000:0.0:1463692949.851834:0:20027:0:(vvp_io.c:791:vvp_io_write_start()) write: [3584, 4096)

# Write starts with ll_write_begin()
# Here 3584 is Where it will start the write and 512 is len of the write 
(in dd's case its block size of 512)
00000080:00200000:0.0:1463692949.851835:0:20027:0:(rw26.c:548:ll_write_begin()) Writing 0 of 3584 to 512 bytes

# Write ends with ll_write_end(). Queued page 1 is indication that one block is 
# written/queued. In our dd's case its 512 bytes. Suppose the block size was 8192; 
# then the page size is divided by block size = 8192/4096=2 and the write pages 
# will be queued 2 (or looped twice in this case).
00000080:00200000:0.0:1463692949.851842:0:20027:0:(rw26.c:677:ll_write_end()) queued page: 1.

# commit write (or queue write) - Async page is submitted for write. 
# Remember sync write would be different and written directly.
00000080:00200000:0.0:1463692949.851842:0:20027:0:(vvp_io.c:700:vvp_io_write_commit()) commit async pages: 1, from 3584, to 4096

# Actual writes goes to disk. 
0000080:00200000:0.0:1463692949.851843:0:20027:0:(vvp_io.c:722:vvp_io_write_commit()) Committed 1 pages 512 bytes, tot: 512

# Finally update the inode with newly written size by calling ll_merge_lvb
00000080:00200000:0.0:1463692949.851844:0:20027:0:(file.c:1049:ll_merge_lvb()) [0x200000400:0x1:0x0] updating i_size 4096

# ll_file_io_generic() - call to write sys call looks like this. returns bytes written or -1.
# Starts with ppos = 0; and it should increment by 512 as dd would be writing it in 512 bytes chunks.

Mapping Inode to Data
=====================

This form of debugging is required when there is suspicion that the layout is
corrupted. Given just the pathname we could get the data contents. This type
of debugging also verifies the metadata (inode) correctly mapps to data.

# Given just the pathname.
[root@mrpel7 lustre-release]# stat /mnt/lustre/five
File: '/mnt/lustre/five'
Size: 9             Blocks: 1          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d    Inode: 144115205272502273  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2017-12-03 17:26:20.000000000 +0530
Modify: 2017-12-03 17:26:36.000000000 +0530
Change: 2017-12-03 17:26:36.000000000 +0530
Birth: -


# Get object-id of the file
[root@mrpel7 lustre-release]# lfs getstripe /mnt/lustre/five
/mnt/lustre/five
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 0
	obdidx		 objid		 objid		 group
	     0	             2	          0x2	             0

# Get number of OST's. File 'five' object could be in any OST.
[root@mrpel7 lustre-release]# mount | grep lustre-ost | cut -d' ' -f1
/tmp/lustre-ost1
/tmp/lustre-ost2

# Find the backing disk mapping. Object 2 will be either under /dev/loop1 or 
# /dev/loop2.
[root@mrpel7 lustre-release]# losetup -a
/dev/loop0: [2050]:2762852 (/tmp/lustre-mdt1)
/dev/loop1: [2050]:2762853 (/tmp/lustre-ost1)
/dev/loop2: [2050]:2762854 (/tmp/lustre-ost2)

# Lustre keeps striping info in EA of the inodes. EA contains the list of all 
# object ID and object index (OST index) Lustre uses object id as file name. 
# If object-id is 'X'. The formula is O/0/d$((X % 32))/X to get the extents.
# In this example we replace 'X' with 2.
[root@mrpel7 lustre-release]# debugfs -c -R "stat O/0/d$((2 % 32))/2" /dev/loop1
debugfs 1.42.13.wc4 (28-Nov-2015)
/dev/loop1: catastrophic mode - not reading inode or group bitmaps
Inode: 101   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 3829217146    Version: 0x00000001:00000002
User:     0   Group:     0   Size: 9
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x5a23e674:00000000 -- Sun Dec  3 17:26:36 2017
atime: 0x5a23e664:00000000 -- Sun Dec  3 17:26:20 2017
mtime: 0x5a23e674:00000000 -- Sun Dec  3 17:26:36 2017
crtime: 0x5a23e63f:c93ffc9c -- Sun Dec  3 17:25:43 2017
Size of extra inode fields: 32
Extended attributes stored in inode body:
invalid EA entry in inode
EXTENTS:
(0):33113

# We now know the inode to be 101! and block to be 33113

# We could also use 'blocks' subcommand with inode to get the blocks.
[root@mrpel7 lustre-release]# debugfs -c -R "blocks <101>" /dev/loop1
debugfs 1.42.13.wc4 (28-Nov-2015)
/dev/loop1: catastrophic mode - not reading inode or group bitmaps
33113

# We can now 'dd' to the location and check if the contens are what 
# we are looking at. Read from device /dev/loop1, and skip the device 33113
# times the block size.
# dd if=/dev/loop1 of=output_file bs=4096 count=1 skip=33113

# Check the contents
[root@mrpel7 lustre-release]# xxd output_file | head -7
0000000: 4c75 7374 7265 4653 0a00 0000 0000 0000  LustreFS........
0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................

# Check the contents POSIX style
[root@mrpel7 lustre-release]# cat /mnt/lustre/five
LustreFS