Sharing

2011年9月28日 星期三

Test Ceph RBD + iSCSI performance

# 在 local
root@ubuntu1104-64-5:/mnt$ dd if=/dev/zero of=testfile bs=4096 count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
40960000 bytes (41 MB) copied, 4.41257 s, 9.3 MB/s

# 在 local 測 rdb image
root@ubuntu1104-64-5:/mnt$ dd if=/dev/zero of=testfile bs=4096 count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
40960000 bytes (41 MB) copied, 3.79785 s, 10.8 MB/s

# 用 Client 端測 rdb

root@ubuntu1104-64-6:~# dd if=/dev/zero of=testfile bs=4096k count=100 conv=fdatasync
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 4.2075 s, 99.7 MB/s
root@ubuntu1104-64-6:~# dd if=/dev/zero of=testfile bs=4096k count=1000 conv=fdatasync
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB) copied, 40.8692 s, 103 MB/s




# local 
root@ubuntu1104-64-5:~# dd if=/dev/zero of=testfile bs=4096k count=100 conv=fdatasync
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 4.59282 s, 91.3 MB/s






2011年9月20日 星期二

Export Ceph RBD with iSCSI

原本要使用 LIO , 但因為 Linux Kernel 2.6.38 似乎還沒把 iSCSI 整合進來, 所以就改用 iET (iSCSI Enterprise Target) 來試驗整個 Ceph 提供的 RBD (Rados Block Device)

如果對 iSCSI 還很不了解, 可以先從這兩篇開始了解
至於設定的部份可以參考這幾篇


第一步先從簡單的 partition 開始熟悉 iSCSI

設定 Target

root@ubuntu1104-64-5:/dev$ apt-get install iscsitarget
root@ubuntu1104-64-5:/dev$ apt-get install open-iscsi

安裝完出現警告訊息: iscsitarget not enabled in "/etc/default/iscsitarget", not starting... ... (warning). 這個問題需編輯設定檔:/etc/default/iscsitarget 將「ISCSITARGET_ENABLE=false」改為「ISCSITARGET_ENABLE=true」,iSCSI Target 服務才能作用。

接下來要設定 iET 的 /etc/iet/ietd.conf

iSNSServer 172.16.33.5
iSNSAccessControl No
Target iqn.2011-09.com.example:storage.lun1
# ubuntu1104--64--5-lvol1 是之前就切好的一塊 partition, 我們用fileio 來把它 release 出去
# ScsiId 及 ScsiSN 現在不知道是做什麼的, 暫時不管它
Lun 0 Path=/dev/mapper/ubuntu1104--64--5-lvol1,Type=fileio,ScsiId=xyz,ScsiSN=xyz
 

root@ubuntu1104-64-5:/dev$ /etc/init.d/iscsitarget restart
 * Removing iSCSI enterprise target devices:
   ...done.
 * Stopping iSCSI enterprise target service:
   ...done.
 * Removing iSCSI enterprise target modules:
   ...done.
 * Starting iSCSI enterprise target service
   ...done.
   ...done.


然後設定允許的連線 /etc/iet/initiators.allow, 末行顯示 ALL ALL 表示預設開放所有來源與目的連線,測試初期就先保持這樣了。

設定 Initiator

先把 /etc/iscsi/iscsid.conf 內的 node.startup 設成 automatic

# 重啟 initiator 的服務
root@ubuntu1104-64-6:/etc/iscsi$ /etc/init.d/open-iscsi restart
* Disconnecting iSCSI targets
...done.
* Stopping iSCSI initiator service
...done.
* Starting iSCSI initiator service iscsid
...done.
* Setting up iSCSI targets
...done.

# 然後尋找我們剛剛開出來的 iSCSI
root@ubuntu1104-64-6:/etc/iscsi$ iscsiadm -m discovery -t st -p 172.16.33.5
172.16.33.5:3260,1 iqn.2011-09.com.example:storage.lun1

root@ubuntu1104-64-6:/etc/iscsi# iscsiadm -m node
172.16.33.5:3260,1 iqn.2011-09.com.example:storage.lun1

# 如果有連結到, 應該會在這邊看到它的目錄
root@ubuntu1104-64-6:/etc/iscsi$ ll /etc/iscsi/nodes/
total 12
drw------- 3 root root 4096 2011-09-20 16:06 ./
drwxr-xr-x 5 root root 4096 2011-09-20 15:20 ../
drw------- 3 root root 4096 2011-09-20 16:06 iqn.2011-09.com.example:storage.lun1/

# 登入到這個節點, 不過基本上前面會自動登入, 所以在這個步驟可能會看到"already exists"
root@ubuntu1104-64-6:/etc/iscsi$ iscsiadm -m node -T iqn.2011-09.com.example:storage.lun1 -p 172.16.33.5 -l
Logging in to [iface: default, target: iqn.2011-09.com.example:storage.lun1, portal: 172.16.33.5,3260]
Login to [iface: default, target: iqn.2011-09.com.example:storage.lun1, portal: 172.16.33.5,3260]: successful

# 用 fdisk 看一下, 會看到多了個 sdb 這個 device, 然後有一個 sdb0 partition
root@ubuntu1104-64-6:/etc/iscsi$ fdisk -l
Disk /dev/sdb: 53.7 GB, 53687091200 bytes
64 heads, 32 sectors/track, 51200 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xe4a1139a

Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       51200    52428784   83  Linux


# 格式化這個 partition
root@ubuntu1104-64-6:/etc/iscsi$ mkfs.ext4 /dev/sdb1
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
3276800 inodes, 13107196 blocks
655359 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
400 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 35 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

# mount 起來
root@ubuntu1104-64-6:/data$ mount /dev/sdb1 /data/scsi

# 看一下所有 mount 的狀況
root@ubuntu1104-64-6:/data$ mount
/dev/mapper/ubuntu1104--64--6-root on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
none on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
/dev/sda1 on /boot type ext2 (rw)
/dev/mapper/ubuntu1104--64--6-lvol0 on /data/osd.2 type btrfs (rw,noatime)
/dev/mapper/ubuntu1104--64--6-lvol1 on /data/osd.3 type btrfs (rw,noatime)
/dev/sdb1 on /data/scsi type ext4 (rw)

# 看一下 partition 的使用狀況
root@ubuntu1104-64-6:/data$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/ubuntu1104--64--6-root
47328184   2595704  42328336   6% /
none                  12358244       224  12358020   1% /dev
none                  12366300         0  12366300   0% /dev/shm
none                  12366300        48  12366252   1% /var/run
none                  12366300         0  12366300   0% /var/lock
/dev/sda1               233191     45272    175478  21% /boot
/dev/mapper/ubuntu1104--64--6-lvol0
52428800   1033620  49277220   3% /data/osd.2
/dev/mapper/ubuntu1104--64--6-lvol1
52428800   1034584  49276440   3% /data/osd.3
/dev/sdb1             51606124    184136  48800552   1% /data/scsi

如果要把 iscsi 的連結斷掉

iscsiadm -m node -T iqn.2011-09.com.example:storage.lun1 -p 172.16.33.5 -u

Create Ceph RBD

參考連結: http://ceph.newdream.net/wiki/RBD

# 製造一個 rbd , 大小為 5G
root@ubuntu1104-64-5:/dev/rbd/rbd$ rbd create goo --size 5120

# 看目前在 rbd 內的 list
root@ubuntu1104-64-5:/dev/rbd/rbd$ rbd list
foo

# 從 rbd 去觀察 foo
root@ubuntu1104-64-5:~$ rbd info foo
rbd image 'foo':
        size 5120 MB in 1280 objects
        order 22 (4096 KB objects)
        block_name_prefix: rb.0.3
        parent:  (pool -1)

# 從 rados 去看一下 rbd 的狀況, 可以發現多了一個 foo.rbd
root@ubuntu1104-64-5:~$ rados ls -p rbd
foo.rbd
rb.0.1.000000000000
rb.0.1.000000000001
rbd_directory
rbd_info

root@ubuntu1104-64-5:~$ modprobe rbd

# 把 rbd 加到系統內, 等下才看的到這個 device
root@ubuntu1104-64-5:/dev$ echo "172.16.33.5 name=admin,secret=AQDeRGdOMNL3MhAAuzvelwICjpYhLIk7IMcX2g== rbd foo" > /sys/bus/rbd/add
root@ubuntu1104-64-5:/dev$ mknod /dev/rbd0 b 254 0


# 看一下 rbd device 有沒有出現
root@ubuntu1104-64-5:~$ ls /sys/bus/rbd/devices
0
root@ubuntu1104-64-5:~$ ls /dev/rbd/rbd
foo:0

# format rbd0
root@ubuntu1104-64-5:/dev$ mkfs -t ext3 /dev/rbd0
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
327680 inodes, 1310720 blocks
65536 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1342177280
40 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information:
done

# mount 起來
root@ubuntu1104-64-5:/dev$ mount -t ext3 /dev/rbd0 /mnt

root@ubuntu1104-64-5:/dev$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu1104--64--5-root
                       46G  6.6G   37G  16% /
none                   12G  228K   12G   1% /dev
none                   12G     0   12G   0% /dev/shm
none                   12G   64K   12G   1% /var/run
none                   12G     0   12G   0% /var/lock
/dev/sda1             228M   45M  172M  21% /boot
/dev/mapper/ubuntu1104--64--5-lvol2
                       50G  1.1G   47G   3% /data/osd.0
/dev/mapper/ubuntu1104--64--5-lvol0
                       50G  1.1G   47G   3% /data/osd.1
/dev/rbd0             5.0G  139M  4.6G   3% /mnt

# 測試完畢, 再將它 umount
root@ubuntu1104-64-5:/dev$ umount /mnt

Export RBD via iSCSI

可以參考 http://ceph.newdream.net/wiki/ISCSI
首先改 /etc/iet/ietd.conf 裡面的設定

Target iqn.2011-09.net.newdream.ceph:rados.iscsi.001
        # 記得要使用 blockio, 而不是 fileio
        Lun 0 Path=/dev/rbd0,Type=blockio

然後分別重啟 target 及 Initiator, 步驟和上面的一樣,

root@ubuntu1104-64-7:~$ fdisk -l

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c4797

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          32      248832   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              32       60802   488134657    5  Extended
/dev/sda5              32       60802   488134656   8e  Linux LVM

Disk /dev/sdb: 5368 MB, 5368709120 bytes
166 heads, 62 sectors/track, 1018 cylinders
Units = cylinders of 10292 * 512 = 5269504 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

# 因為是用 blockio 當成一個 device export 出來, 所以上面沒有任何 partition
Disk /dev/sdb doesn't contain a valid partition table

# 直接 mount 起來
root@ubuntu1104-64-7:~$ mount /dev/sdb /mnt

# 看一下狀況, 多了一個 5G 的 sdb
root@ubuntu1104-64-7:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu1104--64--7-root
                      435G   11G  402G   3% /
none                   12G  216K   12G   1% /dev
none                   12G     0   12G   0% /dev/shm
none                   12G   60K   12G   1% /var/run
none                   12G     0   12G   0% /var/lock
/dev/sda1             228M   45M  172M  21% /boot
/dev/sdb              5.0G  139M  4.6G   3% /mnt

大致上這樣就完成啦! 收工~

2011年9月7日 星期三

Setup Ceph Cluster



把三個 partition create 出來
pjack@ubuntu1104-64-5:/etc/ceph$ sudo lvcreate -L 50G ubuntu1104-64-5
  Logical volume "lvol0" created
pjack@ubuntu1104-64-5:/etc/ceph$ sudo lvcreate -L 50G ubuntu1104-64-5
  Logical volume "lvol1" created
pjack@ubuntu1104-64-5:/etc/ceph$ sudo lvcreate -L 50G ubuntu1104-64-5
  Logical volume "lvol2" created



分別指定成 ext3, ext4, btrfs

第一塊是 ext3
pjack@ubuntu1104-64-5:/etc/ceph$ sudo mkfs -t ext3 /dev/ubuntu1104-64-5/lvol0
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
3276800 inodes, 13107200 blocks
655360 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
400 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 23 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

第二塊是 ext4
pjack@ubuntu1104-64-5:/etc/ceph$ sudo mkfs -t ext4 /dev/ubuntu1104-64-5/lvol1
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
3276800 inodes, 13107200 blocks
655360 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
400 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 31 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.


第三塊是 btrfs, 官方建議的也是這個設定
pjack@ubuntu1104-64-5:/etc/ceph$ sudo mkfs -t btrfs /dev/ubuntu1104-64-5/lvol2

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/ubuntu1104-64-5/lvol2
        nodesize 4096 leafsize 4096 sectorsize 4096 size 50.00GB
Btrfs Btrfs v0.19


# 把這三塊 Mount 起來
pjack@ubuntu1104-64-5:/mnt$ sudo mount /dev/mapper/ubuntu1104--64--5-lvol0 /mnt/lvol0
pjack@ubuntu1104-64-5:/mnt$ sudo mount /dev/mapper/ubuntu1104--64--5-lvol1 /mnt/lvol1
pjack@ubuntu1104-64-5:/mnt$ sudo mount /dev/mapper/ubuntu1104--64--5-lvol2 /mnt/lvol2

# 看一下結果
pjack@ubuntu1104-64-5:/mnt$ df
Filesystem                               1K-blocks      Used Available Use% Mounted on
/dev/mapper/ubuntu1104--64--5-lvol0       51606140    184268  48800432   1% /mnt/lvol0
/dev/mapper/ubuntu1104--64--5-lvol1       51606140    184136  48800564   1% /mnt/lvol1
/dev/mapper/ubuntu1104--64--5-lvol2       52428800        56  50302976   1% /mnt/lvol2

# 看一下每一塊的 format
pjack@ubuntu1104-64-5:/lib/modules/2.6.38-8-server/kernel/fs$ mount -l
/dev/mapper/ubuntu1104--64--5-lvol0 on /mnt/lvol0 type ext3 (rw)
/dev/mapper/ubuntu1104--64--5-lvol1 on /mnt/lvol1 type ext4 (rw)
/dev/mapper/ubuntu1104--64--5-lvol2 on /mnt/lvol2 type btrfs (rw)

不過他對 Ext4 有一些要求

  1. user_xattr
  2. noatime
  3. nodiratime
  4. disable the ext journal

The ext4 partition must be mounted with -o user_xattr or else mkcephfs will fail. Also using noatime,nodiratime boosts performance at no cost. When using ext4, you should disable the ext4 journal, because Ceph does its own journalling. This will boost performance.       

Data Mode
=========
There are 3 different data modes:

* writeback mode
In data=writeback mode, ext4 does not journal data at all. This mode provides a similar level of journaling as that of XFS, JFS, and ReiserFS in its default mode - metadata journaling. A crash+recovery can cause incorrect data to appear in files which were written shortly before the crash. This mode will typically provide the best ext4 performance.

* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically groups metadata information related to data changes with the data blocks into a single unit called a transaction. When it's time to write the new metadata out to disk, the associated data blocks are written first. In general, this mode performs slightly slower than writeback but significantly faster than journal mode.

* journal mode
data=journal mode provides full data and metadata journaling. All new data is written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it outperforms all others modes. Curently ext4 does not have delayed allocation support if this data journalling mode is selected.

修改之後再看一次結果
pjack@ubuntu1104-64-5:/lib/modules/2.6.38-8-server/kernel/fs$ mount -l
/dev/mapper/ubuntu1104--64--5-lvol0 on /mnt/lvol0 type ext3 (rw)
/dev/mapper/ubuntu1104--64--5-lvol2 on /mnt/lvol2 type btrfs (rw)
/dev/mapper/ubuntu1104--64--5-lvol1 on /mnt/lvol1 type ext4 (rw,noatime,nodiratime,user_xattr,data=writeback)


為了之後的方便, 每一台 Server 都先生成 ssh key, 然後 import 到其他台去
pjack@ubuntu1104-64-5:/etc/ceph$ sudo ssh-keygen -d
Generating public/private dsa key pair.
Enter file in which to save the key (/root/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
82:8a:85:37:a2:17:f2:41:4f:e8:96:d0:a6:1b:c9:6c root@ubuntu1104-64-5
The key's randomart image is:
+--[ DSA 1024]----+
|                 |
| . .             |
|. = .            |
|oO + .           |
|BEX o . S        |
|o@ =   .         |
|+ +              |
| .               |
|                 |
+-----------------+

root@ubuntu1104-64-5:~$ ssh-copy-id -i /root/.ssh/id_dsa.pub root@172.16.33.6


接下來把 sample.ceph.conf & sample.fetch_config 複製到 /etc/ceph 下
把設定修改好, 大部份都不必修改

[global]
        ; enable secure authentication
        auth supported = cephx

        ; allow ourselves to open a lot of files
        max open files = 131072

        ; set log file
        log file = /var/log/ceph/$name.log
        ; log_to_syslog = true        ; uncomment this line to log to syslog

        ; set up pid files
        pid file = /var/run/ceph/$name.pid

        ; If you want to run a IPv6 cluster, set this to true. Dual-stack isn't possible
        ;ms bind ipv6 = true


        keyring = /etc/ceph/keyring.admin

[mon]
        mon data = /data/$name
[mon.alpha]
        host = ubuntu1104-64-5
        mon addr = 172.16.33.5:6789
[mds]
        ; where the mds keeps it's secret encryption keys
        keyring = /data/keyring.$name

        ; mds logging to debug issues.
        ;debug ms = 1
        ;debug mds = 20

[mds.alpha]
        host = ubuntu1104-64-5
[osd]
        ; This is where the btrfs volume will be mounted.
        osd data = /data/$name
        keyring = /etc/ceph/keyring.$name

        osd journal = /data/$name/journal
        osd journal size = 1000 ; journal size, in megabytes

[osd.0]
        host = ubuntu1104-64-5
        btrfs devs = /dev/mapper/ubuntu1104--64--5-lvol2

[osd.1]
        host = ubuntu1104-64-5
        btrfs devs = /dev/mapper/ubuntu1104--64--5-lvol0

[osd.2]
        host = ubuntu1104-64-6
        btrfs devs = /dev/mapper/ubuntu1104--64--6-lvol0

[osd.3]
        host = ubuntu1104-64-6
        btrfs devs = /dev/mapper/ubuntu1104--64--6-lvol1


#!/bin/sh
conf="$1"
scp -i /root/.ssh/id_dsa root@172.16.33.5:/etc/ceph/ceph.conf $conf

然後因為 ceph.conf 內用的都是 hostname, 而非 ip address, 所以要去設定一下 /etc/hosts, 不然之後 script 會出問題

127.0.0.1       localhost
127.16.33.5     ubuntu1104-64-5
172.16.33.6     ubuntu1104-64-6
172.16.33.7     ubuntu1104-64-7



另外是發現 Ceph 的 Scipt 好像有些問題, 進去後修改了其中一行的順序, 不然他似乎會把 ceph.conf 內的設定蓋掉
-------- 略 ------------
[ -z "$conf" ] && [ -n "$dir" ] && conf="$dir/conf"

# 多加這一行
[ -z "$conf" ] && [ -z "$dir" ] && conf=$default_conf



經過一長串的前置動作, 終於可以開始把 Ceph Filesystem 建起來

root@ubuntu1104-64-5:/tmp$ /sbin/mkcephfs -a --mkbtrfs
here 0 /etc/ceph/ceph.conf
[/etc/ceph/fetch_config /tmp/fetched.ceph.conf.13131]
ceph.conf                                                         100% 4455     4.4KB/s   00:00
temp dir is /tmp/mkcephfs.pt2DlXEHkB
here 0 /tmp/fetched.ceph.conf.13131
preparing monmap in /tmp/mkcephfs.pt2DlXEHkB/monmap
/usr/bin/monmaptool --create --clobber --add alpha 172.16.33.5:6789 --print /tmp/mkcephfs.pt2DlXEHkB
/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.pt2DlXEHkB/monmap
/usr/bin/monmaptool: generated fsid 9cc6b2d5-1eba-50b2-bd43-7b3807ce301b
epoch 1
fsid 9cc6b2d5-1eba-50b2-bd43-7b3807ce301b
last_changed 2011-09-07 18:18:00.219236
created 2011-09-07 18:18:00.219236
0: 172.16.33.5:6789/0 mon.alpha
/usr/bin/monmaptool: writing epoch 1 to /tmp/mkcephfs.pt2DlXEHkB/monmap (1 monitors)

=== osd.0 ===
here 0 /tmp/mkcephfs.pt2DlXEHkB/conf
umount: /data/osd.0: not mounted
umount: /dev/mapper/ubuntu1104--64--5-lvol2: not mounted

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/mapper/ubuntu1104--64--5-lvol2
        nodesize 4096 leafsize 4096 sectorsize 4096 size 50.00GB
Btrfs Btrfs v0.19
Scanning for Btrfs filesystems
here 0 /tmp/mkcephfs.pt2DlXEHkB/conf
2011-09-07 18:18:00.995785 7fb8f3dfc760 created object store /data/osd.0 journal /data/osd.0/journal
 for osd0 fsid 9cc6b2d5-1eba-50b2-bd43-7b3807ce301b
creating private key for osd.0 keyring /etc/ceph/keyring.osd.0
creating /etc/ceph/keyring.osd.0

=== osd.1 ===
here 0 /tmp/mkcephfs.pt2DlXEHkB/conf
umount: /data/osd.1: not mounted
umount: /dev/mapper/ubuntu1104--64--5-lvol0: not mounted

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/mapper/ubuntu1104--64--5-lvol0
        nodesize 4096 leafsize 4096 sectorsize 4096 size 50.00GB
Btrfs Btrfs v0.19
Scanning for Btrfs filesystems
here 0 /tmp/mkcephfs.pt2DlXEHkB/conf
2011-09-07 18:18:01.689284 7fa826bb2760 created object store /data/osd.1 journal /data/osd.1/journal
 for osd1 fsid 9cc6b2d5-1eba-50b2-bd43-7b3807ce301b
creating private key for osd.1 keyring /etc/ceph/keyring.osd.1
creating /etc/ceph/keyring.osd.1

=== osd.2 ===
pushing conf and monmap to ubuntu1104-64-6:/tmp/mkfs.ceph.13131
umount: /data/osd.2: not mounted
umount: /dev/mapper/ubuntu1104--64--6-lvol0: not mounted

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/mapper/ubuntu1104--64--6-lvol0
        nodesize 4096 leafsize 4096 sectorsize 4096 size 50.00GB
Btrfs Btrfs v0.19
Scanning for Btrfs filesystems
2011-09-07 18:18:03.823692 7fdc3c04c760 created object store /data/osd.2 journal /data/osd.2/journal for osd2 fsid 9cc6b2d5-1eba-50b2-bd43-7b3807ce301b
creating private key for osd.2 keyring /etc/ceph/keyring.osd.2
creating /etc/ceph/keyring.osd.2
collecting osd.2 key


=== osd.3 ===
pushing conf and monmap to ubuntu1104-64-6:/tmp/mkfs.ceph.13131
umount: /data/osd.3: not mounted
umount: /dev/mapper/ubuntu1104--64--6-lvol1: not mounted

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/mapper/ubuntu1104--64--6-lvol1
        nodesize 4096 leafsize 4096 sectorsize 4096 size 50.00GB
Btrfs Btrfs v0.19
Scanning for Btrfs filesystems
 ** WARNING: Ceph is still under development.  Any feedback can be directed  **
 **          at ceph-devel@vger.kernel.org or http://ceph.newdream.net/.     **
2011-09-07 18:18:06.293806 7f91b9b63760 created object store /data/osd.3 journal /data/osd.3/journal for osd3 fsid 9cc6b2d5-1eba-50b2-bd43-7b3807ce301b
creating private key for osd.3 keyring /etc/ceph/keyring.osd.3
creating /etc/ceph/keyring.osd.3
collecting osd.3 key

=== mds.alpha ===
here 0 /tmp/mkcephfs.pt2DlXEHkB/conf
creating private key for mds.alpha keyring /data/keyring.mds.alpha
creating /data/keyring.mds.alpha
here 0 /tmp/mkcephfs.pt2DlXEHkB/conf
Building generic osdmap
 highest numbered osd in /tmp/mkcephfs.pt2DlXEHkB/conf is osd.3
 num osd = 4
/usr/bin/osdmaptool: osdmap file '/tmp/mkcephfs.pt2DlXEHkB/osdmap'
/usr/bin/osdmaptool: writing epoch 1 to /tmp/mkcephfs.pt2DlXEHkB/osdmap
Generating admin key at /tmp/mkcephfs.pt2DlXEHkB/keyring.admin
creating /tmp/mkcephfs.pt2DlXEHkB/keyring.admin
Building initial monitor keyring
added entity mds.alpha auth auth(auid = 18446744073709551615 key=AQDeRGdOQBGPLhAAA3owSiBl0H4ozL4dy0H7Rg== with 0 caps)
added entity osd.0 auth auth(auid = 18446744073709551615 key=AQDZRGdOKLL3ABAAlU4NM3xNTe+m/dUXEvKCRw== with 0 caps)
added entity osd.1 auth auth(auid = 18446744073709551615 key=AQDZRGdOAJ1MKhAAY69HXzl8QLxZ3/MCHP2Cnw== with 0 caps)
added entity osd.2 auth auth(auid = 18446744073709551615 key=AQDbRGdOeEtQMhAAHhbE8EuTxqpobHIUR0SCdg== with 0 caps)
added entity osd.3 auth auth(auid = 18446744073709551615 key=AQDeRGdOYImzEhAAHQkcZtR4E8npHlgpAT8NpQ== with 0 caps)
=== mon.alpha ===
here 0 /tmp/mkcephfs.pt2DlXEHkB/conf
/usr/bin/cmon: created monfs at /data/mon.alpha for mon.alpha
placing client.admin keyring in /etc/ceph/keyring.admin

看一下 Key 的設置
# 檢查一下 Server 1 (Ubuntu1104-64-5) 的狀況
root@ubuntu1104-64-5:/etc/ceph$ ll
drwxr-xr-x  2 root root 4096 2011-09-07 18:16 ./
drwxr-xr-x 86 root root 4096 2011-09-07 18:18 ../
-rw-r--r--  1 root root 4455 2011-09-07 17:31 ceph.conf
-rwxr-xr-x  1 root root  392 2011-09-07 11:32 fetch_config*
-rw-------  1 root root   92 2011-09-07 18:18 keyring.admin
-rw-------  1 root root   85 2011-09-07 18:18 keyring.osd.0
-rw-------  1 root root   85 2011-09-07 18:18 keyring.osd.1

root@ubuntu1104-64-5:/etc/ceph$ cauthtool -l keyring.admin
[client.admin]
        key = AQDeRGdOMNL3MhAAuzvelwICjpYhLIk7IMcX2g==
        auid = 18446744073709551615

# 檢查一下 Server 2 (Ubuntu1104-64-5) 的狀況
root@ubuntu1104-64-6:/etc/ceph$ ll
drwxr-xr-x  2 root root 4096 2011-09-07 18:16 ./
drwxr-xr-x 86 root root 4096 2011-09-07 18:18 ../
-rwxr-xr-x  1 root root  392 2011-09-07 11:56 fetch*
-rw-------  1 root root   85 2011-09-07 18:18 keyring.osd.2
-rw-------  1 root root   85 2011-09-07 18:18 keyring.osd.3


每個 node 的 Key 看起來就定位了, 讓我們把 service 叫起來吧!

root@ubuntu1104-64-5:/tmp$ service ceph -a start
[/etc/ceph/fetch_config /tmp/fetched.ceph.conf.16083]
ceph.conf                                                         100% 4455     4.4KB/s   00:00
=== mon.alpha ===
Starting Ceph mon.alpha on ubuntu1104-64-5...
starting mon.alpha rank 0 at 172.16.33.5:6789/0 mon_data /data/mon.alpha fsid 9cc6b2d5-1eba-50b2-bd43-7b3807ce301b
=== mds.alpha ===
Starting Ceph mds.alpha on ubuntu1104-64-5...
starting mds.alpha at 0.0.0.0:6800/16268
=== osd.0 ===
Mounting Btrfs on ubuntu1104-64-5:/data/osd.0
Scanning for Btrfs filesystems
Starting Ceph osd.0 on ubuntu1104-64-5...
starting osd0 at 0.0.0.0:6801/16371 osd_data /data/osd.0 /data/osd.0/journal
=== osd.1 ===
Mounting Btrfs on ubuntu1104-64-5:/data/osd.1
Scanning for Btrfs filesystems
Starting Ceph osd.1 on ubuntu1104-64-5...
starting osd1 at 0.0.0.0:6804/16464 osd_data /data/osd.1 /data/osd.1/journal
=== osd.2 ===
Mounting Btrfs on ubuntu1104-64-6:/data/osd.2
Scanning for Btrfs filesystems
Starting Ceph osd.2 on ubuntu1104-64-6...
starting osd2 at 0.0.0.0:6800/14475 osd_data /data/osd.2 /data/osd.2/journal
=== osd.3 ===
Mounting Btrfs on ubuntu1104-64-6:/data/osd.3
Scanning for Btrfs filesystems
Starting Ceph osd.3 on ubuntu1104-64-6...
starting osd3 at 0.0.0.0:6803/14676 osd_data /data/osd.3 /data/osd.3/journal


檢查一下整體的狀況及 Authentication list
root@ubuntu1104-64-5:/etc/ceph$ ceph -s
2011-09-07 18:29:24.305413    pg v160: 792 pgs: 792 active+clean; 24 KB data, 112 MB used, 191 GB / 200 GB avail
2011-09-07 18:29:24.307445   mds e4: 1/1/1 up {0=alpha=up:active}
# 有 4 個 osd, 4 個都 turn on 並且加入 storage pool
2011-09-07 18:29:24.307483   osd e7: 4 osds: 4 up, 4 in   
2011-09-07 18:29:24.307539   log 2011-09-07 18:29:20.760469 osd3 172.16.33.6:6803/14676 130 : [INF] 1.8c scrub ok
2011-09-07 18:29:24.307617   mon e1: 1 mons at {alpha=172.16.33.5:6789/0}

root@ubuntu1104-64-5:/etc/ceph$ ceph auth list
2011-09-07 18:29:41.564151 mon <- [auth,list]
2011-09-07 18:29:41.564718 mon0 -> 'installed auth entries:
mon.
        key: AQDeRGdOiEk2NBAAVHVGzaeOFcgSmbZZ2xPu+w==
mds.alpha
        key: AQDeRGdOQBGPLhAAA3owSiBl0H4ozL4dy0H7Rg==
        caps: [mds] allow
        caps: [mon] allow rwx
        caps: [osd] allow *
osd.0
        key: AQDZRGdOKLL3ABAAlU4NM3xNTe+m/dUXEvKCRw==
        caps: [mon] allow rwx
        caps: [osd] allow *
osd.1
        key: AQDZRGdOAJ1MKhAAY69HXzl8QLxZ3/MCHP2Cnw==
        caps: [mon] allow rwx
        caps: [osd] allow *
osd.2
        key: AQDbRGdOeEtQMhAAHhbE8EuTxqpobHIUR0SCdg==
        caps: [mon] allow rwx
        caps: [osd] allow *
osd.3
        key: AQDeRGdOYImzEhAAHQkcZtR4E8npHlgpAT8NpQ==
        caps: [mon] allow rwx
        caps: [osd] allow *
client.admin
        key: AQDeRGdOMNL3MhAAuzvelwICjpYhLIk7IMcX2g==
        caps: [mds] allow
        caps: [mon] allow *
        caps: [osd] allow *
' (0)


把 ceph mount 起來, 先用 Kernel 的方式.. 不過不知道為什麼, 一直出現 "No such device" 這樣的訊息, 無法解決就放棄了
root@ubuntu1104-64-5:/etc/ceph$ mount -t ceph 172.16.33.5:6789:/ /mnt/ceph -v -o name=admin,secret=AQDeRGdOMNL3MhAAuzvelwICjpYhLIk7IMcX2g==
parsing options: rw,name=admin,secret=AQDeRGdOMNL3MhAAuzvelwICjpYhLIk7IMcX2g==
error adding secret to kernel, key name client.admin: No such device.

改成用 cfuse 的方式就沒什麼問題.. 怪怪~ 有可能是要去更新 mount.ceph ?
root@ubuntu1104-64-5:/etc/ceph$ cfuse -m 172.16.33.5:6789 /mnt/ceph
 ** WARNING: Ceph is still under development.  Any feedback can be directed  **
 **          at ceph-devel@vger.kernel.org or http://ceph.newdream.net/.     **
cfuse[3506]: starting ceph client
cfuse[3506]: starting fuse
root@ubuntu1104-64-5:/etc/ceph$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/ubuntu1104--64--5-root
                      47328184   5929696  38994344  14% /
none                  12358244       220  12358024   1% /dev
none                  12366300         0  12366300   0% /dev/shm
none                  12366300        60  12366240   1% /var/run
none                  12366300         0  12366300   0% /var/lock
/dev/sda1               233191     45262    175488  21% /boot
/dev/mapper/ubuntu1104--64--5-lvol2
                      52428800     30244  50275500   1% /data/osd.0
/dev/mapper/ubuntu1104--64--5-lvol0
                      52428800     31272  50274416   1% /data/osd.1
cfuse                209715200   8616960 201098240   5% /mnt/ceph

單獨加一個 osd 的方法

可以參考 http://ceph.newdream.net/wiki/OSD_cluster_expansion/contraction
但事實上他有前置作業, 必須要先把 /etc/ceph/keyring.admin copy 到新機器上, 否則無法執行這些指令

root@ubuntu1104-64-6:/etc/ceph$ mkfs.btrfs /dev/mapper/ubuntu1104--64--6-lvol1
WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using
fs created label (null) on /dev/mapper/ubuntu1104--64--6-lvol1
        nodesize 4096 leafsize 4096 sectorsize 4096 size 50.00GB
Btrfs Btrfs v0.19

root@ubuntu1104-64-6:/etc/ceph$ mount /dev/mapper/ubuntu1104--64--6-lvol1 /data/osd.3

root@ubuntu1104-64-6:/etc/ceph$ cosd -i 3 --mkfs --monmap /tmp/monmap --mkkey
 ** WARNING: Ceph is still under development.  Any feedback can be directed  **
 **          at ceph-devel@vger.kernel.org or http://ceph.newdream.net/.     **
2011-09-07 15:17:17.756554 7ff1ca3a4760 created object store /data/osd.3 journal /data/osd.3/journal for osd3 fsid d9dbbbfc-12ec-7d89-49cd-c91d6c598715
2011-09-07 15:17:17.756987 7ff1ca3a4760 created new key in keyring /etc/ceph/keyring.osd.3


root@ubuntu1104-64-6:/etc/ceph$ ceph auth add osd.3 osd 'allow *' mon 'allow rwx' -i /etc/ceph/keyring.osd.3
2011-09-07 15:17:55.574961 7f7d4bcc5740 read 85 bytes from /etc/ceph/keyring.osd.3
2011-09-07 15:17:55.578720 mon <- [auth,add,osd.3,osd,allow *,mon,allow rwx]
2011-09-07 15:17:55.790027 mon0 -> 'added key for osd.3' (0)


root@ubuntu1104-64-6:/etc/ceph$ ceph osd setmaxosd 4
2011-09-07 15:19:50.847283 mon <- [osd,setmaxosd,4]
2011-09-07 15:19:51.210703 mon0 -> 'set new max_osd = 4' (0)

root@ubuntu1104-64-6:/etc/ceph$ service ceph start osd.3
[/etc/ceph/fetch_config /tmp/fetched.ceph.conf.9423]
ceph.conf                                                         100% 4454     4.4KB/s   00:00
=== osd.3 ===
Mounting Btrfs on ubuntu1104-64-6:/data/osd.3
Scanning for Btrfs filesystems
Starting Ceph osd.3 on ubuntu1104-64-6...
starting osd3 at 0.0.0.0:6803/9532 osd_data /data/osd.3 /data/osd.3/journal

root@ubuntu1104-64-6:/etc/ceph$ ceph -s
2011-09-07 15:21:52.259399    pg v112: 594 pgs: 594 active+clean; 24 KB data, 65856 KB used, 191 GB / 200 GB avail
2011-09-07 15:21:52.260887   mds e4: 1/1/1 up {0=alpha=up:active}
2011-09-07 15:21:52.260924   osd e7: 4 osds: 4 up, 4 in
2011-09-07 15:21:52.260979   log 2011-09-07 15:21:48.274685 osd2 172.16.33.6:6800/9040 127 : [INF] 1.1p2 scrub ok
2011-09-07 15:21:52.261057   mon e1: 1 mons at {alpha=172.16.33.5:6789/0}


Ceph 0.34 build from source

如果想要安裝最新版的 Ceph, 而非 Ubuntu 11.04 official 的 0.24 版,
可以參考
http://ceph.newdream.net/wiki/DebianBuilding from source


# Step 1: install relative package to build source code
root@ubuntu1104-64-5:~/src$ apt-get install debhelper autotools-dev autoconf automake g++ gcc cdbs libfuse-dev libboost-dev libedit-dev libssl-dev libtool libexpat1-dev libfcgi-dev libatomic-ops-dev libgoogle-perftools-dev pkg-config libgtkmm-2.4-dev libcrypto++-dev python-dev

# Step 2: get the source code
root@ubuntu1104-64-5:~/src$ git clone git://ceph.newdream.net/git/ceph.git

# Step 3: get the stable version
root@ubuntu1104-64-5:~/src$ cd ceph
root@ubuntu1104-64-5:~/src/ceph$ git checkout -b stable origin/stable

# Step 4: Build the .deb installation package
root@ubuntu1104-64-5:~/src/ceph$ dpkg-buildpackage -j16

前置步驟要花一些時間, 好了之後到上一層可以發現 .deb 都生出來了

root@ubuntu1104-64-5:~/src$ ls
ceph                                    libceph-dev_0.34-1_amd64.deb
ceph_0.34-1_amd64.changes               librados2_0.34-1_amd64.deb
ceph_0.34-1_amd64.deb                   librados2-dbg_0.34-1_amd64.deb
ceph_0.34-1.dsc                         librados-dev_0.34-1_amd64.deb
ceph_0.34-1.tar.gz                      librbd1_0.34-1_amd64.deb
ceph-client-tools_0.34-1_amd64.deb      librbd1-dbg_0.34-1_amd64.deb
ceph-client-tools-dbg_0.34-1_amd64.deb  librbd-dev_0.34-1_amd64.deb
ceph-dbg_0.34-1_amd64.deb               librgw1_0.34-1_amd64.deb
ceph-fuse_0.34-1_amd64.deb              librgw1-dbg_0.34-1_amd64.deb
ceph-fuse-dbg_0.34-1_amd64.deb          librgw-dev_0.34-1_amd64.deb
gceph_0.34-1_amd64.deb                  obsync_0.34-1_amd64.deb
gceph-dbg_0.34-1_amd64.deb              python-ceph_0.34-1_amd64.deb
libceph1_0.34-1_amd64.deb               radosgw_0.34-1_amd64.deb
libceph1-dbg_0.34-1_amd64.deb           radosgw-dbg_0.34-1_amd64.deb

那就把所有的 .deb 都裝起來, 不過裝的過程中發現還是有些 dependency 的 package 還沒裝

root@ubuntu1104-64-5:~/src$ apt-get install libxslt1.1 python-boto python-pyxattr python-lxml
root@ubuntu1104-64-5:~/src$ dpkg -i *.deb

# 裝好之後檢查一下版本
root@ubuntu1104-64-5:~/src$ ceph --version
ceph version 0.34-4-g7a8ab74 (commit:7a8ab747addf493cb4b82351aeb3c2e07ba46a95)


整個流程還算順利, 沒有太多問題


2011/12/20 補:
後來的版本都可以用 sudo apt-get install ceph python-ceph 來安裝,但如果還是自行改 code, 則可以參考原始的作法,我在 0.39 時有試過
另一個可以參考的網頁:
http://ceph.newdream.net/wiki/Checking_out