Using High-Speed Interconnects with MySQL Cluster

Already before design of NDB Cluster started in 1996 it was evident that one of the major problems of building parallel databases is the communication between the nodes in the network. Thus from the very beginning NDB Cluster was designed with a transporter concept to allow for different transporters.

At the moment the code base includes 4 different transporters where 3 of them are currently working. Most users today uses TCP/IP over Ethernet since this exists in all machines. This is also by far the most well-tested transporter in MySQL Cluster.

Within MySQL we are working hard to ensure that communication with the ndbd process is made in as large chunks as possible since this will benefit all communication medias since all means of transportation benefits from sending large messages over small messages.

For users which desire top performance it is however also possible to use cluster interconnects to increase performance even further. There are two ways to achieve this, either a transporter can be designed to handle this case, or one can use socket implementations that bypass the TCP/IP stack to a small or large extent.

We have made some experiments with both those variants using SCI technology developed by Dolphin (www.dolphinics.no).

Configuring MySQL Cluster to use SCI Sockets

In this section we will show how one can use a cluster configured for normal TCP/IP communication to instead use SCI Sockets. Prerequisites for doing this is that the machines to communicate needs to be equipped with SCI cards. This documentation is based on the SCI Socket version 2.3.0 as of 1 october 2004.

To use SCI Sockets one can use any version of MySQL Cluster. The tests were performed on an early 4.1.6 version. No special builds are needed since it uses normal socket calls which is the normal configuration set-up for MySQL Cluster. SCI Sockets are only supported on Linux 2.4 and 2.6 kernels at the moment. SCI Transporters works on more OS's although only Linux 2.4 have been verified.

There are essentially four things needed to enable SCI Sockets. First it is necessary to build the SCI Socket libraries. Second the SCI Socket kernel libraries need to be installed. Third one or two configuration files needs to be installed. At last the the SCI Socket kernel library needs to be enabled either for the entire machine or for the shell where the MySQL Cluster processes are started from. This process needs to be repeated for each machine in cluster which will use SCI Sockets to communicate.

Two packages need to be retrieved to get SCI Sockets working. The first package builds the libraries which SCI Sockets are built upon and the second is the actual SCI Socket libraries. Currently the distribution is only in source code format.

The latest versions of these packages is currently found at. Check

http://www.dolphinics.no/support/downloads.html

for latest versions.

http://www.dolphinics.no/ftp/source/DIS_GPL_2_5_0_SEP_10_2004.tar.gz
http://www.dolphinics.no/ftp/source/SCI_SOCKET_2_3_0_OKT_01_2004.tar.gz

The next step is to unpack those directories, SCI Sockets is unpacked below the DIS code. Then the code base is compiled. The example below shows the commands used in Linux/x86 to perform this.

shell> tar xzf DIS_GPL_2_5_0_SEP_10_2004.tar.gz
shell> cd DIS_GPL_2_5_0_SEP_10_2004/src/
shell> tar xzf ../../SCI_SOCKET_2_3_0_OKT_01_2004.tar.gz
shell> cd ../adm/bin/Linux_pkgs
shell> ./make_PSB_66_release

If the build is made on an Opteron box and is to use the 64 bit extensions then use make_PSB_66_X86_64_release instead, if the build is made on an Itanium box then use make_PSB_66_IA64_release instead. The X86-64 variant should work for Intel EM64T architectures but no known tests of this exists yet.

After building the code base it has been put into a zipped tar filed DIS and OS and date. It is now time to install the package in the proper place. In this example we will place the installation in /opt/DIS. These actions will most likely require you to log in as root-user.

shell> cp DIS_Linux_2.4.20-8_181004.tar.gz /opt/
shell> cd /opt
shell> tar xzf DIS_Linux_2.4.20-8_181004.tar.gz
shell> mv DIS_Linux_2.4.20-8_181004 DIS

Now that all the libraries and binaries are in their proper place we need to ensure that SCI cards gets proper node identities within the SCI address space. Since SCI is a networking gear it is necessary to decide on the network structure at first.

There are three types of network structures, the first is a simple one-dimensional ring, the second uses SCI switch(es) with one ring per switch port and finally there are 2D/3D torus. Each has its standard of providing node ids.

A simple ring uses simply node ids displaced by 4.

4, 8, 12, ....

The next possibility uses switch(es). The SCI switch has 8 ports. On each port it is possible to place a ring. It is here necessary to ensure that the rings on the switch uses different node id spaces. So the first port uses node ids below 64 and the next 64 node ids are allocated for the next port and so forth.

4,8, 12, ... , 60  Ring on first port
68, 72, .... , 124 Ring on second port
132, 136, ..., 188 Ring on third port
..
452, 456, ..., 508 Ring on the eight port

2D/3D torus network structures takes into account where each node is in each dimension, increment by 4 for each node in the first dimension, by 64 in the second dimension and by 1024 in the third dimension. Please look in the Dolphin for more thorough documentation on this.

In our testing we have used switches. Most of the really big cluster installations uses 2D/3D torus. The extra feature which switches provide is that with dual SCI cards and dual switches we can easily build a redundant network where failover times on the SCI network is around 100 microseconds. This feature is supported by the SCI transporter and is currently also developed for the SCI Socket implementation.

Failover for 2D/3D torus is also possible but requires sending out new routing indexes to all nodes. Even this will complete in around 100 milliseconds and should be ok for most high-availability cases.

By placing the NDB nodes in proper places in the switched architecture it is possible to use 2 switches to build a structure where 16 computers can be interconnected and no single failure can hamper more than one computer. With 32 computers and 2 switches it is possible to configure the cluster in such a manner that no single failure can hamper more than two nodes and in this case it is also known which pair will be hit. Thus by placing those two in separate NDB node groups it is possible to build a safe MySQL Cluster installation. We won't go into details in how this is done, since it is likely to be only of interest for users wanting to go real deep into this.

To set the node id of an SCI card use the following command still being in the /opt/DIS/sbin directory. -c 1 refers to the number of the SCI card, where 1 is this number if only 1 card is in the machine. In this case use adapter 0 always (set by -a 0). 68 is the node id set in this example.

shell> ./sciconfig -c 1 -a 0 -n 68

In case you have several SCI cards in your machine the only safe to discover which card has which slot is by issuing the following command

shell> ./sciconfig -c 1 -gsn

This will give the serial number which can be found at the back of the SCI card and on the card itself. Do this then for -c 2 and onwards as many cards there are in the machine. This will identify which cards uses which id. Then set node ids for all cards.

Now we have installed the necessary libraries and binaries. We have also set the SCI node ids. The next step is to set the mapping from hostnames (or IP addresses) to SCI node ids.

The configuration file for SCI Sockets is to be placed in the file /etc/sci/scisock.conf. This file contains a mapping from hostnames (or IP addresses) to SCI node ids. The SCI node id will map the hostname to communicate through the proper SCI card. Below is a very simple such configuration file.

#host           #nodeId
alpha           8
beta            12
192.168.10.20   16

It is also possible to limit this configuration to only apply for a subset of the ports of these hostnames. To do this another configuration is used which is placed in /etc/sci/scisock_opt.conf.

#-key                        -type        -values
EnablePortsByDefault		          yes
EnablePort                  tcp           2200
DisablePort                 tcp           2201
EnablePortRange             tcp           2202 2219
DisablePortRange            tcp           2220 2231

Now we ready to install the drivers. We need to first install the low-level drivers and then the SCI Socket driver.

shell> cd DIS/sbin/
shell> ./drv-install add PSB66
shell> ./scisocket-install add

If desirable one can now check the installation by invoking a script which checks that all nodes in the SCI Socket config files are accessible.

shell> cd /opt/DIS/sbin/
shell> ./status.sh

If you discover an error and need to change the SCI Socket config files then it is necessary to use a program ksocketconfig to change the configuration.

shell> cd /opt/DIS/util
shell> ./ksocketconfig -f

To check that SCI Sockets are actually used you can use a test program latency_bench which needs to have a server component and clients can connect to the server to test the latency, whether SCI is enabled is very clear from the latency you get. Before you use those programs you also need to set the LD_PRELOAD variable in the same manner as shown below.

To set up a server use the command

shell> cd /opt/DIS/bin/socket
shell> ./latency_bench -server

To run a client use the following command

shell> cd /opt/DIS/bin/socket
shell> ./latency_bench -client hostname_of_server

Now the SCI Socket configuration is completed. MySQL Cluster is now ready to use both SCI Sockets and the SCI transporter documented in the section called “Defining SCI Transporter Connections in a MySQL Cluster”.

The next step is to start-up MySQL Cluster. To enable usage of SCI Sockets it is necessary to set the environment variable LD_PRELOAD before starting the ndbd, mysqld and ndb_mgmd processes to use SCI Sockets. The LD_PRELOAD variable should point to the kernel library for SCI Sockets.

So as an example to start up ndbd in a bash-shell use the following commands.

bash-shell> export LD_PRELOAD=/opt/DIS/lib/libkscisock.so
bash-shell> ndbd

From a tcsh environment the same thing would be accomplished with the following commands.

tcsh-shell> setenv LD_PRELOAD=/opt/DIS/lib/libkscisock.so
tcsh-shell> ndbd

Noteworthy here is that MySQL Cluster can only use the kernel variant of SCI Sockets.

Low-level benchmarks to understand impact of cluster interconnects

The ndbd process has a number of simple constructs which are used to access the data in MySQL Cluster. We made a very simple benchmark to check the performance of each such statement and the effect various interconnects have on their performance.

There are four access methods:

Primary key access

This is a simple access of one record through its primary key. In the simplest case only one record is accessed at a time. This means that the full cost of setting up a number of TCP/IP message and a number of costs for context switching is taken by this single request. In a batched case where e.g. 32 primary key accesses are sent in one batch then those 32 accesses will share the set-up cost of TCP/IP messages and context switches (if the TCP/IP are for different destinations then naturally a number of TCP/IP messages needs to be set up.

Unique key access

Unique key accesses are very similar to primary key accesses except that they are executed as a read of an index table followed by a primary key access on the table. However only one request is sent from the MySQL Server, the read of the index table is handled by the ndbd process. Thus again these requests benefit from being accessed in batches.

Full table scan

When no indexes exist for the lookup on a table, then a full scan of a table is performed. This is one request to the ndbd process which divides the table scan into a set of parallel scans on all ndbd processes in the cluster. In future versions the MySQL server will be able to push down some filtering in those scans. When no indexes exist for the lookup on a table, then a full scan of a table is performed. This is one request to the ndbd process which divides the table scan into a set of parallel scans on all ndbd processes in the cluster. In future versions the MySQL server will be able to push down some filtering in those scans.

Range scan using ordered index

When an ordered index is used it will perform a scan in the same manner as the full table scan except that it will only scan those records which are in the range used by the query set-up by the MySQL server. In future versions a special optimisation will ensure that when all index attributes that are bound includes all attributes in the partitioning key then only one partition will be scanned instead of all in parallel.

To check the base performance of these access methods we developed a set of benchmarks. One such benchmark, testReadPerf issues, simple primary and unique key access, batched primary and unique key accesses. The benchmark also measures the set-up cost of range scans by issuing scans returning a single record and finally there is a variant which uses a range scan to fetch a batch of records.

In this manner we can test the cost of issuing single key access and single record scan accesses and measure the impact of the communication media implementation of these base access methods.

We executed those base benchmark both using a normal transporter using TCP/IP sockets and a similar set-up using SCI sockets. The figures reported below is for small accesses of 20 records per of data per access. The difference between serial and batched goes down by a factor of 3-4 when using 2 kB records instead. SCI Sockets were not tested with 2 kB record2 kB records. Tests were performed on a 2-node cluster with 2 dual CPU machines equipped with AMD MP1900+ processors.

Access type:         TCP/IP sockets           SCI Socket
Serial pk access:    400 microseconds         160 microseconds
Batched pk access:    28 microseconds          22 microseconds
Serial uk access:    500 microseconds         250 microseconds
Batched uk access:    70 microseconds          36 microseconds
Indexed eq-bound:   1250 microseconds         750 microseconds
Index range:          24 microseconds          12 microseconds

We did also another set of tests to check the performance of SCI Sockets compared to using the SCI transporter and all compared to the TCP/IP transporter. All these tests used primary key accesses either serially, multi-threaded and multi-threaded and batched simultaneously.

More or less all of these tests showed that SCI sockets were about 100% faster compared to TCP/IP. The SCI transporter was faster in most cases compared to SCI sockets. One notable case however with many threads in the test program showed that the SCI transporter behaved really bad if used in the mysqld process.

Thus our conclusion overall is that for most benchmarks SCI sockets improves performance with around 100% compared to TCP/IP except in rare cases when communication performance is not an issue such as when scan filters make up most of processing time or when very large batches of primary key accesses are achieved. In that case the CPU processing in the ndbd processes becomes a fairly large part of the cost.

Using the SCI transporter instead of SCI Sockets is only of interest in communicating between ndbd processes. Using the SCI transporter is also only of interest if a CPU can be dedicated for the ndbd process since the SCI transporter ensures that the ndbd will never go to sleep. It is also important to ensure that the ndbd process priority is set in such a way that the process doesn't lose in priority due to running for a long time (as can be done by locking processes to CPU's in Linux 2.6). If this is a possible configuration then ndbd process will benefit by 10-70% compared to using SCI sockets (the larger figures when performing updates and probably also on parallel scan activities).

There are several other implementations of optimised socket variants for clusters reported in various papers. These include optimised socket variants for Myrinet, Gigabit Ethernet, Infiniband and the VIA interface. We have only tested MySQL Cluster so far with SCI sockets and we also include documentation above on how to set-up SCI sockets using ordinary TCP/IP configuration for MySQL Cluster.