• Increase font size
  • Default font size
  • Decrease font size
Home News Feeds Planet MySQL
Planet MySQL
Planet MySQL -

  • Sysbench evaluation of RonDB
     IntroductionSysbench is a tool to benchmark to test open source databases. We have integrated Sysbench into the RonDB installation. This makes it extremely easy to run benchmarks with RonDB. This paper will describe the use of these benchmarks in RonDB. These benchmarks were executed with 1 cluster connection per MySQL Server. This limited the scalability per MySQL Server to about 12 VCPUs. Since we executed those benchmarks we have increased the number of cluster connections per MySQL Server to 4 providing scalability to at least 32 VCPUs per MySQL Server.As preparation to run those benchmarks we have created a RonDB cluster using the Hopsworks framework that is currently used to create RonDB clusters. In these tests all MySQL Servers are using the c5.4xlarge VM instances in AWS (16 VCPUs with 32 GB memory). We have a RonDB management server using the t3a.medium VM instance type. We have tested using two different RonDB clusters, both have 2 data nodes. The first test is using the r5.4xlarge instance type (16 VCPUs and 128 GB memory) and the second test uses the r5n.8xlarge (32 VCPUs and 256 GB memory). It was necessary to use r5n class since we needed more than 10Gbit/second network bandwidth for the OLTP RW test with 32 VCPUs on data nodes.In the RonDB documentation you can find more details how to set up your own RonDB cluster in either our managed version (currently supporting AWS) or using our open source shell scripts to set up a cluster (currently supporting GCP and Azure). RonDB is a new distribution of MySQL NDB Cluster. All experiments in this blog was performed using RonDB 21.04.0.The graph below shows the throughput results from the larger data nodes using 8-12 MySQL Server VMs. Now in order to make sense of these numbers we will explain a bit more about Sysbench and how you can tweak Sysbench to serve your purposes for testing RonDB.Description of Sysbench OLTP RWThe Sysbench OLTP RW benchmark consists of 20 SQL queries. There is a transaction, this means that the transaction starts with a BEGIN statement and it ends with a COMMIT statement. After the BEGIN statement follows 10 SELECT statements that selects one row using the primary key of the table. Next follows 4 SELECT queries that select 100 rows within a range and either uses SELECT DISTINCT, SELECT … ORDER BY, SELECT or SELECT sum(..). Finally there is one INSERT, one DELETE and 2 UPDATE queries.In Pseudo code thus:BEGINRepeat 10 times: SELECT col(s) from TAB where PK=pkSELECT col(s) from TAB where key >= start AND key < (start + 100)SELECT DISTINCT col(s) from TAB where key >= start AND key < (start + 100)SELECT col(s) from TAB where key >= start AND key < (start + 100) ORDER BY keySELECT SUM(col) from TAB where key >= start AND key < (start + 100)INSERT INTO TAB values (....)UPDATE TAB SET col=val WHERE PK=pkUPDATE TAB SET col=val WHERE key=keyDELETE FROM TAB WHERE PK=pkCOMMIT This is the standard OLTP RW benchmark.Benchmark ExecutionNow I will describe some changes that the Sysbench installation in RonDB can handle. To understand this we will start by showing the default configuration file for Sysbench.## Software definition#MYSQL_BIN_INSTALL_DIR="/srv/hops/mysql"BENCHMARK_TO_RUN="sysbench"## Storage definition (empty here)### MySQL Server definition#SERVER_HOST=";"MYSQL_PASSWORD='3*=13*8@20.*0@7$?=45'## NDB node definitions (empty here)### Benchmark definition#SYSBENCH_TEST="oltp_rw"SYSBENCH_INSTANCES="2"THREAD_COUNTS_TO_RUN="1;2;4;8;12;16;24;32;48;64;96;112;128"MAX_TIME="30"In this configuration file we provide the pointer to the RonDB binaries, we provide the type of benchmark we want to execute, we provide the password to the MySQL Servers, we provide the number of threads to execute in each step of the benchmark. There is also a list of IP addresses to the MySQL Servers in the cluster and finally we provide the number of instances of Sysbench we want to execute.This configuration file is created automatically by the managed version of RonDB. This configuration file is available in the API nodes you created when you created the cluster. They are also available in the MySQL Server VMs if you want to test running with a single MySQL Server colocated with the application.The default setup will run the standard Sysbench OLTP RW benchmark with one sysbench instance per MySQL Server. To execute this benchmark the following steps are done:Step 1: Log in to the API node VM where you want to run the benchmark from. The username is ubuntu (in AWS). Thus log in using e.g. ssh ubuntu@IP_address. The IP address is the external IP address that you will find in AWS where your VM instances are listed.Step 2: After successfully being logged in you need to log into the mysql user using the command:sudo su - mysqlStep 3: Move to the right directorycd benchmarksStep 4: Execute the --default-directory /home/mysql/benchmarks/sysbench_multiBenchmark ResultsAs you will discover there is also a sysbench_single, dbt2_single, dbt2_multi directory. These are setup for different benchmarks that we will describe in future papers. sysbench_single is the same as sysbench_multi but with only a single MySQL Server. This will exist also on MySQL Server VMs if you want to benchmark from those. Executing a benchmark from the sysbench machine increases latency since it represents a 3-tiered setup whereas executing sysbench in the MySQL Server represents a 2-tiered setup and thus the latency is lower.If you want to study the benchmark in real-time repeat Step 1, 2 and 3 above and then perform the following commands:cd sysbench_multi/sysbench_resultstail -f oltp_rw_0_0.resThis will display the output from the first sysbench instance that will provide latency numbers and throughput of one of the sysbench instances.When the benchmark has completed the total throughput is found in the file:/home/mysql/benchmarks/sysbench_multi/final_results.txtModifying Sysbench benchmarkThe configuration for sysbench_multi is found in:/home/mysql/benchmarks/sysbench_multi/autobench.confThus if you want to modify the benchmark you can edit this file.So how can we modify this benchmark. First you can decide on how many SELECT statements to retrieve using the primary key that should be issued. The default is 10. To change this add the following line in autobench.conf:SB_POINT_SELECTS=”5”This will change such that instead 5 primary key SELECTs will be issued for each transaction.Next you can decide that you want those primary key SELECTs to retrieve a batch of primary keys. In this case the SELECT will use IN (key1, key2,,, keyN) in the WHERE clause. To use this set the number of keys to retrieve per statement in SB_USE_IN_STATEMENT. Thus to set this to 100 add the following line to autobench.conf.SB_USE_IN_STATEMENT=”100”This means that if SB_POINT_SELECTS is set to 5 and SB_USE_IN_STATEMENT is set to 100 there will be 500 key lookups performed per Sysbench OLTP transaction.Next it is possible to set the number of range scan SELECTs to perform per transaction. So to e.g. disable all range scans we can add the following lines to autobench.conf.SB_SIMPLE_RANGES=”0”SB_ORDER_RANGES=”0”SB_DISTINCT_RANGES=”0”SB_SUM_RANGES=”0”Now it is also possible to modify the range scans. I mentioned that the range scans retrieves 100 rows. The number 100 is changeable through the configuration parameter SB_RANGE_SIZE.The default behaviour is to retrieve all 100 rows and send them back to the application. Thus no filtering. We also have an option to perform filtering in those range scans. In this case only 1 row will be returned, but we will still scan the number of rows specified in SB_RANGE_SIZE. This feature of Sysbench is activated through adding the following line to autobench.conf:SB_USE_FILTER=”yes”Finally it is possible to remove the use of INSERT, DELETE and UPDATEs. This is done by changing the configuration parameter SYSBENCH_TEST from oltp_rw to oltp_ro.There are many more ways to change the configuration of how to run Sysbench, but these settings are enough for this paper. For more details see the documentation of dbt2-0.37.50, also see the Sysbench treeBenchmark ConfigurationsIn my benchmarking reported in this paper I used 2 different configurations. Later we will report more variants of Sysbench testing as well as other benchmark variants.The first is the standard Sysbench OLTP RW configuration. The second is the standard benchmark but adding SB_USE_FILTER=”yes”. This was added since the standard benchmark becomes limited by the network bandwidth using r5.8xlarge instances for the data node. This instance type is limited to 10G Ethernet and it needs almost 20 Gb/s in networking capacity with the performance that RonDB delivers. This bandwidth is achievable using the r5n instances.Each test of Sysbench creates the tables and fills them with data. To have a reasonable execution time of the benchmark each table will be filled with 1M rows. Each sysbench instance will use its own table. It is possible to set the number of rows per table, it is also possible to use multiple tables per sysbench instance. Here we have used the default settings.The test runs are executed for a fairly short time to be able to test a large variety of test cases. This means that it is expected that results are a bit better than expected. To see how results are affected by running for a long time we also ran a few select tests where we ran a single benchmark for more than 1 hour. The results are in this case around 10% lower than the numbers of shorter runs. This is mainly due to variance of the throughput that is introduced by the execution of checkpoints in RonDB. Checkpoints consume around 5-10% of the CPU capacity in the RonDB data nodes.Benchmark setupIn all tests set up here we have started the RonDB cluster using the Hopsworks infrastructure. In all tests we have used c5.4xlarge as the VM instance type for MySQL Servers. This VM has 16 VCPUs and 32 GB of memory. This means a VM with more or less 8 Intel Xeon CPU cores. In all tests there are 2 RonDB data nodes, we have tested with 2 types of VM instances here, the first is the r5.4xlarge which has 16 VCPUs with 128 GB of memory. The second is the r5n.8xlarge which has 32 VCPUs and 256 GB of memory. In the Standard Sysbench OLTP RW test the network became a bottleneck when using r5.8xlarge. These VMs can use up to 10 Gb/sec, but in reality we could see that some instances could not go beyond 7 Gb/sec, when switching to r5n.8xlarge instead this jumped up to 13Gb/sec immediately, so clearly this bottleneck was due to the AWS infrastructure.To ensure that the network bottleneck was removed we switched to using r5n.8xlarge instances instead for those benchmarks. These instances are the same as r5.8xlarge except that they can use up to 25 Gb/sec in network bandwidth instead of 10Gb/sec.Standard Sysbench OLTP RWThe first test we present here is the standard OLTP RW benchmark. When we run this benchmark most of the CPU consumption happens in the MySQL Servers. Each MySQL Server is capable of processing about 4000 TPS for this benchmark. The MySQL Server can process a bit more if the responsiveness of the data node is better, this is likely to be caused by that the CPU caches are hotter in that case when the response comes back to the MySQL Server. Two 16 VCPU data nodes can in this case handle the load from 4 MySQL Servers, adding a 5th can increase the performance slightly, but not much. We compared this to 2 data nodes using 32 VCPUs and in principle the load these data nodes could handle was doubled.The response time was very similar in both cases, at extreme loads the larger data nodes had more latency increases, most likely due to the fact that we got much closer to the limits of what the network could handle.The top number here was 34870 at 64 threads from 10 MySQL Servers. In this case 95% of the transactions had a latency that was less than 19.7 ms, this means that the time for each SQL query was below 1 millisecond. This meant that almost 700k SQL queries per second were executed. These queries reported back to the application 14.5M rows per second for the larger data nodes, most of them coming from the 4 range scan queries in the benchmark. Each of those rows are a bit larger than 100 bytes, thus around 2 GByte per second of application data is transferred to the application (about 25% of this is aggregated in the MySQL when using the SUM range scan).Sysbench OLTP RW with filtering of scansGiven that Sysbench OLTP RW is to a great extent a networking test we also wanted to perform a test that performed a bit more processing, but reporting back a smaller amount of rows. We achieved this by setting SB_USE_FILTER=”yes” in the benchmark configuration file. This means that instead of each range scan SELECT reporting back 100 rows, it will read 100 rows and filter out 99 of them and report only 1 of the 100 rows. This will decrease the amount of rows to process down to about 1M rows per second. Thus this test is a better evaluator of the CPU efficiency of RonDB whereas the standard Sysbench OLTP RW is a good evaluator of RonDBs ability to ship tons of rows between the application and the database engine.At first we wanted to see the effect the number of MySQL servers had on the throughput in this benchmark. We see the results of this in the image above. We see that there is an increase in throughput going from 8 to 12 MySQL Servers. However the additional effect of each added MySQL Server is diminishing. There is very little to gain going beyond 10 MySQL Servers. The optimal use of computing resources is most likely achieved around 8-9 MySQL Servers.Adding additional MySQL servers also has an impact on the variability of the latency. So probably the overall best fit here is to use about 2x more CPU resources on the MySQL Servers compared to the CPU resources in the RonDB data nodes. This rule is based on this benchmark and isn’t necessarily true for another use case.The results with the smaller data nodes, r5.4xlarge is the red line that used 5 MySQL Servers in the test.The rule definitely changes when using the key-value store APIs that RonDB provides. These are at least 100% more efficient compared to using SQL.A key-value store needs to be a LATS database (low Latency, high Availability, high Throughput, Scalable storage). In this paper we have focused on showing Throughput and Latency. Above is the graph showing how latency is affected by the number of threads in the MySQL Server.Many applications have strict requirements on the maximum latency of transactions. So for example if the application requires response time to be smaller than 20 ms than we can see in the graph that we can use around 60 threads towards each MySQL Server. At this number of threads r5.4xlarge delivers 22500 TPS (450k QPS) and r5n.8xlarge delivers twice that number, 45000 TPS (900k QPS).The base latency in an unloaded cluster is a bit below 6 milliseconds. This number is a bit variable based on where exactly the VMs are located that gets started for you. Most of this latency is spent in latency on the networks.  Each network jump in AWS has been reported to be around 40-50 microseconds and one transaction performs around 100 of those network jumps in sequence. Thus almost two-thirds of the base latency comes from the latency in getting messages across. At higher loads the queueing waiting the message to be executed becomes dominating. Benchmarks where everything executes on a single computer has base latency around 2 millisecond per Sysbench transaction which confirms the above calculations.

  • Migrate from Amazon Aurora to MySQL Database Service and Start Saving
    You’ve been using Amazon Aurora to manage your data but recently found out that you could manage the same data on Oracle cloud for a ⅓ your cost on Amazon.   You also learned that you could run your queries 1100x faster than on Amazon Aurora, including Analytics queries with no need for ETL. You can just use the same MySQL Database Service for your OLTP and OLAP queries.   Most importantly, MySQL Database Service is the only database service for MySQL that requires no ETL for OLAP.   Watch the full webinar on how to migrate from Amazon Aurora to MySQL Database Service Download the Powerpoint presentation (login required)   See the benchmarks MySQL Database Service vs Amazon Aurora   Read what Larry Ellison has to say about MySQL Database Service vs Amazon Aurora   Learn more about MySQL Database Service and HeatWave Try MySQL Database service with HeatWave now

  • MySQL Disaster Recovery — Are You Feeling Lucky?
    Disasters are waiting to happen. Like OVH data center fire on March 9, 2021. Do you have a solid MySQL Disaster Recovery (DR) plan in place to activate when the time inevitably comes? If not, Continuent is happy to review your current MySQL HA/DR deployment and to recommend possible changes that might be recommended. Naturally, free of charge. Tags: data center fireovh firemysql disaster recovery (dr)multi-site mysql

  • InnoDB File Growth Weirdness
    There is a common pattern in life, you often discover or understand things by accident. Many scientific discoveries fit such a description. In our database world, I was looking to see how BLOB/TEXT columns are allocated using overlay pages and I stumbled upon something interesting and unexpected. Let me present to you my findings, along with my attempt at explaining what is happening. InnoDB Tablespaces The first oddity I found is a bunch of free pages in each tablespace it is skipping. Here’s an example from a simple table with only an integer primary key and a char(32) column:root@LabPS8_1:~/1btr# innodb_space -f /var/lib/mysql/test/t1.ibd space-extents start_page page_used_bitmap 0 #####################################........................... <--- free pages 64 ################################################################ 128 ################################################################ ...The innodb_space tool comes from the InnoDB ruby project of Jeremy Cole. If you want to explore InnoDB file formats, you need these tools. As you can see, the first extent has 27 free pages. These free pages are reserved for node pages (non-leaf) and will eventually be all used. At this point, I thought a table with 34 index pages, just starting to use the second extent for the leaf pages, would have 90 free pages (27 + 63) and use 2MB of disk space. While the previous statement proved to be true, I was up for quite a surprise. InnoDB File Growth To better illustrate the amount of free space available in an InnoDB data file, I decided to follow the evolution of the tablespace file size as I added index pages. The following figure shows my result. As I added rows, more leaf pages were allocated until the file segment of the leaf pages reached 32 pages. At this point, the table has 33 index pages, one root, and 32 leaves. The allocation of another page forces InnoDB to fully allocate the first extent and add the second one for the leaves. At this point, the size on the disk is 2MB. If we keep inserting rows, the following page allocation triggers InnoDB to allocate 7 reserved extents of 1MB each. At this point, the tablespace size on the disk reaches 9MB.   Register for Percona Live ONLINE A Virtual Event about Open Source Databases   InnoDB uses the reserved extents for btree maintenance operations. They are not accounted for in the free space of the tablespace. Now, reserving 7 extents of 1MB each in a table containing only 560KB of data is pretty insane. At this point, the InnoDB tablespace has a size of 9MB on disk. This is extremely inefficient, about 8.4MB if just free space filled with “0”. Of course, as the table grows, the size impact of these reserved extents is diluted. The amount of reserved space will grow by about 1MB (1 extent) for every 100MB allocated. This allocation of reserved extents is far from optimal, especially in a multi-tenants era where it is fairly common to see MySQL servers handling more than 100k tables. 100k tables, each with only 1MB of data in them will use 900GB of disk space. This phenomenon is not new, a bug report was created back in 2013 and is still open. The bug is considered a low priority and non-critical. A lot of effort has been devoted to improving the capacity of MySQL 8.0 to handle a large number of tables. Until the allocation of reserved extents is fixed, be aware of this issue when planning your storage allocation. Of course, if you are using ZFS, the impacts are more limited…root@LabPS8_1:~# du -hs /var/lib/mysql/test/t1.ibd 225K /var/lib/mysql/test/t1.ibd root@LabPS8_1:~# du -hs --apparent-size /var/lib/mysql/test/t1.ibd 9,0M /var/lib/mysql/test/t1.ibdMy lab setup uses ZFS for LXC and KVM instances. LZ4 compression does magic on extents full of “0”, the actual consumed space is reduced to 225KB. If you want to explore the use of ZFS with MySQL, I invite you to read a ZFS post I wrote a few years ago. Ok, time now to close this parenthesis and go back to the storage of BLOB/TEXT columns. That will be for a future post though!

  • Online DDL with Group Replication Percona Server 8.0.22 (and MySQL 8.0.23)
    While I was working on my grFailOver POC, I have also done some additional parallel testing. One of them was to see how online DDL are executed inside a Group Replication cluster. The online DDL feature provides support for instant and in-place table alterations and concurrent DML. Checking the Group Replication (GR) official documentation I was trying to identify if any limitation exists, but the only thing I have found was: "Concurrent DDL versus DML Operations.  Concurrent data definition statements and data manipulation statements executing against the same object but on different servers is not supported when using multi-primary mode. During execution of Data Definition Language (DDL) statements on an object, executing concurrent Data Manipulation Language (DML) on the same object but on a different server instance has the risk of conflicting DDL executing on different instances not being detected." This impacts only when you have a multi primary scenario, which is NOT recommended and not my case. So in theory GR should be able to handle the online DDL without problems.  My scenario : I have two DCs and I am going to do actions on my DC1 and see how it propagates all over, and what impact I will have. The test To do the test I will run and insert from select.  insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1; And a select, on my Primary node gr1, while on another connection execute the ALTER: ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE; As you may have noticed I am EXPLICITLY asking for INPLACE and lock NONE. So in case MySQL cannot satisfy it should exit and not execute the command. In the meantime on all other nodes I will run a check command to see WHEN my Alter is taking place. Let us roll the ball: On my Primary the command to insert the data [root@gr1 grtest]# while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done Again on Primary another session to execute the alter: DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE; On other nodes to monitor when Alter will start: while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "show processlist;"|grep -i "alter";sleep 1;done What happens: Data is inserted by the loop. Alter start, but I can still insert data in my table, and most important the data is propagated to all nodes of the DC1 cluster. No alter action on the other nodes. .559 .502 .446 .529 .543 .553 .533 .602 .458  <---- end of the alter locally Once ALTER is complete on the local node (Primary) it is then executed (broadcast) to all the nodes participating in the cluster. [ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction <--- waiting for waiting for handler commit    No INSERTS are allowed But write are suspended waiting for: 37411 | root            | localhost          | windmills_s | Query            |    19 | Waiting for table metadata lock                                 | insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype And eventually it will timeout. The other point is that any write hangs until the slowest node had apply the ALTER: It is important to note that all nodes, not only the PRIMARY remain pending waiting for the slow node: The slowest drive all. GR3: 11:01:28.649  48 system user windmills_s Query 171 altering table ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE 11:01:29.674  48 system user windmills_s Query 172 waiting for handler commit ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE GR2 Start 11:00:14.438  18 system user windmills_s Query 97 altering table ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE Ends 11:02:00.107  18 system user windmills_s Query 203 altering table ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE Finally when the last node in the GR cluster has applied the ALTER, the writes will resume, and the Replica node on DC2 will start its ALTER operation on PRIMARY first, then on the other nodes. Summarizing We can say we have 3 phases.  The first one is when the primary start to write and we have ONLINE operation: Phase 2: is when the Primary ends the local operation and transmit the changes to the Secondary nodes: Here all writes are locked by metalock. Phase 3: is when all nodes have finalized the operation, the Alter is passed over replication channel (async replica) and metalock is removed. Writes are executed on Primary ALTER is executed on the Primary DDL does not impact the write operation and respects the not blocking directive. ALTER is completed on Primary and passed to all nodes Meta lock is raised on nodes ALL cluster waits for slowest node to complete When all is done in the DC1 then the action is replicated to DC2 Goto point 2  BUT!!! If the primary goes down, the cluster is unable to elect a new primary until a Secondary had completed the operations. Meaning if alter takes 10 hours, we have a possible cluster without Primary for 10 hours. See also bug: Conclusion It seems that at the moment we have a partial coverage of the online ddl feature when using group_replication. Of course to have to wait for the SECONDARY nodes is better and less impacting than to wait for PRIMARY first and then the SECONDARIES. But is confusing given I was expecting to have either full online coverage (I had explicitly asked for that in the DDL command), or a message telling me it cannot be executed online.  Of course I would prefer to have FULL online coverage ;0)  Keep in mind my setup was also pretty standard and that changing group_replication_consistency does not affect the outcome. But not sure I can classify this as a bug, more an unexpected undesirable behavior. Finally I cannot avoid to say that it seems to me a bit too much expect to have the users accepting to wait for two times the time of an alter, first on the primary then on the secondary nodes. I know we have PT-OSC to save the day, but I really think that native support should be better and allow the alter to start when it starts on the Primary, not after doubling the required time.