The Cost of Streaming Data from MySQL

Repost of my blog post in Wix Engineering Blog with some extra information.

Though the question of “How to query data from a database?” is quite an old one, we still have a reason to investigate it nowadays. True, it is common to query a list of rows from a database since that usually means less code and configuration and it’s also the default approach. However, in certain scenarios, this method is just not optimal. In these cases, we don’t need all rows in memory right away, and sometimes we just can’t afford it.

When addressing the issue of querying a significant amount of data, limited by your server capacity, you may have to go with one of the following solutions: Limiting row count could work, but you won’t get all the data; pagination is another solution. However, you’ll need to perform several queries instead of a single one; and finally, streaming seems like a good option as you still send a single query and receive all the data. However, with streaming, your single query has many underlying requests under the hood to MySQL DB, which may result in longer data processing time. Since requests are performed from a driver, it’s not related to MySQL DB storage engine.

While looking into these solutions, I realized there weren’t any benchmark tests for streaming at all. So I decided to test the performance cost for streaming in JVM.

What is streaming

Note: all source code examples are in Scala.

Here’s how it works:

// conn is java.sql.Connection
val conn = getConnection(driver)
val st: java.sql.Statement = conn.createStatement(
java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY
)
st.setFetchSize(Integer.MIN_VALUE)

val rs: java.sql.ResultSet = st.executeQuery("SELECT * FROM table WHERE id = 123")

When you configure the Statement like that on the executeQuery method call, you get a streaming ResultSet which contains only the first row. In this ResultSet, the next row will be loaded from a database only on the next method call.

When is it applicable?

The straightforward way of getting an actual state of an object in event sourcing is to play all events from an event stream. However, because this is a fold operation and not all events are needed right away, you will need to apply events one by one.

Common sense suggests that applying streaming to event sourcing cannot possibly be “free.” There should be some performance cost for it, otherwise, why do we need a special mode in the MySQL driver?

So… Let’s test!

Obviously, we can’t compare a general use case of streaming, so instead, we will test getting a list of rows via streaming and via the default mode (getting everything at once). This should show us any overhead of the MySQL driver and additional network interaction between our application and the MySQL server.

How to test?

def selectAtOnce(driver: MysqlDriver, limit: Int): List[TestTableRow] = {
val st = getConnection(driver).createStatement()
val rs = st.executeQuery(s"SELECT id, name FROM $TableName LIMIT $limit")
// at this point in rs we have all rows!
val result = mutable.ListBuffer[TestTableRow]()
while (rs.next()) {
result += mapRow(rs)
}
result.toList
}

In the selectAtOnce method, we simply establish the connection to the database, select limit rows, and map it to the result list. This is what the test code should look like so that the driver knows it should turn on streaming. The only distinction is the setFetchSite (Int.MinValue) method call, which makes all the difference:

def selectViaStreaming(driver: MysqlDriver): List[TestTableRow] = {
val st = getConnection(driver).createStatement()
st.setFetchSize(Int.MinValue)
val rs = st.executeQuery(s"SELECT id, name FROM $TableName LIMIT $limit")
// in rs we don’t have any data until the next() method call
val result = mutable.ListBuffer[TestTableRow]()
while (rs.next()) {
result += mapRow(rs)
}
result.toList
}

In selectViaStreaming, we do the same, but we hint the driver that we need the special streaming ResultSet. This ResultSet fetches data only on the next method call.

What To Test

For each driver, I ran two sets of tests: one for the default querying (.atOnce) and one for the streaming (.stream). Each set consisted of several test cases to query different amount of rows (via LIMIT clause); from 1 to 1000.

To check the network influence, I performed tests in three different configurations: Benchmark and MySQL on the same server (local). Benchmark and MySQL on different servers connected through Wi-Fi (wifi = simulation of a “slow” network with ping greater than a millisecond). Benchmark and MySQL on different servers but connected through the wire (wire = fast network).

Results

Local

Local, microseconds

Wi-Fi (slow network with latency)

Wi-Fi (slow network simulation), microseconds

Wire

In the case of MariaDB the difference is insignificant. We may see the difference between streaming and non-streaming decreasing from two times for small amount of rows to tens of percent for big amount of rows.

ConnectorJ’s non-streaming is slightly better than everything else.

Wire, microseconds

You can play with charts here.

Conclusion

So, if you have a local installation of a MySQL server or a powerful network between the application server and the MySQL server, then you may want to consider using streaming for your service from scratch. The difference in performance is not critical, plus you won’t need to worry about scaling so it won’t hurt your service reliability in the future.

Testing Environment (or, how I tested it)

The use case for the event sourcing simulation is 1000 rows with unique ID’s and 200 bytes of payload (which is an average size for events that we have in projects at Wix).

To test the correctness of my benchmark I used an open source library wix-embedded-mysql. It is a highly useful tool that enables you not to think about MySQL server installation (it downloads, runs, and shutdowns a real MySQL server instance in your test).

You can check out my GitHub repo for more details.

Extra

weird case on wire, microseconds

My only assumption here was that there was something really strange with network and MariaDB went mad, but then same thing we would see on Wi-Fi chart.

Software developer, moved to Israel from Russia, trying to be aware of things.