New options and better performance in cqlsh copy

January 17, 2016, 5:13 am

≫ Next: Testing Apache Cassandra with Jepsen

Introduction

COPY FROM and COPY TO are cqlsh commands for importing and exporting data to/from Cassandra using csv files. They were introduced in version 1.1.3 and were previously discussed here. These commands were enhanced in the latest releases with the following new features:

CASSANDRA-9302 improves the performance of COPY FROM with batched prepared statements that are sent to the correct replica via token aware routing.
CASSANDRA-9304 improves the performance of COPY TO with token aware multi-process data export.
CASSANDRA-9303 introduces new options to fine-tune COPY operations to specific use cases.

We will review these new features in this post; they will be available in the following cassandra releases: 2.1.13, 2.2.5, 3.0.3 and 3.2.

Synopsis

Here is a brief reminder on how to use COPY commands in cqlsh:

COPY table_name ( column, ...)
FROM ( 'file_pattern1, file_pattern2, ...' | STDIN )
WITH option = 'value' AND ...

COPY table_name ( column , ... )
TO ( 'file_name' | STDOUT )
WITH option = 'value' AND ...

File patterns are either file names or valid Python glob expressions such as folder/*.csv and so forth. You can use file patterns to import multiple files. Refer to the examples at the end of this post.

How it works

COPY commands launch worker processes that perform the actual import and export in parallel; one worker process per CPU core will be launched but one core is reserved to the parent process and the maximum number of worker processes is 16. This number can be changed with the new option NUMPROCESSES and set to any value with no maximum. Worker and parent processes communicate via Python queues.

When exporting data, the ring tokens are first retrieved in order to split the ring into token ranges. These ranges are passed to worker processes, which send one export request per token range to one of the replicas where the range is located. The results are finally concatenated in the csv output file by the parent process. For this reason, the export order across partitions is not deterministic. Parallel export via token ranges is only available for the random and murmur3 partitioners. For other partitioners the behaviour of COPY TO has not changed: the entire data set is retrieved from the server to which cqlsh is connected to.

When importing data, the parent process reads from the input file(s) chunks with *CHUNKSIZE* rows and sends each chunk to a worker process. Each worker process then analyses a chunk for rows with common partition keys. If at least 2 rows with the same partition key are found, they are batched and sent to a replica that owns the partition. You can control the minimum number of rows with a new option, MINBATCHSIZE, but it is advisable to leave it set to 2. For rows that do not share any common partition key, they get batched with other rows whose partition key belong to a common replica. These rows are then split into batches of size MAXBATCHSIZE, currently 20 rows. These batches are sent to the replicas where the partitions are located. Batches are of type UNLOGGED in both cases.

Performance benchmarks

The benchmark below compares release 2.2.3 (before the changes were implemented) and 2.2.5 (2.2 HEAD as of January 12 2016). The data was generated by inserting 1 million rows with cassandra stress into a 3 node cluster running locally on a laptop (8 Intel cores i7-4702HQ CPU @ 2.20GHz with 16 GB memory and hybrid HDD). The data is exported to a csv file using default options and then re-imported.

We can observe an improvement of 70% for importing and 1,280% for exporting data. The data in this benchmark is random and mostly with unique partition keys. Data that shares partition keys more consistently would have a substantially improved import rate.

Options

Common options

The following options can be used for both COPY TO and COPY FROM.

These options existed already and have not changed:

DELIMITER, a character used to separate fields in the input and output files. It defaults to ‘,’.
QUOTE, a character used to quote fields containing special characters, such as the delimiter. It defaults to ‘”‘.
ESCAPE, a character used to escape other characters such as the quoting character or the delimiter. It defaults to ‘\’.
HEADER, a boolean indicating if the first line contains a header, which should be skipped when importing. When exporting, it indicates if the header should be printed. It defaults to false.
NULL, a string that represents null values. It defaults to the empty string. When exporting values that are missing, they will be indicated by this string. When importing, if this string is encountered it is assumed the value should be imported as missing (null). You should set this string to some other character sequence if you need in to import blank strings.
DATETIMEFORMAT, which used to be called TIMEFORMAT, a string containing the Python strftime format for date and time values, such as ‘%Y-%m-%d %H:%M:%S%z’. It defaults to the time_format value in cqlshrc.

These options are new:

MAXATTEMPTS, an integer indicating the maximum number of attempts to import a batch or to export a range in case of a server error. Note that for some permanent errors, such as parse errors, no multiple attempts will be performed. It defaults to 5.
REPORTFREQUENCY, a decimal number indicating the frequency in seconds with which status updates are displayed. It defaults to 0.25, four updates per second.
DECIMALSEP, a character representing the separator for decimal values. It defaults to ‘.’. When this is specified, numbers will be formatted with this separator when exporting. When importing, numbers are parsed using this separator.
THOUSANDSSEP, a character representing the thousands separator in digits. It defaults to empty. When this is specified, numbers will be formatted with this separator when exporting. When importing, numbers are parsed using this separator.
BOOLSTYLE, a string containing two case insensitive words separated by a comma that represent boolean values. It defaults to ‘true, false’. Valid examples are ‘yes, no’ or ‘1, 0’ and so forth.
NUMPROCESSES, an integer indicating the number of worker processes. It defaults to the number of cores minus one, capped at 16.
CONFIGFILE, a string pointing to a configuration file with the same format as .cqlshrc (see the Python ConfigParser documentation). In this file you can specify other options under the following sections: [copy], [copy-to], [copy-from], [copy:ks.table], [copy-to:ks.table], [copy-from:ks.table], where ks is your keyspace name and table is your table name. Not all sections are mandatory and options are read from these sections in the order just specified. Command line options always override options in configuration files. Depending on the COPY direction, only the relevant copy-from or copy-to sections are used. If no configuration file is specified, then the default .cqlshrc is searched.
RATEFILE, a string pointing to an optional file where output statistics will be printed. These are the same progress statistics that are normally displayed to standard output.

COPY TO

The following options are applicable to COPY TO.

This option existed already and has not changed:

ENCODING, a string representing the encoding for the output files. It defaults to ‘utf8’.

These options are new:

PAGESIZE, an integer indicating the page size for fetching results. It defaults to 1,000. The bigger the page size, the longer the page timeout should be. You may want to change this parameter if you have very large or very small partitions.
PAGETIMEOUT, an integer indicating the timeout in seconds for fetching each page. It defaults to 10 seconds. Increase it for large page sizes or large partitions. If you notice timeouts, you should consider increasing this value. In case of server timeouts, there is an exponential backoff policy that kicks in automatically, so you may notice delays but this is to prevent overloading the server even further. The driver can also generate timeouts, in this case there is a small chance data may be missing or duplicated since the driver doesn’t know if the server will later on drop the request or return a result. Increasing this value is very helpful to prevent driver generated timeouts.
BEGINTOKEN, a string representing the minimum token to consider when exporting data. Records with smaller tokens will not be exported. It defaults to empty, which indicates no minimum token.
ENDTOKEN, a string representing the maximum token to consider when exporting data. Records with bigger tokens will not be exported. It defaults to empty, which indicates no maximum token.
MAXREQUESTS, an integer indicating the maximum number of in-flight requests each worker process can work on. The total level of parallelism when exporting data is given by the number of worker processes multiplied by this value. It defaults to 6. Each request will export the data of an entire token range.
MAXOUTPUTSIZE, an integer indicating the maximum size of the output file measured in number of lines. Beyond this value, the output file will be split into segments. It defaults to -1, which indicates an unlimited maximum and therefore a unique output file.

COPY FROM

The following options are applicable to COPY FROM, they are all new options:

CHUNKSIZE, an integer indicating the size of chunks passed to worker processes. It defaults to 1,000. The parent process reads chunks from the input files and delivers them to worker processes. Worker processes will then sort individual chunks to group records with the same partition key, and failing this, belonging to the same replica. The bigger the chunk the higher the probability of successfully batching records by partition key, but also the higher the amount of memory used.
INGESTRATE, an integer indicating an approximate ingest rate in rows per second. It defaults to 100,000. The parent process will not send additional chunks to worker processes if this rate is exceeded. Note that 100,000 may be higher than the maximum rate your cluster supports, in which case the ingest rate reported by the output statistics would be lower.
MINBATCHSIZE, an integer indicating the minimum size of an import batch. It defaults to 2. If there are at least MINBATCHSIZE records with the same partition key in a chunk, then a batch with these records is sent. It makes sense to keep this parameter as low as 2 because each single partition corresponds to a single insert operation server side, so 2 rows would be inserted at the price of 1.
MAXBATCHSIZE, an integer indicating the maximum size of an import batch. It defaults to 20. Records that do not have common partition keys in a chunk, end up in a left-overs group for each replica. These rows are then batched together up to this maximum number of rows. Even though batched rows are sent to the correct replica, a number too large here may cause timeouts or batch size warnings server side.
MAXROWS, an integer indicating the maximum number of rows to be imported, It defaults to -1. When this number is a positive integer, data import stops when this number of rows is reached. If there is a header row and you’ve specified HEADER=TRUE, it does not count towards this maximum.
SKIPROWS, an integer indicating the number of rows to skip. It defaults to 0, so no rows will be skipped. You can specify this number to skip an initial number of rows. If there is a header row and you’ve specified HEADER=TRUE, it does not count since it is always skipped.
SKIPCOLS, a string containing a comma separated list of column names to skip. By default no columns are skipped. To specify which columns are in your input file you should use the COPY FROM syntax as usual, here you can specify columns that are in the file but that should not be imported.
MAXPARSEERRORS, an integer indicating the maximum global number of parsing errors, It defaults to -1. When this number is a positive integer and the total number of parse errors reaches it, data import stops.
MAXINSERTERRORS, an integer indicating the maximum global number of insert errors, It defaults to -1. When this number is a positive integer and the total number of server side insert errors reaches it, data import stops.
ERRFILE, a string pointing to a file where all rows that could not be imported are saved. It defaults to import_ks_table.err where ks is your keyspace and table is your table name. You’ll find in this file all rows that could not be imported for any reason. If an existing err file is found, it is renamed with a suffix that contains the current date and time.
TTL, an integer indicating the time to live in seconds, by default data will not expire. Use this option to insert data that will expire after the specified number of seconds. This option is only available in cassandra 3.2.

EXAMPLES

Importing all csv files in folder1 and folder2:

COPY ks.table FROM folder1/*.csv, folder2/*.csv

Splitting an output file into segments of 1,000 lines each:

COPY ks.table TO table.csv WITH MAXOUTPUTSIZE=1000

Here is how to fix the ingest rate to 50,000 rows per second:

COPY ks.table FROM table.csv WITH INGESTRATE=50000

Here is how to add COPY options to the cqlshrc configuration file, taken from cqlshrc.sample:

; optional options for COPY TO and COPY FROM
[copy]
maxattempts=10
numprocesses=4

; optional options for COPY FROM
[copy-from]
chunksize=5000
ingestrate=50000

; optional options for COPY TO
[copy-to]
pagesize=2000
pagetimeout=20

↧

Testing Apache Cassandra with Jepsen

January 20, 2016, 9:00 am

≫ Next: New token allocation algorithm in Cassandra 3.0

≪ Previous: New options and better performance in cqlsh copy

Introduction

As a developer, I’ve found that looking for new ways to effectively test software has a dramatic payoff. We’ve written about some of the techniques we use at DataStax in the past. These writings include:

During my summer internship at DataStax, I looked at how we could use the Jepsen framework for distributed systems analysis in testing Cassandra. In the following sections, I’ll give overviews of why Jepsen interests us, how we used Jepsen, what we learned, and our plans for the future.

What’s Jepsen?

If you’re familiar with Jepsen, feel free to skip ahead. If not, we can help clear things up. Jepsen is often used as an umbrella term to refer to a few things. These include:

A Clojure framework for testing distributed systems created by Kyle Kingsbury
Tests in this framework for many distributed systems, written by Kyle
Analyses written by Kyle based on these tests
Kyle’s talks and consultancy around distributed system analysis

In this post, we’ll be talking about tests and analysis we independently performed using the Jepsen framework.

Why Jepsen?

At first glance, it might not be clear how Apache Cassandra could benefit from Jepsen in comparison to the testing approaches we already use: unit tests and distributed tests (dtests, for short). To highlight the differences, we can evaluate these styles of testing with respect to the criteria of state space coverage, controllability, and observability.

Unit tests excel at controllability and observability of Cassandra; since they are written and maintained as part of Cassandra’s source, they are most effective at manipulating a single Cassandra process. On the other hand, they aren’t very good at extensively exploring the state space of a Cassandra process or cluster. They exercise specific state space traces of a single process; they do not effectively model the operation of a real cluster and do not reflect the variety of inputs in a real deployment. Unit tests enable only limited observability outside of the Cassandra process.

Dtests offer a powerful balance of the three categories. Since they use CCM to start and observe a real Cassandra cluster at the client boundaries, coverage of the state space of a Cassandra cluster increases. At the same time, given that they run on a single node, the environment realities of a Cassandra cluster do not get adequately explored. System-level tools power observability, and CCM offers effective controllability of the Cassandra processes, but targeted exploration of specific state space traces is not easy. Dtests are conventionally written as a sequence of specific, deterministic steps; for example, one might write a test that performs writes, bootstraps a node, and then confirms that the writes are still present.

Because it uses SSH for configuration and manipulation, Jepsen allows system-level controllability and observability. For us, its strength lies in its ability to better explore the state space of both a single node and a whole cluster. As opposed to the targeted explorations of a typical dtest, a Jepsen test embraces concurrency and nondeterminism as the default mode of operation. Its powerful generator abstraction enables the test author to rapidly design and implement sequences of random instructions dictating how multiple processes should manipulate the database or environment. For example, several processes might be reading or writing to the database while another process periodically partitions the network. After the completion of such a randomized run, Jepsen checks whether this test’s execution trace preserves certain properties. As a result of this more accurate environmental modeling and improved state space coverage, Jepsen tests necessarily take longer to run and require more system resources. This means they work well as another complementary step to find more nuanced paths to failure.

Jepsen in Practice

Jepsen tests work by checking invariants against a history produced by a test run. The test progresses through several high-level phases:

Set up the OS on the nodes under test
Set up Cassandra on the nodes under test
Client/nemesis processes run operations provided by generator
Checkers verify invariants against history

Our Jepsen tests implement several variations on each of the steps above. They permit the installation of any of the Cassandra 2.1, 2.2, and 3.x releases. Environment variables allow easy configuration of the compaction strategy, hints, commitlog compression, and various other parameters at test runtime. Clients exist for read and write of CQL sets, CQL maps, counters, batches, lightweight transactions, and materialized views. We use checkers to ensure that data structures function as expected at a variety of read-write consistencies, that lightweight transactions are linearizable, and that materialized views accurately reflect the base table. We verify all these properties under failures conditions such as network partitions, process crashes, and clock drift. Because Cassandra seeks to maintain these safety properties while undergoing cluster membership changes, we implemented a conductor abstraction that allows multiple nemeses to run concurrently. This allowed us to execute the above tests while adding or removing nodes from the cluster. In particular, we paid great attention to lightweight transactions, as linearizability is a more challenging consistency model to provide.

These tests helped to identify and reproduce issues in existing subsystems leading up to the release of 3.0: CASSANDRA-10231, CASSANDRA-10001 and CASSANDRA-9851. CASSANDRA-10231 in particular is a powerful example of how randomization can produce a hard-to-predict interleaving of cluster state composition changes and failure conditions.

Our Jepsen tests helped to stabilize the new materialized views feature before its release in Cassandra 3.0. Materialized views offer a nice case study for the value of these tests. Modeling materialized views is simple; we want eventual consistency between the contents of the base table and the view table. In our clients, we write to the base table and read from the view table. It is particularly important that the cluster undergoes changes in composition during the test since the pairing from base replica to view replica will change. The types of issues detected during testing reinforce this priority, as they mostly stem from interactions between environmental failures, materialized views, and changes in cluster membership. Issues identified during targeted testing of materialized views include CASSANDRA-10413, CASSANDRA-10674, and CASSANDRA-10068. In addition to these reported issues, our tests also helped prevent mistakes earlier in development.

We did not identify any new issues in the 2.1 or 2.2 versions of Cassandra as a result of Jepsen testing. We still found great value in this test coverage, as it further reinforced our confidence in the quality of these releases.

Work We Shared

We hope our work with Jepsen can help you with testing Cassandra and other distributed systems. The tests themselves are available on our GitHub. As part of our infrastructure for running tests, we’ve made available our tools for running Jepsen tests in Docker and multi-box Vagrant: Jepsen Docker and Jepsen Vagrant. Multi-box Vagrant support allows testing of clock drift on JVM databases: because libfaketime doesn’t work well with most JVMs and Linux containers do not provide separate clocks, we need separate kernels for these tests. We greatly enjoyed working with Jepsen; we contributed fixes in PR #59 and PR #62 for the very minor issues we encountered. We’ve also suggested a fix for an issue in an upstream library.

Plans for the Future

I found Jepsen to promote several important ideas in its design. First, it highlights the importance of well-defined models and invariants in testing. This encourages unambiguous design and communication, and it also makes the purpose of the test clear. Second, Jepsen enables the tester to easily test these invariants in realistic conditions through test composability; the same core test can be run concurrently with node crashes, network partitions, or other pathological failures. Lastly, by embracing generative testing, it acknowledges the difficulty of thoroughly testing distributed systems using deterministic, hand-selected examples.

We’re actively working on embracing these philosophies in our testing tools at DataStax. In particular, we aim to integrate these ideas with more flexible provisioning and orchestration, allowing us to easily test more cluster configurations and scales. We believe this will help us to ensure that Cassandra remains stable and correct throughout the course of further improvements.

↧

New token allocation algorithm in Cassandra 3.0

January 28, 2016, 6:01 am

≫ Next: DataStax Java Driver: 3.0.0 released!

≪ Previous: Testing Apache Cassandra with Jepsen

Background

Token allocation for a distributed database like Cassandra is not a trivial problem. One wants to have an even split of the token range, so that load¹ can be well distributed between nodes, as well as the ability to add new nodes and have them take a fair share of the load without the necessity to move data between the existing nodes. The two requirements are at odds, as having a well-split range means that new nodes can easily break the balance.

A common solution to the problem, also the one usually taken by Cassandra until recently, is to use a high number of randomly-allocated token ranges per node (“virtual nodes” or “vnodes”). Since the vnode tokens are random it is very easy to add a new node without significantly affecting the load distribution of the cluster. The ranges served by individual vnodes may vary wildly in size, but the averaging effect of the number of vnodes should keep the load variation between nodes in control. Or, at least, this is the expectation.

Problem

Unfortunately, as the number of nodes in the cluster grows, the disproportions in the size of the individual vnode ranges, as well as in the overall size of the token space served by a node, are theoretically guaranteed to grow continuously as more nodes are added. Significantly underused, as well as significantly overused nodes always emerge. To make matters worse it is not trivial to improve the situation when a heavily loaded node is identified — simply adding new members to the cluster would usually not affect the problem one. The effects of replication provide an additional challenge if one tries to allocate manually. And to prevent this from happening, i.e. to be able to keep the distribution under control, one needs to increase the number of vnodes per node as the cluster increases in size, which brings with itself a range of complications, including performance problems from dealing with a superlinearly increasing overall vnode count and the need to move data between existing nodes.

Solution

A better method of generation of tokens for new nodes was needed. Such a method is provided in Cassandra 3.0, where a token allocation algorithm can be triggered during bootstrap. The algorithm is engaged by specifying the allocate_tokens_for_keyspace parameter in cassandra.yaml in combination with num_tokens. As the replication strategies and factors are specified as part of the keyspace definition, the former parameter is needed to specify a keyspace from which the algorithm can find the replication to optimize for. When bootstrap is complete, the new node will have been allocated num_tokens new tokens which try to optimize the replicated token ownership distribution in the cluster.

For new clusters choosing this method over random allocation permits good distributions for much smaller numbers of vnodes per node: for example, experimenting with the distributions generated randomly shows that for 1000 nodes with replication factor 3, one would require about 250 vnodes to keep overutilization below 30%. With the new algorithm a similar or lower maximum can be achieved with just 4, and more importantly, the expected over- and underutilization would be stable and not degrade as more and mode machines are added to the cluster.

For existing clusters using the method to add new machines should quickly take away responsibility from the most heavily overutilized nodes, and will gradually improve the spread as new nodes are added, until it achieves a couple of percent variability in the allocated shares of both old and new nodes (for the old default of 256 vnodes).

Algorithm

The central idea of the algorithm is to generate candidate tokens, and figure out what would be the effect of adding each of them to the ring as part of the new node. The new token will become primary for part of the range of the next one in the ring, but it will also affect the replication of preceding ones.

The algorithm is able to quickly assess the effects thanks to some observations which lead to a simplified but equivalent version of the replication topology²:

Replication is defined per datacentre and replicas for data for this datacentre are only picked from local nodes. That is, no matter how we change nodes in other datacentres, this cannot affect what replicates where in the local one. Therefore in analysing the effects of adding a new token to the ring, we can work with a local version of the ring that only contains the tokens belonging to local nodes.
If there are no defined racks (or the datacentre is a single rack), data must be replicated in distinct nodes. If racks are defined, data must be replicated in distinct racks³. In either case, there is a clearly defined separation of all token ranges in the local ring into groups where only one replica of any data can reside.
The <n> token ranges where a data item is replicated are allocated going onwards in the ring, skipping token ranges in the same replication group. “Skipping” is difficult to deal with efficiently, but this turns out to be equivalent to saying that a vnode has responsibility for a contiguous span of tokens that ends at its token, and begins at the nearest token of the <n>-th distinct replication group that precedes it, or at the nearest other member of the same replication group, whichever is closer.
The latter observation makes it relatively easy to assess the changes to the responsibility caused by adding a new token in the ring.
Under the assumptions¹ of the algorithm, the size of that responsibility (“replicated ownership”) is proportional to the load that the presence of this vnode causes. The load over the whole node is thus proportional to the sum of the replicated ownerships of its vnodes.

The latter, the sum of the replicated ownerships of each node’s vnodes, is what the algorithm tries to distribute evenly. We do this by evaluating the standard deviation in the ownership of all nodes and the effect on this deviation of selecting a specific token and pick the best in a set of candidates. To keep complexity under control, the candidate tokens are chosen to be the midpoints between existing ones⁴. Doing this repeatedly for all requested vnodes plus some optimizations gives the allocation algorithm.

For full details, see CASSANDRA 7032.

To provide some further flexibility in situations where different sizes of nodes are present, we modulate the load distribution by the number of vnodes in a node — similarly to what random allocation results in for the majority of the nodes. For example, if a user specifies node A to use 4 vnodes, and node B to use 6, we assume that node B needs to take 50% higher load than node A. Another example would be an existing cluster where nodes use 32 vnodes each and several years later new machines, 4 times as powerful as the original members of the cluster, need to be added. In that case the user can and should request an allocation of 128 vnodes for the new machines, and the algorithm will try to allocate 4 times more load for them.

Footnotes

¹ To be able to reason about load at a good level of abstraction, this post assumes perfect partition key hashing, as well as (relatively) even load per partition.
² given for NetworkTopologyStrategy as SimpleStrategy is just its simplification with no racks and datacentres
³ Cassandra also attempts to deal with situations where we have defined racks, but they are fewer than the replication factor by allocating in all of the distinct racks and then allowing rack repeats to fill in the required replication count. This is a very rarely used scenario that complicates reasoning significantly and isn’t supported by the algorithm.
⁴ Empirical testing shows this to be no worse than using greater variability, e.g. choosing 4 different candidates for each existing range at 1/5 intervals.

↧

DataStax Java Driver: 3.0.0 released!

January 28, 2016, 7:00 am

≫ Next: Datastax PHP Driver: 1.1 GA Released!

≪ Previous: New token allocation algorithm in Cassandra 3.0

It’s finally here! The Java driver team is pleased to announce that the long-awaited 3.0.0 version has just been released.

Among various new features and improvements, 3.0.0 brings full compatibility with Cassandra 2.2 and 3.0 and a major feature: custom codecs.

With the switch to using semantic versioning, we seized the opportunity of this major release to clean up the API; as a consequence, version 3.0.0 is not binary compatible with older versions and has breaking changes – all of them documented in the upgrade guide (we strongly suggest reviewing it before upgrading the driver).

Compatibility with Cassandra 2.2 and 3.0+

Thanks to JAVA-572, the driver now fully supports the native protocol version 4, which comes with interesting additions:

Support for new CQL types

JAVA-404 and JAVA-786 brought support for four new CQL types: DATE, TIME, SMALLINT and TINYINT.

Methods to set and retrieve such types have been added to all relevant classes (Row, BoundStatement, TupleValue and UDTValue):

getByte() / setByte(byte) for TINYINT;
getShort() / setShort(short) for SMALLINT;
getTime() / setTime(long) for TIME;
getDate() / setDate(LocalDate) for DATE.

Note that to remain consistent with CQL type names, the methods to retrieve and set TIMESTAMP values have been renamed to getTimestamp() and setTimestamp(). They were formerly named getDate() and setDate(), but these now represent the DATE type.

SMALLINT and TINYINT are respectively 16 and 8-bit integers, so their usage should be quite straightforward.

DATE represents a day with no corresponding time value; it is encoded as a 32-bit unsigned integer representing a number of days, with «the Epoch» (January 1st 1970) at the center of the range (2³¹). TIME is the time of the day (with no specific date); it is encoded as a 64-bit signed integer representing the number of nanoseconds since midnight.

Here is a small example of how to use SMALLINT and TINYINT:

session.execute("CREATE TABLE IF NOT EXISTS small_ints(s smallint PRIMARY KEY, t tinyint)");
PreparedStatement pst = session.prepare("INSERT INTO small_ints (s, t) VALUES (:s, :t)");
session.execute(pst.bind(Short.MIN_VALUE, Byte.MAX_VALUE));

Row row = session.execute("SELECT * FROM small_ints").one();

short s = row.getShort("s");
byte t = row.getByte("t");

There is one minor catch: Java’s integer literals default to int, which the driver serializes as CQLINTs. So the following will fail:

session.execute(pst.bind(1, 1));
// InvalidTypeException: Invalid type for value 0 of CQL type smallint,
// expecting class java.lang.Short but class java.lang.Integer provided

The workaround is simply to coerce your arguments to the correct type:

session.execute(pst.bind((short)1, (byte)1));

And here is a small example of all 3 CQL temporal types, TIMESTAMP, DATE and TIME:

session.execute("CREATE TABLE IF NOT EXISTS dates(ts timestamp PRIMARY KEY, d date, t time)");
session.execute("INSERT INTO dates (ts, d, t) VALUES ('2015-01-28 11:47:58', '2015-01-28', '11:47:58')");

Row row = session.execute("SELECT * FROM dates").one();
Date ts = row.getTimestamp("ts");
LocalDate d = row.getDate("d");
long t = row.getTime("t");

As you see, TIMESTAMP is still mapped to java.util.Date, whereas TIME is mapped by the driver to primitive longs, representing the number of nanoseconds since midnight. As for DATE values, the driver encapsulates them in a new class, LocalDate. As it can be quite cumbersome to work with raw DATE literals (specially because Java doesn’t have unsigned integers), the LocalDate class aims to hide all that complexity behind utility methods to convert LocalDate instances to and from integers representing the number of days since the Epoch.

Should the driver’s default mappings for temporal types not suit your needs, we have good news: the new “extras” module – see below – contains alternative codecs to deal with DATE and TIME values:

A LocalDateCodec that maps DATE to Java 8 LocalDate;
Another LocalDateCodec that maps DATE to Joda Time LocalDate;
A LocalTimeCodec that maps TIME to Java 8 LocalTime;
Another LocalTimeCodec that maps TIME to Joda Time LocalTime.

And for those who prefer to keep it low-level and avoid the overhead of creating container classes:

SimpleDateCodec maps DATE to primitive ints representing the number of days since the Epoch; and
SimpleTimestampCodec maps TIMESTAMP to primitive longs representing milliseconds since the Epoch.

Unset values

For Protocol V3 or below, all variables in a statement must be bound. With Protocol V4, variables can be left “unset”, in which case they will be ignored server-side (no tombstones will be generated). If you’re reusing a bound statement you can use the unset methods to unset variables that were previously set:

BoundStatement bound = ps1.bind().setString("foo", "bar");
// Unset by name
bound.unset("foo");
// Unset by index
bound.unset(0);

Note that this will not work under lower protocol versions; attempting to do so would result in an IllegalStateException urging you to explicitly set all values in your statement.

Changes to Schema Metadata API

As you probably already know, CASSANDRA-6717 has completely changed the way Cassandra internally stores information about schemas, while CASSANDRA-6477 introduced Materialized Views, and CASSANDRA-7395 introduced User-defined Functions and Aggregates. On top of that, secondary indexes have been deeply refactored by CASSANDRA-9459.

The driver now fully supports all these features and changes; let’s see how.

Retrieving metadata on a materialized view is straightforward:

MaterializedViewMetadata mv = cluster.getMetadata()
    .getKeyspace("test").getMaterializedView("my_view");

Alternatively, you can obtain the view form its parent table:

TableMetadata table = cluster.getMetadata()
    .getKeyspace("test").getTable("my_table")
MaterializedViewMetadata mv = table.getView("my_view");
// You can also query all views in that table:
System.out.printf("Table %s has the following views: %s%n", table.getName(), table.getViews());

To illustrate the driver’s support for user-defined functions and aggregates, let’s consider the following example:

USE test;
CREATE FUNCTION plus(x int, y int)
            RETURNS NULL ON NULL INPUT
            RETURNS int LANGUAGE java AS 'return x + y;';
CREATE AGGREGATE sum(int) 
            SFUNC plus 
            STYPE int 
            INITCOND 0;

To retrieve metadata on the function defined above:

FunctionMetadata plus = cluster.getMetadata()
    .getKeyspace(keyspace)
    .getFunction("plus", DataType.cint(), DataType.cint());
System.out.printf("Function %s has signature %s and body '%s'%n",
    plus.getSimpleName(), plus.getSignature(), plus.getBody());

To retrieve metadata on the aggregate defined above:

AggregateMetadata sum = cluster.getMetadata()
    .getKeyspace(keyspace)
    .getAggregate("sum", DataType.cint());
System.out.printf("%s is an aggregate that computes a result of type %s%n",
    sum.getSimpleName(), sum.getReturnType());
FunctionMetadata plus = sum.getStateFunc();
System.out.printf("%s is a function that operates on %s%n",
    plus.getSimpleName(), plus.getArguments());

Note that, in order to retrieve a function or aggregate from a keyspace, you need to specify its name and its argument types, to distinguish between overloaded versions.

The way to retrieve metadata on a secondary index has changed from 2.1: the former one-to-one relationship between a column and its (only) index has been replaced with a one-to-many relationship between a table and its many indexes. This is reflected in the driver’s API by the new methods TableMetadata.getIndexes() and TableMetadata.getIndex(String name):

TableMetadata table = cluster.getMetadata()
        .getKeyspace("test")
        .getTable("my_table");
IndexMetadata index = table.getIndex("my_index");
System.out.printf("Table %s has index %s targeting %s%n", table.getName(), index.getName(), index.getTarget());

To retrieve the column an index operates on, you should now inspect the result of the getTarget() method:

IndexMetadata index = table.getIndex("my_index");
ColumnMetadata indexedColumn = table.getColumn(index.getTarget());
System.out.printf("Index %s operates on column %s%n", index.getName(), indexedColumn);

Beware however that the code above only works for built-in indexes where the index target is a single column name. If in doubt, make sure to read the upgrade guide should you need to migrate existing code.

Server Warnings

With Protocol V4, do not miss anymore the oracles emitted by Cassandra!

Joking aside, Cassandra can now send warnings along with the server response; these can include useful information such as batches being too large, too many tombstones being read, etc.. With the Java driver, you can retrieve them by simply inspecting the ExecutionInfo object:

ResultSet rs = session.execute(...);
List<String> warnings = rs.getExecutionInfo().getWarnings();

New Exception Types

New exception types have been added to handle additional server-side errors introduced in Cassandra 2.2: ReadFailureException, WriteFailureException and FunctionExecutionException.

Also, note that thanks to JAVA-1006, the whole exceptions hierarchy has been redesigned in this version.

Custom payloads

And finally, Custom payloads are generic key-value maps that can be sent alongside a query. They are used to convey additional metadata when you deploy a custom query handler on the server side.

Custom Codecs

JAVA-721 introduced an exciting new feature: custom codecs.

In short, where before the driver had a hard-coded set of mappings between CQL types and Java types, now it has a fully dynamic, pluggable and customizable mechanism of handling CQL-to-Java conversions.

With custom codecs, users can now define their own mappings, and the driver will use them wherever appropriate, seamlessly. The possibilities are endless: map CQL temporal types to Java 8 Time API or to Joda Time, as we already mentioned; provide transparent JSON-to-Java and XML-to-Java conversion; map CQL collections to Java arrays or – why not? – to Scala collections…

Just to give you a hint of how powerful this feature can be, imagine that you developed your own codec to convert JSON strings stored in Cassandra to Java objects; the code to retrieve your objects would be as simple as this:

// Roll your own codec
TypeCodec<MyPojo> myJsonCodec = ...;
// register it so the driver can use it
cluster.getConfiguration().getCodecRegistry().register(myJsonCodec);
// query some JSON data
Row row = session.execute("SELECT json FROM t WHERE pk = 'I CAN HAZ JSON'").one();
// Let the driver convert it for you...
MyPojo myPojo = row.get("json", MyPojo.class);

When designing new codecs for the Java driver, just make sure to read our online documentation as well as the javadocs for TypeCodec and CodecRegistry.

Optional Codecs

Custom codecs gave us the opportunity to introduce a new member in the Java driver family: the “extras” module.

This module has been created to host additions to the driver that, albeit useful, cannot make into the core API, mainly for backwards-compatibility reasons, or because they target a more specific audience (e.g. they require Java 8 or higher, while the driver must remain compatible with older versions of Java).

To use this new module in your own application, simply pull the following Maven dependency:

<dependency>
  <groupId>com.datastax.cassandra</groupId>
  <artifactId>cassandra-driver-extras</artifactId>
  <version>3.0.0</version>
</dependency>

To celebrate the event, we included in this version a rich set of codecs that we hope will be useful to many of you:

Codecs for Java arrays;
Codecs for Java enums;
Codecs for Java 8 types;
Codecs for Joda Time;
Codecs for JSON (using Jackson or JSR-353);
…and many more!

Check our online documentation for more details.

Note that the mapping framework has been retrofitted to use custom codecs when appropriate. One of the consequences of that is that the @Enumerated annotation has gone, replaced with codecs from the extras module. Again, please read the upgrade guide for more details if you need to migrate existing code.

Other Major Improvements

`RetryPolicy` enhancements

Thanks to JAVA-819, RetryPolicy has now a new method : onRequestError(). This method gives the user the ability to decide what to do in the following cases:

On a client timeout, while waiting for the server response;
On a connection error (socket closed, etc.);
When the contacted host replies with an unusual error, such as IS_BOOTSTRAPPING, OVERLOADED, or SERVER_ERROR.

To distinguish among these error cases, one should inspect the DriverException that is passed to the method call. Here is a summary of the possible situations:

Error	`DriverException`
Timeout (no server response)	`OperationTimedOutException`
Network failure (socket closed, etc.)	`ConnectionException`
`IS_BOOTSTRAPPING`	`BootstrappingException`
`OVERLOADED`	`OverloadedException`
`SERVER_ERROR`	`ServerError`

Until now, the driver had a hardcoded behavior for all these cases: retry the query. But this behavior is actually dangerous if the query being executed is not idempotent; from now on, users can override the default behavior if they need to. And to make their lives even easier, the driver provides the new IdempotenceAwareRetryPolicy, that conveniently decorates any existing RetryPolicy with idempotence awareness, based on the idempotence flag; here is an example:

Cluster cluster = Cluster.builder()
        .addContactPoints("127.0.0.1")
        // by default, statements will be considered non-idempotent
        .withQueryOptions(new QueryOptions().setDefaultIdempotence(false))
        // make your retry policy idempotence-aware
        .withRetryPolicy(new IdempotenceAwareRetryPolicy(DefaultRetryPolicy.INSTANCE))
        .build();

Session session = cluster.connect();

// by default, statements like this one will not be retried
session.execute("INSERT INTO table (pk, c1) VALUES (42, 'foo')");

// but this one will
session.execute(new SimpleStatement("SELECT c1 FROM table WHERE pk = 42").setIdempotent(true));

Named parameters in `SimpleStatement`

Thanks to JAVA-1037, one has now the ability to set named parameters on a SimpleStatement. Simply use the new constructor that takes a Map argument:

// Note the use of named parameters in the query
String query = "SELECT * FROM measures WHERE sensor_id=:sensor_id AND day=:day";
Map<String, Object> params = new HashMap<String, Object>();
params.put("sensor_id", 42);
params.put("day", "2016-01-28");
SimpleStatement statement = new SimpleStatement(query, params);

One caveat though: named parameters were introduced in Protocol V3, and thus require Cassandra 2.1 or higher. Check our online documentation on simple statements for more information.

Per-statement read timeouts

With JAVA-1033, you now have the possibility to specify read timeouts (i.e. the amount of time the driver will wait for a response before giving up) on a per-statement basis: the new method Statement.setReadTimeoutMillis() overrides the default per-host read timeout defined by SocketOptions.setReadTimeoutMillis().

This can be useful for statements that are granted longer timeouts server-side (for example, aggregation queries). Again, our online documentation on socket options has more about this.

Additions to the `Host` API

JAVA-1035 and JAVA-1042 have enriched the Host API with new useful methods:

getBroadcastAddress() returns the node’s broadcast address. This corresponds to the broadcast_address setting in cassandra.yaml.
getListenAddress() returns the node’s listen address. This corresponds to the listen_address setting in cassandra.yaml.
getDseVersion() returns the DSE version the host is running (when applicable).
getDseWorkload() returns the DSE workload the host is running (when applicable).

Note however that these methods are provided for informational purposes only; depending on the cluster version, on the cluster type (DSE or not), and on the host the information has been fetched from, they may return null at any time.

Getting the driver

As always, the driver is available from Maven and from our downloads server.

We’re also running a platform and runtime survey to improve our testing infrastructure. Your feedback would be most appreciated.

↧

Datastax PHP Driver: 1.1 GA Released!

February 11, 2016, 11:00 am

≫ Next: Understanding the Guarantees, Limitations, and Tradeoffs of Cassandra and Materialized Views

≪ Previous: DataStax Java Driver: 3.0.0 released!

We are pleased to announce the 1.1 GA release of the PHP driver for Apache Cassandra. This release includes all the features necessary to take full advantage of Apache Cassandra 2.1 including support for tuples, user defined types (UDTs), nested collections, client-side timestamps, and binding named arguments when using simple statements. In addition to supporting Cassandra 2.1 features the release also brings with it support for PHP 7, retry polices, raw paging token access, and the ability to disable schema metadata. Example code for all the new features found in this release can be found in the features directory in the driver’s source code.

What’s new

Support for PHP 7

The PHP driver can now be used with PHP 7! The driver will still continue to work for officially supported versions of PHP 5.

User Defined Types

User defined types (UDTs for short), introduced in Apache Cassandra 2.1, allow for creating arbitrarily nested, composite data types with multiple fields in a single column. This can be useful for simplifying schema by grouping related fields into a single UDT instead of using multiple columns. More information about using UDTs can be found in this post.

Inserting a user defined type

$cluster = Cassandra::cluster()->build();
$session = $cluster->connect("music");

$statement = new Cassandra\SimpleStatement(
    "CREATE TYPE IF NOT EXISTS song_metadata (duration int, bit_rate set<text>, encoding text)");
$session->execute($statement);

$statement = new Cassandra\SimpleStatement(
    "CREATE TABLE IF NOT EXISTS songs (id uuid PRIMARY KEY, name text, metadata frozen<song_metadata>)");
$session->execute($statement);

# The UDT can be retrieved from the schema metadata
$songMetadataType = $session->schema()->keyspace("music")->userType("song_metadata");

# Construct a UDT value from the UDT type
$songMetadata = $songMetadataType->create(
    "duration", 180,
    "bit_rate", Cassandra\Type::set(Cassandra\Type::text())->create("128kbps", "256kbps"),
    "encoding", "mp3");

# Bind and and execute using the constructed UDT value
$statement = new Cassandra\SimpleStatement("INSERT INTO songs(id, name, metadata) VALUES (?, ?, ?)");
$options = new Cassandra\ExecutionOptions(
    array(
        "arguments" => array(
            new Cassandra\Uuid(),
            "Some Song",
            $songMetadata
        )
    )
);

$session->execute($statement , $options);

User defined types can also be constructed programatically. This can be useful for instance where schema metadata is disabled or is unavailable.

$songMetadataType = Cassandra\Type::userType(
    "duration", Cassandra\Type::int(),
    "bit_rates", Cassandra\Type::set(Cassandra\Type::text()),
    "encoding", Cassandra\Type::text()
);

$songMetadata = $songMetadataType->create(
    "duration", 180,
    "bit_rate", Cassandra\Type::set(Cassandra\Type::text())->create("128kbps", "256kbps"),
    "encoding", "mp3");

# ...

Tuples

Tuples, also introduced in Apache Cassandra 2.1, are useful for creating positional, fixed length sets with mixed types. They’re similar to UDTs in that they are arbitrary composite type. However, tuple fields are unnamed, therefore its fields can only be referenced by position. This also means that it is not possible to add new fields to a tuple.

$cluster = Cassandra::cluster()->build();
$session = $cluster->connect("music");

$statement = new Cassandra\SimpleStatement(
    "CREATE TABLE IF NOT EXISTS songs_using_tuple (id uuid PRIMARY KEY, name text, metadata tuple<int, frozen<set<text>>, text>)");
$session->execute($statement);

# Create a new tuple type
$songMetadataType = Cassandra\Type::tuple(
    Cassandra\Type::int(),
    Cassandra\Type::set(Cassandra\Type::text()),
    Cassandra\Type::text()
);

# Construct a tuple value using the tuple type
$songMetadata = $songMetadataType->create(
    180,
    Cassandra\Type::set(Cassandra\Type::text())->create("128kbps", "256kbps"),
    "mp3"
);

# Bind and and execute using the constructed tuple value
$statement = new Cassandra\SimpleStatement("INSERT INTO songs_using_tuple (id, name, metadata) VALUES (?, ?, ?)");
$options = new Cassandra\ExecutionOptions(
    array(
        "arguments" => array(
            new Cassandra\Uuid(),
            "Some Song",
            $songMetadata
        )
    )
);

$session->execute($statement , $options);

Nested Collections

Lists, maps, and sets types can now be arbitrarily nested. Other collections can even be keys in maps and sets.

use Cassandra\Type;
use Cassandra\Decimal;

$setType = Type::set(Type::int());
$map = Type::map($setType, Type::text())->create(
    $setType->create(1, 2, 3), "abc",
    $setType->create(4, 5, 6), "xyz"
);

echo "The value of {4, 5, 6} is : " . $map->get($setType->create(4, 5, 6)) . "\n"; # "xyz"

$listType = Type::collection(Type::decimal());
$set = Type::set($listType)->create(
    $listType->create(new Decimal("0.0"), new Decimal("1.0")),
    $listType->create(new Decimal("2.0"), new Decimal("3.0"), new Decimal("4.0"))
);

if ($set->has($listType->create(new Decimal("2.0"), new Decimal("3.0"), new Decimal("4.0")))) {
    echo "Yup! It's in there.\n";
}

Client-side Timestamps

Apache Cassandra uses timestamps to serialize write operations. That is, values with a more current timestamp are considered to be the most up-to-date version of that information. Previous versions of the PHP driver only allowed timestamps to be assigned server-side by Cassandra. This is not always ideal for all applications. This release of the driver allows timestamps to be generated client-side and it is enabled by either setting a global timestamp generator or assigning a specific timestamp to a statement or batch. By default, the driver uses a server-side timestamp generator and behaves the same as previous versions of the driver. The driver also includes a monotonic timestamp generator which assigns microsecond granular timestamps client-side and is useful for applications that plan to make rapid mutations from a single driver instance. In that case, it can prevent writes from a single driver instance from being reordered.

Using the monotonic timestamp generator

$cluster = Cassandra::cluster()
              ->withContactPoints('127.0.0.1')
              ->withTimestampGenerator(new Cassandra\TimestampGenerator\Monotonic())
              ->build();

# Insert and update requests will now be assigned a client-side timestamp using the
# monotonic timestamp generator...

Timestamps can also be assigned for each individual request using Cassandra\ExecutionOptions.

Assigning a client-side timestamp per request

$simple = new Cassandra\SimpleStatement(
    "INSERT INTO playlists (id, song_id, artist, title, album) " .
    "VALUES (62c36092-82a1-3a00-93d1-46196ee77204, ?, ?, ?, ?)"
);

$arguments = array(
    new Cassandra\Uuid('756716f7-2e54-4715-9f00-91dcbea6cf50'),
    'La Petite Tonkinoise',
    'Bye Bye Blackbird',
    'Joséphine Baker'
);
$options = new Cassandra\ExecutionOptions(array(
    'arguments' => $arguments,
    'timestamp' => 1234 # A timestamp can be be assigned per request in execution options
));
$session->execute($simple, $options);

$statement = new Cassandra\SimpleStatement(
  "SELECT artist, title, album, WRITETIME(song_id) FROM simplex.playlists");
$result    = $session->execute($statement);

foreach ($result as $row) {
  echo $row['artist'] . ": " . $row['title'] . " / " . $row['album'] . " (". $row['writetime(song_id)'] . ")\n";
}

Support Named Arguments when using `Cassandra\SimpleStatement`

It is now possible to name arguments when using SimpleStatement. In previous releases only positional arguments were supported for simple statement queries, that is, arguments denoted with “?” needed to be bound to a query in the same order as they appeared in the query string.

Named parameters now work with simple statement insert queries

$simple = new Cassandra\SimpleStatement(
    "INSERT INTO playlists (id, song_id, artist, title, album) " .
    "VALUES (62c36092-82a1-3a00-93d1-46196ee77204, ?, ?, ?, ?)"
);

# Using named arguments now works with simple statements!
$arguments = array(
    'song_id' => new Cassandra\Uuid('756716f7-2e54-4715-9f00-91dcbea6cf50'),
    'title'   => 'La Petite Tonkinoise',
    'album'   => 'Bye Bye Blackbird',
    'artist'  => 'Joséphine Baker'
);

$options = new Cassandra\ExecutionOptions(array(
    'arguments' => $arguments,
));

$session->execute($simple, $options);

This version of the driver also allows parameters to be named using the “:<name>” syntax. Named arguments can still be used in conjunction with prepared queries, but are most useful for non-prepared queries where metadata for the parameters’ names are not available.

Using “:<name>” parameters with a simple statement

$statement = new Cassandra\SimpleStatement(
    "SELECT * FROM simplex.playlists " .
    "WHERE id = :id AND artist = :artist AND title = :title AND album = :album"
);

$options = new Cassandra\ExecutionOptions(
    array('arguments' =>
        array(
            'id'     => new Cassandra\Uuid('62c36092-82a1-3a00-93d1-46196ee77204'),
            'artist' => 'Joséphine Baker',
            'title'  => 'La Petite Tonkinoise',
            'album'  => 'Bye Bye Blackbird'
        )
    )
);

$result = $session->execute($statement, $options);

$row = $result->first();
echo $row['artist'] . ": " . $row['title'] . " / " . $row['album'] . "\n";

Retry Policies

The use of retry policies allows the PHP driver to automatically handle server-side failures when Cassandra is unable to fulfill the consistency requirements of a request. The default retry policy will only retry a request when it will preserve the original consistency level and when it is likely to succeed (there are enough replicas). The default retry policy can be overridden per session by using Cluster::withRetryPolicy() or it can be set per request using the execution option "retry_policy".

Changing the default policy to the downgrading consistency policy

$cluster     = Cassandra::cluster()
                 ->withContactPoints('127.0.0.1')
                 ->withRetryPolicy(new Cassandra\RetryPolicy\DowngradingConsistency())
                 ->build();

$session     = $cluster->connect();

# ...

The driver also provides a fall-through policy that always returns an error and a logging policy which can be used in conjunction with other policies to log their retry decisions.

Chaining the downgrading policy to the logging policy

$retry_policy = new Cassandra\RetryPolicy\DowngradingConsistency();

$cluster     = Cassandra::cluster()
                 ->withContactPoints('127.0.0.1')
                 ->withRetryPolicy(new Cassandra\RetryPolicy\Logging($retry_policy))
                 ->build();

$session     = $cluster->connect();

# ...

Retry policies can also be assigned per-request using the "retry_policy" execution option.

Assigning a retry policy to a specific request

$statement   = new Cassandra\SimpleStatement("INSERT INTO playlists (id, song_id, artist, title, album)
                                              VALUES (62c36092-82a1-3a00-93d1-46196ee77204, ?, ?, ?, ?)");

$arguments   = array(new Cassandra\Uuid('756716f7-2e54-4715-9f00-91dcbea6cf50'),
                    'Joséphine Baker',
                    'La Petite Tonkinoise',
                    'Bye Bye Blackbird'
                    );

$retry_policy = new Cassandra\RetryPolicy\DowngradingConsistency();

# This specific retry policy is used for on this single request
$options     = new Cassandra\ExecutionOptions(array(
                    'consistency' => Cassandra::CONSISTENCY_QUORUM,
                    'arguments' => $arguments,
                    'retry_policy' => new Cassandra\RetryPolicy\Logging($retry_policy)
                    ));

$session->execute($statement, $options);

# ...

Raw Paging Token

Previously, the PHP driver handled paging transparently by managing the paging state internally. It is now possible to access this paging state token using Cassandra\Row::pagingStateToken() and later use this token to resume paging by setting the "paging_state_token" execution option when executing a statement. This allows client applications to store this token for later use. The paging state should not be exposed to or come from untrusted environments.

Using the paging state token to page results

$cluster   = Cassandra::cluster()
               ->withContactPoints('127.0.0.1')
               ->build();
$session   = $cluster->connect("simplex");
$statement = new Cassandra\SimpleStatement("SELECT * FROM entries");
$options = array('page_size' => 2);
$result = $session->execute($statement, new Cassandra\ExecutionOptions($options));

foreach ($result as $row) {
  printf("key: '%s' value: %d\n", $row['key'], $row['value']);
}

while ($result->pagingStateToken()) {
    # The previous paging state token is used to get the next page of results
    $options = array(
        'page_size' => 2,
        'paging_state_token' => $result->pagingStateToken()
    );

    $result = $session->execute($statement, new Cassandra\ExecutionOptions($options));

    foreach ($result as $row) {
      printf("key: '%s' value: %d\n", $row['key'], $row['value']);
    }
}

Disable Schema Metadata

Schema metadata is kept up-to-date by the driver for use by client applications, either directly, or in the 1.1 release it can be used to construct complex data types such as UDTs, tuples and collections. It is also used by the token aware policy to determine the replication strategy of keyspaces. However, some applications might wish to eliminate this overhead. It is now possible to prevent the driver from retrieving and maintaining the schema metadata. This can be used to improve startup performance in applications with short-lived sessions or applications where schema metadata isn’t used.

$cluster   = Cassandra::cluster()
                   ->withContactPoints('127.0.0.1')
                   ->withSchemaMetadata(false) # Disable schema metadata
                   ->build();
$session   = $cluster->connect("simplex");
$schema    = $session->schema();
print count($schema->keyspaces()) . "\n"; # "0"

Internal improvements

This release also includes the following internal improvements:

The default consistency is now LOCAL_ONE instead of ONE
Fixed encoding/decoding for decimal and varint

Looking forward

This release brings with it full support for Apache Cassandra 2.1 along with many other great features including support for PHP 7! In the next release we will be focusing our efforts on supporting Apache Cassandra 2.2 and 3.0. Let us know what you think about the 1.1 GA release. Your feedback is important to us and it influences what features we prioritize. To provide feedback use the following:

Mailing List: https://groups.google.com/a/lists.datastax.com/forum/#!forum/php-driver-user
IRC: #datastax-drivers on irc.freenode.net
Review and contribute source code: https://github.com/datastax/php-driver
Report issues on JIRA: https://datastax-oss.atlassian.net/browse/PHP

↧

Understanding the Guarantees, Limitations, and Tradeoffs of Cassandra and Materialized Views

February 12, 2016, 12:00 am

≫ Next: How to Write a Dtest

≪ Previous: Datastax PHP Driver: 1.1 GA Released!

The new Materialized Views feature in Cassandra 3.0 offers an easy way to accurately denormalize data so it can be efficiently queried. It’s meant to be used on high cardinality columns where the use of secondary indexes is not efficient due to fan-out across all nodes. An example would be creating a secondary index on a user_id. As the number of users in the system grows the longer it would take a secondary index to locate the data since secondary indexes store data locally. With a materialized view you can partition the data on user_id so finding a specific user becomes a direct lookup with the added benefit of holding other denormalized data from the base table along with it, similar to a DynamoDB global secondary index.

Materialized views are a very useful feature to have in Cassandra but before you go jumping in head first, it helps to understand how this feature was designed and what the guarantees are.

Primarily, since materialized views live in Cassandra they can offer at most what Cassandra offers, namely a highly available, eventually consistent version of materialized views.

A quick refresher of the Cassandra guarantees and tradeoffs:

C* Guarantees:

Writes to a single table are guaranteed to be eventually consistent across replicas – meaning divergent versions of a row will be reconciled and reach the same end state.
Lightweight transactions are guaranteed to be linearizable for table writes within a data center or globally depending on the use of LOCAL_SERIAL vs SERIAL consistency level respectively.
Batched writes across multiple tables are guaranteed to succeed completely or not at all (by using a durable log).
Secondary indexes (once built) are guaranteed to be consistent with their local replicas data.

C* Limitations:

Cassandra provides read uncommitted isolation by default. (Lightweight transactions provide linearizable isolation)

C* Tradeoffs:

Using lower consistency levels yield higher availability and better latency at the price of weaker consistency.
Using higher consistency levels yield lower availability and higher request latency with the benefit of stronger consistency.

Another tradeoff to consider is how Cassandra deals with data safety in the face of hardware failures. Say your disk dies or your datacenter has a fire and you lose machines; how safe is your data? Well, it depends on a few factors, mainly replication factor and consistency level used for the write. With consistency level QUORUM and RF=3 your data is safe on at least two nodes so if you lose one node you still have a copy. However, if you only have RF=1 and lose a node forever you’ve lost data forever.

An extreme example of this is if you have RF=3 but write at CL.ONE and the write only succeeds on a single node, followed directly by the death of that node. Unless the coordinator was a different node you probably just lost data.

Given Cassandra’s system properties, the implication of maintaining Materialized Views manually in your application is likely to create permanent inconsistencies between views. Since your application will need to read the existing state from Cassandra then modify the views to clean-up any updates existing rows. Besides the added latency, if there are other updates going to the same rows your reads will end up in a race condition and fail to clean up all the state changes. This is the scenario the mvbench tool compares against.

The Materialized Views feature in Cassandra 3.0 was written to address these and other complexities surrounding manual denormalization, but that is not to say it’s not without its own set of guarantees and tradeoffs to consider. To understand the internal design of Materialized Views please read the design document. At a high level though we chose correctness over raw performance for writes, but did our best to avoid needless write amplification. A simple way to think about this write amplification problem is: if I have a base table with RF=3 and a view table with RF=3 a naive approach would send a write to each base replica and each base replica would send a view update to each view replica; RF+RF^2 writes per-mutation! C* Materialized Views instead pairs each base replica with a single view replica. This simplifies to be RF+RF writes per mutation while still guaranteeing convergence.

Materialized View Guarantees:

All changes to the base table will be eventually reflected in the view tables unless there is a total data loss in the base table (as described in the previous section)

Materialized View Limitations:

All updates to the view happen asynchronously unless corresponding view replica is the same node. We must do this to ensure availability is not compromised. It’s easy to imagine a worst case scenario of 10 Materialized Views for which each update to the base table requires writing to 10 separate nodes. Under normal operation views will see the data quickly and there are new metrics to track it (ViewWriteMetricss).
There is no read repair between the views and the base table. Meaning a read repair on the view will only correct that view’s data not the base table’s data. If you are reading from the base table though, read repair will send updates to the base and the view.
Mutations on a base table partition must happen sequentially per replica if the mutation touches a column in a view (this will improve after ticket CASSANDRA-10307)

Materialized View Tradeoffs:

With materialized views you are trading performance for correctness. It takes more work to ensure the views will see all the state changes to a given row. Local locks and local reads required. If you don’t need consistency or never update/delete data you can bypass materialized views and simply write to many tables from your client. There is also a ticket CASSANDRA-9779 that will offer a way to bypass the performance hit in the case of insert only workloads.
The data loss scenario described in the section above (there exists only a single copy on a single node that dies) has different effects depending on if the base or view was affected. If view data was lost from all replicas you would need to drop and re-create the view. If the base table lost data through, there would be an inconsistency between the base and the view with the view having data the base doesn’t. Currently, there is no way to fix the base from the view; ticket CASSANDRA-10346 was added to address this.

One final point on repair. As described in the design document, repairs mean different things depending on if you are repairing the base or the view. If you repair only the view you will see a consistent state across the view replicas (not the base). If you repair the base you will repair both the base and the view. This is accomplished by passing streamed base data through the regular write path, which in turn updates the views. This mode is also how bootstrapping new nodes and SSTable loading works as well to provide consistent materialized views.

↧

How to Write a Dtest

February 19, 2016, 9:11 am

≫ Next: Cassandra Unit Testing with Byteman

≪ Previous: Understanding the Guarantees, Limitations, and Tradeoffs of Cassandra and Materialized Views

What are Dtests?

Apache Cassandra’s functional test suite, cassandra-dtest, short for “distributed tests”, is an open-source Python project on GitHub where much of the Apache Cassandra test automation effort takes place. Unlike Cassandra’s unit tests, the dtests are end-to-end, black box tests that run against Cassandra clusters via CCM. The Cassandra Cluster Manager, or CCM, is a Python library that runs local C* clusters by hosting multiple JVMs on the same box. Each test’s runtime is anywhere from thirty seconds to several minutes. Many are general purpose functional tests, while others are regression tests for specific tickets from the Apache Cassandra JIRA.

Where are Dtests used?

Continuous integration for dtests runs on a publicly accessible Jenkins server at cassci.datastax.com. As patches are written, contributors can use CassCI to run the C* unit tests and the dtest suite against their new code, as discussed here.

Writing a Dtest

Adding a new dtest is quite simple. You’ll want to choose the appropriate module and/or test suite for your new test, or add one if necessary. Add a new test method to the file you’ve chosen; make sure that “test” is in the method name, or nosetests won’t pick it up. Now is a good time to add your test’s docstring. The docstring should include a description of what your test is trying to verify and how, as well as some doxygen markup. See dtest’s contributing.md for more on the appropriate doxygen annotations to use.

Now that the boilerplate is taken care of, you’re ready to begin writing your test. The first step is to launch a C* cluster, like so:

cluster = self.cluster
cluster.populate(3).start(wait_for_binary_proto=True)

You can modify the number of nodes in the cluster, the number of datacenters, or any of the cassandra.yaml options.

cluster = self.cluster
cluster.set_configuration_options(values={'hinted_handoff_enabled': False}) # Set a cassandra.yaml option
cluster.populate([2, 2]).start(wait_for_binary_proto=True) # A four node cluster. Two nodes in each of two datacenters

Remember that this is using CCM, so all of these processes are running on your laptop. Thus, it’s best not to launch more than five nodes. Most tests run against three nodes.

To create an object representing a connection to your C* cluster, you’ll want to use one of the following methods from dtest.py:

def cql_connection(self, node)
def exclusive_cql_connection(self, node)
def patient_cql_connection(self, node)
def patient_exclusive_cql_connection(self, node)

Use patient_cql_connection, unless you have a specific need for one of the others.

cluster = self.cluster
cluster.populate(3).start(wait_for_binary_proto=True)
node1, node2, node3 = cluster.nodelist()

session = self.patient_cql_connection(node1)

From here out will be the actual testing logic. You can use the Python driver to interact with C*, mostly via CQL, or the ccmlib API to run cassandra-stress, nodetool, or any other tool that ships in the C* source.

session.execute("CREATE KEYSPACE ks WITH replication = { 'class':'SimpleStrategy', 'replication_factor':1} AND DURABLE_WRITES = true")
session.execute("USE ks")
session.execute("CREATE TABLE t (id int PRIMARY KEY, v int)")
session.execute("INSERT INTO t (id, v) VALUES (1, 2)")
rows = session.execute("SELECT * FROM t")

node1, node2, node3 = cluster.nodelist()

node1.stress(['write', 'n=1M', '-rate', 'threads=10'])
node2.decommission()
node3.repair()

You can use assertions.py and Python unittest’s built-in assertions to assert C*’s correctness.

from assertions import assert_one

session.execute("CREATE KEYSPACE ks WITH replication = { 'class':'SimpleStrategy', 'replication_factor':1} AND DURABLE_WRITES = true")
session.execute("USE ks")
session.execute("CREATE TABLE t (id int PRIMARY KEY, v int)")
session.execute("INSERT INTO t (id, v) VALUES (1, 2)")
assert_one(session, "SELECT * FROM t", [1, 2])

rows = list(session.execute("SELECT * FROM t"))
self.assertEqual(rows[0], 1)
self.assertEqual(rows[1], 2)

Make sure you only use these, and not the Python assert keyword, as they offer significantly improved debug output on failures.

rows = list(session.execute("SELECT * FROM t"))
assert rows[0] == 1 # Do not do this.
assert rows[1] == 2

There’s no need to check for errors in C* logs, as that is automatically handled for you by dtest’s teardown.

Once you have finished with your test, make sure your new code is compliant with PEP8. See contributing.md for how to do so, along with further style guidelines. Now just open a pull request against the riptano/cassandra-dtest repository, and we’ll be happy to review and merge it.

↧

Cassandra Unit Testing with Byteman

February 26, 2016, 7:24 am

≫ Next: Debugging SSTables in 3.0 with sstabledump

≪ Previous: How to Write a Dtest

Following on from recent posts on testing in the Apache Cassandra project with dtests and Jepsen, I wanted to look at an interesting tool which we’ve recently begun to explore in our unit tests.

There are some things which are notoriously difficult to cover in unit testing. Verifying behaviours without easily observable side effects is one such case, for example verifying that a particular code path is followed under specific condititions. This type of observability can be increased by refactoring and employing techniques such as dependency injection, but this often comes at the expense of clarity and concision in the code.

Another example is fault injection testing, where pathological conditions are artificially induced at test runtime. This can be extremely useful to exercise those corners of a codebase which deal with error handling, particularly when those errors are difficult to reproduce in a test environment. Verifying correct responses to scenarios such as a disk filling up or a network partition are clearly crucial to developing robust systems, yet these are often hard to model in unit tests. Higher level testing frameworks can provide mechanisms for creating or simulating these scenarios, such as Jepsen’s nemeses, but as with the observability problem, unit tests have often had to rely on dependency injection and mocks or stubs to force execution of error handling code paths.

Byteman is an open source tool, primarily developed by JBoss, which enables additional Java code to be injected into a running JVM. From the project’s home page: “You can inject code almost anywhere you want and there is no need to prepare the original source code in advance. You can even remove injected code and reinstall different changes while the application continues to execute.”

Injections are known as rules and scripted using a simple DSL with primitives for tracing and modifying as well as for defining trigger points and conditions in the existing code. This clearly meshes well with both the observability and fault injection concepts and in fact Byteman ships with a JUnit test runner to support integration with the test fixtures through annotations.

Let’s look at a recently committed Cassandra unit test which uses Byteman.The intent in the test case added for CASSANDRA-10972 is to assert that the HintsBufferPool provides some backpressure to its callers by drawing its write buffers from a BlockingQueue.

The annotation on the test method specifies an action and defines the point at which to execute it:

@Test
@BMRule(name = "Greatest name in the world",
        targetClass="HintsBufferPool",
        targetMethod="switchCurrentBuffer",
        targetLocation="AT INVOKE java.util.concurrent.BlockingQueue.take",
        action="org.apache.cassandra.hints.HintsBufferPoolTest.blockedOnBackpressure = true;")
public void testBackpressure() throws Exception

We specify an action to perform when our rule is triggered:

action="org.apache.cassandra.hints.HintsBufferPoolTest.blockedOnBackpressure = true

which simply flips a boolean flag in the test case. Next, the trigger point:

targetClass="HintsBufferPool",
targetMethod="switchCurrentBuffer",
targetLocation="AT INVOKE java.util.concurrent.BlockingQueue.take",

These attributes represent the code coordinates at which to perform the action; in this case, during a call HintsBufferPool::switchCurrentBuffer. More specifically, during execution of that method whenever BlockingQueue::take is called, the defined action is performed. Note that switchCurrentBuffer is a private method, Byteman can inspect and inject absolutely anywhere in application, library or even Java runtime code. Several options are available when specifying the targetLocation, including directly before or after execution of the target method, when named variables are read or written, when particular method calls are made (as in this example) and even at specific lines in the source.

Finally, the test asserts that the flag was set, indicating that at some point during switchCurrentBuffer the pool did draw from the recycled buffer queue:

assertTrue(blockedOnBackpressure);

In future, expect to see more fault injection in Cassandra’s unit tests and probably also in dtests. Hooks are already in place in CCM (the library used by dtests to manage local clusters) for starting nodes with the Byteman agent installed and submitting scripts to those nodes. Using them to deterministically invoke corner cases in a dtest cluster will help expand our test coverage and complement the Jepsen tests.

Debugging SSTables in 3.0 with sstabledump

March 9, 2016, 9:29 am

≫ Next: Python Driver 3.1.0 Released

≪ Previous: Cassandra Unit Testing with Byteman

Cassandra 3.0.4 and 3.4 introduces sstabledump, a new utility for exploring SSTables. sstabledump is the spiritual successor to and a replacement for sstable2json. sstable2json was removed from Cassandra in version 3.0, but examining SSTable data is still a useful diagnostic tool. sstabledump can export SSTable content to the human readable JSON format.

How SSTable data is stored on disk has changed in Cassandra 3.0, as previously covered in ‘Putting some structure in the storage engine’. Previously, SSTables were composed of partition keys and their cells; now SSTables are composed of partitions and their rows.

This eliminates quite a bit of overhead present in prior versions of Cassandra. Metadata such as clustering key values, timestamps and TTLs are now defined at the row level, rather than repeated for each individual cell within a row. This new layout now matches how data is represented in CQL, and is more understandable.

A nice enhancement of sstabledump over sstable2json is that the utility can be run in ‘client mode’, so the system data does not have to be read to determine schema. sstabledump can be executed outside of the Cassandra environment, and cassandra.yaml is not required in the classpath for the tool to work.

Note that sstabledump only supports Cassandra 3.X SSTables.

Visualizing the Storage Engine changes in 3.0

To demonstrate sstabledump and the changes in SSTable layout in 3.0, we’ll use sstable2json and sstabledump to contrast the SSTables created by a Cassandra 2.2 node and those created by a Cassandra 3.0 node.

First, let’s generate a small SSTable for a table that represents stock ticker data. This should be done within a cqlsh session on each Cassandra cluster:

-- Create the schema

CREATE KEYSPACE IF NOT EXISTS ticker WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

USE ticker;

CREATE TABLE IF NOT EXISTS symbol_history (	
  symbol    text,
  year      int,
  month     int,
  day       int,
  volume    bigint,
  close     double,
  open      double,
  low       double,
  high      double,
  idx       text static,
  PRIMARY KEY ((symbol, year), month, day)
) with CLUSTERING ORDER BY (month desc, day desc);

-- Insert some records

INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high, idx) 
VALUES ('CORP', 2015, 12, 31, 1054342, 9.33, 9.55, 9.21, 9.57, 'NYSE') USING TTL 604800;

INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high, idx) 
VALUES ('CORP', 2016, 1, 1, 1055334, 8.2, 9.33, 8.02, 9.35, 'NASDAQ') USING TTL 604800;

INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high) 
VALUES ('CORP', 2016, 1, 4, 1054342, 8.54, 8.2, 8.2, 8.65) USING TTL 604800;

INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high) 
VALUES ('CORP', 2016, 1, 5, 1054772, 8.73, 8.54, 8.44, 8.75) USING TTL 604800;

-- Update a column value

UPDATE symbol_history USING TTL 604800 set close = 8.55 where symbol = 'CORP' and year = 2016 and month = 1 and day = 4;

Next, let’s flush memtables to disk as SSTables using nodetool:

$ bin/nodetool flush

Then in a cqlsh session we will delete a column value and an entire row to generate some tombstones:

-- Delete a column value

USE ticker;
DELETE high FROM symbol_history WHERE symbol = 'CORP' and year = 2016 and month = 1 and day = 1;

-- Delete an entire row

DELETE FROM symbol_history WHERE symbol = 'CORP' and year = 2016 and month = 1 and day = 5;

We proceed to flush again to generate a new SSTable, and then perform a major compaction yielding a single SSTable.

$ bin/nodetool flush; bin/nodetool compact ticker

Now that we have a single SSTable representing operations on our CQL table we can use the appropriate tool to examine its contents.

C* 2.2 sstable2json Output

$ tools/bin/sstable2json data/data/ticker/symbol_history-d7197900e5aa11e590210b5b92b49507/la-3-big-Data.db

[
{"key": "CORP:2016",
 "cells": [["::idx","NASDAQ",1457495762169139,"e",604800,1458100562],
           ["1:5:_","1:5:!",1457495781073797,"t",1457495781],
           ["1:4:","",1457495762172733,"e",604800,1458100562],
           ["1:4:close","8.55",1457495767496569,"e",604800,1458100567],
           ["1:4:high","8.65",1457495762172733,"e",604800,1458100562],
           ["1:4:low","8.2",1457495762172733,"e",604800,1458100562],
           ["1:4:open","8.2",1457495762172733,"e",604800,1458100562],
           ["1:4:volume","1054342",1457495762172733,"e",604800,1458100562],
           ["1:1:","",1457495762169139,"e",604800,1458100562],
           ["1:1:close","8.2",1457495762169139,"e",604800,1458100562],
           ["1:1:high",1457495780,1457495780541716,"d"],
           ["1:1:low","8.02",1457495762169139,"e",604800,1458100562],
           ["1:1:open","9.33",1457495762169139,"e",604800,1458100562],
           ["1:1:volume","1055334",1457495762169139,"e",604800,1458100562]]},
{"key": "CORP:2015",
 "cells": [["::idx","NYSE",1457495762164052,"e",604800,1458100562],
           ["12:31:","",1457495762164052,"e",604800,1458100562],
           ["12:31:close","9.33",1457495762164052,"e",604800,1458100562],
           ["12:31:high","9.57",1457495762164052,"e",604800,1458100562],
           ["12:31:low","9.21",1457495762164052,"e",604800,1458100562],
           ["12:31:open","9.55",1457495762164052,"e",604800,1458100562],
           ["12:31:volume","1054342",1457495762164052,"e",604800,1458100562]]}
]

As previously stated, the sstable2json output demonstrates that the storage engine prior to Cassandra 2.2 represents partition keys and their cells.

A large portion of the presented data in cells is redundant. For example, when we executed INSERT queries, each cell representing a column value shares the same timestamp and TTL. Additionally, each cell contains not only the full name of the column, but also the values of the clustering keys that cell belongs to. This overhead contributes a large portion to the size of the SSTable.

C* 3.0 sstabledump output

$ tools/bin/sstabledump data/data/ticker/symbol_history-6d6bfc70e5ab11e5aeae7b4a82a62e48/ma-3-big-Data.db

[
  {
    "partition" : {
      "key" : [ "CORP", "2016" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "static_block",
        "position" : 48,
        "cells" : [
          { "name" : "idx", "value" : "NASDAQ", "tstamp" : 1457484225583260, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false }
        ]
      },
      {
        "type" : "row",
        "position" : 48,
        "clustering" : [ "1", "5" ],
        "deletion_info" : { "deletion_time" : 1457484273784615, "tstamp" : 1457484273 }
      },
      {
        "type" : "row",
        "position" : 66,
        "clustering" : [ "1", "4" ],
        "liveness_info" : { "tstamp" : 1457484225586933, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },
        "cells" : [
          { "name" : "close", "value" : "8.54" },
          { "name" : "high", "value" : "8.65" },
          { "name" : "low", "value" : "8.2" },
          { "name" : "open", "value" : "8.2" },
          { "name" : "volume", "value" : "1054342" }
        ]
      },
      {
        "type" : "row",
        "position" : 131,
        "clustering" : [ "1", "1" ],
        "liveness_info" : { "tstamp" : 1457484225583260, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },
        "cells" : [
          { "name" : "close", "value" : "8.2" },
          { "name" : "high", "deletion_time" : 1457484267, "tstamp" : 1457484267368678 },
          { "name" : "low", "value" : "8.02" },
          { "name" : "open", "value" : "9.33" },
          { "name" : "volume", "value" : "1055334" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "CORP", "2015" ],
      "position" : 194
    },
    "rows" : [
      {
        "type" : "static_block",
        "position" : 239,
        "cells" : [
          { "name" : "idx", "value" : "NYSE", "tstamp" : 1457484225578370, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false }
        ]
      },
      {
        "type" : "row",
        "position" : 239,
        "clustering" : [ "12", "31" ],
        "liveness_info" : { "tstamp" : 1457484225578370, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },
        "cells" : [
          { "name" : "close", "value" : "9.33" },
          { "name" : "high", "value" : "9.57" },
          { "name" : "low", "value" : "9.21" },
          { "name" : "open", "value" : "9.55" },
          { "name" : "volume", "value" : "1054342" }
        ]
      }
    ]
  }
]

As a consequence of the new tool’s verbose output, the output payload is less compact than sstable2json. However, the enriched structure of the 3.0 storage engine, is displayed. What is apparent is that there is less repeated data, which leads to a dramatically reduced SSTable storage footprint.

Looking at the output, note that clustering, timestamp and ttl information are now presented at the row level, instead of repeating in individual cells. This change is a large factor in optimizing disk space. While column names are present in each cell, the full column names are not stored for each cell as previously. You can read more about these optimizations and others in the aforementioned blog post.

Internal Representation Format

As previously mentioned, sstabledump’s JSON representation is more verbose than sstable2json. sstabledump also provides an alternative ‘debug’ output format that is more concise than its json counterpart. While initially difficult to understand, it is a more compact and convenient format for advanced users to grok the contents of an SSTable. To view data in this format, simply pass the -d parameter to sstabledump:

$ tools/bin/sstabledump data/data/ticker/symbol_history-6d6bfc70e5ab11e5aeae7b4a82a62e48/ma-3-big-Data.db -d

[CORP:2016]@0 Row[info=[ts=-9223372036854775808] ]: STATIC | [idx=NASDAQ ts=1457496014384090 ttl=604800 ldt=1458100814]
[CORP:2016]@0 Row[info=[ts=-9223372036854775808] del=deletedAt=1457496035375251, localDeletion=1457496035 ]: 1, 5 | 
[CORP:2016]@66 Row[info=[ts=1457496014387922 ttl=604800, let=1458100814] ]: 1, 4 | [close=8.55 ts=1457496020899876 ttl=604800 ldt=1458100820], [high=8.65 ts=1457496014387922 ttl=604800 ldt=1458100814], [low=8.2 ts=1457496014387922 ttl=604800 ldt=1458100814], [open=8.2 ts=1457496014387922 ttl=604800 ldt=1458100814], [volume=1054342 ts=1457496014387922 ttl=604800 ldt=1458100814]
[CORP:2016]@141 Row[info=[ts=1457496014384090 ttl=604800, let=1458100814] ]: 1, 1 | [close=8.2 ts=1457496014384090 ttl=604800 ldt=1458100814], [high=<tombstone> ts=1457496034857652 ldt=1457496034], [low=8.02 ts=1457496014384090 ttl=604800 ldt=1458100814], [open=9.33 ts=1457496014384090 ttl=604800 ldt=1458100814], [volume=1055334 ts=1457496014384090 ttl=604800 ldt=1458100814]
[CORP:2015]@204 Row[info=[ts=-9223372036854775808] ]: STATIC | [idx=NYSE ts=1457496014379236 ttl=604800 ldt=1458100814]
[CORP:2015]@204 Row[info=[ts=1457496014379236 ttl=604800, let=1458100814] ]: 12, 31 | [close=9.33 ts=1457496014379236 ttl=604800 ldt=1458100814], [high=9.57 ts=1457496014379236 ttl=604800 ldt=1458100814], [low=9.21 ts=1457496014379236 ttl=604800 ldt=1458100814], [open=9.55 ts=1457496014379236 ttl=604800 ldt=1458100814], [volume=1054342 ts=1457496014379236 ttl=604800 ldt=1458100814]

Other than the inclusion of this internal representation format, the usage between sstabledump and sstable2json is exactly the same.

Additional Links

Putting some structure in the storage engine – Sylvain Lebresne, DataStax
Introduction to the Apache Cassandra 3.X Storage Engine – Aaron Morton, The Last Pickle
Overview of CASSANDRA-8099 changes
CASSANDRA-8099 – Refactor and modernize the storage engine
CASSANDRA-7464 – Replace sstable2json

↧

Python Driver 3.1.0 Released

March 10, 2016, 10:46 am

≫ Next: Things you didn’t think you could do with DSE Search and CQL

≪ Previous: Debugging SSTables in 3.0 with sstabledump

The DataStax Python Driver 3.1.0 for Apache Cassandra has been released. The work in this release was largely focused on the object mapper, cqlengine, but it also includes a number of other features and bug fixes. A complete list of issues is available in the CHANGELOG. Here I briefly outline the new cqlengine features:

Existing state via LWTException

When using Lightweight Transactions (LWT), LWTException is raised when the transaction is not applied. Now, the existing state information is attached to that exception, allowing the application to inspect the attributes that caused the transaction to fail:

try:
    TestIfNotExistsModel.if_not_exists().create(id=id, count=9, text='1111')
except LWTException as e:
    print e.existing # dict of attributes from the existing object

Collection “contains” Support

QuerySet has a new ‘__contains’ filtering operator for restricting indexed collections:

class Test(Model):
    k = columns.UUID(primary_key=True, default=uuid4)
    s = columns.Set(columns.Integer, index=True)
Test.filter(s__contains=1)

DISTINCT query operator

QuerySet now has a distinct query operator for selecting just primary keys and static columns.

for o in Test.objects().distinct():
    print o

By default this returns objects with just the partition keys populated. Clients can also request static columns by specifying column names to distinct.

Tuples in Models

Tuple type has been added to cqlengine models:

class Table(Model):
    key = Integer(primary_key=True)
    value = Tuple(Text, Integer)

IF EXISTS Lightweight Transactions

Model.if_exists was added, rounding out the set of LWT operations in the mapper:

try:
    Table.objects(id=id).if_exists().update(count=9, text='111111111111')
except LWTException as e:
    # handle failure case
    pass

Nested Collection Modeling

cqlengine now allows collections to be nested.

class Table(Model):
    key = Integer(primary_key=True)
    value = Map(Text, List(Integer))

This was previously explicitly rejected when the model was defined. Now, it is allowed with caveat: the composition of types must adhere to Python rules. Sets can only contain hashable types, and Maps can only be keyed by hashable types. This is because cqlengine coerces to builtin Python types during model IO. In a future major revision we will be changing this to use the driver’s more flexible custom types and avoid this limitation.

Query fetch size

cqlengine now has the ability to control automatic paging behavior via fetch_size. Not only does this allow for controlling how many rows are fetched in each round trip, it also has an ancillary benefit of enabling queries using ORDER BY with IN clauses.

Case-Sensitive Table Names

Previous versions of cqlengine lower-cased table names, even when set explicitly using the Model.__table_name__ attribute. In this version we introduce an attribute to override that behavior

class Table(Model):
    __table_name_case_sensitive__ = True
    __table_name__ = "CrAZyCaSEName"

When set, the __table_name__ will be used without transformation. This is currently defaulted False to preserve legacy behavior. In a future version, explicit table names will be case-sensitive by default. The mapper now warns users that may be affected by this switch.

Wrap

As always, thanks to all who provided contributions and bug reports. The continued involvement of the community is appreciated:

Mailing List: https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user
IRC: #datastax-drivers on irc.freenode.net
Review and contribute source code: https://github.com/datastax/python-driver
Report issues on JIRA: https://datastax-oss.atlassian.net/browse/PYTHON

↧

Things you didn’t think you could do with DSE Search and CQL

March 11, 2016, 10:27 am

≫ Next: DataStax C/C++ Driver: 2.3 GA released!

≪ Previous: Python Driver 3.1.0 Released

Intro

CQL and DSE Search promise to make access to a lucene backed index scalable, highly avaliable, operationally simple, and user friendly.

There have been a couple of developments in DSE 4.8 point releases that may have gone unnoticed by the community of DSE Search users.

One of the main benefits of using DSE Search is that you are able to query the search indexes from through CQL directly from your favorite DataStax driver. Avoiding the solr HTTP API all together means that you:

1) Don’t need to two sets of DAO’s in your app and have application logic around which to use for what purpose

2) You don’t need a load balancer in front of Solr/Tomcat because the DataStax drivers are cluster aware and will load balance for you

3) You don’t need to worry about one node going down under your load balancer and having a fraction of your queries failing upon node failure

4) When security is enabled, requests through the HTTP API are significantly slower, to quote the DSE docs:

“Due to the stateless nature of HTTP Basic Authentication, this can have a significant performance impact as the authentication process must be executed on each HTTP request.”

Why ever use the HTTP API?

The CQL interface is designed to return rows and columns, so features like Solr’s numFound and faceting, were not built in the first few releases.

These features have snuck in via patches in point releases and users that aren’t studiously reading the release notes may not have noticed the changes.

How would I go about getting numfound and performing facet queries in the latest (DSE 4.8.1+) version of DSE?

Show me how

If you know you just need the count (and not the data that comes along with it) then you can just specify count(*) and keep the solr_query where clause. DSE intercepts the query and brings back numDocs from DSE Search instead of actually performing the count in cassandra:

SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"countryoftravel:\"United States\""}' ;

 count
-------
 39709

Here it is with tracing enabled, notice that even my wide open count(*) query comes back in micros

cqlsh> SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"*:*"}' ;

 count
--------
 817000

(1 rows)

Tracing session: 7020df80-e7a9-11e5-9c31-37116dd067c6

 activity                                                                                        | timestamp                  | source    | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                              Execute CQL3 query | 2016-03-11 11:51:02.136000 | 127.0.0.1 |              0
 Parsing SELECT count(*) FROM test.pymt WHERE solr_query = '{"q":"*:*"}' ; [SharedPool-Worker-1] | 2016-03-11 11:51:02.136000 | 127.0.0.1 |             34
                                                       Preparing statement [SharedPool-Worker-1] | 2016-03-11 11:51:02.136000 | 127.0.0.1 |             84
                                                                                Request complete | 2016-03-11 11:51:02.146918 | 127.0.0.1 |          10918

The same goes for facet queries. Note that because of the way the cql protocol is designed (around rows and columns), DSE returns the facet results inside a single cell in JSON format. Pretty slick:

select * FROM test.pymt WHERE solr_query='{"facet":{"pivot":"physicianprimarytype"},"q":"*:*"}' ;  
 facet_pivot
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"physicianprimarytype":[{"field":"physicianprimarytype","value":"doctor","count":813638},{"field":"physicianprimarytype","value":"medical","count":720967},{"field":"physicianprimarytype","value":"of","count":92671},{"field":"physicianprimarytype","value":"osteopathy","count":60123},{"field":"physicianprimarytype","value":"dentistry","count":17132},{"field":"physicianprimarytype","value":"optometry","count":11447},{"field":"physicianprimarytype","value":"medicine","count":3969},{"field":"physicianprimarytype","value":"podiatric","count":3969},{"field":""physicianprimarytype","value":"chiropractor","count":192}]}

TL;DR

You don’t have to use the HTTP API for seach queries, even if you need numFound and faceting. It is now supported via CQL and solr_query.

Futures

Remember I mentioned that the cql protocol is desinged round rows and columns? Well check out this ticket resolved in C* 2.2.0 beta 1 CASSANDRA-8553. If you use your imagination, there are some improvements that can be made once DSE gets c* 3.0 under the hood to make Search functionality even more slick.

Stay tuned!

More Features!

I meant to stop here but when I asked folks (char0) to review this post, they had some additional DSE Search features that get overlooked. I’ll breifly describe them and link to documentation. If you’re new to DSE Search definitely read on:

Partiton routing:

Partition routing is a great multi-tennant feature that lets you limit the amount of fan out that a search query will take under the hood. Essentially, you’re able to specify a Cassandra partition that you are interested in limiting your search to. This will limit the number of nodes that DSE Search requires to fullfil your request.

JSON queries

If you’re looking to do advanced queries thorugh cql (beyond just a simple search) check out the datastax documentation for json queries.

timeAllowed

Many search use cases don’t actually require the backend to scan the entire dataset. If you’re just trying to fill out a page with search results, and latency matters more than having a complete results set (when you dont care about numFound), the timeAllowed parameter let’s you set a maximum latency and DSE Search will just return the results it has found so far.

Please comment if you have any additional DSE Search Features that you think are overlooked!

↧

DataStax C/C++ Driver: 2.3 GA released!

March 14, 2016, 12:00 pm

≫ Next: Tuning DSE Search – Indexing latency and query latency

≪ Previous: Things you didn’t think you could do with DSE Search and CQL

We are pleased to announce the 2.3 GA release of the C/C++ driver for Apache Cassandra. This release includes all the features necessary to take full advantage of Apache Cassandra 3.0 including support for materialized view metadata. This release also brings with it:

Support for secondary index metadata
Support for clustering key order metadata (via cass_table_meta_cluster_key_order() and cass_materialized_view_meta_cluster_key_order())
Support for frozen<> data types (via cass_data_type_is_frozen())

Thanks to a community contribution this release includes supports for the blacklist, blacklist DC and whitelist DC load balancing policies.

What’s new

Materialized view metadata

Cassandra 3.0 added support for materialized views. The 2.3 release of the C/C++ driver adds support for inspecting the metadata for materialized views. Building from the example found in the materialized view post we can retrieve materialized views from either the keyspace metadata or the materialized view’s base table.

CREATE KEYSPACE game
       WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 3 };

USE game;

CREATE TABLE scores
(
  user TEXT,
  game TEXT,
  year INT,
  month INT,
  day INT,
  score INT,
  PRIMARY KEY (user, game, year, month, day)
)

CREATE MATERIALIZED VIEW alltimehigh AS
       SELECT user FROM scores
       WHERE game IS NOT NULL AND score IS NOT NULL AND user IS NOT NULL AND 
             year IS NOT NULL AND month IS NOT NULL AND day  IS NOT NULL
       PRIMARY KEY (game, score, user, year, month, day)
       WITH CLUSTERING ORDER BY (score desc)

Retrieving a materialized view from a keyspace

const CassSchemaMeta* schema_meta = cass_session_get_schema_meta(session);

const CassKeyspaceMeta* keyspace_meta = cass_schema_meta_keyspace_by_name(schema_meta, "game");

const CassMaterizedViewMeta* mv_meta;

/* Retrieve the materialized view by name from the keyspace */
mv_meta = cass_keyspace_meta_materialized_view_by_name(keyspace_meta, "alltimehigh");

/* ... */

/* Materialized views in a keyspace can also be iterated over */
CassIterator* iterator = cass_iterator_materialized_views_from_keyspace_meta(keyspace_meta);

/* Iterate over materialized views in the "game" keyspace */
while (cass_iterator_next(iterator)) {
  mv_meta = cass_iterator_get_materialized_view_meta(iterator);
  
  /* Use materialized view metadata... */    
}

cass_iterator_free(iterator);
cass_schema_meta_free(schema_meta);

Retrieving a materialized view from a table

const CassKeyspaceMeta* keyspace_meta = cass_schema_meta_keyspace_by_name(schema_meta, "game");

const CassTableMeta* table_meta = cass_keyspace_meta_table_by_name(keyspace_meta, "scores");

const CassMaterizedViewMeta* mv_meta;

/* The materialized view can be retrieved by name */ 
mv_meta = cass_table_meta_materialized_view_by_name(table_meta, "alltimehigh");

/* OR by index */
mv_meta =  cass_table_meta_materialized_view(table_meta, 0);

/* ... */

cass_schema_meta_free(schema_meta);

Materialized view metadata is almost exactly the same as the metadata for tables because they are themselves just tables. The difference is materialized views have base table that can be retrieved using cass_materialized_view_meta_base_table(). This example shows how table metadata might be used and can be applied when using materialized views.

Looking forward

This release brings with it full support for Apache Cassandra 3.0 along with several other great features including some community contributed features. We truly appreciate the community involvement, thank you. Keep the pull requests and feedback coming! Let us know what you think about the 2.3 GA release. Your involvement is important to us and it influences what features we prioritize. Use the following resources to get involved:

Mailing List: https://groups.google.com/a/lists.datastax.com/forum/#!forum/cpp-driver-user
IRC: #datastax-drivers on irc.freenode.net
Review and contribute source code: https://github.com/datastax/cpp-driver
Report issues on JIRA: https://datastax-oss.atlassian.net/browse/CPP

↧

Tuning DSE Search – Indexing latency and query latency

March 17, 2016, 3:04 pm

≫ Next: DataStax Enterprise and Windows

≪ Previous: DataStax C/C++ Driver: 2.3 GA released!

Introduction

DSE offers out of the box search indexing for your Cassandra data. The days of double writes or ETL’s between separate DBMS and Search clusters are gone.

I have my cql table, I execute the following API call, and (boom) my cassandra data is available for:

1) full text/fuzzy search
2) ad hoc lucene secondary index powered filtering, and
3) geospatial search

Here is my API call:

$ bin/dsetool create_core <keyspace>.<table> generateResources=true reindex=true

or if you prefer curl (or are using basic auth) use the following:

$ curl "http://localhost:8983/solr/admin/cores?action=CREATE&name=<keyspace>.<table>&generateResources=true"

Rejoice! we are in inverted index, single cluster, operational simplicity bliss!

The remainder of this post will be focused on advanced tuning for DSE Search both for a) search indexing latency (the time it takes for data to be searchable after it has been inserted through cql), and b) search query latency (timings for your search requests).

Indexing latency

In this section I’ll talk about the kinds of things we can do in order to

1) instrument and monitor DSE Search indexing and
2) tune indexing for lower latencies and increased performance

Note: DSE Search ships with Real Time (RT) indexing which will give you faster indexing with 4.7.3, especially when it comes to the tails of your latency distribution. Here’s one of our performance tests. It shows you real time vs near-real time indexing as of 4.7.0:

indexing chart

Perhaps more importantly, as you get machines with more cores, you can continue to increase your indexing performance linearly:
rt vs nrt

Be aware, however, that you should only run one RT search core per cluster since it is significantly more resource hungry than not real time (NRT).

Side note on GC: Because solr and cassandra run on the same JVM in DSE Search and the indexing process generates a lot of java objects, running Search requires a larger JVM Heap. When running traditional CMS, we recommend a 14gb heap with about 2gb new gen. Consider the Stump’s CASSANDRA-8150settings when running search with CMS. G1GC has been found to perform quite well with search workloads, I personally run with a 25gb heap (do not set new gen with G1, the whole point of G1 is that it sets it itself based on your workload!) and gc_pause_ms at about 1000 (go higher for higher throughput or lower to minimize latencies / p99’s; don’t go below 500).

1) Instrumentation

Index Pool Stats:

DSE Search parallelizes the indexing process and allocates work to a thread pool for indexing of your data.

Using JMX, you can see statistics on your indexing threadpool depth, completion, timings, and whether backpressure is active.

This is important because if your indexing queues get too deep, we risk having too much heap pressure => OOM’s. Backpressure will throttle commits and eventually load shed if search can’t keep up with an indexing workload. Backpressure gets triggered when the queues get too large.

The mbean is called:

com.datastax.bdp.search.<keyspace>.<table>.IndexPool

Indexing queues Commit/Update Stats:

You can also see statistics on indexing performance (in microseconds) based on the particular stage of the indexing process for both commits and updates.

The stages for commit are:

FLUSH – Comprising the time spent by flushing the async indexing queue. EXECUTE – Comprising the time spent by actually executing the commit on the index.

The mbean is called:

com.datastax.bdp.search.<keyspace>.<table>.CommitMetrics

The stages for update are:

WRITE – Comprising the time spent to convert the Solr document and write it into Cassandra (only available when indexing via the Solrj HTTP APIs). If you’re using cql this will be 0. QUEUE – Comprising the time spent by the index update task into the index pool. PREPARE– Comprising the time spent preparing the actual index update. EXECUTE – Comprising the time spent to actually executing the index update on Lucene.

The mbean is:

com.datastax.bdp.search.<keyspace>.<table>.UpdateMetrics

indexing stats

Here, the average latency for the QUEUE stage of the update is 767 micros. See our docs for more details on the metrics mbeans and their stages.

2) Tuning

Almost everything in c* and DSE is configurable. Here’s the key levers to get you better search indexing performance. Based on what you see in your instrumentation you can tune accordingly.

The main lever is soft autocommit, that’s the minimum amount of time that will go by before queries are available for search. With RT we can set it to 250 ms or even as low as 100ms–given the right hardware. Tune this based on your SLA’s.

The next most important lever is concurrency per core (or max_solr_concurrency_per_core). You can usually set this to number of CPU cores available to maximize indexing throughput.

Backpressure threshold will become more important as your load increases. Larger boxes can handle higher bp thresholds.

Don’t forget to set up the ramBuffer to 2gb per the docs when you turn on RT indexing.

Query Latency

Now, I’ll go over how we can monitor query performance in DSE Search, identify issues, and some of the tips / tricks we can use to improve search query performance. I will cover how to:

1) instrument and monitor DSE Search indexing and
2) tune indexing for lower latencies and increased performance.

Simliar to how search indexing performance scales with CPU’s, search query performance scales with RAM. Keeping your search indexes in OS page cache is the biggest thing you can do to minimize latencies; so scale deliberately!

1) Instrumentation

There are multiple tools available for monitoring search performance.

OpsCenter:

OpsCenter supports a few search metrics that can be configured per node, datacenter, and solr core:

1) search latencies
2) search requests
3) index size
4) search timeouts
5) search errors

opscenter

Metrics mbeans:

In the same way that indexing has performance metrics, DSE Search query performance metrics are available through JMX and can be useful for troubleshooting perofrmance issues. We can use the query.name parameter in your DSE Serch queries to capture metrics for specifically tagged queries.

The stages of query are:

COORDINATE – Comprises the total amount of time spent by the coordinator node to distribute the query and gather/process results from shards. This value is computed only on query coordinator nodes.

EXECUTE – Comprises the time spent by a single shard to execute the actual index query. This value is computed on the local node executing the shard query.

RETRIEVE – Comprises the time spent by a single shard to retrieve the actual data from Cassandra. This value will be computed on the local node hosting the requested data.

The mbean is:

com.datastax.bdp.search.<keyspace>.<table>.QueryMetrics

Query Tracing:

When using solr_query via cql, query tracing can provide useful information as to where a particular query spent time in the cluster.

Query tracing is available in cqlsh tracing on, in devcenter (in the tab at the bottom of the screen), and via probabilistic tracing which is configurable vianodetool.

DSE Search slow query log:

When users complain about a slow query and you need to find out what it is, the DSE Search slow query log is a good starting point.

dsetool perf solrslowlog enable

Stores to a table in cassandra in the dse_perf.solr_slow_sub_query_log table

2) Tuning

Now let’s focus on some tips for how you can improve search query performance.

Index size

Index size is so important that, I wrote a separate post just on that subject:

Q vs. FQ

In order to take advantage of the solr filter cache, build your queries using fq not q. The filter cache is the only solr cache that persists across commits so don’t spend time or valuable RAM trying to leverage the other caches.

Solr query routing

Partition routing is a great multi-tennancy feature in DSE Search that lets you limit the amount of fan out that a search query will take under the hood. Essentially, you’re able to specify a Cassandra partition that you are interested in limiting your search to. This will limit the number of nodes that DSE Search requires to fullfil your request.

Use docvalues for Faceting and Sorting.

To get improved performance and to avoid OOMs from the field cache, always remember to turn on docvalues on fields that you will be sorting and faceting over. This may become mandatory in DSE at some point so plan ahead.

Other DSE Differentiators

If you’re comparing DSE Search against other search offerings / technologies, the following two differentiators are unique to DSE Search.

Fault tolerant distributed queries

If a node dies during a query, we retry the query on another node.

Node health

Node health and shard router behavior.
DSE Search monitors node health and makes distributed query routing decisions based on the following:

1) Uptime: a node that just started may well be lacking the most up-to-date data (to be repaired via HH or AE).
2) Number of dropped mutations.
3) Number of hints the node is a target for.
4) “failed reindex” status.

All you need to take advantage of this is be on a modern DSE version.

↧

DataStax Enterprise and Windows

March 18, 2016, 5:00 am

≫ Next: DataStax Java Driver: 2.1.10 released!

≪ Previous: Tuning DSE Search – Indexing latency and query latency

We reach out from time to time to everyone currently using or interested in DataStax Enterprise to ensure we’re putting everything into our platform that you need. Our “Quick Polls” are one way of doing that, and we have a new one that we’d like you to take about 2 seconds and participate in where porting DSE to Windows is concerned. Visit our DSE Home Page and quickly make your voice heard using the 3 options in our new poll (lower right on the page). Thanks!

windows poll

↧

DataStax Java Driver: 2.1.10 released!

March 29, 2016, 1:00 pm

≫ Next: Introducing Planet TinkerPop – a New Website for Graph Database Technology Professionals

≪ Previous: DataStax Enterprise and Windows

The Java driver team is pleased to announce that version 2.1.10 has just been released.

Before diving into this release’s major additions, we would like to point out that the 2.1 branch is now entering maintenance mode. Basically, this means that from now on, this branch will only get critical bug fixes and no new features will get added to it.

The 2.1 branch has been around for over 1.5 years and since then, Apache Cassandra has released 2 major versions and then switched to the tick-tock release schedule. The most active Java driver branch is currently 3.0, which is also the only branch providing support for Apache Cassandra 2.2 and 3.0, while also receiving all the bug fixes in the 2.1 branch.

We have prepared an extensive documentation on upgrading from 2.1 to 3.0, and we’d like to encourage you to start looking into upgrading right now, so that you can benefit of the latest features but also be prepared for when the time comes to update your clusters to newer versions of DataStax Enterprise and Apache Cassandra.

Native calls with Java Native Runtime library (JNR)

Thanks to JAVA-727 and JAVA-444, the driver now uses the Java Native Runtime library (JNR) to access native code.

Two areas are already taking advantage of JNR to perform native system calls: timestamp generators now use gettimeofday() to generate microsecond-precision timestamps (see JAVA-727 and our online manual for details), and the time-based UUID generator now includes process information gathered through calls to getpid() (see JAVA-444 for details).

A new Maven dependency to JNR has been added to the driver’s dependencies: if you are using a Maven-based build tool, you don’t need to do anything to start benefiting from JNR right away. But don’t worry: if the library is not found in your application’s classpath, or if the operating system does not meet the minimum requirements, the driver will transparently fall back to non-native replacements (a warning will be logged in such cases).

Stay tuned, as more components will certainly benefit from JNR in the future!

Percentile-based latency trackers are now production-ready

Remember when we first mentioned percentile-based latency trackers when we released 2.0.10? Today we decided that they are mature enough for production use, so we just removed their “beta” status. We encourage all users to try out and include these trackers in various latency-aware components such as speculative execution policies as well as in the slow query logger.

But there’s more: we now offer two implementations of LatencyTracker that use latency histograms behind the scenes:

A brand-new ClusterWidePercentileTracker that records one single latency histogram for the whole cluster, thus comparing hosts between them and making latencies of slower hosts appear in higher percentiles. We recommend this implementation for most clusters.
And the already-existing PerHostPercentileTracker that maintains separate histograms for each host, thus comparing each host solely against itself.

Note that the API has evolved since the first beta preview, so please read the javadocs carefully to understand how to use the aforementioned components. Also, if you intend to use them you still need to explicitly include the Maven dependency below because the driver considers it as optional:

<dependency>
  <groupId>org.hdrhistogram</groupId>
  <artifactId>HdrHistogram</artifactId>
  <version>2.1.4</version>
</dependency>

Retry policies and idempotence inference

When the driver version 3.0.0 was released two months ago, we introduced several enhancements to retry policies. Today we have good news for our 2.1 users: we decided that these changes were so useful that they deserved to be back-ported to 2.1.10!

In order not to break binary compatibility though, the additions to RetryPolicy were included in 2.1.10 in a sub-interface, ExtendedRetryPolicy. But don’t worry: all retry policies shipped with the driver implement it, and as far as the default retry policy is concerned, nothing changes: it still behaves in the exact same way as before.

Because statement idempotence determines whether or not it is safe to re-execute a statement, the driver has recently improved its idempotence awareness in many ways: some of you might already be familiar with the new IdempotenceAwareRetryPolicy – which has also been back-ported to 2.1.10. To help even more our users in adopting this new retry policy, the driver now transparently propagates the idempotent flag from RegularStatement to PreparedStatement and then down to BoundStatement, so that they can all safely be used in conjunction with IdempotenceAwareRetryPolicy. See JAVA-863 for details.

Furthermore, Query Builder’s ability to infer idempotence has been fine-grained thanks to JAVA-1089: the Query Builder now automatically detects when a statement is a lightweight transaction (LWT) and marks it as non-idempotent. The rationale behind this decision is that LWT statements should never be retried as a retry could break transaction linearizability.

Idempotence inference has also been improved in the Object Mapper: thanks to JAVA-923, all queries generated by the mapper are now automatically marked as idempotent, thus making them eligible for retries under IdempotenceAwareRetryPolicy. Similarly, the @QueryParameters annotation now accepts a new idempotent attribute, should you need to tune the idempotence of your Accessor queries.

Finally, everything you should know about idempotence is explained in our new online documentation page on idempotence: check it out!

(You might wonder why the driver does not provide a general mechanism to infer idempotence for all queries: this is unfortunately not possible because the driver does not parse queries and thus has very limited knowledge of what’s being executed – except, to a certain extent, for queries generated with the Query Builder. However, Cassandra itself could help the driver determine if a prepared statement is idempotent: if you are interested in this feature, vote for CASSANDRA-10813.)

Other major improvements

Usability

Thanks to JAVA-1019, the Java driver now has Schema Builder support for CREATE, ALTER and DROP KEYSPACE statements. Here’s a glimpse of how the fluent API looks like (ImmutableMap comes from Guava):

import com.google.common.collect.ImmutableMap;
import static com.datastax.driver.core.schemabuilder.SchemaBuilder.createKeyspace;

SchemaStatement statement = createKeyspace("ks")
        .with()
        .durableWrites(true)
        .replication(
                ImmutableMap.<String, Object> of(
                        "class", "SimpleStrategy",
                        "replication_factor", 1
                )
         );

And thanks to JAVA-1040, the Query Logger now supports logging parameters for SimpleStatements, the same way it does for BoundStatements.

Performance

JAVA-1070 has changed the way the Object Mapper prepares its queries internally, so that it doesn’t block anymore waiting for its statements to be prepared.

Fault tolerance

JAVA-852 has strengthened the driver’s resilience to gossip bugs in Cassandra by ignoring peers that have invalid entries in any of the key fields of the system table peers (this crucial table is read by the driver when it performs node discovery and when it detects changes in the cluster’s topology).

Getting the driver

As always, the driver is available from Maven and from our downloads server.

We’re also running a platform and runtime survey to improve our testing infrastructure. Your feedback would be most appreciated.

↧

Introducing Planet TinkerPop – a New Website for Graph Database Technology Professionals

April 12, 2016, 4:41 am

≫ Next: Python Driver 3.2.0 Released

≪ Previous: DataStax Java Driver: 2.1.10 released!

We’re excited to let you know about a new website devoted to the education and advancement of graph database technology – Planet TinkerPop. Put together by those in the graph community along with our graph technology pro’s here at DataStax, Planet TinkerPop is the place to go to collaborate with and learn from others working in graph technology.

Because Apache TinkerPop™ is used by literally every graph database on the market, you’ll discover a who’s-who of graph database companies participating along with members of the TinkerPop open source community. You’ll find technical articles, success stories, and much more available on the site, so regardless of whether you’re just now investigating graph technology to see what it can do for you or already neck-deep in graph work today, bookmark Planet TinkerPop and check back often to see what the TinkerPop community is doing.

↧

Python Driver 3.2.0 Released

April 12, 2016, 1:50 pm

≫ Next: Don’t forget to shut the front door (and secure your OpsCenter)

≪ Previous: Introducing Planet TinkerPop – a New Website for Graph Database Technology Professionals

The DataStax Python Driver 3.2.0 for Apache Cassandra has been released. As with the previous release, work in this one was again focused on the object mapper, cqlengine. A complete list of issues is available in the CHANGELOG. In this short post I highlight a few of the most interesting new features.

Implicit Deferred Fields in Read Queries

The latest version of cqlengine now “defers” (does not select) columns for which values are already known. This has two interesting upshots. First, values for any equals-constrained columns are not read back from the server, meaning less network traffic and deserialization overhead for the results. Second, identical values are shared among records, meaning the memory footprint is not bloated with redundant values for large result sets.

Model instances still come back fully-populated with these values — they are just managed more efficiently under the hood.

Token Aware Routing

The mapper layer now takes advantage of token-aware routing in the core driver. Now, the driver will compute the routing key whenever all partition key columns are present in a statement execution. This improves latency (and serial throughput) for most workloads where replication factor is less than node count.

Routing key calculation is enabled by default. In the event that your workload does not warrant token aware routing (e.g., where replication factor == number of nodes), it can be disabled using the new __compute_routing_key__ model attribute.

More Conditional Operators for LWT Conditional Statements

Previously, the conditional API only supported equality predicates:

table.objects(k=0).iff(v=0).update(v=1)

Now, the iff comparisons can use the full range of filtering operators^* supported in the rest of the API:

table.objects(k=0).iff(v=0).update(v=1)
table.objects(k=0).iff(v__gte=1).update(v=2)
table.objects(k=0).iff(v__gt=1).update(v=3)
table.objects(k=0).iff(v__ne=5).update(v=4)

^* supported in Cassandra 2.1+

Wrap

As always, thanks to all who provided contributions and bug reports. The continued involvement of the community is appreciated:

Mailing List: https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user
IRC: #datastax-drivers on irc.freenode.net
Review and contribute source code: https://github.com/datastax/python-driver
Report issues on JIRA: https://datastax-oss.atlassian.net/browse/PYTHON

↧

Don’t forget to shut the front door (and secure your OpsCenter)

April 12, 2016, 4:28 pm

≫ Next: Familiarize yourself with the Best Practice Service in OpsCenter

≪ Previous: Python Driver 3.2.0 Released

Databases are one of the most, if not the most, critical aspects of any application infrastructure, which makes security a top concern for anyone relying on or managing a database. Because our customers rely heavily on DataStax Enterprise (DSE) we take security very seriously. Our Security Program covers the key aspects of keeping our customers secure:

Prevention of and hot-fixes for vulnerabilities
Robust security functionality built into our products
Customer education on security best practices

While most of our customers are well aware of how important security is for their DSE clusters, I think it’s important to write this post as a reminder. Beyond securing your DSE cluster it’s just as important to ensure that your installation of DataStax OpsCenter has the proper security settings enabled.

All network security starts with strict and proper firewall rules, allowing only the absolute minimum traffic in or out of your network. This is especially important when running your infrastructure in a public cloud. OpsCenter only requires that a single port be available (8888 by default) to end users so that they can access the OpsCenter web server. Whether this server is running in a public cloud or not, it’s important to enable authentication and HTTPS. OpsCenter authentication can be configured to use built-in storage or to authenticate against a central LDAP server. When authentication is enabled customers can also configure fine grained permissions, ensuring that users only have access to the things they need. Enabling HTTPS is important to prevent man-in-the-middle attacks, especially when authentication is enabled.

In the best case scenario all of the network traffic on your nodes will be limited to your private infrastructure and other nodes in the cluster. Even in this case it is still a good idea to ensure that the communication between OpsCenter and the DataStax Agents on each node is also secure. This is achieved by enabling SSL encryption with two-way authentication on the connection between the two components.

And just in case, the OpsCenter Best Practice Service has your back. There are several security rules enabled by default that will alert you if any of these best practices are not enabled in your infrastructure.

If you’re not sure if you configured things properly or simply have questions or concerns, don’t hesitate to contact DataStax Support, or email security@datastax.com to report any security concerns.

↧

Familiarize yourself with the Best Practice Service in OpsCenter

April 20, 2016, 11:30 am

≫ Next: Six parameters to tune for cqlsh COPY FROM performance

≪ Previous: Don’t forget to shut the front door (and secure your OpsCenter)

Introduction

On the OpsCenter team, we work on features to simplify the operational aspects of managing and maintaining your Datastax Enterprise (DSE) clusters. If you are interested in learning how OpsCenter can help ensure your DSE clusters are following best practices, please continue reading.

What is the Best Practice Service?

OpsCenters Best Practice Service periodically scans your DSE clusters in the background to detect issues that can threaten a clusters security, availability and performance. The service utilizes a set of expert rules spanning categories such as OS, configuration and replication to verify that your deployment is in good shape. Any deviations from best practices are reported back to you with advice on how to resolve the issues.

The Best Practice Service is configurable to your needs on a per cluster basis. Each rule can be turned on/off and set to run at a time and frequency of your choice. OpsCenter’s Best Practice Service ensures your deployment is configured, secured and optimized to run as efficiently as possible, removing the burden of doing that yourself.

Workflow

OpsCenter’s Best Practice Service is ready to go out of the box with all rules preconfigured to run daily. Let me take you through a workflow to show you how it works:

You can access the Best Practice Service from the services tab.

Screen Shot 2016-04-18 at 4.35.58 PM

Each individual rule is automatically run at a specified time and frequency.

Screen Shot 2016-04-18 at 4.31.17 PM

Hovering over a rule tells you more about it. The time when a rule scan occurs and its frequency can be configured individually for each rule by clicking configure.

Screen Shot 2016-04-18 at 4.25.53 PM

Screen Shot 2016-04-18 at 4.26.30 PM

If a rule scan fails, an “ALERT” event is generated.

Screen Shot 2016-04-18 at 4.32.01 PM

By clicking on a failed rule, you can see additional details regarding the failure.

Screen Shot 2016-04-18 at 4.33.43 PM

Try it out

For more information about OpsCenter’s Best Practice Service check out our documentation. To download the latest version of DataStax OpsCenter, go to our downloads page.

↧

Six parameters to tune for cqlsh COPY FROM performance

April 20, 2016, 8:00 pm

≫ Next: How we optimized Cassandra cqlsh COPY FROM

≪ Previous: Familiarize yourself with the Best Practice Service in OpsCenter

Introduction

The purpose of this article is to describe how to improve performance when importing csv data into Cassandra via cqlsh COPY FROM. If you are not familiar with cqlsh COPY, you can read the documentation here. There is also a previous blog article describing recent changes in cqlsh COPY.

The practical steps described here follow from this companion blog entry, which describes the reasoning behind these techniques.

Setup

By default cqlsh ships with an embedded Python driver that is not built with C extensions. In order to increase COPY performance, the Python driver should be installed with the Cython and libev C extensions:

after installing the dependencies described here,
install the driver by following the instructions that are available here.

Once the driver is installed, cqlsh must be told to ignore the embedded driver and to use the installed driver instead. This can be done by setting the CQLSH_NO_BUNDLED environment variable to any value, for example on Linux:

export CQLSH_NO_BUNDLED=TRUE

Using the benchmark described here, an improvement of approximately 60% was achieved by using a driver compiled with c extensions.

You can also compile with Cython the clqsh copy module. Whilst the driver has been optimized for Cython and using a driver with Cython extensions makes a significant difference, compiling the clqsh copy module (a standard Python module) may give approximately a 5% performance boost but not much further. Should you wish to compile the copy module with Cython, first of all you need to install Cython if not already available:

pip install cython

Then in the cassandra installation folder go to pylib and compile the module with the following command:

python setup.py build_ext --inplace

This command will create pylib/cqlshlib/copyutil.c and copyutil.so. Should you need to revert back to pure Python interpreted code, simply delete these two files.

On Linux, it is advantageous to increase CPU time slices by setting the CPU scheduling type to SCHED_BATCH. To do so, launch cqlsh with chrt as follows:

chrt -b 0 cassandra_install_dir/bin/cqlsh

Changing CPU scheduling may boost performance by an additional 5%.

Parameters that affect performance

There are six COPY FROM parameters that you can experiment with, to optimize performance for a specific workload.

NUMPROCESSES=N

This is the number of worker processes, and its default value is the number of cores on the machine minus one, capped at 16. You can observe the CPU idle time via dstat, dstat -lvrn 10, and if CPU idle time is present using the default number of worker processes, you can increase it.

CHUNKSIZE=5000

This is the number of rows sent from the feeder process (which reads data from files) to worker processes; depending on the data-set average row size, it may be advantageous to increase the value for this parameter.

MINBATCHSIZE=10

For each chunk, a worker process will batch by ring position as long as there are at least min batch size entries for each position, otherwise it will batch by replica. The advantage of batching by ring position is that any replica for that position can process the batch. Depending on the size of the chunk, the number of nodes in the cluster and the number of VNODES for each node, this value may need adjusting:

the larger the chunk size, the easier it is to find batching opportunities;
the larger the number of tokens (VNODES x NUM_NODES), the harder it is to find batching opportunities.

MAXBATCHSIZE=20

If batching by replica, worker processes will split batches when they reach this size. Increasing this value is useful if you notice timeouts, that is if the cluster has almost reached capacity. From observation, it is useful to increase batch size in this case. However, a batch size that is too large may cause warnings and eventual rejection. Two parameters control this behavior in cassandra.yaml:

batch_size_warn_threshold_in_kb
batch_size_fail_threshold_in_kb.

INGESTRATE=100000

This is the speed in rows per second at which the feeder process sends data to the worker processes. Normally, there is no need to change this value, unless a rate greater than 100k rows per second may be achievable.

PREPAREDSTATEMENTS=True

This parameter disables prepared statements. With prepared statements enabled, worker processes will not only parse csv data, they will also convert csv fields into types that correspond to the CQL column type, that is the string “3”, for example, will be converted to an integer with value 3 and so forth. Python is not efficient at this type of processing, especially with complex data types such as datetime values or composite types (collections or user types). Disabling prepared statements results in the Cassandra process performing the type parsing instead. However, it also forces the Cassandra process to compile each CQL batch statement, resulting in considerable pressure on the Cassandra process. It is recommended that prepared statements should remain enabled in most cases, unless cluster overload is not a concern.

For further background information on why these parameters affect performance, please refer to the companion blog entry.

↧

Introduction

Synopsis

How it works

Performance benchmarks

Options

Common options

COPY TO

COPY FROM

EXAMPLES

Introduction

What’s Jepsen?

Why Jepsen?

Jepsen in Practice

Work We Shared

Plans for the Future

Background

Problem

Solution

Algorithm

Footnotes

Compatibility with Cassandra 2.2 and 3.0+

Support for new CQL types

Unset values

Changes to Schema Metadata API

Server Warnings

New Exception Types

Custom payloads

Custom Codecs

Optional Codecs

Other Major Improvements

RetryPolicy enhancements

Named parameters in SimpleStatement

Per-statement read timeouts

Additions to the Host API

Getting the driver

What’s new

Support for PHP 7

User Defined Types

Inserting a user defined type

Tuples

Nested Collections

Client-side Timestamps

Using the monotonic timestamp generator

Assigning a client-side timestamp per request

Support Named Arguments when using Cassandra\SimpleStatement

Named parameters now work with simple statement insert queries

Using “:<name>” parameters with a simple statement

Retry Policies

Changing the default policy to the downgrading consistency policy

Chaining the downgrading policy to the logging policy

Assigning a retry policy to a specific request

Raw Paging Token

Using the paging state token to page results

Disable Schema Metadata

Internal improvements

Looking forward

What are Dtests?

Where are Dtests used?

Writing a Dtest

Visualizing the Storage Engine changes in 3.0

C* 2.2 sstable2json Output

C* 3.0 sstabledump output

Internal Representation Format

Additional Links

Existing state via LWTException

Collection “contains” Support

DISTINCT query operator

Tuples in Models

IF EXISTS Lightweight Transactions

Nested Collection Modeling

Query fetch size

Case-Sensitive Table Names

Wrap

What’s new

Materialized view metadata

Retrieving a materialized view from a keyspace

Retrieving a materialized view from a table

Looking forward

Introduction

Indexing latency

`RetryPolicy` enhancements

Named parameters in `SimpleStatement`

Additions to the `Host` API

Support Named Arguments when using `Cassandra\SimpleStatement`