Quantcast
Channel: Dev Posts – DataStax
Viewing all 381 articles
Browse latest View live

How we optimized Cassandra cqlsh COPY FROM

$
0
0

Introduction

This article is a summary of the changes introduced by CASSANDRA-11053 and the reasons behind them. It demonstrates how performance can be improved from a baseline benchmark of circa 35,000 rows per second to as much as 117,000 rows per second, and describes performance optimization techniques that can be applied to Python programs in general.

In addition to reading this article, you can read this companion blog entry, which describes the practical steps required to take advantage of the techniques described here.

Measuring performance

Before optimization, it is necessary to determine in which areas of code most of the latencies are.

Two profilers were used for optimizing clqsh COPY FROM: the Python cProfile module and the line profiler.

cProfile gives function call execution times and this is very useful to identify functions where most of the execution time is spent. Given one or more functions, line profiler can then report execution times line by line.

The following helper functions were used to start and stop profiling:

profiler_helper_functions

The code for these helper functions is available in this Cassandra source code file.

When the line profiler is installed, and if function names are passed to the profile_on() function, then the line profiler is used, otherwise cProfile is used by default. Either way, profile_on() returns an object that must be passed back to profile_off(), along with a filename where the profile results will be saved in text format. For example, given a boolean flag, PROFILE_ON, to indicate if profiling is enabled or not, a high level function can be profiled as follows:

profiler_invocation

In addition to profiling the code, the Linux strace command is also very useful in determining if significant time is spent on specific system calls, including lock contentions (futex calls). For a given process pid, strace can be attached to a process as follows:

strace -vvvv -c -o strace.out -e trace=all -p pid&

To strace Python child processes that are spawned via the multiprocessing module, you can utilize the process.pid that is available after starting the process:

os.system("strace -vvvv -c -o strace.{pid}.out -e trace=all -p {pid}&".format(pid=process.pid))

A very useful tool to monitor performance on Linux is dstat:

dstat -lvrn 10

This command displays machine parameters such as CPU, disk IO, memory and network activity. The command above updates information every second, and it creates a new line every 10 seconds. Sample output:

dstat_output

The –procs– and –total-cpu-usage– sections are particularly useful: too many blocked (blk) tasks are bad and indicate the CPU cannot cope. However, you are aiming at minimizing the CPU idle time (idl) so as not to waste CPU cycles that are available. Other informative sections are –io/total– and –net/total–, which indicate disk and network activity respectively. Finally, the –system– sections display interrupts and context switches.

What we optimized

In cPython, the global interpreter lock (GIL) prevents native threads from executing Python byte-code simultaneously. As a consequence, introducing multiple threads does not aid performance and in fact the GIL overhead degrades it, especially on multiprocessor systems.

In order to scale Python programs running in cPython and other Python implementations with a GIL, such as PyPy, it is necessary to spawn child processes. This is relatively simple to accomplish via the Python multiprocessing module: the API of this module is designed to be similar to the threading module, but it applies to processes rather than threads.

For example, locks are implemented with POSIX semaphores on Unix, which can synchronize processes. Additionally there is support for shared memory and interprocess communication via pipes. Be mindful of the fact that the implementation of this module on Unix and on Windows is quite different in capabilities. If supporting both Unix and Windows, it is beneficial to be aware of the significant differences that exist on Windows due to the absence of os.fork().

COPY FROM is implemented using the following processes:

Parent process

This is the original cqlsh process, and its responsibilities include receiving results, displaying the progress report, and terminating other processes when the operation completes.

Feeder process

This is a unique process that reads data from disk, splits it into chunks and sends it to worker processes. The reason there is a unique feeder process is to support parameters such as the ingest rate, skip rows and max rows, which determine how fast data should fed to worker processes, if rows need to be skipped at the beginning of the data set, or if import should terminate after a maximum number of rows.

Worker processes

These are the processes that perform the actual import of the data. By default, there is one process per core, except that one core is reserved for the combined feeder and parent processes.

In the initial optimization of CASSANDRA-9302, the feeder process did not exist and the parent process was performing all the reading and csv decoding. In CASSANDRA-11053, the feeder process was introduced and csv decoding was moved to worker processes. This improved performance significantly as the parent process was previously a bottleneck. It also resulted in real-time progress reporting becoming more accurate, since the parent process is no longer dividing its time between receiving results and sending more work.

Another important optimization introduced was to reduce the number of results sent from worker processes back to the parent process. By aggregating results at the chunk level, rather than sending results for each single batch, the unloaded worker processes were available to increase decoding and data throughput.

Initially, communication across processes was implemented via the Queue class of the Python multiprocessing module. Whilst this is very convenient to use, on Unix it is implemented via two non-duplex pipes, guarded by a bounded semaphore, and with a thread to queue messages. The single pipe was a source of lock contention amongst worker processes. Performance improved by replacing this queue with a pool of pipes, so that each worker process has a dedicated channel to report back results and to receive work. As a consequence, there is now one channel per process. The downside is that the parent needs to monitor all incoming channels. On Unix this is easy to do via a select call, but on Windows this is not available for pipes, and therefore every single channel must be polled for a short period of time.

All Python performance recommendations that are described here were extremely useful. Replacing a built-in call, a bisect_right, on Python types with the same call on the built-in Integers that were wrapped by the Python types, improved the speed of execution of this call for 1 million entries from 2.2 to 0.4 seconds. The test code is available here.

Python name lookup is another source of delays on critical paths; the Python interpreter takes time to lookup references to functions, and storing function references in a local variable before entering a loop reduced latency in critical path methods.

Results

This benchmark was tested with COPY FROM running on an Amazon R3.2XLARGE virtual machine and Cassandra running on 8 I2.2XLARGE virtual machines.

Initial benchmark performance

Initial performance was about 35,000 rows per second for the 1 kbyte test case, where each partition is of approximately 1 kbyte in size and all partition and clustering keys are unique. The partition key is of type TEXT, the clustering key is BIGINT and there is a single data column of type TEXT.

“out-of-the-box” final performance

Our newly optimized version increased performance to 70,000 rows per second if running cqlsh out of the box.

Python C extensions via Cython

Performance can be increased further to 110,000 rows per second if using a driver built with C extensions, and to 117,000 rows per second if compiling the cqlsh copy module with Cython as well.

The remaining performance recommendations described in this blog post were also applied. Specifically, the effect of running with batch CPU scheduling is about 6,000-7,000 rows per second, and the following parameters were optimized for this specific benchmark by observing CPU usage with dstat: NUMPROCESSES=12 (there were 8 cores on the VM), CHUNKSIZE=5,000.

The tests were performed without VNODES; increasing VNODES has a degrading impact on performance but it can be mitigated by increasing MINBATCHSIZE and MAXBATCHSIZE.

One thing worth noting is that this benchmark only uses very simple types, TEXT and BIGINT. Performance may be significantly lower with complex data types such as date times, collections and user types. For these types, consideration of an optimization of the copy module in cqlsh for Cython may be employed.

For more details on the final results, you can refer to this comment, and the suggested future optimizations described at the end.


The Mechanics of Gremlin OLAP

$
0
0

Gremicide A Gremlin traversal is an abstract description of a legal path through a graph. In the beginning, a single traverser is created that will birth more traversers as a function of the instructions dictated by the traversal. A branching familial tree of traversers is generated from this one primordial, patient zero, adamic traverser. Many traversers will die along the way. They will be filtered out, they will walk down dead-end subgraphs, or they will meet other such fates which conflict with the specification of the traversal as defined by the user (the true sadist in this story). However, the traversers that are ultimately returned are the result of a traverser lineage that has survived the traversal-guided journey across the graph. These traversers are recognized for the answers they provide, but it is only because of the unsung heroes that died along the way that we know that their results are sound and complete.


Suppose the following Traversal below that answers the question: What is the distribution of labels of the vertices known by people?” That is, what are the types and counts of the things that people know? This traversal assumes a graph where a person might know an animal, a robot, or just maybe, another person. The result is a Map such as [person:107, animal:1252, robot:256].

g.V().hasLabel("person").
  out("knows").label().
  groupCount()

Every legal path of the traversal through the graph is walked by a Traverser. A traverser holds a reference to both its current object in the graph (e.g. a Vertex) and its current Step in the traversal (e.g. label()) . If the traverser is currently at vertex v[1] and step label(), then the traverser will walk to the String label of v[1]. As such, label() executes a one-to-one mapping (MapStep) as a vertex can have one and only one label. A one-to-many mapping (FlatMapStep) occurs if the traverser is currently at v[1] and at the step out("knows"). In this situation, the traverser will branch the traverser family tree by splitting itself across all “knows”-adjacent vertices of v[1]. A many-to-one mapping occurs via a ReducingBarrierStep which aggregates all the traversers up to that step and then emits a single traverser representing an analysis of that aggregate. The groupCount()-step is an example of a reducing barrier step. Finally, there is a one-to-maybe mapping (FilterStep). The step hasLabel("person") will either let the traverser pass if it is at a person vertex or it will filter it out of the data stream.

The generic form of the step instances mentioned above are the fundamental processes of any Gremlin traversal. It is important to note that a traversal does not define how these processes are to be evaluated. It is up to the _Gremlin traversal machine_ to determine the means by which the traversal is executed. The Gremlin traversal machine is an abstract computing machine that is able to execute Gremlin traversals against any TinkerPop-enabled graph system. In general, the machine’s algorithm moves traversers (pointers) through a graph (data) as dictated by the steps (instructions) of the traversal (program). The Gremlin traversal machine distributed by Apache TinkerPop provides two implementations of this algorithm.

Furnace

  1. Chained Iterator Algorithm (OLTP): Each step in the traversal reads an iterator of traversers from “the left” and outputs an iterator of traversers to “the right” in a stream-based, lazy fashion. This is also known as the standard OLTP execution model.
  2. Message Passing Algorithm (OLAP): In a distributed environment, each step is able to read traverser messages from “the left” and write traversers messages to “the right.” If a message references an object that is locally accessible, then the traverser message is further processed. If the traverser references a remote object, then the traverser is serialized and it continues its journey at the remote location. This is also known as the computer OLAP execution model.

While both algorithms are semantically equivalent, the first is pull-based and the second is push-based. TinkerPop’s Gremlin traversal machine supports both modes of execution and thus, is able to work against both OLTP graph databases and OLAP graph processors. Selecting which algorithm is used is a function of defining a TraversalSource that will be used for subsequent traversals.

g = graph.traversal()                // OLTP
g = graph.traversal().withComputer() // OLAP

This article is specifically about Gremlin OLAP and its message passing algorithm. The following sections will discuss different aspects of this algorithm in order to help elucidate the mechanics of Gremlin OLAP.


Vertex-Centric Computing

OLAP Master-WorkersEvery TinkerPop-enabled OLAP graph processor implements the GraphComputer interface. A GraphComputer is able to evaluate a VertexProgram. A vertex program can be understood as a “chunk of code” that is evaluated at each vertex in a (logically) parallel manner. In this way, the computation happens from the “perspective” of the vertices and thus, the name vertex-centric computing. Another term for this distributed computing model is bulk synchronous parallel. The vertex program’s chunk does three things in a while(!terminated)-loop:

  1. It reads messages sent to its vertex.
  2. It alters its vertex’s state in some way.
  3. It sends messages to other vertices (adjacent or otherwise)

The vertex program typically terminates when there are no more messages being sent. TraversalVertexProgram is a particular vertex program distributed with Apache TinkerPop that knows how to evaluate a Gremlin traversal using message passing. The chunk of code is (logically) distributed to each vertex which contains a Traversal.clone() (a worker traversal). A vertex receives traverser messages that reference a step in the traversal clone. That step is evaluated. If the result is a traverser that does not reference data at the local vertex, then the traverser is messaged away to where that data is. This process continues until no more traverser messages exist in the computation. Besides the distributed worker traversals, there also exists a single master traversal that serves as the coordinator of the computation — determining when the computation is complete and handling global barriers that synchronize the workers at particular steps in the traversal.


Worker Graph Partitions

The graph data structure ingested by any OLAP GraphComputer is an adjacency list. Each entry in this list represents a vertex, its properties, and its incident edges. In TinkerPop, a single vertex entry is known as a StarVertex. Thus, the adjacency list read by a graph processor can be abstractly defined as List. Typically, a graph processor supports parallel execution whether parallelization is accomplished via threads in a machine, machines in a cluster, or threads in machines in a cluster. Each parallel worker processes a subgraph of the entire graph called a graph partition. The partitions of the graph’s adjacency list are abstractly defined as List<List>. If the list is partitions and there are n-workers, then partitions.size() == n and worker i is responsible for processing partitions.get(i).

Adjacency List What does worker i do with its particular List-partition? The worker will iterate through the list and for each StarVertex it will process any messages associated with that vertex. For TraversalVertexProgram, the messages are simply Traversers. If the traverser’s current graph (data) location is v[1], then it will attach itself to v[1] and then evaluate its current step (instruction) location in the worker’s Traversal.clone(). That step will yield output traversers according to it form: one-to-many, many-to-one, one-to-one, etc. If the output traversers reference objects at v[1], then they will continue to execute. For instance, outE() will put a traverser at every outgoing incident edge of v[1], where these incident edges are contained in the StarVertex data structure. Moreover, values("name") will put a traverser at the String name of v[1]. There are three situations that do not allow the traverser to continue its processing at the current StarVertex.

  1. The traverser no longer references a step in Traversal and at which point it halts. If this occurs, this means the traverser has completed its journey through the graph and traversal and it is stored in a special vertex property called HALTED_TRAVERSERS containing all the traversers that have halted at the respective StarVertex. A halted traverser is a subset of the final result.
  2. The traverser no longer references an object at the local StarVertex and thus, must turn itself into a message and transport itself to the StarVertex that it does reference. The traverser detaches itself and serializes itself across the network (or stored locally if the StarVertex in question is accessible at the current worker’s partition).
  3. The traverser no longer references any data object and thus, is considered dead and is removed from the computation. This occurs when the previous graph location of the traverser is deemed not acceptable by the traversal.

This message passing process continues for all workers until all traversers are either destroyed or halted. The final answer to the traversal query is the aggregation of all the graph locations of the halted traversers distributed across the HALTED_TRAVERSERS of the vertices.


Barrier Synchronization

Barrier StepThere are some steps whose computation can not be evaluated in parallel and require an aggregation at the master traversal. Such steps implement an interface called `Barrier` and include count(), max(), min(), sum(), fold(), groupCount(), group(), etc. Barrier steps are handled in a special way by the TraversalVertexProgram. When a traverser enters a barrier step at a worker traversal, it does not come out the other side. Instead, Barrier.nextBarrier() is used to grab all the traversers that were barriered at the current worker and then they are sent to the master traversal for aggregation along with other sibling worker barriers of the analogous step. For ReducingBarrierSteps, distributed processing occurs to yield a barrier that is not the aggregate of all traversers, but instead, an aggregate of their reduced associative/commutative form. For instance, CountStep.nextBarrier() produces a single Long number traverser. The master traversal’s representation of the barrier step aggregates all the distributed barriers via Barrier.addBarrier(). Then that master barrier step, like any other step, is next()‘d to generate the single traverser from the many. If that single traverser references a graph object, it is messaged to the respective StarVertex for further processing by a worker traversal.

Message Pass

An OLAP traversal undulates from a distributed execution across worker traversal instances, to a local execution at the master traversal, back to a distributed execution across workers, so forth and so on until all traversers have halted and the computation is complete. Note that there are other interesting barrier concepts such as `LocalBarrier` that can be studied by the interested reader in Apache TinkerPop’s documentation.


The Future of Gremlin OLAP

As of TinkerPop 3.2.0, Gremlin OLAP’s GraphComputer assumes that the input data is organized as an adjacency list (i.e. List). Moreover, it assumes that each worker processes a subset of that list and that when a traverser leaves the current StarVertex, it must send itself to the respective remote StarVertex that it does reference. These two assumptions can be lifted in order to support GraphComputer implementations that may be more efficient (and/or expressive) for certain types of graphs and traversals.

  1. Subgraph-Centric Computing: If a single worker partition can hold its entire List partition in memory, then when a traverser leaves the current StarVertex, it may still be able to execute deeper within the local partition’s in-memory subgraph representation. Only when a traverse leaves a partition’s subgraph would a message pass be required. This would significantly increase the speed of OLAP at the expense of requiring subgraphs to fit into memory. This model would also benefit greatly from a good partitioning strategy that ensures that worker subgraphs have more inter-partition edges than intra-partition edges.
  2. Edge-Centric Computing: A single StarVertex may contain a significant amount of data especially as the graph grows. For example, famous people on Twitter can have on the order of 10 million+ incoming follows-edges. In order to reduce the memory requirements of the OLAP processor as well as to better load balance a computation across machines, an edge-centric model can be used where the OLAP ingested graph is an edge-list abstractly defined as List.

Both these models may one day be introduced into the current GraphComputer model. If so, Apache TinkerPop would support vertex-centric, subgraph-centric, and edge-centric computing spanning the gamut of useful distributed graph computing models. Fortunately, the user would be blind to the underlying execution algorithm. Behind the scenes, a traversal would infer its space/time-requirements and ask the GraphComputer to use a particular representation best suited for its evaluation.


Conclusion

FurnaceThe TraversalVertexProgram that drives the evaluation of a distributed Traversal is simple, containing only a few hundred lines of code. The complexity of the computation resides in both the vendor’s GraphComputer implementation and Apache TinkerPop’s Traversal implementation. TraversalVertexProgram merely stands between these two constructs routing traversers amongst worker partitions in order to effect a distributed, OLAP-based evaluation of a Gremlin traversal over a TinkerPop-enabled graph processor.

Using (or plan to use) Orchestration Tools? Let Us Know

$
0
0

We’d love it if you’d take 5 seconds and vote in our new web poll (now running on the DSE home page) on your use of orchestration software. Thanks!

Python Driver 3.3.0 Released

$
0
0

The DataStax Python Driver 3.3.0 for Apache Cassandra has been released. This release had no specific area of focus, but brings a number of new features and improvements. A complete list of issues is available in the CHANGELOG. Here I will mention some of the new features.

New Address Translation Interface

The driver now provides a general purpose AddressTranslator Interface, which can be specialized to translate from cluster-defined RPC addresses to alternate interfaces for the native connection (private addresses, for example). This is a generally extensible interface, but will also be used in the future to add address resolution for common cloud providers.

“Retry Next Host” Retry Policy Decision

The retry policy extension interface now supports a new decision — RETRY_NEXT_HOST. Instead of retrying the operation on the current host, this will cause the driver to send it on the next host in the query plan. This is useful for retrying in situations that can arise from failures on an isolated node. This action is not used by default, but it is available for use in custom retry policies.

SSL hostname verification

This version of the driver introduces a mechanism for forcing SSL hostname verification, which is required for maximum security using SSL. Clients can now set check_hostname in ssl_options and the driver will use ssl.match_hostname to verify.

Contact Point DNS Resolution

The driver now resolves contact points when establishing the initial connection. All addresses resolved from a contact point are used as contact candidates, meaning multiple A records can be used to define a list of contact points by a single domain name. This address resolution also removes a previous deficiency of possibly duplicating hosts in cluster discovery.

Additional Server Information in Host Metadata

The driver now attaches more server information to the host metadata. Added in this release are broadcast_address, listen_address, and release_version. Please note that the address attributes are “best effort”, since they are not available in the local server tables in all versions of Cassandra.

Improved Exception Hierarchy

This release has a number of improvements around exceptions raised. First, all exceptions explicitly raised from the driver are DriverException, so they can be more easily distinguished from the more generic types. Second, there is now additional hierarchy among the exceptions returned by server-side messages. They are now grouped by RequestExecutionException, CoordinationFailure, RequestValidationException, or ConfigurationException.

Wrap

As always, thanks to all who provided contributions and bug reports. The continued involvement of the community is appreciated:

Materialized View Performance in Cassandra 3.x

$
0
0

Materialized views (MV) landed in Cassandra 3.0 to simplify common denormalization patterns in Cassandra data modeling.  This post will cover what you need to know about MV performance; for examples of using MVs, see Chris Batey’s post here.

How Materialized Views Work

Let’s start with the example from Tyler Hobbs’s introduction to data modeling:

CREATE TABLE users (
    id uuid PRIMARY KEY,
    username text,
    email text,
    age int
);

We want to be able to look up users by username and by email.  In a relational database, we’d use an index on the users table to enable these queries.  With Cassandra, an index is a poor choice because indexes are local to each node.  That means that if we created this index:

CREATE INDEX users_by_name ON users (username);

… a query that accessed it would need to fan out to each node in the cluster, and collect the results together.  Put another way, even though the username field is unique, the coordinator doesn’t know which node to find the requested user on, because the data is partitioned by id and not by name. Thus, each node contains a mixture of usernames across the entire value range (represented as a-z in the diagram):

Screen Shot 2016-05-10 at 4.34.53 PM

This causes index performance to scale poorly with cluster size: as the cluster grows, the overhead of coordinating the scatter/gather starts to dominate query performance.

Thus, for performance-critical queries the recommended approach has been to denormalize into another table, as Tyler outlined:

CREATE TABLE users_by_name (
    username text PRIMARY KEY,
    id uuid
);

Now we can look look up users with a partitioned primary key lookup against a single node, giving us performance identical to primary key queries against the base table itself–but these tables must be kept in sync with the users table by application code.  

Materialized views give you the performance benefits of denormalization, but are automatically updated by Cassandra whenever the base table is:

CREATE MATERIALIZED VIEW users_by_name AS 
SELECT * FROM users 
WHERE username IS NOT NULL
PRIMARY KEY (username, id);

Now the view will be repartitioned by username, and just as with manually denormalized tables, our query only needs to access a single partition on a single machine since that is the only one that owns the j-m username range:

Screen Shot 2016-05-10 at 4.34.31 PM

The performance difference is dramatic even for small clusters, but even more important we see that indexed performance levels off when doubling from 8 to 16 nodes in the (AWS m3.xl) cluster, as the scatter/gather overhead starts to become significant:

pasted image 0

Sidebar: When are Indexes Useful?

Indexes can still be useful when pushing analytical predicates down to the data nodes, since analytical queries tend to touch all or most nodes in the cluster anyway, making the primary advantage of materialized views irrelevant.  

Indexes are also useful for full text search–another query type that often needs to touch many nodes–now that the new SASI indexes have been released.

Performance Impact of Materialized Views on Writes

What price do we pay at write time, to get this performance for reads against materialized views?  

Recall that Cassandra avoids reading existing values on UPDATE.  New values are appended to a commitlog and ultimately flushed to a new data file on disk, but old values are purged in bulk during compaction.

Materialized views change this equation.  When an MV is added to a table, Cassandra is forced to read the existing value as part of the UPDATE.  Suppose user jbellis wants to change his username to jellis:

UPDATE users
SET username = 'jellis'
WHERE id = 'fcc1c301-9117-49d8-88f8-9df0cdeb4130';

Cassandra needs to fetch the existing row identified by fcc1c301-9117-49d8-88f8-9df0cdeb4130 to see that the current username is jbellis, and remove the jbellis materialized view entry.

(Even for local indexes, Cassandra does not need to read-before-write.  The difference is that MV denormalizes the entire row and not just the primary key, which makes reads more performant at the expense of needing to pay the entire consistency price at write time.)

Materialized views also introduce a per-replica overhead of tracking which MV updates have been applied.

Added together, here’s the performance impact we see adding materialized views to a table.  As a rough rule of thumb, we lose about 10% performance per MV:

pasted image 0 (1)

Materialized Views vs Manual Denormalization

Denormalization is necessary to scale reads, so the performance hits of read-before-write and batchlog are necessary whether via materialized view or application-maintained table.  But can Cassandra beat manual denormalization?

We wrote a custom benchmarking tool to find out.  mvbench compares the cost of maintaining four denormalizations for a playlist application for manual updates and MV.

Here’s what manual vs MV looks like in a 3 node, m4.xl ec2 cluster, RF=3, in an insert-only workload:

pasted image 0 (2)

What we see is that after the initial JVM warmup, the manually denormalized insert (where we can “cheat” because we know from application logic that no prior values existed, so we can skip the read-before-write) hits a plateau and stays there.  The MV, while faster on average, has performance that starts to decline from its initial peak.

To understand these results, we need to explain what the mvbench workload looks like.  The data model is a table of playlists and four associated MV:

CREATE TABLE user_playlists
(
    user_name           text,
    playlist_name       text,
    song_id             text,
    added_time          bigint,
    artist_name         text,
    genre               text,
    last_played         bigint,
    PRIMARY KEY (user_name, playlist_name, song_id)
);

The MV created are song_to_user, artist_to_user, genre_to_user, and recently_played. For the sake of brevity I will show only the last:

CREATE MATERIALIZED VIEW IF NOT EXISTS mview.recently_played AS
      SELECT song_id, user_name
      FROM user_playlists
      WHERE song_id IS NOT NULL
      AND playlist_name IS NOT NULL
      AND user_name IS NOT NULL
      AND last_played IS NOT NULL
      PRIMARY KEY(user_name, last_played, playlist_name, song_id
);

What is important to note here is that the base user_playlists table has a compound primary key. What is happening to cause the deteriorating MV performance over time is that our sstable-based bloom filter, which is keyed by partition, stops being able to short circut the read-old-value part of the MV maintenance logic, and we have to perform the rest of the primary key lookup before inserting the new data.

MV Performance Summarized

As a general rule then, you can apply the following rules of thumb for MV performance:

  • Reading from a normal table or MV has identical performance.
  • Each MV will cost you about 10% performance at write time.
  • For simple primary keys (tables with one row per partition), MV will be about twice as fast as manually denormalizing the same data.
  • For compound primary keys, MV are still twice as fast for updates but manual denormalization can better optimize inserts.  The crossover point where manual becomes faster is a few hundred rows per partition.  CASSANDRA-9779 is open to address this limitation.

DataStax Ruby Driver 3.0.0 Released!

$
0
0

Today, we are proud to release the first major revision of the Ruby driver in over a year. This release adds support for Cassandra 2.2 and 3.0+. Cassandra 2.2 introduced native protocol v4, and driver 3.0 fully supports it. While I will summarize the most important changes in this article, you can always refer to the changelog for the nitty gritty details.

Features and Improvements

Cassandra 2.2 and 3.0+

Many of our improvements are tied to new features in Cassandra 2.2 or Cassandra 3.0:

  • Support for schema metadata for user-defined functions (UDFs), user-defined aggregates (UDAs), materialized views, and indexes:
            # Print out the names of all views and the tables they apply to in the "my_ks" keyspace
            keyspace = cluster.keyspace('simplex')
            keyspace.materialized_views.each do |mv|
              puts "#{mv.name}  --  #{mv.base_table.name}"
            end
    
            # Print out the cql for each index on the "my_table" table
            keyspace.table('my_table').indexes.each do |index|
              puts index.to_cql
            end
            

    For more schema metadata examples, check out our usage docs. You may also want to read these blog posts for UDFs, UDAs, and materialized views.

  • Augment the Cassandra::Table object to expose many more attributes: id, options, keyspace, partition_key, clustering_columns, and clustering_order. This makes it significantly easier to write administration scripts that report various attributes of your schema, which may help to highlight areas for improvement.
  • Add support for smallint, tinyint, date (Cassandra::Date) and time (Cassandra::Time) data types:
            session.execute("CREATE TABLE IF NOT EXISTS my_table (f1 int PRIMARY KEY, duration time)")
            # Calculate the duration of some operation and record it.
            start = Time.now
            ...do operation...
            duration = Time.now - start
    
            insert = session.prepare('INSERT INTO simplex.t9 (f1, duration) values (?, ?)')
    
            # Cassandra::Time takes a nano-seconds arg. So we convert our duration to (int) nanosecond precision.
            session.execute(insert, arguments: [1, Cassandra::Time.new((duration * 1_000_000_000).to_i)])
    
            # Retrieve the row with id 1.
            result = session.execute('select * from simplex.t9 where f1=1').first
            puts result['duration'].class # Cassandra::Time
            puts "#{result['duration'].seconds} seconds elapsed to do the op"
            
  • Allow the user to choose the protocol version for communication with Cassandra nodes. The driver has always auto-negotiated the highest protocol version supported by all nodes, but in the unlikely event it makes an incorrect choice, you can now configure it yourself:
            cluster = Cassandra.cluster(protocol_version: 3)
            

    Valid values are in the range 1-4; an ArgumentError is raised otherwise. If a node does not support the specified protocol version, it will be considered an unavailable node. For homogeneous clusters, this leads to a Cassandra::Errors::NoHostsAvailable or Cassandra::Errors::ProtocolError error depending on which version of Cassandra is involved. For mixed-version clusters, it limits coordinator nodes to those that support the specified protocol version.

  • Support the ReadError, WriteError, and FunctionCallError Cassandra error responses introduced in Cassandra 2.2.
  • Add support for unset variables in bound statements. In versions of Cassandra prior to 2.2, all variables in a statement had to be bound; an exception was thrown otherwise. Cassandra 2.2 introduced the ability to issue statements without binding all parameters, and those parameters will simply be null when inserting a new row or remain unchanged when updating an existing row:
            session.execute("CREATE TABLE IF NOT EXISTS my_table (k int PRIMARY KEY, v0 int, v1 int)")
            insert = session.prepare("INSERT INTO my_table (k, v0, v1) VALUES (?, ?, ?)")
            session.execute(insert, arguments: {k: 0, v0: 1})  # k==0, v0==1, v1==null
            session.execute(insert, arguments: {k: 0, v1: 2})  # k==0, v0==1, v1==2
            session.execute(insert, arguments: {k: 0, v0: 3, v1: Cassandra::NOT_SET})  # k==0, v0==3, v1==2
            
  • Support DSE security (DseAuthenticator, configured for LDAP). Previously, only the PasswordAuthenticator mechanism built into Cassandra was supported.
  • Expose server warnings on server exceptions and Cassandra::Execution::Info instances. Cassandra 2.2 and later send warnings along with server responses. These can contain useful information such as the batch being too large, too many tombstones being read, etc.
            begin
              results = session.execute('select * from simplex.t9 where f1=1')
              puts results.execution_info.warnings.join("\n")
            rescue Cassandra::Errors::ExecutionError => e
              puts e.execution_info.warnings.join("\n")
            end
            
  • Support sending custom payloads when preparing or executing statements and expose custom payloads received with responses on server exceptions and Cassandra::Execution::Info instances. Custom payloads combine with custom query handlers in Cassandra to extend Cassandra functionality. Learn more about this advanced feature in the Java driver documentation and our usage docs.

Other improvements

We’ve also made several general improvements in the driver to augment usability and performance:

  • Add Cassandra::Logger class to make it easy for users to enable debug logging in the client. This logger reports thread-id’s and timestamps in addition to debug messages making it possible to track down issues in complex multi-threaded applications. You can also use the logger instance in your own application logic; it conforms to the standard Ruby Logger interface.
            cluster = Cassandra.cluster(logger: Cassandra::Logger.new($stderr))
            
  • Default request timeout increased from 10 seconds to 12 seconds. This change was made because by default nodes have a 10 second request processing timeout, and we want to give the client a chance to receive the server error (which may be more specific); two seconds of buffer accomplishes that.
  • Add a timeout option to Cassandra::Future#get. This allows the user to bound how long he is willing to wait for a result, independent of the request timeout. This is useful if you want to “check up on” an asynchronously running request but not abort it if the response is not ready:
            f = session.execute_async('some query')
            result = nil
            begin
              # Wait upto 5 seconds for the result
              result = f.get(5)
            rescue Errors::TimeoutError
              # Ok, so we still don't have a result. Do some other work.
              other_task(some_arg)
    
              # Now, we really do need that result...
              result = f.get
            end
    
            # Do something with the result...
            
  • Add connections_per_local_node, connections_per_remote_node, and requests_per_connection cluster configuration options to tune parallel query execution and resource usage. Prior to Cassandra 2.1, the protocol supported a maximum of 1024 concurrent requests on one connection. Thus, the driver would set up two connections to each local node in the cluster. Cassandra 2.1 introduced the v3 native protocol, which supports 32,768 concurrent requests on one connection. This alleviated the need to have multiple connections to a node, except for one issue: query handlers processing concurrent requests on a connection do not necessarily use multiple threads; that decision depends on the idempotency of requests and potentially other factors. So, it’s possible that request processing overserializes in a node. Having multiple connections alleviates that. Depending on workload, the user can tune these parameters to optimize response times. NOTE: the driver defaults these settings as follows:
    Setting v2 and earlier v3 and later
    connections_per_local_node 2 1
    connections_per_remote_node 1 1
    requests_per_connection 128 1024
  • Support specifying statement idempotence with the new :idempotent option to Cassandra::Session methods executeexecute_async, session.prepare, and session.prepare_async. We also support specifying idempotence when constructing simple statements (Cassandra::Statements::Simple). This helps the driver decide if it’s safe to retry requests for certain types of failures.
  • Add new retry policy decision Cassandra::Retry::Policy#try_next_host. This decision is made in the default retry policy in the following cases:
    • There was a read timeout and either/both the consistency requirement was not met or some data had been received from the replica that was asked when the timeout occurred.
    • There was a write timeout for an idempotent statement and no replica has received the request.
    • Enough nodes were not available to reach the desired consistency level, causing an “unavailable” error.

    Previously, under the above conditions, the error was propagated to the application. This new behavior provides an opportunity to retry the request and avoid the error, making the system more resilient.

Bug Fixes

A large number of defects have been remedied as well. Here are some of the critical fixes:

  • RUBY-120: Tuples and UDTs can be used in sets and hash keys.
  • RUBY-143: Retry querying system table for metadata of new hosts when prior attempts fail, ultimately enabling use of new hosts.
  • RUBY-150: Fixed a protocol decoding error that occurred when multiple messages are available in a stream. This primarily occurred when the system was under load.
  • RUBY-151: Decode incomplete UDTs properly.
  • RUBY-155: Request timeout timer should not include request queuing time. For high volume clients, asynchronously executing requests can queue in the client until the node is ready. Since the amount of time spent in the queue can vary dramatically over time, the request timeout had to be set conservatively to account for the worst case queue time. In 3.0, the request timeout timer encapsulates the time from when the request is ready to write on the wire to the time the response is received.
  • RUBY-161: Protocol version negotiation in mixed version clusters should not fall back to v1 unless it is truly warranted.
  • RUBY-214: Ensure client timestamps have microsecond precision in JRuby. Previously, some row updates would get lost in high transaction environments.

Breaking Changes

As this is a major release packed with new features, some existing api’s needed to adjust to accommodate them. In addition, some behaviors that proved to be inconsistent with other drivers or behaviors that were otherwise undesirable were also rectified in this release:

  • The Datacenter-aware load balancing policy (Cassandra::LoadBalancing::Policies::DCAwareRoundRobin) defaults to using nodes in the local DC only. In prior releases, the policy would fall back to remote nodes after exhausting local nodes. In this release, a statement may error out. Specify a positive value (or nil for unlimited) for max_remote_hosts_to_use when initializing the policy to allow remote node use:
            lbp = Cassandra::LoadBalancing::Policies::DCAwareRoundRobin.new(nil)
            cluster = Cassandra.cluster(load_balancing_policy: lbp)
            

    Note: This only applies if you explicitly specify a load-balancing policy in your application. The load balancing policy initialized by default has historically been configured with max_remote_hosts_to_use set to 0, so this change has no impact on applications that do not specify a load-balancing policy.

  • Cassandra::Future#join is now an alias to Cassandra::Future#get and will raise an error if the future is resolved with one. Previously, join would not block and return nil unless the future resolved.
  • Unspecified variables in statements previously resulted in an exception. Now they are essentially ignored or treated as null. See above for details.
  • Default consistency level is now LOCAL_ONE. Previously, the default was ONE, meaning that a remote node could meet the consistency requirement.
  • Enable tcp no-delay by default. This reduces the time for a request to be sent to a node because the OS doesn’t wait for more pending data before sending out a TCP packet. However, because a larger number of smaller packets are sent on the network, there is a greater possibility of packet collision and TCP retries. If congestion proves to be too costly in your environment, turn off tcp_nodelay in cluster options.
  • Unavailable errors are retried on the next host in the load balancing plan by default. This should make the system more resilient to errors.
  • Statement execution is no longer retried on timeouts, unless the statement is declared idempotent (either in a statement object or as an option to an execute method). Previously, unsafe (non-idempotent) statements were retried on timeouts, potentially causing data corruption. Now statements are only retried if they are idempotent and the error indicates that the statement may succeed if retried.
  • As of driver v2.1, query arguments must be specified in the :arguments option to the Cassandra::Session#execute* and Cassandra::Batch#add methods.
            # pre 2.1 driver
            session.execute('SELECT COUNT(*) FROM simplex.foo WHERE f1 = ?', [7], consistency: :one)
            batch.add('SELECT COUNT(*) FROM simplex.foo WHERE f1 = ?', [7], [Cassandra::Types.int])
    
            # v2.1 of the driver and later
            session.execute('SELECT COUNT(*) FROM simplex.foo WHERE f1 = ?', arguments: [7], consistency: :one)
            batch.add('SELECT COUNT(*) FROM simplex.foo WHERE f1 = ?', arguments: [7], type_hints: [Cassandra::Types.int])
            

    You must update your applications accordingly.

Deprecated Items

As the driver has evolved, it has grown a little long in the tooth in certain areas.

MRI 1.9.3, 2.0.x, 2.1.x, and Rubinius

MRI 1.9.3 and 2.0.x are no longer maintained by the Ruby community, even for security patches. The 2.1.x line is only open to security patches. The maintainers highly recommend users upgrade to Ruby 2.2.x or Ruby 2.3.x.

Rubinius has proven to be unstable in our testing, frequently dumping core.

We are therefore phasing out support for these older Rubies in upcoming releases of the Ruby driver. If this is an issue for you, please post a message on the ruby driver mailing list.

Cassandra 1.2.x

Cassandra 1.2.x is no longer being actively maintained, and DataStax no longer tests the Ruby driver against that version. The driver still has all of the necessary functionality to communicate with Cassandra 1.2.x and there are no concrete plans to remove that functionality. However, as we no longer test the driver against Cassandra 1.2.x, there is potential for breaking changes that we are unaware of.

Please consider upgrading your Cassandra cluster to the latest 3.x version to benefit from the latest features and fixes.

Getting the driver

The new driver gems are available on rubygems.org, so just update your Gemfile, bundle install, and you’re all set.

For more code samples showcasing all of our features, check out our usage docs.

Enjoy!

DataStax C/C++ Driver: 2.4 GA released!

$
0
0

We are excited to announce the 2.4 GA release of the C/C++ driver for Apache Cassandra. This release brings with it a couple new features.

What’s new

Custom Authenticator (SASL)

In previous releases the driver only supported plain text authentication with Casandra (using cass_cluster_set_credentials()), but it is now possible to implement a custom authenticator. This is useful for integrating more complex authentication systems such as Kerberos. An example for implementing a custom plain text authenticator is included with this release.

Hostname Resolution with SSL support

Until this release SSL only supported peer identity verification using the IP address of Cassandra nodes. This release adds the ability to resolve the domain name of Cassandra nodes using reverse DNS (PTR record) and that enables peer identity verification using the domain name when present in the peer certificate’s common name (CN) or subject alternative names (SAN). More information can be found in the driver’s SSL documentation.

Hostname resolution (reverse DNS) is enabled when constructing a cluster object.

/* Enable reverse DNS */
cass_cluster_set_use_hostname_resolution(cluster, cass_true);

Use the SSL verify flags to enable peer identity verification using the peer certificate’s domain name.

/* Or use: CASS_SSL_VERIFY_PEER_IDENTITY_DNS (domain name) */
cass_ssl_set_verify_flags(ssl, CASS_SSL_VERIFY_PEER_CERT | CASS_SSL_VERIFY_PEER_IDENTITY_DNS);

Feedback

More detailed information about all the features, improvements and fixes included in this release can be found in the changelog. Let us know what you think about the release. Your involvement is important to us and it influences what features we prioritize. Use the following resources to get involved:

DataStax Enterprise 5.0.0 Released

$
0
0

Today we released DataStax Enterprise 5.0.0 which introduces many new features with DSE Graph being the largest one. To find out more about DSE Graph, please check out http://www.datastax.com/products/datastax-enterprise-graph.

Also check out http://www.datastax.com/products/datastax-enterprise for more information on many of the other new features such as Advanced Replication, Multi-Instance Deployment, Tiered Storage and improved Security.


New DataStax Enterprise Drivers Available Now

$
0
0

Today, we are thrilled to introduce a new set of DataStax Enterprise Drivers built on top of the widely used open source DataStax drivers for Apache Cassandra™ and enhanced to ease the development of cloud applications powered by DataStax Enterprise.

DataStax Enterprise (DSE) accelerates your ability to deliver real-time value at epic scale by providing a unique always-on architecture. DSE supports mixed workload management with its Transactional, Search, and Analytics components, as well as adaptive data management (multi-model capabilities), with all data (tabular, graph, JSON, and key-value) stored in Cassandra. This provides developers the distributed, responsive and intelligent foundation needed to build and run cloud applications.

With the new DataStax Enterprise Drivers, developers will benefit from a unified and cohesive development experience when using the adaptive data management and mixed workload capabilities provided in DataStax Enterprise 5.0 in order to build incredibly fast, continuously available and distributed applications.

DataStax Enterprise Drivers for C#, Java, Node.js, Python (with more to follow shortly) are available now. Full disclosure, while the source code for the DataStax Enterprise Drivers is published on GitHub, the license of these drivers allows their usage only in conjunction with the use of DataStax Enterprise software.  Read on to learn more about what’s new, when and how to upgrade, and the support provided for DataStax Enterprise Graph.

What’s New

DataStax Enterprise Drivers rely on the latest version of our drivers for Cassandra to provide full compatibility with Apache Cassandra 3.0 while also maintaining compatibility with previous versions. Additionally, they feature support for the new unified authentication and also geospatial types introduced in DataStax Enterprise 5.0:

// Unified authentication
import com.datastax.driver.dse.auth.DseGSSAPIAuthProvider;

DseCluster dseCluster = DseCluster.builder()
        .addContactPoint("127.0.0.1")
        .withAuthProvider(new DseGSSAPIAuthProvider())
        .build();

// Geospatial types
import com.datastax.driver.dse.geometry.Point;

Row row = dseSession.execute("SELECT coords FROM points_of_interest WHERE name =
'Eiffel Tower'").one();
Point coords = row.get("coords", Point.class);

dseSession.execute("INSERT INTO points_of_interest (name, coords) VALUES (?, ?)",
        "Washington Monument", new Point(38.8895, 77.0352));

Last, but definitely not least, DataStax Enterprise Drivers have added support for DSE Graph, which is covered in detail below.

When to Upgrade

You want to upgrade to the new DSE drivers if:

  1. You want to take full advantage of the new features introduced in DataStax Enterprise 5.0.
  2. You plan to use DataStax Enterprise Graph for learning or building exciting new cloud applications that use graph. DataStax Enterprise Drivers expand the familiar query API to support Gremlin queries.

How to Upgrade

Upgrading to the DataStax Enterprise Drivers should be a fairly painless process:

  1. Pull the new dependency through your favorite dependency manager. The new drivers are available through the usual channels: Nuget, Maven, Npm, pip, etc.
  2. Update the corresponding imports and the initialization of cluster and session to use DseCluster and DseSession:
// old code
import com.datastax.driver.Cluster;
import com.datastax.driver.Session;

Cluster cluster = null;
try {
   cluster = Cluster.builder()
           .addContactPoint("127.0.0.1")
           .build();
   Session session = cluster.connect();

   Row row = dseSession.execute("select release_version from system.local").one();
   System.out.println(row.getString("release_version"));
} 
finally {
   if (cluster != null) cluster.close();
}

// new code
import com.datastax.driver.dse.DseCluster;
import com.datastax.driver.dse.DseSession;

DseCluster dseCluster = null;
try {
   dseCluster = DseCluster.builder()
           .addContactPoint("127.0.0.1")
           .build();
   DseSession dseSession = dseCluster.connect();

   Row row = dseSession.execute("select release_version from system.local").one();
   System.out.println(row.getString("release_version"));
} 
finally {
   if (dseCluster != null) dseCluster.close();
}

If you are not yet using the latest version of the DataStax drivers for Apache Cassandra, there might be other changes. The following links should provide you with all the necessary details for upgrading:

DataStax Enterprise Graph Support

DSE Graph Gremlin queries are routed intelligently inside the DSE cluster with query results oftentimes differing from the standard tabular format of standard CQL queries. We have enhanced the API exposed through the drivers to work seamlessly with both CQL and DSE Graph while maintaining a coherent API and providing the same level of smart features underneath that you’ve gotten used to (e.g. connection pooling, smart request routing, retry policies, etc.)

To give you a glimpse at how easy it is to work with Gremlin through the DataStax Enterprise Drivers, check out the sample code below:

GraphNode r = dseSession.executeGraph("g.V().hasLabel('test_vertex')").one();
Vertex vertex = r.asVertex();

You’ll find more details about using DSE Graph with the DataStax Enterprise Drivers in the official docs.

A Coherent Set of Features and API in the DataStax Drivers for Apache Cassandra and DataStax Enterprise

Before closing, it’s worth emphasizing that while all the code snippets included in this article are in Java, you’ll find the other drivers exposing a similar API which remains consistent with the API and features we have provided through our drivers for Apache Cassandra. And yes, we will continue to develop and improve the drivers for Apache Cassandra as they represent the core building blocks on which DataStax Enterprise Drivers are built.

We look forward to learning about the amazingly smart cloud applications you are building using DataStax Enterprise and how the new DataStax Enterprise Drivers are helping you. Please leave a comment with your feedback or questions.

Python DataStax Enterprise Driver 1.0 and Driver 3.5.0 with Execution Profiles

$
0
0

Last week we released a new Python DSE Driver 1.0.0 in conjunction with DataStax Enterprise 5.0. The DSE driver builds on the existing DataStax Python Driver for Apache Cassandra, adding support for DSE-specific data types, authentication mechanisms, and graph query execution. In this post I will introduce the new DSE driver features, and discuss a new Execution Profiles API introduced in the core driver 3.5.0, which we also released last week.

DataStax Enterprise Python Driver

The DSE Python Driver is a new package that depends on the core driver. The source repository is on github, and the source distribution is published as cassandra-driver-dse.

The driver documentation contains information and examples for all the additional features provided on top of the core driver functionality. Please refer to these pages for DSE features, including authentication, geometric types, and graph request execution. There is also an installation page, and an upgrade guide for bringing the DSE driver in for use where core was used previously. Most applications can upgrade by simply changing a package import. Those using custom load balancing configuration, timeouts, or certain execution parameters will need to know about Execution Profiles, a new feature in the core 3.5.0 release. That feature is introduced below. More detail about upgrading can be found in the upgrade guide.

Execution Profiles

Execution Profiles is introduced as follows in the documentation:

Execution profiles are an experimental API aimed at making it easier to execute requests in different ways within a single connected Session. Execution profiles are being introduced to deal with the exploding number of configuration options, especially as the database platform evolves more complex workloads.

The Execution Profile API is being introduced now, in an experimental capacity, in order to take advantage of it in existing projects, and to gauge interest and feedback in the community. For now, the legacy configuration remains intact, but legacy and Execution Profile APIs cannot be used simultaneously on the same client Cluster.

Execution profiles provide more flexibility for configuring request execution parameters without creating multiple Clusters and Sessions. For a very simple example, we could define an alternate profile that has a longer timeout and returns dicts instead of the default namedtuples.

from cassandra.cluster import Cluster, ExecutionProfile
from cassandra.query import dict_factory

cluster = Cluster(execution_profiles={'df': ExecutionProfile(request_timeout=30.0, row_factory=dict_factory)})
session = cluster.connect()

session.execute('SELECT rpc_address FROM system.local')[0]  # uses default profile
#    Row(rpc_address='127.0.0.1')

session.execute('SELECT rpc_address FROM system.local', execution_profile='df')[0]  # uses named profile
#    {u'rpc_address': '127.0.0.1'}

Another more interesting pattern is to define different load balancing, for example to target statements to different datacenters — possibly serving different workloads. Here is an example configuring the default profile to target one datacenter (with other parameters defaulted), and another profile to target a separate datacenter with some alternative parameters:

from cassandra.cluster import EXEC_PROFILE_DEFAULT
from cassandra.policies import DCAwareRoundRobinPolicy, TokenAwarePolicy
from cassandra.query import tuple_factory

ep1 = ExecutionProfile(load_balancing_policy=TokenAwarePolicy(DCAwareRoundRobinPolicy(local_dc='dc1')))
ep2 = ExecutionProfile(load_balancing_policy=TokenAwarePolicy(DCAwareRoundRobinPolicy(local_dc='dc2')),
                       row_factory=tuple_factory, request_timeout=None)  # target dc2, return tuples, never timeout
session = Cluster(execution_profiles={EXEC_PROFILE_DEFAULT: ep1, 'other-dc': ep2}).connect()

# cluster topology
{(h.address, h.datacenter) for h in session.cluster.metadata.all_hosts()}
#    {('127.0.0.1', u'dc1'), ('127.0.0.2', u'dc1'), ('127.0.0.3', u'dc2')}

# default profile cycles between nodes in 'dc1'
session.execute('SELECT rpc_address, data_center FROM system.local')[0]  
#    Row(rpc_address='127.0.0.1', data_center=u'dc1')

session.execute('SELECT rpc_address, data_center FROM system.local')[0]
#    Row(rpc_address='127.0.0.2', data_center=u'dc1')

session.execute('SELECT rpc_address, data_center FROM system.local')[0]
#    Row(rpc_address='127.0.0.1', data_center=u'dc1')

session.execute('SELECT rpc_address, data_center FROM system.local')[0]
#    Row(rpc_address='127.0.0.2', data_center=u'dc1')

# other profile is pinned to 'dc2'
session.execute('SELECT rpc_address, data_center FROM system.local', execution_profile='other-dc')[0]
#    ('127.0.0.3', u'dc2')

session.execute('SELECT rpc_address, data_center FROM system.local', execution_profile='other-dc')[0]
#    ('127.0.0.3', u'dc2')

session.execute('SELECT rpc_address, data_center FROM system.local', execution_profile='other-dc')[0]
#    ('127.0.0.3', u'dc2')

There are plenty of details and more ways to use this in the overview document.

We are eager to hear any feedback from community members who elect to try this API. If it proves useful, the ultimate goal would be to retire the legacy execution parameters and commit to Execution Profiles.

Wrap

As always, thanks to all who provided contributions and bug reports. The continued involvement of the community is appreciated:

DataStax Enterprise Java Driver: 1.0.0 released!

$
0
0

A few days ago, we announced the release of Datastax Enterprise (DSE) 5.0 and, shortly after that, the general availability of new dedicated drivers.

In this post we are going to focus on the DataStax Enterprise Java Driver 1.0.0.

Overview

The DataStax Enterprise Java Driver 1.0.0 is built on top of the “core” driver and supports additional features provided by DSE 5.0, such as Unified Authentication, Geospatial types, and Graph (more on these below).

From an API perspective, the most noticeable additions in the DataStax Enterprise Java Driver are the dedicated Cluster and Session wrappers, namely, DseCluster and DseSession.

These specialized versions have a few additional capabilities (but they will certainly be enhanced with many more DSE-specific features yet to come):

  1. DseSession can execute graph statements (see below);
  2. DseCluster will by default enable geospatial types (see below);
  3. DseCluster will by default use a special load balancing policy, DseLoadBalancingPolicy, which optimizes query plans for Graph OLAP queries.

Below is an example of how to create a DseCluster and a DseSession:

import com.datastax.driver.dse.DseCluster;
import com.datastax.driver.dse.DseSession;
 
DseCluster dseCluster = null;
try {
   dseCluster = DseCluster.builder()
           .addContactPoint("127.0.0.1")
           .build();
   DseSession dseSession = dseCluster.connect();
 
   Row row = dseSession.execute("select release_version from system.local").one();
   System.out.println(row.getString("release_version"));
} finally {
   if (dseCluster != null) dseCluster.close();
}

Upgrading from older DSE drivers

For previous versions of DSE, specific extensions were published as a sub-module of the core Java driver, under the coordinates com.datastax.cassandra:cassandra-driver-dse.

These extensions are now deprecated and will not be maintained anymore. Starting with DSE 5.0, they are superseded by the DataStax Enterprise Java Driver 1.0.0, which is now a standalone project, published under new coordinates: com.datastax.cassandra:dse-driver, and with an independent versioning. It is also compatible with older versions of DSE.

If you were using the legacy driver extensions for DSE, you should now migrate your existing applications to use DataStax Enterprise Java Driver 1.0.0. Please read the Upgrade guide for further instructions.

DSE Unified Authentication

For clients connecting to a DSE cluster secured with DSE Unified Authentication, two authentication providers are included:

  1. DsePlainTextAuthProvider: plain-text authentication;
  2. DseGSSAPIAuthProvider: GSSAPI authentication.

Here is an example of DSE authentication using DseGSSAPIAuthProvider:

import com.datastax.driver.dse.auth.DseGSSAPIAuthProvider;
 
DseCluster dseCluster = DseCluster.builder()
        .addContactPoint("127.0.0.1")
        .withAuthProvider(new DseGSSAPIAuthProvider())
        .build();

See the javadocs of each implementation for more details.

Geospatial types

DataStax Enterprise 5.0 comes with a set of additional types to represent geospatial data: PointType, LineStringType, and PolygonType. Here is an example of a table containg a column of type PointType:

CREATE TABLE points_of_interest(name text PRIMARY KEY, coords 'PointType');

The CQL literal representing a geospatial type is simply its Well-known Text (WKT) form. Inserting a row into the table above using plain CQL is as easy as:

INSERT INTO points_of_interest (name, coords) VALUES ('Eiffel Tower', 'POINT(48.8582 2.2945)');

Of course, you are not limited to string literals to manipulate geospatial types; the DataStax Enterprise Java Driver 1.0.0 also includes Java representations of these types, which can be sent as query parameters, or retrieved back from query results:

import com.datastax.driver.dse.geometry.Point;

Row row = dseSession.execute("SELECT coords FROM points_of_interest WHERE name = 'Eiffel Tower'").one();
Point coords = row.get("coords", Point.class);
System.out.println(coords.X());

dseSession.execute("INSERT INTO points_of_interest (name, coords) VALUES (?, ?)", 
    "Washington Monument", new Point(38.8895, 77.0352));

Please refer to both the DSE 5.0 Geospatial Search documentation and the driver documentation on geospatial types for more details.

DSE Graph

The DataStax Enterprise Java Driver 1.0.0 now supports two different query languages: CQL for interacting with standard Cassandra tables, and Gremlin, the graph traversal language that allows you to interact with DSE Graph.

DseSession has dedicated methods to execute graph queries. These methods accept GraphStatement instances and return a GraphResultSet, which, in turn, can be seen as a tree-like structure where each node is a GraphNode.

Here is a simple example of some graph queries:

import com.datastax.driver.dse.graph.GraphStatement;
import com.datastax.driver.dse.graph.SimpleGraphStatement;

GraphStatement s1 = new SimpleGraphStatement("g.addV(label, 'test_vertex')");
dseSession.executeGraph(s1);

GraphResultSet rs = dseSession.executeGraph("g.V().hasLabel('test_vertex')");
GraphNode n = rs.one();
Vertex vertex = n.asVertex();

Please refer to both the DSE 5.0 Graph documentation and the driver documentation on graph for further information about graph queries.

Getting the driver

The DataStax Enterprise Java Driver binaries are available from Maven Central. If you use Maven, simply add the following dependency to your project:

<dependency>
  <groupId>com.datastax.cassandra</groupId>
  <artifactId>dse-driver</artifactId>
  <version>1.0.0</version>
</dependency>

Beware that the DataStax Enterprise Java Driver is published under specific license terms that allow its usage solely in conjunction with DataStax Enterprise software.

Documentation for the the DataStax Enterprise Java Driver can be found at the following locations:

Note that, for convenience, the API reference includes combined javadocs for both the DataStax Enterprise Java Driver and the core Java Driver for Apache Cassandra.

New C# Driver for DataStax Enterprise

$
0
0

Last week, we released DataStax Enterprise (DSE) 5.0 and, to support the additional features of the platform, we announced the availability of a new set of drivers.

The DataStax Enterprise C# Driver is a new package that depends on the “core” C# driver for Apache Cassandra, exposing new dedicated interfaces IDseCluster and IDseSession, which inherit from ICluster and ISession interfaces respectively, to provide additional capabilities like DSE Graph support, Unified Authentication and Geospatial types support.

Upgrading from the Core Driver

To Upgrade from CassandraCSharpDriver, change your dependency to the new Dse package, add the Dse namespace.

using Dse;

And create IDseCluster and IDseSession instances.

IDseCluster cluster = DseCluster.Builder()
    .AddContactPoint("127.0.0.1")
    .Build();
IDseSession session = cluster.Connect();

Unified Authentication

For clients connecting to a cluster secured with DSE Unified Authentication, two authentication providers are included:

You can specify the authentication provider when initializing the cluster:

using Dse;
using Dse.Auth;
IDseCluster dseCluster = DseCluster.Builder()
    .AddContactPoint("127.0.0.1")
    .WithAuthProvider(new DseGssapiAuthProvider())
    .Build();

See the API docs for DsePlainTextAuthProvider and DseGSSAPIAuthProvider for more information.

DSE Graph Support

The DSE driver supports two different query languages: CQL for interacting with Cassandra tables, and Gremlin, the graph traversal language that allows you to interact with DSE Graph.

IDseSession includes the ExecuteGraph() method to execute graph queries:

GraphResultSet rs = session.ExecuteGraph("g.V()");
Vertex vertex = rs.First();
Console.WriteLine(vertex.Label);

There is also a non-blocking method counterpart, ExecuteGraphAsync() which returns a Task<GraphResultSet> that can be awaited on.

Check out DSE 5.0 Graph documentation more information on DSE Graph / Gremlin and the driver documentation on graph for further information about executing graph queries and setting graph options.

Geospatial Types

DataStax Enterprise 5.0 comes with a set of additional CQL types to represent geospatial data: PointType, LineStringType and PolygonType.

cqlsh> CREATE TABLE points_of_interest(name text PRIMARY KEY, coords 'PointType');
cqlsh> INSERT INTO points_of_interest (name, coords) VALUES ('Eiffel Tower', 'POINT(48.8582 2.2945)');

The DSE driver includes C# representations of these types in the Dse.Geometry namespace that can be used directly as parameters in queries. All C# geospatial types implement ToString(), that returns the string representation in Well-known text format.

using Dse.Geometry;
Row row = session.Execute("SELECT coords FROM points_of_interest WHERE name = 'Eiffel Tower'").First();
Point coords = row.GetValue<Point>("coords");

var statement = new SimpleStatement("INSERT INTO points_of_interest (name, coords) VALUES (?, ?)",
    "Washington Monument", 
    new Point(38.8895, 77.0352));
session.Execute(statement);

See the driver documentation on geospatial types for more details.

Looking Forward

We will continue to develop and improve the C# driver for Apache Cassandra as it represents the core building block on which DataStax Enterprise C# Driver is built, and also continue to add DSE-specific functionalities in this new driver.

To provide feedback use the following:

The DSE driver is available on Nuget.org.

Note that the DataStax Enterprise C# Driver is published under specific license terms that allow its usage only in conjunction with DataStax Enterprise software.



Master DataStax Graph with DataStax Studio

$
0
0

DataStax Enterprise 5.0 introduces DSE Graph, the first massively scalable and high performance graph database, as part of the adaptive data model support required today by all modern cloud applications.

Together with OpsCenter 6.0 and the new DataStax Enterprise Drivers, we are thrilled to release the first version of DataStax Studio, an intuitive web-based application for querying, exploring, analyzing, and visualizing graph data. DataStax Studio has been designed from the ground up to satisfy the needs of architects, developers, and data scientists using DSE Graph.

DataStax Studio is a feature-rich and friendly learning tool that will help you quickly become a DSE Graph expert. DataStax Studio is also a super productive environment for seasoned graph developers and data analysts that will help visually uncover information needed to make critical operation and business decisions.

Without further ado, here is a sneak peek at the features of DataStax Studio that I’m most excited about.

The most advanced Gremlin editor for DSE Graph

Learning new APIs is part of our daily jobs, but becoming proficient and then mastering a fairly large domain specific language like Gremlin takes time. A smart editor can help a lot in this journey by making intelligent suggestions at every step and ensuring that the queries you are writing are error free. DataStax Studio comes with an intelligent code editor for Gremlin that offers code completions, schema completions, and a wide range of validations. We will publish soon an in-depth article about the features of the Gremlin editor, but until then, I can tell you that it’s amazing:

Gremlin Code Editor

Whiteboard your Graph Schema

Knowing at all times the schema of your DSE graph can be very helpful to ensure the correctness and efficiency of your queries, but also for exploring and validating different hypothesis about your data. DataStax Studio provides a whiteboard-like view of your graph vertexes, edges, and properties revealing the graph organization and connections:

Schema+Editor+Graph

Query profiler

Writing correct queries is just the first step. Next comes making sure that these queries are as efficient and performant as required. This is where the profiler view included in DataStax Studio helps by providing you with the execution details of a query:

Screenshot from 2016-06-07 13-31-47

Visualize Graph data the way you want

Once your error free queries are in place, it’s time to see and work with the results. DataStax Studio comes with support for numerous output formats and visualizations. While not as visually appealing as their siblings, but still useful, we have first the grid, JSON, and plain text viewers:

Results - Grid

Next, DataStax Studio also supports a series of commonly used charts: pie, bar, line, area, scatter:

Results - Bar char

Last, but definitely the nicest looking, the graph view:

Results - Graph detail

The graph view allows you to click around to see details about vertices and edges, zoom in and out, and also drag elements around for getting the best view on your DSE graph data.

Try DataStax Studio 1.0 Today

Summarizing, DataStax Studio 1.0 provides:

  1. The most intelligent Gremlin editor on the market
  2. A rich set of data visualizations and numerous output formats that can be used to surface insight from data and present it in the most telling and beautiful formats
  3. An interaction model that makes it a great solution for anything from learning to creating amazing self-documented, executable data-driven documents that uncover information needed to make critical business and operational decisions

There are many other exciting features in DataStax Studio, but I’d encourage you to download and try it yourself.  We are looking forward to hearing your feedback studio-feedback@datastax.com.

Node.js Driver for DataStax Enterprise

$
0
0

Following the release of DataStax Enterprise (DSE) 5.0, we announced the availability of a new set of drivers specifically designed for the platform.

The DataStax Enterprise Node.js Driver is built on top of the “core” Node.js driver for Apache Cassandra and supports additional features added by that platform, like DSE Graph, Geospatial types and Unified Authentication.

The most noticeable change in the DSE driver API is the Client prototype which extends the core driver counterpart and exposes a few additional capabilities:

  • It can execute graph statements
  • It can represent geospatial data
  • It uses a special load balancing policy by default, DseLoadBalancingPolicy
  • It uses Execution profiles to better handle mixed workloads

Upgrading from the Core Driver

Upgrading from the cassandra-driver can be as simple as changing the import statement to point to the dse package:

const cassandra = require('cassandra-driver');
const client = new cassandra.Client({ contactPoints: ['host1'] });

Becomes:

const dse = require('dse-driver');
const client = new dse.Client({ contactPoints: ['host1'] });

The DSE driver module also exports the same submodules as the core driver.

Unified Authentication

For clients connecting to a cluster secured with DSE Unified Authentication, two authentication providers are included:

You can set the authentication provider when creating the Client instance:

const dse = require('dse-driver');
const client = new dse.Client({
  contactPoints: ['h1', 'h2'], 
  keyspace: 'ks1',
  authProvider: new dse.auth.DseGssapiAuthProvider()
});

See the API docs for DseGssapiAuthProvider and DsePlainTextAuthProvider for more information.

DSE Graph Support

The DSE driver supports two different query languages: CQL for interacting with Cassandra tables, and Gremlin, the graph traversal language that allows you to interact with DSE Graph.

Client prototype includes the executeGraph() method to execute graph queries:

client.executeGraph('g.V()', function (err, result) {
  assert.ifError(err);
  const vertex = result.first();
  console.log(vertex.label);
});

Check out DSE 5.0 Graph documentation more information on DSE Graph / Gremlin and the driver documentation on graph for further information about executing graph queries and setting graph options.

Geospatial Types

DataStax Enterprise 5.0 comes with a set of additional CQL types to represent geospatial data: PointType, LineStringType and PolygonType.

cqlsh> CREATE TABLE points_of_interest(name text PRIMARY KEY, coords 'PointType');
cqlsh> INSERT INTO points_of_interest (name, coords) VALUES ('Eiffel Tower', 'POINT(48.8582 2.2945)');

The DSE driver includes encoders and representations of these types in the geometry module that can be used directly as parameters in queries. All Javascript geospatial types implement toString(), that returns the string representation in Well-known text format, and toJSON(), that returns the JSON representation in GeoJSON format.

const dse = require('dse-driver');
const Point = dse.geometry.Point;
const insertQuery = 'INSERT INTO points_of_interest (name, coords) VALUES (?, ?)';
client.execute(insertQuery, ['Eiffel Tower', new Point(48.8582, 2.2945)], callback);
const selectQuery = 'SELECT coords FROM points_of_interest WHERE name = ?';
client.execute(selectQuery, ['Eiffel Tower'], function (err, result) {
  assert.ifError(err);
  const row = result.first();
  const point = row['coords'];
  console.log(point instanceof Point); // true
  console.log('x: %d, y: %d', point.x, point.y); // x: 48.8582, y: 2.2945
});

See the driver documentation on geospatial types for more details.

Working with Mixed Workloads

The driver features Execution Profiles that provide a mechanism to group together a set of configuration options and reuse them across different query executions.

Execution Profiles are specially useful when dealing with different workloads like DSE Graph and CQL workloads, allowing you to use a single Client instance for all workloads, for example:

const client = new Client({ 
  contactPoints: ['host1'], 
  profiles: [
    new ExecutionProfile('time-series', {
      consistency: consistency.localOne
      readTimeout: 30000,
      serialConsistency: consistency.localSerial
    }),
    new ExecutionProfile('graph', {
      loadBalancing: new DseLoadBalancingPolicy('graph-us-west')
      consistency: consistency.localQuorum
      readTimeout: 10000,
      graphOptions: { name: 'myGraph' }
    })
  ]
});

// Use an execution profile for a CQL query
client.execute('SELECT * FROM system.local', null, { executionProfile: 'time-series' }, callback);

// Use an execution profile for a gremlin query
client.executeGraph('g.V().count()', null, { executionProfile: 'graph' }, callback);

Looking Forward

We will continue to develop and improve the Node.js driver for Apache Cassandra as it represents the core building block on which DataStax Enterprise Node.js Driver is built, and also continue to add DSE-specific functionalities in this new driver.

To provide feedback use the following:

The DSE driver is available on npm: dse-driver.

Note that the DataStax Enterprise Node.js Driver is published under specific license terms that allow its usage only in conjunction with DataStax Enterprise software.



DataStax Enterprise Ruby Driver: 1.0.0 released!

$
0
0

Recently, we announced the release of Datastax Enterprise (DSE) 5.0 and, shortly after that, the general availability of new dedicated drivers.

In this post we are going to focus on the DataStax Enterprise Ruby Driver 1.0.0.

Overview

The DataStax Enterprise Ruby Driver 1.0.0 is built on top of the “core” driver and supports additional features provided by DSE 5.0, such as Unified Authentication, Geospatial types, and Graph.

The driver is intended to have the same look and feel as the core driver to make upgrading from the core driver trivial. The only change is to replace references to the Cassandra module with Dse when creating the cluster object:

require 'dse'

# This returns a Dse::Cluster instance
cluster = Dse.cluster

# This returns a Dse::Session instance
session = cluster.connect
rs = session.execute('select * from system.local')

These specialized versions have a few additional capabilities (but they will certainly be enhanced with many more DSE-specific features yet to come):

  • Dse::Session can execute graph statements (see below).
  • Dse::Cluster will by default use a special load balancing policy,Dse::LoadBalancing::Policies::HostTargeting, which optimizes query plans for Graph OLAP queries. For OLTP graph queries and CQL queries, this policy falls back to the datacenter-aware/token-aware/round-robin policy, the same as the core driver.

Although the driver exposes new DSE 5.0 features, it is backward-compatible with older DSE releases.

Unified Authentication

DSE 5.0 introduces DSE Unified Authentication, which supports multiple authentication schemes concurrently. Thus, different clients may authenticate with any authentication provider that is supported under the “unified authentication” umbrella: internal authentication, LDAP, and Kerberos.

NOTE: the authentication providers described below are backward-compatible with legacy authentication mechanisms provided by older DSE releases. So, feel free to use these providers regardless of your DSE environment.

Internal and LDAP Authentication

Just as Cassandra::Auth::Providers::Password handles internal and LDAP authentication with Cassandra, the Dse::Auth::Providers::Password provider handles these types of authentication in DSE 5.0 configured with DseAuthenticator. The Ruby DSE driver makes it very easy to authenticate with username and password:

cluster = Dse.cluster(username: 'user', password: 'pass')

The driver creates the provider under the hood and configures the cluster object appropriately.

Kerberos Authentication

To enable kerberos authentication with DSE nodes, set the auth_provider of the cluster to a Dse::Auth::Providers::GssApi instance:

require 'dse'

# Create a provider for the 'dse' service and have it use the first ticket in the default ticket cache for
# authentication with nodes, which have hostname entries in the Kerberos server.
provider = Dse::Auth::Providers::GssApi.new

# Specify different principal to use for authentication. This principal must already have a valid
# ticket in the Kerberos ticket cache. Also, the principal name is case-sensitive, so make sure it
# *exactly* matches your Kerberos ticket.
provider = Dse::Auth::Providers::GssApi.new('dse', true, 'cassandra@DATASTAX.COM')

cluster = Dse.cluster(auth_provider: provider)

For more information check out the docs.

Geospatial Types

DataStax Enterprise 5.0 comes with a set of additional types to represent geospatial data: PointType, LineStringType, and PolygonType. Here is an example of a table containing a column of type PointType:

CREATE TABLE points_of_interest(name text PRIMARY KEY, coords 'PointType');

The CQL literal representing a geospatial type is simply its Well-known Text (WKT) form. Inserting a row into the table above using plain CQL is as easy as:

INSERT INTO points_of_interest (name, coords) VALUES ('Eiffel Tower', 'POINT(48.8582 2.2945)');

Of course, you are not limited to string literals to manipulate geospatial types; the DataStax Enterprise Ruby Driver 1.0.0 also includes its own representations of these types, which can be sent as query parameters, or retrieved back from query results:

# The geospatial types are defined in the Dse::Geometry module. Include the module
# here so that we can refer to the classes with their base names.
include Dse::Geometry

# Create a table with a PointType column and insert a row into it.
session.execute("CREATE TABLE IF NOT EXISTS points_of_interest" \
                " (name text PRIMARY KEY, coords 'PointType')")
session.execute('INSERT INTO points_of_interest (name, coords) VALUES (?, ?)',
                arguments: ['Empire State', Point.new(38.0, 21.0)])

# Now retrieve the point.
rs = session.execute('SELECT * FROM points_of_interest')
rs.each do |row|
  # We can emit the point in its WKT representation.
  puts "#{row['name']}   #{row['coords'].wkt}"

  # Or the x and y coordinates
  puts "#{row['name']}   #{row['coords'].x},#{row['coords'].y}"

  # Which is really the to_s of the point, so you can do this:
  puts "#{row['name']}   #{row['coords']}"
end

Please refer to both the DSE 5.0 Geospatial Search documentation and the driver documentation on geospatial types for more details.

Graph

The DSE Graph service processes graph queries written in the Gremlin language. Session#execute_graph and Session#execute_graph_async are responsible for transmitting graph queries to DSE graph. The response is a graph result set, which may contain domain object representations of graph objects:

require 'dse'

# Connect to DSE and create a session whose graph queries will be tied to the graph
# named 'mygraph' by default. See the documentation for Dse::Graph::Options for all
# supported graph options.
cluster = Dse.cluster(graph_name: 'mygraph')
session = cluster.connect

# Run a query to get all the vertices in our graph.
results = session.execute_graph('g.V()')

# Each result is a Dse::Graph::Vertex.
# Print out the label and a few of its properties.
puts "Number of vertex results: #{results.size}"
results.each do |v|
   # Start with the label
   puts "#{v.label}:"
   
   # Graph object properties support multiple values.
   # Emit the 'name' property's first value; 
   puts "  name: #{v['name'][0].value}"
   
   # Print all the values of the 'name' property
   puts "  all names: #{v['name'].values.join(',')}"
end

Please refer to both the DSE 5.0 Graph documentation and the driver documentation on graph for further information about graph queries.

Getting the driver

The new driver gems (for both MRI and JRuby) are available on rubygems.org, so just update your Gemfile to include dse-driver, bundle install, and you’re all set.

Be aware that the DataStax Enterprise Ruby Driver is published under specific license terms that allow its usage solely in conjunction with DataStax Enterprise software.

Documentation for the the DataStax Enterprise Ruby Driver can be found at the following locations:

Enjoy!


Groovier Gremlin Editing with Datastax Studio 1.0.0

$
0
0

Datastax Studio, our new fantastic web based notebook editor for use with DSE Graph 5.0, is a great application for you to execute your graph queries and visualize your results.  In this post, we’ll give you a glimpse (you’ll still have to try it to believe it) of the powerful code editor that comes with Studio.  The editor is packed with many of the features you’ve come to expect from advanced integrated development environments (IDE), as well as other exciting features that will help you be more productive when working with DSE Graph 5.0 and the Apache TinkerPop™ query language Gremlin.

So without further ado, let’s take a tour of two areas in the Studio Gremlin editor that will help you write Gremlin queries better and faster:  Validations and Content Assist.

Stay Productive with Validations

Studio comes with a few different types of validations: Groovy Syntax Validations, Type Checking Validations and Domain Specific Validations.  Let’s take a look at each of them.

Groovy Support and Syntax Validations

Ultimately your code is executed within DSE Graph as Groovy so we’ve added enough Groovy syntax support for you to be very productive and happy as you craft your Gremlin statements.  What better way to demonstrate this than with a screenshot of a syntax error!

syntax error support - fig 1
Figure 1:  Syntax Error Support – errors are marked in the gutter and the specific snippet causing the error is underlined.  Hovering over the gutter marker gives you details about the error.

The editor does not support full Groovy syntax but fret not Groovy developers.  If there is some Groovy syntax you’d like to use that is not currently supported by the editor, there is a handy toggle button to disable Studio validations as shown below in Figure 2.  

toggle button - fig 2
Figure 2:  Toggle button to disable editor validations.

For more details on what Groovy syntax is and is not supported, please checkout the Studio documentation.

Type Checking Validations

Pure syntax validation is handy, but when learning a new API it’s even more important to get type validation.  In Figure 3 & 4 you can see that the Studio editor validates method names, type signatures, as well as other standard type validations you’d expect:

method name validation - fig 3
Figure 3:  Method name validation.

argument type validation - fig 4
Figure 4:  Argument type validation.

In Figure 4, it is worth noting that default Groovy behavior would coerce the value to a String.  However, we believe it’s more useful to be stricter to help catch programmatic errors earlier.

Domain Specific Validations for Tinkerpop

DataStax Studio also provides domain specific validations.  As an example, let’s take a look at a validation that will help you avoid a common mistake when first learning Tinkerpop – not iterating your traversals.  This gotcha is explained in-depth in the Datastax Studio Tutorial notebook.  But for the purposes of this post, it’s enough to say that creating a traversal and never iterating it, is bad.  But don’t worry!  As long as you are writing your Gremlin in Studio, we’ll gently remind you to correct the issue.  

iterate your traversals warning - fig 5
Figure 5:  The `iterate your traversals` warning. Note: The last line does not have a warning because DSE Graph will automatically iterate the final traversal of a notebook cell.

Getting Help with Content Assist

Rote memorization of APIs or constantly referring to documentation is no fun.  So as you’d expect when you invoke content assist (ctrl+space) in Studio you’ll get method, variable, class, and other types of proposals.

Studio prioritizes proposals from the Tinkerpop and DSE Graph APIs so it’s easy for you to find methods specific to the task of writing Gremlin queries.  If that wasn’t enough, Studio will also make proposals based on the state of your DSE Graph schema.  We call this schema assist and we’ll talk more about that after some quick examples of more traditional content assist proposals: Method and Variable Proposals

Method Proposals

method proposals
Figure 6:  Method proposals –  the most obvious yet super useful form of content assist.

In Figure 6, there are a couple of details worth pointing out:

  1. The editor is prioritizing methods that are declared on the most specific type being invoked. In this case,  GraphTraversalSource.
  2. There is always an implicit variable called g in your notebook cells.  g is a pre-defined variable that is of type GraphTraversalSource: http://tinkerpop.apache.org/javadocs/current/core/org/apache/tinkerpop/gremlin/process/traversal/dsl/graph/GraphTraversalSource.html

Variable Proposals

Studio will propose, where appropriate, predefined variables as well as the variables that you’ve created that are in scope.  Variable scope is based purely on cell editor content and is ordered from top to bottom of the notebook.  

proposal of variables
Figure 7:  Proposals of variables, both predefined(g, graph, schema), as well as those defined in Notebook cells (customerName, and orderId).

Proposal Filtering

For all types of proposals after invoking content assist, you can type to filter the proposals as shown below:

example of filtering proposals
Figure 8:  Example of filtering proposals.  Any type of proposal can be filtered by typing.

Schema Assist

Many of the Gremlin APIs allow you to filter vertices or edges based on property keys or labels present on those graph elements.  Those keys and labels are part of your graph’s schema.  With any schema that is not trivially small it can be easy to forget its structure.  One of the ways Studio helps is by giving a great visualization of your graph’s schema as seen in Figure 9.

visualization of graph gods
Figure 9:  A visualization of the ‘Graph of the Gods’ schema from the DataStax Studio tutorial.

But as you are writing your queries, you shouldn’t have to refer to the schema visualization repeatedly.  So we’ve leveraged content assist as a place to provide you with suggestions for vertex/edge property keys and label values.

Sweet!  Now your eyes can stay focused on where you are typing.  Let’s take a look at what happens when we invoke content assist on a traversal that takes a property key as an argument; using the same schema that is visualized in Figure 9.

schema assist
Figure 10:  Schema assist, oh the possibilities!

That’s a lot of proposals.  Let’s break them down:

  • First, the editor is intelligent enough to know that this is a Vertex based traversal (because of type inference!) and only presents schema proposals that are relevant for vertex properties and keys.  Amazing!
  • Because there are multiple `has` methods, Studio proposes possibilities for all of these variations
  • There are also multiple possibilities as to which property key you might want, and the editor concretely makes a proposal for each of those.

Basic Filtering

What happens for the next proposal within that same method call?  Let’s see!

schema assist filtering
Figure 11:  Schema assist filtering, proposals are further pruned to only propose property keys available on vertices that have type `monster`.

The second proposal list is much shorter.  That’s because you already selected vertices with the label `monster` and so only property keys from vertices that have a label `monster` are proposed.  

Dynamic Incident Element Filtering

Even better is that Studio will use the graph schema to inspect your traversal to filter what is proposed in subsequent method calls.  In other words, only labels and keys from graph elements (vertices or edges) incident to the current graph element are proposed. Let’s look at a few examples. This time from the DSE Graph Quick Start Tutorial schema.

proposals filtered
Figure 12:  Proposals filtered by the fact that earlier in the Traversal a filter for only vertices of type ‘recipe’.

In Figure 12, only proposals that are valid from recipe vertices are made if a label filter of ‘recipes’ is applied earlier in the traversal.  In this case, recipes have two properties, instructions and name, and only one label to filter on that is still valid; which is the recipe label itself.

What about edge and direction based filtering?  We’ve got you covered there as well!

proposals filtered by vertices
Figure 13:  Proposals filtered by vertices with the ‘author’ label, only propose out edges ‘authored’ and ‘created’.

Again, you can see here that Studio understands this traversal and only proposes edge labels ‘authored’ and ‘created’.  Awesome!

What happens if content assist is invoked for the method inE after filtering by ‘authors’?  Let’s find out:

edge labels
Figure 14:  All edge labels are proposed if the dynamic filtering produces no results.

In Figure 14 above dynamic filtering produced no results (because author has no in edges).  So instead, Studio falls back on proposing all graph elements of the appropriate type for that method parameter.  In this case, all edge labels are proposed.

The behavior of proposing everything if dynamic filtering produces no proposals, applies to any invocation of schema assist.

P, T, and Enum Proposals

In Figure 15, the last argument can either be a value of any type, or a predicate of type P.  Neither are a schema element, so this next tidbit is more about pure content assist.  Here is what happens when we invoke content assist for the third argument:

proposals implicitly imported
Figure 15:  Proposals for implicitly imported P(predicate) helper methods

While Studio can’t propose values for you, it does have built-in support for special Tinkerpop helper classes and enums like P, T, Pop, Direction, Cardinality, etc.  In this case, the static helper methods that produce a predicate of type P are proposed for you.  

Wrapping Up

Thanks for taking a tour of the Studio Gremlin editor!  We think these features will help you to be even more successful with DSE Graph and the Gremlin query language.  If you haven’t already, download a copy of Studio here, and take it for a spin.  We already have some great ideas on additional features and validations, but if there is a feature you’d like to see that isn’t there yet, please drop us a note at studio-feedback@datastax.com.  

DataStax DevCenter 1.6.0 Released

$
0
0

DevCenter 1.6.0 was released is now available for download.

This release introduces a number of new features, including support for DSE Search, export keyspace, large results sets and more. Please see the release notes for a complete list and more details.

Also included are a number of bug fixes:

  • Scrolling a very long query trace result might freeze DevCenter
  • Special characters should be filtered out from query trace events
  • No horizontal scroll bar for query trace table.
  • Query trace on linux is not viewable
  • No validation error for missing keyspace in ‘DROP INDEX’ statement.
  • No validation error for renaming non primary key column.
  • Create Index content assist should not suggest static column.
  • Content assist should not propose existing keyspaces in the CREATE KEYSPACE statement.
  • Drop dialogs: Unable to drop multiple schema elements from different keyspaces.
  • Create Aggregate Content Assist (keyword && with final function): Stype is not suggested.
  • Support inequality expressions in UPDATE IF statement
  • Drop UDA: Wrong UDA dropped in mixed case Keyspace/UDA scenario.
  • No validation error for Create Materialized View without clustering key statement.
  • Details View on Win7 and Ubuntu: Timestamp is not split into 2 lines.
  • Incorrect timestamp format (12 hour without AM/PM indicator) in Results view
  • SELECT COUNT query fails when using solr_query
  • Missing duplicate index validation error on CREATE INDEX statement.
  • Table element in “Query Trace” view has fixed horizontal size

User Defined Aggregations with Spark in DSE 5.0

$
0
0

There are already a couple of blog posts and presentations about UDF/UDA. But for those who do not know how to use Cassandra User Defined Functions (UDF) and User Defined Aggregates (/UDA), here is a short introduction on how to use them from Spark to push down partition-level aggregations.

Introduction to User Defined Functions (UDF)

Cassandra 2.2 introduced User-Defined-Functions and User-Defined-Aggregates, which allowed users to write their own scalar functions and use these to build their own aggregations. UDF and UDA are executed on the coordinator, which is the node that is executing the query thrown into the cluster. UDF, UDA and built-in aggregations (e.g. count, min, max, avg, built-in functions) are applied on the result with respect to the actual consistency level. In other words, Cassandra first performs the read operations as it would without the functions and then applies the functions/aggregation on the result set.

UDF are pure scalar functions. Pure functions depend on the input arguments and produce a single result. One or more columns can be passed into a UDF or UDA. UDF should really be written in Java.

In order to sum up the column values, a UDF could look like this:

CREATE FUNCTION sum_two(a int, b int)
RETURNS NULL ON NULL INPUT
RETURNS int
LANGUAGE java
AS 'return a + b;';

It could then be used like this: SELECT key, sum_two(column_a, column_b) AS the_sum FROM some_table WHERE key = 123;

Aggregations work on multiple rows and produce a single row. In order to do that, an aggregation needs an initial state. For each row that has been read, the aggregations state function is called taking both the current state and the value(s) from the current row returning the new state. After all rows are processed, an optional final function converts the last state into the aggregations result – otherwise the last state is the aggregations result.

For example:

CREATE AGGREGATE sum_it(int)
SFUNC sum_two
STYPE int
INITCOND 0;

A more thorough description of UDF and UDA is available here.

User Defined Functions (UDF) and User Defined Aggregates (UDA) Background

UDF Compilation

Each Java UDF will be compiled to a Java class – so it’s bytecode for the Java Virtual Machine (JVM). In order to compile the UDF, we need a Java source file. Basically, we only put the UDF code into a Java source template and add some necessary (de)serialization code. Package and class names are random. Take a look at the source of that template here.

Scripted UDF are just compiled as specified in the code part of the CREATE FUNCTION statement.

The Sandbox

For C* 2.2 we put the label experimental on both UDF and UDA and you had to explicitly enable UDF in cassandra.yaml. In fact, you can do anything you want with UDF in C* 2.2 – there’s nothing that protects the node from an evil UDF despite permissions with authentication/authorization.

For Cassandra 3.0, so DataStax Enterprise 5.0, we added a sandbox for UDF. That sandbox ensures that UDF do not do evil things. But you still have to explicitly enable UDF in cassandra.yaml.

So, what does evil mean? In short, an evil UDF is one that is not pure as described in CASSANDRA-7395. Pure UDF get input parameter values, operate on these values and return a result. UDF depend only on the input parameters. There are no side effects and no exposure to C* internals.

Let’s recap what the sandbox is for and how it protects the system:

  • no I/O (disk, files, network, punch card)
  • no thread management
  • don’t exit or halt or freeze the JVM
  • no access to internal classes (like org.apache.cassandra.*, com.datastax.*, etc.)
  • no access to 3rd party libraries (like com.google.*, sigar, JNA, etc.)/li>
  • but allow access to required Java Driver classes (not all driver classes)
  • no use of locks and synchronized keyword
  • deny creation of UDF that somehow try to inject other code like a static block in Java UDF
  • detect and safely stop UDF running too long
  • detect and safely stop UDF that consume too much heap

The straight-forward approach could be to use the Java Security Manager and that’s it, right? … Sorry, that is definitely not enough. In fact, the Java Security Manager is just a tiny part of the sandbox. For example, it does not protect you from creating a new thread, if the current thread is not in the root thread group. Additionally, it is not able to apply a runtime or heap quota.

The most important parts of the sandbox are the Java bytecode inspection (for Java UDF) and restricted class loading (for all UDF) introduced in DSE 5 (Cassandra 3.0).

The bytecode inspection will prevent creation of a UDF that tries to detect evil things like injection of a static code block that would be executed when the UDF is created. It also detects usages of the synchronized keywords among other things.

Such a UDF will not pass validation:

CREATE FUNCTION udf_func(input int)
CALLED ON NULL INPUT
RETURNS int
LANGUAGE java
AS $$
    return input;
}
static {
    System.exit(0);
$$;
InvalidRequest: code=2200 [Invalid query] message="Could not compile function 'udfdemo.udf_func' from Java source: org.apache.cassandra.exceptions.InvalidRequestException: Java UDF validation failed: [static initializer declared]"

Detecting UDF executions that run too long or consume too much heap is not rocket science – it is about getting the consumed CPU time and heap values from the JVM. The problem is not to detect such a situation – the problem is how to handle the situation. Naively one could just kill the thread, right? But the answer to that is – No. Killing a thread is almost never the solution. In fact, it destabilizes the whole process in every programming language.

All currently released versions of Cassandra can detect runtime and heap quota violations and fail fast. Fail fast in this case means that a non-recoverable situation has been detected by Cassandra and the only “way out” it to stop the node. Each UDF execution is performed in a separate thread pool.

Our current proposal is to manipulate the compiled byte code, inject some guardian code that checks the runtime and heap quotas and aborts the UDF. Additionally, this allows us to execute Java UDF directly without the need for a separate thread pool. Unfortunately, this only works for Java UDF and not for JavaScript or any other JSR 223 provider. Evil JavaScript code can only be detected – and that still means fail fast and stop the node.

Since Java and scripted UDF fundamentally differ from the sandbox’s point of view, there is a second option in cassandra.yaml that needs to be enabled when you want to enable scripted UDF in addition to Java UDF.

Why all this effort?

The answer is two-fold. First, imagine that one accidentally coded an endless loop (e.g. a loop with the wrong break condition) or produced an infinitely growing list or string. In this case, the goal is to not let the DSE/Cassandra cluster fail on coding bugs (you know, there is nothing like bug-free code).

Second and probably most important goal is to prevent intentional attacks via UDF. Since there is nothing like a perfect system, UDF still needs to be explicitly enabled in cassandra.yaml. Last but not least, please be aware that detecting or even preventing bad things in scripted UDF is really difficult and nearly impossible. Therefore, it’s important that you enable the cassandra.yaml to use scripted UDFs.

Aggregation is not analytics

People sometimes get confused about aggregation in a distributed database. Somehow they assume that aggregation in a distributed database is the essential building block for distributed analytics. But analytics is so much more.

Distributed analytics in Spark uses resilient distributed datasets (RDD), which are immutable. RDD come along with processing instructions. Spark can then build a directed, cyclic graph and distribute the work if and where appropriate.

Analytics itself supports map-reduce, grouping, aggregation, re-partitioning and a whole zoo of more operations on the data as well as support for distributed state. Nowadays analytics has to support many programming languages like Scala, Java, Python, R to solve a specific problem in the most efficient way. Additionally, it has to support different kinds of sources and targets like Cassandra, Kafka, relational databases, flat files in various formats like parquet, CSV, tabular.

Another false assumption that I sometimes hear, is that UDF and UDA adds magic and is able to work on a vast amount of partitions. Although that would be really cool, it is just not true.

And frankly, why should we build another analytics engine inside Cassandra. Again, UDF and UDA are building blocks for something bigger. Database and analytics are related – but not the same thing.

Code walkthrough

The following lines are meant to provide a simple example on how to benefit from UDF and UDA. This example assumes a simple sensor network measuring weather data like temperature. The minimum requirements for the code walkthrough is a DSE 5.0 single node “cluster” with analytics and UDF enabled. In cassandra.yaml, set enable_user_defined_functions: true before you issue dse cassandra -k to start DSE with analytics enabled.

The first thing needed is a table and just enough data for this example.

CREATE KEYSPACE udfdemo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE udfdemo;

CREATE TABLE measurements (
sensor text,
time_bucket text,
ts timestamp,
temp float,
PRIMARY KEY ( (sensor, time_bucket), ts));

INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-111', '2016-04-26', '2016-04-26 12:00:00', 7.8);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-111', '2016-04-26', '2016-04-26 13:00:00', 8.3);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-111', '2016-04-26', '2016-04-26 14:00:00', 9.2);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-111', '2016-04-27', '2016-04-27 12:00:00', 10.2);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-111', '2016-04-27', '2016-04-27 13:00:00', 9.3);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-111', '2016-04-27', '2016-04-27 14:00:00', 9.9);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-222', '2016-04-26', '2016-04-26 12:00:00', 21.9);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-222', '2016-04-26', '2016-04-26 13:00:00', 20.2);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-222', '2016-04-26', '2016-04-26 14:00:00', 21.5);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-222', '2016-04-27', '2016-04-27 12:00:00', 22.4);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-222', '2016-04-27', '2016-04-27 13:00:00', 24.0);
INSERT INTO measurements (sensor, time_bucket, ts, temp) VALUES
('11-222', '2016-04-27', '2016-04-27 14:00:00', 23.7);

SELECT * FROM measurements;

 sensor | time_bucket | ts                       | temp
--------+-------------+--------------------------+------
 11-111 |  2016-04-27 | 2016-04-27 10:00:00+0000 | 10.2
 11-111 |  2016-04-27 | 2016-04-27 11:00:00+0000 |  9.3
 11-111 |  2016-04-27 | 2016-04-27 12:00:00+0000 |  9.9
 11-111 |  2016-04-26 | 2016-04-26 10:00:00+0000 |  7.8
 11-111 |  2016-04-26 | 2016-04-26 11:00:00+0000 |  8.3
 11-111 |  2016-04-26 | 2016-04-26 12:00:00+0000 |  9.2
 11-222 |  2016-04-27 | 2016-04-27 10:00:00+0000 | 22.4
 11-222 |  2016-04-27 | 2016-04-27 11:00:00+0000 |   24
 11-222 |  2016-04-27 | 2016-04-27 12:00:00+0000 | 23.7
 11-222 |  2016-04-26 | 2016-04-26 10:00:00+0000 | 21.9
 11-222 |  2016-04-26 | 2016-04-26 11:00:00+0000 | 20.2
 11-222 |  2016-04-26 | 2016-04-26 12:00:00+0000 | 21.5

Globally, some people use Celsius as a measurement for temperature, others prefer Fahrenheit. One nice thing we could need is a function that converts Celsius to Fahrenheit, because the sensors use Celsius.

CREATE FUNCTION celsius_to_fahrenheit(temp float)
RETURNS NULL ON NULL INPUT
RETURNS float
LANGUAGE java
AS 'return temp * 1.8f + 32;';

To return the temperature in Fahrenheit, we need to apply the celsius_to_fahrenheit UDF to the temp column:

SELECT sensor, time_bucket, ts, temp as temp_celsius, celsius_to_fahrenheit(temp) as temp_fahrenheit FROM measurements;

 sensor | time_bucket | ts                       | temp_celsius | temp_fahrenheit
--------+-------------+--------------------------+--------------+-----------------
 11-111 |  2016-04-27 | 2016-04-27 10:00:00+0000 |         10.2 |           50.36
 11-111 |  2016-04-27 | 2016-04-27 11:00:00+0000 |          9.3 |           48.74
 11-111 |  2016-04-27 | 2016-04-27 12:00:00+0000 |          9.9 |           49.82
 11-111 |  2016-04-26 | 2016-04-26 10:00:00+0000 |          7.8 |           46.04
 11-111 |  2016-04-26 | 2016-04-26 11:00:00+0000 |          8.3 |           46.94
 11-111 |  2016-04-26 | 2016-04-26 12:00:00+0000 |          9.2 |           48.56
 11-222 |  2016-04-27 | 2016-04-27 10:00:00+0000 |         22.4 |           72.32
 11-222 |  2016-04-27 | 2016-04-27 11:00:00+0000 |           24 |            75.2
 11-222 |  2016-04-27 | 2016-04-27 12:00:00+0000 |         23.7 |           74.66
 11-222 |  2016-04-26 | 2016-04-26 10:00:00+0000 |         21.9 |           71.42
 11-222 |  2016-04-26 | 2016-04-26 11:00:00+0000 |         20.2 |           68.36
 11-222 |  2016-04-26 | 2016-04-26 12:00:00+0000 |         21.5 |            70.7

In order to get the average temperature per day, we would need a UDA for our average. (Side node: Cassandra 3.0/DSE 5.0 already has built-in aggregations like min, max and avg.)

An aggregate needs at least a state function, that combines the current (or initial) state with values from a row. Additionally it may require a final function that produces the result to be returned from the last state value. To calculate an average we need a state that sums up all values and counts the number of values. In this example we just use a tuple with an int and a double for the state.

First, we need the UDF that maintains the current state of the aggregation. It is called avg_state in the following code snippet and takes the state as the first argument and the column value from the current row as the second argument. The next line with RETURNS NULL ON NULL INPUT defines that the method must not be called if any of its arguments is null. The opposite would be CALLED ON NULL INPUT and you would have to deal with null values. But in our case, considering null values does not make sense. The 3rd line RETURNS tuple<int,double> defines the return type, which must be the same as the type of the first argument, because a UDA state function takes the state as the first argument and returns the new state. LANGUAGE java defines that the UDF code is Java code. The following lines define the UDF code, which increments the value counter (the first field of the state tuple) and adds the value to the second field of the state tuple.

CREATE OR REPLACE FUNCTION avg_state(state tuple, val float)
RETURNS NULL ON NULL INPUT
RETURNS tuple
LANGUAGE java
AS $$
  state.setInt(0, state.getInt(0) + 1);
  state.setDouble(1, state.getDouble(1) + val);
  return state;
$$;

Since we want the average as a floating point as the result of the aggregation, we have to convert the last state using a final function, which is called avg_final in this example. A final function just takes one argument: the last state. For this example, it is sufficient to divide the sum of all values by the number of values.

CREATE OR REPLACE FUNCTION avg_final(state tuple)
RETURNS NULL ON NULL INPUT
RETURNS float
LANGUAGE java
AS 'return (state.getInt(0) == 0) ? 0f : (float)(state.getDouble(1) / state.getInt(0));';

The user-defined-aggregation itself takes a couple of thing. First, it declares the argument type. Argument names are not necessary. The state type, in our case the tuple of int and double, is defines in line 3. The state type is the first argument for the state function and the remaining state function argument types are those of the aggregate itself. In this example the state function arguments are: tuple and float. These are the argument types of the declared state function avg_state. The final function needed to convert the last state into a “simple” float is declared in line 4 and is avg_final. Line 5 defines the initial value for the state, which is 0 for both the count and the sum of values for the tuple.

CREATE AGGREGATE temp_avg(float)
SFUNC avg_state
STYPE tuple
FINALFUNC avg_final
INITCOND (0, 0);

To retrieve the average temperature for the first sensor on 2016-04-27, we need the following CQL:

SELECT sensor, time_bucket, temp_avg(temp) FROM measurements WHERE sensor='11-111' AND time_bucket='2016-04-27';

 sensor | time_bucket | udfdemo.temp_avg(temp)
--------+-------------+------------------------
 11-111 |  2016-04-27 |                    9.8

Oh, you want it in Fahrenheit? Use our celsius_to_fahrenheit function!

SELECT sensor, time_bucket, celsius_to_fahrenheit(temp_avg(temp)) FROM measurements WHERE sensor='11-111' AND time_bucket='2016-04-27';

 sensor | time_bucket | udfdemo.celsius_to_fahrenheit(udfdemo.temp_avg(temp))
--------+-------------+-------------------------------------------------------
 11-111 |  2016-04-27 |                                                 49.64

The nice thing about UDA, and other built-in aggregations, is that only the aggregated result is pushed through the wire to the client. This may become important, when there is more data to aggregate.

Spark and UDA Walkthrough

Combining the demo above with Spark is straight-forward. The requirements are DSE 5.0 or newer with enable_user_defined_functions: true set in cassandra.yaml and DSE Analytics enabled using dse cassandra -k.

In order to run Spark, start the Spark Shell using the dse spark command. It automatically creates a Spark context and connects to Cassandra and Spark in DSE Analytics. To verify that the connection is setup correctly and the schema is available, execute :showSchema udfdemo. It should print the following output.

scala> :showSchema udfdemo
========================================
 Keyspace: udfdemo
========================================
 Table: measurements
----------------------------------------
 - sensor      : String         (partition key column)
 - time_bucket : String         (partition key column)
 - ts          : java.util.Date (clustering column)
 - temp        : Float         

The next example assumes that we get (or have) a list of sensor and day tuples for which the average temperature shall be computed. Since sensor and day are the partition key of the measurements table, pushing down the aggregation (computation of the average temperature per day per sensor) into Cassandra is absolutely fine. It is a very cheap computation and saves some bandwidth on the wire between Cassandra and Spark.

Note: I’ve omitted the scala> prefix that appears in the Spark Shell to make it easier to copy & paste the code.

Line #1 just sets up the sequence with the sensor and day tuples

Line #2 creates an RDD from the sequence so Spark can parallelize the work on the sequence

Line #3 joins the sensor and day tuples with the CQL partitions in the measurements table

SomeColumns defines the columns to be returned, which includes the call to our User Defined Aggregation temp_avg operating on the temp column. Note that the feature to call User Defined Aggregations and User Defined Functions has been introduced in spark-cassandra-connector using FunctionCallRef in version 1.5.0-M3 of the connector. DSE 5.0 comes with version 1.6.

val sensorsAndDays = Seq( ("11-111", "2016-04-26"), ("11-111", "2016-04-27"), ("11-222", "2016-04-26"), ("11-222", "2016-04-27") )
val sensorsAndDaysRDD = sc.parallelize(sensorsAndDays.map( (_, "") ))
val result = sensorsAndDaysRDD.
  map( _._1 ).
  joinWithCassandraTable("udfdemo","measurements",
                         SomeColumns("sensor",
                                     "time_bucket",
                                     FunctionCallRef("temp_avg", Seq(Right("temp")), Some("avg_temp"))))
result.collect.foreach(println)
((11-111,2016-04-26),CassandraRow{sensor: 11-111, time_bucket: 2016-04-26, avg_temp: 8.433333})
((11-111,2016-04-27),CassandraRow{sensor: 11-111, time_bucket: 2016-04-27, avg_temp: 9.8})
((11-222,2016-04-26),CassandraRow{sensor: 11-222, time_bucket: 2016-04-26, avg_temp: 21.2})
((11-222,2016-04-27),CassandraRow{sensor: 11-222, time_bucket: 2016-04-27, avg_temp: 23.366667})

Wrap-up

UDF and UDA are nice building blocks for new features and also nice to be used from analytics code like Spark. But please do not forget that the execution of UDA will always take place on the coordinator node. This is not because we are lazy but because we still have to respect the consistency level provided with the query. Also, only the coordinator can provide consistency level behaviors and guarantees. Under the covers we still perform normal reads but apply the UDF or UDA on top of these reads.

The UDF sandbox is pretty solid in my opinion and will become even better. Although UDF is disabled by default in cassandra.yaml, enabling Java UDF and having the sandbox should provide enough security.

Some things to keep in mind and really respect:

  • If you intend to use JavaScript UDF, use Java UDF – really.
  • Check your UDF thoroughly. Long running UDF as well as UDF consuming a lot of heap will cause the node to “fail fast” – i.e. stop the node, and probably not just one node…
  • UDA are absolutely fine if applied to a single partition. This is nothing new – single-partition-reads is what you should do with Cassandra in general.
  • UDA is a nice building block for other analytics engines like Spark. So you should perform single-partition aggregations in Cassandra and compute the grand-aggregate in Spark.
  • Do not misuse UDF to perform really expensive computations.

Python Driver 3.6.0 Released

$
0
0

The DataStax Python Driver 3.6.0 for Apache Cassandra has been released. This release had no specific area of focus, but brings a number of new features and improvements. A complete list of issues is available in the CHANGELOG Here I will mention some of the new features.

Handle null values in NumpyProtocolHandler

Before this release, it was a known limitation that the NumpyProtocolHandler would not work with unset values. The driver can handle null values. It uses numpy.ma.masked_array to accomplish this. Masked arrays are arrays that may have missing or invalid entries. Since a MaskedArray is a subclass of numpy.ndarray, they can be used as normal without modifying your code:

session.row_factory = tuple_factory  # required for Numpy results
session.client_protocol_handler = NumpyProtocolHandler

results = session.execute('select * from test.automobile;')
counts = results.current_rows[0]['count']
# counts --> masked_array(data = [0 -- 2 -- 4 -- 6 -- 8 --],
#   mask = [False, True False True False True False True False True], fill_value = 999999)
counts.sum()  # numpy.ndarray.sum()
# 20

Collect greplin scales stats per cluster

Previously, if your application had multiple Cluster objects, the metrics were collected together and the statistics would be effectively meaningless. The driver now collects metrics per cluster. We added two new API functions to the Metrics class:

  • Metrics.get_stats(): get the stats
  • Metrics.set_stats_name(name): set a different Scales stats name
cluster1 = Cluster(...)
cluster2 = Cluster(...)

# Get cluster1 metrics
cluster1.metrics.get_stats()
scales.getStats()['cassandra'] # Always point to the first registered cluster metrics
scales.getStats()['cassandra-0']

# Get cluster2 metrics
cluster2.set_stats_name('Cluster2')
cluster2.metrics.get_stats()
scales.getStats()['Cluster2']

Skip result set metadata for prepared statement

In the v2+ native protocol, a flag may be set which indicates that result set metadata (column names and types, etc) can be omitted. As a performance improvement, the driver now sets this flag automatically with prepared statements.

CQLEngine: A new context manager to allow models to switch the keyspace easily: ContextQuery

In this release, we introduce you to a new context manager that can be used to switch a model context easily: the ContextQuery. Presently, the context only specifies a keyspace for model IO, but in the future it may be extended. For example, it may also be used to select a target cluster from multiple distinct cluster connections. We also improved sync_table and drop_table management functions to support multiple keyspaces.

KEYSPACES = ('test', 'test1', 'test2', 'test3', 'test4')
for keyspace in KEYSPACES:
    create_keyspace_simple(keyspace, 1)

class Automobile(Model):
    manufacturer = columns.Text(primary_key=True)
    year = columns.Integer(primary_key=True)
    model = columns.Text()

sync_table(Automobile, keyspaces=KEYSPACES) 

with ContextQuery(Automobile, keyspace='test2') as A:
    A.objects.create(manufacturer='honda', year=2008, model='civic')
    print len(A.objects.all())  # 1 result

with ContextQuery(Automobile, keyspace='test4') as A:
    print len(A.objects.all())  # 0 result

Wrap

As always, thanks to all who provided contributions and bug reports. The continued involvement of the community is appreciated:

Scale Quickly with DataStax Enterprise on Google Container Engine

$
0
0

Containers are very effective for managing stateless applications.  In a containerized environment, web tiers can be scaled quickly and failed containers can be quickly replaced. Tools like Kubernetes automate that management.  Google Container Engine (GKE) further simplifies the process by providing a great integration between Kubernetes and a top tier cloud provider like Google Cloud Platform (GCP).  Simplified management means less time spent managing machines, resulting in a lower total cost of ownership (TCO) for containerized environments than traditional virtualized environments.

Managing stateful workloads is more complex.  In a stateful system, replacing failed nodes with new nodes will result in an unacceptable loss of data.  One example of a stateful workload is a database.  In particular, this blog post focuses on DataStax Enterprise (DSE), a highly scalable distributed database built on Apache Cassandra™.

Rather than separating out the web tier and the database tier, integration with GKE allows a user to deploy their entire application, including the data store in GKE and then manage its full lifecycle using Kubernetes.  This simplifies administration, improves reliability and provides a more holistic framework for building applications than otherwise available.  All this leads to a lower cost of ownership and improved time to market.

Deploying a DSE cluster

DataStax and Google have been working closely to build an integration between DSE and GKE.  This integration is available on GitHub at: https://github.com/DSPN/google-container-engine-dse.

Deploying a cluster is really simple.  Details instructions are given here.  At a slightly higher level, all you need to do is run the deploy.sh script.

GKE-1

Once a cluster is deployed, you can login to DataStax OpsCenter and view the DSE nodes running:

OpsCenter

Wrapping Up…

The integration described here is a fast way to get started using DataStax Enterprise on Google Container Engine.  At this point, the integration is more demo grade than production.  But, we’re actively working on improving it!

We welcome your thoughts and contributions for this project.  We’re tracking issues on GitHub here: https://github.com/DSPN/google-container-engine-dse/issues.

Additionally, feel free to reach out to ben.lackey@datastax.com or on Twitter @benofben.

Viewing all 381 articles
Browse latest View live