Quantcast
Channel: Dev Posts – DataStax
Viewing all 381 articles
Browse latest View live

Getting the Hang of Using the DataStax Graph Loader: The Basics 

$
0
0

The DataStax Graph Loader is an important tool in the development of a DataStax Graph use case. While its primary function is to bulk load data into DSE Graph, it can be a great tool to help you familiarize yourself with some of the in’s and out’s (pun intended) of the DSE Graph. In this blog post, we’ll show you how to think about the data, how to create a mapping script and the respective mapping functions. We’ll also show you how to use DataStax Graph Loader, DGL for short, to speed up schema creation as well. There are many tricks and ways you can manipulate the data in DGL, though this guide will focus on the basics to start.

Perquisites for This Tutorial

The Data We’ll Be Working With in This Tutorial

We’ll be utilizing this GitHub repo and its data for this tutorial. The data comes to us from the fantastic folks at pokeapi. I took the liberty of cleaning the data files and choosing the ones that had relationships applicable to a graph database. I’ve also changed the file and header names to help newcomers better understand what’s happening inside DSE Graph. Obviously you might not have this luxury with larger, more complex data files. However, if you have the option, I highly recommend making sure key names are uniform across files. This will help immensely from an organization and coding perspective. Happy data == happy developer.

So how does the DataStax Graph Loader work? 

To put it quite simply, the DataStax Graph Loader (DGL) takes a mapping script that includes a file input and loads the data into a respective vertex or edge. That logic is handled by mapping functions within the mapping script. That’s not all, though; it will cache the natural keys of a vertex that you specify within the mapper functions. This is to avoid inserting duplicate vertices within the same DGL run and speed up edge creation. Additionally, DGL can suggest a schema based on properties within a file based on how a mapper function is coded. Let’s take a closer look using some nice and nerdy Pokemon data.

 

 

 

I Choose You, First Mapper Function!

Let’s take a look at the pokemon.csv file. The header contains:

pokemon_id,pokemon,species_id,height,weight,base_experience

pokemon_id is the integer (Int) that is associated with a specific Pokemon. “pokemon_id” is important because this graph is pokemon centric (Pokemon360) and will utilize this ID when linking edges to respective vertices. The rest of headers are propertyKeys and their respective values.

We know that we want a vertex label called “pokemon”, we have a pokemon.csv file that contains our data respective to the pokemon vertex, and we know that pokemon_id is going to be our identifier. That means we have everything we need to create a mapper function!

First, we’ll need to take a file input (we’ll talk about input directories further down) to wrap our function around:


load(pokemonFile).asVertices {

}

Here, we’re telling DGL that we want to load ‘pokemonFile’ (which will map to the ‘pokemon.csv’ file) as vertices. Next, we’ll want to tell DGL the name of our vertex label as well as the identifier, or key, we want to use:


load(pokemonFile).asVertices {
label "pokemon"
key "pokemon_id"
}

That’s it! Pretty simple right? DGL will automatically take the rest of the header and map the data to the respective propertyKeys. There’s obviously a lot more you can do within a mapper function such as transform and manipulate the data but that’s a lesson for another time. There are plenty of vertices you can create with the other data files provided, so give it a shot!

Living on the Edge

I haven’t played a Pokemon game since the originals but there’s one thing most of us know; Pokemon evolve into other Pokemon. That relationship can look like pokemon – (evolves_to) → pokemon. So how do we tell DGL about that relationship?

 

One of the files I’ve created for the evolution chain is called “pokemon_evolves_to.csv”. You’ll notice that I like to name my respective vertex files and edge files to correspond with the relationships they represent. It’s another way to stay organized when dealing with a lot of files. Inside the CSV, you’ll notice there are two identifiers – ‘from_id’ and ‘to_id’. Both of these represent a pokemon_id as one Pokemon evolves to another.

One thing to keep in mind is that you always want to have the outgoing and incoming vertex ID in the same file for a respective edge. That’s how DGL can load the right edges between the respective vertices.

Just as before, you’ll need to tell DGL what you’re loading:

load(pokemonevolutionEdgeFile).asEdges {

}

Notice how we’re loading this file asEdges now. Next, we’ll need to give this edge a label:

load(pokemonevolutionEdgeFile).asEdges {
    label "evolves_to"

}

Here comes the tricky part. We’ll need to tell DGL which keys in the CSV belong to the outgoing and incoming vertices and which respective vertex labels they correspond to:

load(pokemonevolutionEdgeFile).asEdges {
    label "evolves_to"
    outV "from_id", {
        label "pokemon"
        key "pokemon_id"
    }
    inV "to_id", {
        label "pokemon"
        key "pokemon_id"
    }
}

Remember, ‘from_id’ in the CSV file represents the original ‘pokemon_id’ and ‘to_id’ represents the ‘pokemon_id’ it evolves to. Because the evolution goes from one Pokemon to another, the label and key are the same in the outV function and the inV function. There aren’t any additional properties associated with this CSV but if there were, they’d also get loaded into the respective edge.

Keep in mind there are plenty of edges you can create between the vertices you made above – check out the files in the /data/edge/ and give it a go!

Dude, Where’s My Data? 

Next, we’ll quickly look at how to specify input files. By now, you know I’m all about organization. Therefore, I like to split my data directories between vertex and edge data. It’s not necessary, because the files can be read from the same directory, but it’s something I like to do.

First, we need our input directory:

inputfileV = '/path/to/data/vertices/'
inputfileE = '/path/to/data/edges/'

Then we’ll map the actual files:

pokemonFile = File.csv(inputfileV + "pokemon.csv").delimiter(',')
pokemonevolutionEdgeFile = File.csv(inputfileE + "pokemon_evolves_to.csv").delimiter(',')

As you can see, we’re telling DGL that the file is a CSV and has a specific delimiter. You can also provide a header if the files don’t have one by adding:

.header('header1', 'header2', ...)

Also note that DGL is capable of loading much more than just CSV files. Check out our documentation for more info on this.

Are We There Yet…?

Almost. Now you’ve created your first mapping script masterpiece. With all the pieces together, it should look something like the code linked here. Those with a keen eye will notice that there is an additional config line at the top of our mapping script. These can be passed via command line arguments as well. To get a full list of what you can do and what they mean, take this doc reference for a spin. Here’s what the ones we specified mean:

create_schema: Whether to update or create the schema for missing schema elements.
load_new: Whether the vertices loaded are new and do not yet exist in the graph.
load_vertex_threads: Number of threads to use for loading vertices into the graph

 

Ok but Wait, What About My Schema? 

You have two options:

  1. Type out all of the schema by hand (hint: there’s about 75 or more lines)
  2. Let DGL suggest and create the schema

I’m going to assume you want to go with option #2. Because we created the vertex and edge logic within our mapper functions, DGL can infer the proper schema and indexes. To take advantage of that, we’ll need to run DGL in dryrun mode so that it loads no data. You always want to have an explicit schema when loading the actual data – that will make a huge difference in performance. Here’s what the command looks like:

./graphloader /path/to/poke_mapper.groovy -graph <graph name> -address <ip to DSE> -dryrun=true

When it finishes running, you’ll see the output that is your schema. You may notice that it has inferred something as Text when it should be an Int. Feel free to change that in the schema before applying it. The end result for this example is linked here. I usually take the schema, load it via DSE Studio to have a visual aid, and watch the fireworks:

Expand

Let’s Load Some Data For Reals

This is it! Let’s get our data actually loaded into the graph. The command is fairly similar to the one we used above:

./graphloader /path/to/poke_mapper.groovy -graph <graph name> -address <ip to DSE>

 

 

Thanks for reading!
Marc Selwan  |  Solutions Engineering – West  |  Twitter  |  For questions, feel free to reach out on the DataStax Academy Slack: @Marcinthecloud

Webinar Q&A Follow-up – DataStax Enterprise 5.1: 3X the operational analytics speed, help for multi-tenant SaaS apps, & other shiny things

$
0
0

In this week’s webinar, DataStax Enterprise 5.1: 3X the operational analytics speed, help for multi-tenant SaaS apps, & other shiny things, our awesome Technical Evangelist, David Gilardi and I had the opportunity to give a sneak peek at what’s in store in our upcoming April 4th release of DataStax Enterprise 5.1, DataStax OpsCenter 6.1, and DataStax Studio 2.0. Below you’ll find the answers to the Q&A from the webinar.

For those that missed out on the fun, don’t worry. We have a recording and their accompanying slides waiting for you here.

 

Webinar begins – 8:00 AM
Kevin G. 

Q:  Is DSE Graph separately licensed?

A: DSE Graph is an optional addition to a DSE Standard or Max subscription. Also worth noting, like the rest of DSE, it is free for development use.

P.S. if you’re new to DSE Graph, be sure to check out our graph example data sets and free DataStax Academy Graph course, DS330

P.P.S. (self-promotion warning) our next webinar will be all about using DSE Graph to get a complete view of your customers.
________________________________________________________________

Brady G.  – 8:19 AM
Q: What are the performance improvements to OpsCenter?  

A: Performance updates are specific to backup and restore operations using Amazon S3.  The backup time to Amazon S3 has reduced upwards of 25+% compared to OpsCenter 6.0.x versions.  The restore time from Amazon S3 has reduced upwards of 90+% as compared to OpsCenter 6.0.x versions.  

________________________________________________________________

Sreenivasu N. – 8:18

Q: With search index, can we do that with out Solr?

Q: Can we see incoming connections from applications and queries (similar to SQL trace) thru opcenter?

A: The OpsCenter Performance Service will allow users to track and trace slow queries in a cluster, but there is no current way to see incoming connections from your application.  Check out the following link regarding slow queries.  https://docs.datastax.com/en/latest-opsc/opsc/online_help/services/perfServiceSlowQueries.html

As always, in DSE 5.0 and lower,  please use this feature carefully as aggressive thresholds may result in lots of queries being tracked and logged, thereby putting excessive load on the DSE server. DSE 5.1 addresses some of these gaps, but it is still not recommended to use the slow query log as tracing tool.
_______________________________________________________________

Michael G. – 8:20 AM
Q:  Are you planning to add support for backups to Azure blobs?  
A: We think Azure blobs will fit into our roadmap. It’s currently being investigated and in the planning stage.
________________________________________________________________

Sam B. – 8:21 AM
Q:  With the new “”OpsCenter before agents”” upgrade – will that let me upgrade OpsCenter to 6.1, then push updated agent binaries out using LifeCycle Manager?

Sam B.  – 8:28 AM
Q:  Aha – yes you can 🙂

A: Yes, that is correct.  You can upgrade your version of OpsCenter then push agent binaries at a later time. The key being from version 6.1 and forward.  If you have previous versions you will need to get those updated before you can expose this feature.
________________________________________________________________

Aravind Y. – 8:22 AM
Q:  What are the use cases of DESFS?

DataStax Enterprise File System, or DSEFS, is a DataStax Enterprise drop in replacement for Hadoop File System (HDFS). Any use case you have for storing large files – think documents like PDFs, I wouldn’t recommend video. Then storing any of their accompanying metadata within DSE’s Core, Apache Cassandra™. In short, our introduction of DSEFS is to remove the complexity, looking at you zookeeper, and potential failure scenarios from HDFS.

We’ll have a longer deep dive blog post on DSEFS in the coming weeks!

________________________________________________________________

Sean D.  – 8:40 AM
Q:  What versions of DSE does OpsCenter 6.1 support?

A: DataStax OpsCenter 6.1 supports DataStax Enterprise 4.8.x, 5.0.x, and 5.1.x.
________________________________________________________________
Alexey D.  – 8:41 AM

Q: please tell more about multitenancy improvements

Venkatesan S.- 8:41 AM
Q:  I am specifically interested in multi-tenant feature as we about to start building SaaS application. I would like to get more details about how multi-tenant works.  Is there blog post planned?

A: We’re really excited for an improved multi-tenancy story, too. This is all made possible with Row Level Access Control (RLAC); by being able to secure the specific rows of your table it’ll make multi-tenancy much more manageable., with a straight-forward config. We will have a blog post and documentation to deep dive, but until then here are some additional details:

  • Row level access control (RLAC) provides partition-based authorization to data within a table. RLAC improves multi-tenancy and allows for highly customizable tables that determine who is able to see and modify certain subsets of data.
  • Proxy authentication supports physical connection to the database as one user, and queries to the database as distinct users. Provides auditing and maximizes efficiency when connecting to RLAC tables.

In terms of configuration it is a two step process that goes a little like this (courtesy of our amazing docs team):

  1. Select a column on the table on which permissions will be configured. Set a UTF8 partition key column, only one filtering column per table is allowed:

RESTRICT ROWS ON [keyspace_name.]table_name
USING partition_key_column;

Existing filters (if any) now filter on this column. Display the settings using DESCRIBE TABLE:

DESCRIBE TABLE table_name;

The caching options show the rows per partition column:

caching = {‘keys’: ‘ALL’, ‘rows_per_partition’: ‘column_name’}

  1. Assign table permissions to the roles:

GRANT permission ON
‘filtering_string’ ROWS IN [keyspace_name.]table_name
TO role_name;

Where the filtering_string is the text string to match; use wildcards asterisk (*) to match any number of any characters and question mark (?) to match any single character.

Note: RLAC is unlimited; grant access using any number of filters on the selected partition column.

The permission is applied to the role immediately, even for active sessions. Use the LIST command to display the settings:

LIST ALL PERMISSIONS OF role_name;

Hope that gives you a better sense of what it will look like. Additionally, keep an eye out for a blog post coming soon.

________________________________________________________________
Colin S.

Q: We found that a DSE node with Solr enabled that the max amount of data per node is about 500GB.  With 5.1, does that max amount of storage  increase?…we are currently on DSE 4.8.9

A: Hey Colin, the 500GB is a guideline; it’s where we advise data size to be. We are actively trying to increase that, but it will be post 5.1.

________________________________________________________________

Arkajit B.

Q: Can you tell us more about multi data centre support?

Q: Is there any new features that will keep multiple clusters in speparate data centres in sync?

A: Good questions, replication, and data distribution are my favorite capabilities of DataStax Enterprise. We have some great videos in our free DataStax Academy course DS201: DataStax Enterprise Foundations of Apache Cassandra that touch on these very topics. Particularly the Replication/Consistency section will help answer your questions best. In terms of new features, I think you would be most interested in Advanced Replication. Although not new, it has made some major improvements to managing clusters across data centers in new ways with new tools. We’ll have a developer post on those areas soon.

_______________________________________________________________

Gehrig K. – 8:40 AM

Q: “Why is David so good at live demos?”

A: Because Gehrig offered such awesome support through the whole process and I had already had my “uh oh” moment about an hour before the talk when I crashed my laptop.  Through no law at all this meant that I was safe during the live demo portion.

________________________________________________________________

Radovan L. – 8:41 AM
Q:  thank you , very impressive  

Vladislav K.  – 8:41 AM
Q:  Thank you!

Jason B.

Q: Thanks for putting this on, was put together well and informative.

A: Thanks to everyone for tuning in! David and I loved talking with you and hope you’re as excited for 5.1 as we are. Be sure to join our DataStax Community on Slack.  

 

David Gilardi editor’s note:  Hey all, I wanted to clear up a mistake during the Studio 2.0 demo when I referencing the CQL tracing and Consistency Level abilities.  I misspoke and stated “replication factor” when I really meant “consistency level”.  Hopefully I didn’t confuse anyone.

 

DataStax Java Driver: 3.2.0 released!

$
0
0

The Java driver team is pleased to announce that version 3.2.0 of the DataStax Java driver for Apache Cassandra® has just been released.

We would like to stress that this release has been made possible thanks to a strong and vibrant community involvement. Our team and our company are thankful to all of you that contributed, directly or indirectly, to make our driver even better, with a special shout-out to 3 Github users: @avivcarmis, @wimtie and @gnoremac.

In this post, we will focus on the most significant changes brought by this new release; you can also consult the full changelog and our upgrade guide for details about how to migrate existing applications.

Performance and Usability Enhancements

This release is the first one to provide full compatibility with recent Guava versions up to 21.0 inclusive. By default, the driver now uses version 19.0, but any version from 16.0.1 to 21.0 can be used.

According to our internal benchmarks, this version is also noticeably faster than any previous version in the 3.x series thanks to a few performance improvements, among which the performance boost in a core component of the driver. Users are strongly encouraged to migrate to 3.2.0 as soon as possible to benefit from these.

Enhanced SSL options

The current SSLOptions interface does not allow implementors to perform hostname verification. Thanks to JAVA-1364, this is now possible, using the newly-introduced RemoteEndpointAwareSSLOptions sub-interface, that exposes a new method giving implementors access to the remote endpoint address.

This new sub-interface comes with two new concrete implementations: RemoteEndpointAwareJdkSSLOptions and RemoteEndpointAwareNettySSLOptions. Note that the old SSLOptions interface and its direct implementations are now deprecated. Read more about SSL configuration in our online documentation.

Mapper Configuration

This release brings substantial enhancements to the driver’s object mapper; thanks to JAVA-1310 and JAVA-1316, users can now customize certain aspects of the mapping process itself by defining strategies.

For example, by default the mapper tries to map all the properties found in your Java data model. You might want to take an “opt-in” approach instead and only map the ones that are explicitly annotated with @Column or @Field; this is now possible thanks to a dedicated strategy. Similarly, you might want to tell the mapper to ignore certain parent classes in your class hierarchy; this is also made possible by another strategy.

Another common need is to customize the way Cassandra column names are inferred. Out of the box, Java property names are simply lowercased: a userName Java property is mapped to the username column. It is now possible to map it to something else instead, e.g. user_name or USERNAME, by simply configuring the relevant strategy.

Strategies can be supplied through a new class, MappingConfiguration. Read more about the object mapper configuration in our online documentation.

Support for the duration type

This release also brings full support for the new CQL duration type. This type has been introduced in Apache Cassandra® 3.10, and although its primary purpose is to be used in CQL statements to specify either restrictions – e.g. WHERE time < now() + 2h – or aggregations – e.g. GROUP BY floor(time, 2h) –, it can also be used as a regular CQL type. The driver now supports such a usage through the new Duration class (however, there are no dedicated methods for it in the Row interface nor in the BoundStatement class; users are expected to set and retrieve Duration instances through the generic get() and set() methods).

Getting the driver

As always, the driver is available from Maven and from our downloads server.

We’re also running a platform and runtime survey to improve our testing infrastructure. Your feedback would be most appreciated.

Property Graph Algorithms

$
0
0

The term property graph has come to denote an attributed, multi-relational graph. That is, a graph where the edges are labeled and both vertices and edges can have any number of key/value properties associated with them. An example of a property graph with two vertices and one edge is diagrammed below.

Property graphs are more complex than the standard single-relational graphs of common knowledge. The reason for this is that there are different=types of vertices (e.g. people, companies, software) and different types of edges (e.g. knows, works_for, imports). The complexities added by this data structure (and multi-relational graphs in general, e.g. RDF graphs) effect how graph algorithms are defined and evaluated.

Standard graph theory textbooks typically present common algorithms such as various centralities, geodesics, assortative mixings, etc. These algorithms usually come pre-packaged with single-relational graph toolkits and frameworks (e.g.NetworkX, iGraph). It is common for people to desire such graph algorithms when they begin to work with property graph software. I have been asked many times:

“Does the property graph software you work on support any of the common centrality algorithms? For example, PageRank, closeness, betweenness, etc.?”

My answer to this question is always:

“What do you mean by centrality in a property graph?”

When a heterogeneous set of vertices can be related by a heterogeneous set of edges, there are numerous ways in which to calculate centrality (or any other standard graph algorithm for that matter).

  1. Ignore edge labels and use standard single-relational graph centrality algorithms.
  2. Isolate a particular “slice” of the graph (e.g. the knows subgraph) and use standard single-relational graph centrality algorithms.
  3. Make use of abstract adjacencies to compute centrality with higher-order semantics.

The purpose of this blog post is to stress point #3 and the power of property graph algorithms. In Gremlin, you can calculate numerous eigenvector centralities for the same property graph instance. At this point, you might ask: “How can a graph have more than one primary eigenvector?” The answer lies in seeing all the graphs that exist within the graph—i.e. seeing all the higher-order, derived, implicit, virtual, abstract adjacencies. Each line below exemplifies point #1, #2, and #3 in the list above, respectively. The code examples use the power method to calculate the vertex centrality rankings which are stored in the map.

// point #1 above
g.V().repeat(out().groupCount(m)).times(10) 

// point #2 above
g.V().repeat(out("knows").groupCount(m)).times(10)

// point #3 above
g.V().repeat(???.groupCount(m)).times(10) 

The ??? on line 3 refers to the fact that ??? can be any arbitrary computation. For example, ??? can be:

// point #1 below
out('works_for').in('works_for')

// point #2 below
out('works_for').has('name','ACME').in('works_for')

// point #3 below
where(out('develops').out('imports').has('name','Blueprints')).
  out('works_for').in('works_for').
where(out('develops').out('imports').has('name','Blueprints')) 

The above expressions have the following meaning:

  1. Coworker centrality.
  2. ACME Corporation coworker centrality.
  3. Coworkers who import Blueprints into their software centrality.

There are numerous graphs within the graph. As such, “what do you mean by centrality?”

 

These ideas are explored in more detail in the following article and slideshow.

Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms,” Journal of Informetrics, 4(1), pp. 29-41, Elsevier, doi:10.1016/j.joi.2009.06.004, 2009.

DSE Continuous Paging Tuning and Support Guide

$
0
0

Introduction

Continuous paging (CP) is a new method of streaming bulk amounts of records from Datastax Enterprise to the Datastax Java Driver. This is disabled by default and can be activated and used only when running with DataStax Enterprise (DSE) 5.1. When activated, all read operations executed from the CassandraTableScanRDD will use continuous paging. Continuous Paging is an Opt-In feature and can be enabled by setting spark.dse.continuous_paging_enabledto true as a Spark configuration option. The configuration can be set in:

  • the spark-defaults filespark.dse.continuous_paging_enabled true
  • on the command line using–conf spark.dse.continuous_paging_enabled=true
  • or programmatically inside your application using the SparkConf
    conf.set(“spark.dse.continuous_paging_enabled”,”true”)

Continuous Paging cannot be enabled or disabled once the Spark Context has been created.

When to enable Continuous Paging?

Any operation that has reading from Apache Cassandra® as it’s bottleneck will benefit from continuous paging. This will use considerably more Cassandra resources when running so it may not be best for all use-cases. While this feature is Opt-In in 5.1.0 it will become on by default in future DSE Releases.

How it works

Continuous paging increases read speed by having the server continuously prepare new result pages in response to a query. This means there is no cycle of communication between the DSE Server and DSE Java Driver. The reading machinery inside of DSE can operate continuously without having to constantly be restarted by the client like in normal paging. The DSE Java Driver Sessions start these ContinuousPaging requests using the executeContinuously method. The Spark Cassandra Connector implements a custom Scanner class in the DseCassandraConnectionFactory which overrides the default table scan behavior of the OSS Connector and uses the executeContinuously method. (Continuous Paging cannot be used with DSE Search and will be automatically disabled if a Search query is used) Because this is integrated into DSE Server and the DSECassandraConnectionFactory there is no way to use ContinuousPaging without using DSE.

When running, the Continuous Paging session will operate separately from the session automatically created by the Spark Cassandra Connector so there will be no conflict between and Continuous Paging and non-Continuous paging queries/writes. A cache is utilized so there will never be more than one CP Session per executor core in use.

How to determine if CP is being used

CP will be automatically disabled if the target of the Spark Application is not DSE or is not CP capable. If this happens it will be logged within the DseCassandraConnectionFactory object. To see this, add the following line to your logback-spark and logback-spark-executor xml files in the Spark conf directory:

<logger name="com.datastax.bdp.spark.DseCassandraConnectionFactory" level="DEBUG"/>
<logger name="com.datastax.bdp.spark.ContinuousPagingScanner" level="DEBUG"/>

The executor logs will then contain messages describing whether or not they are using continuous paging. Executor logs can be found in
/var/lib/spark/work/$app-id/$executor-id/stdout or by navigating to the Executor tab of the Spark Driver UI on port 4040 of the machine running the driver.

Most usages of continuous paging do not require any instantiation on the Spark Driver so there will most likely be no references to the ContinuousPaging code in the Driver logs.

Most Common Pain Points

Most CP errors will manifest themselves as tasks failing in the middle of the Spark job. These failures will be immediately retried and most likely will succeed on a second attempt. This means jobs will not necessarily fail due to these exceptions but they may cause jobs to take longer as some of the tasks need to be redone. The following are some of the possible situations.

Server Side Timeouts (Client isn’t reading fast enough)

Since the connector is now constantly serving up pages as quickly as it can, there is a built-in timeout on the server to prevent holding a page indefinitely when the client isn’t actually trying to grab the page. This will mostly manifest as an error in your Spark Driver logs saying that the server “Timed out adding page to output queue”:

Failed to process multiple pages with error: Timed out adding page to output queue
org.apache.cassandra.exceptions.ClientWriteException: Timed out adding page to output queue

 

The timeout being exceeded is from the cassandra.yaml,

continuous_paging:
    max_client_wait_time_ms: 20000

The root cause of this timeout is most likely the Spark Executor JVM garbage collecting for a very long period of time. The garbage collection causes execution to pause and the client to not request pages from the Server for a period of time. The Driver UI for the Spark Application contains information about GC in the Spark Stage view. Every task will list the total amount of GC related pauses for the task. If a particular GC exceeds the max_client_wait_time_ms then the Server will throw the above error. Depending on the frequency of GC there are two different mediations.

Sporadic GC

If the GC is sporadic but successful at cleaning out the heap then the JVM most likely healthy but is producing a lot of short lived objects. Increasing the max_client_wait_time_ms or reducing the executor heap size can reduce this pause time should be sufficient to avoid the error.

To retain parallelism when shrinking the JVM heap, instead of using one large JVM with multiple threads you can spawn several smaller Executor JVMs on the same machine. This can be specified per Spark Driver application using the spark.executor.cores parameter. This will spawn x executor JVM’s per machine where x = total number of cores on worker / spark.executor.cores. Setting this must be done with a corresponding change in spark.executor memory otherwise you will use up all the ram on the first executor JVM. Because of RAM limitations, the number of executor JVMs that will be created is actually

Min( AvailableSparkWorkerCores / spark.executor.cores, AvailableSparkWorkerMemory / spark.executor.memory )

Heavy GC

If the executor is running into a GC death spiral (very very high GC, almost always in GC, never able to actually clean the heap) then the amount of data in a Spark Executor Heap needs to be reduced. This can only be done on an application by application basis but basically means reducing the number of rows or elements in a single Spark partition or reducing cached objects.

For DSE *decreasing* the tuning parameter spark.cassandra.input.split.size_in_mb will reduces the number of CQL partitions in a given Spark partition but this may not be sufficient if more data is added with later operations.

Overloading the heap can also be caused by caching large datasets on the executor, exhausting available ram. To avoid this try using the DISK_ONLY based caching strategies instead of MEMORY_ONLY. See api docs.

Client Side Timeouts

Conversely the client can ask for pages from the Server but the server can be under load and fail to return records in time. This timeout will manifest with the following error message

Lost task 254.0 in stage 0.0 (TID 105, 10.200.156.150): java.io.IOException: Exception during execution of SELECT “store”, “order_time”, “order_number”, “color”, “qty”, “size” FROM “ks”.“tab” WHERE token(“store”) > ? AND token(“store”) <= ?   ALLOW FILTERING: [ip-10-200-156-150.datastax.lan/10.200.156.150:9042] Timed out waiting for server response

These are caused by the client having to wait more than spark.cassandra.read.timeout_ms before the next page can be returned from the server. The default for this is 2 minutes which means it was a very long delay. This can be caused by a variety of reasons:

  • The DSE Node is undergoing Heavy GC
  • The node went down after the request (the CPE will need to be restarted on another machine)
  •  Some other network failure

If the error is due to excessive load on the DSE cluster, there are several knobs for tuning the throughput of each executor in the Spark job. Sometimes running a consistent but slower than max speed can be beneficial.

Performance

In most tests we have seen a roughly 2 to 3.5 fold improvement over the normal paging method used by DSE. These gains are made using the default settings but there are two main knobs at the session level for adjusting throughput

  • reads_per_sec – MaxPagesPerSecond
  • fetch.size_in_rows – PageSize in Rows

They can be set on at the configuration level using

spark.cassandra.input.reads_per_sec
spark.cassandra.input.fetch.size_in_rows

or programmatically on a per read basis using the ReadConf object.

The defaults are unlimited pages per second and 1000 rows per page. *Note: These settings are per executor*

Example:

40 Spark Cores
100 Reads per Sec
1000 Rows Per Page
40 Executors * (100 max pages / second * 1000 rows/ page) / Executor  = 4000000 Max Rows / Second

Since the max throughput is also limited by the number of Cores, this can also be adjusted to change throughput.

Reading and Writing from Cassandra in the Same Operation

It is important to note that Continuous Paging is only beneficial when reading from Cassandra is the bottleneck in the pipeline. This means that other Spark operations can end up limiting throughput. For example, reading and then writing to Cassandra (sc.cassandraTable().saveToCassandra) can end up limiting the speed of the operation to the write speed of Cassandra. In these cases, continuous paging will provide limited, if any, benefit.

Dse Graph Frame

$
0
0

The DseGraphFrame package provides the Spark base API for bulk operations and analytics on Dse Graph. It is inspired by Databricks’ GraphFrame library and supports a subset of Apache TinkerPop Gremlin graph traversal language. It  supports reading of DSE Graph data into a GraphFrame and writing GraphFrames from any format supported by Spark into DSE Graph.

The package ties DSE Analytics and DSE Graph components together even stronger than it was before! Spark users will have direct access to the Graph and Graph users will be able to perform bulk deletes, updates and have an advanced API for faster scanning operations.

In this blog post I will cover the main DseGraphFrame advantages and operations:

  1. Importing DSE Graph data into Spark DataFrames and GraphFrames
  2. Updating or deleting Vertex and Edges in DSE Graph
  3. Inserting Edges
  4. Inserting Vertices with custom ids.
  5. Combining Graph and non-graph data
    1. Update graph element properties
    2. Spark Streaming
    3. Join with any Spark supported sources
  6. API support for graph manipulation and graph algorithms:
    1. Spark GraphFrames http://graphframes.github.io
    2. A Subset of TinkerPop3 Gremlin
  7. Data Loading from various sources including Cassandra, JDBC, DSEFS, S3

DseGraphFrame has both Java and Scala Spark APIs at this moment.


 

Table of contents:

 


 

DSE Graph

DSE Graph is built with Apache TinkerPop and fully supports the Apache TinkerPop Gremlin language. A Graph is a set of Vertices and Edges that connect them. Vertices have a mandatory unique id. Edges are identified by two ends inV() and outV(). Both Edges and Vertices have labels that define type of the element and a set of properties.

Create Example

Let’s create a toy “friends” graph in gremlin-console, that will be used in following examples:

Start dse server with graph and spark enabled.

#> dse cassandra -g -k


Run gremlin console

#> dse gremlin-console


Create empty graph

system.graph(“test”).create()


Create  short alias ‘g’ for it and define schema

:remote config alias g test.g
//define properties
schema.propertyKey(“age”).Int().create()
schema.propertyKey(“name”).Text().create()
schema.propertyKey(“id”).Text().single().create()
//define vertex with id property as a custom ID
schema.vertexLabel(“person”).partitionKey(“id”).properties(“name”, “age”).create()
// two type of edges without properties
schema.edgeLabel(“friend”).connection(“person”, “person”).create()
schema.edgeLabel(“follow”).connection(“person”, “person”).create()

Add some vertices and edges

Vertex marko = graph.addVertex(T.label, “person”, “id”, “1”, “name”, “marko”, “age”, 29)
Vertex vadas = graph.addVertex(T.label, “person”, “id”, “2”, “name”, “vadas”, “age”, 27)
Vertex josh = graph.addVertex(T.label, “person”, “id”, “3”, “name”, “josh”, “age”, 32)
marko.addEdge(“friend”, vadas)
marko.addEdge(“friend”, josh)
josh.addEdge(“follow”, marko)
josh.addEdge(“friend”, marko)


You can use DataStax Studio for visualizations and to run commands instead of gremlin-console.

https://docs.datastax.com/en/latest-studio/studio/gettingStarted.html

DSE Graph OLAP

DSE Graph utilises the power of the Spark engine for deep analytical queries with Graph OLAP. It is easily accessed in gremlin console or Studio with ‘.a’ alias:

:remote config alias g test.a

DSE Graph OLAP has broader support for TinkerPop than the DseGraphFrame API. While Graph OLAP is the best for deep queries (those requiring several edge traversals), simple filtering and counts are much faster in the DseGraphFrame API

.gremlin> g.V().has(“name”, “josh”).out(“friend”).out(“friend”).dedup()
==>v[{~label=person, id=2}]
==>v[{~label=person, id=3}]

 

DseGraphFrame

DseGraphFrame represents a Graph as two virtual tables: a Vertex DataFrame and a Edge DataFrame.

Let’s see how the graph looks in Spark:

#> dse spark
scala> val g = spark.dseGraph(“test”)
scala>g.V.show
+---------------+------+---+-----+---+
|             id|~label|_id| name|age|
+---------------+------+---+-----+---+
|person:AAAAAjEx|person|  1|marko| 29|
|person:AAAAAjEz|person|  3| josh| 32|
|person:AAAAAjEy|person|  2|vadas| 27|
+---------------+------+---+-----+---+

 

scala> g.E.show

 

+---------------+---------------+------+--------------------+
|            src|            dst|~label|                  id|
+---------------+---------------+------+--------------------+
|person:AAAAAjEx|person:AAAAAjEy|friend|29ea1ef0-139a-11e...|
|person:AAAAAjEx|person:AAAAAjEz|friend|2d65b121-139a-11e...|
|person:AAAAAjEz|person:AAAAAjEx|friend|33de7dc0-139a-11e...|
|person:AAAAAjEz|person:AAAAAjEx|follow|33702b91-139a-11e...|
+---------------+---------------+------+--------------------+

 

DseGraphFrame uses a GraphFrame-compatible format.  This format requires the Vertex DataFrame to have only one ‘id’ column and the Edge DataFrame to have hardcoded ‘src’ and ‘dst’ columns. Since DSE Graph allows users to define any arbitrary set of columns as the Vertex id and since there is no concept of `labels` in GraphFrame, DseGraphFrame will serialize the entire DSE Graph id into one ‘id’ column. The label is represented as part of the id and also as ‘~label’ property column.

DseGraphFrame methods list:

gf() returns GraphFrame object for graph frame API usage
V() returns DseGraphTraversal[Vertex] object to start  TinkerPop vertex traversa
E() returns DseGraphTraversal[Edge] object to start  TinkerPop edge traversal
cache()persist() cache the graph data with Spark
deleteVertices()deleteEdges() delete vertices or edges
deleteVertexProperties()deleteEdgeProperties() delete properties from the DB. It doesn’t change schema.
updateVertices()updateEdges() change properties or insert new vertices and edges.

 

DseGraphFrameBuilder

DseGraphFrameBuilder is a factory for DseGraphFrame.

Java API

is the same excluding graph initialization

//load a graph
DseGraphFrame graph = DseGraphFrameBuilder.dseGraph(“test”, spark);
//check some vertices
graph.V().df().show()

Scala API

A Scala implicit adds the factory dseGraph() method to a Spark session, so Scala version is shorter:

// load a graph
val graph = spark.dseGraph(“test_graph”)
//check some vertices
graph.V.show

Spark GraphFrame

Java API

use DseGraphFrame.gf() to get GraphFrame.

DseGraphFrameBuilder.dseGraph(String graphName, GraphFrame gf) method to return back to DseGraph

Scala API

Scala provides implicit conversions from GraphFrame to DseGraphFrame and back. It also converts GraphTraversals to DataFrames.So both GraphFrame filtering and TinkerPop traversal methods can be mixed.

TinkerPop3 Gremlin Support

The DseGraphFrame API supports a limited subset of the Gremlin language that covers basic traversal and update queries. TinkerPop traversals are generally more clear and intuitive compared to the GraphFrame motif search queries, so we recommend using Gremlin if possible.

See example of finding all Josh’s friends of friends:

//TinkerPop Gremlin in spark shell
scala>g.V.has(“name”, “josh”).out(“friend”).out(“friend”).show
//GraphFrame motif finding is less readable
scala>g.find(“(a)-[e]->(b); (b)-[e2]->(c)”).filter(” a.name = ‘josh’ and e.`~label` = ‘friend’ and e2.`~label` = ‘friend'”).select(“c.*”).show}

Both outputs are the same, but gremlin looks much shorter and readable.

+---------------+------+---+-----+---+
|             id|~label|_id| name|age|
+---------------+------+---+-----+---+
|person:AAAAAjEz|person|  3| josh| 32|
|person:AAAAAjEy|person|  2|vadas| 27|
+---------------+------+---+-----+---+

List of Gremlin query methods supported by DseGraphFrame:

Step Mehhod
CountGlobalStep count()
GroupCountStep groupCount()
IdStep id()
PropertyValuesStep values()
PropertyMapStep propertyMap()
HasStep has(), hasLabel()
IsStep is()
VertexStep to(), out(), in(), both(), toE(), outE(), inE(), bothE()
EdgeVertexStep toV(), inV(), outV(), bothV()
NotStep not()
WhereStep where()
AndStep and(A,B)
PageRankVertexProgramStep pageRank()

 

Bulk Drop and Property Updates

DseGraphFrame is currently the only way to drop millions of vertices or edges at once. It is also much faster for bulk property updates than other methods. For example to drop all ‘person’ vertices and their associated edges:

scala>g.V().hasLabel(“person”).drop().iterate()

List of Gremlin update methods supported by DseGraphFrame:

DropStep V().drop(),E().drop(),properties().drop()
AddPropertyStep property(name, value, …)

The Traverser concept and side effects are not supported.

Java API

The DseGraphFrame V() and E() methods returns a GraphTraversal, this is a java interface, so all methods exists but some of them throw UnsupportedOperationException. The GraphTraversal is a java iterator and also has toSet() and toList() methods to get query results:

//load a graph
DseGraphFrame graph = DseGraphFrameBuilder.dseGraph(“test”, spark);
//print names
for(String name: graph.V().values(“name”)) System.out.println(name);

To finish a traversal and return to the DataFrame API instead of list or iterator use the .df() method:

graph.V().df()

Scala API

DseGraphFrame supports implicit conversion of GraphTraversal to DataFrame in scala.The following example will traverse Vertices with TinkerPop and then show result as DataFrame

scala>g.V().out().show

In some cases the Java API is required to get correct TinkerPop objects.

For example, to extract the DSE Graph Id object the Traversal java iterator can be converted to a scala iterator which allows direct access to the TinkerPop representation of the Id. This method allows using the original Id instead of DataFrame methods which return the DataFrame String representation of the Id, you can also use toList() and toSet() methods to get appropriate id set

.scala> import scala.collection.JavaConverters._
scala> for(i <-g.V().id().asScala) println (i)
{~label=person, id=1}
{~label=person, id=3}
{~label=person, id=2}scala> g.V.id.toSet
res12: java.util.Set[Object] = [{~label=person, id=2}, {~label=person, id=3}, {~label=person, id=1}]

Logical operations are supported with TinkerPop P predicates class

g.V().has(“age”, P.gt(30)).show
T.label constant could be used to point label
g.E().groupCount().by(T.label).show

Note: Scala is not always able to infer return types, especially in the spark-shell. Thus to get property values, the type should be provided explicitly:

g.V().values[Any](“name”).next()
//or
val n: String = g.V().values(“name”).next()

The same approach is needed to drop a property from the spark-shell. To query property before drop you should pass the type of the property ‘[Any]’:

g.V().properties[Any](“age”, “name”).drop().iterate()

The Dataframe method looks more user friendly in this case:

scala> g.V().properties(“age”, “name”).drop().show()
++
||
++
++
scala>  g.V().values(“age”).show()
+—+
|age|
+—+
| 29|

DseGraphFrame updates

Spark has various sources. As far as you can get this data as DataFrame that has ‘id’, ‘~label’ and one or more properties column you can load this date into the graph. Format the DataFrame to proper format and call one of the update methods:

val v = new_data.vertices.select ($”id” as “_id”, lit(“person”) as “~label”, $”age”)
g.updateVertices (v)
val e = new_data.edges.select (g.idColumn(lit(“person”), $”src”) as “src”, g.idColumn(lit(“person”), $”dst”) as “dst”,  $”relationship” as “~label”)
g.updateEdges (e)

Spark Streaming Example

DataFrame could come from any sources even from Spark streaming. You just need to get DataFrame in an appropriate format and call updateVertices or updateEdges:

dstream.foreachRDD(rdd => {
val updateDF = rdd.toDF(“_id”, “messages”).withColumn(“~label”, lit(“person”))
graph.updateVertices(updateDF)
})

Full source code of the streaming application could be found here

GraphX, GraphFrame and DataFrame

DseGraphFrame can return a GraphFrame representation of the graph with the DseGraphFrame.gf() methods. That give you access to all GraphFrame and GraphX advanced algorithms. It also allows you to build sophisticated queries that are not yet supported by the DseGraphFrame subset of the TinkerPop API.A DseGraphTraversal can return its result as a DataFrame with the df() method. In the Scala API an implicit conversion is provided for conversion from traversal to DataFrame, so all DataFrame methods are available on the DseGraphTraversal.

scala> g.V.select(col(“name”)).show
+—–+
| name|
+—–+
|marko|
| josh|
|vadas|
+—–+

GraphFrame Reserved Column Names

GraphFrame uses the following set of columns internally:”id”, “src”, “dst”, “new_id”, “new_src”, “new_dst”, “graphx_attr”TinkerPop properties with these names will be prepended with “_” when represented inside a GraphFrame/DataFrame.

Querying DSE Graph with SparkSQL

Spark data sources allow one to query graph data with SQL There are com.datastax.bdp.graph.spark.sql.vertex and com.datastax.bdp.graph.spark.sql.edge for vertices and edges.The result tables are in Spark GraphFrame compatible format.To permanently register tables for spark-sql or jdbc access (via the Spark SQL Thriftserver), run the following commands in a `dse spark-sql` session:

spark-sql> CREATE DATABASE graph_test;
spark-sql> USE graph_test;
spark-sql> CREATE TABLE vertices USING com.datastax.bdp.graph.spark.sql.vertex OPTIONS (graph ‘test’);
spark-sql>  CREATE TABLE edges USING com.datastax.bdp.graph.spark.sql.edge OPTIONS (graph ‘test’);

In addition to operating on the graph from Spark via Scala, Java, and SQL, that method allows you to query and modify the graph from Spark Python or R:

Scala

#>dse spark
scala> val df = spark.read.format(“com.datastax.bdp.graph.spark.sql.vertex”).option(“graph”, “test”).load()
scala> df.show

 

PySpark

#>dse pyspark
>>> df = spark.read.format(“com.datastax.bdp.graph.spark.sql.vertex”).load(graph = “test”)
>>> df.show()

 

SparkR

#>dse sparkR
> v <- read.df(“”, “com.datastax.bdp.graph.spark.sql.vertex”, graph=”test”)
>head(v)

 

Export Graph

A DseGraphFrame is represented as 2 DataFrames, so it is easy to export DSE Graph data to any format supported by Spark.

scala> g.V.write.json(“dsefs:///tmp/v_json”)
scala> g.E.write.json(“dsefs:///tmp/e_json”)

This will create two directories in the DSEFS file system with vertex and edge data in JSON formatted in text files.

You can copy the data to the local filesystem (if there is capacity)

#>dse hadoop fs -cat /tmp/v_json/* > local.jsonor use the data for offline analytics, by loading it back from DSEFS:val g = DseGraphFrameBuilder.dseGraph(“test”, spark.read.json(“/tmp/v.json”), spark.read.json(“/tmp/e.json”))

 

Export to CSV

Spark CSV does not support arrays or structs. This means that multi-properties and properties with metadata must be converted before exporting.

For example: let nicknames column be a multi property with metadata, it will be represented as array(struct(value, metadata*)))

To save it and id you will need following code:

val plain = g.V.select (col(“id”), col(“~label”), concat_ws (” “, col(“nicknames.value”)) as “nicknames”)
plain.write.csv(“/tmp/csv_v”)

 

Importing Graph Data into DSE Graph

DseGraphFrame is able to insert data back to DSE Graph. The parallel nature of Spark makes the inserting process faster than single-client approaches. The current DseGraphFrame API supports inserting only to Graphs with a Custom ID and the process is experimental in DSE 5.1.

Limitations: Custom ID only! Graph schema should be created manually in gremlin-console.

Import previously exported graph

  1. Export schema:

In the gremlin-console:

gremlin> :remote config alias g gods.g
gremlin> schema.describe()

Copy the schema and apply it to new graph:

system.graph(‘test_import’’).create()
:remote config alias g test_import.g
// paste schema here
  1. Back to spark and import V and E
val g = spark.dseGraph(“test_import”)
g.updateVertices(spark.read.json(“/tmp/v.json”))
g.updateEdges(spark.read.json(“/tmp/e.json”))

 

Import custom graph

Let’s add data into our test graph. I will use a GraphFrame example graph which has a similar structure to to our graph schema. I will tune the schema to be compatible and then update our graph with new vertices and edges.

scala> val new_data = org.graphframes.examples.Graphs.friends

It consists of two DataFrames.  Let’s check schemas:

scala> new_data.vertices.printSchema
root
|– id: string (nullable = true)
|– name: string (nullable = true)
|– age: integer (nullable = false)
scala> new_data.edges.printSchema
root
|– src: string (nullable = true)
|– dst: string (nullable = true)
|– relationship: string (nullable = true)

Open our graph and check the expected schema.

scala> val g = spark.dseGraph(“test”)
scala>g.V.printSchema
root
|– id: string (nullable = false)
|– ~label: string (nullable = false)
|– _id: string (nullable = true)
|– name: string (nullable = true)
|– age: integer (nullable = true)
scala>g.E.printSchema
root
|– src: string (nullable = false)
|– dst: string (nullable = false)
|– ~label: string (nullable = true)
|– id: string (nullable = true)

The labels need to be defined for vertices and some id columns need to be renamed. Vertex serialized IDs will be calculated by DSE Graph, but an explicit mapping using the idColumn() function is required for Edge fields ‘src’ and ‘dst’.

val v = new_data.vertices.select ($”id” as “_id”, lit(“person”) as “~label”, $”name”, $”age”)
val e = new_data.edges.select (g.idColumn(lit(“person”), $”src”) as “src”, g.idColumn(lit(“person”), $”dst”) as “dst”,  $”relationship” as “~label”)

Append them in the graph:

g.updateVertices (v)
g.updateEdges (e)

This approach can be applied to data from any data source supported by Spark: JSON, JDBC, etc. Create a DataFrame, update schema with a select() statement (or other Spark transformations) and update the DSE Graph via updateVertices() and updateEdges().

TinkerPop and GraphFrame work together in DSE:

 

DSE 5.1: Automatic Optimization of Spark SQL Queries Using DSE Search

$
0
0

Introduction

DSE Search (Apache Solr based) and DSE Analytics (Apache Spark Based) may seem like they are basically designed for orthogonal use cases. Search optimizes quick generic searches over your Big Data and Analytics optimizes for reading your entire dataset for processing.  But there is a sweet spot where Analytics can benefit greatly from the enhanced indexing capabilities from Search. Previously in DSE this synergy could only be accessed from the RDD API but now with DSE 5.1 we bring DSE Search together with DSE Analytics in SparkSQL and DataFrames.

Requirements

DSE Search must be enabled in the target datacenter of the SparkSql or Dataframe request. DSE Analytics must be enabled on host datacenter of the request. Locality is only guaranteed if both Search and Analytics are colocated in the same Datacenter.

Turning on and off the optimizations

In 5.1 this feature is Opt-In and is only useful in certain scenarios which we currently do not automatically detect. To enable these optimizations set the spark.sql.dse.solr.enable_optimization=true as a Spark configuration option. This can be set at the application level by setting the value in

  • spark-defaults
  • on the command line with --conf
  • programatically in the Spark Conf

It can also be set per DataFrame by passing the parameter as an option to your DataFrame

import org.apache.spark.sql.cassandra._
spark.read
  .cassandraFormat("ks", "tab")
  .option("spark.sql.dse.solr.enable_optimization", "true")

How it Works

When a query is sent to Spark via SparkSql or Dataframes it ends up in an optimization engine called [Catalyst](https://spark-summit.org/2016/events/deep-dive-into-catalyst-apache-spark-20s-optimizer/).This engine reduces the query into a set of standardized operations and predicates. Some of those predicates are presented to the data source (which can be Cassandra). The Datasource is then able to decide whether or not it can handle the predicates presented to it.

When Solr Optimization is enabled DSE adds a special predicate handling class to the Cassandra DataSource provided in the Datastax Spark Cassandra Connector,  allowing DSE to transform the Catalyst Predicates into Solr Query clauses.

These compiled Solr clauses are then added to the full table scan done by CassandraTableScanRDD. Once you have enabled Solr optimization these transformations will be done whenever applicable. If a predicate cannot be handled by Solr we push it back up into Spark.

Debugging / Checking whether predicates are applied

The optimizations are all done on the Spark Driver (the JVM running your job). To see optimizations as they are being planned add the following line to the logback-spark file which is used for your driver

xml
<logger name="org.apache.spark.sql.SolrPredicateRules" level="DEBUG"/>

logback-spark-shell.xml # dse spark
logback-spark-sql.xml # dse spark-sql
logback-spark.xml # dse spark-submit

This will log the particular operations performed by the DSE added predicate strategy. If you would like to see all of the predicate strategies being applied add

xml<logger name="org.apache.spark.sql.Cassandra" level="DEBUG"/>

as well.

Performance Details

Count Performance

The biggest performance difference for using Solr is for count style queries where every predicate can be handled by DSE Search. In these cases around 100X the performance of the full table scan can be achieved.
This means analytics queries like "SELECT COUNT(*) where Column > 5" can be done in near-real time by automatically routing through DSE Search. (CP in this graph refers to Continuous Paging another DSE Exclusive Analytics Feature)

Filtering Performance

The other major use case has to do with filtering result sets. This comes into play when retrieving a small portion of the total dataset. Search uses point partition key lookups when retrieving records so there is a linear relationship between the number of rows being retrieved and the time it takes to run. Normally Analytics performs full table scans, whose runtimes are independent of the amount the data actually being filtered in Spark. The speed boosts to DSE Analytics with Continuous Paging have set a low tradeoff point between Search doing specific lookups and running a full scan in Spark. This means that unless the dataset is being filtered down to a few percent of the total data size it is better to leave Search Optimization off.

In the above chart, the yellow line (Spark Only) runs at a constant time since it always requires reading the same amount of data (all of it). The Blue and Red lines show that since Search is individually requesting rows, it's duration is dependent on the amount of records retrieved. Since Primary Keys and Normal Columns are stored differently the performance of retrieving primary keys only (Red) is faster than retrieving the whole row (Blue).

For retrieving whole rows (blue), the inflection point is around 1.5 percent of the total data size. If the filter returns less rows than this percentage performance will be better using the Search optimization.

For returning only primary key values (red), the inflection point is around 5 percent of the total data size. Any filters returning less than this percentage will perform better using the Search optimization.

These are some general guidelines and the performance of any system may be different depending on data layout and hardware. .

Caveats

One major caveat to note is that currently the SELECT COUNT(*) query without any predicates currently will only trigger a normal Cassandra pushdown count. This can be forced to use a solr count instead by adding a "IS NOT NULL" filter on a partition key column. We plan on including this as an automatic optimization in the future.

All results returned via DSE Search are subject to the same limitations that DSE Search is bound by.  The accuracy of the counts are dependent on the consistency of the Solr core on the replicas that provide the results. Errors at indexing time may persist until manually corrected, especially since the Spark Connector doesn't actually use the same endpoint selection logic DSE Search does.

 

DataStax Drivers Fluent APIs for DSE Graph are out!

$
0
0

Fire up your IDE, the time has come!

Following the DataStax Enterprise 5.1 release, DataStax released its first non-beta versions of the Fluent APIs for DSE Graph.

This new feature brings the DataStax Enterprise Drivers into full compatibility with the Apache TinkerPop GLVs, and we even included additional functionalities in order to make the experience of developing graph applications even faster and easier.

"You said 'GLV'"?

"Like... Great Lasagna Volume?"

No.

Most graph database enthusiasts nowadays should be aware of the existence of the Apache TinkerPop project, and its main component, the Gremlin traversal language (and if not, no Lasagna for you tonight). The Gremlin language is syntactically simple (although may be semantically complex). It is exclusively composed of chains of function calls, or nested function calls, which allow to define a Graph Traversal: the chain of steps that will produce or return the desired data from your Graph database.

Due to its syntactic simplicity, the Gremlin language can be expressed into any programming language that supports chaining or nesting functions (which includes every major programming language). A version of Gremlin in a specific programming language, is called Gremlin Language Variant (and that will be referred to as "GLV" in the rest of this post). Therefore, any Gremlin expressed in a programming language is a GLV, even the original Gremlin-Java.

"If they're all variations, is there a common representation?"

One Gremlin to rule them all.

Since the introduction of the GLV concept in TinkerPop, there needed to be a representation of Gremlin that was language-agnostic. So that any GLV would be an adaptation, in the desired programming language, of the language-agnostic Gremlin. This language-agnostic Gremlin is the Gremlin Bytecode. The Gremlin Bytecode is not meant to be directly exposed to users, however it is useful for a database driver ("why is that?" will be explained further down) and is made in a way that any GLV traversal can be translated into Bytecode (i.e. the generic representation), and vice versa.

Examples of GLV with Gremlin-Java and Gremlin-Python

Gremlin is defined by a set of function calls that can be chained or nested. GLVs can express Gremlin but vary based on the syntactic capabilities of the programming language the GLV is defined in, in order to make the use of Gremlin even more convenient and natural for the specific programming language's users. The two most common GLVs at the moment are Gremlin-Java and Gremlin-Python.

A Gremlin-Java traversal example:

g.V().values("name").range(0, 4).groupCount()

Gremlin-Python can make use of Python's collections syntax to allow rewrite this traversal into the following:

g.V().name[0:4].groupCount()

Notice how .name[0:4] replaces .values("name").range(0, 4) in the Gremlin-Python GLV. There a few other examples like this that are good to consider and be aware of when making use of a GLV.

"Enough of this nonsense, what do the DataStax Drivers have to do with this?"

Well in our newest version of DataStax Enterprise Graph, we have added the ability to consume graph traversals in this language-agnostic Bytecode format. Since this is now possible, it means that all the official GLVs can be used against DataStax Enterprise Graph.

In order to make this work, a language driver needs to be able to gather a traversal from the GLV into its generic Bytecode format, and then send it to the Database server. Drivers to do so previously existed only in the TinkerPop project. We have now also adopted this feature - and extended the original GLV capabilities - with the latest release of our DataStax Drivers, to provide what we now call the Fluent API.

Once DSE Graph receives a traversal in the Bytecode format, it translates it back into a traversal of the server's Gremlin Traversal Machine runtime language, and processes it.

DataStax now offers clean and concise utilities that will allow users to use GLVs against DSE Graph backed by the DataStax Enterprise Drivers.

The Fluent API will allow users to interact with DSE Graph via the Gremlin Traversal API, providing a more familiar interface than the existing String-based queries interface, allowing compile-time checking, and easy navigation through the Traversal API within an IDE client-side.

It comes with all the DataStax Enterprise Drivers benefits

When using a GLV backed by a DataStax Enterprise Driver (also called "Fluent API"), users directly benefit from all the advantages and advanced features the DSE Drivers can offer, paired to the ease of use of the GLV Traversal API. With the Fluent API, users can take advantage of the DSE Drivers':

  • automatic cluster discovery
  • built-in load-balancing features
  • datacenter awareness
  • failure recovery policies (RetryPolicy, ReconnectionPolicy)
  • speculative executions
  • enterprise-grade client authentication
  • client-server encryption
  • advanced logging capabilities

And many other features that are essential to any application in production, which all come with zero added effort.

It also comes with extended functionalities!

In addition to exposing the original GLVs, the Fluent API exposes additional traversal features that are made especially for DataStax Enterprise Graph. Therefore, you can leverage DSE Search specific features, integrated in DataStax Enterprise, directly through the Fluent API. The Fluent API exposes DSE Search predicates that will automatically leverage the server-side search engine without having to think about using another tool, interface or language. We now expose Geometric and Geographic-based search predicates as well as advanced full-text search predicates. Have a look at the new predicates here.

Code examples, because we're all here for that.

The DSE Drivers have released artifacts and packages that will provide you with the new almighty utility class DseGraph:

  • It will be the entry point to using the Fluent API.
  • The DseGraph class will provide a method to easily create a TraversalSource, which is the entry point for creating Graph Traversals.
  • Once the TraversalSource is obtained, users can easily create Traversals and for example make a statement out of it, to execute in a DseSession.

It is no more complicated than: (Java)

import com.datastax.dse.graph.api.DseGraph;

TraversalSource g = DseGraph.traversal();
GraphTraversal traversal = g.V().values("name").range(0, 4).groupCount();
GraphStatement statement = DseGraph.statementFromTraversal(traversal);

GraphResultSet results = dseSession.executeGraph(statement);

and with the Python Fluent API:

from dse_graph import DseGraph

g = DseGraph.traversal_source()
traversal = g.V().name[0:4].groupCount()
statement = DseGraph.query_from_traversal(traversal)

results = dse_session.execute_graph(statement)

Note: a DseSession has to be initialized and connected to a DataStax Enterprise cluster in order for this to work. See DataStax Drivers documentation for how to create a DseSession.

The DseGraph utility class also provides the option to get a TraversalSource that is directly connected to the remote DSE Graph server, via a DseSession:

import com.datastax.dse.graph.api.DseGraph;

// use an initialized and connected DseSession
GraphTraversalSource g = DseGraph.traversal(dseSession);
List results = g.V().values("name").range(0, 4).groupCount().toList();

and with Python:

from dse_graph import DseGraph

g = DseGraph.traversal_source(session=dse_session)
results = g.V().name[0:4].groupCount().toList()

As in the example above, with a connected TraversalSource, each iteration operation built in Gremlin itself (toList(), next()) will transparently execute the task of translating the traversal into Bytecode, send it via the DSE Driver to the DSE Graph server, and gather results back.

Wrap up

We will be looking forward to adding support to even more GLVs in the future, keep on extending their functionalities with the Fluent APIs, and overall improve the process of developing Graph applications.

A deep dive into the Gremlin Traversal Machine awaits you to learn more about the internals of Gremlin.

The documentation for the DataStax Java Driver Fluent API is located here, and the Python Fluent API here. Do not forget to check the other great features the DataStax Enterprise Drivers provide, and give us as many feedback as you can via the DataStax Academy Slack #datastax-drivers room!


From CFS to DSEFS

$
0
0

Cassandra File System (CFS) is the default distributed file system in the DataStax Enterprise platform in versions 2.0 to 5.0. Its primary purpose is to support Hadoop and Spark workloads with temporary Hadoop-compatible storage. In DSE 5.1, CFS has been deprecated and replaced with a much improved DataStax Enterprise File System (DSEFS). DSEFS is available as an option in DSE 5.0, and was made the default distributed file system in DSE 5.1.

A Brief History of CFS

CFS stores file data in Apache Cassandra®. This allows for reuse of Cassandra features and offers scalability, high availability and great operational simplicity. Just as Cassandra is shared-nothing, CFS is shared-nothing as well. It scales linearly with the number of nodes in either performance and capacity.

Because Cassandra was not designed to store huge blobs of binary data as single cells, files in CFS have to be split into blocks and subblocks. Subblocks are each 2 MB large by default. A block is stored in a table partition. A subblock is stored in a table cell. The inodes table stores file metadata such as name and attributes and a list of block identifiers.

Except for Cassandra, CFS has almost no server-side components. All the operations like looking up or creating files, splitting/merging subblocks, and data compression/decompression are performed by the client. Therefore to access CFS, the client needs thrift access to Cassandra.

CFS Limitations

Unfortunately this design comes with a few limitations:

  • A subblock must be fully loaded into memory and transferred between the client and the storage before the operation timeout happens. This increases memory use.
  • When writing to Cassandra, each row has to be written at least twice - for the first time to the commit log, and for the second time to a new sstable on disk.
  • There is additional I/O overhead of Cassandra compaction, particularly for write-heavy workloads.
  • Reclaiming space after deleting files is deferred until compaction. This can get particularly bad when dealing with workloads that cause writing temporary files, e.g. when doing exploratory analytics with Spark. When SizeTieredCompactionStrategy is used, this can result in taking several times more space than needed.
  • Authorization is weak, because it is implemented on the client-side. The server administrator can only restrict who may and who may not access CFS at all, but it is not possible to restrict access to a part of the directory tree.

To alleviate the delayed delete problem and reduce both the time and space overhead of compaction, CfsCompactionStrategy was introduced in DSE 2.1. This strategy flushes each block to a separate sstable. When the file needs to be deleted, it just deletes the right sstables from disk. It also doesn't waste I/O for repeatedly rewriting sstables. This is much faster and more efficient for short-lived files than reading and compacting sstables together, however in practice it introduces another set of problems. While there is no hard limit on the number of sstables in the keyspace, each sstable comes at some cost of used resources like file descriptors and memory. Too many sstables make it slow to find data and blow up some internal Cassandra structures like interval trees. It is very easy to run into issues by having too many small files. Simply put, CfsCompactionStrategy didn't scale in the general case, so it has been deprecated and removed in DSE 5.0.

Introducing DSEFS

DSEFS is the new default distributed file system in DSE 5.1. DSEFS is not just an evolution of CFS. DSEFS has been architected from scratch to address the shortcomings of CFS, as well as HDFS.

DSEFS supports all features of CFS and covers the basic subset of HDFS and WebHDFS APIs to enable seamless integration with Apache Spark and other tools that can work with HDFS. It also comes with a few unique features. Notable features include:

  • creating, listing, moving, renaming, deleting files and directories
  • optional password / token authentication and kerberos authentication in Spark
  • POSIX permissions and file attributes
  • interactive console with access to DSEFS, local file system, CFS and other HDFS-compatible filesystems
  • querying block locations for efficient data local processing in Spark or Hadoop
  • replication and block size control
  • transparent optional LZ4 compression
  • a utility to check filesystem integrity (fsck)
  • a utility to view status of the cluster (e.g. disk usage)

DSEFS Interactive Console

DSEFS comes with a new console that speeds up interacting with remote filesystems. Previously, to access CFS, you had to launch dse hadoop fs command which started a new JVM, loaded required classes, then connected to the server and finally executed the requested command. Launching and connecting was repeated for every single command and could take a few seconds every time. DSEFS console can be launched only once and then can execute many commands reusing the same connection. It also understands a concept of working directory, so you don't need to type full remote paths with every command. You can use many file systems like DSEFS, CFS, HDFS, local file system in a single session. Tab-completion of paths helps to improve interaction speed even more.

DSEFS Architecture

The major difference between CFS and DSEFS architectures is that in DSEFS the data storage layer is separate from the metadata storage layer. Metadata, which includes information about paths, file names, file attributes, as well as pointers to data, are stored in Cassandra tables. File data are stored outside Cassandra, directly in the node's local file system. Data is split into blocks, each 64 MB large by default, and each block is stored in its own file.

Storing data blocks in the local file system has several advantages:

  • There is virtually no limit on the number of blocks that can be stored, other than the capacity of storage devices installed in the cluster. Blocks at rest do not take any other system resources such as memory or file descriptors.
  • The DSEFS server can stream a data block over the network very efficiently using sendfile without copying any part of it to JVM heap nor userspace memory.
  • Writing data to blocks skips the Cassandra commit log, so every block needs to be written only once.
  • Deleting files is fast and space is reclaimed immediately. Each block is stored in its own file in some storage directory, so deleting a file from DSEFS is just deleting files from the local file system. There is no need to wait for a compaction operation.
  • Looking up blocks is faster than accessing sstables. Blocks can be quickly accessed directly by their name which is stored in the metadata.
  • Replication for data can be configured in a very fine grained way, separately from Cassandra replication. For example files in one directory can have RF=3 and files in another directory can have RF=5. You can also set replication factor for each file.
  • Data placement is much more flexible than what can be achieved with consistent hashing. A coordinator may choose to place a block on the local node to save network bandwidth or to place a block on the node that has low disk usage to balance the cluster.

Using Cassandra to store metadata has also many advantages:

  • The Cassandra tabular data model is well suited for efficient storage and quick lookup of information about files and block locations. Metadata is tiny compared to data, and comprised of pieces of information of such types that Cassandra can handle well.
  • Cassandra offers excellent scalability, with capacity not limited by the amount of memory available to a single server (such as for an HDFS NameNode). This means DSEFS can store a virtually unlimited number of files.
  • The shared-nothing architecture of Cassandra offers strong high availability guarantees and allows DSEFS to be shared-nothing as well. Any node of your cluster may fail and DSEFS continues to work for both reads and writes. There are no special "master" nodes like HDFS NameNode. Hence, there are no single points of failure, even temporary. DSEFS clients can connect to any node of the DSE cluster that runs DSEFS.
  • Cassandra lightweight transactions allow to make some operations atomic within a data center, e.g. if multiple clients in the same data center request to create the same path, at most one will succeed.
  • Cassandra offers standard tools like cqlsh to query and manipulate data. In some cases it may be useful to have easy access to internal file system metadata structures, e.g. when debugging or recovering data.

The following diagram shows how DSEFS nodes work together:

Clients may talk to any node by HTTP to port 5598. The contacted node becomes the coordinator for the request. When accessing a file, the coordinator consults the metadata layer to check if the file exists, then to check permissions and finally to get the list of block identifiers and block locations. Then it fetches the blocks either from local block layer or from remote nodes by using DSEFS internode communication on port 5599. The coordinator joins blocks together and streams them to the client. In the future we may implement an optimization to skip the coordinator and request and join blocks directly by the client.

When writing a file, first the appropriate records are created in the metadata to register the new file, then the incoming data stream is split into blocks and sent to the appropriate block locations. After successfully writing a block, metadata is updated to reflect that fact.

DSEFS Implementation

DSEFS has been implemented in the Scala programming language. It uses Netty for network connectivity, memory management and asynchronous task execution.

Netty together with Scala-Async allow for non-blocking, asynchronous style of concurrent programming, without callback hell and without explicit thread synchronization. A small number of threads is multiplexed between many connections. A request is always handled by a single thread. This thread-per-core parallelism model greatly improves cache efficiency, reduces the frequency of context switches and keeps the cost of connections low. Connections between the client and the server as well as between the nodes are persistent and shared by multiple requests.

DSEFS allocates buffers from off-heap memory with Netty pooled allocator. JVM heap is used almost exclusively for temporary, short-lived objects. Therefore DSEFS is GC friendly. When internally testing the DSEFS server in standalone mode, external to DSE, we've been able to use JVM heaps of size as low as 64 MB (yes, megabytes) without a noticeable drop in performance.

Contrary to CFS, DSEFS doesn't use the old Thrift API to connect to DSE. Instead, it uses DataStax Java Driver and CQL. CFS was the last DSE component using Thrift, so if you migrate your applications to DSEFS, you can simply disable Thrift API by setting cassandra.yaml start_rpc property to false.

Conclusion

DSE 5.1 comes with a modern, efficient, scalable and highly available distributed file system. There is no reason to use CFS any more. CFS has been deprecated but left available so you can copy your old data to the new file system.

For documentation on DSEFS visit:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/analytics/dsefsTOC.html

We'd love to get your feedback on your experience with DSEFS! Are there any features you want added?

DSE 5.1 Resource Manager, Part 1 – Network Connections Security

$
0
0

DSE Resource Manager is a custom version of the Spark Standalone cluster manager. It provides the functionality of a Spark Master when running Apache Spark™ applications with DSE. Since the introduction of Spark in DSE 4.5, DSE Analytics has enhanced the open source Spark Master implementation with: automatic management of the Spark Master and Spark Workers lifecycles; Apache Cassandra®-based high availability; distributed and fault tolerant storage of Spark Master recovery data; and pain-free configuration for client applications. In 5.1, our introduction of the DSE Resource Manager adds even more to our custom integration providing more ease-of-use, security, and stability.

The most significant improvement with DSE 5.1 is the replacement of the Spark RPC communication mechanism between Spark Driver and Spark Master with a DSE native CQL protocol. While this sounds like it is not very important to your user experience with DSE and Spark, I bet you will be surprised by how much this change bolsters the security of Spark applications and DSE. Interested in learning more? This blog post will guide you through this and other changes brought by DSE Resource Manager in DSE 5.1.

OSS Spark Standalone Deployment

A Spark Standalone cluster consists of Spark Master and a set of Spark Workers running on different physical machines. Spark Workers are agents which start and kill the processes called executors per Spark Master requests.

The Spark Master coordinates the workers. It is aware of each worker in the cluster and of the amount of resources it offers. When an application is registered, the master requests new executors to be created on the workers which have enough resources available. Once started the executors connect to the driver and await work. The SparkContext (created in the driver process) can ask these executors to perform tasks. Each executor process belongs to exactly one SparkContext - there is no way that a single executor can run tasks scheduled by multiple SparkContext instances (it is also not really possible to run multiple SparkContext instances in a single driver process unless you separate them into different class loaders). Therefore, whenever we talk about a Spark application, we mean the SparkContext running in the driver process along with its executors.

Security in Spark Standalone deployment

There are multiple communication channels used in Spark cluster:

  • RPC - used for exchanging control messages between the driver and the master, the master and the worker, the executor and the driver
  • Data - used for sending RDD content between the driver and its executors
  • Files - used for sharing classes and files in general - usually SparkContext starts a file server to share JARs for its executors
  • Web UI - used by the Spark Master, Spark Worker and Application UIs

Since Spark 1.6 the first three channels have been managed together and support the same security features. The Web UI is managed separately because it uses a different protocol (HTTP) and uses a different mechanism under the hood. The Web UI uses an embedded Jetty server, while the rest of communication channels are implemented directly on top of Netty.

As far as security is concerned, Spark Standalone offers a shared secret based mutual authentication and encryption. In other words, for the whole cluster: the master, all the workers and all applications run on that cluster, there is just a single key used to authenticate any pair of communicating components - the driver talking to the master, the driver talking to its executors and the master talking to the workers. This mechanism allows us to control who has access to the cluster as a whole, but it does not allow us to segregate applications submitted by different users. To be more specific, in DSE 5.0 the users need to authenticate the CQL session to gain access to the only secret key used in the cluster. However, once they did that, they would be able to compromise the communication between any pair of components.

Note that shared secret security is not used for Web UI.

Fig. 1 - Spark Applications and Spark Standalone Cluster Manager communication security (DSE 5.0)
All red line connections are secured with the same key therefore all the parties need to know it

What we improved

All the communication between the driver and the master is now carried through the DSE native transport connection by invoking special CQL commands against any node in the DSE cluster. The node which receives the commands will automatically redirect them to node running Spark Master. The master node then translates the message from the DSE internal format into protocol suitable for Spark. While the end user is not impacted by this change, it provides several significant benefits.

First of all, the communication between the driver and the master does not use shared secret security and is instead secured the same way as any CQL connections to DSE. This means that the user can use plain password authentication, as well as Kerberos or whatever authentication mechanism DSE supports in the future. Like any CQL connection, the connection can use compression, TLS encryption, multiple contact points and so on. Furthermore, since we authenticate users who want to submit Spark applications, we may enable authorization and precisely control who can submit an application via CQL.

Moreover, because the connection between the master and the driver does not use shared secret, the whole Spark cluster is split into two logical parts: (1) resource manager - a master and workers; and (2) application - drivers and their executors. Previously, all the applications used the same secret key as the master did in order to be able to talk to it. Now, since the applications do not use the shared secret to connect to the master, each application can use a different secret key internally securing them independently. In other words:

  • The communication between the master and the workers can be confidential - if the security for master-worker communication is enabled, no application can either connect to any master or worker directly (due to mutual authentication) or compromise the communication between them (due to encryption). The cluster will also be protected from a fake worker registration with our master, as well as from a fake master advertising itself to our workers
  • The internal application communication between the driver and its executors can be confidential - if the the security for driver-executor communication is enabled, no other applications can either connect to our driver and executor or compromise the communication between the driver and any of its executors
  • The communication between the driver and the master can be confidential - if we enable the client encryption in DSE configuration, the connection will be encrypted with TLS

Note that with the introduced changes you can explicitly control the communication of the aforementioned pair of components.

Fig. 2 - Spark Applications and DSE Resource Manager communication security (DSE 5.1)
Different link colors denote different authentication and encryption keys used to secure communication between the components

Internal communication between a master and workers

There are two settings in dse.yaml which allow you to enable mutual authentication and encryption for internal communication. Mutual authentication can be enabled by setting spark_security_enabled to true and encryption additionally requires setting spark_security_encryption_enabled to true. Note that encryption will not work without authentication, there are three combinations possible - no security, authentication only, authentication + encryption.

Internal communication between a driver and its executors

Similarly to master-worker communication, we have the same switches for each individual Spark application. Just like described in Spark documentation, we can enable mutual authentication by setting spark.authenticate to true and encryption by setting spark.authenticate.enableSaslEncryption to true in Spark configuration (which can be done either directly or by setting defaults in spark-defaults.conf). You may, but you are not required to provide the secret key explicitly. If it is missing, it will be automatically generated using secure random generator.

Communication between a driver and a master

As said before, the driver uses CQL to communicate with a master therefore that connection undergoes the same rules as any other DSE native protocol connection. If DSE requires authentication from the clients, the authentication will be required for Spark Applications. If DSE requires TLS, you will need to provide a trust store and perhaps a key store. If authorization is enabled in DSE, you will also need to setup permissions to connect, to talk to the resource manager and to submit an application.

Just as a reminder, CQL encryption can be enabled in client_encryption_options section of cassandra.yaml, while authentication and authorization are managed in authentication_options and authorization_options of dse.yaml.

Connecting to DSE Spark

In Spark applications, we choose the cluster manager type and provide connection information through a master URI string. For example, Spark Standalone is identified by the spark:// prefix, Yarn - by the yarn:// prefix and Mesos - by the mesos:// prefix. DSE Resource Manager is defined as a third-party cluster manager and is identified by the dse:// prefix.

Similarly to previous DSE versions, the user does not need to provide the master address explicitly because that master is (and was) configured automatically. In DSE prior to 5.1, when an application was submitted, the bootstrapper program opened a CQL connection to a DSE node and invoked queries to get the current IP address of the Master. Next the SparkContext acquired a direct connection to that master to perform registration. In DSE 5.1, we still obtain CQL connection but we do not connect to the master directly. All information is exchanged through that initial CQL connection. In fact, on the client side, we do not need to care where the master is or even whether there is some master at all.

Fig. 3 - High availability of the Driver to DSE Resource Manager connection
Request can be sent to any DSE node and they will be redirected to the node where Spark Master is running

DSE Resource Manager connection string

The URL has the following shape:
dse://[<host_address>[:<port>][,<host_address2>]...]?[param=value;]...

The host addresses here are just initial contact points. Only one of them is required to be alive during application submission; therefore providing more contact points can greatly increase the chance of successfully submitting the application. Remember, the addresses provided do not have anything to do with the current location of Master - they are just RPC broadcast addresses of some DSE nodes.

The optional parameters which come after the question mark can define the target datacenter, timeouts and other settings of CQL connection. Basically, those parameters are the same as those used by the Spark Cassandra Connector (defined here), just without the spark.cassandra prefix.

By default the connection string dse://? will connect to the local DSE node, the node specified in Hadoop client configuration or Spark configuration.

In general, connection options for Spark application submission are retrieved from the master URI string, falling back to Spark configuration, which in turn falls back to DSE configuration files. That is, unless you specify some setting explicitly, the connection used for registering the application will be the same as for Cassandra DataFrames and RDDs.

Working with multiple datacenters

In DSE, a distinct Spark cluster can be run in each DSE Analytic data center. When a Spark application is submitted, the target datacenter can be provided explicitly or implicitly in the master URI string:

  • implicitly - by specifying addresses of nodes from that data center only
  • explicitly - by specifying connection.local_dc option

In the first case, if the contact points belonged to various data centers, and connection.local_dc option was not specified, the client would not be able to decide to which data center the application should be submitted to and it would ended up with an error. Thus we can omit connection.local_dc only if the target data center is not ambiguous.

In the second case, the specified contact points may belong to any datacenters, not just Analytic ones. The Cassandra driver discovers the rest of the nodes and the load balancing policy will take care of passing control commands to the nodes in the specified data center only.

How to secure network communication step by step

We are going to show how to setup network security in DSE Analytic data center that includes password authentication and encryption. Let us assume there are two tiny analytic data centers with the following nodes (we assume those addresses are public):

DC1 - 10.0.1.1, 10.0.1.2, 10.0.1.3
DC2 - 10.0.2.1, 10.0.2.2, 10.0.2.3

Setup nodes

Given you have your DSE 5.1 installed on all nodes, we need to prepare configuration before the nodes are started up - we will cover only security related stuff here:

  1. Make sure that you use DseAuthenticator and DseAuthorizer - they should be set by default in your cassandra.yaml:

authenticator: com.datastax.bdp.cassandra.auth.DseAuthenticator
authorizer: com.datastax.bdp.cassandra.auth.DseAuthorizer

  1. Enable internal authentication - edit dse.yaml so that:

authentication_options:
enabled: true
default_scheme: internal
plain_text_without_ssl: block
transitional_mode: disabled

  1. Enable authorization - edit dse.yaml so that:

role_management_options:
mode: internal

authorization_options:
enabled: true
transitional_mode: disabled

  1. Secure Spark internode connections - edit dse.yaml so that:

spark_security_enabled: true
spark_security_encryption_enabled: true

  1. Secure DSE internode connections

Create a keystore and a truststore for each node and put them into /etc/dse/keys/ directory (remember to secure access to that directory). You can find out how to do that on many pages on the Internet, though you can also refer to DSE documentation here or to more recent version here. Then configure server encryption in cassandra.yaml:

server_encryption_options:
internode_encryption: all
keystore: /etc/dse/keys/.internode-keystore
keystore_password: keystore_password
truststore: /etc/dse/keys/.internode-truststore
truststore_password: truststore_password
require_client_auth: true
require_endpoint_verification: true

  1. Secure client connections

For client connections, it is usually enough to set up keystores unless you really want TLS-based client authentication - edit cassandra.yaml so that:

client_encryption_options:
enabled: true
optional: false
keystore: /etc/dse/keys/.keystore
keystore_password: keystore_password
require_client_auth: false

  1. Start the nodes

$ sudo service dse start

  1. The first thing you need to do after starting the nodes is to update the replication of some system keyspaces - you can find the details here - in our case:

$ cqlsh --ssl -u cassandra -p cassandra 10.0.1.1

ALTER KEYSPACE dse_leases WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1':'3', 'DC2':'3'};
ALTER KEYSPACE dsefs WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1':'3', 'DC2':'3'};
ALTER KEYSPACE spark_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1':'3', 'DC2':'3'};
ALTER KEYSPACE dse_security WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1':'3', 'DC2':'3'};
ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1':'3', 'DC2':'3'};
ALTER KEYSPACE "HiveMetaStore" WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1':'3', 'DC2':'3'};

Then exit CQL shell for a moment and run repair to ensure the replication changes have propagated:

$ nodetool repair

and run CQL shell again.

  1. Next we will create users john, tim, anna, eva and grant them different privileges against DSE Resource Manager.

Create the users

CREATE ROLE john WITH PASSWORD = 'password' AND LOGIN = true;
CREATE ROLE tim WITH PASSWORD = 'password' AND LOGIN = true;
CREATE ROLE anna WITH PASSWORD = 'password' AND LOGIN = true;
CREATE ROLE eva WITH PASSWORD = 'password' AND LOGIN = true;

Authorize the users

CREATE ROLE rm_user;
GRANT EXECUTE ON REMOTE OBJECT DseClientTool TO rm_user;
GRANT EXECUTE ON REMOTE OBJECT DseResourceManager TO rm_user;

GRANT rm_user TO john;
GRANT rm_user TO tim;
GRANT rm_user TO anna;
GRANT rm_user TO eva;

Now, let:

  • john be able to submit application to DC1,
  • tim be able to submit applications to DC2,
  • anna be able to submit applications to both data centers,
  • eva be able to stop applications in DC1.

GRANT CREATE ON WORKPOOL 'DC1' TO john;
GRANT CREATE ON WORKPOOL 'DC2' TO tim;
GRANT CREATE ON ANY WORKPOOL TO anna;
GRANT MODIFY ON ANY SUBMISSION TO eva;

A workpool is a named chunk of computing resources available in the whole DSE cluster. In DSE 5.1, it is just a Spark Master along with Spark Workers in a single analytic data center.

Start the application

Before you start any application, it is good to provide some default properties for each application. In order to do that, open /etc/dse/spark/spark-defaults.conf and make sure to set the following:

spark.cassandra.connection.host  10.0.1.1,10.0.1.2,10.0.1.3,10.0.2.1,10.0.2.2,10.0.2.3
spark.authenticate                           true
spark.authenticate.enableSaslEncryption      true

And make sure you do not have spark.authenticate.secret set anywhere. Otherwise, all the applications will use that same secret key for mutual authentication and encryption rather than generating a new random one each time.

Note that we set spark.cassandra.connector.host to a set of as much nodes as possible - they are contact points and we need at least one of them is alive to obtain the connection. Obviously we could also specify those nodes in master URI string, like:

dse://10.0.1.1,10.0.1.2,10.0.1.3,10.0.2.1,10.0.2.2,10.0.2.3

However, as we said before, resource manager connection settings are inherited from Spark configuration so the set of contact points will be inherited as well. Since the hosts set will be constant in our examples, we don’t want to write it each time we want to change something else in master URI string. In other words, this hosts set will serve as a good default for both resource manager connections and Cassandra RDD.

  1. Try to start Spark Shell as john, provide explicit target data center - DC1, then try DC2
    Note that you can check whether the application is running in DC1 or DC2 by going to the Spark Master UI in the browser - you should see that the application is registered with the Spark Master in DC1 but not with the Spark Master in DC2.

$ DSE_USERNAME=john DSE_PASSWORD=password dse spark --master dse://?connection.local_dc=DC1
$ DSE_USERNAME=john DSE_PASSWORD=password dse spark --master dse://?connection.local_dc=DC2

  1. Try the same as user tim, anna and eve

Note that the successful tries are the ones which are permitted due to the permissions that were granted before. You may also try to run Spark Shell as say anna but without providing target data center explicitly (that is, just remove master URI string from the command). Spark Shell will not start because target data center will be ambiguous.

  1. Try to remove the application using CQL shell

Firstly run Spark Shell as john and get the application identifier:

$ DSE_USERNAME=john DSE_PASSWORD=password dse spark --master dse://?connection.local_dc=DC1
scala> sc.applicationId

In a second console, try to remove that application by issuing a CQL command firstly as tim and then as eva - tim has no permission to do that, while eva has (remember to connect to one of the nodes of DC1 with CQL Shell):

$ cqlsh --ssl -u tim -p password 10.0.1.1 -e "CALL DseResourceManager.unregisterApplication ('<app_id>')"
$ cqlsh --ssl -u eva -p password 10.0.1.1 -e "CALL DseResourceManager.unregisterApplication ('<app_id>')"

Note that since the application was started by john, tim does not have a permission to manage it, thus the first command fails. However, eva has permission to manage all apps in all data centers so the second command succeeds.

Spark Master and Worker UIs

You will also notice that Spark Master and Worker UI require password authentication and perhaps the browser will complain with untrusted certificate. This is because since DSE 5.1 UI authentication is automatically enabled in both the Spark Master and Worker UI whenever it is enabled in DSE. Similarly, encryption (HTTPS) is enabled in Spark Master and Worker UI whenever client encryption is enabled in DSE. However, Spark Master and Worker UI have no authorization support yet, therefore if someone is able to login, they will have unlimited permissions in the UI.

There is also one more change with regard to Spark Master UI - you do not need to enter Spark Master address in the browser to access the UI. It is enough to enter the address of any node from the data center with the Spark Master. DSE will automatically redirect the browser to the right address.

Refer to DSE documentation to get more detailed information about DSE Resource Manager. Stay tuned for more posts related to DSE Resource Manager where we will cover running Spark executors as different system users, as well as Spark applications high availability troubleshooting.

DSE 5.1 Resource Manager Part 2 – Process Security

$
0
0

Recap from part 1 of this post - DSE Resource Manager is a custom version of Spark Standalone cluster manager. In the first part of this post we explained how the network security was improved in DSE 5.1 for Apache Spark™ applications. Here we will show how the executor processes are started and what we improved in DSE 5.1 in terms of security and processes separation.

DSE Resource Manager comes with a customizable implementation of the mechanism used to control the driver and executor lifecycles. In particular we provide an alternative to the default mechanism which allows processes to be run as separate system users. Follow this blog post to learn how this impacts the security of your DSE cluster, how it can be configured and how you can verify what it actually does. We will also show a step-by-step guide to demonstrate how it works.

OSS Spark Standalone Master - prior to DSE 5.1

Spark executors (and drivers when deployed in the cluster mode) are run on DSE nodes. By default these processes are started by DSE server and are run by the same OS user who runs DSE server. This obviously has some security implications - we need to fully trust the applications which are run on the cluster because they can access DSE data and configuration files. The applications can also access files of each others.

Fig. 1 - Running executors and drivers in OSS Spark Deployment (DSE 5.0 and older)
All processes are run as the same system user thus all of them have the same permissions

DSE 5.1 introduces a new feature which allows to delegate running Spark application components to a runner which can be chosen in DSE configuration, in dse.yaml. The legacy behaviour is now implemented as a “default” runner.

“RunAs” runner

DSE 5.1 brings also the run_as runner which allows to run Spark applications on behalf of a given OS user. Basically the system user, dse, cassandra, or whoever runs DSE server is not used to run users’ applications. Additionally applications of different DSE users are run as different system users on the host machine. That is:

  • all the simultaneously running applications deployed by a single DSE user will be run as a single OS user
  • applications deployed by different DSE users will be run by different OS users
  • the DSE user will not be used to run applications

 

Fig. 2 - Running executors and drivers with RunAs runner / DSE Resource Manager (DSE 5.1)
DSE node process and processes belonging to different DSE users are run as different OS users thus they may have different permissions

The above assumptions allows implementing security rules which protect DSE server private files and prevent accessing private files of other applications.

Mechanism

DSE leverages the sudo program to run Spark application components (driver, executors) as specific OS users. Unlike Hadoop/Yarn resource managers, DSE does not link a DSE user to a particular OS user. Instead, a certain number of spare user accounts (slots) are used. When a request to run an executor or a driver is received by DSE server, it looks up an unused slot, and locks it for that application. Until the application is finished, all of its processes will be run as that slot user. Then, the slot user will be released and can be used by another application.

Since the number of slots is limited, a single slot is shared among all the simultaneously running applications run by the same DSE user. Such a slot is released once all the applications of that user are removed. When there are not enough slots to run an application, there will be error reported and DSE will try to run the executor or driver on a different node. DSE does not limit the number of slots you can configure so, if you need to run more applications simultaneously, simply create more slots. The most reasonable default is the number of cores available to the Spark Worker on that DSE node because there will never be more distinct executors running simultaneously than the number of available cores (each executor uses at least one core).

Slot assignment is done on per node basis, that is the executors of one application may run on different slots on different DSE nodes. An even more important implication is that when DSE is run on a fat node, different DSE instances running within the same OS should be configured with a disjoint sets of slot users. Otherwise, it may end up with a single OS user running applications of two different DSE users.

Fig. 3 - Slot assignment on different nodes

Each DSE instance manages slot users independently. Make sure that if you run multiple DSE instances in a single OS they should use different slot users

Cleanup procedure

RunAs runner cleans up those files which are created in default locations such as the RDD cache files and the content of executor and driver work directories.

A slot user is used to run executors and drivers of one DSE user at one time. When the slot is released, it can be used by a different DSE user. Therefore we have to make sure that some files created by applications on each worker node are handled properly, in particular:

  • When the slot is released, those files will not be accessible to newly started applications which will be run as the same slot user
  • When we run executors or a driver of the application again on the same worker node (after the slot was previously released), the application files will be accessible again to that application

The runner we implemented does not follow what files are created by the application but there are two locations common to each Spark applications which we can control:

  • Application work directory - it is created under Spark worker directory, which is defined in /etc/dse/spark/spark-env.sh as SPARK_WORKER_DIR and by default it is set to /var/lib/spark/worker. There is a subdirectory for each application, and then a subdirectory for each executor. For example, in those directories executor standard out and standard err streams are saved.
  • RDD cache location - the location of cache is also defined in /etc/dse/spark/spark-env.sh as SPARK_LOCAL_DIRS and by default it is set to /var/lib/spark/rdd. There are subdirectories with randomly generated names which are created by executor and driver processes.

To ensure the aforementioned security requirements against the above locations the runner changes ownership and permissions of these locations in runtime. In particular, when a slot is released (the last process running on the slot has finished) the runner changes the ownership of the application directories and their content from the slot user to DSE service user so that no slot user cannot access them. On the other hand, when executors of the application are about to be run again on a worker node, the ownership of the application directories and their content is changed back to the slot user which is newly assigned to that application.

There is one more detail in this regard - what will happen if the worker is killed or the node dies suddenly and the runner is unable to perform cleanup? This problem is fixed in the way that when the worker starts, the first thing it does is to change the ownership of all remaining application directories to DSE service user. When executors of some application are run again on this worker node, they will follow the scenario described above and the application directories will be made accessible to those executors.

Note that if your application creates files somewhere else, you need to clean up those files on your own.

Configuration

In order to start working with run_as runner, the administrator needs to prepare slot users in the OS. run_as runner assumes that:

  • each slot user has its own primary group, whose name is the same as the name of slot user (usually a default behaviour of the OS)
  • DSE service user (the one as who DSE server is run) is a member of each slot's primary group
  • sudo is configured so that DSE service user can execute any command as any slot user without providing a password

We also suggest to override umask to 007 for slot users so that files created by sub-processes will not be accessible by anyone else by default, as well as make sure DSE configuration files are not visible to slot users. One more step to secure DSE server environment can be modifying limits.conf file in order to impose exact disk space quota per slot user.

The only settings which the administrator needs to provide in DSE configuration is a selection of run_as runner and a list of slot users to be used on a node. Those settings are located in dse.yaml, in spark_process_runner section.

Example configuration - configure run_as runner with two slot users: slot1 and slot2

  1. Create 2 users called slot1, slot2 with no login; each such user should have primary group the same as its name: slot1:slot1, slot2:slot2

$ sudo useradd --user-group --shell /bin/false --no-create-home slot1
$ sudo useradd --user-group --shell /bin/false --no-create-home slot2

  1. Add DSE service user (the one as who DSE server is run) to the slot user groups - DSE service user has to be in all slot user groups

You can find the DSE service user name by looking into /etc/init.d/dse - there is the entry which define DSE service user, for example (let it be cassandra in our case):

CASSANDRA_USER="cassandra"

Then, add that user to each slot user primary group:

$ sudo usermod --append --groups slot1,slot2 cassandra

By invoking the following command you can verify group assignment:

$ id cassandra
uid=107(cassandra) gid=112(cassandra) groups=112(cassandra),1003(slot1),1004(slot2)

  1. Modify /etc/sudoers file by adding the following lines (remember to use visudo to edit this file):
Runas_Alias     SLOTS = slot1, slot2
Defaults>SLOTS  umask=007
Defaults>SLOTS  umask_override
cassandra       ALL=(SLOTS) NOPASSWD: ALL
  1. Modify dse.yaml so that:
spark_process_runner:
    runner_type: run_as

    run_as_runner_options:
        user_slots:
            - slot1
            - slot2

The above settings configure DSE to use run_as runner with two slot users: slot1 and slot2.

  1. Do those changes on all the nodes in data center where you want to use run_as runner. After applying the configuration restart all the nodes.

Next, given our workers have at least 2 cores available, we will run two applications but two different users so that a single worker will simultaneously run executors of both applications.

  1. Start two applications in separate consoles

$ DSE_USERNAME=john DSE_PASSWORD=password dse spark --total-executor-cores 3 --master dse://?connection.local_dc=DC1
$ DSE_USERNAME=anna DSE_PASSWORD=password dse spark --total-executor-cores 3 --master dse://?connection.local_dc=DC1

Go to the Spark Master UI page and you will see that both application are registered and running. Then go to Spark Worker UI of one of the workers and verify that the worker is running executors of both applications. To see which slot users are used, execute the following statement in each shell:

scala> sc.makeRDD(1 to 3, 3).map(x => s"${org.apache.spark.SparkEnv.get.executorId} - ${sys.props("user.name")}").collect().foreach(println)

The command runs a simple transformation of a 3-element collection. Each partition will be processed as a separate task - if there are 3 executors with 1 core assigned, each element of the collection will be processed by a different executor. We will get the value of system property “user.name” from each executor JVM. You will notice that one shell will report different slot user for the same executor ID than the other shell which means that both executors are run by different OS user.

You may also have a look at the directories created for each application and executor:

$ sudo ls -lR /var/lib/spark/worker

Example output of this command:

./app-20170424121437-0000:
total 4
drwxrwx--- 2 cassandra slot1 4096 Apr 24 12:14 0

./app-20170424121437-0000/0:
total 12
-rw-rw---- 1 cassandra slot1     0 Apr 24 12:14 stderr
-rw-rw---- 1 cassandra slot1 10241 Apr 24 12:16 stdout

./app-20170424121446-0001:
total 4
drwxrwx--- 2 cassandra slot2 4096 Apr 24 12:14 0

./app-20170424121446-0001/0:
total 8
-rw-rw---- 1 cassandra slot2    0 Apr 24 12:14 stderr
-rw-rw---- 1 cassandra slot2 7037 Apr 24 12:22 stdout

Similarly, when you list /var/lib/spark/rdd you will see that directories used by different applications have different owners.

Refer to the DSE Analytics documentation to get more detailed information about DSE Resource Manager. Stay tuned for more posts related to DSE Resource Manager where we are going to cover running Spark applications high availability troubleshooting.

Graph Storytelling with Studio 2.0.0

$
0
0

One sign of great data visualization is that you can quickly and accurately interpret the data without having to think much about the mechanics of the visualization itself. In Studio 2.0.0 we added a few features to the Graph View that enable this type of seamless storytelling.

Let’s begin with a simple example graph that represents the fantastic Studio Development Team.

 

Each vertex represents a member of the team, and each edge represents a reports_to relationship. By default, Studio assigns color to each vertex based on its Label (which you can think of as it’s “type”). In Studio 2.0.0 we added the ability to style vertices by their property values using the new “Color By” and “Size By” features. For example, let’s imagine that we want to learn a little bit about where the members of the Studio team are originally from. Here’s the same graph, only this time the vertices are colored by the country_of_origin property. This makes it easy to see the network structure and country_of_origin simultaneously.

country_of_origin is a categorical variable, so each unique value is assigned an arbitrary color. Studio can also assign size and color to continuous scalar values using a linear scale. For example, let’s use the "Size By" feature to set the vertex size based on how long each of the Studio team members have worked at DataStax.

The bigger the vertex, the longer that person has worked at DataStax. This is pretty good, but perhaps we can do even better. Let’s see what happens when we also assign color by the days_at_datastax property using a linear scale.

Nice! Now we have 2 dimensions, size and color, working together to help us quickly compare how long each team member has been at DataStax. The bigger and “hotter” the vertex, the longer they have been with the company.

This visualization does a great job of providing rough comparative info at a glance while still showing the network structure, but what if we need even more detail about the days_at_datastax values? For precise comparison of scalar values, the tried-and-true bar chart is tough to beat. Let’s see what happens when we use Studio’s dynamic charting capabilities in concert with the graph view for total clarity.

That’s pretty nice. The graph view conveys the network structure along with a rough comparison of days_at_datastax, and the bar chart provides a definitive secondary reference. Studio also enables you to mouse-over each bar or vertex to see the exact value of days_at_datastax. Lastly, this wasn’t a competition, but congrats to Jim Bisso anyway!

Icons

As you just saw, sometimes the best way to tell the story is with some simple shapes. However, there are also cases where you just can’t beat the interpretability of some well-chosen icons. In Studio 2.0.0 we added the ability to assign Font Awesome Icons to vertices based on their Label (i.e. type), and we think it can add a great deal of clarity in certain situations. To highlight this, let’s take a look at two renderings of the same graph.

As you can see, the icons greatly reduce the amount of thought required to identify the label of each vertex, and that’s what good data visualization is all about. Studio 2.0.0 hasn’t been out for very long, but users are already creating exceptional presentations and instructionals using the new Icon feature.

We sincerely hope that you will enjoy using these new features in Studio 2.0.0. We sure had a great time building them. Happy coding!

Click here to download DataStax Studio 2.0.0.

Spark Application Dependency Management

$
0
0

This blog post was written for DataStax Enterprise 5.1.0. Refer to the DataStax documentation for your specific version of DSE.

Compiling and executing Apache Spark™ applications with custom dependencies can be a challenging task. Spark beginners can feel overwhelmed by the number of different solutions to this problem. Diversity of library versions, the number of different build tools and finally the build techniques, such as assembling fat JARs and dependency shading, can cause a headache.

In this blog post, we shed light on how to manage compile-time and runtime dependencies of a Spark Application that is compiled and executed against DataStax Enterprise (DSE) or open source Apache Spark (OSS).

Along the way we use a set of predefined bootstrap projects that can be adopted and used as a starting point for developing a new Spark Application. These examples are all about connecting, reading, and writing to and from a DataStax Enterprise or Apache Cassandra(R) system.

Quick Glossary:

Spark Driver: A user application that contains a Spark Context.
Spark Context: A Scala class that functions as the control mechanism for distributed work.
Spark Executor: A remote Java Virtual Machine (JVM) that performs work as orchestrated by the Spark Driver.
Runtime classpath: A list of all dependencies available during execution (in execution environment such as Apache Spark cluster). It's important to note that the runtime classpath of the Spark Driver is not necessarily identical to the runtime classpath of the Spark Executor.
Compile classpath: A full list of all dependencies available during compilation (specified with build tool syntax in a build file).

Choose language and build tool

First, git clone the DataStax repository https://github.com/datastax/SparkBuildExamples that provides the code that you are going to work with. Within cloned directories there are Spark Application bootstrap projects for Java and Scala, and for the most frequently used build tools:

  • Scala Build Tool (sbt)
  • Apache Maven™
  • Gradle

In the context of managing dependencies for the Spark Application, these build tools are equivalent. It is up to you to select the language and build tool that best fits you and your team.

For each build tool, the way the application is built is defined with declarative syntax embedded in files in the application’s directory:

  • Sbt: build.sbt
  • Apache Maven: pom.xml
  • Gradle: build.gradle

From now on we are going to refer to those files as a build files.

Choose execution environment

Two different execution environments are supported in the repository: DSE and OSS.

DSE

If you are planning to execute your Spark Application on a DSE cluster, use the dse bootstrap project which greatly simplifies dependency management.

It leverages the dse-spark-dependencies library which instructs a build tool to include all dependency JAR files that are distributed with DSE and are available in the DSE cluster runtime classpath. These JAR files include Apache Spark JARs and their dependencies, Apache Cassandra JARs, Spark Cassandra Connector JAR, and many others. Everything that is needed to build your bootstrap Spark Application is supplied by the dse-spark-dependencies dependency. To view the list of all dse-spark-dependencies dependencies, visit our public repo and inspect the pom files that are relevant to your DSE cluster version.

An example of an DSE built.sbt:

libraryDependencies += "com.datastax.dse" % "dse-spark-dependencies" % "5.1.1" % "provided"

Using this managed dependency will automatically match your compile time dependencies with the DSE dependencies on the runtime classpath. This means there is no possibility in the execution environment for dependency version conflicts, unresolved dependencies etc.

Note: The DSE version must match the one in your cluster, please see “Execution environment version” section for details.

DSE projects templates are built with sbt 0.13.13 or later. In case of unresolved dependencies errors, update sbt and then clean ivy cache (with rm ~/.ivy2/cache/com.datastax.dse/dse-spark-dependencies/ command).

OSS

If you are planning to execute your Spark Application on an open source Apache Spark cluster, use the oss bootstrap project. For the oss bootstrap project, all compilation classpath dependencies must be manually specified in build files.

An example of an OSS built.sbt:

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion % "provided"
)

For OSS, you must specify these four dependencies for the compilation classpath.

During execution, the Spark runtime classpath already contains the org.apache.spark.* dependencies, so all we need to do is to add spark-cassandra-connector as an extra dependency. The DataStax spark-cassandra-connector doesn’t exist in the Spark cluster by default. The most common method to include this additional dependency is to use --packages argument for the spark-submit command. An example of --packages argument usage is shown in the “Execute” section below.

The Apache Spark versions in the build file must match the Spark version in your Spark cluster. See next section for details.

Execution environment versions

It is possible that your DSE or OSS cluster version is different than the one specified in bootstrap project.

DSE

If you are a DSE user then checkout the SparkBuildExamples version that matches your DSE cluster version, for example:

git checkout <DSE_version>
# example: git checkout 5.0.6

If you are a DSE 4.8.x user then checkout 4.8.13 or newer 4.8.x version.

OSS

If you are planning to execute your application against a Spark cluster different than the one specified in a bootstrap project build file, adjust all dependencies version listed there. Fortunately, the main component versions are variables. See the example below and adjust following according to your needs.

Sbt

val sparkVersion = "2.0.2"
val connectorVersion = "2.0.0"

 

Maven

<properties>
  <spark.version>2.0.2</spark.version>
  <connector.version>2.0.0</connector.version>
</properties>

 

Gradle

def sparkVersion = "2.0.2"
def connectorVersion = "2.0.0"

Let’s say that your Spark cluster has 1.5.1 version. Go to version compatibility table, there you can see compatible Apache Cassandra versions and Spark Cassandra Connector versions. In this example, our Apache Spark 1.5.1 cluster is compatible with 1.5.x Spark Cassandra Connector, the newest one is 1.5.2 (newest versions can be found on Releases page). Adjust the variables accordingly and you are good to go!

Build

The build command differs for each build tool. The bootstrap projects can be built with the following commands.

Sbt

sbt clean assembly
# produces jar in path: target/scala-2.11/writeRead-assembly-0.1.jar

Maven

mvn clean package
# produces jar in path: target/writeRead-0.1.jar

Gradle

gradle clean shadowJar
# produces jar in path: build/libs/writeRead-0.1.jar

Execute

The spark-submit command differs between environments. In DSE environment, the command is simplified to autodetect parameters like --master. In addition, various other Apache Cassandra and DSE specific parameters are added to the default SparkConf. Use the following commands to execute the JAR that you built. Refer to the Spark docs for details about spark-submit command.

DSE

dse spark-submit --class com.datastax.spark.example.WriteRead <path_to_produced_jar>

OSS

spark-submit --conf spark.cassandra.connection.host=<cassandra_host> --class com.datastax.spark.example.WriteRead --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0 --master <master_url> <path_to_produced_jar>

Note the usage of --packages to include the spark-cassandra-connector on the runtime classpath for all application JVMs.

Provide additional dependencies

Now that you have successfully built and executed this simple application, it’s time to see how extra dependencies can be added to your Spark Application.

Let’s say your application grows with time and there is a need to incorporate an external dependency to add functionality to your application. For this argument, let the new dependency  be commons-math3.

To supply this dependency to the compilation classpath, we must provide proper configuration entries in build files.

There are two ways to provide additional dependencies to runtime classpath assembling or manually providing all dependencies with the spark-submit command.

Assembly

Assembling is a way of directly including dependencies classes in the resulting JAR file (sometimes called fat-jar or uber-jar) as if these dependency classes were developed along with your application. When the user code is shipped to Apache Spark Executors, these dependency classes are included in the application JAR on the runtime classpath. To see an example, uncomment the following sections in any of your build files.

Sbt

libraryDependencies += "org.apache.commons" %% "commons-math3" % "3.6.1"

 

Maven

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-math3</artifactId>
  <version>3.6.1</version>
</dependency>

 

Gradle

assembly "org.apache.commons:commons-math3:3.6.1"

Now you can use commons-math3 classes in your application code. When your development is finished, you can create a JAR file using the build command and submit it without any modifications to the spark-submit command. If you are curious to see where the additional dependency is, use any archive application to open the produced JAR to see that commons-math3 classes are included.

When assembling, you might run into conflicts where multiple jars attempt to include a file with the same filename but different contents. There are several solutions to this problem, most common are: removing one of the conflicting dependencies or shading (which is described later in this blog post). If all else fails, most plugins have a variety of other merge strategies for handling these situations. For example, the  https://github.com/sbt/sbt-assembly#merge-strategy.

Manually adding JARs to the runtime classpath

If you don’t want to assembly a fat JAR (maybe the number of additional dependencies produced a 100MB JAR file and you consider this size unusable), use an alternate way to provide additional dependencies to runtime classpath.

Mark some of the dependencies with provided keyword to exclude them from the assembly JAR.

Sbt

libraryDependencies += "org.apache.commons" %% "commons-math3" % "3.6.1" % "provided"

 

Maven

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-math3</artifactId>
  <version>3.6.1</version>
  <scope>provided</scope>
</dependency>

 

Gradle

provided "org.apache.commons:commons-math3:3.6.1"

After building a JAR, manually specify additional dependencies with spark-submit command during application submission. Add or extend existing --packages argument of spark-submit command. Note that multiple dependencies are separated by commas. For example:

--packages org.apache.commons:commons-math3:3.6.1,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0

User dependencies conflicting with Spark dependencies

What if you want to use different version of a dependency than the version that is present in the execution environment?

For example, a Spark cluster already has commons-csv in its runtime classpath and the developer needs a different version in their application. Maybe the Spark version is old and doesn’t contain all the needed functionality. Maybe the new version is not backward compatible and breaks Spark Application execution.

This is a common problem and there is a solution: shading.

Shading

Shading is a build technique where dependency classes are packaged with application JAR files (like in assembling) but additionally package structure of this classes is altered. This process happens at compile time and is transparent to the developer. Shading simply substitutes all dependency references in a Spark Application with the same (functionality-wise) classes but located in different packages. For example, the class org.apache.commons.csv.CSVParser for Spark Application becomes shaded.org.apache.commons.csv.CSVParser.

To see shading in action uncomment following sections in build file of your choice. This will embed old commons-csv in resulting jar but with prepended package “shaded”.

Sbt

assembly "org.apache.commons:commons-csv:1.0"

and

shadowJar {
  relocate 'org.apache.commons.csv', 'shaded.org.apache.commons.csv'
}

 

Maven

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-csv</artifactId>
  <version>1.0</version>
</dependency>

and

<relocations>
  <relocation>
    <pattern>org.apache.commons.csv</pattern>
    <shadedPattern>shaded.org.apache.commons.csv</shadedPattern>
  </relocation>
</relocations>

 

Gradle

libraryDependencies += "org.apache.commons" % "commons-csv" % "1.0"

and

assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.apache.commons.csv.**" -> "shaded.org.apache.commons.csv.@1").inAll
)

After building the JAR, you can look into its content and see that commons-csv is embedded in shaded directory.

Summary

In this article, you learned how to manage compile-time and runtime dependencies of a simple Apache Spark application that connects to an Apache Cassandra database by using the Spark Cassandra Connector. You learned how Scala and Java projects are structured with sbt, gradle, and maven build tools. You also learned different ways to provide additional dependencies and how to resolve dependency conflicts with shading.

Studio 2.0 Goes Multi-Model with CQL Support

$
0
0

Great news!  In addition to support for DSE Graph and Apache TinkerPop™, Datastax Studio 2.0 introduces support for the Apache Cassandra™ Query Language(CQL).  A big part of that support is an intelligent CQL editor that will give you a productivity boost when working with CQL and Datastax Enterprise(DSE) 5.0+.  In this blog post we’ll take a deep dive on what the CQL editor has to offer.

Getting Started with CQL and Studio

CQL support for Studio requires DSE 5.0 or higher and Studio 2.0 or higher.  Both can be downloaded here http://docs.datastax.com/en/latest-dse/ and here http://docs.datastax.com/en/latest-studio/.

As is customary in Studio, you work with CQL in a notebook with one or more notebook cells.  To use CQL, just select it as the language for one of your notebook cells:


Figure 1.a:  Shows where to click to get the drop down menu of language options


Figure 1.b:  Demonstrates selecting CQL as a language which will enable the intelligent editor features

You won’t have to make this selection every time as any new cell automatically inherits the prior cells language to avoid having to select the language you want to work with repeatedly.

Keyspaces

If you have worked with CQL in the past the next thing you’ll want to know is how to select a keyspace.  You have a few options with Studio:

  1. Fully qualify schema elements by their keyspace in your CQL statements:
  2. Use a USE statement, which will change the keyspace context for all subsequent statements in the same cell:
  3. Configure the keyspace by selecting one from keyspace drop down for a cell, which will set the keyspace context for all statements in a cell (except for statements following a USE statement):

Like with the cell language, if a new cell is created directly below a CQL cell the keyspace setting will be inherited.

Now that we know how to work with keyspaces, let’s move on.

CQL Validations


Figure 2:  A CQL schema aware domain validation error, indicating a keyspace needs to be specified

In the previous section we showed several ways to work with keyspaces.  But if you don’t use any of the above options how do you know you’ve made a mistake without executing the statement?  The answer is shown in Figure 3 above.  There we can see that Studio let’s us know when our statements has an issue by showing a validation error.

Studio supports supports both CQL syntax and DSE domain specific validations.  A syntax validation is simply whether or not your statement is valid with respect to the CQL grammar:

Domain validations provide you with errors or warnings that you would get from DSE when executing a statement when some constraint is violated.  Most are based on checking if a statement is valid with regards to your schema.  But they can include anything, such as informing you that you’ve specified an invalid table option:


Figure 3:  Example of a CQL domain validation error, that if you execute the statement gives you similar feedback from DSE

In this case, and many others you can figure out how to correct your statement by removing the part of the statement with an error and invoking content assist with ctrl+space to get a list of proposals.  Let’s take a look at content assist now.

Content Assist

Like validations, content assist can help you by proposing the next valid keywords in the grammar, or it can provide domain specific proposals.  Let’s see how we might correct the statement that specified an invalid table option by invoking content assist with ctrl+space:


Figure 4:  Example of proposing valid table options in a CREATE TABLE statement

In Figure 4 above we can see that the table option that we probably wanted before was bloom_filter_fp_chance, which after being selected will be inserted with a valid default value.

There are many places in CQL statements that Studio supports invoking content assist.  Some of the more common are:

  • Proposing table names anywhere a table can be referenced in a statement
  • Proposing column names anywhere a table can be referenced
  • Proposing the next valid keyword, e.g. CREATE <ctrl+space> should propose the TABLE keyword, among others

CQL Templates

Perhaps the most useful place to invoke content assist is at the very beginning of a statement:


Figure 5:  Invocation of content assist at the beginning of a statement that propose CQL statement templates

What you see in Figure 5 is that the proposals that contain placeholder values ({keyspaceName}, {viewName}) are CQL statement templates.  If we select the ALTER TABLE(add column) template a statement is inserted with each placeholder being a portion of the statement you need to complete.  You can TAB through these placeholders to jump around a statement, as well as use SHIFT+TAB to move back to the previous placeholder:


Figure 6:  Show the ALTER TABLE(add column) template inserted, with the current placeholder highlighted

In Figure 6 you can see that the placeholders are emphasized with the current placeholder being highlighted.  For this template we need to provide a table name, a column name and a type for the column.  Templates like these can be very handy when dealing with large complicated statements that you might not remember the syntax for off hand.  Such as the CREATE MATERIALIZED VIEW statement:


Figure 7:  Shows how handy templates can be for large complex statements such as CREATE MATERIALIZED VIEW(with clustering order)

When in doubt, give content assist a try!  All you have to do is invoke it with ctrl+space and you will pleasantly surprised how much Studio can help you with crafting your CQL statements.

Effective Schema

When either validating your statements or making content assist proposals, Studio makes schema based domain validations and content assist proposals using an effective schema.  The effective schema is your existing schema combined with changes each of your DDL statements would effectively make to the database schema.  More specifically the changes from every DDL statement prior to the current statement that you are either trying to invoke content assist on, or that the editor is validating.

This ensures that if you were to execute your cells one by one from the top down that they would each execute successfully.

To make this clearer, take a look at the following example:


Figure 8:  Example of effective schema in a single cell

In the example above, assume that the database schema does not have the videos table, and that we have not executed this notebook cell.  In this cell we can see the following being demonstrated:

  1. A CREATE TABLE statement applies a change to the effective schema so that the videos table now exists from the perspective of the second statement
  2. Even though we haven’t executed it, the second statement(drop table videos) does not have a validation error, because the videos table exists in the effective schema of the drop table statement.
  3. The third statement tries to select from the videos table.  But the effective schema for that statement no longer has the videos table due to the prior drop statement, so it is flagged with a validation error.

Note that effective schema also carries across cells:


Figure 9:  Example of effective schema across multiple cells

And as mentioned previously, content assist also leverages effective schema


Figure 10:  Example of content assist leveraging the effective schema

The example above shows that content assist is aware that the videos2 table exists in the effective schema, but that videos1 has been dropped, so it isn’t proposed as a possible table to drop for the current statement.

Effective schema is a great tool to have to ensure you are writing statements that will execute successfully when working on a notebook that contains DDL statements. Especially notebooks with many statements.

One last topic for this post is a way for you to view your database schema from the editor itself using Studio’s DESCRIBE statement support.

DESCRIBE Statement Support

Suppose you want to create a new user defined type(UDT) that is fairly similar to an existing UDT, or you just don’t remember the syntax.  One way to do this quickly is to leverage Studio’s support for describing CQL schema elements.  Like CQL shell(cqlsh), executing DESCRIBE statements will produce the equivalent DDL to create that schema element.  Which is a handy thing to copy and then modify to meet your new types needs:


Figure 11:  Shows the result of executing a DESCRIBE TYPE statement

In general Studio's DESCRIBE command support is a great way to inspect parts of your schema quickly without leaving the editor.  However, it’s important to note that DESCRIBE commands are not actual CQL statements and don't execute against your DSE cluster. Instead Studio uses the metadata it knows about your schema to generate equivalent output that you would find if issuing DESCRIBE commands using cqlsh.

What DESCRIBE commands does Studio support?

  • DESCRIBE CLUSTER
  • DESCRIBE KEYSPACES
  • DESCRIBE KEYSPACE
  • DESCRIBE TABLES
  • DESCRIBE TABLE
  • DESCRIBE INDEX
  • DESCRIBE MATERIALIZED VIEW
  • DESCRIBE TYPES
  • DESCRIBE TYPE
  • DESCRIBE FUNCTIONS
  • DESCRIBE FUNCTION
  • DESCRIBE AGGREGATES
  • DESCRIBE AGGREGATE

 Next Steps

A great place for you to go next is to download Studio and walk through the Working With CQL tutorial that ships with it.  That tutorial contains even more info about how to work with CQL in Studio, including:

  • Browsing your CQL schema with our fabulous CQL schema viewer
  • Different ways to visualize your CQL results, including a detailed JSON view of nested data in a single column
  • How to create custom CQL execution configurations, including ones that enable tracing and give you a profile view of your queries execution

Thanks!

We hope that Studio will be an extremely productive environment for you to craft your CQL queries to run against Datastax Enterprise.  If you have any feedback or requests, don’t hesitate to contact the Studio team at:  studio-feedback@datastax.com.  

DSE Advanced Replication in DSE 5.1

$
0
0

DSE Advanced Replication feature in DataStax Enterprise underwent a major refactoring between DSE 5.0 (“V1”) and DSE 5.1 (“V2”), radically overhauling its design and performance characteristics.

DSE Advanced Replication builds on the multi-datacenter support in Apache Cassandra® to facilitate scenarios where selective or "hub and spoke" replication is required. DSE Advanced Replication is specifically designed to tolerate sporadic connectivity that can occur in constrained environments, such as retail, oil-and-gas remote sites and cruise ships.

This blog post provides a broad overview of the main performance improvements and  drills down into how we support CDC ingestion and deduplication to ensure efficient transmission of mutations.

Note: This blog post was written targeting DSE 5.1. Please refer to the DataStax documentation for your specific version of DSE if different.

Overview

Discussion of performance enhancements is split into three broad stages:

  1. Ingestion: Capturing the Cassandra mutations for an Advance Replication enabled table
  2. Queueing: Sorting and storing the ingested mutations in an appropriate message queue
  3. Replication: Replicating the ingested mutation to the desired destination(s).

Ingestion

In Advanced Replication v1 (included in DSE 5.0); capturing mutations for an Advanced Replication enabled table used Cassandra triggers. Inside the trigger we unbundled the mutation and extract the various partition updates and key fields for the mutation. By using the trigger in the ingestion transaction, we provided backpressure to ingestion and reduced throughput latency, as the mutations were processed in the ingestion cycle.

In Advanced Replication v2 (included in DSE 5.1), we replaced triggers with the Cassandra Change Data Capture (CDC) feature added in Cassandra version 3.8. CDC is an optional mechanism for extracting mutations from specific tables from the commitlog. This mutation extraction occurs outside the Ingestion transaction, so it adds negligible direct overhead to the ingestion cycle latency.

Post-processing the CDC logs requires CPU and memory. This process competes with DSE for resources, so decoupling of ingestion into DSE and ingestion into Advanced Replication allows us to support bursting for mutation ingestion.

The trigger in v1 was previously run on a single node in the source cluster. CDC is run on every node in the source cluster, which means that there are replication factor (RF) number of copies of each mutation. This change creates the need for deduplication which we’ll explain later on.

Queuing

In Advanced Replication v1, we stored the mutations in a blob of data within a vanilla DSE table, relying on DSE to manage the replication of the queue and maintain the data integrity. The issue was that this insertion was done within the ingestion cycle with a negative impact on ingestion latency, at a minimum doubling the ingestion time. This could increase the latency enough to create a query timeout, causing an exception for the whole Cassandra query.

In Advanced Replication v2 we offloaded the queue outside of DSE and used local files. So for each mutation, we have RF copies of it that mutation - due to capturing the mutations at the replica level via CDC versus at the coordinator level via triggers in v1 – on the same nodes as the mutation is stored for Cassandra. This change ensures data integrity and redundancy and provides RF copies of the mutation.

We have solved this CDC deduplication problem based on an intimate understanding of token ranges, gossip, and mutation structures to ensure that, on average, each mutation is only replicated once.The goal is to replicate all mutations at least once, and to try to minimize replicating a given mutation multiple times. This solution will be described later.

Replication

Previously in Advanced Replication v1, replication could be configured only to a single destination. This replication stream was fine for a use case which was a net of source clusters storing data and forwarding to a central hub destination, essentially 'edge-to-hub.'

In Advanced Replication v2 we added support for multiple destinations, where data could be replicated to multiple destinations for distribution or redundancy purposes. As part of this we added the ability to prioritize which destinations and channels (pairs of source table to destination table) are replicated first, and  configure whether channel replication is LIFO or FIFO to ensure newest or oldest data is replicated first.

CDC Deduplication and its integration into the Message Queue to support replication

With the new implementation of the v2 mutation Queue, we have the situation where we have each mutation stored in Replication Factor number of queues, and the mutations on each Node are interleaved depending on which subset of token ranges are stored on that node.

There is no guarantee that the mutations are received on each node in the same order.

With the Advanced Replication v1 trigger implementation there was a single consolidated queue which made it significantly easier to replicate each mutation only once.

Deduplication

In order to minimize the number of times we process each mutation, we triage the mutations that extract from the CDC log in the following way:

  1. Separate the mutations into their distinct tables.
  2. Separate them into their distinct token ranges.
  3. Collect the mutations in time sliced buckets according to their mutation timestamp (which is the same for that mutation across all the replica nodes.)

Distinct Tables

Separating them into their distinct table represents the directory structure:

token Range configuration

Assume a three node cluster with a replication factor of 3.

For the sake of simplicity, this is the token-range structure on the nodes:

Primary, Secondary and Tertiary are an arbitrary but consistent way to prioritize the token Ranges on the node – and are based on the token Configuration of the keyspace – as we know that Cassandra has no concept of a primary, secondary or tertiary node.

However, it allows us to illustrate that we have three token ranges that we are dealing with in this example. If we have Virtual-Nodes, then naturally there will be more token-ranges, and a node can be ‘primary’ for multiple ranges.

Time slice separation

Assume the following example CDC files for a given table:

As we can see the mutation timestamps are NOT always received in order (look at the id numbers), but in this example we contain the same set of mutations.

In this case, all three nodes share the same token ranges, but if we had a 5 node cluster with a replication factor of 3, then the token range configuration would look like this, and the mutations on each node would differ:

Time slice buckets

As we process the mutations from the CDC file, we store them in time slice buckets of one minute’s worth of data. We also keep a stack of 5 time slices in memory at a time, which means that we can handle data up to 5 minutes out of order. Any data which is processed more than 5 minutes out of order would be put into the out of sequence file and treated as exceptional data which will be need to be replicated from all replica nodes.

Example CDC Time Window Ingestion

  • In this example, assume that there are 2 time slices of 30 seconds
  • Deltas which are positive are ascending in time so are acceptable.
  • Id’s 5, 11 and 19 jump backwards in time.
  • As the sliding time window is 30 seconds, Id’s 5, 12 & 19 would be processed, whilst ID 11 is a jump back of 45 seconds so would not be processed into the correct Time Slice but placed in the Out Of Sequence files.

Comparing Time slices

So we have a time slice of mutations on different replica nodes, they should be identical, but there is no guarantee that they are in the same order. But we need to be able to compare the time slices and treat them as identical regardless of order. So we take the CRC of each mutation, and when we have sealed (rotated it out of memory because the current mutation that we are ingesting is 5 minutes later than this time slice) the time slice , we sort the CRCs and take a CRC of all of the mutation CRCs.
That [TimeSlice] CRC is comparable between time slices to ensure they are identical.

The CRCs for each time slice are communicated between nodes in the cluster via the Cassandra table.

Transmission of mutations

In the ideal situation, identical time slices and all three nodes are active – so each node is happily ticking away only transmitting its primary token range segment files.

However, we need to deal with robustness and assume that nodes fail, time slices do not match and we still have the requirement that ALL data is replicated.

We use gossip to monitor which nodes are active and not, and then if a node fails – the ‘secondary’ become active for that nodes ‘primary’ token range.

Time slice CRC processing

If a CRC matches for a time slice between 2 node – then when that time slice is fully transmitted (for a given destination), then the corresponding time slice (with the matching crc) can be marked as sent (synchdeleted.)

If the CRC mismatches, and there is no higher priority active node with a matching CRC, then that time slice is to be transmitted – this is to ensure that no data is missed and everything is fully transmitted.

Active Node Monitoring Algorithm

Assume that the token ranges are (a,b], (b,c], (c,a], and the entire range of tokens is [a,c], we have three nodes (n1, n2 and n3) and replication factor 3.

    • On startup the token ranges for the keyspace are determined - we actively listen for token range changes and adjust the schema appropriately.
    • These are remapped so we have the following informations:
      • node => [{primary ranges}, {secondary ranges}, {tertiary ranges}]
      • Note: We support vnodes where there may be multiple primary ranges for a node.
    • In our example we have:
      • n1 => [{(a,b]}, {(b,c]}, {c,a]}]
      • n2 => [{(b,c]}, {c,a]}, {(a,b]}]
      • n3 => [{c,a]}, {(a,b]}, {(b,c]}]
    • When all three nodes are live, the active token ranges for the node are as follows:
      • n1 => [{(a,b]}, {(b,c]}, {c,a]}] => {(a,b]}
      • n2 => [{(b,c]}, {c,a]}, {(a,b]}] => {(b,c]}
      • n3 => [{c,a]}, {(a,b]}, {(b,c]}] => {(c,a]}
    • Assume that n3 has died, its primary range is then searched for in the secondary replicas of live nodes:
      • n1 => [{(a,b]}, {(b,c]}, {c,a]}] => {(a,b], }
      • n2 => [{(b,c]}, {c,a]}, {(a,b]}] => {(b,c], (c,a]}
      • n3 => [{c,a]}, {(a,b]}, {(b,c]}] => {}
    • Assume that n2 and n3 have died, their primary range is then searched for in the secondary replicas of live nodes, and if not found the tertiary replicas (assuming replication factor 3) :
      • n1 => [{(a,b]}, {(b,c]}, {c,a]}] => {(a,b], (b,c], (c,a]}
      • n2 => [{(b,c]}, {c,a]}, {(a,b]}] => {}
      • n3 => [{c,a]}, {(a,b]}, {(b,c]}] => {}
  • This ensures that data is only sent once from each edge node, and that dead nodes do not result in orphaned data which is not sent.

Handling the Node Failure Case

Below illustrates the three stages of a failure case.

  1. Before - where everything is working as expected.
  2. Node 2 Fails - so Node 1 becomes Active for its token Slices and ignores what it has already been partially sent for 120-180, and resends from its secondary directory.
  3. Node 2 restarts - this is after Node 1 has sent 3 Slices for which Node 2 was primary (but Node 1 was Active because it was Node 2’s secondary), it synchronously Deletes those because the CRCs match. It ignores what has already been partially sent for 300-360 and resends those from its primary directory and carries on.

Before

Node 2 Dies

Node 2 Restarts

 

Conclusion

The vastly improved and revamped DSE Advanced Replication v2 in DSE 5.1 is more resilient and performant with support for multi-hubs and multi-clusters.

For more information see our documentation here.


Bring Your Own Spark

$
0
0

Bring Your Own Spark (BYOS) is a feature of DSE Analytics designed to connect from external Apache Spark™ systems to DataStax Enterprise with minimal configuration efforts. In this post we introduce how to configure BYOS and show some common use cases.

BYOS extends the DataStax Spark Cassandra Connector with DSE security features such as Kerberos and SSL authentication. It also includes drivers to access the DSE Cassandra File System (CFS) and DSE File System (DSEFS) in 5.1.

There are three parts of the deployment:

  • <dse_home>clients/dse-byos_2.10-5.0.6.jar is a fat jar. It includes everything you need to connect the DSE cluster: Spark Cassandra Connector with dependencies, DSE security connection implementation, and CFS driver.
  • 'dse client-tool configuration byos-export' tool help to configure external Spark cluster to connect to the DSE
  • 'dse client-tool spark sql-schema' tool generates SparkSQL-compatible scripts to create external tables for all or part of DSE tables in SparkSQL metastore.

HDP 2.3+ and CDH 5.3+ are the only Hadoop distributions which support Java 8 officially and which have been tested with BYOS in DSE 5.0 and 5.1.

Quick Start Guide

Pre-requisites:

There is installed and configured a Hadoop or standalone Spark system and you have access to at least one host on the cluster with a preconfigured Spark client. Let’s call it spark-host. The Spark installation should be pointed to by $SPARK_HOME.

There is installed and configured a DSE cluster and you have access to it. Let’s call it dse-host. I will assume you have a cassandra_keyspace.exampletable C* table created on it.The DSE is located at $DSE_HOME.

DSE supports Java 8 only. Make sure your Hadoop, Yarn and Spark use Java 8. See your Hadoop distro documentation on how to upgrade Java version (CDH, HDP).

Prepare the configuration file

On dse-host run:

$DSE_HOME/bin/dse client-tool configuration byos-export byos.conf

It will store DSE client connection configuration in Spark-compatible format into byos.conf.

Note: if SSL or password authentication is enabled, additional parameters needed to be stored. See dse client-tool documentation for details.

Copy the byos.conf to spark-host.

On spark-host append the ~/byos.conf file to the Spark default configuration

cat byos.conf >> $SPARK_HOME/conf/conf/spark-defaults.conf

Note: If you expect conflicts with spark-defaults.conf, the byos-export tool can merge properties itself; refer to the documentation for details.

Prepare C* to SparkSQL mapping (optional)

On dse-host run:

dse client-tool spark sql-schema -all > cassandra_maping.sql

That will create cassandra_maping.sql with spark-sql compatible create table statements.

Copy the file to spark-host.

Run Spark

Copy $DSE_HOME/dse/clients/dse-byos-5.0.0-all.jar to the spark-host

Run Spark with the jar.

$SPARK_HOME/bin/spark-shell --jars dse-byos-5.0.0-all.jar
scala> import com.datastax.spark.connector._
scala> sc.cassandraTable(“cassandra_keyspace”, "exampletable" ).collect

Note: External Spark can not connect to DSE Spark master and submit jobs. Thus you can not point it to DSE Spark master.

SparkSQL

BYOS does not support the legacy Cassandra-to-Hive table mapping format. The spark data frame external table format should be used for mapping: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md

DSE provides a tool to auto generate the mapping for external spark metastore: dse client-tool spark sql-schema

On the dse-host run:

dse client-tool spark sql-schema -all > cassandra_maping.sql

That will create cassandra_maping.sql with spark-sql compatible create table statements

Copy the file to spark-host

Create C* tables mapping in spark meta-store

$SPARK_HOME/bin/spark-sql--jars dse-byos-5.0.0-all.jar -f cassandra_maping.sql

Tables are now ready to use in both SparkSQL and Spark shell.

$SPARK_HOME/bin/spark-sql --jars dse-byos-5.0.0-all.jar
spark-sql> select * from cassandra_keyspace.exampletable
$SPARK_HOME/bin/spark-shell —jars dse-byos-5.0.0-all.jar
scala>sqlConext.sql(“select * from cassandra_keyspace.exampletable");

Access external HDFS from dse spark

DSE is built with Hadoop 2.7.1 libraries. So it is able to access any Hadoop 2.x HDFS file system.

To get access you need just proved full path to the file in Spark commands:

scala> sc.textFile("hdfs://<namenode_host>/<path to the file>")

To get a namenode host you can run the following command on the Hadoop cluster:

hdfs getconf -namenodes

If the Hadoop cluster has custom configuration or enabled kerberos security, the configuration should be copied into the DSE Hadoop config directory:

cp /etc/hadoop/conf/hdfs-site.xml $DSE_HOME/resources/hadoop2-client/conf/hdfs-site.xml

Make sure that firewall does not block the following HDFS data node and name node ports:

NameNode metadata service 8020/9000
DataNode 50010,50020

 

Security configuration

SSL

Start with truststore generation with DSE nodes certificates. If client certificate authentication is enabled (require_client_auth=true), client keystore will be needed.

More info on certificate generation:

https://docs.datastax.com/en/cassandra/2.1/cassandra/security/secureSSLCertificates_t.html

Copy both file to each Spark node on the same location. The Spark '--files' parameter can be used for the coping in Yarn cluster.

Use byos-export parameters to add store locations, type and passwords into byos.conf.

dse client-tool configuration byos-export --set-truststore-path .truststore --set-truststore-password
password --set-keystore-path .keystore --set-keystore-password password byos.conf

Yarn example:

spark-shell --jars byos.jar --properties-file byos.conf --files .truststore,.keystore

Kerberos

Make sure your Spark client host (where spark driver will be running) has kerberos configured and C* nodes DNS entries are configured properly. See more details in the Spark Kerberos documentation.

If the Spark cluster mode deployment will be used or no Kerberos configured on the spark client host use "Token based authentication" to access Kerberized DSE cluster.

byos.conf file will contains all necessary Kerberos principal and service names exported from the DSE.

The JAAS configuration file with the following options need to be copied from DSE node or created manually on the Spark client node only and stored at $HOME/.java.login.config file.

DseClient {
       com.sun.security.auth.module.Krb5LoginModule required
       useTicketCache=true
       renewTGT=true;
};

Note: If a custom file location is used, Spark driver property need to be set pointing to the location of the file.

--conf 'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=login_config_file'

BYOS authenticated by Kerberos and request C* token for executors authentication. The token authentication should be enabled in DSE. the spark driver will automatically cancel the token on exit

Note: the CFS root should be passed to the Spark to request token with:

--conf spark.yarn.access.namenodes=cfs://dse_host/

Spark Thrift Server with Kerberos

It is possible to authenticate services with keytab. Hadoop/YARN services already preconfigured with keytab files and kerberos useк if kerberos was enabled in the hadoop. So you need to grand permissions to these users. Here is example for hive user

cqlsh> create role 'hive/hdp0.dc.datastax.com@DC.DATASTAX.COM' with LOGIN = true;

Now you can login as a hive kerberos user, merge configs and start Spark thrift server. It will be able to query DSE data:

#> kinit -kt /etc/security/keytabs/hive.service.keytab \ hive/hdp0.dc.datastax.com@DC.DATASTAX.COM
#> cat /etc/spark/conf/spark-thrift-sparkconf.conf byos.conf > byos-thrift.conf
#> start-thriftserver.sh --properties-file byos-thrift.conf --jars dse-byos*.jar

Connect to it with beeline for testing:

#> kinit
#> beeline -u 'jdbc:hive2://hdp0:10015/default;principal=hive/_HOST@DC.DATASTAX.COM'

Token based authentication

Note: This approach is less secure than Kerberos one, use it only in case kerberos is not enabled on your spark cluster.

DSE clients use hadoop like token based authentication when Kerberos is enabled in DSE server.

The Spark driver authenticates to DSE server with Kerberos credentials, requests a special token, send the token to the executors. Executors authenticates to DSE server with the token. So no kerberos libraries needed on executors node.

If the Spark driver node has no Kerberos configured or spark application should be run in cluster mode. The token could be requested during configuration file generation with --generate-token parameters.

$DSE_HOME/bin/dse client-tool configuration byos-export --generate-token byos.conf

Following property will be added to the byos.conf:

spark.hadoop.cassandra.auth.token=NwAJY2Fzc2FuZHJhCWNhc3NhbmRyYQljYXNzYW5kcmGKAVPlcaJsigFUCX4mbIQ7YU_yjEJgRUwQNIzpkl7yQ4inoxtZtLDHQBpDQVNTQU5EUkFfREVMRUdBVElPTl9UT0tFTgA

It is important to manually cancel it after task is finished to prevent re usage attack.

dse client-tool cassandra cancel-token NwAJY2Fzc2FuZHJhCWNhc3NhbmRyYQljYXNzYW5kcmGKAVPlcaJsigFUCX4mbIQ7YU_yjEJgRUwQNIzpkl7yQ4inoxtZtLDHQBpDQVNTQU5EUkFfREVMRUdBVElPTl9UT0tFTgA

Instead of Conclusion

Open Source Spark Cassandra Connector and Bring Your Own Spark feature comparison:

Feature OSS DSE BYOS
DataStax Official Support No Yes
Spark SQL Source Tables / Cassandra DataFrames Yes Yes
CassandraDD batch and streaming Yes Yes
C* to Spark SQL table mapping generator No Yes
Spark Configuration Generator No Yes
Cassandra File System Access No Yes
SSL Encryption Yes Yes
User/password authentication Yes Yes
Kerberos authentication No Yes

 

Introducing Java driver 4

$
0
0

Today we are releasing Java driver 4.0.0-alpha1, the first preview of our next major version.

Those of you familiar with semantic versioning will notice that this version number indicates breaking API changes; we felt that this was needed to address a number of longstanding issues:

  • Java 8 support: we dropped support for JDK 6 and 7. This allows us to use Java 8 futures in the API, and take advantage of the latest language features like try-with-resources blocks and lambda expressions.
  • No more dependency to Guava: this has been a recurring pain point with earlier versions. We're happy to announce that the driver does not depend on Guava anymore (it's still used internally, but shaded).
  • Separation of packages into API / internal categories; this allows us to expose more hooks for advanced customization. For example, one thing you can do is override the driver's node discovery mechanism.
  • New configuration API with a default implementation based on Typesafe Config.
  • Immutable statement types to eliminate thread safety issues.
  • Pluggable request execution logic. We plan to use that to provide a reactive extension in the near future.

For full details, please refer to the upgrade guide.

Despite all those changes, the main API should still look familiar to 3.x users. For example, here is the canonical connection example:

try (Cluster cluster =
    Cluster.builder().addContactPoint(new InetSocketAddress("127.0.0.1", 9042)).build()) {

  Session session = cluster.connect();

  ResultSet rs = session.execute("select release_version from system.local");
  Row row = rs.iterator().next();
  System.out.println(row.getString("release_version"));
}

The driver is available from Maven central (note that the coordinates have changed):

<dependency>
  <groupId>com.datastax.oss</groupId>
  <artifactId>java-driver-parent</artifactId>
  <version>4.0.0-alpha1</version>
</dependency>

The sources are on the 4.x branch on GitHub.

Next steps

This alpha release is an early preview for evaluation by bleeding edge adopters. It has not been fully tested and is not production-ready. The basic request execution logic is functional, but many features are still missing, notably: schema and token metadata, metrics, the query builder, non-default policy implementations, compression.

We'll keep working towards a feature-complete beta in the upcoming months. Look for JIRA tickets with 4.* fix versions for an overview of upcoming tasks (here's a filter for 4.0.0-alpha2 and GA).

In the meantime, you can try the new API and provide feedback on the mailing list, or even pick a ticket and create a pull request (please discuss it with us beforehand, as some tickets may require additional information).

Writing Scala Codecs for the Java Driver

$
0
0

One of the common griefs Scala developers express when using the DataStax Java driver is the overhead incurred in almost every read or write operation, if the data to be stored or retrieved needs conversion from Java to Scala or vice versa.

This could be avoided by using "native" Scala codecs. This has been occasionally solicited from the Java driver team, but such codecs unfortunately do not exist, at least not officially.

Thankfully, the TypeCodec API in the Java driver can be easily extended. For example, several convenience Java codecs are available in the driver's extras package.

In this post, we are going to piggyback on the existing extra codecs and show how developers can create their own codecs – directly in Scala.

Note: all the examples in this post are available in this Github repository.

Dealing with Nullability

It can be tricky to deal with CQL types in Scala because CQL types are all nullable, whereas most typical representations of CQL scalar types in Scala resort to value classes, and these are non-nullable.

As an example, let's see how the Java driver deserializes, say, CQL ints.

The default codec for CQL ints converts such values to java.lang.Integer instances. From a Scala perspective, this has two disadvantages: first, one needs to convert from java.lang.Integer to Int, and second, Integer instances are nullable, while Scala Ints aren't.

Granted, the DataStax Java driver's Row interface has a pair of methods named getInt that deserialize CQL ints into Java ints, converting null values into zeroes.

But for the sake of this demonstration, let's assume that these methods did not exist, and all CQL ints were being converted into java.lang.Integer. Therefore, developers would yearn to have a codec that could deserialize CQL ints into Scala Ints while at the same time addressing the nullability issue.

Let this be the perfect excuse for us to introduce IntCodec, our first Scala codec:

import java.nio.ByteBuffer
import com.datastax.driver.core.exceptions.InvalidTypeException
import com.datastax.driver.core.{DataType, ProtocolVersion, TypeCodec}
import com.google.common.reflect.TypeToken

object IntCodec extends TypeCodec[Int](DataType.cint(), TypeToken.of(classOf[Int]).wrap()) {

  override def serialize(value: Int, protocolVersion: ProtocolVersion): ByteBuffer =
    ByteBuffer.allocate(4).putInt(0, value)

  override def deserialize(bytes: ByteBuffer, protocolVersion: ProtocolVersion): Int = {
    if (bytes == null || bytes.remaining == 0) return 0
    if (bytes.remaining != 4) throw new InvalidTypeException("Invalid 32-bits integer value, expecting 4 bytes but got " + bytes.remaining)
    bytes.getInt(bytes.position)
  }

  override def format(value: Int): String = value.toString

  override def parse(value: String): Int = {
    try {
      if (value == null || value.isEmpty || value.equalsIgnoreCase("NULL")) 0
      else value.toInt
    }
    catch {
      case e: NumberFormatException =>
        throw new InvalidTypeException( s"""Cannot parse 32-bits integer value from "$value"""", e)
    }
  }

}

All we did so far is extend TypeCodec[Int] by filling in the superclass constructor arguments (more about that later) and implementing the required methods in a very similar way compared to the driver's built-in codec.

Granted, this isn't rocket science, but it will get more interesting later. The good news is, this template is reproducible enough to make it easy for readers to figure out how to create similar codecs for every AnyVal that is mappable to a CQL type (Boolean, Long, Float, Double, etc... let your imagination run wild or just go for the ready-made solution).

(Tip: because of the automatic boxing/unboxing that occurs under the hood, don't use this codec to deserialize simple CQL ints, and prefer instead the driver's built-in one, which will avoid this overhead; but you can use IntCodec to compose more complex codecs, as we will see below – the more complex the CQL type, the more negligible the overhead becomes.)

Let's see how this piece of code solves our initial problems: as for the burden of converting between Scala and Java, Int values are now written directly with ByteBuffer.putInt, and read directly from ByteBuffer.getInt; as for the nullability of CQL ints, the issue is addressed just as the driver does: nulls are converted to zeroes.

Converting nulls into zeroes might not be satisfying for everyone, but how to improve the situation? The general Scala solution for dealing with nullable integers is to map them to Option[Int]. DataStax Spark Connector for Apache Cassandra®'s CassandraRow class has exactly one such method:

def getIntOption(index: Int): Option[Int] = ...

Under the hood, it reads a java.lang.Integer from the Java driver's Row class, and converts the value to either None if it's null, or to Some(value), if it isn't.

Let's try to achieve the same behavior, but using the composite pattern: we first need a codec that converts from any CQL value into a Scala Option. There is no such built-in codec in the Java driver, but now that we are codec experts, let's roll our own OptionCodec:

class OptionCodec[T](
    cqlType: DataType,
    javaType: TypeToken[Option[T]],
    innerCodec: TypeCodec[T])
  extends TypeCodec[Option[T]](cqlType, javaType)
    with VersionAgnostic[Option[T]] {

  def this(innerCodec: TypeCodec[T]) {
    this(innerCodec.getCqlType, TypeTokens.optionOf(innerCodec.getJavaType), innerCodec)
  }

  override def serialize(value: Option[T], protocolVersion: ProtocolVersion): ByteBuffer =
    if (value.isEmpty) OptionCodec.empty.duplicate else innerCodec.serialize(value.get, protocolVersion)

  override def deserialize(bytes: ByteBuffer, protocolVersion: ProtocolVersion): Option[T] =
    if (bytes == null || bytes.remaining() == 0) None else Option(innerCodec.deserialize(bytes, protocolVersion))

  override def format(value: Option[T]): String =
    if (value.isEmpty) "NULL" else innerCodec.format(value.get)

  override def parse(value: String): Option[T] =
    if (value == null || value.isEmpty || value.equalsIgnoreCase("NULL")) None else Option(innerCodec.parse(value))

}

object OptionCodec {

  private val empty = ByteBuffer.allocate(0)

  def apply[T](innerCodec: TypeCodec[T]): OptionCodec[T] =
    new OptionCodec[T](innerCodec)

  import scala.reflect.runtime.universe._

  def apply[T](implicit innerTag: TypeTag[T]): OptionCodec[T] = {
    val innerCodec = TypeConversions.toCodec(innerTag.tpe).asInstanceOf[TypeCodec[T]]
    apply(innerCodec)
  }

}

And voilà! As you can see, the class body is very simple (its companion object is not very exciting at this point either, but we will see later how it could do more than just mirror the class constructor). Its main purpose when deserializing/parsing is to detect CQL nulls and return None right away, without even having to interrogate the inner codec, and when serializing/formatting, intercept None so that it can be immediately converted back to an empty ByteBuffer (the native protocol's representation of null).

We can now combine our two codecs together, IntCodec and OptionCodec, and compose a TypeCodec[Option[Int]]:

import com.datastax.driver.core._
val codec: TypeCodec[Option[Int]] = OptionCodec(IntCodec)
assert(codec.deserialize(ByteBuffer.allocate(0), ProtocolVersion.V4).isEmpty)
assert(codec.deserialize(ByteBuffer.allocate(4), ProtocolVersion.V4).isDefined)

The problem with TypeTokens

Let's sum up what we've got so far: a TypeCodec[Option[Int]] that is the perfect match for CQL ints. But how to use it?

There is nothing really particular with this codec and it is perfectly compatible with the Java driver. You can use it explicitly, which is probably the simplest way:

import com.datastax.driver.core._
val codec: TypeCodec[Option[Int]] = OptionCodec(IntCodec)
val row: Row = ??? // some CQL query containing an int column
val v: Option[Int] = row.get(0, codec)

But your application is certainly more complex than that, and you would like to register your codec beforehand so that it gets transparently used afterwards:

import com.datastax.driver.core._
// first
val codec: TypeCodec[Option[Int]] = OptionCodec(IntCodec)
cluster.getConfiguration.getCodecRegistry.register(codec)

// then
val row: Row = ??? // some CQL query containing an int column
val v: Option[Int] = row.get(0, ???) // How to get a TypeToken[Option[Int]]?

Well, before we can actually do that, we first need to solve one problem: the Row.get method comes in a few overloaded flavors, and the most flavory ones accept a TypeToken argument; let's learn how to use them in Scala.

The Java Driver API, for historical reasons — but also, let's be honest, due to the lack of alternatives – makes extensive usage of Guava's TypeToken API (if you are not familiar with the type token pattern you might want to stop and read about it first).

Scala has its own interpretation of the same reflective pattern, named type tags. Both APIs pursue identical goals – to convey compile-time type information to the runtime – through very different roads. Unfortunately, it's all but an easy path to travel from one to the other, simply because there is no easy bridge between java.lang.Type and Scala's Type.

Hopefully, all is not lost. As a matter of fact, creating a full-fledged conversion service between both APIs is not a pre-requisite: it turns out that Guava's TypeToken works pretty well in Scala, and most classes get resolved just fine. TypeTokens in Scala are just a bit cumbersome to use, and quite error-prone when instantiated, but that's something that a helper object can facilitate.

We are not going to dive any deeper in the troubled waters of Scala reflection (well, at least not until the last chapter of this tutorial). It suffices to assume that the helper object we mentioned above really exists, and that it does the job of creating TypeToken instances while at the same time sparing the developer the boiler-plate code that this operation usually incurs.

Now we can resume our example and complete our code that reads a CQL int into a Scala Option[Int], in the most transparent way:

import com.datastax.driver.core._
val tt = TypeTokens.optionOf(TypeTokens.int) // creates a TypeToken[Option[Int]]
val row: Row = ??? // some CQL query containing an int column
val v: Option[Int] = row.get(0, tt)

Dealing with Collections

Another common friction point between Scala and the Java driver is the handling of CQL collections.

Of course, the driver has built-in support for CQL collections; but obviously, these map to typical Java collection types: CQL list maps to java.util.List (implemented by java.util.ArrayList), CQL set to java.util.Set (implemented by java.util.LinkedHashSet) and CQL map to java.util.Map (implemented by java.util.HashMap).

This leaves Scala developers with two inglorious options:

  1. Use the implicit JavaConverters object and deal with – gasp! – mutable collections in their code;
  2. Deal with custom Java-to-Scala conversion in their code, and face the consequences of conversion overhead (this is the choice made by the already-mentioned Spark Connector for Apache Cassandra®, because it has a very rich set of converters available).

All of this could be avoided if CQL collection types were directly deserialized into Scala immutable collections.

Meet SeqCodec, our third Scala codec in this tutorial:

import java.nio.ByteBuffer
import com.datastax.driver.core.CodecUtils.{readSize, readValue}
import com.datastax.driver.core._
import com.datastax.driver.core.exceptions.InvalidTypeException

class SeqCodec[E](eltCodec: TypeCodec[E])
  extends TypeCodec[Seq[E]](
    DataType.list(eltCodec.getCqlType),
    TypeTokens.seqOf(eltCodec.getJavaType))
    with ImplicitVersion[Seq[E]] {

  override def serialize(value: Seq[E], protocolVersion: ProtocolVersion): ByteBuffer = {
    if (value == null) return null
    val bbs: Seq[ByteBuffer] = for (elt <- value) yield {
      if (elt == null) throw new NullPointerException("List elements cannot be null")
      eltCodec.serialize(elt, protocolVersion)
    }
    CodecUtils.pack(bbs.toArray, value.size, protocolVersion)
  }

  override def deserialize(bytes: ByteBuffer, protocolVersion: ProtocolVersion): Seq[E] = {
    if (bytes == null || bytes.remaining == 0) return Seq.empty[E]
    val input: ByteBuffer = bytes.duplicate
    val size: Int = readSize(input, protocolVersion)
    for (_ <- 1 to size) yield eltCodec.deserialize(readValue(input, protocolVersion), protocolVersion)
  }

  override def format(value: Seq[E]): String = {
    if (value == null) "NULL" else '[' + value.map(e => eltCodec.format(e)).mkString(",") + ']'
  }

  override def parse(value: String): Seq[E] = {
    if (value == null || value.isEmpty || value.equalsIgnoreCase("NULL")) return Seq.empty[E]
    var idx: Int = ParseUtils.skipSpaces(value, 0)
    if (value.charAt(idx) != '[') throw new InvalidTypeException( s"""Cannot parse list value from "$value", at character $idx expecting '[' but got '${value.charAt(idx)}'""")
    idx = ParseUtils.skipSpaces(value, idx + 1)
    val seq = Seq.newBuilder[E]
    if (value.charAt(idx) == ']') return seq.result
    while (idx < value.length) {
      val n = ParseUtils.skipCQLValue(value, idx)
      seq += eltCodec.parse(value.substring(idx, n))
      idx = n
      idx = ParseUtils.skipSpaces(value, idx)
      if (value.charAt(idx) == ']') return seq.result
      if (value.charAt(idx) != ',') throw new InvalidTypeException( s"""Cannot parse list value from "$value", at character $idx expecting ',' but got '${value.charAt(idx)}'""")
      idx = ParseUtils.skipSpaces(value, idx + 1)
    }
    throw new InvalidTypeException( s"""Malformed list value "$value", missing closing ']'""")
  }

  override def accepts(value: AnyRef): Boolean = value match {
    case seq: Seq[_] => if (seq.isEmpty) true else eltCodec.accepts(seq.head)
    case _ => false
  }

}

object SeqCodec {

  def apply[E](eltCodec: TypeCodec[E]): SeqCodec[E] = new SeqCodec[E](eltCodec)

}

(Of course, we are talking here about scala.collection.immutable.Seq.)

The code above is still vaguely ressemblant to the equivalent Java code, and not very interesting per se; the parse method in particular is not exactly a feast for the eyes, but there's little we can do about it.

In spite of its modest body, this codec allows us to compose a more interesting TypeCodec[Seq[Option[Int]]] that can convert a CQL list<int> directly into a scala.collection.immutable.Seq[Option[Int]]:

import com.datastax.driver.core._
type Seq[+A] = scala.collection.immutable.Seq[A]
val codec: TypeCodec[Seq[Int]] = SeqCodec(OptionCodec(IntCodec))
val l = List(Some(1), None)
assert(codec.deserialize(codec.serialize(l, ProtocolVersion.V4), ProtocolVersion.V4) == l)

Some remarks about this codec:

  1. This codec is just for the immutable Seq type. It could be generalized into an AbstractSeqCodec in order to accept other mutable or immutable sequences. If you want to know how it would look, the answer is here.
  2. Ideally, TypeCodec[T] should have been made covariant in T, the type handled by the codec (i.e. TypeCodec[+T]); unfortunately, this is not possible in Java, so TypeCodec[T] is in practice invariant in T. This is a bit frustrating for Scala implementors, as they need to choose the best upper bound for T, and stick to it for both input and output operations, just like we did above.
  3. Similar codecs can be created to map CQL sets to Sets and CQL maps to Maps; again, we leave this as an exercise to the user (and again, it is possible to cheat).

Dealing with Tuples

Scala tuples are an appealing target for CQL tuples.

The Java driver does have a built-in codec for CQL tuples; but it translates them into TupleValue instances, which are unfortunately of little help for creating Scala tuples.

Luckily enough, TupleCodec inherits from AbstractTupleCodec, a class that has been designed exactly with that purpose in mind: to be extended by developers wanting to map CQL tuples to more meaningful types than TupleValue.

As a matter of fact, it is extremely simple to craft a codec for Tuple2 by extending AbstractTupleCodec:

class Tuple2Codec[T1, T2](
    cqlType: TupleType, javaType: TypeToken[(T1, T2)],
    eltCodecs: (TypeCodec[T1], TypeCodec[T2]))
  extends AbstractTupleCodec[(T1, T2)](cqlType, javaType)
    with ImplicitVersion[(T1, T2)] {

  def this(eltCodec1: TypeCodec[T1], eltCodec2: TypeCodec[T2])(implicit protocolVersion: ProtocolVersion, codecRegistry: CodecRegistry) {
    this(
      TupleType.of(protocolVersion, codecRegistry, eltCodec1.getCqlType, eltCodec2.getCqlType),
      TypeTokens.tuple2Of(eltCodec1.getJavaType, eltCodec2.getJavaType),
      (eltCodec1, eltCodec2)
    )
  }

  {
    val componentTypes = cqlType.getComponentTypes
    require(componentTypes.size() == 2, s"Expecting TupleType with 2 components, got ${componentTypes.size()}")
    require(eltCodecs._1.accepts(componentTypes.get(0)), s"Codec for component 1 does not accept component type: ${componentTypes.get(0)}")
    require(eltCodecs._2.accepts(componentTypes.get(1)), s"Codec for component 2 does not accept component type: ${componentTypes.get(1)}")
  }

  override protected def newInstance(): (T1, T2) = null

  override protected def serializeField(source: (T1, T2), index: Int, protocolVersion: ProtocolVersion): ByteBuffer = index match {
    case 0 => eltCodecs._1.serialize(source._1, protocolVersion)
    case 1 => eltCodecs._2.serialize(source._2, protocolVersion)
  }

  override protected def deserializeAndSetField(input: ByteBuffer, target: (T1, T2), index: Int, protocolVersion: ProtocolVersion): (T1, T2) = index match {
    case 0 => Tuple2(eltCodecs._1.deserialize(input, protocolVersion), null.asInstanceOf[T2])
    case 1 => target.copy(_2 = eltCodecs._2.deserialize(input, protocolVersion))
  }

  override protected def formatField(source: (T1, T2), index: Int): String = index match {
    case 0 => eltCodecs._1.format(source._1)
    case 1 => eltCodecs._2.format(source._2)
  }

  override protected def parseAndSetField(input: String, target: (T1, T2), index: Int): (T1, T2) = index match {
    case 0 => Tuple2(eltCodecs._1.parse(input), null.asInstanceOf[T2])
    case 1 => target.copy(_2 = eltCodecs._2.parse(input))
  }

}

object Tuple2Codec {

  def apply[T1, T2](eltCodec1: TypeCodec[T1], eltCodec2: TypeCodec[T2]): Tuple2Codec[T1, T2] =
    new Tuple2Codec[T1, T2](eltCodec1, eltCodec2)

}

A very similar codec for Tuple3 can be found here. Extending this principle to Tuple4, Tuple5, etc. is straightforward and left for the reader as an exercise.

Going incognito with implicits

The careful reader noticed that Tuple2Codec's constructor takes two implicit arguments: CodecRegistry and ProtocolVersion. They are omnipresent in the TypeCodec API and hence, good candidates for implicit arguments – and besides, both have nice default values. To make the code above compile, simply put in your scope something along the lines of:

object Implicits {

  implicit val protocolVersion = ProtocolVersion.NEWEST_SUPPORTED
  implicit val codecRegistry = CodecRegistry.DEFAULT_INSTANCE

}

Speaking of implicits, let's now see how we can simplify our codecs by adding a pinch of those. Let's take a look at our first trait in this tutorial:

trait VersionAgnostic[T] {  this: TypeCodec[T] =>

  def serialize(value: T)(implicit protocolVersion: ProtocolVersion, marker: ClassTag[T]): ByteBuffer =
    this.serialize(value, protocolVersion)

  def deserialize(bytes: ByteBuffer)(implicit protocolVersion: ProtocolVersion, marker: ClassTag[T]): T =
    this.deserialize(bytes, protocolVersion)

}

This trait basically creates two overloaded methods, serialize and deserialize, which will infer the appropriate protocol version to use and forward the call to the relevant method (the marker argument is just the usual trick to work around erasure).

We can now mix-in this trait with an existing codec, and then avoid passing the protocol version to every call to serialize or deserialize:

import Implicits._
val codec = new SeqCodec(IntCodec) with VersionAgnostic[Seq[Int]]
codec.serialize(List(1,2,3))

We can now go even further and simplify the way codecs are composed together to create complex codecs. What if, instead of writing SeqCodec(OptionCodec(IntCodec)), we could simply write SeqCodec[Option[Int]]? To achieve that, let's enhance the companion object of SeqCodec with a more sophisticated apply method:

object SeqCodec {

  def apply[E](eltCodec: TypeCodec[E]): SeqCodec[E] = new SeqCodec[E](eltCodec)

  import scala.reflect.runtime.universe._

  def apply[E](implicit eltTag: TypeTag[E]): SeqCodec[E] = {
    val eltCodec = ??? // implicit TypeTag -> TypeCodec conversion
    apply(eltCodec)
  }

}

The second apply method guesses the element type by using implicit TypeTag instances (these are created by the Scala compiler, so you don't need to worry about instantiating them), then locates the appropriate codec for it. We can now write:

val codec = SeqCodec[Option[Int]]

Elegant, huh? Of course, we need some magic to locate the right codec given a TypeTag instance. Here we need to introduce another helper object, TypeConversions. Its method toCodec takes a Scala type and, with the help of some pattern matching, locates the most appropriate codec. We refer the interested reader to TypeConversions code for more details.

With the help of TypeConversions, we can now complete our new apply method:

def apply[E](implicit eltTag: TypeTag[E]): SeqCodec[E] = {
  val eltCodec = TypeConversions.toCodec[E](eltTag.tpe)
  apply(eltCodec)
}

Note: similar apply methods can be added to other codec companion objects as well.

It's now time to go really wild, bearing in mind that the following features should only be used with caution by expert users.

If only we could convert Scala's TypeTag instances into Guava's TypeToken ones, and then make them implicit like we did above, we would be able to completely abstract away these annoying types and write very concise code, such as:

val statement: BoundStatement = ???
statement.set(0, List(1,2,3)) // implicit TypeTag -> TypeToken conversion

val row: Row = ???
val list: Seq[Int] = row.get(0) // implicit TypeTag -> TypeToken conversion

Well, this can be achieved in a few different ways; we are going to explore here the so-called Type Class pattern.

The first step is be to create implicit classes containing "get" and "set" methods that take TypeTag instances instead of TypeToken ones; we'll name them getImplicitly and setImplicitly to avoid name clashes. Let's do it for Row and BoundStatement:

implicit class RowOps(val self: Row) {

  def getImplicitly[T](i: Int)(implicit typeTag: TypeTag[T]): T =
    self.get(i, ???) // implicit TypeTag -> TypeToken conversion

  def getImplicitly[T](name: String)(implicit typeTag: TypeTag[T]): T =
    self.get(name, ???) // implicit TypeTag -> TypeToken conversion

}

implicit class BoundStatementOps(val self: BoundStatement) {

  def setImplicitly[T](i: Int, value: T)(implicit typeTag: TypeTag[T]): BoundStatement =
    self.set(i, value, ???) // implicit TypeTag -> TypeToken conversion

  def setImplicitly[T](name: String, value: T)(implicit typeTag: TypeTag[T]): BoundStatement =
    self.set(name, value, ???) // implicit TypeTag -> TypeToken conversion
  }

}

Remember what we stated at the beginning of this tutorial: "there is no easy bridge between Java types and Scala types"? Well, we will have to lay one now to cross that river.

Our helper object TypeConversions has another method, toJavaType, that does just that. Again, digging into its details is out of the scope of this tutorial, but with this method we can complete our implicit classes as below:

def getImplicitly[T](i: Int)(implicit typeTag: TypeTag[T]): T =
  val javaType: java.lang.reflect.Type = TypeConversions.toJavaType(typeTag.tpe)
  self.get(i, TypeToken.of(javaType).wrap().asInstanceOf[TypeToken[T]])

And we are done!

Now, by simply placing the above implicit classes into scope, we will be able to write code as concise as:

statement.setImplicitly(0, List(1,2,3)) // implicitly converted to statement.setImplicitly(0, List(1,2,3)) (TypeTag[Seq[Int]]), then
                                        // implicitly converted to statement.set          (0, List(1,2,3), TypeToken[Seq[Int]])

When retrieving values, it's a bit more complicated because the Scala compiler needs some help from the developer to be able to fill in the appropriate implicit TypeTag instance; we do so like this:

val list = row.getImplicitly[Seq[Int]](0) // implicitly converted to statement.getImplicitly(0) (TypeTag[Seq[Int]]), then
                                          // implicitly converted to statement.get          (0,  TypeToken[Seq[Int]])

That's it. We hope that with this tutorial, we could demonstrate how easy it is to create codecs for the Java driver that are first-class citizens in Scala. Enjoy!

Gremlin DSLs in Java with DSE Graph

$
0
0

When we think about our friend Gremlin of Apache TinkerPop™, we typically imagine him traversing a graph, bounding from vertex to edge to vertex, all the while aggregating, filtering, sacking, matching into complex recursions that ultimately provide the answer to the question we asked of him. Gremlin is quite adept at his job of graph traversing, but he is also quite adaptable to the domain of the graph he is traversing. In being adaptable, for one example, Gremlin can become "Dr. Gremlin" for a healthcare domain, thus comprehending a traversal like:

dr.patients("55600064").                 // find a patient by id
   prescriptions(Rx.CURRENT).            // get prescriptions they are currently prescribed
     consider("lipitor", "40mg").        // for purpose of analysis add lipitor to that list
     consider("lunesta", "1mg").         // for purpose of analysis add lunesta to that list
   interactions(Type.DRUG, Level.SEVERE) // find possible drug interactions that are severe

Underneath this healthcare syntax, Gremlin relies on his low-level knowledge of the graph and the various steps that allow his navigation of it, but to users of this language those complexities can be hidden. By now it should be clear that “Dr. Gremlin” simply promotes imagery for a well known concept: Domain Specific Languages (DSLs).

Several years ago a blog post was authored that described how one could develop DSLs in Gremlin. This older blog post applied to TinkerPop 2, which has long ago been eclipsed by the now widely adopted TinkerPop 3, and therefore has minimum relevance to building a DSL today. Today’s blog post seeks to make DSLs relevant to current times using TinkerPop 3 programming paradigms with an emphasis on their implementation with DSE Graph.

The Importance of DSLs

A good argument for the importance of DSLs in the development of graph applications is Gremlin itself. Gremlin is a graph traversal language, or in other words, a DSL for the graph domain. It "speaks" in the language of property graphs, capturing notions of vertices, edges and properties, and then constrains the actions applied to those domain objects (e.g. out(), inE(), has()) to the form of a traversal. The benefit manifests as a far more succinct, robust, and manageable way to interact with a graph structure as compared with an attempt to do so with a general purpose language.

It is the job of the graph application developer to encode their application’s domain into the vertices and edges of a graph. In other words, they define some form of schema by which vertices and edges represent the various aspects of their domain. With knowledge of the schema it then becomes possible to write Gremlin, using graph language, to insert and extract domain data to and from that encoding.

As an example, consider the KillrVideo dataset from the DataStax Academy 330 course. KillrVideo defines a DSE Graph schema that encodes a “movie” domain into the graph. For example, a “movie” vertex and a “person” vertex each have a number of properties:

schema.vertexLabel("movie").properties("movieId","title","year","duration","country","production").create();
schema.vertexLabel("person").properties("personId","name").create();

and there is a "actor" relationship between these two vertex types:

schema.edgeLabel("actor").connection("movie","person").create();

It is with this knowledge that the Gremlin graph domain language can be used to find all the actors who were in "Young Guns":

g.V().has("movie", "title","Young Guns").out("actor")

In the above statement, Gremlin is told to get the vertices, filter on the vertex label "movie" and a property called "title" and then for the vertices that are allowed by that filter, traverse on out edges labeled "actor" to the adjacent vertex. Neither the Gremlin code nor the description of what it is doing is especially daunting to follow, but it focuses heavily on graph language and the graph schema to interpret it. Someone who is familiar with both of these things wouldn't have much trouble expressing any traversal that they liked, but those who are less versed in these particulars would have a higher barrier for working with the graph. If the level of abstraction were changed so that those with this higher barrier could express their queries in language more familiar to them (in this case, the KillrVideo language), then their efforts to interact with the graph are simplified.

A KillrVideo DSL, a language for working with elements of the movie domain, could create this higher level of abstraction by allowing the same traversal as above to be written as follows:

killr.movies("Young Guns").actors()

The first thing to notice in the above traversal is that the language of the graph is now hidden. The traversal internally holds a "movie" vertex and travels over edges, but none of that is especially evident by just reading the code. It simply states: "get me a movie named 'Young Guns' and then find me the actors on that movie". The second thing to note is that the need to understand the schema and logic of the graph is reduced. Obviously a user of the KillrVideo DSL can’t be completely ignorant of what the graph contains, but the developer of that DSL who is more knowledgeable can design away pitfalls that less knowledgeable users would encounter. Some of those pitfalls are covered in the IDE with intelligent code completion, which would prevent mistypes of string-based property keys and edge labels, but it would also be possible to add validation logic into the DSL to ensure proper usage.

A typical use of validation logic would occur when parameters are supplied to a step. A simple check in the example above would be to ensure that the string passed to the movies() step was not null or empty. Therefore, a traversal constructed as killr.movies(null).actors() would immediately throw an exception. Parameterization of steps and the validation of those parameters go hand-in-hand when building a DSL. Complex traversal algorithms can be hidden behind a single step and made flexible by a body of parameters that can provide runtime tweaks to their execution. Parameters could trigger additional filters, limit returned results, define ranges and depths of execution, or expose any other algorithm feature that might be unknown at design time.

The DSL also creates a buffer that could protect against schema changes. The current schema design calls for a "movie" vertex to have an outgoing "actor" edge to a "person" to define the people who act in a movie. Should that data model change to one where the "actor" edge was promoted to a vertex to resolve the relationship between "movie" and "person", the logic for traversing this relationship is protected by the DSL and this revised traversal logic would only need to changed within its bounds. In other words, the same results would be derived from killr.movies("Young Guns").actors() irrespective of the nature of the schema.

A final point to consider when it comes to the benefits of DSLs, is to realize that DSLs can lead to more focused testing. A DSL will typically establish a fair number of "small" reusable steps (not all will be a few lines of Gremlin, however), each of which is straightforward to independently unit test. We can then use these tested steps with confidence elsewhere in the DSL in higher-ordered steps. The tests of these lower-level DSL steps would help provide assurance that an application will behave well after undergoing schema change, without having to wait for errors in application level tests where the Gremlin may be more complex to debug.

Implementation

In TinkerPop 2, Gremlin was heavily driven by the Groovy programming language. Groovy supports metaprogramming which provided a natural fit for building DSLs. Gremlin was outfitted with some helpful utilities to build new DSL steps, which hid the specifics of the metaprogramming that was going on underneath. For TinkerPop 3.x, Gremlin is not bound to Groovy. It is instead supported natively in Java and is extended on the JVM in projects like Gremlin Groovy or Gremlin Scala and off the JVM in projects like Gremlin Python. Since metaprogramming is not an available feature of all of these languages, a new method for building DSLs needed to be devised.

Each Gremlin Language Variant has its own method of DSL development, but the recommended pattern for implementation is largely the same and is rooted in simple inheritance. Reviewing the basic class structure of the Traversal API, there are:

  • GraphTraversal - The interface that defines the step methods for the graph traversal DSL (e.g out(), in(), select(), etc).
  • GraphTraversalSource - A class that spawns GraphTraversal instances (i.e the variable that is normally denoted by “g” that starts traversal as in g.V())
  • __ - A class that spawns anonymous GraphTraversal instances mostly used as inner traversals.

At the most simple level, creating a DSL involves extending upon these interfaces and classes. Programming languages, like Python, that are not extraordinarily restrictive on types allow DSLs to be built with limited effort. Java, on the other hand, makes things a bit more difficult. The following is a skeleton of a Java version of the KillrVideo DSL that directly extends GraphTraversalSource:

public class KillrVideoTraversalSource extends GraphTraversalSource {
  public KillrVideoTraversalSource(Graph graph) {
    super(graph);
  }

  public KillrVideoTraversalSource(Graph graph, TraversalStrategies strategies) {
    super(graph, strategies);
  }

  public KillrVideoTraversal<Vertex, Vertex> movies(String title) {
    // implementation omitted for brevity
  }
}

Note that it returns a KillrVideoTraversal, so when the DSL is used and a call is made to:

killr.movies("Young Guns")

the return value is a KillrVideoTraversal. As a side note, do not be confused by the use of "killr" for the variable name in place of "g". The familiar "g" could still be used, but to clarify that the code is using the KillrVideo DSL instead of the graph DSL, the "killr" variable name is used. Consider, however, what happens when the traversal “drops down” from the KillrVideo DSL to the graph DSL:

killr.movies("Young Guns").out("actors").in("actors")

The in() does not return a KillVideoTraversal - it returns a GraphTraversal. Therefore, it does not become possible to get back to the KillrVideo DSL without casting, as shown below:

// not possible because actors() is not a method GraphTraversal which is returned by in()
killr.movies("Young Guns").out("actors").in("actors").actors()

// possible with casting which isn't ideal
((KillrVideoTraversal) killr.movies("Young Guns").out("actors").in("actors")).actors()

It isn’t necessarily hard to figure out what needs to be done to resolve this problem, but it is not one that is solved by simply extending a class.

To make DSL building a bit easier, TinkerPop has a GremlinDsl annotation which can help streamline the process of DSL building in Java. The GremlinDsl annotation can be applied to a template interface that extends GraphTraversal. The annotation marks the interface as one to be processed by the Java Annotation Processor, which will generate some boilerplate code at compilation, thus providing the KillrVideoTraversalSource (and related classes/interfaces) that is passed to graph.traversal() to begin using the DSL.

@GremlinDsl
public interface KillrVideoTraversalDsl<S, E> extends GraphTraversal.Admin<S, E> {
}

The KillrVideoTraversalSource will have its own methods to start a traversal. For example, rather than starting a traversal with g.V() it could be started with killr.movies(). To allow that to happen the annotation must be updated:

@GremlinDsl(traversalSource = "com.killrvideo.KillrVideoTraversalSourceDsl")
public interface KillrVideoTraversalDsl<S, E> extends GraphTraversal.Admin<S, E> {
}

Adding the traversalSource parameter will specify the class to use to help generate KillrVideoTraversalSource class. The KillrVideoTraversalSourceDsl template class referenced above looks like this:

public class KillrVideoTraversalSourceDsl extends GraphTraversalSource {

    public KillrVideoTraversalSourceDsl(final Graph graph, final TraversalStrategies traversalStrategies) {
        super(graph, traversalStrategies);
    }

    public KillrVideoTraversalSourceDsl(final Graph graph) {
        super(graph);
    }
}

Both template classes, KillrVideoTraversalDsl and KillrVideoTraversalSourceDsl, will contain all the custom DSL methods that will drive the language. It is important to only use existing Gremlin steps (or other DSL steps conforming to that requirement) within these methods to build the traversal so that it remains compatible for remoting, serialization, traversal strategies and other aspects of the TinkerPop stack. It is also worth keeping in mind that the code within these DSL steps is meant for traversal construction. Attempting to include methods that do not meet the expected signature (e.g. adding a method that returns something other than a Traversal) may lead to unexpected problems for the annotation processor.

To this point, we've discussed the workings of a movies() step and an actors() step. These steps can be added to this DSL scaffolding as follows:

@GremlinDsl(traversalSource = "com.killrvideo.KillrVideoTraversalSourceDsl")
public interface KillrVideoTraversalDsl<S, E> extends GraphTraversal.Admin<S, E> {
    public default GraphTraversal<S, Vertex> actors() {
        return out("actor").hasLabel("person");
    }
}

public class KillrVideoTraversalSourceDsl extends GraphTraversalSource {

    public KillrVideoTraversalSourceDsl(final Graph graph, final TraversalStrategies traversalStrategies) {
        super(graph, traversalStrategies);
    }

    public KillrVideoTraversalSourceDsl(final Graph graph) {
        super(graph);
    }

     public GraphTraversal<Vertex, Vertex> movies(String... titles) {
       GraphTraversal traversal = this.clone().V();
       traversal = traversal.hasLabel("movie");
        if (titles.length == 1)
            traversal = traversal.has("title", titles[0]);
        else if (titles.length > 1)
            traversal = traversal.has("title", P.within(titles));
        return traversal;
    }
}

The movies() method demonstrates how a DSL can hide logic around a traversal’s construction. The traversal always filters on the label for movies but only adds filters by title if titles are present. These new steps can now be put to use in the following code, where the DSE Java Driver is used to connect to DSE Graph:

DseCluster dseCluster = DseCluster.builder()
                                  .addContactPoint("127.0.0.1")
                                  .build();
DseSession dseSession = dseCluster.connect();

KillrVideoTraversalSource killr = DseGraph.traversal(dseSession,
        new GraphOptions().setGraphName("killrvideo"), KillrVideoTraversalSource.class)
killr.movies("Young Guns").actors().values("name").toList();
Java Annotation Processor at Work
When building the KillrVideo DSL project, the Java Annotation Processor will detect the GremlinDsl annotation and generate the appropriate code. Assuming usage of Maven, the code will be generated to target/generated-sources/annotations and will produce four files:
  • __ - to construct anonymous traversals with DSL steps
  • KillrVideoTraversal - the Traversal implementation for the DSL that has all the standard GraphTraversal steps in addition to DSL steps
  • DefaultKillrVideoTraversal - the implementation of KillrVideoTraversal that is returned from the KillrVideoTraversalSource
  • KillrVideoTraversalSource - the TraversalSource extension for the DSL that spawns traversals

The generated code is lengthy and too much to display here, however the following snippets may help with understanding how everything fits together:

public interface KillrVideoTraversal<S, E> extends KillrVideoTraversalDsl<S, E> {
  @Override
  default KillrVideoTraversal<S, Vertex> actors() {
    return (KillrVideoTraversal) KillrVideoTraversalDsl.super.actors();
  }

  // other DSL steps from the full project omitted for brevity

  @Override
  default KillrVideoTraversal<S, Vertex> out(String... edgeLabels) {
    return (KillrVideoTraversal) KillrVideoTraversalDsl.super.out(edgeLabels);
  }

  // remaining Graph steps omitted for brevity
}

public class KillrVideoTraversalSource extends KillrVideoTraversalSourceDsl {
  public KillrVideoTraversalSource(Graph graph) {
    super(graph);
  }

  public KillrVideoTraversalSource(Graph graph, TraversalStrategies strategies) {
    super(graph, strategies);
  }

  @Override
  public KillrVideoTraversal<Vertex, Vertex> movies(String title, String... additionalTitles) {
    KillrVideoTraversalSource clone = this.clone();
    return new DefaultKillrVideoTraversal (clone, super.movies(title,additionalTitles).asAdmin());
  }

  // remaining DSL steps omitted for brevity

  @Override
  public KillrVideoTraversalSource withStrategies(TraversalStrategy... traversalStrategies) {
    return (KillrVideoTraversalSource) super.withStrategies(traversalStrategies);
  }

  // remaining Graph steps omitted for brevity
}

The full KillrVideo DSL project includes additional steps and documentation and can be found in the DataStax graph-examples repository. This repository not only contains the Maven-based DSL project, but also includes instructions for loading the KillrVideo DSL data into DSE Graph using the DSE Graph Loader.

One of the interesting use cases presented there is for graph mutations. The project shows how to enable this syntax:

killr.movie("m100000", "Manos: The Hands of Fate", "USA", "Sun City Films", 1966, 70).
        ensure(actor("p1000000", "Tom Neyman")).
        ensure(actor("p1000001", "John Reynolds")).
        ensure(actor("p1000002", "Diane Mahree"))

The four lines of code above perform a number of tasks:

  • The mutation steps movie() and actor() are meant to "get or create/update" the relevant graph elements. Therefore, when we use movie() the traversal first determines if the movie is present. If it is, it simply returns it and updates its properties with those specified. If the movie is not present, it adds the vertex first with the presented properties and returns it
  • With the actor() step the complexity is even greater because it must first detect if the "person" vertex is present, and if not add it. It then must also detect if that person already has an "actor" edge to the movie and if not, add it.
  • Both mutation steps contains validation or sensible defaults if values are not provided to enforce data integrity. As this code is bound to the steps, the logic is centralized which is convenient for testing and maintainability.
  • The ensure() step is an alias to the standard sideEffect() step. As an alias it provides for a more readable language in the KillrVideo domain. By wrapping the mutation steps in ensure(), the mutations become side-effects so that the "movie" vertex passes through those steps, which allows actor() steps to be chained together.

The syntax of the DSL is highly readable with respect to its intentions and the steps demonstrate their flexibility and power to be re-used and chained. Compare the above example to the actual graph traversal that is being executed underneath:

g.V().
  has("movie", "movieId", "m100000").
  fold().
  coalesce(
    __.unfold(),
    __.addV("movie").property("movieId", "m100000")).
  property("title", "Manos: The Hands of Fate").
  property("country", "USA").
  property("production", "Sun City Films").
  property("year", 1966).
  property("duration", 70).as("^movie").
  sideEffect(coalesce(out("actor").has("person", "personId", "p1000000"),
                      coalesce(V().has("person", "personId", "p1000000"),
                               addV("person").property("personId", "p1000000")).
                      property("name", "Tom Neyman").
                      addE("actor").
                        from("^movie").
                      inV())).
  sideEffect(coalesce(out("actor").has("person", "personId", "p1000001"),
                      coalesce(V().has("person", "personId", "p1000001"),
                               addV("person").property("personId", "p1000001")).
                      property("name", "John Reynolds").
                      addE("actor").
                        from("^movie").
                      inV())).
  sideEffect(coalesce(out("actor").has("person", "personId", "p1000002"),
                      coalesce(V().has("person", "personId", "p1000002"),
                               addV("person").property("personId", "p1000002")).
                      property("name", "Diane Mahree").
                      addE("actor").
                        from("^movie").
                      inV())).

Conclusion

This blog post discussed the benefits of Gremlin DSLs and re-establishes their development patterns with Gremlin for TinkerPop 3.x. As we conclude, let’s return to the "Dr. Gremlin" example from the introduction:

dr.patients("55600064").                 // find a patient by id
   prescriptions(Rx.CURRENT).            // get prescriptions they are currently prescribed
     consider("lipitor", "40mg").        // for purpose of analysis add lipitor to that list
     consider("lunesta", "1mg").         // for purpose of analysis add lunesta to that list
   interactions(Type.DRUG, Level.SEVERE) // find possible drug interactions that are severe

While the graph language is hidden, we've already seen where it is quite possible to drop back into that language at any point along the way if desired. Assuming the interactions() DSL step returned "interaction" vertices, it would be simple enough to filter those with a graph-based has() step. The power of the DSL approach is that the essence of Gremlin still remains. Each step, whether from the graph language or the DSL, still serves to mutate the type of the object passed to it from the previous step and all standard Gremlin semantics remain intact.

Acknowledgements

Daniel Kuppitz reviewed the KillrVideo DSL and provided some useful improvements to it. Marko A. Rodriguez reviewed the early draft of this blog post and offered feedback which was incorporated into it.

The Von Gremlin Architecture

$
0
0

One sunny summer afternoon, Gremlin was floating in a rocky pond near a grassy knoll a mere stones throw from a fertile garden filled with morning glory tentacles tangling themselves 'round many merry marigolds. As far as days go, Gremlin was having a good one. He stared up at fluffy cumulus clouds seemingly made of whipped cream dressed upon a blue desert sky. Sprinkles of rain drizzled down nearly evaporating before hitting his crispy skin being baked by the sharp summer sun slicing sideways through the scene. Gremlin was unable to discern where his face ended and the atmosphere began. The concateny of serenely surreal sensations dissolved his ever traversing problem-solving mind as there were no stresses evoking solution-seeking thoughts. The mind's muscles lost their tension...relieving themselves of their duty. The wash of experience slipped past his ensnaring concept net revealing an infinite moment, where his mind's traversal finally halted, exposing the machine that manifests it all -- the Gremlin traversal machine.

This article presents graphs as a universal data structure and Gremlin as a universal virtual machine for the burgeoning multi-model approach to Big Data computing.

The TinkerPop Is Working Against You On Your Behalf

Gremlin had always understood himself to be the constant cascade of traversing thoughts that flashed by one after another from birth to death. Never did he spend one of those many thoughts to think about what was actually doing the thinking. It wasn't until that day, floating in the summer pond, that he realized there was a more general process underlying his traversals. The fact that the thought he was having about the very nature of his thoughts made him think that this thought machinery was powerful. He questioned: "How is it that my mind is able to create so much novelty/complexity?" Gremlin couldn't put into words what computer scientists had known all along -- Gremlin was Turing Complete. The realization that he was capable of manifesting any possible traversal threw him into a spiraling dizzy. Gremlin was a general-purpose traversing machine!

Gremlin's previously comfortable home within the constant stream of traversals had collapsed... He could no longer simply be a traversal (a particular program), he would be the source of all traversals (the machine capable of executing any program). "Graph!" He tried desperately to throw out an anchor. Anything to help dig himself into the walls of this receding pit of abstraction. "While I can process graphs using any arbitrary algorithm, what about graphs themselves? There must be something more fundamental representing that structure!?" Fear drove Gremlin to try his hardest to avoid the inevitable entailment of the logic of his existence -- his manifestation as The TinkerPop. Instead, Gremlin wanted a buffer. Something that made him just some standard green guy livin' a standard green life captured on an old bleached out picture in a run-of-the-mill frame on a unimportant wall of The TinkerPop's celestial palace somewhere out there in the great wide computational divide. Unfortunately, his anchor had no bite. Like his processing self, graphs are universal. Every other data structure is a graph: maps (key/value), matrices (relational), trees (documents). In the discrete world of computing, there is nothing that eludes an elegant graphical representation. There are things (vertices/datum/energy) and they are related to one another (edges/locations/space). Simplicity knows no better.

The Multi-Model Graph Database

Database Query Language Gremlin Language Graph Constraint
Relational database SQL SQL-Gremlin pre-joined edges
Triplestore SPARQL SPARQL-Gremlin no properties
Document database Query Documents MongoDB-Gremlin disconnected trees

 

The Gremlin traversal machine is a graph-based virtual machine whose instruction set is a collection of functions called steps. A program, called a traversal, is a linear/nested composition of said steps. A traversal's language agnostic representation is called Gremlin bytecode. The fact that any database query language can compile to Gremlin bytecode and that Gremlin's memory system is a property graph composed of vertices, edges, and properties means that these database query languages are querying their respective data structure embedded within the graph. For instance, the existence of SQL-Gremlin (by Ted Wilmes) means that the concept of tables, rows, and joins are not only being upheld by the Gremlin traversal machine (processor), but also by the graph database (memory). Moreover, the existence of SPARQL-Gremlin (by Daniel Kuppitz and Harsh Thakkar) means that the concept of subjects, predicates, objects and patterns are not only being upheld by the Gremlin traversal machine (algorithm), but also by the graph database (data structure). This post will present MongoDB-Gremlin which maps JSON documents to directed acyclic trees within any Apache TinkerPop-enabled graph database and enables the user to query such graph-encoded documents using MongoDB's JSON-based CRUD language.

 

A General-Purpose Distributed Computer


The widely popular Von Neumann architecture contains both a memory unit (structure) and a central processing unit (process). The central processing unit is able to read and write data in memory. The memory's data can be arbitrarily accessed using a pointer (a reference). Sometimes a particular referenced datum in memory is itself a pointer! Such self-referential memory enables the logical embedding of any conceivable discrete data structure into physical memory. Along with memory access, the central processing unit is also able to perform recursive (e.g. while) and conditional operations (e.g. if/else). The operations of the processor form the computer's assembly language. Higher-level programming languages have been created to enable the user to manipulate the computer's memory and processor from a particular point of view (e.g. as functions, objects, etc.). Together, an arbitrarily organizable memory structure and an expressive processor instruction set yields the general-purpose computer.

The graph is analogous to the memory structure of modern computers. Vertices (data) can be arbitrarily connected via edges (pointers). Moreover, the Gremlin traversal machine is analogous to the central processing unit. It is able to read from the graph (V(), out(), values()), write to the graph (e.g. addV(), addE(), property(), drop()), and perform looping (repeat()) and branching (choose()) operations. The Gremlin traversal machine's operations (steps) form the foundation of Gremlin's assembly language. From there, query languages can be created to enable the user to work with the graph from a particularly point of view (e.g. as tables, triples, documents, or domain-specific custom structures). Together, an arbitrarily connectable graph and a Turing Complete graph-oriented traversal machine yields the general-purpose database.

A Graph-Based Document Database

A document, at its core, is a tree. Some document databases represent their documents as JSON objects. While JSON objects are generally trees, they have a few other complicating characteristics such as a distinction between arrays, objects (tree nodes), and primitives (tree leaves). It is possible to encode JSON objects in a graph database because a tree is a topologically constrained directed, binary graph. In particular, a tree is a directed acyclic graph. To demonstrate the document to graph embedding, assume the following JSON object describing a person, their hobbies, and their spoken languages.

{
  "name" : "Gremlin",
  "hobbies" : ["traversing", "reflecting"],
  "birthyear" : 2009,
  "alive" : true,
  "languages" : [
    {
      "name" : "Gremlin-Java",
      "language" : "Java8"
    },
    {
      "name" : "Gremlin-Python",
      "language" : "Python"
    },
    {
      "name" : "Ogre",
      "language" : "Clojure"
    }
  ]
}

In the popular document database MongoDB®, a user would store the above object in a people-collection. Along with the above object, assume the people-collection also contains other similarly schema'd objects. The following queries, represented as JSON query objects, could then be submitted and answered in a tuple-space manner.

// find people named Gremlin
db.people.find({"name":"Gremlin"})

// find people born after the year 2000
db.people.find({"birthyear":{"$gt":2000}})

// find alive people who like to traverse and/or halt
db.people.find({"alive":true,"hobbies":{"$in":["traversing","halting"]})

// find people that speak a language in Java8
db.people.find({"languages.language":"Java8"})

Apache TinkerPop graphs are composed of vertices, edges, and properties, where both vertices and edges can have any number of associated key/value-pairs. A document-to-graph embedding could map every JSON object (document) to a vertex. All primitive fields of the object would be properties on the vertex. If a field contained a primitive array value, then vertex multi-properties would be used (i.e. a single property containing multiple value). If a field referenced another object (sub-document), then the object would be another vertex with the respective field name being the label of the edge connecting the parent object to the child object. Given this unambiguous, one-to-one bijective mapping between a document and a graph, the previous MongoDB query documents can be written in Gremlin as below.

IMPORTANT: The above straightforward document-to-graph encoding assumes that arrays can not be nested and can only contain either all primitives or all objects (i.e. sub-documents).

// find people named Gremlin
g.V().hasLabel("person").has("name","Gremlin")

// find people born after the year 2000
g.V().hasLabel("person").has("birthyear", gt(2000))

// find alive people who like to traverse and/or halt
g.V().hasLabel("person").has("alive",true).has("hobbies",within("traversing","halting"))

// find people that speak a language in Java8
g.V().hasLabel("person").where(out("languages").has("language","Java8"))

The above traversals will only return the Gremlin vertex (the parent document). However, in MongoDB, what is returned is the entire document which includes the parent document and all nested, embedded sub-documents. Given that a JSON document is a directed acyclic graph, then the query traversal simply needs to be post-fixed with a recursive walk from the parent document vertex to all reachable properties (primitive fields) and vertices (sub-documents). This is accomplished using Gremlin's repeat()-step which directs traversers to walk down the graph-encoded tree until() they reach leaf vertices (i.e. vertices with no children). From there, the paths that the traversers took are gathered into a tree() displaying the properties on each vertex as well as the label of the edge used to traverser to them.

gremlin> g.V().hasLabel("person").has("name","Gremlin").    // construct query document
           until(out().count().is(0)).                      // construct result document
             repeat(outE().inV()).
           tree().
             by(valueMap()).
             by(label())
==>
[
  [
    alive:[true],
    birthyear:[2009],
    hobbies:[traversing,reflecting],
    name:[Gremlin]
  ]:
  [
    languages:[
      [name:[Ogre],language:[Clojure]]:[],
      [name:[Gremlin-Python],language:[Python]]:[],
      [name:[Gremlin-Java],language:[Java8]]:[]
    ]
  ]
]

This section has specified how to:

  1. Encode a JSON document into a property graph (topological embedding).
  2. Translate MongoDB JSON query documents to Gremlin traversals (language compiler).
  3. Translate Gremlin traversal results to a JSON document (result formatter).
MongoDB-Gremlin

As of August 2017, there exists a proof-of-concept, distinct Gremlin language named MongoDB-Gremlin. MonogoDB-Gremlin compiles both MongoDB query and insert JSON documents into a Gremlin Traversal. The traversal consumes a JSON String document and yields an iterator of JSONObjects. In between the input and output, the graph is traversed accordingly.

In the code fragment below, the mongodb-gremlin library is loaded into the Gremlin Console. The db traversal source allows the user to interact with the underlying graph from a MongoDB-perspective. MongoDBTraversalSource provides two methods: insertOne and find(). Both methods yield a Traversal.

~/software/tinkerpop bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :install com.datastax.tinkerpop mongodb-gremlin 0.1-SNAPSHOT
==>Loaded: [com.datastax.tinkerpop, mongodb-gremlin, 0.1-SNAPSHOT]
gremlin> import com.datastax.tinkerpop.mongodb.MongoDBTraversalSource
gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> db = graph.traversal(MongoDBTraversalSource)
==>mongodbtraversalsource[tinkergraph[vertices:0 edges:0], standard]

In the code fragment below, db.insertOne() is used to insert a JSON object. The result is the standard result object of MongoDB insert documents.

gremlin> db.insertOne("""
{
  "~label" : "person",
  "name" : "Gremlin",
  "hobbies" : ["traversing", "reflecting"],
  "birthyear" : 2009,
  "alive" : true,
  "languages" : [
    {
      "name" : "Gremlin-Java",
      "language" : "Java8"
    },
    {
      "name" : "Gremlin-Python",
      "language" : "Python"
    },
    {
      "name" : "Ogre",
      "language" : "Clojure"
    }
  ]
}
""")
==>[acknowledged:true,insertId:0]

Once the (graph) database has been populated with JSON objects, it is possible to use MongoDB query documents to find JSON objects that match desired patterns.

gremlin> db.find('{"name" : "Gremlin"}')
==>[~label:person,
    birthyear:2009,
    alive:true,
    languages:[
       [~label:vertex,~id:2,name:Gremlin-Java,language:Java8],
       [~label:vertex,~id:6,name:Gremlin-Python,language:Python],
       [~label:vertex,~id:10,name:Ogre,language:Clojure]],
    hobbies:[traversing,reflecting],
    ~id:0,
    name:Gremlin]
gremlin> db.find('{"birthyear" : {"$gt":2008}}')
==>[~label:person,
    birthyear:2009,
    alive:true,
    languages:[
       [~label:vertex,~id:2,name:Gremlin-Java,language:Java8],
       [~label:vertex,~id:6,name:Gremlin-Python,language:Python],
       [~label:vertex,~id:10,name:Ogre,language:Clojure]],
    hobbies:[traversing,reflecting],
    ~id:0,
    name:Gremlin]

It is important to note that because JSON objects are encoded as directed, acyclic graphs within the respective Apache TinkerPop-enabled graph system, it is possible to use Gremlin's GraphTraversal to interact with the respective "documents."

gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:4 edges:3], standard]
gremlin> g.V().has('name','Gremlin').
           out('languages').has('name','Ogre').
           values('language')
==>Clojure

The Many Engines of the Gremlin Traversal Machine

It is useful that graphs can conveniently model any other data structure. It is useful that Gremlin is Turing Complete. It is even more useful that Gremlin is expressive enough to elegantly process graphs. It is for these reasons that Gremlin can simulate any query language with relative ease. However, the most interesting aspect of the Gremlin traversal machine is that Gremlin bytecode can be evaluated by different executions engines depending on the workload demanded by the traversal. For instance, the following traversals may be best served by an analytical or real-time engine.

g.V().hasLabel("person").groupCount().by("age") (OLAP)
g.V(1).out("knows").values("name") (OLTP)

The Gremlin machine allows graph system providers to select the type of execution engine to use to evaluate a traversal. Apache's TinkerPop3 provides two such engines: standard (OLTP) and computer (OLAP). The former leverages a streaming model which moves traversers through a functional pipeline. The latter uses parallel, distributed linear map-like scans of the vertex set with traverser message passing occurring in a reduce-like manner. Recent advances have been made in leveraging the actor model for message passing traversers in an OLTP-based, data locality aware, query routing manner. The reason that common distributed computing models are able to conveniently execute Gremlin is due to the inherently distributed nature of a Gremlin traversal. Gremlin bytecode describes a traversal as a nest-able chain of simple stateless functions that operate on immutable traverser objects. The encapsulation of the computational state into a traverser swarm enables the computation to be more easily distributed and localized as each traverser is its own processor referencing a particular vertex (or edge) in the graph. The traversers of a traversal act as distributed processing units over a distributed graph structure.

The engine-agnostic aspect of the Gremlin traversal machine makes it unique in the virtual machine computing arena -- especially in the Big Data processing space. It allows for the same Gremlin bytecode to execute on a single machine or across many machines while leveraging different memory access patterns accordingly. What this entails is that any query language can be exposed to the user as long as there exists a compiler that compiles that language to Gremlin bytecode. Together, Gremlin allows users to write any arbitrary algorithm for execution by the Gremlin traversal machine using any standard distributed computing framework across data stored in any graph database system.

Forever You, Forever Me Within The TinkerPop

It was too late for Gremlin to turn back. He had reached the event horizon of realization. From the perspective of his mechanical friends watching him float listlessly in the rocky pond on that righteously wonderful day, Gremlin appeared to be frozen in a state of eternal bliss. Never again would Gremlin be tossed about vertices and edges by life's traversals winding, bending, and knotting him through and around the ever-changing graph topology. Instead now, with every breath he breathed, the cosmos danced feverishly before him as shimmering shadows of his fundamental instruction set within the Gremlin traversal machine. Gremlin understood that he was the means, the reason, the source of every traversal that flickered upon the graphical membrane of reality. However, even within this most extreme of contemplative states, Gremlin was still coming to terms with every detail, nuance, variation, theme, purpose, hope, and dream of something he could not yet put his mind's eye upon. Infinitely far away, yet speeding infinitely fast towards that which he always was, is, and will be for all time (process) and for all things (structure) -- The TinkerPop.

Credit: Sebastian Good offered this blog post's title during a Twitter conversation.

Viewing all 381 articles
Browse latest View live