The document discusses lessons learned from using Spark to load data from Oracle into Cassandra. It describes problems encountered with Spark SQL handling Oracle NUMBER and timeuuid fields incorrectly. It also discusses issues generating IDs across RDDs and limitations on returning RDDs of tuples over 22 items. The resources section provides references for learning more about Spark, Scala, and using Spark with Cassandra.
Co-Organizer of Dallas Cassandra Meetup Group
Certified Apache Cassandra Developer
CQL Copy command - This is a pretty quick and dirty way of getting data from a text file into a C* table. The primary limiting factor is that the data in the text file has to match the schema of the table.
Sqoop - this is a tool from the Hadoop ecosystem, but it works for C*, too. It's meant for pulling to/from a RDBMS. It's pretty limited on any kind of transformation you want do.
Write a Java program. It's pretty simple to write a java program that reads from a text file and uses the CQL Driver to write to C*. If you set the write consistency level to Any and use the ExecuteAsync() methods, you can get it to run pretty darn fast.
Write a Spark program. This is a great option if you want to transform the schema of the source before writing to the C* destination. You can get the data from any number of sources (text files, RDBMS, etc.), use a map statement to transform the data into the right format, and then use the Spark Cassandra Connector to write the data to C*.