Scoobi

From Knowitall
Jump to: navigation, search

Scoobi (Github) is a Scala library for Hadoop. We have a bunch of Scoobi jobs in the browser-hadoop project under edu.washington.cs.knowitall.browser.hadoop.scoobi

Running

To run a Scoobi job, set the main class in the browser-hadoop pom.xml, and compile it using mvn clean compile assembly:single. Then, you can test the job locally by running java -jar myjob.jar [args]. Or, you can run it on Hadoop using a command like this: hadoop jar myjob.jar -Dmapred.task.timeout=1200000 -Dmapred.child.java.opts=-Xmx4G [args] -- scoobi nolibjars

If you're getting an error like java.lang.ClassNotFoundException: com.nicta.scoobi.impl.exec.MscrMapper, it's probably because you forgot to add -- scoobi nolibjars to the end of the command.

Scoobi gotchas

Sometimes jobs will work on your local machine, but not on the cluster. One possible reason is that you can't use static variables in the object that extends ScoobiApp (something to do with Scala DelayedInit?). If you want to have static variables, you need to put them in another object and import them whenever you want to use them. For example:

Bad:

object MyScoobiJob extends ScoobiApp {
  val myStaticVar = new Something()

  def myMethod(input: DList[String]): DList[String] = {
    myStaticVar.doSomething() // Null pointer exception
  }
}

Good:

object MyScoobiJobStaticVars {
  val myStaticVar = new Something()
}

object MyScoobiJob extends ScoobiApp {
  def myMethod(input: DList[String]): DList[String] = {
    import MyScoobiJobStaticVars._
    myStaticVar.doSomething()
  }
}