Hadoop

From Knowitall
Jump to: navigation, search

Guidelines

We are using the FairScheduler with one pool per user. Additionally, there is a pool for background tasks. This pool has a weighting of 0. Jobs in this pool will never start if other jobs are running.

If there are problems with scheduling, first check the scheduler page: http://rv-n11.cs.washington.edu:50030/scheduler. Second, you can make a hot configuration to conf/fairscheduler.xml, such as adding a minMaps for a pool. These changes take effect in seconds. Go ahead and make changes if you need to, but please keep Michael informed (send schmmd@cs.washington.edu an email). Otherwise it's too confusing...

A number of people have suggested characteristics for good jobs. The following defines a "nice" job.

  1. Mappers finish in more than 60 seconds (preferably some minutes) and less than an hour (preferably much less).
  2. There are no more than 9 reducers.
  3. The entire job finishes within 48 hours.

You can reduce the number of mappers by using "mapred.min.split.size" and you can increase the number of mappers by decreasing the block size of your file (dfs.block.size). When we upgrade (next quarter) we can specify per-pool reducer limits.

Jobs should not be run if they have any of the following characteristics. These are "pathological" jobs.

  1. Mappers finish in less than 30 seconds or more than 4 hours.
  2. There are more than 18 reducers. Exception: more reducers are allowable if they finish quickly (all reducers take less than 10 min using all slots).
  3. The job will take more than 7 days.

If you need to run a pathological job, you need to email the users of the cluster and explain why you are running such a job and buy the group a pitcher of beer. If you see a pathological job that was not explained, you should contact the owner and ask them to kill their job. If 30 minutes pass and you have not heard back, you may kill the job.

We will not allow people to use the Hadoop cluster if they are under 21.

Tips

Track your jobs at the Job Tracker

How to run streaming jobs

You don't necessarily have to write a Java/Scala program and create a jar file to run jobs. You can also use the Hadoop streaming API, which is a wrapper that allows you to run arbitrary mapper and reducer commands that work with standard input and standard output.

To run a streaming job, run: hadoop jar ~knowall/hadoop/contrib/streaming/hadoop-streaming-1.0.2.jar -file mymapper.py -mapper 'python mymapper.py' -file myreducer.py -reducer 'python myreducer.py' -input /dir/to/input/in/hdfs -output /dir/to/output/in/hdfs.

Pass in any files you need using the -file option. Note you can pass in any command that reads from standard input and prints to standard output. So, for example, if you just want to run a mapper job that just transforms each line of a large input file in some way, you can just use cat as your reducer, i.e., -reducer cat.

Reference: Writing streaming jobs in Hadoop.

Troubleshooting

Dead nodes

If a node is dead according to the DFS Health page, do the following steps:

  1. ssh into the dead node as knowall (ask Michael)
  2. Run hadoop datanode to start the data node.

This process is now tied to your terminal, so now you need to disown it:

  1. Press Ctrl+z to pause the program
  2. Run bg to run it in the background
  3. Run disown -h

If you get a terminal prompt after that, it's safe to exit. If the nodes don't appear to be working in the JobTracker, then do these steps:

  1. ssh into rv-n11 as knowall
  2. run stop-mapred.sh
  3. run start-mapred.sh

If stop-mapred fails to stop something, then you need to ssh into the nodes and manually kill those processes. It's safe-ish to kill mapred processes, but do not kill anything related to DFS, or else we might experience data corruption.

If you are seeing this error org.apache.hadoop.security.AccessControlException run the following commands as knowall :

  • hadoop fs -chmod a+x hdfs:///hadoop/mapred/system
  • hadoop fs -chmod a+r hdfs:///hadoop/mapred/system
  • hadoop fs -chmod a+w hdfs:///hadoop/mapred/system