Difference between revisions of "Hadoop"

From Knowitall
Jump to: navigation, search
(Created page with "= Guidelines = We are using the FairScheduler with one pool per user. Additionally, there is a pool for background tasks. This pool has a weighting of 0. Jobs in this pool wi...")
 
(Troubleshooting dead nodes)
Line 22: Line 22:
  
 
We will not allow people to use the Hadoop cluster if they are under 21.
 
We will not allow people to use the Hadoop cluster if they are under 21.
 +
 +
= Troubleshooting =
 +
== Dead nodes ==
 +
If a node is head according to the [http://rv-n11.cs.washington.edu:50070/dfshealth.jsp DFS Health page], do the following steps:
 +
# ssh into the dead node as knowall (ask Michael)
 +
# Run <code>hadoop datanode</code> to start the data node.
 +
 +
This process is now tied to your terminal, so now you need to disown it:
 +
# Press Ctrl+z to pause the program
 +
# Run <code>bg</code> to run it in the background
 +
# Run <code>disown -h</code>
 +
 +
If you get a terminal prompt after that, it's safe to exit.
 +
If the nodes don't appear to be working in the [http://rv-n11.cs.washington.edu:50030/jobtracker.jsp JobTracker], then do these steps:
 +
# ssh into rv-n11 as knowall
 +
# run <code>stop-mapred.sh</code>
 +
# run <code>start-mapred.sh</code>
 +
 +
If stop-mapred fails to stop something, then you need to ssh into the nodes and manually kill those processes. It's safe-ish to kill mapred processes, but do not kill anything related to DFS, or else we might experience data corruption.

Revision as of 19:46, 23 January 2013

Guidelines

We are using the FairScheduler with one pool per user. Additionally, there is a pool for background tasks. This pool has a weighting of 0. Jobs in this pool will never start if other jobs are running.

If there are problems with scheduling, first check the scheduler page: http://rv-n11.cs.washington.edu:50030/scheduler. Second, you can make a hot configuration to conf/fairscheduler.xml, such as adding a minMaps for a pool. These changes take effect in seconds. Go ahead and make changes if you need to, but please keep Michael informed (send schmmd@cs.washington.edu an email). Otherwise it's too confusing...

A number of people have suggested characteristics for good jobs. The following defines a "nice" job.

  1. Mappers finish in more than 60 seconds (preferably some minutes) and less than an hour (preferably much less).
  2. There are no more than 9 reducers.
  3. The entire job finishes within 48 hours.

You can reduce the number of mappers by using "mapred.min.split.size" and you can increase the number of mappers by decreasing the block size of your file (dfs.block.size). When we upgrade (next quarter) we can specify per-pool reducer limits.

Jobs should not be run if they have any of the following characteristics. These are "pathological" jobs.

  1. Mappers finish in less than 30 seconds or more than 4 hours.
  2. There are more than 18 reducers. Exception: more reducers are allowable if they finish quickly (all reducers take less than 10 min using all slots).
  3. The job will take more than 7 days.

If you need to run a pathological job, you need to email the users of the cluster and explain why you are running such a job and buy the group a pitcher of beer. If you see a pathological job that was not explained, you should contact the owner and ask them to kill their job. If 30 minutes pass and you have not heard back, you may kill the job.

We will not allow people to use the Hadoop cluster if they are under 21.

Troubleshooting

Dead nodes

If a node is head according to the DFS Health page, do the following steps:

  1. ssh into the dead node as knowall (ask Michael)
  2. Run hadoop datanode to start the data node.

This process is now tied to your terminal, so now you need to disown it:

  1. Press Ctrl+z to pause the program
  2. Run bg to run it in the background
  3. Run disown -h

If you get a terminal prompt after that, it's safe to exit. If the nodes don't appear to be working in the JobTracker, then do these steps:

  1. ssh into rv-n11 as knowall
  2. run stop-mapred.sh
  3. run start-mapred.sh

If stop-mapred fails to stop something, then you need to ssh into the nodes and manually kill those processes. It's safe-ish to kill mapred processes, but do not kill anything related to DFS, or else we might experience data corruption.