poniedziałek, 25 lutego 2013

maven-hadoop-plugin

Submitting hadoop job on remote machine is not a complicated process but it takes a lot time, it could be 10 minutes or sometime 15. There are a lot steps to do to get the final result of map reduce and download it to local file system.

-compile map reduce code and build jar file
-upload jar to remote server via ftp
-connect to server via ssh client
-prepare input data in HDFS (usually only once)
-submit job using ‘hadoop jar…’ command
-copy map reduce output from HDFS to remote server local file system
-download output to local machine

Like I mentioned, doing those entire steps manually takes time. I was seeking for some ways to do it faster and easier. After short research, because I didn’t find any tool or solution, so I state the easier and the best way is to develop something myself. I was wondering how to do this. I could code some standalone application for doing this but it would not be enough comfortable again and require few steps from the user like switching from java IDE to another window and finding jar in file system. I thought maybe it will be better to write eclipse plugin. Everything would be in one place but would have some weakness also – no usage outside of eclipse.  Next thought was Maven. Integrated with … everything could be used in the console and what important plugin development for it is easy and pleasant, so I took this idea started development.

The result of my work looks really well. Now I submit my job by ‘one button click’. I have my maven execution configured in eclipse (hadoop:execute)



 My sample wordcount need to have maven-hadoop-plugin configured in its pom.xmll file

<build>
 <plugins>
  <plugin>
   <groupId>org.apache.maven.plugins</groupId>
   <artifactId>maven-hadoop-plugin</artifactId>
   <version>0.0.1-SNAPSHOT</version>
   <configuration>
    <host>ipAddress</host>
    <login>login</login>
    <password>password</password>
    <outputDir>output30</outputDir>
    <hdfsOutputDir>
      /books/users/gkolpu/output30
    </hdfsOutputDir>
    <hdfsInputDir>
      /books/input
    </hdfsInputDir>
    <className>
      org.gkolpu.hadoop.BookWordCounter
    </className>
    <jarName>WordCounter.jar</jarName>
   </configuration>
  </plugin>
 </plugins>
</build>



Running my maven hadoop:execute goal from eclipse I see all logs coloured in IDE console



When the job finished on remote server map reduce output is downloaded automatically to target directory




Output files can be easily opened in IDE editors




You can find this plugin with source code on my git-hub repository. There is one only one goal but this project is still under development and other maven goals are planned.

https://github.com/gkolpuc/maven-hadoop-plugin

Brak komentarzy:

Prześlij komentarz