Grzegorz Kołpuć: lutego 2013

poniedziałek, 25 lutego 2013

The Pirate Bay - Away From Keybord

This is a documentary movie about The pirate bay founders and their juridical process. I recommend to watch it and would love to see more stuff like that. Maybe Kim Dotcom's story would be next?

It’s the day before the trial starts. Fredrik packs a computer into a rusty old Volvo. Along with his Pirate Bay co-founders, he faces $13 million in damage claims to Hollywood in a copyright infringement case. Fredrik is on his way to install a new computer in the secret server hall. This is where the world’s largest file sharing site is hidden.

When the hacker prodigy Gottfrid, the internet activist Peter and the network nerd Fredrik are found guilty, they are confronted with the reality of life offline – away from keyboard. But deep down in dark data centres, clandestine computers quietly continue to duplicate files.

http://watch.tpbafk.tv/

maven-hadoop-plugin

Submitting hadoop job on remote machine is not a complicated process but it takes a lot time, it could be 10 minutes or sometime 15. There are a lot steps to do to get the final result of map reduce and download it to local file system.

-compile map reduce code and build jar file
-upload jar to remote server via ftp
-connect to server via ssh client
-prepare input data in HDFS (usually only once)
-submit job using ‘hadoop jar…’ command
-copy map reduce output from HDFS to remote server local file system
-download output to local machine

Like I mentioned, doing those entire steps manually takes time. I was seeking for some ways to do it faster and easier. After short research, because I didn’t find any tool or solution, so I state the easier and the best way is to develop something myself. I was wondering how to do this. I could code some standalone application for doing this but it would not be enough comfortable again and require few steps from the user like switching from java IDE to another window and finding jar in file system. I thought maybe it will be better to write eclipse plugin. Everything would be in one place but would have some weakness also – no usage outside of eclipse. Next thought was Maven. Integrated with … everything could be used in the console and what important plugin development for it is easy and pleasant, so I took this idea started development.

The result of my work looks really well. Now I submit my job by ‘one button click’. I have my maven execution configured in eclipse (hadoop:execute)

My sample wordcount need to have maven-hadoop-plugin configured in its pom.xmll file

<build>
 <plugins>
  <plugin>
   <groupId>org.apache.maven.plugins</groupId>
   <artifactId>maven-hadoop-plugin</artifactId>
   <version>0.0.1-SNAPSHOT</version>
   <configuration>
    <host>ipAddress</host>
    <login>login</login>
    <password>password</password>
    <outputDir>output30</outputDir>
    <hdfsOutputDir>
      /books/users/gkolpu/output30
    </hdfsOutputDir>
    <hdfsInputDir>
      /books/input
    </hdfsInputDir>
    <className>
      org.gkolpu.hadoop.BookWordCounter
    </className>
    <jarName>WordCounter.jar</jarName>
   </configuration>
  </plugin>
 </plugins>
</build>

Running my maven hadoop:execute goal from eclipse I see all logs coloured in IDE console

When the job finished on remote server map reduce output is downloaded automatically to target directory

Output files can be easily opened in IDE editors

You can find this plugin with source code on my git-hub repository. There is one only one goal but this project is still under development and other maven goals are planned.

https://github.com/gkolpuc/maven-hadoop-plugin

niedziela, 17 lutego 2013

Dumping stack and heap

JVM by default gives us a lot executable applications/tools. Most of developers uses them intermittently or just don't. I would like to show two of them which make us able to dump stack and hep from specific running jvm. Why put them together? They both are very helpful in seeking memory leaks, deadlocks or some blocking or slower parts of our code. Stuck dump contains stack trace of every active thread in particular java process. JSTACK tool is responsible for it.

jstack -l $PROCESS_ID > "/home/gkolpu/stack-$PROCESS_ID-$TIME"

-m
     prints mixed mode (both Java and native C/C++ frames) stack trace.
-h
     prints a help message.

If found something in stack traces, we would probably know what is the sensitive part of code and which classes and objects is problem in, then we can analyse heap dump. JMAP will create it.

jmap  -J-d64 -dump:format=b,file=heap.$PROCESS_ID.hprof $PROCESS_ID

< no option > 
     When no option is used jmap prints shared object mappings. 
-heap
     Prints a heap summary. 
-histo
     Prints a histogram of the heap. 
-permstat
     Prints class loader wise statistics of permanent generation of Java heap.
-h
     Prints a help message.

Those both program can be found in java bin directory.

ORA-31013: Invalid XPATH expression - multiple namespaces

Preparing some sql query, which goal was searching into xml document using xPath and Oracle API, I came across a little problem due the multiple namespaces in xml source. I prepared the query:

SELECT id,
   TO_NUMBER( extract(xml,
   '/content:MESSAGE/data:CITY/data:POPULATION/text()')) pop
  FROM xml_table where id='E0437C858C6E5A22097EF739EC045';

It was not working correctly, oracle respond with the following message:

ORA-31013: Invalid XPATH expression
31013. 00000 -  "Invalid XPATH expression"
*Cause:    XPATH expression passed to the function is invalid.
*Action:   Check the xpath expression for possible syntax errors.
Error at Line: 3 Column: 7

Error message was little confusing for not experienced Oracle user because of xPath was correct. I've found out what was wrong. I just had to add a namespaces definitions as a second parameter to extract function.

   SELECT id,
   TO_NUMBER( extract(xml,
    '/content:MESSAGE/data:CITY/data:POPULATION/text()',
   'xmlns:content="http://www.gkolpu.blogspot.com/ContentNamespace 
   xmlns:data="http://www.gkolpu.blogspot.com/DataNamespace"')) pop
  FROM xml_table where id='E0437C858C6E5A22097EF739EC045';

piątek, 15 lutego 2013

"too many arguments" problem with tar large files set

Unix commands have some limits in arguments number. I had a 500k files set and wanted to tar in into one big archive. It was not so easy as I expected. I used this command first

 tar -cvzf ../archive.tar.gz *xml

Didn’t work. It gives "too many arguments" message. After a little research I've created file with files names which I want to pack into tar archive and used this file.

find . -name '*.xml' -print | tar -cvzf ../arch.tar.gz --files-from -

this approach is very useful not only for tar command but also for others like rm, cp, mv...