Grzegorz Kołpuć: Hadoop. MapReduce File InputFormat

In hadoop it is very important how we read data for Map Reduce. Standard input is a file set in HDFS. I will try to explain how to define your own file input format. First step is implementing InputFormat interface or extend one of it's implementations like in the example below.

public class EmailInputFormat extends FileInputFormat {

 @Override
 public RecordReader getRecordReader(InputSplit split,
  JobConf job, Reporter reporter) throws IOException {
  reporter.setStatus(split.toString());
  return new EmailRecordReader(job, (FileSplit) split);
 }

}

getRecordReader method have to return RecordReader implementation, so let's create one more class.

public class EmailRecordReader implements RecordReader {
 private LineRecordReader lineReader;
 private LongWritable lineKey;
 private Text lineValue;

 public EmailRecordReader(JobConf job, FileSplit split) throws 
                 IOException {
  lineReader = new LineRecordReader(job, split);

  lineKey = lineReader.createKey();
  lineValue = lineReader.createValue();
 }

 public boolean next(Text key, Email value) throws 
                 IOException {
  // TODO Auto-generated method stub
  // put your code here

  // --
  return false;
 }

 public Text createKey() {
  return new Text("");
 }

 public Email createValue() {
  return new Email();
 }

 public long getPos() throws IOException {
  return lineReader.getPos();
 }

 public void close() throws IOException {
  lineReader.close();
 }

 public float getProgress() throws IOException {
  return lineReader.getProgress();
 }

}

We need to implement constructor to utilize FileSpit but the most important part of this code is next method. Hadoop enginne will be running this method until returning false. So we can use it to produce how much input records we want.

Grzegorz Kołpuć

piątek, 4 stycznia 2013

Hadoop. MapReduce File InputFormat

Brak komentarzy:

Prześlij komentarz