Skip to content

Preprocess Enron Email Dataset

Samantha Chan edited this page Jun 18, 2014 · 11 revisions

To prepare the data set for the benchmark applications, you need to first compile the StormEmailBenchmark and then use the CoalesceEnronDataset class to combine all emails into a single file.

Compile "StormEmailBenchmark"

  1. Go to the root directory of "StormEmailBenchmark"
  2. type "mvn package" at the command line

Preprocessing:

  1. Download Enron dataset from: http://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz
  2. To generate the full dataset: java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar \ com.ibm.storm.email.benchmark.testing.CoalesceEnronDataset \ <path_to_downloaded_data>/enron_mail_20110402/maildir \ <output_file_path> \ no
  3. To generate the 25% dataset: java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar \ com.ibm.storm.email.benchmark.testing.CoalesceEnronDataset \ <path_to_downloaded_data>/enron_mail_20110402/maildir \ <output_file_path> \ yes

Next Steps:

[Create dataset for Apache Storm benchmark ](Create dataset for Apache Storm benchmark )