Skip to content

Preprocess Enron Email Dataset

Samantha Chan edited this page Jun 25, 2014 · 11 revisions

To prepare the data set for the benchmark applications, you need to first compile the StormEmailBenchmark and then use the CoalesceEnronDataset class to combine all emails into a single file.

Compile "StormEmailBenchmark"

  1. Go to the root directory of "StormEmailBenchmark"
  2. Execute mvn package at the command line

Preprocessing:

  1. Download Enron dataset from: http://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz
  2. To generate the full dataset: java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.testing.CoalesceEnronDataset <path_to_downloaded_data>/enron_mail_20110402/maildir <output_file_path> no
  3. To generate the 25% dataset: java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.testing.CoalesceEnronDataset <path_to_downloaded_data>/enron_mail_20110402/maildir <output_file_path> yes

Next Steps:

[Create dataset for Apache Storm benchmark ](Create dataset for Apache Storm benchmark )

[Create dataset for InfoSphere Streams benchmark ](Create dataset for InfoSphere Streams benchmark )