Preprocess Enron Email Dataset

To prepare the data set for the benchmark applications, you need to first compile the StormEmailBenchmark and then use the CoalesceEnronDataset class to combine all emails into a single file.

Compile "StormEmailBenchmark"

Go to the root directory of "StormEmailBenchmark"
type "mvn package" at the command line

Preprocessing:

Download Enron dataset from: http://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz
To generate the full dataset: java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar \ com.ibm.storm.email.benchmark.testing.CoalesceEnronDataset \ <path_to_downloaded_data>/enron_mail_20110402/maildir \ <output_file_path> \ no
To generate the 25% dataset: java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar \ com.ibm.storm.email.benchmark.testing.CoalesceEnronDataset \ <path_to_downloaded_data>/enron_mail_20110402/maildir \ <output_file_path> \ yes

Next Steps:

[Create dataset for Apache Storm benchmark ](Create dataset for Apache Storm benchmark )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess Enron Email Dataset

Compile "StormEmailBenchmark"

Preprocessing:

Next Steps:

Clone this wiki locally