Preprocess Enron Email Dataset

To prepare the data set for the benchmark applications, you need to first compile the StormEmailBenchmark and then use the CoalesceEnronDataset class to combine all emails into a single file.

Compile "StormEmailBenchmark"

Go to the root directory of "StormEmailBenchmark"
Execute mvn package at the command line

Preprocessing:

Download Enron dataset from: http://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz
To generate the full dataset: java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.testing.CoalesceEnronDataset <path_to_downloaded_data>/enron_mail_20110402/maildir <output_file_path> no
To generate the 25% dataset: java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.testing.CoalesceEnronDataset <path_to_downloaded_data>/enron_mail_20110402/maildir <output_file_path> yes

Next Steps:

[Create dataset for Apache Storm benchmark ](Create dataset for Apache Storm benchmark )

[Create dataset for InfoSphere Streams benchmark ](Create dataset for InfoSphere Streams benchmark )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess Enron Email Dataset

Compile "StormEmailBenchmark"

Preprocessing:

Next Steps:

Clone this wiki locally