-
Notifications
You must be signed in to change notification settings - Fork 16
Preprocess Enron Email Dataset
Samantha Chan edited this page Jun 25, 2014
·
11 revisions
To prepare the data set for the benchmark applications, you need to first compile the StormEmailBenchmark and then use the CoalesceEnronDataset
class to combine all emails into a single file.
- Go to the root directory of "StormEmailBenchmark"
- Execute
mvn package
at the command line
- Download Enron dataset from: http://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz
- To generate the full dataset:
java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar
com.ibm.streamsx.storm.email.benchmark.testing.CoalesceEnronDataset
<path_to_downloaded_data>/enron_mail_20110402/maildir
<output_file_path>
no
- To generate the 25% dataset:
java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar
com.ibm.streamsx.storm.email.benchmark.testing.CoalesceEnronDataset
<path_to_downloaded_data>/enron_mail_20110402/maildir
<output_file_path>
yes
[Create dataset for Apache Storm benchmark ](Create dataset for Apache Storm benchmark )
[Create dataset for InfoSphere Streams benchmark ](Create dataset for InfoSphere Streams benchmark )