Created a map/reduce program for parallel processing of the publically available DBLP dataset that contains entries for various publications at many different venues (e.g., conferences and journals).
Each entry in the dataset describes a publication, which contains the list of authors, the title, and the publication venue and a few other attributes. The file is approximately 2.5Gb.
Graph generated representing the the entire UIC CS faculty researchers and thier work
(Pro tip : Download and Zoom in for the details.)
Consider the following entry in the dataset.
<inproceedings mdate="2017-05-24" key="conf/icst/GrechanikHB13">
<author>Mark Grechanik</author>
<author>B. M. Mainul Hossain</author>
<author>Ugo Buy</author>
<title>Testing Database-Centric Applications for Causes of Database Deadlocks.</title>
This entry lists a paper at the IEEE International Conference on Software Testing, Verification and Validation (ICST) published in 2013 whose authors are my former Ph.D. student at UIC, now tenured Associate Professor at the University of Dhaka, Dr. B.M. Mainul Hussain whose advisor Mark Grechanik is a co-author on this paper. The third co-author is Prof.Ugo Buy, a faculty member at our CS department. The presence of two authors, Mark Grechanik and Ugo Buy in a single publication like this one establishes a connection between these faculty members. Your job is to create a "friendship" connectivity graph between UIC CS faculty members using the information extracted from this dataset. Paritioning this dataset into shards is easy, since it requires to preserve the well-formedness of XML only. Most likely, you will write a simple program to partition the dataset into an approximately equal size shards.
After creating and testing this map/reduce program locally, The job was deployed and run on the Amazon Elastic MapReduce (EMR).
- Run
sbt clean assembly
, which will run the test cases and run generate the JAR - Copy the jar on the machine.(Machine needs to have java 1.8)
- Run the following command :
hadoop jar Mohammed_Siddiq_HW3-assembly-0.1.jar RelateAuthors.JobRunner path/to/input/file path/to/output/file
Please refer the video uploaded on youtube to run the Job in on th cloud
- Implemented XmlInputFormat that implements the DataRecordReader to generated Key,values based on the multiple start tags and end tags. Start and end Tag considered
<article ,<inproceedings ,<proceedings ,<book ,<incollection ,<phdthesis ,<mastersthesis "`
</article>, </inproceedings>,</proceedings>,</book>,</incollection>,</phdthesis>,</mastersthesis>
The Mapper :
- processes the individual papers/publications given as xml by InputFormat
- Extracts the authors and filters the CS authors of UIC.
- Emits the Individual authors as keys and 1 as the value .
- Also emits the co-authors as the key and 1 as the value.
- Signifying authors work alone and with the co-authors from the CS faculty of UIC.
For example:
are authors of a paper then the mapper emits:(a1)->1,(a2)->1,(a3)->1,(a1,a2)->1,(a1,a3)->1,(a2,a3)->1
The Combiner and Reducer :
- The combiner and reducer adds all the corresponding values of the keys, thereby summing up all the number of publications of individual authors and the co-authors.
The sample output of the mapper would look like this :
a. prasad sistla 130
a. prasad sistla,isabel f. cruz 2
a. prasad sistla,lenore d. zuck 7
a. prasad sistla,ouri wolfson 22
a. prasad sistla,robert h. sloan 1
a. prasad sistla,v. n. venkatakrishnan 8
ajay d. kshemkalyani 113
ajay d. kshemkalyani,ugo buy 1
anastasios sidiropoulos 101
anastasios sidiropoulos,bhaskar dasgupta 1
andrew e. johnson 102
andrew e. johnson,barbara di eugenio 2
andrew e. johnson,luc renambot 26
andrew e. johnson,tanya y. bergerwolf 1
balajee vamanan 15
barbara di eugenio 116
barbara di eugenio,bing liu 1
barbara di eugenio,brian d. ziebart 1
barbara di eugenio,isabel f. cruz 2
barbara di eugenio,luc renambot 1
barbara di eugenio,ouri wolfson 2
barbara di eugenio,peter c. nelson 3
bhaskar dasgupta 142
bhaskar dasgupta,nasim mobasheri 9
bhaskar dasgupta,ouri wolfson 5
bhaskar dasgupta,robert h. sloan 2
bhaskar dasgupta,tanya y. bergerwolf 12
This final output directory of the reducers is given to the Graph Visualization tool written in graphviz to generate an PNG image representing friendship graph between professors. Individual nodes represent the CS faculty and the Edges between them represent the friendship(co-publishers).
The weights associated with the edges represent the number of times the they published together. The weights of the individual nodes represent the total number of publication of the author.
Link to GraphViz implementation to generate the DOT file from the output and present the graph