Update Release

samujjwaal · Jan 11, 2021 · 8331a4d · 8331a4d
1 parent c095a16
commit 8331a4d
Show file tree

Hide file tree

Showing 4 changed files with 532 additions and 74 deletions.
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
@@ -55,6 +55,4 @@ jobs:
           upload_url: ${{ steps.create_release.outputs.upload_url }}
           asset_path: ./hw2_dblp_mapred.jar
           asset_name: dblp_mapreduce.jar
-          asset_content_type: application/jar
-
-
+          asset_content_type: application/jar
diff --git a/README.md b/README.md
@@ -1,10 +1,16 @@
-# Homework 2 : DBLP Map Reduce
+![workflow status](https://github.com/samujjwaal/dblp-mapreduce/workflows/Upload%20JAR%20Release/badge.svg)
+![GitHub repo size](https://img.shields.io/github/repo-size/samujjwaal/dblp-mapreduce)
+![GitHub last commit (branch)](https://img.shields.io/github/last-commit/samujjwaal/dblp-mapreduce/master)
+![GitHub top language](https://img.shields.io/github/languages/top/samujjwaal/dblp-mapreduce)
+![](https://img.shields.io/badge/-Scala-DE3423?style=flat&logo=scala&logoColor=white)
+
+# Map Reduce on DBLP data
 
 ### Description: Design and implement an instance of the Hadoop MapReduce computational model to perform analyses on DBLP publication data
 
 ## Overview
 
-As part of this project, a MapReduce program is created for the parallel processing of the publicly available [DBLP dataset](https://dblp.uni-trier.de/xml/). The dataset contains records for various publications by author(s) at different types of venues (like conferences, schools,books, and journals). Multiple map/reduce jobs have been defined to extract various insights from the dataset. 
+As part of this project, a MapReduce program is created for the parallel processing of the publicly available [DBLP dataset](https://dblp.uni-trier.de/xml/). The dataset contains records for various publications by author(s) at different types of venues (like conferences, schools,books, and journals). Multiple map/reduce jobs have been defined to extract various insights from the dataset.
 
 The map/reduce jobs created are :
 
@@ -16,122 +22,127 @@ The map/reduce jobs created are :
 
 ## Instructions to Execute
 
+- Setup Hadoop environment on the target system. Skip if already done. 
+  (Follow [these steps](hadoop.md) if not)
+
 - Generate executable jar file
 
-  - Clone this repository
+    - Clone this repository
 
-  - Open root folder of the project in the terminal and assemble the project jar using command:
+    - Open root folder of the project in the terminal and assemble the project jar using command:
 
-    `sbt clean compile assembly`
+      `sbt clean compile assembly`
 
-    This command compiles the source code, executes the test cases and builds the executable jar file “*hw2_dblp_mapred.jar*” in the folder “*target/scala-2.13*”
+      This command compiles the source code, executes the test cases and builds the executable jar file “*hw2_dblp_mapred.jar*” in the folder “*target/scala-2.13*”
 
 - Setup Hadoop environment
 
-  - Start Hadoop DFS & YARN services using:
+    - Start Hadoop DFS & YARN services using:
 
-    - Start NameNode & DataNode daemons
+        - Start NameNode & DataNode daemons
 
-      `start-dfs.sh`
+          `start-dfs.sh`
 
-    - Start ResourceManager & NodeManager daemons
+        - Start ResourceManager & NodeManager daemons
 
-      `start-yarn.sh`
+          `start-yarn.sh`
 
-    - Verify if daemons are running using:
+        - Verify if daemons are running using:
 
-      `jps`
+          `jps`
 
-  - Create directory in HDFS to store the input file:
+    - Create directory in HDFS to store the input file:
 
-    `hdfs dfs -mkdir input`
+      `hdfs dfs -mkdir input`
 
-  - Place the dblp.xml file in the directory created above:
+    - Place the dblp.xml file in the directory created above:
 
-    `hdfs dfs -put path/to/dblp.xml input`
+      `hdfs dfs -put path/to/dblp.xml input`
 
 - Execute jar file
 
-  - Run the jar file using:
+    - Run the jar file using:
 
-    `hadoop jar hw2_dblp_mapred.jar job_num input`
+      `hadoop jar hw2_dblp_mapred.jar job_num input`
 
-    - The argument ‘*job_num*’ has to be provided by user and can have possible values of 1/2/3/4/5 corresponding to the job being performed as described [above](#overview) and [below](#mapreduce-jobs)
+        - The argument ‘*job_num*’ has to be provided by user and can have possible values of 1/2/3/4/5 corresponding to the job being performed as described [above](#overview) and [below](#mapreduce-jobs)
 
-    - The output folder for the job results have been set in the config file ‘*JobSpec.conf*’ as follows:
+        - The output folder for the job results have been set in the config file ‘*JobSpec.conf*’ as follows:
 
-      ```
-      # Output paths for MapReduce jobs
-      master_output_path = "output_hw2"
-      Job1_output_path = "/top_10_authors_at_venues"
-      Job2_output_path = "/pubs_with_1_author_at_venues"
-      Job3_output_path = "/pubs_with_max_authors_at_venues"
-      Job4_output_path = "/top_100_authors_max_coauthors"
-      Job5_output_path = "/100_authors_0_coauthors"
-      ```
+          ```text
+          # Output paths for MapReduce jobs
+          master_output_path = "output_hw2"
+          Job1_output_path = "/top_10_authors_at_venues"
+          Job2_output_path = "/pubs_with_1_author_at_venues"
+          Job3_output_path = "/pubs_with_max_authors_at_venues"
+          Job4_output_path = "/top_100_authors_max_coauthors"
+          Job5_output_path = "/100_authors_0_coauthors"
+          ```
 
-    - The main output folder ‘output_hw2’ needs to be deleted if repeating any map/reduce job or else an error is raised. Delete folder using:
+        - The main output folder ‘output_hw2’ needs to be deleted if repeating any map/reduce job or else an error is raised. Delete folder using:
 
-      `hdfs dfs -rm -r output_hw2`
+          `hdfs dfs -rm -r output_hw2`
 
-  - *After executing all jobs*, extract the output files from the HDFS into a local directory “*mapreduce_output*” using:
+    - *After executing all jobs*, extract the output files from the HDFS into a local directory “*mapreduce_output*” using:
 
-    `hdfs dfs -get output_hw2 mapreduce_output`
+      `hdfs dfs -get output_hw2 mapreduce_output`
 
-    Output of all jobs is in CSV format.
+      Output of all jobs is in CSV format.
 
 - Stop Hadoop services
 
-  - Stop all daemons after execution is completed using:
+    - Stop all daemons after execution is completed using:
 
-    `stop-yarn.sh
-    stop-dfs.sh`
+      ```bash
+      stop-yarn.sh
+      stop-dfs.sh
+      ```
 
 ## Application Design
 
 - ### XML parsing
 
-  - For parsing the dblp.xml file using the dblp.dtd schema I have used [multiple tag XMLInputFormatter](https://github.com/Mohammed-siddiq/hadoop-XMLInputFormatWithMultipleTags) by Mohammed Siddiq, which is an implementation of Mahout's XMLInputFormat with support for multiple input and output tags. 
+    - For parsing the dblp.xml file using the dblp.dtd schema I have used [multiple tag XMLInputFormatter](https://github.com/Mohammed-siddiq/hadoop-XMLInputFormatWithMultipleTags) by Mohammed Siddiq, which is an implementation of Mahout's XMLInputFormat with support for multiple input and output tags.
 
-  - The input and output tags are mentioned in the config file.
+    - The input and output tags are mentioned in the config file.
 
-  - The tags considered are:
+    - The tags considered are:
 
-    `<article ,<book ,<incollection ,<inproceedings ,<mastersthesis ,<proceedings ,<phdthesis ,<www `
+      `<article ,<book ,<incollection ,<inproceedings ,<mastersthesis ,<proceedings ,<phdthesis ,<www `
 
 ### MapReduce Jobs
 
 - ### Job 1
 
-  - Mapper Class: `VenueTopTenAuthorsMapper`
-  - Reducer Class:`VenueTopTenAuthorsReducer`
-  - Output path: `output_hw2/top_10_authors_at_venues`
-  - Output format: `key:<venue name> & value:<list of authors(seperated by ';')>`
+    - Mapper Class: `VenueTopTenAuthorsMapper`
+    - Reducer Class:`VenueTopTenAuthorsReducer`
+    - Output path: `output_hw2/top_10_authors_at_venues`
+    - Output format: `key:<venue name> & value:<list of authors(seperated by ';')>`
 
 - ### Job 2
 
-  - Mapper Class:`VenueOneAuthorMapper`
-  - Reducer Class:`VenueOneAuthorReducer`
-  - Output path: `output_hw2/pubs_with_1_author_at_venues"`
-  - Output format:`key:<venue name> & value:<list of publications(seperated by ';')>` 
+    - Mapper Class:`VenueOneAuthorMapper`
+    - Reducer Class:`VenueOneAuthorReducer`
+    - Output path: `output_hw2/pubs_with_1_author_at_venues"`
+    - Output format:`key:<venue name> & value:<list of publications(seperated by ';')>`
 
 - ### Job 3
 
-  - Mapper Class:`VenueTopPubMapper`
-  - Reducer Class:`VenueTopPubReducer`
-  - Output path: `output_hw2/pubs_with_max_authors_at_venues`
-  - Output format: `key:<venue name> & value:<publication name>`
+    - Mapper Class:`VenueTopPubMapper`
+    - Reducer Class:`VenueTopPubReducer`
+    - Output path: `output_hw2/pubs_with_max_authors_at_venues`
+    - Output format: `key:<venue name> & value:<publication name>`
 
 - ### Job 4
 
-  - Mapper Class:`CoAuthorCountMapper`
-  - Reducer Class:`MostCoAuthorCountReducer`
-  - Output path: `output_hw2/top_100_authors_max_coauthors`
-  - Output format:`key:<author name> & value:<max. number of coauthors>` 
+    - Mapper Class:`CoAuthorCountMapper`
+    - Reducer Class:`MostCoAuthorCountReducer`
+    - Output path: `output_hw2/top_100_authors_max_coauthors`
+    - Output format:`key:<author name> & value:<max. number of coauthors>`
 
 - ### Job 5
 
-  - Mapper Class:`CoAuthorCountMapper`
-  - Reducer Class:`ZeroCoAuthorCountReducer`
-  - Output path: `output_hw2/100_authors_0_coauthors`
-  - Output format:`key:<author name> & value:<0>` 
+    - Mapper Class:`CoAuthorCountMapper`
+    - Reducer Class:`ZeroCoAuthorCountReducer`
+    - Output path: `output_hw2/100_authors_0_coauthors`
+    - Output format:`key:<author name> & value:<0>`