Skip to content

Commit 931fc64

Browse files
committed
PdfTextExtract
1 parent 6ef90cf commit 931fc64

File tree

7 files changed

+7641
-3041
lines changed

7 files changed

+7641
-3041
lines changed

README.md

+4
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ Bonus:
5656
- Supports PostgresSQL using 'pg' module.
5757
- Supports Excel File Read/Write using 'excel-js' module.
5858
- Converts HTML Reports to Zip format which can shared across.
59+
- Extracts Text from PDF files.
60+
- Shows Page performance using Lighthouse Library.
5961

6062
### Built With
6163

@@ -67,6 +69,7 @@ Bonus:
6769
- [ESLint](https://eslint.org/)
6870
- [SonarQube](https://www.sonarqube.org/)
6971
- [Lighthouse](https://developers.google.com/web/tools/lighthouse)
72+
- [pdfjs-dist-es5](https://www.npmjs.com/package/pdfjs-dist-es5)
7073

7174
## Getting Started
7275

@@ -225,6 +228,7 @@ Once logger object is created you can use this instead of console.log in your fr
225228
```JS
226229
npx cross-env ENV=qa npm run test:ui
227230
```
231+
24. For Extracting text from PDF we are using `pdfjs-dist-es5` library. You can run the test case `PdfToText.test.ts` to verify contents of PDF file. `getPDFText()` method in `lib/WebActions.ts` class is used for extracting text from PDF file.
228232

229233
## Reports
230234

lib/WebActions.ts

+20
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import type { Page } from '@playwright/test';
44
import { BrowserContext, expect } from '@playwright/test';
55
import { Workbook } from 'exceljs';
66
import { testConfig } from '../testConfig';
7+
import * as pdfjslib from 'pdfjs-dist-es5';
78

89
export class WebActions {
910
readonly page: Page;
@@ -54,4 +55,23 @@ export class WebActions {
5455
throw error;
5556
});
5657
}
58+
59+
async getPdfPageText(pdf: any, pageNo: number) {
60+
const page = await pdf.getPage(pageNo);
61+
const tokenizedText = await page.getTextContent();
62+
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
63+
return pageText;
64+
}
65+
66+
async getPDFText(filePath: any): Promise<string> {
67+
const dataBuffer = fs.readFileSync(filePath);
68+
const pdf = await pdfjslib.getDocument(dataBuffer).promise;
69+
const maxPages = pdf.numPages;
70+
const pageTextPromises = [];
71+
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
72+
pageTextPromises.push(this.getPdfPageText(pdf, pageNo));
73+
}
74+
const pageTexts = await Promise.all(pageTextPromises);
75+
return pageTexts.join(' ');
76+
}
5777
}

0 commit comments

Comments
 (0)