Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Reader for .doc, .docx, and .pdf files #46

Open
3 of 6 tasks
GStefanowich opened this issue Apr 29, 2023 · 4 comments
Open
3 of 6 tasks

Create a Reader for .doc, .docx, and .pdf files #46

GStefanowich opened this issue Apr 29, 2023 · 4 comments

Comments

@GStefanowich
Copy link
Contributor

GStefanowich commented Apr 29, 2023

There are currently two unimplemented template files (for reading .doc and .docx, and .pdf files, respectively)

  • .odt files
  • .doc files
  • .docx files
  • .pdf files
  • .ppt files
  • .pptx files

Reading Word documents and PDF files are a bit less straight-forward than plaintext files. A library that is compatible with the license for this project may be advisible.

public sealed class PdfReader : ILineReader
{
/// <inheritdoc />
public IAsyncEnumerable<string?> ReadLineAsync(CancellationToken cancellation = default)
=> throw new NotImplementedException();
/// <inheritdoc />
public ValueTask DisposeAsync()
=> ValueTask.CompletedTask;
}

public sealed class DocumentReader : ILineReader
{
/// <inheritdoc />
public IAsyncEnumerable<string?> ReadLineAsync(CancellationToken cancellation = default)
=> throw new NotImplementedException();
/// <inheritdoc />
public ValueTask DisposeAsync()
=> ValueTask.CompletedTask;
}

@jaimevisser
Copy link

Docx is xml so e-mails should be readable using plaintext search.

@GStefanowich
Copy link
Contributor Author

@jfbourke Do you want to split your OpenDocumentTextReader.cs to support .docx files?

The only difference I can see at first glance is .odt reads from content.xml and .docx reads from word/document.xml.

If you don't want to I can, but I want everyone to have a chance to contribute and your implementation works well 😃

@jfbourke
Copy link
Contributor

jfbourke commented May 2, 2023

@GStefanowich I've split the implementation to support the two file formats, added .pptx as well.

@GStefanowich
Copy link
Contributor Author

@jfbourke Looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants