Reads PDF files from a local directory and extracts their text content using
pdftools::pdf_text(). Designed to work with PDFs downloaded via
auto_download().
Value
A data.frame with columns:
- file
Character. The PDF filename (without directory path).
- text
Character. The extracted text, with pages collapsed into a single string separated by newlines.
- pages
Integer. The number of pages in the PDF.
Details
Requires the pdftools package. Install it with
install.packages("pdftools") if not already available.
Files that fail to parse (e.g., scanned/image-only PDFs) are included in
the result with NA text and NA pages along with a warning.
Examples
if (FALSE) { # \dontrun{
# After downloading PDFs
auto_download(format = "pdf", year = 2024, download_dir = "gao_reports")
texts <- extract_text("gao_reports/pdf")
head(texts$file)
nchar(texts$text[1])
} # }
