Python! Extracting Text from PDFs

Adrian Dolinay

pypdf2 pypdf2 extract text pytesseract pytesseract extract text python extract text from pdf python get text from pdf python ocr

description

Tutorial on how to extract text from PDF files. Learn the difference between natively digital and scanned PDFs, extract text from a digital PDF using PyPDF2 and extract text from a scanned PDF using optical character recognition with pytesseract.

Tesseract executable download for Windows: https://github.com/UB-Mannheim/tesseract/wiki Tesseract Installation for Linux: https://linuxhint.com/install-tesseract-ocr-linux/ Tesseract Installation for Mac: https://www.oreilly.com/library/view/building-computer-vision/9781838644673/95de5b35-436b-4668-8ca2-44970a6e2924.xhtml

The notebook can be found in the "Data Science with Python" folder within the below repo. GitHub Repo - https://github.com/ad17171717/YouTube-Tutorials/tree/main/Python/Extract%20Text%20from%20PDF

CONNECT: LinkedIn: https://www.linkedin.com/in/adrian-dolinay-frm-96a289106/ GitHub: https://github.com/ad17171717 Twitter: https://twitter.com/DolinayG Odysee: https://odysee.com/@adriandolinay:0 Medium: https://medium.com/@adriandolinay

|-Video Chapters-| 0:00 - Intro 0:10 - Installing packages 1:41 - Text extraction definition 2:21 - Extracting text from a natively digital PDF 4:44 - Extracting text from a scanned PDF using OCR 8:35 - References and additional learning ... https://www.youtube.com/watch?v=Ohz1f-e0ick

created

2023-04-17

staked

0.0 LBC

license

Copyrighted (contact publisher)

File size

35347336 Bytes