Python! Extracting Text from PDFs
Adrian Dolinay
Tutorial on how to extract text from PDF files. Learn the difference between natively digital and scanned PDFs, extract text from a digital PDF using PyPDF2 and extract text from a scanned PDF using optical character recognition with pytesseract.
Tesseract executable download for Windows: https://github.com/UB-Mannheim/tesseract/wiki Tesseract Installation for Linux: https://linuxhint.com/install-tesseract-ocr-linux/ Tesseract Installation for Mac: https://www.oreilly.com/library/view/building-computer-vision/9781838644673/95de5b35-436b-4668-8ca2-44970a6e2924.xhtml
The notebook can be found in the "Data Science with Python" folder within the below repo. GitHub Repo - https://github.com/ad17171717/YouTube-Tutorials/tree/main/Python/Extract%20Text%20from%20PDF
CONNECT: LinkedIn: https://www.linkedin.com/in/adrian-dolinay-frm-96a289106/ GitHub: https://github.com/ad17171717 Twitter: https://twitter.com/DolinayG Odysee: https://odysee.com/@adriandolinay:0 Medium: https://medium.com/@adriandolinay
|-Video Chapters-| 0:00 - Intro 0:10 - Installing packages 1:41 - Text extraction definition 2:21 - Extracting text from a natively digital PDF 4:44 - Extracting text from a scanned PDF using OCR 8:35 - References and additional learning ... https://www.youtube.com/watch?v=Ohz1f-e0ick
35347336 Bytes