ReadPDF Duplicating Information
-
Hi Allan.
After several unsuccessful attempts I decided to share the problem below:
OpenRPA.Utilities -> ReadPDF
Repeating letters and/or complete sentences from previous pages.
I will make some visual markings for your understanding.Note: for this example I used a simple stream that copies the result to clipboard.
I tried other means of output (csv, dataTable, excel), and the results had the same problems.Link to pdf document used in the example:
download pdf example -
This post is deleted! -
I also have this same problem
-
I have not been super active the last few weeks.
I'm pushed extremely hard on getting version 1.3 of OpenFlow finished, while also having to supporting my family, due to my my mother getting very sick. Once that is done, I will return some love to OpenRPA and forum/rocketchat again.
You are not forgotten or ignored, I'm just not able to answer as fast as I normally do. Sorry about that. -
Allan, I'm sorry I didn't intend to prey on you, because your support and attention were beyond my expectations.
I'm sorry for her mother, and I wish her well. -
@allan-zimmermann
Allan,
Did you get to look at the OCR reader problem? -
I testet 6-7 different nuget packages and they all had issue with your pdf ... pdf's are not text, but vector graphics, so it may not always be possible to extract it as "real" text ...
But i finall found one that seemed to work ..
If you search for TikaOnDotNet and the install
TikaOnDotNet.textExtractor
Then close the robot and restart it ( for some reason it keeps getting stuck on installing .. need to look at that later ) it should then install TikaOnDotNet and al dependecies
And an string variable "text" and add an Invoke Code, and set it to C# and add this codeConsole.WriteLine("init"); var textExtractor = new TikaOnDotNet.TextExtraction.TextExtractor(); Console.WriteLine("Extract"); var result = textExtractor.Extract(@"C:\Users\Allan\Downloads\test_reading_pdf.pdf"); Console.WriteLine("save resukt"); text += result;
then modify the filepath to match yours ...
-
@allan-zimmermann I will install, and then post the result.
Thank you very much -
@allan-zimmermann
Allan, here's a summary of the tests.
In visual studio it worked well (see screenshots)However, in OpenRPA, the errors described below occur:
LogError:
Thanks for your help.
-
It's not a class or class file, Invoke Code is a function. You cannot use using statements inside a function.
Remove the using part ( so line 1 ) -
The first time I ran I did as you instructed, but I got an error "namespace name or type 'TikaOnDotNet' cannot be found'.
Note: the TikaOnDotNet package is already installed and works fine in visual studio as I showed in the previous prints. -
Did you install TikaOnDotNet using the package manager on the project?
-
@allan-zimmermann No. I installed it from the visual studio manager
Can you guide me in installing this package?
Obs.: I'm running in docker. -
Go to "Open Project" -> Select a project -> Click the "Open Package Manager"
Select "Nuget.org" and type "tika" and select "TikeOnDitNet" and click "install" when you see the version number in the dropdown list.
What do you mean, "run in docker" you cannot run openrpa in docker ? -
@allan-zimmermann
I want to thank you for your help and the speed with which you always respond. Thanks a lot again, everything is working.