XML processing and website scraping in Java
This book is 100% complete
Completed on 2017-11-16
About the Book
This book is about XML and HTML processing in the Java world.
First I tell you how I got to this topic and how I learned about XMLBeam and JSoup.
This book is practical: I have a lot of sample code in it and you can use it as a tutorial for XMLBeam and JSoup. However the documentation of those two frameworks should be read too.
I talk in this book about XML converting on the Google App Engine (to PDF, RTF, DocX, HTML), create a website scraper with JSoup and XMLBeam to compare the performance of them both.
And at the end I show how you can customize the toString option of both frameworks.
- What took me the most time?
XML Processing and the Google App Engine
- Why GAE?
- Getting the data
- XML to HTML
- XML to PDF
- XML to RTF
- XML to “.*X”
- Exporting the files in GAE
XML Processing Advanced
- XML Extractor – my version of solving the problem (brute force)
- Beam me up, Sven
XML processing when memory matters
Website scraping with JSoup and XMLBeam
- Introduction of the task
- Comparison of runtime
Runtime comparison advanced
- Visualizing the data
- No network I/O – caching the sites in the memory
- No file I/O – parse the sites only
- One more performance tweak
Upgrade to Java 8
- New features in Java 8
- Using parallel streams
- Making the data preparation faster
- Parallelizing the content scraping
Using other XPath engines with XMLBeam
Custom printing for HTML with JSoup
- Prettier printing
- Conclusion and source code
Printing XMLBeam projections
- Java 6 or 7
- Java 8
- Java 8 revisited
- Source code
Watsi is a global crowdfunding platform for healthcare that enables anyone to donate as little as $5 to directly fund life-changing medical care for people in need. 100% of every donation funds medical care and we are dedicated to complete transparency.
The Leanpub 45-day 100% Happiness Guarantee
Within 45 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms...