XML processing and website scraping in Java
XML processing and website scraping in Java
How to use JSoup and XMLBeam in practice
About the Book
This book is about XML and HTML processing in the Java world.
First I tell you how I got to this topic and how I learned about XMLBeam and JSoup.
This book is practical: I have a lot of sample code in it and you can use it as a tutorial for XMLBeam and JSoup. However the documentation of those two frameworks should be read too.
I talk in this book about XML converting on the Google App Engine (to PDF, RTF, DocX, HTML), create a website scraper with JSoup and XMLBeam to compare the performance of them both.
And at the end I show how you can customize the toString option of both frameworks.
- What took me the most time?
XML Processing and the Google App Engine
- Why GAE?
- Getting the data
- XML to HTML
- XML to PDF
- XML to RTF
- XML to “.*X”
- Exporting the files in GAE
XML Processing Advanced
- XML Extractor – my version of solving the problem (brute force)
- Beam me up, Sven
XML processing when memory matters
Website scraping with JSoup and XMLBeam
- Introduction of the task
- Comparison of runtime
Runtime comparison advanced
- Visualizing the data
- No network I/O – caching the sites in the memory
- No file I/O – parse the sites only
- One more performance tweak
Upgrade to Java 8
- New features in Java 8
- Using parallel streams
- Making the data preparation faster
- Parallelizing the content scraping
Using other XPath engines with XMLBeam
Custom printing for HTML with JSoup
- Prettier printing
- Conclusion and source code
Printing XMLBeam projections
- Java 6 or 7
- Java 8
- Java 8 revisited
- Source code
Watsi is a global crowdfunding platform for healthcare that enables anyone to donate as little as $5 to directly fund life-changing medical care for people in need. 100% of every donation funds medical care and we are dedicated to complete transparency.
The Leanpub 45-day 100% Happiness Guarantee
Within 45 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers), EPUB (for phones and tablets) and MOBI (for Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
El Manual del ManagerKeyvan Akbary, Félix López, and Álvaro Salazar
¿Has deseado alguna vez el haber tenido una buena introducción al rol del Engineering Manager? En este libro aprenderás lo necesario para ejercer el rol de una manera efectiva: Expectativas y Responsabilidades del Rol, 1-1s, Ayudar a Crecer, Objetivos, Planes de Carrera, Cultura, Feedback, Contratación, Cultura de Producto y mucho más.
Functional Design and ArchitectureAlexander Granin
Software Design in Functional Programming, Design Patterns and Practices, Methodologies and Application Architectures. How to build real software in Haskell with less efforts and low risks. The first complete source of knowledge.
Ansible for KubernetesJeff Geerling
Ansible is a powerful infrastructure automation tool. Kubernetes is a powerful application deployment platform. Learn how to use these tools to automate massively-scalable, highly-available infrastructure.
Practical FP in Scala: A hands-on approachGabriel Volpe
A practical book aimed for those familiar with functional programming in Scala who are yet not confident about architecting an application from scratch.
Together, we will develop a purely functional application using the best libraries in the Cats ecosystem, while learning about design patterns and best practices.
Ansible for DevOpsJeff Geerling
Ansible is a simple, but powerful, server and configuration management tool. Learn to use Ansible effectively, whether you manage one server—or thousands.
Tame your Work FlowSteve Tendon and Daniel Doiron
Do you need a high performance enterprise governance approach improving management, execution and delivery while dealing with multiple projects/products, events, stakeholders and teams? Giving you better bottom line results, faster time to market, less work, better predictability, happier employees, and delighted clients? Then learn about TameFlow!
Production HaskellMatt Parsons
Are you excited about Haskell, but don't know where to begin? Are you thrilled by the technical advantages, but worried about the unknown pitfalls? This book has you covered.
C++ Best PracticesJason Turner
Level up your C++, get the tools working for you, eliminate common problems, and move on to more exciting things!
Cloud StrategyGregor Hohpe
“Strategy is the difference between making a wish and making it come true.” A successful migration to the cloud shouldn’t be driven by wishes, but guided by a sound strategy, frameworks, and decision models. This book tells you how—without becoming superficial nor getting lost in technology and product details.
Composing SoftwareEric Elliott
All software design is composition: the act of breaking complex problems down into smaller problems and composing those solutions. Most developers have a limited understanding of compositional techniques. It's time for that to change.
11 BooksThe Quality Software Bundle is for managers, would-be managers, and any of us who find themselves being managed and confused. This comprehensive bundle covers the entire span of software development approaches, from hacking through waterfall, cascade, prototyping, Iterative enhancement, reusable code, off-the-shelf, to Agile teams. The bundle...
The Node.js Bundle
3 BooksThis bundle combines three bestselling Leanpub Node.js books into a package that gives you everything you need to get started with developing Node.js applications at an unbeatable price.
The Tester's Library
8 BooksThe Tester's Library consists of eight five-star books that every software tester should read and re-read. As bound books, this collection would cost over $200. Even as e-books, their price would exceed $80, but in this bundle, their cost is only $49.99. Here are the books, and why they should be in your library: Perfect Software and Other...
11 BooksIn this bundle, you will find 10 different agile books. They are about different aspects of being agile. - finding a job - doing coding dojo's - Retrospectives - Personal kanban - a non-typical coaching book and even a book that gives you an insight in the lives of some agile people.
WTFlop 6M + HU - Beta Bundle
Marionette.js A to Z
Complete Scala Bundle
3 BooksScala is a general-purpose programming language and it's getting extremely popular these days. Some say that learning Scala could be a challenging task. My experience, however, suggests that this is actually a myth that has very little to do with reality. With the right approach, learning Scala can be easy, fun and rewarding.The first book from...
Build A Better Backbone App
3 BooksThe best way to learn new development skills is through experience, but that takes time you don't have.Get the best of both worlds with this bundle: you'll learn how to produce modern web applications by learning from experienced developers like Derick Bailey and David Sulc. BackboneJS is one of the favorite tools on the web today, but it...
People Skills—Soft but Difficult
7 BooksPerhaps you've been told that "lack of people skills" has been holding you back. No wonder: you may have had hundreds of hours of technical training, but little or no "people skills" guidance.You've heard it said that people skills are "soft," whereas technical skills are "hard." For you, though, technical skills are "easy," but people skills...
SurviveJS - Webpack + React
2 BooksGet both SurviveJS - Webpack and SurviveJS - React for a single price!