XML processing and website scraping in Java
XML processing and website scraping in Java
How to use JSoup and XMLBeam in practice
About the Book
This book is about XML and HTML processing in the Java world.
First I tell you how I got to this topic and how I learned about XMLBeam and JSoup.
This book is practical: I have a lot of sample code in it and you can use it as a tutorial for XMLBeam and JSoup. However the documentation of those two frameworks should be read too.
I talk in this book about XML converting on the Google App Engine (to PDF, RTF, DocX, HTML), create a website scraper with JSoup and XMLBeam to compare the performance of them both.
And at the end I show how you can customize the toString option of both frameworks.
- What took me the most time?
XML Processing and the Google App Engine
- Why GAE?
- Getting the data
- XML to HTML
- XML to PDF
- XML to RTF
- XML to “.*X”
- Exporting the files in GAE
XML Processing Advanced
- XML Extractor – my version of solving the problem (brute force)
- Beam me up, Sven
XML processing when memory matters
Website scraping with JSoup and XMLBeam
- Introduction of the task
- Comparison of runtime
Runtime comparison advanced
- Visualizing the data
- No network I/O – caching the sites in the memory
- No file I/O – parse the sites only
- One more performance tweak
Upgrade to Java 8
- New features in Java 8
- Using parallel streams
- Making the data preparation faster
- Parallelizing the content scraping
Using other XPath engines with XMLBeam
Custom printing for HTML with JSoup
- Prettier printing
- Conclusion and source code
Printing XMLBeam projections
- Java 6 or 7
- Java 8
- Java 8 revisited
- Source code
Open Sourcing Mental Illness, Ltd
Changing how we talk about mental health in the tech community.https://osmihelp.org
Changing how we talk about mental health in the tech community.
The Leanpub 45-day 100% Happiness Guarantee
Within 45 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers), EPUB (for phones and tablets) and MOBI (for Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
C++20 is the next big C++ standard after C++11. As C++11 did it, C++20 changes the way we program modern C++. This change is, in particular, due to the big four of C++20: ranges, coroutines, concepts, and modules.
The book is almost daily updated. These incremental updates ease my interaction with the proofreaders.
C++ Best PracticesJason Turner
Level up your C++, get the tools working for you, eliminate common problems, and move on to more exciting things!
Atomic KotlinBruce Eckel and Svetlana Isakova
For both beginning and experienced programmers! From the author of the multi-award-winning Thinking in C++ and Thinking in Java together with a member of the Kotlin language team comes a book that breaks the concepts into small, easy-to-digest "atoms," along with exercises supported by hints and solutions directly inside IntelliJ IDEA!
Sockets and PipesType Classes
Sockets and Pipes is not an introduction to Haskell; it is an introduction to writing software in Haskell. Using a handful of everyday Haskell libraries, this book walks through reading the HTTP specification and implementing it to create a web server.
Introducing EventStormingAlberto Brandolini
The deepest tutorial and explanation about EventStorming, straight from the inventor.
Composing SoftwareEric Elliott
All software design is composition: the act of breaking complex problems down into smaller problems and composing those solutions. Most developers have a limited understanding of compositional techniques. It's time for that to change.
Cloud StrategyGregor Hohpe
“Strategy is the difference between making a wish and making it come true.” A successful migration to the cloud can transform your organization, but it shouldn’t be driven by wishes. This book tells you how to develop a sound strategy guided by frameworks and decision models without being overly abstract nor getting lost in product details.
node-opcua by exampleEtienne Rossignon
Get the best out of node-opcua through a set of documented examples by the author himself that will allow you to create stunning OPCUA Servers or Clients.
Functional Design and ArchitectureAlexander Granin
Software Design in Functional Programming, Design Patterns and Practices, Methodologies and Application Architectures. How to build real software in Haskell with less efforts and low risks. The first complete source of knowledge.
Ansible for DevOpsJeff Geerling
Ansible is a simple, but powerful, server and configuration management tool. Learn to use Ansible effectively, whether you manage one server—or thousands.
Software Architecture for Developers: Volumes 1 & 2 - Technical leadership and communication
2 Books"Software Architecture for Developers" is a practical and pragmatic guide to modern, lightweight software architecture, specifically aimed at developers. You'll learn:The essence of software architecture.Why the software architecture role should include coding, coaching and collaboration.The things that you really need to think about before...
Django for Beginners/APIs/Professionals
3 BooksBuy every PowerShell book from Adam Bertram at a 20% discount!
CCIE Service Provider Ultimate Study Bundle
2 BooksPiotr Jablonski, Lukasz Bromirski, and Nick Russo have joined forces to deliver the only CCIE Service Provider training resource you'll ever need. This bundle contains a detailed and challenging collection of workbook labs, plus an extensively detailed technical reference guide. All of us have earned the CCIE Service Provider certification...
Cisco CCNA 200-301 Complet
4 BooksCe lot comprend les quatre volumes du guide préparation à l'examen de certification Cisco CCNA 200-301.
All the Books of The Medical Futurist
6 BooksWe put together the most popular books from The Medical Futurist to provide a clear picture about the major trends shaping the future of medicine and healthcare. Digital health technologies, artificial intelligence, the future of 20 medical specialties, big pharma, data privacy, digital health investments and how technology giants such as Amazon...
Linux Administration Complet
4 BooksCe lot comprend les quatre volumes du Guide Linux Administration :Linux Administration, Volume 1, Administration fondamentale : Guide pratique de préparation aux examens de certification LPIC 1, Linux Essentials, RHCSA et LFCS. Administration fondamentale. Introduction à Linux. Le Shell. Traitement du texte. Arborescence de fichiers. Sécurité...
Software Architecture and Beautiful APIs
2 BooksThere is no better way to learn how to design good APIs than to look at many existing examples, complementing the Software Architecture theory on API design.
Learn Git, Bash, and Terraform the Hard Way
3 BooksLearn Git, Bash and Terraform using the Hard Way method.These technologies are essential tools in the DevOps armoury. These books walk you through their features and subtleties in a simple, gradual way that reinforces learning rather than baffling you with theory.
9 Books-Bundle: Shut Up and Code!
9 Books"Shut up and code." Laughter in the audience. The hacker had just plugged in his notebook and started sharing his screen to present his super-smart Python script. "Shut up and code" The letters written in a white literal coding font on black background was the hackers' home screen background mantra. At the time, I was a first-year computer...