mark :: blog :: apache

01 Apr 2023: Searching my email attachments with Ponymail and Tika

Moving my email archives to Ponymail went well

One feature I forgot that zoe had was how it indexed some attachments. If you have an email with a PDF attachment and that PDF attachment had plain text in it (it wasn't just a scan) then you could search on words in the PDF. That's super handy. Ponymail doesn't do that, in fact you don't get to search on any text in attachments, even if they are plain text (or things like patches). Let's fix that!

Remember how I said whenever I need some code I first look if there is an Apache community that has a project that does something similar? Well Apache Tika is an awesome project that will return the plain text of pretty much whatever you throw at it. PDF? sure. Patches? definitely. Word docs? yup. Images? yes. Wait, images? so Tika will go and use Tesseract and do an OCR of an image.

Okay, so let's add a field to the mbox index, attachmenttext, populate it with Tika, and search on it. For now if some text in an attachment matches your search query you'll see the result, but you won't know exactly where the text appears (perhaps later it could highlight which attachment it appears in).

I wrote a quick Python script that runs through all emails in Ponymail (or some search query subset), and if they have attachments runs all the attachments through Apache Tika, storing the plain texts in the attachmenttext field. We ignore anything that's already got something in that field, so we can just run this periodically rather than on import. Then a one-line patch to Ponymail also searches the attachmenttext field. 40,000 attachments and two hours later, it was all done and working.

It's not ready for a PR yet; probably for Ponymail upstream we'd want the option of doing this at import, although I chose not too so we can be deliberately careful as parsing untrusted attachments is risky

So there we have it; a way to search your emails including inside most attachments, outside the cloud, using Open Source projects and a few little patches.

28 Mar 2023: How Ponymail helped me keep my email archive searchable but out of the cloud.

I have a lot of historical personal email, back as far as 1991, and from time to time there's a need to find some old message. Although services like GMail would like to you keep all your mail in their cloud and pride themselves on searching, I'd rather keep my email archive offline and encrypted at rest unless there's some need to do a search. Indeed I always use Google takeout every month to remove all historic GMail messages. Until this year I used a tool called Zoe for allowing searchable email archives. You can import your emails, it uses Apache Lucene as a back end, and gives you a nice web based interface to find your mails. But Zoe has been unmaintained for over a decade and has mostly vanished from the net. It was time to replace it.

Whenever I need some open source project my first place to look is if there is an Apache Software Foundation community with a project along the same lines. And the ASF is all about communities communicating over Email, so not only is there an ASF project with a solution, but that project is used to provide the web interface for all the archived ASF mailing lists too. "Ponymail Foal" is the project and lists.apache.org is where you can see it running. (Note that the Ponymail website refers to the old version of Pony Mail before "Foal")

Internally the project is mostly Python, HTML and Javascript, using Python scripts to import emails into elasticsearch, so it's really straightforward to get up and running following the project instructions.

So I can just import my several hundred thousand email messages I have in random text mbox format files and be done? Well, nearly. It almost worked but it needed a few tweaks:

Ponymail wasn't able to parse a fair number of email messages. Analysing the mails led to only three root causes of mails not being able to be imported:

Bad "Content-Type" headers. Even my bank gets this wrong with the header Content-Type: text/html; charset="utf-8 charset=\"iso-8859-1\"". I just made the code ignore similar bad headers and try the fallbacks. Patch here

Messages with no text or HTML body and no attachments. These are fairly common for example a calendar entry might be sent as "Content-Type: text/calendar". I just made it so that if there is no recognised body it just uses whatever the last section it found was, regardless of content type. Patch here

Google Chat messages from many years ago. These have no useful anything, no body, no to: no message id, no return address. Rather than note them as failures I use made the code ignore them completely. Since this is just a warning, no upstream patch prepared.

Handling List-Id's. Ponymail likes to sort mails by the List-Id which makes a lot of sense where you have the thousands of Apache lists. But with personal email, and certainly when you subscribe to various newsletters, or get bills, or spam that got into the archives then you end up with lots of list id's that are only used once or twice or are not useful. Working on open source projects there's lots of lists that I'm on that I want the email to get archived, but it would be nice if it was separated out in the Ponymail UI. So really I needed the ability to have an 'allow list' of list id's that I want to have separate, with everything else defaulting to a generic list id (being my email address where all those mails came into). Patch here

HTML email. Where an email contains only HTML and no text version then Ponymail will make and store a text conversion of the HTML, but sometimes, especially those pesky bank emails, it's useful to be able to see the HTML with all the embedded images. Displaying HTML email in HTML isn't really a goal for the project, especially since you have to be really careful you don't end up parsing untrusted javascript for example. And you might not want all those tracking images to suddenly start getting pinged. But I'd really like a button that you could use on selected emails to display them in HTML. Fortunately Ponymail stores a complete raw copy of the email, any my proof-of-concept worked, so this can be easy to add in the future.

Managing a personal email archive can be a daunting task especially with the volume of email correspondence. However, with Ponymail, it's possible to take control of your email archive, keep it local and secure, and search through it quickly and efficiently using the power of ElasticSearch.

26 Feb 2008: XSS vs Remote Execution of Arbitrary Code

Last Friday, just as I was finishing work for the day, an email appeared in my mailbox from the UK CPNI announcing a public remote code execution flaw in Apache on HP-UX. As Chair of the Apache Software Foundation Security Team I knew there were no outstanding remote code execution flaws in Apache HTTP server (in fact we've not had a remote code execution flaw for many years) so I was expecting to invoke the Red Hat Critical Action Plan which would have meant a rather long weekend for me, my team, and various development and quality engineering staff.

First thing to do was to find the original source of the advisory, as co-ordination centres and research firms are known to often play the Telephone game, with advisory texts mangled beyond recognition. Following the links led to the actual advisory on the HP site. This describes the vulnerability as follows:

A potential security
vulnerability has been identified with HP-UX running Apache. The vulnerability
could be exploited remotely to execute arbitrary code

But then they give the CVE name for the flaw, CVE-2007-6388, which is a known public flaw fixed last month in various Apache versions from the ASF and in updates from various vendors that ship Apache (including Red Hat).

This flaw is a cross-site scripting flaw in the mod_status module. Note that the server-status page is not enabled by default and it is best practice to not make this publicly available. I wrote mod_status over 12 years ago and so I know that this flaw is exactly how the ASF describes it; it definitely can't let a remote attacker execute arbitrary code on your Apache HTTP server, under any circumstances.

I fired off a quick email to a couple of contacts in the HP security team and they confirmed that the flaw they fixed is just the cross-site scripting flaw, not a remote code flaw. The CVSS ratings they give in their advisory are consistent with it being a cross-site scripting flaw too.

So happy with a false alarm we cancelled our Critical Action Plan and I went off and had a nice weekend practicing taking panoramas without a tripod ready for an upcoming holiday. My first attempt came out better than I expected:

16 Jul 2007: A year of Apache Security Response

For the past 12 months I've been keeping metrics on the types of issues that get reported to the private Apache Software Foundation security alert email address. Here's the summary for Jul 2006-Jun 2007 based on 154 reports:

User reports a security vulnerability
(this includes things later found not to be vulnerabilities) 47 (30%)

User is confused because they visited a site "powered by Apache"
(happens a lot when some phishing or spam points to a site that is taken down and replaced with the default Apache httpd page) 39 (25%)

User asks a general product support question
38 (25%)

User asks a question about old security vulnerabilities
21 (14%)

User reports being compromised, although non-ASF software was at fault
(For example through PHP, CGI, other web applications) 9 (6%)

That last one is worth restating: in the last 12 months no one who contacted the ASF security team reported a compromise that was found to be caused by ASF software.

28 Jul 2006: CVE-2006-3747 mod_rewrite off-by-one

There's a new Apache HTTP Server security issue out today, an off-by-one bug that affects the Rewrite module, mod_rewrite. We've not had many serious Apache bugs in some time, in fact the last one of note was four years ago, the Chunked Encoding Vulnerability.

This issue is technically interesting as the off-by-one only lets you write one pointer to the space immediately after a stack buffer. So the ability to exploit this issue is totally dependent on the stack layout for a particular compiled version of mod_rewrite. If the compiler used has added padding to the stack immediately after the buffer being overwritten, this issue can not be exploited, and Apache httpd will continue operating normally. Many older (up to a year or so ago) versions of gcc pad stack buffers on most architectures.

The Red Hat Security Response Team analysed Red Hat Enterprise Linux 3 and Red Hat Enterprise Linux 4 binaries for all architectures as shipped by Red Hat and determined that these versions cannot be exploited. We therefore do not plan on providing updates for this issue.

In contrast, our Fedora Core 4 and 5 builds are vulnerable as the compiler version used adds no stack padding. For these builds, the pointer being overwritten overwrites a saved register and, unfortunately, one that has possible security consequences. It's still quite unlikely we'll see a worm appear for this issue that affects Fedora though: for one thing, the vulnerability can only be exploited when mod_rewrite is enabled and a specific style of RewriteRule is used. So it's likely to be different on every vulnerable site (unless someone has some third party product that relies on some vulnerable rewrite rules). Even then, you still need to be able to defeat the Fedora Core randomization to be able to reliably do anything interesting with this flaw.

So, as you can probably tell, I spent a few days this week analysing assembler dumps of our Apache binaries on some architectures. It was more fun than expected; mostly because I used to code full-time in assembler, although that was over 15 years ago.

In the past I've posted timelines of when we found out about issues and dealt with them in Apache; so for those who are interested:

20060721-23:29 Mark Dowd forwards details of issue to security@apache.org
20060722-07:42 Initial response from Apache security team
20060722-08:14 Investigation, testing, and patches created
20060724-19:04 Negotiated release date with reporter
20060725-10:00 Notified NISCC and CERT to give vendors heads up
20060727-17:00 Fixes committed publically
20060727-23:30 Updates released to Apache site
20060828       Public announcement from Apache, McAfee, CERT, NISCC

Here is the patch against 2.0, the patch against 1.3 or 2.2 is almost identical.

04 May 2005: Apache vulnerability database

We've not really given Apache Week any priority in the last few months -- in fact we've not posted a new issue since October 2004. So I'm glad we didn't rename it Apache Month. Time to register apachewhenthereissomethinginteresting.com.

Anyway, the most useful thing that I've kept up to date in Apache Week is the database of vulnerabilities that affects the Apache Web server v1.3 and v2.0. This list was even being linked to directly by httpd.apache.org so I made good on a promise I made a year ago and moved the database to the official site. Apache Week uses xslt for transforming the database, but the Apache site used velocity for page markup, but no one seemed to mind me adding ant-trax.jar to the site so the database gets converted from xslt to the page format that gets marked up by velocity. The end result is a couple of nice HTML pages on the official Apache site that list all the vulnerabilities that is easy for us to keep up to date.

02 Mar 2005: Happy Birthday!

Roy Fielding sent out a message reminding us all that the Apache web server just celebrated it's tenth birthday.

In January 1995 I found a security flaw affecting the NCSA web server and I'd forwarded my patch on to Brian Behlendorf. The flaw affected the Wired.com site he was the administrator of. He told me about the Apache project and and I was invited to join the group and share the numerous patches I'd made to NCSA httpd, so my first post was back in April 1995. I can't believe that was ten years ago!

Anyway in my official Red Hat blog I've been posting stuff about the recent comparisons of security issues in Microsoft and Red Hat, and we've published a ton of useful data. See Counting Teapots and Real Data.

21 Sep 2004: Survivability

In the Red Hat earnings call last night, Matthew Szulik mentioned some statistics on the survivability of Red Hat Enterprise Linux 3. In August 2004, SANS Internet Storm Center published statistics on the survival time of Windows by looking at the average time between probes/worms that could affect an unpatched system. The findings showed that it would take only 20 minutes on average for a machine to be compromised remotely, less than the time it would take to download all the updates to protect against those flaws. See some news about the report.

Red Hat Enterprise Linux 3 was released in November 2003 and I wanted to find out what it's survival rate on x86 would likely be to compare to Windows. We'll first look at the worst case and find what flaws have been fixed in RHEL3 that could possibly be remotely exploited. Then from that work out how often they are exploited to come up with a survivability time for RHEL3.

Firstly we need to discount flaws that require user interaction as they are not included in a survivability study - for example CAN-2004-0597 or CAN-2004-0722 where a user would have to visit a malicious web page, preview a malicious email, or open a malicious file with a particular application. So we won't include CAN-2004-0006, a flaw in Gaim that requires a user to be sent a malicious packet from a trusted buddy, for example.

From the release of RHEL3 until 19th August 2004 we have the following flaws that could be triggered remotely:

CAN-2004-0493 is a memory leak in Apache. This allowed a remote attacker to perform a denial of service attack against the server by forcing it to consume large amounts of memory. This flaw could possibly be use in combination with a flaw in PHP to execute arbitrary code. However no exploit has been seen in the wild for this issue, and it looks incredibly difficult to exploit. A second memory leak affected SSL enabled Apache, CAN-2004-0113. These wouldn't allow a full installation of RHEL3 to be remotely compromised.

Flaws in OpenSSL were found that could lead to a crash, CAN-2004-0079. Any service that accepts SSL traffic using OpenSSL to decode it could be vulnerable to this issue. However for servers like Apache, a single child crashing is automatically recovered and will not even cause a Denial of Service.

A flaw in the ISAKMP daemon in racoon could lead to a DoS, CAN-2004-0403, but this daemon is not used by default.

Using one of the above flaws, remote probes could cause a service to crash or exceed OS resource limits and be terminated. These have little impact on the survivability of a machine. What affects survivability are flaws that could lead to remote code execution or a total machine crash.

Two of these type of flaws are in CVS, CAN-2004-0396 and CAN-2004-0414. These flaws could allow a remote user who has the permission to connect to a CVS server to execute arbitrary code on the server. An exploit for these flaws is widely available. However the majority of systems would not be running a CVS server, it certainly isn't default, and in order for this to be remotely exploited by an unknown attacker a system would need to have been set up to allow remote anonymous CVS access.

A flaw in Samba SWAT, CAN-2004-0600, allows remote code execution but only where the SWAT (administration) port is open to the internet. This is not the default, and not a sensible or usual configuration.

The final issue is an overflow in rsync, CAN-2003-0962. This flaw is similar to the CVS flaw in that it requires a system to be running an open rsync server to be exploited.

So a full install of a Red Hat Enterprise Linux 3 box that was connected to the internet in November 2003 even without the firewall and without receiving updates would still remain uncompromised (and still running) to this day.

It's not to say that a RHEL3 user couldn't get compromised - but that's not the point of the survivability statistuc. In order to get compromised, a user would have to have either enabled anonymous rsync, SWAT, or be running an open CVS server, none of which are default or common. Or a user would have to take some action like visiting a malicious web site or receiving and opening a malicious email.

14 Apr 2004: Apache Critical Flaw! (not)

So according to a Secunia advisory I just read there is a new flaw in Apache that allows attackers to "compromise a vulnerable system". [source]

They got that information from a Connectiva security advisory. That advisory actually says "trigger the execution of arbitrary commands" but if you read the context you'll find that in fact what it means is that a cunning attacker could use a minor flaw in Apache that allows it to log escape characters in order to exploit possible flaws in terminal emulators to execure arbitrary commands if you view the log file. [source].

So we've magically turned an issue which is of quite minor risk and minor severity into one classed as "Moderately Critical". Using the same logic you could then use publicised (but fixed) flaws in the Linux kernel to gain root privileges and we've got a remote root exploit in Apache folks! It's Chinese Whisper Security Advisories at their best.

13 Feb 2004: 8 years of Apache Week

As I was commiting the template for this weeks issue of Apache Week I noticed that it has now been exactly eight years since I wrote the first issue. Back then Apache wasn't so popular and the documentation was lacking. Apache Week was designed specifically to give administrators the confidence to try the Apache web server on their machines without having to parse the hundreds of messages each week on the developer mailing list. That first issue was written over a 64k ISDN dial-up line from a computer perched on stark IKEA tabletop. Friday afternoons were spent writing up what had happened during the week. Not much has changed. Actually, I think that IKEA tabletop is still sitting in storage somewhere at Red Hat in Guildford. I wish I'd kept hold of it, it would have been useful for my girlfriends sons train layout.

Over the years there have been many times when we've thought about stopping production, usually when a competitor announced some other Apache magazine that we thought would do a better job than we do. But most of them gave up. They probably realised that there wasn't any money to be made from an Apache httpd journal.

UK Web became C2Net which became Red Hat, and Apache Week is still going strong. We'll have to think of something exciting to do for our tenth birthday.