Moving my email archives to Ponymail went well
One feature I forgot that zoe had was how it indexed some attachments. If you have an email with a PDF attachment and that PDF attachment had plain text in it (it wasn't just a scan) then you could search on words in the PDF. That's super handy. Ponymail doesn't do that, in fact you don't get to search on any text in attachments, even if they are plain text (or things like patches). Let's fix that!
Remember how I said whenever I need some code I first look if there is an Apache community that has a project that does something similar? Well Apache Tika is an awesome project that will return the plain text of pretty much whatever you throw at it. PDF? sure. Patches? definitely. Word docs? yup. Images? yes. Wait, images? so Tika will go and use Tesseract and do an OCR of an image.
Okay, so let's add a field to the mbox index, attachmenttext, populate it with Tika, and search on it. For now if some text in an attachment matches your search query you'll see the result, but you won't know exactly where the text appears (perhaps later it could highlight which attachment it appears in).
I wrote a quick Python script that runs through all emails in Ponymail (or some search query subset), and if they have attachments runs all the attachments through Apache Tika, storing the plain texts in the attachmenttext field. We ignore anything that's already got something in that field, so we can just run this periodically rather than on import. Then a one-line patch to Ponymail also searches the attachmenttext field. 40,000 attachments and two hours later, it was all done and working.
It's not ready for a PR yet; probably for Ponymail upstream we'd want the option of doing this at import, although I chose not too so we can be deliberately careful as parsing untrusted attachments is risky
So there we have it; a way to search your emails including inside most attachments, outside the cloud, using Open Source projects and a few little patches.
Whenever I need some open source project my first place to look is if there is an Apache Software Foundation community with a project along the same lines. And the ASF is all about communities communicating over Email, so not only is there an ASF project with a solution, but that project is used to provide the web interface for all the archived ASF mailing lists too. "Ponymail Foal" is the project and lists.apache.org is where you can see it running. (Note that the Ponymail website refers to the old version of Pony Mail before "Foal")
Internally the project is mostly Python, HTML and Javascript, using Python scripts to import emails into elasticsearch, so it's really straightforward to get up and running following the project instructions.
So I can just import my several hundred thousand email messages I have in random text mbox format files and be done? Well, nearly. It almost worked but it needed a few tweaks:
Managing a personal email archive can be a daunting task especially with the volume of email correspondence. However, with Ponymail, it's possible to take control of your email archive, keep it local and secure, and search through it quickly and efficiently using the power of ElasticSearch.
First thing to do was to find the original source of the advisory, as co-ordination centres and research firms are known to often play the Telephone game, with advisory texts mangled beyond recognition. Following the links led to the actual advisory on the HP site. This describes the vulnerability as follows:
But then they give the CVE name for the flaw, CVE-2007-6388, which is a known public flaw fixed last month in various Apache versions from the ASF and in updates from various vendors that ship Apache (including Red Hat).
This flaw is a cross-site scripting flaw in the mod_status module. Note that the server-status page is not enabled by default and it is best practice to not make this publicly available. I wrote mod_status over 12 years ago and so I know that this flaw is exactly how the ASF describes it; it definitely can't let a remote attacker execute arbitrary code on your Apache HTTP server, under any circumstances.
I fired off a quick email to a couple of contacts in the HP security team and they confirmed that the flaw they fixed is just the cross-site scripting flaw, not a remote code flaw. The CVSS ratings they give in their advisory are consistent with it being a cross-site scripting flaw too.
So happy with a false alarm we cancelled our Critical Action Plan and I went off and had a nice weekend practicing taking panoramas without a tripod ready for an upcoming holiday. My first attempt came out better than I expected:
User reports a security vulnerability (this includes things later found not to be vulnerabilities) | 47 (30%) | ||
User is confused because they visited a site "powered by Apache" (happens a lot when some phishing or spam points to a site that is taken down and replaced with the default Apache httpd page) | 39 (25%) | ||
User asks a general product support question | 38 (25%) | ||
User asks a question about old security vulnerabilities | 21 (14%) | ||
User reports being compromised, although non-ASF software was at fault (For example through PHP, CGI, other web applications) | 9 (6%) |
That last one is worth restating: in the last 12 months no one who contacted the ASF security team reported a compromise that was found to be caused by ASF software.
This issue is technically interesting as the off-by-one only lets you write one pointer to the space immediately after a stack buffer. So the ability to exploit this issue is totally dependent on the stack layout for a particular compiled version of mod_rewrite. If the compiler used has added padding to the stack immediately after the buffer being overwritten, this issue can not be exploited, and Apache httpd will continue operating normally. Many older (up to a year or so ago) versions of gcc pad stack buffers on most architectures.
The Red Hat Security Response Team analysed Red Hat Enterprise Linux 3 and Red Hat Enterprise Linux 4 binaries for all architectures as shipped by Red Hat and determined that these versions cannot be exploited. We therefore do not plan on providing updates for this issue.
In contrast, our Fedora Core 4 and 5 builds are vulnerable as the compiler version used adds no stack padding. For these builds, the pointer being overwritten overwrites a saved register and, unfortunately, one that has possible security consequences. It's still quite unlikely we'll see a worm appear for this issue that affects Fedora though: for one thing, the vulnerability can only be exploited when mod_rewrite is enabled and a specific style of RewriteRule is used. So it's likely to be different on every vulnerable site (unless someone has some third party product that relies on some vulnerable rewrite rules). Even then, you still need to be able to defeat the Fedora Core randomization to be able to reliably do anything interesting with this flaw.
So, as you can probably tell, I spent a few days this week analysing assembler dumps of our Apache binaries on some architectures. It was more fun than expected; mostly because I used to code full-time in assembler, although that was over 15 years ago.
In the past I've posted timelines of when we found out about issues and dealt with them in Apache; so for those who are interested:
20060721-23:29 Mark Dowd forwards details of issue to security@apache.org 20060722-07:42 Initial response from Apache security team 20060722-08:14 Investigation, testing, and patches created 20060724-19:04 Negotiated release date with reporter 20060725-10:00 Notified NISCC and CERT to give vendors heads up 20060727-17:00 Fixes committed publically 20060727-23:30 Updates released to Apache site 20060828 Public announcement from Apache, McAfee, CERT, NISCCHere is the patch against 2.0, the patch against 1.3 or 2.2 is almost identical.
Anyway, the most useful thing that I've kept up to date in Apache Week is the database of vulnerabilities that affects the Apache Web server v1.3 and v2.0. This list was even being linked to directly by httpd.apache.org so I made good on a promise I made a year ago and moved the database to the official site. Apache Week uses xslt for transforming the database, but the Apache site used velocity for page markup, but no one seemed to mind me adding ant-trax.jar to the site so the database gets converted from xslt to the page format that gets marked up by velocity. The end result is a couple of nice HTML pages on the official Apache site that list all the vulnerabilities that is easy for us to keep up to date.
In January 1995 I found a security flaw affecting the NCSA web server and I'd forwarded my patch on to Brian Behlendorf. The flaw affected the Wired.com site he was the administrator of. He told me about the Apache project and and I was invited to join the group and share the numerous patches I'd made to NCSA httpd, so my first post was back in April 1995. I can't believe that was ten years ago!
Anyway in my official Red Hat blog I've been posting stuff about the recent comparisons of security issues in Microsoft and Red Hat, and we've published a ton of useful data. See Counting Teapots and Real Data.
Red Hat Enterprise Linux 3 was released in November 2003 and I wanted to find out what it's survival rate on x86 would likely be to compare to Windows. We'll first look at the worst case and find what flaws have been fixed in RHEL3 that could possibly be remotely exploited. Then from that work out how often they are exploited to come up with a survivability time for RHEL3.
Firstly we need to discount flaws that require user interaction as they are not included in a survivability study - for example CAN-2004-0597 or CAN-2004-0722 where a user would have to visit a malicious web page, preview a malicious email, or open a malicious file with a particular application. So we won't include CAN-2004-0006, a flaw in Gaim that requires a user to be sent a malicious packet from a trusted buddy, for example.
From the release of RHEL3 until 19th August 2004 we have the following flaws that could be triggered remotely:
CAN-2004-0493 is a memory leak in Apache. This allowed a remote attacker to perform a denial of service attack against the server by forcing it to consume large amounts of memory. This flaw could possibly be use in combination with a flaw in PHP to execute arbitrary code. However no exploit has been seen in the wild for this issue, and it looks incredibly difficult to exploit. A second memory leak affected SSL enabled Apache, CAN-2004-0113. These wouldn't allow a full installation of RHEL3 to be remotely compromised.
Flaws in OpenSSL were found that could lead to a crash, CAN-2004-0079. Any service that accepts SSL traffic using OpenSSL to decode it could be vulnerable to this issue. However for servers like Apache, a single child crashing is automatically recovered and will not even cause a Denial of Service.
A flaw in the ISAKMP daemon in racoon could lead to a DoS, CAN-2004-0403, but this daemon is not used by default.
Using one of the above flaws, remote probes could cause a service to crash or exceed OS resource limits and be terminated. These have little impact on the survivability of a machine. What affects survivability are flaws that could lead to remote code execution or a total machine crash.
Two of these type of flaws are in CVS, CAN-2004-0396 and CAN-2004-0414. These flaws could allow a remote user who has the permission to connect to a CVS server to execute arbitrary code on the server. An exploit for these flaws is widely available. However the majority of systems would not be running a CVS server, it certainly isn't default, and in order for this to be remotely exploited by an unknown attacker a system would need to have been set up to allow remote anonymous CVS access.
A flaw in Samba SWAT, CAN-2004-0600, allows remote code execution but only where the SWAT (administration) port is open to the internet. This is not the default, and not a sensible or usual configuration.
The final issue is an overflow in rsync, CAN-2003-0962. This flaw is similar to the CVS flaw in that it requires a system to be running an open rsync server to be exploited.
So a full install of a Red Hat Enterprise Linux 3 box that was connected to the internet in November 2003 even without the firewall and without receiving updates would still remain uncompromised (and still running) to this day.
It's not to say that a RHEL3 user couldn't get compromised - but that's not the point of the survivability statistuc. In order to get compromised, a user would have to have either enabled anonymous rsync, SWAT, or be running an open CVS server, none of which are default or common. Or a user would have to take some action like visiting a malicious web site or receiving and opening a malicious email.
They got that information from a Connectiva security advisory. That advisory actually says "trigger the execution of arbitrary commands" but if you read the context you'll find that in fact what it means is that a cunning attacker could use a minor flaw in Apache that allows it to log escape characters in order to exploit possible flaws in terminal emulators to execure arbitrary commands if you view the log file. [source].
So we've magically turned an issue which is of quite minor risk and minor severity into one classed as "Moderately Critical". Using the same logic you could then use publicised (but fixed) flaws in the Linux kernel to gain root privileges and we've got a remote root exploit in Apache folks! It's Chinese Whisper Security Advisories at their best.
Over the years there have been many times when we've thought about stopping production, usually when a competitor announced some other Apache magazine that we thought would do a better job than we do. But most of them gave up. They probably realised that there wasn't any money to be made from an Apache httpd journal.
UK Web became C2Net which became Red Hat, and Apache Week is still going strong. We'll have to think of something exciting to do for our tenth birthday.