Archive

Archive for the ‘Regexps’ Category

Regular Expression to Validate URLs

February 10th, 2010 No comments

Here is a short regexp that is used to validate whether a user has entered correct URL address. Might be useful in any scripts dealing with user data.

I will show it as PHP code:

preg_match(‘/^(http:\/\/|https:\/\/)([^\.\/]+\.)*([a-zA-Z0-9])([a-zA-Z0-9-]*)\.([a-zA-Z]{2,4})(\/.*)?$/i’, $_POST['url']);

It will check whether url suits the pattern. May not be ideal, but it’s working. :)

Regular Expression to Extract all E-mail Addresses from a File With PHP

October 8th, 2008 2 comments

Sometimes you need to extract some data from text files. E-mails, passwords, just some simple tags… no matter what it is, your best choice to do this is to use regular expressions. I will show you a PHP script that will extract all valid e-mails from a text file.

<?
$fs=fopen(“best.txt”, “r”);

$f3=fopen(“clean.txt”, “a”);
while(!feof($fs))
{
$gan=fgets($fs);
preg_match(“/[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.(([0-9]{1,3})|([a-zA-Z]{2,3})|(aero|coop|info|museum|name))/”, $gan, $matches);
fwrite($f3, trim($matches[0]).”\r\n”);
}
fclose($f3);
fclose($fs);

?>

best.txt is a file containing valid e-mail addresses. clean.txt will contain e-mail addresses only. We’re checking every string of best.txt against a regular expression that represents a valid e-mail pattern. “/[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.(([0-9]{1,3})|([a-zA-Z]{2,3})|(aero|coop|info|museum|name))/” is the pattern and I don’t think it is necessary to explain what does it mean. If you’re familiar with regular expressions, you’ll be able to modify it to find any specific e-mails. If not, you may use this example and you will find that it really works. There are some programs on the net, that are doing the same thing, but they work under Windows and don’t allow to process big files.

This script can work with big files, don’t forget to set time limit to 0 (I have this done in my php.ini). Happy parsing! :)

Regular Expression to Parse Text Between Simple Tags (XML)

September 6th, 2008 2 comments

It is often necessary to extract text from a variable that contains HTML or XML code. I’ve created a simple regular expression that will help you to extract all text between certain tags into an array. It is a PHP solution, though regular expression is compatible with other programming languages.

preg_match_all(“/<tag>(.*?)<\/tag>/”, $source, $results);

This construsion will create an array with extracted data. All you need is to change “tag” to any tag you like. This string was created to parse xml files, but it will work for simple HTML tags without attributes too.

The function above will extract all occurences of regular expression match. $output will contain an array with the extracted values. Please, run var_dump to check what’s in this array

Preventing Bandwidth Leak With Correct Out URLs

May 21st, 2008 1 comment

There is a common practice to use outgoing links from your site to track visitor activity. For example, your site is www.site.com, and external links look like www.site.com/out.php?url=http://anothersite.com . It is OK for counters and traffic tracking, but may be used for an uncommon way. It this sample you may replace http://anothersite.com with any other site and your site will redirect to it. Do you understand what can it be used for?

Spamming, phishing and other stuff like this often relies on such bugs. You may receive abuses from anybody because you cannot know what kind of sites will be promoted and what will be the way to do it.
Even monsters, like Google and ADRiver have such a traffic leak. I recently found in my mailbox e-mails with links to:
http://www.google.fr/pagead/SOME_PARAMS&adurl=SPAMMER’s URL and http://ad.doubleclick.net/SOME_PARAMS?SPAMMER’s URL .
How do you prevent such things? First of all, never use such a construction with url=http:// and so on. You can assign an unique id for each URL and store it into database or text file and create outgoing URLs with such IDs. www.site.com/out.php?id=YOUR_ID will be much better and will save you from this malicious activity.

Be patient with standard scripts, as some of them contain such a vulnerability. For example, Autorank Pro and some other may contain such URL syntax. Have a nice day and make your URLs in a correct way!