0

Given such a domain:

http%3a%2f%2fwww.google.com%2fpagead%2fconversion%2f1001680686%2f%3flabel%3d4dahCKKczAYQrt7R3QM%26value%3d%26muid%3d_0RQqV8nf-ENh3b4qRJuXQ%26bundleid%3dcom.google.android.youtube%26appversion%3d5.10

I want to replace the

%3a%2f%2

with

://

and get rid all the content behind ".com", so finally I just want to got

http://www.google.com

How can I implement this in Java using a regex?

msrd0
  • 7,002
  • 9
  • 41
  • 74
chrisTina
  • 2,190
  • 6
  • 38
  • 71
  • Java has a regex library you can use for this. Try looking at this article: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html – ryekayo Nov 03 '14 at 18:26
  • 2
    URL decode the value first and then use URI/URL (http://docs.oracle.com/javase/6/docs/api/java/net/URI.html) to get scheme, host values to construct what you want. – srkavin Nov 03 '14 at 18:28

3 Answers3

2

You can use:

String u = URLDecoder.decode(url, "UTF-8").replaceFirst("(\\.[^/]+).*$", "$1");
// http://www.google.com
anubhava
  • 713,503
  • 59
  • 514
  • 593
  • When using your method, it works for most URLs, but I got one result like this:"http\://nfl.demdex.net/event?d_uuid=78914312359887357297063319424411977817&d_dpid=1327&d_dpuuid=2A05BC680507A5C5-60000108200532A8&d_ptfm=android&d_dst=1&d_rtbd=json". I think I should only have "http://nfl.demdex.net", right? What's the issue? Anything you can improve or modified? – chrisTina Nov 03 '14 at 18:58
  • Yes sure, try my answer now. – anubhava Nov 03 '14 at 18:59
  • This looks really good. Can you give some brief explanation about your answer? I will accept this answer after that, thanks. – chrisTina Nov 03 '14 at 19:09
  • Yes sure. This regex finds first dot using `\\.` and then using negation regex `[^/]+` it matches text until it hits a slash `/`. `(\\.[^/]+)` is capturing this in captured group #1 and `.*$` matches everything till end of URL. In the replacement part we just a back reference `$1` for captured group #1 thus giving us `http://nfl.demdex.net` and discarding the rest. – anubhava Nov 03 '14 at 19:14
  • What if I got url like this: when decoded, it becomes: www.google.com:80/other part... I do not want the port information either (:80), how could I make changes about this? @anubhava – chrisTina Nov 04 '14 at 17:01
  • You can use: `String u = URLDecoder.decode(url, "UTF-8").replaceFirst("(\\.[^:/]+).*$", "$1");` – anubhava Nov 04 '14 at 17:13
  • Hi, can you take a look at this question? http://stackoverflow.com/questions/30404224/new-line-and-dollar-sign-in-java-regular-expression – chrisTina May 22 '15 at 19:18
  • Ok let me take a look. – anubhava May 22 '15 at 19:25
1

So you have a URL of this scheme after you decoded it (e.g. with java.net.URLDecoder.decode()):

http://www.google.com/here/is/some/content

To get the Domain and the Protocol from the input, you can use a regex like this:

String input = URLDecoder.decode("http%3a%2f%2fwww.google.com%2fpagead%2fconversion%2f1001680686%2f%3flabel%3d4dahCKKczAYQrt7R3QM%26value%3d%26muid%3d_0RQqV8nf-ENh3b4qRJuXQ%26bundleid%3dcom.google.android.youtube%26appversion%3d5.10");
Matcher m = Pattern.compile("(http[s]?)://([^/]+)(/.*)?").matcher(input);
if (!m.matches()) return;
String protocol = m.group(1);
String domain   = m.group(2);
System.out.println(protocol + "://" + domain);

Explanation of the regex:

(http[s]?)://([^/]+)(/.*)?
|---1----|-2-|--3--|--4---|
  1. Matches the protocols http and https
  2. Matches the :// behind the protocol
  3. Matches the domain name ([^/]+ is any string that doesn't contain a slash)
  4. Matches everything behind the domain (must start with a slash)
msrd0
  • 7,002
  • 9
  • 41
  • 74
  • 1
    I'm just leaving the note that you're using a deprecated version `URLDecoder.decode(String)` ... – Tom Nov 03 '14 at 18:45
  • @Tom I linked to the JavaDoc so I know this is deprecated. If you know what's that for an encoding feel free to add it – msrd0 Nov 03 '14 at 18:47
0

One way;

java.net.URI uri = new java.net.URI(java.net.URLDecoder.decode(url, "UTF-8"));

System.out.println( uri.getScheme() + "://" + uri.getHost() );
Alex K.
  • 165,803
  • 30
  • 257
  • 277