Sorry your browser is not supported!

You are using an outdated browser that does not support modern web technologies, in order to use this site please update to a new browser.

Browsers supported include Chrome, FireFox, Safari, Opera, Internet Explorer 10+ or Microsoft Edge.

Author
Message
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 18th Dec 2016 23:28 Edited at: 18th Dec 2016 23:41
Hello everybody.

Anybody knows of a good way to return URL from a webpage?
I was trying to parse through a downloaded webpage.
For now I'm only trying to find URLs in the webpage that points to images.

For example, I am trying to find this URL in the downloaded webpage.
http://vignette2.wikia.nocookie.net/artonelico/images/6/61/At2-nenesha.jpg/revision/latest/scale-to-width-down/300?cb=20100623143410

I know that the URL will be prededed by
Quote: "<img src="

The image URL will be preceded by that. It should look like this in the downloaded webpage:
Quote: "<img src="http://vignette2.wikia.nocookie.net/artonelico/images/6/61/At2-nenesha.jpg/revision/latest/scale-to-width-down/300?cb=20100623143410" "


I should be able to find the start of the URL by using the command
Quote: "FIND SUB STRING$(source, string)"

and search the <img src="

My question is, how do I return only the URL from all those jumble of strings from a webpage?
How to return this URL
Quote: "http://vignette2.wikia.nocookie.net/artonelico/images/6/61/At2-nenesha.jpg/revision/latest/scale-to-width-down/300?cb=20100623143410"

from the included downlaoded webpage?

I sort of remembered that we can search for the start and end of strings by using some commands, but I can/t remember which.
Anybody knows of a good way to do this?

Attachments

Login to view attachments
Chris Tate
DBPro Master
15
Years of Service
User Offline
Joined: 29th Aug 2008
Location: London, England
Posted: 19th Dec 2016 01:59 Edited at: 19th Dec 2016 01:59
If the webpage uses strict XHTML, you could parse the website as an XML document by using the XML plugin. The URL could then be returned with the XML GET ATTRIBUTE command after locating the img tag.

If it is not XHTML, you could convert the HTML into XHTML prior to parsing it as XML.

If not then, you have the Matrix1 plugin installed you can use the Mid$ command to return the string after the src= attribute, and before the last quotation mark. The InStr command can be used to return the location of the first and last quotation marks after the src= attribute.
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 19th Dec 2016 09:23 Edited at: 19th Dec 2016 09:30
Thank you, I remembered the second method now lol.

Another question though. It's about HTML links.
For example, let's go to this page:
Quote: "http://artonelico.wikia.com/wiki/Nenesha"


In that web page, I see there are a lot of links type. Some gave full links
Quote: "href="http://neptunia.wikia.com/wiki/Hyperdimension_Neptunia_Wiki""


But others gave only ?partial reference link?
Quote: "href="/wiki/Ar_tonelico:_Melody_of_Elemia_Official_Visual_Book""


My question is, if it's only ?partial link, how do we get the full address so that we can follow the link?
We are on the page http://artonelico.wikia.com/wiki/Nenesha
On that page, we see this ?partial link:
Quote: "href="/wiki/Ar_tonelico:_Melody_of_Elemia_Official_Visual_Book""

How do we get the full URL for that link? Is the 'root' URL hidden somewhere in the jumble of HTML?
Chris Tate
DBPro Master
15
Years of Service
User Offline
Joined: 29th Aug 2008
Location: London, England
Posted: 19th Dec 2016 18:54 Edited at: 19th Dec 2016 18:55
The URLs which begin with a forward slash (/) are converted into absolute paths.

So in HTML linking terms '/wiki/Ar_tonelico:_Melody_of_Elemia_Official_Visual_Book' is the seen as the same as 'neptunia.wikia.com/wiki/Ar_tonelico:_Melody_of_Elemia_Official_Visual_Book'.

Prepend the actual domain name to the links which begin with a forward slash.

But there are issues to consider when '../otherfolder/Ar_tonelico:_Melody_of_Elemia_Official_Visual_Book' or similar is used to prepend the parent folder inplace of the ../ token. Here you would need to omit the current folder in place of the parent followed by the path excluding the ../ token.

../ links are not very secure and are disabled on some servers because it allows access to private sibling or ancestor folders on a server.
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 19th Dec 2016 21:17
Any good way to obtain the domain name?
Like, is the webpage routinely stores its domain name somewhere?

For now all I can think of is to strip the full URL into its bare domain name.
Or is there other good method?
Ortu
DBPro Master
16
Years of Service
User Offline
Joined: 21st Nov 2007
Location: Austin, TX
Posted: 19th Dec 2016 22:58
If the link starts with a / it is a link to other content on the current site (doesn't link to another external site)

You should know the domain, because you have sent an http request to it to get this html text to being with


A single player RPG featuring a branching, player driven storyline of meaningful choices and multiple endings alongside challenging active combat and intelligent AI.
http://games.joshkirklin.com/sulium
Chris Tate
DBPro Master
15
Years of Service
User Offline
Joined: 29th Aug 2008
Location: London, England
Posted: 19th Dec 2016 23:01 Edited at: 19th Dec 2016 23:02
At some point you would of used an absolute web address otherwise there is no way to get to any website on the WAN.

So the last used absolute address used in your viewer will contain the domain name to use.

All links from now on would refer to that domain name when a front slash is prefixed to the URL; until a new domain name is visited.

Edit: As Ortu stated
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 19th Dec 2016 23:58 Edited at: 20th Dec 2016 00:12
I probably would not know the domain name 'directly', because I accessed the web page using its complete URL with relative links.
For example:
Quote: "http://artonelico.wikia.com/wiki/Nenesha"

The domain name should be artonelico.wikia.com. But I wouldnt know that because I began with a 'complete' URL http://artonelico.wikia.com/wiki/Nenesha

Maybe I should explain what I intend to do.

In the game, I wanted the player to input keywords so DBPro can return the search result from Google. I've tried and seems like it can be done.
In that search results, there will be URLs, descriptions and everything. I can isolate the links URL because they all starts with <a href="/url?url=https://www.
But the URL returned from google are full URL, not only domain URL. So, I wouldn't know what the domain names are from the returned Google search page.
I would use these URLs to go directly to the suggested web pages.
But when I'm inside the web pages, I would encounter various links. Some are complete URLs with http:// protocols, while some others are only relative links "/wiki/Ar_tonelico:_Melody_of_Elemia_Official_Visual_Book"
I need the domain name to combine with the relative link, so I can jump to that relative link page.
But I don't know what the domain URL (artonelico.wikia.com) is. Because I've only ever used (http://artonelico.wikia.com/wiki/Nenesha) since the beginning.

Or am I missing something? Is the domain name included somewhere in the google search result, so that I can isolate it?
If not, then I guess I'll find a way to strip the full URL, to leave just the domain URL.

Here's the downloaded google search result, just in case.

Attachments

Login to view attachments
Chris Tate
DBPro Master
15
Years of Service
User Offline
Joined: 29th Aug 2008
Location: London, England
Posted: 20th Dec 2016 01:57 Edited at: 20th Dec 2016 01:58
Quote: "I probably would not know the domain name 'directly', because I accessed the web page using its complete URL with relative links."


The the first link or URL you used contains the domain name.
Quote: "I began with a 'complete' URL http://artonelico.wikia.com/wiki/Nenesha"


So http://artonelico.wikia.com/wiki/Nenesha is the first page you visit, therefore the first domain name is available thus from this point onward your browser is using the file 'folder' associated with the domain name on the server. It gets a little more complicated when we start talking about subdomains and TLDs, but to illustrate the point:

Example:

A google search for 'Egg':

shows this listing to a website: <a href="/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=0ahUKEwiYpfjN04HRAhXSdlAKHWv4BBAQFghWMAM&url=http%3A%2F%2Fwww.egglondon.co.uk%2F&usg=AFQjCNFp-vV-5AZ0j2BunVLdAkUvKQDLiA&sig2=UxP1e4CYEQIJ37RoibP_nA" onmousedown="return rwt(this,'','','','4','AFQjCNFp-vV-5AZ0j2BunVLdAkUvKQDLiA','UxP1e4CYEQIJ37RoibP_nA','0ahUKEwiYpfjN04HRAhXSdlAKHWv4BBAQFghWMAM','','',event)" data-href="http://www.egglondon.co.uk/">Egg London</a>

Quite an ugly looking URL, but we know that url=http:// is going to be followed by a domain name,

so the opening token of "http://www.egglondon.co.uk/" is http:// or it could have been https:// and the closing token is the first occurance of / which gives us www.egglondon.co.uk

the opening token of "http://artonelico.wikia.com/wiki/Nenesha" is http:// and the closing token is the first occurance of / which gives us 'artonelico.wikia.com'

The characters between the opening (http://) and closing token (first occurance of /) represent the domain name
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 20th Dec 2016 02:23
So the / is the key.
That seems very consistent.
Thank you. I'll try it out.
Ortu
DBPro Master
16
Years of Service
User Offline
Joined: 21st Nov 2007
Location: Austin, TX
Posted: 20th Dec 2016 03:20 Edited at: 20th Dec 2016 03:29
Yes, exactly as Chris says you just want to extract everything between // and the first /

I would look specifically for // instead of http:// as you may also encounter https://

In detail, a URL has a specific format:

[Protocol]://[address]/[resource]

The address portion is used by DNS to route traffic to the server. It is composed of at least 2 parts [domain].[topLevelDomain] as mysite.com

The domain can be divided into 'subdomains' or hosts [host1].[domain].[TLD] as forum.thegamecreators.com

It can also include a port as part of the address, if the server is using a non standard port for a given protocol [domain].[TLD]:<port>/[resource] as mysite.com:32400/index.html

Everything after the first / is a pretty arbitrary, it identifies a resource on the server, it can refer to a Web page, a file download, or an API end point, it's up to the server to evaluate what is being requested and how to serve that request.

The resource can also pass data with the request by using ?

[domain].[TLD]/[resource]?[variable]=[value] as mysite.com/search.php?name=bob

These are intended to be arguments to a GET request


A single player RPG featuring a branching, player driven storyline of meaningful choices and multiple endings alongside challenging active combat and intelligent AI.
http://games.joshkirklin.com/sulium
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 22nd Dec 2016 17:58
Thanks guys. This is the domain name stripper. Just input the full URL

Chris Tate
DBPro Master
15
Years of Service
User Offline
Joined: 29th Aug 2008
Location: London, England
Posted: 23rd Dec 2016 03:07 Edited at: 23rd Dec 2016 03:13
Not quite right there.

Quote: "domainResult$ = mid$( webPage$, 1, Finish-1 )"

The 1 indicates that it is not starting from your variable named Start. 'googleURL=http://www.secureserverspace.com/pagename.php?value=5' would end up returning googleURL=http://www.secureserverspace.com. Should be mid$( webPage$, Start, Finish-1 )

Quote: "INC Start,8"

http is not the only protocal, another common one is https:// which would need a 9 character increment. This forum uses https as you can see in your browser address bar.

Quote: "ENDFUNCTION domainResult$"

domainResult$ is only assigned a value in your IF block. If the IF check is false, domainResult$ would be undefined or worse, an unintentional reference to a global variable. The result for this value would be unpredictable unless you strictly define it as a local variable, as is the case for all of your function variables if you want to avoid refering to a global of the same name. When you start writing 10s of thousands of lines of code, duplicate names can appear quite often. I would recommend decaring local variables and using UDTs where possible which will help reduce conflicts and reduce compile attempts in your overall development span.

Hope this saves you from encountering any issues
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 23rd Dec 2016 12:59
Woops, you're right. It doesnt return anything on failed attempt.

Would this be OK?
If successfull, it'll return domain name
If not successfull, but the input URL is valid, then it'll return the original URL (which might be a domain name already)
If not successfull, and the input is not URL at all, then it will output "No Valid Domain Found"



Regarding the inc 8 thing,
I made the function with the assumption that we already know what webpage we're in, so we can input the address in the function to get domain name. For example, we can input this
https://forum.thegamecreators.com/thread/218443
and get this
https://forum.thegamecreators.com

The inc 8 is just there so it skips the whole https:// or http:// thing, and search for the next / that marks the end of domain name. So it should work with both https:// or http://




Another thing though. I'm not good with HTML terminologies.
You mention that you used googleURL=http://www.secureserverspace.com/pagename.php?value=5 as input string
The function would not work if we include that googleURL= bit at the start. I didn't make the fucntion search for Start position, because we assume all links fed to the fucntion starts with htt
My question is, is googleURL=http://www.secureserverspace.com/pagename.php?value=5 what people usually call 'complete' URL?
https://forum.thegamecreators.com is domain name.
Then what about https://forum.thegamecreators.com/thread/218443 . What do we call that kind of URL?
Or is it googleURL=http://www.secureserverspace.com/pagename.php?value=5 is what people usually call and URL?


Chris Tate
DBPro Master
15
Years of Service
User Offline
Joined: 29th Aug 2008
Location: London, England
Posted: 27th Dec 2016 17:32 Edited at: 27th Dec 2016 17:34
Declare the local variables like so [Local VariableName].

webPage$ is already declared as a local variable because it is a function parameter.



Quote: "You mention that you used googleURL=http://www.secureserverspace.com/pagename.php?value=5 as input string
The function would not work if we include that googleURL= bit at the start. I didn't make the fucntion search for Start position, because we assume all links fed to the fucntion starts with htt
My question is, is googleURL=http://www.secureserverspace.com/pagename.php?value=5 what people usually call 'complete' URL?"


Some hyperlinks use scripting to describe the web address to be lanched by some client or server side function. googleURL= is the parameter assignment, and its value is http://www.secureserverspace.com/pagename.php?value=5. If you encounter such hyperlinks in your code you can use the Left$ function check that the string indeed starts with http: or your Start position should be used to bypass the unwanted prefix.

If none of the links you search for use scripting, then there is nothing to worry about.

Quote: "Then what about https://forum.thegamecreators.com/thread/218443 . What do we call that kind of URL?"


This is an absolute URL. The number 218443 is probably a folder which automatically opens a default server side page file.

Quote: " googleURL=http://www.secureserverspace.com/pagename.php?value=5 is what people usually call and URL?"


and that is an URL parameter. Something which assigns to a variable called googleURL. If you notice at the end there is a ?value=5. This is a GET request parameter for PHP scripting. pagename.php is the page name.
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 20th Jan 2017 07:49
I've seen some links like this:
<a href="javascript:void(0);" class="like bookmark" id="bookmark">Bookmark</a>

My question is, what does the 'javascript' part refers to? What do they do?
I assume they don't refer to other webpages?
Ortu
DBPro Master
16
Years of Service
User Offline
Joined: 21st Nov 2007
Location: Austin, TX
Posted: 20th Jan 2017 16:13 Edited at: 20th Jan 2017 16:19
It effectively prevents the default behavior of a hyperlink (which is to navigate to a new page) typically, this means they have custom handling for the click event of this link which is doing something other than navigation, or is doing something before allowing the navigation to process.

It's not really a best practice, as you are better off using a span or other element that doesn't have default behavior that needs to be prevented. Any text element can be styled with css to look like a clickable link and any event handler can trigger a page navigation.
http://games.joshkirklin.com/sulium

A single player RPG featuring a branching, player driven storyline of meaningful choices and multiple endings alongside challenging active combat and intelligent AI.

Login to post a reply

Server time is: 2024-04-23 18:51:32
Your offset time is: 2024-04-23 18:51:32