Sorry your browser is not supported!

You are using an outdated browser that does not support modern web technologies, in order to use this site please update to a new browser.

Browsers supported include Chrome, FireFox, Safari, Opera, Internet Explorer 10+ or Microsoft Edge.

DarkBASIC Professional Discussion / How to strip text bodies from HTML pages?

Author
Message
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 19th Dec 2016 23:39 Edited at: 19th Dec 2016 23:40
Hello guys.
Let me just say that I'm a zero at HTML.

I wanted to salvages texts from various websites, and display them in my game.
So I've downloaded several webpages.

But it seems that all the webpages have different tags enclosing the text body.
For example, descriptive texts in Wikipedia was wrapped with different tags compared to [url]www.thegamecreators.com[/url].
Quote: "<meta name="twitter:description" content="Nenesha was Infel's friend and the 14th Maiden of Homura/Fuero. Four hundread years ago, she wanted to create Metafalica with Infel but their plan failed. Infel opened her Heart to everyone but..." />"

Quote: "<div class="description">New to AppGameKit? Fear not for there are people on-hand to help you out in here.</div>"


Is there a universal tags that I can use as markers so I can strip the texts in-between those tags?
Just the descriptive texts, or titles, that people can read.
I can't grasp the common patters/tags that wraps the texts, so I cannot use them as markers to strip the texts in them.

The only one that I know of looks like this:
Quote: "<p>
You can use all the well known tags and CSS properties to format text, fonts and
colors.
</p>"


Is there any common tags that can help me to identify text bodies to help with text stripping? Or rules that may simplify this?
Chris Tate
DBPro Master
15
Years of Service
User Offline
Joined: 29th Aug 2008
Location: London, England
Posted: 20th Dec 2016 01:46
There are so many HTML tags and scripts out there that it would be very tedious to write parsers to interpret them all.

For example; <p> as you indicated is paragraph.
<h1> is heading 1, <h2> is heading two
<b> is bold text (the old way),
<strong> is also bold
<td>table cell
<span> A span of text (usually)
<div>divider

all of these could contain text within them, but that's not all of the elements and ,some text based elements are nested in non-text based elements so this is just the tip of the iceberg.

If you could use a WebBrowser control, a simple call to Control.InnerText would return string containing all the text in the website; in an ideal world you would not need to write the HTML parser yourself.

If you for some reason wanted to interpret all the elements in HTML (bypassing non-text elements), then here is a list of their descriptions: http://www.w3schools.com/tags/default.asp
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 20th Dec 2016 02:19
Quote: "If you could use a WebBrowser control, a simple call to Control.InnerText would return string containing all the text in the website; in an ideal world you would not need to write the HTML parser yourself."


Err how do I use the WebBrowser control?
External DLL?

WickedX used to mention some DLL used by Internet Explorer. Is that it?
Quote: "urlmon.dll
browseui.dll
ieframe.dll
iertutil.dll
mshtml.dll
shdocvw.dll
urlmon.dll
wininet.dll"
Ortu
DBPro Master
16
Years of Service
User Offline
Joined: 21st Nov 2007
Location: Austin, TX
Posted: 20th Dec 2016 03:47
Parsing arbitrary and unknown html is frankly a nightmare, and is often against many sites terms of use if you are automating it in high volume.

Ideally you want to use a site or service that exposes an api that will return the data you are after directly in either json or xml. These are consumable formats, html is a display format.

Of course this isn't always possible, but what you are trying to do will always be unreliable at best.


A single player RPG featuring a branching, player driven storyline of meaningful choices and multiple endings alongside challenging active combat and intelligent AI.
http://games.joshkirklin.com/sulium
Kevin Picone
21
Years of Service
User Offline
Joined: 27th Aug 2002
Location: Australia
Posted: 20th Dec 2016 10:44 Edited at: 20th Dec 2016 10:46
Here's an example I wrote a while back.
Strip Html From String

PlayBASIC To HTML5/WEB - Convert PlayBASIC To Machine Code
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 20th Dec 2016 12:26 Edited at: 21st Dec 2016 00:44
Thank you Ortu. It's good to know that before I go on a wild goose chase.
I guess I'll have to really get into studying HTML if I wanted to present the texts in more elegant ways.

Kevin Picone, that code would be a lifesaver.
I tried to search for PlayBasic commands listing, but the only one I can find are without parameters, so I don't completely understand what the codes do.
Do you have access to PlayBasic command listing with their parameters?
Or is there any other way besides installing PlayBasic?

EDIT: Nvm I was being lazy I guess. I downloaded PlayBasic.
hakimfullmetal
9
Years of Service
User Offline
Joined: 17th Feb 2015
Location:
Posted: 21st Dec 2016 00:40 Edited at: 21st Dec 2016 03:02
Here's the HTML stripper originally made by Kevin Picone in PlayBASIC, edited into DBPRo-friendly format.
ALso requires IanM MatrixUtils plugin.
Just dumping it here in case anybody wanted it later.



Replace the webpage.txt in open to read 1, "webpage.txt" with any HTML files you want to strip. Put it in your project folder
You can try it with this file:

Attachments

Login to view attachments
Chris Tate
DBPro Master
15
Years of Service
User Offline
Joined: 29th Aug 2008
Location: London, England
Posted: 21st Dec 2016 02:48
Not bad for 321 lines of code aye!

Login to post a reply

Server time is: 2024-03-29 15:31:17
Your offset time is: 2024-03-29 15:31:17