The resurrection of atari-wiki.com

Latest Atari related news.
User avatar
exxos
Site Admin
Site Admin
Posts: 28377
Joined: 16 Aug 2017 23:19
Location: UK

Re: The resurrection of atari-wiki.com

Post by exxos »

Have noticed a lot of them are coming up " internal error" because of "Malformed UTF-8 characters". if I edit the page and simply resubmit then it correctly works. So I think the spam bots were not always using legitimate characters which the wiki software accepts.

I am working on the assumption that people who actually uploaded content to the wiki actually tested the pages afterwards and thus were all correct.

I have written a script which will churn out a list of links which come up when "Internal error" is found. The problem is it would have to download all 2 million pages individually one by one to test them :roll:

So I will test a batch of something like 100 links and see how long it will actually take..

EDIT:

So out of my sample set of 1,030 links. 437 came back as "internal error" and took 2.6622693339984 minutes.

Lets see if I can work this out properly now..

2,200,000 links / 1030 = 2136 * 2.67 = 5704 minutes / 60 = 95 hours , meh! :lol:

Need to find out how to speed that up really. maybe run the script multiple times with different link files and take the browser out of the equation so it all runs on commandline.

EDIT2:

2.4mins from the command line :lol: :roll:

EDIT3:

2.2mins without downloading the whole page.
User avatar
exxos
Site Admin
Site Admin
Posts: 28377
Joined: 16 Aug 2017 23:19
Location: UK

Re: The resurrection of atari-wiki.com

Post by exxos »

Basically run the script 36 times now, so theoretically, 36 times faster execution time :lol: I know the scripts are running in the background because its in Htop list. Though they do not actually output anything until it has scanned all the links.

The current scans look for " internal error" messages on the wiki pages as previously mentioned. I am taking that as a "100% safe" method to finding spam pages as the spam pages should only be causing wiki errors. I guesstimated 50% based on my sample set of 1,000 would fail that way. But that could be way off considering there are over 2 million pages to check. Once I have that list then they will be removed from the wiki. then I will work on the next stage of deleting spam.

Anyway, be interesting to see how many it ultimately finds :ball:

EDIT:

Well think that failed. Think mysql has died, couldn't take the heat :lol: :roll:

ah :roll:

Capture.PNG

Will have to keep an eye on how much memory its actually using... sad times when 8GB isn't enough :lol: :roll:

I've done 0-9 then A-L and its got 2GB free, but the mysql is maxing out the CPU and running a tad slow now. So will leave it at that for now.

Capture.PNG
RIP AMD EPYC 4 CORE CPU
:2k2:
You do not have the required permissions to view the files attached to this post.
elterwater
Posts: 53
Joined: 14 Jan 2022 14:43

Re: The resurrection of atari-wiki.com

Post by elterwater »

Have you considered running this task through something like Jmeter to give yourself a way of concurrently working your way through a list with blocks of URLs to check in parallel? That would cut down the processing time a lot. If the internal error also manifests itself when making a call to the media wiki api for a given page, that would speed up the processing as well as you're only getting the JSON back rather than the full HTML page.

If the rogue content has been generated using a series of bot accounts which still exist as users in the media wiki database, would it be easier to work out which content has been touched by those bot accounts and attack the problem that way?

Apologies if you've already been over this :D
User avatar
exxos
Site Admin
Site Admin
Posts: 28377
Joined: 16 Aug 2017 23:19
Location: UK

Re: The resurrection of atari-wiki.com

Post by exxos »

Thanks for the suggestions I've not heard of that program. I've run multiple scripts at once to do the work. But mysql is terribly slow. So it's maxing out the CPU causing the scripts to run a lot slower. So not sure the parallel method is a good one now.

I limited the get page contents to 500 bytes but the wiki would still generate the page. Though I don't think the page generation code is the bottleneck. It's the SQL databases slowing it all down. With over 2 million pages I'm not really surprised. This is even all running on a premium server as well.

There could be more efficient ways of getting the job done. But it's the time to work it all out. The scripts should be done in a couple days. I could spend longer writing scripts to get the job done faster. I'm not a great programmer and I have RSI. So every letter I press hurts.

User accounts method ironically is more complicated. There's very few tools for deleting stuff. Even if all the spam accounts are found, there's nothing to delete them and related posts from the database. As mediawiki is well know, it's tools are basically zero. It's working out who's a genuine poster and who's not. It's not so simple.

@Icky was talking about similar. But he found there's millions of accounts along with case variations of actual proper users and admins. He was writing scripts to do that method I think but it's not a simple job.

Im doing it in stages to try and not loose any proper content. But it's a waiting game to see how many topics were throwing out errors. I hope it will at least find a million posts. Once deleted out I can look for any other common issues across the spam pages anyway.
elterwater
Posts: 53
Joined: 14 Jan 2022 14:43

Re: The resurrection of atari-wiki.com

Post by elterwater »

No worries :D Performance testing and optimisation is my day job so this is something I could possibly help with, but it sounds like you're quite a lot of the way there anyway. I would need a backup of the MySQL database to filter/clean/screw up and restore multiple times so not sure how feasible that would be.

The bottleneck of the page generation and the MySQL database performance is closely tied together as the code will rely heavily upon the database content queries to form the different content on the page. Asking for 500 bytes isn't going to help here as you've rightly pointed out as the PHP still has to wander off and generate the page with all it's queries first before truncating and returning just the first 500 bytes to you. There's probably opportunity to add extra indexing in some of the database tables to speed up some of these queries or to update the offending content directly in the tables to bypass the page generation overhead.

I'm aware that I'm just a stranger spouting words at you over the internet (albeit an actual customer who has ordered from you :D) but if there is a way you can share a backup of the database for me to have a crack at in isolation from your own efforts, please let me know as I'd be happy to help.
User avatar
exxos
Site Admin
Site Admin
Posts: 28377
Joined: 16 Aug 2017 23:19
Location: UK

Re: The resurrection of atari-wiki.com

Post by exxos »

So this is currently the dump from numbers 0-9. as in, any pages beginning with those numbers are extracted into the relevant files.

If people can please take a look through them and make sure there does not appear to be any Atari related content...

output_0.zip

Probably around 50,000 pages I guess so far.

A-L pages have been running for 22 hours so far. Even though this is probably incredibly slow, It sure beats going through them one by one manually :lol:

I'm not really sure how faster processing is but I think it is only doing something like one page per second overall :roll: Slow and steady wins the race :lol:

EDIT:

I's just finished. 10,156 topics!

output_I.zip

So some quick mathematical guesswork and appears that that one particular script is working about 7 pages per minute. which is rather slow, but the thing is it is processing 12 scripts at once, another guest would be around 84 a minute . So I guess my one per second processing was probably not far out.

It also probably begs the question that running the script multiple times is not really helping much. Probably running something like 3-5 scripts maximum would be more economical.
You do not have the required permissions to view the files attached to this post.
User avatar
chronicthehedgehog
Site sponsor
Site sponsor
Posts: 383
Joined: 08 May 2022 18:11
Location: The Midlands

Re: The resurrection of atari-wiki.com

Post by chronicthehedgehog »

Saw this in Output_I:

IDE_&_SCSI_Drivers
User avatar
exxos
Site Admin
Site Admin
Posts: 28377
Joined: 16 Aug 2017 23:19
Location: UK

Re: The resurrection of atari-wiki.com

Post by exxos »

chronicthehedgehog wrote: 19 Jul 2023 13:47 Saw this in Output_I:

IDE_&_SCSI_Drivers
ah cool thanks. That page must have been faulty.

EDIT Yep.

Capture.PNG
2.PNG
You do not have the required permissions to view the files attached to this post.
User avatar
chronicthehedgehog
Site sponsor
Site sponsor
Posts: 383
Joined: 08 May 2022 18:11
Location: The Midlands

Re: The resurrection of atari-wiki.com

Post by chronicthehedgehog »

Maybe this from the first zip file:

output_3
31.75mm_1.25"_38.1mm_1.5"
User avatar
exxos
Site Admin
Site Admin
Posts: 28377
Joined: 16 Aug 2017 23:19
Location: UK

Re: The resurrection of atari-wiki.com

Post by exxos »

chronicthehedgehog wrote: 19 Jul 2023 16:47 Maybe this from the first zip file:

output_3
31.75mm_1.25"_38.1mm_1.5"
Something about tubes :shrug:

Capture.PNG
You do not have the required permissions to view the files attached to this post.

Return to “NEWS & ANNOUNCEMENTS”

Who is online

Users browsing this forum: ClaudeBot, xyzzy76 and 7 guests