Ignareo the Carillon, a web spider template of ultimate concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns!
To love another person is to see the face of God.
https://github.com/Hecate2/ISMLautovoter or https://github.com/Hecate2/Ignareo
Ultimate High-performance HTTP I/O originated for Chtholly Nota Seniorious, and for ISML, www.internationalsaimoe.com/voting.
Launches 100k（十万）HTTP requests in < 0.7 seconds on a single 4GHz Ryzen 3600 core with 2×8G 3200MHz memory.
Python 3.6 √
Python 3.7 √ (recommended, for better SSL experience)
对于Windows Python3.8用户 For users using Python 3.8 on Windows
import platform if platform.system() == "Windows": import asyncio asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
NotImplementedError。 You probably need to include these code to solve the
Voters of all lands, UNITE!
Thanks to the increasing stars, I have decided to rewrite this README to be civil-programmer-oriented.
Feel free to raise Issues including "我永远喜欢珂朵莉", "I love Chtholly forever", "私はいつまでもクトリが好きです".
Codes in this repository were initially written for voting in ISML. Consequently, everything was rapidly edited with simple tools including IDLE and notepads, designated for intensive fire power instantly launched by anyone (even non-programmers) only with a double click on Windows. Therefore, the codes might be not well formatted, not quite programmer-flavored and redundant. I apologize for that.
The structure of original Ignareo is shown below, referring to the codes in
./DestroyerIGN. Feel free to copy or rewrite any code to shape your own components.
gevent.spawn(a_function_with_socket_operations, param, param, parameters…)
Ignareo is still a bare core for now, but it is essentially different from and can never be replaced with other libraries like
grequests. I have been looking for methods of further integration, but integration sacrifices transparency and simplicity. Therefore I will not pack Ignareo as a series of APIs or as a scrapy-like engine.
Generally speaking, I'm sorry but you may have to read the source code of
DestroyerIGN/IgnareoG.py, because reconstructing Ignareo for your own use requires an understanding of its structure.
This is the original "military" purpose for which I built Ignareo. Civil users do not need to actually run the codes, but may follow this as an example.
./DestroyerIGN, start captchaServer.py, and then IgnareoG.py. Finally, provide Destroyer Ignareo with ammunition (proxy IPs) by starting Ammunition.py.
For new hands: Note that you should tell
Ammunition.pywhere to get the proxies at what frequency. You may find tens of thousands of free proxies or purchase millions from the Internet. Search for
Ammunition.pyand you would find tuples like
('http://localhost:55556/',1),. Substitute the default tuples for your own URLs and time intervals (in seconds) to get proxies.
Ammunition.pyvisits your URL and extracts
XXX.XXX.XXX.XXX:XXXXXfrom the webpage, and send the proxies in the format
IGN does not open fire until you run Ammunition.py at last!
You can also try
asyncio, but IgnareoA is not fully reliable when thousands of concurrent connections have to be handled in a single process, especially on Windows, because the number of files that
asynciocan open is limited. But IgnareoA saves a little bit of memory and CPU than IgnareoG.
IgnareoMTimplemented with threads may also be a choice, but the performance is significantly lower. And IgnareoMT leads to memory leak in long run. You have to restart your computer to recollect free memory.
The order in which the three engines were built is:
[earliest] A → MT → G [latest]
This part guides you to cast a simulated machine vote with HTTP requests. Assume that you are a real human voter who visits https://www.internationalsaimoe.com/voting , and you are given a long list of characters to select. After you select the characters, you should fill the CAPTCHA. You should spend at least 90 (or 120, 180, 190) seconds before you are allowed to submit your vote. To prevent anyone to cast multiple votes in a match, only one vote is allowed from a same IP address.
During the process stated above, your browser sends HTTP requests to the ISML server. Now we are to simulate your browser's HTTP operations with Ignareo. So, from the perspective of HTTP requests, what happens in the process described above?
Your browser sends an HTTP GET request to http://www.internationalsaimoe.com/voting , and the server responds an HTML where all the characters are listed. This can be simulated with
r = requests.get('http://www.internationalsaimoe.com/voting')
r.textis the HTML you want. Note that the codes given in this chapter is just a conceptual example, not the best practice in real voting.
Besides, in order to distinguish "who you are", a "voting_token" is given in the responded HTML to identify you. If your IP address has already casted a vote before, you will not be given the token.
Now we have to wait for 90 (or 120, 180, 190) seconds before we submit the vote. For now we are just to record the time when we acquired the HTML response.
You may search for
IgnareoG.pyfor corresponding reference.
Your browser POST a canvas fingerprint to the server. This is an
MD5string which is almost unique for everyone's computer and browser, used to prevent votes from a same device.
We are just to generate a random fingerprint and post it.
def PostFingerprint(self):in IgnareoG.py.
We are to recognize the letters and numbers in the captcha image. We can get the image with
r = requests.get('https://www.internationalsaimoe.com/captcha/%s/%s' % (self.voting_token, int(time.time() * 1000)))
r.contentas the image, we are to POST this image to out captcha server (
captchaServer.py), which recognize the image. My methods cannot ensure that every character in the image can be recognized (because character detection is implemented with not deep learning but traditional computer vision techniques), so we may download multiple captcha images from the server.
def AIDeCaptcha(self):in IgnareoG.py for this part.
Now we are to POST your selected characters, along with the voting_token and the captcha recognition result.
Here I am not going to introduce how to generate the data to be posted. It was just some simple but tedious work implemented with
charaSelector.pythat defines which character at what probability to vote for. Civil users may ignore the details in
Military voters can debug
charaSelector.pyusing the sample webpage (
.htmfile). Make sure you do understand the structure of the webpage and my codes. You should always edit your
charaSelector.pyand check it very carefully for each match. Note that you are never guaranteed to win even if you cast billions of correct votes, because ISML operators select the winner mostly according to their own preference.
def Submit(self):in IgnareoG.py.
Through an ordinary manual vote, you can see a record after you submit your vote.
r = session.get('http://www.internationalsaimoe.com/voting')
You should GET the webpage with your cookies. See the official documents of
requeststo learn about
session. An individual session is used for every simulated voter in Ignareo.
Now you can save the record
r.textfor fun, or save billions of records and drop them at your opponents' campsite for military menace (lol).
The whole process stated above can be run by
def Vote(self):in IgnareoG. You just need to write the logics to cast a single vote without having to care about concurrency, and the architecture of Ignareo can help you handle large amounts of concurrent HTTP I/O tasks.
Ignareo does lack convenient features, but can make use of most ready-made wheels safely.
Retrying middleware implemented as decorators:
./DestroyerIGN/retryapi.pyserves as an example depicting how to implement middlewares with decoreators. Use a decorator like this:
IgnareoG.py. To avoid any captcha server to get overwhelmed, Ignareo posts the captcha image to a different captcha server each time. Usually load balancing is implemented at the server side, but in my codes, this is achieved only at the client side.
The whole voting system is of a broker architecture. 3 aspects (切面) (obtaining IP addresses, voting and captcha recognition) (You may have heard of aspect oriented programming, AOP) are distributed in 3 types of nodes (
captchaServer.py). The broker architecture has been proved to be somewhat a simple but effective, and thus popular pattern. You may refer to a well-known open standard called Common Object Request Broker Architecture (CORBA), which has provided many guides for creating a standardized application. ~~Well, I did not read those guides at all when I built Ignareo.~~
IgnareoA/G is an asynchronous HTTP server which listens to POST from
ammunition.py. These POSTs carry IP addresses which are used as proxies in voting. The event loop in the Ignareo server is also used for sending asynchronous HTTP requests to ISML.
You can certainly run multiple processes of Ignareo by changing
portListin Ammunition.py, IgnareoA/G.py and captchaServer.py. You may recognize Ignareo and its captcha servers as an elastic microservices instead of a single heavy spider application. Control your task flow with
You should tell the client where the servers are. That is, you should change the value of
captchaServersin IgnareoG.py if you launched more captcha servers, and change
ammunition.pyif you deployed more Ignareo processes.
Note that you can write all kinds of blocking codes in IgnareoG.
time.sleep()into non-blocking pause. That means you can boast high performance automatically. But non-socket time-consuming codes (computation or long-playing HDD I/O) should be transfered to other processes.
To control the network I/O process of IgnareoA, write your own codes in Voter.py.
To change which characters to vote for, modify charaSelector.py.
The trendy architecture of Ignareo has been fully tested in real combats. Just trust her as your reliable partner!
socketasynchronous. It means you can feed gevent with multi-thread web spider codes (typically
requests) and enjoy asynchronous performance. In principle you can even connect to databases asynchronously. The event loop of gevent on different platforms is documented at http://www.gevent.org/loop_impls.html. According to the page, Windows users have libuv, which is likely to outperform Linux thanks to IOCP.
IgnareoA uses the classical Python library
asyncio. The codes in IgnareoA have to be literally asynchronous with
To summarize, you can just use Ignareo with low-level APIs provided by asyncio or gevent. The event loop is running forever in the server.
You may also have a try with
httpx. I have not implemented such a version, so help yourself and happy coding!
Referring to IGN, your web spider can be fabricated conceptually with 3 cascades:
Every later cascade is an http server for the previous one.
Ammunition.pywhich collects proxies from the Internet.
In the early stages of developing the spider we may, by instinct, ask each worker thread to obtain 1 proxy for its own use. This has been proved to be inefficient in our practice when large numbers of worker threads with complicated logics are launched. Workers should be launched passively corresponding to generated tasks. Proxies had better be loaded actively by an external process. Otherwise there can be heavy coupling and a bottleneck of performance.
Sometimes you may need multiple types of information to start a task (e.g. a cookie and a proxy). In such cases I suggest building separate task producers for different types of necessary information.
The main network I/O cascade should execute the tasks like Ignareo does. If multiple types of information is necessary for a task, you may need an integrated task queue or even a lightweight database in Ignareo. Classical messaging queues like RocketMQ, RabbitMQ or even redis are also possible alternatives, but they do make the whole system heavier. Since you are running HTTP-oriented tasks, you could have got everything done with HTTP.
Computation-intensive tasks should be extracted from the main I/O engine. This is to avoid running out of memory and CPU single-core capability too quickly. Running out of CPU in a single Python network I/O process may cause bunches unhandled HTTP responses to pile up. That's also a reason why I use 10 processes of IgnareoG to decrease the average load of each process.
Actually I have assembled a web spider system named
Valguliousfor ISML, which has a captcha recognition system in each process of spider. This has been proved to be overwhelmingly heavy for most personal computers in combats. In the later version named
SenioriousI extracted the captcha module as a discrete service, which marked the first giant leap in building the ISML auto voter as microservices.
This 3-cascade paradigm provides an example involving all kinds of HTTP I/O and computation. All the processes can be easily distributed on different machines.
Hopefully Ignareo serves just as a concept of high performance HTTP I/O engine, rather than a heavy framework. I tried to make no decision for you, except for performance and convenience of transplanting your other web spiders.
I have already developed an auto voter in 2018, using Scrapy. Scrapy is truly a great framework.
The first reason why I aborted Scrapy is that it can cause critical trouble when you need a non-blocking pause between two requests. It seems you have to set the pause before receiving the response of the first request (https://stackoverflow.com/questions/36984696/scrapy-non-blocking-pause). Consequently it becomes difficult to control the interval of requests.
Secondly, Scrapy probably does not maintain synced(connected?) with the server. It starts another TCP connection with new SYN for every new request. This characteristic not only leads to redundant SYN flow, but also makes it easier for the website to detect the spider. For example, Scrapy may start a new TCP connection sending a POST request. Real browsers never do so!
Last but not least, the codes for a Scrapy spider, crammed in a single class and ran in a single process, can be chaotic and frustrating to understand and maintain.
It's very difficult to predict or control the detailed behavior of Scrapy, because it is such a great framework, making most decisions for you, hiding most of its source codes. This is good in many cases, but when you execute your personalized demands and decisions, you have to fight the framework.
Destroyer Ignareo, as a reinvented wheel, is re-designed for the future. With only hundreds of lines of codes in the core, she allows you to define your own work flow, and understand everything about her. Now you can focus more on data parsing and dependency.
Screaming high concurrency! Light weight Convolutional Neural Network! Against captcha within 0.06 seconds per image on CPU! Easy for distributed deployment!
The structure of IGN can be applied for any saimoe voting and even more in principle (I'm using IgnareoA to monitor IoT devices). Using IGN for other purposes is also welcomed. Feel free to raise Issues including "我永远喜欢珂朵莉", "I love Chtholly forever", "私はいつまでもクトリが好きです", and even more! （请扭曲的珂学家不要一夜之间刷两千条……）
However, BE CAREFUL in case your operations may result in a CC attack! (I've killed an SQL service imprudently with IGN...)
By reading documents of many (possibly) useful libraries!
June 7, 2020: No warranty for ISML 2020. No possibility for darkest horses to win.
Apr. 13th, 2020: Long time no see!
To developers who want to integrate more services (e.g. selenium browser cluster) with Ignareo:
Ignareo can be "Cloud Native". Docker-compose and Kubernetes may help you manage millions of machines running billions of services.
Mar. 21st, 2020:
An image library of Sukasuka/Sukamoka:
Oct. 12th, 2019: Ignareo.py will be renamed as IgnareoA.py. The name Ignareo now refers to the whole series of programs (the whole project)
Sep. 13th, 2019: >asyncio.ensurefuture(self.post('httpc://chtholly.68',data=r'祝妖精仓库中秋快乐！'.encode('月饼')))
July 19th, 2019: [Strongly Recommended] Welcome to the novel (under GPL-3.0 license) at /DestroyerIGN/CINT the Space Fleet Hecate2
For more information: log.md
First, please allow me to ascribe the reason dogmatically to spammers using pristine selenium wildly.
By running a miniature version of stress test involving a mixture of selenium, multi-thread spamming programs and IGN, I would like to give the following suggestion:
STOP USING IGN when ISML is really slow. If you persist reloading IGN with ammunition very quickly, it can be extremely difficult for everyone (both humans and programs) to submit vote. Neither bots nor humans will be able to submit! Therefore, ISML would receive very few votes for hours, until someone quit.
Well, selenium and multi-thread web spiders has a limited number of "workers" (the word may refer to either processes, threads or coroutines), but the number of workers of IGN is almost unlimited. The jobs launched by workers but not responded by servers are usually kept in the program, waiting. Exerting pressure on a server with limited "workers" can hardly lead to a Denial of Service, because they do not start new spamming attempts when all the workers have to wait. So most of the workers can receive their responses sooner or later. But IGN doesn't care about slow responses and keep raising more attempts. With the passage of time (perhaps within only a few minutes in actual battles), it is likely that IGN would own the absolute major portion of unresponded requests, which will be far more than what ISML can handle. The increase of unresponded requests, however, gives even more positive feedback to generating unhandled spamming attempts, since IGN does not care about the unresponded ones! Only the timeout and the failure of proxies can stop IGN from holding more requests.
And do not assume yourself as the undisputed winner when you take up all the hardware resources of ISML with your ancient spamming programs. You can't exclude IGN at all, but you rile everyone.
As the saying goes: Victorque has hundreds of thousands of bilibili accounts.
Before you challenge a rich supporting group, you should always consider such a probable fact: though IGN can support 2500 to 5000 voters on your computer, your opponent has hundreds of machines, each can run 25 to 50 selenium browsers. Meanwhile, you can't afford the cost for proxies.
Remember that: Nobody is with almighty justice. Nobody sees the end of war until death.
Again: Thank you very much for fighting against all the bad sides in all kinds of saimoes!
ISMLnextGen, which contains some prototypes and basic code blocks, is the lab for the development of IGN.
First of all, please allow me to extend my sincerest gratitude to
Chtholly Nota Seniorious,
Tiat Sheba Ignareo,
Ithea Myse Valgulious,
Nephren Ruq Insania,
who always charge my will to conquer all the difficulties.
Thanks to Lilya Aspray the Legal Brave, and all the leprechauns (including all that live in the past or future) from SukaSuka/SukaMoka series.
Thanks to all humans that support Ignareo, including but not limited to the contributors of all the magnificent open codes utilized by Ignareo.
Thanks to my supporting group, which invested huge funds to give birth to the previous generations of auto voter programs. Thanks for their trust and cultivation on me.
Thanks to ISML, as well as other saimoe platforms, which gives me the opportunity and an arena to practice web spiders and other technologies. Perhaps saimoe tournaments in the future should give up deciding who is the most moe character, and focus more on training programmers.
Thanks to some of my opponents in saimoe, who developed brilliant programs to inspire and spur on me.
Thanks to all the Chthollists who love Chtholly and SukaSuka.