Scrapinghub · Sep 13th 2018
About the job:
In Scrapinghub, we are developing a next-generation platform for automatic crawling and extraction - a combination of using the state-of-art machine learning technology and scaling it up with microservices.
The platform would be used directly by our customers via API, as well as by ourselves for internal projects. So far our extraction capabilities include automated product and article extraction from single pages, and we plan to expand it to support whole-domain, and also support more page types like jobs and news. The service is still in early stages of development, serving its first customers.
Our platform has several components communicating via Apache Kafka. Most components are written in Python, with a few components implemented with Scala and Kafka Streams. The current priorities are improving reliability and scalability of the system, integrating with other Scrapinghub services, and adding new features like auto-scaling. This is going to be a challenging journey for any good backend engineer!
Apply and join an excellent team of engineers and data scientists, including one of the world’s top-ranked Kaggle masters!
Due to business requirements, the successful candidate must be based in Ireland for the duration of the project therefore only candidates based in Ireland or based in the EU and willing to relocate to Ireland will be considered.
Design and implementation of a large scale web crawling and extraction service.
Solution architecture for large scale crawling and data extraction: design, hardware and development effort estimations, writing proposal drafts, explaining and motivating the solution for customers,
Implementation and troubleshooting of Apache Kafka applications: workers, HW estimation, performance tuning, debugging,
Interaction with data science engineers and customers
Write code carefully for critical and production environments along with good communication and learning skills.
Experience building at least one large scale data processing system or high load service. Understanding what CPU/memory effort the particular code requires,
Good knowledge of Python
experience with any distributed messaging system (Rabbitmq, Kafka, ZeroMQ, etc),
Docker containers basics,
Good communication skills in English,
Understand a ways to solve problem, and ability to wisely choose between: quick hotfix, long-term solution, or design change.
Bonus points for:
Kafka Streams and microservices based on Apache Kafka, understanding Kafka message delivery semantics and how to achieve them on practice,
HBase: data model, selecting the access patterns, maintenance processes,
Understanding how web works: research on link structure, major components on link graphs,
Algorithms and data structures background,
Experience with web data processing tasks: web crawling, finding similar items, mining data streams, link analysis, etc.
Experience with Microservices,
Experience with JVM,
Open source activity.