Answers by Aleksandr Blekh on Quora

Recent Answers Written by Aleksandr Blekh on Quora

How can I run front-end unit tests in a CI/CD pipeline if the unit tests require a browser like Chrome or Mozilla to execute?

I’m far from being a TDD purist and even not a big fan of TDD, however, depending on the organizational and other contexts (industry, scale of the company, specifics of the application, etc.), automated UI testing IMHO should, at least, be considered and, if found feasible, implemented.

Now back to your direct question. I’m not an expert in the topic, by all means, but… Ever heard of front-end (and related) testing frameworks? Selenium, Cucumber (some people do not recommend using these two together), PhantomJS, CasperJS and likely some more interesting open source software projects. Embedding relevant scripts into a CI/CD pipeline should be relatively easy (says the author, just planning to dive into the front-end testing ocean… :-).

See Questions On Quora
Posted: December 20, 2016, 9:12 am

Is it true that tech degrees like information systems become obsolete 10 or 15 years down the road?

Firstly, Information Systems is not a purely tech degree - it is an interdisciplinary degree that studies socio-technical systems and intersects with such disciplines as management science. Secondly, nobody can accurately and consistently predict the future. Thirdly, even assuming the same rapid pace of innovation in machine learning (ML) and artificial intelligence (AI) and corresponding potential negative impact on labor markets, it is highly unlikely that 10–15 years would be enough to replace all human talent, working in the IT industry or any other industry, for that matter.

Having said that, some positions, where people perform relatively routine operations (data entry, cleaning, conversion, integration, etc., even some data science work) will likely be replaced by ML/AI systems. Similarly, humans, whose jobs involve driving vehicles through highly predefined routes (big trucks, postal and food delivery, warehouses, etc.) might be replaced by self-driving vehicles (cars, drones, ), though it is not clear how soon it will happen to have a significant impact on relevant job markets. Keep in mind that all those ML/AI systems need to be architected, designed, implemented, produced, serviced, improved, etc.

Thus, for most of those highly intellectual tasks, people will still likely dominate as the main force of labor and innovation, at least, within the 10–15 years time horizon. However, the number of professionals that will be needed for the rapidly growing industry of smart machines, will likely be less (or much less) than as per current trends. So, in my opinion, the answer to your question is likely NO (“tech degrees will unlikely to become obsolete in 10–15 years”), with the above-mentioned reservations.

See Questions On Quora
Posted: December 19, 2016, 1:04 am

Is it legal to develop an open-source clone of a proprietary software?

If by “clone” you mean exact (or very similar) functionality, then, AFAIK, the answer is “Yes, but…”. The “but” part refers to various precautions that one has to go through in order not to be in legal trouble. For example, I would think one needs to explore patents, clearly or potentially related to the functionality in question, as well as corresponding trade marks and other potential elements of risk. For more details and an assessment of a particular situation, I would advise to consult with a legal professional, specializing in IP law (some of them are affiliated / work pro bono with various open source-related non-profit organizations, so such consultation would be free of charge). Hope this helps.

See Questions On Quora
Posted: December 10, 2016, 12:04 am

How can one run a persistent R session on a server?

You would most likely want to use either GNU Screen or more modern and advanced tmux or Byobu (which is based on tmux). See some details in this nice answer Best practice submit long-running R-jobs, retrieve later? as well as in relevant documentation for Screen and Byobu. Some additional interesting information and details are available in this article: 5 Ways to Keep Remote SSH Sessions and Processes Running After Disconnection. Hope this helps.

See Questions On Quora
Posted: August 16, 2016, 9:37 am

Is there a commercial or open source package that lets one integrate Siri-like NLP into an enterprise application?

There are many open source software (OSS) packages and even many more commercial packages / services focused on the AI-based intelligent personal assistants (IPA) domain and market (see the list of selected projects and products below). Having said that, I would be very careful, considering integrating non-mature / non-active OSS projects (obvious risks) or small commercial products / services (vendor lock-in and shutdown via M&A risks, financial risks, too). Anyway, enjoy the list. I hope that my answer is helpful.

OSS Packages

Non-OSS Free Products / Services

  • Wit (service/APIs/framework; non-traditional business model or lack of thereof)

Non-OSS Commercial Products / Services

  • (IMHO nice offering, though I generally prefer OSS; comprehensive; small free tier)
  • Viv (focused on manufacturers and product/service vendors)

See Questions On Quora
Posted: August 14, 2016, 7:10 pm

How long does it take to learn R programming?

It is impossible to answer this question directly. Firstly, one needs to define what level of mastery is implied by “learn R programming”. It is all relative. For almost any modern comprehensive programming language, there is no particular time point, when one can say that she / he has learned the language. There is almost no limit. One can only say that they know a particular language enough to perform certain tasks. This is doubly true for R due to its extreme flexibility in allowing to achieve a particular goal through various methods.

Secondly, the time to achieve a certain level of mastery of R, other programming language or any other topic, for that matter, is extremely individual and depends on a variety of factors, including level of prior exposure to computer programming, person’s traits, their enthusiasm and dedication toward mastering the language, time spent (on both reading and practicing) and much more.

See Questions On Quora
Posted: July 11, 2016, 8:48 am

How is success measured in the GitHub community?

Well, on a per-project basis, the obvious metrics of success in a GitHub community are two-fold: interest/use metrics and development metrics. The former include the number of forks, stars and watches for the project, while the latter - the number of contributors, releases, commits. However, while these metrics might be enough to get a ballpark understanding of the health of a project, they paint an incomplete picture. There are many other metrics that reflect various aspects of FLOSS success, such as technical (an average speed of defect fixing, …), community (level of activity of discussions within development team, structure of the team, governance maturity, …) and ecosystem (interactions with other FLOSS projects, such as code reuse).

See Questions On Quora
Posted: July 9, 2016, 12:17 am

What tools are good for drawing neural network architecture diagrams?

For 2D diagrams like the first one, you can easily use some of diagramming packages - general (cross-platform), like Graphviz, or focused on your favorite programming or markup language. For example, for the R language, the usual suspects would be CRAN - Package diagram and DiagrammeR, among many others. For Python, NetworkX is likely a good package to explore. Bokeh is also very nice, though might be an overkill for this task. For JavaScript and, thus, Web, D3.js is pretty much a de facto standard.

For more complex 3D diagrams, like the second and third in your question’s details, and, of course, for 2D as well, I would suggest using LaTeX and one or more of relevant packages. Specifically, the marvelous TikZ comes to mind (see examples: Diagram of an artificial neural network, Drawing a Neural Network architecture and Fully connected network diagram), along with the nice smartdiagram.

See Questions On Quora
Posted: May 25, 2016, 6:43 am

How complex are Microsoft's Cortana, IBM's Watson and Amazon's Alexa?

I don't know exact details about complexity of each of those platforms. However, I would argue that it might not make much sense to compare them in the first place, since they have different goals, use case scenarios, target user profiles, etc.

Having said that, I'm pretty sure that Watson is the most sophisticated platform of them, simply due to the fact that it is designed to be the most general AI (please do not confuse this term with artificial general intelligence / AGI!) software with a wide range of potential use cases in a multitude of domains. Cortana and Alexa, on the other hand, are consumer-focused products, which suggests that they are likely much more limited in features and, thus, complexity (when compared with Watson).

See Questions On Quora
Posted: May 10, 2016, 2:50 am

What is the most efficient approach to recommender systems?

In addition to Abhinav Maurya's valid point on the need to include proper criteria in such questions' formulation, there are other aspects that make it impossible to provide a comprehensive answer within Quora's limitations on an answer's size.

Basically, the main problem is that success of any information system depends on the context of the system's use. There are various types of recommender systems (with major ones being collaborative, content-based and hybrid), scenarios of use for each type of system as well as extremely diverse user population and personas with their own requirements for the system's accuracy, performance, usability and much more.

Considering the fact that all those major factors have a multitude of potential values, it becomes clear that their combination represents a very large feature space (as data scientists, we should use the right terminology, shouldn't we? :-). So, while the task of finding the most efficient (however it is defined for a particular case) approach to building a recommender system is certainly doable, it also certainly seems like not a trivial optimization task and, more importantly, highly dependent on the input and other parameters or, in other words, the context. Therefore, considering the need of somehow limiting the practically infinite feature space (for example, by using dimensionality reduction), there seem to be no general solution to the problem.

See Questions On Quora
Posted: May 8, 2016, 4:35 am

What are some good examples of patent trolling?

The following is another story on the subject that has been published very recently (which, actually, I have just read today). It is a story by Gil Elbaz about his company Factual and their dramatic fight with the company called Locata. Please read it here: Beating Back the Patent Trolls — NewCo Shift.

While I liked (and upvoted) the story, I was curious enough to visit Locata's website and relevant Wikipedia article, before labeling them a Non-Practicing Entity (aka "patent troll"). Perhaps, I'm mistaken, but the company (at least, the main one, Locata Corporation) doesn't seem like an NPE to me, considering documented deployments of their technology. I'm not making any conclusions, since there is not enough information on the topic, but I felt that it's important to "hear the other side".

See Questions On Quora
Posted: April 28, 2016, 1:18 am

What is the second generation of neural networks?

There exist various classifications of artificial neural networks (ANNs), based on approaches used, their architectures and other characteristics. The development of ANNs across these dimensions along the time scale is quite difficult to specify due to different granularity of various advancements in the field. Having said that, one of the most comprehensive treatments of the topic in this context is IMHO the work by Professor Jürgen Schmidhuber (found outside of Quora: Juergen Schmidhuber's home page), in particular, his outstanding review paper Deep Learning in Neural Networks: An Overview (currently at 4th revision, last updated in mid-2014).

As far as determining and defining of the current most relevant / trendy topic in neural networks, I would argue that it is so-called Large Deep Neural Networks. According to Ilya Sutskever (just a guess on the profile :-), this term should be preferred to "neural networks" and "deep learning", when describing modern ANNs research and practice. For rationale and more details on the term and the topic, I refer readers to his excellent guest blog post A Brief Overview of Deep Learning.

My mentioning of review articles on ANNs in the context of their classification would be largely incomplete without mentioning the following excellent resources:

Depending on a perspective, one can argue that the fourth (or whatever the number is for the next period, based on a particular classification) generation of ANNs is artificial general intelligence (AGI). However, for this perspective, ANNs is just a mechanism of achieving AGI. Alternatively, from a purely ANN-bound perspective, the fourth / next generation will likely refer either to neural networks, based on combining current approaches with some novel architectures or math methods, or to networks that achieve extremely high levels of accuracy regardless of their nature.

P.S. I'm not an expert in the field, so take my thoughts with a grain of salt.

See Questions On Quora
Posted: April 4, 2016, 3:48 am

What are good PhD research topics in machine learning in 2016?

I cannot tell about the whole machine learning (ML) field, as it is quite large, but, perhaps, you might become interested in arguably one of the most important topics in ML: kernel methods. With that in mind, take a look at Aleksandr Blekh's answer to Statistical Theory: What are some interesting theoretical results relating to Kernel Density Estimation? I hope that the information and references within will help you.

See Questions On Quora
Posted: April 3, 2016, 6:36 am

How do you calculate Life Time Value of a customer for a monthly subscription model and no long term contract?

As you likely know, LTV depends on customer lifetime (CLT). There are various approaches to assessing CLT. The simplest is to measure your customer churn C, then CLT = 1/C. However, some people think that it's oversimplification:

As for LTV, see this post: The following Quora discussion also has some nice answers:

Finally, do not obsess with all those metrics. If you don't believe me, maybe you will believe this smart VC:

See Questions On Quora
Posted: April 1, 2016, 11:00 pm

Are there any open source projects to which a project manager could donate services?

Absolutely. The following presents a couple of suggestions:

See Questions On Quora
Posted: March 30, 2016, 4:04 am

I am 48 years old. Can I persue a career in Data Analysis?

Just 48? Have a look at the video below and you will have the answer to your question. :-)

See Questions On Quora
Posted: March 27, 2016, 2:00 pm

Should I start as Data analyst or Software Engineer to become a Data scientist?

The answer largely depends on your specific situation. You have shared some details, but they're not enough to give the optimal suggestion. Nevertheless, I will try my best.

Firstly, I think that there are multiple routes to becoming a data scientist, even an aspiring one. Since data science is a highly interdisciplinary field, combining body of knowledge from various domains (Figure 1), one can approach data science from the domain they are most familiar with, while picking up other knowledge and skills on the way. In your case, it seems that you're relatively comfortable with software engineering and know some statistics and data analytics. Thus, you should focus your work on improving your stats/analytics knowledge and skills, while maintaining your software engineering ones up-to-date. As Josh Wills famously said, "a data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician".

Figure 1. One of more recent adaptations of the original data science Venn diagram. Source: Data Science Venn Diagram v2.0.

Secondly, based on the above-mentioned arguments, you have several options, which you should prioritize, according to your specific situation. These options include:

  • If you have financial ability not to work for a period of time and are seriously interested in becoming a data scientist, you can consider enrolling in one of widely available structured programs. They range from relevant MOOCs (tracks with or without certificate) and data science courses / bootcamps / schools to certificate programs and full-scale masters programs in data science from major universities, e.g., Data Science Certificate and Master of Science in CSE, respectively (as you see, names of the programs vary from containing "data science" to being programs in CSE, analytics, data analytics, business analytics and even statistics).

    There are also several interesting data science fellowship programs (essentially, more prestigious bootcamps) that I would recommend to take a look at. Two sister programs Insight Data Science Fellows Program and Insight Data Engineering Fellows Program are highly regarded by top technology companies. Core differences between them are that the former is focused on the machine learning aspect of data science and accepts only Ph.D. graduates, whereas the latter is focused on data engineering (software development) aspect of data science and doesn't have this limitation (it seems that you better fit the data engineering program, especially assuming that you're not post-doctoral). Another interesting and popular program, which also have the remote attendance option, is The Data Incubator, which offers both local (NYC / Washington, DC / SFBA) and remote (online) attendance options. Unlike many other bootcamps, these fellowships are absolutely free (minus living expenses for local options), as their costs are covered by potential employers, interested in hiring top talent in the field.
  • If you don't have financial ability to not to work for some time or have other family responsibilities, I would suggest you to use an alternative path to becoming a data scientist. I would call this path organic in comparison to focused or intensive one, described above. By organic path I mean finding a software engineering job and combining working on the job with self-study along several dimensions that you arguably need to improve upon / master (mathematics, statistics, etc.). This is the path I've chosen to follow, based on my life circumstances (even though I don't necessarily have a definitive goal to become a data scientist, but want to acquire relevant knowledge and skills for various reasons). There are tons of offline and online resources for mastering data science domains through self-study. For example, see my Quora answer on resources for data visualization: Aleksandr Blekh's answer to What are the best resources for learning data visualization? My other relevant answers include Aleksandr Blekh's answer to What are the best data science masters programs? and this meta-answer on Stack Exchange.
  • Hope this helps. Good luck!

See Questions On Quora
Posted: March 27, 2016, 10:44 am

What is the biggest contribution to the open-source community that a large company has made?

I agree with Marcas Neal in that Google is likely the biggest (at least, certainly one of the biggest) contributors to the open source community, mostly based on GSoC and Android. More specifically, based on estimates by Open Hub portal, currently the Android codebase is valued approximately at more than $230M: Estimated Cost Page. Unfortunately, the Android contribution, while likely the largest in financial value, is far from producing negative effects on open source ecosystem due to Google’s iron grip on Android: Controlling open source by any means necessary.

Having said that, for fairness, several other major corporate contributors to open source need be mentioned (with a very long list of companies contributed and continuing to contribute in a more modest - by size - way). Specifically, I would emphasize role of IBM and Facebook. The former donated in 2001 their now very popular Eclipse platform, valued at about $40M (IBM Donates $40 Million of Software to Open Source Community; Estimated Cost Page) and is an active contributor to a large ecosystem of Apache open source projects. The latter open sourced their HipHop Virtual Machine for PHP (HHVM), which is currently valued at about $45M: HipHop Virtual Machine for PHP.

See Questions On Quora
Posted: March 27, 2016, 4:00 am

What's the best way to remember to stop your AWS instances when you're not using them?

I am aware of the following approaches to automating instances in AWS:

See Questions On Quora
Posted: March 20, 2016, 10:15 pm

Should I go to Oxford University for Philosophy, Politics and Economics, or to Georgia Tech for Chemical and Biomolecular Engineering?

If you are equally passionate about both areas of study (or, rather, large domains), then I would suggest choosing the domain that is more stable and/or attractive career-wise (which depends on many factors, including where you want to live and what you want to work on after future graduation). Having said that, I find it quite difficult to believe that you are equally passionate about two so different domains.

I understand that it is not easy to precisely find one's own perfect career. And being interested in many, often very different, domains is IMHO absolutely normal, and I, myself, am guilty of that as well (even now, after being through a significant portion of my career).

Nevertheless, your phrase of being "flexible on long-term career goals" shows that you basically don't know what you want to do career-wise. And this is OK. However, I think that it is much better to make such a significant decision, such as choosing your domain of study and, potentially, work after some more comprehensive thinking about and experiencing both fields. What I would recommend - if your situation allows, of course - is to take some time (perhaps, a year or two) to immerse yourself in the respective domains (through work or other work-related opportunities) to "feel" them and get a better understanding of both domains and your passion about each of them. Then you will be able to compare your sentiments toward each direction and make a much more informed decision that you won't be regretting later. Good luck!

Disclaimer: While I currently work at Georgia Tech :-), this answer expresses only my personal opinion and not one of my employer.

See Questions On Quora
Posted: March 19, 2016, 11:32 am

I am an unemployed software developer that trades stocks part time and make 10-20k a month. I want to launch a startup. Should I just go for it?

10-20k a month? Holy cow, I think I'd be in a much better (financial) situation now, should I have spent time on learning trading then on getting my three degrees... :-)

I'm just kidding (though, as you know, there is some truth in any joke). Anyway, back to your question. My interpretation of your words is that you want to launch a startup just to launch a startup, whatever secondary rationale is involved in this (coolness factor, money, fame, etc.). You don't seem to have a clear idea(s) that you like, want, no... have a burning desire to implement or a clear problem that you're dying to solve (thinking about it every day and every night). I think that one has to have that kind of enthusiasm, dedication and focus in order to succeed in the startup world. I think that you are not ready.

I would advise you to explore and go either one of the following two routes:

  • get a software development job and save money for your future startup's runway - you will need it and you will be glad to depend less or not at all on VC funding with many benefits, among which complete freedom to decide what is best for your company and not parting with valuable equity, when it's cheap (early stages of a startup);
  • join existing and promising startup in a software development or similar role - not only you will save and earn money (from whatever salary + your "little trading hobby"), but, more importantly, if the startup is really good, you will learn a ton of valuable experience that will be extremely helpful to you for your future ventures (plus, you will build a network of people that might help you in various ways, when the time is right). Good luck!

See Questions On Quora
Posted: February 24, 2016, 12:57 am

Would it be feasible to start an open-source community with the sole focus of reducing battery costs?

I assume that you talking about an open source hardware community? If so, I don't see reasons why it would not be feasible. Having said that, unlike typical open source hardware communities, which IMHO are artifact-building, such community would be largely a research community, which is not a bad thing at all... Good luck!

See Questions On Quora
Posted: February 16, 2016, 2:28 am

What are the most difficult hurdles to building a successful open-source community?

It depends. But, if I can guess, most likely it is community's sustainability - there are too many abandoned open source software projects (and relevant communities) out there. Please see books that I have recommended in Aleksandr Blekh's answer to How are open source communities organized and how can I join them? (second section). I'm sure that you will find answers to this question in any of those books.

See Questions On Quora
Posted: February 16, 2016, 2:19 am

Why might someone in Atlanta stay on Comcast now that Google Fiber is available?

Google Fiber is certainly an attractive offering. It is quite difficult to recommend Comcast over Google in this regard, especially considering Comcast's overall less-than-stellar customer service (though, to be fair, I have to say that I have had a relatively decent experience with Comcast's cable Internet service in Atlanta during 2007-2013 - if service appointments wait time would be shorter - not days - it would have been not that bad).

Having said that, the following are some potential reasons, in my opinion, for staying with Comcast over switching to or selecting Google Fiber in ATL (take with a grain of salt):

1. You already have cable modem and don't need speeds over 75 Mbps, which for many people is fast enough (this will allow to save $10/mo. (first year only, though you might get discount rates, threatening to switch) and have high-speed Internet in locations that don't yet have Google Fiber or won't have it all).

2. You live within a geographical location in Atlanta or in a particular dwelling that won't be getting Google Fiber service soon enough or at all.

3. You don't have Netflix or similar streaming TV services subscription, but would like to get some streaming TV for free (although quite limited, Comcast offers bundles, where you pay, say, $60 (first year only, though you might get discount rates, threatening to switch) for a 75 Mbps Internet service with some premium TV channels streaming (I think these bundles are referred to as Double Play; see more details here: High Speed Internet Service from XFINITY® by Comcast).

4. You are cost-averse and Comcast offers you attractive enough discount rates as a marketing initiative or after been threatened by a potential switch to Google Fiber.

5. You are an extremely loyal and/or extremely satisfied customer of Comcast :-).

See Questions On Quora
Posted: February 11, 2016, 7:05 am

How do I contribute to open source projects localization of my language?

First of all, I would suggest you to become familiar with the process of open source software (OSS) projects localization. For that you can use various resources, for example, the FOSS Localization wiki book and the rather detailed Localisation Guide.

As Vladislav Zorov and James Dixon have mentioned, different OSS projects use different approaches and tools in a form of either external, or internal translation / localization platforms. Most OSS projects' websites inform potential contributors on what platforms their projects use and what is the localization workflow.

Should you decide to use similar platform for your own OSS project or one you are joining (that hasn't yet established a localization workflow), there is no shortage of feature-rich platforms, many of which are free and open source. The simplest way to approach this would be, perhaps, to use closed source Google Translator Toolkit (About Translator Toolkit), but there are much better alternatives, depending on the project needs. For example, the following OSS or OSS-friendly platforms are IMHO worth considering:

See Questions On Quora
Posted: February 7, 2016, 7:41 am

Is it legal to modify or extend an open source software and make a commercial software?

You need to carefully read the Eclipse Public License (or any other license, for that matter) and, if needed (if anything is not clear to you), consult with an IP attorney, specializing in open source software licensing.

See Questions On Quora
Posted: February 7, 2016, 12:35 am

How do I get a Product Management role at Google?

Firstly, I would recommend you to read the book Cracking the PM Interview: How to Land a Product Manager Job in Technology (9780984782819): Gayle Laakmann McDowell, Jackie Bavaro: Books.

Secondly, I would advise you to find and read most popular blogs, focused on product management (PM) at major technology companies.

Thirdly, I think that you would be a better fit, and, thus, have better chances to get a PM role within Google, in their Advanced Technology and Projects Group (Google ATAP - Google+) or in any other R&D division or group across the company.

See Questions On Quora
Posted: February 7, 2016, 12:01 am

Is there any open source deep learning tool available for speaker recognition?

You probably can use open source deep learning software for speech recognition in order to perform speaker identification (i.e., by assembling a set of features that comprises a particular speaker's "voice profile" - please don't ask me further details on how to implement it - I don't have specific experience in that; I'm just sharing my advice, based on common sense and some understanding of ML domain).

There is a rather limited number of open source deep learning software, focused on speech recognition. I'm aware of one package in two incarnations: pannous/caffe-speech-recognition (for Caffe) and pannous/tensorflow-speech-recognition (for TensorFlow). Most other deep learning tools for speech recognition are most likely proprietary, for obvious reasons (Microsoft => Skype, Baidu => ?, IBM => Watson, etc.). Having said that, you might be interested in relevant relatively recent research achievements in this subject domain, especially by Microsoft ( and Baidu (Scaling up end-to-end speech recognition).

Finally, it might be a good idea to combine traditional speech recognition approaches and their open source implementations (i.e., CMU Sphinx) with their deep learning counterparts. As far as I know, James Baker has some interesting ideas in that regard.

See Questions On Quora
Posted: February 4, 2016, 8:29 am

How can I deal with IP addresses in machine learning algorithms in traffic analysis and anomaly detection?

The following is a copy of  my answer on Cross Validated. While it doesn't address the IP addresses (no pun intended) aspect, it has some potentially valuable references to materials on the topics of machine learning in traffic analysis and anomaly detection. I hope that you will find it useful.

I'm definitely not an expert on anomaly detection. However, it's an interesting area and here's my two cents. First, considering your note that "Mahalanobis distance could be only applied to normally distributed features". I ran across some research that argues that it is still possible to use that metric in cases of non-normal data. Take a look for yourself at this paper and this technical report.

I also hope that you'll find useful the following resources on unsupervised anomaly detection (AD) in the IT network security context, using various approaches and methods: this paper, presenting a geometric framework for unsupervised AD; this paper, which uses density-based and grid-based clustering approach; this presentation slides, which mention using of self-organizing maps for AD.

Finally, I suggest you to take a look at following answers of mine, which I believe are relevant to the topic and, thus, might be helpful: answer on clustering approaches, answer on non-distance-based clustering and answer on software options for AD.

See Questions On Quora
Posted: February 1, 2016, 8:23 am

What is a predatory journal?

There is a lot (or, at least, enough) of information on the subject of predatory journals. Firstly, there is a nice relevant article on Wikipedia: Predatory open access publishing. Secondly, there are many relevant questions and answers on Academia Stack Exchange, for example, this one: What are "fake", "shady", and/or "predatory" journals? Finally, if you need even more information, there is Internet search... ;-). Hope this helps.

See Questions On Quora
Posted: January 19, 2016, 6:23 am

How many hours of coding should you have to be a statistician in industry?

I'm not a statistician, but, in my opinion, it is impossible to answer this question with a reasonable precision. Simply because there is a wide range of variation in people's skills, previous experience, industry verticals and subject domains, tasks at hand, business and IT environments, among many other factors. If you want to become an expert in coding, try deliberate practice for 10,000 hours: Outliers (book) - I'm mostly joking, of course, as this topic is quite controversial and is a subject of the significant scientific debate: Aleksandr Blekh's answer to How can I improve my Java programming skills as fast as possible?.

See Questions On Quora
Posted: January 6, 2016, 3:49 am

What is the total AUM for US-based VC as of 2015?

I don't think that AUM (as both the term and an indicator) is applicable to VC funds at all. I guess, AUM can refer either to the amount of money raised by a particular VC firm, targeting a specific fund, or to summary valuation of a VC fund's portfolio companies (my interpretation - might be wrong). On the former, Mark Suster argues that AUM is not a term, applicable to VC funds. However, for the latter, unlike Mark's reasoning in Does the Size of a VC Fund Matter? | Bothsides of the Table, I think that AUM is not applicable to VC funds simply because most of their "assets" (investments) are startups with way too incorrectly defined or too fluid valuations, which makes both AUM the term and AUM the indicator not making much sense.

See Questions On Quora
Posted: January 1, 2016, 4:02 am

How would you feel about being asked to reformat 10, 000 lines of code?

Unless you are specifically told to to the reformatting manually (which is either simply stupid, or just a complete inefficiency in using software developer's time), I am pretty sure that you've been asked to do that task to test your ability to think as a real (good) software developer (who typically tries to automate things as much as possible, where appropriate). Therefore, I would feel being tested; I would write corresponding script (or use relevant tools, i.e. from your IDE) and then I would see what would be your bosses' reaction. If they would be happy with your approach (and, to a lesser degree, the implementation) of automating the task, then I would feel that I'm at the right place, otherwise I would question the value of working for that employer.

See Questions On Quora
Posted: December 31, 2015, 1:28 pm

Is it possible to self learn data science?

It is certainly possible to self-learn data science (and any other topic, for that matter): List of notable autodidacts. Whether it is the best / optimal choice for a particular person is another (and quite difficult) question. Good luck!

See Questions On Quora
Posted: December 31, 2015, 5:28 am

Has anyone ever tried to estimate the market value of top open source software?

Various attempts have been made and various approaches (with COCOMO model likely being the most popular) have been used in order to estimate the economic value of open source software (OSS). While obviously numbers are very approximate and differ from similar estimations, the following resources are both interesting and useful (if you want to try to learn how to approach such estimation for your amusement, education or sizing up a market sector, for example for your startup).

IMPORTANT NOTE: Please keep in mind that the above resources represent assessments of either or both of economic value of OSS use and economic value of OSS development, which are two very different things.

See Questions On Quora
Posted: December 28, 2015, 10:00 pm

What software programs exist that have algorithms that can identify patterns, trends and correlations?

This question is too general and, thus, large to be answered comprehensively. However, a lot of software, focused on data science (and some machine learning software, as Tejas Mehta mentioned), include to varied degrees functionality that you are seeking (data of not so high quality is actually a norm for real-life data science applications, so that is implied). Having said that, since you are interested in business data, I think that you will have faster and better results, exploring practical sub-domain of data science, usually referred to as business intelligence. See corresponding Wikipedia article (Business intelligence) for more details and further references.

See Questions On Quora
Posted: December 28, 2015, 9:10 am

Where can I find business partners in the USA?

Obviously, it is not easy to enter any developed IT services market, especially the US one. The competition (both internal and external) is pretty brutal. However, it is not impossible. Two main things that I would pay attention to are: 1) the stability, skills and experience of your team; 2) having your company's portfolio of successfully implemented projects and a nice/solid presentation of all the above online. Good luck!

See Questions On Quora
Posted: December 27, 2015, 9:17 pm

Is is a good idea to do MS in MIS instead of doing MS in CS?

I agree with John Czerwiec (with an addition of an IT consultant as another potential role for MIS graduates). You might also check my recent and related answer (though, the context is somewhat different): Aleksandr Blekh's answer to I am in the 3rd year of my degree in Information Technology and I wish to go to the US for a masters degree.Which is better MIS or MS in Comp Science? Ultimately, if you want to become a software engineer or developer, MS in CS is likely the best option (if to pursue MS - see below).

However, as I have mentioned in the answer linked above, this is not the only way to achieve this goal. Having said that, I think that pursuing a MIS degree after getting BS in CS might be considered by some people as somewhat negative. Thus, I would suggest either going the MS in CS route, or, depending on your priorities, consider not getting an additional CS degree at all (unless, as John said, you want a deeper exposure to CS topics, for example, for research or academic / teaching practice).

See Questions On Quora
Posted: December 27, 2015, 12:03 am

I am in the 3rd year of my degree in Information Technology and I wish to go to the US for a masters degree.Which is better MIS or MS in Comp Science?

Which degree is better depends on your career goals, preferences and other personal factors. If you want to be a software engineer/developer, it is IMHO better to get MS in CS, but if you want to work as IT manager or such, MIS degree is enough. That is not to say that a MIS graduate cannot be an excellent software engineer/developer or, vice versa, that all MSCS graduates are excellent software professionals. It very much depends on a particular person, rather than a degree program (by the way, there are tons of highly skilled and experienced software engineers/developers without MS degree, CS degree or any other degree, for that matter (not that I'm advocating not getting a degree, but...). In regard to MIS, you can find more details in Aleksandr Blekh's answer to Is it worth doing an MIS program? Also, search Quora for "MIS vs MSCS" questions - there are some excellent answers out there. Hope this helps. Good luck with whatever you choose!

See Questions On Quora
Posted: December 25, 2015, 4:10 am

Is Elon Musk right to say that lithium is a practically infinite resource because any salt water has it?

It is indeed a serious question in general, from the alternative (green) energy perspective, and, obviously, for Tesla, considering that Tesla's Lithium Supply Constraints Might Hamper Its Growth.

While extracting valuable materials from seawater is not a novel approach, applying it to Li has its own challenges. While the amount of Li, dissolved in seawater, is very significant and, even, enormous, it still cannot be called "practically infinite" from the strict scientific sense. Unlike some other chemical elements, Li is present in seawater in a very small concentrations; therefore, its extraction from seawater represents a complex scientific and technological task. One of the most challenging part of the problem is not the process per se, but the feasibility of it, especially from the energy consumption standpoint. The 2010 research paper by Ugo Bardi, published in Sustainability (, provides a fascinating analysis of extracting minerals from seawater from the energy perspective. While the author refers to the relevant Li process as "feasible", the assumptions and calculations are very approximate and, thus, require further research clarifications and confirmations.

Recent research efforts, in particular, ones by Japanese scientists, are promising. For example, see research paper by on innovative method of recovering Li from seawater (Innovative lithium recovery technique from seawater by using world-first dialysis with a lithium ionic superconductor). However, while the author says that the method "shows good energy efficiency", there is no support for this statement, so it is not clear from the paper what are energy requirements for this method, even approximately.

Finally, I think that there is a financial aspect of the problem, as follows. In order to transition to industrial extraction of lithium (or any other minerals, for that matter) from seawater, companies should have financial incentives to do so. In other words, the price of lithium on the market should be high enough, so that such production is profitable. However, on the other hand, potential large-scale use of lithium in automotive batteries (by Tesla and other auto manufacturers) is possible only when the price of lithium is low enough. At some point, the price should reach some equilibrium, however, it is unclear (to me) whether that price will be sufficiently low for large-scale production of Li-based batteries and relevant electric vehicles.

P.S. I'm definitely not an expert in the subject domain, thus, all the above are just my humble thoughts on one of the topic of my interests. Take them with a grain of salt.

See Questions On Quora
Posted: December 8, 2015, 12:37 am

Is it true there's no word for "fun" in Russian?

Interesting question. I agree with most people, who have answered and expressed an idea that it is difficult and often impossible to answer such questions due to extreme richness of various languages in describing concepts. In addition to focusing on concepts rather than words, people should take into account two other important factors: context and culture (of both a speaker and its environment). So, in my opinion, the answer to your question is

"No, it is not true. There are many words for fun in Russian".

So, based on what concept speaker wants to emphasize as well as what is the context of the phrase and cultural environment, potential translations for "fun" might be:

здорово, весело, прикольно, отпадно, классно, круто и даже интересно.

See Questions On Quora
Posted: December 6, 2015, 4:34 am

How does open source software differ from proprietary, closed source? How is each distributed?

Have you tried to read something on the topic? Internet is full of such information. Perhaps, the following Wikipedia article is good starting point: Open-source software.

See Questions On Quora
Posted: December 3, 2015, 11:23 pm

What is the difference between Data Analytics, Data Analysis, Data Mining, Data Science, Machine Learning, and Big Data?

Fundamentals. Essentially, the difference lies in the focus. Data scientist is an umbrella term for both wide- and narrow-focused professionals in data analysis and engineering. On the other hand, a machine learning engineer is simply a data scientist, focusing on machine learning (ML) domain of the larger data science field.

Knowledge. Different titles do not imply that machine learning engineers know less than their less focused colleagues, perhaps, quite the opposite is true, at least often enough. Practically all ML engineers I've ever met (online and in person) were at least as knowledgeable in data science overall concepts, methods and tools, as us, data scientists. ML engineers are often more knowledgeable, perhaps, due to to relative complexity of ML domain (especially its AI area) in comparison to general statistical data analysis.

Terminology. One more note. I disagree with Dima Korolev that title ML engineer implies focus on engineering vs. science. From my experience, it is not the case. I'd say that ML engineer and ML scientist mean the same thing; the former is just a more popular term, probably, used to emphasize the complexity of and involvement in algorithmic engineering, which is a core characteristic of ML domain. Having said that, the difference in similar pair of general data science terms indeed exists (and, I believe, that is what prompted Dima's note on distinction). That is, data scientist and data engineer are indeed two very different titles, representing two very different perspectives and areas of focus in data science. A nice illustration of this argument is the existence of two separate relevant and popular data science (an umbrella term, again) fellowship programs: Insight Data Science Fellows Program and Insight Data Engineering Fellows Program.

See Questions On Quora
Posted: November 16, 2015, 3:52 am

On what topic I can conduct a research related to information system of an organization using grounded theory?

I suggest you to first formulate a problem by either projecting a topic of your interest onto enterprise information systems domain, or by performing (research and/or popular) literature search and extracting problems of that nature from it. This will give you a problem formulation, which will direct you to the next step: gathering data. Hint: grounded theory is applied to problems and domains, where theories are non-existing, or when researcher wants to create their own theory, even if some exist. For more information, please refer the relevant Wikipedia article (Grounded theory) and links within, as a starting point. Hope this helps. Good luck!

See Questions On Quora
Posted: November 15, 2015, 7:13 am

How do I tell about my open source project?

Just post the relevant message on your blog or, better, website (you should have one, if you're serious about your professional brand), your social network channels (LinkedIn, Quora, Twitter, Facebook, Tumblr, etc.) and relevant professional groups within those sites.

If your project is related to software development and/or IT infrastructural issues, I suggest also posting a message on YCombinator's Hacker News as well as corresponding Reddit channels. If you host your project on GitHub (which I highly recommend), consider socializing there via following people and watching others' projects, so that your professional brand and, hence, projects, are exposed to many people with similar interests (Be Social - User Documentation).

If your project is focused on a specific subject domain, industry or research area, I think it's a good idea to also post relevant information on websites, forums and other publication outlets, dedicated to those specific topics and/or domains. Good luck!

See Questions On Quora
Posted: November 14, 2015, 10:54 pm

What are the Open Source Accounting + Billing systems?

Unfortunately, I cannot recommend any open source accounting software, as I don't have direct experience, working with it. Having said that, there are various Top N Best articles on the Internet, which you can analyze, based on your requirements. Additionally, a good starting point might be this Wikipedia article's section: Comparison of accounting software. Good luck!

See Questions On Quora
Posted: October 15, 2015, 11:48 pm

Is there a simple open source document delivery system that a copywriter could use to deliver documents for clients?

Firstly, you are wrong on the availability of WordPress plugins that either focus on editorial workflows or contain that functionality. See for yourself: WordPress › collaboration " Tags " WordPress Plugins. This approach would be my first recommendation, if your website is powered by WordPress.

Secondly, there some open source software solutions that you can install on your website (obviously, if you have administrative rights on your server), some functionality of which is relevant to your tasks. That software category is usually referred to as product information management (PIM), but not always. This approach is not as direct as the above-mentioned one and likely more complex (as they are larger scope solutions), but it could allow you to construct a workflow system, which would be very well matched to your specific needs. For example, take a look at Akeneo and pimcore.

See Questions On Quora
Posted: October 13, 2015, 7:12 pm

If someone wants to pursue a PhD in a different field than their undergrad major, is it better to simply do some coursework related to the new major, or do a second degree?

While there is no definite answer for your your general question, I will try to answer it in the context of your particular situation. Below I will refer to data science only, but, obviously, genome science is somewhat different (despite some intersecting data science-related aspects), since you have to master biology and related material.

Software development background is indeed helpful in regard to career prospects in data science. However, one of the most important factors of success in data science is mastery of statistics and mathematics, as they are the cornerstones of data science approaches, methods and tools. Being in a similar situation as you have described (though I have more technical background), I can attest that it is definitely not easy, but as they say, "a thousand mile journey starts with the first step". With that in mind, if you want to get a second degree or a certificate or simply take classes (traditional or MOOC), I would definitely suggest to skip computer science path (you can always improve your knowledge and skills in CS via MOOCs, books, personal or open source projects, etc.) and focus on mathematics and, especially, statistics (i.e., in the scope of Master's in statistics or, alternatively, applied mathematics or operations research).

One more point. Considering your background in linguistics and your desire to break into data science, I would recommend to use this combination by (after learning general data science aspects) focusing on data science areas that deal with natural language processing (NLP): machine translation, voice recognition, artificial intelligence, etc.. These areas are extremely popular now and IMHO have excellent career prospects. Hope this helps. Good luck!

See Questions On Quora
Posted: October 13, 2015, 4:12 pm

When mapping political and business relationships between people and organizations in a society, what are some useful open source tools?

If you want just purely mapping functionality, then I suggest looking at mind maping tools: List of concept- and mind-mapping software. If, on the other hand, you want to study those relationships  quantitatively as part of theory (or hypotheses) testing, then I would  recommend exploring structural equation modeling (SEM) tools.

Tools  of the both types I've mentioned exist in open source variants and the  selection is pretty wide. For example, for open source mind mapping tools,  see links in Aleksandr  Blekh's answer to Is there an open-source or self-hosted, purely  Web-based alternative to TheBrain, an idea-mapping tool? and, for open source SEM tools, see R packages sem, lavaan, OpenMx, plspm, Onyx (GUI for models generation and analysis) and others.

See Questions On Quora
Posted: October 12, 2015, 3:19 am

I need to write a research proposal for postgraduate admission kindly send me your research proposals field construction?

I doubt that someone will send you any documents of that nature, first and foremost because, as Professor Porter has mentioned, such documents can be easily found online (see below). His answer is IMHO too comprehensive for your current needs, but still there are some useful pieces of advice that you can take advantage of (i.e., talking to local professors or researchers).

I don't think that a postgraduate research proposal (whatever it means) is any different from graduate or any other research proposals (also sometimes referred to as idea papers). Since I guess you're asking about research proposal's structure (as nobody can help you with your content), perform Internet search for terms like "research proposal structure" or "research proposal outline". The latter request results in many useful documents, for example: Guidelines on writing a research proposal and, especially, Writing a Research Proposal - Organizing Your Social Sciences Research Paper - Research Guides at University of Southern California (also see references within).

See Questions On Quora
Posted: October 9, 2015, 2:49 am

Viewing page 1 of 1

User Aleksandr Blekh - Academia Stack Exchange

most recent 30 from

Answer by Aleksandr Blekh for infinite/sustainable hosting of a web-interface to a research database

Let me offer you several strategies. Firstly, you can consider, instead of or in addition to developing a LAMP-based Web application, to publish your research database as open data set with a clearly documented structure (schema, ontology, etc.). The benefits of that include much wider option of long-term preservation as well as opening various opportunities for other researchers to reproduce, enhance and build new knowledge on top of your results: open data => reproducible research => scientific innovation. For this option, you can consider using some solid free open data repositories, such as figshare, Zenodo, CKAN-based Datahub and GitHub (see examples).

Secondly, you can consider a hybrid approach, which is to combine an open data set, published as mentioned above, with a relevant open source code of Web application that anyone could download, install and use to interface with your data set. Considering the open source hosting aspect, from above-mentioned options, the GitHub one is especially attractive, as you could seamlessly host both data and relevant Web application code. If you (or someone who can help you) are technical enough, you could make access to your data set, using this approach, even easier, by providing a containerized (such as Docker) version of your data and application (if the data set if not too large, you can even push relevant public Docker image to DockerHub or other services that host public images for free). Similarly, you can publish a free software appliance - virtual machine (VM) - perhaps, some of the above-mentioned repositories (and/or maybe others) offer hosting open VMs for free.

Thirdly, you can propose developing and hosting Web application that would provide open access to your data set to (in addition to some universities) relevant non-profit organizations, working in your particular domain. If successful, the costs of developing and maintaining the database would be covered (at least, for some time) by relevant scholarships, grants or similar financial vehicles. For example, for social sciences, including humanities, you can review funding opportunities at Social Science Research Council, The Rockefeller Foundation, Carnegie Corporation, Ford Foundation, Russel Sage Foundation and many other non-profits.

Posted: September 10, 2016, 6:10 am

Answer by Aleksandr Blekh for Where can I download a large sample bibliography collection in BibTeX?

For your purposes, I would highly recommend you to use The Collection of Computer Science Bibliographies by Alf-Christian Achilles. This extensive collection contains 3M+ references on the various CS subjects (grouped in about 1500 collections) and, besides offering search and browse interfaces, allows one to download the actual bibliographic data in BibTeX format - just select a particular bibliography and you will see the links to the source files - uncompressed and/or zipped.

P.S. Don't forget to acknowledge the value of this resource to the maintainer of this meta-collection (a thank you note will do) and, perhaps, even attribute the source, if your software will be citable.

Posted: June 26, 2016, 6:34 am

Answer by Aleksandr Blekh for What is the academic value of posts on LinkedIn?

LinkedIn certainly has some value, as a general professional networking tool. However, that value has been declining for quite a while and rather rapidly more recently due to various factors, mainly inability (or lack of care/desire) of LinkedIn's management to manage the quality of the community, provide consistent user experience, fix issues and improve features, just to name a few. Whether the recent acquisition of LinkedIn by Microsoft will help LinkedIn to remain a major player in the market and improve its dominance or, vice versa, will enable its stagnation and transform it into Microsoft's technology- and talent-focused support division, remains to be seen (I make no bets).

Having said that, the value of LinkedIn from the academic publishing perspective is quite bleak (which is a nice way to say "close to zero"), in my humble opinion. The following are some of the reasons for my such assessment.

  • Quality / scientific rigor. LinkedIn lacks a peer review process, which means that any published piece there should be taken with many more grains of salt than, if such process would be in place (not that is expected).

  • Relevancy. LinkedIn is not very relevant to academia. LinkedIn's network of people from academic circles tend to be much less comprehensive than academia's specialized networks due to some of their colleagues, collaborators, etc. using LinkedIn rarely, if ever, or just not having any presence there at all. Therefore, disseminating scientific information, using LinkedIn, is a much less effective option. Nevertheless, if one has important academic contacts on LinkedIn that are missing from the person's other networks, it might make sense to publish there a brief post (similar to an abstract) with a link to a full-text article (preferably, a DOI link).

  • Information persistence. LinkedIn lacks a mechanism of persistent identifiers (again, not that we can expect that from a general networking platform), which implies lack of guarantee that a link to an article published there will not become broken over time (which jeopardizes scientific information dissemination).

P.S. There is no such term, as "job CV" - I understand what you're trying to say, but IMHO it sounds pretty bad and, thus, I would recommend against using such word combination in any context. HTH

Posted: June 24, 2016, 1:01 am

Answer by Aleksandr Blekh for Is there a conventional word that describes a professor for whom you were a TA

I would initially suggest terms supervising lecturer or supervising teaching professor. However, both terms are not perfect due to potential interpretation of "lecturer" and "teaching professor" as formal positions. In order to improve this, it might make sense to add clarifying term "class" and remove "teaching" from the second option. Therefore, my final suggestions are the following two options:

  • supervising class (course) lecturer;
  • supervising class (course) professor.
Posted: June 22, 2016, 12:08 am

Answer by Aleksandr Blekh for Trouble with advisor in final Ph.D. phase

I'm sorry to hear about your situation. I had to change my Ph.D. advisor (and I'm very glad I did), but it was in the early phase of my dissertation process. I'm quite surprised by your "discovery" about a potential of your advisor not caring about your career. Firstly, it is unlikely (why would she "tolerate" you for 5+ years then?). Secondly, if your advisor would truly not care about your career, it should have been pretty clear early in your collaboration, so either your assumption is not true, or you paid no attention to this aspect at all, which is quite difficult to believe in.

Anyway, in regard to your potential actions. I strongly recommend you to consider all possibilities to avoid changing your advisor, considering how far are you in the program. Changing an advisor is not only a administrative / logistical nightmare, but, if it would require you to start your research from the scratch or almost from the scratch, it would be extremely depressing, to say the least.

If you could save five years of work and life by defending your dissertation and graduating, even if parting with your advisor not very amicably, I would say that it is worth a serious consideration. The two obvious dangers in this case would be: 1) being able to defend dissertation successfully; 2) potential problems with obtaining a recommendation letter from your advisor (she could either decline, or give a negative or not so positive one). The second aspect is quite important, as your postgraduate applications, not listing your dissertation advisor as a referee, might raise quite a lot of eyebrows, with potentially negative consequences in regard to your postdoctoral offers / career.

You have to carefully think about all these (and other) aspects, consider feedback from people here and your own environment, but, ultimately, only you can decide the best course of action, based on various details, known to you only, as well as your gut feeling, as some new research suggests.

Regardless of what you decide on the subject and how you part with your advisor, I wish you to successfully graduate and achieve your professional and personal goals in the future. Be strong in staying your courses, but flexible in ways of reaching your destinations. Or, as Lao Tzu has said,

Nothing is softer or more flexible than water, yet nothing can resist it.

Posted: June 14, 2016, 6:44 am

Answer by Aleksandr Blekh for Harvard VS APA: the differences? How about mixing styles for a better clarity?

Even if your current university and situation do not call for using a specific publication style, I would strongly recommend against mixing two or more styles, even if they are not much different. The reason is pretty clear: consistency. For the sake of readers of your publications as well as for the sake of your own sanity. Following a single style will make your life easier - if you can choose, just pick the one you feel more comfortable with or the one popular, or, perhaps, a standard de facto, in your field (the latter is IMHO much more important - again, that "for the sake of readers" argument).

Posted: June 8, 2016, 2:18 am

Answer by Aleksandr Blekh for Is my work a good research work?

Good question (+1). In my opinion, academic research work should be focused more on learning in general and learning how to perform research correctly in particular, rather than on doing grandiose, novel or even "the right" research. This is especially applicable to the Master's level research, where implementation-focused work and theses are very popular (obviously, it is quite field-dependent, but here I imply the software engineering / computer science areas of research).

I don't see any reasons for why an good implementation-focused research work could not be published as a research paper in a solid journal. In fact, I have seen a lot of such papers (of varied quality), especially in the above-mentioned domains, published in respected peer-reviewed outlets.

Posted: June 5, 2016, 8:34 am

Answer by Aleksandr Blekh for CV for a PhD application in applied mathematics

  • Firstly, the Career Objective section is a thing of past and should not be present in a CV or resume. Not only it is old-fashioned, it actually makes one change their CV or resume every time one applies to different organization and position. It is much better to place relevant position-focused information in a cover letter, which should be adjusted to a particular position anyway.

  • Secondly, do not put personal details, like mailing and physical address, on CV or resume. An e-mail address and, maybe, a phone number is more than enough. You don't expect potential employers to send you postal mail, do you? Plus, the physical address would jeopardize the security of one's identity.

  • Thirdly, the section Research Interests should be higher in the list - I would say, even prior to the section Education (or, at least, right after it).

  • Fourthly, I suggest you to create two versions of your CV (the following is not applicable to resume) - one with references, for organizations that require them as part of initial application, and another without ones, for those that require them later or using different communication channel (say, Interfolio).

  • Fifthly, go ahead and search Internet for examples of academic cover letters (there are plenty of them - stick with the ones from reputable universities). Hope this helps. Good luck!

P.S. I would reword section titles, as follows: Conference Presentations => Talks & Presentations; Research Interest => Research Interests; Co-curricular Activities => not sure it makes sense to extract them in a separate section - why not list them below relevant educational info; Extra-curricular Activities => Extracurricular Activities.

Posted: June 5, 2016, 6:58 am

Answer by Aleksandr Blekh for Any data for average number of papers per year at different career stages?

In regard to the data, I would suggest you to look at NSF's Survey of Doctorate Recipients (SDR) (select Data tab for data sets). A potentially more convenient or flexible way to access and select data of interest might be via NSF's SESTAT Data Tool (provides access to the SDR data as well).

Some data (or data sources) might be extracted from relevant literature. In particular, the study Comparing Research Productivity Across Disciplines and Career Stages uses the 2003 SDR dataset (see Table 3 for some ready-to-use numbers). Beyond the above-mentioned direct and indirect data sources, I would recommend to review related studies that might potentially contain of refer to relevant data. In particular, check the following papers (obviously, a non-exhaustive list).

Posted: June 3, 2016, 9:07 am

Answer by Aleksandr Blekh for Do academics look down on well-designed academic websites?

Your question is not only too broad and opinionated, but it is also formulated in such way that it is quite difficult to answer, in general. Simply because there is no universal definition of what attribute "well-designed" means. It could mean different things to different people. There are no clear and universal criteria for judging whether a website (or any other object, for that matter) is well designed or not and, if Yes, how well. Certainly, there are various heuristics and checklists for assessing the quality of design of a website, but they are not universal at all, as each criterion's weight is strongly dependent on the context, which, in this particular case, includes goals of the assessment, the website's audience, the assessor's judgement, the layout and essence of the site's content.

In addition to the above, the academic audience is likely to pay more attention to the essence of a website, rather then its design (unless it shows a clear disrespect to potential visitors - in a form of poor spelling, offensive language, excessive use of advertising, frequently appearing mailing list pop-up windows, extremely bright or dis-balanced colors as well as accessibility, readability and navigability issues, among others).

Having said that, I don't see any reasons for why academics would look down on a well-designed academic website (provided that it is somehow determined that the website in question is indeed a well-designed one). That is, of course, unless the site contains irrelevant or poor quality content.

Posted: May 25, 2016, 10:16 pm

Answer by Aleksandr Blekh for Leadership Ph.D alternative

There is a wide range of choices in managerial education, in general, and in leadership education, in particular. Most universities' business schools offer various executive education programs, which range from continuous education / professional development programs, such as (hereafter, I am using Harvard University just as an example for some types of programs) these smaler scale programs, to comprehensive executive leadership programs. Executive leadership programs can be general as well as industry-oriented, such this higher education-focused or this healthcare-focused. Other leadership education options include more lightweight alternatives, such as relevant MOOCs with certificates (either single courses, or thematic tracks), university certificate programs (such as this one at MIT) and relevant educational programs by think tanks (like ones by Aspen Institute), non-profits (like ones by Center for Creative Leadership) and similar organizations.

Posted: March 22, 2016, 2:16 am

Answer by Aleksandr Blekh for When is it appropriate to describe research as "recent"?

Good question. The semantics of the word "recent", in general, and in academic writing, in particular, is not clearly defined (that is, fuzzy), which makes its practical use quite tricky, as evidenced by your question.

While @vonbrand's answer offers some valuable insights, such as considering the fluidity of a particular scientific field or domain, I would suggest a more practical solution to this problem, as follows. Consider literature that you reference in a particular paper. What is the temporal range of the sources? I think that this aspect could guide you in to where the word "recent" is appropriate and where not so much.

For example, if you cite sources from the current century as well as 1930s, then a paper from 2010 should be considered recent, but not one from 1950. If, on the other hand, your temporal range of references is rather narrow, say, recent 20 years, then you should refer to as "recent" for sources that are from approximately last 4-5 years. You can come up with your own rule of thumb (10-20% of the total range sounds pretty reasonable). The most important aspect would be not the actual value (for the rule of thumb), but rather your consistency in applying it throughout the paper.

Posted: March 9, 2016, 1:58 am

Answer by Aleksandr Blekh for How to properly cite a comment from reddit

I believe that your best guess is pretty close to the right answer. According to the APA Style (6th ed.), you should list as much information as possible for non-periodical publications, which you have done well. I think that your resource falls under category "Nonperiodical Web Document or Report", as described on this page of the Purdue OWL's APA Formatting and Style Guide.

However, on the second thought, it seems that a more correct option to use would be APA's electronic sources guidelines for "Online Forum or Discussion Board Posting". Not only Reddit better fits this category, but it also allows you to specify the author of the quote you are citing. Therefore, the optimal citation in question, in my opinion, should be as follows (note that I took liberty to remove date of retrieval as the link you provide is a permalink and, thus, pretty stable):

Snowden, E. (2015). Just days left to kill mass surveillance under Section 215 of the Patriot Act. We are Edward Snowden and the ACLU's Jameel Jafer. AUA. Retrieved from

Posted: February 21, 2016, 6:56 am

Answer by Aleksandr Blekh for What does "to be enjoyed with all rights and privileges pertaining thereto" mean on a French diploma?

That phrase is clearly not France-specific, as @ff524 mentioned. In order to add to and further illustrate the nice answer by @vonbrand, I will share the following paper, which discusses Roman origins and Medieval expressions of the relevant phase(s):

In addition to some comments and answers for the above-mentioned most likely duplicate question, I would add that modern practical meaning of this phrase significantly depends on graduate's field of study and institution they graduated from.

In regard to the field of study, rights and privileges might include (beyond the implied rights and privileges to say that one graduated with specific degree from a particular institution, to wear the institution's regalia, to be referred to as a Dr. [for Ph.D. graduates], etc.): to be able to practice in specific regulated fields, such as medicine and law (upon satisfying additional conditions, such as attending medical residency or passing specific state's bar examination, correspondingly).

In regard to the institution, some rights and privileges include to participate in alumni activities, to retain institution's e-mail address, to get discounts on various products and services as well as on attending individual classes and, even, enrolling into certain degree programs at one's alma mater.

Posted: February 21, 2016, 1:59 am

Answer by Aleksandr Blekh for How should I state 'MS dropout' in my resume when applying for data scientist positions?

First, some advice. I agree with @gnometorule, but I would state it stronger: IMHO and based on limited information you've shared, it would be a mistake to drop out so close to graduation. Even though the current culture within startup ecosystem and, overall, tech industry largely ignores education credentials in favor of "being a hustler", "being a doer", "being street smart", etc., the data science subset of the both areas actually seem to have more respect and pay more attention to people's education. This is quite understandable, considering the relative complexity of data science and, especially, its machine learning and artificial intelligence fields of study and practice.

I would strongly suggest you to consider things in perspective and do your best to successfully finish the program. Not only it will give you some advantages when competing in the job market, but also might be useful to you, should you decide in the future to go for a Ph.D., teach at some educational institution or pursue other opportunities (i.e., scientific research or consulting).

In regard to your specific question - should you decide to ignore my advice - I think that it would be better to formulate in your resume the phrase "MS dropout" not as such or, even, not as

"University XYZ, MS program, Statistics, Years Range, Incomplete",

but rather as a positive fact / achievement:

"University XYZ, MS program, Statistics, Years Range, Completed 90% of curriculum".

Having said that, again, I strongly suggest you to consider finishing your Master's program.

Posted: February 18, 2016, 6:40 am

Answer by Aleksandr Blekh for Pitfalls of Academic Blogging

Let me start with the following disclaimer. Firstly, I'm considering myself also a junior academician (defended my Ph.D. in April 2015; though I have quite a bit of industry experience). Secondly, while I thought about starting professional (in a sense of covering both academia/research and industry, plus various interests) blogging for a while and, even created my own WordPress-powered website with a blog section, I still yet to find time to start and continue blogging regularly. Having said that, everyone's situation and circumstances are different. Also, having some kind of writer's block or, rather, fear, I decided that mostly answering (and sometimes asking) questions on Stack Exchange sites as well as Quora is a gentle way of preparing myself to a more serious :-) blogging exposure.

Now, on to your questions (take my advice with a grain of salt, considering the disclaimer above).

Is academic blogging a good idea?

In my humble opinion, absolutely. I've seen a lot of academic blogs. Most of them are of good to excellent quality. Reading someone's such academic blog immediately adds some virtual respect points to that person's virtual balance in my brain. Sometimes it helps to find answers to my specific questions. Often, it increases my awareness on some topics or subject domains. It also helps me to understand who might be a good potential collaborator for a future research or an advisor for a science-focused venture / startup. All of the above-mentioned points are potential benefits toward a good professional exposure / visibility for an academic blogger.

Does it become too much effort?

As I said, I have no direct experience in blogging, but, based on my experience with answering questions, it depends on your desired involvement. I guess, for blogging it is more about setting a comfortable for the author schedule and sticking to it. Answering questions is a more flexible way.

Is it worthwhile?

See answer to Q1.

How likely is blog-death?

Since one of major, if not the major, benefits of blogging is training one's brain to formulate and express thoughts and arguments, I think that "blog-death" is not only over-rated, but irrelevant. Even if zero people will read your blog now, 1) at some point, some people will start reading it, if it will be worth reading and, more importantly, 2) you will still be self-improving in so many ways.

In general, what are pitfalls to watch out for when starting an academic blog?

IMHO potential factors of success are (obviously, potential pitfalls would be the opposite aspects):

  • finding interesting topics;
  • expressing yourself via original and quality writing;
  • creating a visually appealing blog (likely, not critical, but still...);
  • creating a realistic schedule and sticking to it;
  • having faith in yourself.
Posted: February 14, 2016, 3:49 am

Answer by Aleksandr Blekh for Determining the appropriate research design

In this instance, I respectfully disagree with @Wrzlprmft's comment. The core of this question is clearly focused on the research design / methodology and, thus, fits the scope of Academia.SE very well. Software development in this case simply represents the context of the applied research.

In regard to the essence of the question, I can offer the following insights and recommendations.

  • In my opinion, grounded theory (GT) is not a good fit for your study, as GT is typically used for designing general theories, rather than applied ones.

  • Design science seems like a pretty good fit, so you might read up and consider strategies for compressing research, performed, using this approach, to fit your time frame.

  • You might also consider other qualitative research designs and approaches, such as content analysis, narrative analysis and others (see more details, for example, at Qualitative Research Guidelines Project).

  • I would especially recommend you to pay attention to action research, as this research approach seems to fit quite well with your planned study. For more details on action research and other qualitative approaches, as well as likely the most popular qualitative research software ATLAS.ti, see this page.

Finally, I would urge you to pay attention to your terminology, as it might be very confusing not only to potential readers, but to the authors themselves. For example, your phrase "theoretical design of the software application" sounds... quite strange and reduces the clarity of your work and, thus, potentially, can hurt readers' impressions from it. I think that "conceptual design of a software application" sounds and reads much clearer. Hope this helps. Good luck with your thesis!

Posted: February 5, 2016, 8:31 am

Answer by Aleksandr Blekh for How to arrange a multi-topic academic homepage

Firstly, this question is not really academia-specific, since it is applicable to any relatively complex multi-topic multi-user-type website (or any other information resource, for that matter) - however, I will answer it, since some time ago I was facing the same problem, so I understand your situation.

Secondly, this problem lies within the scope of the very large interdisciplinary field of information architecture (IA) (for some introduction, beyond the corresponding Wikipedia article, see, for example, this page and this page.

Thirdly, there is a multitude of approaches to solving this problem and finding the one ("good way", as you put it), which is close to the optimal approach, requires consideration and prioritization of multiple factors, including perspectives for different types of potential users of the site.

Since different types of users have different priorities and preferences, your analysis will most likely generate several (relatively) optimal designs. As some have already mentioned in other answers, those user-type-based optimal designs might be combined on a single site via tabbed interface, with each tab, focused on a particular type of user. Then, within each tab area, relevant topics can be arranged, based on topics hierarchy, using various methods (smaller tabs, navigational side tree or menu, etc.), plus, the hierarchy's content might be adjusted, based on the relevant type of users and their interests. This just one of the most straightforward and simple ideas. While the sky is the limit in generating site designs, I suggest applying KISS principle to the site's IA for the optimal UX.

There are many nice academic websites out there, but I can't really recommend much due to their diversity and lack of time (I'd have to dig through my vast number of bookmarks). If you care, feel free to visit my own personal professional website, which targets both academia and industry, but academic content is quite limited so far. Please keep in mind that I haven't had a chance to fully update the site in terms of both design and, especially, content, which I hope to get my hands on eventually. Nevertheless, overall IA of the site might give you some useful ideas for implementing on your site (i.e., main menu structure, project types dynamic filtering in the Portfolio section, etc.).

P.S. Despite warning, I have decided to made a quick review of my bookmarks in regard to the topic and here is a tiny subset of academic websites that I find useful, interesting and attractive:

Posted: January 24, 2016, 1:50 pm

Answer by Aleksandr Blekh for Database of funded US Department of Defense (DoD) component grant proposals with abstracts: Does it exist?

I recommend you to review the following two databases (other sources might be available as well):

  • Federal grants, affiliated with US Department of Defense (DoD) (I see only posted and closed grant opportunities, but couldn't find how to get the funded ones - see the other source below.)

  • USA Spending Map (Here you definitely can get funded grant opportunities - just select Agency: DoD, Award Type: Grants, Fiscal Year and other parameters, if any. This database is also nice, because it offers options to either download data sets, or use its RESTful APIs.)

P.S. Should you become interested in DoD contracts, that information is available on their website.

Posted: January 24, 2016, 2:20 am

Answer by Aleksandr Blekh for How to overcome discouragement on finding major error in work just before paper submission?

I think that the best way to overcome your situation is to realize that nothing out of ordinary happened - to err is human. In my opinion, research is about discovering truth and enriching knowledge (including the one of the researchers'). And making mistakes is a natural part of the process. I would just discuss my work with advisor openly and do my best to learn from that.

I don't think that you have wasted anybody's time. I (along with many other people) believe that negative results are also valuable (for example, see this paper, this journal and this workshop). The same applies to other results, such as similar to existing results, not impressive results, etc.

As for strategies for not making mistakes, I don't think there are any to prevent them completely, as I said, but you can reduce their probability by not advancing too fast in your research (i.e., rushing to obtain results or to publish) as well as asking feedback on your work from other people beyond your advisor or other people closely involved in the research (perhaps, even, from other disciplines, in order to obtain opinions, based on different perspectives).

Posted: October 4, 2015, 10:52 pm

Answer by Aleksandr Blekh for Are 'Dr' for medical doctor used in the same sense as a PhD?

While all those titles share the same linguistic roots, obviously, the meaning is somewhat different. When referring to a Ph.D., term doctor is used in the context of general knowledge acquisition. That is why the full title is doctor of philosophy, where philosophy implies "love of wisdom". On the other hand, a medical doctor (M.D.) or Doctor of Osteopathic medicine (D.O.) title or one of dental doctor titles refers to a specialist in one or more areas of medicine. A relatively popular alternative term for medical doctor is physician, which some people might confuse with with physicist. The origins of the word "physician" and its relation to the word "doctor" are discusses in this interesting article in Science Friday.

The original meaning of the word "doctor" as "license to teach" has likely been transferred to the medicine knowledge domain IMHO due to the important role of one of the cornerstones of science that medicine played at that particular time period and place (medieval Europe). You may also find additional interesting information in this related discussion on StackExchange.

Posted: June 12, 2015, 1:19 am

Viewing page 1 of 1

User Aleksandr Blekh - Data Science Stack Exchange

most recent 30 from

Answer by Aleksandr Blekh for Steps in exploratory methods for mild-sized data with mixed categorical and numerical values?

You can get a reasonably good approximation of steps for exploratory data analysis (EDA) by reviewing the EDA section of the NIST Engineering Statistics Handbook. Additionally, you might find helpful parts of my related answer here on Data Science SE.

Methods, related to EDA, are too diverse that it is not feasible to discuss them in a single answer. I will just mention several approaches. If you are interested in applying classification to your data set, you might find information, mentioned in my other answer helpful. In order to detect structures in a data set, you can try to apply principal component analysis (PCA). If, on the other hand, you are interested in exploring latent structures in data, consider using exploratory factor analysis (EFA).

Posted: October 25, 2015, 12:10 am

Answer by Aleksandr Blekh for Sampling for multi categorical variable

Let me give you some pointers (assuming that I'm right on this, which might not necessarily be true, so proceed with caution :-). First, I'd figure out the applicable terminology. It seems to me that your case can be categorized as multivariate sampling from a categorical distribution (see this section on categorical distribution sampling). Perhaps, the simplest approach to it is to use R ecosystem's rich functionality. In particular, standard stats package contains rmultinom function (link).

If you need more complex types of sampling, there are other packages that might be worth exploring, for example sampling (link), miscF (link), offering rMultinom function (link). If your complex sampling is focused on survey data, consider reading this interesting paper "Complex Sampling and R" by Thomas Lumley.

If you use languages other than R, check multinomial function from Python's numpy package and, for Stata, this blog post. Finally, if you are interested in Bayesian statistics, the following two documents seems to be relevant: this blog post and this survey paper. Hope this helps.

Posted: October 12, 2015, 3:48 pm

Answer by Aleksandr Blekh for Are there any tools for feature engineering?

Very interesting question (+1). While I am not aware of any software tools that currently offer comprehensive functionality for feature engineering, there is definitely a wide range of options in that regard. Currently, as far as I know, feature engineering is still largely a laborious and manual process (i.e., see this blog post). Speaking about the feature engineering subject domain, this excellent article by Jason Brownlee provides a rather comprehensive overview of the topic.

Ben Lorica, Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media Inc., has written a very nice article, describing the state-of-art (as of June 2014) approaches, methods, tools and startups in the area of automating (or, as he put it, streamlining) feature engineering.

I took a brief look at some startups that Ben has referenced and a product by Skytree indeed looks quite impressive, especially in regard to the subject of this question. Having said that, some of their claims sound really suspicious to me (i.e., "Skytree speeds up machine learning methods by up to 150x compared to open source options"). Continuing talking about commercial data science and machine learning offerings, I have to mention solutions by Microsoft, in particular their Azure Machine Learning Studio. This Web-based product is quite powerful and elegant and offers some feature engineering functionality (FEF). For an example of some simple FEF, see this nice video.

Returning to the question, I think that the simplest approach one can apply for automating feature engineering is to use corresponding IDEs. Since you (me, too) are interested in R language as a data science backend, I would suggest to check, in addition to RStudio, another similar open source IDE, called RKWard. One of the advantages of RKWard vs RStudio is that it supports writing plugins for the IDE, thus, enabling data scientists to automate feature engineering and streamline their R-based data analysis.

Finally, on the other side of the spectrum of feature engineering solutions we can find some research projects. The two most notable seem to be Stanford University's Columbus project, described in detail in the corresponding research paper, and Brainwash, described in this paper.

Posted: October 3, 2015, 8:05 am

Answer by Aleksandr Blekh for Looking for language and framework for data munging/wrangling

If you are interested in a very high-level (enterprise architecture) framework, I suggest you to take a look at the MIKE2.0 Methodology. Being an information management framework, MIKE2.0 has, certainly, much wider coverage than the domain of your interest, but it is a solid, interesting and open (licensed under the Creative Commons Attribution License) framework. A better fit for your focus is the Extract, transform, load (ETL) framework, which is extremely popular in contexts of Business Intelligence and Data Warehousing. On a more practical note, you might want to check my answer on Quora on open source master data management (MDM) solutions. Pay attention to the Talend solutions (disclaimer: I am not affiliated with this or any company), which cover a wide spectrum of MDM, ETL and data integration domains as open source and commercial offerings.

Posted: September 30, 2015, 9:12 pm

Answer by Aleksandr Blekh for How to start analysing and modelling data for an academic project, when not a statistician or data scientist

Typically, quantitative analysis is planned and performed, based on research study's goals. Focusing on research goals and corresponding research questions, researcher would propose a model (or several models) and a set of hypotheses, associated with the model(s). Model(s) and its/their elements' types usually dictate (suggest) quantitative approaches that would make sense in a particular situation. For example, if your model includes latent variables, you would have to use appropriate methods to perform data analysis (i.e., structural equation modeling). Otherwise, you can apply a variety of other methods, such as time series analysis or, as you mentioned, multiple regression and machine learning. For more details on research workflow with latent variables, also see section #3 in my relevant answer.

One last note: whatever methods you use, pay enough attention to the following two very important aspects - performing full-scale exploratory data analysis (EDA) (see my relevant answer) and trying to design and perform your analysis in the reproducible research fashion (see my relevant answer).

Posted: September 22, 2015, 7:42 am

Answer by Aleksandr Blekh for Program to fine-tune pre-trained word embeddings on my data set

While I am not aware of software specifically for tuning trained word embeddings, perhaps the following open source software might be helpful, if you can figure out what parts can be modified for the fine-tuning part (just an idea off the top of my head - I'm not too familiar with the details):

Posted: July 31, 2015, 4:39 am

Answer by Aleksandr Blekh for Do I need an Artificial Intelligence API?

One needs to use an artificial intelligence (AI) API, if there is a need to add AI functionality to a software application - this is pretty obvious. Traditionally, my advice on machine learning (ML) software includes the following two excellent curated lists of resources: this one and this one.

However, keep in mind that ML is just a subset of AI domain, so if your tasks involve AI areas beyond ML, you need more AI-focused tools or platforms. For example, you can take a look at ai-one's AI platforms and APIs as well as interesting general AI open source project OpenCog.

In addition to the above-mentioned AI-focused platforms, IBM's Watson AI system deserves a separate mention, as quite cool and promising. It offers its own ecosystem for developers, called IBM Watson Developer Cloud, based on IBM's BlueMix cloud computing platform-as-a-service (PaaS). However, at the present time, I find this offering to be quite expensive as well as limiting, especially for individual developers, small startups and other small businesses, due to its tight integration with and reliance only on a single PaaS (Blue Mix). It will be interesting to watch this space as competition in AI domain and marketplace IMHO will surely intensify in the future.

Posted: June 10, 2015, 3:53 am

Answer by Aleksandr Blekh for What is the definition of knowledge within data science?

Knowledge is a general term and I don't think that there exist definitions of knowledge for specific disciplines, domains and areas of study. Therefore, in my opinion, knowledge, for a particular subject domain, can be defined just as a domain-specific (or context-specific, as mentioned by @JGreenwell +1) perspective (projection) of a general concept of knowledge.

Posted: June 7, 2015, 5:38 am

Answer by Aleksandr Blekh for Ideas for next step of Machine Learning

I would suggest you to check this excellent presentation by Li Deng (Microsoft Research). Many of the slides contain references to relevant research papers and even several interesting books on the topics of interest (it should be pretty easy to find). It might be also helpful to check references, listed in this research paper by Prof. Andrew Ng and his colleagues at Baidu Research. Finally, a focused Internet search will provide you with comprehensive list of resources for further research.

Posted: May 21, 2015, 5:33 am

Answer by Aleksandr Blekh for Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?

In addition to exploratory data analysis (EDA), both descriptive and visual, I would try to use time series analysis as a more comprehensive and sophisticated analysis. Specifically, I would perform time series regression analysis. Time series analysis is a huge research and practice domain, so, if you're not familiar with the fundamentals, I suggest starting with the above-linked Wikipedia article, gradually searching for more specific topics and reading corresponding articles, papers and books.

Since time series analysis is a very popular approach, it is supported by most open source and closed source commercial data science and statistical environments (software), such as R, Python, SAS, SPSS and many others. If you want to use R for this, check my answers on general time series analysis and on time series classification and clustering. I hope that this is helpful.

Posted: May 18, 2015, 2:32 am

Answer by Aleksandr Blekh for Application of Control Theory in Data Science

Have you tried the Internet search? The results should be able to answer most, if not all, of your questions. The topics of your interest sound like rather general or high-level. I'm sure that they can, in one form or another, be applied in the data science context. In my opinion, those topics are more related to operations research (OR), therefore, I would recommend you to perform some research on the Internet on the intersections between control systems (theory) and data science.

Having said that, first thing that comes to my mind is that the most likely candidate for use of control theory concepts and methods in data science context would be distributed systems and algorithms for data analysis, such as MapReduce (Hadoop, etc.), as well as other parallel processing systems. If there exist an intersection between OR's area of optimization and control theory, then it very well could be used for big data algorithms optimization, among other tasks.

Posted: May 17, 2015, 8:30 am

Answer by Aleksandr Blekh for Attributing causality to single quasi-independent variable

I would suggest you to consider either direct dimensionality reduction approach. Check my relevant answer on this site. Another valid option is to use latent variable modeling, for example, structural equation modeling. You can start with relevant articles on Wikipedia (this and this, correspondingly) and then, as needed, read more specialized or more practical articles, papers and books.

Posted: May 16, 2015, 2:36 am

Answer by Aleksandr Blekh for Best or recommended R package for logit and probit regression

Unless you have some very specific or exotic requirements, in order to perform logistic (logit and probit) regression analysis in R, you can use standard (built-in and loaded by default) stats package. In particular, you can use glm() function, as shown in the following nice tutorials from UCLA: logit in R tutorial and probit in R tutorial.

If you are interested in multinomial logistic regression, this UCLA tutorial might be helpful (you can use glm() or packages, such as glmnet or mlogit). For the above-mentioned very specific or exotic requirements, many other R packages are available, for example logistf ( or elrm (

I also recommend another nice tutorial on GLMs from Princeton University (by Germán Rodríguez), which discusses some modeling aspects, not addressed in the UCLA materials, in particular updating models and model selection.

Posted: May 13, 2015, 2:47 am

Answer by Aleksandr Blekh for Use of Nash-Equilibrium in big data environments

I have a very limited knowledge of game theory, but hope to learn more. However, I think that potential applications of Nash equilibrium in the context of big data environments, implies the need of analyzing a large number of features (representing various strategic pathways or traits) as well as large number of cases (representing significant number of actors). Considering these points, I would think that complexity and, consequently, performance requirements for Nash equilibrium in big data applications grow exponentially. For some examples from the Internet load-balancing domain, see paper by Even-Dar, Kesselman and Mansour (n.d.).

The above-mentioned points touch only the volume aspect of 4V big data model (an update of Gartner's original 3V model). If you add to that other aspects (variety, velocity and veracity), the situation seems to become even more complex. Perhaps, people with econometrics background and experience will have some of the most comprehensive opinions on this interesting question. A lot of such people are active on Cross Validated, so I will let them know about this question - hopefully, some of them will be interested to share their view by answering this question.


Even-Dar, E., Kesselman, A., & Mansour, Y. (n.d.). Convergence time to Nash equilibria. Retrieved from

Posted: May 8, 2015, 6:57 am

Answer by Aleksandr Blekh for How can I use Data Science to profoundly contribute to Humanity?

Since I have already answered a similar question on Data Science StackExchange site, plus some related ones, I will mention all of them here and let you decide, if you find them helpful:

Posted: April 21, 2015, 10:56 pm

Answer by Aleksandr Blekh for Abstract data type?

Any platform, focused on social networking (not necessarily Twitter), at its core uses the most appropriate and natural abstract data type (ADT) for such domain - a graph data structure.

If you use Python, you can check nice NetworkX package, used for "the creation, manipulation, and study of the structure, dynamics, and functions of complex networks". Of course, there are many other software tools for various programming languages for building, using and analyzing network structures. You might also find useful the relevant book "Social Network Analysis for Startups: Finding connections on the social web", which provides a nice introduction into the social network analysis (SNA) and uses the above-mentioned NetworkX software for SNA examples. P.S. I have no affiliation whatsoever with NetworkX open source project or the book's authors.

Posted: April 15, 2015, 6:04 pm

Answer by Aleksandr Blekh for Possibility of working on KDDCup data in local system

I think that you have, at least, the following major options for your data analysis scenario:

  1. Use big data-enabling R packages on your local system. You can find most of them via the corresponding CRAN Task View that I reference in this answer (see point #3).

  2. Use the same packages on a public cloud infrastructure, such as Amazon Web Services (AWS) EC2. If your analysis is non-critical and tolerant to potential restarts, consider using AWS Spot Instances, as their pricing allows for significant financial savings.

  3. Use the above mention public cloud option with R standard platform, but on more powerful instances (for example, on AWS you can opt for memory-optimized EC2 instances or general purpose on-demand instances with more memory).

In some cases, it is possible to tune a local system (or a cloud on-demand instance) to enable R to work with big(ger) data sets. For some help in this regard, see my relevant answer.

For both above-mentioned cloud (AWS) options, you can find more convenient to use R-focused pre-built VM images. See my relevant answer for details. You may also find useful this excellent comprehensive list of big data frameworks.

Posted: April 12, 2015, 5:23 am

Answer by Aleksandr Blekh for Extracting model equation and other data from 'glm' function in R

In order to extract some data from the fitted glm model object, you need to figure out where that data resides (use documentation and str() for that). Some data might be available from the summary.glm object, while more detailed data is available from the glm object itself. For extracting model parameters, you can use coef() function or direct access to the structure.


From Princeton's* introduction to R course's website, GLM section - see for details & examples:

The functions that can be used to extract results from the fit include

- 'residuals' or 'resid', for the deviance residuals
- 'fitted' or 'fitted.values', for the fitted values (estimated probabilities)
- 'predict', for the linear predictor (estimated logits)
- 'coef' or 'coefficients', for the coefficients, and
- 'deviance', for the deviance. 

Some of these functions have optional arguments; for example, you can extract five different types of residuals, called "deviance", "pearson", "response" (response - fitted value), "working" (the working dependent variable in the IRLS algorithm - linear predictor), and "partial" (a matrix of working residuals formed by omitting each term in the model). You specify the one you want using the type argument, for example residuals(lrfit,type="pearson").

*) More accurately, this website is by Germán Rodríguez from Princeton University.

Posted: April 9, 2015, 3:27 pm

Answer by Aleksandr Blekh for Building a static local website using Rmarkdown: step by step procedure

In most things, related to R, there are many approaches to solve a problem, sometimes too many, I would say. The task of building a static website, using RMarkdown, is not an exception.

One of the best, albeit somewhat brief, sets of workflows on the topic include the following one by Daniel Wollschlaeger, which includes this workflow, based on R, nanoc and Jekyll, as well as this workflow, based on R and WordPress. Another good workflow is this one by Jason Bryer, which is focused on R(Markdown), Jekyll and GitHub Pages.

Not everyone likes GitHub Pages, Jekyll, Octopress and Ruby, so some people came up with alternative solutions. For example, this workflow by Edward Borasky is based on R and, for a static website generator, on Python-based Nicola (instead of Ruby-based Jekyll or nanoc). Speaking about static website generators, there are tons of them, in various programming languages, so, if you want to experiment, check this amazing website, listing almost all of them. Almost, because some are missing - for example, Samantha and Ghost, listed here.

Some other interesting workflows include this one by Joshua Lande, which is based on Jekyll and GitHub Pages, but includes some nice examples of customization for integrating a website with Disqus, Google Analytics and Twitter as well as getting custom URL for the site and more.

Those who want a pure R-based static site solution, now have some options, including rsmith (, a static site generator by Hadley Wickham, and Poirot (, a static site generator by Ramnath Vaidyanathan.

Finally, I would like to mention an interesting project (from an open science perspective) that I recently ran across - an open source software by Mark Madsen for a lab notebook static site, which is based on GitHub Pages and Jekyll, but also supports pandoc, R, RMarkdown and knitr.

Posted: April 8, 2015, 3:47 am

Answer by Aleksandr Blekh for Learning resources for data science to win political campaigns?

This is an interesting and relevant question. I think that from data science perspective, it should not be, in principle, any different from any other similar data science tasks, such as prediction, forecasting or other analyses. Similarly to any data science work, the quality of applying data science to politics very much depends on understanding not only data science approaches, methods and tools, but, first and foremost, the domain being analyzed, that is politics domain.

Rapidly rising popularity of data science and machine learning (ML), in general, certainly has a significant impact on particular verticals and politics is not an exception. This impact can be seen not only in increased research interest in applying data science and ML to political science (for example, see this presentation, this paper, this overview paper and this whole virtual/open issue in a prominent Oxford journal), but in practical applications. Moreover, a new term - political informatics or poliInformatics or poli-informatics - has been coined to name an interdisciplinary field, which stated goal is to study and use data science, big data and ML in the government and politics domains. As I've said earlier, the interest in applying data science to politics goes beyond research and often results in politics-focused startups, such as PoliticIt or Para Bellum Labs. Following the unfortunate, but established trend in startup ecosystem, many of those ventures fail. For example, read the story of one of such startups.

I am pretty sure that you will be able to find neither proprietary algorithms that political startups or election data science teams used and use, nor the their data sets. However, I am rather positive that you can get some understanding about typical data sets as well as data collection and analysis methods via the resources that I have referenced above. Hope this helps.

Posted: April 4, 2015, 6:57 am

Answer by Aleksandr Blekh for Do data scientists use Excel?

Do experienced data scientists use Excel?

I've seen some experienced data scientists, who use Excel - either due to their preference, or due to their workplace's business and IT environment specifics (for example, many financial institutions use Excel as their major tool, at least, for modeling). However, I think that most experienced data scientists recognize the need to use tools, which are optimal for particular tasks, and adhere to this approach.

Can you assume a lack of experience from someone who does primarily use Excel?

No, you cannot. This is the corollary from my above-mentioned thoughts. Data science does not automatically imply big data - there is plenty of data science work that Excel can handle quite well. Having said that, if a data scientist (even experienced one) does not have knowledge (at least, basic) of modern data science tools, including big data-focused ones, it is somewhat disturbing. This is because experimentation is deeply ingrained into the nature of data science due to exploratory data analysis being a essential and, even, a crucial part of it. Therefore, a person, who does not have an urge to explore other tools within their domain, could rank lower among candidates in the overall fit for a data science position (of course, this is quite fuzzy, as some people are very quick in learning new material, plus, people might have not had an opportunity to satisfy their interest in other tools due to various personal or workplace reasons).

Therefore, in conclusion, I think that the best answer an experienced data scientist might have to a question in regard to their preferred tool is the following: My preferred tool is the optimal one, that is the one that best fits the task at hand.

Posted: April 3, 2015, 10:37 pm

Answer by Aleksandr Blekh for What is the term for when a model acts on the thing being modeled and thus changes the concept?

Though it is not specifically a term, focused on machine learning, but I would refer to such behavior of a statistical model, using a general term side effect (while adding some clarifying adjectives, such as expected or unexpected, desired or undesired, and similar). Modeling outcome or transitive feedback loop outcome might be some of the alternative terms.

Posted: April 3, 2015, 12:51 am

Answer by Aleksandr Blekh for how to modify sparse survey dataset with empty data points?

I would consider approaching this situation from the following two perspectives:

  • Missing data analysis. Despite formally the values in question are empty and not NA, I think that effectively incomplete data can (and should) be considered as missing. If that is the case, you need to automatically recode those values and then apply standard missing data handling approaches, such as multiple imputation. If you use R, you can use packages Amelia (if the data is multivariate normal), mice (supports non-normal data) or some others. For a nice overview of approaches, methods and software for multiple imputation of data with missing values, see the 2007 excellent article by Nicholas Horton and Ken Kleinman "Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models".

  • Sparse data analysis, such as sparse regression. I'm not too sure how well this approach would work for variables with high levels of sparsity, but you can find a lot of corresponding information in my relevant answer.

Posted: April 2, 2015, 11:38 pm

Answer by Aleksandr Blekh for How does SQL Server Analysis Services compare to R?

In my opinion, it seems that SSAS makes more sense for someone who:

  • has significantly invested in Microsoft's technology stack and platform;
  • prefer point-and-click interface (GUI) to command line;
  • focus on data warehousing (OLAP cubes, etc.);
  • has limited needs in terms of statistical methods and algorithms variety;
  • has limited needs in cross-language integration;
  • doesn't care much about openness, cross-platform integration and vendor lock-in.

You can find useful this blog post by Sami Badawi. However, note that the post is not recent, so some information might be outdated. Plus, the post contains an initial review, which might be not very accurate or comprehensive. If you're thinking about data science, while considering staying within Microsoft ecosystem, I suggest you to take a look at Microsoft's own machine learning platform Azure ML. This blog post presents a brief comparison of (early) Azure ML and SSAS.

Posted: March 27, 2015, 11:22 am

Answer by Aleksandr Blekh for General approach to extract key text from sentence (nlp)

You need to analyze sentence structure and extract corresponding syntactic categories of interest (in this case, I think it would be noun phrase, which is a phrasal category). For details, see corresponding Wikipedia article and "Analyzing Sentence Structure" chapter of NLTK book.

In regard to available software tools for implementing the above-mentioned approach and beyond, I would suggest to consider either NLTK (if you prefer Python), or StanfordNLP software (if you prefer Java). For many other NLP frameworks, libraries and programming various languages support, see corresponding (NLP) sections in this excellent curated list.

Posted: March 21, 2015, 8:58 pm

Answer by Aleksandr Blekh for Data Science in C (or C++)

In my opinion, ideally, to be a more well-rounded professional, it would be nice to know at least one programming language for the most popular programming paradigms (procedural, object-oriented, functional). Certainly, I consider R and Python as the two most popular programming languages and environments for data science and, therefore, primary data science tools.

Julia is impressive in certain aspects, but it tries to catch up with those two and establish itself as a major data science tool. However, I don't see this happening any time soon, simply due to R/Python's popularity, very large communities as well as enormous ecosystems of existing and newly developed packages/libraries, covering an very wide range of domains / fields of study.

Having said that, many packages and libraries, focused on data science, ML and AI areas, are implemented and/or provide APIs in languages other than R or Python (for the proof, see this curated list and this curated list, both of which are excellent and give a solid perspective about the variety in the field). This is especially true for performance-oriented or specialized software. For that software, I've seen projects with implementation and/or APIs mostly in Java, C and C++ (Java is especially popular in the big data segment of data science - due to its closeness to Hadoop and its ecosystem - and in the NLP segment), but other options are available, albeit to a much more limited, domain-based, extent. Neither of these languages is a waste of time, however you have to prioritize mastering any or all of them with your current work situation, projects and interests. So, to answer your question about viability of C/C++ (and Java), I would say that they are all viable, however not as primary data science tools, but as secondary ones.

Answering your questions on 1) C as a potential data science tool and 2) its efficiency, I would say that: 1) while it's possible to use C for data science, I would recommend against doing it, because you'd have a very hard time finding corresponding libraries or, even more so, trying to implement corresponding algorithms by yourself; 2) you shouldn't worry about efficiency, as many performance-critical segments of code are implemented in low-level languages like C, plus, there are options to interface popular data science languages with, say, C (for example, Rcpp package for integration R with C/C++: This is in addition to simpler, but often rather effective, approaches to performance, such as consistent use of vectorization in R as well as using various parallel programming frameworks, packages and libraries. For R ecosystem examples, see CRAN Task View "High-Performance and Parallel Computing with R".

Speaking about data science, I think that it makes quite a lot of sense to mention the importance of reproducible research approach as well as the availability of various tools, supporting this concept (for more details, please see my relevant answer). I hope that my answer is helpful.

Posted: March 21, 2015, 8:09 pm

Answer by Aleksandr Blekh for IDE alternatives for R programming (RStudio, IntelliJ IDEA, Eclipse, Visual Studio)

Here's R Language Support for IntelliJ IDEA. However, keep in mind that this support is not in the form of built-in functionality or official plug-in, but rather a third-party plug-in. I haven't tried it, so my opinion on it is limited to the point above.

In my opinion, a better option would be Eclipse, which offers R support via StatET IDE: However, I find Eclipse IDE too heavyweight. Therefore, my preferred option is RStudio IDE - I don't know why one would prefer other options. I especially like RStudio's ability of online access to the full development environment via RStudio Server.

Posted: March 19, 2015, 12:21 am

Answer by Aleksandr Blekh for Python or R for implementing machine learning algorithms for fraud detection

I would say that it is your call and purely depends on your comfort with (or desire to learn) the language. Both languages have extensive ecosystems of packages/libraries, including some, which could be used for fraud detection. I would consider anomaly detection as the main theme for the topic. Therefore, the following resources illustrate the variety of approaches, methods and tools for the task in each ecosystem.

Python Ecosystem

  • scikit-learn library: for example, see this page;
  • LSAnomaly, a Python module, improving OneClassSVM (a drop-in replacement): see this page;
  • Skyline: an open source example of implementation, see its GitHub repo;
  • A relevant discussion on StackOverflow;
  • pyculiarity, a Python port of Twitter's AnomalyDetection R Package (as mentioned in 2nd bullet of R Ecosystem below "Twitter's Anomaly Detection package").

R Ecosystem

Additional General Information

Posted: February 21, 2015, 10:08 am

Answer by Aleksandr Blekh for High-dimensional data: What are useful techniques to know?

This is very broad question, which I think it's impossible to cover comprehensively in a single answer. Therefore, I think that it would be more beneficial to provide some pointers to relevant answers and/or resources. This is exactly what I will do by providing the following information and thoughts of mine.

First of all, I should mention the excellent and comprehensive tutorial on dimensionality reduction by Burges (2009) from Microsoft Research. He touches on high-dimensional aspects of data frequently throughout the monograph. This work, referring to dimensionality reduction as dimension reduction, presents a theoretical introduction into the problem, suggests a taxonomy of dimensionality reduction methods, consisting of projective methods and manifold modeling methods, as well as provides an overview of multiple methods in each category.

The "projective pursuit" methods reviewed include independent component analysis (ICA), principal component analysis (PCA) and its variations, such as kernel PCA and probabilistic PCA, canonical correlation analysis (CCA) and its kernel CCA variation, linear discriminant analysis (LDA), kernel dimension reduction (KDR) and some others. The manifold methods reviewed include multidimensional scaling (MDS) and its landmark MDS variation, Isomap, Locally Linear Embedding and graphical methods, such as Laplacian eigenmaps and spectral clustering. I'm listing the most of the reviewed methods here in case, if the original publication is inaccessible for you, either online (link above), or offline (References).

There is a caveat for the term "comprehensive" that I've applied to the above-mentioned work. While it is indeed rather comprehensive, this is relative, as some of the approaches to dimensionality reduction are not discussed in the monograph, in particular, the ones, focused on unobservable (latent) variables. Some of them are mentioned, though, with references to another source - a book on dimensionality reduction.

Now, I will briefly cover several narrower aspects of the topic in question by referring to my relevant or related answers. In regard to nearest neighbors (NN)-type approaches to high-dimensional data, please see my answers here (I especially recommend to check the paper #4 in my list). One of the effects of the curse of dimensionality is that high-dimensional data is frequently sparse. Considering this fact, I believe that my relevant answers here and here on regression and PCA for sparse and high-dimensional data might be helpful.


Burges, C. J. C. (2010). Dimension reduction: A guided tour. Foundations and Trends® in Machine Learning, 2(4), 275-365. doi:10.1561/2200000002

Posted: January 26, 2015, 8:00 am

Answer by Aleksandr Blekh for Is the R language suitable for Big Data

Some good answers here. I would like to join the discussion by adding the following two notes:

1) The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (

2) Having said that, it is important to remember about other aspects of big data concept, based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

3) While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (, in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

Posted: July 19, 2014, 2:19 am

Viewing page 1 of 1

User Aleksandr Blekh - Cross Validated

most recent 30 from

Answer by Aleksandr Blekh for Predict user behaviour with constantly changing input variables

Interesting question (+1). I'm not an expert in recommendation systems, so my attempt to help will be limited to emphasizing the following point (please don't ask about implementation details - you will have to figure that out by yourself or ask other people):

  • I would think that there are two somewhat different approaches: 1) one that you seem to suggest, where, if I understood correctly, you want to predict next (website) destinations of the user and generate corresponding recommendations, based on those predicted destinations; 2) another is to generate recommendations, based on tracking the user's N most recent actions (perhaps, a more dynamic option).
Posted: May 19, 2015, 10:47 am

Answer by Aleksandr Blekh for Calculating CIs for $\eta^2$ via Z scores - sample size?

In case you are still interested in this topic, I would recommend you to take a look at the papers, referenced in my answer, especially the first one (by Lakens). Also, check MBESS R package: see home page and JSS paper (note that the software's current version most likely contains additional features and improvements, not described in the referenced original JSS paper).

Posted: May 10, 2015, 3:09 am

Answer by Aleksandr Blekh for Difference between regression analysis and curve fitting

In addition to @NickCox's excellent answer (+1), I wanted to share my subjective impression on this somewhat fuzzy terminology topic. I think that a rather subtle difference between the two terms lies in the following. On one hand, regression often, if not always, implies an analytical solution (reference to regressors implies determining their parameters, hence my argument about analytical solution). On the other hand, curve fitting does not necessarily imply producing an analytical solution and IMHO often might be and is used as an exploratory approach.

Posted: May 8, 2015, 5:59 pm

Answer by Aleksandr Blekh for Invariance test after CFA model

As far as I know, measurement invariance testing is usually performed in SEM context, when research sample contains multiple groups. In SEM context, measurement invariance is often referred to as factorial invariance. It is definitely a good idea to perform both measurement invariance analysis as well as common method bias analysis prior to creating structural models and this approach is actually recommended in the literature (i.e., Podsakoff, MacKenzie, Lee & Podsakoff, 2003; van de Schoot, Lugtig & Hox, 2012).

Gaskin (2012) provides excellent textual and video tutorials on performing CFA, including measurement model invariance testing and common method bias testing. While I don't have experience in performing CFA in AMOS (I prefer R), you are in luck :-), since many Gaskin's tutorials (and CFA ones, in particular) are focused on using AMOS. I highly recommend his materials, both textual and, especially, video. I hope that my answer is helpful.


Gaskin, J. (2012). Confirmatory factor analysis. Gaskination's StatWiki. Retrieved from

Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), p 879-903. doi:10.1037/0021-9010.88.5.879 Retrieved from

van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 1(7). doi:10.1080/17405629.2012.686740 Retrieved from

Posted: May 7, 2015, 12:59 am

Answer by Aleksandr Blekh for software library to compute KL divergence?

It's great that you came up with the solution (+1). I meant to post an answer to this question much earlier, but was busy traveling to my dissertation defense (which was successful :-). You are likely to be happy with your solution, but, in addition to possibility to compute KL divergences for certain distributions in R, for example, via function KLdiv from flexmix package (, I ran across another and, in my opinion, much better option, which might be of your interest.

It is a very comprehensive piece of autonomous open source software, relevant to the topic, called Information Theoretical Estimators (ITE) Toolbox. It is written in MATLAB/Octave and supports various information theoretic measures. So, sending thanks and kudos to the author of this software, I'm excited to share it here and hope that it will useful to you and the community.

Posted: May 1, 2015, 5:14 am

Answer by Aleksandr Blekh for Can you run clustering algorithms on perfectly collinear data?

The following is not an attempt to comprehensively answer your interesting (+1) question, but rather conveniently store and share with you and others some relevant, in my opinion, papers:

Posted: April 21, 2015, 3:38 pm

Answer by Aleksandr Blekh for Covariance between variables

If you're talking about correlation between predictor variables in a regression model, then the phenomenon you're describing is referred to as multicollinearity. In order to detect multicollinearity, as a minimum, you have to calculate variance inflation factor (VIF), but there are other tests for this task as well. While detecting multicollinearity is relatively easy, dealing with it is not. Therefore, it might be beneficial to prevent it prior analysis or, at least, reduce it during the analysis. For more information on preventing and reducing multicollinearity, check my relevant answer.

Posted: April 17, 2015, 4:14 pm

Answer by Aleksandr Blekh for How to describe meaning of R squared?

As @MattReichenbach said, if you have Age is the only predictor in your model, then your wording is fine. However, in order to avoid specifying a particular variable, I would suggest the following wording: "the model explains 30% of variation of the car condition index" (also note the use of present tense, which to me sounds more natural and correct). Using "the model" will allow you easier modification of results reporting (more flexibility) in the future, for example, in case, if/when you will add more predictors to the model.

Posted: April 17, 2015, 3:54 pm

Answer by Aleksandr Blekh for Regression analysis or Structural Equation Modelling

First of all, especially considering that your model is not that simple, I suggest you to switch for this study from using term regression analysis to using term latent variable modeling (LVM) or, more commonly, structural equation modeling (SEM). The main reason is not the terminology, but emphasizing the fact that SEM encompasses a comprehensive analysis of both measurement model and structural model. In SEM terminology, to analyze a measurement model, you need to perform confirmatory factor analysis (CFA), after you've done EFA, while to analyze a structural model, you need to perform path analysis, also referred to as path modeling (PM) or simply SEM.

In terms of the SEM process, as I said earlier, it is quite a challenge to grasp all concepts and, especially, tie them all into one neat framework. So, I would suggest you to start with this excellent tutorial, after that - this paper (theoretical parts) to understand better SEM in general as well as two major approaches to SEM (CB-SEM and PLS-SEM) and then, perhaps, take a quick look at this paper to get a sense (don't try to understand everything right away) how the full SEM analysis (EFA $\rightarrow$ CFA $\rightarrow$ PM/SEM) should be performed and reported. Then you can return to this question to post small clarifying questions or post them as separate questions. Hope this helps.

Note. Two important aspects: 1) your full SEM model (both measurement and structural models) should be hypothesized by you, based on theory or, if theory doesn't exist for that knowledge domain, literature review as well as your assumptions and arguments; 2) the mapping between 26 items and 4 latent factors is exactly that hypothesized measurement model I was talking about.

Posted: April 16, 2015, 10:23 am

Answer by Aleksandr Blekh for When the dependent variable and random effects 'overlap' in mixed effects models

My knowledge of mixed effects models (MEM) is rather fuzzy so far, so I will just share with you the following two nice blog post tutorials on MEM in R by Jared Knowles: "Getting Started with Mixed Effect Models in R" and "Mixed Effects Tutorial 2: Fun with merMod Objects". I hope that it's helpful.

Posted: April 16, 2015, 9:47 am

Answer by Aleksandr Blekh for How to run regression analysis without extracted factors from factor anlaysis?

I'm confused what you're confused about. If I understood your question correctly, your plan is to perform regression analysis, using factors, extracted during exploratory factor analysis (EFA). Let's assume that your original data set contains $N$ observations and $k$ columns, equal to the total number of factors. Your EFA resulted in 4 extracted factors (not the corresponding data, as you rightly noted), let's call them $f_1, f_2, f_3, f_4$. So, the next step, I think, would be to perform regression analysis on a subset of the original data set, containing only columns, corresponding to the extracted factors. Therefore, both goals will be achieved: performing EFA and regression.

Posted: April 16, 2015, 4:20 am

Answer by Aleksandr Blekh for KFold Cross Validation Package/Library in C++?

I'm sure that you will find that many of C++ libraries, listed in this section of that nice curated list of machine learning (ML) libraries, support cross-validation. Also, if you don't mind using C++ within .NET, check an interesting ML framework Accord.NET - it indeed does support cross-validation.

Posted: April 16, 2015, 3:47 am

Answer by Aleksandr Blekh for Confidence measures for Gaussian mixture models

I will start answering this questions in the reverse order, as it seems to make more sense.

I'm playing around with densityMclust in the mclust R package, and it doesn't seem to be returning any confidence measure (analogous to a p-value).

It seems to me that R package mclust used to have confidence measures reporting functionality in some of its previous versions, but it has been removed or disabled for some reasons. That functionality included calculating (via bootstrapping) and reporting significance (p-values) as well as standard errors and confidence intervals for estimated parameters. Based on current CRAN documentation, the functionality was available via functions mclustBootstrapLRT() and MclustBootstrap().

Considering the above, I think that you have the following options:

  1. Determine the latest version of mclust, which contained needed functionality, install that version and perform the analysis.

  2. Implement missing functionality in end-user R code, based on information, formulas and references, provided in the documentation's description for mclustBootstrapLRT() and MclustBootstrap() functions. IMHO, a much better source of information for manual implementation is a nice blog post " EM Algorithm: Confidence Intervals" by Stephanie Hicks.

  3. Consider using mixtools package, which seems to contain at least significance (p-values) calculating and reporting functionality, similar to the one of mclustBootstrapLRT() function (see page 26 in the corresponding JSS paper).

When generating Gaussian mixture models using expectation maximization with Bayesian Information Criterion, is it necessary to report a confidence measure?

Unless it is very difficult (skill-wise or time-wise) for you to use one of the above-mentioned options, I think that it is quite important to include such reporting in your analysis' results, as it demonstrates (academic or industrial) professional level of statistical rigor.

How do you know that the algorithms are returning the optimal models?

I think that EM algorithm returns optimal models, because the M-step is the optimizing one (M from maximization). Having said that, EM algorithm iterates until it converges to a local maximum of the log-likelihood function.

Additional information on EM algorithm can be found in the following papers: brief, medium and large (a 280+ pages book, ironically called "gentle tutorial" :-). It might also be of interest this paper on estimating standard errors for EM algorithm and this general paper on estimating confidence intervals for mixture models.

Posted: April 16, 2015, 3:21 am

Answer by Aleksandr Blekh for Exploratory data analysis for a dataset with continuous and categorical variables

First of all, it is possible to calculate correlation for both continuous and categorical variables, as long as the latter ones are ordered. This type of correlation is referred to as polychoric correlation.

In order to calculate polychoric correlation, since you plan to use R, you have, at least, two options: 1) psych package offers polychoric() and related functions (; 2) package polycor offers hetcor() function. Analysis of models, containing ordered categorical (ordinal) variables, include some other methods, including, but not limited to, numeric recoding, ordinal regression and latent variables approach.

Posted: April 15, 2015, 5:27 am

Answer by Aleksandr Blekh for Trend Analysis: How to tell random fluctuations from actual changes in trends?

Basically, you have to perform trend analysis, which is time series exploratory technique, based on ARMA family of models, of which ARIMA is most likely the most popular one. However, for your purposes, I think that it might be enough to just perform time series decomposition, where, along with seasonality and cyclical pattern, trend is one of the main components. More details on time series decomposition as well as some examples can be found here. In regard to some existing rules of thumb for time series' minimum sample size, Prof. Rob J. Hyndman dismisses such guidelines as "mis­lead­ing and unsub­stan­ti­ated in the­ory or prac­tice" in this relevant blog post.

Posted: April 14, 2015, 8:29 am

Answer by Aleksandr Blekh for How to implement GLM computationally in C++ (or other languages)?

While there is definitely some educational value of re-implementing GLM framework (or any other statistical framework, for that matter), I question the feasibility of this approach due to complexity and, consequently, time and efforts involved. Having said that, if you indeed want to go this route and review existing open source GLM implementations, you have, at least, the following options:

  • Standard GLM implementation by R package stats. See the corresponding source code here on GitHub or by typing the function name (without parentheses) in R's command line.

  • Alternative and specific GLM implementations for R include the following packages: glm2, glmnet and some others. Additionally, GLM-releated R packages are listed in this blog post.

  • Excellent GLM Notes webpage (by Michael Kane and Bryan W. Lewis) offers a wealth of interesting and useful details on standard and alternative R GLM implementations aspects.

  • For Julia GLM implementations, check similar to R's GLM and GLMNet packages.

  • For Python GLM implementations, check the one in statsmodels library and the one in scikit-learn library (implements Ridge, OLS and Lasso - find corresponding modules).

  • For .NET GLM implementations, check IMHO very interesting Accord.NET framework - the GLM source code is here on GitHub.

  • For C/C++ GLM implementations, check apophenia C library (this source code seems to be relevant) and, perhaps, C++ GNU Scientific Library (GSL) (see this GitHub repo, but I was unable to find the relevant source code). Also of interest could be: this C++ IRLS GLM implementation (which uses GSL) as well as the Bayesian Object Oriented Modeling (BOOM) C++ library (GLM-focused source code is here on GitHub).

Posted: April 14, 2015, 3:39 am

Answer by Aleksandr Blekh for How can I determine if a time-series is statistically stable?

There exist various approaches to testing whether a time series is stationary. One of the most popular approaches is based on unit root test family of tests, which include Augmented_Dickey-Fuller (ADF) test (available in R as tseries::adf.test()), Zivot-Andrews test (available in R as and several others (see the links in the unit root test Wikipedia article). Another approach is to use the KPSS test, which is considered complimentary to unit root testing. Finally, there are approaches, based on spectrum analysis, which include Priestley-Subba Rao (PSR) test and wavelet spectrum test. Some theoretic discussion and examples are available via the previous link as well as in corresponding section of the online textbook "Forecasting: principles and practice" by professors Rob J. Hyndman and George Athana­sopou­los:

Posted: April 13, 2015, 9:57 pm

Answer by Aleksandr Blekh for How does R package 'quantmod' receive (almost) real-time data?

Reviewing quantmod package's documentation (the up-to-date one, located on CRAN, since documentation on the package's website is obsolete), it appears that, currently, R package quantmod supports, aside from local data sets (MySQL, CSV, RData), the following public and private online data sources (availability varies from function to function).

Posted: April 13, 2015, 5:27 am

Answer by Aleksandr Blekh for Is it valid to reduce noise in the test data from noisy experiments by averaging over multiple runs?

I think that the experimenters' decision fits into general resampling statistical strategy. Having said that, I'm not sure what specific aspects, if any, might be used to criticize this approach from the machine learning perspective.

In regard to reducing noisy data, while I'm not sure how applicable it is in your subject domain, you might want to check my hopefully relevant answer. Moreover, I think that it might make sense to use clustering to detect and eliminate noisy data by applying bootstrapping technique. Please see my answer on using bootstrapping for clustering.

Posted: April 13, 2015, 12:17 am

Answer by Aleksandr Blekh for Machine Learning for Image Processing book recommendation

The book by Prince, recommended by @seanv507 is indeed an excellent book on the topic (+1). And while it is not really compact, it has very logical structure and even a generous refresher chapter on probability as well as great focus on machine learning within computer vision context.

However, I'd like to recommend another excellent book on the topic (also freely downloadable), which, while having more focus on computer vision per se, IMHO contains enough machine learning material to qualify for an answer. The book that I'm talking about is "Computer Vision: Algorithms and Applications" by Richard Szeliski (Microsoft Research). One of the advantages of this book versus the one by Price is... narrower margins, which allow for larger font size and, thus, better readability. Also, the book by Szeliski is very practical. Since both books share significant content, but have somewhat different focus, in my opinion, they very well complement each other. All this, among other advantages, makes it very easy for me to highly recommend Szeliski's book.

Posted: April 12, 2015, 6:27 am

Answer by Aleksandr Blekh for 3 categorical IV and 1 categorical DV -- what test to use?

I would suggest you the following high-level data analysis strategy/workflow:

  1. Start with performing exploratory data analysis (EDA). This will provide you with a sense of your data set as well as reveal the data set's features, which might be helpful in further steps (assumptions, etc.).

  2. Perform regression analysis. Your statement about inability of using logistic regression is incorrect, but this due to confusion that the term logistic regression often is used to refer to a model with a binary DV. Indeed, logistic regression is applicable in your case and is referred to as multinomial logistic regression, since your DV is of unordered categorical type. Should your DV be ordered, then that would be a case of an ordered logistic regression. The analysis IMHO should include evaluating the model's goodness-of-fit (GoF) and other relevant metrics (see above-referenced articles as a starting point, including for information on tests, etc.).

  3. Interpret the results of your analysis, based on your research goals and questions.

Posted: April 9, 2015, 9:01 am

Answer by Aleksandr Blekh for Change Point Analysis for Environmental Data

In addition to the previous nice answers (+1 to both), I'd like to offer the following insights:

  • Consider using entropy-based approaches, methods and measures for change point analysis. Check my related answer for some ideas (it focuses on time series, but I see no reasons for why the same approach cannot be applied to some other domains).

  • Consider using Early Warning Signals (EWS) Toolbox and corresponding R package earlywarnings. The toolbox (methods and software) includes, in addition to time series analysis, spacial data analysis, which AFAIK is a significant part of environmental data analysis (i.e., see EWS site's menus Spacial Indicators and Case Studies).

Posted: April 9, 2015, 7:58 am

Answer by Aleksandr Blekh for Locally weighted regression VS kernel linear regression?

Here's how I understand the distinction between the two methods (don't know what third method you're referring to - perhaps, locally weighted polynomial regression due to the linked paper).

Locally weighted regression is a general non-parametric approach, based on linear and non-linear least squares regression. Kernel linear regression is IMHO essentially an adaptation (variant) of a general locally weighted regression in the context of kernel smoothing. It seems that the main advantage of kernel linear regression is that it automatically eliminates the domain boundaries bias, associated with locally weighted approach (Hastie, Tibshirani & Friedman, 2009; for that as well as a general overview, see sections 6.1-6.3, pp. 192-201). This phenomenon is called automatic kernel carpentry (Hastie & Loader, 1993; Hastie et al., 2009; Müller, 1993). More details on locally weighted regression can be found in the paper by Ruppert and Wand (1994).

Due to different presentation style, some other information on the topic might also be helpful. For example this page -link dead, now it's this book, Chapter 20.2 on linear smoothing, this class notes presentation slides document on kernel methods, this class notes page on local learning approaches. I also like this blog post and this blog post, as they are relevant and nicely blend theory with examples in R and Python, correspondingly.


Hastie, T., & Loader, C. (1993). Local regression: Automatic kernel carpentry. Statistical Science, 8(2), 120-143. Retrieved from

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction (2nd ed.). New York: Springer-Verlag. Retrieved from

Müller, H.-G. (1993). [Local Regression: Automatic Kernel Carpentry]: Comment. Statistical Science, 8(2), 134-139.

Ruppert, D., & Wand, M. (1994). Multivariate locally weighted least-squares regression. The Annals of Statistics, 22(3), 1346–1370. Retrieved from

Posted: March 27, 2015, 5:23 am

Answer by Aleksandr Blekh for What is the "partial" in partial least squares methods?

I would like to answer this question, largely based on the historical perspective, which is quite interesting. Herman Wold, who invented partial least squares (PLS) approach, hasn't started using term PLS (or even mentioning term partial) right away. During the initial period (1966-1969), he referred to this approach as NILES - abbreviation of the term and title of his initial paper on this topic Nonlinear Estimation by Iterative Least Squares Procedures, published in 1966.

As we can see, procedures that later will be called partial, have been referred to as iterative, focusing on the iterative nature of the procedure of estimating weights and latent variables (LVs). The "least squares" term comes from using ordinary least squares (OLS) regression to estimate other unknown parameters of a model (Wold, 1980). It seems that the term "partial" has its roots in the NILES procedures, which implemented "the idea of split the parameters of a model into subsets so they can be estimated in parts" (Sanchez, 2013, p. 216; emphasis mine).

The first use of the term PLS has occurred in the paper Nonlinear iterative partial least squares (NIPALS) estimation procedures, which publication marks next period of PLS history - the NIPALS modeling period. 1970s and 1980s become the soft modeling period, when, influenced by Karl Joreskog's LISREL approach to SEM, Wold transforms NIPALS approach into soft modeling, which essentially has formed the core of the modern PLS approach (the term PLS becomes mainstream in the end of 1970s). 1990s, the next period in PLS history, which Sanchez (2013) calls "gap" period, is marked largely by decreasing of its use. Fortunately, starting from 2000s (consolidation period), PLS enjoyed its return as a very popular approach to SEM analysis, especially in social sciences.

UPDATE (in response to amoeba's comment):

  • Perhaps, Sanchez's wording is not ideal in the phrase that I've cited. I think that "estimated in parts" applies to latent blocks of variables. Wold (1980) describes the concept in detail.
  • You're right that NIPALS was originally developed for PCA. The confusion stems from the fact that there exist both linear PLS and nonlinear PLS approaches. I think that Rosipal (2011) explains the differences very well (at least, this is the best explanation that I've seen so far).

UPDATE 2 (further clarification):

In response to concerns, expressed in amoeba's answer, I'd like to clarify some things. It seems to me that we need to distinguish the use of the word "partial" between NIPALS and PLS. That creates two separate questions about 1) the meaning of "partial" in NIPALS and 2) the meaning of "partial" in PLS (that's the original question by Phil2014). While I'm not sure about the former, I can offer further clarification about the latter.

According to Wold, Sjöström and Eriksson (2001),

The "partial" in PLS indicates that this is a partial regression, since ...

In other words, "partial" stems from the fact that data decomposition by NIPALS algorithm for PLS may not include all components, hence "partial". I suspect that the same reason applies to NIPALS in general, if it's possible to use the algorithm on "partial" data. That would explain "P" in NIPALS.

In terms of using the word "nonlinear" in NIPALS definition (do not confuse with nonlinear PLS, which represents nonlinear variant of the PLS approach!), I think that it refers not to the algorithm itself, but to nonlinear models, which can be analyzed, using linear regression-based NIPALS.

UPDATE 3 (Herman Wold's explanation):

While Herman Wold's 1969 paper seems to be the earliest paper on NIPALS, I have managed to find another one of the earliest papers on this topic. That is a paper by Wold (1974), where the "father" of PLS presents his rationale for using the word "partial" in NIPALS definition (p. 71):

3.1.4. NIPALS estimation: Iterative OLS. If one or more variables of the model are latent, the predictor relations involve not only unknown parameters, but also unknown variables, with the result that the estimation problem becomes nonlinear. As indicated in 3.1 (iii), NIPALS solves this problem by an iterative procedure, say with steps s = 1, 2, ... Each step s involves a finite number of OLS regressions, one for each predictor relation of the model. Each such regression gives proxy estimates for a sub-set of the unknown parameters and latent variables (hence the name partial least squares), and these proxy estimates are used in the next step of the procedure to calculate new proxy estimates.


Rosipal, R. (2011). Nonlinear partial least squares: An overview. In Lodhi H. and Yamanishi Y. (Eds.), Chemoinformatics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques, pp. 169-189. ACCM, IGI Global. Retrieved from

Sanchez, G. (2013). PLS path modeling with R. Berkeley, CA: Trowchez Editions. Retrieved from

Wold, H. (1974). Causal flows with latent variables: Partings of the ways in the light of NIPALS modelling. European Economic Review, 5, 67-86. North Holland Publishing.

Wold, H. (1980). Model construction and evaluation when theoretical knowledge is scarce: Theory and applications of partial least squares. In J. Kmenta and J. B. Ramsey (Eds.), Evaluation of econometric models, pp. 47-74. New York: Academic Press. Retrieved from

Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109-130. doi:10.1016/S0169-7439(01)00155-1 Retrieved from

Posted: January 29, 2015, 7:08 pm

Answer by Aleksandr Blekh for How to determine Forecastability of time series?

Parameters m and r, involved in calculation of approximate entropy (ApEn) of time series, are window (sequence) length and tolerance (filter value), correspondingly. In fact, in terms of m, r as well as N (number of data points), ApEn is defined as "natural logarithm of the relative prevalence of repetitive patterns of length m as compared with those of length m + 1" (Balasis, Daglis, Anastasiadis & Eftaxias, 2011, p. 215):

$$ ApEn(m, r, N) = \Phi^m(r) - \Phi^{m+1}(r), $$

$\text{where }$

$$ \Phi^m(r) = {\LARGE{\Sigma}_i} lnC^m_i(r)/(N - m + 1) $$

Therefore, it appears that changing the tolerance r allows to control the (temporal) granularity of determining time series' entropy. Nevertheless, using the default values for both m and r parameters in pracma package's entropy function calls works fine. The only fix that needs to be done to see the correct entropy values relation for all three time series (lower entropy for more well-defined series, higher entropy for more random data) is to increase the length of random data vector:

 all.series <- list(series1 = AirPassengers,
                    series2 = sunspot.year,
                    series3 = rnorm(500)) # <== size increased
 sapply(all.series, approx_entropy)
  series1   series2   series3 
  0.5157758 0.7622430 1.4741971 

The results are as expected - as the predictability of fluctuations decreases from most determined series1 to most random series 3, their entropy consequently increases: ApEn(series1) < ApEn(series2) < ApEn(series3).

In regard to other measures of forecastability, you may want to check mean absolute scaled errors (MASE) - see this discussion for more details. Forecastable component analysis also seems to be an interesting and new approach to determining forecastability of time series. And, expectedly, there is an R package for that, as well - ForeCA.

       Omega, spectrum.control = list(method = "wosa"))
 series1   series2   series3 
 41.239218 25.333105  1.171738 

Here $\Omega \in [0, 1]$ is a measure of forecastability where $\Omega(white noise) = 0\%$ and $\Omega(sinusoid) = 100 \%$.


Balasis, G., Daglis, I. A., Anastasiadis, A., & Eftaxias, K. (2011). Detection of dynamical complexity changes in Dst time sSeries using entropy concepts and rescaled range analysis. In W. Liu and M. Fujimoto (Eds.), The Dynamic Magnetosphere, IAGA Special Sopron Book, Series 3, 211. doi:10.1007/978-94-007-0501-2_12. Springer. Retrieved from

Georg M. Goerg (2013): Forecastable Component Analysis. JMLR, W&CP (2) 2013: 64-72.

Posted: January 19, 2015, 11:56 pm

Answer by Aleksandr Blekh for Dynamic Time Warping Clustering

Yes, you can use DTW approach for classification and clustering of time series. I've compiled the following resources, which are focused on this very topic (I've recently answered a similar question, but not on this site, so I'm copying the contents here for everybody's convenience):

Posted: January 5, 2015, 3:49 pm

Answer by Aleksandr Blekh for Complete machine learning library for Java/Scala

You may find helpful this extensive curated list of ML libraries, frameworks and software tools. In particular, it contains resources that you're looking for - ML lists for Java and for Scala.

Posted: August 28, 2014, 7:14 am

Viewing page 1 of 1

User Aleksandr Blekh - Stack Overflow

most recent 30 from

Answer by Aleksandr Blekh for Generating models for Flask-AppBuilder using flask-sqlqcodegen

Upon some Internet searching, I ran across an issue on GitHub, which described exactly the same problem. However, the most recent recommendation at the time produced another error instead of the original one. In the discussion with the author of flask-sqlcodegen, it appeared that there exist a pull request (PR) kindly provided by a project contributor that apparently should fix the problem. After updating my local repository, followed by rebuilding and reinstalling the software, I was able to successfully generate models for my database. The whole process consists of the following steps.

  1. Change to directory with a local repo of flask-sqlcodegen.
  2. If you made any changes, like I did, stash them: git stash.
  3. Update repo: git pull origin master (now includes that PR).
  4. Rebuild/install software: python install.
  5. If you need your prior changes, restore them: git stash pop. Otherwise, discard them: git reset --hard.
  6. Change to your Flask application directory and auto-generate the models, as follows.

    sqlacodegen --flask --outfile postgresql+psycopg2://USER:PASS@HOST/DBNAME

Acknowledgements: Big thank you to Kamil Sindi (the flask-sqlcodegen's author) for the nice software and rapid & helpful feedback as well as to Alisdair Venn for that valuable pull request.

Posted: July 31, 2016, 1:53 am

Answer by Aleksandr Blekh for Strange MySQL "read-only" error

Based on my question's comments (special thanks to @Eborbob) and my update, I have figured that some process in the system resets the read-only flag to ON (1), which seem to trigger the issue and results in the website becoming inaccessible. In order to fix the problem as well as make this fix persistent across software and server restarts, I decided to update MySQL configuration file my.cnf and restart the DB server.

After making the relevant update (in my case, addition) to the configuration file


let's verify that the flag is indeed set to OFF (0):

# mysql
mysql> SELECT @@global.read_only;
| @@global.read_only |
|                  0 |
1 row in set (0.00 sec)

Finally, let's restart MySQL server (for some reason, a dynamic reloading of MySQL configuration (/etc/init.d/mysql reload) didn't work, so I had to restart the database server explicitly:

service mysql stop
service mysql start

Voila! Now access to the website is restored. Will update my answer, if any changes will occur.

Posted: February 18, 2016, 2:34 am

Answer by Aleksandr Blekh for Error trying to start Notification Server

I have just figured out this. As I said in the recent update, I was trying to start notification server as non-'root'. Looking again at permissions of the /var/tmp/aphlict/pid folder, the problem suddenly became crystal clear and trivial.

ls -l /var/tmp/aphlict

total 4
drwxr-xr-x 2 root root 4096 Nov 16 13:40 pid

Therefore, all that needed to be done to fix the problem is to make the directory writable for everyone (I hope that this approach does not create a potential security issue):

chmod go+w /var/tmp/aphlict/pid

su MY_NON_ROOT_USER_NAME -c './bin/aphlict start'
Aphlict Server started.

Problem solved. By the way, for the Notification Server to work properly, do I need to open port 22281, in addition to already opened 22280? (Please answer in comments. Thank you!)

Posted: November 17, 2015, 6:58 pm

Answer by Aleksandr Blekh for Converting to JSON (key,value) pair using R

The output that you're seeing is produced by jsonlite, when a data set is a list:



Make sure that your data set is indeed a data frame and you will see the expected output:

toJSON(head(iris), pretty = TRUE)

        "Sepal.Length": 5.1,
        "Sepal.Width": 3.5,
        "Petal.Length": 1.4,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 4.9,
        "Sepal.Width": 3,
        "Petal.Length": 1.4,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 4.7,
        "Sepal.Width": 3.2,
        "Petal.Length": 1.3,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 4.6,
        "Sepal.Width": 3.1,
        "Petal.Length": 1.5,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 5,
        "Sepal.Width": 3.6,
        "Petal.Length": 1.4,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 5.4,
        "Sepal.Width": 3.9,
        "Petal.Length": 1.7,
        "Petal.Width": 0.4,
        "Species": "setosa"
Posted: April 13, 2015, 8:54 am

Answer by Aleksandr Blekh for View selected sample for each replication in bootstrap loop

Based on your comments, I've fixed the code. Here's the version that I tested and it seems to work:

x <- c(20,54,18,65,87,49,45,94,22,15,16,15,84,55,44,13,16,65,48,98,74,56,97,11,25,43,32,74,45,19,56,874,3,56,89,12,28,71,93)
n <- length(x)

nBoot <-3; mn <- numeric(nBoot)
repl <- matrix(x, nrow=nBoot, ncol=length(x))

for (boots in 1:nBoot) {
  repl[boots, ] <- sample(x, n, replace=TRUE)
  pr <- print(repl)
  mn[boots] <- mean(repl)
Posted: April 8, 2015, 1:43 pm

Answer by Aleksandr Blekh for Algorithm for multiple extended string matching

I think that it might make sense to start by reading the following Wikipedia article's section: You can then perform a literature review on algorithms, implementing regular expression pattern matching.

In terms of practical implementation, there is a large variety of regular expression (regex) engines in a form of libraries, focused on one or more programming languages. Most likely, the best and most popular option is the C/C++ PCRE library, with its newest version PCRE2, released in 2015. Another C++ regex library, which is quite popular at Google, is RE2. I recommend you to read this paper, along with the two other, linked within the article, for details on algorithms, implementation and benchmarks. Just recently, Google has released RE2/J - a linear time version of RE2 for Java: see this blog post for details. Finally, I ran across an interesting pure C regex library TRE, which offers way too many cool features to list here. However, you can read about them all on this page.

P.S. If the above is not enough for you, feel free to visit this Wikipedia page for details of many more regex engines/libraries and their comparison across several criteria. Hope my answer helps.

Posted: March 10, 2015, 8:57 am

Answer by Aleksandr Blekh for Add existing scripts to an Rstudio project

Technically, you can change working directory programmatically within a project, but this is considered a very poor practice and is strongly recommended against. However, you can set working directory at a project's top level (full path to Folder A, in your example) and then refer to scripts and objects, located in Folders 1-3 via corresponding relative paths. For example: "./Folder1/MyScript.R" or "./Folder2/MyData.csv".

Posted: February 24, 2015, 7:56 pm

Answer by Aleksandr Blekh for R equivalent to matrix row insertion in Matlab

You certainly can have a similar functionality by using R's integration with a clipboard. In particular, standard R functions that provide support for clipboard operations include connection functions (base package), such as file(), url(), pipe() and others, clipboard text transfer functions (utils package), such as readClipboard(), writeClipboard(), as well as data import functions (base package), which use connection argument, such as scan() or read.table().

This functionality differs from platform to platform. In particular, for Windows platform, you need to use connection name clipboard, for Mac platform (OS X) - you can use pipe("pbpaste") (see this StackOverflow discussion for more details and alternative methods). It appears that Kmisc package offers a platform-independent approach to that functionality, however, I haven't used it so far, so, can't really confirm that it works as expected. See this discussion for details.

The following code is a simplest example of how you would use the above-mentioned functionality:

read.table("clipboard", sep="\t", header=header, ...)

An explanation and further examples are available in this blog post. As far as plotting the imported data goes, RStudio not only allows you to use standard R approaches, but also adds an element of interactivity via its bundled manipulate package. See this post for more details and examples.

Posted: February 15, 2015, 6:20 am

Answer by Aleksandr Blekh for R: Export CrossTable to Latex

Based on the gmodels' package documentation, function CrossTable() returns results as a list. Therefore, I don't see any problems with exporting the results to LaTeX format. You just need to convert that list into a data frame. Then you have a choice of various R packages, containing functions to convert a data frame into LaTeX format. For example, you can use df2latex() from psych package. Alternatively, you can use either latex() or latexTabular(), both from Hmisc package. The former converts a data frame into a TeX file, whereas the former converts a data frame into a LaTeX code for the corresponding object in a tabular environment (a LaTeX table).


Initial attempt - doesn't work, as CrossTable()'s result is not a simple list:


let <- sample(c("A","B"), 10, replace = TRUE)
num <- sample(1:3, 10, replace = TRUE)
tab <- CrossTable(let, num, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)

myList <- lapply(1:ncol(tab), function(x) as.character(unlist(tab[, x])))
myDF <-, stringsAsFactors = FALSE)
myLatex <- latexTabular(myDF)

Further efforts

Well, it's a little trickier than I initially thought, but there are two ways, as I see it. Please see below.

The first option is to convert the CrossTable to data frame

myDF <-

and then manually reshape the initial data frame per your requirements (sorry, I'm not too familiar with cross-tabulation).

The second option uses Rz package (installation is a bit annoying as it wants to install Gtk, but after closing GUI, you can call functions in R session normally, as follows.


let <- sample(c("A","B"), 10, replace = TRUE)
num <- sample(1:3, 10, replace = TRUE)
tab <- crossTable(let, num) # note that I use crossTable() from 'Rz' package

# Console (default) output

let     1      2      3    Total 
A          0      2      1      3
        0.0%  66.7%  33.3%   100%
B          1      2      4      7
       14.3%  28.6%  57.1%   100%
Total      1      4      5     10
       10.0%  40.0%  50.0%   100%

Chi-Square Test for Independence

Number of cases in table: 10 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 1.4286, df = 2, p-value = 0.4895
    Chi-squared approximation may be incorrect
Please install vcd package to output Cramer's V.

# Now use LaTeX output

summary(tab, latex = TRUE)
    \caption{let $\times$ num}
         &                      \multicolumn{3}{c}{num}                      &                           \\
    let  &\multicolumn{1}{c}{1}&\multicolumn{1}{c}{2}&\multicolumn{1}{c}{3}&\multicolumn{1}{c}{Total} \\
    A    &             0        &             2        &             1        &               3           \\
         &        0.0\%        &       66.7\%        &       33.3\%        &          100\%           \\
    B    &             1        &             2        &             4        &               7           \\
         &       14.3\%        &       28.6\%        &       57.1\%        &          100\%           \\
    Total&             1        &             4        &             5        &              10           \\
         &       10.0\%        &       40.0\%        &       50.0\%        &          100\%           \\

Chi-Square Test for Independence

Number of cases in table: 10 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 1.4286, df = 2, p-value = 0.4895
    Chi-squared approximation may be incorrect
Please install vcd package to output Cramer's V.


Posted: January 29, 2015, 4:39 am

Answer by Aleksandr Blekh for How can I create a graph in R from a table with four variables? (Likert scale)

If you prefer a ggplot2-based solution, as an alternative to suggested base R graphics solution, I think that it should be along the following lines. A minimal reproducible example (MRE), based on your data follows.

if (!suppressMessages(require(ggplot2))) install.packages('ggplot2')
if (!suppressMessages(require(reshape))) install.packages('reshape')

myData <- data.frame('Gov. agencies' = c(3, 10, 1, 8, 7), 'Local authority' = c(3, 6, 3, 4, 13), 'Police forces' = c(3, 6, 3, 4, 13), 'NGO/third sector' = c(2, 5, 1, 10, 11), response = c('Not familiar', 'Somewhat familiar', 'Neutral', 'Familiar', 'Very familiar'))

levels(myData$response) <- c('Not familiar', 'Somewhat familiar', 'Neutral', 'Familiar', 'Very familiar')

myDataMelted <- melt(myData, id.vars = 'response')

ggplot(myDataMelted, aes(x=response, y=value, fill = variable))+
    geom_bar(stat = "identity", position = "dodge", color = "black")

The result:

enter image description here

WARNING! Please note that the above code is posted as a proof-of-concept and it is not only not complete in terms of labeling/beautification, but it contains an error (I think, not a major one), which I hope more knowledgeable people here will help me to fix, so that you could have an alternative solution (and I could have some educational experience and peace of mind, after all the trouble :-). The error is that groups are not in the correct order / do not belong to the correct categories. I've tried to alleviate that problem via levels(), but probably still missed or forgot some other point.

Posted: January 13, 2015, 2:51 am

Answer by Aleksandr Blekh for Gain Package Installation error in R 3.1.2

I believe that the problem lies in your corrupted, incomplete or otherwise incorrect R environment. I was able to install that package without any problems at all just by issuing the default command:

> install.packages("gains")
Installing package into ‘C:/Users/Alex/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
trying URL ''
Content type 'application/zip' length 35802 bytes (34 Kb)
opened URL
downloaded 34 Kb

package ‘gains’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
[1] tools_3.1.1

As a quick solution to the problem, I suggest to specify CRAN mirror explicitly:

install.packages("gains", repos = "")
Posted: January 2, 2015, 11:18 am

Answer by Aleksandr Blekh for Matrix specification for simple diagram, using 'diagram' package

Finally, I have figured it out myself. It's a little tricky, but not a rocket science. Thanks to everyone who tried to help or, at least, read the question. Actually, after I've figured this out, I took another look at the @jbaums' suggestion above and realized that it is basically the same, discounting non-essential details. The suggested solution (which was appearing incorrectly, as shown above) was tested in my RStudio, whereas, since my machine with RStudio Server was down, I had to test my solution on R-Fiddle... The same company. Same (similar) technology. Go figure. Anyway, here is my obligatory minimal reproducible example (MRE):


connect <- c(0,0,0,0,

M <- matrix(nrow=4, ncol=4, byrow=TRUE, data=connect)
p <- plotmat(M, pos=c(1, 2, 1), name='', box.col="lightblue", curve=0)

MRE result:

enter image description here

Posted: November 24, 2014, 9:15 pm

Answer by Aleksandr Blekh for How to save an object through GGally in R

While @CMichael's comment is nice (I didn't know that, hence +1), it's applicable only if you want to save a particular plot from GGally-generated plot matrix. I believe that you'd like to save the whole plot matrix - the need, which I've recently also experienced. Therefore, you can use a standard R approach and save the graphics by opening corresponding (to desired format) graphical device, printing the object and closing the device, which will effectively save the graphics in a desired format.

# use pdf() instead of svg(), if you want PDF output
svg("myPlotMatrix.svg", height = 7, width = 7)
g <- ggpairs(...)
Posted: November 18, 2014, 5:33 am

Answer by Aleksandr Blekh for knitr templating - Dynamic chunks issue

Finally, I've figured out what was causing the issue. The first part was easy. Due to suggested simplification, I've switched from ggplot2 to standard R graphics functions. The problem is that it appears that plot() doesn't return a value/object, so that's why NULLs has been seen in the output, instead of plots.

The second part was a bit more tricky, but an answer to a related question ( clarified the situation. Based on that information, I was able modify my MRE correspondingly and the resulting document appears with correct content (same applies to the generated LaTeX source, which seems to be ready for cross-referencing).

I'm thinking about converting this code into a more generic function for reuse across my project, if time will permit [shouldn't take long] (@Yihui, could this be useful for knitr project?). Thanks to everyone who took time to analyze, help or just read this question. I think that knitr's documentation should be more clear on issues, related to producing PDF documents from RMarkdown source. My solution for the MRE follows.

title: "MRE: a dynamic chunk issue"
author: "Aleksandr Blekh"
    fig_caption: yes
    keep_tex: yes
    highlight: NULL

```{r, echo=FALSE, include=FALSE}

opts_knit$set(progress = F, verbose = F)
opts_chunk$set(comment=NA, warning=FALSE, message=FALSE, echo=FALSE, tidy=FALSE)

```{r Preparation, results='hide'}

g1 <- qplot(mpg, wt, data=mtcars)
g2 <- qplot(mpg, hp, data=mtcars)

myPlots <- list(g1, g2)

bcRefStr <- list("objType" = "fig",
                 "objs" = c("g1", "g2"),
                 "str" = "Plots \\ref{fig:g1} and \\ref{fig:g2}")

```{r DynamicChunk, include=FALSE}

latexObjLabel <- paste0("{{name}}\\\\label{", bcRefStr$objType, ":{{name}}", "}")

chunkName <- "{{name}}"
chunkHeader <- paste0("```{r ", chunkName, ", ")
chunkOptions <- paste0("include=TRUE, results='asis', fig.height=4, fig.width=4, fig.cap='", latexObjLabel, "'")
chunkHeaderFull <- paste0(chunkHeader, chunkOptions, "}")
chunkBody <- "print(get('{{name}}'))"

chunkText <- c(chunkHeaderFull,
               "```", "\n")

figReportParts <- lapply(bcRefStr$objs, function (x) knit_expand(text = chunkText, name = x))

`r knit(text = unlist(figReportParts))`
Posted: November 13, 2014, 6:42 am

Answer by Aleksandr Blekh for Has anyone tried to parallelize multiple imputation in 'mice' package?

Recently, I've tried to parallelize multiple imputation (MI) via mice package externally, that is, by using R multiprocessing facilities, in particular parallel package, which comes standard with R base distribution. Basically, the solution is to use mclapply() function to distribute a pre-calculated share of the total number of needed MI iterations and then combine resulting imputed data into a single object. Performance-wise, the results of this approach are beyond my most optimistic expectations: the processing time decreased from 1.5 hours to under 7 minutes(!). That's only on two cores. I've removed one multilevel factor, but it shouldn't have much effect. Regardless, the result is unbelievable!

Posted: October 2, 2014, 3:44 am

Answer by Aleksandr Blekh for ggplot2 log transformation for data and scales

Finally, I have figured out the issues, removed my previous answer and I'm providing my latest solution below (the only thing I haven't solved is legend panel for components - it doesn't appear for some reason, but for an EDA to demonstrate the presence of mixture distribution I think that it is good enough). The complete reproducible solution follows. Thanks to everybody on SO who helped w/this directly or indirectly.



set.seed(12345) # for reproducibility

data(diamonds, package='ggplot2')  # use built-in data
myData <- diamonds$price

calc.components <- function(x, mix, comp.number) {

  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])

overlayHistDensity <- function(data, {

  # extract 'k' components from mixed distribution 'data' <- normalmixEM(data, k = NUM_COMPONENTS,
                          maxit = 100, epsilon = 0.01)

  numComponents <- length($sigma)
  message("Extracted number of component distributions: ",

    suppressWarnings(brewer.pal(NUM_COMPONENTS, "Set1"))

  # create (plot) histogram and ...
  g <- ggplot(, aes(x = data)) +
    geom_histogram(aes(y = ..density..),
                   binwidth = 0.01, alpha = 0.5) +
    theme(legend.position = 'top', legend.direction = 'horizontal')

  comp.labels <- lapply(seq(numComponents),
                        function (i) paste("Component", i))

  # ... fitted densities of components
  distComps <- lapply(seq(numComponents), function (i)
    stat_function(fun =,
                  args = list(mix =, comp.number = i),
                  size = 2, color = DISTRIB_COLORS[i]))

  legend <- list(scale_colour_manual(name = "Legend:",
                                     values = DISTRIB_COLORS,
                                     labels = unlist(comp.labels)))

  return (g + distComps + legend)

overlayPlot <- overlayHistDensity(log10(myData), 'calc.components')


enter image description here

Posted: September 3, 2014, 9:43 am

Viewing page 1 of 1

User Aleksandr Blekh - Open Data Stack Exchange

most recent 30 from

Answer by Aleksandr Blekh for Free public real time social data APIs

A significant number of free public APIs are available through the Mashape API Marketplace (freemium and commercial ones are available as well). For example, their social data APIs can be found here: I hope this is helpful.

Posted: October 25, 2015, 4:33 pm

Answer by Aleksandr Blekh for where can I find shapefiles for the highways of Puerto Rico?

Since I have promised, I will answer this question without waiting for its migration, if it will ever happen. Basically, I think that the best and latest data set that you can find now is this one - from the US official open data repository's TIGER/Line database. This page is generated, based on a relevant search (Puerto Rico), and might also contain some data sets of your interest.

Other potentially useful data sets include ones within U.S. Atlas TopoJSON repository (on how to use the data via R, see this nice tutorial) as well as this repository of U.S. major roads ESRI shapefile and geoJSON data sets (you have to check whether this repository contains PR data).

Posted: April 5, 2015, 7:52 am

Answer by Aleksandr Blekh for How can I get a full list of US Zipcodes with their associated names/CSAs/MSAs/lats/longs?

It appears that obtaining this data is not as trivial, as it might seem at first. The following are my suggestions in regard to the requested data sources and other options. It seems that currently there are two relatively solid sources of the data you're looking for:

The following additional, but not official, not solid and somewhat outdated database, might be also helpful: (also check links in the "Other Sources ..." section, especially GNIS data set - however, the GNIS data is used in the SBA's Web service).

Posted: March 24, 2015, 4:04 am

Answer by Aleksandr Blekh for Where can I find project risk management data?

Some project risk management data can be found within the following resources:

NOTES: 1) I don't think that Project Management Institute (PMP) has project risk management data, as @Joe suggested. At least, I haven't been able to find it. 2) Obviously, there exists other industry-focused project risk management data, similar to the one referenced above, focused on the software / IT industry.

Posted: January 5, 2015, 1:11 pm

Answer by Aleksandr Blekh for Estimate of total public expenditure from governments around world?

Adding to previous good answer, I think that you might find useful the following WDI indicators (

Posted: February 27, 2014, 1:03 am

Answer by Aleksandr Blekh for Results of past NCAA games

Take a look at this College Football Statistics & History site:

Posted: February 25, 2014, 7:02 pm

Answer by Aleksandr Blekh for Demography vs. political preference data sources

Check this collection of static and real-time data sets: Most indicators should be on a per-country (including per-EU-country) basis.

Also, see:

Posted: February 25, 2014, 6:53 pm

Viewing page 1 of 1