HungryFEED can't get feed. Don't be mad at HungryFEED. SimplePie reported: A feed could not be found at A feed with an invalid mime type may fall victim to this error, or SimplePie was unable to auto-discover it.. Use force_feed() if you are certain this URL is a real feed.

User Aleksandr Blekh - Academia Stack Exchange

most recent 30 from

Answer by Aleksandr Blekh for Where to find the most valid databases about the ranking of countries in research and scientific production?

It seems to me that you haven't made any effort to find information in question, which is easily available online. To illustrate, just a very brief Internet search resulted in the following arguably relevant sources: SCImago Journal & Country Rank - International Science Ranking as well as Nature Index - Country Outputs. Note that Nature Index is a much less comprehensive resource, since it is based on "selected group of 82 high-quality science journals". Thus, it is likely less representative, though it might still be representative enough, depending on how representative those 82 journals are. On the other hand, SCImago Ranking is based on Scopus database and advertises the coverage of 5000+ publishers and citations across 239 countries.

Obviously, if you are serious about your research, you have to perform a comprehensive search for relevant sources (as well as relevant papers) and analyze their quality before using any of them to draw any conclusions.

Posted: July 26, 2018, 9:25 pm

Answer by Aleksandr Blekh for Popular Science References in Statement of Purpose for Ph.D

I can offer some advice, but take it with a grain of salt. Firstly, I suggest that a decision of including such content into a SoP should be primarily driven by relevant expectations from schools you have decided to apply to (many schools likely even have guidelines on preparing SoP and other required or desired documents). Secondly, if the relevant expectations from SoP are flexible, I suggest you to have a section (instead of a paragraph) dedicated to explaining your motivation. IMO, it should include not only motivation specific to your chosen domain (theoretical physics), but also motivation to dive into science, in general. However, to prevent this from being interpreted as less serious / too lightweight, it is important to find and use relevant solid references. One example of such semi-popular, but solid and well-known, source is the famous talk by Dr. Richard W. Hamming "You and Your Research" (text of the talk is available here and relevant video of the talk - here). Thirdly, such semi-popular references ideally should be complemented by references to solid research papers (seminal and/or survey papers) on specific aspects of your research interests. Good luck!

Posted: December 4, 2017, 7:01 am

Answer by Aleksandr Blekh for How to find out the tier of a university in the USA?

Typically, when people talk about tiers in context of universities, the assumption is that they refer to levels of teaching and/or research activity within some standardized framework. As far as I know, the most well-known framework of this kind is Carnegie Classification of Institutions of Higher Education. Based on this question, I further assume that it implies an interest in university research tiers (which matches the traditional Carnegie Classification framework - the Basic criterion).

Thus, in order to find a research tier of a university, click LOOKUP link in the main menu of the above-mentioned website and enter your search criteria on the Institutional Lookup page. The request will generate a list of results or a single result, depending on your criteria. Click on the linked title of the relevant institution and browse the resulting webpage. On that page, find Basic classification row, which contains the target value. For example, performing the search for my current institution (employer), Georgia Institute of Technology (aka Georgia Tech), we find that it belongs to the category of Doctoral Universities: Highest Research Activity. This is what usually refers to as (the highest) R1 tier (for more details about the shorthand labels, see this page).

Note that, while level of research activity is the most popular classification criterion, there are other criteria (see Listings -> Standard Listings). Also note that Carnegie Classification is focused on the academic ecosystem in the United States. I am not familiar with similar national or international frameworks (but have no doubt that some exist).

Posted: August 31, 2017, 10:37 am

Answer by Aleksandr Blekh for Do you need to define standard abbreviations like "EEG" and "fMRI" in the abstract?

Generally, in such situations, there are multiple factors at play. They include specific field of study and relevant de facto standards (community consensus), specific publication and relevant author instructions as well as required (or chosen) publication style. If any of these factors do not clearly prescribe the abbreviations policy, I would suggest to use the following strategy:

  • do not use any abbreviations in the abstract;
  • define abbreviations at their first mention after the abstract;
  • use relevant abbreviations throughout the rest of the text (occasionally returning to using the abbreviation definitions, if the frequency of appearance of the corresponding items is high).
Posted: May 18, 2017, 1:27 am

Answer by Aleksandr Blekh for infinite/sustainable hosting of a web-interface to a research database

Let me offer you several strategies. Firstly, you can consider, instead of or in addition to developing a LAMP-based Web application, to publish your research database as open data set with a clearly documented structure (schema, ontology, etc.). The benefits of that include much wider option of long-term preservation as well as opening various opportunities for other researchers to reproduce, enhance and build new knowledge on top of your results: open data => reproducible research => scientific innovation. For this option, you can consider using some solid free open data repositories, such as figshare, Zenodo, CKAN-based Datahub and GitHub (see examples).

Secondly, you can consider a hybrid approach, which is to combine an open data set, published as mentioned above, with a relevant open source code of Web application that anyone could download, install and use to interface with your data set. Considering the open source hosting aspect, from above-mentioned options, the GitHub one is especially attractive, as you could seamlessly host both data and relevant Web application code. If you (or someone who can help you) are technical enough, you could make access to your data set, using this approach, even easier, by providing a containerized (such as Docker) version of your data and application (if the data set if not too large, you can even push relevant public Docker image to DockerHub or other services that host public images for free). Similarly, you can publish a free software appliance - virtual machine (VM) - perhaps, some of the above-mentioned repositories (and/or maybe others) offer hosting open VMs for free.

Thirdly, you can propose developing and hosting Web application that would provide open access to your data set to (in addition to some universities) relevant non-profit organizations, working in your particular domain. If successful, the costs of developing and maintaining the database would be covered (at least, for some time) by relevant scholarships, grants or similar financial vehicles. For example, for social sciences, including humanities, you can review funding opportunities at Social Science Research Council, The Rockefeller Foundation, Carnegie Corporation, Ford Foundation, Russel Sage Foundation and many other non-profits.

Posted: September 10, 2016, 6:10 am

Answer by Aleksandr Blekh for Where can I download a large sample bibliography collection in BibTeX?

For your purposes, I would highly recommend you to use The Collection of Computer Science Bibliographies by Alf-Christian Achilles. This extensive collection contains 3M+ references on the various CS subjects (grouped in about 1500 collections) and, besides offering search and browse interfaces, allows one to download the actual bibliographic data in BibTeX format - just select a particular bibliography and you will see the links to the source files - uncompressed and/or zipped.

P.S. Don't forget to acknowledge the value of this resource to the maintainer of this meta-collection (a thank you note will do) and, perhaps, even attribute the source, if your software will be citable.

Posted: June 26, 2016, 6:34 am

Answer by Aleksandr Blekh for What is the academic value of posts on LinkedIn?

LinkedIn certainly has some value, as a general professional networking tool. However, that value has been declining for quite a while and rather rapidly more recently due to various factors, mainly inability (or lack of care/desire) of LinkedIn's management to manage the quality of the community, provide consistent user experience, fix issues and improve features, just to name a few. Whether the recent acquisition of LinkedIn by Microsoft will help LinkedIn to remain a major player in the market and improve its dominance or, vice versa, will enable its stagnation and transform it into Microsoft's technology- and talent-focused support division, remains to be seen (I make no bets).

Having said that, the value of LinkedIn from the academic publishing perspective is quite bleak (which is a nice way to say "close to zero"), in my humble opinion. The following are some of the reasons for my such assessment.

  • Quality / scientific rigor. LinkedIn lacks a peer review process, which means that any published piece there should be taken with many more grains of salt than, if such process would be in place (not that is expected).

  • Relevancy. LinkedIn is not very relevant to academia. LinkedIn's network of people from academic circles tend to be much less comprehensive than academia's specialized networks due to some of their colleagues, collaborators, etc. using LinkedIn rarely, if ever, or just not having any presence there at all. Therefore, disseminating scientific information, using LinkedIn, is a much less effective option. Nevertheless, if one has important academic contacts on LinkedIn that are missing from the person's other networks, it might make sense to publish there a brief post (similar to an abstract) with a link to a full-text article (preferably, a DOI link).

  • Information persistence. LinkedIn lacks a mechanism of persistent identifiers (again, not that we can expect that from a general networking platform), which implies lack of guarantee that a link to an article published there will not become broken over time (which jeopardizes scientific information dissemination).

P.S. There is no such term, as "job CV" - I understand what you're trying to say, but IMHO it sounds pretty bad and, thus, I would recommend against using such word combination in any context. HTH

Posted: June 24, 2016, 1:01 am

Answer by Aleksandr Blekh for Is there a conventional word that describes a professor for whom you were a TA

I would initially suggest terms supervising lecturer or supervising teaching professor. However, both terms are not perfect due to potential interpretation of "lecturer" and "teaching professor" as formal positions. In order to improve this, it might make sense to add clarifying term "class" and remove "teaching" from the second option. Therefore, my final suggestions are the following two options:

  • supervising class (course) lecturer;
  • supervising class (course) professor.
Posted: June 22, 2016, 12:08 am

Answer by Aleksandr Blekh for Trouble with advisor in final Ph.D. phase

I'm sorry to hear about your situation. I had to change my Ph.D. advisor (and I'm very glad I did), but it was in the early phase of my dissertation process. I'm quite surprised by your "discovery" about a potential of your advisor not caring about your career. Firstly, it is unlikely (why would she "tolerate" you for 5+ years then?). Secondly, if your advisor would truly not care about your career, it should have been pretty clear early in your collaboration, so either your assumption is not true, or you paid no attention to this aspect at all, which is quite difficult to believe in.

Anyway, in regard to your potential actions. I strongly recommend you to consider all possibilities to avoid changing your advisor, considering how far are you in the program. Changing an advisor is not only a administrative / logistical nightmare, but, if it would require you to start your research from the scratch or almost from the scratch, it would be extremely depressing, to say the least.

If you could save five years of work and life by defending your dissertation and graduating, even if parting with your advisor not very amicably, I would say that it is worth a serious consideration. The two obvious dangers in this case would be: 1) being able to defend dissertation successfully; 2) potential problems with obtaining a recommendation letter from your advisor (she could either decline, or give a negative or not so positive one). The second aspect is quite important, as your postgraduate applications, not listing your dissertation advisor as a referee, might raise quite a lot of eyebrows, with potentially negative consequences in regard to your postdoctoral offers / career.

You have to carefully think about all these (and other) aspects, consider feedback from people here and your own environment, but, ultimately, only you can decide the best course of action, based on various details, known to you only, as well as your gut feeling, as some new research suggests.

Regardless of what you decide on the subject and how you part with your advisor, I wish you to successfully graduate and achieve your professional and personal goals in the future. Be strong in staying your courses, but flexible in ways of reaching your destinations. Or, as Lao Tzu has said,

Nothing is softer or more flexible than water, yet nothing can resist it.

Posted: June 14, 2016, 6:44 am

Answer by Aleksandr Blekh for Harvard VS APA: the differences? How about mixing styles for a better clarity?

Even if your current university and situation do not call for using a specific publication style, I would strongly recommend against mixing two or more styles, even if they are not much different. The reason is pretty clear: consistency. For the sake of readers of your publications as well as for the sake of your own sanity. Following a single style will make your life easier - if you can choose, just pick the one you feel more comfortable with or the one popular, or, perhaps, a standard de facto, in your field (the latter is IMHO much more important - again, that "for the sake of readers" argument).

Posted: June 8, 2016, 2:18 am

Answer by Aleksandr Blekh for Is my work a good research work?

In my opinion, academic research work should be focused more on learning in general and learning how to perform research correctly in particular, rather than on doing grandiose, novel or even "the right" research. This is especially applicable to the Master's level research, where implementation-focused work and theses are very popular (obviously, it is quite field-dependent, but here I imply the software engineering / computer science areas of research).

I don't see any reasons for why an good implementation-focused research work could not be published as a research paper in a solid journal. In fact, I have seen a lot of such papers (of varied quality), especially in the above-mentioned domains, published in respected peer-reviewed outlets.

Posted: June 5, 2016, 8:34 am

Answer by Aleksandr Blekh for CV for a PhD application in applied mathematics

  • Firstly, the Career Objective section is a thing of past and should not be present in a CV or resume. Not only it is old-fashioned, it actually makes one change their CV or resume every time one applies to different organization and position. It is much better to place relevant position-focused information in a cover letter, which should be adjusted to a particular position anyway.

  • Secondly, do not put personal details, like mailing and physical address, on CV or resume. An e-mail address and, maybe, a phone number is more than enough. You don't expect potential employers to send you postal mail, do you? Plus, the physical address would jeopardize the security of one's identity.

  • Thirdly, the section Research Interests should be higher in the list - I would say, even prior to the section Education (or, at least, right after it).

  • Fourthly, I suggest you to create two versions of your CV (the following is not applicable to resume) - one with references, for organizations that require them as part of initial application, and another without ones, for those that require them later or using different communication channel (say, Interfolio).

  • Fifthly, go ahead and search Internet for examples of academic cover letters (there are plenty of them - stick with the ones from reputable universities). Hope this helps. Good luck!

P.S. I would reword section titles, as follows: Conference Presentations => Talks & Presentations; Research Interest => Research Interests; Co-curricular Activities => not sure it makes sense to extract them in a separate section - why not list them below relevant educational info; Extra-curricular Activities => Extracurricular Activities.

Posted: June 5, 2016, 6:58 am

Answer by Aleksandr Blekh for Any data for average number of papers per year at different career stages?

In regard to the data, I would suggest you to look at NSF's Survey of Doctorate Recipients (SDR) (select Data tab for data sets). A potentially more convenient or flexible way to access and select data of interest might be via NSF's SESTAT Data Tool (provides access to the SDR data as well).

Some data (or data sources) might be extracted from relevant literature. In particular, the study Comparing Research Productivity Across Disciplines and Career Stages uses the 2003 SDR dataset (see Table 3 for some ready-to-use numbers). Beyond the above-mentioned direct and indirect data sources, I would recommend to review related studies that might potentially contain of refer to relevant data. In particular, check the following papers (obviously, a non-exhaustive list).

Posted: June 3, 2016, 9:07 am

Answer by Aleksandr Blekh for Do academics look down on well-designed academic websites?

Your question is not only too broad and opinionated, but it is also formulated in such way that it is quite difficult to answer, in general. Simply because there is no universal definition of what attribute "well-designed" means. It could mean different things to different people. There are no clear and universal criteria for judging whether a website (or any other object, for that matter) is well designed or not and, if Yes, how well. Certainly, there are various heuristics and checklists for assessing the quality of design of a website, but they are not universal at all, as each criterion's weight is strongly dependent on the context, which, in this particular case, includes goals of the assessment, the website's audience, the assessor's judgement, the layout and essence of the site's content.

In addition to the above, the academic audience is likely to pay more attention to the essence of a website, rather then its design (unless it shows a clear disrespect to potential visitors - in a form of poor spelling, offensive language, excessive use of advertising, frequently appearing mailing list pop-up windows, extremely bright or dis-balanced colors as well as accessibility, readability and navigability issues, among others).

Having said that, I don't see any reasons for why academics would look down on a well-designed academic website (provided that it is somehow determined that the website in question is indeed a well-designed one). That is, of course, unless the site contains irrelevant or poor quality content.

Posted: May 25, 2016, 10:16 pm

Answer by Aleksandr Blekh for Leadership Ph.D alternative

There is a wide range of choices in managerial education, in general, and in leadership education, in particular. Most universities' business schools offer various executive education programs, which range from continuous education / professional development programs, such as (hereafter, I am using Harvard University just as an example for some types of programs) these smaler scale programs, to comprehensive executive leadership programs. Executive leadership programs can be general as well as industry-oriented, such this higher education-focused or this healthcare-focused. Other leadership education options include more lightweight alternatives, such as relevant MOOCs with certificates (either single courses, or thematic tracks), university certificate programs (such as this one at MIT) and relevant educational programs by think tanks (like ones by Aspen Institute), non-profits (like ones by Center for Creative Leadership) and similar organizations.

Posted: March 22, 2016, 2:16 am

Answer by Aleksandr Blekh for When is it appropriate to describe research as "recent"?

Good question. The semantics of the word "recent", in general, and in academic writing, in particular, is not clearly defined (that is, fuzzy), which makes its practical use quite tricky, as evidenced by your question.

While @vonbrand's answer offers some valuable insights, such as considering the fluidity of a particular scientific field or domain, I would suggest a more practical solution to this problem, as follows. Consider literature that you reference in a particular paper. What is the temporal range of the sources? I think that this aspect could guide you in to where the word "recent" is appropriate and where not so much.

For example, if you cite sources from the current century as well as 1930s, then a paper from 2010 should be considered recent, but not one from 1950. If, on the other hand, your temporal range of references is rather narrow, say, recent 20 years, then you should refer to as "recent" for sources that are from approximately last 4-5 years. You can come up with your own rule of thumb (10-20% of the total range sounds pretty reasonable). The most important aspect would be not the actual value (for the rule of thumb), but rather your consistency in applying it throughout the paper.

Posted: March 9, 2016, 1:58 am

Answer by Aleksandr Blekh for How to properly cite a comment from reddit

I believe that your best guess is pretty close to the right answer. According to the APA Style (6th ed.), you should list as much information as possible for non-periodical publications, which you have done well. I think that your resource falls under category "Nonperiodical Web Document or Report", as described on this page of the Purdue OWL's APA Formatting and Style Guide.

However, on the second thought, it seems that a more correct option to use would be APA's electronic sources guidelines for "Online Forum or Discussion Board Posting". Not only Reddit better fits this category, but it also allows you to specify the author of the quote you are citing. Therefore, the optimal citation in question, in my opinion, should be as follows (note that I took liberty to remove date of retrieval as the link you provide is a permalink and, thus, pretty stable):

Snowden, E. (2015). Just days left to kill mass surveillance under Section 215 of the Patriot Act. We are Edward Snowden and the ACLU's Jameel Jafer. AUA. Retrieved from

Posted: February 21, 2016, 6:56 am

Answer by Aleksandr Blekh for What does "to be enjoyed with all rights and privileges pertaining thereto" mean on a French diploma?

That phrase is clearly not France-specific, as @ff524 mentioned. In order to add to and further illustrate the nice answer by @vonbrand, I will share the following paper, which discusses Roman origins and Medieval expressions of the relevant phase(s):

In addition to some comments and answers for the above-mentioned most likely duplicate question, I would add that modern practical meaning of this phrase significantly depends on graduate's field of study and institution they graduated from.

In regard to the field of study, rights and privileges might include (beyond the implied rights and privileges to say that one graduated with specific degree from a particular institution, to wear the institution's regalia, to be referred to as a Dr. [for Ph.D. graduates], etc.): to be able to practice in specific regulated fields, such as medicine and law (upon satisfying additional conditions, such as attending medical residency or passing specific state's bar examination, correspondingly).

In regard to the institution, some rights and privileges include to participate in alumni activities, to retain institution's e-mail address, to get discounts on various products and services as well as on attending individual classes and, even, enrolling into certain degree programs at one's alma mater.

Posted: February 21, 2016, 1:59 am

Answer by Aleksandr Blekh for How should I state 'MS dropout' in my resume when applying for data scientist positions?

First, some advice. I agree with @gnometorule, but I would state it stronger: IMHO and based on limited information you've shared, it would be a mistake to drop out so close to graduation. Even though the current culture within startup ecosystem and, overall, tech industry largely ignores education credentials in favor of "being a hustler", "being a doer", "being street smart", etc., the data science subset of the both areas actually seem to have more respect and pay more attention to people's education. This is quite understandable, considering the relative complexity of data science and, especially, its machine learning and artificial intelligence fields of study and practice.

I would strongly suggest you to consider things in perspective and do your best to successfully finish the program. Not only it will give you some advantages when competing in the job market, but also might be useful to you, should you decide in the future to go for a Ph.D., teach at some educational institution or pursue other opportunities (i.e., scientific research or consulting).

In regard to your specific question - should you decide to ignore my advice - I think that it would be better to formulate in your resume the phrase "MS dropout" not as such or, even, not as

"University XYZ, MS program, Statistics, Years Range, Incomplete",

but rather as a positive fact / achievement:

"University XYZ, MS program, Statistics, Years Range, Completed 90% of curriculum".

Having said that, again, I strongly suggest you to consider finishing your Master's program.

Posted: February 18, 2016, 6:40 am

Answer by Aleksandr Blekh for Pitfalls of Academic Blogging

Let me start with the following disclaimer. Firstly, I'm considering myself also a junior academician (defended my Ph.D. in April 2015; though I have quite a bit of industry experience). Secondly, while I thought about starting professional (in a sense of covering both academia/research and industry, plus various interests) blogging for a while and, even created my own WordPress-powered website with a blog section, I still yet to find time to start and continue blogging regularly. Having said that, everyone's situation and circumstances are different. Also, having some kind of writer's block or, rather, fear, I decided that mostly answering (and sometimes asking) questions on Stack Exchange sites as well as Quora is a gentle way of preparing myself to a more serious :-) blogging exposure.

Now, on to your questions (take my advice with a grain of salt, considering the disclaimer above).

Is academic blogging a good idea?

In my humble opinion, absolutely. I've seen a lot of academic blogs. Most of them are of good to excellent quality. Reading someone's such academic blog immediately adds some virtual respect points to that person's virtual balance in my brain. Sometimes it helps to find answers to my specific questions. Often, it increases my awareness on some topics or subject domains. It also helps me to understand who might be a good potential collaborator for a future research or an advisor for a science-focused venture / startup. All of the above-mentioned points are potential benefits toward a good professional exposure / visibility for an academic blogger.

Does it become too much effort?

As I said, I have no direct experience in blogging, but, based on my experience with answering questions, it depends on your desired involvement. I guess, for blogging it is more about setting a comfortable for the author schedule and sticking to it. Answering questions is a more flexible way.

Is it worthwhile?

See answer to Q1.

How likely is blog-death?

Since one of major, if not the major, benefits of blogging is training one's brain to formulate and express thoughts and arguments, I think that "blog-death" is not only over-rated, but irrelevant. Even if zero people will read your blog now, 1) at some point, some people will start reading it, if it will be worth reading and, more importantly, 2) you will still be self-improving in so many ways.

In general, what are pitfalls to watch out for when starting an academic blog?

IMHO potential factors of success are (obviously, potential pitfalls would be the opposite aspects):

  • finding interesting topics;
  • expressing yourself via original and quality writing;
  • creating a visually appealing blog (likely, not critical, but still...);
  • creating a realistic schedule and sticking to it;
  • having faith in yourself.
Posted: February 14, 2016, 3:49 am

Answer by Aleksandr Blekh for Are 'Dr' for medical doctor used in the same sense as a PhD?

While all those titles share the same linguistic roots, obviously, the meaning is somewhat different. When referring to a Ph.D., term doctor is used in the context of general knowledge acquisition. That is why the full title is doctor of philosophy, where philosophy implies "love of wisdom". On the other hand, a medical doctor (M.D.) or Doctor of Osteopathic medicine (D.O.) title or one of dental doctor titles refers to a specialist in one or more areas of medicine. A relatively popular alternative term for medical doctor is physician, which some people might confuse with with physicist. The origins of the word "physician" and its relation to the word "doctor" are discusses in this interesting article in Science Friday.

The original meaning of the word "doctor" as "license to teach" has likely been transferred to the medicine knowledge domain IMHO due to the important role of one of the cornerstones of science that medicine played at that particular time period and place (medieval Europe). You may also find additional interesting information in this related discussion on StackExchange.

Posted: June 12, 2015, 1:19 am

Viewing page 1 of 1

User Aleksandr Blekh - Data Science Stack Exchange

most recent 30 from

Answer by Aleksandr Blekh for Steps in exploratory methods for mild-sized data with mixed categorical and numerical values?

You can get a reasonably good approximation of steps for exploratory data analysis (EDA) by reviewing the EDA section of the NIST Engineering Statistics Handbook. Additionally, you might find helpful parts of my related answer here on Data Science SE.

Methods, related to EDA, are too diverse that it is not feasible to discuss them in a single answer. I will just mention several approaches. If you are interested in applying classification to your data set, you might find information, mentioned in my other answer helpful. In order to detect structures in a data set, you can try to apply principal component analysis (PCA). If, on the other hand, you are interested in exploring latent structures in data, consider using exploratory factor analysis (EFA).

Posted: October 25, 2015, 12:10 am

Answer by Aleksandr Blekh for Sampling for multi categorical variable

Let me give you some pointers (assuming that I'm right on this, which might not necessarily be true, so proceed with caution :-). First, I'd figure out the applicable terminology. It seems to me that your case can be categorized as multivariate sampling from a categorical distribution (see this section on categorical distribution sampling). Perhaps, the simplest approach to it is to use R ecosystem's rich functionality. In particular, standard stats package contains rmultinom function (link).

If you need more complex types of sampling, there are other packages that might be worth exploring, for example sampling (link), miscF (link), offering rMultinom function (link). If your complex sampling is focused on survey data, consider reading this interesting paper "Complex Sampling and R" by Thomas Lumley.

If you use languages other than R, check multinomial function from Python's numpy package and, for Stata, this blog post. Finally, if you are interested in Bayesian statistics, the following two documents seems to be relevant: this blog post and this survey paper. Hope this helps.

Posted: October 12, 2015, 3:48 pm

Answer by Aleksandr Blekh for Are there any tools for feature engineering?

Very interesting question (+1). While I am not aware of any software tools that currently offer comprehensive functionality for feature engineering, there is definitely a wide range of options in that regard. Currently, as far as I know, feature engineering is still largely a laborious and manual process (i.e., see this blog post). Speaking about the feature engineering subject domain, this excellent article by Jason Brownlee provides a rather comprehensive overview of the topic.

Ben Lorica, Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media Inc., has written a very nice article, describing the state-of-art (as of June 2014) approaches, methods, tools and startups in the area of automating (or, as he put it, streamlining) feature engineering.

I took a brief look at some startups that Ben has referenced and a product by Skytree indeed looks quite impressive, especially in regard to the subject of this question. Having said that, some of their claims sound really suspicious to me (i.e., "Skytree speeds up machine learning methods by up to 150x compared to open source options"). Continuing talking about commercial data science and machine learning offerings, I have to mention solutions by Microsoft, in particular their Azure Machine Learning Studio. This Web-based product is quite powerful and elegant and offers some feature engineering functionality (FEF). For an example of some simple FEF, see this nice video.

Returning to the question, I think that the simplest approach one can apply for automating feature engineering is to use corresponding IDEs. Since you (me, too) are interested in R language as a data science backend, I would suggest to check, in addition to RStudio, another similar open source IDE, called RKWard. One of the advantages of RKWard vs RStudio is that it supports writing plugins for the IDE, thus, enabling data scientists to automate feature engineering and streamline their R-based data analysis.

Finally, on the other side of the spectrum of feature engineering solutions we can find some research projects. The two most notable seem to be Stanford University's Columbus project, described in detail in the corresponding research paper, and Brainwash, described in this paper.

Posted: October 3, 2015, 8:05 am

Answer by Aleksandr Blekh for Looking for language and framework for data munging/wrangling

If you are interested in a very high-level (enterprise architecture) framework, I suggest you to take a look at the MIKE2.0 Methodology. Being an information management framework, MIKE2.0 has, certainly, much wider coverage than the domain of your interest, but it is a solid, interesting and open (licensed under the Creative Commons Attribution License) framework. A better fit for your focus is the Extract, transform, load (ETL) framework, which is extremely popular in contexts of Business Intelligence and Data Warehousing. On a more practical note, you might want to check my answer on Quora on open source master data management (MDM) solutions. Pay attention to the Talend solutions (disclaimer: I am not affiliated with this or any company), which cover a wide spectrum of MDM, ETL and data integration domains as open source and commercial offerings.

Posted: September 30, 2015, 9:12 pm

Answer by Aleksandr Blekh for How to start analysing and modelling data for an academic project, when not a statistician or data scientist

Typically, quantitative analysis is planned and performed, based on research study's goals. Focusing on research goals and corresponding research questions, researcher would propose a model (or several models) and a set of hypotheses, associated with the model(s). Model(s) and its/their elements' types usually dictate (suggest) quantitative approaches that would make sense in a particular situation. For example, if your model includes latent variables, you would have to use appropriate methods to perform data analysis (i.e., structural equation modeling). Otherwise, you can apply a variety of other methods, such as time series analysis or, as you mentioned, multiple regression and machine learning. For more details on research workflow with latent variables, also see section #3 in my relevant answer.

One last note: whatever methods you use, pay enough attention to the following two very important aspects - performing full-scale exploratory data analysis (EDA) (see my relevant answer) and trying to design and perform your analysis in the reproducible research fashion (see my relevant answer).

Posted: September 22, 2015, 7:42 am

Answer by Aleksandr Blekh for Program to fine-tune pre-trained word embeddings on my data set

While I am not aware of software specifically for tuning trained word embeddings, perhaps the following open source software might be helpful, if you can figure out what parts can be modified for the fine-tuning part (just an idea off the top of my head - I'm not too familiar with the details):

Posted: July 31, 2015, 4:39 am

Answer by Aleksandr Blekh for Do I need an Artificial Intelligence API?

One needs to use an artificial intelligence (AI) API, if there is a need to add AI functionality to a software application - this is pretty obvious. Traditionally, my advice on machine learning (ML) software includes the following two excellent curated lists of resources: this one and this one.

However, keep in mind that ML is just a subset of AI domain, so if your tasks involve AI areas beyond ML, you need more AI-focused tools or platforms. For example, you can take a look at ai-one's AI platforms and APIs as well as interesting general AI open source project OpenCog.

In addition to the above-mentioned AI-focused platforms, IBM's Watson AI system deserves a separate mention, as quite cool and promising. It offers its own ecosystem for developers, called IBM Watson Developer Cloud, based on IBM's BlueMix cloud computing platform-as-a-service (PaaS). However, at the present time, I find this offering to be quite expensive as well as limiting, especially for individual developers, small startups and other small businesses, due to its tight integration with and reliance only on a single PaaS (Blue Mix). It will be interesting to watch this space as competition in AI domain and marketplace IMHO will surely intensify in the future.

Posted: June 10, 2015, 3:53 am

Answer by Aleksandr Blekh for What is the definition of knowledge within data science?

Knowledge is a general term and I don't think that there exist definitions of knowledge for specific disciplines, domains and areas of study. Therefore, in my opinion, knowledge, for a particular subject domain, can be defined just as a domain-specific (or context-specific, as mentioned by @JGreenwell +1) perspective (projection) of a general concept of knowledge.

Posted: June 7, 2015, 5:38 am

Answer by Aleksandr Blekh for Ideas for next step of Machine Learning

I would suggest you to check this excellent presentation by Li Deng (Microsoft Research). Many of the slides contain references to relevant research papers and even several interesting books on the topics of interest (it should be pretty easy to find). It might be also helpful to check references, listed in this research paper by Prof. Andrew Ng and his colleagues at Baidu Research. Finally, a focused Internet search will provide you with comprehensive list of resources for further research.

Posted: May 21, 2015, 5:33 am

Answer by Aleksandr Blekh for Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?

In addition to exploratory data analysis (EDA), both descriptive and visual, I would try to use time series analysis as a more comprehensive and sophisticated analysis. Specifically, I would perform time series regression analysis. Time series analysis is a huge research and practice domain, so, if you're not familiar with the fundamentals, I suggest starting with the above-linked Wikipedia article, gradually searching for more specific topics and reading corresponding articles, papers and books.

Since time series analysis is a very popular approach, it is supported by most open source and closed source commercial data science and statistical environments (software), such as R, Python, SAS, SPSS and many others. If you want to use R for this, check my answers on general time series analysis and on time series classification and clustering. I hope that this is helpful.

Posted: May 18, 2015, 2:32 am

Answer by Aleksandr Blekh for Application of Control Theory in Data Science

Have you tried the Internet search? The results should be able to answer most, if not all, of your questions. The topics of your interest sound like rather general or high-level. I'm sure that they can, in one form or another, be applied in the data science context. In my opinion, those topics are more related to operations research (OR), therefore, I would recommend you to perform some research on the Internet on the intersections between control systems (theory) and data science.

Having said that, first thing that comes to my mind is that the most likely candidate for use of control theory concepts and methods in data science context would be distributed systems and algorithms for data analysis, such as MapReduce (Hadoop, etc.), as well as other parallel processing systems. If there exist an intersection between OR's area of optimization and control theory, then it very well could be used for big data algorithms optimization, among other tasks.

Posted: May 17, 2015, 8:30 am

Answer by Aleksandr Blekh for Attributing causality to single quasi-independent variable

I would suggest you to consider either direct dimensionality reduction approach. Check my relevant answer on this site. Another valid option is to use latent variable modeling, for example, structural equation modeling. You can start with relevant articles on Wikipedia (this and this, correspondingly) and then, as needed, read more specialized or more practical articles, papers and books.

Posted: May 16, 2015, 2:36 am

Answer by Aleksandr Blekh for Best or recommended R package for logit and probit regression

Unless you have some very specific or exotic requirements, in order to perform logistic (logit and probit) regression analysis in R, you can use standard (built-in and loaded by default) stats package. In particular, you can use glm() function, as shown in the following nice tutorials from UCLA: logit in R tutorial and probit in R tutorial.

If you are interested in multinomial logistic regression, this UCLA tutorial might be helpful (you can use glm() or packages, such as glmnet or mlogit). For the above-mentioned very specific or exotic requirements, many other R packages are available, for example logistf ( or elrm (

I also recommend another nice tutorial on GLMs from Princeton University (by Germán Rodríguez), which discusses some modeling aspects, not addressed in the UCLA materials, in particular updating models and model selection.

Posted: May 13, 2015, 2:47 am

Answer by Aleksandr Blekh for Use of Nash-Equilibrium in big data environments

I have a very limited knowledge of game theory, but hope to learn more. However, I think that potential applications of Nash equilibrium in the context of big data environments, implies the need of analyzing a large number of features (representing various strategic pathways or traits) as well as large number of cases (representing significant number of actors). Considering these points, I would think that complexity and, consequently, performance requirements for Nash equilibrium in big data applications grow exponentially. For some examples from the Internet load-balancing domain, see paper by Even-Dar, Kesselman and Mansour (n.d.).

The above-mentioned points touch only the volume aspect of 4V big data model (an update of Gartner's original 3V model). If you add to that other aspects (variety, velocity and veracity), the situation seems to become even more complex. Perhaps, people with econometrics background and experience will have some of the most comprehensive opinions on this interesting question. A lot of such people are active on Cross Validated, so I will let them know about this question - hopefully, some of them will be interested to share their view by answering this question.


Even-Dar, E., Kesselman, A., & Mansour, Y. (n.d.). Convergence time to Nash equilibria. Retrieved from

Posted: May 8, 2015, 6:57 am

Answer by Aleksandr Blekh for How can I use Data Science to profoundly contribute to Humanity?

Since I have already answered a similar question on Data Science StackExchange site, plus some related ones, I will mention all of them here and let you decide, if you find them helpful:

Posted: April 21, 2015, 10:56 pm

Answer by Aleksandr Blekh for Abstract data type?

Any platform, focused on social networking (not necessarily Twitter), at its core uses the most appropriate and natural abstract data type (ADT) for such domain - a graph data structure.

If you use Python, you can check nice NetworkX package, used for "the creation, manipulation, and study of the structure, dynamics, and functions of complex networks". Of course, there are many other software tools for various programming languages for building, using and analyzing network structures. You might also find useful the relevant book "Social Network Analysis for Startups: Finding connections on the social web", which provides a nice introduction into the social network analysis (SNA) and uses the above-mentioned NetworkX software for SNA examples. P.S. I have no affiliation whatsoever with NetworkX open source project or the book's authors.

Posted: April 15, 2015, 6:04 pm

Answer by Aleksandr Blekh for Possibility of working on KDDCup data in local system

I think that you have, at least, the following major options for your data analysis scenario:

  1. Use big data-enabling R packages on your local system. You can find most of them via the corresponding CRAN Task View that I reference in this answer (see point #3).

  2. Use the same packages on a public cloud infrastructure, such as Amazon Web Services (AWS) EC2. If your analysis is non-critical and tolerant to potential restarts, consider using AWS Spot Instances, as their pricing allows for significant financial savings.

  3. Use the above mention public cloud option with R standard platform, but on more powerful instances (for example, on AWS you can opt for memory-optimized EC2 instances or general purpose on-demand instances with more memory).

In some cases, it is possible to tune a local system (or a cloud on-demand instance) to enable R to work with big(ger) data sets. For some help in this regard, see my relevant answer.

For both above-mentioned cloud (AWS) options, you can find more convenient to use R-focused pre-built VM images. See my relevant answer for details. You may also find useful this excellent comprehensive list of big data frameworks.

Posted: April 12, 2015, 5:23 am

Answer by Aleksandr Blekh for Extracting model equation and other data from 'glm' function in R

In order to extract some data from the fitted glm model object, you need to figure out where that data resides (use documentation and str() for that). Some data might be available from the summary.glm object, while more detailed data is available from the glm object itself. For extracting model parameters, you can use coef() function or direct access to the structure.


From Princeton's* introduction to R course's website, GLM section - see for details & examples:

The functions that can be used to extract results from the fit include

- 'residuals' or 'resid', for the deviance residuals
- 'fitted' or 'fitted.values', for the fitted values (estimated probabilities)
- 'predict', for the linear predictor (estimated logits)
- 'coef' or 'coefficients', for the coefficients, and
- 'deviance', for the deviance. 

Some of these functions have optional arguments; for example, you can extract five different types of residuals, called "deviance", "pearson", "response" (response - fitted value), "working" (the working dependent variable in the IRLS algorithm - linear predictor), and "partial" (a matrix of working residuals formed by omitting each term in the model). You specify the one you want using the type argument, for example residuals(lrfit,type="pearson").

*) More accurately, this website is by Germán Rodríguez from Princeton University.

Posted: April 9, 2015, 3:27 pm

Answer by Aleksandr Blekh for Building a static local website using Rmarkdown: step by step procedure

In most things, related to R, there are many approaches to solve a problem, sometimes too many, I would say. The task of building a static website, using RMarkdown, is not an exception.

One of the best, albeit somewhat brief, sets of workflows on the topic include the following one by Daniel Wollschlaeger, which includes this workflow, based on R, nanoc and Jekyll, as well as this workflow, based on R and WordPress. Another good workflow is this one by Jason Bryer, which is focused on R(Markdown), Jekyll and GitHub Pages.

Not everyone likes GitHub Pages, Jekyll, Octopress and Ruby, so some people came up with alternative solutions. For example, this workflow by Edward Borasky is based on R and, for a static website generator, on Python-based Nicola (instead of Ruby-based Jekyll or nanoc). Speaking about static website generators, there are tons of them, in various programming languages, so, if you want to experiment, check this amazing website, listing almost all of them. Almost, because some are missing - for example, Samantha and Ghost, listed here.

Some other interesting workflows include this one by Joshua Lande, which is based on Jekyll and GitHub Pages, but includes some nice examples of customization for integrating a website with Disqus, Google Analytics and Twitter as well as getting custom URL for the site and more.

Those who want a pure R-based static site solution, now have some options, including rsmith (, a static site generator by Hadley Wickham, and Poirot (, a static site generator by Ramnath Vaidyanathan.

Finally, I would like to mention an interesting project (from an open science perspective) that I recently ran across - an open source software by Mark Madsen for a lab notebook static site, which is based on GitHub Pages and Jekyll, but also supports pandoc, R, RMarkdown and knitr.

Posted: April 8, 2015, 3:47 am

Answer by Aleksandr Blekh for Learning resources for data science to win political campaigns?

This is an interesting and relevant question. I think that from data science perspective, it should not be, in principle, any different from any other similar data science tasks, such as prediction, forecasting or other analyses. Similarly to any data science work, the quality of applying data science to politics very much depends on understanding not only data science approaches, methods and tools, but, first and foremost, the domain being analyzed, that is politics domain.

Rapidly rising popularity of data science and machine learning (ML), in general, certainly has a significant impact on particular verticals and politics is not an exception. This impact can be seen not only in increased research interest in applying data science and ML to political science (for example, see this presentation, this paper, this overview paper and this whole virtual/open issue in a prominent Oxford journal), but in practical applications. Moreover, a new term - political informatics or poliInformatics or poli-informatics - has been coined to name an interdisciplinary field, which stated goal is to study and use data science, big data and ML in the government and politics domains. As I've said earlier, the interest in applying data science to politics goes beyond research and often results in politics-focused startups, such as PoliticIt or Para Bellum Labs. Following the unfortunate, but established trend in startup ecosystem, many of those ventures fail. For example, read the story of one of such startups.

I am pretty sure that you will be able to find neither proprietary algorithms that political startups or election data science teams used and use, nor the their data sets. However, I am rather positive that you can get some understanding about typical data sets as well as data collection and analysis methods via the resources that I have referenced above. Hope this helps.

Posted: April 4, 2015, 6:57 am

Answer by Aleksandr Blekh for Do data scientists use Excel?

Do experienced data scientists use Excel?

I've seen some experienced data scientists, who use Excel - either due to their preference, or due to their workplace's business and IT environment specifics (for example, many financial institutions use Excel as their major tool, at least, for modeling). However, I think that most experienced data scientists recognize the need to use tools, which are optimal for particular tasks, and adhere to this approach.

Can you assume a lack of experience from someone who does primarily use Excel?

No, you cannot. This is the corollary from my above-mentioned thoughts. Data science does not automatically imply big data - there is plenty of data science work that Excel can handle quite well. Having said that, if a data scientist (even experienced one) does not have knowledge (at least, basic) of modern data science tools, including big data-focused ones, it is somewhat disturbing. This is because experimentation is deeply ingrained into the nature of data science due to exploratory data analysis being a essential and, even, a crucial part of it. Therefore, a person, who does not have an urge to explore other tools within their domain, could rank lower among candidates in the overall fit for a data science position (of course, this is quite fuzzy, as some people are very quick in learning new material, plus, people might have not had an opportunity to satisfy their interest in other tools due to various personal or workplace reasons).

Therefore, in conclusion, I think that the best answer an experienced data scientist might have to a question in regard to their preferred tool is the following: My preferred tool is the optimal one, that is the one that best fits the task at hand.

Posted: April 3, 2015, 10:37 pm

Answer by Aleksandr Blekh for What is the term for when a model acts on the thing being modeled and thus changes the concept?

Though it is not specifically a term, focused on machine learning, but I would refer to such behavior of a statistical model, using a general term side effect (while adding some clarifying adjectives, such as expected or unexpected, desired or undesired, and similar). Modeling outcome or transitive feedback loop outcome might be some of the alternative terms.

Posted: April 3, 2015, 12:51 am

Answer by Aleksandr Blekh for how to modify sparse survey dataset with empty data points?

I would consider approaching this situation from the following two perspectives:

  • Missing data analysis. Despite formally the values in question are empty and not NA, I think that effectively incomplete data can (and should) be considered as missing. If that is the case, you need to automatically recode those values and then apply standard missing data handling approaches, such as multiple imputation. If you use R, you can use packages Amelia (if the data is multivariate normal), mice (supports non-normal data) or some others. For a nice overview of approaches, methods and software for multiple imputation of data with missing values, see the 2007 excellent article by Nicholas Horton and Ken Kleinman "Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models".

  • Sparse data analysis, such as sparse regression. I'm not too sure how well this approach would work for variables with high levels of sparsity, but you can find a lot of corresponding information in my relevant answer.

Posted: April 2, 2015, 11:38 pm

Answer by Aleksandr Blekh for How does SQL Server Analysis Services compare to R?

In my opinion, it seems that SSAS makes more sense for someone who:

  • has significantly invested in Microsoft's technology stack and platform;
  • prefer point-and-click interface (GUI) to command line;
  • focus on data warehousing (OLAP cubes, etc.);
  • has limited needs in terms of statistical methods and algorithms variety;
  • has limited needs in cross-language integration;
  • doesn't care much about openness, cross-platform integration and vendor lock-in.

You can find useful this blog post by Sami Badawi. However, note that the post is not recent, so some information might be outdated. Plus, the post contains an initial review, which might be not very accurate or comprehensive. If you're thinking about data science, while considering staying within Microsoft ecosystem, I suggest you to take a look at Microsoft's own machine learning platform Azure ML. This blog post presents a brief comparison of (early) Azure ML and SSAS.

Posted: March 27, 2015, 11:22 am

Answer by Aleksandr Blekh for General approach to extract key text from sentence (nlp)

You need to analyze sentence structure and extract corresponding syntactic categories of interest (in this case, I think it would be noun phrase, which is a phrasal category). For details, see corresponding Wikipedia article and "Analyzing Sentence Structure" chapter of NLTK book.

In regard to available software tools for implementing the above-mentioned approach and beyond, I would suggest to consider either NLTK (if you prefer Python), or StanfordNLP software (if you prefer Java). For many other NLP frameworks, libraries and programming various languages support, see corresponding (NLP) sections in this excellent curated list.

Posted: March 21, 2015, 8:58 pm

Answer by Aleksandr Blekh for IDE alternatives for R programming (RStudio, IntelliJ IDEA, Eclipse, Visual Studio)

Here's R Language Support for IntelliJ IDEA. However, keep in mind that this support is not in the form of built-in functionality or official plug-in, but rather a third-party plug-in. I haven't tried it, so my opinion on it is limited to the point above.

In my opinion, a better option would be Eclipse, which offers R support via StatET IDE: However, I find Eclipse IDE too heavyweight. Therefore, my preferred option is RStudio IDE - I don't know why one would prefer other options. I especially like RStudio's ability of online access to the full development environment via RStudio Server.

Posted: March 19, 2015, 12:21 am

Answer by Aleksandr Blekh for Python or R for implementing machine learning algorithms for fraud detection

I would say that it is your call and purely depends on your comfort with (or desire to learn) the language. Both languages have extensive ecosystems of packages/libraries, including some, which could be used for fraud detection. I would consider anomaly detection as the main theme for the topic. Therefore, the following resources illustrate the variety of approaches, methods and tools for the task in each ecosystem.

Python Ecosystem

  • scikit-learn library: for example, see this page;
  • LSAnomaly, a Python module, improving OneClassSVM (a drop-in replacement): see this page;
  • Skyline: an open source example of implementation, see its GitHub repo;
  • A relevant discussion on StackOverflow;
  • pyculiarity, a Python port of Twitter's AnomalyDetection R Package (as mentioned in 2nd bullet of R Ecosystem below "Twitter's Anomaly Detection package").

R Ecosystem

Additional General Information

Posted: February 21, 2015, 10:08 am

Answer by Aleksandr Blekh for Machine learning toolkit for Excel

As far as I know, currently there are not that many projects and products that allow you to perform serious machine learning (ML) work from within Excel.

However, the situation seems to be changing rapidly due to active Microsoft's efforts in popularizing its ML cloud platform Azure ML (along with ML Studio). The recent acquisition of R-focused company Revolution Analytics by Microsoft (which appears to me as more of acqui-hiring to a large extent) is an example of the company's aggressive data science market strategy.

In regard to ML toolkits for Excel, as a confirmation that we should expect most Excel-enabled ML projects and products to be Azure ML-focused, consider the following two projects (the latter is an open source):

Posted: January 28, 2015, 10:33 am

Answer by Aleksandr Blekh for High-dimensional data: What are useful techniques to know?

This is very broad question, which I think it's impossible to cover comprehensively in a single answer. Therefore, I think that it would be more beneficial to provide some pointers to relevant answers and/or resources. This is exactly what I will do by providing the following information and thoughts of mine.

First of all, I should mention the excellent and comprehensive tutorial on dimensionality reduction by Burges (2009) from Microsoft Research. He touches on high-dimensional aspects of data frequently throughout the monograph. This work, referring to dimensionality reduction as dimension reduction, presents a theoretical introduction into the problem, suggests a taxonomy of dimensionality reduction methods, consisting of projective methods and manifold modeling methods, as well as provides an overview of multiple methods in each category.

The "projective pursuit" methods reviewed include independent component analysis (ICA), principal component analysis (PCA) and its variations, such as kernel PCA and probabilistic PCA, canonical correlation analysis (CCA) and its kernel CCA variation, linear discriminant analysis (LDA), kernel dimension reduction (KDR) and some others. The manifold methods reviewed include multidimensional scaling (MDS) and its landmark MDS variation, Isomap, Locally Linear Embedding and graphical methods, such as Laplacian eigenmaps and spectral clustering. I'm listing the most of the reviewed methods here in case, if the original publication is inaccessible for you, either online (link above), or offline (References).

There is a caveat for the term "comprehensive" that I've applied to the above-mentioned work. While it is indeed rather comprehensive, this is relative, as some of the approaches to dimensionality reduction are not discussed in the monograph, in particular, the ones, focused on unobservable (latent) variables. Some of them are mentioned, though, with references to another source - a book on dimensionality reduction.

Now, I will briefly cover several narrower aspects of the topic in question by referring to my relevant or related answers. In regard to nearest neighbors (NN)-type approaches to high-dimensional data, please see my answers here (I especially recommend to check the paper #4 in my list). One of the effects of the curse of dimensionality is that high-dimensional data is frequently sparse. Considering this fact, I believe that my relevant answers here and here on regression and PCA for sparse and high-dimensional data might be helpful.


Burges, C. J. C. (2010). Dimension reduction: A guided tour. Foundations and Trends® in Machine Learning, 2(4), 275-365. doi:10.1561/2200000002

Posted: January 26, 2015, 8:00 am

Answer by Aleksandr Blekh for Is the R language suitable for Big Data

Some good answers here. I would like to join the discussion by adding the following two notes:

1) The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (

2) Having said that, it is important to remember about other aspects of big data concept, based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

3) While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (, in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

Posted: July 19, 2014, 2:19 am

Viewing page 1 of 1

User Aleksandr Blekh - Cross Validated

most recent 30 from

Answer by Aleksandr Blekh for Predict user behaviour with constantly changing input variables

Interesting question (+1). I'm not an expert in recommendation systems, so my attempt to help will be limited to emphasizing the following point (please don't ask about implementation details - you will have to figure that out by yourself or ask other people):

  • I would think that there are two somewhat different approaches: 1) one that you seem to suggest, where, if I understood correctly, you want to predict next (website) destinations of the user and generate corresponding recommendations, based on those predicted destinations; 2) another is to generate recommendations, based on tracking the user's N most recent actions (perhaps, a more dynamic option).
Posted: May 19, 2015, 10:47 am

Answer by Aleksandr Blekh for Calculating CIs for $\eta^2$ via Z scores - sample size?

In case you are still interested in this topic, I would recommend you to take a look at the papers, referenced in my answer, especially the first one (by Lakens). Also, check MBESS R package: see home page and JSS paper (note that the software's current version most likely contains additional features and improvements, not described in the referenced original JSS paper).

Posted: May 10, 2015, 3:09 am

Answer by Aleksandr Blekh for Difference between regression analysis and curve fitting

In addition to @NickCox's excellent answer (+1), I wanted to share my subjective impression on this somewhat fuzzy terminology topic. I think that a rather subtle difference between the two terms lies in the following. On one hand, regression often, if not always, implies an analytical solution (reference to regressors implies determining their parameters, hence my argument about analytical solution). On the other hand, curve fitting does not necessarily imply producing an analytical solution and IMHO often might be and is used as an exploratory approach.

Posted: May 8, 2015, 5:59 pm

Answer by Aleksandr Blekh for Invariance test after CFA model

As far as I know, measurement invariance testing is usually performed in SEM context, when research sample contains multiple groups. In SEM context, measurement invariance is often referred to as factorial invariance. It is definitely a good idea to perform both measurement invariance analysis as well as common method bias analysis prior to creating structural models and this approach is actually recommended in the literature (i.e., Podsakoff, MacKenzie, Lee & Podsakoff, 2003; van de Schoot, Lugtig & Hox, 2012).

Gaskin (2012) provides excellent textual and video tutorials on performing CFA, including measurement model invariance testing and common method bias testing. While I don't have experience in performing CFA in AMOS (I prefer R), you are in luck :-), since many Gaskin's tutorials (and CFA ones, in particular) are focused on using AMOS. I highly recommend his materials, both textual and, especially, video. I hope that my answer is helpful.


Gaskin, J. (2012). Confirmatory factor analysis. Gaskination's StatWiki. Retrieved from

Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), p 879-903. doi:10.1037/0021-9010.88.5.879 Retrieved from

van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 1(7). doi:10.1080/17405629.2012.686740 Retrieved from

Posted: May 7, 2015, 12:59 am

Answer by Aleksandr Blekh for software library to compute KL divergence?

It's great that you came up with the solution (+1). I meant to post an answer to this question much earlier, but was busy traveling to my dissertation defense (which was successful :-). You are likely to be happy with your solution, but, in addition to possibility to compute KL divergences for certain distributions in R, for example, via function KLdiv from flexmix package (, I ran across another and, in my opinion, much better option, which might be of your interest.

It is a very comprehensive piece of autonomous open source software, relevant to the topic, called Information Theoretical Estimators (ITE) Toolbox. It is written in MATLAB/Octave and supports various information theoretic measures. So, sending thanks and kudos to the author of this software, I'm excited to share it here and hope that it will useful to you and the community.

Posted: May 1, 2015, 5:14 am

Answer by Aleksandr Blekh for Can you run clustering algorithms on perfectly collinear data?

The following is not an attempt to comprehensively answer your interesting (+1) question, but rather conveniently store and share with you and others some relevant, in my opinion, papers:

Posted: April 21, 2015, 3:38 pm

Answer by Aleksandr Blekh for Covariance between variables

If you're talking about correlation between predictor variables in a regression model, then the phenomenon you're describing is referred to as multicollinearity. In order to detect multicollinearity, as a minimum, you have to calculate variance inflation factor (VIF), but there are other tests for this task as well. While detecting multicollinearity is relatively easy, dealing with it is not. Therefore, it might be beneficial to prevent it prior analysis or, at least, reduce it during the analysis. For more information on preventing and reducing multicollinearity, check my relevant answer.

Posted: April 17, 2015, 4:14 pm

Answer by Aleksandr Blekh for How to describe meaning of R squared?

As @MattReichenbach said, if you have Age is the only predictor in your model, then your wording is fine. However, in order to avoid specifying a particular variable, I would suggest the following wording: "the model explains 30% of variation of the car condition index" (also note the use of present tense, which to me sounds more natural and correct). Using "the model" will allow you easier modification of results reporting (more flexibility) in the future, for example, in case, if/when you will add more predictors to the model.

Posted: April 17, 2015, 3:54 pm

Answer by Aleksandr Blekh for Regression analysis or Structural Equation Modelling

First of all, especially considering that your model is not that simple, I suggest you to switch for this study from using term regression analysis to using term latent variable modeling (LVM) or, more commonly, structural equation modeling (SEM). The main reason is not the terminology, but emphasizing the fact that SEM encompasses a comprehensive analysis of both measurement model and structural model. In SEM terminology, to analyze a measurement model, you need to perform confirmatory factor analysis (CFA), after you've done EFA, while to analyze a structural model, you need to perform path analysis, also referred to as path modeling (PM) or simply SEM.

In terms of the SEM process, as I said earlier, it is quite a challenge to grasp all concepts and, especially, tie them all into one neat framework. So, I would suggest you to start with this excellent tutorial, after that - this paper (theoretical parts) to understand better SEM in general as well as two major approaches to SEM (CB-SEM and PLS-SEM) and then, perhaps, take a quick look at this paper to get a sense (don't try to understand everything right away) how the full SEM analysis (EFA $\rightarrow$ CFA $\rightarrow$ PM/SEM) should be performed and reported. Then you can return to this question to post small clarifying questions or post them as separate questions. Hope this helps.

Note. Two important aspects: 1) your full SEM model (both measurement and structural models) should be hypothesized by you, based on theory or, if theory doesn't exist for that knowledge domain, literature review as well as your assumptions and arguments; 2) the mapping between 26 items and 4 latent factors is exactly that hypothesized measurement model I was talking about.

Posted: April 16, 2015, 10:23 am

Answer by Aleksandr Blekh for When the dependent variable and random effects 'overlap' in mixed effects models

My knowledge of mixed effects models (MEM) is rather fuzzy so far, so I will just share with you the following two nice blog post tutorials on MEM in R by Jared Knowles: "Getting Started with Mixed Effect Models in R" and "Mixed Effects Tutorial 2: Fun with merMod Objects". I hope that it's helpful.

Posted: April 16, 2015, 9:47 am

Answer by Aleksandr Blekh for How to run regression analysis without extracted factors from factor anlaysis?

I'm confused what you're confused about. If I understood your question correctly, your plan is to perform regression analysis, using factors, extracted during exploratory factor analysis (EFA). Let's assume that your original data set contains $N$ observations and $k$ columns, equal to the total number of factors. Your EFA resulted in 4 extracted factors (not the corresponding data, as you rightly noted), let's call them $f_1, f_2, f_3, f_4$. So, the next step, I think, would be to perform regression analysis on a subset of the original data set, containing only columns, corresponding to the extracted factors. Therefore, both goals will be achieved: performing EFA and regression.

Posted: April 16, 2015, 4:20 am

Answer by Aleksandr Blekh for KFold Cross Validation Package/Library in C++?

I'm sure that you will find that many of C++ libraries, listed in this section of that nice curated list of machine learning (ML) libraries, support cross-validation. Also, if you don't mind using C++ within .NET, check an interesting ML framework Accord.NET - it indeed does support cross-validation.

Posted: April 16, 2015, 3:47 am

Answer by Aleksandr Blekh for Confidence measures for Gaussian mixture models

I will start answering this questions in the reverse order, as it seems to make more sense.

I'm playing around with densityMclust in the mclust R package, and it doesn't seem to be returning any confidence measure (analogous to a p-value).

It seems to me that R package mclust used to have confidence measures reporting functionality in some of its previous versions, but it has been removed or disabled for some reasons. That functionality included calculating (via bootstrapping) and reporting significance (p-values) as well as standard errors and confidence intervals for estimated parameters. Based on current CRAN documentation, the functionality was available via functions mclustBootstrapLRT() and MclustBootstrap().

Considering the above, I think that you have the following options:

  1. Determine the latest version of mclust, which contained needed functionality, install that version and perform the analysis.

  2. Implement missing functionality in end-user R code, based on information, formulas and references, provided in the documentation's description for mclustBootstrapLRT() and MclustBootstrap() functions. IMHO, a much better source of information for manual implementation is a nice blog post " EM Algorithm: Confidence Intervals" by Stephanie Hicks.

  3. Consider using mixtools package, which seems to contain at least significance (p-values) calculating and reporting functionality, similar to the one of mclustBootstrapLRT() function (see page 26 in the corresponding JSS paper).

When generating Gaussian mixture models using expectation maximization with Bayesian Information Criterion, is it necessary to report a confidence measure?

Unless it is very difficult (skill-wise or time-wise) for you to use one of the above-mentioned options, I think that it is quite important to include such reporting in your analysis' results, as it demonstrates (academic or industrial) professional level of statistical rigor.

How do you know that the algorithms are returning the optimal models?

I think that EM algorithm returns optimal models, because the M-step is the optimizing one (M from maximization). Having said that, EM algorithm iterates until it converges to a local maximum of the log-likelihood function.

Additional information on EM algorithm can be found in the following papers: brief, medium and large (a 280+ pages book, ironically called "gentle tutorial" :-). It might also be of interest this paper on estimating standard errors for EM algorithm and this general paper on estimating confidence intervals for mixture models.

Posted: April 16, 2015, 3:21 am

Answer by Aleksandr Blekh for Exploratory data analysis for a dataset with continuous and categorical variables

First of all, it is possible to calculate correlation for both continuous and categorical variables, as long as the latter ones are ordered. This type of correlation is referred to as polychoric correlation.

In order to calculate polychoric correlation, since you plan to use R, you have, at least, two options: 1) psych package offers polychoric() and related functions (; 2) package polycor offers hetcor() function. Analysis of models, containing ordered categorical (ordinal) variables, include some other methods, including, but not limited to, numeric recoding, ordinal regression and latent variables approach.

Posted: April 15, 2015, 5:27 am

Answer by Aleksandr Blekh for Trend Analysis: How to tell random fluctuations from actual changes in trends?

Basically, you have to perform trend analysis, which is time series exploratory technique, based on ARMA family of models, of which ARIMA is most likely the most popular one. However, for your purposes, I think that it might be enough to just perform time series decomposition, where, along with seasonality and cyclical pattern, trend is one of the main components. More details on time series decomposition as well as some examples can be found here. In regard to some existing rules of thumb for time series' minimum sample size, Prof. Rob J. Hyndman dismisses such guidelines as "mis­lead­ing and unsub­stan­ti­ated in the­ory or prac­tice" in this relevant blog post.

Posted: April 14, 2015, 8:29 am

Answer by Aleksandr Blekh for How to implement GLM computationally in C++ (or other languages)?

While there is definitely some educational value of re-implementing GLM framework (or any other statistical framework, for that matter), I question the feasibility of this approach due to complexity and, consequently, time and efforts involved. Having said that, if you indeed want to go this route and review existing open source GLM implementations, you have, at least, the following options:

  • Standard GLM implementation by R package stats. See the corresponding source code here on GitHub or by typing the function name (without parentheses) in R's command line.

  • Alternative and specific GLM implementations for R include the following packages: glm2, glmnet and some others. Additionally, GLM-releated R packages are listed in this blog post.

  • Excellent GLM Notes webpage (by Michael Kane and Bryan W. Lewis) offers a wealth of interesting and useful details on standard and alternative R GLM implementations aspects.

  • For Julia GLM implementations, check similar to R's GLM and GLMNet packages.

  • For Python GLM implementations, check the one in statsmodels library and the one in scikit-learn library (implements Ridge, OLS and Lasso - find corresponding modules).

  • For .NET GLM implementations, check IMHO very interesting Accord.NET framework - the GLM source code is here on GitHub.

  • For C/C++ GLM implementations, check apophenia C library (this source code seems to be relevant) and, perhaps, C++ GNU Scientific Library (GSL) (see this GitHub repo, but I was unable to find the relevant source code). Also of interest could be: this C++ IRLS GLM implementation (which uses GSL) as well as the Bayesian Object Oriented Modeling (BOOM) C++ library (GLM-focused source code is here on GitHub).

Posted: April 14, 2015, 3:39 am

Answer by Aleksandr Blekh for How can I determine if a time-series is statistically stable?

There exist various approaches to testing whether a time series is stationary. One of the most popular approaches is based on unit root test family of tests, which include Augmented_Dickey-Fuller (ADF) test (available in R as tseries::adf.test()), Zivot-Andrews test (available in R as and several others (see the links in the unit root test Wikipedia article). Another approach is to use the KPSS test, which is considered complimentary to unit root testing. Finally, there are approaches, based on spectrum analysis, which include Priestley-Subba Rao (PSR) test and wavelet spectrum test. Some theoretic discussion and examples are available via the previous link as well as in corresponding section of the online textbook "Forecasting: principles and practice" by professors Rob J. Hyndman and George Athana­sopou­los:

Posted: April 13, 2015, 9:57 pm

Answer by Aleksandr Blekh for How does R package 'quantmod' receive (almost) real-time data?

Reviewing quantmod package's documentation (the up-to-date one, located on CRAN, since documentation on the package's website is obsolete), it appears that, currently, R package quantmod supports, aside from local data sets (MySQL, CSV, RData), the following public and private online data sources (availability varies from function to function).

Posted: April 13, 2015, 5:27 am

Answer by Aleksandr Blekh for Is it valid to reduce noise in the test data from noisy experiments by averaging over multiple runs?

I think that the experimenters' decision fits into general resampling statistical strategy. Having said that, I'm not sure what specific aspects, if any, might be used to criticize this approach from the machine learning perspective.

In regard to reducing noisy data, while I'm not sure how applicable it is in your subject domain, you might want to check my hopefully relevant answer. Moreover, I think that it might make sense to use clustering to detect and eliminate noisy data by applying bootstrapping technique. Please see my answer on using bootstrapping for clustering.

Posted: April 13, 2015, 12:17 am

Answer by Aleksandr Blekh for Locally weighted regression VS kernel linear regression?

Here's how I understand the distinction between the two methods (don't know what third method you're referring to - perhaps, locally weighted polynomial regression due to the linked paper).

Locally weighted regression is a general non-parametric approach, based on linear and non-linear least squares regression. Kernel linear regression is IMHO essentially an adaptation (variant) of a general locally weighted regression in the context of kernel smoothing. It seems that the main advantage of kernel linear regression is that it automatically eliminates the domain boundaries bias, associated with locally weighted approach (Hastie, Tibshirani & Friedman, 2009; for that as well as a general overview, see sections 6.1-6.3, pp. 192-201). This phenomenon is called automatic kernel carpentry (Hastie & Loader, 1993; Hastie et al., 2009; Müller, 1993). More details on locally weighted regression can be found in the paper by Ruppert and Wand (1994).

Due to different presentation style, some other information on the topic might also be helpful. For example this page -link dead, now it's this book, Chapter 20.2 on linear smoothing, this class notes presentation slides document on kernel methods, this class notes page on local learning approaches. I also like this blog post and this blog post, as they are relevant and nicely blend theory with examples in R and Python, correspondingly.


Hastie, T., & Loader, C. (1993). Local regression: Automatic kernel carpentry. Statistical Science, 8(2), 120-143. Retrieved from

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction (2nd ed.). New York: Springer-Verlag. Retrieved from

Müller, H.-G. (1993). [Local Regression: Automatic Kernel Carpentry]: Comment. Statistical Science, 8(2), 134-139.

Ruppert, D., & Wand, M. (1994). Multivariate locally weighted least-squares regression. The Annals of Statistics, 22(3), 1346–1370. Retrieved from

Posted: March 27, 2015, 5:23 am

Answer by Aleksandr Blekh for What is the "partial" in partial least squares methods?

I would like to answer this question, largely based on the historical perspective, which is quite interesting. Herman Wold, who invented partial least squares (PLS) approach, hasn't started using term PLS (or even mentioning term partial) right away. During the initial period (1966-1969), he referred to this approach as NILES - abbreviation of the term and title of his initial paper on this topic Nonlinear Estimation by Iterative Least Squares Procedures, published in 1966.

As we can see, procedures that later will be called partial, have been referred to as iterative, focusing on the iterative nature of the procedure of estimating weights and latent variables (LVs). The "least squares" term comes from using ordinary least squares (OLS) regression to estimate other unknown parameters of a model (Wold, 1980). It seems that the term "partial" has its roots in the NILES procedures, which implemented "the idea of split the parameters of a model into subsets so they can be estimated in parts" (Sanchez, 2013, p. 216; emphasis mine).

The first use of the term PLS has occurred in the paper Nonlinear iterative partial least squares (NIPALS) estimation procedures, which publication marks next period of PLS history - the NIPALS modeling period. 1970s and 1980s become the soft modeling period, when, influenced by Karl Joreskog's LISREL approach to SEM, Wold transforms NIPALS approach into soft modeling, which essentially has formed the core of the modern PLS approach (the term PLS becomes mainstream in the end of 1970s). 1990s, the next period in PLS history, which Sanchez (2013) calls "gap" period, is marked largely by decreasing of its use. Fortunately, starting from 2000s (consolidation period), PLS enjoyed its return as a very popular approach to SEM analysis, especially in social sciences.

UPDATE (in response to amoeba's comment):

  • Perhaps, Sanchez's wording is not ideal in the phrase that I've cited. I think that "estimated in parts" applies to latent blocks of variables. Wold (1980) describes the concept in detail.
  • You're right that NIPALS was originally developed for PCA. The confusion stems from the fact that there exist both linear PLS and nonlinear PLS approaches. I think that Rosipal (2011) explains the differences very well (at least, this is the best explanation that I've seen so far).

UPDATE 2 (further clarification):

In response to concerns, expressed in amoeba's answer, I'd like to clarify some things. It seems to me that we need to distinguish the use of the word "partial" between NIPALS and PLS. That creates two separate questions about 1) the meaning of "partial" in NIPALS and 2) the meaning of "partial" in PLS (that's the original question by Phil2014). While I'm not sure about the former, I can offer further clarification about the latter.

According to Wold, Sjöström and Eriksson (2001),

The "partial" in PLS indicates that this is a partial regression, since ...

In other words, "partial" stems from the fact that data decomposition by NIPALS algorithm for PLS may not include all components, hence "partial". I suspect that the same reason applies to NIPALS in general, if it's possible to use the algorithm on "partial" data. That would explain "P" in NIPALS.

In terms of using the word "nonlinear" in NIPALS definition (do not confuse with nonlinear PLS, which represents nonlinear variant of the PLS approach!), I think that it refers not to the algorithm itself, but to nonlinear models, which can be analyzed, using linear regression-based NIPALS.

UPDATE 3 (Herman Wold's explanation):

While Herman Wold's 1969 paper seems to be the earliest paper on NIPALS, I have managed to find another one of the earliest papers on this topic. That is a paper by Wold (1974), where the "father" of PLS presents his rationale for using the word "partial" in NIPALS definition (p. 71):

3.1.4. NIPALS estimation: Iterative OLS. If one or more variables of the model are latent, the predictor relations involve not only unknown parameters, but also unknown variables, with the result that the estimation problem becomes nonlinear. As indicated in 3.1 (iii), NIPALS solves this problem by an iterative procedure, say with steps s = 1, 2, ... Each step s involves a finite number of OLS regressions, one for each predictor relation of the model. Each such regression gives proxy estimates for a sub-set of the unknown parameters and latent variables (hence the name partial least squares), and these proxy estimates are used in the next step of the procedure to calculate new proxy estimates.


Rosipal, R. (2011). Nonlinear partial least squares: An overview. In Lodhi H. and Yamanishi Y. (Eds.), Chemoinformatics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques, pp. 169-189. ACCM, IGI Global. Retrieved from

Sanchez, G. (2013). PLS path modeling with R. Berkeley, CA: Trowchez Editions. Retrieved from

Wold, H. (1974). Causal flows with latent variables: Partings of the ways in the light of NIPALS modelling. European Economic Review, 5, 67-86. North Holland Publishing.

Wold, H. (1980). Model construction and evaluation when theoretical knowledge is scarce: Theory and applications of partial least squares. In J. Kmenta and J. B. Ramsey (Eds.), Evaluation of econometric models, pp. 47-74. New York: Academic Press. Retrieved from

Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109-130. doi:10.1016/S0169-7439(01)00155-1 Retrieved from

Posted: January 29, 2015, 7:08 pm

Answer by Aleksandr Blekh for How to determine Forecastability of time series?

Parameters m and r, involved in calculation of approximate entropy (ApEn) of time series, are window (sequence) length and tolerance (filter value), correspondingly. In fact, in terms of m, r as well as N (number of data points), ApEn is defined as "natural logarithm of the relative prevalence of repetitive patterns of length m as compared with those of length m + 1" (Balasis, Daglis, Anastasiadis & Eftaxias, 2011, p. 215):

$$ ApEn(m, r, N) = \Phi^m(r) - \Phi^{m+1}(r), $$

$\text{where }$

$$ \Phi^m(r) = {\LARGE{\Sigma}_i} lnC^m_i(r)/(N - m + 1) $$

Therefore, it appears that changing the tolerance r allows to control the (temporal) granularity of determining time series' entropy. Nevertheless, using the default values for both m and r parameters in pracma package's entropy function calls works fine. The only fix that needs to be done to see the correct entropy values relation for all three time series (lower entropy for more well-defined series, higher entropy for more random data) is to increase the length of random data vector:

 all.series <- list(series1 = AirPassengers,
                    series2 = sunspot.year,
                    series3 = rnorm(500)) # <== size increased
 sapply(all.series, approx_entropy)
  series1   series2   series3 
  0.5157758 0.7622430 1.4741971 

The results are as expected - as the predictability of fluctuations decreases from most determined series1 to most random series 3, their entropy consequently increases: ApEn(series1) < ApEn(series2) < ApEn(series3).

In regard to other measures of forecastability, you may want to check mean absolute scaled errors (MASE) - see this discussion for more details. Forecastable component analysis also seems to be an interesting and new approach to determining forecastability of time series. And, expectedly, there is an R package for that, as well - ForeCA.

       Omega, spectrum.control = list(method = "wosa"))
 series1   series2   series3 
 41.239218 25.333105  1.171738 

Here $\Omega \in [0, 1]$ is a measure of forecastability where $\Omega(white noise) = 0\%$ and $\Omega(sinusoid) = 100 \%$.


Balasis, G., Daglis, I. A., Anastasiadis, A., & Eftaxias, K. (2011). Detection of dynamical complexity changes in Dst time sSeries using entropy concepts and rescaled range analysis. In W. Liu and M. Fujimoto (Eds.), The Dynamic Magnetosphere, IAGA Special Sopron Book, Series 3, 211. doi:10.1007/978-94-007-0501-2_12. Springer. Retrieved from

Georg M. Goerg (2013): Forecastable Component Analysis. JMLR, W&CP (2) 2013: 64-72.

Posted: January 19, 2015, 11:56 pm

Answer by Aleksandr Blekh for multivariate control chart

I'm not sure, if you're still interested in the topic, but I will provide a brief answer for you and other people that are interested in working with quality control charts (QCC), using R language. For a theoretical introduction to the topic, I would suggest reviewing the corresponding section of StatSoft's nice electronic textbook: For much more advanced treatment of the topic, I'd suggest this relevant thesis, titled "An investigation of some characteristics of univariate and multivariate control charts" (see links to PDF chapters).

Traditionally, R ecosystem offers a wide variety of packages to choose from for a specific domain. This applies to the QCC analysis as well. Most frequently used package for QCC analysis is qcc (Quality Control Charts), however there many other packages with varying ranges of functionality:

  • IQCC: Improved Quality Control Charts
  • MSQC: Multivariate Statistical Quality Control
  • qcr: Quality control and reliability
  • qualityTools: Statistical Methods for Quality Science
  • SPCadjust: Functions for calibrating control charts
  • CMPControl: Control Charts for Conway-Maxwell-Poisson Distribution
  • edcc: Economic Design of Control Charts
  • MetaQC: Objective Quality Control and Inclusion/Exclusion Criteria for Genomic Meta-Analysis
  • graphicsQC: Quality Control for Graphics in R
  • QCGWAS: Quality Control of Genome Wide Association Study results
  • GWAtoolbox: GWAS Quality Control
  • qAnalyst (removed from CRAN)
  • SixSigma: Six Sigma Tools for Quality and Process Improvement
  • qicharts: Quality Improvement Charts

For qcc there is a hard-to-find vignette by Luca Scrucca, which can be complemented by this blog post. For those, considering using QCC in educational setting, there is an interesting paper, describing the process (no code, though). Finally, anyone, interested in using QCC in a larger context of SixSigma and in R environment, the book "Six Sigma with R: Statistical engineering for process improvement", published by Springer, might be helpful.

Posted: January 19, 2015, 6:10 pm

Answer by Aleksandr Blekh for Dynamic Time Warping Clustering

Yes, you can use DTW approach for classification and clustering of time series. I've compiled the following resources, which are focused on this very topic (I've recently answered a similar question, but not on this site, so I'm copying the contents here for everybody's convenience):

Posted: January 5, 2015, 3:49 pm

Answer by Aleksandr Blekh for Complete machine learning library for Java/Scala

You may find helpful this extensive curated list of ML libraries, frameworks and software tools. In particular, it contains resources that you're looking for - ML lists for Java and for Scala.

Posted: August 28, 2014, 7:14 am

Viewing page 1 of 1

User Aleksandr Blekh - Stack Overflow

most recent 30 from

Answer by Aleksandr Blekh for Generating models for Flask-AppBuilder using flask-sqlqcodegen

Upon some Internet searching, I ran across an issue on GitHub, which described exactly the same problem. However, the most recent recommendation at the time produced another error instead of the original one. In the discussion with the author of flask-sqlcodegen, it appeared that there exist a pull request (PR) kindly provided by a project contributor that apparently should fix the problem. After updating my local repository, followed by rebuilding and reinstalling the software, I was able to successfully generate models for my database. The whole process consists of the following steps.

  1. Change to directory with a local repo of flask-sqlcodegen.
  2. If you made any changes, like I did, stash them: git stash.
  3. Update repo: git pull origin master (now includes that PR).
  4. Rebuild/install software: python install.
  5. If you need your prior changes, restore them: git stash pop. Otherwise, discard them: git reset --hard.
  6. Change to your Flask application directory and auto-generate the models, as follows.

    sqlacodegen --flask --outfile postgresql+psycopg2://USER:PASS@HOST/DBNAME

Acknowledgements: Big thank you to Kamil Sindi (the flask-sqlcodegen's author) for the nice software and rapid & helpful feedback as well as to Alisdair Venn for that valuable pull request.

Posted: July 31, 2016, 1:53 am

Answer by Aleksandr Blekh for Strange MySQL "read-only" error

Based on my question's comments (special thanks to @Eborbob) and my update, I have figured that some process in the system resets the read-only flag to ON (1), which seem to trigger the issue and results in the website becoming inaccessible. In order to fix the problem as well as make this fix persistent across software and server restarts, I decided to update MySQL configuration file my.cnf and restart the DB server.

After making the relevant update (in my case, addition) to the configuration file


let's verify that the flag is indeed set to OFF (0):

# mysql
mysql> SELECT @@global.read_only;
| @@global.read_only |
|                  0 |
1 row in set (0.00 sec)

Finally, let's restart MySQL server (for some reason, a dynamic reloading of MySQL configuration (/etc/init.d/mysql reload) didn't work, so I had to restart the database server explicitly:

service mysql stop
service mysql start

Voila! Now access to the website is restored. Will update my answer, if any changes will occur.

Posted: February 18, 2016, 2:34 am

Answer by Aleksandr Blekh for Error trying to start Notification Server

I have just figured out this. As I said in the recent update, I was trying to start notification server as non-'root'. Looking again at permissions of the /var/tmp/aphlict/pid folder, the problem suddenly became crystal clear and trivial.

ls -l /var/tmp/aphlict

total 4
drwxr-xr-x 2 root root 4096 Nov 16 13:40 pid

Therefore, all that needed to be done to fix the problem is to make the directory writable for everyone (I hope that this approach does not create a potential security issue):

chmod go+w /var/tmp/aphlict/pid

su MY_NON_ROOT_USER_NAME -c './bin/aphlict start'
Aphlict Server started.

Problem solved. By the way, for the Notification Server to work properly, do I need to open port 22281, in addition to already opened 22280? (Please answer in comments. Thank you!)

Posted: November 17, 2015, 6:58 pm

Answer by Aleksandr Blekh for Converting to JSON (key,value) pair using R

The output that you're seeing is produced by jsonlite, when a data set is a list:



Make sure that your data set is indeed a data frame and you will see the expected output:

toJSON(head(iris), pretty = TRUE)

        "Sepal.Length": 5.1,
        "Sepal.Width": 3.5,
        "Petal.Length": 1.4,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 4.9,
        "Sepal.Width": 3,
        "Petal.Length": 1.4,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 4.7,
        "Sepal.Width": 3.2,
        "Petal.Length": 1.3,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 4.6,
        "Sepal.Width": 3.1,
        "Petal.Length": 1.5,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 5,
        "Sepal.Width": 3.6,
        "Petal.Length": 1.4,
        "Petal.Width": 0.2,
        "Species": "setosa"
        "Sepal.Length": 5.4,
        "Sepal.Width": 3.9,
        "Petal.Length": 1.7,
        "Petal.Width": 0.4,
        "Species": "setosa"
Posted: April 13, 2015, 8:54 am

Answer by Aleksandr Blekh for View selected sample for each replication in bootstrap loop

Based on your comments, I've fixed the code. Here's the version that I tested and it seems to work:

x <- c(20,54,18,65,87,49,45,94,22,15,16,15,84,55,44,13,16,65,48,98,74,56,97,11,25,43,32,74,45,19,56,874,3,56,89,12,28,71,93)
n <- length(x)

nBoot <-3; mn <- numeric(nBoot)
repl <- matrix(x, nrow=nBoot, ncol=length(x))

for (boots in 1:nBoot) {
  repl[boots, ] <- sample(x, n, replace=TRUE)
  pr <- print(repl)
  mn[boots] <- mean(repl)
Posted: April 8, 2015, 1:43 pm

Answer by Aleksandr Blekh for Algorithm for multiple extended string matching

I think that it might make sense to start by reading the following Wikipedia article's section: You can then perform a literature review on algorithms, implementing regular expression pattern matching.

In terms of practical implementation, there is a large variety of regular expression (regex) engines in a form of libraries, focused on one or more programming languages. Most likely, the best and most popular option is the C/C++ PCRE library, with its newest version PCRE2, released in 2015. Another C++ regex library, which is quite popular at Google, is RE2. I recommend you to read this paper, along with the two other, linked within the article, for details on algorithms, implementation and benchmarks. Just recently, Google has released RE2/J - a linear time version of RE2 for Java: see this blog post for details. Finally, I ran across an interesting pure C regex library TRE, which offers way too many cool features to list here. However, you can read about them all on this page.

P.S. If the above is not enough for you, feel free to visit this Wikipedia page for details of many more regex engines/libraries and their comparison across several criteria. Hope my answer helps.

Posted: March 10, 2015, 8:57 am

Answer by Aleksandr Blekh for Add existing scripts to an Rstudio project

Technically, you can change working directory programmatically within a project, but this is considered a very poor practice and is strongly recommended against. However, you can set working directory at a project's top level (full path to Folder A, in your example) and then refer to scripts and objects, located in Folders 1-3 via corresponding relative paths. For example: "./Folder1/MyScript.R" or "./Folder2/MyData.csv".

Posted: February 24, 2015, 7:56 pm

Answer by Aleksandr Blekh for R equivalent to matrix row insertion in Matlab

You certainly can have a similar functionality by using R's integration with a clipboard. In particular, standard R functions that provide support for clipboard operations include connection functions (base package), such as file(), url(), pipe() and others, clipboard text transfer functions (utils package), such as readClipboard(), writeClipboard(), as well as data import functions (base package), which use connection argument, such as scan() or read.table().

This functionality differs from platform to platform. In particular, for Windows platform, you need to use connection name clipboard, for Mac platform (OS X) - you can use pipe("pbpaste") (see this StackOverflow discussion for more details and alternative methods). It appears that Kmisc package offers a platform-independent approach to that functionality, however, I haven't used it so far, so, can't really confirm that it works as expected. See this discussion for details.

The following code is a simplest example of how you would use the above-mentioned functionality:

read.table("clipboard", sep="\t", header=header, ...)

An explanation and further examples are available in this blog post. As far as plotting the imported data goes, RStudio not only allows you to use standard R approaches, but also adds an element of interactivity via its bundled manipulate package. See this post for more details and examples.

Posted: February 15, 2015, 6:20 am

Answer by Aleksandr Blekh for R: Export CrossTable to Latex

Based on the gmodels' package documentation, function CrossTable() returns results as a list. Therefore, I don't see any problems with exporting the results to LaTeX format. You just need to convert that list into a data frame. Then you have a choice of various R packages, containing functions to convert a data frame into LaTeX format. For example, you can use df2latex() from psych package. Alternatively, you can use either latex() or latexTabular(), both from Hmisc package. The former converts a data frame into a TeX file, whereas the former converts a data frame into a LaTeX code for the corresponding object in a tabular environment (a LaTeX table).


Initial attempt - doesn't work, as CrossTable()'s result is not a simple list:


let <- sample(c("A","B"), 10, replace = TRUE)
num <- sample(1:3, 10, replace = TRUE)
tab <- CrossTable(let, num, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)

myList <- lapply(1:ncol(tab), function(x) as.character(unlist(tab[, x])))
myDF <-, stringsAsFactors = FALSE)
myLatex <- latexTabular(myDF)

Further efforts

Well, it's a little trickier than I initially thought, but there are two ways, as I see it. Please see below.

The first option is to convert the CrossTable to data frame

myDF <-

and then manually reshape the initial data frame per your requirements (sorry, I'm not too familiar with cross-tabulation).

The second option uses Rz package (installation is a bit annoying as it wants to install Gtk, but after closing GUI, you can call functions in R session normally, as follows.


let <- sample(c("A","B"), 10, replace = TRUE)
num <- sample(1:3, 10, replace = TRUE)
tab <- crossTable(let, num) # note that I use crossTable() from 'Rz' package

# Console (default) output

let     1      2      3    Total 
A          0      2      1      3
        0.0%  66.7%  33.3%   100%
B          1      2      4      7
       14.3%  28.6%  57.1%   100%
Total      1      4      5     10
       10.0%  40.0%  50.0%   100%

Chi-Square Test for Independence

Number of cases in table: 10 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 1.4286, df = 2, p-value = 0.4895
    Chi-squared approximation may be incorrect
Please install vcd package to output Cramer's V.

# Now use LaTeX output

summary(tab, latex = TRUE)
    \caption{let $\times$ num}
         &                      \multicolumn{3}{c}{num}                      &                           \\
    let  &\multicolumn{1}{c}{1}&\multicolumn{1}{c}{2}&\multicolumn{1}{c}{3}&\multicolumn{1}{c}{Total} \\
    A    &             0        &             2        &             1        &               3           \\
         &        0.0\%        &       66.7\%        &       33.3\%        &          100\%           \\
    B    &             1        &             2        &             4        &               7           \\
         &       14.3\%        &       28.6\%        &       57.1\%        &          100\%           \\
    Total&             1        &             4        &             5        &              10           \\
         &       10.0\%        &       40.0\%        &       50.0\%        &          100\%           \\

Chi-Square Test for Independence

Number of cases in table: 10 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 1.4286, df = 2, p-value = 0.4895
    Chi-squared approximation may be incorrect
Please install vcd package to output Cramer's V.


Posted: January 29, 2015, 4:39 am

Answer by Aleksandr Blekh for How can I create a graph in R from a table with four variables? (Likert scale)

If you prefer a ggplot2-based solution, as an alternative to suggested base R graphics solution, I think that it should be along the following lines. A minimal reproducible example (MRE), based on your data follows.

if (!suppressMessages(require(ggplot2))) install.packages('ggplot2')
if (!suppressMessages(require(reshape))) install.packages('reshape')

myData <- data.frame('Gov. agencies' = c(3, 10, 1, 8, 7), 'Local authority' = c(3, 6, 3, 4, 13), 'Police forces' = c(3, 6, 3, 4, 13), 'NGO/third sector' = c(2, 5, 1, 10, 11), response = c('Not familiar', 'Somewhat familiar', 'Neutral', 'Familiar', 'Very familiar'))

levels(myData$response) <- c('Not familiar', 'Somewhat familiar', 'Neutral', 'Familiar', 'Very familiar')

myDataMelted <- melt(myData, id.vars = 'response')

ggplot(myDataMelted, aes(x=response, y=value, fill = variable))+
    geom_bar(stat = "identity", position = "dodge", color = "black")

The result:

enter image description here

WARNING! Please note that the above code is posted as a proof-of-concept and it is not only not complete in terms of labeling/beautification, but it contains an error (I think, not a major one), which I hope more knowledgeable people here will help me to fix, so that you could have an alternative solution (and I could have some educational experience and peace of mind, after all the trouble :-). The error is that groups are not in the correct order / do not belong to the correct categories. I've tried to alleviate that problem via levels(), but probably still missed or forgot some other point.

Posted: January 13, 2015, 2:51 am

Answer by Aleksandr Blekh for Gain Package Installation error in R 3.1.2

I believe that the problem lies in your corrupted, incomplete or otherwise incorrect R environment. I was able to install that package without any problems at all just by issuing the default command:

> install.packages("gains")
Installing package into ‘C:/Users/Alex/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
trying URL ''
Content type 'application/zip' length 35802 bytes (34 Kb)
opened URL
downloaded 34 Kb

package ‘gains’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
[1] tools_3.1.1

As a quick solution to the problem, I suggest to specify CRAN mirror explicitly:

install.packages("gains", repos = "")
Posted: January 2, 2015, 11:18 am

Answer by Aleksandr Blekh for Matrix specification for simple diagram, using 'diagram' package

Finally, I have figured it out myself. It's a little tricky, but not a rocket science. Thanks to everyone who tried to help or, at least, read the question. Actually, after I've figured this out, I took another look at the @jbaums' suggestion above and realized that it is basically the same, discounting non-essential details. The suggested solution (which was appearing incorrectly, as shown above) was tested in my RStudio, whereas, since my machine with RStudio Server was down, I had to test my solution on R-Fiddle... The same company. Same (similar) technology. Go figure. Anyway, here is my obligatory minimal reproducible example (MRE):


connect <- c(0,0,0,0,

M <- matrix(nrow=4, ncol=4, byrow=TRUE, data=connect)
p <- plotmat(M, pos=c(1, 2, 1), name='', box.col="lightblue", curve=0)

MRE result:

enter image description here

Posted: November 24, 2014, 9:15 pm

Answer by Aleksandr Blekh for How to save an object through GGally in R

While @CMichael's comment is nice (I didn't know that, hence +1), it's applicable only if you want to save a particular plot from GGally-generated plot matrix. I believe that you'd like to save the whole plot matrix - the need, which I've recently also experienced. Therefore, you can use a standard R approach and save the graphics by opening corresponding (to desired format) graphical device, printing the object and closing the device, which will effectively save the graphics in a desired format.

# use pdf() instead of svg(), if you want PDF output
svg("myPlotMatrix.svg", height = 7, width = 7)
g <- ggpairs(...)
Posted: November 18, 2014, 5:33 am

Answer by Aleksandr Blekh for knitr templating - Dynamic chunks issue

Finally, I've figured out what was causing the issue. The first part was easy. Due to suggested simplification, I've switched from ggplot2 to standard R graphics functions. The problem is that it appears that plot() doesn't return a value/object, so that's why NULLs has been seen in the output, instead of plots.

The second part was a bit more tricky, but an answer to a related question ( clarified the situation. Based on that information, I was able modify my MRE correspondingly and the resulting document appears with correct content (same applies to the generated LaTeX source, which seems to be ready for cross-referencing).

I'm thinking about converting this code into a more generic function for reuse across my project, if time will permit [shouldn't take long] (@Yihui, could this be useful for knitr project?). Thanks to everyone who took time to analyze, help or just read this question. I think that knitr's documentation should be more clear on issues, related to producing PDF documents from RMarkdown source. My solution for the MRE follows.

title: "MRE: a dynamic chunk issue"
author: "Aleksandr Blekh"
    fig_caption: yes
    keep_tex: yes
    highlight: NULL

```{r, echo=FALSE, include=FALSE}

opts_knit$set(progress = F, verbose = F)
opts_chunk$set(comment=NA, warning=FALSE, message=FALSE, echo=FALSE, tidy=FALSE)

```{r Preparation, results='hide'}

g1 <- qplot(mpg, wt, data=mtcars)
g2 <- qplot(mpg, hp, data=mtcars)

myPlots <- list(g1, g2)

bcRefStr <- list("objType" = "fig",
                 "objs" = c("g1", "g2"),
                 "str" = "Plots \\ref{fig:g1} and \\ref{fig:g2}")

```{r DynamicChunk, include=FALSE}

latexObjLabel <- paste0("{{name}}\\\\label{", bcRefStr$objType, ":{{name}}", "}")

chunkName <- "{{name}}"
chunkHeader <- paste0("```{r ", chunkName, ", ")
chunkOptions <- paste0("include=TRUE, results='asis', fig.height=4, fig.width=4, fig.cap='", latexObjLabel, "'")
chunkHeaderFull <- paste0(chunkHeader, chunkOptions, "}")
chunkBody <- "print(get('{{name}}'))"

chunkText <- c(chunkHeaderFull,
               "```", "\n")

figReportParts <- lapply(bcRefStr$objs, function (x) knit_expand(text = chunkText, name = x))

`r knit(text = unlist(figReportParts))`
Posted: November 13, 2014, 6:42 am

Viewing page 1 of 1

User Aleksandr Blekh - Open Data Stack Exchange

most recent 30 from

Answer by Aleksandr Blekh for Free public real time social data APIs

A significant number of free public APIs are available through the Mashape API Marketplace (freemium and commercial ones are available as well). For example, their social data APIs can be found here: I hope this is helpful.

Posted: October 25, 2015, 4:33 pm

Answer by Aleksandr Blekh for where can I find shapefiles for the highways of Puerto Rico?

Since I have promised, I will answer this question without waiting for its migration, if it will ever happen. Basically, I think that the best and latest data set that you can find now is this one - from the US official open data repository's TIGER/Line database. This page is generated, based on a relevant search (Puerto Rico), and might also contain some data sets of your interest.

Other potentially useful data sets include ones within U.S. Atlas TopoJSON repository (on how to use the data via R, see this nice tutorial) as well as this repository of U.S. major roads ESRI shapefile and geoJSON data sets (you have to check whether this repository contains PR data).

Posted: April 5, 2015, 7:52 am

Answer by Aleksandr Blekh for How can I get a full list of US Zipcodes with their associated names/CSAs/MSAs/lats/longs?

It appears that obtaining this data is not as trivial, as it might seem at first. The following are my suggestions in regard to the requested data sources and other options. It seems that currently there are two relatively solid sources of the data you're looking for:

The following additional, but not official, not solid and somewhat outdated database, might be also helpful: (also check links in the "Other Sources ..." section, especially GNIS data set - however, the GNIS data is used in the SBA's Web service).

Posted: March 24, 2015, 4:04 am

Answer by Aleksandr Blekh for Where can I find project risk management data?

Some project risk management data can be found within the following resources:

NOTES: 1) I don't think that Project Management Institute (PMP) has project risk management data, as @Joe suggested. At least, I haven't been able to find it. 2) Obviously, there exists other industry-focused project risk management data, similar to the one referenced above, focused on the software / IT industry.

Posted: January 5, 2015, 1:11 pm

Answer by Aleksandr Blekh for Estimate of total public expenditure from governments around world?

Adding to previous good answer, I think that you might find useful the following WDI indicators (

Posted: February 27, 2014, 1:03 am

Answer by Aleksandr Blekh for Results of past NCAA games

Take a look at this College Football Statistics & History site:

Posted: February 25, 2014, 7:02 pm

Answer by Aleksandr Blekh for Demography vs. political preference data sources

Check this collection of static and real-time data sets: Most indicators should be on a per-country (including per-EU-country) basis.

Also, see:

Posted: February 25, 2014, 6:53 pm

Viewing page 1 of 1